Hacker Newsnew | past | comments | ask | show | jobs | submit | noamteyssier's commentslogin

Good question! I just added that comparison and the rust uutils coreutils port is significantly faster than the standard coreutils.


I think this could potentially really reduce the amount of memory required - especially in cases where there is a lot of repetitive prefixes.

Would be interesting to try this out


I think you'd still need to go through that if you were really optimizing both `sort` and `uniq` working with their constraints.

What I'm really optimizing here is the functional equivalent of `sort | uniq -c | sort -n`


yeah that's right - there are trade-offs in doing so as it can require much more memory. So like everything it's an application specific decision


I've added this functionality to `hist-0.1.5` with a benchmark of other tools that do this on the CLI


this looks very interesting and I'd love to add it to the benchmarking! I was interested in trying it but unfortunately got an installation error on my macbook where I'm running the benchmarks:

``` clang \ -g -ggdb -O3 \ -Wall -Wextra -Wpointer-arith \ -D_FORTIFY_SOURCE=2 -fPIE \ mmuniq.c \ -lm \ -Wl,-z,now -Wl,-z,relro \ -o mmuniq mmuniq.c:1:10: fatal error: 'byteswap.h' file not found 1 | #include <byteswap.h> | ^~~~~~~~~~~~ 1 error generated. make: ** [mmuniq] Error 1 ```


I've actually added a benchmark for this specific task and added `unic` to it.

It may not be the most fair comparison because with these random fastqs I'm generating the vast majority of the input is unique so it could be overloading the cuckoo filter.


Shows up a lot in bioinformatics actually - trying to identify sequences with a specific subsequence (grep) and how many of each unique sequence there are. The number of lines here could be massive (order of 1-10's of GB)

You don't really end up using these results in any specific analysis but it's super helpful for troubleshooting tools or edge-cases.


I've added awk into the benchmarks also!


Also perl, should be faster than awk IIRC:

perl -ne 'print if ! $a{$_}++'


Yeah I'm using it to serialize the output lines as a TSV. Rust's `println!` is notoriously slow and using `csv` to serialize the output is a nice way to boost throughput


Thanks. Nice find! Though it feels weird to have to use a csv crate for that. Ideally the `fast printing` part should be understood, and either used directly, or extracted as a separate, smaller crate.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: