More

noamteyssier · 2025-10-28T21:45:11 1761687911

Good question! I just added that comparison and the rust uutils coreutils port is significantly faster than the standard coreutils.

noamteyssier · 2025-10-28T20:55:43 1761684943

I think this could potentially really reduce the amount of memory required - especially in cases where there is a lot of repetitive prefixes.

Would be interesting to try this out

noamteyssier · 2025-10-28T20:53:47 1761684827

I think you'd still need to go through that if you were really optimizing both `sort` and `uniq` working with their constraints.

What I'm really optimizing here is the functional equivalent of `sort | uniq -c | sort -n`

noamteyssier · 2025-10-28T18:34:16 1761676456

yeah that's right - there are trade-offs in doing so as it can require much more memory. So like everything it's an application specific decision

noamteyssier · 2025-10-28T18:33:23 1761676403

I've added this functionality to `hist-0.1.5` with a benchmark of other tools that do this on the CLI

noamteyssier · 2025-10-28T18:32:36 1761676356

this looks very interesting and I'd love to add it to the benchmarking! I was interested in trying it but unfortunately got an installation error on my macbook where I'm running the benchmarks:

``` clang \ -g -ggdb -O3 \ -Wall -Wextra -Wpointer-arith \ -D_FORTIFY_SOURCE=2 -fPIE \ mmuniq.c \ -lm \ -Wl,-z,now -Wl,-z,relro \ -o mmuniq mmuniq.c:1:10: fatal error: 'byteswap.h' file not found 1 | #include <byteswap.h> | ^~~~~~~~~~~~ 1 error generated. make: ** [mmuniq] Error 1 ```

noamteyssier · 2025-10-28T18:31:06 1761676266

I've actually added a benchmark for this specific task and added `unic` to it.

It may not be the most fair comparison because with these random fastqs I'm generating the vast majority of the input is unique so it could be overloading the cuckoo filter.

noamteyssier · 2025-10-28T16:52:06 1761670326

Shows up a lot in bioinformatics actually - trying to identify sequences with a specific subsequence (grep) and how many of each unique sequence there are. The number of lines here could be massive (order of 1-10's of GB)

You don't really end up using these results in any specific analysis but it's super helpful for troubleshooting tools or edge-cases.

noamteyssier · 2025-10-28T16:50:09 1761670209

I've added awk into the benchmarks also!

pabs3 · 2025-10-29T00:46:43 1761698803

Also perl, should be faster than awk IIRC:

perl -ne 'print if ! $a{$_}++'

noamteyssier · 2025-10-28T16:49:10 1761670150

Yeah I'm using it to serialize the output lines as a TSV. Rust's `println!` is notoriously slow and using `csv` to serialize the output is a nice way to boost throughput

dbdr · 2025-10-29T07:07:57 1761721677

Thanks. Nice find! Though it feels weird to have to use a csv crate for that. Ideally the `fast printing` part should be understood, and either used directly, or extracted as a separate, smaller crate.