Speeding Up Grep Searches¶

Sometimes you may find yourself needing to filter a large amount of output using the grep command. However, grep can sometimes struggle when you try to filter files with an incredibly large number of lines, as it loads each line into RAM line-by-line. This can mean you can quickly exhaust even large amounts of requested RAM. There are a few ways around this.

Using the `C` locale instead of `UTF-8`¶

Prefixing the grep command with LC_ALL=C (as per this helpful Stack Overflow post) uses the C locale instead of UTF-8 and improves job runtimes significantly (between 20 to 100 times faster). Recently a user was running a grep command to filter over 6000 lines of patterns from a 300GB file with over 4 billion lines using a command similar to:

grep -f patterns.txt file.txt > final.txt

The -f option obtains patterns from FILE (in this case patterns.txt), one per line (for more information see the grep manpage which can be viewed on Apocrita using man grep). So the command above looks for matching patterns defined in patterns.txt in the file file.txt and redirects the output to final.txt.

The command above was running for 24 hours and still not finishing. Simply by using the LC_ALL=C prefix:

LC_ALL=C grep -f patterns.txt file.txt > final.txt

The job completed in around 15 minutes. If the content of your data is pure ASCII characters, then LC_ALL=C should be fine to use.

Interpreting patterns as fixed strings¶

Adding -F to grep interprets the patterns to be matched as a list of fixed strings, instead of regular expressions, separated by newlines, any of which is to be matched. If you don't require the use of regular expressions (also know as regex) then this can be a real improvement as well.

Recently a user was running a grep command to filter over 470,000 lines of patterns from a 300GB file with over 53 million lines:

LC_ALL=C grep -f patterns.txt file.txt > final.txt

Despite using LC_ALL=C to use the C locale instead of UTF-8, the job would use an enormous amount of RAM as the 470,000 lines of patterns in patterns.txt were loaded line-by-line. Eventually even a large amount of RAM such as 256GB would quickly be entirely exhausted and the job would be killed by the scheduler on Apocrita before it had a chance to finish.

By adding -F to the grep command, like so:

LC_ALL=C grep -Ff patterns.txt file.txt > final.txt

the user's job completed in just a few minutes.

Splitting big files into smaller ones¶

If neither of the above tips work and your job is still running out of RAM, you may need to use split (available on the cluster without needing to load any additional modules) to split your files into smaller ones:

$ split --help
Usage: split [OPTION]... [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is 'x'.  With no INPUT, or when INPUT
is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
  -a, --suffix-length=N   generate suffixes of length N (default 2)
      --additional-suffix=SUFFIX  append an additional SUFFIX to file names
  -b, --bytes=SIZE        put SIZE bytes per output file
  -C, --line-bytes=SIZE   put at most SIZE bytes of lines per output file
  -d, --numeric-suffixes[=FROM]  use numeric suffixes instead of alphabetic;
                                   FROM changes the start value (default 0)
  -e, --elide-empty-files  do not generate empty output files with '-n'
      --filter=COMMAND    write to shell COMMAND; file name is $FILE
  -l, --lines=NUMBER      put NUMBER lines per output file
  -n, --number=CHUNKS     generate CHUNKS output files; see explanation below
  -u, --unbuffered        immediately copy input to output with '-n r/...'
      --verbose           print a diagnostic just before each
                            output file is opened
      --help     display this help and exit
      --version  output version information and exit

For example, to split a large patterns file with 400,000 lines into multiple parts, each with 10,000 lines per part:

split --numeric-suffixes -l 10000 patterns.txt patterns.txt_

This would give you 40 files, named with numeric suffixes, like so:

patterns.txt_00
patterns.txt_01
patterns.txt_02
patterns.txt_03
patterns.txt_04
<etc>

You could then use an array job, perhaps using the list_of_files.txt method, to run an array job that concurrently runs your grep command against each smaller 10,000 line file, and then combines the output at the end, something like:

#!/bin/bash

#$ -cwd
#$ -pe smp 1
#$ -l h_rt=240:0:0
#$ -l h_vmem=8G
#$ -N grep_array
#$ -t 1-40
#$ -j y

INPUT_FILE=$(sed -n "${SGE_TASK_ID}p" list_of_patterns.txt)

LC_ALL=C grep -Ff ${INPUT_FILE} file.txt > final.txt.${SGE_TASK_ID}-${INPUT_FILE}.txt

On this occasion, the large 400,000 pattern file was split, but you could also split file.txt if it is too large.

We hope you find these tips useful. As usual, you can ask a question on our Slack channel (QMUL users only), or by sending an email to its-research-support@qmul.ac.uk which is handled directly by staff with relevant expertise.

Title image: Generated using DALL-E-2

Speeding Up Grep Searches¶

Using the C locale instead of UTF-8¶

Interpreting patterns as fixed strings¶

Splitting big files into smaller ones¶

Using the `C` locale instead of `UTF-8`¶