Blastn options

Generally when using BLAST (Basic Local Alignment Search Tool) I would use a script.

Example BLAST script from NeSI – https://support.nesi.org.nz/hc/en-gb/articles/208619807

#!/bin/bash -e

#SBATCH --job-name       BLAST
#SBATCH --time           02:30:00      # Allow 100 CPU hrs / GB of blastn query seq
#SBATCH --hint           nomultithread # Unless trying 72 threads
#SBATCH --ntasks         1
#SBATCH --cpus-per-task  36            # 1 whole node.
#SBATCH --mem            105G          # 1 whole node. Allow for whole database.
##SBATCH --mem            200G          # For large databases (nr, refseq_genomic)

module load BLAST/2.9.0-gimkl-2018b
module load BLASTDB/2020-01

# This script takes one argument, the FASTA file of query sequences.
QUERIES=$1
FORMAT="6 qseqid qstart qend qseq sseqid sgi sacc sstart send staxids sscinames stitle length evalue bitscore"
BLASTOPTS="-evalue 0.05 -max_target_seqs 10"
BLASTAPP=blastn
DB=nt
#BLASTAPP=blastx
#DB=nr

# Keep the database in RAM if searching multiple 
# query sequences against a database of over 10GB.
cp $BLASTDB/{$DB,taxdb}* $TMPDIR/ 
export BLASTDB=$TMPDIR

# Single node multithreaded BLAST.
srun $BLASTAPP $BLASTOPTS -db $DB -query $QUERIES -outfmt "$FORMAT" \
    -out $QUERIES.$DB.$BLASTAPP -num_threads $SLURM_CPUS_PER_TASK

However, I am not a fan of the chosen format options. Metagenomics.wiki is a great source and provides good starting points for different tools, which not only includes BLAST but also 16S Tools, BOWTIE2 and many more.

I like to know the percentage of identical matches when my BLAST results are returned – Ident[ity]: the highest percent identity for a set of aligned segments to the same subject sequence. This can be returned by adding pident to FORMAT.

More useful options can be found here -> http://www.metagenomics.wiki/tools/blast/blastn-output-format-6

Previously, I also found discrepancies between using a script and a manual BLAST search for some sequences. After some digging I found that the issue seems to have been with -max_target_seqs.

So basically while most of the time -max_target_seqs does seem to be a final filter before the output is written, this setting actually happens much earlier in the search and can (unexpectedly) cull what turns out to be the best hit.

The blog cites an article by Shah et al. (2018) and probably what was most concerning at the time was that “a feature of BLAST that operates in a non-intuitive way and that is frequently misused in bioinformatics workflows, potentially leading to erroneous results impacting numerous scientific articles”.

These erroneous results were exactly what I had found. I was not getting the best hit from the BLAST database

The BLAST team replied to the Shah et al. (2018) paper and released BLAST+ 2.8.1, which fixed the bug. They also wrote a reply – Madden et al. (2018).

The main point here is that the top hits should not be taken at face value! Look at all the different output values and use at least two when making your decision on a sequence match. Also, do a quick scan and look at the top 2 or 3 hits. Do not rely solely on the e-value to decide on the most appropriate match – look at the length of the sequence that has matched your sequence and see if it is appropriate! Maybe that match needs to be disregarded? Finally, always make sure your databases, and in this case, BLAST modules are up-to-date 🙂

Share this: