(sometimes) useful one-liners in mass spectrometry or proteomics


First of all, congratulations for finding this page! These one-liners are not really recommended for any serious work, but can be used to illustrate what can be achieved with very simple means in bash.

awk 'BEGIN {FS=""} /^>/ {next} {for(i=1;i<=NF;i++) {A[$i]++; n++}} END {for(i in A) print i,A[i]/n}'
counts the frequencies of symbols (amino acids or nucleotides) in a FASTA file

sed 's/>.*/#/'|tr -d '\n'|sed 's/#/\nYOCED>\n/g'|rev|fold -60|sed 1d
reverses each sequence in a FASTA file to do decoy searches (all decoy proteins have the same name: DECOY)

awk 'BEGIN {RS=">[^\n]+\n"} {gsub("\n","");} /[WLIFVMC]K.[DE]/ {N++} END {print N}'
finds and counts the protein sequences in a FASTA file with a SUMOylation motif (beware of the first empty record)

awk 'BEGIN {RS=">[^\n]+\n"} {gsub("\n","");} {N+=gsub(/[WLIFVMC]K.[DE]/,1)} END {print N}'
as above, but counts the total number of sites

awk 'BEGIN {RS=">[^\n]+\n"} {gsub("\n","");} /N[^P][ST]]/ {N++} END {print N}'
finds and counts the protein sequences in a FASTA file with at least one classical N-glycosylation motif

awk 'BEGIN {RS=">"} {P=substr($0,4,6);} {gsub("[st][^\n]+\n|\n",""); print P,gsub(/N[^P][ST]/,1)}'
print the number of N-glycosylation motifs in each protein, with the protein accession numbers, from a UniProt FASTA file

awk 'BEGIN {RS=">"} {P=substr($0,4,6);} {gsub("[st][^\n]+\n|\n",""); print P,gsub(/N[^P][ST]/,1)}'|grep -E '^[A-Z0-9]{6}'
like the above, but robust against sequence headers with multiple '>' and matching the motif

grep -oP '[pr]\|\K.{6}'
extracts all accession numbers for all entries in a UniProt FASTA file

grep '^>'|cut -c5-10
same as the above

awk '{BEGIN FS="[ =\t]+"}/^PEPM/{print $1"="$2-$2*24E-6,$3;next}/^[0-9]/{print $1+$1*75E-6,$2,$3;next}{print}'
(statistical) recalibration of an MGF file, shifting MS1 m/z values by -24 ppm and MS2 m/z values by +75 ppm

sed 1d|tr -d '\n'|awk 'BEGIN {RS="[RK]"} (NR==1)&&/.{5,}/ {print $1 RT; next} /[^P].{4,}/ {print $1 RT}'
in silico digestion (minimum 6 amino acids, no missed cleavages) of a protein sequence in FASTA format with trypsin (works with default Cygwin AWK)

awk 'BEGIN {OFS=″\t″} ($1~/[0-9]\./) {print $1,100*$5/($4+$5),$6/14.5}'
extracts fraction of mobile phase B in % and column pressure in bar as function of time in minutes from an Exigent autosave file and outputs a tab-delimited file, in other words, finds the table in this file, performs one calculation, two unit conversions and formats the output for Microsoft Excel


If you find any of the above fun, interesting, useful or have suggestions for improvements, or other potentially handy one-liners, please feel free to e-mail me (Magnus Palmblad) on first name dot last name at google dot com. I am particularly curious to find the minimum one-liner to perform tasks such as digesting a protein sequence with trypsin or reversing a FASTA database (as above) and other common, well-defined tasks. Prices may be a awarded to particularly elegant solutions.