(sometimes) useful one-liners in mass spectrometry or proteomics
First of all, congratulations for finding this page! These one-liners are not really recommended for any serious work, but can be used to illustrate what can be achieved with very simple means in bash.
awk 'BEGIN {FS=""} /^>/ {next} {for(i=1;i<=NF;i++) {A[$i]++; n++}} END {for(i in A) print i,A[i]/n}'
counts the frequencies of symbols (amino acids or nucleotides) in a FASTA file
sed 's/>.*/#/'|tr -d '\n'|sed 's/#/\nYOCED>\n/g'|rev|fold -60|sed 1d
reverses each sequence in a FASTA file to do decoy searches (all decoy proteins have the same name: DECOY)
awk 'BEGIN {RS=">[^\n]+\n"} {gsub("\n","");} /[WLIFVMC]K.[DE]/ {N++} END {print N}'
finds and counts the protein sequences in a FASTA file with a SUMOylation motif (beware of the first empty record)
awk 'BEGIN {RS=">[^\n]+\n"} {gsub("\n","");} {N+=gsub(/[WLIFVMC]K.[DE]/,1)} END {print N}'
as above, but counts the total number of sites
awk 'BEGIN {RS=">[^\n]+\n"} {gsub("\n","");} /N[^P][ST]]/ {N++} END {print N}'
finds and counts the protein sequences in a FASTA file with at least one classical N-glycosylation motif
awk 'BEGIN {RS=">"} {P=substr($0,4,6);} {gsub("[st][^\n]+\n|\n",""); print P,gsub(/N[^P][ST]/,1)}'
print the number of N-glycosylation motifs in each protein, with the protein accession numbers, from a UniProt FASTA file
awk 'BEGIN {RS=">"} {P=substr($0,4,6);} {gsub("[st][^\n]+\n|\n",""); print P,gsub(/N[^P][ST]/,1)}'|grep -E '^[A-Z0-9]{6}'
like the above, but robust against sequence headers with multiple '>' and matching the motif
grep -oP '[pr]\|\K.{6}'
extracts all accession numbers for all entries in a UniProt FASTA file
grep '^>'|cut -c5-10
same as the above
awk '{BEGIN FS="[ =\t]+"}/^PEPM/{print $1"="$2-$2*24E-6,$3;next}/^[0-9]/{print $1+$1*75E-6,$2,$3;next}{print}'
(statistical) recalibration of an MGF file, shifting MS1 m/z values by -24 ppm and MS2 m/z values by +75 ppm
sed 1d|tr -d '\n'|awk 'BEGIN {RS="[RK]"} (NR==1)&&/.{5,}/ {print $1 RT; next} /[^P].{4,}/ {print $1 RT}'
in silico digestion (minimum 6 amino acids, no missed cleavages) of a protein sequence in FASTA format with trypsin (works with default Cygwin AWK)
awk 'BEGIN {OFS=″\t″} ($1~/[0-9]\./) {print $1,100*$5/($4+$5),$6/14.5}'
extracts fraction of mobile phase B in % and column pressure in bar as function of time in minutes from an Exigent autosave file and outputs a tab-delimited file, in other words, finds the table in this file, performs one calculation, two unit conversions and formats the output for Microsoft Excel
If you find any of the above fun, interesting, useful or have suggestions for improvements, or other potentially handy one-liners, please feel free to e-mail me (Magnus Palmblad) on first name dot last name at google dot com. I am particularly curious to find the minimum one-liner to perform tasks such as digesting a protein sequence with trypsin or reversing a FASTA database (as above) and other common, well-defined tasks. Prices may be a awarded to particularly elegant solutions.