Copyright Notice

make_random is Copyright (C) 2001- Magnus Palmblad

make_random is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.

About make_random

make_random generates a random or "decoy" database using amino acid frequencies from a real database in FASTA format and outputs the random database in FASTA format. The random database will have the same amino acid frequencies in any given position in the sequence, and the same protein size distribution as the original database. This may be useful to estimate false positive rates in protein identification.

Figure 1. Markov chain used to generate random sequences with the same amino acid, tryptic peptide size and protein size distribution as a true database. The † represents termination of the sequence. All transitional probabilities are derived from the true database.

Download make_random here. (Right-click and choose "Save Link As...".)

Compiling make_random

Compile make_random with: gcc -o make_random make_random.c -lm

Using make_random

Usage: make_random <sequence database> <number of sequences to generate> <output>

where <sequence database> is the original sequence database in FASTA format, the <number of sequences to generate> number of sequences to generate and <output> the file to which the random database will be written (also in FASTA format).

For example, using the A. thaliana A_thaliana_20070220.fasta sequence database, a random database of 34555 sequences (the same number as in the original database) may be generated and used to estimate the false positive rate in protein identification by searching data against the true and random database with the same search parameters. If the database contains many redundant entries, the randomized database will have more unique peptides than the original database, which may lead to an overestimation of the false positive rate.

N.B. The expectation maximization in PeptideProphet does not work with random databases. Instead PeptideProphet estimates the probability for each peptide identification being correct based on approximated score distributions for correct and incorrect matches in the search results.