GenoSNP

[1]. A Variational Bayes within-sample SNP genotyping algorithm that does not require a reference population.

Giannoulatou E, Yau C, Colella S, Ragoussis J, Holmes CC. Bioinformatics. 2008 Oct 1;24(19):2209-14


Download

References

GenoSNP is a genotyping algorithm for the Illumina Infinium SNP genotyping assay [1]. It is entirely within-sample and does not require the need for a population of control samples nor parameters derived from such a population. The ability to call genotypes using only within-sample information makes the method computationally light and practical for studies involving rare variants and small sample sizes.


GenoSNP works on Illumina Infinium raw intensity data. It can be used for genotyping SNPs from HumanHap300Duo and Duo+, Hap240S, HumanHap550, HumanHap550-Duo, HumanHap650Y and Human1M BeadChip data. It works by clustering on the intensities for each BeadPool (or BeadSet) separately. Therefore the BeadPool information for all the SNPs is necessary. The BeadPool information is the .bpm file for the array (provided by Illumina) but you can also find it in the SNP table of GenomeStudio under “Norm ID”. There are 24 different BeadPools for the Human-1, 14 for the Hap240S, 11 for the Hap300, 25 for the Hap550, and 27 for the Hap650Y.

Command Line Arguments

Input Data Formats

Output Data Formats

MatLab Code and plotting functions in R

./GenoSNP -snps snpfile.txt -samples samplefile.txt -cutoff 0.7 -calls calls.txt -probs probabilities.txt


Input parameters. Both SNPs and Samples files are mandatory:


-snps mandatory SNPs input file

-samples mandatory Samples input file

-cutoff optional cutoff, default is 0.7



Output parameters. The user may optionally choose for the raw probabilities to be output.


-calls mandatory Calls output file

-probs optional Probabilities output file

GenoSNP can be downloaded for academic research use only. For commercial or other use please contact the authors. Copyright is retained by the University of Oxford.


You can download the C code from here.

A statically linked executable file for Linux (x86_64) is provided here.


Please address all questions and comments to Eleni Giannoulatou at E.Giannoulatou@victorchang.edu.au

The input and output formats should be tab-separated and have no header line. Values that are closely coupled, such as AlleleA/AlleleB, RawX/RawY or probAA/probAB/probBB, could be stored in the same cell, separated by single spaces.


SNPs file

The SNPs input file describes the indexes of columns in the Samples file. It also records the Bead Pool and the Alleles of each SNP.


A tab-separated file with one SNP per row, and alleles separated by a single space.

<SNP_ID>[TAB]<BEADSET_ID>[TAB]<ALLELE_A>[SP]<ALLELE_B>


For example:

rs1256528 225 T G

rs2905062 224 T C

rs1815606 223 A G

rs6603811 225 A G

rs7531583 225 A C


Samples file

The Samples input file is the core data file containing the RawX/RawY values for each Sample. The SNP columns are indexed as per SNPs file. If only Individual ID is required then the user should duplicate the Individual ID for the Family ID.


A tab-separated file with one row per Sample, and the rawX, rawY values separated by a single space.

<FAMILY_ID>[TAB]<INDIVIDUAL_ID>[TAB]<RAW_X_SNP_1>[SP]<RAW_Y_SNP_1>[TAB]<RAW_X_SNP_2>[SP]<RAW_Y_SNP_2> ...and so on as per SNPs file...


For example:

WTD0001-O-A08 WTD0001-O-A08 15999 10834 1284 17773 20689 606 ...

WTD0001-O-B01 WTD0001-O-B01 15600 31048 1528 12968 19610 2118 ...

WTD0001-O-B04 WTD0001-O-B04 15600 21048 1528 12968 19610 2118 ...

Calls file

The Calls output file contains the resulting genotype for each SNP. 'No calls' are marked by 0 0. Otherwise the Alleles are carried over from the SNPs file.


A tab-separated file with one row per Sample, and the allele A & B values separated by a single space.

<FAMILY_ID>[TAB]<INDIVIDUAL_ID>[TAB]<ALLELE_A_SNP_1>[SP]<ALLELE_B_SNP_1>[TAB]<ALLELE_A_SNP_2>[SP]<ALLELE_B_SNP_2> ...and so on as per SNP file...


For example:

WTD0001-O-A08 WTD0001-O-A08 0 0 A G A A G G C G ...

WTD0001-O-B01 WTD0001-O-B01 A A G G C G T C 0 0 ...

WTD0001-O-B04 WTD0001-O-B04 A A G G C G T C 0 0 ...


Probabilities file

The Probabilities output file outputs the all three probabilities for the possible genotypes for each SNP.


A tab-separated file with one row per Sample, and the probabilities AA, AB, BB separated by single spaces.

<FAMILY_ID>[TAB]<INDIVIDUAL_ID>[TAB]<PROB_AA_SNP_1>[SP]<PROB_AB_SNP_1>[SP]<PROB_BB_SNP_1>[TAB]<PROB_AA_SNP_2>[SP]<PROB_AB_SNP_2>[SP]<PROB_BB_SNP_2> ...and so on as per SNP file...


For example:

WTD0001-O-A08 WTD0001-O-A08 0.999212 0.000716231 7.20932e-05 0.999212 0.000716231 7.20932e-05 ...

WTD0001-O-B01 WTD0001-O-B01 0.999991 5.65942e-06 3.80093e-06 0.999991 5.65942e-06 3.80093e-06 ...

WTD0001-O-B04 WTD0001-O-B04 0.999991 5.65942e-06 3.80093e-06 0.999991 5.65942e-06 3.80093e-06 ...

The MatLab version of GenoSNP is available here. Please consult the included README file for input/output data formats.


You can also find some R functions for plotting here. These use the previous version for the format of the Input/Output so please consult the included README file.