GEMINI: The Genomic Search Engine

GEMINI is an open-source bioinformatics tool and website written in python to facilitate near-neighbor searching of genomic data. This website is currently under construction. Use the Submit a Query panel to search one of our datasets, or read through the tutorial to learn more.

Submit a query

Gemini

Queries must use the .hdf5 file format.Click here to download an example

GEMINI Tutorial

GEMINI allows users to more effectively search Level 3 gene expression datasets from the Cancer Genome Atlas Project, by using data itself as a query. Rather than performing a keyword search, GEMINI compares the similarity of your data to existing TCGA samples to determine the most relevant results. To submit a query, upload your data to the Query section in one of the following formats:

  • An HDF5 file, with three datasets:
    • 'Sample'-- An array with one entry, the Sample ID for this query
    • 'Feature'-- An array of gene IDs
    • 'Data'-- An array of gene expression values for each gene in 'Feature'.
  • A tab delimited .txt file with Level 3 gene expression data from the TCGA
  • A comma-separated (.csv) file with a header and two columns for GeneID and expression value
The query submission box provides example of each of the above formats. All formats use the same set of genes, those of an Agilent G4502A scanner.

GEMINI organizes samples into distinct searchable datasets, such as OV for ovarian cancer samples. Once you have uploaded your query, select a dataset from the dropdown menu and click the search button.

Sample Results

Example query results

GEMINI returns a list of the top ten nearest neighbors in the dataset based on a similarity function. Currently, GEMINI uses a combination of principal component analysis and Euclidean distance between samples to determine similarity, though more sophisticated metrics will be added in the future. The results page also shows a visual representation of the nearest neighbors' top 10 principal components, and future work will enable users to view and compare associated sample information, such as clinical outcomes.