YeastPhenome.org

About
Criteria for inclusion
Strategy for identifying relevant data sources
Collecting and organizing screen meta-data
Collecting and organizing screen data
Tested mutants
Data handling
Data normalization
Computing correlations
Data usage and citation policy

About

The Yeast Phenome is a collaborative project dedicated to building and maintaining a comprehensive compendium of systematic loss-of-function phenotypes for the budding yeast Saccharomyces cerevisiae.

As part of this effort, we systematically track, collect, and annotate all published phenotypic screens utilizing the yeast knock-out collection. We extract the data, standardize their format and structure, and make them readily available for search, download, and analysis. Our primary goal is to empower ourselves and the wider research community to answer key questions, such as:

Has phenotype X been measured systematically for all knock-out mutants? If so, how many times? Under what conditions? Where were the results published? Where can the data be downloaded?
Have any phenotypes related to X been investigated? Do any phenotypes appear to be related to X through co-occurrence in the same mutants?
What genes exhibit the highest/lowest values for phenotype X? How does gene Y rank relative to all other genes? What other phenotypes have been reported for gene Y? Do any other genes share similar phenotypic profiles?

For a more in-depth discussion of our motivations, the principles of data librarianship, and the global impact that similar initiatives could have on biomedical research, please refer to our 2020 Viewpoint article published in The FEBS Journal.

Criteria for inclusion

In order to be included in Yeast Phenome, a phenotypic screen of the yeast knock-out (YKO) collection must meet all following criteria:

Measure a quantitative (continuous) or qualitative (binary, discrete, categorical) phenotype (or the lack thereof) for at least 1,000 knock-out mutants. (Note: 1,000 is a cutoff defined empirically by examinin the distribution of tested mutants in an early version of Yeast Phenome).
Use any current or past version of the haploid Mat-a (BY4741), haploid Mat-alpha (BY4742) or homozygous diploid (BY4743) deletion collection.
Report data on all tested mutants or just the strongest hits (these may include mutants with the largest effect sizes, highest reproducibility or most confident deviations from wild-type).
Be associated with a journal article indexed in Pubmed.

Yeast Phenome does not include:

Large-scale screens of other mutant collections (e.g., DAmP strains, temperature sensitive mutants or the prototrophic variant of the knock-out library).
Studies that report an arbitrary subset of screen results chosen based on biological interest, rathern than signal strength or confidence.
Genetic interaction screens, i.e. screens where knock-out mutants are examined under a secondary genetic perturbation.

Strategy for identifying relevant data sources

To identify phenotypic screens that fit these criteria, we developed a comprehensive search strategy and applied it systematically over a 10-year period (2012–2022). As a starting point, we searched the Saccharomyces Genome Database (SGD) for gene phenotypes associated with terms such as “systematic mutation set” and “competitive growth”, and compiled a preliminary list of publications that reported phenotypic screens of the YKO collection. By curating these publications and parsing their citations, we discovered many additional publications reporting YKO screens. Furthermore, we examined the publication records of research labs that have released numerous YKO screens and made sure that we captured all of their publications in this domain.

We incorporated relevant YKO screens from existing repositories such as the Yeast Functional Genomics Database (YFGdb), the Database for High Throughput Screening hits (dHITS), ScreenTroll and FitSearch.

We received pointers to potentially relevant papers from BioGRID curators and set up automated PubMed queries for keywords such as “yeast knockout collection”, “yeast deletion collection” and “phenotypic screen”.

Collecting and organizing screen meta-data

During the curation process, each publication is associated with a list of screens. Each screen is then annotated with extensive meta-data that capture the type of collection (haploid Mat-a, haploid Mat-alpha or homozygous diploid), the phenotype, the experimental environment (including the growth media), the type of released data (quantitative or discrete) and the source from which the data were obtained (see below).

The phenotypes and the environments are recorded using a set of controlled vocabularies, i.e. unique terms chosen to avoid duplications and errors. For example, a chemogenomic screen for hydroxyurea was annotated with the phenotype “growth” and the environment “hydroxyurea”.

Whenever applicable, phenotypes are linked to reporters, i.e. specific readouts through which the phenotype was assessed (e.g., the phenotype “unfolded protein response” has the reporter “UPRE-GFP” since the activation of the response is measured via GFP expression driven by an UPRE promoter).

Environments, especially chemical and physical perturbations, are associated with a dose (e.g., environment “hydroxyurea” with dose “100 mM”) and alternative names used in the literature (e.g., “hydroxyurea” and “HU”). Whenever available, chemical compounds are also linked with external identifiers from ChEBI and PubChem.

Missing meta-data. A set of 1,719 chemical compounds, screened in Hoepfner D~Movva NR, 2014, are proprietary. The names or chemical structures of these compounds were not released in the publication and, as such, are not available in Yeast Phenome.

Collecting and organizing screen data

In addition to meta-data describing the parameters of the experiment, each screen is linked to its corresponding data, i.e. the list of tested mutants and their phenotypic values. The data are obtained from one of three main sources:

the main text (e.g., list of hits reported in a table or figure)
the supplementary material (e.g., an Excel table or PDF file)
a website associated with the publication (typical for larger datasets)

After retrieving the data, we evaluate their completeness relative to the experiment described in the publication and determine whether additional data might be available. An example of "incomplete" data is a case where the publication describes the measurement of a quantitative phenotype but only reports a binary list of hits. Similarly, if a publication reports only the list of hits but not the list of tested strains, the data is considered "incomplete".

In all cases of apparent data incompleteness, we contact the authors and ask them to share the missing data with Yeast Phenome. Since, in most cases, these data are unpublished, we ask the authors to give us explicit permission to upload the data onto Yeast Phenome. The authors who agreed to share data relevant to any given screen are acknowledged on the screen’s page, as well as on our Data contributors page.

Tested mutants

By definition, all screens in Yeast Phenome used the YKO collection and tested the vast majority of the ~5,000 non-essential gene knock-outs (the biggest exception is Kemmeren P~Hostege FC, 2014 that only tested ~1,500 knock-out mutants). However, our preliminary analyses showed that the composition of the YKO collection varied over time and between labs. Furthermore, many screens had to exclude small subsets of the collection for technical or biological reasons (e.g., failure to transform the strain with a plasmid carrying a fluorescent reporter). As a result of such limitations, the set of mutants tested in one screen could differ from that of another screen by as much as 20%.

Variation in tested space can prevent an accurate interpretation of the screen results: it may be unclear whether a particular gene is absent from a screen’s hit list because it didn’t show a strong phenotype (or didn’t validate at re-testing), or because it was never tested in the first place.

To address this issue, we did our best at recovering a list of tested strains for as many screens as possible. Whenever the list was not released as part of the original publication, we contacted the authors via email (see above). If the authors were unable to provide the list, we estimated the screen’s tested space from the tested spaces of all other screens. The estimate was based on a consensus list, i.e. the list of strains that have been tested in at least 50% of all screens that did declare their tested space (Kemmeren P~Hostege FC, 2014 and Huseinovic A~Vos JC, 2017 were excluded from the estimate because their declared tested space was ~1,500 genes, which is considerably lower than other screens). If a screen reported values outside of the consensus tested space, those values were retained.

Data handling

For all data, we minimize manual handling to maximize record keeping and reproducibility. The only two operations performed manually are:

downloading the data from the source;
whenever required, converting PDF files to a computer-readable format (text or Excel files).

All other data manipulations (quality control, clean up, reformatting, upload to database, etc.) are done programmatically. These typically include fixing typos in gene names, translating gene names into ORFs, excluding ORFs that are currently marked as merged or deleted in SGD, converting phenotype data into digital scores as per convention (see below). For each publication, data manipulations are encoded in a Python-based Jupyter notebook, which is version-controlled and stored in a Github repository.

Data normalization

To facilitate both analysis and interpretation of diverse Yeast Phenome data, we implemented several conventions and normalizations.

Phenotypes reported on qualitative (e.g., mild, intermediate, severe) or categorical (e.g., round, elongated, irregular) scales (2% of datasets) were converted into sets of discrete (1, 2, 3) or binary (0, 1) values.
Quantitative and truncated quantitative phenotypes (i.e., those reported on a quantitative scale only for the set of mutants considered to be hits) were transformed so that the sign and magnitude of their values were consistent with how the phenotype and its corresponding experimental condition are defined in their respective controlled vocabularies (higher values correspond to higher expressions of the phenotype, and vice versa).

Since different phenotypes followed dramatically different distributions but were consistently unimodal, we used the mode as a reference to normalize each dataset using a modified z-score transformation.

$$NPV_i = {P_i - P_{mode} \over \sqrt{\frac{1}{N} \sum_{i=1}^N (P_i - P_{mode})^2}}$$

Where:

NPV_i = normalized phenotypic value for knock-out mutant i
P_i = raw phenotypic value for knock-out mutant i (as reported by the original publication)
P_mode = mode of the kernel density distribution of phenotypic values for all knock-out mutants in this screen (KDE was performed using Gaussian kernels, as implemented in the scipy.stats.gaussian_kde, with a scalar bandwidth of 0.25)
N = total number of knock-out mutants tested in this screen

As a result, all phenotypic values reported in Yeast Phenome can be universally interpreted as standardized deviations from the most typical mutant, which, assuming extreme phenotypes are rare, is also likely to represent wild-type. The transformed data, which we refer to as normalize phenotypic values (NPVs), are used throughout the website. The original data are available at the Github repository.

Computing correlations

To obtain robust estimates of screen-screen phenotypic similarity in a computationally efficient and parallelizable manner, we adopted the following strategy.

Given a data matrix, where rows correspond to genes, columns correspond to phenotypic screens and the values are normalized phenotypic values (NPV), we created 100 submatrices by subsetting 10% of the rows using random sampling with replacement.
For each of the submatrices, we computed gene-gene cosine correlations using a parallelized implementation by deepgraph.
For each screen pair, we combined the 100 sampled cosine correlations by computing the mean and standard deviation.
Values corresponding to screen pairs with less than 100 non-zero data points in common were masked.

The same approach was used to compute gene-gene phenotypic similarities available for bulk download.

In contrast, to compute gene-gene phenotypic similarity values displayed online, we used a different approach. We trained a Supervised Contrastive (SupCon) maching learning model on the same data matrix. The model was trained to maximize the cosine similarity between pairs of genes co-annotated to the same GO Slim Biological Process term, while also avoiding over-fitting and ground truth memorization. The model enabled us to reduce the dimensionality of the data matrix to a 128-dimensional vector space, where each gene is represented by a vector of 128 values. The cosine similarity between these vectors was then used to estimate gene-gene phenotypic similarity.

For a more detailed method explanation and evaluation, please read our full blog post.

Data usage and citation policy

If you use data downloaded from YeastPhenome.org in a talk or a manuscript, please acknowledge the data source by citing the original publication from which the data is derived, as well as the YeastPhenome.org database. The database should be cited as follows:

Turco G, Chang C, Wang RY, Kim G, Stoops EH, Richardson B, Sochat V, Rust J, Oughtred R, Thayer N, Kang F, Livstone MS, Heinicke S, Schroeder M, Dolinski KJ, Botstein D, Baryshnikova A. 2023. Global analysis of the yeast knock-out phenome. Science Advances. 9(21):eadg5702. PMID: 37235661

Project

Table of contents