The Yeast Phenome is a collaborative project dedicated to building and maintaining a comprehensive compendium of systematic loss-of-function phenotypes for the budding yeast Saccharomyces cerevisiae.
As part of this effort, we systematically track, collect, and annotate all published phenotypic screens utilizing the yeast knock-out collection. We extract the data, standardize their format and structure, and make them readily available for search, download, and analysis. Our primary goal is to empower ourselves and the wider research community to answer key questions, such as:
For a more in-depth discussion of our motivations, the principles of data librarianship, and the global impact that similar initiatives could have on biomedical research, please refer to our 2020 Viewpoint article published in The FEBS Journal.
In order to be included in Yeast Phenome, a phenotypic screen of the yeast knock-out (YKO) collection must meet all following criteria:
Yeast Phenome does not include:
To identify phenotypic screens that fit these criteria, we developed a comprehensive search strategy and applied it systematically over a 10-year period (2012–2022). As a starting point, we searched the Saccharomyces Genome Database (SGD) for gene phenotypes associated with terms such as “systematic mutation set” and “competitive growth”, and compiled a preliminary list of publications that reported phenotypic screens of the YKO collection. By curating these publications and parsing their citations, we discovered many additional publications reporting YKO screens. Furthermore, we examined the publication records of research labs that have released numerous YKO screens and made sure that we captured all of their publications in this domain.
We incorporated relevant YKO screens from existing repositories such as the Yeast Functional Genomics Database (YFGdb), the Database for High Throughput Screening hits (dHITS), ScreenTroll and FitSearch.
We received pointers to potentially relevant papers from BioGRID curators and set up automated PubMed queries for keywords such as “yeast knockout collection”, “yeast deletion collection” and “phenotypic screen”.
During the curation process, each publication is associated with a list of screens. Each screen is then annotated with extensive meta-data that capture the type of collection (haploid Mat-a, haploid Mat-alpha or homozygous diploid), the phenotype, the experimental environment (including the growth media), the type of released data (quantitative or discrete) and the source from which the data were obtained (see below).
The phenotypes and the environments are recorded using a set of controlled vocabularies, i.e. unique terms chosen to avoid duplications and errors. For example, a chemogenomic screen for hydroxyurea was annotated with the phenotype “growth” and the environment “hydroxyurea”.
Whenever applicable, phenotypes are linked to reporters, i.e. specific readouts through which the phenotype was assessed (e.g., the phenotype “unfolded protein response” has the reporter “UPRE-GFP” since the activation of the response is measured via GFP expression driven by an UPRE promoter).
Environments, especially chemical and physical perturbations, are associated with a dose (e.g., environment “hydroxyurea” with dose “100 mM”) and alternative names used in the literature (e.g., “hydroxyurea” and “HU”). Whenever available, chemical compounds are also linked with external identifiers from ChEBI and PubChem.
Missing meta-data. A set of 1,719 chemical compounds, screened in Hoepfner D~Movva NR, 2014, are proprietary. The names or chemical structures of these compounds were not released in the publication and, as such, are not available in Yeast Phenome.
In addition to meta-data describing the parameters of the experiment, each screen is linked to its corresponding data, i.e. the list of tested mutants and their phenotypic values. The data are obtained from one of three main sources:
After retrieving the data, we evaluate their completeness relative to the experiment described in the publication and determine whether additional data might be available. An example of "incomplete" data is a case where the publication describes the measurement of a quantitative phenotype but only reports a binary list of hits. Similarly, if a publication reports only the list of hits but not the list of tested strains, the data is considered "incomplete".
In all cases of apparent data incompleteness, we contact the authors and ask them to share the missing data with Yeast Phenome. Since, in most cases, these data are unpublished, we ask the authors to give us explicit permission to upload the data onto Yeast Phenome. The authors who agreed to share data relevant to any given screen are acknowledged on the screen’s page, as well as on our Data contributors page.
By definition, all screens in Yeast Phenome used the YKO collection and tested the vast majority of the ~5,000 non-essential gene knock-outs (the biggest exception is Kemmeren P~Hostege FC, 2014 that only tested ~1,500 knock-out mutants). However, our preliminary analyses showed that the composition of the YKO collection varied over time and between labs. Furthermore, many screens had to exclude small subsets of the collection for technical or biological reasons (e.g., failure to transform the strain with a plasmid carrying a fluorescent reporter). As a result of such limitations, the set of mutants tested in one screen could differ from that of another screen by as much as 20%.
Variation in tested space can prevent an accurate interpretation of the screen results: it may be unclear whether a particular gene is absent from a screen’s hit list because it didn’t show a strong phenotype (or didn’t validate at re-testing), or because it was never tested in the first place.
To address this issue, we did our best at recovering a list of tested strains for as many screens as possible. Whenever the list was not released as part of the original publication, we contacted the authors via email (see above). If the authors were unable to provide the list, we estimated the screen’s tested space from the tested spaces of all other screens. The estimate was based on a consensus list, i.e. the list of strains that have been tested in at least 50% of all screens that did declare their tested space (Kemmeren P~Hostege FC, 2014 and Huseinovic A~Vos JC, 2017 were excluded from the estimate because their declared tested space was ~1,500 genes, which is considerably lower than other screens). If a screen reported values outside of the consensus tested space, those values were retained.
For all data, we minimize manual handling to maximize record keeping and reproducibility. The only two operations performed manually are:
All other data manipulations (quality control, clean up, reformatting, upload to database, etc.) are done programmatically. These typically include fixing typos in gene names, translating gene names into ORFs, excluding ORFs that are currently marked as merged or deleted in SGD, converting phenotype data into digital scores as per convention (see below). For each publication, data manipulations are encoded in a Python-based Jupyter notebook, which is version-controlled and stored in a Github repository.
To facilitate both analysis and interpretation of diverse Yeast Phenome data, we implemented several conventions and normalizations.
Since different phenotypes followed dramatically different distributions but were consistently unimodal, we used the mode as a reference to normalize each dataset using a modified z-score transformation.
$$NPV_i = {P_i - P_{mode} \over \sqrt{\frac{1}{N} \sum_{i=1}^N (P_i - P_{mode})^2}}$$
Where:
As a result, all phenotypic values reported in Yeast Phenome can be universally interpreted as standardized deviations from the most typical mutant, which, assuming extreme phenotypes are rare, is also likely to represent wild-type. The transformed data, which we refer to as normalize phenotypic values (NPVs), are used throughout the website. The original data are available at the Github repository.
To obtain robust estimates of screen-screen phenotypic similarity in a computationally efficient and parallelizable manner, we adopted the following strategy.
The same approach was used to compute gene-gene phenotypic similarities available for bulk download.
In contrast, to compute gene-gene phenotypic similarity values displayed online, we used a different approach. We trained a Supervised Contrastive (SupCon) maching learning model on the same data matrix. The model was trained to maximize the cosine similarity between pairs of genes co-annotated to the same GO Slim Biological Process term, while also avoiding over-fitting and ground truth memorization. The model enabled us to reduce the dimensionality of the data matrix to a 128-dimensional vector space, where each gene is represented by a vector of 128 values. The cosine similarity between these vectors was then used to estimate gene-gene phenotypic similarity.
For a more detailed method explanation and evaluation, please read our full blog post.
If you use data downloaded from YeastPhenome.org in a talk or a manuscript, please acknowledge the data source by citing the original publication from which the data is derived, as well as the YeastPhenome.org database. The database should be cited as follows:
Turco G, Chang C, Wang RY, Kim G, Stoops EH, Richardson B, Sochat V, Rust J, Oughtred R, Thayer N, Kang F, Livstone MS, Heinicke S, Schroeder M, Dolinski KJ, Botstein D, Baryshnikova A. 2023. Global analysis of the yeast knock-out phenome. Science Advances. 9(21):eadg5702. PMID: 37235661