Optimizing differential expression analysis for proteomics data via high-performing rules and ensemble inference.
- Publisher:
- NATURE PORTFOLIO
- Publication Type:
- Journal Article
- Citation:
- Nat Commun, 2024, 15, (1), pp. 3922
- Issue Date:
- 2024-05-09
Open Access
Copyright Clearance Process
- Recently Added
- In Progress
- Open Access
This item is open access.
Full metadata record
Field | Value | Language |
---|---|---|
dc.contributor.author | Peng, H | |
dc.contributor.author | Wang, H | |
dc.contributor.author | Kong, W | |
dc.contributor.author |
Li, J https://orcid.org/0000-0003-1833-7413 |
|
dc.contributor.author | Goh, WWB | |
dc.date.accessioned | 2024-08-01T04:37:29Z | |
dc.date.available | 2024-04-16 | |
dc.date.available | 2024-08-01T04:37:29Z | |
dc.date.issued | 2024-05-09 | |
dc.identifier.citation | Nat Commun, 2024, 15, (1), pp. 3922 | |
dc.identifier.issn | 2041-1723 | |
dc.identifier.issn | 2041-1723 | |
dc.identifier.uri | http://hdl.handle.net/10453/179946 | |
dc.description.abstract | Identification of differentially expressed proteins in a proteomics workflow typically encompasses five key steps: raw data quantification, expression matrix construction, matrix normalization, missing value imputation (MVI), and differential expression analysis. The plethora of options in each step makes it challenging to identify optimal workflows that maximize the identification of differentially expressed proteins. To identify optimal workflows and their common properties, we conduct an extensive study involving 34,576 combinatoric experiments on 24 gold standard spike-in datasets. Applying frequent pattern mining techniques to top-ranked workflows, we uncover high-performing rules that demonstrate optimality has conserved properties. Via machine learning, we confirm optimal workflows are indeed predictable, with average cross-validation F1 scores and Matthew's correlation coefficients surpassing 0.84. We introduce an ensemble inference to integrate results from individual top-performing workflows for expanding differential proteome coverage and resolve inconsistencies. Ensemble inference provides gains in pAUC (up to 4.61%) and G-mean (up to 11.14%) and facilitates effective aggregation of information across varied quantification approaches such as topN, directLFQ, MaxLFQ intensities, and spectral counts. However, further development and evaluation are needed to establish acceptable frameworks for conducting ensemble inference on multiple proteomics workflows. | |
dc.format | Electronic | |
dc.language | eng | |
dc.publisher | NATURE PORTFOLIO | |
dc.relation.ispartof | Nat Commun | |
dc.relation.isbasedon | 10.1038/s41467-024-47899-w | |
dc.rights | info:eu-repo/semantics/openAccess | |
dc.subject.mesh | Proteomics | |
dc.subject.mesh | Workflow | |
dc.subject.mesh | Machine Learning | |
dc.subject.mesh | Proteome | |
dc.subject.mesh | Humans | |
dc.subject.mesh | Algorithms | |
dc.subject.mesh | Databases, Protein | |
dc.subject.mesh | Humans | |
dc.subject.mesh | Proteome | |
dc.subject.mesh | Proteomics | |
dc.subject.mesh | Algorithms | |
dc.subject.mesh | Databases, Protein | |
dc.subject.mesh | Workflow | |
dc.subject.mesh | Machine Learning | |
dc.subject.mesh | Proteomics | |
dc.subject.mesh | Workflow | |
dc.subject.mesh | Machine Learning | |
dc.subject.mesh | Proteome | |
dc.subject.mesh | Humans | |
dc.subject.mesh | Algorithms | |
dc.subject.mesh | Databases, Protein | |
dc.title | Optimizing differential expression analysis for proteomics data via high-performing rules and ensemble inference. | |
dc.type | Journal Article | |
utslib.citation.volume | 15 | |
utslib.location.activity | England | |
pubs.organisational-group | University of Technology Sydney | |
pubs.organisational-group | University of Technology Sydney/Faculty of Engineering and Information Technology | |
pubs.organisational-group | University of Technology Sydney/Strength - CHT - Health Technologies | |
pubs.organisational-group | University of Technology Sydney/Strength - AAI - Advanced Analytics Institute Research Centre | |
pubs.organisational-group | University of Technology Sydney/All Manual Groups | |
pubs.organisational-group | University of Technology Sydney/All Manual Groups/Centre for Health Technologies (CHT) | |
pubs.organisational-group | University of Technology Sydney/All Manual Groups/Data Science Institute (DSI) | |
pubs.organisational-group | University of Technology Sydney/Strength - DSI - Data Science Institute | |
utslib.copyright.status | open_access | * |
dc.rights.license | This work is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/ | |
dc.date.updated | 2024-08-01T04:37:24Z | |
pubs.issue | 1 | |
pubs.publication-status | Published online | |
pubs.volume | 15 | |
utslib.citation.issue | 1 |
Abstract:
Identification of differentially expressed proteins in a proteomics workflow typically encompasses five key steps: raw data quantification, expression matrix construction, matrix normalization, missing value imputation (MVI), and differential expression analysis. The plethora of options in each step makes it challenging to identify optimal workflows that maximize the identification of differentially expressed proteins. To identify optimal workflows and their common properties, we conduct an extensive study involving 34,576 combinatoric experiments on 24 gold standard spike-in datasets. Applying frequent pattern mining techniques to top-ranked workflows, we uncover high-performing rules that demonstrate optimality has conserved properties. Via machine learning, we confirm optimal workflows are indeed predictable, with average cross-validation F1 scores and Matthew's correlation coefficients surpassing 0.84. We introduce an ensemble inference to integrate results from individual top-performing workflows for expanding differential proteome coverage and resolve inconsistencies. Ensemble inference provides gains in pAUC (up to 4.61%) and G-mean (up to 11.14%) and facilitates effective aggregation of information across varied quantification approaches such as topN, directLFQ, MaxLFQ intensities, and spectral counts. However, further development and evaluation are needed to establish acceptable frameworks for conducting ensemble inference on multiple proteomics workflows.
Please use this identifier to cite or link to this item:
Download statistics for the last 12 months
Not enough data to produce graph