Optimizing differential expression analysis for proteomics data via high-performing rules and ensemble inference.

Peng, H; Wang, H; Kong, W; Li, J; Goh, WWB

Optimizing differential expression analysis for proteomics data via high-performing rules and ensemble inference.

Peng, H Wang, H Kong, W Li, J

Goh, WWB

Permalink

Publisher:: NATURE PORTFOLIO
Publication Type:: Journal Article
Citation:: Nat Commun, 2024, 15, (1), pp. 3922
Issue Date:: 2024-05-09

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Published versionAdobe PDF (2.42 MB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Peng, H
dc.contributor.author	Wang, H
dc.contributor.author	Kong, W
dc.contributor.author	Li, J https://orcid.org/0000-0003-1833-7413
dc.contributor.author	Goh, WWB
dc.date.accessioned	2024-08-01T04:37:29Z
dc.date.available	2024-04-16
dc.date.available	2024-08-01T04:37:29Z
dc.date.issued	2024-05-09
dc.identifier.citation	Nat Commun, 2024, 15, (1), pp. 3922
dc.identifier.issn	2041-1723
dc.identifier.issn	2041-1723
dc.identifier.uri	http://hdl.handle.net/10453/179946
dc.description.abstract	Identification of differentially expressed proteins in a proteomics workflow typically encompasses five key steps: raw data quantification, expression matrix construction, matrix normalization, missing value imputation (MVI), and differential expression analysis. The plethora of options in each step makes it challenging to identify optimal workflows that maximize the identification of differentially expressed proteins. To identify optimal workflows and their common properties, we conduct an extensive study involving 34,576 combinatoric experiments on 24 gold standard spike-in datasets. Applying frequent pattern mining techniques to top-ranked workflows, we uncover high-performing rules that demonstrate optimality has conserved properties. Via machine learning, we confirm optimal workflows are indeed predictable, with average cross-validation F1 scores and Matthew's correlation coefficients surpassing 0.84. We introduce an ensemble inference to integrate results from individual top-performing workflows for expanding differential proteome coverage and resolve inconsistencies. Ensemble inference provides gains in pAUC (up to 4.61%) and G-mean (up to 11.14%) and facilitates effective aggregation of information across varied quantification approaches such as topN, directLFQ, MaxLFQ intensities, and spectral counts. However, further development and evaluation are needed to establish acceptable frameworks for conducting ensemble inference on multiple proteomics workflows.
dc.format	Electronic
dc.language	eng
dc.publisher	NATURE PORTFOLIO
dc.relation.ispartof	Nat Commun
dc.relation.isbasedon	10.1038/s41467-024-47899-w
dc.rights	info:eu-repo/semantics/openAccess
dc.subject.mesh	Proteomics
dc.subject.mesh	Workflow
dc.subject.mesh	Machine Learning
dc.subject.mesh	Proteome
dc.subject.mesh	Humans
dc.subject.mesh	Algorithms
dc.subject.mesh	Databases, Protein
dc.subject.mesh	Humans
dc.subject.mesh	Proteome
dc.subject.mesh	Proteomics
dc.subject.mesh	Algorithms
dc.subject.mesh	Databases, Protein
dc.subject.mesh	Workflow
dc.subject.mesh	Machine Learning
dc.subject.mesh	Proteomics
dc.subject.mesh	Workflow
dc.subject.mesh	Machine Learning
dc.subject.mesh	Proteome
dc.subject.mesh	Humans
dc.subject.mesh	Algorithms
dc.subject.mesh	Databases, Protein
dc.title	Optimizing differential expression analysis for proteomics data via high-performing rules and ensemble inference.
dc.type	Journal Article
utslib.citation.volume	15
utslib.location.activity	England
pubs.organisational-group	University of Technology Sydney
pubs.organisational-group	University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	University of Technology Sydney/Strength - CHT - Health Technologies
pubs.organisational-group	University of Technology Sydney/Strength - AAI - Advanced Analytics Institute Research Centre
pubs.organisational-group	University of Technology Sydney/All Manual Groups
pubs.organisational-group	University of Technology Sydney/All Manual Groups/Centre for Health Technologies (CHT)
pubs.organisational-group	University of Technology Sydney/All Manual Groups/Data Science Institute (DSI)
pubs.organisational-group	University of Technology Sydney/Strength - DSI - Data Science Institute
utslib.copyright.status	open_access	*
dc.rights.license	This work is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/
dc.date.updated	2024-08-01T04:37:24Z
pubs.issue	1
pubs.publication-status	Published online
pubs.volume	15
utslib.citation.issue	1

Abstract:

Identification of differentially expressed proteins in a proteomics workflow typically encompasses five key steps: raw data quantification, expression matrix construction, matrix normalization, missing value imputation (MVI), and differential expression analysis. The plethora of options in each step makes it challenging to identify optimal workflows that maximize the identification of differentially expressed proteins. To identify optimal workflows and their common properties, we conduct an extensive study involving 34,576 combinatoric experiments on 24 gold standard spike-in datasets. Applying frequent pattern mining techniques to top-ranked workflows, we uncover high-performing rules that demonstrate optimality has conserved properties. Via machine learning, we confirm optimal workflows are indeed predictable, with average cross-validation F1 scores and Matthew's correlation coefficients surpassing 0.84. We introduce an ensemble inference to integrate results from individual top-performing workflows for expanding differential proteome coverage and resolve inconsistencies. Ensemble inference provides gains in pAUC (up to 4.61%) and G-mean (up to 11.14%) and facilitates effective aggregation of information across varied quantification approaches such as topN, directLFQ, MaxLFQ intensities, and spectral counts. However, further development and evaluation are needed to establish acceptable frameworks for conducting ensemble inference on multiple proteomics workflows.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/179946