What Do They Capture? - A Structural Analysis of Pre-Trained Language Models for Source Code

Wan, Y; Zhao, W; Zhang, H; Sui, Y; Xu, G; Jin, H

What Do They Capture? - A Structural Analysis of Pre-Trained Language Models for Source Code

Wan, Y Zhao, W Zhang, H Sui, Y

Xu, G

Jin, H

Permalink

Publisher:: ACM
Publication Type:: Conference Proceeding
Citation:: Proceedings - International Conference on Software Engineering, 2022, 2022-May, pp. 2377-2388
Issue Date:: 2022-01-01

Closed Access

	Filename	Description	Size
	icse22_code_probing_camera.pdf	Accepted version	7.55 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Wan, Y
dc.contributor.author	Zhao, W
dc.contributor.author	Zhang, H
dc.contributor.author	Sui, Y https://orcid.org/0000-0002-9510-6574
dc.contributor.author	Xu, G https://orcid.org/0000-0003-4493-6663
dc.contributor.author	Jin, H
dc.date.accessioned	2022-08-25T10:43:33Z
dc.date.available	2022-08-25T10:43:33Z
dc.date.issued	2022-01-01
dc.identifier.citation	Proceedings - International Conference on Software Engineering, 2022, 2022-May, pp. 2377-2388
dc.identifier.isbn	9781450392211
dc.identifier.issn	0270-5257
dc.identifier.uri	http://hdl.handle.net/10453/160863
dc.description.abstract	Recently, many pre-trained language models for source code have been proposed to model the context of code and serve as a basis for downstream code intelligence tasks such as code completion, code search, and code summarization. These models leverage masked pre-training and Transformer and have achieved promising results. However, currently there is still little progress regarding interpretability of existing pre-trained code models. It is not clear why these models work and what feature correlations they can capture. In this paper, we conduct a thorough structural analysis aiming to provide an interpretation of pre-trained language models for source code (e.g., CodeBERT, and GraphCodeBERT) from three distinctive perspectives: (1) attention analysis, (2) probing on the word embedding, and (3) syntax tree induction. Through comprehensive analysis, this paper reveals several insightful findings that may inspire future studies: (1) Attention aligns strongly with the syntax structure of code. (2) Pre-training language models of code can preserve the syntax structure of code in the intermediate representations of each Transformer layer. (3) The pre-trained models of code have the ability of inducing syntax trees of code. Theses findings suggest that it may be helpful to incorporate the syntax structure of code into the process of pre-training for better code representations.
dc.language	en
dc.publisher	ACM
dc.relation.ispartof	Proceedings - International Conference on Software Engineering
dc.relation.ispartof	Proceedings of the 44th International Conference on Software Engineering
dc.relation.isbasedon	10.1145/3510003.3510050
dc.rights	info:eu-repo/semantics/closedAccess
dc.title	What Do They Capture? - A Structural Analysis of Pre-Trained Language Models for Source Code
dc.type	Conference Proceeding
utslib.citation.volume	2022-May
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAI - Advanced Analytics Institute Research Centre
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
utslib.copyright.status	closed_access	*
dc.date.updated	2022-08-25T10:43:29Z
pubs.publication-status	Published
pubs.volume	2022-May

Abstract:

Recently, many pre-trained language models for source code have been proposed to model the context of code and serve as a basis for downstream code intelligence tasks such as code completion, code search, and code summarization. These models leverage masked pre-training and Transformer and have achieved promising results. However, currently there is still little progress regarding interpretability of existing pre-trained code models. It is not clear why these models work and what feature correlations they can capture. In this paper, we conduct a thorough structural analysis aiming to provide an interpretation of pre-trained language models for source code (e.g., CodeBERT, and GraphCodeBERT) from three distinctive perspectives: (1) attention analysis, (2) probing on the word embedding, and (3) syntax tree induction. Through comprehensive analysis, this paper reveals several insightful findings that may inspire future studies: (1) Attention aligns strongly with the syntax structure of code. (2) Pre-training language models of code can preserve the syntax structure of code in the intermediate representations of each Transformer layer. (3) The pre-trained models of code have the ability of inducing syntax trees of code. Theses findings suggest that it may be helpful to incorporate the syntax structure of code into the process of pre-training for better code representations.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/160863