Automatic Chinese character similarity measurement

Liu, M; Rus, V; Li, Y; Sheng, C; Liu, L

Automatic Chinese character similarity measurement

Liu, M

Rus, V Li, Y Sheng, C Liu, L

Permalink

Publication Type:: Journal Article
Citation:: Web Intelligence, 2018, 16 (3), pp. 195 - 202
Issue Date:: 2018-01-01

Closed Access

	Filename	Description	Size
	Automatic Chinese character similarity measurement.pdf	Published Version	201.7 kB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Liu, M https://orcid.org/0000-0003-4256-6531	en_US
dc.contributor.author	Rus, V	en_US
dc.contributor.author	Li, Y	en_US
dc.contributor.author	Sheng, C	en_US
dc.contributor.author	Liu, L	en_US
dc.date.issued	2018-01-01	en_US
dc.identifier.citation	Web Intelligence, 2018, 16 (3), pp. 195 - 202	en_US
dc.identifier.issn	2405-6456	en_US
dc.identifier.uri	http://hdl.handle.net/10453/134181
dc.description.abstract	© 2018 - IOS Press and the authors. All rights reserved. Automatically identifying Chinese characters that are similar in their glyph, pronunciations and meaning are important for building smart question generation tools in a computer-assisted language-learning environment. Previous research on the Chinese character similarity measurement focused on character glyph (e.g. structures, strokes and radicals) with heuristic algorithms whose parameter have preset values. This article presents a machine learning (regression) approach to measure the similarity between two Chinese characters, based on the information which not only includes the glyph, but also pronunciation (pinyin) and semantic meaning derived from HowNet. We evaluated various regression models using a testing set consisting of 2586 pairs of characters selected from elementary Chinese textbooks used. The study results showed that four regression models (M5, Support Vector Machine, Gaussian Process and Linear Regression) have similar results (0.617 -1/2 Mean Absolute Error -1/2 0.641, 0.772 - 1/2 Root Mean Square Error 1/2 0.790). In addition, the study implied that the performance of the regression model could be influenced by the character frequency. Moreover, we evaluated the regression model in a well-known Chinese language learning resource, called 100 pairs of the most confusing Chinese characters. The experiment results indicated that this approach has potential in the recognition and generation of confusing Chinese character pairs.	en_US
dc.relation.ispartof	Web Intelligence	en_US
dc.relation.isbasedon	10.3233/WEB-180387	en_US
dc.title	Automatic Chinese character similarity measurement	en_US
dc.type	Journal Article
utslib.citation.volume	3	en_US
utslib.citation.volume	16	en_US
utslib.for	0801 Artificial Intelligence and Image Processing	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/DVC (Teaching and Learning)
pubs.organisational-group	/University of Technology Sydney/DVC (Teaching and Learning)/Connected Intelligence Centre
utslib.copyright.status	closed_access
pubs.issue	3	en_US
pubs.publication-status	Published	en_US
pubs.volume	16	en_US

Abstract:

© 2018 - IOS Press and the authors. All rights reserved. Automatically identifying Chinese characters that are similar in their glyph, pronunciations and meaning are important for building smart question generation tools in a computer-assisted language-learning environment. Previous research on the Chinese character similarity measurement focused on character glyph (e.g. structures, strokes and radicals) with heuristic algorithms whose parameter have preset values. This article presents a machine learning (regression) approach to measure the similarity between two Chinese characters, based on the information which not only includes the glyph, but also pronunciation (pinyin) and semantic meaning derived from HowNet. We evaluated various regression models using a testing set consisting of 2586 pairs of characters selected from elementary Chinese textbooks used. The study results showed that four regression models (M5, Support Vector Machine, Gaussian Process and Linear Regression) have similar results (0.617 -1/2 Mean Absolute Error -1/2 0.641, 0.772 - 1/2 Root Mean Square Error 1/2 0.790). In addition, the study implied that the performance of the regression model could be influenced by the character frequency. Moreover, we evaluated the regression model in a well-known Chinese language learning resource, called 100 pairs of the most confusing Chinese characters. The experiment results indicated that this approach has potential in the recognition and generation of confusing Chinese character pairs.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/134181