Empirical analysis of Zipf's law, power law, and lognormal distributions in medical discharge reports.

Quiroz, JC; Laranjo, L; Tufanaru, C; Kocaballi, AB; Rezazadegan, D; Berkovsky, S; Coiera, E

Empirical analysis of Zipf's law, power law, and lognormal distributions in medical discharge reports.

Quiroz, JC Laranjo, L Tufanaru, C Kocaballi, AB Rezazadegan, D Berkovsky, S Coiera, E

Permalink

Publisher:: Elsevier BV
Publication Type:: Journal Article
Citation:: International journal of medical informatics, 2020, 145, pp. 104324
Issue Date:: 2020-11-02

Closed Access

	Filename	Description	Size
	1-s2.0-S1386505620310984-main.pdf	Published version	1.16 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Quiroz, JC
dc.contributor.author	Laranjo, L
dc.contributor.author	Tufanaru, C
dc.contributor.author	Kocaballi, AB
dc.contributor.author	Rezazadegan, D
dc.contributor.author	Berkovsky, S
dc.contributor.author	Coiera, E
dc.date.accessioned	2020-12-27T23:00:43Z
dc.date.available	2020-10-29
dc.date.available	2020-12-27T23:00:43Z
dc.date.issued	2020-11-02
dc.identifier.citation	International journal of medical informatics, 2020, 145, pp. 104324
dc.identifier.issn	1386-5056
dc.identifier.issn	1872-8243
dc.identifier.uri	http://hdl.handle.net/10453/144966
dc.description.abstract	BACKGROUND:Bayesian modelling and statistical text analysis rely on informed probability priors to encourage good solutions. OBJECTIVE:This paper empirically analyses whether text in medical discharge reports follow Zipf's law, a commonly assumed statistical property of language where word frequency follows a discrete power-law distribution. METHOD:We examined 20,000 medical discharge reports from the MIMIC-III dataset. Methods included splitting the discharge reports into tokens, counting token frequency, fitting power-law distributions to the data, and testing whether alternative distributions-lognormal, exponential, stretched exponential, and truncated power-law-provided superior fits to the data. RESULT:Discharge reports are best fit by the truncated power-law and lognormal distributions. Discharge reports appear to be near-Zipfian by having the truncated power-law provide superior fits over a pure power-law. CONCLUSION:Our findings suggest that Bayesian modelling and statistical text analysis of discharge report text would benefit from using truncated power-law and lognormal probability priors and non-parametric models that capture power-law behavior.
dc.format	Print-Electronic
dc.language	eng
dc.publisher	Elsevier BV
dc.relation.ispartof	International journal of medical informatics
dc.relation.isbasedon	10.1016/j.ijmedinf.2020.104324
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	08 Information and Computing Sciences, 09 Engineering, 11 Medical and Health Sciences
dc.subject.classification	Medical Informatics
dc.title	Empirical analysis of Zipf's law, power law, and lognormal distributions in medical discharge reports.
dc.type	Journal Article
utslib.citation.volume	145
utslib.location.activity	Ireland
utslib.for	08 Information and Computing Sciences
utslib.for	09 Engineering
utslib.for	11 Medical and Health Sciences
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
pubs.organisational-group	/University of Technology Sydney
utslib.copyright.status	closed_access	*
dc.date.updated	2020-12-27T23:00:26Z
pubs.publication-status	Published
pubs.volume	145

Abstract:

BACKGROUND:Bayesian modelling and statistical text analysis rely on informed probability priors to encourage good solutions. OBJECTIVE:This paper empirically analyses whether text in medical discharge reports follow Zipf's law, a commonly assumed statistical property of language where word frequency follows a discrete power-law distribution. METHOD:We examined 20,000 medical discharge reports from the MIMIC-III dataset. Methods included splitting the discharge reports into tokens, counting token frequency, fitting power-law distributions to the data, and testing whether alternative distributions-lognormal, exponential, stretched exponential, and truncated power-law-provided superior fits to the data. RESULT:Discharge reports are best fit by the truncated power-law and lognormal distributions. Discharge reports appear to be near-Zipfian by having the truncated power-law provide superior fits over a pure power-law. CONCLUSION:Our findings suggest that Bayesian modelling and statistical text analysis of discharge report text would benefit from using truncated power-law and lognormal probability priors and non-parametric models that capture power-law behavior.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/144966