基于局部Transformer的泰语分词和词性标注联合模型OA北大核心CSTPCD

Joint model for Thai word segmentation and part-of-speech tagging via a local Transformer

中文摘要

英文摘要

泰语分词和词性标注任务二者之间存在高关联性,已有研究表明将分词和词性标注任务进行联合学习可以有效提升模型性能,为此,提出了一种针对泰语拼写和构词特点的分词和词性标注联合模型.针对泰语中字符构成音节,音节组成词语的特点,采用局部Transformer网络从音节序列中学习分词特征;考虑到词根和词缀等音节与词性的关联,将用于分词的音节特征融入词语序列特征,缓解未知词的词性标注特征缺失问题.在此基础上,模型采用线性分类层预测分词标签,采用线性条件随机场建模词性序列的依赖关系.在泰语数据集LST20 上的试验结果表明,模型分词F1、词性标注微平均F1 和宏平均F1 分别达到 96.33%、97.06%和85.98%,相较基线模型分别提升了0.33%、0.44%和0.12%.

There is a high correlation between Thai word segmentation(WS)and part-of-speech(POS)tagging tasks,and it has been demonstrated that joint learning of WS and POS tagging tasks can effectively enhance model perform-ance.Herein,we propose a novel joint model for Thai WS and POS,including Thai spelling rules and sub-word fea-tures.A local Transformer network is employed to learn WS features from windowed syllable sequences.Considering the relationship between syllables,such as roots,affixes,and POS,the syllable features used for WS are integrated into the characteristics of word sequence to alleviate the lack of POS tagging features for out-of-vocabulary words.Moreover,we utilize a linear classification layer to forecast the label of WS and a linear conditional random field to model the label dependencies of POS sequences.Experimental findings for the Thai LST20 dataset reveal that the pro-posed method has a WS F1 value,POS tagging microF1 value,and macro F1 value of 96.33%,97.06%,and 85.98%,re-spectively,which are enhanced by 0.33%,0.44%,and 0.12%,with respect to the baselines.

作者：朱叶芬;线岩团;余正涛;相艳

作者单位：昆明理工大学信息工程与自动化学院, 云南昆明 650500||昆明理工大学云南省人工智能重点实验室, 云南昆明 650500

分类：计算机与自动化

中文关键词：泰语分词;词性标注;联合学习;局部Transformer;构词特点;音节特征;线性条件随机场;联合模型

英文关键词：Thai word segmentation;part-of-speech tagging;joint learning;local Transformer;sub-word features;syl-lable features;linear conditional random field;joint model

刊名：《智能系统学报》 2024 (002)

页码/页数：401-410 / 10

基金：国家自然科学基金项目(62266028);云南省重大科技专项计划(202002AD080001).

DOI：10.11992/tis.202209034

下载量：0

点击量：0

基于局部Transformer的泰语分词和词性标注联合模型OA北大核心CSTPCD

Joint model for Thai word segmentation and part-of-speech tagging via a local Transformer

评论