SIG-FIN-018-17

| Topic path: Top/SIG-FIN-018-17
  • 追加された行はこの色です。
  • 削除された行はこの色です。
  • SIG-FIN-018-17 へ行く。

[[第18回研究会]]

//*Revenue Prediction based on Multi-task Max-margin Topic Models 

*マルチタスク最大マージントピックモデルによる収益予測 [#tf9cf3aa]

**著者 [#nf474624]
中川雄太(神戸大学大学院 システム情報学研究科),上野良輔(神戸大学大学院 システム情報学研究科),江口浩二(神戸大学大学院 システム情報学研究科)

**概要 [#wa52b671]
金融・経済テキストデータから企業の収益性に関する指標を予測するモデルを実現することを目的とする.この目的のもと,既存手法であるMedLDAを拡張して分類タスクと回帰タスクを同時に解決するマルチタスク最大マージントピックモデル(MultiMedLDA)に一般化する.MultiMedLDAでは複数種類のラベルが付与された文書データを対象としており,複数種類の付加情報を同時に考慮しながら潜在トピックの推定を可能にしている.これにより,予測精度の改善が期待される.本論文では,業種の離散ラベル,営業利益変化率の連続ラベルを伴う企業評価テキストを用いてMultiMedLDAの有効性を評価し,MedLDAの回帰タスクと比較して議論する.
//Due to the development of information technology in recent years, the diversity in the form of information transmission has increased substantially and the amount of document data has grown exponentially in the world. This kind of information can be found in various fields including economic and financial field, such as in the form of document data of company valuation in online news and the form of numerical data of company financial indices and global exchange transactions on economic and financial websites. Researchers and practitioners in this field recently have a keen interest in discovering new ideas by making full use of these data. One promising approach to analyzing large-scale data is topic modeling, typically by Latent Dirichlet Allocation (LDA). This model assumes that each group (e.g., document) is represented as a mixture of latent topics, where each latent topic is represented as a distribution of data points (e.g., words). In general, real-world document data are associated with side information in the forms of discrete and continuous representations. Maximum Entropy Discrimination LDA (MedLDA) is a supervised topic model that can improve accuracy of latent topic estimation by making use of the side information associated with the documents. In the model, a margin maximization method as in Support Vector Machine (SVM) is incorporated into the framework of topic modeling with LDA, and the estimated topics are used as features for the classifier. However, MedLDA cannot be applied to document data that are associate with both discrete and continuous labels. In this paper, we generalize Multi-task MedLDA (MultiMedLDA) that simultaneously addresses classification and regression tasks in an extension of MedLDA. For document data with multiple types of labels, MultiMedLDA introduces an optimization method called dual decomposition to solve the multi-objective optimization problem with multi-tasks involving classification and regression tasks. It is expected that prediction performance can be improved by estimating latent topics using more side information. In this paper, we evaluate the effectiveness of MultiMedLDA through experiments with enterprise evaluation documents associated with continuous labels of change rate of operating incomes and discrete labels of categories of business, and discuss it compared with single-task MedLDA.

**キーワード [#v1220acc]
//Topic models, Latent dirichlet allocation, Multi-tasks
マルチタスク教師付きトピックモデル,金融テキストマイニング,収益予測

**論文 [#s442d956]

//(3月6日以降に公表いたします)
&ref(SIG-FIN-018-17.pdf);
トップ   編集 差分 履歴 添付 複製 名前変更 リロード   新規 一覧 検索 最終更新   ヘルプ   最終更新のRSS