[Talk Summary 10] Parse Tree Fragmentation of Ungrammatical Sentences

Huma Hashemi, ISP graduate student, University of Pittsburgh had a talk about "Parse Tree Fragmentation of Ungrammatical Sentences" on Friday, 2016/11/18. She presented an evaluation of Parser Robustness for ungrammatical sentences.

Huma started the talk by giving a introduction about natural language processing (NLP) that brings about a motivation for her proposal. One of the most challenging issues that NPL has to deal with is "noisier" texts such as English-as-a-second language and machine translation. For many NLP applications that requires a parser, the sentences may not be well-formed, for instance, information extraction, question answering and summarization systems. Therefore, to build a good NLP application, a parser should be able to parse ungrammatical sentences.
Huma's research focuses on answering the question "how much parser's performance degrades when deal with grammar mistake?" and evaluation of a parser on ungrammatical sentences.  There are three main contributions of Huma's work presented in the talk:

  • a metric and methodology for evaluating ungrammatical sentences without referring to a gold standard corpus;
  • a quantitative comparison of parser accuracy of leading dependency parsers on ungrammatical sentences; this may help practitioners to select an appropriate parser for their applications; and 
  • a suite of robustness analyses for the parsers on specific kinds of problems in the ungrammatical sentences; this may help developers to im- prove parser robustness in the future.

Huma ran an experiment on the 8 well-known parsers (Malt, Mate, MST, SNN, SyntaxNet, Turbo, Tweebo, and Yara) using two datasets, which are (1) Penn Treebank (news data) 5000 sentences, and (2) Tweebank (twitter data) 1000 tweets. She concluded that parsers indeed have different responses to ungrammatical sentences of various types. If the input data contains text more similar to tweets (e.g. containing URLs and emoticons), Malt or Turbo parser may be good choices. If the input data is more similar to the machine translation outputs; SyntaxNet, Malt, Tweebo and Turbo parser are good options.

5317 Sennott Square
University of Pittsburgh


Popular posts from this blog

[Talk Summary] Machine Learning and Privacy: Friends or Foes?

SoRec: Social Recommendation Using Probabilistic Matrix Factorization

[Talk Summary 3] Personalized Recommendations using Knowledge Graphs