Enriching Linguistic Analyses by Modelling Neutral and Controversial Items
Typically, linguistic analyses are performed over datasets composed of text items where each item is assigned a category that represents a phenomenon. This category is obtained by combining multiple human annotations. Items considered for analyses are often those which exhibit a clear polarizing phenomenon (e.g. either polite or impolite). However, language can sometimes exhibit none of those phenomena (neither polite nor impolite) or a combination of phenomena (e.g. polite and impolite). This is evident in NLU datasets as they contain a significant number of items on which annotators disagreed, or agreed that they do not exhibit any phenomenon. The goal is to discover linguistic patterns associated with those items. This helps in further enriching linguistic analyses by providing insight into how language could be interpreted by different listeners.