This paper proposes an improvement over a paper[A generic unsupervised methods for decomposing multi-author documents, N. Akiva and M. Koppel 2013]. We have worked on two aspects, In the first aspect, we try to capture writing style of author by ngram model of words, POS Tags and PQ Gram model of syntactic parsing over used basic uni-gram model. In the second aspect, we added some layers of refinements in existing baseline model and introduce new term ”similarity index” to distinguish between pure and mixed segments before unsupervised labeling. Similarity index uses overall and sudden change of writing style by PQ Gram model and words used using n-gram model between lexicalised/unlexicalised sentences in segments for refinement. In this paper, we investigate the role of feature selection that captures the syntactic patterns specific to an author and its overall effect in the final accuracy of the baseline system. More specifically, we insert a layer of refinement to the baseline system and define a threshold based on the similarity measure among the sentences to consider the purity of the segments to be given as input to the GMM.The key idea of our approach is to provide theGMMclustering with the ”good segments” so that the clustering precision is maximised which is then used as labels to train a classifier. We also try different features set like bigrams and trigrams of POS tags and an PQ Grams based feature on unlexicalised PCFG to capture the distinct writing styles which is then given as an input to a GMM trained by iterative EM algorithm to generate good clusters of the segments of the merged document.
Comments: 4 Pages.
[v1] 2019-10-28 08:51:06
Unique-IP document downloads: 13 times
Vixra.org is a pre-print repository rather than a journal. Articles hosted may not yet have been verified by peer-review and should be treated as preliminary. In particular, anything that appears to include financial or legal advice or proposed medical treatments should be treated with due caution. Vixra.org will not be responsible for any consequences of actions that result from any form of use of any documents on this website.
Add your own feedback and questions here:
You are equally welcome to be positive or negative about any paper but please be polite. If you are being critical you must mention at least one specific error, otherwise your comment will be deleted as unhelpful.