On the K-Mer Frequency Spectra of Organism Genome and Proteome Sequences with a Preliminary Machine Learning Assessment of Prime Predictability

Nathan O. Schmidt

On the K-Mer Frequency Spectra of Organism Genome and Proteome Sequences with a Preliminary Machine Learning Assessment of Prime Predictability

A regular expression and region-specific filtering system for biological records at the National Center for Biotechnology database is integrated into an object-oriented sequence counting application, and a statistical software suite is designed and deployed to interpret the resulting k-mer frequencies---with a priority focus on nullomers. The proteome k-mer frequency spectra of ten model organisms and the genome k-mer frequency spectra of two bacteria and virus strains for the coding and non-coding regions are comparatively scrutinized. We observe that the naturally-evolved (NCBI/organism) and the artificially-biased (randomly-generated) sequences exhibit a clear deviation from the artificially-unbiased (randomly-generated) histogram distributions. Furthermore, a preliminary assessment of prime predictability is conducted on chronologically ordered NCBI genome snapshots over an 18-month period using an artificial neural network; three distinct supervised machine learning algorithms are used to train and test the system on customized NCBI data sets to forecast future prime states---revealing that, to a modest degree, it is feasible to make such predictions.

Comments: 130 pages and 18 figures

Download: PDF

Submission history

[v1] 2013-02-04 16:39:12

Unique-IP document downloads: 171 times

Vixra.org is a pre-print repository rather than a journal. Articles hosted may not yet have been verified by peer-review and should be treated as preliminary. In particular, anything that appears to include financial or legal advice or proposed medical treatments should be treated with due caution. Vixra.org will not be responsible for any consequences of actions that result from any form of use of any documents on this website.

Add your own feedback and questions here:
You are equally welcome to be positive or negative about any paper but please be polite. If you are being critical you must mention at least one specific error, otherwise your comment will be deleted as unhelpful.

Quantitative Biology

On the K-Mer Frequency Spectra of Organism Genome and Proteome Sequences with a Preliminary Machine Learning Assessment of Prime Predictability

Submission history