Authors: Nathan O. Schmidt
A regular expression and region-specific filtering system for biological records at the National Center for Biotechnology database is integrated into an object-oriented sequence counting application, and a statistical software suite is designed and deployed to interpret the resulting k-mer frequencies---with a priority focus on nullomers. The proteome k-mer frequency spectra of ten model organisms and the genome k-mer frequency spectra of two bacteria and virus strains for the coding and non-coding regions are comparatively scrutinized. We observe that the naturally-evolved (NCBI/organism) and the artificially-biased (randomly-generated) sequences exhibit a clear deviation from the artificially-unbiased (randomly-generated) histogram distributions. Furthermore, a preliminary assessment of prime predictability is conducted on chronologically ordered NCBI genome snapshots over an 18-month period using an artificial neural network; three distinct supervised machine learning algorithms are used to train and test the system on customized NCBI data sets to forecast future prime states---revealing that, to a modest degree, it is feasible to make such predictions.
Comments: 130 pages and 18 figures
[v1] 2013-02-04 16:39:12
Unique-IP document downloads: 53 times
Add your own feedback and questions here:
You are equally welcome to be positive or negative about any paper but please be polite. If you are being critical you must mention at least one specific error, otherwise your comment will be deleted as unhelpful.