Get the All New 11" iPad Pro, or Microsoft Surface Pro, or Take $350 Off OnDemand and vLive Training - ends 12/5!

Reading Room

Subscribe to SANS Newsletters

Join the SANS Community to receive the latest curated cyber security news, vulnerabilities and mitigations, training opportunities, and our webcast schedule.






Artificial Intelligence

Featuring 1 Paper as of July 11, 2018

  • Times Change and Your Training Data Should Too: The Effect of Training Data Recency on Twitter Classifiers STI Graduate Student Research
    by Ryan O'Grady - July 11, 2018 

    Sophisticated adversaries are moving their botnet command and control infrastructure to social media microblogging sites such as Twitter. As security practitioners work to identify new methods for detecting and disrupting such botnets, including machine-learning approaches, we must better understand what effect training data recency has on classifier performance. This research investigates the performance of several binary classifiers and their ability to distinguish between non-verified and verified tweets as the offset between the age of the training data and test data changed. Classifiers were trained on three feature sets: tweet-only features, user-only features, and all features. Key findings show that classifiers perform best at +0 offset, feature importance changes over time, and more features are not necessarily better. Classifiers using user-only features performed best, with a mean Matthews correlation coefficient of 0.95 ± 0.04 at +0 offset, 0.58 ± 0.43 at -8 offset, and 0.51 ± 0.21 at +8 offset. The R2 values are 0.90, 0.34, and 0.26, respectively. Thus, the classifiers tested with +0 offset accounted for 56% to 64% more variance than those tested with −8 and +8 offset. These results suggest that classifier performance is sensitive to the recency of the training data relative to the test data. Further research is needed to replicate this experiment with botnet vs. non-botnet tweets to determine if similar classifier performance is possible and the degree to which performance is sensitive to training data recency.


Most of the computer security white papers in the Reading Room have been written by students seeking GIAC certification to fulfill part of their certification requirements and are provided by SANS as a resource to benefit the security community at large. SANS attempts to ensure the accuracy of information, but papers are published "as is". Errors or inconsistencies may exist or may be introduced over time as material becomes dated. If you suspect a serious error, please contact webmaster@sans.org.

All papers are copyrighted. No re-posting or distribution of papers is permitted.

STI Graduate Student Research - This paper was created by a SANS Technology Institute student as part of the graduate program curriculum.