Optimalisasi Deteksi Anomali Untuk Pemfilteran Log dan Integrasi Dengan SIEM Menggunakan Machine Learning

Salsabila Amalia Harjanto, Mutiara Nurhaliza, Jody Hezekiah Tanasa Sagala

Abstract

Cybersecurity has become a paramount concern in today's digital age, necessitating robust systems like Security Information and Event Management (SIEM) for effective threat detection through log analysis. Traditional methods often prove inadequate due to static rules prone to false positives. In this study, we propose a Machine Learning-based approach to optimize anomaly detection in Hadoop Distributed File System (HDFS) logs. Evaluating Decision Tree, Naive Bayes, Log Clustering, Support Vector Machine (SVM), and Logistic Regression, Log Clustering emerges with the highest accuracy at 98.19% and the highest recall at 56.05% among the models tested. These findings underscore Log Clustering's efficacy in enhancing cybersecurity in big data environments, particularly in its efficiency for integration with SIEM systems.

Keywords

Cyber Security, SIEM, Anomaly Detection, HDFS Log, Machine Learning

Full Text:

PDF

References

Sapegin, A., Jaeger, D., Cheng, F., & Meinel, C. (2017). Towards a system for complex analysis of security events in large-scale networks. Comput. Secur., 67, 16-34. https://doi.org/10.1016/j.cose.2017.02.001.

Asim, M., McKinnel, D., Dehghantanha, A., Parizi, R., Hammoudeh, M., & Epiphaniou, G. (2019). Big Data Forensics: Hadoop Distributed File Systems as a Case Study. , 179-210. https://doi.org/10.1007/978-3-030-10543-3_8.

Zwietasch, T. (2014). Detecting anomalies in system log files using machine learning techniques (Bachelor's thesis).

Perera, A., Rathnayaka, S., Perera, N. D., Madushanka, W. W., & Senarathne, A. N. (2021, April). The next gen security operation center. In 2021 6th International Conference for Convergence in Technology (I2CT) (pp. 1-9). IEEE.

https://radimrehurek.com/data_science_python/

Nasteski, V. (2017). An overview of the supervised machine learning methods. Horizons. b, 4(51-62), 56.

Osisanwo, F. Y., Akinsola, J. E. T., Awodele, O., Hinmikaiye, J. O., Olakanmi, O., & Akinjobi, J. (2017). Supervised machine learning algorithms: classification and comparison. International Journal of Computer Trends and Technology (IJCTT), 48(3), 128-138.

Wang, L. (2016). Discovering phase transitions with unsupervised learning. Physical Review B, 94, 195105. https://doi.org/10.1103/PhysRevB.94.195105.

Hinton, G., & Sejnowski, T. (2018). Unsupervised Learning. , 1009. https://doi.org/10.1007/978-3-319-17885-1_101437.

He, S., Zhu, J., He, P., & Lyu, M. R. (2016, October). Experience report: System log analysis for anomaly detection. In 2016 IEEE 27th international symposium on software reliability engineering (ISSRE) (pp. 207-218). IEEE.

Veeraiah, D., & Rao, J. (2020). An Efficient Data Duplication System based on Hadoop Distributed File System. 2020 International Conference on Inventive Computation Technologies (ICICT), 197-200. https://doi.org/10.1109/ICICT48043.2020.9112567.

Bui, D., Hussain, S., Huh, E., & Lee, S. (2016). Adaptive Replication Management in HDFS Based on Supervised Learning. IEEE Transactions on Knowledge and Data Engineering, 28, 1369-1382. https://doi.org/10.1109/TKDE.2016.2523510.

Shvachko, K., Kuang, H., Radia, S., & Chansler, R. (2010). The Hadoop Distributed File System. 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), 1-10. https://doi.org/10.1109/MSST.2010.5496972.

Dwivedi, K., & Dubey, S. (2014). Analytical review on Hadoop Distributed file system. 2014 5th International Conference - Confluence The Next Generation Information Technology Summit (Confluence), 174-181. https://doi.org/10.1109/CONFLUENCE.2014.6949336.

Khalil, M., & Hamad, M. (2021). Big Data Management Using Hadoop. Journal of Physics: Conference Series, 1804. https://doi.org/10.1088/1742-6596/1804/1/012109.

Kousalya, K., & Parvez, S. (2018). Effective processing of unstructured data using python in Hadoop map reduce. International journal of engineering and technology, 7, 417. https://doi.org/10.14419/IJET.V7I2.21.12456.

Ridzuan, F., & Zainon, W. M. N. W. (2019). A review on data cleansing methods for big data. Procedia Computer Science, 161, 731-738.

J. Han, M. Kamber, and J. Pei. Data mining: concepts and techniques. Elsevier, 2011.

Su, C., & Yang, C. (2008). Feature selection for the SVM: An application to hypertension diagnosis. Expert Syst. Appl., 34, 754-763. https://doi.org/10.1016/j.eswa.2006.10.010.

Cortes, C., & Vapnik, V. (1995). Support-Vector Networks. Machine Learning, 20, 273-297. https://doi.org/10.1023/A:1022627411411.

Fleiss, J., Williams, J., & Dubro, A. (1986). The logistic regression analysis of psychiatric data.. Journal of psychiatric research, 20 3, 195-209 . https://doi.org/10.1016/0022-3956(86)90003-8.

Larget, B. (2008). Logistic regression. Exploring Concepts of Child Well-being. https://doi.org/10.1007/978-0-387-79054-1_13.

Landwehr, N., Kersting, K., & Raedt, L. (2007). Integrating Naïve Bayes and FOIL. J. Mach. Learn. Res., 8, 481-507. https://doi.org/10.5555/1314498.1314516.

Liu, Z., Qin, T., Guan, X., Jiang, H., & Wang, C. (2018). An Integrated Method for Anomaly Detection From Massive System Logs. IEEE Access, 6, 30602-30611. https://doi.org/10.1109/ACCESS.2018.2843336.

Huo, J., Weng, J., & Qu, H. (2019). A parallel clustering algorithm for logs data based on Hadoop platform. Proceedings of the 3rd International Conference on High Performance Compilation, Computing and Communications. https://doi.org/10.1145/3318265.3318281.

Refbacks

  • There are currently no refbacks.