Project Description
Artificial Intelligence (AI) is expected to be an integral part of next-generation AI-native 6G networks. With the prevalence of AI, researchers have identified numerous use cases of AI in network security. However, there are almost nonexistent studies that analyze the suitability of Large Language Models (LLMs) in network security.
In this project, we investigate the suitability of Large Language Models (LLMs) in network security use cases. For this purpose, we select the case study of “Spoofing, Tampering, Repudiation, Information Disclosure, Denial of Service, and Elevation of Privilege (STRIDE)” threat modeling of 5G threats and vulnerabilities. We perform experiments by selecting six 5G threats and utilizing four different prompting techniques with five LLMs. We provide detailed insights based on the evaluation results of LLM-based STRIDE classification. We provide detailed discussions on potential underlying factors that influence the behavior of LLMs in modeling certain threats, including: incorrect threat perspective, failure to identify second-order threats, and insights on Few-Shot (FS) prompting positively impacting performance. We analyze the suitability of LLMs using numerical testing and various performance metrics, including accuracy, precision, recall, and F1 score. Our results indicate that the performance of the selected LLMs is comparable, highlighting the need for enhancements in these models for STRIDE threat modeling in 5G networks and fine-tuning LLMs for network security use cases.
Evaluation Methodology
Our evaluation methodology is shown in the figure below. We initially select multiple threats and vulnerabilities in the 5G network along with their baseline STRIDE classifications from the published literature and standards. Then, we use various prompting techniques to perform the LLM-based STRIDE classification of the selected threats. Finally, we compare the STRIDE classification by LLMs to the baseline and evaluate the results.

Results
The results of LLM-based STRIDE classification of the six 5G threats are shown in the Table below. The first column of the table includes the prompting techniques we employ in our evaluation, while the second column shows the LLM models we select. The rest of the columns in this table present the baseline STRIDE classification along with the results of LLM-based STRIDE classification of the 5G threats. In this table, the white cells with a dot (∙) represent a positive baseline value, while an empty white cell represents a negative baseline value. According to this baseline, we categorize the LLM-based STRIDE classifications as True Positive (TP) (dark green cell with a dot (∙)), True Negative (TN) (empty green cell), False Positive (FP) (yellow cell with a dot (∙)), and False Negative (FN) (empty red cell). The coloring scheme represents that the greens (TP and TN) are correct classifications. Yellow (FP) is an incorrect classification, but it is not the worst outcome (over-predicting positive), and red (FN) is an incorrect classification and represents the worst outcome (under-predicting positive). The main insights and observations from these results are explained in the paper [1].
![[Uncaptioned image]](https://arxiv.org/html/2505.04101v1/x2.png)
The figure below illustrates the performance of the LLMs in terms of accuracy, precision, recall, and F1 score using a heatmap chart. This result is averaged across all threats and all prompting techniques for a specific LLM. We observe that GPT-4o, Claude 3.7 Sonnet, and Grok-2 show ‘relatively’ higher accuracy and precision but lower recall compared to the other two LLMs. On the other hand, Sonar and Gemini 2.5 Pro achieved lower precision and higher recall in comparison. The F1 score indicates that the performance of all the LLMs is comparable for this case study. The maximum accuracy achieved is 72%, which highlights that there are opportunities for improvement across all LLMs, perhaps by fine-tuning for the specific application of STRIDE threat modeling in 5G networks.

Reference:
[1] AbdulGhaffar, AbdulAziz, and Ashraf Matrawy. “LLMs’ Suitability for Network Security: A Case Study of STRIDE Threat Modeling.” arXiv preprint arXiv:2505.04101 (2025).