Large language models (LLMs) are increasingly applied across cybersecurity domains, including threat intelligence, vulnerability detection, anomaly analysis, and red teaming. These applications are supported by both specialized cybersecurity LLMs and general-purpose models.
Specialized cybersecurity LLMs
Model | Release date | Model Type | Training focus |
---|---|---|---|
SecLLM | 2024 | Code LLaMA variant | – Insecure code samples – CVE-linked code snippets – Exploit patterns |
LLM4Cyber | 2024 | Fine-tuned general LLM | – MITRE ATT&CK – CVE – Threat intelligence feeds (CTI) |
LlamaGuard | 2024 | Safety-aligned LLaMA | – Safety filter prompts – Input/output policy enforcement – Adversarial prompt handling |
SecGPT | 2023 | GPT-style LLM | – Cybersecurity text – CVE reports |
Cybersecurity-BERT | 2023 | BERT (encoder-only) | – Malware reports – Vulnerability descriptions – Technical security documentation |
General-purpose LLMs for cybersecurity
These large language models are not trained solely on cybersecurity data but can still perform well in the domain when prompted correctly or evaluated on benchmarks like SecBench.
Examples:
- GPT-4 / GPT-4o
- DeepSeek-V3
- Mistral
- Qwen2 / Yi / LLaMA-3-Instruct
- Hunyuan-Turbo
Benchmarking LLM performance across cybersecurity domains
This benchmark evaluates 7 general LLMs, including both proprietary (e.g., GPT-4) and open-source models (e.g., DeepSeek, Mistral). The benchmark spans 9 cybersecurity subfields, including:
- Data Security
- Identity & Access Management
- Application Security
- Network Security
- Security Standards (and others)
The x-axis domains are sorted by LLM performance, with lower-scoring domains placed toward the left and higher-scoring ones toward the right.
MCQs (Multiple-Choice Questions) benchmarking:
SAQs (Short Answer Questions):
Source: SecBench design1 See benchmark methodology.
The role of LLMs in cybersecurity
Large language models (LLMs) are used across cybersecurity operations to extract actionable insights from unstructured sources such as threat intelligence reports, incident logs, CVE databases, and attacker TTPs.
LLMs automate key tasks, including threat classification, alert summarization, and correlation of indicators of compromise (IOCs).
When fine-tuned on cybersecurity data, large language models can detect anomalies in logs, analyze phishing emails, prioritize vulnerabilities, and map threats to frameworks like MITRE ATT&CK.
Applications of large language models in cybersecurity
Threat intelligence
Co-pilot for contextual threat analysis: LLM-powered tools like CyLens support security analysts throughout the threat intelligence lifecycle—handling attribution, detection, correlation, triage, and remediation by analyzing extensive threat reports with modular NLP pipelines and entity correlation filters.2
Real-time proactive threat intelligence: systems integrate LLMs with retrieval‑augmented generation (RAG) frameworks to ingest continuous CTI feeds (e.g., CVE) into vector databases (like Milvus), enabling up‑to‑date automated detection, scoring, and contextual reasoning.3
Forum-based CTI extraction: LLMs analyze unstructured data from cybercrime forums to extract key threat indicators using simple prompts.4
Vulnerability detection
Vulnerability description enrichment: LLMs such as CVE‑LLM enrich vulnerability descriptions using domain ontologies, enabling automated triage and CVSS scoring integration within existing security management systems.5
Android filesystem vulnerability detection: Investigates how LLMs can detect file system access vulnerabilities in Android apps, including permission abuse and insecure storage.6
RL fine‑tuning for vulnerability detection: Applies reinforcement learning (RL) to fine-tune LLMs (LLaMA 3B/8B, Qwen 2.5B) for improved accuracy in identifying software vulnerabilities.7
Anomaly detection & log analysis
Semantic log anomaly detection: Frameworks like LogLLM use LLM encoders/decoders to parse and classify log entries, improving anomaly detection beyond pattern matching.8
Log parsing with large language models: Automated LLM parsing converts unstructured logs into structured formats via prompt‑based and fine‑tuned approaches.9
Red teaming / LLM-assisted attack prevention
LLM-driven pentesting and remediation (penheal): Automates penetration testing using a two-stage pipeline—first identifying security weaknesses, then generating remediation actions using a custom LLM setup.10
On-prem red team agent for internal security (hackphyr): Deploys a fine-tuned 7B LLM agent locally to perform red-team tasks such as lateral movement simulation, credential harvesting, and vulnerability scanning in networks.11
Benchmark methodology
SecBench is a large-scale, multi-dimensional benchmark for evaluating LLMs in cybersecurity across different tasks, domains, languages, and formats.
Evaluation dimensions
1. Multi-level reasoning:
- Knowledge Retention (KR): Questions that test factual knowledge or definitions. These are more straightforward.
- Logical reasoning (LR): Questions that require inference and deeper understanding. These are more challenging and test the model’s ability to reason based on context.
2. Multi-format:
- MCQs (Multiple-Choice Questions): Traditional format where the model selects from predefined answers. Total of 44,823 questions.
- SAQs (Short Answer Questions): Open-ended format requiring the model to generate its response for evaluating reasoning, clarity, and hallucination resistance. Total of 3,087 questions.
3. Multi-Language:
SecBench includes questions in both Chinese and English.
4. Multi-Domain:
Questions span 9 cybersecurity domains (D1–D9), including: security management, data security, network security, application security, cloud security, and more.
Evaluation
MCQs are graded by checking if the model selects the correct choice(s).
SAQs are graded using a GPT-4o mini “grading agent”, which compares the model’s response to the ground truth and assigns a score based on accuracy and completeness’.
LLM performance evaluation: For example, Network Security (D3) is assessed by grouping relevant questions from its 44,823-question MCQ dataset.
Accuracy is measured based on each model’s performance, specifically on questions labeled under the D3 domain. A model’s percentage score for D3 reflects the proportion of network security questions it answered correctly.
External Links
- 1. https://arxiv.org/pdf/2412.20787
- 2. https://arxiv.org/abs/2502.20791
- 3. https://arxiv.org/abs/2504.00428
- 4. https://arxiv.org/pdf/2408.03354
- 5. https://arxiv.org/pdf/2502.15932
- 6. https://arxiv.org/pdf/2407.11279
- 7. https://arxiv.org/pdf/2505.02079
- 8. https://arxiv.org/pdf/2411.08561
- 9. https://arxiv.org/pdf/2504.04877
- 10. https://arxiv.org/pdf/2407.13267
- 11. https://arxiv.org/pdf/2407.08991
Comments
Your email address will not be published. All fields are required.