We follow ethical norms & our process for objectivity.

This research is not funded by any sponsors.

Specialized cybersecurity LLMs

General-purpose LLMs for cybersecurity

The role of LLMs in cybersecurity

Applications of large language models in cybersecurity

Benchmark methodology

Specialized cybersecurity LLMs General-purpose LLMs for cybersecurity The role of LLMs in cybersecurity Applications of large language models in cybersecurity Benchmark methodology

Table of contents

Specialized cybersecurity LLMs General-purpose LLMs for cybersecurity The role of LLMs in cybersecurity Applications of large language models in cybersecurity Benchmark methodology

LLMs

Updated on Jul 25, 2025

Large Language Models in Cybersecurity [2025]

Cem Dilmegani

with Mert Palazoğlu

See our ethical norms

Large language models (LLMs) are increasingly applied across cybersecurity domains, including threat intelligence, vulnerability detection, anomaly analysis, and red teaming. These applications are supported by both specialized cybersecurity LLMs and general-purpose models.

Specialized cybersecurity LLMs

Updated at 07-11-2025

Model	Release date	Model Type	Training focus
SecLLM	2024	Code LLaMA variant	– Insecure code samples – CVE-linked code snippets – Exploit patterns
LLM4Cyber	2024	Fine-tuned general LLM	– MITRE ATT&CK – CVE – Threat intelligence feeds (CTI)
LlamaGuard	2024	Safety-aligned LLaMA	– Safety filter prompts – Input/output policy enforcement – Adversarial prompt handling
SecGPT	2023	GPT-style LLM	– Cybersecurity text – CVE reports
Cybersecurity-BERT	2023	BERT (encoder-only)	– Malware reports – Vulnerability descriptions – Technical security documentation

General-purpose LLMs for cybersecurity

These large language models are not trained solely on cybersecurity data but can still perform well in the domain when prompted correctly or evaluated on benchmarks like SecBench.

Examples:

GPT-4 / GPT-4o
DeepSeek-V3
Mistral
Qwen2 / Yi / LLaMA-3-Instruct
Hunyuan-Turbo

Benchmarking LLM performance across cybersecurity domains

This benchmark evaluates 7 general LLMs, including both proprietary (e.g., GPT-4) and open-source models (e.g., DeepSeek, Mistral). The benchmark spans 9 cybersecurity subfields, including:

Data Security
Identity & Access Management
Application Security
Network Security
Security Standards (and others)

The x-axis domains are sorted by LLM performance, with lower-scoring domains placed toward the left and higher-scoring ones toward the right.

MCQs (Multiple-Choice Questions) benchmarking:

SAQs (Short Answer Questions):

Source: SecBench design¹ See benchmark methodology.

The role of LLMs in cybersecurity

Large language models (LLMs) are used across cybersecurity operations to extract actionable insights from unstructured sources such as threat intelligence reports, incident logs, CVE databases, and attacker TTPs.

LLMs automate key tasks, including threat classification, alert summarization, and correlation of indicators of compromise (IOCs).

When fine-tuned on cybersecurity data, large language models can detect anomalies in logs, analyze phishing emails, prioritize vulnerabilities, and map threats to frameworks like MITRE ATT&CK.

Applications of large language models in cybersecurity

Threat intelligence

Co-pilot for contextual threat analysis: LLM-powered tools like CyLens support security analysts throughout the threat intelligence by analyzing extensive threat reports with modular NLP pipelines and entity correlation filters.²

Real-time proactive threat intelligence: systems integrate LLMs with retrieval‑augmented generation (RAG) frameworks to ingest continuous CTI feeds (e.g., CVE) into vector databases (like Milvus), enabling up‑to‑date automated detection, scoring, and contextual reasoning.³

Forum-based CTI extraction: LLMs analyze unstructured data from cybercrime forums to extract key threat indicators using simple prompts.⁴

Vulnerability detection

Vulnerability description enrichment: LLMs such as CVE‑LLM enrich vulnerability descriptions using domain ontologies, enabling automated triage and CVSS scoring integration within existing security management systems.⁵

Android filesystem vulnerability detection: Investigates how LLMs can detect file system access vulnerabilities in Android apps, including permission abuse and insecure storage.⁶

RL fine‑tuning for vulnerability detection: Applies reinforcement learning (RL) to fine-tune LLMs (LLaMA 3B/8B, Qwen 2.5B) for improved accuracy in identifying software vulnerabilities.⁷

Anomaly detection & log analysis

Semantic log anomaly detection: Frameworks like LogLLM use LLM encoders/decoders to parse and classify log entries, improving anomaly detection beyond pattern matching.⁸

Log parsing with large language models: Automated LLM parsing converts unstructured logs into structured formats via prompt‑based and fine‑tuned approaches.⁹

Red teaming / LLM-assisted attack prevention

LLM-driven pentesting and remediation (penheal): Automates penetration testing using a two-stage pipeline; first identifying security weaknesses, then generating remediation actions using a custom LLM setup.¹⁰

On-prem red team agent for internal security (hackphyr): Deploys a fine-tuned 7B LLM agent locally to perform red-team tasks such as lateral movement simulation, credential harvesting, and vulnerability scanning in networks.¹¹

Benchmark methodology

SecBench is a large-scale, multi-dimensional benchmark for evaluating LLMs in cybersecurity across different tasks, domains, languages, and formats.

Evaluation dimensions

1. Multi-level reasoning:

Knowledge Retention (KR): Questions that test factual knowledge or definitions. These are more straightforward.
Logical reasoning (LR): Questions that require inference and deeper understanding. These are more challenging and test the model’s ability to reason based on context.

2. Multi-format:

MCQs (Multiple-Choice Questions): Traditional format where the model selects from predefined answers. Total of 44,823 questions.
SAQs (Short Answer Questions): Open-ended format requiring the model to generate its response for evaluating reasoning, clarity, and hallucination resistance. Total of 3,087 questions.

3. Multi-Language:

SecBench includes questions in both Chinese and English.

4. Multi-Domain:

Questions span 9 cybersecurity domains (D1–D9), including: security management, data security, network security, application security, cloud security, and more.

Evaluation

MCQs are graded by checking if the model selects the correct choice(s).

SAQs are graded using a GPT-4o mini “grading agent”, which compares the model’s response to the ground truth and assigns a score based on accuracy and completeness’.

LLM performance evaluation: For example, Network Security (D3) is assessed by grouping relevant questions from its 44,823-question MCQ dataset.

Accuracy is measured based on each model’s performance, specifically on questions labeled under the D3 domain. A model’s percentage score for D3 reflects the proportion of network security questions it answered correctly.

External Links

Share This Article

Cem Dilmegani

Follow on

Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

Follow on

Researched by

Mert Palazoğlu

Industry Analyst

Mert Palazoglu is an industry analyst at AIMultiple focused on customer service and network security with a few years of experience. He holds a bachelor's degree in management.

Comments

Your email address will not be published. All fields are required.

0 Comments

Related research

LLM Latency Benchmark by Use Cases in 2025

Jul 309 min read

LLM Pricing: Top 15+ Providers Compared in 2025

Jul 307 min read

Large Language Models in Cybersecurity [2025]

Specialized cybersecurity LLMs

General-purpose LLMs for cybersecurity

Benchmarking LLM performance across cybersecurity domains

The role of LLMs in cybersecurity

Applications of large language models in cybersecurity

Threat intelligence

Vulnerability detection

Anomaly detection & log analysis

Red teaming / LLM-assisted attack prevention

Benchmark methodology

Evaluation dimensions

Evaluation

External Links

Next to Read

LLM Latency Benchmark by Use Cases in 2025

Compare Top 20 LLM Security Tools & Free Frameworks

10+ Large Language Model Examples & Benchmark 2025

Comments

Related research

LLM Latency Benchmark by Use Cases in 2025

LLM Pricing: Top 15+ Providers Compared in 2025