AIMultiple ResearchAIMultiple ResearchAIMultiple Research
CybersecurityLLMs
Updated on Jul 11, 2025

Large Language Models in Cybersecurity [2025]

Headshot of Cem Dilmegani
MailLinkedinX

Large language models (LLMs) are increasingly applied across cybersecurity domains, including threat intelligence, vulnerability detection, anomaly analysis, and red teaming. These applications are supported by both specialized cybersecurity LLMs and general-purpose models.

Specialized cybersecurity LLMs

Updated at 07-11-2025
ModelRelease dateModel TypeTraining focus
SecLLM2024Code LLaMA variant– Insecure code samples
– CVE-linked code snippets
– Exploit patterns
LLM4Cyber2024Fine-tuned general LLM– MITRE ATT&CK
– CVE
– Threat intelligence feeds (CTI)
LlamaGuard2024Safety-aligned LLaMA– Safety filter prompts
– Input/output policy enforcement
– Adversarial prompt handling
SecGPT2023GPT-style LLM– Cybersecurity text
– CVE reports
Cybersecurity-BERT2023BERT (encoder-only)– Malware reports
– Vulnerability descriptions
– Technical security documentation

General-purpose LLMs for cybersecurity

These large language models are not trained solely on cybersecurity data but can still perform well in the domain when prompted correctly or evaluated on benchmarks like SecBench.

Examples:

  • GPT-4 / GPT-4o
  • DeepSeek-V3
  • Mistral 
  • Qwen2 / Yi / LLaMA-3-Instruct
  • Hunyuan-Turbo

Benchmarking LLM performance across cybersecurity domains 

This benchmark evaluates 7 general LLMs, including both proprietary (e.g., GPT-4) and open-source models (e.g., DeepSeek, Mistral). The benchmark spans 9 cybersecurity subfields, including:

  • Data Security
  • Identity & Access Management
  • Application Security
  • Network Security
  • Security Standards (and others)

The x-axis domains are sorted by LLM performance, with lower-scoring domains placed toward the left and higher-scoring ones toward the right.

MCQs (Multiple-Choice Questions) benchmarking:

SAQs (Short Answer Questions):

Source: SecBench design1 See benchmark methodology.

The role of LLMs in cybersecurity

Large language models (LLMs) are used across cybersecurity operations to extract actionable insights from unstructured sources such as threat intelligence reports, incident logs, CVE databases, and attacker TTPs.

LLMs automate key tasks, including threat classification, alert summarization, and correlation of indicators of compromise (IOCs). 

When fine-tuned on cybersecurity data, large language models can detect anomalies in logs, analyze phishing emails, prioritize vulnerabilities, and map threats to frameworks like MITRE ATT&CK.

Applications of large language models in cybersecurity

Threat intelligence

Co-pilot for contextual threat analysis: LLM-powered tools like CyLens support security analysts throughout the threat intelligence lifecycle—handling attribution, detection, correlation, triage, and remediation by analyzing extensive threat reports with modular NLP pipelines and entity correlation filters.2

Real-time proactive threat intelligence: systems integrate LLMs with retrieval‑augmented generation (RAG) frameworks to ingest continuous CTI feeds (e.g., CVE) into vector databases (like Milvus), enabling up‑to‑date automated detection, scoring, and contextual reasoning.3

Forum-based CTI extraction: LLMs analyze unstructured data from cybercrime forums to extract key threat indicators using simple prompts.4

Vulnerability detection

Vulnerability description enrichment: LLMs such as CVE‑LLM enrich vulnerability descriptions using domain ontologies, enabling automated triage and CVSS scoring integration within existing security management systems.5

Android filesystem vulnerability detection: Investigates how LLMs can detect file system access vulnerabilities in Android apps, including permission abuse and insecure storage.6

RL fine‑tuning for vulnerability detection: Applies reinforcement learning (RL) to fine-tune LLMs (LLaMA 3B/8B, Qwen 2.5B) for improved accuracy in identifying software vulnerabilities.7

Anomaly detection & log analysis

Semantic log anomaly detection: Frameworks like LogLLM use LLM encoders/decoders to parse and classify log entries, improving anomaly detection beyond pattern matching.8

Log parsing with large language models: Automated LLM parsing converts unstructured logs into structured formats via prompt‑based and fine‑tuned approaches.9

Red teaming / LLM-assisted attack prevention

LLM-driven pentesting and remediation (penheal): Automates penetration testing using a two-stage pipeline—first identifying security weaknesses, then generating remediation actions using a custom LLM setup.10

On-prem red team agent for internal security (hackphyr): Deploys a fine-tuned 7B LLM agent locally to perform red-team tasks such as lateral movement simulation, credential harvesting, and vulnerability scanning in networks.11

Benchmark methodology

SecBench is a large-scale, multi-dimensional benchmark for evaluating LLMs in cybersecurity across different tasks, domains, languages, and formats.

 Evaluation dimensions

1. Multi-level reasoning:

  • Knowledge Retention (KR): Questions that test factual knowledge or definitions. These are more straightforward.
  • Logical reasoning (LR): Questions that require inference and deeper understanding. These are more challenging and test the model’s ability to reason based on context.

2. Multi-format:

  • MCQs (Multiple-Choice Questions): Traditional format where the model selects from predefined answers. Total of 44,823 questions.
  • SAQs (Short Answer Questions): Open-ended format requiring the model to generate its response for evaluating reasoning, clarity, and hallucination resistance. Total of 3,087 questions.

3. Multi-Language:

SecBench includes questions in both Chinese and English.

4. Multi-Domain:

Questions span 9 cybersecurity domains  (D1–D9), including: security management, data security, network security, application security, cloud security, and more.

Evaluation

MCQs are graded by checking if the model selects the correct choice(s).

SAQs are graded using a GPT-4o mini “grading agent”, which compares the model’s response to the ground truth and assigns a score based on accuracy and completeness’.

LLM performance evaluation: For example, Network Security (D3) is assessed by grouping relevant questions from its 44,823-question MCQ dataset.

Accuracy is measured based on each model’s performance, specifically on questions labeled under the D3 domain. A model’s percentage score for D3 reflects the proportion of network security questions it answered correctly.

Share This Article
MailLinkedinX
Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.
Mert Palazoglu is an industry analyst at AIMultiple focused on customer service and network security with a few years of experience. He holds a bachelor's degree in management.

Next to Read

Comments

Your email address will not be published. All fields are required.

0 Comments