AIMultipleAIMultiple
No results found.
Şevval Alper

Şevval Alper

AI Researcher
16 Articles
Stay up-to-date on B2B Tech

Şevval is an AI researcher at AIMultiple. She has previous research experience in pseudorandom number generation using chaotic systems.

Research interests

Şevval focuses on AI coding tools, AI agents, and quantum technologies.

She is part of the AIMultiple benchmark team, conducting assessments and providing insights to help readers understand various emerging technologies and their applications.

Professional experience

She contributed to organizing and guiding participants in three “CERN International Masterclasses - hands-on particle physics” events in Türkiye, working alongside faculty to facilitate learning.

Education

Şevval holds a Bachelor's degree in Physics from Middle East Technical University.

Latest Articles from Şevval

AIDec 3

AI Reasoning Benchmark: MathR-Eval

We evaluated eight leading LLMs using a 100-question mathematical reasoning dataset, MathR-Eval, to measure how well each model solves structured, logic-based math problems. All models were tested zero-shot, with identical prompts and standardized answer checking. This enabled us to measure pure reasoning accuracy and compare both reasoning and non-reasoning models under the same conditions.

AIDec 2

OCR Benchmark: Text Extraction / Capture Accuracy

OCR accuracy is critical for many document processing tasks and SOTA multi-modal LLMs are now offering an alternative to OCR.

Agentic AINov 25

MCP Benchmark: Top MCP Servers for Web Access

We benchmarked 8 MCP servers across web search, extraction, and browser automation by running 4 different tasks 5 times each. We also tested scalability with 250 concurrent AI agents.

AINov 21

LLM Parameters: GPT-5 High, Medium, Low and Minimal

New LLMs, such as OpenAI’s GPT-5 family, come in different versions (e.g., GPT-5, GPT-5-mini, and GPT-5-nano) and with various parameter settings, including high, medium, low, and minimal. Below, we explore the differences between these model versions by gathering their benchmark performance and the costs to run the benchmarks. Price vs.

AINov 14

Screenshot to Code: Lovable vs v0 vs Bolt

During my 20 years as a software developer, I led many front-end teams in developing pages based on designs that were inspired by screenshots. Designs can be transferred to code using AI tools.

AINov 12

AGI Benchmark: Can AI Generate Economic Value

AI will have its greatest impact when AI systems start to create economic value autonomously. We benchmarked whether frontier models can generate economic value. We prompted them to build a new digital application (e.g., website or mobile app) that can be monetized with a SaaS or advertising-based model.

AINov 7

Best AI Code Editor: Cursor vs Windsurf vs Replit

Making an app without coding skills is highly trending right now. But can these tools successfully build and deploy an app? To answer this question,  we spent three days testing the following agentic IDEs/AI coding tools: Claude Code, Cline, Cursor, Windsurf and Replit Agent.

AINov 6

E-Commerce AI Video Maker Benchmark: Veo 3 vs Sora 2

Product visualization plays a crucial role in e-commerce success, yet creating high-quality product videos remains a significant challenge. Recent advancements in AI video generation technology offer promising solutions.

AIOct 27

Speech-to-Text Benchmark: Deepgram vs. Whisper

We benchmarked the leading speech-to-text (STT) providers, focusing specifically on healthcare applications. Our benchmark used real-world examples to assess transcription accuracy in medical contexts, where precision is crucial. Benchmark results Based on both WER and CER results, GPT-4o-transcribe demonstrates the highest transcription accuracy among all evaluated speech-to-text systems.

Agentic AIOct 18

Top 4 AI Search Engines Compared

Searching with LLMs has become a major alternative to Google search. We benchmarked the following AI search engines to see which one provides the most correct results: Benchmark results Deepseek is the leader of this benchmark, by correctly providing 57% of the data in our ground truth dataset.