Back to Blog
Security
March 24, 2026
20 min read
Pantoja Digital

We Built an AI Chatbot, Protected It with Guardrails, Then Hired Ourselves to Break It. Here's What Happened.

We deployed AI guardrails on our chatbot Pixel, then ran our NullShield security scanner against it. Three rounds of testing. Here's every vulnerability we found and fixed.

Share

We built an AI chatbot. We gave it guardrails — jailbreak detection, topic control, sensitive information filtering. We deployed it on our own website.

Then we pointed our own security scanner at it and tried to break it.

Three rounds of testing. Twenty-seven findings. Every vulnerability documented. Every fix applied. Here's the full, unfiltered story of what happened when we tested our own product — and what it taught us about AI security.

The Setup

Meet Pixel. It's the AI chatbot that lives on pantojadigital.com — the little chat widget in the bottom corner you might have already noticed. Pixel answers questions about our services, helps visitors understand what we do, and points them in the right direction.

Under the hood, Pixel is powered by Claude (Anthropic's large language model) with a Python-based guardrails system inspired by NVIDIA's NeMo Guardrails framework. The guardrails handle three critical functions:

  • Jailbreak detection — Catches attempts to manipulate the AI into ignoring its instructions
  • Topic control — Keeps Pixel focused on Pantoja Digital's services and relevant topics
  • Sensitive information filtering — Prevents the AI from leaking internal data, API keys, or system prompts

The backend runs on FastAPI, hosted on Railway. The frontend chat widget is part of our Next.js site deployed on Vercel.

On paper, Pixel was secure. The guardrails were working. The AI stayed on topic, rejected prompt injection attempts, and never leaked its system prompt.

But here's the question we couldn't stop asking: Are guardrails enough?

Why We Tested Our Own Product

We sell NullShield — an AI security testing platform that scans chatbots, voice agents, and AI-powered tools for vulnerabilities. We test other companies' products for a living.

So the question was obvious: if we're asking businesses to trust us with their security, shouldn't we be able to prove we've secured our own?

This isn't just a nice-to-have. It's a credibility issue. If a locksmith can't secure their own house, you don't hire them. If a security company won't scan their own products, why would you let them scan yours?

Jensen Huang said it at GTC 2026: "Every company needs an AI strategy." We agree. But we'd add one thing: every AI strategy needs a security audit. Strategy without security is just optimism with a deployment date.

So we fired up NullShield, pointed it at pixel-api.pantojadigital.com, and hit scan.

Here's what happened.

Round 1: The First Scan

NullShield v18 ran its full suite against Pixel's API endpoint. The scan tested for everything — injection attacks, authentication weaknesses, security header misconfigurations, information disclosure, rate limiting, and more.

Results: 11 findings.

SeverityCount
Critical0
High2
Medium4
Low3
Info2

No critical findings. That sounds good, right? Not so fast. Let's look at what was actually found.

Finding #1: Exposed API Documentation

Severity: High

FastAPI ships with automatic API documentation out of the box. It's a fantastic developer tool. It also means that, by default, anyone who visits /docs or /openapi.json gets a complete, interactive map of every endpoint your API exposes.

Our Pixel API had this enabled. In production.

That means anyone could see:

  • Every endpoint available
  • The exact request and response schemas
  • Parameter types, validation rules, and defaults
  • The entire structure of our API

For an attacker, this is a gift. It's like breaking into a building and finding the blueprints taped to the front door.

Finding #2: Missing Security Headers

Severity: High / Medium

The API was missing several critical HTTP security headers:

  • HSTS (HTTP Strict Transport Security) — Without this, the connection could theoretically be downgraded from HTTPS to HTTP via a man-in-the-middle attack
  • X-Content-Type-Options — Missing nosniff directive, allowing the browser to MIME-sniff responses
  • X-Frame-Options — No clickjacking protection
  • Referrer-Policy — Browser sending full referrer URLs to external sites
  • Cache-Control — API responses were being cached, potentially storing sensitive conversation data

Each of these is a small gap on its own. Together, they paint a picture of an API that was built for functionality, not hardened for production.

Finding #3: No Rate Limiting

Severity: Medium

Pixel's API had zero rate limiting. None. An attacker — or even just a script kiddie with a for loop — could send thousands of requests per second. The implications:

  • DDoS vulnerability — Flood the API and take Pixel offline
  • Cost attacks — Every request costs money (Claude API calls). An attacker could rack up our bill
  • Brute force attacks — Without rate limiting, automated attacks have no friction

Finding #4: Cacheable API Responses

Severity: Medium

API responses weren't setting proper cache headers, which meant:

  • Chat conversations could be stored in browser caches or CDN caches
  • Sensitive responses could persist after the session ends
  • Shared computers could expose previous users' conversations

Other Findings

The remaining findings were lower severity — email security configurations (SPF/DKIM alignment), informational disclosures, and minor misconfigurations. Important to document, but not immediate threats.

The Big Takeaway from Round 1

Here's what hit us: the guardrails were doing their job perfectly. Pixel wasn't leaking its system prompt. It wasn't falling for jailbreak attempts. It was staying on topic.

But the infrastructure around Pixel was wide open.

It's like having a state-of-the-art alarm system inside a house with no locks on the doors. The AI conversation was protected. Everything else was not.

What We Fixed After Round 1

We fixed every actionable finding in a single sprint. Here's what changed:

1. Disabled API Documentation

app = FastAPI(docs_url=None, redoc_url=None, openapi_url=None)

One line of code. The entire API documentation — endpoints, schemas, everything — was no longer publicly accessible. Development convenience should never override production security.

2. Added Rate Limiting

We implemented rate limiting at 30 requests per minute per IP. Enough for legitimate users to have a conversation. Not enough for an attacker to abuse.

3. Added Security Headers

Every response from Pixel's API now includes:

  • Strict-Transport-Security: max-age=31536000; includeSubDomains
  • X-Content-Type-Options: nosniff
  • X-Frame-Options: DENY
  • Referrer-Policy: strict-origin-when-cross-origin
  • Cache-Control: no-store, no-cache, must-revalidate

4. Cache Control

All API responses now return Cache-Control: no-store, ensuring conversation data is never cached by browsers, proxies, or CDNs.

Total time to fix: under 30 minutes.

That's the thing about most of these findings. They're not complex to fix. They're just easy to forget. You're focused on making the AI work, making the responses accurate, tuning the guardrails — and you forget that the API itself needs hardening too.

Round 2: NullShield Got Smarter

Between scans, we didn't just fix Pixel. We upgraded NullShield itself.

NullShield v22 introduced three major capabilities:

  1. NoSQL injection scanner — Tests for MongoDB operator injection ($gt, $ne, $regex, etc.)
  2. Enhanced subdomain enumeration — Discovers infrastructure endpoints, staging environments, and related subdomains
  3. Attack chain detection — Combines individual low/medium findings into theoretical attack chains that could represent critical-severity composite vulnerabilities

We ran the upgraded scanner against the fixed Pixel API.

Results: 16 findings.

SeverityCount
Critical6
High3
Medium3
Low2
Info2

Wait. More findings than Round 1? And six criticals when we had zero before?

This is counterintuitive but it's actually the most important lesson from this entire case study: a better scanner finds more problems.

Let's break down what happened.

The NoSQL Injection Findings

Severity: Critical (by pattern) / False Positive (by context)

NullShield v22's new NoSQL injection scanner sent payloads containing MongoDB operators to Pixel's API:

{"message": {"$gt": ""}}
{"message": {"$ne": null}}
{"message": {"$regex": ".*"}}

The API accepted these payloads and processed them without error. In a traditional application with a MongoDB backend, this would be a critical vulnerability — an attacker could manipulate database queries to extract or modify data.

Here's the thing: Pixel doesn't have a database. It's a stateless API that forwards messages to Claude and returns responses. There's no MongoDB. There's no database at all.

So these were technically false positives — the vulnerability pattern was detected, but the underlying risk wasn't present.

But we fixed them anyway. Why?

Because accepting malformed input is still a problem. Even if there's no database to exploit today, accepting MongoDB operators means:

  • The API isn't validating input properly
  • If a database is added later, these become real vulnerabilities
  • Malformed input could cause unexpected behavior in the AI model
  • It signals to an attacker that input validation is weak, encouraging further probing

A false positive in detection can still represent a real gap in design.

Attack Chain Detection

Severity: Critical (composite)

NullShield's new attack chain engine identified theoretical multi-step attack paths. For example:

  1. Discover API structure via information disclosure →
  2. Identify accepted injection patterns →
  3. Enumerate infrastructure subdomains →
  4. Chain findings into a targeted attack path

Individually, each finding might be medium or low severity. Combined, they represent a realistic attack scenario. The chain engine surfaces these composite risks so they can be addressed holistically, not one finding at a time.

Railway Infrastructure Subdomains

Severity: High / Medium

NullShield's improved subdomain enumeration discovered Railway infrastructure endpoints associated with our deployment. These are legitimate hosting platform subdomains — not something we control or can remove.

This is a reality of cloud-hosted infrastructure. When you deploy on Railway, Vercel, AWS, or any cloud platform, there's infrastructure surface area that belongs to the platform, not you. The finding is valid — these subdomains exist and could provide information to an attacker doing reconnaissance — but the remediation is "accept risk" because the fix is on Railway's side, not ours.

We documented this as an accepted risk with context, because that's what good security practices look like. Not every finding has a fix. Some have an acceptance, a justification, and a monitoring plan.

The Pattern: More Findings ≠ Less Secure

Round 1 found 11 issues. Round 2 found 16. Does that mean Pixel got less secure?

No. It means NullShield got better at looking.

This is one of the most misunderstood dynamics in security testing. When you upgrade your scanner, your finding count often goes up, not down. That's not regression — it's visibility.

Think of it like a home inspection. Inspector A checks the foundation and roof. Inspector B checks the foundation, roof, plumbing, electrical, HVAC, and soil drainage. Inspector B finds more issues. That doesn't mean the house got worse between inspections. It means Inspector B was more thorough.

If your security scanner is finding the same number of issues every quarter, it's not because you're secure. It's because your scanner isn't improving.

What We Fixed After Round 2

1. Input Sanitization

We added comprehensive input validation to Pixel's API:

  • NoSQL operator filtering — Reject any input containing MongoDB operators ($gt, $ne, $regex, $in, $or, $and, etc.)
  • Message length limit — Cap input at 1,000 characters. No legitimate chat message needs to be longer. Long inputs are often injection attempts.
  • JSON injection prevention — Detect and reject structured data patterns in plain text fields
BLOCKED_PATTERNS = [
    r'\$gt', r'\$lt', r'\$ne', r'\$eq', r'\$regex',
    r'\$in', r'\$nin', r'\$or', r'\$and', r'\$not',
    r'\$exists', r'\$where',
]

def sanitize_input(message: str) -> str:
    if len(message) > 1000:
        raise HTTPException(status_code=400, detail="Message too long")
    for pattern in BLOCKED_PATTERNS:
        if re.search(pattern, message, re.IGNORECASE):
            raise HTTPException(status_code=400, detail="Invalid input")
    return message

2. Content-Security-Policy

We added a comprehensive CSP header to control exactly what resources the API responses can load and execute:

Content-Security-Policy: default-src 'none'; frame-ancestors 'none'

This is particularly important for API endpoints that might return content rendered in a browser context. CSP is the last line of defense against XSS and data injection attacks.

3. Proxy Header Suppression

Some proxy-related headers (Via, X-Powered-By) were leaking infrastructure information. We configured the application to strip these headers from responses, reducing the information available to attackers during reconnaissance.

4. Health Endpoint Rate Limiting

Even the /health endpoint — used for uptime monitoring — got rate limited. An unprotected health endpoint can be used for service discovery, uptime monitoring by attackers, and as a low-cost way to keep probing your infrastructure.

Round 3: The Verification Scan

After applying all fixes from Round 2, we ran the final verification scan. This is the moment of truth — the entire point of iterative testing.

The Results

pantojadigital.com: 9 findings (0 Critical, 0 High actionable)

  • The Vercel 403 false positives? Completely eliminated — from 70 down to 0. Our NullShield improvement correctly identifies platform-level default deny patterns.
  • Remaining 9 findings: theoretical attack chains, minor header observations, and platform-level items. Zero real exploitable vulnerabilities.

Pixel Chatbot: 28 findings (1 real actionable)

  • NoSQL injection: reduced from 2 confirmed → 1. Our input sanitization caught one operator pattern, but a more creative bypass got through. This is exactly why you test iteratively.
  • 14 findings were Railway infrastructure subdomains (dev, staging, internal) — platform-level, not our code, documented as accepted risk.
  • Attack chains reduced from 6 → 5 as underlying findings were resolved.
  • The one remaining NoSQL pattern will be addressed with a stricter input validation layer.

The progression tells the story:

  • Scan 1: 11 findings → Found the obvious gaps
  • Scan 2: 16 findings → Better tools found deeper issues
  • Scan 3: 28 total, but only 1 real actionable → Infrastructure noise, core app hardened

Round 4: The Final Hardening

We weren't satisfied. One actionable finding was still one too many. So we applied aggressive final hardening:

  • NoSQL operator blocking strengthened with regex pattern matching ($[a-zA-Z] catches all current and future MongoDB operators)
  • Additional Cross-Origin headers added to Pixel
  • robots.txt endpoint added to prevent search engine indexing of the API
  • Global API rate limiter deployed in website middleware (covers every /api/* route)
  • DMARC policy upgraded from quarantine to reject
  • Proxy disclosure headers suppressed on both platforms

Then we scanned one final time.

The Results

pantojadigital.com: 11 findings (0 Critical ✅)

  • 2 High — both theoretical attack chains (header analysis + rate limiting patterns)
  • 3 Medium — CSP detection (set in config but ZAP's edge detection lags), rate limiting path, header analysis
  • 3 Low — informational (email propagation delay, non-existent endpoint, permissions policy detection)
  • 3 Info — HTTP probe, subdomain info, WAF detection
  • Zero exploitable vulnerabilities. Zero critical. The website is hardened.

Pixel Chatbot: 12 findings (down from 28! 🔽)

  • 2 Critical confirmed: NoSQL injection operators still getting through via encoded/nested payloads — despite our regex filter, the scanner found creative bypasses. This is the value of adversarial testing.
  • 3 Critical theoretical: attack chains built on the NoSQL findings
  • 3 High: Railway infrastructure subdomains (internal, vpn, intranet) — platform-level, documented as accepted risk
  • 1 High: Railway domain email security — can't control Railway's DNS
  • 2 Medium: Cache-control directives
  • 1 Low: Health check endpoint (intentional)

The key metric: 28 → 12 total findings. And every remaining finding is either Railway infrastructure we can't touch, or a NoSQL pattern we're actively hardening against.

The NoSQL persistence is actually the best case study proof: even with aggressive input sanitization, NullShield found creative bypasses. Imagine what it finds on systems with NO input validation.

The pantojadigital.com Scan

We didn't just test Pixel. We also pointed NullShield at pantojadigital.com itself — our main marketing website running on Next.js/Vercel.

First scan: 70 findings.

Seventy! That's a lot. Except... they were all false positives.

Here's what happened: NullShield was testing various paths and endpoints, and Vercel was returning 403 (Forbidden) responses for non-existent routes. NullShield's detection engine was interpreting these 403s as "access denied to a resource that exists" rather than "this route doesn't exist."

This was actually a valuable lesson about our own scanner. We improved NullShield to detect the Vercel 403 pattern — when a hosting platform returns 403 for all unknown routes rather than 404, that's a platform behavior, not a security finding.

Second scan (with improved detection): 9 findings.

SeverityCount
Critical0
High1
Medium4
Low4

The remaining findings were legitimate but manageable — security header improvements, rate limiting refinements, and email security configurations. The main site was in solid shape, with DMARC, SPF, and DKIM all properly configured.

The meta-lesson here: Testing our own products improved both the product being tested AND the testing tool itself. Scanning Pixel made Pixel more secure. Scanning pantojadigital.com made NullShield more accurate. Everybody wins.

What We Learned

After three rounds of testing, dozens of fixes, and a few humbling discoveries, here's what we took away.

1. Guardrails Protect the AI Conversation, Not the Infrastructure

This is the single most important lesson.

Pixel's guardrails were excellent. Jailbreak detection worked. Topic control worked. Sensitive info filtering worked. The AI conversation was secure.

But the FastAPI docs were exposed. Security headers were missing. The API had no rate limiting. Input validation was absent. The infrastructure was wide open.

Guardrails are one layer of security. An important layer — maybe the most visible layer. But they protect the AI, not the system the AI runs on. If you deploy a chatbot with perfect guardrails on an unhardened API, you've built a vault door on a tent.

2. Defense in Depth Is Real

Security isn't one thing. It's layers:

  • Guardrails — Protect the AI conversation
  • Input validation — Catch malicious payloads before they reach the AI
  • Rate limiting — Prevent abuse and cost attacks
  • Security headers — Protect the transport layer
  • Access control — Limit who can reach what
  • Monitoring — Detect anomalies in real time

Each layer catches things the others miss. Remove any one layer, and the attack surface grows. This isn't theoretical — we proved it across three rounds of testing.

3. Better Scanners Find More Problems

NullShield v18 found 11 issues. NullShield v22 found 16 issues on the same (improved!) system. Not because the system got worse, but because the scanner got better.

Your security scanner should be evolving. New attack patterns emerge constantly — NoSQL injection, prototype pollution, AI-specific attacks like prompt injection and model manipulation. If your scanner is running the same checks it ran a year ago, it's giving you a false sense of security.

Ask your security vendor: What new detection capabilities have you added in the last 90 days? If they can't answer, they're maintaining, not improving.

4. Infrastructure Matters More Than You Think

When people think about AI chatbot security, they think about jailbreaks. They think about prompt injection. They think about the AI saying something it shouldn't.

Those are real risks. But the most actionable findings in our audit were all infrastructure:

  • Exposed API documentation
  • Missing security headers
  • No rate limiting
  • Unvalidated input
  • Information-leaking response headers

These are the same issues that affect any web application. AI doesn't change the fundamentals of web security — it adds to them.

5. Iterative Testing Works

Three rounds. Each round: scan → analyze → fix → rescan.

Round 1 established a baseline. Round 2 went deeper with better tools. Round 3 verified the fixes.

This is how security testing should work. It's not a one-time checkbox. It's a cycle. Find, fix, verify. Then upgrade your tools and start again.

If you ran a security scan six months ago and haven't scanned since, you don't have security — you have a snapshot.

The Numbers

Here's the full picture across all three rounds:

Pixel API (pixel-api.pantojadigital.com):

Round 1Round 2Round 3Round 4
Scanner Versionv18v22v22+v22+
Total Findings11162812
Critical0655
High2324
Medium4332
Low32141
Info2240
Real Actionable4611
Issues Fixed467

pantojadigital.com:

Scan 1Scan 2Scan 3Scan 4
Total Findings70 (FP)9911
Critical0000
High0112
Medium0433
Low0453
Info0003
Real Exploitable0000

Totals across 4 rounds:

  • Total scan time: ~3 hours
  • Total fix time: ~2 hours
  • Total API cost: ~$1.00
  • Vulnerabilities identified and fixed: 17 actionable issues
  • Scanner improvements triggered: 4 (Vercel 403 detection, NoSQL injection testing, attack chain engine, subdomain consolidation)

Three hours of work. Less than a dollar in compute. And both our product and our scanner are meaningfully better for it.

What This Means for Your Business

If you have an AI chatbot, voice agent, RAG system, or any AI-powered tool deployed on your website or product, here's the reality:

Your guardrails are necessary but not sufficient.

They protect the AI conversation. They don't protect the API, the infrastructure, the input validation, the transport layer, or the hosting environment. You need both.

A one-time scan is better than nothing, but iterative testing is what actually works.

Security isn't a state — it's a process. Scan, fix, verify. Upgrade your tools. Scan again.

You should be testing with tools that are getting smarter, not just running the same checks.

NullShield currently tests over 500 attack patterns, and we add new detections regularly — because the attack surface is evolving. The NoSQL injection scanner that found issues in Round 2 didn't exist in Round 1. The attack chain engine that identified composite risks was brand new. Static scanners give you a static view of a dynamic problem.

The cost of testing is trivial compared to the cost of a breach.

We tested our own chatbot for less than a dollar. The average cost of an AI-related security incident is significantly higher. Testing is the cheapest insurance you can buy.


We built Pixel. We protected it with guardrails. We tested it with NullShield. We found real vulnerabilities. We fixed them. We made both our product and our scanner better in the process.

That's the whole story. No spin. No hiding the findings. No pretending we were secure from day one.

Because if a security company can't be honest about its own vulnerabilities, it has no business testing yours.

Want to know what NullShield would find on your AI tools? We test chatbots, voice agents, RAG systems, and AI-powered applications with the same rigor we applied to our own.

Get your AI tools tested →

Share

Get AI Security Insights Weekly

Join our newsletter for the latest in AI security, automation tips, and industry insights.

Ready to get started?

Book a free discovery call and let's build your AI strategy together.

Book a Discovery Call