You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Protocol, Proxy, and Security Integration Tests Analysis
This document provides a detailed analysis of the integration tests covering protocol support, IPv6, API proxy, credential hiding, token unsetting, one-shot tokens, exit code propagation, and error handling in the AWF (Agentic Workflow Firewall).
Sends curl -fsS https://api.github.com/zen with github.com allowed. Verifies the request succeeds (exit code 0).
HTTPS to non-allowed domain
Sends curl -f https://example.com with only github.com allowed. Verifies the CONNECT is denied by Squid and curl fails.
HTTPS verbose output
Runs curl -v and greps for SSL/TLS connection info, confirming TLS handshakes happen through the proxy.
HTTP/2 connections
Uses curl --http2 to verify HTTP/2 negotiation works through the Squid CONNECT tunnel.
HTTP/1.1 fallback
Uses curl --http1.1 to verify HTTP/1.1 also works, confirming protocol negotiation flexibility.
HTTP requests
Attempts plain HTTP to github.com. Expects failure because HTTP requests hit Squid's intercept port where HTTP-to-HTTPS redirects fail. Documents a known limitation.
Custom headers
Passes -H "Accept: application/json" to verify headers traverse the proxy.
User-Agent header
Passes -A "Test-Agent/1.0" to verify custom User-Agent traverses the proxy.
IPv4 connections
Uses curl -4 to force IPv4, verifying IPv4-only connectivity works.
IPv6 (may not be available)
Uses curl -6 with `
curl max-time
Uses --max-time 5 to verify the timeout option works through the proxy.
curl connect-timeout
Uses --connect-timeout 10 to verify connection timeout works through the proxy.
Real-World Mapping
These tests map to the core firewall functionality: when an AI agent (Claude, Copilot, Codex) makes HTTP/HTTPS requests, the proxy must:
Allow HTTPS to whitelisted domains (API calls to api.github.com, api.anthropic.com, etc.)
Block HTTPS to non-whitelisted domains (preventing data exfiltration)
Support various HTTP versions and headers that real tools use
Handle connection timeouts gracefully when domains are blocked
Gaps and Missing Coverage
Gap
Impact
Recommendation
No test for multiple allowed domains simultaneously
Real workflows allow 5-10+ domains. No test verifies correct behavior with many domains.
Add test with 3-5 allowed domains, verify each works and a non-listed one is blocked.
No test for wildcard subdomain matching
Allowing github.com should also allow api.github.com, raw.githubusercontent.com, etc.
Add test verifying subdomain matching behavior.
No test for large request/response bodies
AI agents may download large files (repos, models).
Add test downloading a file >1MB through the proxy.
No test for concurrent requests
AI agents make many parallel requests.
Add test making 5-10 concurrent curl requests.
No test for WebSocket upgrade
Some MCP servers use WebSocket.
Add test attempting WebSocket connection through proxy (expected to fail or document limitation).
No test for POST/PUT/PATCH/DELETE methods
Only GET-like requests are tested. AI agents use POST for API calls.
Add test for POST with JSON body through the proxy.
HTTP intercept mode behavior not thoroughly validated
The HTTP test documents a "known limitation" but doesn't verify the exact failure mode.
Add test that checks the specific error (403 page vs connection refused).
No test for redirect following
curl -L following redirects across domains.
Test that redirects from allowed domain to non-allowed domain are blocked.
Edge Cases
What happens when Squid is slow to start and the first request arrives before it's ready?
What happens with very long domain names (>255 chars)?
What about requests to IP addresses directly (bypassing DNS)?
2. IPv6 Integration Tests
File:tests/integration/ipv6.test.ts
What It Tests
Test Case
Description
Accept IPv6 DNS servers
Configures dnsServers: ['2001:4860:4860::8888', '2001:4860:4860::8844'] and verifies they appear in debug logs.
Accept mixed IPv4/IPv6 DNS
Configures both 8.8.8.8 and 2001:4860:4860::8888, verifies both appear in logs.
DNS resolution with IPv4-only DNS
Runs nslookup github.com with IPv4 DNS only, verifies resolution works.
IPv6 traffic blocked for non-whitelisted
When IPv6 is available, attempts curl -6 https://example.com (not in allowlist). Expects failure.
IPv6 curl graceful failure
Tests curl -6 with fallback, ensuring graceful handling when IPv6 is unavailable.
ip6tables status logging
Verifies that starting with IPv6 DNS servers logs ip6tables availability status.
IPv6 chain cleanup
Verifies that ip6tables chains are cleaned up on exit.
IPv4 curl with IPv4 DNS
Baseline test confirming IPv4 curl works with IPv4 DNS servers.
Dual-stack DNS configuration
Configures 4 DNS servers (2 IPv4 + 2 IPv6), verifies all appear in logs.
Valid IPv6 loopback
Tests that ::1 (loopback) is accepted as a DNS server.
Link-local addresses
Tests graceful handling by using IPv4 DNS to avoid link-local issues.
Empty IPv6 DNS server list
Tests that IPv4-only DNS servers work and DNS resolution succeeds.
Real-World Mapping
GitHub Actions runners may have IPv6 enabled. The firewall must handle IPv6 DNS configuration correctly to prevent DNS-based data exfiltration while allowing legitimate IPv6 DNS resolution. The ip6tables rules must match the iptables rules for IPv4.
Gaps and Missing Coverage
Gap
Impact
Recommendation
No test for actual IPv6 traffic through proxy
Only tests DNS config and curl -6 blocking. Doesn't verify IPv6 traffic flows correctly through Squid when allowed.
Add conditional test (when IPv6 available) that makes successful IPv6 request to allowed domain.
Invalid IPv6 address rejection not tested
The test for "Invalid IPv6 address rejected at CLI level" actually uses a valid IPv6 address (::1).
Add test with genuinely invalid IPv6 like not:an:ipv6 and verify CLI error.
No test for IPv6-only environment
All tests either have IPv4 or mixed. What if IPv4 is unavailable?
Add conditional test for IPv6-only DNS configuration.
Link-local test doesn't actually test link-local
Uses IPv4 DNS "to avoid link-local issues" - doesn't actually test fe80:: handling.
Add test with fe80::1%eth0 to verify proper rejection or handling.
ip6tables cleanup not strongly verified
Checks for presence of log keywords but doesn't verify rules are actually removed.
After cleanup, run ip6tables -L and verify no AWF rules remain.
Many tests only check log output
Several tests just verify strings appear in stderr rather than testing actual network behavior.
Where feasible, add actual connectivity tests.
Edge Cases
What happens when IPv6 is available but ip6tables is not installed?
Race condition: ip6tables rules being applied while IPv6 traffic is already in flight.
DNS64/NAT64 environments where IPv6 traffic is translated to IPv4.
3. API Proxy Sidecar Tests
File:tests/integration/api-proxy.test.ts
What It Tests
Test Case
Description
Anthropic healthcheck
Starts API proxy with ANTHROPIC_API_KEY, checks /health on port 10001. Expects "status":"healthy" and "anthropic-proxy".
OpenAI healthcheck
Same pattern with OPENAI_API_KEY on port 10000.
ANTHROPIC_BASE_URL set
Verifies the agent container has ANTHROPIC_BASE_URL=http://172.30.0.30:10001.
ANTHROPIC_AUTH_TOKEN placeholder
Verifies the agent gets ANTHROPIC_AUTH_TOKEN=placeholder-token-for-credential-isolation (real key held by sidecar).
The MCP logs hiding prevents reading log files that might contain sensitive operation details
Gaps and Missing Coverage
Gap
Impact
Recommendation
No test for Cargo credentials
~/.cargo/credentials.toml may contain crates.io tokens.
Add test verifying Cargo credentials are hidden.
No test for Composer auth
~/.composer/auth.json may contain Packagist tokens.
Add test for Composer credentials.
No test for pip/PyPI credentials
~/.pip/pip.conf or ~/.config/pip/pip.conf may contain index tokens.
Add test for Python package manager credentials.
No test for SSH keys
~/.ssh/id_rsa, ~/.ssh/id_ed25519 could be exfiltrated.
Add test verifying SSH keys are not accessible (or document if they're needed).
No test for /proc/self/environ credential reading
Agent could read its own environment via /proc/self/environ.
Add test reading /proc/self/environ and verifying sensitive tokens are cleared. (Covered partially by token-unset tests.)
No test for symlink bypass
Agent could create symlinks from accessible paths to hidden paths.
Test that symlinks to credential files resolve to /dev/null content.
No test for bind mount information leakage
mount or cat /proc/mounts could reveal the /dev/null pattern.
Test that mount info doesn't reveal the hiding mechanism.
No test for credential files that don't exist on host
Tests assume credential files exist on the host machine.
Add conditional logic or mock files for test portability.
No test for cloud provider credentials
~/.aws/credentials, ~/.config/gcloud/, ~/.azure/ could contain cloud tokens.
Add tests for cloud provider credential files.
Edge Cases
What if a credential file is a symlink on the host?
What if the agent creates a new credential file (e.g., docker login creates ~/.docker/config.json)?
What about credential discovery via find / -name "*.json" -exec grep -l auth?
What if a tool writes credentials to a non-standard location?
5. Token Unsetting Tests
File:tests/integration/token-unset.test.ts
What It Tests
Test Case
Description
GITHUB_TOKEN unset from /proc/1/environ
After 7s delay, reads /proc/1/environ and verifies GITHUB_TOKEN is gone. Also verifies agent can still read via getenv.
OPENAI_API_KEY unset
Same pattern for OpenAI key.
ANTHROPIC_API_KEY unset
Same pattern for Anthropic key.
Multiple tokens simultaneously
Sets all three tokens, verifies all cleared from /proc/1/environ after 7s.
Works in non-chroot mode
Sets AWF_CHROOT_ENABLED=false, verifies token clearing works in container-only mode.
Real-World Mapping
The entrypoint.sh runs the agent in the background, waits 5 seconds for initialization, then unsets sensitive tokens from the parent shell's environment. This prevents a secondary attack vector: even if credential files are hidden, an attacker could read /proc/1/environ to find API keys passed as environment variables. The one-shot token library caches values so the agent process can still use them.
Gaps and Missing Coverage
Gap
Impact
Recommendation
Fixed 7s sleep is fragile
If agent startup takes longer (or shorter), tests may be flaky.
Use a more robust synchronization mechanism (e.g., signal file).
No test for COPILOT_GITHUB_TOKEN
Listed in unset_sensitive_tokens() but not tested here.
Add test for COPILOT_GITHUB_TOKEN.
No test for all tokens in sensitive list
unset_sensitive_tokens() lists 12+ tokens. Only 3 are tested.
Add tests for GH_TOKEN, GITHUB_API_TOKEN, GITHUB_PAT, CLAUDE_API_KEY, etc.
No test for /proc/self/environ
Tests check /proc/1/environ (PID 1) but not /proc/self/environ of child processes.
Add test reading /proc/self/environ from a child process.
No test for timing race
What if agent reads /proc/1/environ before the 5s unsetting delay?
Add test that reads immediately (before sleep) and after, showing the transition.
No test for token values in /proc/*/cmdline
Tokens passed via --env might appear in cmdline.
Verify token values don't appear in /proc/*/cmdline.
No negative test: token not set
Should verify that unsetting a non-existent token doesn't cause errors.
Add test with no tokens set.
Edge Cases
What if the agent process forks and the child inherits the pre-unset environment?
What if /proc/1/environ is not readable (restricted /proc mount)?
What about tokens set via --env flag vs inherited from host environment?
6. One-Shot Token Protection Tests
File:tests/integration/one-shot-tokens.test.ts
What It Tests
Container Mode (7 tests)
Test Case
Description
Cache GITHUB_TOKEN
Uses printenv (forks new process each time). Both reads succeed. Verifies debug log shows token access with value preview.
Cache COPILOT_GITHUB_TOKEN
Same pattern for Copilot token.
Cache OPENAI_API_KEY
Same pattern for OpenAI key.
Multiple tokens independently
Sets GITHUB_TOKEN and OPENAI_API_KEY. Verifies both cached independently.
Non-sensitive vars unaffected
Sets NORMAL_VAR. Verifies it's readable multiple times without one-shot-token log messages.
Python same-process caching
Uses Python os.getenv() to call getenv() directly (same process). Both reads succeed via cache.
Clear from /proc/self/environ
Python test: verifies first getenv() caches, checks os.environ dict, second getenv() returns cached value.
Chroot Mode (5 tests)
Test Case
Description
Cache GITHUB_TOKEN in chroot
Same as container mode but in chroot. Verifies library copied to chroot.
Cache COPILOT_GITHUB_TOKEN in chroot
Same for Copilot token in chroot.
Python caching in chroot
Python os.getenv() test in chroot mode.
Non-sensitive vars in chroot
Verifies non-sensitive vars unaffected in chroot.
Multiple tokens in chroot
Multiple independent tokens in chroot mode.
Edge Cases (3 tests)
Test Case
Description
Empty token value
Sets GITHUB_TOKEN=''. Both reads return empty.
Token not set
NONEXISTENT_TOKEN not in environment. Both reads return empty.
Special characters
Sets token with @#$% characters. Both reads preserve special chars.
Real-World Mapping
The one-shot token library (LD_PRELOAD) is the core mechanism preventing token theft within the agent process. When an AI agent like Claude Code calls getenv("ANTHROPIC_API_KEY"):
First call: library caches value in memory, unsets from environment
Subsequent calls: library returns cached value
/proc/self/environ no longer contains the token
A prompt injection attack that reads /proc/self/environ or spawns a subprocess to printenv gets nothing
Gaps and Missing Coverage
Gap
Impact
Recommendation
No test for ANTHROPIC_API_KEY caching
Tested in container mode but not explicitly with debug log verification like GITHUB_TOKEN.
Already covered (Test 3 in container mode).
No test for Node.js process.env access
Python is tested but Node.js (used by Claude Code, Copilot CLI) is not.
Add Node.js test with process.env.GITHUB_TOKEN.
No test for Go os.Getenv() access
Go is common in GitHub Actions.
Add Go test verifying LD_PRELOAD works with Go binaries.
No test for compiled C program
The library intercepts C's getenv(). Testing with a compiled C program would be most direct.
Add test with simple C program calling getenv().
No test for environ global variable
Some programs access the environ array directly instead of getenv().
Test that extern char **environ iteration doesn't show tokens.
exit 0 propagates as 0. Verifies "Process exiting with code: 0" in stderr.
Exit code 1
exit 1 propagates as 1.
Exit code 2
exit 2 propagates as 2.
Exit code 42
Custom exit code propagates correctly.
Exit code 127
Running nonexistent_command_xyz produces 127 (command not found).
Exit code 255
Maximum standard exit code propagates correctly.
Command Exit Codes (6 tests)
Test Case
Description
true
true command returns 0.
false
false command returns 1.
test success
test 1 -eq 1 returns 0.
test failure
test 1 -eq 2 returns 1.
grep found
echo "hello world" | grep hello returns 0.
grep not found
echo "hello world" | grep xyz returns 1.
Pipeline Exit Codes (3 tests)
Test Case
Description
Last command in pipeline
echo "test" | cat | exit 5 returns 5.
Compound command success
echo "a" && echo "b" && exit 0 returns 0.
Compound command failure
echo "a" && false && echo "c" returns 1 (short-circuits at false).
Real-World Mapping
Exit code propagation is essential for CI/CD integration. When AWF runs in a GitHub Actions step:
Exit code 0 means the agent succeeded → workflow continues
Non-zero exit code means the agent failed → workflow marks step as failed
The correct exit code helps users diagnose what went wrong
Gaps and Missing Coverage
Gap
Impact
Recommendation
No test for signal-induced exit codes
SIGTERM → 143, SIGKILL → 137. Not tested.
Add tests for signal exit codes.
No test for exit codes > 255
Shell wraps codes modulo 256.
Test exit 256 returns 0, exit 257 returns 1.
No test for exit code from killed process
What exit code does a kill -9'd process produce?
Test that AWF propagates signal-killed process codes.
No test for exit code from timeout
When --max-time causes curl to timeout.
Test exit code from timed-out commands.
No test for exit code from OOM-killed process
When a command is killed by OOM.
Document behavior (may not be testable in CI).
No test for pipefail behavior
set -o pipefail changes which exit code is returned from pipelines.
Test with set -o pipefail to verify behavior.
No test for nested AWF invocations
AWF running AWF (if that's ever a pattern).
May not be relevant but worth documenting.
Edge Cases
What if the Docker container itself crashes (not the command)?
What if docker wait returns a different code than the process's actual exit code?
What about exit codes from exec'd processes within the command?
8. Error Handling Tests
File:tests/integration/error-handling.test.ts
What It Tests
Network Errors (3 tests)
Test Case
Description
Blocked domain
curl -f https://example.com with only github.com allowed. Expects non-zero exit.
Connection refused
curl http://localhost:12345 where no server listens. Expects "connection failed".
DNS resolution failure
curl to non-existent domain (in allowlist). Expects "dns failed".
Command Errors (3 tests)
Test Case
Description
Command not found
nonexistent_command_xyz123 returns exit code 127.
Permission denied
cat /etc/shadow fails with permission denied.
File not found
cat /nonexistent/file/path fails with "No such file".
Script Errors (2 tests)
Test Case
Description
Bash syntax error
bash -c "if then fi" caught with || echo.
Division by zero
bash -c "echo $((1/0))" caught with || echo.
Process Signals (1 test)
Test Case
Description
SIGTERM from command
kill -TERM $$ self-terminates. Verifies firewall handles it gracefully (result is defined).
Recovery After Errors (1 test)
Test Case
Description
Continue after failure
Runs false then echo "recovery test". Verifies second command works (they're separate AWF invocations).
Real-World Mapping
Error handling is critical for production reliability:
Network errors occur when agents try to access domains not in the allowlist
Command errors occur when agent-generated commands have bugs
Script errors happen when AI-generated code has syntax issues
Signal handling ensures the firewall cleans up properly when interrupted
Recovery after errors ensures the system doesn't enter a broken state
Gaps and Missing Coverage
Gap
Impact
Recommendation
No test for Docker daemon unavailable
What if Docker is not running when AWF starts?
Test CLI behavior when Docker socket is missing.
No test for disk full
What if /tmp is full and work directory can't be created?
Test behavior with limited disk space.
No test for Squid proxy crash
What if Squid crashes mid-session?
Test agent behavior when Squid becomes unreachable.
No test for network partition
What if the Docker network goes down?
Test behavior when awf-net becomes unavailable.
No test for SIGKILL handling
SIGKILL cannot be caught. What state is left behind?
Document and test cleanup behavior after SIGKILL.
No test for SIGINT (Ctrl+C) during command
The most common user interruption pattern.
Test that SIGINT properly stops containers and cleans up.
No test for concurrent error + cleanup
What if error occurs during cleanup?
Test nested error handling scenarios.
No test for invalid CLI arguments
What happens with --allow-domains "" or --dns-servers invalid?
Already tested at unit level but not integration.
No test for container startup timeout
What if container image pull takes too long?
Test behavior with unreachable registry.
No test for resource cleanup verification
After error, verify Docker networks/containers/volumes are cleaned up.
Add post-test assertions checking Docker state.
Edge Cases
What if the error occurs in a trap handler itself?
What if multiple signals arrive in rapid succession?
What about errors in the iptables cleanup phase?
What if the work directory is deleted by another process during execution?
9. Cross-Cutting Observations
Strengths
Defense in depth: The security tests cover multiple layers (credential hiding, token unsetting, one-shot tokens, API proxy isolation).
Both modes tested: Most security tests cover both normal and chroot mode.
Real-world attack simulation: The credential hiding tests simulate actual exfiltration attacks (base64, xxd, grep patterns).
Custom matchers: The toSucceed(), toFail(), toExitWithCode() matchers provide clear, readable assertions.
Bypass prevention: Tests specifically cover the chroot bypass vulnerability (Test 8) that was previously discovered and fixed.
Broad API proxy coverage: Tests cover all three API providers (OpenAI, Anthropic, Copilot) for healthchecks and env wiring; end-to-end request routing and credential isolation are currently verified in depth only for Anthropic.
Systemic Gaps
No stress testing: No tests verify behavior under load, concurrent requests, or resource exhaustion.
No timing/performance tests: No tests verify that proxy overhead is within acceptable bounds.
Limited negative testing for security: While credential hiding is well-tested, other security bypasses (iptables manipulation, network namespace escape, container escape) are not tested.
Fragile timing dependencies: The token-unset tests rely on fixed sleep 7 delays which could be flaky in slow CI environments.
Test isolation concerns: Tests share cleanup fixtures. A failed test could leave state that affects subsequent tests.
No test for the full workflow: No end-to-end test runs an actual AI agent (even a mock one) through the complete AWF pipeline.
Missing SSL Bump tests: The --ssl-bump feature is configured in the CLI but has no integration tests in this set.
Missing blocked-domains tests: The --block-domains feature (deny-list on top of allow-list) has no integration tests.
No test for --env-all security implications: Passing all host env vars could leak credentials.
No test for log output sanitization: Verifying that real API keys never appear in logs/stderr.