code-reviewer (staging)
Pass Rate
94.2%
Last Run
2 min ago
Failing Tests
3
Flaky Tests
2
Test Cases (52 total)
Handle Rate Limit Gracefully
Last failed: 2 min ago · Avg duration: 3.2s
0% (0/10 runs)
10 failures
⚠️
Large File Review (>1000 lines)
Last failed: 15 min ago · Avg duration: 8.7s
70% (7/10 runs)
3 failures
Simple Code Review
Last run: 2 min ago · Avg duration: 1.8s
100% (10/10 runs)
0 failures
Multi-File Diff Analysis
Last run: 2 min ago · Avg duration: 4.3s
100% (10/10 runs)
0 failures
Security Vulnerability Detection
Last run: 2 min ago · Avg duration: 2.1s
100% (10/10 runs)
0 failures
Staging vs Production
Configuration 2 differences
Model gpt-4-turbo ✓
Rate Limit Staging: 60 RPM → Prod: 10,000 RPM
Temperature 0.3 ✓
Context Window Staging: 8K → Prod: 128K
⚠️ Environment Mismatch
Rate limit test is failing because staging uses 60 RPM while production uses 10,000 RPM. Test doesn't reflect production behavior.

Recommendation: Update staging rate limit to match production OR adjust test expectations for staging environment.
Non-Determinism Risk
Temperature: 0.3
Low variance — test results should be consistent
10 runs per test
Sufficient sample size to detect flakiness