Software
Selected tools and research artifacts.
-
Open project
QuickScope
Finds and certifies hard regions in dynamic LLM benchmarks, focusing evaluation budget where models keep failing reliably.
-
Open project
STEER Benchmark
Explore benchmark artifacts for assessing economic rationality and microeconomic reasoning in large language models.