🧩 Philosophy 2d ago · bpomo

A Fast and Loose Clustering of LLM Benchmarks

Less Wrong

A Fast and Loose Clustering of LLM Benchmarks

Source ↗ 👁 2 💬 0

AI Benchmarks measure a variety of distinct skills, from agency to general knowledge to spatial reasoning. Two benchmarks may measure similar traits if AI models which perform well on one also perform well on the other. Moreover, these connections might be nonobvious from the descriptions of the benchmarks. This is a rough first pass at clustering benchmarks into groups based upon this type of similarity, and the Claude Coded experiment can be found at this github repo.We have lots of AI benchma