zone411's submissions

1.		LLM Position Bias Benchmark: Swapped-Order Pairwise Judging (github.com/lechmazur)
		1 point by zone411 3 days ago \| past \| discuss
2.		Show HN: Buyout Game Benchmark: Multi-Agent Bargaining, Transfers, and Takeovers (github.com/lechmazur)
		6 points by zone411 25 days ago \| past
3.		LLM Persuasion Benchmark: Multi-Turn Persuasion Between Models (github.com/lechmazur)
		9 points by zone411 28 days ago \| past
4.		Show HN: LLM Debate Benchmark (github.com/lechmazur)
		9 points by zone411 32 days ago \| past \| 3 comments
5.		Show HN: LLM Sycophancy Benchmark: Opposite-Narrator Contradictions (github.com/lechmazur)
		3 points by zone411 45 days ago \| past
6.		Show HN: LLM Round‑Trip Translation Benchmark (github.com/lechmazur)
		6 points by zone411 7 months ago \| past
7.		Show HN: LLM Creative Story‑Writing Benchmark V3 (github.com/lechmazur)
		8 points by zone411 7 months ago \| past
8.		Show HN: Mapping LLM Style and Range in Flash Fiction (github.com/lechmazur)
		7 points by zone411 7 months ago \| past
9.		Pact: Head-to-head negotiation benchmark for LLMs (github.com/lechmazur)
		6 points by zone411 8 months ago \| past
10.		Show HN: Bazaar – a new LLM benchmark for economic reasoning under uncertainty (github.com/lechmazur)
		8 points by zone411 9 months ago \| past \| 1 comment
11.		AI Comes Up with Physics Experiments. But They Work (quantamagazine.org)
		4 points by zone411 9 months ago \| past
12.		Emergent Price-Fixing by LLM Auction Agents (github.com/lechmazur)
		7 points by zone411 9 months ago \| past
13.		Public Goods Game Benchmark: Contribute and Punish, a Multi-Agent Benchmark (github.com/lechmazur)
		7 points by zone411 on March 20, 2025 \| past
14.		Elimination Game: Multi-Agent LLM Social Reasoning, Strategy, and Deception (github.com/lechmazur)
		5 points by zone411 on Feb 25, 2025 \| past
15.		SWE-Lancer: a benchmark of freelance software engineering tasks from Upwork (arxiv.org)
		111 points by zone411 on Feb 18, 2025 \| past \| 74 comments
16.		LLM Hallucination Benchmark: R1, o1, o3-mini, Gemini 2.0 Flash Think Exp 01-21 (github.com/lechmazur)
		17 points by zone411 on Feb 10, 2025 \| past \| 3 comments
17.		Multi-Agent Step Race Benchmark: LLM Collaboration and Deception Under Pressure (github.com/lechmazur)
		7 points by zone411 on Jan 22, 2025 \| past \| 1 comment
18.		Show HN: LLM Thematic Generalization Benchmark (github.com/lechmazur)
		6 points by zone411 on Jan 14, 2025 \| past
19.		Show HN: LLM Creative Story-Writing Benchmark (github.com/lechmazur)
		5 points by zone411 on Jan 6, 2025 \| past
20.		Show HN: LLM Divergent Thinking Creativity Benchmark (github.com/lechmazur)
		8 points by zone411 on Dec 30, 2024 \| past
21.		Show HN: LLM Deceptiveness and Gullibility Benchmark (github.com/lechmazur)
		7 points by zone411 on Oct 22, 2024 \| past \| 1 comment
22.		LLM Confabulation (Hallucination) Leaderboard (github.com/lechmazur)
		6 points by zone411 on Oct 10, 2024 \| past
23.		O1-preview and o1-mini results on NYT Connections (twitter.com/lechmazur)
		2 points by zone411 on Sept 13, 2024 \| past \| 1 comment
24.		Grok is an AI modeled after the Hitchhiker’s Guide to the Galaxy (twitter.com/xai)
		213 points by zone411 on Nov 5, 2023 \| past \| 226 comments
25.		Can you beat a stochastic parrot? ParrotChess.com (parrotchess.com)
		3 points by zone411 on Sept 22, 2023 \| past \| 4 comments
26.		Generative AI while browsing in Chrome (labs.google.com)
		3 points by zone411 on Aug 15, 2023 \| past
27.		Statement on AI Risk (safe.ai)
		341 points by zone411 on May 30, 2023 \| past \| 921 comments
28.		Google tells staff it plans to limit publishing AI research (businessinsider.com)
		63 points by zone411 on May 5, 2023 \| past \| 28 comments
29.		4th Gen Intel Xeon Scalable Sapphire Rapids Leaps Forward (servethehome.com)
		2 points by zone411 on Jan 10, 2023 \| past \| 1 comment
30.		Fast and Furious Movie Titles by 'Claude' from Anthropic AI (twitter.com/jayelmnop)
		2 points by zone411 on Jan 9, 2023 \| past
		More