I’ve been using AI tools like Claude, ChatGPT, and Gemini for tasks from coding to organizing my messy file folders, but I realized I haven’t done any fundamental testing of how they behave when given the same tasks. Sure, I’ve given them similar tasks to handle, but not the exact same ones, to see how the output matches up.