I showcase ChainForge (chainforge.ai), an open-source tool for open-ended testing of hypotheses about large language model (LLM) outputs. First, I cover the motivations behind ChainForge and demo the tool. Then, I will detail results from in-lab and in-the-wild usage studies conducted with colleagues at Harvard CS. We find that there are three stages of prompt engineering —opportunistic exploration, limited evaluation, and iterative refinement —and argue that future designers need to explicitly consider each stage when developing LLM sensemaking tools. Finally, I cover more informal lessons learned, such as surprising ways people have been using ChainForge (at least to us!), trade-offs between a tool’s breadth of applicability and the learning curve for specific user groups; and future research directions.