Built a robust AI eval and testing workflow to measure and improve LLM quality on code-focused tasks (Swift) and general NLP. Delivered automated evaluation suites, human-in-the-loop review workflows, and prompt optimization cycles to raise accuracy, reliability, and consistency across releases. Designed metrics, test sets, and dashboards to make model quality visible and actionable for product and engineering.