Evaluation on Hi, I'm Muhammad Amal

Evaluation on Hi, I'm Muhammad Amal https://muhammadamal.my.id/tags/evaluation/ Recent content in Evaluation on Hi, I'm Muhammad Amal Hugo en-us Wed, 29 Jan 2025 09:00:00 +0700 Benchmarking SLMs for Your Use Case, From Lmeval to Custom Suites https://muhammadamal.my.id/blog/benchmarking-slms-for-your-use-case-lmeval-to-custom/ Wed, 29 Jan 2025 09:00:00 +0700 https://muhammadamal.my.id/blog/benchmarking-slms-for-your-use-case-lmeval-to-custom/ Public leaderboards lie about your task. Build a benchmark that measures what your users actually need. Evaluating LLM Agents, From Vibes to Regression Suites https://muhammadamal.my.id/blog/evaluating-llm-agents/ Fri, 24 May 2024 09:00:00 +0700 https://muhammadamal.my.id/blog/evaluating-llm-agents/ A practical agent evaluation system with deterministic checks, LLM-as-judge rubrics, and the regression discipline that survives model upgrades. Evaluating RAG, Beyond Vibes-Based Testing https://muhammadamal.my.id/blog/rag-evaluation-ragas-trulens-deepeval/ Mon, 26 Feb 2024 09:00:00 +0700 https://muhammadamal.my.id/blog/rag-evaluation-ragas-trulens-deepeval/ Ragas, TruLens, DeepEval — measuring RAG quality. Faithfulness, context precision, answer relevance. CI integration without LLM-as-judge bills. Putting a RAG Evaluation Pipeline in CI, The Setup I Actually Use https://muhammadamal.my.id/blog/rag-evaluation-pipeline-ci/ Mon, 20 Nov 2023 09:00:00 +0700 https://muhammadamal.my.id/blog/rag-evaluation-pipeline-ci/ A practical RAG eval setup wired into CI — retrieval and generation metrics, golden questions, and catching silent regressions.