Rendered at 11:28:53 GMT+0000 (Coordinated Universal Time) with Wasmer Edge.
axpy906 2 days ago [-]
Good post. I’ve been thinking about doing offline testing of LLM tasks a bit these days and have come to the conclusion that old school testing is the best until more mature features can be developed. Specifically, I mean running a power analysis to determine sample size, random sampling based on that and then running tests like a z test to see if there is a difference and between what bounds. Tests are expensive and I wish there was a better way for realizable offline evals.
rhdunn 2 days ago [-]
Have you seen LLM testing tools like promptfoo?
axpy906 2 days ago [-]
Yes, I have seen it and BrainTrust too. Unfortunately, need FOSS without vendor at scale.
amarcheschi 2 days ago [-]
i - luckily - passed my statistics exam this summer, it's however cool to visualize what's happening
qefduzh 2 days ago [-]
[flagged]
aduffy 2 days ago [-]
- Brand new burner account
- upset about “AI slop” (the image is clearly not AI)
- mentions tech buzzwords that annoy you
- claiming the article is not as rigorous as an academic paper
Perhaps I’m just old school. But I miss the HN where the best way to get upvotes was to be insightful and not to send low-effort snarky replies
overbytecode 2 days ago [-]
One of those points is not like the other, Marimo’s feature to deploy a notebook as WASM is a very nice feature imo.
- upset about “AI slop” (the image is clearly not AI)
- mentions tech buzzwords that annoy you
- claiming the article is not as rigorous as an academic paper
Perhaps I’m just old school. But I miss the HN where the best way to get upvotes was to be insightful and not to send low-effort snarky replies