I am not one for benchmarks in general but while developing it serves as a base to build upon. In my opinion BEAM is the most relevant benchmark because it tests end-to-end answer quality, not just retrieval. LongMemEval is solid for retrieval evaluation but only measures whether the right document is in the top-K, not whether the system answers correctly. LoCoMo tests useful abilities (multi-hop, temporal) but its recall metric is trivially gameable when top-k exceeds the number of sessions per conversation.
0
0
0
No replies yet.