Knut Jägersberg

KnutJaegersberg

AI & ML interests

NLP, opinion mining, narrative intelligence

Articles

Organizations

KnutJaegersberg's activity

replied to s3nh's post 13 days ago
view reply

Don't burn out! Lighten up again will you.

posted an update 13 days ago
replied to BramVanroy's post 3 months ago
view reply

it mixed up stuff in the output, gave weird answers. didn't have that problem with other models. maybe the update they released sovled that issue, I just never cared, given the alternatives.

replied to BramVanroy's post 3 months ago
view reply

I got some weird results, since there are a lot of other models in that performance-parameter range, I just didn't try anymore.

replied to macadeliccc's post 4 months ago
replied to bwang0911's post 4 months ago
replied to JustinLin610's post 4 months ago
replied to osanseviero's post 4 months ago
view reply

I hear there is an incredible amount of competition among LLM makers within China, I guess one would publish and thus promote only the best. Hundreds of models. Competition is good for performance.

replied to s3nh's post 5 months ago
view reply

I didn't dive deeply into all the creative role play models, although I sense there is a great deal of innovation happening there, unrecognized. Beautiful art!

replied to their post 5 months ago
view reply

that's a nice space you made there, but it is also unrelated to my post

replied to their post 5 months ago
view reply

I didn't see a link to the prompt in the video, but prompt format can be optimized.

replied to their post 5 months ago
posted an update 5 months ago
view post
Post
Shocking: 2/3 of LLMs fail at 2K context length

code_your_own_ai makes a great vlog about mostly LLM related AI content.
As I watched the video below, I wondered about current best practices on LLM evaluation. We have benchmarks, we have sota LLMs evaluating LLMs, we have tools evaluating based on human comparison.
Often, I hear, just play with the LLM for 15 mins to form an opinion.
While I think for a specific use case and clear expectations, this could yield signal carrying experiences, I also see that one prompt is used to judge models.
While benchmarks have their weaknesses, and are by themselves not enough to judge model quality, I still think systematic methods that try to reduce various scientifically known errs should be the way forward, even for qualitative estimates.
What do you think? How can we make a public tool for judging models like lmsys/chatbot-arena-leaderboard help to leverage standards known in social science?

https://www.youtube.com/watch?v=mWrivekFZMM
·
posted an update 5 months ago
view post
Post
QuIP# ecosystem is growing :)

I've seen a quip# 2 bit Qwen-72b-Chat model today on the hub that shows there is support for vLLM inference.
This will speed up inference and make high performing 2 bit models more practical. I'm considering quipping MoMo now, as I can only use brief context window of Qwen-72b on my system otherwise, even with bnb double quantization.

keyfan/Qwen-72B-Chat-2bit

Also notice the easier to use Quip# for all library :)

https://github.com/chu-tianxiang/QuIP-for-all
  • 2 replies
·
posted an update 5 months ago
view post
Post
Microsoft: Improving Text Embeddings with Large Language Models

- uses an LLM instead of complex pipelines to create the training data
- directly generates data for numerous text embedding tasks
- fine tunes standard models with contrastative loss achieving great performance
- critical thought: isn't this kinda benchmark hacking? If the benchmarks are so encompassing that they capture the complete idea of embedding, it's maybe a good idea, but often it is oversimplifying, I find.

Feel free to share your thoughts, even if they like mine don't beat the benchmarks ;P


https://arxiv.org/abs/2401.00368
  • 2 replies
·
replied to fffiloni's post 6 months ago