> distributed LLM inference This seems extremely inefficient considering data tr... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		WASDx 4 days ago \| parent \| context \| favorite \| on: Open source AI must win > distributed LLM inference This seems extremely inefficient considering data transfer between model layers if the model is distributed. I found this project called Petals that claim up to 4 tok/s for a 180B model although its repository hasn't been updated in two years. https://petals.dev/
		help

stymaar 4 days ago [–]

For token generation, yes: because current-gen LLMs are autoregressive you need to add the inter-node latency for every since token.

For prompt processing it would work though, and it could for diffusion LLMs as well.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact