French nuclear submarine crew accidentally leaked sensitive data through Strava fitness app, revealing location and patrol schedule.
At first everything's great fun, but soon their new llama roommates pull pranks that go too far. They become unwanted house guests, like a group of out-of-control teenagers intent on trashing ...
Eventually, they managed to sustain a performance of 39.31 tokens per second running a Llama-based LLM with 260,000 parameters. Cranking up the model size significantly reduced the performance ...
Here gains are generally lower compared to PP due to TG performance being limited by memory bandwidth. Nevertheless, for some quants/architectures/threads the speedup is quite remarkable (e.g., almost ...
TinyLlama 1.1B 3T Q40 Benchmark 844 MB python launch.py tinyllama_1_1b_3t_q40 Llama 3 8B Q40 Benchmark 6.32 GB python launch.py llama3_8b_q40 Llama 3 8B Instruct Q40 Chat, API 6.32 GB python launch.py ...