BLUE
TVbrenthor.bsky.social

Its not apples to apples tho Going PC the trade off is a more open marketplace, cheaper games, less optimization. Sony is more optimized but very consumer hostile and proprietary to basically Apple levels For $1k you can get a good PC that will do a lot more than a PS5 Pro

1
QMghawinriver.bsky.social

Mostly because crypto bros and AI evangelists sorta ate up all the components on one hand, and Google and IBM's fetish for poor memory utilization and optimization kinda made desktop components pricier

1
Mpinkgallica.bsky.social

Built my computer exactly 5 years ago now. Kind of feeling the itch to upgrade/build a new one... Monster Hunter Wilds also comes out in February (not sure about its optimization) and my current setup might struggle, to say the least. 🤔

0

Convergence guarantee for linearly-constrained combinatorial optimization with a quantum alternating operator ansatz https://arxiv.org/pdf/2409.18829 Brayden Goldstein-Gelb, Phillip C. Lotshaw

Convergence guarantee for linearly-constrained combinatorial optimization with a quantum alternating operator ansatz
Convergence guarantee for linearly-constrained combinatorial optimization with a quantum alternating operator ansatz

arXiv abstract link

0
MGmwisk.bsky.social

He actually seems a bit technical for SF6 standards, a lot of his optimization involves spacing and timing adjustments. He even has a just frame in his Power Dunk target combo. Pretty unique indeed. I say work on him a step at a time. Definitely not the s.hp xx drive rush char.

1
HNhackernewsbot.bsky.social

PyTorch Native Architecture Optimization: Torchao Discussion

PyTorch Native Architecture Optimization: torchao
PyTorch Native Architecture Optimization: torchao

We’re happy to officially launch torchao, a PyTorch native library that makes models faster and smaller by leveraging low bit dtypes, quantization and sparsity. torchao is an accessible toolkit of techniques written (mostly) in easy to read PyTorch code spanning both inference and training. This blog will help you pick which techniques matter for your workloads. We benchmarked our techniques on popular GenAI models like LLama 3 and Diffusion models and saw minimal drops in accuracy. Unless otherwise noted the baselines are bf16 run on A100 80GB GPU. Our topline metrics for llama 3 are 97% speedup for Llama 3 8B inference using autoquant with int4 weight only quantization and hqq 73% peak VRAM reduction for Llama 3.1 8B inference at 128K context length with a quantized KV cache 50% speedup for Llama 3 70B pretraining using float8 training on H100 30% peak VRAM reduction for Llama 3 8B using 4 bit quantized optimizers. Our topline metrics for diffusion model inference 53% speedup using float8 dynamic quantization inference with float8 row-wise scaling on flux1.dev onH100 50% reduction in model VRAM for CogVideoX using int8 dynamic quantization Below we’ll walk through some of the techniques available in torchao you can apply to your models for inference and training. Inference Our inference quantization algorithms work over arbitrary PyTorch models that contain nn.Linear layers. Weight only and dynamic activation quantization for various dtypes and sparse layouts can be chosen using our top level quantize_ api from torchao.quantization import ( quantize_, int4_weight_only, ) quantize_(model, int4_weight_only()) Sometimes quantizing a layer can make it slower because of overhead so if you’d rather we just pick how to quantize each layer in a model for you then you can instead run model = torchao.autoquant(torch.compile(model, mode='max-autotune')) quantize_ API has a few different options depending on whether your model is compute bound or memory bound. from torchao.quantization import ( # Memory bound models int4_weight_only, int8_weight_only, # Compute bound models int8_dynamic_activation_int8_semi_sparse_weight, int8_dynamic_activation_int8_weight, # Device capability 8.9+ float8_weight_only, float8_dynamic_activation_float8_weight, ) We also have extensive benchmarks on diffusion models in collaboration with the HuggingFace diffusers team in diffusers-torchao where we demonstrated 53.88% speedup on Flux.1-Dev and 27.33% speedup on CogVideoX-5b Our APIs are composable so we’ve for example composed sparsity and quantization to bring 5% speedup for ViT-H inference But also can do things like quantize weights to int4 and the kv cache to int8 to support Llama 3.1 8B at the full 128K context length running in under 18.9GB of VRAM. QAT Post training quantization, especially at less than 4 bit can suffer from serious accuracy degradations. Using Quantization Aware Training (QAT) we’ve managed to recover up to 96% of the accuracy degradation on hellaswag. We’ve integrated this as an end to end recipe in torchtune with a minimal tutorial Training Low precision compute and communications torchao provides easy to use e2e workflows for reducing the precision of training compute and distributed communications, starting with float8 for `torch.nn.Linear` layers.Here is a one-liner to convert the compute gemms of your training run to float8: from torchao.float8 import convert_to_float8_training convert_to_float8_training(model) For an e2e example of how to speed up LLaMa 3 70B pretraining by up to 1.5x with float8, see our README, and torchtitan’s blog and float8 recipe. Performance and accuracy of float8 pretraining of LLaMa 3 70B, vs bfloat16 (source: https://dev-discuss.pytorch.org/t/enabling-float8-all-gather-in-fsdp2/2359) We are expanding our training workflows to more dtypes and layouts NF4 QLoRA in torchtune Prototype int8 training support Accelerated sparse 2:4 training Low bit Optimizers Inspired by Bits and Bytes we’ve also added prototype support for 8 and 4 bit optimizers as a drop in replacement for AdamW. from torchao.prototype.low_bit_optim import AdamW8bit, AdamW4bit optim = AdamW8bit(model.parameters()) Integrations We’ve been actively working on making sure torchao works well in some of the most important projects in open source. Huggingface transformers as an inference backend In diffusers-torchao as a reference implementation for accelerating diffusion models In HQQ for fast 4 bit inference In torchtune for PyTorch native QLoRA and QAT recipes In torchchat for post training quantization In SGLang for for int4 and int8 post training quantization Conclusion If you’re interested in making your models faster and smaller for training or inference, we hope you’ll find torchao useful and easy to integrate. pip install torchao There are a lot of things we’re excited about next ranging from going lower than 4 bit, performant kernels for high-throughput inference, expanding to more layers, scaling types or granularities, MX hardware support and supporting more hardware backends. If any of the above sounds exciting you can follow our progress at: https://github.com/pytorch/ao If you’re interested in working on torchao, we’ve created a contributors guide, and if you have any questions we hang out on the #torchao channel on discord.gg/gpumode Acknowledgements We are fortunate to stand on the shoulders of giants and collaborate with some of the best people in open source. Thank you! Bits and Bytes for pioneering work in low bit optimizers and QLoRA Answer.ai for their engineering work to get FSDP and QLoRA composing Mobius Labs for the lovely back and forths on quantization algorithms and low bit kernels HuggingFace transformers for their help in battle testing and integrating our work HuggingFace diffusers for our collaboration on extensive benchmarks and best practices torch.compile so we could write our algorithms in pure PyTorch GPU MODE for most of our early contributors

0
Cchronographia.bsky.social

Prepared food tasting of Profit Optimization and not much else—is there a particular name for this part of late-stage capitalism or…?

0
AGandregivenchy.bsky.social

Have you ever felt paralyzed by the need for perfect timing? Let's break free from that. The truth is, there's power in simply deciding to do something because you want to. No optimization needed, just do the thing.

0