- Le plus récent
- Le plus de votes
- La plupart des commentaires
This is likely too late for your model. I can't comment on Falcon - however you can find info on Neuron implementations of Llama V2 here: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/torch/transformers-neuronx/index.html#model-classes-status, available in the transformers_neuronx pip package.
Because of the number of parameters (7B, 13B or 70B being typical) an inf2.8xlarge (2 cores) is likely too small. The parameters need to be spread across multiple cores, making a inf2.24xlarge (12 cores) or inf2.48xlarge (24 cores) a better choice. Instance choice will also depend on whether quantization is used.
This tutorial https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/pytorch/neuronx_distributed/llama/llama2_inference.html uses a trn1.32xlarge.
Contenus pertinents
- demandé il y a 2 mois
- demandé il y a un an
- AWS OFFICIELA mis à jour il y a 3 ans
- AWS OFFICIELA mis à jour il y a 3 ans
- AWS OFFICIELA mis à jour il y a 2 ans