1回答
- 新しい順
- 投票が多い順
- コメントが多い順
1
Unfortunately no, I believe it's not currently supported and the error message you saw is in line with that.
I'd like to see the wording on this page (which says "Multi-model endpoints are not supported on GPU instance types.") expanded to make this clearer since Inferentia accelerators aren't "GPUs" as such.
You could perhaps look at testing CPU inference performance for MME serving of a large number of models, or push some of your higher-traffic models to dedicated single-model endpoints on Inferentia?
What a shame, we handle many concurrent requests per second, and inference machines were the best ones we found... Is there any machine that can withstand a similar workload without costing us a fortune?