- Newest
- Most votes
- Most comments
HI,
First of all thanks for sharing this issue with so much detail. The problem comes for cases when the hash value computed ends up being a negative number. The modulo implementation for different languages is different for negative numbers. Our implementation matches with Java, JS, Go etc but Python has a different result and that's where we see a problem.
We have updated the documentation in the way to take absolute value before taking the modulo. With this approach, we avoid the problem by avoiding any negative values ensuring that all languages will have same behavior. This approach will keep producing same measure_name values as we are currently getting in our query engine using the currently prescribed method in our documentation.
This is how the generation and query sides of the hashing process will look like:
SELECT mod(abs(from_big_endian_64(xxhash64(CAST('466edb76-a2a5-410d-8ba2-5a53c42e85e7' AS varbinary)))), 8192)
On client side: customers also need to follow the same approach:
measure_name = getMeasureName(UserId)
int getMeasureName(value) {
hash_value = abs(hash(value))
return hash_value % 8192
Thanks again for your support
Timestream Team
First of all: thanks for your reply. Regarding the problem, it makes no difference to use the absolute value beforehand, see:
import xxhash #"Problem" with python is here, the abs call makes no sense because intdigest() converts it to an unsigned int which is not just the abs value of the negative result, one thing that worked was to use the big_endian byte representation and cast it explicit back to int hash=abs(xxhash.xxh64('466edb76-a2a5-410d-8ba2-5a53c42e85e7').intdigest()) #9925523190173248731 hash=xxhash.xxh64('466edb76-a2a5-410d-8ba2-5a53c42e85e7').intdigest() #9925523190173248731 hash%8192 #3291 #digest() returns bytes of the big-endian representation of the integer digest b = xxhash.xxh64('466edb76-a2a5-410d-8ba2-5a53c42e85e7').digest() #cast to signed integer int.from_bytes(b, "big",signed=True) #-8521220883536302885
The same goes by the way for the golang implementation your example here is also wrong. It also converts the output natively into a uint. However you have to actively cast it back into a signed integer and then use the abs value from it to get compliant with timestreams xxhash
func Test(t *testing.T) { x := xxhash.New() _, err := x.Write([]byte("466edb76-a2a5-410d-8ba2-5a53c42e85e7")) if err != nil { log.Fatal(err) } t.Log(x.Sum64()) #9925523190173248731 t.Log(int64(x.Sum64())) #-8521220883536302885 }
For timestream:
SELECT cast(abs(from_big_endian_64(xxhash64(CAST('466edb76-a2a5-410d-8ba2-5a53c42e85e7' AS varbinary)))%8192) AS varchar) as hash_old, --hash_old 4901 mod(abs(from_big_endian_64(xxhash64(CAST('466edb76-a2a5-410d-8ba2-5a53c42e85e7' AS varbinary)))), 8192) as hash_new, --hash_new 4901 cast(abs(from_big_endian_64(xxhash64(CAST('466edb76-a2a5-410d-8ba2-5a53c42e85e7' AS varbinary)))) AS varchar) as hash_without_mod --hash_without_mod 8521220883536302885
This leads me to the following question: is this really the intention with timestreams xxhash function? Or should the xxhash implementation of timestream be compliant with the implementations of other languages?
Hopefully I could clarify things for others and the documentation can be updated in place.
Best,
Relevant content
- asked 2 years ago
- AWS OFFICIALUpdated 3 months ago