AWS Timestream Recommendations for partitioning multi-measure records bug in xxhash?

1

Hey,

I followed the article "Recommendations for partitioning multi-measure records". First of all: The golang package in the example don't exist anymore. A new version of a xxhash64 algorithm can be found (here)[https://github.com/cespare/xxhash/v2].

Nevertheless I have the following problem: When I use the xxhash64 library to compute the hash of my measure_name I got different results in Timestream and in my go/python implementation.

import xxhash
print(abs(xxhash.xxh64('466edb76-a2a5-410d-8ba2-5a53c42e85e7').intdigest()%8192))
#evaluates to: 3291

SELECT cast(abs(from_big_endian_64(xxhash64(CAST('466edb76-a2a5-410d-8ba2-5a53c42e85e7' AS varbinary)))%8192) AS varchar) evaluates to: 4901

Maybe I do something wrong, but it's not with all my hashes. Around 50% of my calculated hashes diverge from the Timestream hash Anyone see the problem here?

asked a year ago305 views
2 Answers
0

HI,

First of all thanks for sharing this issue with so much detail. The problem comes for cases when the hash value computed ends up being a negative number. The modulo implementation for different languages is different for negative numbers. Our implementation matches with Java, JS, Go etc but Python has a different result and that's where we see a problem.

We have updated the documentation in the way to take absolute value before taking the modulo. With this approach, we avoid the problem by avoiding any negative values ensuring that all languages will have same behavior. This approach will keep producing same measure_name values as we are currently getting in our query engine using the currently prescribed method in our documentation.

This is how the generation and query sides of the hashing process will look like:

SELECT mod(abs(from_big_endian_64(xxhash64(CAST('466edb76-a2a5-410d-8ba2-5a53c42e85e7' AS varbinary)))), 8192)
On client side: customers also need to follow the same approach:

measure_name = getMeasureName(UserId)
int getMeasureName(value) {
    hash_value =  abs(hash(value))
    return hash_value % 8192

Thanks again for your support

Timestream Team

AWS
answered a year ago
0

First of all: thanks for your reply. Regarding the problem, it makes no difference to use the absolute value beforehand, see:

import xxhash
#"Problem" with python is here, the abs call makes no sense because intdigest() converts it to an unsigned int which is not just the abs value of the negative result, one thing that worked was to use the big_endian byte representation and cast it explicit back to int
hash=abs(xxhash.xxh64('466edb76-a2a5-410d-8ba2-5a53c42e85e7').intdigest())
#9925523190173248731
hash=xxhash.xxh64('466edb76-a2a5-410d-8ba2-5a53c42e85e7').intdigest()
#9925523190173248731
hash%8192
#3291

#digest() returns bytes of the big-endian representation of the integer digest
b = xxhash.xxh64('466edb76-a2a5-410d-8ba2-5a53c42e85e7').digest()
#cast to signed integer
int.from_bytes(b, "big",signed=True)
#-8521220883536302885

The same goes by the way for the golang implementation your example here is also wrong. It also converts the output natively into a uint. However you have to actively cast it back into a signed integer and then use the abs value from it to get compliant with timestreams xxhash

func Test(t *testing.T) {
	x := xxhash.New()
	_, err := x.Write([]byte("466edb76-a2a5-410d-8ba2-5a53c42e85e7"))
	if err != nil {
		log.Fatal(err)
	}
	t.Log(x.Sum64())
        #9925523190173248731
	t.Log(int64(x.Sum64()))
        #-8521220883536302885
}

For timestream:

SELECT cast(abs(from_big_endian_64(xxhash64(CAST('466edb76-a2a5-410d-8ba2-5a53c42e85e7' AS varbinary)))%8192) AS varchar) as hash_old, 
--hash_old 4901
mod(abs(from_big_endian_64(xxhash64(CAST('466edb76-a2a5-410d-8ba2-5a53c42e85e7' AS varbinary)))), 8192) as hash_new, 
--hash_new 4901
cast(abs(from_big_endian_64(xxhash64(CAST('466edb76-a2a5-410d-8ba2-5a53c42e85e7' AS varbinary)))) AS varchar) as hash_without_mod
--hash_without_mod	8521220883536302885

This leads me to the following question: is this really the intention with timestreams xxhash function? Or should the xxhash implementation of timestream be compliant with the implementations of other languages?

Hopefully I could clarify things for others and the documentation can be updated in place.

Best,

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions