Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

andreasrimmelspacher_189038 avatar image
andreasrimmelspacher_189038 asked andreasrimmelspacher_189038 commented

Need help with user-defined functions and aggregates

Hello everybody,

I am new to Cassandra and I am trying to extract sketches (e.g. bloom filter) from some given data. While doing that, I came this far - my questions break down to the below:

CREATE TYPE bloomfilter_udt (
    n_as_sample_size int,
    m_as_number_of_buckets int,
    p_as_next_prime_above_m bigint,
    hash_for_string_coefficient_a list <bigint>,
    hash_for_number_coefficients_a list <bigint>,
    hash_for_number_coefficients_b list <bigint>,
    bloom_filter_as_map map<int, int>
);

CREATE OR REPLACE FUNCTION bloomfilter_udf (
    state bloomfilter_udt,
    value text,
    sample_size int
)
    CALLED ON NULL INPUT
    RETURNS bloomfilter_udt
    LANGUAGE java AS
        $$
        //fill state = bloomfilter_udt with some data
        return state;
        $$
    ;

CREATE OR REPLACE AGGREGATE bloomfilter_uda (
    text,
    int
)
    SFUNC bloomfilter_udf
    STYPE bloomfilter_udt
    INITCOND {};

1) When I call the aggregate, I would like to pass sample_size with a sub-query, e.g.

==> "SELECT bloomfilter_uda(name, (SELECT count(*) FROM test_table)) FROM test_table;" <==

Is that possible with Cassandra?

2) When I try to register the bloomfilter_uda, I get the following error:

==> InvalidRequest: Error from server: code=2200 [Invalid query] message="Invalid set literal for (dummy) of type bloomfilter_udt" <==

Can I just pass Cassandra data types as a state (map, list, set)?

3) If I assume, all of the above is my bad, how can I access the props of the state? Like

==> state.n_as_sample_size <==

Is this somehow possible?

I'd appreciate some help/hints.

Thanks

Andreas

[EDIT] For my minimum example, I follow these steps:

CREATE KEYSPACE my_keyspace
WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 3};

CREATE TABLE my_table (
id timeuuid,
name text,
PRIMARY KEY(id, name)
);

INSERT INTO my_table (id, name) VALUES( now(), 'This is some random string');
INSERT INTO my_table (id, name) VALUES( now(), 'This is another random string');
INSERT INTO my_table (id, name) VALUES( now(), 'stringstringstring');
INSERT INTO my_table (id, name) VALUES( now(), 'test test test test');
INSERT INTO my_table (id, name) VALUES( now(), 'some final string');

CREATE OR REPLACE FUNCTION my_user_defined_function_triggered_on_each_row (state map<text,int>, value text, sample_size int)
CALLED ON NULL INPUT
RETURNS map<text,int>
LANGUAGE java
AS 
$$
//if sample size is not in result map, then put it
if(!state.containsKey("sample_size")) {state.put("sample_size", sample_size);};
//put the hashvalue (integer represented as a string) of each row as a key in result map and set the value to 1
String hashvalue = "" + value.hashCode();
if(!state.containsKey(hashvalue)) {state.put(hashvalue, 1);};
return state;
$$;

CREATE OR REPLACE AGGREGATE my_user_defined_aggregate_to_get_sketch (text, int)
SFUNC my_user_defined_function_triggered_on_each_row
STYPE map<text, int>
INITCOND {};

SELECT my_user_defined_aggregate_to_get_sketch(name, 5)
FROM my_table;

It work's until here and gives me the following output:

Now instead of storing everything in one map<text,int>, I would like to use a user defined data type, where I can put some more attributes with different data types. I follow there steps:

CREATE TYPE my_sketch(
sample_size int,
values map<text,int>
);

CREATE OR REPLACE FUNCTION my_user_defined_function_triggered_on_each_row_new(state my_sketch, value text, sample_size int)
CALLED ON NULL INPUT
RETURNS my_sketch
LANGUAGE java
AS 
$$
//For a minimal example let's do nothing and just return the state.
//I am guessing though, accessing my_sketch will be the next problem. I already tried something like state.my_size...
// I tried some versions like:
return state;
$$;

I test the new function calling on my_table. I expect the first parameter to be the return of each row. I call.

SELECT my_user_defined_function_triggered_on_each_row_new({"sample_size": 5, "values": {'value1': 5 }},  name, 5 )
FROM my_table ;

Gives me the expected output:

When i now try to wrap this up to a single return row with an aggregate, I implement my new aggregate like:

CREATE OR REPLACE AGGREGATE my_user_defined_aggregate_to_get_sketch_new (text, int)
SFUNC my_user_defined_function_triggered_on_each_row_new
STYPE my_sketch
INITCOND {}; 

But this gives me the error:

InvalidRequest: Error from server: code=2200 [Invalid query] message="Invalid set literal for (dummy) of type my_sketch"

I am using Cassandra 3.11.6 though.

user-defined functionuser-defined aggregate
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

alexandre.dutra avatar image
alexandre.dutra answered andreasrimmelspacher_189038 commented
1) When I call the aggregate, I would like to pass sample_size with a sub-query [..] Is that possible with Cassandra?

No.


2) When I try to register the bloomfilter_uda, I get the following error [...] Can I just pass Cassandra data types as a state (map, list, set)?

Yes, but you need to input a valid literal for your UDT by initializing at least one field:

CREATE OR REPLACE AGGREGATE bloomfilter_uda ( text, int )
    SFUNC bloomfilter_udf
    STYPE bloomfilter_udt
    INITCOND { n_as_sample_size : 0 };

This is a subtlety of the CQL parser; if you input just {} the parser would be fooled into thinking that this is a set literal (an empty set).


3) If I assume, all of the above is my bad, how can I access the props of the state?

Inside functions and aggregates, if you need to access or modify a tuple or a user-defined type, you actually need to use the DataStax Java driver 3.x API for User-defined types:

CREATE OR REPLACE FUNCTION bloomfilter_udf (
    state bloomfilter_udt,
    value text,
    sample_size int
)
    CALLED ON NULL INPUT
    RETURNS bloomfilter_udt
    LANGUAGE java AS
        $$
        state.setInt("n_as_sample_size", 42);
        state.setInt("m_as_number_of_buckets" 42);
        state.setLong("p_as_next_prime_above_m", 4242L);
        List<Long> hashForStringCoefficients = ...;
        state.setList("hash_for_string_coefficient_a", 
            hashForStringCoefficients, Long.class);
        // etc.
        return state;
        $$   

The variable state inside the Java block is of type UDTValue.

3 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Thanks, that's I was looking for. Is there any reason in this forum, why previous answers and updates on the question from my side are not shown?

0 Likes 0 ·

[Follow up question re-posted in #3939]

0 Likes 0 ·
Erick Ramirez avatar image Erick Ramirez ♦♦ andreasrimmelspacher_189038 ·

@andreasrimmelspacher_189038 Since this is not an "answer", I've reposted it as a brand new question. Cheers!

0 Likes 0 ·
Erick Ramirez avatar image
Erick Ramirez answered andreasrimmelspacher_189038 commented

@andreasrimmelspacher_189038 The schema you provided is too obfuscated for me to understand the problem you're facing. Let me respond to them in turn:

(1) You can't nest a SELECT inside another SELECT statement. Cassandra just wasn't designed for that kind of access pattern.

(2) Can you provide a "minimum" viable schema + the steps to replicate? I'd like to understand what you're doing and replicate it in one of my clusters. Also, please include the version of Cassandra (or DSE) you are using so I can recreate it exactly as you see it.

(3) I don't know enough until (2) to answer this. Cheers!

1 comment Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.