Hi,
I've gone through documentation but didn't get clear picture on the below questions.
1. How does SQL or SQL-MR queries works interanally in aster database? Will the generate MapReduce jobs when queries are fired? like Hive and pig generate in hadoop.
For example,
Customer table has 10 records which are distributed by hash(customer id) and say they're 3 workers, so data is split among them as
worker1 - 3 records, worker2 - 3 records, worker3- 4 records.
Since data is distributed among three different machines(nodes), how simple sql query select * from customer or any SQL-MR query ; fecthes data? whether it'll send the same query on all three nodes, or it'll generate map reduce jobs which fetches the rows?
2. Once creating table customer, data distributed by hash(customer id), can change distribution later by different column, age instead of customer id?
3. In hadoop jobs can be run one at time? simultaneously you cant run multiple others will be queued? the same thing happens in aster database or can we run multiple jobs on cluster?
Please help me to understand these.
Pradi
Thanks for the reply Raja,
I got to know some these stuff by going through the documentation. But these concepts didn't answer my questions like whether SQL-MR internally generates any MapReduce job or not? what about SQL how it fetched distributed data from different nodes?
An SQL-MapReduce job is automatically started when you issue a query that includes an SQLMapReduce function. A GUI is there to monitor the job too.
Let me try...
1. How does SQL or SQL-MR queries works interanally in aster database? Will the generate MapReduce jobs when queries are fired? like Hive and pig generate in hadoop.
For example,
Customer table has 10 records which are distributed by hash(customer id) and say they're 3 workers, so data is split among them as
worker1 - 3 records, worker2 - 3 records, worker3- 4 records.
Since data is distributed among three different machines(nodes), how simple sql query select * from customer or any SQL-MR query ; fecthes data? whether it'll send the same query on all three nodes, or it'll generate map reduce jobs which fetches the rows?
Map Reduce jobs are only used when you use an MR function (ie anything with an ON clause is an MR function). When you call regular SQL functions such as select * it is not like hadoop where it spins up a JVM to execute the job - this is done in a fashion similar to regular postgres. With regular SQL, the queen will send the commands to each individual worker, the workers will compile their answers based on the data on their nodes and return the resultset to the queen where the queen will compile the results into a single dataset. When it comes to things like joins, this can become complicated because data has to be shuffled between nodes. The queen facilitates this process as the workers are blind to what exists on other workers
2. Once creating table customer, data distributed by hash(customer id), can change distribution later by different column, age instead of customer id?
I'm 99 pct sure you can't change the distribution key - you would have to perform a CTAS using the new key.
3. In hadoop jobs can be run one at time? simultaneously you cant run multiple others will be queued? the same thing happens in aster database or can we run multiple jobs on cluster?
You can run multiple jobs on Aster and they will execute in parallel. The issue here is that performance will be impacted.
HTH... let me know if you need clarification
Thanks a lot ewan,
In case of simple SQL select * statement queen will send commands to each worker depending on the data it has.
if possible, Can you please explain a bit in detail in case SQL-MR queries, how it works? how many maps are generated etc?
How does queen node knows whether a node is alive or dead? does worker nodes send heartbeats to queen regularly like in hadoop.
Once again thank a lot for reply.
Pradi