In Aster on Hadoop, does it need any data movement? I have set of questions as below:
- Do we need to populate the Aster tables for analysis or I can directly access the Hadoop files in place? No data movement required at all?
- Does Aster support all Hadoop formats thru Hive?
- When we create views in Aster, will it move data on demand, while analysis?
- Do I need Presto installed separately on Hadoop, or it's an optional?
- You don't have to populate the Aster tables unless you use it frequently (to avoid latency in HDFS reads). You can create a view in Aster that refers to a HIVE table in HCATALOG.
- My understanding is if it's HCATALOG, you can use it in Aster. Outside HCATALOG access, you need to write a special connector to pull it from Parquet or AVRO format.
- It's moved on demand
- If you use Presto, you can obviously use that as a gateway to access data in HDFS in any format without the need for a special connector. You'd however need a field tool called ANYDATABASE2ASTER that uses JDBC to talk to Presto.
Hi, regarding the 1st and 3rd points from above I am wondering if it's wise to query a huge Hadoop set of documents (pdf,word,etc..) for text mining using functions like document_from_table (https://aster-community.teradata.com/community/aster-field-strong/blog/2016/07/20/underground-docume... ) since the data has to be moved from HDP to Aster for processing? Second question, are all the SQL-MR functions able to run over Aster-HDP bridges? What would be the safest and fastest way to bridge Aster to HDP for such functions to run efficiently. Thanks
AFS doesnt exist on Aster on Hadoop. I think Document_from_afs() is designed for AFS and may not work with HDFS . Can you make a request in the above link to see if you can get the author omri.shiv attention (copied in the link) ? Also, best way to run SQL/MR functions are views that you can build that has a load_from_hcatalog() query in it.
Thanks, I got his reply and in the meanwhile I also manged to install the document_from_table function and play a little with it. It is nice for small amounts of data but when it comes to millions of pdf files it would be unproductive to load them in Aster just to have them available afterwards for querying with a sql-mr function. The most efficient way would be to have them loaded (I suppose without parsing them) in a HDFS and query them directly from inside Aster. There is a function called documentParser that presumably does this and right now I am struggling to build the environment to test the bellow described scenario (see the inline image), namely:
1) To find a suitable CDH or Hortonworks HDP VM (AE 6.1 can work only up to HDP 2.1 on HWorks and CDH 5.0 on Cloudera....but the Cloudera 5.0.0 vm is not anymore available..:((
2) To set up in AMC the HDP host and see it working
3) To find the way to load the pdf files in the HDFS...according to the in instructions in the pptx on SQL MR: documentParser ...which are a little scarce, see below.
4) To use the documentParser function on the below usage hoping the mr_driver and the directory settings work
I am aware about the load_from_hcatalog function but the documentParser function works as described in the pptx in 3 ways which are bypassing the hcatalog so I have to stick to them
Sorry for the missing inline images, they are from the pptx located at the above link, regarding the put fucntion on HDP and the function usage scenarios, all of them not using load_from_hcatalog. Thanks again.