Using R and Aster to Mine Trip Adviser Comments

Learn Data Science
Not applicable

With the holiday season set to kick off, and many people hitting the ski slopes, I thought now would be a good time to demonstrate the power of Aster. In this exercise we will harness the power of Aster and Trip Advisor to mine ski resort reviews (Being from the Western United States, I will mine a Colorado Ski Resort). Before we get started on mining the web we need to answer a question?

What are two of Aster’s biggest strengths? Though this can and will be debated by many Aster Enthusiasts; the two that come to my find first are time to value and ease of use. Aster is changing the paradigm of Extract, Load, Transform (ETL) data to Extract, Load, Transform (ELT). Being a discovery platform Aster can quickly ingest multi-structured or unstructured data such as: text, web logs, machine data, and social media.

In this post I will demonstrate both strengths by scrapping www.tripadvisor.com for reviews about a Ski Resort in Colorado, creating a table in Aster, populating that table with the Trip Advisor reviews, and using the SQL-MR function Text_Parser to put the reviews in a relational form. In subsequent posts I will further demonstrate the power of Aster by mining the reviews with popular easy to use SQL-MR functions such as: ExtractSentiment, NGrams, and TF_IDF.

The tools that I will use are: R, R Studio, ACT, Teradata Studio, WINSCP, PUTTY, and Aster Express. Each of these tools can be downloaded to your machine for free, so please follow along and post some insights from the ski resort you chose. (My assumptions on this exercise are that you are familiar with each of the aforementioned tools. If you need any assistance please post questions in the comment and I will get them answered)

Step 1:
Scrapping Trip Advisor

  1. Open your favorite browser and go to www.tripadvisor.com
  2. Search for your favorite ski resort and click on the reviews. My url looks like thishttp://www.tripadvisor.com/Attraction_Review-g33676-d125452-Reviews-Vail_Mountain_Resort-Vail_Colora...
  3. Copy your url and open up R Studio. Load the rvest package and run the code. You will need to replace my url in the url object with yours. Running this R program we will have 5 variables in our Trip Advisor reviews data set. They are: resort, id, quote, rating, date, and review. Here is the R code I ran to mine 1100 comments about Vail Mountain Resort:

library(rvest)

#Each subsequent webpage differs, in the middle of the address, by the terms in the following 'looping' vector. Change number of terms depending on case

looping<- c(1:1180)

n<-length(looping)

tableout <- data.frame()

 

for(i in looping){

 

  #Change URL address here depending on attraction for review

  url <- paste ("http://www.tripadvisor.com/Attraction_Review-g33676-d125452-Reviews-or",i,"0-Vail_Mountain-Vail_Colorado.html#REVIEWS",sep="")

  reviews <- url %>%

    html() %>%

    html_nodes("#REVIEWS .innerBubble")

 

  id <- reviews %>%

    html_node(".quote a") %>%

    html_attr("id")

 

  quote <- reviews %>%

    html_node(".quote span") %>%

    html_text()

 

  rating <- reviews %>%

    html_node(".rating .rating_s_fill") %>%

    html_attr("alt") %>%

    gsub(" of 5 stars", "", .) %>%

    as.integer()

  #rawdates <- reviews %>% html_node(".rating .ratingDate") %>% html_text("ratingDate")

  #datebits <- lapply(rawdates, function(x) unlist(strsplit(x, " |\n"))[2:4])

  #date <- unlist(lapply(datebits, function(x) paste0(paste(x[1:2], collapse=" "), paste(",", x[3]))))

  date <- reviews %>%

    html_node(".rating .ratingDate") %>%

    html_attr("title") %>%

    strptime("%b %d, %Y") %>%

    as.POSIXct()

 

  review <- reviews %>%

    html_node(".entry .partial_entry") %>%

    html_text()

 

  #get rid of \n in reviews as this stands for 'enter' and is confusing dataframe layout

  reviewnospace <- gsub("\n", "", review)

 

  temp.tableout <- data.frame(id, quote, rating, date, reviewnospace)

 

  tableout <- rbind(tableout,temp.tableout)

}

  1. write.csv(tableout, file="VailMountaintest")

 

  1. The R Output will be saved as a .csv file to our working directory. Now we need to make sure the reviews are saved in UTC-8 character encoding (Otherwise we will get an error when we try and populate our table in Aster). This can be done in notepad or at the command line. Now we are ready for step 2.

Step 2:

Creating a table in Aster

  1. Now connect to Aster Express with Teradata Studios or ACT and create a database called SkiResort.
  2. Create a table to hold our ski resort reviews. This is how mine looks

Step 3:

Transferring SkiResortReview.csv file to our Aster Express with WINSCP and populating the table with data.

  1. Transfer the SkiResortReview.csv to a directory on your Aster Express image using WINSCP
  2. Open up a PUTTY session and connect to Aster Express
  3. Navigate to folder where you place the SkiResortReview.csv file
  4. Now use Ncluster Loader to load the data into the table.

ncluster_loader –U db_superuser –w db_superuser –d SkiResort –skip-rows 1 –el-enabled –c public.resorttext SkiResortReview.csv

  1. Now are table should be populated with the reviews.

Step 4:

Now that the reviews are loaded in the table we are ready to use the SQL-MR function Text_Parser to give structure to our data. We will create a table with the Text_Parser function and remove stop words and punctuation. The syntax I used is:

 CREATE TABLE public.resort_tparser_out
DISTRIBUTE BY HASH(id) AS
SELECT * FROM Text_Parser( ON resorttext
TEXT_COLUMN('review')
CASE_INSENSITIVE('true')
STEMMING('true')
PUNCTUATION('\[.,?\!\]')
REMOVE_STOP_WORDS('true')
ACCUMULATE('id', 'review'))
ORDER BY id;

Now that we have our reviews in a structured format, we can start doing some fun analytics to see what others are saying about our ski resort of choice. This will be covered in a future blog post so stay tuned.