Learn Data Science

Learn Data Science
Looking Glass

Explore and Discover

Latest Articles, Videos and Blog Posts speak with those interested in analytics-based decision making. Browse all content in the Teradata Data Science Community to gain valuable insights.

29 Views
0 Comments

Robot-Keychain.png

 

           AI started a race in the automotive industry. This race will change the whole industry. Souma Das, Teradata India, talks about the importance of Analytics and AI in this race.

 

To learn more, click HERE

26 Views
0 Comments

hurdle 2.jpg

 

The most difficult aspect to realizing success with analytics has been, and continues to be, the organizational challenges in getting stakeholders to embrace an innovative mindset.

 

To learn more, click HERE

36 Views
0 Comments

AirNebula-KarthikGuruswamy-Web-650.png

 

About the Insights

As of Jan 2012 there were roughly 60K direct flights between 3000+ airports with 500+ airlines recorded on the open source website OPENFLIGHTS.ORG.

 

Seen through the lens of advanced analytics, the different airline carriers of the world appear together like a beautiful nebula (interstellar cloud formations). Similar color groupings of nodes with thick edges provide insight into airlines that has common routes exposing competition and also potential synergies in different regions.

 

This Sigma graph based data visualization shows airline carrier similarity measured by the common cities they serve. The nodes or circles in the graph are the airline carriers and the edge thickness and proximity of the nodes are indicative of the degree of similarity. The thicker the edge and closer the nodes, the more cities the carriers serve in common. This visual has multiple clusters of airlines which intuitively maps it to geographical regions they serve. Some of the key insights in the visual is the similarities or overlap between China Southern and China Eastern Airlines, Emirates and Qatar, British Airways and Lufthansa, American and Delta — indicating a competitive situation. Ryan Air seems to have carved a niche by serving cities with potential synergies with Lufthansa and British Airways. Air France has more similarity with US carriers like United compared to other European carriers like Alitalia, Lufthansa etc.,— probably can be explained by co-branding. In essence the visualization is a multi-dimensional Venn Diagram that exposes the complex relationships rather succinctly.

 

Overall the graph allows to study the similarities in the competition with other players for a potential partnership or to grow market share and coverage. Similar insights can be developed for any problem that involves multiple players in an ecosystem with common variables they touch.

 

About the Analytics

The visualization was created in Aster App Center. The analytics falls into the category of associative mining where we look at co-occurrences of items within a context. The associative mining algorithm that was used is Collaborative Filtering — unleashed on the airline and city data which was treated like retail basket data. The basket would be the city and the airline carriers would be the items. The commonality of any two airlines is determined by a score which considers what cities any two airlines fly into independently by itself vs what's common. The pair-wise affinity score is then treated as an edge weight with the two airlines treated as nodes, which is fed into a visualizer to create beautiful clusters using the force-atlas algorithm with modularity coloring.

 

About the Analyst

Karthik Guruswamy is based in San Francisco Bay Area and lives with his wife Vidhya and two daughters. Karthik works as a Principal Consultant with Big Data & Advanced Analytics, Americas for Teradata.

 

Karthik's passion for Data Science, Analytic spans 25+ years starting out as RDBMS developer in Informix. Karthik has worked with several startups in silicon valley as Data and Server Architect. Karthik joined Teradata through the acquisition of Aster Data where he was a Senior Consultant working with almost all of Aster's marquee customers. While in Teradata & Aster Professional Services, Karthik was engaged with social media customers such as Linkedin and Edmodo. Most recently Karthik has been advising Fortune customers such as Dell, Big Automotive, Overstock and Wells Fargo Bank.

 

Karthik specializes in MPP / Map Reduce / Graph and works to unravel hidden patterns in customer data and create powerful visualizations to bring the insight to the business users. Karthik uses a wide variety of algorithms around Time Series & Pattern Detection, Data Mining, Machine Learning, Neural Nets, Text Disambiguation and Statistical Analysis in his projects. Karthik is also a data science blogger in Linkedin. He has written a number of blog posts on data science concepts, primarily targeted to a business audience.

129 Views
0 Comments

 

CellStormRider-SundaraRaman-Web-650.png

 

About the Insights

This visualization captures the journey of Sundara Raman as he rides the commuter train corridors in Sydney, Australia. Armed with his mobile phone and special software, Sundara's train ride through Sydney can be traced via his mobile phone cell tower connections, represented by the colored dots (or nodes) on the chart, as his train hurtles through the city

 

Its part of a new form of analytics that uses mobile phone data to study the traffic patterns caused by movement and mass congregations of people. Its primary purpose is to optimize the cell tower network to avoid performance issues and improve customer experience. However, it also supports emerging data monetization initiatives where detailed traffic flows can be used for urban planning, retail store location analysis and marketing offers.

 

In this analysis Sundara is looking for cell signal 'storms' that can overwhelm towers and impact performance. As crowded commuter trains run down the lines and pool at stations, they send out 100's to 1000's of signals that move rapidly across towers and can overwhelm them. This visualization is part of a series of charts that overlay tower performance data, commuter traffic volumes and tower hand offs to pinpoint cell signal 'storm surges' enabling detailed recommendations to optimize the network.

 

The chart also highlights specific customer experience issues caused by the transfer between 4G cell towers (darker shade dots) and lower speed 3G cell towers (lighter shade dots) and 'ping pong' impacts from the to-ing and fro-ing of signals between towers, represented by close clusters of connected towers near Lindfield, Killara, Waitara, North Sydney and Chatswood stations.

 

About the Analytics

This visualization was created using Teradata Aster and Aster Lens. Smartphone signaling data was collected from simultaneous use of 3G and 4G mobile phones using special purpose software when travelling on crowded public transport lines along North Shore and Strathfield Lines in Sydney, Australia. Geospatial analytics were included using the geo location data for train stations and cell towers to isolate the cell towers located within a 1km radius of the train stations.

 

This approach was used to measure the impact of signal propagation among cell towers within a short defined range from train stations. Color codes were added to the sigma chart on the GEXF file using Visual Basic scripts to uniformly distinguish between 4G and 3G cell tower areas. Each color signifies the network coverage area to which a group of cell towers belong. Statistics published by Sydney City Rail, covering peak time train traffic loads for each train station, were used to correlate cell site performance.

 

About the Analyst

Sundara is a Senior Telecom Industry Consultant by day and an aspiring Data Scientist by night. He has a Master's degree in Business and Administration from Massey University, New Zealand. He lives in Sydney, Australia with his wife and two children.

 

Sundara is an inventor and a joint holder of an Australian patent with his wife on Computer Assisted Psychological Assessment and Treatment that applies the principles of Cognitive Behavioural Therapy (CBT). So now, if during your next daily commute you happen to catch a glimpse of Sundara, juggling his multiple mobile phones, then you will know he is not crazy. He is just using analytics to gain insights that can help his Telecom clients improve their mobile network customer experience.

134 Views
0 Comments

The Teradata Analytics Platform is being coming and with it comes the ability for Teradata users to access Aster's machine learning and graph functions! For Aster uers, it's time to dust of your Teradata SQL. In this article we list some tips and tricks for the Aster users who is mirgrating to the Teradata Analytics Platform.

Read more...

129 Views
0 Comments

 

shutterstock_755847925.jpg

 

 

 

Let's start with the basics.

 

Models are only as good as your data ...

Period.

Models are very sensitive to data - the staple food of any information-driven organization. After all, the learning and training happen by looking at data. If you have issues with the quality of data or don't have a whole lot to cover for all possible dimensions, your models will reflect that. Models are more sensitive to data than the algorithms itself. We will start with that:

 

1. Data & Feature Quality will affect both Model accuracy and consistency

The quality of data we feed to the models on a regular basis from an EDW or Data Lake through ETL processes etc. determines how accurate the models are and how consistent it is no matter what algorithms one chooses to train and predict. For models to be useful, it's probably more important to have slightly lower accuracy and consistent than the other way. Have 95% accuracy in prediction one day and dropping to 30% accuracy on the other day is probably not desired in an operational env.

 

What can affect data quality?

Gaps in data (like missing data for a few days here and there) is the most common one. Data can arrive late from servers or devices upstream or software issues, or wrong code can send you bad data, without any ETL process catching it. A new transformation process or event code or parameter change can create bad data especially with numeric data that's hard to detect. Just a simple misunderstanding of taxonomy at the collection point can get your data with a slightly different meaning that you think it is.

 

What about Feature Quality?

The whole idea of using data for ML models is the expectation that the data will have features that can be extracted to model the real world, they came from. You may have a lot of big data, but if the 'feature quality' is poor then the models will reflect that. For example, if you are building a multi-variate model and using multiple streams of data, we all know that it's important that they both have unique information aka no feature collinearity. Also critical that external factors that affect those features are somehow factored into the equation.

 

2. Simple sophistication vs. Complex approaches

 

You don't always need to use the most sophisticated algorithms and most complex data and features you collected

If you have a lot of data like Google, Facebook or Amazon does, it makes sense to do Deep Learning, etc., where the data comes with "low bias and high variance." Of course for accuracy, you tradeoff computation cost, explainability, etc.,

 

If you have 100K rows of data, it probably has low variance and high bias. You can mainly use algorithms like Logistic Regression, RandomForest, Support Vector Machines or Naive Bayes and get pretty good results. Also, it's somewhat easy to explain the results and runs a lot faster. Using Deep Learning will require you to examine overfitting and just a lot of trouble, and in end, you will probably get accuracy as close to mainstream algorithms.

 

How about 10M-100M rows? Well, now we have options, and you can decide based on some factors. Logistic Regression, Random Forest, SVM, and Naive Bayes would work just as fine. Deep Learning would probably do well too and would require some basic hyperparameter tuning, regularization options, etc., to get it right ...

 

3. Are you Overfitting? - check

 

This is a common theme that data scientists know. Don't model on your training set and try to predict on it, to get your accuracy numbers. Your first model will probably always overfit. Requires some iterations/cross-validation/regularization etc., to loosen the model, so it generalizes well. Maybe try switching algorithms?

 

4. Are the Classes balanced? - check

 

If you are doing classification, then if your data is skewed heavily towards a few classes vs. others, you have a class imbalance. Your model accuracy will be so-so when predicting leading to too many false positives and negatives. One thing to see if we can do either data augmentation or resampling or trying different algorithms that has different regularization methods etc.,

 

5. Are you Cursed with Dimensionality? - check

 

Modeling on a few columns or features is one thing. If you have 100+ features or columns or attributes, you probably will run into a phenomenon called 'the Curse of Dimensionality'. The data can get sparse, and model accuracy can vary quite a bit under different circumstances. Techniques such as 'Dimension Reduction' may need to be done to avoid the issue.

 

6. How is the "inference" performance of the model? Does it match operational requirements?

 

It's great to build models with low false positives and negatives every single time we try to predict, but if the scoring gets into some iterative mode real-time - that's not good. If you are scoring 1000s of customers each second on something on your 'inference farm', it's wise to pick an algorithm that will return faster given some input data. The speed of decision making is what matters in the end for most scenarios.

 

7. Correlation or Curve Fitting does not imply Causation

 

Machine learning tries to capture "correlation reality" obtained from data to predict a situation based on the model. However, it cannot know that what variables cause which. However, it can detect that the presence of specific variables always coincides with the absence or presence of another variable. So if the underlying conditions that drive these observed variables change then, the model assumptions can go wrong with a drop in accuracy.

 

Read my previous blog post on 'Correlation, Causation and Implication Rules' for more insights into this topic.

 

8. One size fits all vs. multiple models

 

Should you build one model for all data or should you build a model per customer or product? This question is overloaded and has tradeoffs in accuracy, performance, data imbalance (some customers have more data than the other), etc., This is a very informed decision!

 

What else?

 

The above are the common gotchas that are just the tip of the iceberg. Then, you have the whole DevOps integration pipeline that requires a lot of production level sophistication like continuous integration/development, etc., to deploy and manage model lifecycles .. (to be covered in another blog in future!)

 

Thanks for reading!

160 Views
0 Comments

 

  

world.jpg

 

Various quotations, attributed to a wide range of thinkers, writers and thought leaders, have made the point that the best way to predict the future is to define it for yourself.

 

To learn more, click  HERE

179 Views
0 Comments

 

 

universe 2.jpg

 

 

Join us at Teradata Analytics Universe as UC Berkeley Professor and bestselling author Morten Hansen & co-founder of Sudden Compass and global tech ethnographer Tricia Wang kick off our keynote sessions. Want to know more?

 

Register HERE

 

177 Views
0 Comments

CrownOfThorns-KailashPurang-Web-650.png

 

About the Insights

This is the second visualization of Kailash Purang's Two Part CIA Report series. It demonstrates the ability of advanced analytics to rapidly distill extremely complex documents into easily consumed visualizations, free of human bias. It should be viewed after the reader has seen Part One, 'Terror Report'.

 

In this second visualization, Kailash analyzes the same data as the 'Terror Report' word cloud, with more sophisticated text and graph analysis to reveal significantly more of the storyline and meaning of the report itself. Each dot (or node) is a significant word appearing in the report, larger nodes are words occurring more often. The lines (or edges) link words to the other words they appeared with. The darker thicker edges link words that occur together with higher frequency. Now we can see the main story lines and subjects in the word clusters and their linkage to each other. If you start at the top left hand corner, you see the name Abu Zubaydah amongst words like waterboarding, rectal, mother, brutal and harm. Edges link to enhanced interrogation techniques, CIA and detainees and smaller word groups like oversight and actively avoided. It shows the treatment the still in captive Abu Zubaydah has received and we can trace the edges through to see the surrounding issues of how and why it was allowed to happen.

 

By studying the visualization the reader can now quickly absorb the key details and interplay between all the subjects covered by this very complex report, free of human bias and filtering.

 

About the Analytics

This visualization uses Teradata Aster's text mining capabilities on the 525 page, December 9 2014, publicly released excerpt of the Committee Study of the Central Intelligence Agency's Detention and Interrogation Program compiled by the U.S Senate Select Committee on Intelligence.

 

Term frequency–inverse document frequency was used to isolate the critical words and word groups within the report. The algorithm compares how often a word occurs in a piece of text, relative to how often it occurs in the whole body of text. A word that is important to a specific piece of text will occur relatively frequently in that piece as compared to the whole body.

 

The detailed connections data linking the words was acquired by text mining using native Aster text mining functions such as nGram. The output was used to create an underlying node-edge table. This was visualized as a graph using Aster Lens emphasizing the connections. This allows clear clusters of words to occur representing individual ideas.

 

About the Analyst

Kailash is the lead Data Scientist for Teradata in Singapore. He also works across South East Asia and most notably in Indonesia, supporting the leading banking and communication industry clients Teradata serves in the region.

 

Kailash holds a Bachelor of Economics and Statistics as well as a Masters in Economics from the National University of Singapore. He also holds a Bachelor of Management from University Of London. He has worked in the field of analytics for 15 years across various industries.

 

Despite having ‘sold his soul’ to join the commercial world, he still believes that the aim of all this learning and technology is to make people’s life easier and more fun. To help introduce analytics in a fun ‘tear-less’ way, he works in his spare time on creating visualizations that show how everybody can benefit from simple analytic applications.

 

As a Data Scientist for Teradata, he strives to make his clients realize the full potential of 'Big Data' so that their customers can benefit via better services and offerings.

Bloggers
Top Kudoed Authors

Data Science Informative Articles and Blog Posts

Our blogs allows customers, prospects, partners, third-party influencers and analysts to share thoughts on a range of product and industry topics. Also, we have lots of content in the community; allowing you to gain valuable insights from Teradata data scientists.