'How to know that my data is Skewed?
After transferring my data (suppose table) to HDFS, I have no idea how my data is replicated (which part goes to which machne (node)).
So, running Spark SQL queries some people say that you can give hint to spark that my data is skewed.
But how would i know that my data is skewed so i can give hint to spark?
Solution 1:[1]
This really depends on your data qualities and how do you want to use those data. Also, depends on your Spark how to implement the algorithms. Basically, you can use SQL to make some query pick one of the columns as key for example user_name. id so on, and make a group by seeing if there have huge differences.
For example if have such case
select count(distinct(user_name)) from your table group by user_id
count username
199999999999 abc123
12 abc124
6 abc121
Check the example above, the username abc123 which is data skew problem.
There have few references regarding resolve the data skew problem in Apache Spark 1.http://silverpond.com.au/2016/10/06/balancing-spark.html 2.https://databricks.com/session/handling-data-skew-adaptively-in-spark-using-dynamic-repartitioning
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | SharpLu |
