'How to know that my data is Skewed?

After transferring my data (suppose table) to HDFS, I have no idea how my data is replicated (which part goes to which machne (node)).

So, running Spark SQL queries some people say that you can give hint to spark that my data is skewed.

But how would i know that my data is skewed so i can give hint to spark?



Solution 1:[1]

This really depends on your data qualities and how do you want to use those data. Also, depends on your Spark how to implement the algorithms. Basically, you can use SQL to make some query pick one of the columns as key for example user_name. id so on, and make a group by seeing if there have huge differences.

For example if have such case 
select count(distinct(user_name)) from your table group by user_id 

count           username 
199999999999      abc123
12                abc124
6                 abc121

Check the example above, the username abc123 which is data skew problem.

There have few references regarding resolve the data skew problem in Apache Spark 1.http://silverpond.com.au/2016/10/06/balancing-spark.html 2.https://databricks.com/session/handling-data-skew-adaptively-in-spark-using-dynamic-repartitioning

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 SharpLu