'How to write to specific datanode in hdfs using pyspark

I have a requirement to write common data to the same hdfs data nodes, like how we repartition in pyspark on a column to bring similar data into the same worker node, even replicas should be in the same node.

For instance, we have a file, table1.csv

  • Id, data
  • 1, A
  • 1, B
  • 2, C
  • 2, D

And another tablet.csv

  • Id, data
  • 1, X
  • 1, Y
  • 2, Z
  • 2, X1

Then datanode1 should only have (1,A),(1,B),(1,X),(1,Y) and datanode2 should only have (2,C),(2,D),(2,Z),(2,X1) And replication within datanodes.

It can be separate files as well based on keys. But each key should map it to a particular node.

I tried with pyspark writing to hdfs, but it just randomly assigned the datanodes when I checked with hdfs DFS fsck.

Read about rackid by setting rack topology but is there away to select which rack to store data on?

Any help is appreciated, I'm totally stuck.

KR Alex



Solution 1:[1]

I maitain that without actually exposing the problem this is not going to help you but as you technically asked for a solution here's a couple ways to do what you want, but won't actually solve the underlying problem.

If you want to shift the problem to resource starvation:

Spark setting:
spark.locality.wait - technically doesn't solve your problem but is actually likely to help you immediately before you implement anything else I list here. This is should be your goto move before trying anything else as it's trivial to try. Pro: just wait until you get a node with the data. Cheap and fast to implement. Con: Doesn't promise to solve data locality, just will wait for a while incase the right nodes come up. It doesn't guarantee that when you run your job it will be placed on the nodes with the data.

** yarn labels**

to allocate your worker nodes to specific nodes.
Pro: This should ensure at least 1 copy of the data lands within a set of worker nodes/data nodes. If subsequent jobs also use this node label you should get good data locality. Technically it doesn't specify where data is written but by caveat yarn will write to the hdfs node it's on first. Con: You will create congestion on these nodes, or may have to wait for other jobs to finish so you can get allocated or you may carve these into a queue that no other jobs can access reducing the functional capacity of yarn. (HDFS will still work fine)

Use Cluster federation

Ensures data lands inside a certain set of machines. pro: A folder can be assigned to a set of data nodes Con: You have to allocated another name node, and although this satisfies your requirement it doesn't mean you'll get data locality. Great example of something that will fit the requirement but might not solve the problem. It doesn't guarantee that when you run your job it will be placed on the nodes with the data.

My-Data-is-Everywhere

hadoop dfs -setrep -w 10000 /path of the file

Turn up replication for just the folder that contains this data equal to the number of nodes in the cluster. Pro: All your data nodes have the data you need. Con: You are wasting space. Not necessarily bad, but can't really be done for large amounts of data without impeding your cluster's space.

Whack-a-mole:

Turn off datanodes, until the data is replicated where you want it. Turn all other nodes back on. Pro: You fulfill your requirement. Con: It's very disruptive to anyone trying to use the cluster. It doesn't guarantee that when you run your job it will be placed on the nodes with the data. Again it kinda points out how silly your requirement is.

Racking-my-brain

Someone smarter than me might be able to develop a rack strategy in your cluster that would ensure data is always written to specific nodes that you could then "hope" you are allocated to... haven't fully developed the strategy in my mind but likely some math genius could work it out.

Solution 2:[2]

You could also implement HBASE and allocate region servers such that the data landed on the 3 servers. (As this would technically fulfill your requirement). (As it would be on 3 servers and in HDFS)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Matt Andruff
Solution 2 Matt Andruff