'Is it possible to combine multiple input files with different schemas using Schema Drift / Dynamic Columns
I have around 20 tab-separated input files. They have in the region of 500 columns, but each will be slightly different.
The sink output schema is known and will contain all the possible input columns.
As a simplified example:
File 1
| Name | Age | DOB | Nationality |
|---|---|---|---|
| Bob | 21 | 01/01/1972 | British |
File2
| Name | Nationality | NINO |
|---|---|---|
| Joe | British | AA995654A |
File 3
| Name | DOB | Nationality |
|---|---|---|
| Sam | 01/01/1990 | British |
Is it possible to have one DataFlow with multiple inputs, where the schema is not known until runtime, that would cope with changes in the input files and in this case would output:
| Name | Age | DOB | NINO | Nationality |
|---|---|---|---|---|
| Bob | 21 | 01/01/1972 | NULL | British |
| Joe | NULL | NULL | AA995654A | British |
| Sam | NULL | 01/01/1990 | NULL | British |
I have looked at column pattern matching and schema drift, but don't see how/if it is possible to achieve this.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
