'Regex to parse Azure Data Lake Storage Gen2 URI for production and testing with Azurite
In my Java application I am using Azure Data Lake Storage Gen2 for storage (ABFS). In the class that handles the requests to the filesystem, I get a file path as an input and then use some regex to extract Azure connection info from it.
The Azure Data Lake Storage Gen2 URI is in the following format:
abfs[s]://<file_system>@<account_name>.dfs.core.windows.net/<path>/<file_name>
I use the following regex abfss?://([^/]+)@([^\\.]+)(\\.[^/]+)/?((.+)?) to parse a given file path to extract:
- fileSystem
- accountName
- accountSuffix
- relativePath (path + file_name)
Below is just a test Java code with comments stating result/value in each variable after matching.
private void parsePath(String path) {
//path = abfs://[email protected]/selim/test.csv
Pattern azurePathPattern = Pattern.compile("abfss?://([^/]+)@([^\\.]+)(\\.[^/]+)/?((.+)?)");
Matcher matcher = azurePathPattern.matcher(path);
if (matcher.find()) {
String fileSystem = matcher.group(1); //storage
String accountName = matcher.group(2); //myaccount
String accountSuffix = matcher.group(3); //.dfs.core.windows.net
//relativePath is <path>/<file_name>
String relativePath = matcher.group(4); //selim/test.csv
}
}
The problem is when I decided to use Azurite which is an Azure Storage API compatible server (emulator) that allow me to run unit tests against this emulator instead of against an actual Azure Server as recommended in the Microsoft documentation.
Azurite uses a different file URI than Azure so this makes my above Regex invalid for testing purposes. Azurite file URI is in the following format:
abfs[s]://<file_system>@<local_ip>:<local_port>/<account_name>/<path>/<file_name>
Azurite default account_name is devstoreaccount1 so here is an example path for a file on Azurite:
abfs://[email protected]:10000/devstoreaccount1/selim/test.csv
If parsed by above regex this will be the output, causing incorrect api calls to Azurite server:
- fileSystem: storage (correct)
- accountName: 127 (incorrect, should be: devstoreaccount1)
- accountSuffix: .0.0.1:10000 (incorrect, should be empty string)
- relativePath: devstoreaccount1/selim/test.csv (incorrect, should be selim/test.csv)
Is it possible to have a 1 regex that can handle both URIs or 2 regexes to solve this issue
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
