'Google Cloud Storage atomic creation of a Blob
I'm using haddop-connectors project for writing BLOBs to Google Cloud Storage.
I'd like to make sure that a BLOB with a specific target name that is being written in a concurrent context is either written in FULL or not appearing at all as visible in case that an exception has occurred while writing.
In the code below, in case that that an I/O exception occurs, the BLOB written will appear on GCS because the stream is being closed in finally:
val stream = fs.create(path, overwrite)
try {
actions.map(_ + "\n").map(_.getBytes(UTF_8)).foreach(stream.write)
} finally {
stream.close()
}
The other possibility would be to not close the stream and let it "leak" so that the BLOB does not get created. However this is not really a valid option.
val stream = fs.create(path, overwrite)
actions.map(_ + "\n").map(_.getBytes(UTF_8)).foreach(stream.write)
stream.close()
Can anybody share with me a recipe on how to write to GCS a BLOB either with hadoop-connectors or cloud storage client in an atomic fashion?
Solution 1:[1]
I have used reflection within hadoop-connectors to retrieve an instance of com.google.api.services.storage.Storage from the GoogleHadoopFileSystem instance
GoogleCloudStorage googleCloudStorage = ghfs.getGcsFs().getGcs();
Field gcsField = googleCloudStorage.getClass().getDeclaredField("gcs");
gcsField.setAccessible(true);
Storage gcs = (Storage) gcsField.get(googleCloudStorage);
in order to have the ability to make a call based on an input stream corresponding to the data in memory.
private static StorageObject createBlob(URI blobPath, byte[] content, GoogleHadoopFileSystem ghfs, Storage gcs)
throws IOException
{
CreateFileOptions createFileOptions = new CreateFileOptions(false);
CreateObjectOptions createObjectOptions = objectOptionsFromFileOptions(createFileOptions);
PathCodec pathCodec = ghfs.getGcsFs().getOptions().getPathCodec();
StorageResourceId storageResourceId = pathCodec.validatePathAndGetId(blobPath, false);
StorageObject object =
new StorageObject()
.setContentEncoding(createObjectOptions.getContentEncoding())
.setMetadata(encodeMetadata(createObjectOptions.getMetadata()))
.setName(storageResourceId.getObjectName());
InputStream inputStream = new ByteArrayInputStream(content, 0, content.length);
Storage.Objects.Insert insert = gcs.objects().insert(
storageResourceId.getBucketName(),
object,
new InputStreamContent(createObjectOptions.getContentType(), inputStream));
// The operation succeeds only if there are no live versions of the blob.
insert.setIfGenerationMatch(0L);
insert.getMediaHttpUploader().setDirectUploadEnabled(true);
insert.setName(storageResourceId.getObjectName());
return insert.execute();
}
/**
* Helper for converting from a Map<String, byte[]> metadata map that may be in a
* StorageObject into a Map<String, String> suitable for placement inside a
* GoogleCloudStorageItemInfo.
*/
@VisibleForTesting
static Map<String, String> encodeMetadata(Map<String, byte[]> metadata) {
return Maps.transformValues(metadata, QuickstartParallelApiWriteExample::encodeMetadataValues);
}
// A function to encode metadata map values
private static String encodeMetadataValues(byte[] bytes) {
return bytes == null ? Data.NULL_STRING : BaseEncoding.base64().encode(bytes);
}
Note in the example above, that even if there are multiple callers trying to create a blob with the same name in parallel, ONE and only ONE will succeed in creating the blob. The other callers will receive 412 Precondition Failed.
Solution 2:[2]
GCS objects (blobs) are immutable 1, which means they can be created, deleted or replaced, but not appended.
The Hadoop GCS connector provides the HCFS interface which gives the illusion of appendable files. But under the hood, it is just one blob creation, GCS doesn't know if the content is complete or not from the application's perspective, just as you mentioned in the example. There is no way to cancel a file creation.
There are 2 options you can consider:
Create a temp blob/file, copy it to the final blob/file, then delete the temp blob/file, see 2. Note that there is no atomic rename operation in GCS, rename is implemented as copy-then-delete.
If your data fits into the memory, first read up the stream and buffer the bytes in memory, then create the blob/file, see 3.
GCS connector should also work with the 2 options above, but I think GCS client library gives you more control.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | marius_neo |
| Solution 2 |
