'Split an avro file and upload to REST

I have created some avro files. I can use the following commands to convert them to json, just to check whether the files are ok

java -jar avro-tools-1.8.2.jar tojson FileName.avro>outputfilename.json

Now, I have some big avro files and the REST API I m trying to upload to, has size limitations and thus I am trying to upload it in chunks using streams.

The following sample, which just reads from the original file in chunks and copies to another avro file, creates the file perfectly

using System;
using System.IO;

class Test
{

    public static void Main()
    {
        // Specify a file to read from and to create.
        string pathSource = @"D:\BDS\AVRO\filename.avro";
        string pathNew = @"D:\BDS\AVRO\test\filenamenew.avro";

        try
        {

            using (FileStream fsSource = new FileStream(pathSource,
                FileMode.Open, FileAccess.Read))
            {
                byte[] buffer = new byte[(20 * 1024 * 1024) + 100];
                long numBytesToRead = (int)fsSource.Length;
                int numBytesRead = 0;
                using (FileStream fsNew = new FileStream(pathNew,
                    FileMode.Append, FileAccess.Write))
                {

                    // Read the source file into a byte array.
                    //byte[] bytes = new byte[fsSource.Length];
                    //int numBytesToRead = (int)fsSource.Length;
                    //int numBytesRead = 0;
                    while (numBytesToRead > 0)
                    {

                        int bytesRead = fsSource.Read(buffer, 0, buffer.Length);
                        byte[] actualbytes = new byte[bytesRead];

                        Array.Copy(buffer, actualbytes, bytesRead);
                        // Read may return anything from 0 to numBytesToRead.


                        // Break when the end of the file is reached.
                        if (bytesRead == 0)
                            break;

                        numBytesRead += bytesRead;
                        numBytesToRead -= bytesRead;



                        fsNew.Write(actualbytes, 0, actualbytes.Length);
                    }

                }
            }

                // Write the byte array to the other FileStream.


        }
        catch (FileNotFoundException ioEx)
        {
            Console.WriteLine(ioEx.Message);
        }
    }
}

How do I know this creates a ok avro. Because the earlier command to convert to json, again works i.e.

java -jar avro-tools-1.8.2.jar tojson filenamenew.avro>outputfilename.json

However, when I use the same code, but instead of copying to another file, just call a rest api, the file gets uploaded but upon downloading the same file from the server and running the command above to convert to json says - "Not a Data file".

So, obviously something is getting corrupted and I am struggling to figure out what.

This is the snippet

 string filenamefullyqualified = path + filename;
            Stream stream = System.IO.File.Open(filenamefullyqualified, FileMode.Open, FileAccess.Read, FileShare.None);


            long? position = 0;

            byte[] buffer = new byte[(20 * 1024 * 1024) + 100];
            long numBytesToRead = stream.Length;
            int numBytesRead = 0;



            do
            {

                var content = new MultipartFormDataContent();
                int bytesRead = stream.Read(buffer, 0, buffer.Length);
                byte[] actualbytes = new byte[bytesRead];

                Array.Copy(buffer, actualbytes, bytesRead);

                if (bytesRead == 0)
                    break;

                //Append Data
                url = String.Format("https://{0}.dfs.core.windows.net/raw/datawarehouse/{1}/{2}/{3}/{4}/{5}?action=append&position={6}", datalakeName, filename.Substring(0, filename.IndexOf("_")), year, month, day, filename, position.ToString());
                numBytesRead += bytesRead;
                numBytesToRead -= bytesRead;

                ByteArrayContent byteContent = new ByteArrayContent(actualbytes);
                content.Add(byteContent);


                method = new HttpMethod("PATCH");

                request = new HttpRequestMessage(method, url)
                {
                    Content = content
                };


                request.Headers.Add("Authorization", "Bearer " + accesstoken);



                var response = await client.SendAsync(request);
                response.EnsureSuccessStatusCode();

                position = position + request.Content.Headers.ContentLength;

                Array.Clear(buffer, 0, buffer.Length);




            } while (numBytesToRead > 0);
            stream.Close();

I have looked through the forum threads but haven't come across anything which deals with splitting of avro files.

I have a hunch that my "content" for the http request isn't right. what is it that I am missing?

If you need more details, I will be happy to provide.



Solution 1:[1]

I have found the problem now. The problem was because of MultipartFormDataContent. When an avro file is uploaded with that, it adds extra text like content Type etc, along with removal of many lines (I do not know why).

So, the solution was to upload the contents as "ByteArrayContent" itself and not add it to MultipartFormDataContent like I was doing earlier.

Here is the snippet, almost similar to the one in the question, except that I no longer use MultipartFormDataContent

            string filenamefullyqualified = path + filename;
            Stream stream = System.IO.File.Open(filenamefullyqualified, FileMode.Open, FileAccess.Read, FileShare.None);
            //content.Add(CreateFileContent(fs, path, filename, "text/plain"));


            long? position = 0;

            byte[] buffer = new byte[(20 * 1024 * 1024) + 100];
            long numBytesToRead = stream.Length;
            int numBytesRead = 0;


            //while ((bytesRead = stream.Read(buffer, 0, buffer.Length)) > 0)
            //{
            do
            {

                //var content = new MultipartFormDataContent();

                int bytesRead = stream.Read(buffer, 0, buffer.Length);
                byte[] actualbytes = new byte[bytesRead];

                Array.Copy(buffer, actualbytes, bytesRead);

                if (bytesRead == 0)
                    break;

                //Append Data
                url = String.Format("https://{0}.dfs.core.windows.net/raw/datawarehouse/{1}/{2}/{3}/{4}/{5}?action=append&position={6}", datalakeName, filename.Substring(0, filename.IndexOf("_")), year, month, day, filename, position.ToString());
                numBytesRead += bytesRead;
                numBytesToRead -= bytesRead;

                ByteArrayContent byteContent = new ByteArrayContent(actualbytes);
                //byteContent.Headers.ContentType= new MediaTypeHeaderValue("text/plain");
                //content.Add(byteContent);


                method = new HttpMethod("PATCH");

                //request = new HttpRequestMessage(method, url)
                //{
                //    Content = content
                //};


                request = new HttpRequestMessage(method, url)
                {
                    Content = byteContent
                };

                request.Headers.Add("Authorization", "Bearer " + accesstoken);



                var response = await client.SendAsync(request);
                response.EnsureSuccessStatusCode();

                position = position + request.Content.Headers.ContentLength;

                Array.Clear(buffer, 0, buffer.Length);




            } while (numBytesToRead > 0);
            stream.Close();

Solution 2:[2]

But the streaming by record will not be able to handle the AVRO file as a whole in a transaction. We may end up in partial success, if some records fail, for example.

If we have a small tool that can split AVRO files based on a threshold number of records, it will be great.

The spark-based split by partition technique does allow to split data set to a pre-defined number of files; but, it does not allow splitting based on the number of records. I.e., I do not want an AVRO file with more than 500 records.

So we have to devise a batching logic based on the comfortable heap size the application can handle along with a two-phase commit, to handle transactions

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Saugat Mukherjee
Solution 2 Adeeb