'Peformance issues reading CSV files in a Java (Spring Boot) application

I am currently working on a spring based API which has to transform csv data and to expose them as json. it has to read big CSV files which will contain more than 500 columns and 2.5 millions lines each. I am not guaranteed to have the same header between files (each file can have a completly different header than another), so I have no way to create a dedicated class which would provide mapping with the CSV headers. Currently the api controller is calling a csv service which reads the CSV data using a BufferReader.

The code works fine on my local machine but it is very slow : it takes about 20 seconds to process 450 columns and 40 000 lines. To improve speed processing, I tried to implement multithreading with Callable(s) but I am not familiar with that kind of concept, so the implementation might be wrong.

Other than that the api is running out of heap memory when running on the server, I know that a solution would be to enhance the amount of available memory but I suspect that the replace() and split() operations on strings made in the Callable(s) are responsible for consuming a large amout of heap memory.

So I actually have several questions :

#1. How could I improve the speed of the CSV reading ?

#2. Is the multithread implementation with Callable correct ?

#3. How could I reduce the amount of heap memory used in the process ?

#4. Do you know of a different approach to split at comas and replace the double quotes in each CSV line ? Would StringBuilder be of any healp here ? What about StringTokenizer ?

Here below the CSV method

  public static final int NUMBER_OF_THREADS = 10;

   public static List<List<String>> readCsv(InputStream inputStream) {
            List<List<String>> rowList = new ArrayList<>();
            ExecutorService pool = Executors.newFixedThreadPool(NUMBER_OF_THREADS);
            List<Future<List<String>>> listOfFutures = new ArrayList<>();
            try {
                    BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8));
                    String line = null;
                    while ((line = reader.readLine()) != null) {
                            CallableLineReader callableLineReader = new CallableLineReader(line);
                            Future<List<String>> futureCounterResult = pool.submit(callableLineReader);
                            listOfFutures.add(futureCounterResult);
                    }
                    reader.close();
                    pool.shutdown();
            } catch (Exception e) {
                    //log Error reading csv file
            }

            for (Future<List<String>> future : listOfFutures) {
                    try {
                            List<String> row = future.get();
                    }
                    catch ( ExecutionException | InterruptedException e) {
                            //log Error CSV processing interrupted during execution
                    }
            }

            return rowList;
    }

And the Callable implementation

public class CallableLineReader implements Callable<List<String>>  {

        private final String line;

        public CallableLineReader(String line) {
                this.line = line;
        }

        @Override
        public List<String> call() throws Exception {
                return Arrays.asList(line.replace("\"", "").split(","));
        }
}


Solution 1:[1]

Instead of trying out a different approach, try to run with a profiler first and see where time is actually being spent. And use this information to change the approach.

Async-profiler is a very solid profiler (and free!) and will give you a very good impression of where time is being spent. And it will also show the time spend on garbage collection. So you can easily see the ratio of CPU utilization caused by garbage collection. It also has the ability to do allocation profiling to figure out which objects are being created (and where).

For a tutorial see the following link.

Solution 2:[2]

Try using Spring batch and see if it helps your scenario.

Ref : https://howtodoinjava.com/spring-batch/flatfileitemreader-read-csv-example/

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Dhanraj