'Why is this task faster in Python than Julia?

I ran the following code in RStudio:

exo <- read.csv('exoplanets.csv',TRUE,",")
df <- data.frame(exo)

ranks <- 570
files <- 3198
datas <- vector()

for ( w in 2:files ) {
    listas <-vector()
    for ( i in 1:ranks) {
            name <- as.character(df[i,w])
            listas <- append (listas, name)
    }
    datas <- append (datas, listas)
}

It reads a huge NASA CSV file, converts it to a dataframe, converts each element to string, and adds them to a vector.

RStudio took 4 min and 15 seconds.

So I decided to implement the same code in Julia. I ran the following in VS Code:

using CSV, DataFrames

df = CSV.read("exoplanets.csv", DataFrame)

fil, col = 570, 3198
arr = []

for i in 2:fil
        for j in 1:col
            push!(arr, string(df[i, j]))
        end
end

The result was good. The Julia code took only 1 minute and 25 seconds!

Then for pure curiosity I implemented the same code this time in Python to compare. I ran the following in VS Code:

import numpy as np
import pandas as pd

exo = pd.read_csv("exoplanets.csv")
arr = np.array(exo)

fil, col = 570, 3198
lis = []

for i in range(1, fil):
        for j in range(col):
            lis.append(arr[i][j].astype('str'))

The result shocked me! Only 35 seconds!!! And in Spyder from Anaconda only 26 seconds!!! Almost 2 million floats!!! Is Julia slower than Python in data analysis? Can I improve the Julia code?



Solution 1:[1]

It depends on what you want to test (i.e. if you want to test looping or just want the result fast). I assume you want the result fast and in a clean code, in which case I would write this operation in the following way in Julia:

arr = reduce(vcat, eachrow(Matrix(string.(df[2:570, 1:3198]))))

can you please confirm that this produces the expected result and what is the timing of this operation? (in this I assume that you have more rows than 570 and more columns than 3198 so I subset them first)

If you want to test loops then the comments under your answer would start to be relevant.

Also note that your code for DataFrames.jl does not perform the same operation as codes in R and Python (looping order is different so could you please double check what you need). This difference is crucial for performance. I have given you the code reproducing the behavior of your DataFrames.jl code (which is a harder/slower variant of what you want to do in comparison to R/Python codes)

Solution 2:[2]

Given your comment to the @phipsgabler answer, what you are timing here is the fixed costs of importing the modules and compile more than the task itself..

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Bogumił Kamiński
Solution 2 Antonello