'Initialize a column with missing values and copy+transform another column of a dataframe into the initialized column

I have a messy column in a csv file (column A of the dataframe).

using CSV, DataFrames
df = DataFrame(A = ["1", "3", "-", "4", missing, "9"], B = ["M", "F", "R", "G", "Z", "D"])

Want I want to do is:

  1. Transform the integer from string to numeric (e.g. Float64)
  2. Transform the string "-" in missing

The strategy would be to first define a new column vector filled with missing

df[:C] = fill(missing, size(df)[1])

and then perform the 2 transformations with for loops

for i in 1:size(df)[1]
    if df[:A][i] == "-"
        continue
    else
        df[:C][i] = parse(Float64,df[:A][i])
    end
end

However, when looking at df[:C] I have a column filled only with missing. What am I doing wrong?



Solution 1:[1]

There are several issues with your code, but first let me show how I would write this transformation:

df.C = passmissing(parse).(Float64, replace(df.A, "-"=>missing))

It is not the most efficient way to do it but is simple to reason about.

An implementation using a loop could look like:

df.C = similar(df.A, Union{Float64, Missing});

for (i, a) in enumerate(df.A)
    if !ismissing(a) && a != "-"
        df.C[i] = parse(Float64, a)
    else
        df.C[i] = missing
    end
end

Note that similar by default will fill the df.C with missing so the else part could be dropped, but this behavior is not documented so it is safer to write it.

You could also use a comprehension:

df. C = [ismissing(a) || a == "-" ? missing : parse(Float64, a) for a in df.A]

Now, to fix your code you could write:

# note a different initialization
# in your code df.C allowed only values of Missing type and disallows of Float64 type
df.C = Vector{Union{Float64, Missing}}(missing, size(df, 1))

for i in 1:size(df)[1]
    # note that we need to handle missing value and "=" separately
    if ismissing(df.A[i]) || df.A[i] == "-"
        continue
    else
        df.C[i] = parse(Float64,df.A[i])
    end
end

Finally note that it is preferred to write df.C than df[:C] to access a column in a data frame (currently both are equivalent but this might change in the future).

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1