'Find mean of a column for every 1000000 count in R
I have a dataframe which has the following structure with two columns data1 and data2. Below is the sample data:
data1 data2
800000 1
800030 0.956521739130435
1000000 0.480916030534351
1686626 0.496
1687492 0.174757281553398
2148463 0.0344827586206897
2850823 0.05
2959087 0.0416666666666667
I would like to calculate the mean of second row i.e. data2 for every 1000000 count in data1.
which means it should give the mean for first 2 rows then for next 3 rows and then for the next 3 rows and so on...
The output should be a dataframe with last value within the interval 1000000 and the mean value of data2 in that interval: Sample output is shown below:
800030 0.97826087
1687492 0.38389110
2959087 0.04204981
Could some help to do this in R?
Solution 1:[1]
Assuming your data is in data.frame DF , you can use aggregate function to do this
> with(DF, aggregate(data2, by=list((data1+0.01)%/%1000000), mean ))
Group.1 x
1 0 0.97826087
2 1 0.38389110
3 2 0.04204981
To get the values in column over which mean was calculated, you will have to use aggregate again - this time on data1 column itself. After that you can merge two resultant dataframes.
res <- with(DF, merge(aggregate(data1, by = list((data1 + 0.01)%/%1e+06), paste), aggregate(data2, by = list((data1 + 0.01)%/%1e+06), mean), by = "Group.1"))
names(res) <- c("Group", "Values", "Mean")
res
## Group Values Mean
## 1 0 800000, 800030 0.97826087
## 2 1 1000000, 1686626, 1687492 0.38389110
## 3 2 2148463, 2850823, 2959087 0.04204981
Solution 2:[2]
You can do something like this :
group <- cut(df$data1, c(0,1000000,2000000,3000000))
tapply(df$data2, group, mean)
# (0,1e+06] (1e+06,2e+06] (2e+06,3e+06]
# 0.81247926 0.33537864 0.04204981
EDIT : To automatically compute the breaks in seq, you can replace c(0,1000000,2000000,3000000) with something like :
c(seq(0, max(df$data1), by=1000000),max(df$data1))
EDIT 2 : The following, using ddplyfrom plyr, will return both mean and max in a data frame :
group <- cut(df$data1, c(seq(0, max(df$data1), by=1000000),max(df$data1)))
ddply(df, .(group), summarize, mean=mean(data2), max=max(data2))
# group mean max
# 1 (0,1e+06] 0.81247926 1.000
# 2 (1e+06,2e+06] 0.33537864 0.496
# 3 (2e+06,2.96e+06] 0.04204981 0.050
Solution 3:[3]
For the sake of diversity, here's another solution using split:
sapply(split(df,df$data1%/%1e6), function(x)mean(x$data2))
0 1 2
0.97826087 0.38389110 0.04204981
Edit: or even simpler:
sapply(split(df$data2,df$data1%/%1e6), mean)
Solution 4:[4]
For the sake of completeness, I would like to propose a solution in the
tidyverse universe
library(tidyverse)
df <- tribble(
~data1, ~data2,
800000, 1,
800030, 0.956521739130435,
1000000, 0.480916030534351,
1686626, 0.496,
1687492, 0.174757281553398,
2148463, 0.0344827586206897,
2850823, 0.05,
2959087, 0.0416666666666667)
df |>
mutate(group= data1 %/% 1000000) |>
group_by(group) |>
summarize(max_in_group = max(data1), mean_data2 = mean(data2), n_in_group = n())
#> # A tibble: 3 x 4
#> group max_in_group mean_data2 n_in_group
#> <dbl> <dbl> <dbl> <int>
#> 1 0 800030 0.978 2
#> 2 1 1687492 0.384 3
#> 3 2 2959087 0.0420 3
Although it seems not all decimals are printed, it is only an output format
of dplyr. All decimals are computed as shown below
df |>
mutate(group= data1 %/% 1000000) |>
group_by(group) |>
summarize(max_in_group = max(data1), mean_data2 = mean(data2), n_in_group = n()) |>
pull(mean_data2)
#> [1] 0.97826087 0.38389110 0.04204981
Created on 2022-03-05 by the reprex package (v2.0.1)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | |
| Solution 3 | |
| Solution 4 | josep maria porrĂ |
