'Extracting specific number of rows from dataframe

I have a csv file having two columns i.e. imagename and ID. There are multiple image names for same ID as shown in picture. Number of image names against id is different. I need to extract same number of image names against ids and drop the exceeded rows . For example if id 5 has the lowest number of images say 8 then all ids would have corresponding 8 image names.

Table format

This code is extracting first 100 id's but number of images against each ID is different i.e. ID =1 has 11 images but Id=2 has 24 images and so on

`select_id = df.loc[df['id'] <= 100]`

Expected output equal number of images against each ID Expected output



Solution 1:[1]

You can do this:

# get min number of occurrences
min_len = df['id'].value_counts().min()

# group by and get head for each group
df.groupby('id').head(min_len)

Solution 2:[2]

Welcome to Stack Overflow! When asking a question on here its a good idea to provide a 'minimum reproduceable example" and example data (https://stackoverflow.com/help/minimal-reproducible-example). I created some example data for you this time. See the code below along with how to subset like you are describing.

# Create some example data
id <- c(0,0,0,1,1,1,1,2,2,3,3,3,3,4,4,4,4,4,4,4)
file_names <- paste("file_number_",seq(1:length(id)),".jpg",sep = "")
dataframe <- data.frame(id = id, file_names = file_names)

# view the data
dataframe

This initial data looks like:

# id         file_names
# 1   0  file_number_1.jpg
# 2   0  file_number_2.jpg
# 3   0  file_number_3.jpg
# 4   1  file_number_4.jpg
# 5   1  file_number_5.jpg
# 6   1  file_number_6.jpg
# 7   1  file_number_7.jpg
# 8   2  file_number_8.jpg
# 9   2  file_number_9.jpg
# 10  3 file_number_10.jpg
# 11  3 file_number_11.jpg
# 12  3 file_number_12.jpg
# 13  3 file_number_13.jpg
# 14  4 file_number_14.jpg
# 15  4 file_number_15.jpg
# 16  4 file_number_16.jpg
# 17  4 file_number_17.jpg
# 18  4 file_number_18.jpg
# 19  4 file_number_19.jpg
# 20  4 file_number_20.jpg

Now to subset as you describe.

# find the lowest amount of entries for an id
min <- min(table(dataframe$id))

#create and empty df for your subset
final_dataframe <- data.frame()

# get the first two images for every id
for (i in 1:length(unique(dataframe$id))) {
  final_dataframe <- rbind(dataframe[which(dataframe$id == unique(dataframe$id)[i]),][1:2,],final_dataframe)
}

# view
final_dataframe

Final dataframe:

# id         file_names
# 14  4 file_number_14.jpg
# 15  4 file_number_15.jpg
# 10  3 file_number_10.jpg
# 11  3 file_number_11.jpg
# 8   2  file_number_8.jpg
# 9   2  file_number_9.jpg
# 4   1  file_number_4.jpg
# 5   1  file_number_5.jpg
# 1   0  file_number_1.jpg
# 2   0  file_number_2.jpg

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 D.Manasreh
Solution 2 Henry Holm