R: select all rows that belong to maximum count - r

In a monitoring scheme each species (A, B, ...) is counted at least twice in each area (a1, a2, ...). For the final result the rows from the sample with the highest total count need to be selected for each area and species.
Example data:
data_joined <- data.frame("species" = c("A","A","A","A","A","A","B","B","B","B"),
"area" = c("a1","a1","a1","a1","a1","a2","a1","a1","a2","a2"),
"sample_nr" = c(1,1,1,1,2,2,1,1,2,2),
"count" = c(1,1,1,1,6,1,1,1,3,3))
My current solution is pasted below. However, the loop is very slow on the original data which contains about 18,000 rows. I can imagine there are much faster and elegant solutions. The original data are in sf format and include geometries which need to be kept after selection.
i_list <- list() # empty list
for (i in unique(data_joined$area)) # all areas that are in the data
{
loop_i_data <- data_joined[data_joined$area == i,] # select data for area i
j_list <- list()
for(j in unique(data_joined$species)) # all species that are in the data
{
loop_j_data <- loop_i_data[loop_i_data$species == j,] # select data of species j in area i
max_select <- which.max(
c(sum(loop_j_data[loop_j_data$sample_nr == "1",]$count, na.rm = TRUE), # sum first count
sum(loop_j_data[loop_j_data$sample_nr == "2",]$count, na.rm = TRUE), # sum second count
sum(loop_j_data[loop_j_data$sample_nr == "3",]$count, na.rm = TRUE), # sum third count
sum(loop_j_data[loop_j_data$sample_nr == "4",]$count, na.rm = TRUE), # sum fourth count
sum(loop_j_data[loop_j_data$sample_nr == "5",]$count, na.rm = TRUE), # sum fifth count
sum(loop_j_data[loop_j_data$sample_nr == "6",]$count, na.rm = TRUE), # sum sixth count
sum(loop_j_data[loop_j_data$sample_nr == "7",]$count, na.rm = TRUE))) # sum seventh count
j_list[[j]] <- loop_j_data[loop_j_data$sample_nr == max_select,] # add maximum count occasion to list
}
i_list[[i]] <- do.call(rbind, j_list)
}
data_final = do.call(rbind, i_list) # rbind all data
row.names(data_final) <- NULL
data_final

Using dplyr, we can find the sum of count for each species, area and sample_nr and select all the rows with max count in each species and area.
library(dplyr)
data_joined %>%
group_by(species, area, sample_nr) %>%
summarise(n = sum(count)) %>%
slice(which.max(n)) %>%
left_join(data_joined) %>%
select(-n)
# species area sample_nr count
# <fct> <fct> <dbl> <dbl>
#1 A a1 2 6
#2 A a2 2 1
#3 B a1 1 1
#4 B a1 1 1
#5 B a2 2 3
#6 B a2 2 3

Related

What is the best way to re-write (simplify) the same logic to produce the same result as below codes in R?

I need to extract a sample that has equal distribution in each experience-level group. For your info, there are total 4 groups (1, 2, 3, 4 years of exp), and total 8 people (A, B, C, D, E, F, G, H) in this example scenario. I was trying to come up with a function with loops, but don't know how to. Please help me out! Thank you! :)
library(tidyverse)
data <- tibble(id = c("A","A","A","B","B","C","C","D","D","D","D","E","E","E","E","F","F","G","G","G","H","H","H","H"),year_exp = c(1,2,3,1,2,1,2,1,2,3,4,1,2,3,4,1,2,1,2,3,1,2,3,4), pre_year_exp = year_exp - 1)
data_0 <- data %>% filter(year_exp == max(year_exp) - 0) %>% sample_n(2)
data_1 <- data %>% filter(year_exp == max(year_exp) - 1) %>% anti_join(data_0, by = 'id') %>% sample_n(2)
data_2 <- data %>% filter(year_exp == max(year_exp) - 2) %>% anti_join(data_0, by = 'id') %>% anti_join(data_1, by = 'id') %>% sample_n(2)
data_3 <- data %>% filter(year_exp == max(year_exp) - 3) %>% anti_join(data_0, by = 'id') %>% anti_join(data_1, by = 'id') %>% anti_join(data_2, by = 'id')
#Result Table
result <- data_0 %>% bind_rows(data_1, data_2, data_3)
result
The below produces the same output as your code and extends the idea to allow for an arbitrary number of values of year_exp using a for loop.
Please note that because this simply extends your code, it must share the following (possibly-undesirable) features with your code:
The code moves sequentially through groups, sampling from the members of later groups who were not sampled for early groups. Accordingly, there is a risk that the code throws an error because it tries to sample from groups whose members were already sampled from previous, other groups.
The probabilities of selection are not uniformly distributed across members of a group. Accordingly, the samples drawn from each group are not representative of that group.
In the event that there data were instead a balanced panel, there are much more efficient and simpler ways to accomplish this.
library(tibble)
library(dplyr)
set.seed(123)
# Create original data
data <- tibble(id = c("A","A","A","B","B","C","C","D","D","D","D","E","E","E","E","F","F","G","G","G","H","H","H","H"),
year_exp = c(1,2,3,1,2,1,2,1,2,3,4,1,2,3,4,1,2,1,2,3,1,2,3,4),
pre_year_exp = year_exp - 1)
# Assign values to parameters used by/in the loop.
J <- data$id %>% unique %>% length # unique units/persons (8)
K <- data$year_exp %>% unique %>% length # unique groups/years (4)
N <- 2 # sample size per group (2)
# Initialize objects loop will modify
samples_list <- vector(mode = "list", length = K) # stores each sample
used_ids <- rep(NA_character_, J) # stores used ids
index <- 1:N # initial indices for used ids
# For-loop solution
for (k in 1:K) {
# Identifier for current group
cur_group <- 1 + K - k
# Sample from persons in current group who were not previously sampled
one_sample <- data %>%
filter(year_exp == cur_group, !(id %in% used_ids)) %>%
slice_sample(n = N)
# Save sample and the id values for those sampled
samples_list[[k]] <- one_sample
used_ids[index] <- one_sample$id
index <- index + N
}
# Bind into a single data.frame
bind_rows(samples_list)
#> # A tibble: 8 x 3
#> id year_exp pre_year_exp
#> <chr> <dbl> <dbl>
#> 1 H 4 3
#> 2 D 4 3
#> 3 G 3 2
#> 4 E 3 2
#> 5 C 2 1
#> 6 B 2 1
#> 7 F 1 0
#> 8 A 1 0

Finding the exact match in the values in the categorical variables

I wanted to find an exact match in the values between all three columns (rg1,rg2,rg3).Below is my dataframe.
For instance - first row has a combination of (70,71,72) , if this same combination appears in the remaining rows for the rest of the user ids , then, keep only those users and delete rest.
To describe it further - first row has (70,71,72) and say , if row 10 had the same values in B,C,D column, then I just want to display row 1 and row 10.(using R)
I tried doing clustering on this - kmodes. But I'm not getting the expected results.The current code is grouping all the rgs but it's kind of validating only a single Rg that has appeared most frequently in the data frame(above is my dataframe) and ranking them accordingly.
Can someone please guide me on this?Is there any better way to do this?
kmodes <- klaR::kmodes(mapped_df, modes= 5, iter.max = 10, weighted = FALSE)
#Add these clusters to the main dataframe
final <- mapped_df %>%
mutate(cluster = kmodes$cluster)
You can sort across the columns, then look for duplicates.
set.seed(1234)
df <- tibble(Userids = 1:20,
rg_1 = sample(1:20, 20, TRUE),
rg_2 = sample(1:20, 20, TRUE),
rg_3 = sample(1:20, 20, TRUE))
df[4, -1] <- rev(df[15, -1])
# sort across the columns
df_sorted <- t(apply(df[-1], 1, sort))
# return the duplicated rows
df[duplicated(df_sorted) | duplicated(df_sorted, fromLast = TRUE), ]
This will give you a data frame with all the duplicated values. Once you have the sorted data frame, it should be easy enough to find what you need.
Userids rg_1 rg_2 rg_3
<int> <int> <int> <int>
1 4 16 17 6
2 15 6 17 16
I still do not understand what are you precisely looking for. Besides, it is always recomended to include the data frame you are refering.
I could suggest a solution, which implies the use of a threshold value. So, for each row, if some of the differences (between rg1-rg2, rg1-rg3 and rg2-rg3) is higher than the threshold, it will not be consider.
threshold <- 5
index <- mapped_df %>%
tibble(g1_g2 = abs(rg1 - rg2),
g1_g3 = abs(rg1 - rg3),
g2_g3 = abs(rg2 - rg3)) %>%
apply(1, function(x, threshold) all(x <= threshold),
threshold = threshold)
mapped_df[index]
Maybe you're (just) after some filtering?
library(tidyverse)
data <- tibble(Userids = 1:10,
rg1 = c(70,1:8,70),
rg2 = c(71,11:18,71),
rg3 = c(72,21:28,72))
data |>
filter(rg1 == 70,
rg2 == 71,
rg3 == 72)
data |>
filter(rg1 == rg1[row_number()==1],
rg2 == rg2[row_number()==1],
rg3 == rg3[row_number()==1])
Output:
# A tibble: 2 × 4
Userids rg1 rg2 rg3
<int> <dbl> <dbl> <dbl>
1 1 70 71 72
2 10 70 71 72
Or combine them for ease:
data |>
unite(rg, starts_with("rg")) |>
filter(rg == rg[row_number()==1])
Output:
# A tibble: 2 × 2
Userids rg
<int> <chr>
1 1 70_71_72
2 10 70_71_72

R - Efficiently counting number of switches in binary variable for each group

To give some context, I have a dataframe of eyetracking data from a psychology experiment and I want to count the switches between two Areas Of Interest (AOI), for each participant.
Here's a simplified dataframe of the problem (we assume that AOI2 == !AOI1 so we don't need it):
library(tidyverse)
df <- tibble(Participant = rep(1:7, times = 1, each = 10),
Time = rep(1:10, 7),
AOI1 = rbinom(70, 1, .5))
What I want is to count how many times the value of AOI1 changes during time for each participant. I could do it using for loops like bellow, but I was wondering if there was a simpler and more R way of doing it?
df.switches <- tibble(Participant = 1:7,
Switches = NA)
for(p in 1:7){
s <- 0
for(i in 2:10){
if(subset(df, Participant == p & Time == i, select = AOI1) !=
subset(df, Participant == p & Time == i-1, select = AOI1)){
s <- s + 1
}
}
df.switches <- df.switches %>%
mutate(Switches = ifelse(Participant == p, s, Switches))
}
One option is to use dplyr::lag to compare the value with current row in order to count number of switches for each participants.
library(tidyverse)
df %>% group_by(Participant) %>%
summarise(count = sum(AOI1 != lag(AOI1, default = -Inf)))
# # A tibble: 7 x 2
# Participant count
# <int> <int>
# 1 1 5
# 2 2 4
# 3 3 5
# 4 4 4
# 5 5 6
# 6 6 6
# 7 7 4
Since you are already using the tidyverse, you can use lag available as part of dplyr. This checks whether the value of AOI1 is the same as the previous value, and if not, sets a flag to 1. For the first record of each participant, the value is automatically set to NA. Note that the group_by is required, otherwise the flag won't get "reset" every time a new participant is encountered. Also it is assumed that the data is sorted by Participant and Time; if not, pipe arrange(Participant, Time) before the group_by.
df <- tibble(Participant = rep(1:7, times = 1, each = 10),
Time = rep(1:10, 7),
AOI1 = rbinom(70, 1, .5))
df2 <- df %>%
group_by(Participant) %>%
mutate(switch = ifelse(AOI1 != lag(AOI1), 1, 0)) %>%
summarise(num_switches = sum(switch, na.rm = TRUE))

Computing average over different columns/rows in a list of data.frames

I've a list of 140 elements of type data.frame ('my.list'). I would like to compute 350 averages of certain values ranges in a certain column for a certain set of rows in a certain data.frame (this is a bit cryptic); so, 350 different averages like:
Of data.frame #1, the average of column 'Measure1', row 1:5;
Of data.frame #2, the average of column 'Measure3', row 1:4, etc. etc.
I have another data.frame ('my.dfAverage') which indicates for which data.frame, column and rows it needs the average. I want to write the 350 different averages and standard deviations to this data.frame (so with the columns: 'average_id', 'dataframe_number', 'column_name', 'row_numbers', 'average' and 'st_dev'). Some value ranges have NA's, these values can be dropped for computing the average.
What is the best way to automatically compute the 350 averages and standard deviations from the list of data.frames based on the info in this data.frame? I thought of creating a for-loop (or maybe the lapply function?), but I'm quite new to these functions, so I'm not sure what the way to go is here.
Small reproducible example of my list of data.frames:
my.df1 <- data.frame(ID = c(1:5),
Measure1 = c(2247,2247,1970,1964,1971),
Measure2 = c(2247,2247,NA,1964,1971))
my.df2 <- data.frame(ID = c(1:4),
Measure3 = c(2247,NA,1970,1964),
Measure5 = c(2247,2247,NA,1964))
my.df3 <- data.frame(ID = c(1:4),
Measure6 = c(2247,600,1970,1964),
Measure8 = c(2247,2247,NA,1964))
my.list <- list(list1 = my.df1, list2 = my.df2, list3 = my.df3)
Desired output table for the averages and standard deviation:
my.dfAverage <- data.frame(average_id = c(1:3),
dataframe_number = c(1,2,3),
column_name = c('Measure1','Measure3','Measure6'),
row_numbers = c('1:3','1:4','1:2'),
average = (NA),
st_dev = (NA))
This is a different approach than the one given above: I will use only base r functions: Point to note, ensure the data has stringsAsFactors=FALSE
write a function but ensure you index mylist correctly. then compute the function on this i e f(...,na.rm=T). to write a function using apply:
fun1=function(f){with(my.dfAverage,
mapply(function(x,y,z)
f(x[eval(parse(text=y)),z],na.rm=T),my.list,row_numbers,column_name))}
transform(my.dfAverage,average=fun1(mean),st_dev=fun1(sd))
average_id dataframe_number column_name row_numbers average st_dev
1 1 1 Measure1 1:3 2154.667 159.9260
2 2 2 Measure3 1:4 2060.333 161.6859
3 3 3 Measure6 1:2 1423.500 1164.6049
Data Used:
my.dfAverage <- data.frame(average_id = c(1:3),
dataframe_number = c(1,2,3),
column_name = c('Measure1','Measure3','Measure6'),
row_numbers = c('1:3','1:4','1:2'),
average = (NA),
st_dev = (NA),stringsAsFactors = F)
A solution using tidyverse.
First, expand the my.dfAverage based on row_numbers.
library(tidyverse)
my.dfAverage2 <- my.dfAverage %>%
separate(row_numbers, into = c("start", "end")) %>%
mutate(row_numbers = map2(start, end, `:`)) %>%
unnest() %>%
select(-start, -end) %>%
mutate(row_numbers = as.integer(row_numbers),
dataframe_number = as.integer(dataframe_number))
Second, transform all data frames in my.list and combine them to a single data frame.
my.list.df <- my.list %>%
setNames(1:length(.)) %>%
map_dfr(function(x){
x2 <- x %>%
gather(column_name, value, -ID)
return(x2)
},.id = "dataframe_number") %>%
mutate(ID = as.integer(ID), dataframe_number = as.integer(dataframe_number)) %>%
rename(row_numbers = ID)
Third, merge my.dfAverage2 and my.list.df and calculate the mean and standard deviation. my.dfAverage3 is the final output.
my.dfAverage3 <- my.dfAverage2 %>%
left_join(my.list.df, by = c("dataframe_number", "column_name", "row_numbers")) %>%
group_by(average_id, dataframe_number, column_name) %>%
summarise(row_numbers = paste(min(row_numbers), max(row_numbers), sep = ":"),
average = mean(value, na.rm = TRUE),
st_dev = sd(value, na.rm = TRUE)) %>%
ungroup()
my.dfAverage3
# A tibble: 3 x 6
# average_id dataframe_number column_name row_numbers average st_dev
# <int> <int> <chr> <chr> <dbl> <dbl>
# 1 1 1 Measure1 1:3 2155 160
# 2 2 2 Measure3 1:4 2060 162
# 3 3 3 Measure6 1:2 1424 1165
DATA
my.list is the same as OP's my.list.
my.dfAverage <- data.frame(average_id = c(1:3),
dataframe_number = c(1,2,3),
column_name = c('Measure1','Measure3','Measure6'),
row_numbers = c('1:3','1:4','1:2'))

Add a column with count of NAs and Mean

I have a data frame and I need to add another column to it which shows the count of NAs in all the other columns for that row and also the mean of the non-NA values.
I think it can be done in dplyr.
> df1 <- data.frame(a = 1:5, b = c(1,2,NA,4,NA), c = c(NA,2,3,NA,NA))
> df1
a b c
1 1 1 NA
2 2 2 2
3 3 NA 3
4 4 4 NA
5 5 NA NA
I want to mutate another column which counts the number of NAs in that row and another column which shows the mean of all the NON-NA values in that row.
library(dplyr)
count_na <- function(x) sum(is.na(x))
df1 %>%
mutate(means = rowMeans(., na.rm = T),
count_na = apply(., 1, count_na))
#### ANSWER FOR RADEK ####
elected_cols <- c('b', 'c')
df1 %>%
mutate(means = rowMeans(.[elected_cols], na.rm = T),
count_na = apply(.[elected_cols], 1, count_na))
As mentioned here https://stackoverflow.com/a/37732069/2292993
df1 <- data.frame(a = 1:5, b = c(1,2,NA,4,NA), c = c(NA,2,3,NA,NA))
df1 %>%
mutate(means = rowMeans(., na.rm = T),
count_na = rowSums(is.na(.)))
to work on selected cols (the example here is for col a and col c):
df1 %>%
mutate(means = rowMeans(., na.rm = T),
count_na = rowSums(is.na(select(.,one_of(c('a','c'))))))
You can try this:
#Find the row mean and add it to a new column in the dataframe
df1$Mean <- rowMeans(df1, na.rm = TRUE)
#Find the count of NA and add it to a new column in the dataframe
df1$CountNa <- rowSums(apply(is.na(df1), 2, as.numeric))
I recently faced a variation on this question where I needed to compute the percent of complete values, but for specific variables (not all variables). Here is an approach that worked for me.
df1 %>%
# create dummy variables representing if the observation is missing ----
# can modify here for specific variables ----
mutate_all(list(dummy = is.na)) %>%
# compute a row wise sum of missing ----
rowwise() %>%
mutate(
# number of missing observations ----
n_miss = sum(c_across(matches("_dummy"))),
# percent of observations that are complete (non-missing) ----
pct_complete = 1 - mean(c_across(matches("_dummy")))
) %>%
# remove grouping from rowwise ----
ungroup() %>%
# remove dummy variables ----
dplyr::select(-matches("dummy"))

Resources