Finding the exact match in the values in the categorical variables

Finding the exact match in the values in the categorical variables - r

I wanted to find an exact match in the values between all three columns (rg1,rg2,rg3).Below is my dataframe.
For instance - first row has a combination of (70,71,72) , if this same combination appears in the remaining rows for the rest of the user ids , then, keep only those users and delete rest.
To describe it further - first row has (70,71,72) and say , if row 10 had the same values in B,C,D column, then I just want to display row 1 and row 10.(using R)
I tried doing clustering on this - kmodes. But I'm not getting the expected results.The current code is grouping all the rgs but it's kind of validating only a single Rg that has appeared most frequently in the data frame(above is my dataframe) and ranking them accordingly.
Can someone please guide me on this?Is there any better way to do this?
kmodes <- klaR::kmodes(mapped_df, modes= 5, iter.max = 10, weighted = FALSE)
#Add these clusters to the main dataframe
final <- mapped_df %>%
mutate(cluster = kmodes$cluster)

You can sort across the columns, then look for duplicates.
set.seed(1234)
df <- tibble(Userids = 1:20,
rg_1 = sample(1:20, 20, TRUE),
rg_2 = sample(1:20, 20, TRUE),
rg_3 = sample(1:20, 20, TRUE))
df[4, -1] <- rev(df[15, -1])
# sort across the columns
df_sorted <- t(apply(df[-1], 1, sort))
# return the duplicated rows
df[duplicated(df_sorted) | duplicated(df_sorted, fromLast = TRUE), ]
This will give you a data frame with all the duplicated values. Once you have the sorted data frame, it should be easy enough to find what you need.
Userids rg_1 rg_2 rg_3
<int> <int> <int> <int>
1 4 16 17 6
2 15 6 17 16

I still do not understand what are you precisely looking for. Besides, it is always recomended to include the data frame you are refering.
I could suggest a solution, which implies the use of a threshold value. So, for each row, if some of the differences (between rg1-rg2, rg1-rg3 and rg2-rg3) is higher than the threshold, it will not be consider.
threshold <- 5
index <- mapped_df %>%
tibble(g1_g2 = abs(rg1 - rg2),
g1_g3 = abs(rg1 - rg3),
g2_g3 = abs(rg2 - rg3)) %>%
apply(1, function(x, threshold) all(x <= threshold),
threshold = threshold)
mapped_df[index]

Maybe you're (just) after some filtering?
library(tidyverse)
data <- tibble(Userids = 1:10,
rg1 = c(70,1:8,70),
rg2 = c(71,11:18,71),
rg3 = c(72,21:28,72))
data |>
filter(rg1 == 70,
rg2 == 71,
rg3 == 72)
data |>
filter(rg1 == rg1[row_number()==1],
rg2 == rg2[row_number()==1],
rg3 == rg3[row_number()==1])
Output:
# A tibble: 2 × 4
Userids rg1 rg2 rg3
<int> <dbl> <dbl> <dbl>
1 1 70 71 72
2 10 70 71 72
Or combine them for ease:
data |>
unite(rg, starts_with("rg")) |>
filter(rg == rg[row_number()==1])
Output:
# A tibble: 2 × 2
Userids rg
<int> <chr>
1 1 70_71_72
2 10 70_71_72

Related

How to partition into equal sum subsets in R?

I have a dataset with a column, X1, of various values. I would like to order this dataset by the value of X1, and then partition into K number of equal sum subsets. How can this be accomplished in R? I am able to find quartiles for X1 and append the quartile groupings as a new column to the dataset, however, quartile is not quite what I'm looking for. Thank you in advance!
df <- data.frame(replicate(10,sample(0:1000,1000,rep=TRUE)))
df <- within(df, quartile <- as.integer(cut(X1, quantile(X1, probs=0:4/4), include.lowest=TRUE)))

Here's a rough solution (using set.seed(47) if you want to reproduce exactly). I calculate the proportion of the sum for each row, and do the cumsum of that proportion, and then cut that into the desired number of buckets.
library(dplyr)
n_groups = 10
df %>% arrange(X1) %>%
mutate(
prop = X1 / sum(X1),
cprop = cumsum(prop),
bins = cut(cprop, breaks = n_groups - 1)
) %>%
group_by(bins) %>%
summarize(
group_n = n(),
group_sum = sum(X1)
)
# # A tibble: 9 × 3
# bins group_n group_sum
# <fct> <int> <int>
# 1 (-0.001,0.111] 322 54959
# 2 (0.111,0.222] 141 54867
# 3 (0.222,0.333] 111 55186
# 4 (0.333,0.444] 92 55074
# 5 (0.444,0.556] 80 54976
# 6 (0.556,0.667] 71 54574
# 7 (0.667,0.778] 66 55531
# 8 (0.778,0.889] 60 54731
# 9 (0.889,1] 57 55397
This could of course be simplified--you don't need to keep around the extra columns, just mutate(bins = cut(cumsum(X1 / sum(X1)), breaks = n_groups - 1)) will add the bins column to the original data (and no other columns), and the group_by() %>% summarize() is just to diagnose the result.

R: select all rows that belong to maximum count

In a monitoring scheme each species (A, B, ...) is counted at least twice in each area (a1, a2, ...). For the final result the rows from the sample with the highest total count need to be selected for each area and species.
Example data:
data_joined <- data.frame("species" = c("A","A","A","A","A","A","B","B","B","B"),
"area" = c("a1","a1","a1","a1","a1","a2","a1","a1","a2","a2"),
"sample_nr" = c(1,1,1,1,2,2,1,1,2,2),
"count" = c(1,1,1,1,6,1,1,1,3,3))
My current solution is pasted below. However, the loop is very slow on the original data which contains about 18,000 rows. I can imagine there are much faster and elegant solutions. The original data are in sf format and include geometries which need to be kept after selection.
i_list <- list() # empty list
for (i in unique(data_joined$area)) # all areas that are in the data
{
loop_i_data <- data_joined[data_joined$area == i,] # select data for area i
j_list <- list()
for(j in unique(data_joined$species)) # all species that are in the data
{
loop_j_data <- loop_i_data[loop_i_data$species == j,] # select data of species j in area i
max_select <- which.max(
c(sum(loop_j_data[loop_j_data$sample_nr == "1",]$count, na.rm = TRUE), # sum first count
sum(loop_j_data[loop_j_data$sample_nr == "2",]$count, na.rm = TRUE), # sum second count
sum(loop_j_data[loop_j_data$sample_nr == "3",]$count, na.rm = TRUE), # sum third count
sum(loop_j_data[loop_j_data$sample_nr == "4",]$count, na.rm = TRUE), # sum fourth count
sum(loop_j_data[loop_j_data$sample_nr == "5",]$count, na.rm = TRUE), # sum fifth count
sum(loop_j_data[loop_j_data$sample_nr == "6",]$count, na.rm = TRUE), # sum sixth count
sum(loop_j_data[loop_j_data$sample_nr == "7",]$count, na.rm = TRUE))) # sum seventh count
j_list[[j]] <- loop_j_data[loop_j_data$sample_nr == max_select,] # add maximum count occasion to list
}
i_list[[i]] <- do.call(rbind, j_list)
}
data_final = do.call(rbind, i_list) # rbind all data
row.names(data_final) <- NULL
data_final

Using dplyr, we can find the sum of count for each species, area and sample_nr and select all the rows with max count in each species and area.
library(dplyr)
data_joined %>%
group_by(species, area, sample_nr) %>%
summarise(n = sum(count)) %>%
slice(which.max(n)) %>%
left_join(data_joined) %>%
select(-n)
# species area sample_nr count
# <fct> <fct> <dbl> <dbl>
#1 A a1 2 6
#2 A a2 2 1
#3 B a1 1 1
#4 B a1 1 1
#5 B a2 2 3
#6 B a2 2 3

How to quickly create multiple summary tables with group_by() / summarise()?

I have a data frame with N vars, M categorical and 2 numeric. I would like to create M data frames, one for each categorical variable.
Eg.,
data %>%
group_by(var1) %>%
summarise(sumVar5 = sum(var5),
meanVar6 = mean(var6))
data %>%
group_by(varM) %>%
summarise(sumVar5 = sum(var5),
meanVar6 = mean(var6))
etc...
Is there a way to iterate through the categorical variables and generate each of the summary tables? That is, without needing to repeat the above chunks M times.
Alternatively, these summary tables don't have to be individual objects, as long as I can easily reference / pull the summaries for each of the M variables.

Here is a solution (I hope). Creates a list of data frames with the formula you have:
library(tidyverse)
# Create sample data frame
data <- data.frame(var1 = sample(1:2, 5, replace = T),
var2 = sample(1:2, 5, replace = T),
var3 = sample(1:2, 5, replace = T),
varM = sample(1:2, 5, replace = T),
var5 = rnorm(5, 3, 6),
var6 = rnorm(5, 3, 6))
# Vars to be grouped (var1 until varM in this example)
vars_to_be_used <- names(select(data, var1:varM))
# Function to be used
group_fun <- function(x, .df = data) {
.df %>%
group_by_(.x) %>%
summarise(sumVar5 = sum(var5),
meanVar6 = mean(var6))
}
# Loop over vars
results <- map(vars_to_be_used, group_fun)
# Nice list names
names(results) <- vars_to_be_used
print(results)

You didn't supply a sample data.set so I created a small example to show how it works.
data <- data_frame(var1 = rep(letters[1:5], 2),
var2 = rep(LETTERS[11:15], 2),
var3 = 1:10,
var4 = 11:20)
A combination of tidyverse packages can get you where you need to be.
Steps used: First we gather all the columns we want to group by on in a cols column and keep the numeric vars separate. Next we split the data.frame in a list of data.frames so that every column we want to group by on has it's own table with the 2 numeric vars. Now that everything is in a list, we need to use the map functionality from the purrr package. Using map, we spread the data.frame again so the column names are as we expect them to be. Finally using map we use group_by_if to group by on the character column and summarise the rest. All the outcomes are stored in a list where you can access what you need.
Run the code in pieces to see what every step does.
library(dplyr)
library(purrr)
library(tidyr)
outcomes <- data %>%
gather(cols, value, -c(var3, var4)) %>%
split(.$cols) %>%
map(~ spread(.x, cols, value)) %>%
map(~ group_by_if(.x, is.character) %>%
summarise(sumvar3 = sum(var3),
meanvar4 = mean(var4)))
outcomes
$`var1`
# A tibble: 5 x 3
var1 sumvar3 meanvar4
<chr> <int> <dbl>
1 a 7 13.5
2 b 9 14.5
3 c 11 15.5
4 d 13 16.5
5 e 15 17.5
$var2
# A tibble: 5 x 3
var2 sumvar3 meanvar4
<chr> <int> <dbl>
1 K 7 13.5
2 L 9 14.5
3 M 11 15.5
4 N 13 16.5
5 O 15 17.5

R - Efficiently counting number of switches in binary variable for each group

To give some context, I have a dataframe of eyetracking data from a psychology experiment and I want to count the switches between two Areas Of Interest (AOI), for each participant.
Here's a simplified dataframe of the problem (we assume that AOI2 == !AOI1 so we don't need it):
library(tidyverse)
df <- tibble(Participant = rep(1:7, times = 1, each = 10),
Time = rep(1:10, 7),
AOI1 = rbinom(70, 1, .5))
What I want is to count how many times the value of AOI1 changes during time for each participant. I could do it using for loops like bellow, but I was wondering if there was a simpler and more R way of doing it?
df.switches <- tibble(Participant = 1:7,
Switches = NA)
for(p in 1:7){
s <- 0
for(i in 2:10){
if(subset(df, Participant == p & Time == i, select = AOI1) !=
subset(df, Participant == p & Time == i-1, select = AOI1)){
s <- s + 1
}
}
df.switches <- df.switches %>%
mutate(Switches = ifelse(Participant == p, s, Switches))
}

One option is to use dplyr::lag to compare the value with current row in order to count number of switches for each participants.
library(tidyverse)
df %>% group_by(Participant) %>%
summarise(count = sum(AOI1 != lag(AOI1, default = -Inf)))
# # A tibble: 7 x 2
# Participant count
# <int> <int>
# 1 1 5
# 2 2 4
# 3 3 5
# 4 4 4
# 5 5 6
# 6 6 6
# 7 7 4

Since you are already using the tidyverse, you can use lag available as part of dplyr. This checks whether the value of AOI1 is the same as the previous value, and if not, sets a flag to 1. For the first record of each participant, the value is automatically set to NA. Note that the group_by is required, otherwise the flag won't get "reset" every time a new participant is encountered. Also it is assumed that the data is sorted by Participant and Time; if not, pipe arrange(Participant, Time) before the group_by.
df <- tibble(Participant = rep(1:7, times = 1, each = 10),
Time = rep(1:10, 7),
AOI1 = rbinom(70, 1, .5))
df2 <- df %>%
group_by(Participant) %>%
mutate(switch = ifelse(AOI1 != lag(AOI1), 1, 0)) %>%
summarise(num_switches = sum(switch, na.rm = TRUE))

For loop: Count matches and unique elements among two dataframes and apply function to counts

I would like to conduct a very involved loop. I have multiple regions, each with hundreds of plots in my real data frame. I would like to subset by region and then plot and preform various functions on the subsets to ultimately calculate dissimilarity owed to only species that are shared. I will preface by saying each row is representative of an interaction.
My example df:
set.seed(540)
df<- data.frame(region= c(rep(1, 16), rep(2,8)),
plot= c(rep("A",5), rep("B",9), rep("C", 2), rep("D", 6),rep("E", 2)),
plantsp= sample(1:24,24, replace= TRUE),
lepsp= sample(1:24,24,replace= TRUE),
psitsp= sample(1:24,24,replace= TRUE))
df[] <- lapply(df, as.character)
df$plantsp<-paste('plantsp', df$plantsp, sep='_')
df$lepsp<-paste('lepsp', df$lepsp, sep='_')
df$psitsp<-paste('psitsp', df$psitsp, sep='_')
df$paste1<- paste(df$plantsp, df$lepsp, sep='_')
df$paste2<- paste(df$lepsp, df$psitsp, sep='_')
df$paste3<- paste(df$plantsp,df$lepsp, df$psitsp)
Step1: Subset df by region. Example:
region_sub <- split(df, df$region)
Step2: Subset df by plot. Example:
plot_sub <- split(region_sub[[1]], region_sub[[1]][[2]])
Step3: We will call each subset (each list component) from the step above a plot subset. In this example I will use the first subset (region1, plotA) as an example for all subsequent outputs. I will call this region1, plotA subset plot_sub1. I want to compare plot_sub1 to the original df to make three df subsets. We will call these df_sub1, df_sub2, df_sub3. First, df_sub1 consists of matches among entries in the plantsp, lepsp columns among plot_sub1 and df. Rows with any unique entries are removed, as well as and rows where a plantsp match, but not the lepsp and visa versa. Example of df_sub1:
df_sub1<- df[c(1,2,3,4,5,22),c(1:4,6)]
Notice, only those rows with shared species remain. Further, only those rows with shared species that also interact remain. Also, I have removed unnecessary columns (e.g. psitsp, paste2, paste3) to draw your attention to the results of this step. These columns do not need to be removed for the code.
Step4: Repeat step3 for lepsp and psitsp columns to make df_sub2. Example:
df_sub2<- df[1:5,c(1:2,4,5,7)]
Step5: Repeat step3 for plantsp,lepsp and psitsp column to make df_sub3. Example:
df_sub3<- df[1:5,c(1:5,8)]
Step6: Now that all subsets are made, I want to count matching elements in the paste1 column among plot_sub1 and df_sub1 (=5). Example:
This would be stored in a vector match. The results will be stored in the match or unique vector, accordingly. Example:
match<- length(intersect(df_sub1$paste1, plot_sub[[1]]$paste1))
match
I also want to count the unique elements (=1). This would be stored in a vector unique. This will be repeated for plot_sub1 and df_sub2and plot_sub1 and df_sub3. I am not sure how to count unique elements among two df so I cannot offer example code for that.
unique<- 1
Note: Matches among plot_sub only need to be counted 1 time in the event the df_sub has repeated interactions or matches. This needs to account for presence- absence of matches, not the abundance.
In summary for this subset, the two vectors would be:
match<- c( length(intersect(df_sub1$paste1, plot_sub[[1]]$paste1)),
length(intersect(df_sub2$paste2, plot_sub[[1]]$paste2)),
length(intersect(df_sub3$paste3, plot_sub[[1]]$paste3))
match
unique<-c(1,0,0)
The sum will then be totaled for each vector. Example:
sum_match<- 15
sum_unique<- 1
Step7: Lastly, these values would be input into a function:
((a + b)/((2*a + b)/2) - 1) Where a= sum_match and b=sum_unique.
The value is then input into the result vector res_vec.
Step8: This process (step3-7) would be iterated for each plot subset.
Effectively, this will calculate the dissimilarity of shared interactions among plot interactions and the corresponding metaweb (all possible interactions). This is a modification from (Poisot et al 2012) to account for tritrophic interactions.
It's quite pathetic, but to start the for loop I have:
res_vec<- NA
for (i in 1:length(unique(df$region)))
{
for (j in 1:length(unique(df$plot)))
{
I really appreciate any time one is willing to help me realize the arguments within the loop. That is where it gets tricky for me.

Thans #Gregor for all the clarification you've already done in the comments!
Here is my solution using the the tidyverse.
CODE + EXPLANATION
## Load packages
library(tidyverse)
## Nest data
new_df <- df %>%
group_by(region, plot) %>%
nest(.key = plot_sub)
new_df
# A tibble: 5 x 3
# region plot plot_sub
# <dbl> <fctr> <list>
# 1 1 A <tibble [5 x 3]>
# 2 1 B <tibble [9 x 3]>
# 3 1 C <tibble [2 x 3]>
# 4 2 D <tibble [6 x 3]>
# 5 2 E <tibble [2 x 3]>
The column plot_sub contains the same data as the list with the same name in your question. Think of this column as a list of dataframes.
I know write a function to create the df_sub's. This keeps our code more clean, and avoids unecessary repetition. This function will then be applied to our column plot_sub
# Function to create the df_sub
# Takes the plot_sub, original dataframe (df) and a list of columns, which should be compared
# Returns the desired df_sub with new interactions of species which are in plot_sub
# Only unique interactions are returned
create_df_sub <- function(plot_sub, df, col_list){
# Filter df such that it only contains species which are in plot_sub
for (x in col_list) {
df <- df[df[[x]] %in% plot_sub[[x]], ]
}
# Combine plot_sub and filtered df
df_sub <- rbind(plot_sub[, col_list], df[, col_list])
# Paste relevant colums together
df_sub$paste_col <- do.call(paste, c(df_sub[, col_list], sep = '_'))
# Exclude duplicated values
df_sub <- df_sub[!duplicated(df_sub$paste_col), ]
return(df_sub)
}
Now I define the columns I want to create the df_sub with and then apply the function to the plot_sub-column
col_list1 <- c('plantsp', 'lepsp')
col_list2 <- c('lepsp', 'psitsp')
col_list3 <- c('plantsp', 'lepsp', 'psitsp')
new_df <- new_df %>%
mutate(df_sub1 = map(plot_sub, create_df_sub, df = df, col_list = col_list1),
df_sub2 = map(plot_sub, create_df_sub, df = df, col_list = col_list2),
df_sub3 = map(plot_sub, create_df_sub, df = df, col_list = col_list3))
map takes a vector or list as argument and applies the specified function to each element (like lapply). Compare the first elements of df_sub1 and plot_sub to see the difference.
new_df$plot_sub[[1]]
# A tibble: 5 x 3
# plantsp lepsp psitsp
# <chr> <chr> <chr>
# 1 plantsp_2 lepsp_19 psitsp_19
# 2 plantsp_21 lepsp_19 psitsp_4
# 3 plantsp_19 lepsp_2 psitsp_11
# 4 plantsp_9 lepsp_13 psitsp_24
# 5 plantsp_24 lepsp_9 psitsp_2
new_df$df_sub1[[1]]
# A tibble: 6 x 3
# plantsp lepsp paste_col
# <chr> <chr> <chr>
# 1 plantsp_2 lepsp_19 plantsp_2_lepsp_19
# 2 plantsp_21 lepsp_19 plantsp_21_lepsp_19
# 3 plantsp_19 lepsp_2 plantsp_19_lepsp_2
# 4 plantsp_9 lepsp_13 plantsp_9_lepsp_13
# 5 plantsp_24 lepsp_9 plantsp_24_lepsp_9
# 6 plantsp_9 lepsp_2 plantsp_9_lepsp_2
The new interaction is added in df_sub1.
To extract matching and unique values, I use inner_join and anti_join on the plot_sub-column and the different df_sub's
new_df <- new_df %>%
mutate(match1 = map2(df_sub1, plot_sub, inner_join, by = col_list1),
match2 = map2(df_sub2, plot_sub, inner_join, by = col_list2),
match3 = map2(df_sub3, plot_sub, inner_join, by = col_list3),
unique1 = map2(df_sub1, plot_sub, anti_join, by = col_list1),
unique2 = map2(df_sub2, plot_sub, anti_join, by = col_list2),
unique3 = map2(df_sub3, plot_sub, anti_join, by = col_list3))
The inner_join returns all rows, which have matching values in the columns specified in the by-argument, whereas the anti_join returns all rows of df_sub, which are not matched.
Here I use the map2-function, which takes two vectors/list and applies the specified function.
new_df$match1[[1]]
# A tibble: 5 x 4
# plantsp lepsp psitsp paste_col
# <chr> <chr> <chr> <chr>
# 1 plantsp_2 lepsp_19 psitsp_19 plantsp_2_lepsp_19
# 2 plantsp_21 lepsp_19 psitsp_4 plantsp_21_lepsp_19
# 3 plantsp_19 lepsp_2 psitsp_11 plantsp_19_lepsp_2
# 4 plantsp_9 lepsp_13 psitsp_24 plantsp_9_lepsp_13
# 5 plantsp_24 lepsp_9 psitsp_2 plantsp_24_lepsp_9
new_df$unique1[[1]]
# A tibble: 1 x 3
# plantsp lepsp paste_col
# <chr> <chr> <chr>
# 1 plantsp_9 lepsp_2 plantsp_9_lepsp_2
In the last step I extract the number of rows of each match and unique and sum it up. I also calculate the res_vec.
new_df <- new_df %>%
mutate(sum_match = map_int(match1, nrow) + map_int(match2, nrow) + map_int(match3, nrow),
sum_unique = map_int(unique1, nrow) + map_int(unique2, nrow) + map_int(unique3, nrow),
res_vec = ((sum_match + sum_unique)/((2*sum_match + sum_unique)/2)) - 1)
Here I use map_int as my return value is an integer and I want to directly use it in a sum. Using map only would return a list which I first have to convert to a integer vector.
new_df %>% select(region, plot, sum_match, sum_unique, res_vec)
# A tibble: 5 x 5
# region plot sum_match sum_unique res_vec
# <dbl> <fctr> <int> <int> <dbl>
# 1 1 A 15 1 0.03225806
# 2 1 B 27 3 0.05263158
# 3 1 C 6 2 0.14285714
# 4 2 D 18 1 0.02702703
# 5 2 E 6 0 0.00000000
DATA
set.seed(540)
df <- data.frame(region = c(rep(1, 16), rep(2, 8)),
plot = c(rep('A', 5), rep('B', 9), rep('C', 2), rep('D', 6),rep('E', 2)),
plantsp = sample(1:24, 24, replace = TRUE),
lepsp = sample(1:24, 24, replace = TRUE),
psitsp = sample(1:24, 24, replace = TRUE))
df$plantsp <- paste('plantsp', df$plantsp, sep = '_')
df$lepsp <- paste('lepsp', df$lepsp, sep = '_')
df$psitsp <- paste('psitsp', df$psitsp, sep = '_')

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Finding the exact match in the values in the categorical variables - r

Related

How to partition into equal sum subsets in R?

R: select all rows that belong to maximum count

How to quickly create multiple summary tables with group_by() / summarise()?

R - Efficiently counting number of switches in binary variable for each group

For loop: Count matches and unique elements among two dataframes and apply function to counts

Categories

Resources