Hello I have a dataframe such as :
species family Events groups
1 SP1 A 10,22 G1
2 SP1 B 7 G2
3 SP1 C,D 4,5,6,1,3 G3,G4,G5,G6
4 SP2 A 22,10 G1
5 SP2 D,C 6,5,4,3,1 G4,G6,G5,G3
6 SP3 C 4,5,3,6,1 G3,G6,G5
7 SP3 E 7 G2
8 SP3 A 10 G1
9 SP4 C 7,22 G12
and I would like to simply merge row for each where there is at least one duplicated element in each columns (except species).
For instance I will merge the rows :
species family Events groups
SP1 A 10,22 G1
species family Events groups
SP2 A 22,10 G1
species family Events groups
SP3 A 10 G1
into
species family Events groups
SP1,SP2,SP3 A 10,22 G1
SO if I do the same for each row I should get an expected output as :
species family Events groups
SP1,SP2,SP3 A 10,22 G1
SP1,SP3 B,E 7 G2
SP1,SP2,SP3 C,D 1,3,4,5,6 G3,G4,G6,G5
SP4 C 7,22 G12
Note that SP4 has not been merged with any rows since its group was not present in any other rows.
Does someone have an idea please ?
Thank you very much for your help and time
Here is the dataframe if it can helps:
structure(list(species = structure(c(1L, 1L, 1L, 2L, 2L, 3L,
3L, 3L, 4L), .Label = c("SP1", "SP2", "SP3", "SP4"), class = "factor"),
family = structure(c(1L, 2L, 4L, 1L, 5L, 3L, 6L, 1L, 3L), .Label = c("A",
"B", "C", "C,D", "D,C", "E"), class = "factor"), Events = structure(c(2L,
7L, 5L, 3L, 6L, 4L, 7L, 1L, 8L), .Label = c("10", "10,22",
"22,10", "4,5,3,6,1", "4,5,6,1,3", "6,5,4,3,1", "7", "7,22"
), class = "factor"), groups = structure(c(1L, 3L, 4L, 1L,
6L, 5L, 3L, 1L, 2L), .Label = c("G1", "G12", "G2", "G3,G4,G5,G6",
"G3,G6,G5", "G4,G6,G5,G3"), class = "factor")), class = "data.frame", row.names = c(NA,
-9L))
What I can do and tried :
So far I only know how to merge rows with exact duplicated value using something like that in dplyr :
desired_df <- df %>%
group_by_at(vars(-species)) %>%
summarize(species = toString(species)) %>%
ungroup() %>%
select(names(df))
but here we do not have exact duplicated values, instead I'm looking for between comma that can be present in another row.
Here is an all-tidyverse solution (calling the input data frame dat).
Please note that this solution is not identical to the desired output you gave. This is because you stated the rule is to "merge rows for which there is at least one duplicated element for each column, except species." By that rule, rows 2 and 7 should not be merged because they share no family in common.
First, convert the three columns we are going to test for overlapping values into list-columns. Now each element of those columns is a list. I also coerced the Events column to a numeric so that it will sort properly.
library(tidyverse)
dat <- dat %>%
mutate(across(c(family, Events, groups), ~ strsplit(as.character(.), split = ','))) %>%
mutate(Events = map(Events, as.numeric))
Next, define a function to collapse each row of the data frame. The function takes argument i which is a row index. Within the function, we do two things:
First we use pmap_lgl to iterate across each row of the data frame to check for which rows all the three columns family, Events, and groups have at least one shared value with row i and therefore should be collapsed. For example, if i==1 this will return TRUE for rows 1,4,and 8.
Next, we filter dat for only those rows that returned TRUE, and apply a function to all columns of those rows. The function collapses all columns in those rows into comma-separated strings of the sorted unique values.
collapse_rows <- function(i) {
rows_collapse <- pmap_lgl(dat, function(family, Events, groups, ...)
any(dat$family[[i]] %in% family) & any(dat$Events[[i]] %in% Events) & any(dat$groups[[i]] %in% groups))
dat %>%
filter(rows_collapse) %>%
mutate(across(everything(), ~ paste(sort(unique(unlist(.))), collapse = ',')))
}
Finally we apply this function to each row index. We end up with duplicated rows, for example rows 1, 4, and 8 of the initial output will be identical. We use distinct to remove all of those duplicates.
dat_collapse <- map_dfr(1:nrow(dat), collapse_rows) %>% distinct
Final output:
species family Events groups
1 SP1,SP2,SP3 A 10,22 G1
2 SP1 B 7 G2
3 SP1,SP2,SP3 C,D 1,3,4,5,6 G3,G4,G5,G6
4 SP3 E 7 G2
5 SP4 C 7,22 G12
Related
I have the following data frame.
Input:
class id q1 q2 q3 q4
Ali 12 1 2 3 3
Tom 16 1 2 4 2
Tom 18 1 2 3 4
Ali 24 2 2 4 3
Ali 35 2 2 4 3
Tom 36 1 2 4 2
class indicates the teacher's name,
id indicates the student user ID, and,
q1, q2, q3 and q4 indicate marks on different test questions
Requirement:
I am interested in finding potential cases of cheating. I hypothesise that if the students are in the same class and have similar scores on different questions, they are likely to have cheated.
For this, I want to calculate absolute distance or difference, grouped by class name, across multiple columns, i.e., all the test questions q1, q2, q3 and q4. And I want to store this information in a couple of new columns as below:
difference:
For a given class name, it contains the pairwise distance or difference with all other students' id. For a given class name, it stores the information as (id1, id2 = difference)
cheating:
This column lists any id's based on the previously created new column where the difference was zero (or some threshold value). This will be a flag to alert the teacher that their student might have cheated.
class id q1 q2 q3 q4 difference cheating
Ali 12 1 2 3 3 (12,24 = 2), (12,35 = 2) NA
Tom 16 1 2 4 2 (16,18 = 3), (16,36 = 0) 36
Tom 18 1 2 3 4 (16,18 = 3), (18,36 = 3) NA
Ali 24 2 2 4 3 (12,24 = 2), (24,35 = 0) 35
Ali 35 2 2 4 3 (12,35 = 2), (24,35 = 0) 24
Tom 36 1 2 4 2 (16,36 = 0), (18,36 = 3) 16
Is it possible to achieve this using dplyr?
Related posts:
I have tried to look for related solutions but none of them address the exact problem that I am facing e.g.,
This post calculates the difference between all pairs of rows. It does not incorporate the group_by situation plus the solution is extremely slow: R - Calculate the differences in the column values between rows/ observations (all combinations)
This one compares only two columns using stringdist(). I want my solution over multiple columns and with a group_by() condition: Creating new field that shows stringdist between two columns in R?
The following post compares the initial values in a column with their preceding values: R Calculating difference between values in a column
This one compares values in one column to all other columns. I would want this but done row wise and through group_by(): R Calculate the difference between values from one to all the other columns
dput()
For your convenience, I am sharing data dput():
structure(list(class =
c("Ali", "Tom", "Tom", "Ali", "Ali", "Tom"),
id = c(12L, 16L, 18L, 24L, 35L, 36L),
q1 = c(1L, 1L, 1L, 2L, 2L, 1L),
q2 = c(2L, 2L, 2L, 2L, 2L, 2L),
q3 = c(3L, 4L, 3L, 4L, 4L, 4L),
q4 = c(3L, 2L, 4L, 3L, 3L, 2L)), row.names = c(NA, -6L), class = "data.frame")
Any help would be greatly appreciated!
You could try to clustering the data, using hclust() for example. Once the relative distances are calculated and mapped, the cut the tree at the threshold of expected cheating.
This example I am using the standard dist() function to calculate differences, the stringdist function may be better or maybe another option is out there to try.
df<- structure(list(class =
c("Ali", "Tom", "Tom", "Ali", "Ali", "Tom"),
id = c(12L, 16L, 18L, 24L, 35L, 36L),
q1 = c(1L, 1L, 1L, 2L, 2L, 1L),
q2 = c(2L, 2L, 2L, 2L, 2L, 2L),
q3 = c(3L, 4L, 3L, 4L, 4L, 4L),
q4 = c(3L, 2L, 4L, 3L, 3L, 2L)), row.names = c(NA, -6L), class = "data.frame")
#apply the standard distance function
scores <- hclust(dist(df[ , 3:6]))
plot(scores)
#divide into groups based on level of matching too closely
groups <- cutree(scores, h=0.1)
#summary table
summarytable <- data.frame(class= df$class, id =df$id, groupings =groups)
#select groups with more than 2 people in them
suspectgroups <- table(groups)[table(groups) >=2]
potential_cheaters <- summarytable %>% filter(groupings %in% names(suspectgroups)) %>% arrange(groupings)
potential_cheaters
This works for this test case, but for larger datasets the height in the cutree() function may need to be adjusted. Also consider splitting the initial dataset by class to eliminate the chance of matching people between classes (depending on the situation of course).
I have a df with a categorical value which has 2 levels: Feed, Food
structure(c(2L, 2L, 1L, 2L, 1L, 2L), .Label = c("Feed", "Food"), class = "factor")
I want to create a list with a numeric value to match each categorical variable (ie. Feed = 0, Food = 1)
The list matches with the categorical variable to form 2 columns
Probably very simple...every time I've attempted to put the two together, both columns have ended up numeric
Like this?
library(dplyr)
df <- structure(c(2L, 2L, 1L, 2L, 1L, 2L), .Label = c("Feed", "Food"), class = "factor")
df <- tibble("foods" = df)
df %>%
mutate(numeric = as.numeric(foods))
# A tibble: 6 x 2
foods numeric
<fct> <dbl>
1 Food 2
2 Food 2
3 Feed 1
4 Food 2
5 Feed 1
6 Food 2
Or like this if you want 0/1 as numbers.
df %>%
mutate(numeric = as.numeric(foods) - 1)
Given
Group ss
B male
B male
B female
A male
A female
X male
Then
tab <- table(res$Group, res$ss)
I want the group column to be in the order B, A, X as it is on the data. Currently its alphabetic order which is not what I want. This is what I want
MALE FEMALE
B 5 5
A 5 10
X 10 12
If you arrange the factor levels based on the order you want, you'll get the desired result.
res$Group <- factor(res$Group, levels = c('B', 'A', 'X'))
#If it is based on occurrence in Group column we can use
#res$Group <- factor(res$Group, levels = unique(res$Group))
table(res$Group, res$ss)
#Or just
#table(res)
# female male
# B 1 2
# A 1 1
# X 0 1
data
res <- structure(list(Group = structure(c(2L, 2L, 2L, 1L, 1L, 3L),
.Label = c("A", "B", "X"), class = "factor"), ss = structure(c(2L, 2L, 1L, 2L,
1L, 2L), .Label = c("female", "male"), class = "factor")),
class = "data.frame", row.names = c(NA, -6L))
unique returns the unique elements of a vector in the order they occur. A table can be ordered like any other structure by extracting its elements in the order you want. So if you pass the output of unique to [,] then you'll get the table sorted in the order of occurrence of the vector.
tab <- table(res$Group, res$ss)[unique(res$Group),]
This question already has answers here:
Finding maximum value of one column (by group) and inserting value into another data frame in R
(3 answers)
Closed 7 years ago.
I have this data frame which consists of two vectors and it runs into million of rows. I used loop but it takes a day to compare the value.
Can some one suggest any apply functions??
Names Sales
A 1
A 2
A 3
B 1
B 5
B 6
.
.
what I want is unique list of names along with the maximum element in sales against that particular name. like A has 3 rows and highest sales is 3.
Output should be in data frame.
Names Sales
A 3
B 6
You can try with aggregate()
aggregate(V2 ~ ., df1 , max)
# V1 V2
#1 A 3
#2 B 6
data
df1 <- structure(list(V1 = structure(c(1L, 1L, 1L, 2L, 2L, 2L),
.Label = c("A", "B"), class = "factor"), V2 = c(1L, 2L, 3L, 1L, 5L, 6L)),
.Names = c("V1","V2"), class = "data.frame", row.names = c(NA, -6L))
I have a dataframe df containing two factor variables (Var and Year) as well as one (in reality several) column with values.
df <- structure(list(Var = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L,
3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"), Year = structure(c(1L,
2L, 3L, 1L, 2L, 3L, 3L, 1L, 2L, 3L), .Label = c("2000", "2001",
"2002"), class = "factor"), Val = structure(c(1L, 2L, 2L, 4L,
1L, 3L, 3L, 5L, 6L, 6L), .Label = c("2", "3", "4", "5", "8",
"9"), class = "factor")), .Names = c("Var", "Year", "Val"), row.names = c(NA,
-10L), class = "data.frame")
> df
Var Year Val
1 A 2000 2
2 A 2001 3
3 A 2002 3
4 B 2000 5
5 B 2001 2
6 B 2002 4
7 B 2002 4
8 C 2000 8
9 C 2001 9
10 C 2002 9
Now I'd like to find rows with the same value for Val for each Var and Year and only keep one of those. So in this example I would like row 7 to be removed.
I've tried to find a solution with plyr using something like
df_new <- ddply(df, .(Var, Year), summarise, !duplicate(Val))
but obviously that is not a function accepted by ddply.
I found this similar question but the plyr solution by Arun only gives me a dataframe with 0 rows and 0 columns and I do not understand the answer well enough to modify it according to my needs.
Any hints on how to go about that?
Non-duplicates of Val by Var and Year are the same as non-duplicates of Val, Var, and Year. You can specify several columns for duplicated (or the whole data frame).
I think this does what you'd like.
df[!duplicated(df), ]
Or.
df[!duplicated(df[, c("Var", "Year", "Val")]), ]
you can just used the unique() function instead of !duplicate(Val)
df_new <- ddply(df, .(Var, Year), summarise, Val=unique(Val))
# or
df_new <- ddply(df, .(Var, Year), function(x) x[!duplicated(x$Val),])
# or if you only have these 3 columns:
df_new <- ddply(df, .(Var, Year), unique)
# with dplyr
df%.%group_by(Var, Year)%.%filter(!duplicated(Val))
hth
You don't need the plyr package here. If your whole dataset consists of only these 3 columns and you need to remove the duplicates, then you can use,
df_new <- unique(df)
Else, if you need to just pick up the first observation for a group by variable list, then you can use the method suggested by Richard. That's usually how I have been doing it.