Comparing Occurrence Of Data Within Two Groups - r

I have a Data with User_Name and Group.
User_Name Group
MustafE A
fischeta A
LosperS1 A
MustafE B
fischeta B
jose B
MustafE c
fischeta c
I want to flag those customers which are not repeating groups .. Example - 'LosperS1' is in group A but not in group B , same way 'jose' is in group B but not in group C, so in a new column they will be marked as "No In group B/No In group C"
Any help will be appreciated ..

Here is a way to get the output using tidyverse. Get the distinct elements of 'User_Name' column, loop through those elements (map), filter the rows of the dataset based on the presence of looped elements in 'User_Name', paste the elements that are not found in the 'Group' column when compared with the filtered 'Group', subset the first row (slice) and right_join with the original dataset. We used map_df to get the end output as a single data.frame instead of a list of data.frame
library(tidyverse)
df1 %>%
distinct(User_Name) %>%
pull(User_Name) %>%
map_df(~ df1 %>%
filter(User_Name == .x) %>%
mutate(Flag = toString(setdiff(unique(df1$Group),
unique(Group)))) %>%
slice(1) %>%
select(-Group)) %>%
right_join(df1, "User_Name")

Related

Determine the size of string in a particular cell in dataframe: R

In a data frame, I have a column (type: chr) that contains answers separated by a comma. I want to create another column based on the size of the string and award points. For example, some of the entries in a column are:
Column1
word1,word2,word3
word1,word2
word1
Now, for the first cell, I want the size of the cell to be evaluated as 3 (as it contains three distinct word and there are no duplicates in the cell values). I'm not sure how do I achieve this.
An option is to split the column with strsplit into a list of vectors, get the unique elements by looping over the list with lapply and get the lengths
df1$Size <- lengths(lapply(strsplit(df1$Column1, ",\\s*"), unique))
Another option is separate_rows from tidyr
library(dplyr)
library(tidyr)
df1 %>%
mutate(rn = row_number()) %>%
separate_rows(Column1) %>%
group_by(rn) %>%
summarise(Size = n_distinct(Column1), .groups = 'drop') %>%
select(Size) %>%
bind_cols(df1, .)
-output
# Column1 Size
#1 word1,word2,word3 3
#2 word1,word2 2
#3 word1 1
data
df1 <- data.frame(Column1 = c('word1,word2,word3', 'word1,word2', 'word1'))
Original Answer:
Another option:
library(dplyr)
library(stringr)
df %>%
mutate(Lengths = str_count(Column1, ",") + 1)
Edit:
I hadn't noticed the OP requirements properly (about non-duplicates). As #Onyambu pointed out in the comments, this chunk will only works if there are no duplicated words in data.
It basically counts how many words there are.

select highest pairs from complex table

I want to make a new dataframe from a selection of rows in a complex table of pairwise comparisons. I want to select the rows such that the 2 highest values of each pairwise comparison is selected.
Below is an example dataset:
dataframe <- data.frame(X1 = c("OP2413iiia","OP2413iiib","OP2413iiic","OP2645ii_a","OP2645ii_b","OP2645ii_c","OP2645ii_d","OP2645ii_e","OP3088i__a","OP5043___a","OP5043___b","OP5044___a","OP5044___b","OP5044___c","OP5046___a","OP5046___b","OP5046___c","OP5046___d","OP5046___e","OP5047___a","OP5047___b","OP5048___b","OP5048___c","OP5048___d","OP5048___e","OP5048___f","OP5048___g","OP5048___h","OP5049___a","OP5049___b","OP5051DNAa","OP5051DNAb","OP5051DNAc","OP5052DNAa","OP5053DNAa"),
gr1 = c("2","2","2","3","3","3","3","3","3","4","4","4","3","4","2","3","3","3","4","2","4","3","3","3","4","2","4","2","3","3","3","4","2","4","2"),
X2 = c("OP2413iiib","OP2413iiic","OP5046___a","OP2645ii_a","OP2645ii_a","OP2645ii_a","OP2645ii_b","OP2645ii_b","OP5046___a","OP2645ii_b","OP2645ii_c","OP2645ii_c","OP2645ii_c","OP2645ii_c","OP5048___e","OP2645ii_d","OP5046___a","OP2645ii_d","OP2645ii_d","OP2645ii_d","OP2645ii_d","OP2645ii_e","OP5048___e","OP2645ii_e","OP2645ii_e","OP2645ii_e","OP2645ii_e","OP2645ii_e","OP3088i__a","OP3088i__a","OP3088i__a","OP3088i__a","OP3088i__a","OP3088i__a","OP3088i__a"),
gr2 = c("3","3","3","4","4","4","2","2","2","2","4","4","4","4","4","2","2","2","2","2","2","4","4","4","4","4","4","4","3","3","3","3","3","3","3"),
value = c("1.610613e+00","1.609732e+00","8.829263e-04","1.080257e+01","1.111006e+01","1.110978e+01","4.048302e+00","5.610458e+00","5.609584e+00","9.911490e+00","1.078518e+01","1.133728e+01","1.133686e+01","1.738092e+00","9.247411e+00","5.170646e+00","6.074909e+00","6.074287e+00","6.212711e+00","3.769029e+00","5.793390e+00","1.124045e+01","1.163326e+01","1.163293e+01","7.752766e-01","1.008434e+01","1.222854e+00","6.469443e+00","1.610828e+00","1.784774e+00","1.784235e+00","9.434803e+00","4.512563e+00","9.582847e+00","4.309312e+00"))
expected_output_dataframe <- rbind(dataframe[10,],dataframe[34,],dataframe[32,],dataframe[15,],dataframe[3,],dataframe[17,])
Many thanks in advance
Cheers
The method works using dplyr. I created an extra column, gr_pair, to identify the pairwise groups.
library(dplyr)
library(magrittr)
dataframe %>%
filter(gr1 != gr2) %>% # This case is excluded from your expected output
mutate(gr_pair = paste(pmin(gr1, gr2), pmax(gr1, gr2), sep = ",")) %>%
group_by(gr_pair) %>%
top_n(2, value) # Keep the top two rows in each group, sorted by value

Splitting a list column into an entry per column

I would like to split a list column element into individual columns.
For example, in the starwars dataset,
data("starwars")
I would want this list column (the entry in row 7)
c("Attack of the Clones", "Revenge of the Sith", "A New Hope")
To be broken into columns A,B,C... with the values of the movies
A B C D ...
Attack of the Clones Revenge of the Sith A New Hope NA ...
I have kind of hacked together a way to do this with
starwars %>% separate(films, into= letters[1:7],sep = ",")
Which would result in an output of
A B C D ...
c("Attack of the Clones" "Revenge of the Sith" "A New Hope") NA ...
But this will require some additional scrubbing, and I don't think this is general. Is there a way to do this in one swoop?
The 'films' column is a list of vectors. If we wanted to create data.frame with 7 columns i.e. maximum length of the 'films' and store it as list, assign the length to maximum length from the whole column, convert it to a data.frame
library(tidyverse)
mx <- max(lengths(starwars$films))
starwars %>%
mutate(films = map(films, ~ `length<-`(.x, mx) %>%
as.data.frame.list %>%
set_names(LETTERS[seq_len(mx)]))) %>%
unnest(films)
Or another option is pull the 'films' column, convert it to tibble within n map, bind with the columns of 'starwars' except the 'films'
starwars %>%
pull(films) %>%
map_df(~ t(.x) %>%
as_tibble) %>%
bind_cols(starwars %>%
select(-films), .)

Finding elements from multiple columns of one dataframe that are not in multiple columns of another

library(tidyverse)
I have two dataframes (see sample code at bottom) called Df1 and Df2. I want to find phone numbers in Df1 (from all the columns) that are not in any of the phone number columns in Df2.
First, I restructure Df1 so that there is only one Id per row.
Df1<-Df1 %>%
gather(key, value, -Id) %>%
filter(!is.na(value)) %>%
select(-key) %>%
group_by(Id) %>%
filter(!duplicated(value)) %>%
mutate(Phone=paste0("Phone_",1:n())) %>%
spread(Phone, value)
Next, I rename Df2 and then use a join to find only Ids in Df1 that are in Df2.
Df2<-Df2%>%set_names(c("Id","Ph1","Ph2"))
DfJoin<-left_join(Df2,Df1,by="Id")
This is where I'm stuck. I want to find all the numbers in Df1 (Phone1 Phone2, and Phone 3) that are not in Df2 (Ph1 and Ph2). Below are some ideas for code. I tried many variations of this idea but could not find a way to achieve what I want. The final product should just be a table with the phone numbers(s) in any Df1 column that are not in any Df2 column together with the associated Id. I'm also wondering if there is another join or set operation that would achieve this in a more efficient way?
DfJoin<-DfJoin%>%mutate(New=if_else(! DfJoin[2:3] %in% DfJoin[4:6]),1,0)
DfJoin<-DfJoin%>%filter(! DfJoin[2:3] %in% DfJoin[2:4])
Sample Data:
Dataframe 1:
Id<-c(199,148,148,145,177,165,144,121,188,188,188,111)
Ph1<-c(6532881717,6572231223,6541132112,6457886543,6548887777,7372222222,6451123425,6783450101,7890986543,6785554444,8764443344,6453348736)
Ph2<-c(NA,NA,NA,NA,NA,7372222222,NA,NA,NA,6785554444,NA,NA)
Df1<-data.frame(Id,Ph1,Ph2)
Dataframe 2:
Id2<-c(199,148,142,145,177,165,144,121,182,109,188,111)
Phone1<-c(6532881717,6572231223,6541132112,6457886543,6548887777,7372222222,6451123425,6783450101,7890986543,6785554400,8764443344,6453348736)
Phone2<-c(NA,NA,NA,NA,NA,7372222222,NA,NA,NA,6785554444,NA,NA)
Df2<-data.frame(Id2,Phone1,Phone2)
One way to think about this problem:
You have a set of phone numbers in df1 for each ID number.
You have a set of phone numbers in df2 for each ID number.
You want to find, within each ID, the set difference between df1 and df2.
You can do this by mapping the base R function setdiff() onto your joined dataframe. To do this, you need to convert your data frames into list-column format, where all the phone numbers for each ID are present as a list in a "cell" of the dataframe. This is easily done by combining group_by(), summarize() and list().
# create example data
Id <- c(199,148,148,145,177,165,144,121,188,188,188,111)
ph1 <- c(6532881717,6572231223,6541132112,6457886543,6548887777,7372222222,6451123425,6783450101,7890986543,6785554444,8764443344,6453348736)
ph2 <- c(NA,NA,NA,NA,NA,7372222222,NA,NA,NA,6785554444,NA,NA)
df1 <- data.frame(Id, ph1, ph2)
Id2 <- c(199,148,142,145,177,165,144,121,182,109,188,111)
phone1 <- c(6532881717,6572231223,6541132112,6457886543,6548887777,7372222222,6451123425,6783450101,7890986543,6785554400,8764443344,6453348736)
phone2 <- c(NA,NA,NA,NA,NA,7372222222,NA,NA,NA,6785554444,NA,NA)
df2 <- data.frame(Id=Id2, phone1, phone2)
# convert the data to list-column format
df1.listcol <- df1 %>%
gather(col, phone, -Id) %>%
na.omit() %>%
group_by(Id) %>%
summarize(phone_list1 = list(phone))
df2.listcol <- df2 %>%
gather(col, phone, -Id) %>%
na.omit() %>%
group_by(Id) %>%
summarize(phone_list2 = list(phone))
Take a look at these dataframes to make sure you understand how we've reformatted them. Obviously, we could save a few lines of code by making this conversion process into a function, and then calling the function on each of df1 and df2, but I didn't do that here.
# join the two listcol dfs by Id, then map setdiff on the two columns
result <-
df1.listcol %>%
left_join(df2.listcol, by='Id') %>%
mutate(only_list_1 = map2(phone_list1, phone_list2, ~setdiff(.x, .y))) %>%
select(Id, only_list_1) %>%
unnest()
result
The result is
Id only_list_1
148 6541132112
188 7890986543
188 6785554444
Have you tried anti_join(a, b, by = "x1")
This basically gives you all rows in a which are not in b
DfJoin <- anti_join(Df1, Df2, by = "Id")
tidyr_dplyr cheatsheet
Use the above cheatsheet for data manipulation in tidyverse

subset columns with common values in long data frame

I have the following data frame:
Group 1 ID A Value
Group 1 ID B Value
Group 1 ID C Value
Group 2 ID B Value
Group 2 ID C Value
Group 3 ID B Value
… … …
I am trying to use dplyr to get the mean value for each of the same ID across groups (e.g. the mean of the value of ID B across group 1, group 2, and group 3). However, not every group has all of the IDs so I wanted to subset so that only means for IDs which are in all groups get computed. I know that I can group_by(dataFrame, group) %>% filter subset %>% group_by(id) %>% mutate(mean) but I don't know what code to place in the filter subset.
How about
df %>%
group_by(id) %>%
mutate(count = n()) %>%
filter(count != ngroups) %>% #...
So basically remove all the rows in the dataframe that correspond to an ID that doesn't appear in all groups, then perform the computation.

Resources