mutate the new data frame if email and unique ID is duplicate - r

I have a sample data frame and I want to check if the values are duplicate and mutate new columns as 1,0 for duplicate. I am trying like below but this isn't working for me.
df4 <- data.frame(emp_id =c("DEV-2962","KTN_2252","ANA2719","ITI_2624","DEV2698","HRT2921","","KTN2624","DEV2698","ITI2535","DEV2698","HRT2837","ERV2951","KTN2542","ANA2813","ITI2210"),
email = c("akash.dev#abcd.com","rahul.singh#abcd.com","salman.abbas#abcd.com","ram.lal#abcd.com","ram.lal#xyz.com","prabal.garg#xyz.com","sanu.ali#abcd.com","kunal.singh#abcd.com","lakhan.tomar#abcd.com","praveen.thakur#abcd.com","sarman.ali#abcd.com","zuber.khan#dkl.com","giriraj.singh#dkl.com","lokesh.sharma#abcd.com","pooja.pawar#abcd.com","nikita.sharma#abcd.com"))
ID = "emp_id"
Email = "email"
ID <- sym(ID)
Email <- sym(email)
df4 <- df4 %>% group_by(!!ID) %>%
mutate(Flag=1:n(),`Duplicate_ID`=ifelse(Flag==1,0,1)) %>% select(-Flag)
df4 <- df4 %>% filter(!is.na(!!Email)) %>% group_by(!!Email) %>%
mutate(Flag=1:n(),`Duplicate_email`=ifelse(Flag==1,0,1)) %>% select(-Flag) %>% ungroup(.)
there can be different names in data frame for Name and email so i also want to fixed it.
also I want to give input parameter for user to give names of columns according to its data frame.
and i will recall it in my script. do we have any suggestion for that...??
like here i am using sym for fix the parameter in script.
enter image description here

Instead of getting into non-standard evaluation try with across. Also as far as I could read your code you are trying to assign 0 to first instance of the value in column and 1 for all the duplicates. You can do this duplicated so no need for group_by, ifelse etc.
library(dplyr)
ID = "emp_id"
Email = "email"
df4 <- df4 %>%
mutate(across(c(ID, Email), ~as.integer(duplicated(.)), .names = 'flag_{col}'))

Related

R : how copy (last) column name to new column as values?

In short, I need to create new column with timestamps, taking from another column name
So I have already this command to select below columns from dataset : Lat, Long_, last_col()
I use last(col) because column name (date) is changing
data_new <- data %>%
select(Lat, Long_, last_col() )
Results:
"Lat","Long_","5/26/20"
-14.271,-170.132,44
13.4443,144.7937,167
My goal is to achieve below results:
"Lat","Long_","date","Value"
-14.271,-170.132,"5/26/20",44
13.4443,144.7937,"5/26/20",167
Any idea please ?
We can use mutate
library(dplyr)
data_new %>%
mutate(date = names(.)[3]) %>%
rename(Value = `5/26/20`)
If there are more rows, then the bug free approach is pivot_longer
library(tidyr)
pivot_longer(data_new, cols = -c(Lat:Long_), names_to = 'date')

Drop a column that was used as 'by' argument in join

I have the following query:
library(dplyr)
FinalQueryDplyr <- PostsWithFavorite %>%
inner_join(Users, by = c("OwnerUserId" = "Id"), keep = FALSE) %>%
select(DisplayName, Age, Location, FavoriteTotal, MostFavoriteQuestion, MostFavoriteQuestionLikes) %>%
select(-c(OwnerUserId)) %>%
arrange(desc(FavoriteTotal))
As you can see, I use the OwnerUserId column as the joining column between 2 data frames.
I want the result data frame to only have other columns, without the OwnerUserId column visible.
Even though I 'deselect' the OwnerUserId column 2 times in said query:
once by not including it in the first select clause
once by explicitly deselecting it with select(-c(OwnerUserId))
It is still visible in the result:
OwnerUserId DisplayName Age Location FavoriteTotal MostFavoriteQuestion MostFavoriteQuestionLikes
How can I get rid of the column that was used as a joining column in dplyr?
One option is to remove the attribute by converting to data.frame
library(dplyr)
PostsWithFavorite %>%
inner_join(Users, by = c("OwnerUserId" = "Id"), keep = FALSE) %>%
select(DisplayName, Age, Location, FavoriteTotal,
MostFavoriteQuestion, MostFavoriteQuestionLikes) %>%
as.data.frame %>%
select(-c(OwnerUserId)) %>%
arrange(desc(FavoriteTotal))

Is there a way to extract multiple attributes efficiently from a JSON column?

I have a dataframe that has one column which contains json data. I want to extract some attributes from this json data into named columns of the data frame.
Sample data
json_col = c('{"name":"john"}','{"name":"doe","points": 10}', '{"name":"jane", "points": 20}')
id = c(1,2,3)
df <- data.frame(id, json_col)
I was able to achieve this using
library(tidyverse)
library(jsonlite)
extract_json_attr <- function(from, attr, default=NA) {
value <- from %>%
as.character() %>%
jsonlite::fromJSON(txt = .) %>%
.[attr]
return(ifelse(is.null(value[[1]]), default, value[[1]]))
}
df <- df %>%
rowwise() %>%
mutate(name = extract_json_attr(json_col, "name"),
points = extract_json_attr(json_col, "points", 0))
In this case the extract_json_attr needs to parse the json column multiple times for each attribute to be extracted.
Is there a better way to extract all attributes at one shot?
I tried this function to return multiple values as a list, but I am not able to use it with mutate to set multiple columns.
extract_multiple <- function(from, attributes){
values <- from %>%
as.character() %>%
jsonlite::fromJSON(txt = .) %>%
.[attributes]
return (values)
}
I am able to extract the desired values using this function
extract_multiple(df$json_col[1],c('name','points'))
extract_multiple(df$json_col[2],c('name','points'))
But cannot apply this to set multiple columns in a single go. Is there a better way to do this efficiently?
Here is one way using bind_rows from dplyr
dplyr::bind_rows(lapply(as.character(df$json_col), jsonlite::fromJSON))
# A tibble: 3 x 2
# name points
# <chr> <int>
#1 john NA
#2 doe 10
#3 jane 20
To subset specific attribute from the function, we can do
bind_rows(lapply(as.character(df$json_col), function(x)
jsonlite::fromJSON(x)[c('name', 'points')]))
On the R4DS slack channel I received an alternative approach for handling json arrays as columns. Using that, I found another approach that seems to work better on larger datasets.
library(tidyverse)
library(jsonlite)
extract <- function(input, fields){
json_df <- fromJSON(txt=input)
missing <- setdiff(fields, names(json_df))
json_df[missing] <- NA
return (json_df %>% select(fields))
}
df <- data.frame(id=c(1,2,3),
json_col=c('{"name":"john"}','{"name":"doe","points": 10}', '{"name":"jane", "points": 20}'),
stringsAsFactors=FALSE)
df %>%
mutate(json_col = paste0('[',json_col,']'),
json_col = map(json_col, function(x) extract(input=x, fields=c('name', 'points')))) %>%
unnest(cols=c(json_col))

how to merge the following dataset, as independent rows?

I would like to create a new data frame from two existing data frames, they share columns called first name, last name, and email, but I wish to merge them in a way the second data frame just sticks to the first one in order to create a list of all the emails I have. the data frames contain duplicates, so I wish to conserve them to proceed to eliminate them in the next step. Obviously, the code I posted below does not work. Any help?
first <- c("andrea","luis","mike","thomas")
last <- c("robinson", "trout", "rice","snell")
email <- c("andrea#gmail.com", "lt#gmail.com", "mr#gmail.com", "tom#gmail.com")
first <- c("mike","steven","mark","john", "martin")
last <- c("rice", "berry", "smalls","sale", "arnold")
email <- c("mr#gmail.com", "st#gmail.com", "ms#gmail.com", "js#gmail.com", "ma#gmail.com)
alz <- c(1,2,NA,3,4)
der <- c(0,2,3,NA,3)
all_emails <- data.frame(first,last,email)
no_contact_emails <- data.frame(first,last,email,alz,der)
df <- merge(no_contact_emails, all_emails, all = TRUE)
df <- df$email[!duplicated(df$email) & !duplicated(df$email, fromLast = TRUE)]
expected output will be a join dataset with all the emails except the one for mike rice since in the one that is duplicate.
Your reproducible example is a little confusing, so I made you a new one to see if this is what you are looking for:
df1 <- data.frame(
first = c("andrea","luis","mike","thomas"),
last = c("robinson", "trout", "rice","snell"),
email = c("andrea#gmail.com", "lt#gmail.com", "mr#gmail.com", "tom#gmail.com")
)
df2 <- data.frame(
first = c("mike","steven","mark","john", "martin"),
last = c("rice", "berry", "smalls","sale", "arnold"),
email = c("mr#gmail.com", "st#gmail.com", "ms#gmail.com", "js#gmail.com",
"ma#gmail.com")
)
Now, there are 2 different ways you can do this, using dplyr:
library(dplyr)
df1 %>%
bind_rows(df2) %>%
distinct(first, last, .keep_all = TRUE)
Or:
df1 %>%
full_join(df2)
Hope this helps!

Finding elements from multiple columns of one dataframe that are not in multiple columns of another

library(tidyverse)
I have two dataframes (see sample code at bottom) called Df1 and Df2. I want to find phone numbers in Df1 (from all the columns) that are not in any of the phone number columns in Df2.
First, I restructure Df1 so that there is only one Id per row.
Df1<-Df1 %>%
gather(key, value, -Id) %>%
filter(!is.na(value)) %>%
select(-key) %>%
group_by(Id) %>%
filter(!duplicated(value)) %>%
mutate(Phone=paste0("Phone_",1:n())) %>%
spread(Phone, value)
Next, I rename Df2 and then use a join to find only Ids in Df1 that are in Df2.
Df2<-Df2%>%set_names(c("Id","Ph1","Ph2"))
DfJoin<-left_join(Df2,Df1,by="Id")
This is where I'm stuck. I want to find all the numbers in Df1 (Phone1 Phone2, and Phone 3) that are not in Df2 (Ph1 and Ph2). Below are some ideas for code. I tried many variations of this idea but could not find a way to achieve what I want. The final product should just be a table with the phone numbers(s) in any Df1 column that are not in any Df2 column together with the associated Id. I'm also wondering if there is another join or set operation that would achieve this in a more efficient way?
DfJoin<-DfJoin%>%mutate(New=if_else(! DfJoin[2:3] %in% DfJoin[4:6]),1,0)
DfJoin<-DfJoin%>%filter(! DfJoin[2:3] %in% DfJoin[2:4])
Sample Data:
Dataframe 1:
Id<-c(199,148,148,145,177,165,144,121,188,188,188,111)
Ph1<-c(6532881717,6572231223,6541132112,6457886543,6548887777,7372222222,6451123425,6783450101,7890986543,6785554444,8764443344,6453348736)
Ph2<-c(NA,NA,NA,NA,NA,7372222222,NA,NA,NA,6785554444,NA,NA)
Df1<-data.frame(Id,Ph1,Ph2)
Dataframe 2:
Id2<-c(199,148,142,145,177,165,144,121,182,109,188,111)
Phone1<-c(6532881717,6572231223,6541132112,6457886543,6548887777,7372222222,6451123425,6783450101,7890986543,6785554400,8764443344,6453348736)
Phone2<-c(NA,NA,NA,NA,NA,7372222222,NA,NA,NA,6785554444,NA,NA)
Df2<-data.frame(Id2,Phone1,Phone2)
One way to think about this problem:
You have a set of phone numbers in df1 for each ID number.
You have a set of phone numbers in df2 for each ID number.
You want to find, within each ID, the set difference between df1 and df2.
You can do this by mapping the base R function setdiff() onto your joined dataframe. To do this, you need to convert your data frames into list-column format, where all the phone numbers for each ID are present as a list in a "cell" of the dataframe. This is easily done by combining group_by(), summarize() and list().
# create example data
Id <- c(199,148,148,145,177,165,144,121,188,188,188,111)
ph1 <- c(6532881717,6572231223,6541132112,6457886543,6548887777,7372222222,6451123425,6783450101,7890986543,6785554444,8764443344,6453348736)
ph2 <- c(NA,NA,NA,NA,NA,7372222222,NA,NA,NA,6785554444,NA,NA)
df1 <- data.frame(Id, ph1, ph2)
Id2 <- c(199,148,142,145,177,165,144,121,182,109,188,111)
phone1 <- c(6532881717,6572231223,6541132112,6457886543,6548887777,7372222222,6451123425,6783450101,7890986543,6785554400,8764443344,6453348736)
phone2 <- c(NA,NA,NA,NA,NA,7372222222,NA,NA,NA,6785554444,NA,NA)
df2 <- data.frame(Id=Id2, phone1, phone2)
# convert the data to list-column format
df1.listcol <- df1 %>%
gather(col, phone, -Id) %>%
na.omit() %>%
group_by(Id) %>%
summarize(phone_list1 = list(phone))
df2.listcol <- df2 %>%
gather(col, phone, -Id) %>%
na.omit() %>%
group_by(Id) %>%
summarize(phone_list2 = list(phone))
Take a look at these dataframes to make sure you understand how we've reformatted them. Obviously, we could save a few lines of code by making this conversion process into a function, and then calling the function on each of df1 and df2, but I didn't do that here.
# join the two listcol dfs by Id, then map setdiff on the two columns
result <-
df1.listcol %>%
left_join(df2.listcol, by='Id') %>%
mutate(only_list_1 = map2(phone_list1, phone_list2, ~setdiff(.x, .y))) %>%
select(Id, only_list_1) %>%
unnest()
result
The result is
Id only_list_1
148 6541132112
188 7890986543
188 6785554444
Have you tried anti_join(a, b, by = "x1")
This basically gives you all rows in a which are not in b
DfJoin <- anti_join(Df1, Df2, by = "Id")
tidyr_dplyr cheatsheet
Use the above cheatsheet for data manipulation in tidyverse

Resources