how to merge the following dataset, as independent rows? - r

I would like to create a new data frame from two existing data frames, they share columns called first name, last name, and email, but I wish to merge them in a way the second data frame just sticks to the first one in order to create a list of all the emails I have. the data frames contain duplicates, so I wish to conserve them to proceed to eliminate them in the next step. Obviously, the code I posted below does not work. Any help?
first <- c("andrea","luis","mike","thomas")
last <- c("robinson", "trout", "rice","snell")
email <- c("andrea#gmail.com", "lt#gmail.com", "mr#gmail.com", "tom#gmail.com")
first <- c("mike","steven","mark","john", "martin")
last <- c("rice", "berry", "smalls","sale", "arnold")
email <- c("mr#gmail.com", "st#gmail.com", "ms#gmail.com", "js#gmail.com", "ma#gmail.com)
alz <- c(1,2,NA,3,4)
der <- c(0,2,3,NA,3)
all_emails <- data.frame(first,last,email)
no_contact_emails <- data.frame(first,last,email,alz,der)
df <- merge(no_contact_emails, all_emails, all = TRUE)
df <- df$email[!duplicated(df$email) & !duplicated(df$email, fromLast = TRUE)]
expected output will be a join dataset with all the emails except the one for mike rice since in the one that is duplicate.

Your reproducible example is a little confusing, so I made you a new one to see if this is what you are looking for:
df1 <- data.frame(
first = c("andrea","luis","mike","thomas"),
last = c("robinson", "trout", "rice","snell"),
email = c("andrea#gmail.com", "lt#gmail.com", "mr#gmail.com", "tom#gmail.com")
)
df2 <- data.frame(
first = c("mike","steven","mark","john", "martin"),
last = c("rice", "berry", "smalls","sale", "arnold"),
email = c("mr#gmail.com", "st#gmail.com", "ms#gmail.com", "js#gmail.com",
"ma#gmail.com")
)
Now, there are 2 different ways you can do this, using dplyr:
library(dplyr)
df1 %>%
bind_rows(df2) %>%
distinct(first, last, .keep_all = TRUE)
Or:
df1 %>%
full_join(df2)
Hope this helps!

Related

R: automatically generate dataframes from single function

I have a collection of ten dataframes (df.a, df.b, and so on) and the following concept in R
df.x.new = df.x %>% do this %>% do that...
I wondered if there is an elegant way to interchange the df.x variable of my single code line above iteratively with my dfs one by another, to get ten new dfs as an output.
Meaning something like this:
#place your elegant code here
df.a.new = df.a %>% do this %>% do that
df.b.new = df.b % do this %>% do that
#and so on
Edit:
#this should serve as a minimal reproducible code
df.a = c(1,2,3)
df.b = c(4,5,6)
df.c = c(7,8,9)
df.a.new = df.a %>% left_join(df.b)

Lookup tables in R

I have a tibble with a ton of data in it, but most importantly, I have a column that references a row in a lookup table by number (ex. 1,2,3 etc).
df <- tibble(ref = c(1,1,1,2,5)
data = c(33,34,35,35,32))
lkup <- tibble(CurveID <- c(1,2,3,4,5)
Slope <- c(-3.8,-3.5,-3.1,-3.3,-3.3)
Intercept <- c(40,38,40,38,36)
Min <- c(25,25,21,21,18)
Max <- c(36,36,38,37,32))
I need to do a calculation for each row in the original tibble based on the information in the referenced row in the lookup table.
df$result <- df$data - lkup$intercept[lkup$CurveID == df$ref]/lkup$slope[lkup$CurveID == df$ref]
The idea is to access the slope or intercept (etc) value from the correct row of the lookup table based on the number in the data table, and to do this for each data point in the column. But I keep getting an error telling me my data isn't compatible, and that my objects need to be of the same length.
You could also do it with match()
df$result <- df$data - lkup$Intercept[match(df$ref, lkup$CurveID)]/lkup$Slope[match(df$ref, lkup$CurveID)]
df$result
# [1] 43.52632 44.52632 45.52632 45.85714 42.90909
You could use the dplyr package to join the tibbles together. If the ref column and CurveID column have the same name then left_join will combine the two tibbles by the matching rows.
library(dplyr)
df <- tibble(CurveID = c(1,1,1,2,5),
data = c(33,34,35,35,32))
lkup <- tibble(CurveID = c(1,2,3,4,5),
Slope = c(-3.8,-3.5,-3.1,-3.3,-3.3),
Intercept = c(40,38,40,38,36),
Min = c(25,25,21,21,18),
Max = c(36,36,38,37,32))
df <- df %>% left_join(lkup, by = "CurveID")
Then do the calcuation on each row
df <- df %>% mutate(result = data - (Intercept/Slope)) %>%
select(CurveID, data, result)
For completeness' sake, here's one way to literally do what OP was trying:
library(slider)
df %>%
mutate(result = slide_dbl(ref, ~ slice(lkup, .x)$Intercept /
slice(lkup, .x)$Slope))
though since slice goes by row number, this relies on CurveID equalling the row number (we make no reference to CurveID at all). You can write it differently with filter but it ends up being more code.

mutate the new data frame if email and unique ID is duplicate

I have a sample data frame and I want to check if the values are duplicate and mutate new columns as 1,0 for duplicate. I am trying like below but this isn't working for me.
df4 <- data.frame(emp_id =c("DEV-2962","KTN_2252","ANA2719","ITI_2624","DEV2698","HRT2921","","KTN2624","DEV2698","ITI2535","DEV2698","HRT2837","ERV2951","KTN2542","ANA2813","ITI2210"),
email = c("akash.dev#abcd.com","rahul.singh#abcd.com","salman.abbas#abcd.com","ram.lal#abcd.com","ram.lal#xyz.com","prabal.garg#xyz.com","sanu.ali#abcd.com","kunal.singh#abcd.com","lakhan.tomar#abcd.com","praveen.thakur#abcd.com","sarman.ali#abcd.com","zuber.khan#dkl.com","giriraj.singh#dkl.com","lokesh.sharma#abcd.com","pooja.pawar#abcd.com","nikita.sharma#abcd.com"))
ID = "emp_id"
Email = "email"
ID <- sym(ID)
Email <- sym(email)
df4 <- df4 %>% group_by(!!ID) %>%
mutate(Flag=1:n(),`Duplicate_ID`=ifelse(Flag==1,0,1)) %>% select(-Flag)
df4 <- df4 %>% filter(!is.na(!!Email)) %>% group_by(!!Email) %>%
mutate(Flag=1:n(),`Duplicate_email`=ifelse(Flag==1,0,1)) %>% select(-Flag) %>% ungroup(.)
there can be different names in data frame for Name and email so i also want to fixed it.
also I want to give input parameter for user to give names of columns according to its data frame.
and i will recall it in my script. do we have any suggestion for that...??
like here i am using sym for fix the parameter in script.
enter image description here
Instead of getting into non-standard evaluation try with across. Also as far as I could read your code you are trying to assign 0 to first instance of the value in column and 1 for all the duplicates. You can do this duplicated so no need for group_by, ifelse etc.
library(dplyr)
ID = "emp_id"
Email = "email"
df4 <- df4 %>%
mutate(across(c(ID, Email), ~as.integer(duplicated(.)), .names = 'flag_{col}'))

How to arrange the strings in my data frame given below in a desired order in R?

I have a data frame where the elements in one of the columns are
"1.cn3.ap.1"
"7.fr9.ap.3"
"4.dl.ap.2"
"5.d2.cr.1"
"4.dl.u.1"
"4.dl.ap.1"
df<- df[order(df$A),]
#this gave the following result :
"1.cn3.ap.1"
"4.dl.ap.1"
"4.dl.ap.2"
"4.dl.u.1"
"5.d2.cr.1"
"7.fr9.ap.3"
But I need my data in this manner:
"1.cn3.ap.1"
"4.dl.u.1"
"4.dl.ap.1"
"4.dl.ap.2"
"5.d2.cr.1"
"7.fr9.ap.3"
You may be able to get what you need by splitting the data by the period, and sorting on the individual columns, then bringing the columns together again after sorting. something similar to this maybe?
library(dplyr)
library(tidyr)
df <- df %>%
separate(A, into = c("part1","part2","part3","part4"), sep = "\\.") %>%
arrange(part1, part2, desc(part3), part4) %>%
unite(A, part1:part4, sep = ".")

How to work around error while reshape data frame with spread()

I am trying to transform long data frame into wide and flagged cases. I pivot it and use a temporary vector that serves as a flag. It works perfectly on small data sets: see the example (copy and paste into your Rstudio), but when I try to do it on real data it reports an error:
churnTrain3 <- spread(churnTrain, key = "state", value = "temporary", fill = 0)
Error: Duplicate identifiers for rows (169, 249), (57, 109), (11, 226)
The structure wide data set is relevant for further processing
Is there any work around for this problem. I bet a lot of people try to clean data and get to the same problem.
Please help me
Here is the code:
First chunk "example "makes small data set for good visualisation how it supiosed to look
Second chunk "real data" is sliced portion of data set from churn library
library(caret)
library(tidyr)
#example
#============
df <- data.frame(var1 = (1:6),
var2 = (7:12),
factors = c("facto1", "facto2", "facto3", "facto3","facto5", "facto1") ,
flags = c(1, 1, 1, 1, 1, 1))
df
df2 <- spread(data = df, key = "factors" , value = flags, fill = " ")
df2
#=============
# real data
#============
data(churn)
str(churnTrain)
churnTrain <- churnTrain[1:250,1:4]
churnTrain$temporary <-1
churnTrain3 <- spread(churnTrain, key = "state", value = "temporary", fill = 0)
str(churnTrain)
head(churnTrain3)
str(churnTrain3)
#============
Spread can only put one unique value in the 'cell' that intersects the spread 'key' and the rest of the data (in the churn example, account_length, area_code and international_plan). So the real question is how to manage these duplicate entries. The answer to that depends on what you are trying to do. I provide one possible solution below. Instead of making a dummy 'temporary' variable, I instead count the number of episodes and use that as the dummy variable. This can be done very easily with dplyr:
library(tidyr)
library(dplyr)
library(C50) # this is one source for the churn data
data(churn)
churnTrain <- churnTrain[1:250,1:4]
churnTrain2 <- churnTrain %>%
group_by(state, account_length, area_code, international_plan) %>%
tally %>%
dplyr::rename(temporary = n)
churnTrain3 <- spread(churnTrain2, key = "state", value = "temporary", fill = 0)
Spread now works.
As others point out, you need to input a unique vector into spread. My solution is use base R:
library(C50)
f<- function(df, key){
if (sum(names(df)==key)==0) stop("No such key");
u <- unique(df[[key]])
id <- matrix(0,dim(df)[1],length(u))
uu <- lapply(df[[key]],function(x)which(u==x)) ## check 43697442 for details
for(i in 1:dim(df)[1]) id[i,uu[[i]]] <- 1
colnames(id) = as.character(u)
return(cbind(df,id));
}
df <- data.frame(var1 = (1:6),
var2 = (7:12),
factors = c("facto1", "facto2", "facto3", "facto3","facto5", "facto1"))
f(df, key='fact')
f(df, key='factors')
data(churn)
churnTrain <- churnTrain[1:250,1:4]
f(churnTrain, key='state')
Although you may see a for-loop and other temporary variables inside the f function, the speed is not slow indeed.

Resources