R: Find matching string then copying row - r

I have a multi-step problem. First step: match text in one string (df1) from one column to another range of columns (df2). There is no order of which columns match and the match could occur anywhere within the range. Once the match is found, copy the df2 row match into df1. Finally, repeat for the entire column.
df1= structure(list(Assay = c("ATG_AR_trans_up","NVS_PXR_cis","BSK_VCAM1_up"), p.value = c(0.01,0.05,0.0001)), .Names = c("Assay", "p.value"),row.names = c(NA, 3L), class = "data.frame")
df1
Assay p.value
ATG_AR_trans_up 0.01
NVS_hPXR 0.065
BSK_VCAM1_up 0.001
df2=structure(list(GeneID = c("AR", "VACM1", "TR", "ER", "PXR"), Assay1= c("ATG_ARE_cis", "BSK_hEDG_VCAM1", "NVS_TR_tran", "ATG_ER_UP", "NVS_PXRE_UP"), Assay2= c("ATG_AR_trans_up", "BSK_BE3K_VCAM1", "NA", "ATG_ERE_cis", "ATG_PXRE_cis"), Assay3= c("NVS_AR_trans", "BSK_VCAM1_UP", "NA", "NVS_ERa_CIS", "NVS_PXR_cis"), Assay4= c("Tox21_AR_ARE","NA", "NA", "Tox21_ERaERb_lig", "NA")), .Names = c("GeneID", "Assay1", "Assay2", "Assay3", "Assay4"),row.names = c(NA, 5L), class = "data.frame")
df2
GeneID Assay1 Assay 2 Assay3
AR ATG_ARE_cis NVS_hAR ATG_AR_trans_up
VACM1 BSK_hEGF_CAM1 BSK_VCAM1_up BSK_VCAM1_down
TR NVS_TR_tran NA NA
ER ATG_ER_UP ATG_ERE_cis NVS_ERa_CIS
PXR ATG_PXR_down ATG_PXRE_cis NVS_hPXR
Essentially becomes
df
Assay p.value GeneID Assay1 Assay2 Assay3
ATG_AR_trans_up 0.01 AR ATG_ARE_cis NVS_hAR ATG_AR_trans_up
NVS_hPXR 0.065 PXR ATG_PXR_down ATG_PXRE_cis NVS_hPXR
BSK_VCAM1_up 0.001 VCAM1 BSK_hEGF_CAM1 BSK_VCAM1_up BSK_VCAM1_down
For brevity I shortened the df substantially, but it is around 88 Assays and 4,000 some rows to go through for just one match (there are about 30). So a my initial instinct is to loop, but I was told grep might be a helpful package (even though it is not for R 3.2.2). Any help would be appreciated though.

Since OP was interested in a grep solution, another way to do it would be,
asDF2 <- apply(df2, 1, function(r) do.call(paste, as.list(r)))
do.call(rbind, lapply(1:nrow(df1),
function(i){
matchIX <- grepl(df1$Assay[i], asDF2, ignore.case=T)
if(any(matchIX))
cbind(df1[i, ], df2[matchIX, ])
}))
The first line creates a character vector with concatenated row assay names of df2. The second line loops through df1 and finds match in asDF2 using grepl
Or equivalently,
do.call(rbind, lapply(1:nrow(df1),
function(i){
matchIX <- grepl(df1$Assay[i],
data.frame(t(df2), stringsAsFactors=F),
ignore.case=T)
if(any(matchIX))
cbind(df1[i, ], df2[matchIX, ])
} ))
Note that above variants, can match multiple rows in df2 to df1.
NOTE
To test I added new rows to original data frames as
df1 <- rbind(df1, data.frame(Assay="NoMatch", p.value=.2))
df2 <- rbind(df2,
data.frame(GeneID="My", Assay1="NVS_PXR_cis", Assay2="NA", Assay3="NA", Assay4="NA"))

This can be easily done with reshaping. I put all the assays into all caps because that was messing up the matching.
library(dplyr)
library(tidyr)
library(stringi)
df2_ID = df %>% mutate(new_ID = 1:n() )
result =
df2_ID %>%
select(new_ID, Assay1:Assay85) %>%
gather(assay_number, Assay, Assay1:Assay85) %>%
mutate(Assay =
Assay %>%
iconv(to = "ASCII") %>%
stri_trans_toupper) %>%
inner_join(df1 %>%
mutate(Assay =
Assay %>%
iconv(to = "ASCII") %>%
stri_trans_toupper)) %>%
inner_join(df2_ID)

Since you're new to R, I think you are right that the most intuitive way to do this is with a for-loop. This is not the most concise or most efficient way to do this, but it should be clear what's going on.
# Creating example data
df1 <- as.data.frame(matrix(data=c("aa", "bb", "ee", .9, .5, .7), nrow=3))
names(df1) <- c("assay", "p")
df2 <- as.data.frame(matrix(data=c("G1", "G2", "aa", "dd", "bb", "ee", "cc", "ff"), nrow=2))
names(df2) <- c("GeneID", "assay1", "assay2", "assay3")
# Building a dataframe to store output
df3 <- as.data.frame(matrix(data=NA, nrow=dim(df1)[1], ncol=dim(df2)[2]))
names(df3) <- names(df2)
# Populating dataframe with output
for(i in 1:dim(df1)[1]){
index <- which(df2==as.character(df1$assay[i]), arr.ind = TRUE)[1]
for(j in 1:dim(df3)[2]){
df3[i,j] <- as.character(df2[index,j])
}
}
df <- cbind(df1, df3)

Edit after clarification from user:
I just created a triple for loop to check your values. Basically what it does is it looks for a match. It does this by looping through all columns and all the values from that column.
However my code is not perfect yet (also a beginner in R) and I just wanted to post it so that maybe we can work something out together :).
So I first convert your data to a data.frame. After that I create an empty output which I later fill per match found.
The improvements in this method would be that with this solution the function append will also append the column names which will result in multiple useless column names.
df3 <- as.data.frame(df1)
df4 <- as.data.frame(df2)
output <- data.frame()
for(j in 1:nrow(df3)) {
match <- FALSE
for(i in 2:(ncol(df4))) {
for(p in 1:nrow(df4)) {
if((df3[j, 1] == df4[p, i]) && (match == FALSE)) {
output <- append(output, c(df3[j, ], df4[j, ]))
match <- TRUE
}
}
}
}

Assuming, you don't have any repeated entry corresponding to the entry in df1. Following is the solution for your problem:
assay <-as.matrix(df1[,1])
m1 <- as.numeric(sapply(assay, function(x){grep(x,df2[,2], ignore.case = T)}, simplify = FALSE))
m2 <- as.numeric(sapply(assay, function(x){grep(x,df2[,3], ignore.case = T)}, simplify = FALSE))
m3 <- as.numeric(sapply(assay, function(x){grep(x,df2[,4], ignore.case = T)}, simplify = FALSE))
m4 <- as.numeric(sapply(assay, function(x){grep(x,df2[,5], ignore.case = T)}, simplify = FALSE))
m1[is.na(m1)] <- 0
m2[is.na(m2)] <- 0
m3[is.na(m3)] <- 0
m4[is.na(m4)] <- 0
m0 <- (m1+m2+m3+m4)
df <- NULL
for(i in 1:nrow(df1){
df3 = cbind(df1[i,],df2[m0[i],])
df = rbind(df,df3)
}
Edit: Generalization
Since you have more than 80 rows, you can generalize it as under:
assay <-as.matrix(df1[,1])
# Storing Assay column in a list
m <- vector('list',ncol(df2[, 2:ncol(df2)]))
for(i in 1:length(m)){
m[[i]] <- as.numeric(sapply(assay, function(x){grep(x,df2[,(i+1)], ignore.case = T)}, simplify = FALSE))
}
# Getting row subscript for df2
m1 <- as.data.frame(m)
m1[is.na(m1)] <- 0
m2 <- rowSums(m1)
df <- NULL
for(i in 1:nrow(df1)){
df3 = cbind(df1[i,],df2[m2[i],])
df = rbind(df,df3)
}

Related

extracting observations from matrix where columns and rows match a "key"

Given a matrix m how can I create a TRUE/ FALSE or 1 / 0 matrix where the columns and rows match some "key" in a data frame?
My goal is to assign a 1 or 0 to the location in the matrix where the columns match the cols and the rows match the rows in the colsrows_df. Then essentially just extract the observations where this is true or paste them into the colsrows_df next to the correct columns.
The below forloop just creates diagonally 1's and 0's
m <- matrix(runif(30), nrow = 20, ncol = 20)
dimnames(m) <- list(c(paste0("ID", 1:5, "_2000"), paste0("ID", 1:5, "_2001"), paste0("ID", 1:5, "_2002"), paste0("ID", 1:5, "_2003")),
c(paste0("ID", 1:5, "_2000"), paste0("ID", 1:5, "_2001"), paste0("ID", 1:5, "_2002"), paste0("ID", 1:5, "_2003")))
cols <- colnames(m)
rows <- rownames(m)
library(tidyr)
library(dplyr)
colsrows <- cbind(cols, rows)
# Here I just separate the rows/cols and then add an extra year and paste them back together
colsrows_df <- colsrows %>%
data.frame %>%
separate(cols, c("id_col", "year_col"), "_", remove = FALSE) %>%
separate(rows, c("id_row", "year_row"), "_", remove = FALSE) %>%
mutate(year_row_plus_1 = as.numeric(year_row) + 1,
rows = paste0(id_row,"_", year_row_plus_1)) %>%
select(cols, rows)
colsrows_df
for(i in 1:nrow(colsrows)){
m[i, ] <- colnames(m) == colsrows_df$cols
m[, i] <- rownames(m) == colsrows_df$rows
}
m
EDIT:
This seems to "solve" the problem however I am not sure how robust it is.
ids <- colsrows_df[colsrows_df$cols %in% colnames(m) &
colsrows_df$rows %in% rownames(m), ]
res <- melt(m[as.matrix(colsrows_df[colsrows_df$cols %in% colnames(m) &
colsrows_df$rows %in% rownames(m), ][2:1])])
cbind(ids, res)
I think can you first filter colsrows_df with rownames and colnames which are actually present in m then change the order of columns, convert to matrix , use it to subset m and change those values to 1.
m[as.matrix(colsrows_df[colsrows_df$cols %in% colnames(m) &
colsrows_df$rows %in% rownames(m), ][2:1])] <- 1
Then convert remaining ones to 0
m[m != 1] <- 0

Joining data frames without returning all matching combinations

I have a list of data.frames (in this example only 2):
set.seed(1)
df1 <- data.frame(id = sample(LETTERS,50,replace=T), val = rnorm(50), val1 = rnorm(50), stringsAsFactors = F)
df2 <- data.frame(id = sample(LETTERS,30,replace=T), val = rnorm(30), val2 = rnorm(30), stringsAsFactors = F)
df.list <- list(df1,df2)
I want to join them into a single data.frame only by a subset of the shared column names, in this case by id.
If I use:
library(dplyr)
df <- df.list %>% purrr::reduce(dplyr::inner_join,by="id")
The shared column names, which I'm not joining by, get mutated with the x and y suffices:
id val.x val1 val.y val2
1 G -0.05612874 0.2914462 2.087167 0.7876396
2 G -0.05612874 0.2914462 -0.255027 1.4411577
3 J -0.15579551 -0.4432919 -1.286301 1.0273924
In reality, for the shared column names for which I'm not joining by, it's good enough to select them from a single data.frame in the list - which ever they exist in WRT to the joined id.
I don't know these shared column names in advance but that's not difficult find out:
E.g.:
df.list.colnames <- unlist(lapply(df.list,function(l) colnames(l %>% dplyr::select(-id))))
df.list.colnames <- table(df.list.colnames)
repeating.colnames <- names(df.list.colnames)[which(df.list.colnames > 1)]
Which will then allow me to separate them from the data.frames in the list:
repeating.colnames.df <- do.call(rbind,lapply(df.list,function(r) r %>% dplyr::select_(.dots = c("id",repeating.colnames)))) %>%
unique()
I can then join the list of data.frames excluding these columns:
And then join them as above:
for(r in 1:length(df.list)) df.list[[r]] <- df.list[[r]] %>% dplyr::select_(.dots = paste0("-",repeating.colnames))
df <- df.list %>% purrr::reduce(dplyr::inner_join,by="id")
And now I'm left with adding the repeating.colnames.df to that. I don't know of any join in dplyr that wont return all combinations between df and repeating.colnames.df, so it seems that all I can do is apply over each df$id, pick the first match in repeating.colnames.df and join the result with df.
Is there anything less cumbersome for this situation?
If I followed correctly, I think you can handle this by writing a custom function to pass into reduce that identifies the common column names (excluding your joining columns) and excludes those columns from the "second" table in the merge. As reduce works through the list, the function will "accumulate" the unique columns, defaulting to the columns in the "left-most" table.
Something like this:
library(dplyr)
library(purrr)
set.seed(1)
df1 <- data.frame(id = sample(LETTERS,50,replace=T), val = rnorm(50), val1 = rnorm(50), stringsAsFactors = F)
df2 <- data.frame(id = sample(LETTERS,30,replace=T), val = rnorm(30), val2 = rnorm(30), stringsAsFactors = F)
df.list <- list(df1,df2)
fun <- function(df1, df2, by_col = "id"){
df1_names <- names(df1)
df2_names <- names(df2)
dup_cols <- intersect(df1_names[!df1_names %in% by_col], df2_names[!df2_names %in% by_col])
out <- dplyr::inner_join(df1, df2[, !(df2_names %in% dup_cols)], by = by_col)
return(out)
}
df_chase <- df.list %>% reduce(fun,by_col="id")
Created on 2019-01-15 by the reprex package (v0.2.1)
If I compare df_chase to your final solution, I yield the same answer:
> all.equal(df_chase, df_orig)
[1] TRUE
You can just get rid of the duplicate columns from one of the data frames if you say you don't really care about them and simply use base::merge:
set.seed(1)
df1 <- data.frame(id = sample(LETTERS,50,replace=T), val = rnorm(50), val1 = rnorm(50), stringsAsFactors = F)
df2 <- data.frame(id = sample(LETTERS,30,replace=T), val = rnorm(30), val2 = rnorm(30), stringsAsFactors = F)
duplicates = names(df1) == names(df2) & names(df1) !="id"
df2 = df2[,!duplicates]
df12 = base::merge.data.frame(df1, df2, by = "id")
head(df12)

Merging Long-Form Data that has NAs with Wide-Form Complete Data To Override NAs

So I have three data sets that I need to merge. These contain school data and read/math scores for grades 4 and 5. One of them is a long form data set that has a lot of missingness in some variables (yes, I do need the data in long form) and the other two have the full missing data in wide form. All of these data frames contain a column that has an unique ID number for each individual in the database.
Here is a full reproducible example that generates a small example of the types of data.frames I am working with... The three data frames that I need to use are the following: school_lf, school4 and school5. school_lf has the long form data with NAs and school4 and school5 are the dfs I need to use to populate the NA's in this long form data (by id and grade)
set.seed(890)
school <- NULL
school$id <-sample(102938:999999, 100)
school$selected <-sample(0:1, 100, replace = T)
school$math4 <- sample(400:500, 100)
school$math5 <- sample(400:500, 100)
school$read4 <- sample(400:500, 100)
school$read5 <- sample(400:500, 100)
school <- as.data.frame(school)
# Delete observations at random from the school df
indm4 <- which(school$math4 %in% sample(school$math4, 25))
school$math4[indm4] <- NA
indm5 <- which(school$math5 %in% sample(school$math5, 50))
school$math5[indm5] <- NA
indr4 <- which(school$read4 %in% sample(school$read4, 70))
school$read4[indr4] <- NA
indr5 <- which(school$read5 %in% sample(school$read5, 81))
school$read5[indr5] <- NA
# Separate Read and Math
read <- as.data.frame(subset(school, select = -c(math4, math5)))
math <- as.data.frame(subset(school, select = -c(read4, read5)))
# Now turn this into long form data...
clr <- melt(read, id.vars = c("id", "selected"), variable.name = "variable", value.name = "readscore")
clm <- melt(math, id.vars = c("id", "selected"), value.name = "mathscore")
# Clean up the grades for each of these...
clr$grade <- ifelse(clr$variable == "read4", 4,
ifelse(clr$variable == "read5", 5, NA))
clm$grade <- ifelse(clm$variable == "math4", 4,
ifelse(clm$variable == "math5", 5, NA))
# Put all these in one df
school_lf <-cbind(clm, clr$readscore)
school_lf$readscore <- school_lf$`clr$readscore` # renames
school_lf$`clr$readscore` <- NULL # deletes
school_lf$variable <- NULL # deletes
###############
# Generate the 2 data frames with IDs that have the full data
set.seed(890)
school4 <- NULL
school4$id <-sample(102938:999999, 100)
school4$selected <-sample(0:1, 100, replace = T)
school4$math4 <- sample(400:500, 100)
school4$read4 <- sample(400:500, 100)
school4$grade <- 4
school4 <- as.data.frame(school4)
set.seed(890)
school5 <- NULL
school5$id <-sample(102938:999999, 100)
school5$selected <-sample(0:1, 100, replace = T)
school5$math5 <- sample(400:500, 100)
school5$read5 <- sample(400:500, 100)
school5$grade <- 5
school5 <- as.data.frame(school5)
I need to merge the wide-form data into the long-form data to replace the NAs with the actual values. I have tried the code below, but it introduces several columns instead of merging the read scores and the math scores where there's NA's. I simply need one column with the read scores and one with the math scores, instead of six separate columns (read.x, read.y, math.x, math.y, mathscore and readscore).
sch <- merge(school_lf, school4, by = c("id", "grade", "selected"), all = T)
sch <- merge(sch, school5, by = c("id", "grade", "selected"), all = T)
Any help is highly appreciated! I've been trying to solve this for hours now and haven't made any progress (so figured I'd ask here)
You can use the coalesce function from dplyr. If a value in the first vector is NA, it will see if the value at the same position in the second vector is not NA and select it. If again NA, it goes to the third.
library(dplyr)
sch %>% mutate(mathscore = coalesce(mathscore, math4, math5)) %>%
mutate(readscore = coalesce(readscore, read4, read5)) %>%
select(id:readscore)
EDIT: I just tried to do this approach on my actual data and it does not work because the replacement data also has some NAs and, as a result, the dfs I try to do coalesce with have differing number of rows... Back to square one.
I was able to figure this out with the following code (albeit it's not the most elegant or straight-forward ,and #Edwin's response helped point me in the right direction. Any suggestions on how to make this code more elegant and efficient are more than welcome!
# Idea: put both in long form and stack on top of one another... then merge like that!
sch4r <- as.data.frame(subset(school4, select = -c(mathscore)))
sch4m <- as.data.frame(subset(school4, select = -c(readscore)))
sch5r <- as.data.frame(subset(school5, select = -c(mathscore)))
sch5m <- as.data.frame(subset(school5, select = -c(readscore)))
# Put these in LF
sch4r_lf <- melt(sch4r, id.vars = c("id", "selected", "grade"), value.name = "readscore")
sch4m_lf <- melt(sch4m, id.vars = c("id", "selected", "grade"), value.name = "mathscore")
sch5r_lf <- melt(sch5r, id.vars = c("id", "selected", "grade"), value.name = "readscore")
sch5m_lf <- melt(sch5m, id.vars = c("id", "selected", "grade"), value.name = "mathscore")
# Combine in one DF
sch_full_4 <-cbind(sch4r_lf, sch4m_lf$mathscore)
sch_full_4$mathscore <- sch_full_4$`sch4m_lf$mathscore`
sch_full_4$`sch4m_lf$mathscore` <- NULL # deletes
sch_full_4$variable <- NULL
sch_full_5 <- cbind(sch5r_lf, sch5m$mathscore)
sch_full_5$mathscore <- sch_full_5$`sch5m$mathscore`
sch_full_5$`sch5m$mathscore` <- NULL
sch_full_5$variable <- NULL
# Stack together
sch_full <- rbind(sch_full_4,sch_full_5)
sch_full$selected <- NULL # delete this column...
# MERGE together
final_school_math <- mutate(school_lf, mathscore = coalesce(school_lf$mathscore, sch_full$mathscore))
final_school_read <- mutate(school_lf, readscore = coalesce(school_lf$readscore, sch_full$readscore))
final_df <- cbind(final_school_math, final_school_read$readscore)
final_df$readscore <- final_df$`final_school_read$readscore`
final_df$`final_school_read$readscore` <- NULL

Big tasks In R, how to avoid for loops to run faster

My code is running but very very slowly. So this is a big problem and it has to run quicker. So here is the task:
I have a dataset with telecommunication records and i want to apply multiple functions on all records to each customer and put the results in a another data frame.
So df1 is the data frame where each row has a unique customer id and columns with some profil infomations. df2 is a very big data frame with about 800 000 telecommunications records identifyed over the customer ids. Now i want to compute e.g. the average data usage for each customer in df2 and save the result in df1.
df1 looks like
df1 <- read.table(header = TRUE, sep=",",
text="CUSTOMER_ID,Age,ContractType, Gender
ID1,45,Postpaid,m
ID2,50,Postpaid,f
ID3,35,Postpaid,f
ID4,44,Postpaid,m
ID5,32,Postpaid,m
ID6,48,Postpaid,f
ID7,50,Postpaid,m
ID8,51,Postpaid,f")
df2 looks like
df2 <- read.table(header = TRUE, sep=",",
text="CUSTOMER_ID,EVENT,VOLUME, DURATION, MONTH
ID1,100,500,200,201505
ID1,50,400,150,201506
ID1,80,600,50,201507
ID2,40,800,45,201505
ID2,25,650,120,201506
ID2,65,380,250,201507
ID3,30,950,110,201505
ID3,25,630,85,201506
ID3,15,780,60,201507")
My codes is like
USAGE <- c("EVENT", "VOLUME", "DURATION") #column names of df2
list of functions i want to apply on df2
StatFunctions <- list(
max = function(x) max(x),
mean = function(x) mean(x),
sum = function(x) sum(x)
)
In my original data set the Customer IDs are more complex so i choose this pattern search for the cutsomer ids. This is only a cut out of my code. But with the rest it is the same problem with the for loops.
func.num <- function(prefix, target.df, n) {
active.df <- get(target.df)
return(StatFunctions[[n]](active.df[grep(pattern = prefix,
x = active.df$CUSTOMER_ID), USAGE[m]]))
}
for (x in df1$CUSTOMER_ID) {
for (m in 1:length(USAGE)) {
for (n in 1:length(StatFunctions)) {
df1[df1$CUSTOMER_ID == x, paste(names(StatFunctions[n]),
USAGE[m], sep = "_")] <- func.num(prefix = x, target.df = "df2",n)
}
}
}
I know the code is very complicated and should be simplified.
And i want a data frame like this
Customer_ID Age contractType Gender max_EVENT mean_EVENT sum_EVENT ... sum_DURATION
ID1 45 Postpaid m 100 76 230 ... 400
So how can i avoid the for loops to run faster?
I would use dplyr package to summarize df2 by customer ID, then merge with df1.
df1 <- read.table(header = TRUE, sep=",",
text="CUSTOMER_ID,Age,ContractType, Gender
ID1,45,Postpaid,m
ID2,50,Postpaid,f
ID3,35,Postpaid,f
ID4,44,Postpaid,m
ID5,32,Postpaid,m
ID6,48,Postpaid,f
ID7,50,Postpaid,m
ID8,51,Postpaid,f")
df2 <- read.table(header = TRUE, sep=",",
text="CUSTOMER_ID,EVENT,VOLUME, DURATION, MONTH
ID1,100,500,200,201505
ID1,50,400,150,201506
ID1,80,600,50,201507
ID2,40,800,45,201505
ID2,25,650,120,201506
ID2,65,380,250,201507
ID3,30,950,110,201505
ID3,25,630,85,201506
ID3,15,780,60,201507")
df1$CUSTOMER_ID <- gsub(" ", "", df1$CUSTOMER_ID)
df2$CUSTOMER_ID <- gsub(" ", "", df2$CUSTOMER_ID)
library(dplyr)
USAGE <- c("EVENT", "VOLUME", "DURATION")
FUNC <- c("max", "mean", "sum")
dots <- lapply(USAGE, function(u) sprintf("%s(%s)", FUNC, u)) %>% unlist()
dots <- setNames(dots, sub("\\)", "", sub("\\(", "_", dots)))
sum_df <- df2 %>% group_by(CUSTOMER_ID) %>%
summarize_(.dots = dots) %>%
ungroup()
df1$CUSTOMER_ID <- as.character(df1$CUSTOMER_ID)
sum_df$CUSTOMER_ID <- as.character(sum_df$CUSTOMER_ID)
df1 <- left_join(df1, sum_df)
First we fetch the columns that are to be operated on and the ID's
mycols <- c("EVENT","VOLUME","DURATION")
id <- levels(df2$CUSTOMER_ID)
We are going to do this by using the (much faster) apply functions, that will allow us to do the operations parallel on each column, instead of one by one. Create a function that takes such operation on each of the columns. This we will apply over each ID.
For taking mean and summing, we may use the (very fast) colMeans and colSums.
applyfun <- function(i,FUN){
FUN(df2[df2$CUSTOMER_ID == i,mycols])
}
For maximum, we create a similar function
colMax <- function (colData) {
apply(colData, MARGIN=c(2), max)
}
Apply the three functions
outmean <- sapply(id,applyfun,colMeans)
outsum <- sapply(id,applyfun,colSums)
outmax <- sapply(id,applyfun,colMax)
out <- data.frame(CUSTOMER_ID = rownames(t(outmean)),
mean = t(outmean),
sum = t(outsum),
max = t(outmax))
Merge the data onto df1
merge(df1,out,key = "CUSTOMER_ID",all.x = TRUE)
which gives the output:
CUSTOMER_ID Age ContractType Gender mean.EVENT ... max.DURATION
1 ID1 45 Postpaid m 76.66667 ... 200
2 ID2 50 Postpaid f 43.33333 ... 250
3 ID3 35 Postpaid f 23.33333 ... 110
4 ID4 44 Postpaid m NA ... NA
I had some whitespace problems with the CUSTOMER_ID from your examples of df1 and df2 and suppose you do not. To fix this I used
df1$CUSTOMER_ID <- as.factor(trimws(df1$CUSTOMER_ID))
df2$CUSTOMER_ID <- as.factor(trimws(df2$CUSTOMER_ID))

Merge rows with condition and limit in a dataframe

I have the following dummy dataset of 1000 observations:
obs <- 1000
df <- data.frame(
a=c(1,0,0,0,0,1,0,0,0,0),
b=c(0,1,0,0,0,0,1,0,0,0),
c=c(0,0,1,0,0,0,0,1,0,0),
d=c(0,0,0,1,0,0,0,0,1,0),
e=c(0,0,0,0,1,0,0,0,0,1),
f=c(10,2,4,5,2,2,1,2,1,4),
g=sample(c("yes", "no"), obs, replace = TRUE),
h=sample(letters[1:15], obs, replace = TRUE),
i=sample(c("VF","FD", "VD"), obs, replace = TRUE),
j=sample(1:10, obs, replace = TRUE)
)
One key feature of this dataset is that the variables a to e's values are only one 1 and the rest are 0. We are sure the only one of these five columns have a 1 as value.
I found a way to extract these rows given a condition (with a 1) and assign to their respective variables:
df.a <- df[df[,"a"] == 1,,drop=FALSE]
df.b <- df[df[,"b"] == 1,,drop=FALSE]
df.c <- df[df[,"c"] == 1,,drop=FALSE]
df.d <- df[df[,"d"] == 1,,drop=FALSE]
df.e <- df[df[,"e"] == 1,,drop=FALSE]
My dilemma now is to limit the rows saved into df.a to df.e and to merge them afterwards.
Here's a shorter way to create df.merged:
# variables of 'df'
vars <- c("a", "b", "c", "d", "e")
# number of rows to extract
n <- 100
df.merged <- do.call(rbind, lapply(vars, function(x) {
head(df[as.logical(df[[x]]), ], n)
}))
Here, rbind is sufficient. The function rbind.fill is necessary if your data frames differ with respect to the number of columns.
To get the n-rows subset, a simple data[1:n,] does the job.
df.a.sub <- df.a[1:10,]
df.b.sub <- df.b[1:10,]
df.c.sub <- df.c[1:10,]
df.d.sub <- df.d[1:10,]
df.e.sub <- df.e[1:10,]
Finally, merge them by (it took the most time to find a straightforward "merge multiple dataframes" and all I needed to do was rbind.fill(df1, df2, ..., dfn) thanks to this question and answer):
require(plyr)
df.merged <- rbind.fill(df.a.sub, df.b.sub, df.c.sub, df.d.sub, df.e.sub)

Resources