Extracting first column that meets certain criteria for each row - r

I will try to explain what I am doing the best I can it is kind of confusing but I'll give it a shot. Essentially I start with 2 data frames. Each one containing a unique row per person and two items per user as columns. My goal is to turn this into 1 data frame with one unique row per user and the first item from each of the two data frames upon the condition that the items do not repeat. For example if for customer 1 in the first data frame his items are "a" and "d" and in the second data frame his items are "a" and "c", I would want the final data frame to be "a" and "c" for this customer. I have written an apply that does this however when I perform this on roughly 160,000 rows it takes quite a bit of time. I was hoping someone would be able to come up with a more efficient solution to my problem.
d1 <- data.frame(id = c("1", "2", "3"), stringsAsFactors = F)
r1 <- data.frame(i1 = c("a", "b", "c"), i2 = c("d", "e", "f"), stringsAsFactors = F)
rownames(r1) = d1$id
r2 <- data.frame(i1 = c("a", "c", "f"), i2 = c("c", "t", "l"), stringsAsFactors = F)
rownames(r2) = d1$id
dFinal <- data.frame(id = d1$id, r1 = "", r2 = "", stringsAsFactors = F)
dFinal$r1 = apply(dFinal, 1, function(x){r1[rownames(r1) == x["id"], "i1"]})
dFinal$r2 = apply(dFinal, 1, function(x){r2[rownames(r2) == x["id"], which(!r2[rownames(r2) == x["id"],c("i1","i2")] %in% x["r1"])[1]]})

Would the following do what you're looking for:
# Keep only first column of first data.frame
df <- cbind(d1,r1,r2)[,-3]
names(df) <- c("id","r1_final","r2_i1","r2_i2")
df$r2_final <- df$r2_i1
# Keep only second column of second data.frame
# if the value in the first column is found in first data.frame
df[df$r1_final == df$r2_i1,"r2_final"] <- df[df$r1_final == df$r2_i1,"r2_i2"]
df_final <- df[,c("id","r1_final","r2_final")]
print(df_final)
id r1_final r2_final
1 1 a c
2 2 b c
3 3 c f
Edit:
OP asked for a solution if there were four data.frames instead of 2 like in the example, here is some code that I haven't tested but it should work with two additional columns
df$r2_final <- df$r2_i1
df$r3_final <- df$r3_i1
df$r4_final <- df$r4_i1
df[df$r1_final == df$r2_i1,"r2_final"] <- df[df$r1_final == df$r2_i1,"r2_i2"]
df[df$r3_i1 %in% c(df$r1_final,df$r2_final),"r3_final"] <- df[df$r3_i1 %in% c(df$r1_final,df$r2_final),"r3_i2"]
df[df$r4_i1 %in% c(df$r1_final,df$r2_final,df$r3_final),"r4_final"] <- df[df$r4_i1 %in% c(df$r1_final,df$r2_final,df$r3_final),"r4_i2"]
df_final <- df[,c("id","r1_final","r2_final","r3_final","r4_final")]

Thanks for the accepted answer as it worked perfectly! However it gave me an idea to use ifelse. While it doesn't work any better or worse than the accepted answer it was a little easier for me to wrap my head around when adding more columns or data frames.
dfInt <- cbind(df1, df2, df3, df4)
dfInt$R1_Final <- dfInt$R1_1
dfInt$R2_Final <- ifelse(dfInt$R1_Final == dfInt$R2_1,
dfInt$R2_2,
dfInt$R2_1)
dfInt$R3_Final <- ifelse(dfInt$R1_Final != dfInt$R3_1 & dfInt$R2_Final != dfInt$R3_1,
dfInt$R3_1,
ifelse(dfInt$R2_Final != dfInt$R3_2,
dfInt$R3_2,
dfInt$R3_3))
dfInt$R4_Final <- ifelse(dfInt$R1_Final != dfInt$R4_1 & dfInt$R2_Final != dfInt$R4_1 & dfInt$R3_Final != dfInt$R4_1,
dfInt$R4_1,
ifelse(dfInt$R2_Final != dfInt$R4_2 & dfInt$R3_Final != dfInt$R4_2,
dfInt$R4_2,
ifelse(dfInt$R3_Final != dfInt$R4_3,
dfInt$R4_3,
dfInt$R4_4)))

Related

Left join two R data frames with OR conditions

Problem
I have two data frames that I want to join using a conditional statement on three non-numeric variables. Here is a pseudo-code version of what I want to achieve.
Join DF1 and DF2 on DF1$A == DF2$A | DF1$A == DF2$B
Dataset
Here's some code to create the two data frames. variant_index is the data frame that will be used to annotate input using a left_join:
library(dplyr)
options(stringsAsFactors = FALSE)
set.seed(5)
variant_index <- data.frame(
rsid = rep(sapply(1:5, function(x) paste0(c("rs", sample(0:9, 8, replace = TRUE)), collapse = "")), each = 2),
chrom = rep(sample(1:22, 5), each = 2),
ref = rep(sample(c("A", "T", "C", "G"), 5, replace = TRUE), each = 2),
alt = sample(c("A", "T", "C", "G"), 10, replace = TRUE),
eaf = runif(10),
stringAsFactors = FALSE
)
variant_index[1, "alt"] <- "T"
variant_index[8, "alt"] <- "A"
input <- variant_index[seq(1, 10, 2), ] %>%
select(rsid, chrom)
input$assessed <- c("G", "C", "T", "A", "T")
What I tried
I would like to perform a left_join on input to annotate with the eaf column from variant_index. As you can see from the input data frame, its assessed column can match either with input$ref or with input$alt. The rsid and chrom column will always match.
I know I can specify multiple column in the by argument of left_join, but if I understand correctly, the condition will always be
input$assessed == variant_index$ref & input$assessed == variant_index$alt
whereas I want to achieve
input$assessed == variant_index$ref | input$assessed == variant_index$alt
Possible solution
The desired output can be obtained like so:
input %>%
left_join(variant_index) %>%
filter(assessed == ref | assessed == alt)
But it doesn't seem like the best solution to me, since I am possibly generating double the lines, and would like to apply this join to data frames containing 100M+ lines. Is there a better solution?
Complex joins are straight forward in SQL:
library(sqldf)
sqldf("select *
from variant_index v
join input i on i.assessed = v.ref or i.assessed = v.alt")
Try this
library(dbplyr)
x1 <- memdb_frame(x = 1:5)
x2 <- memdb_frame(x1 = 1:3,x2 = letters[1:3])
x1 <- x1 %>% left_join(b, sql_on = "a.x=b.x1 or a.x=b.x2")
we can use show_query to see the code

FOR loop optimization in R

I have the following code that it taking forever to run on my 80k rows CBP table. Anyone could help me optimize my loop. Trying simply to find duplicates sharing the same values in certain (not all) columns, getting the number of duplicates there is and then returning the ids for each of the duplicates:
for (row in 1:nrow(CBP)){
subs <- subset(CBP, CBP$Lower_Bound__c == CBP[row,"Lower_Bound__c"] & CBP$Price_Book__c == CBP[row,"Price_Book__c"] & CBP$Price__c == CBP[row,"Price__c"] & CBP$Product__c == CBP[row,"Product__c"] & CBP$Department__c == CBP[row,"Department__c"] & CBP$UOM__c == CBP[row,"UOM__c"] & CBP$Upper_Bound__c == CBP[row,"Upper_Bound__c"])
if (nrow(subs)>1){
CBP[row,]$dup <- nrow(subs)
CBP[row,]$dupids <- paste(subs[,"Id"], collapse = ",")
}
print(row)
}
I'm having a hard time understanding your example. However, here's a simple approach with data.table that might work for your situation. You can create a variable (nsame in the example) that counts if the something is a duplicate by multiple variables (var1 and var2 in the example). Then just grab the row index.
library(data.table)
# generate some example data
dt <- data.table(
var1 = c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
var2 = c("a", "a", "z", "b", "y", "b", "c", "c", "c"),
var3 = 1:9
)
# counter for each combination of var1-var2
dt[ , nsame := 1:.N, by=.(var1, var2)]
# duplicates are where the counter is > 1
which(dt$nsame > 1)
## 2 6 8 9
Using base R:
dupe_columns = c(
"Lower_Bound__c", "Price_Book__c", "Price__c", "Product__c",
"Department__c", "UOM__c", "Upper_Bound__c"
)
# which rows are duplicated
dupes = which(duplicated(CBP[, dupe_columns]) | duplicated(CBP[, dupe_columns], fromLast = TRUE))
# how many are there
length(dupes)
# IDs that are duplicated
CBP[dupes, "Id"]
# collapse Ids with duplicates by group:
aggregate(CBP$Id, by = CBP[dupe_columns], FUN = paste, collapse = ",")
If any of this doesn't work or you need more help, post 10-20 rows of sample data (use dput() so it is copy/pasteable!!!) so we can test and verify.
Subtle point, but I use CBP[, dupe_columns] in the duplicated() line because duplicated() will work the same whether we give it a data frame or a vector. CBP[, dupe_columns] will be a data frame if you have more than one column to check for dupes, but will be a vector if you give it a single column. However, when we get down to aggregate we need the by argument to be a list (like a data frame). So I use CBP[dupe_columns] (no comma) which will guarantee a data frame even if we are only checking a single column.

Extract rows from a single column to form two new columns

Update:
I realized that the dummy data frame I created originally does not reflect the structure of the data frame that I am working with. Allow me to rephrase my question here.
Data frame that I'm starting with:
StudentAndClass <- c("Anthropology College_Name","x","y",
"Geology College_Name","z","History College_Name", "x","y","z")
df <- data.frame(StudentAndClass)
Students ("x","y","z") are enrolled in classes that they are listed under. e.g. "x" and "y" are in Anthropology, while "x", "y", "z" are in History.
How can I create the desired data frame below?
Student <- c("x", "y", "z", "x", "y","z")
Class <- c("Anthropology College_Name", "Anthropology College_Name",
"Geology College_Name", "History College_Name",
"History College_Name", "History College_Name")
df_tidy <- data.frame(Student, Class)
Original post:
I have a data frame with observations of two variables merged in a single column like so:
StudentAndClass <- c("A","x","y","A","B","z","B","C","x","y","z","C")
df <- data.frame(StudentAndClass)
where "A", "B", "C" represent classes, and "x", "y", "z" students who are taking these classes. Notice that observations of students are wedged between observations of classes.
I'm wondering how I can create a new data frame with the following format:
Student <- c("x", "y", "z", "x", "y","z")
Class <- c("A", "A", "B", "C", "C", "C")
df_tidy <- data.frame(Student, Class)
I want to extract the rows containing observations of students and put them in a new column, while making sure that each Student observation is paired with the corresponding Class observation in the Class column.
One option is to create a vector
v1 <- c('x', 'y', 'z')
Then split the data based on logical vector and rbind
setNames(do.call(cbind, split(df, !df[,1] %in% v1)), c('Student', 'Class'))
# Student Class
#2 x A
#3 y A
#6 z B
#9 x B
#10 y C
#11 z C
Or with tidyverse
library(tidyverse)
df %>%
group_by(grp = c('Class', 'Student')[(StudentAndClass %in% v1) + 1]) %>%
mutate(n = row_number()) %>%
spread(grp, StudentAndClass) %>%
select(-n)
# A tibble: 6 x 2
# Class Student
#* <fctr> <fctr>
#1 A x
#2 A y
#3 B z
#4 B x
#5 C y
#6 C z
Update
If we need this based on elements between each pair of same 'LETTERS'
grp <- with(df, cummax(match(StudentAndClass, LETTERS[1:3], nomatch = 0)))
do.call(rbind, lapply(split(df, grp), function(x)
data.frame(Class = x[,1][2:(nrow(x)-1)], Student = x[[1]][1], stringsAsFactors=FALSE)))
Updated
In essence, you just need to find which indexes have college names, use those to get the range of students in each college, then subset the main vector by those ranges. Since students aren't guaranteed to be nested between two similar values, you have to be careful about any "empty" colleges.
college_indices <- which(endsWith(StudentAndClass, 'College_Name'))
colleges <- StudentAndClass[college_indices]
bounds_mat <- rbind(
start = college_indices,
end = c(college_indices[-1], length(StudentAndClass))
)
colnames(bounds_mat) <- colleges
bounds_mat['start', ] <- bounds_mat['start', ] + 1
bounds_mat['end', ] <- bounds_mat['end', ] - 1
# This prevents any problems if a college has no listed students
empty_college <- bounds_mat['start', ] > bounds_mat['end', ]
bounds_mat <- bounds_mat[, !empty_college]
class_listing <- apply(
bounds_mat,
2,
function(bounds) {
StudentAndClass[bounds[1]:bounds[2]]
}
)
df_tidy <- data.frame(
Student = unlist(class_listing),
Class = rep(names(class_listing), lengths(class_listing)),
row.names = NULL
)

Merge rows with condition and limit in a dataframe

I have the following dummy dataset of 1000 observations:
obs <- 1000
df <- data.frame(
a=c(1,0,0,0,0,1,0,0,0,0),
b=c(0,1,0,0,0,0,1,0,0,0),
c=c(0,0,1,0,0,0,0,1,0,0),
d=c(0,0,0,1,0,0,0,0,1,0),
e=c(0,0,0,0,1,0,0,0,0,1),
f=c(10,2,4,5,2,2,1,2,1,4),
g=sample(c("yes", "no"), obs, replace = TRUE),
h=sample(letters[1:15], obs, replace = TRUE),
i=sample(c("VF","FD", "VD"), obs, replace = TRUE),
j=sample(1:10, obs, replace = TRUE)
)
One key feature of this dataset is that the variables a to e's values are only one 1 and the rest are 0. We are sure the only one of these five columns have a 1 as value.
I found a way to extract these rows given a condition (with a 1) and assign to their respective variables:
df.a <- df[df[,"a"] == 1,,drop=FALSE]
df.b <- df[df[,"b"] == 1,,drop=FALSE]
df.c <- df[df[,"c"] == 1,,drop=FALSE]
df.d <- df[df[,"d"] == 1,,drop=FALSE]
df.e <- df[df[,"e"] == 1,,drop=FALSE]
My dilemma now is to limit the rows saved into df.a to df.e and to merge them afterwards.
Here's a shorter way to create df.merged:
# variables of 'df'
vars <- c("a", "b", "c", "d", "e")
# number of rows to extract
n <- 100
df.merged <- do.call(rbind, lapply(vars, function(x) {
head(df[as.logical(df[[x]]), ], n)
}))
Here, rbind is sufficient. The function rbind.fill is necessary if your data frames differ with respect to the number of columns.
To get the n-rows subset, a simple data[1:n,] does the job.
df.a.sub <- df.a[1:10,]
df.b.sub <- df.b[1:10,]
df.c.sub <- df.c[1:10,]
df.d.sub <- df.d[1:10,]
df.e.sub <- df.e[1:10,]
Finally, merge them by (it took the most time to find a straightforward "merge multiple dataframes" and all I needed to do was rbind.fill(df1, df2, ..., dfn) thanks to this question and answer):
require(plyr)
df.merged <- rbind.fill(df.a.sub, df.b.sub, df.c.sub, df.d.sub, df.e.sub)

subsetting in r based on a vector of conditions

This is a restatement of my poorly worded previous question. (To those who replied to it, I appreciate your efforts, and I apologize for not being as clear with my question as I should have been.) I have a large dataset, a subset of which might look like this:
a<-c(1,2,3,4,5,1)
b<-c("a","b","a","b","c","a")
c<-c("m","f","f","m","m","f")
d<-1:6
e<-data.frame(a,b,c,d)
If I want the sum of the entries in the fourth column based on a specific condition, I could do something like this:
attach(e)
total<-sum(e[which(a==3 & b=="a"),4])
detach(e)
However, I have a "vector" of conditions (call it condition_vector), the first four elements of which look more like this:
a==3 & b == "a"
a==2
a==1 & b=="a" & c=="m"
c=="f"
I'd like to create a "generalized" version of the "total" formula above that produces a results_vector of totals by reading in the condition_vector of conditions. In this example, the first four entries in the results_vector would be calculated conceptually as follows:
results_vector[1]<-sum(e[which(a==3 & b=="a"),4])
results_vector[2]<-sum(e[which(a==2),4])
results_vector[3]<-sum(e[which(a==1 & b=="a" & c=="m"),4])
results_vector[4]<-sum(e[which(c=="f"),4])
My actual data set has more than 20 variables. So each record in the condition_vector can contain anywhere from 1 to more than 20 conditions (as opposed to between 1 and 3 conditions, used in this example).
Is there a way to accomplish this other than using a parse(eval(text= ... approach (which takes a long time to run on a relatively small dataset)?
Thanks in advance for any help you can provide (and again, I apologize that I wasn't as clear as I should have been last time around).
Spark
Here using a solution using eval(parse(text=..) here, even if obviously you find it slow:
cond <- c('a==3 & b == "a"','a==2','a==1 & b=="a" & c=="x"','c=="f"')
names(cond) <- cond
results_vector <- lapply(cond,function(x)
sum(dat[eval(parse(text=x)),"d"]))
$`a==3 & b == "a"`
[1] 3
$`a==2`
[1] 2
$`a==1 & b=="a" & c=="m"`
[1] 1
$`c=="f"`
[1] 11
The advantage of naming your conditions vector is to access to your results by condition.
results_vector[cond[2]]
$`a==2`
[1] 2
Here is a function that takes as arguments the condition in each column (if no condition in a column, then NA as argument) and sums in a selected column of a selected data.frame:
conds.by.col <- function(..., sumcol, DF) #NA if not condition in a column
{
conds.ls <- list(...)
res.ls <- vector("list", length(conds.ls))
for(i in 1: length(conds.ls))
{
res.ls[[i]] <- which(DF[,i] == conds.ls[[i]])
}
res.ls <- res.ls[which(lapply(res.ls, length) != 0)]
which_rows <- Reduce(intersect, res.ls)
return(sum(DF[which_rows , sumcol]))
}
Test:
a <- c(1,2,3,4,5,1)
b <- c("a", "b", "a", "b", "c", "a")
c <- c("m", "f", "f", "m", "m", "f")
d <- 1:6
e <- data.frame(a, b, c, d)
conds.by.col(3, "a", "f", sumcol = 4, DF = e)
#[1] 3
For multiple conditions, mapply:
#all conditions in a data.frame:
myconds <- data.frame(con1 = c(3, "a", "f"),
con2 = c(NA, "a", NA),
con3 = c(1, NA, "f"),
stringsAsFactors = F)
mapply(conds.by.col, myconds[1,], myconds[2,], myconds[3,], MoreArgs = list(sumcol = 4, DF = e))
#con1 con2 con3
# 3 10 6
I guess "efficiency" isn't the first you say watching this, though...

Resources