This is a restatement of my poorly worded previous question. (To those who replied to it, I appreciate your efforts, and I apologize for not being as clear with my question as I should have been.) I have a large dataset, a subset of which might look like this:
a<-c(1,2,3,4,5,1)
b<-c("a","b","a","b","c","a")
c<-c("m","f","f","m","m","f")
d<-1:6
e<-data.frame(a,b,c,d)
If I want the sum of the entries in the fourth column based on a specific condition, I could do something like this:
attach(e)
total<-sum(e[which(a==3 & b=="a"),4])
detach(e)
However, I have a "vector" of conditions (call it condition_vector), the first four elements of which look more like this:
a==3 & b == "a"
a==2
a==1 & b=="a" & c=="m"
c=="f"
I'd like to create a "generalized" version of the "total" formula above that produces a results_vector of totals by reading in the condition_vector of conditions. In this example, the first four entries in the results_vector would be calculated conceptually as follows:
results_vector[1]<-sum(e[which(a==3 & b=="a"),4])
results_vector[2]<-sum(e[which(a==2),4])
results_vector[3]<-sum(e[which(a==1 & b=="a" & c=="m"),4])
results_vector[4]<-sum(e[which(c=="f"),4])
My actual data set has more than 20 variables. So each record in the condition_vector can contain anywhere from 1 to more than 20 conditions (as opposed to between 1 and 3 conditions, used in this example).
Is there a way to accomplish this other than using a parse(eval(text= ... approach (which takes a long time to run on a relatively small dataset)?
Thanks in advance for any help you can provide (and again, I apologize that I wasn't as clear as I should have been last time around).
Spark
Here using a solution using eval(parse(text=..) here, even if obviously you find it slow:
cond <- c('a==3 & b == "a"','a==2','a==1 & b=="a" & c=="x"','c=="f"')
names(cond) <- cond
results_vector <- lapply(cond,function(x)
sum(dat[eval(parse(text=x)),"d"]))
$`a==3 & b == "a"`
[1] 3
$`a==2`
[1] 2
$`a==1 & b=="a" & c=="m"`
[1] 1
$`c=="f"`
[1] 11
The advantage of naming your conditions vector is to access to your results by condition.
results_vector[cond[2]]
$`a==2`
[1] 2
Here is a function that takes as arguments the condition in each column (if no condition in a column, then NA as argument) and sums in a selected column of a selected data.frame:
conds.by.col <- function(..., sumcol, DF) #NA if not condition in a column
{
conds.ls <- list(...)
res.ls <- vector("list", length(conds.ls))
for(i in 1: length(conds.ls))
{
res.ls[[i]] <- which(DF[,i] == conds.ls[[i]])
}
res.ls <- res.ls[which(lapply(res.ls, length) != 0)]
which_rows <- Reduce(intersect, res.ls)
return(sum(DF[which_rows , sumcol]))
}
Test:
a <- c(1,2,3,4,5,1)
b <- c("a", "b", "a", "b", "c", "a")
c <- c("m", "f", "f", "m", "m", "f")
d <- 1:6
e <- data.frame(a, b, c, d)
conds.by.col(3, "a", "f", sumcol = 4, DF = e)
#[1] 3
For multiple conditions, mapply:
#all conditions in a data.frame:
myconds <- data.frame(con1 = c(3, "a", "f"),
con2 = c(NA, "a", NA),
con3 = c(1, NA, "f"),
stringsAsFactors = F)
mapply(conds.by.col, myconds[1,], myconds[2,], myconds[3,], MoreArgs = list(sumcol = 4, DF = e))
#con1 con2 con3
# 3 10 6
I guess "efficiency" isn't the first you say watching this, though...
Related
I have a data frame ('ju') that has three columns and 230 rows. The first two columns represent a pair of objects. The third column includes one of those objects. I'd like to add the fourth column which will contain the second object from that pair, as shown below.
I wrote a code to identify the value for the forth column (loser), but it does not give me any output when I run it.
for (i in 1:230) {
if (ju$winner[i]==ju$letter2[i]) {
paste(ju$letter1[i])
} else {
paste (ju$letter2[i])
}
}
I can not see what is wrong with the code. Also I would appreciate if you can suggest how I could create this fourth column directly into my data frame, instead of creating a separate vector and then adding it to the data frame. Thanks
This will do it without a for loop:
ju$loser <- ifelse(ju$winner %in% ju$letter1, ju$letter2, ju$letter1)
Gives:
> ju
letter1 letter2 winner loser
1 a c a c
2 c b b c
3 t j j t
4 r k k r
If you want to print to console, you'll need to add:
cat(ju$letter1[i])
or
print(ju$letter1[i])
Regarding the New Column question, a possible solution (sub-optimal to use a for loop here -- See suggestion from #lab_rat_kid):
ju$NewColumn = NA
for (i in 1:230) {
if (ju$winner[i]==ju$letter2[i]) {
ju$NewColumn[i] <- ju$letter1[i]
} else {
ju$NewColumn[i] <- ju$letter2[i]
}
}
with tidyverse:
dt <- tibble(l1 = c("a", "c", "t", "r"),
l2 = c("c", "b", "j", "k"),
winner = c("a", "b", "j", "k"))
dt <- dt %>%
mutate(looser = if_else(winner == l1, l2, l1))
(dt)
In R, I want to set a [61,1] dummy vector a equal to one if the elements in one string vector [61,1] b matches the element in 3 [1,1] string vectors c1, c2, c3.
I've tried something like:
for(i in 1:3) {
data$a[data$b==data2$ci]<-1
}
without success.
To make it clear, the data looks something like (couldn't make column vectors, so think of the transpose of these):
Data
a 0 0 0 0 0 ... 0
b M M D M X ... E
Data2
c1 M
c2 X
c3 P
I want "a" to look like:
a 1 1 0 1 1 ... 0
EDIT:
This code works:
data$a <- ifelse(data$b %in% data2$p1, 1,
ifelse(data$b %in% data2$p2, 1,
ifelse(data$b %in% data2$p3, 1, 0)))
However, I'm sure that there is a more efficient solution using a loop. If there were more than three variables, (e.g., c1, c2, ... c99), then a lot of repetitive code would have to be written.
If I understand the question correctly, the easiest approach is to use an ifelse statement. Please add a reproducible example in your next question to make it easier for those trying to give you solutions.
Data <- data.frame(a = 0,
b = c("M", "M", "D", "M", "X", "P"),
stringsAsFactors = FALSE)
Data2 <- data.frame(a = c("c1", "c2", "c3"),
b = c("M", "X", "P"),
stringsAsFactors = FALSE)
Data$a <- ifelse(Data$b %in% Data2$b, 1, 0)
The %in% operator evaluates whether each element in the vector on the left hand side is present in the vector on the right. It outputs a logical vector that is used to define whether to assign a 1 or 0 in this case.
I have the following code that it taking forever to run on my 80k rows CBP table. Anyone could help me optimize my loop. Trying simply to find duplicates sharing the same values in certain (not all) columns, getting the number of duplicates there is and then returning the ids for each of the duplicates:
for (row in 1:nrow(CBP)){
subs <- subset(CBP, CBP$Lower_Bound__c == CBP[row,"Lower_Bound__c"] & CBP$Price_Book__c == CBP[row,"Price_Book__c"] & CBP$Price__c == CBP[row,"Price__c"] & CBP$Product__c == CBP[row,"Product__c"] & CBP$Department__c == CBP[row,"Department__c"] & CBP$UOM__c == CBP[row,"UOM__c"] & CBP$Upper_Bound__c == CBP[row,"Upper_Bound__c"])
if (nrow(subs)>1){
CBP[row,]$dup <- nrow(subs)
CBP[row,]$dupids <- paste(subs[,"Id"], collapse = ",")
}
print(row)
}
I'm having a hard time understanding your example. However, here's a simple approach with data.table that might work for your situation. You can create a variable (nsame in the example) that counts if the something is a duplicate by multiple variables (var1 and var2 in the example). Then just grab the row index.
library(data.table)
# generate some example data
dt <- data.table(
var1 = c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
var2 = c("a", "a", "z", "b", "y", "b", "c", "c", "c"),
var3 = 1:9
)
# counter for each combination of var1-var2
dt[ , nsame := 1:.N, by=.(var1, var2)]
# duplicates are where the counter is > 1
which(dt$nsame > 1)
## 2 6 8 9
Using base R:
dupe_columns = c(
"Lower_Bound__c", "Price_Book__c", "Price__c", "Product__c",
"Department__c", "UOM__c", "Upper_Bound__c"
)
# which rows are duplicated
dupes = which(duplicated(CBP[, dupe_columns]) | duplicated(CBP[, dupe_columns], fromLast = TRUE))
# how many are there
length(dupes)
# IDs that are duplicated
CBP[dupes, "Id"]
# collapse Ids with duplicates by group:
aggregate(CBP$Id, by = CBP[dupe_columns], FUN = paste, collapse = ",")
If any of this doesn't work or you need more help, post 10-20 rows of sample data (use dput() so it is copy/pasteable!!!) so we can test and verify.
Subtle point, but I use CBP[, dupe_columns] in the duplicated() line because duplicated() will work the same whether we give it a data frame or a vector. CBP[, dupe_columns] will be a data frame if you have more than one column to check for dupes, but will be a vector if you give it a single column. However, when we get down to aggregate we need the by argument to be a list (like a data frame). So I use CBP[dupe_columns] (no comma) which will guarantee a data frame even if we are only checking a single column.
I will try to explain what I am doing the best I can it is kind of confusing but I'll give it a shot. Essentially I start with 2 data frames. Each one containing a unique row per person and two items per user as columns. My goal is to turn this into 1 data frame with one unique row per user and the first item from each of the two data frames upon the condition that the items do not repeat. For example if for customer 1 in the first data frame his items are "a" and "d" and in the second data frame his items are "a" and "c", I would want the final data frame to be "a" and "c" for this customer. I have written an apply that does this however when I perform this on roughly 160,000 rows it takes quite a bit of time. I was hoping someone would be able to come up with a more efficient solution to my problem.
d1 <- data.frame(id = c("1", "2", "3"), stringsAsFactors = F)
r1 <- data.frame(i1 = c("a", "b", "c"), i2 = c("d", "e", "f"), stringsAsFactors = F)
rownames(r1) = d1$id
r2 <- data.frame(i1 = c("a", "c", "f"), i2 = c("c", "t", "l"), stringsAsFactors = F)
rownames(r2) = d1$id
dFinal <- data.frame(id = d1$id, r1 = "", r2 = "", stringsAsFactors = F)
dFinal$r1 = apply(dFinal, 1, function(x){r1[rownames(r1) == x["id"], "i1"]})
dFinal$r2 = apply(dFinal, 1, function(x){r2[rownames(r2) == x["id"], which(!r2[rownames(r2) == x["id"],c("i1","i2")] %in% x["r1"])[1]]})
Would the following do what you're looking for:
# Keep only first column of first data.frame
df <- cbind(d1,r1,r2)[,-3]
names(df) <- c("id","r1_final","r2_i1","r2_i2")
df$r2_final <- df$r2_i1
# Keep only second column of second data.frame
# if the value in the first column is found in first data.frame
df[df$r1_final == df$r2_i1,"r2_final"] <- df[df$r1_final == df$r2_i1,"r2_i2"]
df_final <- df[,c("id","r1_final","r2_final")]
print(df_final)
id r1_final r2_final
1 1 a c
2 2 b c
3 3 c f
Edit:
OP asked for a solution if there were four data.frames instead of 2 like in the example, here is some code that I haven't tested but it should work with two additional columns
df$r2_final <- df$r2_i1
df$r3_final <- df$r3_i1
df$r4_final <- df$r4_i1
df[df$r1_final == df$r2_i1,"r2_final"] <- df[df$r1_final == df$r2_i1,"r2_i2"]
df[df$r3_i1 %in% c(df$r1_final,df$r2_final),"r3_final"] <- df[df$r3_i1 %in% c(df$r1_final,df$r2_final),"r3_i2"]
df[df$r4_i1 %in% c(df$r1_final,df$r2_final,df$r3_final),"r4_final"] <- df[df$r4_i1 %in% c(df$r1_final,df$r2_final,df$r3_final),"r4_i2"]
df_final <- df[,c("id","r1_final","r2_final","r3_final","r4_final")]
Thanks for the accepted answer as it worked perfectly! However it gave me an idea to use ifelse. While it doesn't work any better or worse than the accepted answer it was a little easier for me to wrap my head around when adding more columns or data frames.
dfInt <- cbind(df1, df2, df3, df4)
dfInt$R1_Final <- dfInt$R1_1
dfInt$R2_Final <- ifelse(dfInt$R1_Final == dfInt$R2_1,
dfInt$R2_2,
dfInt$R2_1)
dfInt$R3_Final <- ifelse(dfInt$R1_Final != dfInt$R3_1 & dfInt$R2_Final != dfInt$R3_1,
dfInt$R3_1,
ifelse(dfInt$R2_Final != dfInt$R3_2,
dfInt$R3_2,
dfInt$R3_3))
dfInt$R4_Final <- ifelse(dfInt$R1_Final != dfInt$R4_1 & dfInt$R2_Final != dfInt$R4_1 & dfInt$R3_Final != dfInt$R4_1,
dfInt$R4_1,
ifelse(dfInt$R2_Final != dfInt$R4_2 & dfInt$R3_Final != dfInt$R4_2,
dfInt$R4_2,
ifelse(dfInt$R3_Final != dfInt$R4_3,
dfInt$R4_3,
dfInt$R4_4)))
I have two data.frames:
pattern <- data.frame(pattern = c("A", "B", "C", "D"), val = c(1, 1, 2, 2))
match <- data.frame(match = c("A", "C"))
I want to add to my data.frame pattern another column called new_val and assign "X" to each row where the value for column pattern is in the data.frame match otherwise assign "Y"
is.element(pattern$pattern, match$match)
[1] TRUE FALSE TRUE FALSE
So, the resulting data.frame should look like:
pattern val new_val
1 A 1 X
2 B 1 Y
3 C 2 X
4 D 2 Y
I achieved to do it with an ugly for-loop but I am sure this can be pretty much done in a one line R command using fancy stuff :-)
Is anyone able to help?
Many thanks!
I'm only really posting this since Tyler said "if you wanted a one liner data.table would likely do it" and I knew it was definitely possible with a one liner in base. I am also assuming match had been renamed to mat.
pattern$new_val <- c("Y", "X")[(pattern$pattern %in% mat)+1]
pattern
# pattern val new_val
#1 A 1 X
#2 B 1 Y
#3 C 2 X
#4 D 2 Y
pattern$pattern %in% mat is finding which of the elements of pattern are in mat which returns TRUE if it's in mat, FALSE if it's not. Then I add 1 to make it numeric in the range of 1-2 so that it can be used for indexing. Then we use that as an index to the self defined vector c("Y", "X") and since the index we created is always 1 or 2 we're always able to grab an element of interest. So in this case we'll grab "Y" if pattern wasn't in mat and "X" if it was - which is what you wanted.
Here's one way (I renamed your match to mat since there's a pretty important base function named match that you could actually use to solve this problem; in fact %in% is a form of match:
pattern <- data.frame(pattern = c("A", "B", "C", "D"), val = c(1, 1, 2, 2))
mat <- c("A", "C")
pattern$new_val <- "Y" #pre allot everything to be Y
pattern$new_val[pattern$pattern %in% mat] <- "X" #replace any A or C with an X
pattern
PS if you wanted a one liner data.table would likely do it.
If you wanted something a little more complicated you could use a function from a package I'm working on:
library(qdap)
#original problem
pattern$new_val <- text2color(pattern$pattern, list(c("A", "C")), c("X", "Y"))
#extending it
#makes D a 5
text2color(pattern$pattern, list(c("A", "C"), "D"), c("X", 5, "Y"))
This function really is designed to do something else but if you want to grab the essential parts of it you can look at the source code.