FOR loop optimization in R - r

I have the following code that it taking forever to run on my 80k rows CBP table. Anyone could help me optimize my loop. Trying simply to find duplicates sharing the same values in certain (not all) columns, getting the number of duplicates there is and then returning the ids for each of the duplicates:
for (row in 1:nrow(CBP)){
subs <- subset(CBP, CBP$Lower_Bound__c == CBP[row,"Lower_Bound__c"] & CBP$Price_Book__c == CBP[row,"Price_Book__c"] & CBP$Price__c == CBP[row,"Price__c"] & CBP$Product__c == CBP[row,"Product__c"] & CBP$Department__c == CBP[row,"Department__c"] & CBP$UOM__c == CBP[row,"UOM__c"] & CBP$Upper_Bound__c == CBP[row,"Upper_Bound__c"])
if (nrow(subs)>1){
CBP[row,]$dup <- nrow(subs)
CBP[row,]$dupids <- paste(subs[,"Id"], collapse = ",")
}
print(row)
}

I'm having a hard time understanding your example. However, here's a simple approach with data.table that might work for your situation. You can create a variable (nsame in the example) that counts if the something is a duplicate by multiple variables (var1 and var2 in the example). Then just grab the row index.
library(data.table)
# generate some example data
dt <- data.table(
var1 = c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
var2 = c("a", "a", "z", "b", "y", "b", "c", "c", "c"),
var3 = 1:9
)
# counter for each combination of var1-var2
dt[ , nsame := 1:.N, by=.(var1, var2)]
# duplicates are where the counter is > 1
which(dt$nsame > 1)
## 2 6 8 9

Using base R:
dupe_columns = c(
"Lower_Bound__c", "Price_Book__c", "Price__c", "Product__c",
"Department__c", "UOM__c", "Upper_Bound__c"
)
# which rows are duplicated
dupes = which(duplicated(CBP[, dupe_columns]) | duplicated(CBP[, dupe_columns], fromLast = TRUE))
# how many are there
length(dupes)
# IDs that are duplicated
CBP[dupes, "Id"]
# collapse Ids with duplicates by group:
aggregate(CBP$Id, by = CBP[dupe_columns], FUN = paste, collapse = ",")
If any of this doesn't work or you need more help, post 10-20 rows of sample data (use dput() so it is copy/pasteable!!!) so we can test and verify.
Subtle point, but I use CBP[, dupe_columns] in the duplicated() line because duplicated() will work the same whether we give it a data frame or a vector. CBP[, dupe_columns] will be a data frame if you have more than one column to check for dupes, but will be a vector if you give it a single column. However, when we get down to aggregate we need the by argument to be a list (like a data frame). So I use CBP[dupe_columns] (no comma) which will guarantee a data frame even if we are only checking a single column.

Related

add a column based on values in three other columns

I have a data frame ('ju') that has three columns and 230 rows. The first two columns represent a pair of objects. The third column includes one of those objects. I'd like to add the fourth column which will contain the second object from that pair, as shown below.
I wrote a code to identify the value for the forth column (loser), but it does not give me any output when I run it.
for (i in 1:230) {
if (ju$winner[i]==ju$letter2[i]) {
paste(ju$letter1[i])
} else {
paste (ju$letter2[i])
}
}
I can not see what is wrong with the code. Also I would appreciate if you can suggest how I could create this fourth column directly into my data frame, instead of creating a separate vector and then adding it to the data frame. Thanks
This will do it without a for loop:
ju$loser <- ifelse(ju$winner %in% ju$letter1, ju$letter2, ju$letter1)
Gives:
> ju
letter1 letter2 winner loser
1 a c a c
2 c b b c
3 t j j t
4 r k k r
If you want to print to console, you'll need to add:
cat(ju$letter1[i])
or
print(ju$letter1[i])
Regarding the New Column question, a possible solution (sub-optimal to use a for loop here -- See suggestion from #lab_rat_kid):
ju$NewColumn = NA
for (i in 1:230) {
if (ju$winner[i]==ju$letter2[i]) {
ju$NewColumn[i] <- ju$letter1[i]
} else {
ju$NewColumn[i] <- ju$letter2[i]
}
}
with tidyverse:
dt <- tibble(l1 = c("a", "c", "t", "r"),
l2 = c("c", "b", "j", "k"),
winner = c("a", "b", "j", "k"))
dt <- dt %>%
mutate(looser = if_else(winner == l1, l2, l1))
(dt)

Select rows of data frame with several values using logical operators [duplicate]

This question already has answers here:
Select rows from a data frame based on values in a vector
(3 answers)
Closed 5 years ago.
I am using square brackets to select rows of data in a data frame based on logical operators. For example, if I have a data frame
df = data.frame(Letter = rep(c("A", "B", "C", "D", "E"), 10), Number = rep(c(1:25), 4))
and I want to select rows that contain the letters A, B, or C I use the code
df = df[df$Letter == "A" | df$Letter == "B" | df$Letter == "C",]
I'm wondering if there is a way to condense this, something along the lines of
df = df[df$Letter == c("A", "B", "C"),]
or maybe
df = df[df$Letter == "A" | "B" | "C",]
neither of which work, but basically I'm looking for a shorter, easier way to list several logical operators.
I would prefer to do it with square brackets rather than subset() or some other function but if it really isn't possible with square brackets then I would be open to other ideas
You could do this:
df <- df[df$Letter %in% c("A", "B", "C"),]
For these instances I use the dplyr package.
library(dplyr)
new.df <- df %>%
filter(Letter %in% c("A", "B", "C"))
I hope that helps,
cheers!
PS: dplyr cheat sheet is here

Extracting first column that meets certain criteria for each row

I will try to explain what I am doing the best I can it is kind of confusing but I'll give it a shot. Essentially I start with 2 data frames. Each one containing a unique row per person and two items per user as columns. My goal is to turn this into 1 data frame with one unique row per user and the first item from each of the two data frames upon the condition that the items do not repeat. For example if for customer 1 in the first data frame his items are "a" and "d" and in the second data frame his items are "a" and "c", I would want the final data frame to be "a" and "c" for this customer. I have written an apply that does this however when I perform this on roughly 160,000 rows it takes quite a bit of time. I was hoping someone would be able to come up with a more efficient solution to my problem.
d1 <- data.frame(id = c("1", "2", "3"), stringsAsFactors = F)
r1 <- data.frame(i1 = c("a", "b", "c"), i2 = c("d", "e", "f"), stringsAsFactors = F)
rownames(r1) = d1$id
r2 <- data.frame(i1 = c("a", "c", "f"), i2 = c("c", "t", "l"), stringsAsFactors = F)
rownames(r2) = d1$id
dFinal <- data.frame(id = d1$id, r1 = "", r2 = "", stringsAsFactors = F)
dFinal$r1 = apply(dFinal, 1, function(x){r1[rownames(r1) == x["id"], "i1"]})
dFinal$r2 = apply(dFinal, 1, function(x){r2[rownames(r2) == x["id"], which(!r2[rownames(r2) == x["id"],c("i1","i2")] %in% x["r1"])[1]]})
Would the following do what you're looking for:
# Keep only first column of first data.frame
df <- cbind(d1,r1,r2)[,-3]
names(df) <- c("id","r1_final","r2_i1","r2_i2")
df$r2_final <- df$r2_i1
# Keep only second column of second data.frame
# if the value in the first column is found in first data.frame
df[df$r1_final == df$r2_i1,"r2_final"] <- df[df$r1_final == df$r2_i1,"r2_i2"]
df_final <- df[,c("id","r1_final","r2_final")]
print(df_final)
id r1_final r2_final
1 1 a c
2 2 b c
3 3 c f
Edit:
OP asked for a solution if there were four data.frames instead of 2 like in the example, here is some code that I haven't tested but it should work with two additional columns
df$r2_final <- df$r2_i1
df$r3_final <- df$r3_i1
df$r4_final <- df$r4_i1
df[df$r1_final == df$r2_i1,"r2_final"] <- df[df$r1_final == df$r2_i1,"r2_i2"]
df[df$r3_i1 %in% c(df$r1_final,df$r2_final),"r3_final"] <- df[df$r3_i1 %in% c(df$r1_final,df$r2_final),"r3_i2"]
df[df$r4_i1 %in% c(df$r1_final,df$r2_final,df$r3_final),"r4_final"] <- df[df$r4_i1 %in% c(df$r1_final,df$r2_final,df$r3_final),"r4_i2"]
df_final <- df[,c("id","r1_final","r2_final","r3_final","r4_final")]
Thanks for the accepted answer as it worked perfectly! However it gave me an idea to use ifelse. While it doesn't work any better or worse than the accepted answer it was a little easier for me to wrap my head around when adding more columns or data frames.
dfInt <- cbind(df1, df2, df3, df4)
dfInt$R1_Final <- dfInt$R1_1
dfInt$R2_Final <- ifelse(dfInt$R1_Final == dfInt$R2_1,
dfInt$R2_2,
dfInt$R2_1)
dfInt$R3_Final <- ifelse(dfInt$R1_Final != dfInt$R3_1 & dfInt$R2_Final != dfInt$R3_1,
dfInt$R3_1,
ifelse(dfInt$R2_Final != dfInt$R3_2,
dfInt$R3_2,
dfInt$R3_3))
dfInt$R4_Final <- ifelse(dfInt$R1_Final != dfInt$R4_1 & dfInt$R2_Final != dfInt$R4_1 & dfInt$R3_Final != dfInt$R4_1,
dfInt$R4_1,
ifelse(dfInt$R2_Final != dfInt$R4_2 & dfInt$R3_Final != dfInt$R4_2,
dfInt$R4_2,
ifelse(dfInt$R3_Final != dfInt$R4_3,
dfInt$R4_3,
dfInt$R4_4)))

R - reshape dataframe from duplicated column names but unique values

Hi I have a dataframe that looks like the following
I want to apply a function to it so that it reshapes it like this
How would I do that?
Here is one option that could work. W loop through the unique names of the dataset, create a logical index with ==, extract the columns, unlist, create a data.frame, and then cbind it together or just use data.frame (assumption is that the number of duplicate elements are equal for each set)
data.frame(lapply(unique(names(df1)), function(x)
setNames(data.frame(unlist(df1[names(df1)==x], use.names = FALSE)), x)))
# type model make
#1 a b c
#2 d e f
data
df1 <- data.frame(type = "a", model = "b", make = "c", type = "d",
model = "e",
make = "f", check.names=FALSE, stringsAsFactors=FALSE)

subsetting in r based on a vector of conditions

This is a restatement of my poorly worded previous question. (To those who replied to it, I appreciate your efforts, and I apologize for not being as clear with my question as I should have been.) I have a large dataset, a subset of which might look like this:
a<-c(1,2,3,4,5,1)
b<-c("a","b","a","b","c","a")
c<-c("m","f","f","m","m","f")
d<-1:6
e<-data.frame(a,b,c,d)
If I want the sum of the entries in the fourth column based on a specific condition, I could do something like this:
attach(e)
total<-sum(e[which(a==3 & b=="a"),4])
detach(e)
However, I have a "vector" of conditions (call it condition_vector), the first four elements of which look more like this:
a==3 & b == "a"
a==2
a==1 & b=="a" & c=="m"
c=="f"
I'd like to create a "generalized" version of the "total" formula above that produces a results_vector of totals by reading in the condition_vector of conditions. In this example, the first four entries in the results_vector would be calculated conceptually as follows:
results_vector[1]<-sum(e[which(a==3 & b=="a"),4])
results_vector[2]<-sum(e[which(a==2),4])
results_vector[3]<-sum(e[which(a==1 & b=="a" & c=="m"),4])
results_vector[4]<-sum(e[which(c=="f"),4])
My actual data set has more than 20 variables. So each record in the condition_vector can contain anywhere from 1 to more than 20 conditions (as opposed to between 1 and 3 conditions, used in this example).
Is there a way to accomplish this other than using a parse(eval(text= ... approach (which takes a long time to run on a relatively small dataset)?
Thanks in advance for any help you can provide (and again, I apologize that I wasn't as clear as I should have been last time around).
Spark
Here using a solution using eval(parse(text=..) here, even if obviously you find it slow:
cond <- c('a==3 & b == "a"','a==2','a==1 & b=="a" & c=="x"','c=="f"')
names(cond) <- cond
results_vector <- lapply(cond,function(x)
sum(dat[eval(parse(text=x)),"d"]))
$`a==3 & b == "a"`
[1] 3
$`a==2`
[1] 2
$`a==1 & b=="a" & c=="m"`
[1] 1
$`c=="f"`
[1] 11
The advantage of naming your conditions vector is to access to your results by condition.
results_vector[cond[2]]
$`a==2`
[1] 2
Here is a function that takes as arguments the condition in each column (if no condition in a column, then NA as argument) and sums in a selected column of a selected data.frame:
conds.by.col <- function(..., sumcol, DF) #NA if not condition in a column
{
conds.ls <- list(...)
res.ls <- vector("list", length(conds.ls))
for(i in 1: length(conds.ls))
{
res.ls[[i]] <- which(DF[,i] == conds.ls[[i]])
}
res.ls <- res.ls[which(lapply(res.ls, length) != 0)]
which_rows <- Reduce(intersect, res.ls)
return(sum(DF[which_rows , sumcol]))
}
Test:
a <- c(1,2,3,4,5,1)
b <- c("a", "b", "a", "b", "c", "a")
c <- c("m", "f", "f", "m", "m", "f")
d <- 1:6
e <- data.frame(a, b, c, d)
conds.by.col(3, "a", "f", sumcol = 4, DF = e)
#[1] 3
For multiple conditions, mapply:
#all conditions in a data.frame:
myconds <- data.frame(con1 = c(3, "a", "f"),
con2 = c(NA, "a", NA),
con3 = c(1, NA, "f"),
stringsAsFactors = F)
mapply(conds.by.col, myconds[1,], myconds[2,], myconds[3,], MoreArgs = list(sumcol = 4, DF = e))
#con1 con2 con3
# 3 10 6
I guess "efficiency" isn't the first you say watching this, though...

Resources