I have some code which includes a for loop, and nested if statements. The issue is that it is taking too long to run and I want to make it much faster.
I have data on cohorts in a data frame called f2_cebu_davao. There is also a column in this data frame called person_id. There are 3 categories of the cohorts: 'Baseline', 'Other Effects', 'Campaign'.
I want to loop through each person_id in the f2_cebu_davao data frame, and check to see which cohort it is in. If it is in the cohort 'Baseline' or 'Other Effects', then I will check the before_baseline_othereffects table to see if the ID can be found in that table. If it can, I make a new column in the f2_cebu_davao table and the value will be 'returning'. Otherwise, 'new'.
If the cohort name is 'campaign', I will check the before_campaign table and do the same procedure as above.
My data is quite big (all my objects are big) so this is taking a really long time to run (it's been running for more than 30 minutes and still not finished!).
How can I speed this up (possibly by using vectorization, or just by modifying the code a little)?
I tried do loop through but it's taking too long.
before_baseline_othereffects <- subset(loans_final_full, submitted_at_date < '2018-05-21')
before_campaign <- subset(loans_final_full, submitted_at_date < '2019-01-21')
for(i in 1:nrow(f2_cebu_davao)){
if(as.vector(f2_cebu_davao[, 'cohort'][i]) == 'Baseline') {
if(as.vector(f2_cebu_davao[,'person_id'][i]) %in% as.vector(unique(before_baseline_othereffects$person_id)) == TRUE) {
f2_cebu_davao$new_or_returning[i] <- 'Returning'
} else {
f2_cebu_davao$new_or_returning[i] <- 'New'
}
} else if (as.vector(f2_cebu_davao[, 'cohort'][i]) == 'Other Effects'){
if(as.vector(f2_cebu_davao[,'person_id'][i]) %in% as.vector(unique(before_baseline_othereffects$person_id)) == TRUE) {
f2_cebu_davao$new_or_returning[i] <- 'Returning'
} else {
f2_cebu_davao$new_or_returning[i] <- 'New'
}
} else {
if(as.vector(f2_cebu_davao[,'person_id'][i]) %in% as.vector(unique(before_campaign$person_id)) == TRUE) {
f2_cebu_davao$new_or_returning[i] <- 'Returning'
} else {
f2_cebu_davao$new_or_returning[i] <- 'New'
}
}
}
Happy to update and test this if you can provide some example data and desired output. I expect something like this should work.
Here I make up some fake data:
f2_cebu_davao <- data.frame(stringsAsFactors = F,
cohort = rep(c("Baseline", "Other Effects", "Something else",
"Another Something"), by = 3),
person_id = 1:12
)
before_baseline_othereffects <- c(1:4)
before_campaign <- c(5:8)
Here I apply it using dplyr's case_when, spelling out four cases. This code will be vectorized and I expect would run much faster than the current loop code.
The cohort is either "Baseline" or "Other Effects, and the person_id appears in before_baseline_othereffects. This creates "Returning" in rows 1 & 2.
Given the first condition wasn't met, but the cohort is still in either "Baseline" or "Other Effects," return "New," as is done in rows 5 & 6.
Given the first two conditions weren't met, but the person was in before_campaign, mark Returning, as in rows 7 & 8.
Otherwise, mark New, as in rows 3&4 and 9-12.
library(dplyr)
output <- f2_cebu_davao %>%
mutate(new_or_returning = case_when(
cohort %in% c("Baseline", "Other Effects") &
person_id %in% before_baseline_othereffects ~ "Returning",
cohort %in% c("Baseline", "Other Effects") ~ "New",
person_id %in% before_campaign ~ "Returning",
TRUE ~ "New"
))
Here's the output:
> output
cohort person_id new_or_returning
1 Baseline 1 Returning
2 Other Effects 2 Returning
3 Something else 3 New
4 Another Something 4 New
5 Baseline 5 New
6 Other Effects 6 New
7 Something else 7 Returning
8 Another Something 8 Returning
9 Baseline 9 New
10 Other Effects 10 New
11 Something else 11 New
12 Another Something 12 New
Related
I am working on market transaction data where each observation contains the value of the variable of the buyer's id, and the value of the variable of the seller's id. For each observation (i.e each transaction), I would like to create a variable equal to the number of other transactions the associated seller has done with a different buyer than the one involved in this transaction. As a consequence, in the following
data <- data.frame(Buyer_id = c("001","001","002","001"), Seller_id = c("021","022","022","021"))
I would like to obtain:
Result <- list(0,1,1,0)
I searched for already existing answers for similar problems than mine, usually involving the function mapply(), and tried to implement them, but it proved unsuccessful.
Thank you very much for helping me.
Are you looking for something like this? If yes, then you might want to change your reproducible example to have a c instead of list when you construct your data.frame.
data <- data.frame(Buyer_id = c("001","001","002","001"),
Seller_id = c("021","022","022","021"))
data$n <- NA
for (i in seq_len(nrow(data))) {
seller <- as.character(data[i, "Seller_id"])
buyer <- as.character(data[i, "Buyer_id"])
with.buyers <- as.character(data[data$Seller_id == seller, "Buyer_id"])
with.buyers <- unique(with.buyers)
diff.buyers <- with.buyers[!(with.buyers %in% buyer)]
data[i, "n"] <- length(diff.buyers)
}
Buyer_id Seller_id n
1 001 021 0
2 001 022 1
3 002 022 1
4 001 021 0
Apart from Roman Lustrik's solution, there is also an approach that uses graphs.
library(igraph)
data <- data.frame(Seller_id = c("021","022","022","021"),
Buyer_id = c("001","001","002","001"),
stringsAsFactors = FALSE)
my.graph <- graph_from_data_frame(data)
plot(my.graph)
degree(my.graph, mode = c("out"))
# Transform the graph into a simple graph. A simple graph does not allow
# duplicate edges.
my.graph <- simplify(my.graph)
plot(my.graph)
degree(my.graph, mode = c("out"))
V(my.graph)$out.degree <- degree(my.graph, mode = c("out"))
data$n <- apply(data,
MARGIN = 1,
FUN = function(transaction)
{
node.out.degree <- V(my.graph)$out.degree[ V(my.graph)$name == transaction["Seller_id"] ]
if (node.out.degree <= 1) {
# Since the vertex has at most 1 out degree we know that the current transaction
# is the only appearance of the current seller.
return(0)
} else {
# In this case, we know that the seller participates in at least one more
# tansaction. We therefore simply subtract minus one (the current transaction)
# from the out degree.
return(node.out.degree - 1)
}
})
data
I have a data set which, when plotted, produces a graph that looks like this:
Plot
The head of this data is:
> head(data_frame)
score position
73860 10 43000
73859 10 43001
73858 10 43002
73857 10 43003
73856 10 43004
73855 10 43005
I've uploaded the whole file as a tab delimited text file here.
As you can see, the plot has regions which have a score of around 10, but there's one region in the middle that "dips". I would like to identify these dips.
Defining a dip as:
Starting when the score is below 7
Ending when the score rises to 7 or above and stays at 7 or above for at least 500 positions
I would like to identify all the regions which meet the above definition, and output their start and end positions. In this case that would only be the one region.
However, I'm at a bit of a loss as to how to do this. Looks like the rle() function could be useful, but I'm not too sure how to implement it.
Expected output for the data frame would be something like:
[1] 44561 46568
(I haven't actually checked that everything in between these would qualify under the definition, but from the plot this looks about right)
I would be very grateful for any suggestions!
Andrei
So I've come up with one solution that uses a series of loops. I do realise this is inefficient, though, so if you have a better answer, please let me know.
results <- data.frame(matrix(ncol=2,nrow=1))
colnames(results) <- c("start","end")
state <- "out"
count <- 1
for (i in 1:dim(data_frame)[1]){
print(i/dim(data_frame)[1])
if (data_frame[i,3] < 7 & state=="out") {
results[count,1] <- data_frame[i,2]
state <- "in"
next
}
if (data_frame[i,3] >= 7 & state=="in") {
if ((i+500)>dim(data_frame)[1]){
results[count,2] <- data_frame[dim(data_frame)[1],2]
state <- "out"
break
}
if (any(data_frame[(i+1):(i+500),3] < 7)) {
next
} else {
results[count,2] <- data_frame[i-1,2]
count <- count+1
state <- "out"
next
}
}
if ((i+500)>dim(data_frame)[1] & state == "out") {
break
}
}
Something like this is a tidyverse solution and uses rle as OP suggested...
below7 <- data_frame$score < 7
x <- rle(below7)
runs <- tibble(
RunLength=x$lengths,
Below7=x$values,
RunStart=df$position[1]
) %>%
mutate(
RunStart=ifelse(
row_number() == 1,
data_frame$position[1],
data_frame$position[1] + cumsum(RunLength)-RunLength+1
),
RunEnd=RunStart + RunLength-1,
Dip=Below7,
Dip=Dip | Below7 | (!Below7 & RunLength < 500)
)
as.data.frame(runs)
Giving
RunLength Below7 RunStart RunEnd Dip
1 1393 FALSE 43000 44392 FALSE
2 84 TRUE 44394 44477 TRUE
3 84 FALSE 44478 44561 TRUE
...
19 60 FALSE 46338 46397 TRUE
20 171 TRUE 46398 46568 TRUE
21 2433 FALSE 46569 49001 FALSE
So to get OP's final answer
runs %>%
filter(Dip) %>%
summarise(
DipStart=min(RunStart),
DipEnd=max(RunEnd)
)
# A tibble: 1 x 2
DipStart DipEnd
<dbl> <dbl>
1 44394 46568
If the original data.frame might contain more than one dip, you'd have to do a little more work when creating the runs tibble: having indentified each individual run, you'd need to create an additional column, DipIndex say, which indexed each individual Dip.
I am trying to mutate a column for salary in my data frame to adjust for inflation since I have a multi-year sample, called adj_SALARY. The salary column is a character vector (indicated by unadj_SALARY), and I need to multiply the values by a ratio of Consumer Price Indices(shown below as a fraction) to convert all values to 2017 dollars. I also have columns as dummy variables indicating YEAR_2014, YEAR_2015, YEAR_2016, YEAR_2017, and YEAR_2018. I have tried running the code below and am still being met with an error message that "In if (YEAR_2014 == 1) { :
the condition has length > 1 and only the first element will be used". Would love some help on the best way to set this up! Here's my code right now:
enter code here NHIS_test <- NHIS1 %>%
mutate(adj_SALARY = if(YEAR_2014 == 1) {
as.numeric(as.character(NHIS1$unadj_SALARY))*(242.839/230.280) }
else if(YEAR_2015 == 1) {
as.numeric(as.character(NHIS1$unadj_SALARY))*(242.839/233.916) }
else if (YEAR_2016 == 1) {
as.numeric(as.character(NHIS1$unadj_SALARY))*(242.839/233.707) }
else if (YEAR_2017 == 1) {
as.numeric(as.character(NHIS1$unadj_SALARY))*(242.839/236.916)}
else if (YEAR_2018 == 1) {
as.numeric(as.character(NHIS1$unadj_SALARY))*(1)})
We can use ifelse/case_when instead of if/else ifelse is vectorized
library(dplyr)
NH1S1 %>%
mutate(unadj_SALARY = as.numeric(as.character(unadj_SALARY)),
adj_SALARY =
case_when(
YEAR_2014 == 1 ~ unadj_SALARY *(242.839/230.280),
YEAR_2015 == 1 ~ unadj_SALARY *(242.839/233.916),
YEAR_2016 == 1 ~ unadj_SALARY *(242.839/233.707),
YEAR_2017 == 1 ~ unadj_SALARY *(242.839/236.916),
YEAR_2018 == 1 ~ unadj_SALARY))
NOTE: Instead of doing the numeric conversion on 'unadj_SALARY' multiple times, it is better to do it once and then use that for further transformation/calculations
Building off of this question Pass a data.frame with column names and fields as filter
Let's say we have the following data set:
filt = data.table(X1 = c("Gender","Male"),
X2 = c('jobFamilyGroup','Finance'),
X3 = c('jobFamilyGroup','Software Dev')
df = data.table(Gender = c('Male','F','Male','Male','F'),
EmployeeStatus = c('Active','na','Active','Active','na'),
jobFamilyGroup = c('Finance','Software Dev','HR','Finance','Software Dev'))
and I want to use filt as a filter for df. filt is done by grabbing an input from Shiny and transforming it a bit to get me that data.table above. My goal is to filter df so we have: All rows that are MALE AND (Software Dev OR Finance).
Currently, I'm hardcoding it to always be an AND but that isn't ideal for situations like this. My thought would be to have multiple if conditions to catch things like this, but I feel like there could be an easier approach for building this logic in.
___________UPDATE______________
Once I have a table like filt I can pass code like:
if(!is.null(primary))
{
if(ncol(primary)==1){
d2 = df[get(as.character(primary[1,1]))==as.character(primary[2,1])]
}
else if(length(primary)==2){
d2 = df[get(as.character(primary[1,1]))==as.character(primary[2,1]) &
get(as.character(primary[1,2]))==as.character(primary[2,2])]
}
else{
d2 = df[get(as.character(primary[1,1]))==as.character(primary[1,2]) &
get(as.character(primary[1,2]))==as.character(primary[2,2]) &
get(as.character(primary[1,3]))==as.character(primary[2,3])]
}
}
But this code doesn't account for the OR Logical needed if there are multiple inputs for one type of grouping. Meaning the current code says give me all rows where: Gender == Male & Job Family Group == 'Finance'& Job Family Group == 'Software Dev' When really it should be Gender == Male & (Job Family Group == 'Finance'| Job Family Group == 'Software Dev')
this is a minimal example meaning there are many other columns so ideally the solution has the ability to determine when a multiple input for a grouping is present.
Given your problem, what if you parsed it so your logic looked like:
Gender %in% c("Male") & jobFamilyGroup %in% c('Finance','Software Dev')
By lumping all filter values with the same column name together in an %in% you get your OR and you keep your AND between column names.
UPDATE
Consider the case discussed in comments below.
Your reactive inputs a data.table specifying
Gender IS Male
Country IS China OR US
EmployeeStatus IS Active
In the sample data you provided there is no country column, so I added one. I extract the columns to be filtered and the values to be filtered and split the values to be filtered by the columns. I pass this into an lapply which does the logical check for each column using an %in% rather than a == so that options within the same column are treated as an | instead of a &. Then I rbind the logical results together and apply an all to the columns and then filter df by the results.
This approach handles the & between columns and the | within columns. It supports any number of columns to be searched removing the need for your if/else logic.
library(data.table)
df = data.table(Gender = c('Male','F','Male','Male','F'),
EmployeeStatus = c('Active','na','Active','Active','na'),
jobFamilyGroup = c('Finance','Software Dev','HR','Finance','Software Dev'),
Country = c('China','China','US','US','China'))
filt = data.table(x1 = c('Gender' , 'Male'),x2 = c('Country' , 'China'),x3 = c('Country','US'), x4 = c('EmployeeStatus','Active'))
column = unlist(filt[1,])
value = unlist(filt[2,])
tofilter = split(value,column)
tokeep = apply(do.call(rbind,lapply(names(tofilter),function(x){
`[[`(df,x) %in% tofilter[[x]]
})),2,all)
df[tokeep==TRUE]
#> Gender EmployeeStatus jobFamilyGroup Country
#> 1: Male Active Finance China
#> 2: Male Active HR US
#> 3: Male Active Finance US
I am still new to R and want to use a *ply function to extract information from a dataframe. A sample input dataframe looks like this:
# Construct the dataframe
season <- c("12","12","12","12","12")
hometeam <- c("Team A","MyTeam","MyTeam","Team D","Team E")
awayteam <- c("MyTeam","Team B","Team C","MyTeam","MyTeam")
score <- c("1 - 1","7 - 1","0 - 0","0 - 2","0 - 1")
stats <- data.frame(season,hometeam,awayteam,score)
print(stats)
season hometeam awayteam score
1 11/12 Team A MyTeam 1 - 1
2 11/12 MyTeam Team B 7 - 1
3 11/12 MyTeam Team C 0 - 0
4 11/12 Team D MyTeam 0 - 2
5 11/12 Team E MyTeam 0 - 1
What I want to do is extract both the opponent of 'MyTeam' as well as the winner. The score is always given as the home team's score vs. the away team's score. I have a way of extracting who the opponent is like this:
# Get the opponent to MyTeam; can add this to dataframe later
opponent <- ifelse(stats$hometeam == "MyTeam", stats$awayteam, stats$hometeam)
But I am stuck trying to get the winner of every match. I tried doing this with daply() and a named function like so:
# Separate out scores for home and away team to determine winner
stats <- separate(stats, score, c('homescore','awayscore'), sep=' - ', remove=TRUE)
# Function for use in ply to get the winner of a match
determineWinner <- function(homescore, awayscore, hometeam) {
homewon <- FALSE
if ( homescore < awayscore) {
homewon <- FALSE
} else if ( homescore > awayscore ) {
homewon <- TRUE
} else {
return("tie")
}
if ( hometeam == "MyTeam" ) {
ifelse(homewon, return("won"), return("lost"))
} else {
ifelse(homewon, return("lost"), return("won"))
}
}#end of function
winner <- daply(stats, .(homescore,awayscore,hometeam), determineWinner(stats$homescore, stats$awayscore, stats$hometeam) )
But, this clearly does not work. Am I applying the daply() method incorrectly? I think am still unsure how the *ply functions really behave. It seems like a *ply function is the way to go here, but if there are other solutions out there, I am all ears. Any help is greatly appreciated!
Your logic can be implemented using nested ifelse:
winner <- ifelse(stats$homescore > stats$awayscore,
ifelse(stats$hometeam == "MyTeam","won","lost"),
ifelse(stats$homescore < stats$awayscore,
ifelse(stats$hometeam == "MyTeam","lost","won"),
"tie"))
##[1] "tie" "won" "tie" "won" "won"