I have a dataframe of 16million rows, and I am looking to add a column based on existing column, Month. If the month is 3 or 4 or 5, the column Season will be spring, etc.
for (i in 1:nrow(df)) {
if (df$Month[[i]] %in% c(3,4,5)) {
df$Season[[i]] <- "Spring"
} else if (df$Month[[i]] %in% c(6,7,8)) {
df$Season[[i]] <- "Summer"
} else if (df$Month[[i]] %in% c(9,10,11)) {
df$Season[[i]] <- "Autumn"
} else if (df$Month[[i]] %in% c(12,1,2)) {
df$Season[[i]] <- "Winter"
}
}
However, it is taking way too long for it to complete. What can I do?
One of the easier and faster ways is to create a data frame of the months and seasons and then join it to your parent data frame.
Like this:
seasons<-data.frame(Month=1:12, Season=c("Winter", "Winter", rep("Spring", 3), rep("Summer", 3), rep("Autumn", 3), "Winter"))
answer <- dplyr::left_join(df, seasons)
this is assuming both data frames have matching column names "Month".
I expect about 1000x increase in performance over the for loop.
This is more along with the lines of #Dave2e however with base R:
Season=c("Winter", "Winter", rep("Spring", 3),
rep("Summer", 3), rep("Autumn", 3), "Winter")
df<-data.frame(month=sample(1:12,10,replace=T)) #Sample data
df$season<-Season[df$month]
df
# month season
#1 8 Summer
#2 8 Summer
#3 5 Spring
#4 7 Summer
#5 2 Winter
#6 4 Spring
#7 12 Winter
#8 7 Summer
#9 11 Autumn
#10 1 Winter
This one, is significantly faster than the for loop method.
Using for loop (1000 rows):
#user system elapsed
#0.02 0.00 0.02
Using vectorised method (1000 rows):
#user system elapsed
# 0 0 0
Calculated using system.time.
This difference might look insignificant considering there are only 1000 rows. However, it becomes a significantly large difference as the number of rows are increased (in OP's case, 16 million)
Related
I want to count how many positive and negative returns for each month in 10 years time and include them in the dtf. This is because I want to calculate the probability of getting positive and negative returns. I'm not sure how to include the counting process in my for loop.
For example: The average monthly returns for July in 10 years is 2.18%, and the number of positive returns are 8 out of 10 which is 80% and so the probability of obtaining negative returns will be 20%. Since the 10 years average monthly return for July is in positive, 2.18%, I would like the probability of positive return (80%) to be shown in dtf instead of negative one (20%).
Another example: Same thing goes to May. Since the average monthly return in 10 years time for May is negative (-1.23%), and the probability of getting positive returns in 10 years is 60% (6 out of 10) while for the negative one is 40% (4 out of 10), I would like the negative probability (40%) to be shown in dtf instead of the positive 60%.
Same thing goes to each and every month, and therefore, there will be a 3rd column in dtf showing the probability of getting positive/negative returns.
I tried to include if loop in my for loop but it doesn't work. I attached my code below with the dtf with only 2 column (Month and AverageMonthlyRet).
library(quantmod)
#obtian the historical stock price
prices <- getSymbols("^GSPC", src = 'yahoo', from = "2009-07-01", to = "2019-08-01",
periodicity = "monthly", auto.assign = FALSE, warnings = FALSE)[,4]
#calculate the log return and convert back to simple return
return <- diff(log(prices))
r <- na.omit(exp(return)-1)
monthlyRet <- as.numeric(r[,1])
#loop through all the months in 10 years
AverageMonthlyRet <- c()
for (j in 1:12){
Group <- c()
for (i in seq(j,length(monthlyRet),12)){
Group[i] <- monthlyRet[i]
}
AverageMonthlyRet[j] <- mean(Group, na.rm=TRUE)
}
AverageMonthlyRet <- round(AverageMonthlyRet,4)
#create a data frame to store the result
Month <- c("Aug","Sep","Oct","Nov","Dec","Jan","Feb","Mar","Apr","May","Jun","Jul")
dtf <- data.frame(Month, AverageMonthlyRet)
Here is a proposed solution, based on my understanding of your question. Comments are given in the code.
The main idea is to add some counter variables that you increment in the loop depending on the results of some if/else tests. They store the probability of negative or positive income. Then with an additional test you determine of you want to keep the positive or negative income probability.
There are ways to write it more concisely, but this longer version show all the details and ideas. If you know that the income is never null, you can just compute one counter because you know that one probability, e.g. positive, will always be 100 - other probability. In my version, null income is possible and thus you might have positive probability + negative probability being smaller than 100.
library(quantmod)
#obtian the historical stock price
prices <- getSymbols("^GSPC", src = 'yahoo', from = "2009-07-01", to = "2019-08-01",
periodicity = "monthly", auto.assign = FALSE, warnings = FALSE)[,4]
#calculate the log return and convert back to simple return
return <- diff(log(prices))
r <- na.omit(exp(return)-1)
monthlyRet <- as.numeric(r[,1])
#loop through all the months in 10 years
AverageMonthlyRet <- c()
#Added: Array to store the probability for each month
Probability <- c()
for (j in 1:12){
Group <- c()
#Added: Counter, for each month, of positive or negative income
connt_pos=0
count_neg=0
for (i in seq(j,length(monthlyRet),12)){
Group[i] <- monthlyRet[i]
#Added: Increment the counters based on the sign of monthlyRet[i]
if(monthlyRet[i]>0){
connt_pos <- connt_pos+1
}
else if(monthlyRet[i]<0){
count_neg <- count_neg+1
}
}
AverageMonthlyRet[j] <- mean(Group, na.rm=TRUE)
#Added: Depending if the average monthly retrn is positive or negative
#compute the probability of positive or negative income (in %)
prob=0
if(AverageMonthlyRet[j]>0)
{
prob=connt_pos/(length(monthlyRet)/12)*100
}
else if (AverageMonthlyRet[j]<0){
prob=count_neg/(length(monthlyRet)/12)*100
}
#Added: Store the result
Probability[j] <- prob
}
AverageMonthlyRet <- round(AverageMonthlyRet,4)
#create a data frame to store the result
Month <- c("Aug","Sep","Oct","Nov","Dec","Jan","Feb","Mar","Apr","May","Jun","Jul")
#Added: Add the new probability column to the final data frame
dtf <- data.frame(Month, AverageMonthlyRet,Probability)
Using base R, we can count the proportion of sign for each month, merge by "Month" with dtf and select the appropriate value based on sign of AverageMonthlyRet.
tmp <- as.data.frame(do.call(rbind, tapply(sign(r), format(index(r), "%b"),
function(x) c(table(x)))))
tmp$Month <- rownames(tmp)
df1 <- merge(dtf, tmp)
df1$selected <- ifelse(sign(df1$AverageMonthlyRet) == 1, df1$`1`, df1$`-1`)
df1
# Month AverageMonthlyRet -1 1 selected
#1 Apr 0.0122 1 9 9
#2 Aug -0.0078 5 5 5
#3 Dec 0.0061 3 7 7
#4 Feb 0.0234 2 8 8
#5 Jan 0.0115 4 6 6
#6 Jul 0.0218 2 8 8
#7 Jun 0.0030 4 6 6
#8 Mar 0.0171 4 6 6
#9 May -0.0123 4 6 4
#10 Nov 0.0162 2 8 8
#11 Oct 0.0189 4 6 6
#12 Sep 0.0086 4 6 6
I have a very large set of data driven off of an id and a date. The dataset has several hundred million rows and about 10 million id's. I am running in a non-windows environment with ample RAM and multiple processors available. I am doing this in parallel. At the moment, I'm working with multidplyr, though am considering all options.
For illustration:
> df[1:11,]
id date gap episode
1 100000019 2015-01-24 0 1
2 100000019 2015-02-20 27 1
3 100000019 2015-03-31 39 2
4 100000019 2015-04-29 29 2
5 100000019 2015-05-27 28 2
6 100000019 2015-06-24 28 2
7 100000019 2015-07-24 30 2
8 100000019 2015-08-23 30 2
9 100000019 2015-09-21 29 2
10 100000019 2015-10-22 31 3
11 100000019 2015-12-30 69 4
The data is sorted before the function call. The order is important. For each id, after the first date, I need to determine the number of days between each subsequent date. I call this a gap. So, the first date for the id gets a gap of zero. The second date gets the value of the second date minus the date in the prior row. An so on.
I am splitting the data by id, then sending the data for each id to the following function.
assign_gap <- function(x) {
# x$gap <- NA
for(i in 1:nrow(x)) {
x[i, ]$gap <- ifelse(i == 1, 0, x[i,]$date - x[i-1, ]$date)
}
return(x)
}
cluster <- create_cluster(8)
cluster_assign_value(cluster, 'assign_gap', assign_gap)
system.time(df <- df %>% partition(id, cluster = cluster) %>% do(assign_gap(.)) %>% collect())
I then apply another function that groups the sequence of gaps across dates into "episodes" based on allowable_gap (I am using a value of 30). So, each id will potentially have multiple episodes assigned based on the date sequence and the gap.
assign_episode <- function(x, allowable_gap){
ep <- 1
for(i in 1:nrow(x)){
ifelse(x[i,]$gap <= allowable_gap, ep <- ep, ep <- ep + 1)
x[i, ]$episode <- ep
}
return(x)
}
cluster <- create_cluster(8)
cluster_assign_value(cluster, 'assign_episode', assign_episode)
cluster_assign_value(cluster, 'allowable_gap', allowable_gap)
system.time(df <- df %>% partition(id, cluster = cluster) %>% do(assign_episode(., allowable_gap)) %>% collect())
Given the amount of data I have, I'd really like to find a way to avoid these loops in the functions, which I expect will improve efficiency considerably. If anyone can think of an alternative that accomplishes the same thing, I would be grateful.
I would recommend using the data.table library. This library is extremely fast, particularly if one is working with large data sets like yours. Here is a partial solution, where I solve the first step of your question:
1. calculate gap between dates, making sure the first row of each id is 0
library(data.table)
setDT(df)
df[, gap := c(0L, diff(date)) , by = id ]
Even though this is not working in parallel, I would expect this code to be faster than the loop you're currently using.
2. Assign a group episode for consecutive observations when the gap is under 30 by id
I haven't found a solution for the second part of your question yet, but I would encourage others to complement this answer if they find a solution.
I am a beginner in R, and I would like to do a survival analysis on the dataset about light bulbs I have. I would like to calculate the lifetime of a light bulb, so I need to calculate the time period between date_broken in row 2 and date_solved in row 1 for example.
I know I can use difftime(time, time2, units = "days") to calculate the time between date_fixed and date_broken in the same row, but then I would calculate the time the light bulb was broken and that is not what I am interested in.
I provided a small sample of my data below. For each light bulb on a particular location I have information about the date it broke and the day it was fixed.
(Besides the columns given in the example below, I have other features that should have predictive value.)
# date_broken date_fixed lightbulb location
# 1 26-2-2015 17-3-2015 1 A
# 2 19-3-2015 26-3-2015 1 A
# 3 26-3-2015 26-3-2015 1 A
# 4 17-4-2015 29-4-2015 2 B
# 5 19-6-2015 25-6-2015 2 B
# 6 9-7-2015 30-7-2015 2 B
ds <- data.frame( date_broken = c("26-2-2015", "19-3-2015",
"26-3-2015", "17-4-2015",
"19-6-2015", "9-7-2015"),
date_fixed = c("17-3-2015", "26-3-2015", "26-3-2015", "29-4-2015", "25-6-2015", "30-7-2015"),
lightbulb = c("1`", "1", "1", "2", "2", "2"), location = c("A", "A", "A", "B", "B", "B"))
First you'll need to fix your dates, as #Gaurav suggested. Then, you'll need to summarize by lightbulb, or the difference will be meaningless.
I present here an alternative using packages lubridate and data.table:
library(lubridate)
library(data.table)
ds$date_broken <- dmy(ds$date_broken)
ds$date_fixed <- dmy(ds$date_fixed)
setDT(ds)
setDT(ds)[, dt := difftime(date_fixed, shift(date_broken, 1L, type="lag"), "days"), by = lightbulb]
ds
Which produces:
## date_broken date_fixed lightbulb location dt
## 1: 2015-02-26 2015-03-17 1 A NA days
## 2: 2015-03-19 2015-03-26 1 A 28 days
## 3: 2015-03-26 2015-03-26 1 A 7 days
## 4: 2015-04-17 2015-04-29 2 B NA days
## 5: 2015-06-19 2015-06-25 2 B 69 days
## 6: 2015-07-09 2015-07-30 2 B 41 days
For a future opportunity, it's a lot of help when you produce some expected results, along with your question.
This should help
library(dplyr)
ds2 <- ds %>%
group_by(lightbulb) %>%
mutate(tp = as.Date(date_broken, "%d-%m-%Y") -
as.Date(lag(date_fixed,1), "%d-%m-%Y"))
I really love those super-impressive pipe operators in R. They're so elegant, and great if someone's got a ready-to-go solution.
I mostly do loops, probably 'cos I like something I see what's going on, and I can debug as I go. (I was also brought up on BASIC some decades ago - but don't tell anyone.)
Anyway this was my approach for something very similar that I was doing, with hopefully a bit of added value using a sequence counter. This might be useful as a regression variable (covariate or stratification) or something by which you can subset, assuming you might for example want to look at later failures separately from earlier failures. Survival curves by sequence counter were quite informative in my work.
First convert the dates...
## convert dates. once done it's done
ds$date_broken <- as.Date(ds$date_broken, "%d-%m-%Y")
ds$date_fixed <- as.Date(ds$date_fixed, "%d-%m-%Y")
Add a sequence counter column (called seq) to keep track of number of failures
ds$seq <- 1
Populate that counter column
for (rdx in 2:nrow(ds)) {
## if same item, increment count. If new item, start new count at seq = 1
ifelse(ds$lightbulb[rdx] == ds$lightbulb[rdx-1], ds$seq[rdx] <- ds$seq[rdx-1]+1, 1)
}
Now add a difference column
ds$diff <- NA
Populate that difference column
for (rdx in 2:nrow(ds)) {
## if same item, difference is current failure date minus previous in-service date
ifelse(ds$seq[rdx] != 1, ds$diff[rdx] <- ds$date_broken[rdx] - ds$date_fixed[rdx-1], NA)
}
Well that worked for me, both to conceptualise and to implement. Please note that some folk do get a bit hung up with loops (http://paleocave.sciencesortof.com/2013/03/writing-a-for-loop-in-r/), but around my neck of the woods life is for living, not coding, and even I sometimes stir paint with a screwdriver (an old one though).
I have a dataset with dates occurring randomly. For example:
10/21/15, 11/21/15, 11/22/15, 11/28/15,11/30/15, 12/12/15...etc
I am looking to create a rolling average by time-period NOT by at the observation level. For instance if I wanted to do a moving average of the last 7 days. I would not want to look up at the last 7 rows, but rather the last 7 days
For a tiny example:
dates = c('2015-08-07', '2015-08-08','2015-08-09','2015-09-09','2015-10-10')
value = c(5,10,5,3,2)
df=data.frame(dates, value)
df$desired = c(NA,5,7.5, NA,NA)
I am obviously looking to do this for much larger dataset, but I hope you get the idea. If I was to use 7 days for example this is the result I would expect.
Notice that I don't include the current observations value into the rolling average, only the previous. I want rolling average by time period, not observation row number.
I tried looking at rollmean and dplyr but I couldnt figure it out. I don't really care how it happens though.
Thanks!
try this:
rollavgbyperiod <- function(i,window){
startdate <- dates[i]-window
enddate <- dates[i]-1
interval <- seq(startdate,enddate,1)
tmp <- value[dates %in% interval]
return(mean(tmp))
}
dates <- as.Date(dates)
window <- 7
res <- sapply(1:length(dates),function(m) rollavgbyperiod(m,window))
res[is.nan(res)] <- NA
> data.frame(dates,value,res)
dates value res
1 2015-08-07 5 NA
2 2015-08-08 10 5.0
3 2015-08-09 5 7.5
4 2015-09-09 3 NA
5 2015-10-10 2 NA
I suggest using runner package in this case. What is needed here is mean_run with k = 7 window, lagged by 1 period. Simple one-liner:
library(runner)
dates = c('2015-08-07', '2015-08-08','2015-08-09','2015-09-09','2015-10-10')
value = c(5, 10, 5, 3, 2)
mean_run(x = value, k = 7, lag = 1, idx = as.Date(dates))
#[1] NA 5.0 7.5 NA NA
Check package and function documentation
Hope your doing well, I am working on an assignment related to data pre processing and I need some help in R
I have a column for days in which they are 711 unique values. In total I have 2 million observations. The data has been collected over 2 years and each day represents one day in a week.
For example day 1 is Monday and day 8 is Monday aswell and day 15 Is Monday and so on.
Could someone help me to replace this with 1 to 7 so if day 1 is Monday I want the cell which contains the value 8 to be replaced by 1 and 15 with 1 and so on.
I hope this makes sense.
thank you for your help.
Regards
A
Following the comments (since I can't comment), try this:
# An example data.frame
mydata <- data.frame(DAY= 1:21, ABC= letters[1:21])
mydata
# Do "mod 7" with variable DAY, so DAY have now values from 0 to 6,
# Then assign back to variable DAY
mydata$DAY <- mydata$DAY %% 7
mydata
# Replace 0 for 7 in DAY variable
mydata$DAY <- ifelse(mydata$DAY == 0, 7, mydata$DAY)
mydata
# Save final data.frame
write.csv(mydata, file='mydata.csv')
Rather than issue 7 separate commands (one for each day) you can use dplyr:
require(dplyr)
d <- data.frame(day = seq(1:711))
mutate(d, day = day %% 7 +1)
What we're doing here is taking the day number and finding its remainder when divided by 7. We have to add 1 back to this so we dont get 0 when there is no remainder.