I am trying to create a vector (dateVec) which contains the dates in the column Date propagated by the number of days in column Days. I cannot understand why the code that I created is not working. Dates are in Date format.
> for ( i in mydata[,1] ) {
> dateVec = mydata [,1] + 0 : mydata [,2] }
The data has much more rows, here is a sample as an example:
Date (mydata[,1]) -- Days (mydata[,2])
10/05/2017 ---------- 3
05/05/2017 ---------- 2
The result that I would expect for dateVec would be:
(10/05/2017, 11/05/2017, 12/05/2017, 13/05/2017, 05/05/2017, 06/05/2017, 07/05/2017, ...)
There are a few issues here why your code isn't working.
For loop: Here, your i needs a series of integers to iterate through. As you
have it now, you are trying to loop from i = 1 to "10/05/2017" and "05/05/2017".
A more useful way is to use seq_along to generate a sequence of integers from
1 to the length of the object passed through to seq_along.
dateVec is not indexed, so that you are overwriting dateVec for each
iteration of your loop
Variable length of days. For the first date, you are generating a sequence 3
days long and for the second date, 2 days. You will need a data structure that can handle variable length element such as a list.
To modify your existing code:
mydata <- data.frame(Date = as.Date(c("10/05/2017", "05/05/2017"),
format = "%d/%m/%Y"), Days = c(3, 2))
dateVec <- list()
for (i in seq_along(mydata[, 1])) {
dateVec[[i]] = mydata [i, 1] + 0 : mydata [i, 2]
}
res <- do.call("c", dateVec)
A more r idiomatic approach is to pass the starting date and length of time in parallel using mapply to return a list, which is then concatenated to a vector of dates
res <- do.call("c", mapply(function(x, y) seq(from = x, length.out = y,
by = "1 day"), x = mydata[["Date"]], y = mydata[["Days"]]))
Here's a clunky solution:
library("lubridate")
mydata = data.frame(Date = dmy(c("10/05/2017", "05/05/2017")),
Days = c(3,2))
dateVec = dmy(character())
for(i in 1:length(mydata$Date)){
dateVec = c(dateVec,mydata$Date[i])
for(j in 1:mydata$Days[i]){
dateVec = c(dateVec, mydata$Date[i]+j)
}
}
Note that this uses the lubridate package and doesn't format the dates quite how you did. I also found it interesting that I had to initialize dateVec as a date object. Initially I tried dateVec = c() but R tried to coerce to numeric.
Related
i have problem with substraction 2 value from data frame in apply function. I took stock timeseries and try to substract value in time i and i-1:
library(data.table)
do_apply <- rep(NA,nrow(DATA))
do_apply$close <- DATA$Zamkniecie
funkcja <- function(close){
close_lag_1 <- shift(x=close,
n = 1,
type = 'lag',
fill=0)
close <- close - close_lag_1
return(zamkniecie)
}
sapply(do_apply$close, funkcja)
My dataset is called data and I have a column called time that contains time in mm:ss format. I also wrote a function functime(var1,var2).
I would like ultimately to use apply or vapply and have var2 set to a constant (lets say var2 = 6) and var1 to be each value of the column data$time.
Something like:
If
data$time <- c("10:10","11:00", "09:30"), when I do vapply(), I would like to get a
data$output <-c(functime(data$time[1],6),functime(data$time[2],6),functime(data$time[3],6))
which in this example is the same as
data$output <- c(functime("10:10",6),functime("11:00",6),functime("09:30",6))
My lame attempt to that is something like:
vapply(data$time,functime,var2 = 6,FUN.VALUE = 1)
The documentation for vapply says it should be :vapply(x,fun,fun.value)
I am confused on how to "say to vapply" that I want to take as its first argument all rows of the data$time column, have a fixed second argument that I will define it as 6.
Ultimately I would like to add my data$output in the original dataset using a mutate.
Edit: (Include lines of data and function)
data$id <- c(9,6,5763,4)
data$time <- c("5:06","5:06","5:11","5:08")
data$city <-c("Kyle","Oklahoma","Monterey","Austin")
The function is:
calctime <- function(racePace, raceDistance){
# racePace is the per unit pace in mm:ss - character
# raceDistance is the total race distance - numeric
# Pace and race distance must use same units (km or mi or whatever)
# Seconds to character time function
CharMinSec <- function(sec){
outMin <- floor(sec/60)
outSec <- ((sec/60)-outMin)*60
if(outSec==0 | round(outSec)<10){
outChar <- paste0(outMin,":0",round(outSec))
} else {
outChar <- paste(outMin,round(outSec),sep=":")
}
outChar
}
paceMinSec <- as.numeric(strsplit(racePace,':')[[1]])
paceSec <- paceMinSec[1]*60+ paceMinSec[2]
raceMin <- floor(paceSec*raceDistance/60)
raceSec <- ((paceSec*raceDistance/60)-raceMin)*60
raceTime <- CharMinSec(raceMin*60+raceSec)
list(Seconds=raceSec)
}
# Example of 4:15/km for a half-marathon
calctime("4:15",21.097494)
calcTime <- function(pace,distance){
return (lubridate::period_to_seconds(lubridate::ms(pace)) * distance)
}
pace <- c("10:10","11:00", "09:30")
vapply(pace,calcTime,6,FUN.VALUE = 1)
## 10:10 11:00 09:30
## 36960 39960 34560
d <- tibble::as_tibble(list(pace = pace))
dplyr::mutate(d, raceSeconds = calcTime(pace,6))
## A tibble: 3 x 2
## pace raceSeconds
## <chr> <dbl>
## 1 10:10 36960
## 2 11:00 39960
## 3 09:30 34560
I had to change 2 things, but your vapply call was right.
In the function, I changed the last line so it returns a value instead of a list with one value
calctime <- function(racePace, raceDistance){
# racePace is the per unit pace in mm:ss - character
# raceDistance is the total race distance - numeric
# Pace and race distance must use same units (km or mi or whatever)
# Seconds to character time function
CharMinSec <- function(sec){
outMin <- floor(sec/60)
outSec <- ((sec/60)-outMin)*60
if(outSec==0 | round(outSec)<10){
outChar <- paste0(outMin,":0",round(outSec))
} else {
outChar <- paste(outMin,round(outSec),sep=":")
}
outChar
}
paceMinSec <- as.numeric(strsplit(racePace,':')[[1]])
paceSec <- paceMinSec[1]*60+ paceMinSec[2]
raceMin <- floor(paceSec*raceDistance/60)
raceSec <- ((paceSec*raceDistance/60)-raceMin)*60
raceTime <- CharMinSec(raceMin*60+raceSec)
raceSec
}
Now that the list returns a value, the vapply() works, but in my case I had to force the time column to be a character
data = data.frame(
id = c(9,6,5763,4),
time = c("5:06","5:06","5:11","5:08"),
city = c("Kyle","Oklahoma","Monterey","Austin")
)
data$time = as.character(data$time)
data$output = vapply(data$time,calctime,raceDistance = 6,FUN.VALUE=1) #works fine
I am looking to loop over a vector of dates, using these dates as a subsetting criterion and carry out calculations. For simplicity's sake we will assume these calculations are a count of rows.
The problem is R treats the vector of dates as 5 digit numbers. This is despite having been coerced as dates using as.Date, therefore, loop creates a list of length 17,896. In my loop list there are only 12 dates.
I very much look forward to any suggestions. Thank you.
# first date of each month in 2018
dates_2018 = seq(as.Date("2018-1-1"), as.Date("2018-12-31"), "days")
loop_date = as.Date(as.vector(tapply(dates_2018, substr(dates_2018, 1, 7), max), mode="any"), origin = "1970-01-01")
# dummy df
df = data.frame(id = 1:length(dates_2018)
,dates_2018)
# count number of days satisfy criteria
y = list()
for (i in loop_date)
{
y[[i]] = nrow(df[df$dates_2018 >= i, ])
}; y
You can do y[[i]] = nrow(df[df$dates_2018 >= as.Date(i,origin = "1970-01-01"), ]) and get a result by y[[17562]], but you will find your result in a list with 17,896 elments. Here is more proper
for (i in seq_along(loop_date))
{
y[[i]] = nrow(df[df$dates_2018 >= loop_date[i], ])
}
for(t in 1921:2017) {
nam <- paste("", t, sep = "")
assign(nam, window(aCPI_gr, start=c(t,1), end=c(t,12)))
}
aCPI_gr_y <- cbind(`1921`: `2017`) #doesn't work
This loop is generating vectors with CPI data from every month per year. Now i would like to pack all of them in a data frame with cbind, but i am of course to lazy to type every year-vector by hand in the cbind function. is there an easy way to avoid typing every year-vector by hand? something like cbind(1921:2017)
1) matrix If you have a monthly unidimensional ts series spanning 97 years, such as the test series tser below, then this will convert it to a 12 x 97 matrix with one year per column. The dimnames argument can be omitted if you don't need the names.
tser <- ts(seq_len(12 * 97), start = 1921, freq = 12) # test data
m <- matrix(tser, 12, dimnames = list(month = 1:12, year = 1921:2017))
2) tapply An alternative is:
tapply(tser, list(month = cycle(tser), year = as.integer(time(tser))), c)
In trying to avoid using the for loop in R, I wrote a function that returns an average value from one data frame given row-specific values from another data frame. I then pass this function to sapply over the range of row numbers. My function works, but it returns ~ 2.5 results per second, which is not much better than using a for loop. So, I feel like I've not fully exploited the vectorized aspects of the apply family of functions. Can anyone help me rethink my approach? Here is a minimally working example. Thanks in advance.
#Creating first dataframe
dates<-seq(as.Date("2013-01-01"), as.Date("2016-07-01"), by = 1)
n<-length(seq(as.Date("2013-01-01"), as.Date("2016-07-01"), by = 1))
df1<-data.frame(date = dates,
hour = sample(1:24, n,replace = T),
cat = sample(c("a", "b"), n, replace = T),
lag = sample(1:24, n, replace = T))
#Creating second dataframe
df2<-data.frame(date = sort(rep(dates, 24)),
hour = rep(1:24, length(dates)),
p = runif(length(rep(dates, 24)), min = -20, max = 100))
df2<-df2[order(df2$date, df2$hour),]
df2$cat<-"a"
temp<-df2
temp$cat<-"b"
df2<-rbind(df2,temp)
#function
period_mean<-function(x){
tmp<-df2[df$cat == df1[x,]$cat,]
#This line extracts the row name index from tmp,
#in which the two dataframes match on date and hour
he_i<-which(tmp$date == df1[x,]$date & tmp$hour == df1[x,]$hour)
#My lagged period is given by the variable "lag". I want the average
#over the period hour - (hour - lag). Since df2 is sorted such hours
#are consecutive, this method requires that I subset on only the
#relevant value for cat (hence the creation of tmp in the first line
#of the function
p<-mean(tmp[(he_i - df1[x,]$lag):he_i,]$p)
print(x)
print(p)
return(p)
}
#Execute function
out<-sapply(1:length(row.names(df1)), period_mean)
EDIT I have subsequently learned that part of the reason my original problem was iterating so slowly is that my data classes between the two dataframes were not the same. df1$date was a date field, while df2$date was a character field. Of course, this wasn't apparent with the example I posted because the data types were the same by construction. Hope this helps.
Here's one suggestion:
getIdx <- function(i) {
date <- df1$date[i]
hour <- df1$hour[i]
cat <- df1$cat[i]
which(df2$date==date & df2$hour==hour & df2$cat==cat)
}
v_getIdx <- Vectorize(getIdx)
df1$index <- v_getIdx(1:nrow(df1))
b_start <- match("b", df2$cat)
out2 <- apply(df1[,c("cat","lag","index")], MAR=1, function(x) {
flr <- ifelse(x[1]=="a", 1, b_start)
x <- as.numeric(x[2:3])
mean(df2$p[max(flr, (x[2]-x[1])):x[2]])
})
We make a function (getIdx) to retrieve the rows from df2 that match the values from each row in df1, and then Vectorize the function.
We then run the vectorized function to get a vector of rownames. We set b_start to be the row where the "b" category starts.
We then iterate through the rows of df1 with apply. In the mean(...) function, we set the "floor" to be either row 1 (if cat=="a") or b_start (if cat=="b"), which eliminates the need to subset (what you were doing with tmp).
Performance:
> system.time(out<-sapply(1:length(row.names(df1)), period_mean))
user system elapsed
11.304 0.393 11.917
> system.time({
+ df1$index <- v_getIdx(1:nrow(df1))
+ b_start <- match("b", df2$cat)
+ out2 <- apply(df1[,c("cat","lag","index")], MAR=1, function(x) {
+ flr <- ifelse(x[1]=="a", 1, b_start)
+ x <- as.numeric(x[2:3])
+ mean(df2$p[max(flr, (x[2]-x[1])):x[2]])
+ })
+ })
user system elapsed
2.839 0.405 3.274
> all.equal(out, out2)
[1] TRUE