R: Loop requires long run time, suggestions on better structure - r

I currently have a data set with quarterly returns for 10 indices. My dataset (compoundrates) is structured so that in the first column, we have "Scenario" and the second column is "Quarter", and the following 10 are the quarterly index return. The projection is 50 quarters, so lines 1-51 reflect quarters 0-50 for scenario 1, and lines 52-102 reflect quarter 0-50 for scenario 2, etc for 1000 scenarios.
To calculate cumulative compound rates, I need to multiply the current return by all previous returns from the projection. I set up a loop to do this in the code below:
for(i in 1:nrow(compoundrates)){
if(compoundrates[i, "Quarter"] == 0){
compoundrates[i, -c(1:2)] <- 1
} else{
compoundrates[i, -c(1:2)] <- compoundrates[i, -c(1:2)] * compoundrates[i - 1, -c(1:2)]
}
}
The loop is simple and works how I want. However, with 51000 rows, this takes about 13 minutes. Is there a way to speed up the code? I tried thinking of a vectorized solution, but could only think that I would need to loop through all rows of the dataset. While 13 minutes is not the end of the world, I have other datasets with longer projections, up to 200 quarters, which would take extremely long.
Possibly pivoting the dataset to be horizontal would require only 50 loops rather than 51000, but thought I'd see if anyone else had a more elegant solution.
Edit: Included here is a sample of the first couple of lines of my dataset:
> dput(head(compoundrates[, 1:4])) # First part of data, only 2 indices
structure(list(Scenario = c(1L, 1L, 1L, 1L, 1L, 1L), Quarter = c(0,
1, 2, 3, 4, 5), US = c(1, 1.06658609144463, 1.1022314574062,
1.1683883540847, 1.29134306037902, 1.28907212981088), MidCap = c(1,
1.10361590084936, 1.12966579678275, 1.21702573464001, 1.2674372889915,
1.37286942499386)), row.names = c(NA, -6L), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), groups = structure(list(Scenario = 1L,
.rows = list(1:6)), row.names = c(NA, -1L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE))

Try this out, it uses vectorized functions, basically exactly what you are trying to do with a for loop. It create new columns so you can see what is going on. Vectorized function usually run a lot faster than for loops
library(tidyverse)
compoundrates %>%
group_by(Scenario) %>%
arrange(Quarter) %>%
mutate(US_lag = lag(US),
MidCap_lag = lag(MidCap),
US_cum = US *US_lag,
MidCap_cum = MidCap *MidCap_lag) %>%
mutate_all(~ifelse(is.na(.), 1, .))
This should do the cumulative product you were asking for
compoundrates %>%
group_by(Scenario) %>%
arrange(Quarter) %>%
mutate(US_cum = cumprod(US),
MidCap_cum = cumprod(MidCap))
Per #carl-witthoft suggestion here is the benchmarking. I made a dataframe of 60,000 rows grouped by the 6 quarters in the OP.
Unit: milliseconds
expr
big_data %>%
group_by(Scenario) %>%
mutate(US_cum = cumprod(US),
MidCap_cum = cumprod(MidCap))
min lq mean median uq max neval
63.1487 70.0257 77.16906 73.72995 79.7645 147.167 100

Related

R data.table Perform Aggregate On Select Rows Using Full Dataset

I'm calculating features for a machine learning algorithm in R using data.table. I'm only going to be making predictions on rows that have a future date, but to calculate each feature, I need to aggregate on a large dataset that can have millions of rows. However, to improve processing speed and performance, I don't need the aggregate to calculate on rows that are for a past date.
In plain terms, I'm trying to use a large dataset to calculate the feature for only the last n rows using the entire dataset and skip rows where the Date is in the past. I have a user defined function that counts the number of rows that are higher than the current row being calculated in the loop. My example data.table below shows the outcome I'm trying to achieve. The row being calculated will count the number of rows higher than it and then move to the next row in the loop. I want it to skip all rows with a past date and only calculate rows with a future date. The current date in this example is 2019-03-20.
Group Date Appt Sum
A 2019-03-18 1 NA
A 2019-03-19 1 NA
A 2019-03-20 1 NA
A 2019-03-21 1 3
A 2019-03-22 1 4
A 2019-03-23 1 5
library(data.table)
dt = structure(list(Group = c("A", "A", "A", "A", "A", "A"), Date = structure(c(17973,
17974, 17975, 17976, 17977, 17978), class = "Date"), Appt = c(1L,
1L, 1L, 1L, 1L, 1L), Sum = c(NA, NA, NA, 3, 4, 5)), row.names = c(NA,
-6L), class = "data.frame")
setDT(dt)
This is the function and code I'm currently using and it works perfectly. The only problem is that it performs the calculation on every row even though I only need the calculation results for a few rows at the end of the dataset that can be in the millions. It's wasting a ton of processing power and time by making calculations that will be excluded from the prediction set.
rollingSum <- function(i, data, count, dates) {
z <- with(data[i, ], zoo(count, dates))
g <- zoo(, seq(start(z), end(z), by="day"))
m <- merge(z, g)
window(rollapplyr(m, 365, sum, na.rm=TRUE, partial=TRUE), time(z))
}
dt[, Sum := as.numeric(rollingSum(data=dt, count=Appt, dates=Date) - Appt), by=Group]
here is some more details for my comment above:
dt <- data.table(
Group = "A",
Date = as.IDate(c("2019-03-18", "2019-03-19", "2019-03-20",
"2019-03-21", "2019-03-22","2019-03-23")),
Appt = 1)
microbenchmark(
dt[, Sum := as.numeric(rollingSum(data=dt, count=Appt, dates=Date) - Appt), by=Group],
dt[, Sum2 := ifelse(Date > as.IDate("2019-03-20"), (1:.N) - Appt , as.numeric(NA)), by = Group],unit = "ms")
here are the benchmarks:
Unit: milliseconds
expr min lq mean median uq max neval
rollingSum 3.463955 4.0644910 18.748804 4.353562 4.745325 1395.840823 100
new func 0.768079 0.8757095 1.258782 1.015766 1.140316 8.275985 100

Why am I having Issues with Separating Rows in a Dataframe?

I'm having an issue with separating rows in a dataframe that I'm working in.
In my dataframe, there's a column called officialIndices that I want to separate the rows by. This column stores a list of numbers act as indexes to indicate which rows have the same data. For example: indices 2:3 means that rows 2:3 have the same data.
Here is the code that I am working with.
offices_list <- data_google$offices
offices_JSON <- toJSON(offices_list)
offices_from_JSON <-
separate_rows(fromJSON(offices_JSON), officialIndices, convert = TRUE)
This is what my offices_list frame looks like
This is what it looks like after I try to separate the rows
My code works fine when it has indices 2:3 since there is a difference of 1. However on indices like 7:10, it separates the rows as 7 and 10 instead of doing 7, 8, 9, 10, which is how I want it do be done. How would I get my code to separate the rows like this?
Output of dput(head(offices_list))
structure(list(position = c("President of the United States",
"Vice-President of the United States", "United States Senate",
"Governor", "Mayor", "Auditor"), divisionId = c("ocd-division/country:us",
"ocd-division/country:us", "ocd-division/country:us/state:or",
"ocd-division/country:us/state:or", "ocd-division/country:us/state:or/place:portland",
"ocd-division/country:us/state:or/place:portland"), levels = list(
"country", "country", "country", "administrativeArea1", NULL,
NULL), roles = list(c("headOfState", "headOfGovernment"),
"deputyHeadOfGovernment", "legislatorUpperBody", "headOfGovernment",
NULL, NULL), officialIndices = list(0L, 1L, 2:3, 4L, 5L,
6L)), row.names = c(NA, 6L), class = "data.frame")
This should work. I expect it will work for further rows too, since I tested for ranges greater than two in officialIndices.
First I extracted the start and end rows, and used their difference to determine how many rows are needed. Then tidyr::uncount() will add that many copies.
library(dplyr); library(tidyr)
data_sep <- data %>%
separate(officialIndices, into = c("start", "end"), sep = ":") %>%
# Use 1 row, and more if "end" is defined and larger than "start"
mutate(rows = 1 + if_else(is.na(end), 0, as.numeric(end) - as.numeric(start))) %>%
uncount(rows)

Calculating value between two csv in R

I have a two data table (csv) which contain information about a MOOC course.
The first table contains information about mouse movments (distance). Like this:
1-2163.058../2-20903.66351.../3-25428.5415..
The first number means the day (1- first day, 2- second day, etc.) when it happens, the second number means the distance in pixel. (2163.058, 20903.66351, etc.).
The second table contains the same information but instead of distance, there is the time was recorded. Like this:
1-4662.0/2-43738.0/3-248349.0....
The first number means the day (1- first day, 2- second day, etc.) when it happens, the second number means the time in milliseconds.
In the table, every column records a data from the specific web page, and every row records a user behaviour on this page.
I want to create a new table with the same formation, where I want to count the speed by pixel. Divide the distance table with time table which gives new table with same order, shape.
Here are two links for the two tables goo.gl/AVQW7D goo.gl/zqzgaQ
How can I do this with raw csv?
> dput(distancestream[1:3,1:3])
structure(list(id = c(2L, 9L, 10L),
`http//tanul.sed.hu/mod/szte/frontpage.php` = structure(c(2L, 1L, 1L),
.Label = c("1-0", "1-42522.28760403924"),
class = "factor"),
`http//tanul.sed.hu/mod/szte/register.php` = c(0L, 0L, 0L)),
.Names = c("id", "http//tanul.sed.hu/mod/szte/frontpage.php",
"http//tanul.sed.hu/mod/szte/register.php"),
class = c("data.table", 0x0000000002640788))
> dput(timestream[1:3,1:3])
structure(list(id = c(2L, 9L, 10L),
`http//tanul.sed.hu/mod/szte/frontpage.php` = structure(c(2L, 1L, 1L),
.Label = c("0", "1-189044.0"),
class = "factor"),
`http//tanul.sed.hu/mod/szte/register.php` = c(0L, 0L, 0L)),
.Names = c("id",
"http//tanul.sed.hu/mod/szte/frontpage.php",
"http//tanul.sed.hu/mod/szte/register.php"),
class = c("data.table", 0x0000000002640788))
This may not be the most efficient method, but I believe it should yield the result you are looking for.
# Set file paths
dist.file <- # C:/Path/To/Distance/File.csv
time.file <- # C:/Path/To/Time/File.csv
# Read data files
dist <- read.csv(dist.file, stringsAsFactors = FALSE)
time <- read.csv(time.file, stringsAsFactors = FALSE)
# Create dataframe for speed values
speed <- dist
speed[,2:ncol(speed)] <- NA
# Create progress bar
pb <- txtProgressBar(min = 0, max = ncol(dist) * nrow(dist), initial = 0, style = 3, width = 20)
item <- 0
# Loop through all columns and rows of distance data
for(col in 2:ncol(dist)){
for(r in 1:nrow(dist)){
# Check that current item has data to be calculated
if(dist[r,col] != 0 & dist[r,col] != "1-0" & !is.na(time[r,col])){
# Split the data into it's separate day values
dists <- lapply(strsplit(strsplit(dist[r,col], "/")[[1]], "-"), as.numeric)
times <- lapply(strsplit(strsplit(time[r,col], "/")[[1]], "-"), as.numeric)
# Calculate the speeds for each day
speeds <- sapply(dists, "[[", 2) / sapply(times, "[[", 2)
# Paste together the day values and assign to the current item in speed dataframe
speed[r,col] <- paste(sapply(dists, "[[", 1), format(speeds, digits = 20), sep = "-", collapse = "/")
} else{
# No data to calculate, assign 0 to current item in speed dataframe
speed[r,col] <- 0
}
# Increase progress bar counter
item <- item + 1
setTxtProgressBar(pb,item)
}
}
# Create a csv for speed data
write.csv(speed, "speed.csv")

"unlist" a list stored in a variable

I have a dataframe with a two odd variables. For one one variable, each cell stores a list whose contents is simply a vector of two numbers. For the other variable, each cell stores a three dimensional array (even though only two dimensions are necessary) of 8 numbers.
I want to simplify the dataset by breaking out the odd variable into separate variables. I figured out how to break all the data out using a for loop but this is very slow. I know apply is supposed to be generally quicker, but I can't figure out how I would translate this to apply. Is it possible, or is there a better way to do this?
for (i in 1:nrow(df)){
if (length(df$coordinates.coordinates[[i]]>0)){
df[i,"coordinates.lon"]<- df$coordinates.coordinates[[i]][1]
df[i,"coordinates.lat"]<- df$coordinates.coordinates[[i]][2]
}
if (length(df$place.bounding_box.coordinates[[i]]>0)){
df[i,"place.bounding_box.a.lon"] <-df$place.bounding_box.coordinates[[i]][1,1,1]
df[i,"place.bounding_box.b.lon"] <-df$place.bounding_box.coordinates[[i]][1,2,1]
df[i,"place.bounding_box.c.lon"] <-df$place.bounding_box.coordinates[[i]][1,3,1]
df[i,"place.bounding_box.d.lon"] <-df$place.bounding_box.coordinates[[i]][1,4,1]
df[i,"place.bounding_box.a.lat"] <-df$place.bounding_box.coordinates[[i]][1,1,2]
df[i,"place.bounding_box.b.lat"] <-df$place.bounding_box.coordinates[[i]][1,2,2]
df[i,"place.bounding_box.c.lat"] <-df$place.bounding_box.coordinates[[i]][1,3,2]
df[i,"place.bounding_box.d.lat"] <-df$place.bounding_box.coordinates[[i]][1,4,2]
}
}
EDIT
Here is an example dataframe with one case (via dput)
structure(list(coordinates.coordinates = list(c(112.088477, -7.227974
)), place.bounding_box.coordinates = list(structure(c(112.044456,
112.044456, 112.143242, 112.143242, -7.263067, -7.134563, -7.134563,
-7.263067), .Dim = c(1L, 4L, 2L)))), .Names = c("coordinates.coordinates",
"place.bounding_box.coordinates"), class = c("tbl_df", "data.frame"
), row.names = c(NA, -1L))
In case it helps, this is the data format that gets out when you try to read Twitter stream data using jsonlite's stream_in function (with flatten=TRUE)
library(dplyr)
df = data_frame(
coordinates.coordinates =
list(c(0, 1), c(2, 3)),
place.bounding_box.coordinates =
list(array(0, dim=c(1, 4, 2)),
array(1, dim=c(1, 4, 2))))
df %>%
rowwise %>%
do(with(., data_frame(
longitude = coordinates.coordinates[1],
latitude = coordinates.coordinates[2]) %>% bind_cols(
place.bounding_box.coordinates %>%
as.data.frame %>%
setNames(c(
"place.bounding_box.a.lon",
"place.bounding_box.b.lon",
"place.bounding_box.c.lon",
"place.bounding_box.d.lon",
"place.bounding_box.a.lat",
"place.bounding_box.b.lat",
"place.bounding_box.c.lat",
"place.bounding_box.d.lat")))))

Returning first row of group

I have a dataframe consisting of an ID, that is the same for each element in a group, two datetimes and the time interval between these two. One of the datetime objects is my relevant time marker. Now I like to get a subset of the dataframe that consists of the earliest entry for each group. The entries (especially the time interval) need to stay untouched.
My first approach was to sort the frame according to 1. ID and 2. relevant datetime. However, I wasn't able to return the first entry for each new group.
I then have been looking at the aggregate() as well as ddply() function but I could not find an option in both that just returns the first entry without applying an aggregation function to the time interval value.
Is there an (easy) way to accomplish this?
ADDITION:
Maybe I was unclear by adding my aggregate() and ddply() notes. I do not necessarily need to aggregate. Given the fact that the dataframe is sorted in a way that the first row of each new group is the row I am looking for, it would suffice to just return a subset with each row that has a different ID than the one before (which is the start-row of each new group).
Example data:
structure(list(ID = c(1454L, 1322L, 1454L, 1454L, 1855L, 1669L,
1727L, 1727L, 1488L), Line = structure(c(2L, 1L, 3L, 1L, 1L,
1L, 1L, 1L, 1L), .Label = c("A", "B", "C"), class = "factor"),
Start = structure(c(1357038060, 1357221074, 1357369644, 1357834170,
1357913412, 1358151763, 1358691675, 1358789411, 1359538400
), class = c("POSIXct", "POSIXt"), tzone = ""), End = structure(c(1357110430,
1357365312, 1357564413, 1358230679, 1357978810, 1358674600,
1358853933, 1359531923, 1359568151), class = c("POSIXct",
"POSIXt"), tzone = ""), Interval = c(1206.16666666667, 2403.96666666667,
3246.15, 6608.48333333333, 1089.96666666667, 8713.95, 2704.3,
12375.2, 495.85)), .Names = c("ID", "Line", "Start", "End",
"Interval"), row.names = c(NA, -9L), class = "data.frame")
By reproducing the example data frame and testing it I found a way of getting the needed result:
Order data by relevant columns (ID, Start)
ordered_data <- data[order(data$ID, data$Start),]
Find the first row for each new ID
final <- ordered_data[!duplicated(ordered_data$ID),]
As you don't provide any data, here is an example using base R with a sample data frame :
df <- data.frame(group=c("a", "b"), value=1:8)
## Order the data frame with the variable of interest
df <- df[order(df$value),]
## Aggregate
aggregate(df, list(df$group), FUN=head, 1)
EDIT : As Ananda suggests in his comment, the following call to aggregate is better :
aggregate(.~group, df, FUN=head, 1)
If you prefer to use plyr, you can replace aggregate with ddply :
ddply(df, "group", head, 1)
Using ffirst from collapse
library(collapse)
ffirst(df, g = df$group)
data
df <- data.frame(group=c("a", "b"), value=1:8)
This could also be achieved by dplyr using group_by and slice-family of functions,
data %>%
group_by(ID) %>%
slice_head(n = 1)

Resources