I have a dataframe with a two odd variables. For one one variable, each cell stores a list whose contents is simply a vector of two numbers. For the other variable, each cell stores a three dimensional array (even though only two dimensions are necessary) of 8 numbers.
I want to simplify the dataset by breaking out the odd variable into separate variables. I figured out how to break all the data out using a for loop but this is very slow. I know apply is supposed to be generally quicker, but I can't figure out how I would translate this to apply. Is it possible, or is there a better way to do this?
for (i in 1:nrow(df)){
if (length(df$coordinates.coordinates[[i]]>0)){
df[i,"coordinates.lon"]<- df$coordinates.coordinates[[i]][1]
df[i,"coordinates.lat"]<- df$coordinates.coordinates[[i]][2]
}
if (length(df$place.bounding_box.coordinates[[i]]>0)){
df[i,"place.bounding_box.a.lon"] <-df$place.bounding_box.coordinates[[i]][1,1,1]
df[i,"place.bounding_box.b.lon"] <-df$place.bounding_box.coordinates[[i]][1,2,1]
df[i,"place.bounding_box.c.lon"] <-df$place.bounding_box.coordinates[[i]][1,3,1]
df[i,"place.bounding_box.d.lon"] <-df$place.bounding_box.coordinates[[i]][1,4,1]
df[i,"place.bounding_box.a.lat"] <-df$place.bounding_box.coordinates[[i]][1,1,2]
df[i,"place.bounding_box.b.lat"] <-df$place.bounding_box.coordinates[[i]][1,2,2]
df[i,"place.bounding_box.c.lat"] <-df$place.bounding_box.coordinates[[i]][1,3,2]
df[i,"place.bounding_box.d.lat"] <-df$place.bounding_box.coordinates[[i]][1,4,2]
}
}
EDIT
Here is an example dataframe with one case (via dput)
structure(list(coordinates.coordinates = list(c(112.088477, -7.227974
)), place.bounding_box.coordinates = list(structure(c(112.044456,
112.044456, 112.143242, 112.143242, -7.263067, -7.134563, -7.134563,
-7.263067), .Dim = c(1L, 4L, 2L)))), .Names = c("coordinates.coordinates",
"place.bounding_box.coordinates"), class = c("tbl_df", "data.frame"
), row.names = c(NA, -1L))
In case it helps, this is the data format that gets out when you try to read Twitter stream data using jsonlite's stream_in function (with flatten=TRUE)
library(dplyr)
df = data_frame(
coordinates.coordinates =
list(c(0, 1), c(2, 3)),
place.bounding_box.coordinates =
list(array(0, dim=c(1, 4, 2)),
array(1, dim=c(1, 4, 2))))
df %>%
rowwise %>%
do(with(., data_frame(
longitude = coordinates.coordinates[1],
latitude = coordinates.coordinates[2]) %>% bind_cols(
place.bounding_box.coordinates %>%
as.data.frame %>%
setNames(c(
"place.bounding_box.a.lon",
"place.bounding_box.b.lon",
"place.bounding_box.c.lon",
"place.bounding_box.d.lon",
"place.bounding_box.a.lat",
"place.bounding_box.b.lat",
"place.bounding_box.c.lat",
"place.bounding_box.d.lat")))))
Related
I currently have a data set with quarterly returns for 10 indices. My dataset (compoundrates) is structured so that in the first column, we have "Scenario" and the second column is "Quarter", and the following 10 are the quarterly index return. The projection is 50 quarters, so lines 1-51 reflect quarters 0-50 for scenario 1, and lines 52-102 reflect quarter 0-50 for scenario 2, etc for 1000 scenarios.
To calculate cumulative compound rates, I need to multiply the current return by all previous returns from the projection. I set up a loop to do this in the code below:
for(i in 1:nrow(compoundrates)){
if(compoundrates[i, "Quarter"] == 0){
compoundrates[i, -c(1:2)] <- 1
} else{
compoundrates[i, -c(1:2)] <- compoundrates[i, -c(1:2)] * compoundrates[i - 1, -c(1:2)]
}
}
The loop is simple and works how I want. However, with 51000 rows, this takes about 13 minutes. Is there a way to speed up the code? I tried thinking of a vectorized solution, but could only think that I would need to loop through all rows of the dataset. While 13 minutes is not the end of the world, I have other datasets with longer projections, up to 200 quarters, which would take extremely long.
Possibly pivoting the dataset to be horizontal would require only 50 loops rather than 51000, but thought I'd see if anyone else had a more elegant solution.
Edit: Included here is a sample of the first couple of lines of my dataset:
> dput(head(compoundrates[, 1:4])) # First part of data, only 2 indices
structure(list(Scenario = c(1L, 1L, 1L, 1L, 1L, 1L), Quarter = c(0,
1, 2, 3, 4, 5), US = c(1, 1.06658609144463, 1.1022314574062,
1.1683883540847, 1.29134306037902, 1.28907212981088), MidCap = c(1,
1.10361590084936, 1.12966579678275, 1.21702573464001, 1.2674372889915,
1.37286942499386)), row.names = c(NA, -6L), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), groups = structure(list(Scenario = 1L,
.rows = list(1:6)), row.names = c(NA, -1L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE))
Try this out, it uses vectorized functions, basically exactly what you are trying to do with a for loop. It create new columns so you can see what is going on. Vectorized function usually run a lot faster than for loops
library(tidyverse)
compoundrates %>%
group_by(Scenario) %>%
arrange(Quarter) %>%
mutate(US_lag = lag(US),
MidCap_lag = lag(MidCap),
US_cum = US *US_lag,
MidCap_cum = MidCap *MidCap_lag) %>%
mutate_all(~ifelse(is.na(.), 1, .))
This should do the cumulative product you were asking for
compoundrates %>%
group_by(Scenario) %>%
arrange(Quarter) %>%
mutate(US_cum = cumprod(US),
MidCap_cum = cumprod(MidCap))
Per #carl-witthoft suggestion here is the benchmarking. I made a dataframe of 60,000 rows grouped by the 6 quarters in the OP.
Unit: milliseconds
expr
big_data %>%
group_by(Scenario) %>%
mutate(US_cum = cumprod(US),
MidCap_cum = cumprod(MidCap))
min lq mean median uq max neval
63.1487 70.0257 77.16906 73.72995 79.7645 147.167 100
The main dataframe has a column "passings". It is the only nested variable in the main dataframe. Inside it, there are dataframes (an example a nested cell). In the nested cells, the number of rows varies, yet the number of columns is the same. The columns names are "date" and "title". What I need is to grab a respective date and put it in the main dataframe as a new variable if title is "Закон прийнято" ("A passed law" - translation).
I'm a newbie in coding.
I will appreciate your help!
dataframe
an example of a dataframe within a nested cell
Here is an option where we loop over the 'passings' list column with map (based on the image, it would be a list of 2 column data.frame), filter the rows where the 'title' is "Закон прийнято" (assuming only a single value per row) and pull the 'date' column to create a new column 'date' in the original dataset
library(dplyr)
library(purrr)
df1 %>%
mutate(date = map_chr(passings, ~ .x %>%
filter(title == "Закон прийнято") %>%
pull(date)))
# id passed passings date
#1 54949 TRUE 2015-06-10, 2015-06-08, abcb, Закон прийнято 2015-06-08
#2 55009 TRUE 2015-06-10, 2015-09-08, bcb, Закон прийнято 2015-09-08
NOTE: It works as expected.
data
df1 <- structure(list(id = c(54949, 55009), passed = c(TRUE, TRUE),
passings = list(structure(list(date = c("2015-06-10", "2015-06-08"
), title = c("abcb", "Закон прийнято")), class = "data.frame", row.names = c(NA,
-2L)), structure(list(date = c("2015-06-10", "2015-09-08"
), title = c("bcb", "Закон прийнято")), class = "data.frame", row.names = c(NA,
-2L)))), row.names = c(NA, -2L), class = "data.frame")
I'm having an issue with separating rows in a dataframe that I'm working in.
In my dataframe, there's a column called officialIndices that I want to separate the rows by. This column stores a list of numbers act as indexes to indicate which rows have the same data. For example: indices 2:3 means that rows 2:3 have the same data.
Here is the code that I am working with.
offices_list <- data_google$offices
offices_JSON <- toJSON(offices_list)
offices_from_JSON <-
separate_rows(fromJSON(offices_JSON), officialIndices, convert = TRUE)
This is what my offices_list frame looks like
This is what it looks like after I try to separate the rows
My code works fine when it has indices 2:3 since there is a difference of 1. However on indices like 7:10, it separates the rows as 7 and 10 instead of doing 7, 8, 9, 10, which is how I want it do be done. How would I get my code to separate the rows like this?
Output of dput(head(offices_list))
structure(list(position = c("President of the United States",
"Vice-President of the United States", "United States Senate",
"Governor", "Mayor", "Auditor"), divisionId = c("ocd-division/country:us",
"ocd-division/country:us", "ocd-division/country:us/state:or",
"ocd-division/country:us/state:or", "ocd-division/country:us/state:or/place:portland",
"ocd-division/country:us/state:or/place:portland"), levels = list(
"country", "country", "country", "administrativeArea1", NULL,
NULL), roles = list(c("headOfState", "headOfGovernment"),
"deputyHeadOfGovernment", "legislatorUpperBody", "headOfGovernment",
NULL, NULL), officialIndices = list(0L, 1L, 2:3, 4L, 5L,
6L)), row.names = c(NA, 6L), class = "data.frame")
This should work. I expect it will work for further rows too, since I tested for ranges greater than two in officialIndices.
First I extracted the start and end rows, and used their difference to determine how many rows are needed. Then tidyr::uncount() will add that many copies.
library(dplyr); library(tidyr)
data_sep <- data %>%
separate(officialIndices, into = c("start", "end"), sep = ":") %>%
# Use 1 row, and more if "end" is defined and larger than "start"
mutate(rows = 1 + if_else(is.na(end), 0, as.numeric(end) - as.numeric(start))) %>%
uncount(rows)
I simply want to take a dataframe with two columns, one with a grouping variable and the second with values, and transform it so that the grouping variable becomes columns with the appropriate values. A very simple question, but after searching for about an hour, I cannot find a good answer. Here is a toy example:
var <- c("Var1", "Var1", "Var2", "Var2")
value <- c(1, 2, 3, 4)
df <- data.frame(var, value)
df.one <- df[df$var == "Var1", ]
df.two <- df[df$var == "Var2", ]
desired.df <- data.frame(df.one[2], df.two[2])
colnames(desired.df) <- c("Var1", "Var2")
desired.df
With more variables and values, this bit of code could become extremely clunky. Can anyone suggest a better method? Any advice would be greatly appreciated!
Data:
df <- structure(list(var = structure(c(1L, 1L, 2L, 2L),
.Label = c("Var1", "Var2"), class = "factor"),
value = c(1, 2, 3, 4)), .Names = c("var", "value"),
class = "data.frame", row.names = c(NA, -4L))
It looks like it is useful to introduce a new variable that identifies the observation within var (I call this case below); you can remove it after reshaping it if you like.
With reshape2/plyr:
library("plyr")
library("reshape2")
## add 'case' identifier
df <- ddply(df,"var",mutate,case=1:length(var))
## dcast() to reshape; then drop identifier
dcast(df,case~var)[,-1]
With tidyr (same strategy):
library("tidyr")
library("dplyr")
df %>% group_by(var) %>%
mutate(case=seq(n())) %>%
spread(var,value) %>%
select(-case)
This could probably be done with reshape() in base R as well, but I have never been able to figure it out ...
Base R solution:
data.frame(split(df$value,df$var))
# Var1 Var2
#1 1 3
#2 2 4
This solution implies that all 'VarN' subsets have equal length.
More general solution will be:
z <- split(df$value,df$var)
max.length <- max(sapply(z,length))
data.frame(lapply(z,`length<-`,max.length))
which appends NA to shorter lists to make sure that all lists have the same length.
I have a dataframe consisting of an ID, that is the same for each element in a group, two datetimes and the time interval between these two. One of the datetime objects is my relevant time marker. Now I like to get a subset of the dataframe that consists of the earliest entry for each group. The entries (especially the time interval) need to stay untouched.
My first approach was to sort the frame according to 1. ID and 2. relevant datetime. However, I wasn't able to return the first entry for each new group.
I then have been looking at the aggregate() as well as ddply() function but I could not find an option in both that just returns the first entry without applying an aggregation function to the time interval value.
Is there an (easy) way to accomplish this?
ADDITION:
Maybe I was unclear by adding my aggregate() and ddply() notes. I do not necessarily need to aggregate. Given the fact that the dataframe is sorted in a way that the first row of each new group is the row I am looking for, it would suffice to just return a subset with each row that has a different ID than the one before (which is the start-row of each new group).
Example data:
structure(list(ID = c(1454L, 1322L, 1454L, 1454L, 1855L, 1669L,
1727L, 1727L, 1488L), Line = structure(c(2L, 1L, 3L, 1L, 1L,
1L, 1L, 1L, 1L), .Label = c("A", "B", "C"), class = "factor"),
Start = structure(c(1357038060, 1357221074, 1357369644, 1357834170,
1357913412, 1358151763, 1358691675, 1358789411, 1359538400
), class = c("POSIXct", "POSIXt"), tzone = ""), End = structure(c(1357110430,
1357365312, 1357564413, 1358230679, 1357978810, 1358674600,
1358853933, 1359531923, 1359568151), class = c("POSIXct",
"POSIXt"), tzone = ""), Interval = c(1206.16666666667, 2403.96666666667,
3246.15, 6608.48333333333, 1089.96666666667, 8713.95, 2704.3,
12375.2, 495.85)), .Names = c("ID", "Line", "Start", "End",
"Interval"), row.names = c(NA, -9L), class = "data.frame")
By reproducing the example data frame and testing it I found a way of getting the needed result:
Order data by relevant columns (ID, Start)
ordered_data <- data[order(data$ID, data$Start),]
Find the first row for each new ID
final <- ordered_data[!duplicated(ordered_data$ID),]
As you don't provide any data, here is an example using base R with a sample data frame :
df <- data.frame(group=c("a", "b"), value=1:8)
## Order the data frame with the variable of interest
df <- df[order(df$value),]
## Aggregate
aggregate(df, list(df$group), FUN=head, 1)
EDIT : As Ananda suggests in his comment, the following call to aggregate is better :
aggregate(.~group, df, FUN=head, 1)
If you prefer to use plyr, you can replace aggregate with ddply :
ddply(df, "group", head, 1)
Using ffirst from collapse
library(collapse)
ffirst(df, g = df$group)
data
df <- data.frame(group=c("a", "b"), value=1:8)
This could also be achieved by dplyr using group_by and slice-family of functions,
data %>%
group_by(ID) %>%
slice_head(n = 1)