I would like to calculate the acf of a time series grouped by a grouping variable. Specifically, I have a data frame contaning a single time series (variable a) and a grouping variable (e. g. weekday, variable b). Here is an example:
data <- data.frame(a=rnorm(1:150), b=rep(rep(1:3, each=5), 10))
Now, I would like to calculate the acf for the different values of the grouping variable. For example, for lag 2 and group 1 I would like to get the correlation between t and t-2 calculated only over time points t with b=1 (the value of b for t-2 does not matter). I know that the function acf can easily calculate the acf but I don't find a way to include the grouping variable.
I could manually calculate the desired correlation but as I have a large data set and a lot of lags and values for the grouping variables, I would hope that there is a more elegant and faster way. Here is the manual calculation for the example above (lag 2, b=1):
sel <- which(data$b==1)
cor(data$a[sel[sel > 2]], data$a[sel[sel>2] - 2])
If the time series object is a tsibble, the following works for me. Assuming the data frame is called df and the variable you are interested in is called var. You can specify max lag additionally
df %>% group_by(Region) %>% ACF(var, lag_max = 18) %>% autoplot()
I'm not sure I understand exactly what information you are looking for but if you just want the acf values for multiple groups this should accomplish that. Some people have mentioned creating a tidy solution and this uses dplyr, tidyr, and purrr to do grouped calculations.
library(dplyr)
library(tidyr)
library(purrr)
sample_data <- dplyr::data_frame(group = sample(c("a", "b", "c"), size = 100, replace = T), value = sample.int(30, size = 100, replace = T))
head(sample_data)
#> # A tibble: 6 × 2
#> group value
#> <chr> <int>
#> 1 c 28
#> 2 c 9
#> 3 c 13
#> 4 c 11
#> 5 a 9
#> 6 c 9
grouped_acf_values <- sample_data %>%
tidyr::nest(-group) %>%
dplyr::mutate(acf_results = purrr::map(data, ~ acf(.x$value, plot = F)),
acf_values = purrr::map(acf_results, ~ drop(.x$acf))) %>%
tidyr::unnest(acf_values) %>%
dplyr::group_by(group) %>%
dplyr::mutate(lag = seq(0, n() - 1))
head(grouped_acf_values)
#> Source: local data frame [6 x 3]
#> Groups: group [1]
#>
#> group acf_values lag
#> <chr> <dbl> <int>
#> 1 c 1.00000000 0
#> 2 c -0.20192774 1
#> 3 c 0.07191805 2
#> 4 c -0.18440489 3
#> 5 c -0.31817935 4
#> 6 c 0.06368096 5
You can have a look at split to seperate your data.frame in buckets and then lapply to apply your function to each group. Something like:
groups_data <- split(data, data$b)
groups_acf <- lapply(groups_data, acf,...)
Then you have to extract the required information from the output list for instance with `sapply(groups,acf, FUN=function(acfobject){acfobject$value})
For groups computations, I would also definitiely go with new ways "à la" Hadley Wickham with %>% operator and group_by ; studing that is on my todo list...
Related
I have the following sample data set
Time <- c(1,2,3,4,5,6,7,8,9,10,11,12)
Value <- c(0,1,2,3,2,1,2,3,2,1,2,3)
Data <- data.frame(Time, Value)
I would like to automatically find each maximum for the Value column and create a new data frame with only the Value and associated Time. In this example, maximum values occur every fourth time interval. I would like to group the data into bins and find the associated max value.
I kept my example simple for illustrative purposes, however, keep in mind:
Each max value in my data set will be different
Each max value is not guaranteed to occur at equal intervals but rather, I can guarantee that each max value will occur within a range (i.e. a bin) of time values.
Thank you for any help with this process!
You could find the local maxima by finding the points where the diff of the sign of the diff of the Value column is negative.
Data[which(diff(sign(diff(Data$Value))) < 0) + 1,]
#> Time Value
#> 4 4 3
#> 8 8 3
We can see that this works in a more general case too:
Time <- seq(0, 10, 0.1)
Value <- sin(Time)
Data <- data.frame(Time, Value)
plot(Data$Time, Data$Value)
Data2 <- Data[which(diff(sign(diff(Data$Value))) < 0) + 1,]
abline(v = Data2$Time, col = 'red')
Edit
Following more info from the OP, it seems we are looking for the maxima within a 120-second window. This being the case, we can get the solution more easily like this:
library(dplyr)
bin_size <- 4 # Used for example only, will be 120 in real use case
Data %>%
mutate(Bin = floor((Time - 1) / bin_size)) %>%
group_by(Bin) %>%
filter(Value == max(Value))
#> # A tibble: 3 x 3
#> # Groups: Bin [3]
#> Time Value Bin
#> <dbl> <dbl> <dbl>
#> 1 4 3 0
#> 2 8 3 1
#> 3 12 3 2
Obviously in the real data, change bin_size to 120.
Maybe this one?
library(dplyr)
Data %>%
slice_max(Value)
Time Value
1 4 3
2 8 3
3 12 3
I have a dataset that has 453 variables (columns) and 119 observations (rows). It is comprised of 118 health observations for different countries over a number of years. For example, 10 of the 453 variables contain health data from Australia over a 10 year period; 8 of the 453 variables contain health data from Bangladesh over a 8 year period.
I want to subset these 453 variables into their own country-based data frames. The country name and year is in row 1 (e.g. Australia_2013, Australia_2014 etc). Seeing as though there are > 40 countries in this dataset, I would like to create a loop for this.
From what I've read so far, I think I should create a vector list of country names and then write a loop function that subsets data according to the vector list. All of the examples I can find are for subsetting based on rows however.
Can anyone point me in the right direction, or share example code for this?
Much thanks in anticipation
Based on your description, I assume your data looks something like this:
country_year <- c("Australia_2013", "Australia_2014", "Bangladesh_2013")
health <- matrix(nrow = 3, ncol = 3, data = runif(9))
dataset <- data.frame(rbind(country_year, health), row.names = NULL, stringsAsFactors = FALSE)
dataset
# X1 X2 X3
#1 Australia_2013 Australia_2014 Bangladesh_2013
#2 0.665947273839265 0.677187719382346 0.716064820764586
#3 0.499680359382182 0.514755881391466 0.178317369660363
#4 0.730102791683748 0.666969108628109 0.0719663293566555
First, move your row 1 (e.g., Australia_2013, Australia_2014 etc.) to the column names, and then apply the loop to create country-based data frames.
library(dplyr)
# move header
dataset2 <- dataset %>%
`colnames<-`(dataset[1,]) %>% # uses row 1 as column names
slice(-1) %>% # removes row 1 from data
mutate_all(type.convert) # converts data to appropriate type
# apply loop
for(country in unique(gsub("_\\d+", "", colnames(dataset2)))) {
assign(country, select(dataset2, starts_with(country))) # makes subsets
}
Regarding the loop,
gsub("_\\d+", "", colnames(dataset2)) extracts the country names by replacing "_[year]" with nothing (i.e., removing it), and the unique() function that is applied extracts one of each country name.
assign(country, select(dataset2, starts_with(country))) creates a variable named after the country and this country variable only contains the columns from dataset2 that start with the country name.
Edit: Responding to Comment
The question in the comment was asking how to add row-wise summaries (e.g., rowSums(), rowMeans()) as new columns in the country-based data frames, while using this for-loop.
Here is one solution that requires minimal changes:
for(country in unique(gsub("_\\d+", "", colnames(dataset2)))) {
assign(country,
select(dataset2, starts_with(country)) %>% # makes subsets
mutate( # creates new columns
rowSums = rowSums(select(., starts_with(country))),
rowMeans = rowMeans(select(., starts_with(country)))
)
)
}
mutate() adds new columns to a dataset.
select(., starts_with(country)) selects columns that start with the country name from the current object (represented as . in the function).
here is a dplyr answer, for version >= 1.0.
I created a small example, and we nest into the data column the different columns. Then since nest_by already created a rowwise grouped we can subset each data for the columns that starts with the country name. We need to convert this to a character.
Finally, if needed you can pull the list-column subset to get a list of tibbles that contains the relevant columns.
Of note, I think working with in a tidy format (long and not double info encoded (country and year) would be easier.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- data.frame(country = c("A", "B", "C"),
A_1 = 1:3,
A_2 = 3:5,
B_1 = 2:4,
C_3 = 1:3)
df
#> country A_1 A_2 B_1 C_3
#> 1 A 1 3 2 1
#> 2 B 2 4 3 2
#> 3 C 3 5 4 3
nest_by(df, country) %>%
mutate(subset = list(select(data, starts_with(as.character(country))))) %>%
pull(subset)
#> [[1]]
#> # A tibble: 1 x 2
#> A_1 A_2
#> <int> <int>
#> 1 1 3
#>
#> [[2]]
#> # A tibble: 1 x 1
#> B_1
#> <int>
#> 1 3
#>
#> [[3]]
#> # A tibble: 1 x 1
#> C_3
#> <int>
#> 1 3
Created on 2020-12-08 by the reprex package (v0.3.0)
First of all, the data structure is not optimal, having 1st row as string means, numbers in all other rows (in each column) are also coded by R as strings. But that's not part of the question.
You cannot create a series of data frames, but you can if they are part of a list (that's what lists are in R for!), with one list element holding single country.
Pure base R approach, solution with a working example:
# example dataset df
data("mtcars")
df <- mtcars
df <- rbind(paste0(sample(letters, ncol(df), replace = TRUE), "_2014"), df)
str(df)
# solution
countries <- substr(df[1, ], 1, nchar(df[1, ]) - 5)
unique_countries <- unique(countries)
df <- rbind.data.frame(countries, df, stringsAsFactors = FALSE)
list_df_per_country <- list()
for (i in seq_along(unique_countries)) {
list_df_per_country[[i]] <- df[which(df[1, ] == unique_countries[i])]
}
To use the code above, just save your dataframe as df, i.e. df <- your_dataframe, and run lines below the # solution, one by one.
This question already has answers here:
calculating mean for every n values from a vector
(3 answers)
Closed 4 years ago.
I am new to R so any help is greatly appreciated!
I have a data frame of 278800 observations for each of my 10 variables, I am trying to create an 11th variable that sums every 200 observations (or rows) of a specific variable/column (sum(1:200, 201:399, 400:599 etc.) Similar to the offset function in excel.
I have tried subsetting my data to just the variable of interest with the aim of adding a new variable that continuously sums every 200 rows however I cannot figure it out. I understand my new "variable" will produce 1,394 data points (278,800/200). I have tried to use the rollapply function, however the output does not sum in blocks of 200, it sums 1:200, 2:201, 3:202 etc.)
Thanks,
E
rollapply has a by= argument for that. Here is a smaller example using n = 3 instead of n = 200. Note that 1+2+3=6, 4+5+6=15, 7+8+9=24 and 10+11+12=33.
# test data
DF <- data.frame(x = 1:12)
library(zoo)
n <- 3
rollapply(DF$x, n, sum, by = n)
## [1] 6 15 24 33
First let's generate some data and get a label for each group:
library(tidyverse)
df <-
rnorm(1000) %>%
as_tibble() %>%
mutate(grp = floor(1 + (row_number() - 1) / 200))
> df
# A tibble: 1,000 x 2
value grp
<dbl> <dbl>
1 -1.06 1
2 0.668 1
3 -2.02 1
4 1.21 1
...
1000 0.78 5
This creates 1000 random N(0,1) variables, turns it into a data frame, and then adds an incrementing numeric label for each group of 200.
df %>%
group_by(grp) %>%
summarize(grp_sum = sum(value))
# A tibble: 5 x 2
grp grp_sum
<dbl> <dbl>
1 1 9.63
2 2 -12.8
3 3 -18.8
4 4 -8.93
5 5 -25.9
Then we just need to do a group-by operation on the second column and sum the values. You can use the pull() operation to get a vector of the results:
df %>%
group_by(grp) %>%
summarize(grp_sum = sum(value)) %>%
pull(grp_sum)
[1] 9.62529 -12.75193 -18.81967 -8.93466 -25.90523
I created a vector with 278800 observations (a)
a<- rnorm(278800)
b<-NULL #initializing the column of interest
j<-1
for (i in seq(1,length(a),by=200)){
b[j]<-sum(a[i:i+199]) #b is your column of interest
j<-j+1
}
View(b)
In the example below I am trying to determine which value is closest to each of the vals_int, by id. I can solve this problem using sapply() in a matter similar to below, but I am wondering if the sapply() part can be done with another function in dplyr.
I am really just interested in if the sapply method and output can be reproduced using some function(s) in the dplyr package. I had thought that do() may work but am struggling to determine how.
library(tidyverse)
df <- data_frame(
id = rep(1:10, 10) %>%
sort,
visit = rep(1:10, 10),
value = rnorm(100)
)
vals_int <- c(1, 2, 3)
tmp <- sapply(vals_int,
function(val_i) abs(df$value - val_i))
Yes, you can use the rowwise() and do() functions in dplyr to perform the same operation on every row, like so:
df %>% rowwise %>% do(diffs = abs(.$value - vals_int))
This will create a column called diffs in a new tibble which is a list of vectors with length 3. If you coerce the output that do() returns to be a data frame, it will instead create a tibble with three columns, one for each of the values subtracted.
df %>% rowwise %>% do(as.data.frame(t(abs(.$value - vals_int))))
The answer by #qdread does what you are looking for, but the tidyverse is starting to move away from the do() function (if that matters to you, idk). Here is an alternative method using map from the purrr package.
df %>%
mutate(closest = map(value, function(x){
abs(x - vals_int) %>%
t() %>%
as.tibble()
})) %>%
unnest()
That gives you this:
# A tibble: 100 x 6
id visit value V1 V2 V3
<int> <int> <dbl> <dbl> <dbl> <dbl>
1 1 1 0.91813183 0.08186817 1.081868 2.081868
2 1 2 -1.68556173 2.68556173 3.685562 4.685562
3 1 3 -0.05984289 1.05984289 2.059843 3.059843
4 1 4 0.40128729 0.59871271 1.598713 2.598713
5 1 5 -0.09995526 1.09995526 2.099955 3.099955
6 1 6 0.81802663 0.18197337 1.181973 2.181973
7 1 7 -1.49244225 2.49244225 3.492442 4.492442
8 1 8 -0.74256185 1.74256185 2.742562 3.742562
9 1 9 -0.43943907 1.43943907 2.439439 3.439439
10 1 10 0.54985857 0.45014143 1.450141 2.450141
# ... with 90 more rows
In dplyr, I'm looking for way/s to group by unique keys(for the problem at hand, by unique row numbers). Given a dataframe such as below:
df <- data.frame(A = rep(1:5, each = 2), B = rnorm(10, 3, 3), C= runif(10, 1.5, 4.5))
#> A B C
#> 1 1 -4.6399372 1.622857
#> 2 1 0.9933197 4.256062
#> 3 2 4.1381981 3.522439
#> 4 2 4.6943698 4.260124
#> 5 3 5.7183797 3.877568
#> 6 3 -3.6183500 2.236473
#> 7 4 -2.5711393 4.373780
#> 8 4 5.9092908 2.125349
#> 9 5 6.1531930 4.472758
#> 10 5 -1.9750869 1.516432
I would like to get a row of mean of three rows(df[4:6, ]) which replaces those specified in the index with single row. Thus the result would produce only 8 rows in total after grouping and collapsing. Normally, I would work the way out in following manner:
df %>%
group_by(rownumber = c(1:3, rep(4, each=3), 7:10)) %>%
summarise_all(.funs = mean)
But, I find the code overtly explicit, in that each slice of index has to be provided.
There must be more efficient/succinct ways to achieve the same feat. Thanks to anyone to offer insights. And also, although tidyverse community seems to dodge the row naming convention, for now, I'd like to have a proper row numbering here.
One option would be to replace those elements with a specific value so that we can avoid the rep and the later concatenation step
df %>%
group_by(grp = replace(row_number(), 4:6, 4)) %>%
summarise_all(mean)