I think this is a split-apply-combine problem, but with a time series twist. My data consists of irregular counts and I need to perform some summary statistics on each group of counts. Here's a snapshot of the data:
And here's it is for your console:
library(xts)
date <- as.Date(c("2010-11-18", "2010-11-19", "2010-11-26", "2010-12-03", "2010-12-10",
"2010-12-17", "2010-12-24", "2010-12-31", "2011-01-07", "2011-01-14",
"2011-01-21", "2011-01-28", "2011-02-04", "2011-02-11", "2011-02-18",
"2011-02-25", "2011-03-04", "2011-03-11", "2011-03-18", "2011-03-25",
"2011-03-26", "2011-03-27"))
returns <- c(0.002,0.000,-0.009,0.030, 0.013,0.003,0.010,0.001,0.011,0.017,
-0.008,-0.005,0.027,0.014,0.010,-0.017,0.001,-0.013,0.027,-0.019,
0.000,0.001)
count <- c(NA,NA,1,1,2,2,3,4,5,6,7,7,7,7,7,NA,NA,NA,1,2,NA,NA)
maxCount <- c(NA,NA,0.030,0.030,0.030,0.030,0.030,0.030,0.030,0.030,0.030,
0.030,0.030,0.030,0.030,NA,NA,NA,0.027,0.027,NA,NA)
sumCount <- c(NA,NA,0.000,0.030,0.042,0.045,0.056,0.056,0.067,0.084,0.077,
0.071,0.098,0.112,0.123,NA,NA,NA,0.000,-0.019,NA,NA)
xtsData <- xts(cbind(returns,count,maxCount,sumCount),date)
I have no idea how to construct the max and cumSum columns, especially since each count series is of an irregular length. Since I won't always know the start and end points of a count series, I'm lost at trying to figure out the index of these groups. Thanks for your help!
UPDATE: here is my for loop for attempting to calculating cumSum. it's not the cumulative sum, just the returns necessary, i'm still unsure how to apply functions to these ranges!
xtsData <- cbind(xtsData,mySumCount=NA)
# find groups of returns
for(i in 1:nrow(xtsData)){
if(is.na(xtsData[i,"count"]) == FALSE){
xtsData[i,"mySumCount"] <- xtsData[i,"returns"]
}
else{
xtsData[i,"mySumCount"] <- NA
}
}
UPDATE 2: thank you commenters!
# report returns when not NA count
x1 <- xtsData[!is.na(xtsData$count),"returns"]
# cum sum is close, but still need to exclude the first element
# -0.009 in the first series of counts and .027 in the second series of counts
x2 <- cumsum(xtsData[!is.na(xtsData$count),"returns"])
# this is output is not accurate because .03 is being displayed down the entire column, not just during periods when counts != NA. is this just a rounding error?
x3 <- max(xtsData[!is.na(xtsData$count),"returns"])
SOLUTION:
# function to pad a vector with a 0
lagpad <- function(x, k) {
c(rep(0, k), x)[1 : length(x)]
}
# group the counts
x1 <- na.omit(transform(xtsData, g = cumsum(c(0, diff(!is.na(count)) == 1))))
# cumulative sum of the count series
z1 <- transform(x1, cumsumRet = ave(returns, g, FUN =function(x) cumsum(replace(x, 1, 0))))
# max of the count series
z2 <- transform(x1, maxRet = ave(returns, g, FUN =function(x) max(lagpad(x,1))))
merge(xtsData,z1$cumsumRet,z2$maxRet)
The code shown is not consistent with the output in the image and there is no explanation provided so its not clear what manipulations were wanted; however, the question did mention that the main problem is distinguishing the groups so we will address that.
To do that we compute a new column g whose rows contain 1 for the first group, 2 for the second and so on. We also remove the NA rows since the g column is sufficient to distinguish groups.
The following code computes a vector the same length as count by first setting each NA position to FALSE and each non-NA position to TRUE. It then differences each position of that vector with the prior position. To do that it implicitly converts FALSE to 0 and TRUE to 1 and then performs the differencing. Next we convert this last result to a logical vector which is TRUE for each 1 component and FALSE otherwise. Since the first component of the vector that is differenced has no prior position we prepend 0 for that. The prepending operation implicitly converts the TRUE and FALSE values just generated to 1 and 0 respectively. Taking the cumsum fills in the first group with 1, the second with 2 and so on. Finally omit the NA rows:
x <- na.omit(transform(x, g = cumsum(c(0, diff(!is.na(count)) == 1))))
giving:
> x
returns count maxCount sumCount g
2010-11-26 -0.009 1 0.030 0.000 1
2010-12-03 0.030 1 0.030 0.030 1
2010-12-10 0.013 2 0.030 0.042 1
2010-12-17 0.003 2 0.030 0.045 1
2010-12-24 0.010 3 0.030 0.056 1
2010-12-31 0.001 4 0.030 0.056 1
2011-01-07 0.011 5 0.030 0.067 1
2011-01-14 0.017 6 0.030 0.084 1
2011-01-21 -0.008 7 0.030 0.077 1
2011-01-28 -0.005 7 0.030 0.071 1
2011-02-04 0.027 7 0.030 0.098 1
2011-02-11 0.014 7 0.030 0.112 1
2011-02-18 0.010 7 0.030 0.123 1
2011-03-18 0.027 1 0.027 0.000 2
2011-03-25 -0.019 2 0.027 -0.019 2
attr(,"na.action")
2010-11-18 2010-11-19 2011-02-25 2011-03-04 2011-03-11 2011-03-26 2011-03-27
1 2 16 17 18 21 22
attr(,"class")
[1] "omit"
You can now use ave to perform any calculations you like. For example to take cumulative sums of returns by group:
transform(x, cumsumRet = ave(returns, g, FUN = cumsum))
Replace cumsum with any other function that is suitable for use with ave.
Ah, so "count" are the groups and you want the cumsum per group and the max per group. I think in data.table, so here is how I would do it.
library(xts)
library(data.table)
date <- as.Date(c("2010-11-18", "2010-11-19", "2010-11-26", "2010-12-03", "2010-12-10",
"2010-12-17", "2010-12-24", "2010-12-31", "2011-01-07", "2011-01-14",
"2011-01-21", "2011-01-28", "2011-02-04", "2011-02-11", "2011-02-18",
"2011-02-25", "2011-03-04", "2011-03-11", "2011-03-18", "2011-03-25",
"2011-03-26", "2011-03-27"))
returns <- c(0.002,0.000,-0.009,0.030, 0.013,0.003,0.010,0.001,0.011,0.017,
-0.008,-0.005,0.027,0.014,0.010,-0.017,0.001,-0.013,0.027,-0.019,
0.000,0.001)
count <- c(NA,NA,1,1,2,2,3,4,5,6,7,7,7,7,7,NA,NA,NA,1,2,NA,NA)
maxCount <- c(NA,NA,0.030,0.030,0.030,0.030,0.030,0.030,0.030,0.030,0.030,
0.030,0.030,0.030,0.030,NA,NA,NA,0.027,0.027,NA,NA)
sumCount <- c(NA,NA,0.000,0.030,0.042,0.045,0.056,0.056,0.067,0.084,0.077,
0.071,0.098,0.112,0.123,NA,NA,NA,0.000,-0.019,NA,NA)
DT<-data.table(date,returns,count)]
DT[!is.na(count),max:=max(returns),by=count]
DT[!is.na(count),cumSum:= cumsum(returns),by=count]
#if you need an xts object at the end, then.
xtsData <- xts(cbind(DT$returns,DT$count, DT$max,DT$cumSum),DT$date)
Related
The acf method in the stats package returns a complex output. For example
x = rnorm(1000, mean=100, sd=10)
acf(x)
returns a plot. If I do
acf_x = acf(x)
acf_x
it returns
Autocorrelations of series ‘x’, by lag
0 1 2 3 4 5 6 7 8 9 10 11
1.000 0.000 -0.031 -0.002 -0.052 0.017 -0.014 0.030 0.011 0.002 -0.044 0.000
12 13 14 15 16 17 18 19 20 21 22 23
0.055 -0.007 0.049 0.025 -0.027 -0.048 0.033 0.027 0.043 -0.007 -0.010 0.025
24 25 26 27 28 29 30
-0.083 0.045 -0.074 0.016 0.041 -0.046 0.010
If I look at class(acf) it returns 'acf'.
How do I extract the autocorrelation versus lag into a data_frame?
More generally, when presented with a function that returns a complex object, how do I extract the data from it, i.e. is there a general pattern for this type of function?
If you look at the help function of acf via ?acf you'll see under "value" what the output will look like.
In this case, the acf object is a list with several elements.
If you e.g. want the lags, you can simply access this via:
my_lags <- acf_x$lag
Deschen's answer to the original question gives the general response - how do I discover the elements in a complex model object: str(). One can also use the names() function for S3 objects, where the result lists the names one can use to extract elements from the list() with the $ or [[ forms of the extract operator.
set.seed(95014)
x = rnorm(1000, mean=100, sd=10)
acf_x <- acf(x)
names(acf_x)
> names(acf_x)
[1] "acf" "type" "n.used" "lag" "series" "snames"
>
Since the acf and lag elements are stored as arrays, we'll need to extract just the first dimension to obtain a simple vector. We can accomplish this by chaining the [ form of the extract operator onto the object that is generated by the [[ extract on the model object.
head(acf_x[["acf"]][,1,1]) # second extract returns a simple vector
> head(acf_x[["acf"]][,1,1])
[1] 1.000000000 -0.034863150 0.037745441 -0.020464290 -0.004974406
[6] 0.016770363
In this case R performs the extraction left to right - first acf_x[["acf"]] is evaluated, and then [,1,1] is applied to the result.
For the concrete part of the question, "how do I create a data frame with this data?" One can create a data frame with the output from the acf() function as follows.
set.seed(95014)
x = rnorm(1000, mean=100, sd=10)
acf_x <- acf(x)
results <- data.frame(acf_value = acf_x$acf[,1,1],
acf_lag = acf_x$lag[,1,1])
head(results)
...and the output:
> head(results)
acf_value acf_lag
1 1.000000000 0
2 -0.034863150 1
3 0.037745441 2
4 -0.020464290 3
5 -0.004974406 4
6 0.016770363 5
Try
str(acf_x)
or
print.default(acf_x)
This will get you an idea how the object looks like internally and how to access the elements in it.
I need to create a 'time' column, starting at 0 and adding increments of 0.0005. The length of the column should be dependent on the length of existing columns. What I have tried at so far is below.
So in my head, the below script says: create a column with 0 and 0.0005 as data points 1 and 2, cumulatively add the difference between data points 1 and 2 and repeat for length of specified column. This doesn't really work, hence why I am posting here. If anyone has some sage advice, it would be greatly appreciated.
df$time = c(0,0.0005, cumsum(diff(df$time [1:2], lag = 1)), length(df$other.column))
Expected outcome
time
0
0.0005
0.001
0.0015
0.002
0.0025
0.003
0.0035
0.004
0.0045
0.005
0.0055
0.006
0.0065
0.007
0.0075
0.008
0.0085
0.009
0.0095
etc
We can multiply the 0.00005 with the sequence of rows
df$time <- (seq_len(nrow(df)) - 1) * 0.0005
data
df <- data.frame(a = 1:10)
It sounds like you just want the following sequence:
seq(0, 0.1, by=0.0005)
You may replace the from and to values to whatever you want via:
seq(from, to, by=0.0005)
You could use seq by specifying length.out parameter as number of rows of dataframe.
df <- data.frame(a = 1:10)
df$time <- seq(0, by = 0.0005, length.out = nrow(df))
df
# a time
#1 1 0.0000
#2 2 0.0005
#3 3 0.0010
#4 4 0.0015
#5 5 0.0020
#6 6 0.0025
#7 7 0.0030
#8 8 0.0035
#9 9 0.0040
#10 10 0.0045
I want to apply two different formulas on four columns of my dataframe df. I have done this manually, but since my original data frame has several columns, I want to be able to use loops or case when to do this faster.
Here's how sample dataframe df looks like:
A B C D
20 100 4 1200
40 150 6 2300
34 200 3 1230
32 225 9 1100
12 220 10 1000
Formula 1:
(x-max(x))/(max(x)-min(x))
Formula 2:
(min(x)-x)/(max(x)-min(x))
I'd like to apply formula 1 on columns B and D and formula 2 on columns A and C.
After applying the formula, I want to store the values in a different dataframe but with the same column names.
Here's what I did:
formula_1 <-function(x) {
(((x - min(x)))/(max(x) - min(x)))
}
formula_2 <-function(x){(min(x)-x)/(max(x)-min(x))
}
Create an empty dataframe BI_score
BI_score$B <- formula_1(df$B)
BI_score$D <- formula_1 (df$D)
BI_score$A <- formula_2 (df$A)
BI_score$C <- formula_2 (df$C)
EDIT
As there are some NAs and Inf values and if we want to exclude them from calculation, we can handle it by updating the function as below and then apply the function to column as shown previously.
formula_1 <-function(x) {
temp <- x[is.finite(x)]
replace(x, is.finite(x), (((temp - min(temp)))/(max(temp) - min(temp))))
}
formula_2 <-function(x) {
temp <- x[is.finite(x)]
replace(x, is.finite(x), (min(temp)-temp)/(max(temp)-min(temp)))
}
The most straight forward approach would be to use lapply to apply the function separately on selected columns.
BI_score <- df
fm1_cols <- c("B", "D")
fm2_cols <- c("A", "C")
BI_score[fm1_cols] <- lapply(df[fm1_cols], formula_1)
BI_score[fm2_cols] <- lapply(df[fm2_cols], formula_2)
BI_score
# A B C D
#1 -0.29 0.00 -0.14 0.154
#2 -1.00 0.40 -0.43 1.000
#3 -0.79 0.80 0.00 0.177
#4 -0.71 1.00 -0.86 0.077
#5 0.00 0.96 -1.00 0.000
As mentioned by #Sotos, if you want to apply the function on alternate columns you could do
BI_score[c(TRUE, FALSE)] <- lapply(df[c(TRUE, FALSE)], formula_1)
BI_score[c(FALSE, TRUE)] <- lapply(df[c(FALSE, TRUE)], formula_2)
Just for fun, approach using dplyr
library(dplyr)
bind_cols(df %>% select(fm1_cols) %>% mutate_all(formula_1),
df %>% select(fm2_cols) %>% mutate_all(formula_2))
If your goal is to apply the two functions on alternating columns, then you can do it via logical indexing
cbind.data.frame(sapply(df[c(TRUE, FALSE)], formula_2),
sapply(df[c(FALSE, TRUE)], formula_1))
# A C B D
#1 -0.2857143 -0.1428571 0.00 0.15384615
#2 -1.0000000 -0.4285714 0.40 1.00000000
#3 -0.7857143 0.0000000 0.80 0.17692308
#4 -0.7142857 -0.8571429 1.00 0.07692308
#5 0.0000000 -1.0000000 0.96 0.00000000
We can use mutate_at from dplyr
library(dplyr)
df1 %>%
mutate_at(vars(B, D), formula_1) %>%
mutate_at(vars(A, C), formula_2)
To give a small working example, suppose I have the following data frame:
library(dplyr)
country <- rep(c("A", "B", "C"), each = 6)
year <- rep(c(1,2,3), each = 2, times = 3)
categ <- rep(c(0,1), times = 9)
pop <- rep(c(NA, runif(n=8)), each=2)
money <- runif(18)+100
df <- data.frame(Country = country,
Year = year,
Category = categ,
Population = pop,
Money = money)
Now the data I'm actually working with has many more repetitions, namely for every country, year, and category, there are many repeated rows corresponding to various sources of money, and I want to sum these all together. However, for now it's enough just to have one row for each country, year, and category, and just trivially apply the sum() function on each row. This will still exhibit the behavior I'm trying to get rid of.
Notice that for country A in year 1, the population listed is NA. Therefore when I run
aggregate(Money ~ Country+Year+Category+Population, df, sum)
the resulting data frame has dropped the rows corresponding to country A and year 1. I'm only using the ...+Population... bit of code because I want the output data frame to retain this column.
I'm wondering how to make the aggregate() function not drop things that have NAs in the columns by which the grouping occurs--it'd be nice if, for instance, the NAs themselves could be treated as values to group by.
My attempts: I tried turning the Population column into factors but that didn't change the behavior. I read something on the na.action argument but neither na.action=NULL nor na.action=na.skip changed the behavior. I thought about trying to turn all the NAs to 0s, and I can't think of what that would hurt but it feels like a hack that might bite me later on--not sure. But if I try to do it, I'm not sure how I would. When I wrote a function with the is.na() function in it, it didn't apply the if (is.na(x)) test in a vectorized way and gave the error that it would just use the first element of the vector. I thought about perhaps using lapply() on the column and coercing it back to a vector and sticking that in the column, but that also sounds kind of hacky and needlessly round-about.
The solution here seemed to be about keeping the NA values out of the data frame in the first place, which I can't do: Aggregate raster in R with NA values
As you have already mentioned dplyr before your data, you can use dplyr::summarise function. The summarise function supports grouping on NA values.
library(dplyr)
df %>% group_by(Country,Year,Category,Population) %>%
summarise(Money = sum(Money))
# # A tibble: 18 x 5
# # Groups: Country, Year, Category [?]
# Country Year Category Population Money
# <fctr> <dbl> <dbl> <dbl> <dbl>
# 1 A 1.00 0 NA 101
# 2 A 1.00 1.00 NA 100
# 3 A 2.00 0 0.482 101
# 4 A 2.00 1.00 0.482 101
# 5 A 3.00 0 0.600 101
# 6 A 3.00 1.00 0.600 101
# 7 B 1.00 0 0.494 101
# 8 B 1.00 1.00 0.494 101
# 9 B 2.00 0 0.186 100
# 10 B 2.00 1.00 0.186 100
# 11 B 3.00 0 0.827 101
# 12 B 3.00 1.00 0.827 101
# 13 C 1.00 0 0.668 100
# 14 C 1.00 1.00 0.668 101
# 15 C 2.00 0 0.794 100
# 16 C 2.00 1.00 0.794 100
# 17 C 3.00 0 0.108 100
# 18 C 3.00 1.00 0.108 100
Note: The OP's sample data doesn't have multiple rows for same groups. Hence, number of summarized rows will be same as actual rows.
I have a dataframe with multiple columns and and multiple rows. The data is based on monthly observations over the period of 11 years. Now I want to take the sum of each column based on observations for previous 12 months. For example sum of column for Jan-05 is based on its observations from Jan-04 to Dec-04. And for Feb-05 is based on observations from Feb-04 to Jan-05 and so on. My original data frame has data for 10 years and monthly data.
I illustrate part of my dataframe as follows:
df1
Month A B C
Jan-04 0.003 0.006 NA
Feb-04 0.003 0.002 NA
Mar-04 -0.005 -0.001 NA
Apr-04 0.000 0.000 NA
May-04 0.000 -0.002 NA
Jun-04 -0.001 -0.001 NA
Jul-04 -0.001 -0.001 NA
Aug-04 -0.010 NA NA
Sep-04 0.001 NA NA
Oct-04 0.002 NA NA
Nov-04 -0.003 NA NA
Dec-04 -0.003 NA NA
Jan-05 0.005 -0.002 NA
Feb-05 -0.0015 0.004 0.0003
Mar-05 -0.0041 0.002 0.0070
The desired resultant dataframe
Month A B C
Jan-05 -0.013 0.004 NA
Feb-05 -0.011 -0.004 NA
Mar-05 -0.0151 -0.0014 0.0003
Here is a solution in base R. First we define a function to subset the df based on the time difference from the date of interest and find the column sums on that subsetted df, and then we run that function for all of the time points of interest.
subset_last_year <- function(df, date, cols_to_sum = c("A", "B", "C")){
date = as.POSIXct(date, format = "%d-%b-%y")
df$Time_Difference = difftime(date, df$Month_Date, units = "weeks")
df_last_year = df[df$Time_Difference > 0 & df$Time_Difference < 53, ]
tmp_col_sum = colSums(df_last_year[ , cols_to_sum], na.rm = TRUE)
return(tmp_col_sum)
}
#oddly you have to add days
df$Month_Date = paste0("01-", df$Month)
df$Month_Date = as.POSIXct(df$Month_Date, format = "%d-%b-%y")
#not worried about performance because the data set is not that large
dates = c("01-Jan-05", "01-Feb-05", "01-Mar-05")
res = data.frame()
for(i in 1:length(dates)){
tmp = subset_last_year(df, dates[i])
res = rbind(res, tmp)
}
rownames(res) = dates
colnames(res) = c("A", "B", "C")