Sum of columns of dataframe based on time period in R - r

I have a dataframe with multiple columns and and multiple rows. The data is based on monthly observations over the period of 11 years. Now I want to take the sum of each column based on observations for previous 12 months. For example sum of column for Jan-05 is based on its observations from Jan-04 to Dec-04. And for Feb-05 is based on observations from Feb-04 to Jan-05 and so on. My original data frame has data for 10 years and monthly data.
I illustrate part of my dataframe as follows:
df1
Month A B C
Jan-04 0.003 0.006 NA
Feb-04 0.003 0.002 NA
Mar-04 -0.005 -0.001 NA
Apr-04 0.000 0.000 NA
May-04 0.000 -0.002 NA
Jun-04 -0.001 -0.001 NA
Jul-04 -0.001 -0.001 NA
Aug-04 -0.010 NA NA
Sep-04 0.001 NA NA
Oct-04 0.002 NA NA
Nov-04 -0.003 NA NA
Dec-04 -0.003 NA NA
Jan-05 0.005 -0.002 NA
Feb-05 -0.0015 0.004 0.0003
Mar-05 -0.0041 0.002 0.0070
The desired resultant dataframe
Month A B C
Jan-05 -0.013 0.004 NA
Feb-05 -0.011 -0.004 NA
Mar-05 -0.0151 -0.0014 0.0003

Here is a solution in base R. First we define a function to subset the df based on the time difference from the date of interest and find the column sums on that subsetted df, and then we run that function for all of the time points of interest.
subset_last_year <- function(df, date, cols_to_sum = c("A", "B", "C")){
date = as.POSIXct(date, format = "%d-%b-%y")
df$Time_Difference = difftime(date, df$Month_Date, units = "weeks")
df_last_year = df[df$Time_Difference > 0 & df$Time_Difference < 53, ]
tmp_col_sum = colSums(df_last_year[ , cols_to_sum], na.rm = TRUE)
return(tmp_col_sum)
}
#oddly you have to add days
df$Month_Date = paste0("01-", df$Month)
df$Month_Date = as.POSIXct(df$Month_Date, format = "%d-%b-%y")
#not worried about performance because the data set is not that large
dates = c("01-Jan-05", "01-Feb-05", "01-Mar-05")
res = data.frame()
for(i in 1:length(dates)){
tmp = subset_last_year(df, dates[i])
res = rbind(res, tmp)
}
rownames(res) = dates
colnames(res) = c("A", "B", "C")

Related

R - Help to build time column from 0?

I need to create a 'time' column, starting at 0 and adding increments of 0.0005. The length of the column should be dependent on the length of existing columns. What I have tried at so far is below.
So in my head, the below script says: create a column with 0 and 0.0005 as data points 1 and 2, cumulatively add the difference between data points 1 and 2 and repeat for length of specified column. This doesn't really work, hence why I am posting here. If anyone has some sage advice, it would be greatly appreciated.
df$time = c(0,0.0005, cumsum(diff(df$time [1:2], lag = 1)), length(df$other.column))
Expected outcome
time
0
0.0005
0.001
0.0015
0.002
0.0025
0.003
0.0035
0.004
0.0045
0.005
0.0055
0.006
0.0065
0.007
0.0075
0.008
0.0085
0.009
0.0095
etc
We can multiply the 0.00005 with the sequence of rows
df$time <- (seq_len(nrow(df)) - 1) * 0.0005
data
df <- data.frame(a = 1:10)
It sounds like you just want the following sequence:
seq(0, 0.1, by=0.0005)
You may replace the from and to values to whatever you want via:
seq(from, to, by=0.0005)
You could use seq by specifying length.out parameter as number of rows of dataframe.
df <- data.frame(a = 1:10)
df$time <- seq(0, by = 0.0005, length.out = nrow(df))
df
# a time
#1 1 0.0000
#2 2 0.0005
#3 3 0.0010
#4 4 0.0015
#5 5 0.0020
#6 6 0.0025
#7 7 0.0030
#8 8 0.0035
#9 9 0.0040
#10 10 0.0045

Using regex to set a specific digit to NA? [duplicate]

This question already has answers here:
Replace all particular values in a data frame
(8 answers)
Closed 4 years ago.
Sample of df:
LASSO_deviance LASSO_AUC
68 0.999 0.999
2 1.000 1.000
39 1.000 1.005
7 1.02 1.2
I want to set cells which contain 1.000 to either NA or 0, in preferential order.
I've tried something like: df %>% mutate_at(vars(LASSO_deviance, LASSO_AUC), funs(gsub(pattern = "1{1}[^.{1,}]", 0, x = .))) with no luck.
tt <- "LASSO_deviance LASSO_AUC
68 0.999 0.999
2 1.000 1.000
39 1.000 1.005
7 1.02 1.2"
dat <- read.table(text = tt, header = T)
No need for regex because you can simply find where your data is equal to 1.000
dat[dat == 1.000] <- NA # or dat[dat == 1.000] <- 0
dat
# LASSO_deviance LASSO_AUC
# 68 0.999 0.999
# 2 NA NA
# 39 NA 1.005
# 7 1.020 1.200

R Populate a vector by matching names to df column values

I have a named vector filled with zeros
toy1<- rep(0, length(37:45))
names(toy1) <- 37:45
I want to populate the vector with count data from a dataframe
size count
37 1.181
38 0.421
39 0.054
40 0.005
41 0.031
42 0.582
45 0.024
I need help finding a way to match the value for size to the vector name and then input the corresponding count value into that vector position
Might be as simple as:
toy1[ as.character(dat$size) ] <- dat$count
toy1
# 37 38 39 40 41 42 43 44 45
#1.181 0.421 0.054 0.005 0.031 0.582 0.000 0.000 0.024
R's indexing for assignments can have character values. If you had just tried to index with the raw column:
toy1[ dat$size ] <- dat$count
You would have gotten (as did I initially):
> toy1
37 38 39 40 41 42 43 44 45
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 NA NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1.181 0.421
0.054 0.005 0.031 0.582 NA NA 0.024
That occurred because numeric indexing occurred and there was default extension of the length of the vector to accommodate the numbers up to 45.
With a version of the dataframe that had a number that was not in the range 37:45, I did get a warning from using match with a nomatch of 0, but I also got the expected results:
toy1[ match( as.character( dat$size), names(toy1) , nomatch=0) ] <- dat$count
#------------
Warning message:
In toy1[match(as.character(dat$size), names(toy1), nomatch = 0)] <- dat$count :
number of items to replace is not a multiple of replacement length
> toy1
37 38 39 40 41 42 43 44 45
1.181 0.421 0.054 0.005 0.031 0.582 0.000 0.000 0.000
The match function is at the core of the merge function but this application would be much faster than a merge of dataframes
Lets say your data frame is df, then you can just update the records in toy1 for records available in your data frame:
toy1[as.character(df$size)] <- df$count
Edit: To check for a match m before updating the records. m are the matched indices in size column of df:
m <- match(names(toy1), as.character(df$size))
Then, for the indices in toy1 which have a match, it can be updated as below:
toy1[which(!is.na(m))] <- df$count[m[!is.na(m)]]
PS: Efficient way would be to define toy1 as a data frame and perform an outer join by size column.
First, let's get the data loaded in.
toy1<- rep(0, length(37:45))
names(toy1) <- 37:45
df = read.table(text="37 1.181
38 0.421
39 0.054
40 0.005
41 0.031
42 0.582
45 0.024")
names(df) = c("size","count")
Now, I present a really ugly solution. We only update toy1 where the name of toy1 appears in df$size. We return df$count by obtaining the index of the match in df. I use sapply to get a vector of the index back. On both sizes we only look for places where names(toy1) appear in df$size.
toy1[names(toy1) %in% df$size] = df$count[sapply(names(toy1)[names(toy1) %in% df$size],function(x){which(x == df$size)})]
But, this isn't very elegant. Instead, you could turn toy1 into a data.frame.
toydf = data.frame(toy1 = toy1,name = names(toy1),stringsAsFactors = FALSE)
Now, we can use merge to get the values.
updated = merge(toydf,df,by.x = "name",by.y="size",all.x=T)
This returns a 3 column data.frame. You can then extract the count column from this, replace NA with 0 and you're done.
updated$count[is.na(updated$count)] = 0
updated$count
#> [1] 1.181 0.421 0.054 0.005 0.031 0.582 0.000 0.000 0.024

R: aggregating time series groups of irregular length

I think this is a split-apply-combine problem, but with a time series twist. My data consists of irregular counts and I need to perform some summary statistics on each group of counts. Here's a snapshot of the data:
And here's it is for your console:
library(xts)
date <- as.Date(c("2010-11-18", "2010-11-19", "2010-11-26", "2010-12-03", "2010-12-10",
"2010-12-17", "2010-12-24", "2010-12-31", "2011-01-07", "2011-01-14",
"2011-01-21", "2011-01-28", "2011-02-04", "2011-02-11", "2011-02-18",
"2011-02-25", "2011-03-04", "2011-03-11", "2011-03-18", "2011-03-25",
"2011-03-26", "2011-03-27"))
returns <- c(0.002,0.000,-0.009,0.030, 0.013,0.003,0.010,0.001,0.011,0.017,
-0.008,-0.005,0.027,0.014,0.010,-0.017,0.001,-0.013,0.027,-0.019,
0.000,0.001)
count <- c(NA,NA,1,1,2,2,3,4,5,6,7,7,7,7,7,NA,NA,NA,1,2,NA,NA)
maxCount <- c(NA,NA,0.030,0.030,0.030,0.030,0.030,0.030,0.030,0.030,0.030,
0.030,0.030,0.030,0.030,NA,NA,NA,0.027,0.027,NA,NA)
sumCount <- c(NA,NA,0.000,0.030,0.042,0.045,0.056,0.056,0.067,0.084,0.077,
0.071,0.098,0.112,0.123,NA,NA,NA,0.000,-0.019,NA,NA)
xtsData <- xts(cbind(returns,count,maxCount,sumCount),date)
I have no idea how to construct the max and cumSum columns, especially since each count series is of an irregular length. Since I won't always know the start and end points of a count series, I'm lost at trying to figure out the index of these groups. Thanks for your help!
UPDATE: here is my for loop for attempting to calculating cumSum. it's not the cumulative sum, just the returns necessary, i'm still unsure how to apply functions to these ranges!
xtsData <- cbind(xtsData,mySumCount=NA)
# find groups of returns
for(i in 1:nrow(xtsData)){
if(is.na(xtsData[i,"count"]) == FALSE){
xtsData[i,"mySumCount"] <- xtsData[i,"returns"]
}
else{
xtsData[i,"mySumCount"] <- NA
}
}
UPDATE 2: thank you commenters!
# report returns when not NA count
x1 <- xtsData[!is.na(xtsData$count),"returns"]
# cum sum is close, but still need to exclude the first element
# -0.009 in the first series of counts and .027 in the second series of counts
x2 <- cumsum(xtsData[!is.na(xtsData$count),"returns"])
# this is output is not accurate because .03 is being displayed down the entire column, not just during periods when counts != NA. is this just a rounding error?
x3 <- max(xtsData[!is.na(xtsData$count),"returns"])
SOLUTION:
# function to pad a vector with a 0
lagpad <- function(x, k) {
c(rep(0, k), x)[1 : length(x)]
}
# group the counts
x1 <- na.omit(transform(xtsData, g = cumsum(c(0, diff(!is.na(count)) == 1))))
# cumulative sum of the count series
z1 <- transform(x1, cumsumRet = ave(returns, g, FUN =function(x) cumsum(replace(x, 1, 0))))
# max of the count series
z2 <- transform(x1, maxRet = ave(returns, g, FUN =function(x) max(lagpad(x,1))))
merge(xtsData,z1$cumsumRet,z2$maxRet)
The code shown is not consistent with the output in the image and there is no explanation provided so its not clear what manipulations were wanted; however, the question did mention that the main problem is distinguishing the groups so we will address that.
To do that we compute a new column g whose rows contain 1 for the first group, 2 for the second and so on. We also remove the NA rows since the g column is sufficient to distinguish groups.
The following code computes a vector the same length as count by first setting each NA position to FALSE and each non-NA position to TRUE. It then differences each position of that vector with the prior position. To do that it implicitly converts FALSE to 0 and TRUE to 1 and then performs the differencing. Next we convert this last result to a logical vector which is TRUE for each 1 component and FALSE otherwise. Since the first component of the vector that is differenced has no prior position we prepend 0 for that. The prepending operation implicitly converts the TRUE and FALSE values just generated to 1 and 0 respectively. Taking the cumsum fills in the first group with 1, the second with 2 and so on. Finally omit the NA rows:
x <- na.omit(transform(x, g = cumsum(c(0, diff(!is.na(count)) == 1))))
giving:
> x
returns count maxCount sumCount g
2010-11-26 -0.009 1 0.030 0.000 1
2010-12-03 0.030 1 0.030 0.030 1
2010-12-10 0.013 2 0.030 0.042 1
2010-12-17 0.003 2 0.030 0.045 1
2010-12-24 0.010 3 0.030 0.056 1
2010-12-31 0.001 4 0.030 0.056 1
2011-01-07 0.011 5 0.030 0.067 1
2011-01-14 0.017 6 0.030 0.084 1
2011-01-21 -0.008 7 0.030 0.077 1
2011-01-28 -0.005 7 0.030 0.071 1
2011-02-04 0.027 7 0.030 0.098 1
2011-02-11 0.014 7 0.030 0.112 1
2011-02-18 0.010 7 0.030 0.123 1
2011-03-18 0.027 1 0.027 0.000 2
2011-03-25 -0.019 2 0.027 -0.019 2
attr(,"na.action")
2010-11-18 2010-11-19 2011-02-25 2011-03-04 2011-03-11 2011-03-26 2011-03-27
1 2 16 17 18 21 22
attr(,"class")
[1] "omit"
You can now use ave to perform any calculations you like. For example to take cumulative sums of returns by group:
transform(x, cumsumRet = ave(returns, g, FUN = cumsum))
Replace cumsum with any other function that is suitable for use with ave.
Ah, so "count" are the groups and you want the cumsum per group and the max per group. I think in data.table, so here is how I would do it.
library(xts)
library(data.table)
date <- as.Date(c("2010-11-18", "2010-11-19", "2010-11-26", "2010-12-03", "2010-12-10",
"2010-12-17", "2010-12-24", "2010-12-31", "2011-01-07", "2011-01-14",
"2011-01-21", "2011-01-28", "2011-02-04", "2011-02-11", "2011-02-18",
"2011-02-25", "2011-03-04", "2011-03-11", "2011-03-18", "2011-03-25",
"2011-03-26", "2011-03-27"))
returns <- c(0.002,0.000,-0.009,0.030, 0.013,0.003,0.010,0.001,0.011,0.017,
-0.008,-0.005,0.027,0.014,0.010,-0.017,0.001,-0.013,0.027,-0.019,
0.000,0.001)
count <- c(NA,NA,1,1,2,2,3,4,5,6,7,7,7,7,7,NA,NA,NA,1,2,NA,NA)
maxCount <- c(NA,NA,0.030,0.030,0.030,0.030,0.030,0.030,0.030,0.030,0.030,
0.030,0.030,0.030,0.030,NA,NA,NA,0.027,0.027,NA,NA)
sumCount <- c(NA,NA,0.000,0.030,0.042,0.045,0.056,0.056,0.067,0.084,0.077,
0.071,0.098,0.112,0.123,NA,NA,NA,0.000,-0.019,NA,NA)
DT<-data.table(date,returns,count)]
DT[!is.na(count),max:=max(returns),by=count]
DT[!is.na(count),cumSum:= cumsum(returns),by=count]
#if you need an xts object at the end, then.
xtsData <- xts(cbind(DT$returns,DT$count, DT$max,DT$cumSum),DT$date)

Parsing a Matrix that has three columns (Date, Parameter, Result) into a matrix that has a column for each parameter at specified date R

I have a matrix that has 3 columns and 47,772 rows. Within the rows there are 64 parameters.
Currently the data frame looks like:
SAMPLE_DATE PARAMETER RESULT
8/2/1954 Alkalinity, total as CaCO3(mg/L) 112.5
8/2/1954 Depth, Secchi disk depth(m) 2.44
8/2/1954 Nutrient-nitrogen as N(mg/L) 0.87
8/2/1954 Phosphorus as P(mg/L) 0.001
8/2/1954 Sulfate as SO4(mg/L) 11
3/7/1962 Alkalinity, total as CaCO3(mg/L) 140
3/7/1962 Alkalinity, total as CaCO3(mg/L) 320
3/7/1962 Alkalinity, total as CaCO3(mg/L) 130
3/7/1962 Ammonia-nitrogen as N(mg/L) 0.02
3/7/1962 Ammonia-nitrogen as N(mg/L) 0.26
3/7/1962 Ammonia-nitrogen as N(mg/L) 0.02
3/7/1962 Apparent color(PCU) 10
3/7/1962 Apparent color(PCU) 10
....
and I want transform it into something that looks like:
Date Alkalinity, total as CaCO3(mg/L) Depth, Secchi disk depth(m).....etc
8/2/1954 112.5 2.44 ..... etc
note: not every date has every parameter
Any Ideas?
Here's one approach. I've added a "time" variable since there are duplicated "SAMPLE_DATE" + "PARAMETER" combinations.
library(reshape2) # for dcast
library(splitstackshape) # for getanID
x2 <- getanID(x, id.vars = c("SAMPLE_DATE", "PARAMETER"))
dcast(x2, .id + SAMPLE_DATE ~ PARAMETER, value.var = "RESULT")
# .id SAMPLE_DATE Alkalinity, total as CaCO3(mg/L) Ammonia-nitrogen as N(mg/L)
# 1 1 3/7/1962 140.0 0.02
# 2 1 8/2/1954 112.5 NA
# 3 2 3/7/1962 320.0 0.26
# 4 3 3/7/1962 130.0 0.02
# Apparent color(PCU) Depth, Secchi disk depth(m) Nutrient-nitrogen as N(mg/L)
# 1 10 NA NA
# 2 NA 2.44 0.87
# 3 10 NA NA
# 4 NA NA NA
# Phosphorus as P(mg/L) Sulfate as SO4(mg/L)
# 1 NA NA
# 2 0.001 11
# 3 NA NA
# 4 NA NA
As above, but with the "data.table" package:
library(data.table)
packageVersion("data.table")
# [1] ‘1.8.11’
DT <- data.table(x)
DT[, .id := sequence(.N), by = list(SAMPLE_DATE, PARAMETER)]
dcast.data.table(DT, .id + SAMPLE_DATE ~ PARAMETER, value.var="RESULT")
If you don't want separate rows for duplicated combinations, you will have to aggregate the data in some way first.
This will be (sort of) a contingency table (zeros if no value and teh sum of values when there are overlapping category values):
xtabs(RESULT~ SAMPLE_DATE+PARAMETER, data=dat)
PARAMETER
SAMPLE_DATE Alkalinity, total as CaCO3(mg/L) Ammonia-nitrogen as N(mg/L)
3/7/1962 590.000 0.300
8/2/1954 112.500 0.000
PARAMETER
SAMPLE_DATE Apparent color(PCU) Depth, Secchi disk depth(m)
3/7/1962 20.000 0.000
8/2/1954 0.000 2.440
PARAMETER
SAMPLE_DATE Nutrient-nitrogen as N(mg/L) Phosphorus as P(mg/L)
3/7/1962 0.000 0.000
8/2/1954 0.870 0.001
PARAMETER
SAMPLE_DATE Sulfate as SO4(mg/L)
3/7/1962 0.000
8/2/1954 11.000
If you have a different desire than the sum() of repeated categories then the tapply function can deliver. E.g. with the mean as the target function:
with( dat, tapply(RESULT, list( SAMPLE_DATE, PARAMETER), FUN=mean, na.rm=TRUE))
Alkalinity, total as CaCO3(mg/L) Ammonia-nitrogen as N(mg/L) Apparent color(PCU) Depth, Secchi disk depth(m)
3/7/1962 196.6667 0.1 10 NA
8/2/1954 112.5000 NA NA 2.44
Nutrient-nitrogen as N(mg/L) Phosphorus as P(mg/L) Sulfate as SO4(mg/L)
3/7/1962 NA NA NA
8/2/1954 0.87 0.001 11

Resources