Grouped moving average in r - r

I'm trying to calculate a moving average in r over a particular field BUT I need this moving average to be grouped by two or more other fields. The purpose of this new average is for predictive analysis so I need it to be trailing as well.
Any variables that do not have enough values to be averaged (such as student J) would ideally give either NA or its original Score value.
I've been trying rollapply and data.table and am having no luck!
I've provided the table of data and two moving averages (AVG2 with k=2 and AVG3 with k=3) to show exactly what I'm after. The moving average is on Score and the variables to group over are school, Student and area. Please help!
no school Student area Score **AVG2** **AVG3**
1 I S A 5 NA NA
2 B S A 2 NA NA
3 B S A 7 NA NA
4 B O A 3 NA NA
5 B O B 9 NA NA
6 I O A 6 NA NA
7 I O B 3 NA NA
8 I S A 7 NA NA
9 I O A 1 NA NA
10 B S A 7 4.5 NA
11 I S A 3 NA NA
12 I O A 8 3.5 NA
13 B S A 3 7 5.33
14 I O A 4 4.5 5
15 B O A 1 NA NA
16 I S A 9 5 5
17 B S A 4 5 5.67
18 B O A 6 2 NA
19 I S A 3 6 6.33
20 I O B 8 NA NA
21 B S A 3 3.5 4.67
22 I O A 4 6 4.33
23 B O A 1 3.5 3.33
24 I S A 9 6 5
25 B S A 4 3.5 3.33
26 B O A 6 3.5 2.67
27 I J A 6 NA NA
here is the code to recreate the initial table in r:
school <- c('I','B','B','B','B','I','I','I','I','B','I','I','B','I','B','I','B','B','I','I','B','I','B','I','B','B','I')
Student <- c('S','S','S','O','O','O','O','S','O','S','S','O','S','O','O','S','S','O','S','O','S','O','O','S','S','O','J')
area <- c('A','A','A','A','B','A','B','A','A','A','A','A','A','A','A','A','A','A','A','B','A','A','A','A','A','A','A')
Score <- c(5,2,7,3,9,6,3,7,1,7,3,8,3,4,1,9,4,6,3,8,3,4,1,9,4,6,6)
data.frame(school, Student, area, Score)

You can try solving the problem using dplyr and TTR but for student J from school I it is not possible to calculate a moving average as there's only one measurement.
AVG2 caluculated with stats:filter gives the result you wanted to have, but I also added AVG2b calculated with TTR::SMA to show a simple moving average calculation, where the current measurement is also taken into account.
library(dplyr)
library(TTR)
df <- data.frame(school, Student, Score)
df$AVG2 <- NA
df$AVG2b <- NA
df[!(df$school=="I" & df$Student=="J"),] <- df[!(df$school=="I" & df$Student=="J"),] %>%
group_by(school, Student) %>%
mutate(AVG2 = stats::filter(Score, c(0, 0.5, 0.5), sides = 1 ), AVG2b = SMA(Score, n= 2))
> df
school Student Score AVG2 AVG2b
1 I S 5 NA NA
2 B S 2 NA NA
3 B S 7 NA 4.5
4 B O 3 NA NA
5 B O 9 NA 6.0
6 I O 6 NA NA
7 I O 3 NA 4.5
8 I S 7 NA 6.0
9 I O 1 4.5 2.0
10 B S 7 4.5 7.0
...

Here is a rollapply solution. Note that it appears that you want the average of the prior two or three rows in the same group, i.e. excluding the data on the current row.
library(zoo)
roll <- function(x, n) {
if (length(x) <= n) NA
else rollapply(x, list(-seq(n)), mean, fill = NA)
}
transform(DF, AVG2 = ave(Score, school, Student, FUN = function(x) roll(x, 2)),
AVG3 = ave(Score, school, Student, FUN = function(x) roll(x, 3)))
giving:
school Student Score AVG2 AVG3
1 I S 5 NA NA
2 B S 2 NA NA
3 B S 7 NA NA
4 B O 3 NA NA
5 B O 9 NA NA
6 I O 6 NA NA
7 I O 3 NA NA
8 I S 7 NA NA
9 I O 1 4.5 NA
10 B S 7 4.5 NA
11 I S 3 6.0 NA
12 I O 8 2.0 3.333333
13 B S 3 7.0 5.333333
14 I O 4 4.5 4.000000
15 B O 1 6.0 NA
16 I S 9 5.0 5.000000
17 B S 4 5.0 5.666667
18 B O 6 5.0 4.333333
19 I S 3 6.0 6.333333
20 I O 8 6.0 4.333333
21 B S 3 3.5 4.666667
22 I O 4 6.0 6.666667
23 B O 1 3.5 5.333333
24 I S 9 6.0 5.000000
25 B S 4 3.5 3.333333
26 B O 6 3.5 2.666667
27 I J 6 NA NA
Update: Fixed roll.

Here is AVG2 calculation with data.table, which is faster compared to other approaches:
library(data.table)
dt <- data.table(df)
setkey(dt, school, Student, area)
dt[, c("start", "len") := .(ifelse(.I + 1 > .I[.N], 0, .I +1), pmax(pmin(1, .I[.N] - .I -1), 0)), by = .(school, Student, area)][
, AVG2 := mean(dt$Score[start:(start+len)]), by = 1:nrow(dt)]
res$AVG2[res$len == 0] <- NA

Related

R strsplit for uneven number of columns in a huge data set

I have a huge data set with about 200 columns and 25k+ rows, with the separator ';'. The columns are of an uneven number.
I read it in as a delimited txt file df <- read.delim(~path/data.txt, sep=";", header = FALSE)
which reads nicely as a table.
My issue is, many of the rows are so long that in the txt file they often spill onto new lines and it is here that it is not recognising that it should continue on the same row. Therefore the distinguished columns have information that belongs else where.
Each observation of data is a dbl.
I have created a new example below for ease of reading, therefore it is not possible to simply sort conditions into columns.
***EDIT: x, y and z contain spatial coordinates, but I have substituted them for their corresponding letters for ease of reading.
The data is X-profile data giving me coordinates of the centre point along a line, followed by offsets of 1m (up to 100m either side of 0, the centre line) in each column with its corresponding height ***
My data ends up looking something like this:
[c1] [c2] [c3] [c4] [c5] [c6] [c7] [c8] [c9]
[1] x y z 1 2 3 N/A N/A N/A
[2] x y z 1 2 3 4 5 6
[3] 7 8 9 10 N/A N/A N/A N/A N/A
[4] x y z 1 2 3 4 5 7
[5] 7 8 9 N/A N/A N/A N/A N/A N/A
[6] x y z 1 2 3 N/A N/A N/A
[7] x y z 1 2 3 4 5 N/A
And I'd like it to look like this:
[c1] [c2] [c3] [c4] [c5] [c6] [c7] [c8] [c9] [c10] [c11] [c12] [c13]
[1] x y z 1 2 3 N/A N/A N/A N/A N/A N/A N/A
[2] x y z 1 2 3 4 5 6 7 8 9 10
[3] x y z 1 2 3 4 5 6 7 8 9 N/A
[4] x y z 1 2 3 N/A N/A N/A N/A N/A N/A N/A
[5] x y z 1 2 3 4 5 N/A N/A N/A N/A N/A
I have tried strsplit(as.character(df), split = "\n", fixed = TRUE) and it returns an error that it is not a character string. I have tried the same function with split = "\t" and split = "\r" and it returns the same error. Each attempt takes around half an hour to process so I was also wondering if there is a more efficient way to do this.
I hope I have explained clearly my aim.
EDIT
The text file is similar to the following example:
x;y;z;1;2;3
x;y;z;1;2;3;4;5;6;
7;8;9;10
x;y;z;1;2;3;4;5;6;
7;8;9
x;y;z;1;2;3;4
x;y;z;1;2;3;4;5;6;
7;8;9;10;11;12;13;
14;15
In some cases a number is split between the previous line and that below:
E.G.
101;102;103;10
4;105;106
This layout is exactly how it is being read into R.
Use scan which omits empty lines by default. Next, find positions that begin with "x" using findInterval, split there and paste them together. Then basically the ususal strsplit, some length adaptions etc. and you got it.
r <- scan('foo.txt', 'A', qui=T)
r <- split(r, findInterval(seq_len(length(r)), grep('^x', r))) |>
lapply(paste, collapse='') |>
lapply(strsplit, ';') |>
lapply(el) |>
{\(.) lapply(., `length<-`, max(lengths(.)))}() |>
do.call(what=rbind) |>
as.data.frame()
r
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18
# 1 x y z 1 2 3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# 2 x y z 1 2 3 4 5 6 7 8 9 10 <NA> <NA> <NA> <NA> <NA>
# 3 x y z 1 2 3 4 5 6 7 8 9 <NA> <NA> <NA> <NA> <NA> <NA>
# 4 x y z 1 2 3 4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# 5 x y z 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Data:
writeLines(text='x;y;z;1;2;3
x;y;z;1;2;3;4;5;6;
7;8;9;10
x;y;z;1;2;3;4;5;6;
7;8;9
x;y;z;1;2;3;4
x;y;z;1;2;3;4;5;6;
7;8;9;10;11;12;13;
14;15', 'foo.txt')
using data.table:
dt <- data.table(df)
dt[, grp := cumsum(c1 == "x")]
dt <- merge(dt[c1 == "x"], dt[c1 == 7], by = "grp", all = T)[, grp := NULL]
names(dt) <- paste0("c", 1:ncol(dt))
Resulting to:
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18
1: x y z 1 2 3 NA NA NA NA NA NA NA NA NA NA NA NA
2: x y z 1 2 3 4 5 6 7 8 9 10 NA NA NA NA NA
3: x y z 1 2 3 4 5 7 7 8 9 NA NA NA NA NA NA
4: x y z 1 2 3 NA NA NA NA NA NA NA NA NA NA NA NA
5: x y z 1 2 3 4 5 NA NA NA NA NA NA NA NA NA NA

Set values to NA in data frame if Date is outside of a given interval

I have two dataframes, df1 and df2.
df1 contains values for different products X1,X2,and so on at different times. df2 contains the true start and end date for some of the products. I want to replace the values outside of the given date intervals in df2 by NA, as shown in the final table df3.
Create df1 and df2:
df1=data.frame(matrix(NA,10,6))
df1[,1]=(c(seq(as.Date("2012-01-01"),as.Date("2012-10-01"),by="1 month")))
df1[,2]=c(1:10); df1[,3]=c(12:21); df1[,4]=c(0.5:10); df1[,5]=c(5:14); df1[,6]=c(10:19)
colnames(df1)=c("Date","X1","X2","X3","X4","X5")
df2=data.frame(matrix(data=c("X1","X2","X4","2012-02-01","2012-04-01","2012-06-01","2012-09-01","2012-06-01","2012-10-01"),3,3))
colnames(df2)=c("Name","Start","End")
Output:
> df1
Date X1 X2 X3 X4 X5
1 2012-01-01 1 12 0.5 5 10
2 2012-02-01 2 13 1.5 6 11
3 2012-03-01 3 14 2.5 7 12
4 2012-04-01 4 15 3.5 8 13
5 2012-05-01 5 16 4.5 9 14
6 2012-06-01 6 17 5.5 10 15
7 2012-07-01 7 18 6.5 11 16
8 2012-08-01 8 19 7.5 12 17
9 2012-09-01 9 20 8.5 13 18
10 2012-10-01 10 21 9.5 14 19
> df2
Name Start End
1 X1 2012-02-01 2012-09-01
2 X2 2012-04-01 2012-06-01
3 X4 2012-06-01 2012-10-01
Final output should look like this:
df3
Date X1 X2 X3 X4 X5
1 2012-01-01 NA NA 0.5 NA 10
2 2012-02-01 2 NA 1.5 NA 11
3 2012-03-01 3 NA 2.5 NA 12
4 2012-04-01 4 15 3.5 NA 13
5 2012-05-01 5 16 4.5 NA 14
6 2012-06-01 6 17 5.5 10 15
7 2012-07-01 7 NA 6.5 11 16
8 2012-08-01 8 NA 7.5 12 17
9 2012-09-01 9 NA 8.5 13 18
10 2012-10-01 NA NA 9.5 14 19
Using dplyr and tidyr...
library(tidyr)
library(dplyr)
df3 <- df1 %>% gather(key=Name,value=value,-Date) %>% #convert to long form
left_join(df2) %>% #merge in date limits
mutate(ind=(as.Date(Date)>=as.Date(Start) & as.Date(Date)<=as.Date(End))) %>% #check valid
mutate(value=replace(value,!ind,NA)) %>% #replace invalid with NA
select(Date,Name,value) %>% #remove unnecessary variables
spread(key=Name,value=value) #convert back to rectangular form
df3
Date X1 X2 X3 X4 X5
1 2012-01-01 NA NA 0.5 NA 10
2 2012-02-01 2 NA 1.5 NA 11
3 2012-03-01 3 NA 2.5 NA 12
4 2012-04-01 4 15 3.5 NA 13
5 2012-05-01 5 16 4.5 NA 14
6 2012-06-01 6 17 5.5 10 15
7 2012-07-01 7 NA 6.5 11 16
8 2012-08-01 8 NA 7.5 12 17
9 2012-09-01 9 NA 8.5 13 18
10 2012-10-01 NA NA 9.5 14 19
I am sure there is a more elegant way, but you could create a matrix of the indices that meet your criterion, where you set the elements to 1 if it is within your interval for that product and NA if it isn't. Assuming you are dealing with numerical values you can then multiply your data frame with that index matrix:
Example:
library(dplyr)
## Convert your dates to Date-objects:
df2 <- df2 %>% dplyr::mutate(Start = as.Date(Start), End = as.Date(End))
## Create a matrix of indices (TRUE/FALSE):
indMx <- lapply(names(df1)[-1], function(product){
(df1$Date >= df2$Start[df2$Name == product]) &
(df1$Date <= df2$End[df2$Name == product])
}) %>% do.call('cbind',.)
## Multiply with NA^indMx, which gives you NA in place of FALSE and
## 1 in place of TRUE:
df1[,-1] <- df1[,-1]*NA^indMx
df1
# Date X1 X2 X3
# 1 2012-01-01 1 12 0.5
# 2 2012-02-01 NA 13 1.5
# 3 2012-03-01 NA 14 2.5
# 4 2012-04-01 NA NA 3.5
# 5 2012-05-01 NA NA 4.5
# 6 2012-06-01 NA NA NA
# 7 2012-07-01 NA 18 NA
# 8 2012-08-01 NA 19 NA
# 9 2012-09-01 NA 20 NA
# 10 2012-10-01 10 21 NA
Here is one solution with data.table. There might be a more elegant method using non-equi joins.
for(i in seq_len(nrow(df2))) df1[!(Date %between% df2[i,.(Start, End)]), df2[i, Name] := NA]
Here, you run through each row of df2, subset df1 based on dates outside of the start and end dates in the current row of df2, and then assign NA to the variable given in df2.
This returns
df1
Date X1 X2 X3
1: 2012-01-01 NA NA NA
2: 2012-02-01 2 NA NA
3: 2012-03-01 3 NA NA
4: 2012-04-01 4 15 NA
5: 2012-05-01 5 16 NA
6: 2012-06-01 6 17 5.5
7: 2012-07-01 7 NA 6.5
8: 2012-08-01 8 NA 7.5
9: 2012-09-01 9 NA 8.5
10: 2012-10-01 NA NA 9.5
update
If the data is constructed as was updated in the original post, then run this line first to convert the Names variable in df2 to a character vector (starts out as a factor). Then the above code will work for the new dataset.
# convert data.frames to data.tables
setDT(df1)
setDT(df2)
# convert factor to character
df2[, Name := as.character(Name)]
data
library(data.table)
# read in data
df1 <- fread("Date X1 X2 X3
2012-01-01 1 12 0.5
2012-02-01 2 13 1.5
2012-03-01 3 14 2.5
2012-04-01 4 15 3.5
2012-05-01 5 16 4.5
2012-06-01 6 17 5.5
2012-07-01 7 18 6.5
2012-08-01 8 19 7.5
2012-09-01 9 20 8.5
2012-10-01 10 21 9.5")
df2 <- fread(" Name Start End
X1 2012-02-01 2012-09-01
X2 2012-04-01 2012-06-01
X3 2012-06-01 2012-10-01")
# convert to date type
df1[, Date := as.Date(Date)]
df2[, c("Start", "End") := .(as.Date(Start), as.Date(End))]

Find row of the next instance of the value in R

I have two columns Time and Event. There are two events A and B. Once an event A takes place, I want to find when the next event B occurs. Column Time_EventB is the desired output.
This is the data frame:
df <- data.frame(Event = sample(c("A", "B", ""), 20, replace = TRUE), Time = paste("t", seq(1,20)))
What is the code in R for finding the next instance of a value (B in this case)?
What is the code for once the instance of B is found, return the value of the corresponding Time Column?
The code should be something like this:
data$Time_EventB <- ifelse(data$Event == "A", <Code for returning time of next instance of B>, "")
In Excel this can be done using VLOOKUP.
Here's a simple solution:
set.seed(1)
df <- data.frame(Event = sample(c("A", "B", ""),size=20, replace=T), time = 1:20)
as <- which(df$Event == "A")
bs <- which(df$Event == "B")
next_b <- sapply(as, function(a) {
diff <- bs-a
if(all(diff < 0)) return(NA)
bs[min(diff[diff > 0]) == diff]
})
df$next_b <- NA
df$next_b[as] <- df$time[next_b]
> df
Event time next_b
1 A 1 2
2 B 2 NA
3 B 3 NA
4 4 NA
5 A 5 8
6 6 NA
7 7 NA
8 B 8 NA
9 B 9 NA
10 A 10 14
11 A 11 14
12 A 12 14
13 13 NA
14 B 14 NA
15 15 NA
16 B 16 NA
17 17 NA
18 18 NA
19 B 19 NA
20 20 NA
Here's an attempt using a "rolling join" from the data.table package:
library(data.table)
setDT(df)
df[Event=="B", .(time, nextb=time)][df, on="time", roll=-Inf][Event != "A", nextb := NA][]
# time nextb Event
# 1: 1 2 A
# 2: 2 NA B
# 3: 3 NA B
# 4: 4 NA
# 5: 5 8 A
# 6: 6 NA
# 7: 7 NA
# 8: 8 NA B
# 9: 9 NA B
#10: 10 14 A
#11: 11 14 A
#12: 12 14 A
#13: 13 NA
#14: 14 NA B
#15: 15 NA
#16: 16 NA B
#17: 17 NA
#18: 18 NA
#19: 19 NA B
#20: 20 NA
Using data as borrowed from #thc

aggregate + mean returns wrong result

Using R, I am about to calculate groupwise means with aggregate(..., mean). The mean return however is wrong.
testdata <-read.table(text="
a b c d year
2 10 1 NA 1998
1 7 NA NA 1998
4 6 NA NA 1998
2 2 NA NA 1998
4 3 2 1 1998
2 6 NA NA 1998
3 NA NA NA 1998
2 7 NA 3 1998
1 8 NA 4 1998
2 7 2 5 1998
1 NA NA 4 1998
2 5 NA 6 1998
2 4 NA NA 1998
3 11 2 7 1998
1 18 4 10 1998
3 12 7 5 1998
2 17 NA NA 1998
2 11 4 5 1998
1 3 1 1 1998
3 5 1 3 1998
",header=TRUE,sep="")
aggregate(. ~ year, testdata,
function(x) c(mean = round(mean(x, na.rm=TRUE), 2)))
colMeans(subset(testdata, year=="1998", select=d), na.rm=TRUE)
aggregate says the mean of d for group 1998 is 4.62, but it is 4.5.
Reducing the data to one column only, aggregate gets it right:
aggregate(. ~ year, test[4:5],
function(x) c(mean = round(mean(x, na.rm=TRUE), 2)))
What's wrong with my aggregate() + mean() function?
aggregate is taking out your rows containing NAs in any column before passing it to the mean function. Try running your aggregate call without na.rm=TRUE - it will still work.
To fix this, you need to change the default na.action in aggregate to na.pass:
aggregate(. ~ year, testdata,
function(x) c(mean = round(mean(x, na.rm=TRUE), 2)), na.action = na.pass)
year a b c d
1 1998 2.15 7.89 2.67 4.5

Removing rows of data frame if number of NA in a column is larger than 3

I have a data frame (panel data): Ctry column indicates the name of countries in my data frame. In any column (for example: Carx) if number of NAs is larger 3; I want to drop the related country in my data fame. For example,
Country A has 2 NA
Country B has 4 NA
Country C has 3 NA
I want to drop country B in my data frame. I have a data frame like this (This is for illustration, my data frame is actually very huge):
Ctry year Carx
A 2000 23
A 2001 18
A 2002 20
A 2003 NA
A 2004 24
A 2005 18
B 2000 NA
B 2001 NA
B 2002 NA
B 2003 NA
B 2004 18
B 2005 16
C 2000 NA
C 2001 NA
C 2002 24
C 2003 21
C 2004 NA
C 2005 24
I want to create a data frame like this:
Ctry year Carx
A 2000 23
A 2001 18
A 2002 20
A 2003 NA
A 2004 24
A 2005 18
C 2000 NA
C 2001 NA
C 2002 24
C 2003 21
C 2004 NA
C 2005 24
A fairly straightforward way in base R is to use sum(is.na(.)) along with ave, to do the counting, like this:
with(mydf, ave(Carx, Ctry, FUN = function(x) sum(is.na(x))))
# [1] 1 1 1 1 1 1 4 4 4 4 4 4 3 3 3 3 3 3
Once you have that, subsetting is easy:
mydf[with(mydf, ave(Carx, Ctry, FUN = function(x) sum(is.na(x)))) <= 3, ]
# Ctry year Carx
# 1 A 2000 23
# 2 A 2001 18
# 3 A 2002 20
# 4 A 2003 NA
# 5 A 2004 24
# 6 A 2005 18
# 13 C 2000 NA
# 14 C 2001 NA
# 15 C 2002 24
# 16 C 2003 21
# 17 C 2004 NA
# 18 C 2005 24
You can use by() function to group by Ctry and count NA's of each group :
DF <- read.csv(
text='Ctry,year,Carx
A,2000,23
A,2001,18
A,2002,20
A,2003,NA
A,2004,24
A,2005,18
B,2000,NA
B,2001,NA
B,2002,NA
B,2003,NA
B,2004,18
B,2005,16
C,2000,NA
C,2001,NA
C,2002,24
C,2003,21
C,2004,NA
C,2005,24',
stringsAsFactors=F)
res <- by(data=DF$Carx,INDICES=DF$Ctry,FUN=function(x)sum(is.na(x)))
validCtry <-names(res)[res <= 3]
DF[DF$Ctry %in% validCtry, ]
# Ctry year Carx
#1 A 2000 23
#2 A 2001 18
#3 A 2002 20
#4 A 2003 NA
#5 A 2004 24
#6 A 2005 18
#13 C 2000 NA
#14 C 2001 NA
#15 C 2002 24
#16 C 2003 21
#17 C 2004 NA
#18 C 2005 24
EDIT :
if you have more columns to check, you could adapt the previous code as follows:
res <- by(data=DF,INDICES=DF$Ctry,
FUN=function(x){
return(sum(is.na(x$Carx)) <= 3 &&
sum(is.na(x$Barx)) <= 3 &&
sum(is.na(x$Tarx)) <= 3)
})
validCtry <- names(res)[res]
DF[DF$Ctry %in% validCtry, ]
where, of course, you may change the condition in FUN according to your needs.
Since you mention that you data is "very huge" (whatever that means exactly), you could try a solution with dplyr and see if it's perhaps faster than the solutions in base R. If the other solutions are fast enough, just ignore this one.
require(dplyr)
newdf <- df %.% group_by(Ctry) %.% filter(sum(is.na(Carx)) <= 3)

Resources