Grouping data by specific observations in R - r

I want to create a new variable that's derived from specific values in my existing variables. My data frame looks something like the following:
year <- c("2010", "2011", "2012", "2013", "2014", "2015")
x <- c(2980, 2955, 3110, 2962, 2566, 3788)
y <- c(2453, 2919, 2930, 2864, 2873, 3031)
df <- data.frame(year, x, y)
More specifically, I want to create a third column, z, that is the ratio of x and y. However, I don't want to create this ratio by simply dividing x by y for each individual year. Instead, I want the values in 2015 (and 2014 etc.) to be an average of this ratio in the three preceding years, i.e. 2014, 2013, and 2012.
I've looked at Wickham's dplyr package and, in particular, the group_by function but I'm stumped because I don't want to group my data by year per se but by each years' three preceding years as illustrated (hopefully) above.

With dplyr and library(zoo):
df_fin<- df %>% mutate( z = rollmeanr(x/y,3,na.pad=TRUE))
I think the column z is what you want but it would be good to have the desired output.

The answers that use zoo::rollmean are all on the correct track, but they have a couple of "off by one" errors in them. First, you actually want zoo::rollmeanr( ..., na.pad=TRUE ) which will correctly pad the output with NA on the left side:
> zoo::rollmeanr( df$x / df$y, 3, na.pad=TRUE )
[1] NA NA 1.0962018 1.0359948 0.9962648 1.0590378
The second "off by one" error arises from alignment of this vector with the rest of your data. From your description, you want the value for 2015 to be the average of 2014, 2013, and 2012. However, appending the vector above to your table will make the value for 2015 to be the average of 2015, 2014, and 2013, instead. To correct, you want to omit the last value in your input to the rolling average and prepend an NA to compensate:
> c( NA, zoo::rollmeanr( head(df$x / df$y,-1), 3, na.pad=TRUE ) )
[1] NA NA NA 1.0962018 1.0359948 0.9962648
Putting it all together using dplyr notation:
df %>% mutate( z = c( NA, zoo::rollmeanr( head(x/y,-1), 3, na.pad=TRUE ) ) )
year x y z
1 2010 2980 2453 NA
2 2011 2955 2919 NA
3 2012 3110 2930 NA
4 2013 2962 2864 1.0962018
5 2014 2566 2873 1.0359948
6 2015 3788 3031 0.9962648

df$z<-0
for (i in 4:6){
df$z[i]<-mean(df$x[(i-3):(i-1)])/mean(df$y[(i-3):(i-1)])
}
Whit a loop, you can get this:
year x y z
1 2010 2980 2453 0.000000
2 2011 2955 2919 0.000000
3 2012 3110 2930 0.000000
4 2013 2962 2864 1.089497
5 2014 2566 2873 1.036038
6 2015 3788 3031 0.996654

library(zoo)
library(dplyr)
df %>% mutate(z = x/y, zz = rollmean(z, 3, fill = NA)

Related

How to find rolling mean using means previously generated using R?

Hope the community can help me since I am relatively new to R and to the StackOverflow community.
I am trying to replace a missing value of a group with the average of the 3 previous years and then use this newly generated mean to continue generating the next period missing value in R either using dplyr or data.table. My data looks something like this (desired output column rounded to 2 digits):
df <- data.frame(gvkey = c(10443, 10443, 10443, 10443, 10443, 10443, 10443, 29206, 29206, 29206, 29206, 29206), fyear = c(2005, 2006, 2007, 2008, 2009, 2010, 2011, 2017, 2018, 2019, 2020, 2021), receivables = c(543, 595, 757, NA, NA, NA, NA, 147.469, 161.422, 154.019, NA, NA), desired_output = c(543, 595, 757, 631.67, 661.22, 683.30, 658.73, 147.47, 161.42, 154.02, 154.30, 156.58))
I have attempted the following line of code, but it does not use the newly generated number:
df <- df %>% mutate(mean_rect=rollapply(rect,3,mean,align='right',fill=NA))
Any help would be greatly appreciated!
Because your desired fill value depends on any previously created fill values, I think the only reasonable approach is a trusty for loop:
df$out <- NA
for (i in 1:nrow(df)) {
if (!is.na(df$receivables[i])) {
df$out[i] <- df$receivables[i]
} else {
df$out[i] <- mean(df$out[(i-3):(i-1)], na.rm = T)
}
}
gvkey fyear receivables desired_output out
1 10443 2005 543.000 543.00 543.0000
2 10443 2006 595.000 595.00 595.0000
3 10443 2007 757.000 757.00 757.0000
4 10443 2008 NA 631.67 631.6667
5 10443 2009 NA 661.22 661.2222
6 10443 2010 NA 683.30 683.2963
7 10443 2011 NA 658.73 658.7284
8 29206 2017 147.469 147.47 147.4690
9 29206 2018 161.422 161.42 161.4220
10 29206 2019 154.019 154.02 154.0190
11 29206 2020 NA 154.30 154.3033
12 29206 2021 NA 156.58 156.5814

How to replace NA values with average of precedent and following values, in R

I currently have a dataset that has more or less the following characteristics:
Country <- rep(c("Honduras", "Belize"),each=6)
Year <- rep(c(2010,2011,2012,2014,2015,2016),2)
Observation <- c(2, 5,NA, NA,2,3,NA, NA,2,3,1,NA)
df <- data.frame(Country, Year, Observation)
What I would like to do is find a command/write a function that fills only the NAs for each country with:
if NA Observation is for the first year (2010) fills it with the next non-NA Observation;
if NA Observation is for the last year (2014) fills it with the previous available period's Observation.
3.1 if NA Observation is for years between the first and last fills is with the average of the 2 closest periods.
3.2 However, if there are 2 or more consecutive NAs, (let's take 2 as an example) first fill the first with the preceding Observation and the second with the same method as (3.1)
As an illustration, the previous dataset should finally be:
Observation2 <- c(2, 5, 5, 3.5 ,2,3,2, 2,2,3,1,1)
df2 <- data.frame(Country, Year, Observation2)
I hope I was sufficiently clear. It is very specific but I hope someone can help.
Feel free to ask any questions about it if you do not understand.
Input. There is some question of whether alternation of country names as mentioned in the comments under the question and shown in the Note at the end was intended but at any rate assume that each subsequence of increasing years is a separate group and group by them, grp. (If it was intended that the first 6 entries in Country be Honduras the last 6 be Belize then we could replace the group_by(...) with group_by(Country) in the code below.)
Clarification of Question. We assume that the question is asking that within group:
Leading NAs are to be replaced with the first non-NA.
Trailing NAs are to be replaced with the last non-NA.
If there is one consecutive NA surrounded by non-NAs it is replaced by the prior non-NA.
If there are two consecutive NA's then the first is replaced with the prior non-NA and the second is filled in with the average of the prior non-NA and next non-NA.
The question does not address the situation of 3+ consecutive NAs so maybe this never occurs but just in case it does what the code should do is fill in the first NA with the prior non-NA and the remainder should be filled in using linear interpolation.
Code. Now for each group, replace any NA with the prior value. Then use linear interpolation on what is left via na.approx using rule=2 to extend the ends. Finally only keep desired columns.
dplyr clashes. Note that lag and filter in dplyr collide in an incompatible way with the functions of the same name in base R so we exclude them and use dplyr:: prefix if we want to access them.
library(dplyr, exclude = c("lag", "filter"))
library(zoo)
df2 <- df %>%
# group_by(Country) %>%
group_by(grp = cumsum(c(TRUE, diff(Year) < 0))) %>%
mutate(Observation2 = coalesce(Observation, dplyr::lag(Observation)) %>%
na.approx(rule = 2)) %>%
ungroup %>%
select(Country, Year, Observation2)
identical(df2$Observation2, Observation2)
## [1] TRUE
Note
We used this input taken from the question.
Country <- rep(c("Honduras", "Belize"),6)
Year <- rep(c(2010,2011,2012,2014,2015,2016),2)
Observation <- c(2, 5,NA, NA,2,3,NA, NA,2,3,1,NA)
df <- data.frame(Country, Year, Observation)
df
giving:
Country Year Observation
1 Honduras 2010 2
2 Belize 2011 5
3 Honduras 2012 NA
4 Belize 2014 NA
5 Honduras 2015 2
6 Belize 2016 3
7 Honduras 2010 NA
8 Belize 2011 NA
9 Honduras 2012 2
10 Belize 2014 3
11 Honduras 2015 1
12 Belize 2016 NA
Added
In a comment the poster added another example. We run it here. This is the same code incorporating the simplification to group_by discussed in the first paragraph above. (That does not change the result.)
Country <- rep(c("Honduras", "Belize"),each=6)
Year <- rep(c(2010,2011,2012,2014,2015,2016),2)
Observation <- c(2, 5, NA, NA,2,3, NA, NA,2, NA,1,NA)
df <- data.frame(Country, Year, Observation)
df2 <- df %>%
group_by(Country) %>%
mutate(Observation2 = coalesce(Observation, dplyr::lag(Observation)) %>%
na.approx(rule = 2)) %>%
ungroup %>%
select(Country, Year, Observation2)
df2
giving:
# A tibble: 12 x 3
Country Year Observation2
<chr> <dbl> <dbl>
1 Honduras 2010 2
2 Honduras 2011 5
3 Honduras 2012 5
4 Honduras 2014 3.5
5 Honduras 2015 2
6 Honduras 2016 3
7 Belize 2010 2
8 Belize 2011 2
9 Belize 2012 2
10 Belize 2014 2
11 Belize 2015 1
12 Belize 2016 1

indexing through function to gather multiple values

I want to extract a sequence of values from this dataframe altogether to form a single output, I can do so individually using:
p <- function(x, i){
r<- rank_data[x][rank_data[,i] %in% 2000:2020,]
}
p(1:2, 2)
#output
jan year
235.2 2008
where the sequence of values in p() continue as:
x=c(1:2, 3:4, 5:6)
i=(2, 4, 6)
I'm looking for a single code, where a variable x or i can be indexed into the dataframe to produce the expected output. Although, using some other iteration function like apply is welcome. I want to better understand indexing through iterative functions.
expected output:
jan year feb year2 mar year3 ...
235.2 2008 287.6 2020 187.8 2019 ...
NA NA 241.9 2002 NA NA
I've asked a similar question here, although, I'm more interested in doing this through indexing with a single iterative function. The technique provided by the author in the previous question, is very specialised, so I'm looking for something simpler to get the hang of this.
Reproducible code:
structure(list(jan = c(268.1, 263.1, 235.2, 223.3, 219.2, 218.3
), year = c(1928, 1948, 2008, 1877, 1995, 1990), feb = c(287.6,
241.9, 213.7, 205.1, 191.9, 191.2), year2 = c(2020, 2002, 1997,
1990, 1958, 1923), mar = c(225.3, 190.7, 187.8, 187.2, 175.9,
173.9), year3 = c(1981, 1903, 2019, 1947, 1994, 1912)), class = "data.frame", row.names = c(NA,
6L))
You should store the values of x in a list because if you store them in a vector there is no way to distinguish between two groups.
x = c(1:2, 3:4, 5:6)
x
#[1] 1 2 3 4 5 6
Storing them in a list.
x= list(1:2, 3:4, 5:6)
x
#[[1]]
#[1] 1 2
#[[2]]
#[1] 3 4
#[[3]]
#[1] 5 6
You can use Map to index rows from your dataframe.
p <- function(x, i){
r<- rank_data[x][rank_data[,i] %in% 2000:2020,]
r
}
x= list(1:2, 3:4, 5:6)
i= c(2, 4, 6)
result <- Map(p, x, i)
result
#[[1]]
# jan year
#3 235.2 2008
#[[2]]
# feb year2
#1 287.6 2020
#2 241.9 2002
#[[3]]
# mar year3
#3 187.8 2019
If you want output same as shown you can add another step to count max number of rows.
nr <- 1:max(sapply(result, nrow))
do.call(cbind, lapply(result, function(x) x[nr, ]))
# jan year feb year2 mar year3
#3 235.2 2008 287.6 2020 187.8 2019
#NA NA NA 241.9 2002 NA NA

Change column values based on factors of other columns

For example, if I have a data frame like this:
df <- data.frame(profit=c(10,10,10), year=c(2010,2011,2012))
profit year
10 2010
10 2011
10 2012
I want to change the value of profit according to the year. For year 2010, I multiple the profit by 3, for year 2011, multiple the profit by 4, for year 2012, multiple by 5, which should result like this:
profit year
30 2010
40 2011
50 2012
How should I approach this? I tried:
inflationtransform <- function(k,v) {
switch(k,
2010,v<-v*3,
2011,v<-v*4,
2012,v<-v*5,
)
}
df$profit <- sapply(df$year,df$profit,inflationtransform)
But it doesn't work. Can someone tell me what to do?
For this particular example, since your factors and years are both ordered and incremented by 1, you could just subtract 2007 from the year column and multiply it by profit.
transform(df, profit = profit * (year - 2007))
# profit year
# 1 30 2010
# 2 40 2011
# 3 50 2012
Otherwise, you could use a lookup vector. This will cover all cases.
lookup <- c("2010" = 3, "2011" = 4, "2012" = 5)
transform(df, profit = profit * lookup[as.character(year)])
# profit year
# 1 30 2010
# 2 40 2011
# 3 50 2012
I wouldn't use switch() unless you really need to. It's not vectorized, and that's where R is most efficient. However, since you ask for it in the comments, here's one way. I find it easier to use a for() loop with switch().
for(i in seq_len(nrow(df))) {
df$profit[i] <- with(df, switch(as.character(year[i]),
"2010" = 3 * profit[i],
"2011" = 4 * profit[i],
"2012" = 5 * profit[i]
))
}

return final row of dataframe - recurring variable names

I want to return the final row for each subsection of a dataframe. I'm aware of the ddply and aggregate functions, but they are not giving the expected output in this case, as the column by which I split the data has recurring names.
For example, in df:
year <- rep(c(2011, 2012, 2013), each=12)
season <- rep(c("Spring", "Summer", "Autumn", "Winter"), each=3)
allseason <- rep(season, 3)
temp <- rnorm(36, mean = 61, sd = 10)
df <- data.frame(year, allseason, temp)
I want to return the final temp reading at the end of every season. When I run either
final1 <- aggregate(df, list(df$allseason), tail, 1)
or
final2 <- ddply(df, .(allseason), tail, 1)
I get only the final 4 seasons (i.e. those of 2013). The function seems to stop there and does not go back to previous years/seasons. My intended output is a data frame with 12 rows * 3 columns.
All help appreciated!
*I notice that in the df created here, the allseasons column is designated as a factor with 4 levels, whereas this is not the case in my original dataframe.
In your ddply code, you only forgot to also group by year:
With plyr:
library(plyr)
ddply(df, .(year, allseason), tail, 1)
Or with dplyr
library(dplyr)
df %>%
group_by(year, allseason) %>%
do(tail(.,1))
Or if you want a base R alternative you can use ave:
df[with(df, ave(year, list(year, allseason), FUN = seq_along)) == 3,]
Result:
# year allseason temp
#1 2011 Autumn 63.40626
#2 2011 Spring 59.69441
#3 2011 Summer 42.33252
#4 2011 Winter 79.10926
#5 2012 Autumn 63.14974
#6 2012 Spring 60.32811
#7 2012 Summer 67.57364
#8 2012 Winter 61.39100
#9 2013 Autumn 50.30501
#10 2013 Spring 61.43044
#11 2013 Summer 55.16605
#12 2013 Winter 69.37070
Note that the output will contain the same rows in each case, only the ordering may differ.
And just to add to #beginneR's answer, your aggregate solution should look like:
aggregate(temp ~ allseason + year, data = df, tail, 1)
# or:
with(df, aggregate(temp, list(allseason, year), tail, 1))
Result:
allseason year temp
1 Autumn 2011 64.51539
2 Spring 2011 45.14341
3 Summer 2011 62.29240
4 Winter 2011 47.97461
5 Autumn 2012 43.16781
6 Spring 2012 80.02419
7 Summer 2012 72.31149
8 Winter 2012 45.58344
9 Autumn 2013 55.92607
10 Spring 2013 52.06778
11 Summer 2013 51.01308
12 Winter 2013 53.22452

Resources