R - Round number to nearest unevenly spaced custom value - r

I am trying to round consectutive years to the nearest year that a census took place. Unfortunately, in NZ the spacing between census is not always consistent. Eg. I want to round years 2000 to 2020 to the nearest value of 2001, 2006, 2013, 2018. Is there a way to do this without resorting to a series of if_else or case_when statements?

You could use sapply to find the minimum absolute difference between the two vectors.
Suppose your vectors were like this:
census_years <- c(2001, 2006, 2013, 2018)
all_years <- 2000:2020
Then you can do:
sapply(all_years, function(x) census_years[which.min(abs(census_years - x))])
#> [1] 2001 2001 2001 2001 2006 2006 2006 2006 2006 2006 2013 2013 2013 2013 2013
#> [16] 2013 2018 2018 2018 2018 2018
Created on 2020-12-09 by the reprex package (v0.3.0)

We can use findInterval
census_year[findInterval(year_in_question, census_year)+1]
#[1] 2013
data
census_year <- c(2001, 2006, 2013, 2018)
year_in_question <- 2012

This does the trick, by finding the smallest difference between the year and the census years. Vectorizing is left as an exercise...
require(magrittr)
census_year <- c(2001, 2006, 2013, 2018)
year_in_question <- 2012
abs(census_year - year_in_question) %>% # abs diff in years
which.min() %>% # index number of the smallest abs difference
census_year[.] # use that index number
[1] 2013

Related

How to find rolling mean using means previously generated using R?

Hope the community can help me since I am relatively new to R and to the StackOverflow community.
I am trying to replace a missing value of a group with the average of the 3 previous years and then use this newly generated mean to continue generating the next period missing value in R either using dplyr or data.table. My data looks something like this (desired output column rounded to 2 digits):
df <- data.frame(gvkey = c(10443, 10443, 10443, 10443, 10443, 10443, 10443, 29206, 29206, 29206, 29206, 29206), fyear = c(2005, 2006, 2007, 2008, 2009, 2010, 2011, 2017, 2018, 2019, 2020, 2021), receivables = c(543, 595, 757, NA, NA, NA, NA, 147.469, 161.422, 154.019, NA, NA), desired_output = c(543, 595, 757, 631.67, 661.22, 683.30, 658.73, 147.47, 161.42, 154.02, 154.30, 156.58))
I have attempted the following line of code, but it does not use the newly generated number:
df <- df %>% mutate(mean_rect=rollapply(rect,3,mean,align='right',fill=NA))
Any help would be greatly appreciated!
Because your desired fill value depends on any previously created fill values, I think the only reasonable approach is a trusty for loop:
df$out <- NA
for (i in 1:nrow(df)) {
if (!is.na(df$receivables[i])) {
df$out[i] <- df$receivables[i]
} else {
df$out[i] <- mean(df$out[(i-3):(i-1)], na.rm = T)
}
}
gvkey fyear receivables desired_output out
1 10443 2005 543.000 543.00 543.0000
2 10443 2006 595.000 595.00 595.0000
3 10443 2007 757.000 757.00 757.0000
4 10443 2008 NA 631.67 631.6667
5 10443 2009 NA 661.22 661.2222
6 10443 2010 NA 683.30 683.2963
7 10443 2011 NA 658.73 658.7284
8 29206 2017 147.469 147.47 147.4690
9 29206 2018 161.422 161.42 161.4220
10 29206 2019 154.019 154.02 154.0190
11 29206 2020 NA 154.30 154.3033
12 29206 2021 NA 156.58 156.5814

Calculate the difference between max and min values of two vectors for nested subsets of those vectors

I have a dataframe of about 900 rows (see simplified sample below). I am trying to estimate the value of the maximum(doy) – minimum(doy) per whaleID, per year. I need to return an object (e.g. table) of the doy difference by whaleID and year. One challenge is that not every year contains two 'doy' observations. I’ve tried using “dplyr”, aggregate() and making a loop (which I am not yet competent in designing). I’d like to achieve this using Base if possible, but am all ears for any suggestions for help on this one, thank you!
whaleID<-c(31,4,5,65,31,4,4,4,31,5)
year<-c(2010, 2010, 2010, 2011, 2011, 2011, 2011, 2011, 2011, 2011)
doy<-c(65,71,88,67,77,78,81,82,88,88)
You can just use aggregate() and subtract the values from range():
whaleID<-c(31,4,5,65,31,4,4,4,31,5)
year<-c(2010, 2010, 2010, 2011, 2011, 2011, 2011, 2011, 2011, 2011)
doy<-c(65,71,88,67,77,78,81,82,88,88)
dfx <- data.frame(whaleID, year, doy)
aggregate(dfx$doy, by = list(whaleId = dfx$whaleID, year = dfx$year),
FUN = function(x) diff(range(x)))
whaleId year x
1 4 2010 0
2 5 2010 0
3 31 2010 0
4 4 2011 4
5 5 2011 0
6 31 2011 11
7 65 2011 0

How subsetting a data frame with an "at least" condition?

I am new to R and I'm trying to subsetting a data frame, but I don't know how to do according to my needs. Specifically, I have a panel data frame ranging from 1987 to 2017, but some information I need are observed on 2005, 2007, 2013 and 2017. As I can assume this information is constant over time, it's sufficient that one individual has been observed at least in one of these years.
How can I subset the data frame to have all the individuals along all years, condition on have being observed at least in one of the set 2005, 2009, 2013, 2017?
Thank you.
The idea is the following:
pid year
101 1984
101 1985
101 1986
101 1987
102 1984
102 1985
102 1986
102 1987
..
102 2005
102 2006
103 1990
103 1991
103 1992
103 1993
...
103 2005
What I would like is to keep the all information and years for the pid who have at least the observation in 2005 or 2009, or 2013 or 2017.
A guess with base R :
yearOk <- which(dat$year %in% c(2005, 2007, 2013, 2017)) #row with year ok
idOK <- unique(dat$id[yearOk]) #get the ids that are in these years
datOk <- dat[which(dat$id %in% idOk),] #subset dat based on the wanted ids
Here's a way using ave from base R -
yourdf[with(yourdf, ave(year, id, FUN = function(x) any(x %in% c(2005,2009,2013,2017)))), ]

Functions on a Matrix in R

Lets say I have a dataset with a column representing the years.
Years
2007
2008
2009
2011
2015
I want to subtract the row with the row below it and save the ans to a new column. such as for above data I want to make a function that subtracts 2008 to 2007, the ans is 1 and save this ans to a new column, the next would be 2009 - 2008, 2011 - 2009. the resulting matrix should look like
Year Gap
2007 1
2008 1
2009 2
2011 4
2015 .
and so on
How can I make a function in R that will do this for me?

Simple filtering in R, but with more than one value

I am well aware of how to extract some data based on a condition, but whenever I try multiple conditions, a struggle ensues. I have some data and I only want to extract certain years from the df. Here is an example df:
year value
2006 3
2007 4
2007 3
2008 5
2008 4
2008 4
2009 5
2009 9
2010 2
2010 8
2011 3
2011 8
2011 7
2012 3
2013 4
2012 6
Now let's say I just want 2008, 2009, 2010, and 2011. I try
df<-df[df$year == c("2008", "2009", "2010", "2011"),]
doesn't work, so then:
df<-df[df$year == "2008" & df$year == "2009"
& df$year == "2010" & df$year == "2011",]
No error messages, just an empty df. What am I missing?
You need to use %in% and not==
df[df$year %in% c(2008, 2009, 2010, 2011),]
year value
4 2008 5
5 2008 4
6 2008 4
7 2009 5
8 2009 9
9 2010 2
10 2010 8
11 2011 3
12 2011 8
13 2011 7
As answered %in% works but so should using |. The & is for AND logic, meaning that the year would need to be equal to 2008, 2009, 2010 AND 2011 whereas what you want is the OR operator.
df<-df[df$year == "2008" | df$year == "2009" | df$year == "2010" | df$year == "2011",]
If you don't like %in%, try the function is.element. You might find it more intuitive.
df[is.element(el=df[,"year"], set=c(2008:2011)),]
Careful, though... switching el and set gives different results, and it can be confusing which way you want it. For this example, just remember that "set" contains the "subSET" of years that you want.
The questions has been answered but I wanted to add a comment about why your first try gives an unexpected result. This is a good example of R's vector recycling.
I'm guessing you got
year value
6 2008 4
13 2011 8
Why has R done this? What happens is R recycles the vector c("2008", "2009", "2010", "2011") like the below.
year value compare
2006 3 2008
2007 4 2009
2007 3 2010
2008 5 2011
2008 4 2008
2008 4 2009
2009 5 2010
2009 9 2011
2010 2 2008
2010 8 2009
2011 3 2010
2011 8 2011
2011 7 2008
2012 3 2009
2013 4 2010
2012 6 2011
Do you see what's about to happen? When you run
df<-df[df$year == c("2008", "2009", "2010", "2011"),]
it will return the rows where the year column and the compare column are equal. You didn't get a warning because (by chance) your comparison vector was a divisor of the number of rows, so R thought it was doing the right thing.
This is essentially the same as #Metrics answer:
subset(df, year %in% c(2008, 2009, 2010, 2011))
And if you need help with %in%, see ?intersect

Resources