How subsetting a data frame with an "at least" condition? - r

I am new to R and I'm trying to subsetting a data frame, but I don't know how to do according to my needs. Specifically, I have a panel data frame ranging from 1987 to 2017, but some information I need are observed on 2005, 2007, 2013 and 2017. As I can assume this information is constant over time, it's sufficient that one individual has been observed at least in one of these years.
How can I subset the data frame to have all the individuals along all years, condition on have being observed at least in one of the set 2005, 2009, 2013, 2017?
Thank you.
The idea is the following:
pid year
101 1984
101 1985
101 1986
101 1987
102 1984
102 1985
102 1986
102 1987
..
102 2005
102 2006
103 1990
103 1991
103 1992
103 1993
...
103 2005
What I would like is to keep the all information and years for the pid who have at least the observation in 2005 or 2009, or 2013 or 2017.

A guess with base R :
yearOk <- which(dat$year %in% c(2005, 2007, 2013, 2017)) #row with year ok
idOK <- unique(dat$id[yearOk]) #get the ids that are in these years
datOk <- dat[which(dat$id %in% idOk),] #subset dat based on the wanted ids

Here's a way using ave from base R -
yourdf[with(yourdf, ave(year, id, FUN = function(x) any(x %in% c(2005,2009,2013,2017)))), ]

Related

R - Round number to nearest unevenly spaced custom value

I am trying to round consectutive years to the nearest year that a census took place. Unfortunately, in NZ the spacing between census is not always consistent. Eg. I want to round years 2000 to 2020 to the nearest value of 2001, 2006, 2013, 2018. Is there a way to do this without resorting to a series of if_else or case_when statements?
You could use sapply to find the minimum absolute difference between the two vectors.
Suppose your vectors were like this:
census_years <- c(2001, 2006, 2013, 2018)
all_years <- 2000:2020
Then you can do:
sapply(all_years, function(x) census_years[which.min(abs(census_years - x))])
#> [1] 2001 2001 2001 2001 2006 2006 2006 2006 2006 2006 2013 2013 2013 2013 2013
#> [16] 2013 2018 2018 2018 2018 2018
Created on 2020-12-09 by the reprex package (v0.3.0)
We can use findInterval
census_year[findInterval(year_in_question, census_year)+1]
#[1] 2013
data
census_year <- c(2001, 2006, 2013, 2018)
year_in_question <- 2012
This does the trick, by finding the smallest difference between the year and the census years. Vectorizing is left as an exercise...
require(magrittr)
census_year <- c(2001, 2006, 2013, 2018)
year_in_question <- 2012
abs(census_year - year_in_question) %>% # abs diff in years
which.min() %>% # index number of the smallest abs difference
census_year[.] # use that index number
[1] 2013

Functions on a Matrix in R

Lets say I have a dataset with a column representing the years.
Years
2007
2008
2009
2011
2015
I want to subtract the row with the row below it and save the ans to a new column. such as for above data I want to make a function that subtracts 2008 to 2007, the ans is 1 and save this ans to a new column, the next would be 2009 - 2008, 2011 - 2009. the resulting matrix should look like
Year Gap
2007 1
2008 1
2009 2
2011 4
2015 .
and so on
How can I make a function in R that will do this for me?

Exclude values from data.frame in R

I have the following dataframe:
Count Year
32 2018
346 2017
524 2016
533 2015
223 2014
1 2010
3 2008
1 1992
Is it possible to exclude the years 1992 and 2008. I tried different ways, but don't find a flexible solution.
I would like to have the same dataframe without the years 1993 and 2008.
Many thanks in advance,
jeemer
library(dplyr); filter(df, year != 1992 | year != 2008)

Simple filtering in R, but with more than one value

I am well aware of how to extract some data based on a condition, but whenever I try multiple conditions, a struggle ensues. I have some data and I only want to extract certain years from the df. Here is an example df:
year value
2006 3
2007 4
2007 3
2008 5
2008 4
2008 4
2009 5
2009 9
2010 2
2010 8
2011 3
2011 8
2011 7
2012 3
2013 4
2012 6
Now let's say I just want 2008, 2009, 2010, and 2011. I try
df<-df[df$year == c("2008", "2009", "2010", "2011"),]
doesn't work, so then:
df<-df[df$year == "2008" & df$year == "2009"
& df$year == "2010" & df$year == "2011",]
No error messages, just an empty df. What am I missing?
You need to use %in% and not==
df[df$year %in% c(2008, 2009, 2010, 2011),]
year value
4 2008 5
5 2008 4
6 2008 4
7 2009 5
8 2009 9
9 2010 2
10 2010 8
11 2011 3
12 2011 8
13 2011 7
As answered %in% works but so should using |. The & is for AND logic, meaning that the year would need to be equal to 2008, 2009, 2010 AND 2011 whereas what you want is the OR operator.
df<-df[df$year == "2008" | df$year == "2009" | df$year == "2010" | df$year == "2011",]
If you don't like %in%, try the function is.element. You might find it more intuitive.
df[is.element(el=df[,"year"], set=c(2008:2011)),]
Careful, though... switching el and set gives different results, and it can be confusing which way you want it. For this example, just remember that "set" contains the "subSET" of years that you want.
The questions has been answered but I wanted to add a comment about why your first try gives an unexpected result. This is a good example of R's vector recycling.
I'm guessing you got
year value
6 2008 4
13 2011 8
Why has R done this? What happens is R recycles the vector c("2008", "2009", "2010", "2011") like the below.
year value compare
2006 3 2008
2007 4 2009
2007 3 2010
2008 5 2011
2008 4 2008
2008 4 2009
2009 5 2010
2009 9 2011
2010 2 2008
2010 8 2009
2011 3 2010
2011 8 2011
2011 7 2008
2012 3 2009
2013 4 2010
2012 6 2011
Do you see what's about to happen? When you run
df<-df[df$year == c("2008", "2009", "2010", "2011"),]
it will return the rows where the year column and the compare column are equal. You didn't get a warning because (by chance) your comparison vector was a divisor of the number of rows, so R thought it was doing the right thing.
This is essentially the same as #Metrics answer:
subset(df, year %in% c(2008, 2009, 2010, 2011))
And if you need help with %in%, see ?intersect

Refer to relative rows in R

I know this answer must be out there, but I can't figure out how to word the question.
I'd like to calculate the differences between values in my data.frame.
from this:
f <- data.frame(year=c(2004, 2005, 2006, 2007), value=c(8565, 8745, 8985, 8412))
year value
1 2004 8565
2 2005 8745
3 2006 8985
4 2007 8412
to this:
year value diff
1 2004 8565 NA
2 2005 8745 180
3 2006 8985 240
4 2007 8412 -573
(ie value of current year minus value of previous year)
But I don't know how to have a result in one row that is created from another row. Any help?
Thanks,
Tom
There are many different ways to do this, but here's one:
f[, "diff"] <- c(NA, diff(f$value))
More generally, if you want to refer to relative rows, you can use lag() or do it directly with indexes:
f[-1,"diff"] <- f[-1, "value"] - f[-nrow(f), "value"]
Use the diff function
f <- cbind(f, c(NA, diff(f[,2])))
If year column isn't sorted then you could use match:
f$diff <- f$value - f$value[match(f$year-1, f$year)]

Resources