Filling missing levels - r

I have the following type of dataframe:
Country <- rep(c("USA", "AUS", "GRC"),2)
Year <- 2001:2006
Level <- c("rich","middle","poor",rep(NA,3))
df <- data.frame(Country, Year,Level)
df
Country Year Level
1 USA 2001 rich
2 AUS 2002 middle
3 GRC 2003 poor
4 USA 2004 <NA>
5 AUS 2005 <NA>
6 GRC 2006 <NA>
I want to fill the missing values with the correct level label in the last from the right column.
So the expected outcome should be like this:
Country Year Level
1 USA 2001 rich
2 AUS 2002 middle
3 GRC 2003 poor
4 USA 2004 rich
5 AUS 2005 middle
6 GRC 2006 poor

In base R, you could use ave():
transform(df, Level = ave(Level, Country, FUN = na.omit))
# Country Year Level
# 1 USA 2001 rich
# 2 AUS 2002 middle
# 3 GRC 2003 poor
# 4 USA 2004 rich
# 5 AUS 2005 middle
# 6 GRC 2006 poor
Another, more accurate possibility is to use a join. Here we merge the Country column with the NA-omitted data. The outcome is the same, just in a different row order.
merge(df["Country"], na.omit(df))
# Country Year Level
# 1 AUS 2002 middle
# 2 AUS 2002 middle
# 3 GRC 2003 poor
# 4 GRC 2003 poor
# 5 USA 2001 rich
# 6 USA 2001 rich

We can group by 'Country' and get the non-NA unique value
library(dplyr)
df %>%
group_by(Country) %>%
dplyr::mutate(Level = Level[!is.na(Level)][1])
# A tibble: 6 x 3
# Groups: Country [3]
# Country Year Level
# <fctr> <int> <fctr>
#1 USA 2001 rich
#2 AUS 2002 middle
#3 GRC 2003 poor
#4 USA 2004 rich
#5 AUS 2005 middle
#6 GRC 2006 poor
If we have loaded dplyr along with plyr, it is better to specify explicitly dplyr::mutate or dplyr::summarise so that it uses the function from dplyr. There are same functions in plyr and it could potentially mask the functions from dplyr when both are loaded creating different behavior.

You can do it using data.table and zoo:-
library(data.table)
library(zoo)
setDT(df)
df[, Level := na.locf(Level), by = Country]
This will give you:-
Country Year Level
1: USA 2001 rich
2: AUS 2002 middle
3: GRC 2003 poor
4: USA 2004 rich
5: AUS 2005 middle
6: GRC 2006 poor

library(dplyr)
df %>%
group_by(Country) %>%
mutate(Level = replace(Level, is.na(Level), unique(na.omit(Level))))
Country Year Level
<fctr> <int> <fctr>
1 USA 2001 rich
2 AUS 2002 middle
3 GRC 2003 poor
4 USA 2004 rich
5 AUS 2005 middle
6 GRC 2006 poor
Or, more succinctly, applying #suchait's idea to use na.locf:
df %>%
group_by(Country) %>%
mutate(Level = zoo::na.locf(Level))

A solution using dplyr and tidyr.
library(dplyr)
library(tidyr)
df %>%
arrange(Country) %>%
fill(Level) %>%
arrange(Year)
# Country Year Level
# 1 USA 2001 rich
# 2 AUS 2002 middle
# 3 GRC 2003 poor
# 4 USA 2004 rich
# 5 AUS 2005 middle
# 6 GRC 2006 poor

Here is another data.table solution which updates on join using a lookup table which is created from the given dataset itself:
library(data.table)
setDT(df)[df[!is.na(Level)], on = .(Country), Level := Level][]
Country Year Level
1: USA 2001 rich
2: AUS 2002 middle
3: GRC 2003 poor
4: USA 2004 rich
5: AUS 2005 middle
6: GRC 2006 poor

Related

summing a column based on values in two other columns

I have a data frame that lists individual mass shootings for each state between 1991-2020. I would like to 1) sum the total victims each year for each state, and 2) sum the total number of mass shootings each state had each year.
So far, I've only managed to get a total sum of victims between 1991-2020 for each state. And I'm not even sure how I could get a column with the total incidents per year, per state. Are there any adjustments I can make to the aggregate function, or is there some other function to get the information I want?
What I have:
combined = read.csv('https://raw.githubusercontent.com/bandcar/massShootings/main/combo1991_2020_states.csv')
> head(combined)
state date year fatalities injured total_victims
3342 Alabama 04/07/2009 2009 4 0 4
3351 Alabama 03/10/2009 2009 10 6 16
3285 Alabama 01/29/2012 2012 5 0 5
135 Alabama 12/28/2013 2013 3 5 8
267 Alabama 07/06/2013 2013 0 4 4
557 Alabama 06/08/2014 2014 1 4 5
q = aggregate(total_victims ~ state,data=combined,FUN=sum)
> head(q)
state total_victims
1 Alabama 364
2 Alaska 19
3 Arizona 223
4 Arkansas 205
5 California 1816
6 Colorado 315
What I want for each state for each year:
year state total_victims total_shootings
1 2009 Alabama 20 2
2 2012 Alabama 5 1
3 2013 Alabama 12 2
4 2014 Alabama 5 1
You can use group_by in combination with summarise() from the tidyverse packages.
library(tidyverse)
combined |>
group_by(state, year) |>
summarise(total_victims = sum(total_victims),
total_shootings = n())
This is the result you get:
# A tibble: 457 x 4
# Groups: state [52]
state year total_victims total_shootings
<chr> <int> <int> <int>
1 Alabama 2009 20 2
2 Alabama 2012 5 1
3 Alabama 2013 12 2
4 Alabama 2014 10 2
5 Alabama 2015 17 4

Add rows and complete dyad by group

I have a dataset in a dyadic format and sorted by group and I am trying to add an observation to each group. I need this observation to also be integrated with the other pairs. Below is a reproducible example to show what I mean. Data is a simplified version of my dataset (it contains more groups essentially).
data <- data.frame(country1 = c("BEL", "FRA", "BEL", "FRA", "AUS", "ITA"),
country2 = c("FRA", "BEL", "FRA", "BEL", "ITA", "AUS"),
year = c(2001,2001,2002,2002,2002,2002),
id = c(1,1,1,1,2,2))
> data
country1 country2 year id
1 BEL FRA 2001 1
2 FRA BEL 2001 1
3 BEL FRA 2002 1
4 FRA BEL 2002 1
5 AUS ITA 2002 2
6 ITA AUS 2002 2
I would like to add a different country to each group. For instance, say I would like to add Luxembourg to group 1 and Portugal to group 2.
This is what the output I need should look like:
> data
country1 country2 year id
1 BEL FRA 2001 1
2 FRA BEL 2001 1
3 LUX BEL 2001 1
4 LUX FRA 2001 1
5 BEL LUX 2001 1
6 FRA LUX 2001 1
7 BEL FRA 2002 1
8 FRA BEL 2002 1
9 LUX BEL 2002 1
10 LUX FRA 2002 1
11 BEL LUX 2002 1
12 FRA LUX 2002 1
13 AUS ITA 2002 2
14 ITA AUS 2002 2
15 POR AUS 2002 2
16 POR ITA 2002 2
17 AUS POR 2002 2
18 ITA POR 2002 2
I found a workaround way but I don't know how to simplify this process and to automate it to some extent.
id1 <- data%>%
filter(id== 1) %>%
mutate(country3 = "LUX")
id1_1 <- id1 %>%
select(!country2) %>%
rename("country2" = "country3") %>%
distinct()
id1_2 <- id1 %>%
select(!country1) %>%
rename("country1" = "country3") %>%
distinct()
id1_2 <- id1_2 [, c(2,1,3,4)]
id1 <- rbind(id1_1, id1_2)
data<- rbind(data, id1)
This completes the dyads but it is quite tedious to do since I am trying to add about 100 countries to a hundred groups.
I can create either a vector or a data frame containing all the countries I need to add (and arrange them by group if necessary), but I just don't know how to use them to fill the main data. Thanks for any tips!
Would something like this work for you?
library(tidyverse)
data <- data.frame(country1 = c("BEL", "FRA", "BEL", "FRA", "AUS", "ITA"),
country2 = c("FRA", "BEL", "FRA", "BEL", "ITA", "AUS"),
year = c(2001,2001,2002,2002,2002,2002),
id = c(1,1,1,1,2,2))
additions <- tribble(
~id, ~country1,
1, "LUX",
2, "POR"
)
unique_combos <- data |>
distinct(id, year, country1) |>
rows_append(additions) |>
expand(year, nesting(id, country1)) |>
filter(!is.na(year))
unique_combos |>
rename(country2 = country1) |>
full_join(unique_combos) |>
filter(country1 != country2) |>
arrange(id, year, country1, country2)
#> Joining, by = c("year", "id")
#> # A tibble: 24 × 4
#> year id country2 country1
#> <dbl> <dbl> <chr> <chr>
#> 1 2001 1 FRA BEL
#> 2 2001 1 LUX BEL
#> 3 2001 1 BEL FRA
#> 4 2001 1 LUX FRA
#> 5 2001 1 BEL LUX
#> 6 2001 1 FRA LUX
#> 7 2002 1 FRA BEL
#> 8 2002 1 LUX BEL
#> 9 2002 1 BEL FRA
#> 10 2002 1 LUX FRA
#> # … with 14 more rows
Created on 2022-06-29 by the reprex package (v2.0.1)

r data.table adjust min and max years only if each set has at least one incrementing obs

I have a data set that holds an id, location, start year, end year, age1 and age2. For each group defined as id, location, age1 and age2, I would like to create new start and end year. For instance, I may have three entries for china encompassing age 0 - age 4. One will be 2000 - 2000, the other is 2001 - 2001, and the final is 2005-2005. Since the years are incrementing by 1 in the first two entries, I'd want their corresponding newstart and newend to be 2000-2001. The third entry would have newstart==2005 and newend==2005 as this is not apart of a continuous set of years.
The data table I have resembles the following, except it has thousands of entries many combinations :
id location start end age1 age2
1 brazil 2000 2000 0 4
1 brazil 2001 2001 0 4
1 brazil 2002 2002 0 4
2 argentina 1990 1991 1 1
2 argentina 1991 1991 2 2
2 argentina 1992 1992 2 2
2 argentina 1993 1993 2 2
3 belize 2001 2001 0.5 1
3 belize 2005 2005 1 2
I want to alter the data table so that it will look like the following
id location start end age1 age2 newstart newend
1 brazil 2000 2000 0 4 2000 2002
1 brazil 2001 2001 0 4 2000 2002
1 brazil 2002 2002 0 4 2000 2002
2 argentina 1990 1991 1 1 1991 1991
2 argentina 1991 1991 2 2 1991 1993
2 argentina 1992 1992 2 2 1991 1993
2 argentina 1993 1993 2 2 1991 1993
3 belize 2001 2001 0.5 1 2001 2001
3 belize 2005 2005 1 2 2005 2005
I have tried creating a variable that tracks the difference of the previous year and the current year using lag and then calculating the difference between these two years. I then created the newstart and newend by placing the min start and max end. I have found that this only works if there is a set of 2 in continuous years. If I have a larger set, this doesn't work as it has no way of tracking the number of obs in which the years increase by 1 for each grouping. I believe I need some type of loop.
Is there a more efficient way to accomplish this?
data.table
You tagged with data.table, so my first suggestion is this:
library(data.table)
dat[, contiguous := rleid(c(TRUE, diff(start) == 1)), by = .(id)]
dat[, c("newstart", "newend") := .(min(start), max(end)), by = .(id, contiguous)]
dat[, contiguous := NULL]
dat
# id location start end age1 age2 newstart newend
# 1: 1 brazil 2000 2000 0.0 4 2000 2002
# 2: 1 brazil 2001 2001 0.0 4 2000 2002
# 3: 1 brazil 2002 2002 0.0 4 2000 2002
# 4: 2 argentina 1990 1991 1.0 1 1990 1993
# 5: 2 argentina 1991 1991 2.0 2 1990 1993
# 6: 2 argentina 1992 1992 2.0 2 1990 1993
# 7: 2 argentina 1993 1993 2.0 2 1990 1993
# 8: 3 belize 2001 2001 0.5 1 2001 2001
# 9: 3 belize 2005 2005 1.0 2 2005 2005
base R
If instead you really just mean data.frame, then
dat <- transform(dat, contiguous = ave(start, id, FUN = function(a) cumsum(c(TRUE, diff(a) != 1))))
dat <- transform(dat,
newstart = ave(start, id, contiguous, FUN = min),
newend = ave(end , id, contiguous, FUN = max)
)
# Warning in FUN(X[[i]], ...) :
# no non-missing arguments to min; returning Inf
# Warning in FUN(X[[i]], ...) :
# no non-missing arguments to min; returning Inf
# Warning in FUN(X[[i]], ...) :
# no non-missing arguments to max; returning -Inf
# Warning in FUN(X[[i]], ...) :
# no non-missing arguments to max; returning -Inf
dat
# id location start end age1 age2 newstart newend contiguous
# 1 1 brazil 2000 2000 0.0 4 2000 2002 1
# 2 1 brazil 2001 2001 0.0 4 2000 2002 1
# 3 1 brazil 2002 2002 0.0 4 2000 2002 1
# 4 2 argentina 1990 1991 1.0 1 1990 1993 1
# 5 2 argentina 1991 1991 2.0 2 1990 1993 1
# 6 2 argentina 1992 1992 2.0 2 1990 1993 1
# 7 2 argentina 1993 1993 2.0 2 1990 1993 1
# 8 3 belize 2001 2001 0.5 1 2001 2001 1
# 9 3 belize 2005 2005 1.0 2 2005 2005 2
dat$contiguous <- NULL
Interesting point I just learned about ave: it uses interaction(...) (all grouping variables), which is going to give all possible combinations, not just the combinations observed in the data. Because of that, the FUNction may be called with zero data. In this case, it did, giving the warnings. One could suppress this with function(a) suppressWarnings(min(a)) instead of just min.
We could use dplyr. After grouping by 'id', take the difference of the 'start' and the lagof the 'start', apply rleid to get the run-length-id' and create the 'newstart', 'newend' as the min and max of the 'start'
library(dplyr)
library(data.table)
df1 %>%
group_by(id) %>%
group_by(grp = rleid(replace_na(start - lag(start), 1)),
.add = TRUE) %>%
mutate(newstart = min(start), newend = max(end))
-output
# A tibble: 9 x 9
# Groups: id, grp [4]
# id location start end age1 age2 grp newstart newend
# <int> <chr> <int> <int> <dbl> <int> <int> <int> <int>
#1 1 brazil 2000 2000 0 4 1 2000 2002
#2 1 brazil 2001 2001 0 4 1 2000 2002
#3 1 brazil 2002 2002 0 4 1 2000 2002
#4 2 argentina 1990 1991 1 1 1 1990 1993
#5 2 argentina 1991 1991 2 2 1 1990 1993
#6 2 argentina 1992 1992 2 2 1 1990 1993
#7 2 argentina 1993 1993 2 2 1 1990 1993
#8 3 belize 2001 2001 0.5 1 1 2001 2001
#9 3 belize 2005 2005 1 2 2 2005 2005
Or with data.table
library(data.table)
setDT(df1)[, grp := rleid(replace_na(start - shift(start), 1))
][, c('newstart', 'newend') := .(min(start), max(end)), .(id, grp)][, grp := NULL]

Add lines with NA values

I have a data frame like this:
indx country year death value
1 1 Italy 2000 hiv 1
2 1 Italy 2001 hiv 2
3 1 Italy 2005 hiv 3
4 1 Italy 2000 cancer 4
5 1 Italy 2001 cancer 5
6 1 Italy 2002 cancer 6
7 1 Italy 2003 cancer 7
8 1 Italy 2004 cancer 8
9 1 Italy 2005 cancer 9
10 4 France 2000 hiv 10
11 4 France 2004 hiv 11
12 4 France 2005 hiv 12
13 4 France 2001 cancer 13
14 4 France 2002 cancer 14
15 4 France 2003 cancer 15
16 4 France 2004 cancer 16
17 2 Spain 2000 hiv 17
18 2 Spain 2001 hiv 18
19 2 Spain 2002 hiv 19
20 2 Spain 2003 hiv 20
21 2 Spain 2004 hiv 21
22 2 Spain 2005 hiv 22
23 2 Spain ... ... ...
indx is a value linked to the country (same country = same indx).
In this example I used only 3 countries (country) and 2 disease (death), in the original data frame are many more.
I would like to have one row for each country for each disease from 2000 to 2005.
What I would like to get is:
indx country year death value
1 1 Italy 2000 hiv 1
2 1 Italy 2001 hiv 2
3 1 Italy 2002 hiv NA
4 1 Italy 2003 hiv NA
5 1 Italy 2004 hiv NA
6 1 Italy 2005 hiv 3
7 1 Italy 2000 cancer 4
8 1 Italy 2001 cancer 5
9 1 Italy 2002 cancer 6
10 1 Italy 2003 cancer 7
11 1 Italy 2004 cancer 8
12 1 Italy 2005 cancer 9
13 4 France 2000 hiv 10
14 4 France 2001 hiv NA
15 4 France 2002 hiv NA
16 4 France 2003 hiv NA
17 4 France 2004 hiv 11
18 4 France 2005 hiv 12
19 4 France 2000 cancer NA
20 4 France 2001 cancer 13
21 4 France 2002 cancer 14
22 4 France 2003 cancer 15
23 4 France 2004 cancer 16
24 4 France 2005 cancer NA
25 2 Spain 2000 hiv 17
26 2 Spain 2001 hiv 18
27 2 Spain 2002 hiv 19
28 2 Spain 2003 hiv 20
29 2 Spain 2004 hiv 21
30 2 Spain 2005 hiv 22
31 2 Spain ... ... ...
I.e. I would like to add lines with value = NA at the missing years for each country for each disease.
For example, it lacks data of HIV in Italy between 2002 and 2004 and then I add this lines with value = NA.
How can I do that?
For a reproducible example:
indx <- c(rep(1, times=9), rep(4, times=7), rep(2, times=6))
country <- c(rep("Italy", times=9), rep("France", times=7), rep("Spain", times=6))
year <- c(2000, 2001, 2005, 2000:2005, 2000, 2004, 2005, 2001:2004, 2000:2005)
death <- c(rep("hiv", times=3), rep("cancer", times=6), rep("hiv", times=3), rep("cancer", times=4), rep("hiv", times=6))
value <- c(1:22)
dfl <- data.frame(indx, country, year, death, value)
Using base R, you could do:
# setDF(dfl) # run this first if you have a data.table
merge(expand.grid(lapply(dfl[c("country", "death", "year")], unique)), dfl, all.x = TRUE)
This first creates all combinations of the unique values in country, death, and year and then merges it to the original data, to add the values and where combinations were not in the original data, it adds NAs.
In the package tidyr, there's a special function that does this for you with a a single command:
library(tidyr)
complete(dfl, country, year, death)
Here is a longer base R method. You create two new data.frames, one that contains all combinations of the country, year, and death, and a second that contains an index key.
# get data.frame with every combination of country, year, and death
dfNew <- with(df, expand.grid("country"=unique(country), "year"=unique(year),
"death"=unique(death)))
# get index key
indexKey <- unique(df[, c("indx", "country")])
# merge these together
dfNew <- merge(indexKey, dfNew, by="country")
# merge onto original data set
dfNew <- merge(df, dfNew, by=c("indx", "country", "year", "death"), all=TRUE)
This returns
dfNew
indx country year death value
1 1 Italy 2000 cancer 4
2 1 Italy 2000 hiv 1
3 1 Italy 2001 cancer 5
4 1 Italy 2001 hiv 2
5 1 Italy 2002 cancer 6
6 1 Italy 2002 hiv NA
7 1 Italy 2003 cancer 7
8 1 Italy 2003 hiv NA
9 1 Italy 2004 cancer 8
10 1 Italy 2004 hiv NA
11 1 Italy 2005 cancer 9
12 1 Italy 2005 hiv 3
13 2 Spain 2000 cancer NA
14 2 Spain 2000 hiv 17
15 2 Spain 2001 cancer NA
...
If df is a data.table, here are the corresponding lines of code:
# CJ is a cross-join
setkey(df, country, year, death)
dfNew <- df[CJ(country, year, death, unique=TRUE),
.(country, year, death, value)]
indexKey <- unique(df[, .(indx, country)])
dfNew <- merge(indexKey, dfNew, by="country")
dfNew <- merge(df, dfNew, by=c("indx", "country", "year", "death"), all=TRUE)
Note that it rather than using CJ, it is also possible to use expand.grid as in the data.frame version:
dfNew <- df[, expand.grid("country"=unique(country), "year"=unique(year),
"death"=unique(death))]
tidyr::complete helps create all combinations of the variables you pass it, but if you have two columns that are identical, it will over-expand or leave NAs where you don't want. As a workaround you can use dplyr grouping (df %>% group_by(indx, country) %>% complete(death, year)) or just merge the two columns into one temporarily:
library(tidyr)
# merge indx and country into a single column so they won't over-expand
df %>% unite(indx_country, indx, country) %>%
# fill in missing combinations of new column, death, and year
complete(indx_country, death, year) %>%
# separate indx and country back to how they were
separate(indx_country, c('indx', 'country'))
# Source: local data frame [36 x 5]
#
# indx country death year value
# (chr) (chr) (fctr) (int) (int)
# 1 1 Italy cancer 2000 4
# 2 1 Italy cancer 2001 5
# 3 1 Italy cancer 2002 6
# 4 1 Italy cancer 2003 7
# 5 1 Italy cancer 2004 8
# 6 1 Italy cancer 2005 9
# 7 1 Italy hiv 2000 1
# 8 1 Italy hiv 2001 2
# 9 1 Italy hiv 2002 NA
# 10 1 Italy hiv 2003 NA
# .. ... ... ... ... ...

Conditional subsetting gone wrong in R

So I have this fairly basic problem with R subsetting, but because I'm a newbie I don't know how to solve it properly. There's example of some panel data I have:
idnr year sales space municipality pop
1 1 2004 110000 1095 136 71377
2 1 2005 110000 1095 136 71355
3 1 2006 110000 1095 136 71837
4 1 2007 120000 1095 136 72956
5 2 2004 35000 800 136 71377
6 3 2004 45000 1000 136 71377
7 3 2005 45000 1000 2584 23135
8 3 2006 45000 1000 2584 23258
9 3 2007 45000 1000 2584 23407
10 4 2005 180000 5000 2584 23254
11 4 2006 220000 5000 2584 23135
12 4 2007 250000 5000 2584 23258
So my problem is that I want to subset data using conditions for both year = 2004 AND (not or) year = 2005. However it doesn't seem to work. Code:
tab3 <- stores[stores$year==2004 & stores$year==2005, c("idnr","year")]
What I am trying to say is that I need to select data which existed in both 2004 and 2005, cause some entries existed either in 2004 or 2005, but not in both and hence should be excluded. Using data above as an example, this should be the output:
idnr year
1 2004
1 2005
3 2004
3 2005
Update:
I was hoping that akrun's method may work for selecting data entries, which appeared ONLY in 2005. Such that:
idnr year
4 2005
Unfortunately, it doesn't. Instead it groups both idnr's which appeared in 2004&2005 with those which appeared only in 2005. Any ideas?
Here is a an option using "data.table". Convert the dataset ("df") to "data.table" using setDT. Set the "year" column as "key" (setkey(..)). Subset the rows that have "2004/2005" in the "year" columns (J(c(2004,..)), select the first two columns 1:2.
library(data.table) # data.table_1.9.5
DT1 <- setkey(setDT(df),year)[J(c(2004,2005)), 1:2, with=FALSE]
DT1
# idnr year
#1: 1 2004
#2: 2 2004
#3: 3 2004
#4: 1 2005
#5: 3 2005
#6: 4 2005
Update
Based on the updated expected result, we can check whether there are more than one unique "year" entries (uniqueN(year)>1) per "idnr" group, get the row index (.I) as a column ("V1") and subset the data.table "DT1".
DT1[DT1[, .I[uniqueN(year)>1], idnr]$V1,]
# idnr year
#1: 1 2004
#2: 1 2005
#3: 3 2004
#4: 3 2005
Or everything in one liner
setDT(df)[year %in% 2004:2005, if(uniqueN(year) > 1L) year, idnr]
# idnr V1
# 1: 1 2004
# 2: 1 2005
# 3: 3 2004
# 4: 3 2005
Or a base R option would be
indx <- with(df, ave(year==2004, idnr, FUN=any)& ave(year==2005,
idnr, FUN=any) & year %in% 2004:2005)
df[indx,1:2]
# idnr year
#1 1 2004
#2 1 2005
#6 3 2004
#7 3 2005
Update2
Based on the dataset and the expected result showed, we can check whether the first value of "year" is 2005 for each group "idnr". If it is TRUE, then subset the first observation (.SD[1L,..]) and select the columns that are needed.
setDT(df)[,if(year[1L]==2005) .SD[1L,1,with=FALSE], by = idnr]
# idnr year
#1: 4 2005
Or
setDT(df)[df[,.I[year[1L]==2005] , by = idnr]$V1[1L], 1:2, with=FALSE]
# idnr year
#1: 4 2005
If you want to subset with either year == 2004 or year == 2005, you need to use the | operator instead of & in your actual approach:
tab3 <- stores[stores$year == 2004 | stores$year == 2005, c("idnr", "year")]
Which results:
#> tab3
# idnr year
#1 1 2004
#2 1 2005
#5 2 2004
#6 3 2004
#7 3 2005
#10 4 2005
Or using dplyr:
library(dplyr)
tab3 <- stores %>% select(idnr, year) %>% filter(year == 2004 | year == 2005)
More concisely:
tab3 <- stores %>% select(idnr, year) %>% filter(year %in% c(2004, 2005))

Resources