Counting the distinct values for each day and group and inserting the value in an array in R - r

I want to transform the data below to give me an association array with the count of each unique id in each group for each day. So, for example, from the data below
Year Month Day Group ID
2014 04 26 1 A
2014 04 26 1 B
2014 04 26 2 B
2014 04 26 2 C
2014 05 12 1 B
2014 05 12 2 E
2014 05 12 2 F
2014 05 12 2 G
2014 05 12 3 G
2014 05 12 3 F
2015 05 19 1 F
2015 05 19 1 D
2015 05 19 2 E
2015 05 19 2 G
2015 05 19 2 D
2015 05 19 3 A
2015 05 19 3 E
2015 05 19 3 B
I want to make an array that gives:
[1] (04/26/2014)
Grp 1 2 3
1 0 1 0
2 1 0 0
3 0 0 0
[2] (05/12/2014)
Grp 1 2 3
1 0 0 1
2 0 0 2
3 1 2 0
[3] (05/19/2015)
Grp 1 2 3
1 0 1 0
2 1 0 1
3 0 1 0
The 'Grp' is just to indicate the group number. I know how to count the distinct values within the table, overall, but I’m trying to use for loops to also insert the appropriate unique value for each day for e.g., inserting the unique number of IDs that are present in both group 1 and 2 in 04/26/2014 and inserting that number in the group 1 and group 2 association matrix for that day. Any help would be appreciated.

I don't quite understand how you get the second one, but you can try this
dd <- read.table(header = TRUE, text = "Year Month Day Group ID
2014 04 26 1 A
2014 04 26 1 B
2014 04 26 2 B
2014 04 26 2 C
2014 05 12 1 B
2014 05 12 2 E
2014 05 12 2 F
2014 05 12 2 G
2014 05 12 3 G
2014 05 12 3 F
2015 05 19 1 F
2015 05 19 1 D
2015 05 19 2 E
2015 05 19 2 G
2015 05 19 2 D
2015 05 19 3 A
2015 05 19 3 E
2015 05 19 3 B")
dd <- within(dd, {
date <- as.Date(apply(dd[, 1:3], 1, paste0, collapse = '-'))
Group <- factor(Group)
Year <- Month <- Day <- NULL
})
Eg, for the first one
sp <- split(dd, dd$date)[[1]]
tbl <- table(sp$ID, sp$Group)
`diag<-`(crossprod(tbl), 0)
# 1 2 3
# 1 0 1 0
# 2 1 0 0
# 3 0 0 0
And do them all at once
lapply(split(dd, dd$date), function(x) {
cp <- crossprod(table(x$ID, x$Group))
diag(cp) <- 0
cp
})
# $`2014-04-26`
#
# 1 2 3
# 1 0 1 0
# 2 1 0 0
# 3 0 0 0
#
# $`2014-05-12`
#
# 1 2 3
# 1 0 0 0
# 2 0 0 2
# 3 0 2 0
#
# $`2015-05-19`
#
# 1 2 3
# 1 0 1 0
# 2 1 0 1
# 3 0 1 0

A possible solution with dplyr and tidyr will be as follows:
library(dplyr)
library(tidyr)
df$date <- as.Date(paste(df$Year, df$Month, df$Day, sep = '-'))
df %>%
expand(date, Group) %>%
left_join(., df) %>%
group_by(date, Group) %>%
summarise(nID = n_distinct(ID)) %>%
split(., .$date)
Resulting output:
$`2014-04-26`
Source: local data frame [3 x 3]
Groups: date [1]
date Group nID
(date) (int) (int)
1 2014-04-26 1 2
2 2014-04-26 2 2
3 2014-04-26 3 1
$`2014-05-12`
Source: local data frame [3 x 3]
Groups: date [1]
date Group nID
(date) (int) (int)
1 2014-05-12 1 1
2 2014-05-12 2 3
3 2014-05-12 3 2
$`2015-05-19`
Source: local data frame [3 x 3]
Groups: date [1]
date Group nID
(date) (int) (int)
1 2015-05-19 1 2
2 2015-05-19 2 3
3 2015-05-19 3 3

Related

Count the occurences of accidents until the next accidents

I have the following data frame and I would like to create the "OUTPUT_COLUMN".
Explanation of columns:
ID is the identification number of the policy
ID_REG_YEAR is the identification number per Registration Year
CALENDAR_YEAR is the year that the policy have exposure
NUMBER_OF_RENEWALS is the count of numbers that the policy has renewed
ACCIDENT is accident occurred
KEY TO THE DATASET: ID_REG_YEAR and CALENDAR_YEAR
Basically, if column NUMBER_OF_RENEWALS = 0 then OUTPUT_COLUMN = 100. Any rows that an accident did not occurred before should contain 100 (e.g rows 13,16,17). If an Accident occured I would like to count the number of renewals until the next accident.
ID ID_REG_YEAR CALENDAR_YEAR NUMBER_OF_RENEWALS ACCIDENT OUTPUT_COLUMN
1 A A_2015 2015 0 YES 100
2 A A_2015 2016 0 YES 100
3 A A_2016 2016 1 YES 0
4 A A_2016 2017 1 YES 0
5 A A_2017 2017 2 NO 1
6 A A_2017 2018 2 NO 1
7 A A_2018 2018 3 NO 2
8 A A_2018 2019 3 NO 2
9 A A_2019 2019 4 YES 0
10 A A_2019 2020 4 YES 0
11 B B_2015 2015 0 NO 100
12 B B_2015 2016 0 NO 100
13 B B_2016 2016 1 NO 100
14 C C_2013 2013 0 NO 100
15 C C_2013 2014 0 NO 100
16 C C_2014 2014 1 NO 100
17 C C_2014 2015 1 NO 100
18 C C_2015 2015 2 YES 0
19 C C_2015 2016 2 YES 0
20 C C_2016 2016 3 NO 1
21 C C_2016 2017 3 NO 1
22 C C_2017 2017 4 NO 2
23 C C_2017 2018 4 NO 2
24 C C_2018 2018 5 YES 0
25 C C_2018 2019 5 YES 0
26 C C_2019 2019 6 NO 1
27 C C_2019 2020 6 NO 1
28 C C_2020 2020 7 NO 2
Here is a dplyr solution. First, obtain a separate column for the registration year, which will be used to calculate renewals since prior accident (assumes this is years since last accident). Then, create a column to contain the year of the last accident after grouping by ID. Using fill this value will be propagated. The final outcome column will be set as either 100 (if no prior accident, or NUMBER_OF_RENEWALS is zero) vs. the registration year - last accident year.
library(dplyr)
df %>%
separate(ID_REG_YEAR, into = c("ID_REG", "REG_YEAR"), convert = T) %>%
group_by(ID) %>%
mutate(LAST_ACCIDENT = ifelse(ACCIDENT == "YES", REG_YEAR, NA_integer_)) %>%
fill(LAST_ACCIDENT, .direction = "down") %>%
mutate(OUTPUT_COLUMN_2 = ifelse(
is.na(LAST_ACCIDENT) | NUMBER_OF_RENEWALS == 0, 100, REG_YEAR - LAST_ACCIDENT
))
Output
ID ID_REG REG_YEAR CALENDAR_YEAR NUMBER_OF_RENEWALS ACCIDENT OUTPUT_COLUMN LAST_ACCIDENT OUTPUT_COLUMN_2
<chr> <chr> <int> <int> <int> <chr> <int> <int> <dbl>
1 A A 2015 2015 0 YES 100 2015 100
2 A A 2015 2016 0 YES 100 2015 100
3 A A 2016 2016 1 YES 0 2016 0
4 A A 2016 2017 1 YES 0 2016 0
5 A A 2017 2017 2 NO 1 2016 1
6 A A 2017 2018 2 NO 1 2016 1
7 A A 2018 2018 3 NO 2 2016 2
8 A A 2018 2019 3 NO 2 2016 2
9 A A 2019 2019 4 YES 0 2019 0
10 A A 2019 2020 4 YES 0 2019 0
# … with 18 more rows
Note: If you want to use your policy number (NUMBER_OF_RENEWALS) and not go by the year, you can do something similar. Instead of adding a column with the last accident year, you can include the last accident policy. Then, your output column could reflect the policy number instead of year (to consider the possibility that one or more years could be skipped).
df %>%
separate(ID_REG_YEAR, into = c("ID_REG", "REG_YEAR"), convert = T) %>%
group_by(ID) %>%
mutate(LAST_ACCIDENT_POLICY = ifelse(ACCIDENT == "YES", NUMBER_OF_RENEWALS, NA_integer_)) %>%
fill(LAST_ACCIDENT_POLICY, .direction = "down") %>%
mutate(OUTPUT_COLUMN_2 = ifelse(
is.na(LAST_ACCIDENT_POLICY) | NUMBER_OF_RENEWALS == 0, 100, NUMBER_OF_RENEWALS - LAST_ACCIDENT_POLICY
))

How to modify a column based on a condition in a time series?

I have a data on animal territories by month (1 = January etc.) for multiple individuals:
year month terr_size id
2018 1 20 1
2018 2 30 1
2019 1 5 1
2019 2 10 1
2018 3 20 2
2018 5 25 2
2018 6 20 2
2018 7 20 2
2019 1 10 2
2019 2 5 2
2019 3 20 2
2019 4 30 2
I want to add a column that has a 1 if two consecutive months exceed some value e.g. 10. One wrinkle is that my data can run over one year for a single id.
year month terr_size id new_col
2018 1 20 1 1
2018 2 30 1 1
2019 1 5 1 0
2019 2 10 1 0
2018 3 20 2 0
2018 5 25 2 1
2018 6 20 2 1
2018 7 20 2 1
2019 1 10 2 0
2019 2 5 2 0
2019 3 20 2 1
2019 4 30 2 1
This can be expressed compactly using a single left join in a single SQL statement.
Using the input shown in the Note at the end, perform a left self join using the indicated on condition and set new_col to 1 if for any original row both it and any matched rows have terr_size greater than or equal to 10. If there is no matched row then use coalesce to set new_col to 0.
library(sqldf)
sqldf("
select a.*,
coalesce(max(a.terr_size >= 10 and b.terr_size >= 10), 0)
new_col
from DF a
left join DF b on
a.id = b.id and
(12 * b.year + b.month = 12 * a.year + a.month + 1 or
12 * b.year + b.month = 12 * a.year + a.month - 1)
group by a.rowid")
giving:
year month terr_size id new_col
1 2018 1 20 1 1
2 2018 2 30 1 1
3 2019 1 5 1 0
4 2019 2 10 1 0
5 2018 3 20 2 0
6 2018 5 25 2 1
7 2018 6 20 2 1
8 2018 7 20 2 1
9 2019 1 10 2 0
10 2019 2 5 2 0
11 2019 3 20 2 1
12 2019 4 30 2 1
Note
The input and output shown in the question are not consistent so to be clear we assumed this:
Lines <- "year month terr_size id
2018 1 20 1
2018 2 30 1
2019 1 5 1
2019 2 10 1
2018 3 20 2
2018 5 25 2
2018 6 20 2
2018 7 20 2
2019 1 10 2
2019 2 5 2
2019 3 20 2
2019 4 30 2 "
DF <- read.table(text = Lines, header = TRUE)
Your data:
df <- read.table(text = "year month terr_size id
2018 1 20 1
2018 2 30 1
2019 1 5 1
2019 2 10 1
2018 3 20 2
2018 2 25 2
2018 6 20 2
2018 7 20 2
2019 1 10 2
2019 2 5 2
2019 3 20 2
2019 4 30 2 ", header = TRUE)
The idea is to create a date variable first.
Then you create two copies of your data by changing the dates one month ahead and one month back.
R is efficient memory-wise for this kind of operation, so you won't have a problem.
You will just take the space for one additional column. It doesn't actually replicate the whole dataframe.
Then you can join the new columns to the original dataframe.
You then apply the condition you needed.
I created a magic_number variable for that.
At the end, I selected only the original columns plus the one you needed.
library(dplyr)
library(lubridate)
# the threshold number
magic_number <- 10
# creare date variable
df <- df %>% mutate(date = make_date(year, month))
# [p]revious month
dfp <- df %>% transmute(id, date = date - months(1), terr_size_p = terr_size)
# [n]ext month
dfn <- df %>% transmute(id, date = date + months(1), terr_size_n = terr_size)
# join by id and date
df <- df %>%
left_join(dfp, by = c("id", "date")) %>%
left_join(dfn, by = c("id", "date"))
# for new_col to be 1, terr_size must be over the threshold, so must be at least one between previous and next month
df <- df %>%
mutate(new_col = as.numeric(terr_size > magic_number &
any(terr_size_p > magic_number, terr_size_n > magic_number)))
# remove variables if there is no more use for them
df <- df %>% select(-terr_size_p, -terr_size_n, -date)
df
Result:
year month terr_size id new_col
1 2018 1 20 1 1
2 2018 2 30 1 1
3 2019 1 5 1 0
4 2019 2 10 1 0
5 2018 3 20 2 1
6 2018 2 25 2 1
7 2018 6 20 2 1
8 2018 7 20 2 1
9 2019 1 10 2 0
10 2019 2 5 2 0
11 2019 3 20 2 1
12 2019 4 30 2 1
(The result is not exactly the same because your initial data and expected results do not correspond at row 5)
This solution handles the december-january issue we talked about in the comments.
I'm not exactly sure what is the rule because your output isn't following the rule you talk about (eg: line1/5 doesn't have another month for comparison yet you put an 1, line 6 is separated by 2 months, you put a 1 in the line 11 whereas line12 was <10).
I assumed the most complicated scenario, so you can remove the extra conditions you don't need:
You put an 1 if the territory size remained >10 for two consecutive months including this one (or the first recorded month if it's >10) for each individual.
df <- read.table(text = "year month terr_size id
2018 1 20 1
2018 2 30 1
2019 1 5 1
2019 2 10 1
2018 3 20 2
2018 5 25 2
2018 6 20 2
2018 7 20 2
2019 1 10 2
2019 2 5 2
2019 3 20 2
2019 4 30 2", header = TRUE)
Using dplyr and lag:
library(dplyr)
df %>% arrange(id, year,month) %>%
dplyr::mutate(newcol=case_when(is.na(lag(month))==TRUE & terr_size>10~1,
lag(id)!=id & terr_size>10~1,
id==lag(id) & year-lag(year)==0 & month-lag(month)==1 & terr_size>10 & lag(terr_size)>10~1,
id==lag(id) & year-lag(year)==1 & lag(month)-month==11 & terr_size>10 & lag(terr_size)>10~1,
TRUE~0))
output:
year month terr_size id newcol
1 2018 1 20 1 1
2 2018 2 30 1 1
3 2019 1 5 1 0
4 2019 2 10 1 0
5 2018 3 20 2 1
6 2018 5 25 2 0
7 2018 6 20 2 1
8 2018 7 20 2 1
9 2019 1 10 2 0
10 2019 2 5 2 0
11 2019 3 20 2 0
12 2019 4 30 2 1

Remove IDs with fewer than 9 unique observations

I am trying to filter my data and remove IDs that have fewer than 9 unique month observations. I would also like to create a list of IDs that includes the count.
I've tried using a few different options:
library(dplyr)
count <- bind %>% group_by(IDS) %>% filter(n(data.month)>= 9) %>% ungroup()
count2 <- subset(bind, with(bind, IDS %in% names(which(table(data.month)>=9))))
Neither of these worked.
This is what my data looks like:
data.month ID
01 2
02 2
03 2
04 2
05 2
05 2
06 2
06 2
07 2
07 2
07 2
07 2
07 2
08 2
09 2
10 2
11 2
12 2
01 5
01 5
02 5
01 7
01 7
01 7
01 4
02 4
03 4
04 4
05 4
05 4
06 4
06 4
07 4
07 4
07 4
07 4
07 4
08 4
09 4
10 4
11 4
12 4
In the end, I would like a this:
IDs
2
3
I would also like this
IDs Count
2 12
5 2
7 1
4 12
So far this code is the closest, but still just gives error codes:
count <- bind %>%
group_by(IDs) %>%
filter(length(unique(bind$data.month >=9)))
Error in filter_impl(.data, quo) :
Argument 2 filter condition does not evaluate to a logical vector
You can do with unique and length
library(dplyr)
df %>% group_by(ID) %>% summarise(Count=length(unique(data.month)))
# A tibble: 4 x 2
ID Count
<int> <int>
1 2 12
2 4 12
3 5 2
4 7 1
If want to get the ID
df%>%group_by(ID)%>%summarise(Count=length(unique(data.month)))%>%filter(Count>9)%>%select(ID)
# A tibble: 2 x 1
ID
<int>
1 2
2 4
We can use n_distinct
To remove IDs with less than 9 unique observations
library(dplyr)
df %>%
group_by(ID) %>%
filter(n_distinct(data.month) >= 9) %>%
pull(ID) %>% unique
#[1] 2 4
Or
df %>%
group_by(ID) %>%
filter(n_distinct(data.month) >= 9) %>%
distinct(ID)
# ID
# <int>
#1 2
#2 4
For unique counts of each ID
df %>%
group_by(ID) %>%
summarise(count = n_distinct(data.month))
# ID count
# <int> <int>
#1 2 12
#2 4 12
#3 5 2
#4 7 1
here is a data.table approach
library( data.table )
ID's with 9 obervations or more
unique( DT[, if (.N >= 9) .SD, by = .(data.month)]$ID )
#[1] 2 4
#Unique ID's per month
unique(DT, by = c("data.month", "ID"))[, .(counts = .N), by = .(IDs = ID)]
# IDs counts
# 1: 2 12
# 2: 5 2
# 3: 7 1
# 4: 4 12
sample data
DT <- fread("data.month ID
01 2
02 2
03 2
04 2
05 2
05 2
06 2
06 2
07 2
07 2
07 2
07 2
07 2
08 2
09 2
10 2
11 2
12 2
01 5
01 5
02 5
01 7
01 7
01 7
01 4
02 4
03 4
04 4
05 4
05 4
06 4
06 4
07 4
07 4
07 4
07 4
07 4
08 4
09 4
10 4
11 4
12 4")

Create time axis for longitudinal data; calculations with data variables

I've got the following sample data frame. The data is in long format (longitudinal data). col1 indicates the person ID (for this sample we only have 2 people). col2 indicates the occurrence of a life event (e.g. 0 = not married, 1 = married). The status change from 0 to one actually marks the life event. col3 is 1 for each measurement occasion after the event and 0 for each measurement occasion prior to the event. The year indicates the year of assessment. The month indicates the month of assessment (02 = February).
col1 col2 col3 year month
row.name11 A 0 0 2013 02
row.name12 A 0 0 2014 02
row.name13 A 1 1 2015 02
row.name14 A 0 1 2016 02
row.name15 A 0 1 2018 02
row.name16 B 0 0 2014 02
row.name17 B 0 0 2015 02
row.name18 B 1 1 2016 02
row.name19 B 0 1 2017 04
I now wish to create an event-centered timeline. The new variable should be 0 when the event takes place (col2 == 1). It should be negative prior to the event (indicating the month until the event occurs) and positive after the event (indicating the month since the event has occurred).
It should look like this (see event.time variable):
col1 col2 col3 year month event.time
row.name11 A 0 0 2013 02 -24
row.name12 A 0 0 2014 02 -12
row.name13 A 1 1 2015 02 0
row.name14 A 0 1 2016 02 12
row.name15 A 0 1 2018 02 36
row.name16 B 0 0 2014 02 -24
row.name17 B 0 0 2015 02 -12
row.name18 B 1 1 2016 02 0
row.name19 B 0 1 2017 04 14
I figured out that I should transform my year and month variable into date-variables (using as.date function) first. However, I wasn't successful. How could I efficiently calculate the event.time variable afterwards? Maybe using the col3 because this variable indicates if it is prior or after the event?
I'm more than happy to receive any advices you may have! Thanks in advance :)
#if nchar(month) is 1 then add 0 before month, otherwise use month directly.
#1 added to make the transformation to as.Date simple
df$date<- paste0(df$year,'-',ifelse(nchar(df$month)==1,paste0(0,df$month),df$month),'-1')
df$date<- as.Date(df$date)
library(dplyr)
df %>% group_by(col1) %>%
#Get the minmume date where col2==1 incase there is more than one 1 in the same ID
mutate(date_used=min(date[col2==1]), event.time=as.numeric(date - date_used))
# A tibble: 9 x 8
# Groups: col1 [2]
col1 col2 col3 year month date date_used event.time
<fct> <int> <int> <int> <int> <date> <date> <dbl>
1 A 0 0 2013 2 2013-02-01 2015-02-01 -730
2 A 0 0 2014 2 2014-02-01 2015-02-01 -365
3 A 1 1 2015 2 2015-02-01 2015-02-01 0
4 A 0 1 2016 2 2016-02-01 2015-02-01 365
5 A 0 1 2018 2 2018-02-01 2015-02-01 1096
6 B 0 0 2014 2 2014-02-01 2016-02-01 -730
7 B 0 0 2015 2 2015-02-01 2016-02-01 -365
8 B 1 1 2016 2 2016-02-01 2016-02-01 0
9 B 0 1 2017 4 2017-04-01 2016-02-01 425
Data
df <- read.table(text="
col1 col2 col3 year month
row.name11 A 0 0 2013 02
row.name12 A 0 0 2014 02
row.name13 A 1 1 2015 02
row.name14 A 0 1 2016 02
row.name15 A 0 1 2018 02
row.name16 B 0 0 2014 02
row.name17 B 0 0 2015 02
row.name18 B 1 1 2016 02
row.name19 B 0 1 2017 04
",header=T)
Here is an option using lubridate
library(tidyverse)
library(lubridate)
ym <- function(y, m) ymd(sprintf("%s-%s-01", y, m))
df %>%
group_by(col1) %>%
mutate(event.time = interval(ym(year, month)[col2 == 1], ym(year, month)) %/% months(1))
## A tibble: 9 x 6
## Groups: col1 [2]
# col1 col2 col3 year month event.time
# <fct> <int> <int> <int> <int> <dbl>
#1 A 0 0 2013 2 -24.
#2 A 0 0 2014 2 -12.
#3 A 1 1 2015 2 0.
#4 A 0 1 2016 2 12.
#5 A 0 1 2018 2 36.
#6 B 0 0 2014 2 -24.
#7 B 0 0 2015 2 -12.
#8 B 1 1 2016 2 0.
#9 B 0 1 2017 4 14.
Sample data
df <- read.table(text =
" col1 col2 col3 year month
row.name11 A 0 0 2013 02
row.name12 A 0 0 2014 02
row.name13 A 1 1 2015 02
row.name14 A 0 1 2016 02
row.name15 A 0 1 2018 02
row.name16 B 0 0 2014 02
row.name17 B 0 0 2015 02
row.name18 B 1 1 2016 02
row.name19 B 0 1 2017 04", header = T)

How to remove subjects with missing yearly observations in R?

num Name year age X
1 1 A 2011 68 116292
2 1 A 2012 69 46132
3 1 A 2013 70 7042
4 1 A 2014 71 -100425
5 1 A 2015 72 6493
6 2 B 2011 20 -8484
7 3 C 2015 23 -120836
8 4 D 2011 3 -26523
9 4 D 2012 4 9923
10 4 D 2013 5 82432
I have the data which is represented by various subjects in 5 years. I need to remove all the subjects, which are missing any of years from 2011 to 2015. How can I accomplish it, so in given data only subject A is left?
Using data.table:
A data.table solution might look something like this:
library(data.table)
dt <- as.data.table(df)
dt[, keep := identical(unique(year), 2011:2015), by = Name ][keep == T, ][,keep := NULL]
# num Name year age X
#1: 1 A 2011 68 116292
#2: 1 A 2012 69 46132
#3: 1 A 2013 70 7042
#4: 1 A 2014 71 -100425
#5: 1 A 2015 72 6493
This is more strict in that it requires that the unique years be exactly equal to 2011:2015. If there is a 2016, for example that person would be excluded.
A less restrictive solution would be to check that 2011:2015 is in your unique years. This should work:
dt[, keep := all(2011:2015 %in% unique(year)), by = Name ][keep == T, ][,keep := NULL]
Thus, if for example, A had a 2016 year and a 2010 year it would still keep all of A. But if anyone is missing a year in 2011:2015 this would exclude them.
Using base R & aggregate:
Same option, but using aggregate from base R:
agg <- aggregate(df$year, by = list(df$Name), FUN = function(x) all(2011:2015 %in% unique(x)))
df[df$Name %in% agg[agg$x == T, 1] ,]
Here is a slightly more straightforward tidyverse solution.
First, expand the dataframe to include all combinations of Name + year:
df %>% complete(Name, year)
# A tibble: 20 x 5
Name year num age X
<fctr> <int> <int> <int> <int>
1 A 2011 1 68 116292
2 A 2012 1 69 46132
3 A 2013 1 70 7042
4 A 2014 1 71 -100425
5 A 2015 1 72 6493
6 B 2011 2 20 -8484
7 B 2012 NA NA NA
8 B 2013 NA NA NA
9 B 2014 NA NA NA
10 B 2015 NA NA NA
...
Then extend the pipe to group by "Name", and filter to keep only those with 0 NA values:
df %>% complete(Name, year) %>%
group_by(Name) %>%
filter(sum(is.na(age)) == 0)
# A tibble: 5 x 5
# Groups: Name [1]
Name year num age X
<fctr> <int> <int> <int> <int>
1 A 2011 1 68 116292
2 A 2012 1 69 46132
3 A 2013 1 70 7042
4 A 2014 1 71 -100425
5 A 2015 1 72 6493
Just check which names have the right number of entries.
## Reproduce your data
df = read.table(text=" num Name year age X
1 1 A 2011 68 116292
2 1 A 2012 69 46132
3 1 A 2013 70 7042
4 1 A 2014 71 -100425
5 1 A 2015 72 6493
6 2 B 2011 20 -8484
7 3 C 2015 23 -120836
8 4 D 2011 3 -26523
9 4 D 2012 4 9923
10 4 D 2013 5 82432",
header=TRUE)
Tab = table(df$Name)
Keepers = names(Tab)[which(Tab == 5)]
df[df$Name %in% Keepers,]
num Name year age X
1 1 A 2011 68 116292
2 1 A 2012 69 46132
3 1 A 2013 70 7042
4 1 A 2014 71 -100425
5 1 A 2015 72 6493
Here is a somewhat different approach using tidyverse packages:
library(tidyverse)
df <- read.table(text = " num Name year age X
1 1 A 2011 68 116292
2 1 A 2012 69 46132
3 1 A 2013 70 7042
4 1 A 2014 71 -100425
5 1 A 2015 72 6493
6 2 B 2011 20 -8484
7 3 C 2015 23 -120836
8 4 D 2011 3 -26523
9 4 D 2012 4 9923
10 4 D 2013 5 82432")
df2 <- spread(data = df, key = Name, value = year)
x <- colSums(df2[, 4:7], na.rm = TRUE) > 10000
df3 <- select(df2, num, age, X, c(4:7)[x])
df4 <- na.omit(df3)
All steps can of course be constructed as one single pipe with the %>% operator.

Resources