I have a relatively large dataset and I want to replace NA value for the price in a specific year and for a specific ID number with an available value in next year within a group for the same ID number. Here is a reproducible example:
ID <- c(1,2,3,2,2,3,1,4,5,5,1,2,2)
year <- c(2000,2001,2002,2002,2003,2007,2001,2000,2005,2006,2002,2004,2005)
value <- c(1000,20000,30000,NA,40000,NA,6000,4000,NA,20000,7000,50000,60000)
data <- data.frame(ID, year, value)
ID year value
1 1 2000 1000
2 2 2001 20000
3 3 2002 30000
4 2 2002 NA
5 2 2003 40000
6 3 2007 NA
7 1 2001 6000
8 4 2000 4000
9 5 2005 NA
10 5 2006 20000
11 1 2002 7000
12 2 2004 50000
13 2 2005 60000
So, for example for ID=2 we have following value and years:
ID year value
2 2001 20000
2 2002 NA
2 2003 40000
2 2004 50000
2 2005 60000
So in the above case, NA should be replaced with 40000 (Values in next year). And the same story for other IDs.
the final result should be in this form:
ID year value
1 2000 1000
1 2001 6000
1 2002 7000
2 2001 20000
2 2002 40000
2 2003 40000
2 2004 50000
2 2005 60000
3 2007 NA
4 2000 4000
5 2005 20000
5 2006 20000
Please note that for ID=3 since there is no next year available, we want to leave it as is. That's why it's in the form of NA
I appreciate if you can suggest a solution
Thanks
dplyr solution
library(tidyverse)
data2 <- data %>%
dplyr::group_by(ID) %>%
dplyr::arrange(year) %>%
dplyr::mutate(replaced_value = ifelse(is.na(value), lead(value), value))
print(data2)
# A tibble: 13 x 4
# Groups: ID [5]
ID year value replaced_value
<dbl> <dbl> <dbl> <dbl>
1 1 2000 1000 1000
2 4 2000 4000 4000
3 2 2001 20000 20000
4 1 2001 6000 6000
5 3 2002 30000 30000
6 2 2002 NA 40000
7 1 2002 7000 7000
8 2 2003 40000 40000
9 2 2004 50000 50000
10 5 2005 NA 20000
11 2 2005 60000 60000
12 5 2006 20000 20000
13 3 2007 NA NA
Try this tidyverse approach using a flag to check sequential years and fill() to complete data:
library(tidyverse)
#Data
ID <- c(1,2,3,2,2,3,1,4,5,5,1,2,2)
year <- c(2000,2001,2002,2002,2003,2007,2001,2000,2005,2006,2002,2004,2005)
value <- c(1000,20000,30000,NA,40000,NA,6000,4000,NA,20000,7000,50000,60000)
data <- data.frame(ID, year, value)
#Code
data2 <- data %>% arrange(ID,year) %>%
group_by(ID) %>%
mutate(Flag=c(1,diff(year))) %>%
fill(value,.direction = 'downup') %>%
mutate(value=ifelse(Flag!=1,NA,value)) %>% select(-Flag)
Output:
# A tibble: 13 x 3
# Groups: ID [5]
ID year value
<dbl> <dbl> <dbl>
1 1 2000 1000
2 1 2001 6000
3 1 2002 7000
4 2 2001 20000
5 2 2002 20000
6 2 2003 40000
7 2 2004 50000
8 2 2005 60000
9 3 2002 30000
10 3 2007 NA
11 4 2000 4000
12 5 2005 20000
13 5 2006 20000
You could do:
library(dplyr)
data %>%
group_by(ID) %>%
mutate(value = coalesce(value, as.integer(sapply(pmin(year + 1, max(year)), function(x) value[year == x])))) %>%
arrange(ID, year)
Output:
# A tibble: 13 x 3
# Groups: ID [5]
ID year value
<dbl> <dbl> <dbl>
1 1 2000 1000
2 1 2001 6000
3 1 2002 7000
4 2 2001 20000
5 2 2002 40000
6 2 2003 40000
7 2 2004 50000
8 2 2005 60000
9 3 2002 30000
10 3 2007 NA
11 4 2000 4000
12 5 2005 20000
13 5 2006 20000
Now in case you want to replace NA with any value that follows immediately - i.e. even if the year is not necessarily consecutive - you could do:
library(tidyverse)
data %>%
arrange(ID, year) %>%
group_by(ID, idx = cumsum(is.na(value))) %>%
fill(value, .direction = 'up') %>%
ungroup %>%
select(-idx)
This is much more straightforward (and likely much faster) in data.table:
library(data.table)
setDT(data)[order(ID, year), ][
, value := nafill(value, type = 'nocb'), by = .(ID, cumsum(is.na(value)))]
I want to know how I can backfill NA values in panel data set.
data set
date firms return
1999 A NA
2000 A 5
2001 A NA
1999 B 9
2000 B NA
2001 B 10
expected out come
date firms return
1999 A 5
2000 A 5
2001 A NA
1999 B 9
2000 B 10
2001 B 10
I use this formula to fill NA values with previous value in panel data set
library(dplyr)
library(tidyr)
df1<-df %>% group_by(firms) %>% fill(return)
Is there any easy way like this by which I can fill NA values with next value in a panel data set.
You almost had it.
df <- df %>% group_by(firms) %>% fill(return, .direction="up")
df
# A tibble: 6 x 3
# Groups: firms [2]
date firms return
<int> <fct> <int>
1 1999 A 5
2 2000 A 5
3 2001 A NA
4 1999 B 9
5 2000 B 10
6 2001 B 10
This question already has answers here:
Add ID column by group [duplicate]
(4 answers)
How to create a consecutive group number
(13 answers)
Closed 5 years ago.
I have a dataframe (df) that looks like this:
School Student Year
A 10 1999
A 10 2000
A 20 1999
A 20 2000
A 20 2001
B 10 1999
B 10 2000
And I would like to create a person ID column so that df looks like this:
ID School Student Year
1 A 10 1999
1 A 10 2000
2 A 20 1999
2 A 20 2000
2 A 20 2001
3 B 10 1999
3 B 10 2000
In other words, the ID variable indicates which person it is in the dataset, accounting for both Student number and School membership (here we have 3 students total).
I did df$ID <- df$Student and tried to request the value +1 if c("School", "Student) was unique. It isn't working. Help appreciated.
We can do this in base R without doing any group by operation
df$ID <- cumsum(!duplicated(df[1:2]))
df
# School Student Year ID
#1 A 10 1999 1
#2 A 10 2000 1
#3 A 20 1999 2
#4 A 20 2000 2
#5 A 20 2001 2
#6 B 10 1999 3
#7 B 10 2000 3
NOTE: Assuming that 'School' and 'Student' are ordered
Or using tidyverse
library(dplyr)
df %>%
mutate(ID = group_indices_(df, .dots=c("School", "Student")))
# School Student Year ID
#1 A 10 1999 1
#2 A 10 2000 1
#3 A 20 1999 2
#4 A 20 2000 2
#5 A 20 2001 2
#6 B 10 1999 3
#7 B 10 2000 3
As #radek mentioned, in the recent version (dplyr_0.8.0), we get the notification that group_indices_ is deprecated, instead use group_indices
df %>%
mutate(ID = group_indices(., School, Student))
Group by School and Student, then assign group id to ID variable.
library('data.table')
df[, ID := .GRP, by = .(School, Student)]
# School Student Year ID
# 1: A 10 1999 1
# 2: A 10 2000 1
# 3: A 20 1999 2
# 4: A 20 2000 2
# 5: A 20 2001 2
# 6: B 10 1999 3
# 7: B 10 2000 3
Data:
df <- fread('School Student Year
A 10 1999
A 10 2000
A 20 1999
A 20 2000
A 20 2001
B 10 1999
B 10 2000')
I am using a national survey to run a regression: the survey is conducted every two years and some individual are repeatedly interviewed while others just one time.
Now I want to make the df a panel one (have only the individual that appears more than one time). The df is like this:
year nquest nord nordp sex age
2000 10 1 1 F 40
2000 10 2 2 M 43
2000 30 1 1 M 30
2002 10 1 1 F 42
2002 10 2 2 M 45
2002 10 3 NA F 15
2002 30 1 1 M 32
2004 10 1 1 F 44
2004 10 2 2 M 47
2004 10 3 3 F 17
2004 50 1 NA M 66
where nquest is the code number of the family, nord is the code number of the individual and nordp is the code number that the individual had in the previous survey; when a new individual is interviewed the value in nordp is "missing" (R automatically insert NA). For example the individual 3 of family 10 has nordp=NA in 2002 because it is the first time that she is interviewed, while in 2004 nordp is 3 (because 3 was the number that she had in 2002).
I can't use nord to filter the df because the composition of the family may change (for example in 2002 in family x the mother has nordp=2 (it means that in 2000 nord was 2) and nord=2 but the next year nord could be 1 (for example if she gets divorced) but nordp is still 2).
I tried to filter using this command:
df <- df %>%
group_by(nquest, nordp)
filter(n()>1)
but I don't get the right df because if for the same family there are more than one individual insert (NA) they will be considered as the same person since nordp is NA the first time.
How can I consider also the individual that appears for the first time in a certain year (nordp=NA)? I tried to a create a command using age (the age in t shoul be equal to (age (in t-2) + 2; for example in 2000 age is 20, in 2002 is 22) but it didn't worked.
Consider that the df is composed by thousand observations and I can't check manually.
The final df should be:
year nquest nordp sex age
2000 10 1 F 40
2000 10 2 M 43
2000 30 1 M 30
2002 10 1 F 42
2002 10 2 M 45
2002 10 3 F 15
2002 30 1 M 32
2004 10 1 F 44
2004 10 2 M 47
2004 10 3 F 17
As you can see there are only the individual that appears more than one time and nquest=10 nordp=30 appears three times; with my command it appears just two times because in the first year nordp was NA.
We wish to assign unique IDs to individuals, then filter by the count of unique IDs. The main idea is to chain together the nordp and nord values within each family over years. Here's an idea inspired by Identify groups of linked episodes which chain together. First, load the igraph package, via library(igraph). Then the following function assigns IDs for a given family.
assignID <- function(d) {
fields <- names(d) # store original column names
d$nordp[is.na(d$nordp)] <- seq_len(sum(is.na(d$nordp))) + 100
d$nordp_x <- (d$year-2) * 1000 + d$nordp
d$nord_x <- d$year * 1000 + d$nord
dd <- d[, c("nordp_x", "nord_x")]
gr.test <- graph.data.frame(dd)
links <- data.frame(org_id = unique(unlist(dd)),
id = clusters(gr.test)$membership)
d <- merge(d, links, by.x = "nord_x", by.y = "org_id", all.x = TRUE)
d$uid <- d$nquest * 100 + d$id
d[, c(fields, "uid")]
}
The function can "tell", for example, that
year nordp nord
2000 1 1
2002 1 2
2004 2 3
is the same individual, by chaining together the nordp and nord over the years, and assigns the same unique ID to all 3 rows. So, for example,
assignID(subset(df, nquest == 10))
# year nquest nord nordp sex age dob uid
# 1 2000 10 1 1 F 40 1960 1001
# 2 2000 10 2 2 M 43 1957 1002
# 3 2002 10 1 1 F 42 1960 1001
# 4 2002 10 2 2 M 45 1957 1002
# 5 2002 10 3 101 F 15 1987 1003
# 6 2004 10 1 1 F 44 1960 1001
# 7 2004 10 2 2 M 47 1957 1002
# 8 2004 10 3 3 F 17 1987 1003
gives us an additional column with the uid for each individual.
The remaining steps are straightforward. We split the dataframe by nquest, apply assignID to each subset, and rbind the output:
dd <- do.call(rbind, by(df, df$nquest, assignID))
Then we can just group by uid and filter by count:
dd %>% group_by(uid) %>% filter(n()>1)
# Source: local data frame [10 x 8]
# Groups: uid [4]
# year nquest nord nordp sex age dob uid
# <int> <int> <int> <dbl> <fctr> <int> <int> <dbl>
# 1 2000 10 1 1 F 40 1960 1001
# 2 2000 10 2 2 M 43 1957 1002
# 3 2002 10 1 1 F 42 1960 1001
# 4 2002 10 2 2 M 45 1957 1002
# 5 2002 10 3 101 F 15 1987 1003
# 6 2004 10 1 1 F 44 1960 1001
# 7 2004 10 2 2 M 47 1957 1002
# 8 2004 10 3 3 F 17 1987 1003
# 9 2000 30 1 1 M 30 1970 3001
# 10 2002 30 1 1 M 32 1970 3001
The following is what I have:
ID Year Score
1 1999 10
1 2000 11
1 2001 14
1 2002 22
2 2000 19
2 2001 17
2 2002 22
3 1998 10
3 1999 12
The following is what I would like to do:
ID Year Score Total
1 1999 10 10
1 2000 11 21
1 2001 14 35
1 2002 22 57
2 2000 19 19
2 2001 17 36
2 2002 22 48
3 1998 10 10
3 1999 12 22
The amount of years and the specific years vary for each Id.
I have a feeling that it's some advanced options in ddply but I have not been able to find the answer. I've also tried working with for/while loops but since these are dreadfully slow in R and my data-set is large, it's not working all that well.
Thanks in advance!
You can use the sumsum function and apply it with ave to all subgroups.
transform(dat, Total = ave(Score, ID, FUN = cumsum))
ID Year Score Total
1 1 1999 10 10
2 1 2000 11 21
3 1 2001 14 35
4 1 2002 22 57
5 2 2000 19 19
6 2 2001 17 36
7 2 2002 22 58
8 3 1998 10 10
9 3 1999 12 22
If your data is large, then ddply will be slow.
data.table is the way to go.
library(data.table)
DT <- data.table(dat)
# create your desired column in `DT`
DT[, agg.Score := cumsum(Score), by = ID]