replacing NA with next available number within a group - r

I have a relatively large dataset and I want to replace NA value for the price in a specific year and for a specific ID number with an available value in next year within a group for the same ID number. Here is a reproducible example:
ID <- c(1,2,3,2,2,3,1,4,5,5,1,2,2)
year <- c(2000,2001,2002,2002,2003,2007,2001,2000,2005,2006,2002,2004,2005)
value <- c(1000,20000,30000,NA,40000,NA,6000,4000,NA,20000,7000,50000,60000)
data <- data.frame(ID, year, value)
ID year value
1 1 2000 1000
2 2 2001 20000
3 3 2002 30000
4 2 2002 NA
5 2 2003 40000
6 3 2007 NA
7 1 2001 6000
8 4 2000 4000
9 5 2005 NA
10 5 2006 20000
11 1 2002 7000
12 2 2004 50000
13 2 2005 60000
So, for example for ID=2 we have following value and years:
ID year value
2 2001 20000
2 2002 NA
2 2003 40000
2 2004 50000
2 2005 60000
So in the above case, NA should be replaced with 40000 (Values in next year). And the same story for other IDs.
the final result should be in this form:
ID year value
1 2000 1000
1 2001 6000
1 2002 7000
2 2001 20000
2 2002 40000
2 2003 40000
2 2004 50000
2 2005 60000
3 2007 NA
4 2000 4000
5 2005 20000
5 2006 20000
Please note that for ID=3 since there is no next year available, we want to leave it as is. That's why it's in the form of NA
I appreciate if you can suggest a solution
Thanks

dplyr solution
library(tidyverse)
data2 <- data %>%
dplyr::group_by(ID) %>%
dplyr::arrange(year) %>%
dplyr::mutate(replaced_value = ifelse(is.na(value), lead(value), value))
print(data2)
# A tibble: 13 x 4
# Groups: ID [5]
ID year value replaced_value
<dbl> <dbl> <dbl> <dbl>
1 1 2000 1000 1000
2 4 2000 4000 4000
3 2 2001 20000 20000
4 1 2001 6000 6000
5 3 2002 30000 30000
6 2 2002 NA 40000
7 1 2002 7000 7000
8 2 2003 40000 40000
9 2 2004 50000 50000
10 5 2005 NA 20000
11 2 2005 60000 60000
12 5 2006 20000 20000
13 3 2007 NA NA

Try this tidyverse approach using a flag to check sequential years and fill() to complete data:
library(tidyverse)
#Data
ID <- c(1,2,3,2,2,3,1,4,5,5,1,2,2)
year <- c(2000,2001,2002,2002,2003,2007,2001,2000,2005,2006,2002,2004,2005)
value <- c(1000,20000,30000,NA,40000,NA,6000,4000,NA,20000,7000,50000,60000)
data <- data.frame(ID, year, value)
#Code
data2 <- data %>% arrange(ID,year) %>%
group_by(ID) %>%
mutate(Flag=c(1,diff(year))) %>%
fill(value,.direction = 'downup') %>%
mutate(value=ifelse(Flag!=1,NA,value)) %>% select(-Flag)
Output:
# A tibble: 13 x 3
# Groups: ID [5]
ID year value
<dbl> <dbl> <dbl>
1 1 2000 1000
2 1 2001 6000
3 1 2002 7000
4 2 2001 20000
5 2 2002 20000
6 2 2003 40000
7 2 2004 50000
8 2 2005 60000
9 3 2002 30000
10 3 2007 NA
11 4 2000 4000
12 5 2005 20000
13 5 2006 20000

You could do:
library(dplyr)
data %>%
group_by(ID) %>%
mutate(value = coalesce(value, as.integer(sapply(pmin(year + 1, max(year)), function(x) value[year == x])))) %>%
arrange(ID, year)
Output:
# A tibble: 13 x 3
# Groups: ID [5]
ID year value
<dbl> <dbl> <dbl>
1 1 2000 1000
2 1 2001 6000
3 1 2002 7000
4 2 2001 20000
5 2 2002 40000
6 2 2003 40000
7 2 2004 50000
8 2 2005 60000
9 3 2002 30000
10 3 2007 NA
11 4 2000 4000
12 5 2005 20000
13 5 2006 20000
Now in case you want to replace NA with any value that follows immediately - i.e. even if the year is not necessarily consecutive - you could do:
library(tidyverse)
data %>%
arrange(ID, year) %>%
group_by(ID, idx = cumsum(is.na(value))) %>%
fill(value, .direction = 'up') %>%
ungroup %>%
select(-idx)
This is much more straightforward (and likely much faster) in data.table:
library(data.table)
setDT(data)[order(ID, year), ][
, value := nafill(value, type = 'nocb'), by = .(ID, cumsum(is.na(value)))]

Related

back fill NA values in panel data set

I want to know how I can backfill NA values in panel data set.
data set
date firms return
1999 A NA
2000 A 5
2001 A NA
1999 B 9
2000 B NA
2001 B 10
expected out come
date firms return
1999 A 5
2000 A 5
2001 A NA
1999 B 9
2000 B 10
2001 B 10
I use this formula to fill NA values with previous value in panel data set
library(dplyr)
library(tidyr)
df1<-df %>% group_by(firms) %>% fill(return)
Is there any easy way like this by which I can fill NA values with next value in a panel data set.
You almost had it.
df <- df %>% group_by(firms) %>% fill(return, .direction="up")
df
# A tibble: 6 x 3
# Groups: firms [2]
date firms return
<int> <fct> <int>
1 1999 A 5
2 2000 A 5
3 2001 A NA
4 1999 B 9
5 2000 B 10
6 2001 B 10

How many counts are in each types in each year? With either group_by() or Split() [duplicate]

This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 4 years ago.
I have a data frame df as follows:
df
Code Time Country Type
1 n001 2000 France 1
2 n002 2001 Japan 5
3 n003 2003 USA 2
4 n004 2004 USA 2
5 n005 2004 Canada 1
6 n006 2005 Britain 2
7 n007 2005 USA 1
8 n008 2005 USA 2
9 n010 2005 USA 1
10 n011 2005 Canada 1
11 n012 2005 USA 2
12 n013 2005 USA 5
13 n014 2005 Canada 1
14 n015 2006 USA 2
15 n017 2006 Canada 1
16 n018 2006 Britain 1
17 n019 2006 Canada 1
18 n020 2006 USA 1
...
where Type is the type of news, and Time is the year when the news was published.
My aim is to count the number of each type of news each year.
I was thinking about a result like this:
...
$2005
Type: 1 Count: 4
Type: 2 Count: 3
Type: 5 Count: 1
$2006
Type: 1 Count: 4
...
I used the following code:
gp = group_by(df, Time)
summarise(gp, table(Time)
Error in summarise_impl(.data, dots) :
Evaluation error: unique() applies only to vectors.
Then I tried split( ), thinking it may be able to separate the dataframe by year so I could count the number of each type by year
split(df, 'Time')
$Time
Code Time Country Type
1 n001 2000 France 1
2 n002 2001 Japan 5
3 n003 2003 USA 2
4 n004 2004 USA 2
...
Everything is almost the same, apart from the "$Time" sign.
I was wondering what I did wrong, and how to fix it.
We can split Type Column by Time and calculate it's frequency by table.
lapply(split(df$Type, df$Time), table)
#$`2000`
#1
#1
#$`2001`
#5
#1
#$`2003`
#2
#1
#$`2004`
#1 2
#1 1
#$`2005`
#1 2 5
#4 3 1
#$`2006`
#1 2
#4 1
How about this?
df %>%
group_by(Time, Type) %>%
count() %>%
spread(Type, n)
You could use something like this. split on Time, then group by Type and tally the result
df %>%
split(.$Time) %>%
map(~ group_by(., Type) %>% tally())
......
$`2004`
# A tibble: 2 x 2
Type n
<int> <int>
1 1 1
2 2 1
$`2005`
# A tibble: 3 x 2
Type n
<int> <int>
1 1 4
2 2 3
3 5 1
$`2006`
# A tibble: 2 x 2
......
Or use summarise instead of tally if you want a column called count instead of n
df1 %>%
split(.$Time) %>%
map(~ group_by(., Type) %>% summarise(count = n()))

Assign unique ID based on two columns [duplicate]

This question already has answers here:
Add ID column by group [duplicate]
(4 answers)
How to create a consecutive group number
(13 answers)
Closed 5 years ago.
I have a dataframe (df) that looks like this:
School Student Year
A 10 1999
A 10 2000
A 20 1999
A 20 2000
A 20 2001
B 10 1999
B 10 2000
And I would like to create a person ID column so that df looks like this:
ID School Student Year
1 A 10 1999
1 A 10 2000
2 A 20 1999
2 A 20 2000
2 A 20 2001
3 B 10 1999
3 B 10 2000
In other words, the ID variable indicates which person it is in the dataset, accounting for both Student number and School membership (here we have 3 students total).
I did df$ID <- df$Student and tried to request the value +1 if c("School", "Student) was unique. It isn't working. Help appreciated.
We can do this in base R without doing any group by operation
df$ID <- cumsum(!duplicated(df[1:2]))
df
# School Student Year ID
#1 A 10 1999 1
#2 A 10 2000 1
#3 A 20 1999 2
#4 A 20 2000 2
#5 A 20 2001 2
#6 B 10 1999 3
#7 B 10 2000 3
NOTE: Assuming that 'School' and 'Student' are ordered
Or using tidyverse
library(dplyr)
df %>%
mutate(ID = group_indices_(df, .dots=c("School", "Student")))
# School Student Year ID
#1 A 10 1999 1
#2 A 10 2000 1
#3 A 20 1999 2
#4 A 20 2000 2
#5 A 20 2001 2
#6 B 10 1999 3
#7 B 10 2000 3
As #radek mentioned, in the recent version (dplyr_0.8.0), we get the notification that group_indices_ is deprecated, instead use group_indices
df %>%
mutate(ID = group_indices(., School, Student))
Group by School and Student, then assign group id to ID variable.
library('data.table')
df[, ID := .GRP, by = .(School, Student)]
# School Student Year ID
# 1: A 10 1999 1
# 2: A 10 2000 1
# 3: A 20 1999 2
# 4: A 20 2000 2
# 5: A 20 2001 2
# 6: B 10 1999 3
# 7: B 10 2000 3
Data:
df <- fread('School Student Year
A 10 1999
A 10 2000
A 20 1999
A 20 2000
A 20 2001
B 10 1999
B 10 2000')

filter a df with NA to get only individuals that appear more than one time in r

I am using a national survey to run a regression: the survey is conducted every two years and some individual are repeatedly interviewed while others just one time.
Now I want to make the df a panel one (have only the individual that appears more than one time). The df is like this:
year nquest nord nordp sex age
2000 10 1 1 F 40
2000 10 2 2 M 43
2000 30 1 1 M 30
2002 10 1 1 F 42
2002 10 2 2 M 45
2002 10 3 NA F 15
2002 30 1 1 M 32
2004 10 1 1 F 44
2004 10 2 2 M 47
2004 10 3 3 F 17
2004 50 1 NA M 66
where nquest is the code number of the family, nord is the code number of the individual and nordp is the code number that the individual had in the previous survey; when a new individual is interviewed the value in nordp is "missing" (R automatically insert NA). For example the individual 3 of family 10 has nordp=NA in 2002 because it is the first time that she is interviewed, while in 2004 nordp is 3 (because 3 was the number that she had in 2002).
I can't use nord to filter the df because the composition of the family may change (for example in 2002 in family x the mother has nordp=2 (it means that in 2000 nord was 2) and nord=2 but the next year nord could be 1 (for example if she gets divorced) but nordp is still 2).
I tried to filter using this command:
df <- df %>%
group_by(nquest, nordp)
filter(n()>1)
but I don't get the right df because if for the same family there are more than one individual insert (NA) they will be considered as the same person since nordp is NA the first time.
How can I consider also the individual that appears for the first time in a certain year (nordp=NA)? I tried to a create a command using age (the age in t shoul be equal to (age (in t-2) + 2; for example in 2000 age is 20, in 2002 is 22) but it didn't worked.
Consider that the df is composed by thousand observations and I can't check manually.
The final df should be:
year nquest nordp sex age
2000 10 1 F 40
2000 10 2 M 43
2000 30 1 M 30
2002 10 1 F 42
2002 10 2 M 45
2002 10 3 F 15
2002 30 1 M 32
2004 10 1 F 44
2004 10 2 M 47
2004 10 3 F 17
As you can see there are only the individual that appears more than one time and nquest=10 nordp=30 appears three times; with my command it appears just two times because in the first year nordp was NA.
We wish to assign unique IDs to individuals, then filter by the count of unique IDs. The main idea is to chain together the nordp and nord values within each family over years. Here's an idea inspired by Identify groups of linked episodes which chain together. First, load the igraph package, via library(igraph). Then the following function assigns IDs for a given family.
assignID <- function(d) {
fields <- names(d) # store original column names
d$nordp[is.na(d$nordp)] <- seq_len(sum(is.na(d$nordp))) + 100
d$nordp_x <- (d$year-2) * 1000 + d$nordp
d$nord_x <- d$year * 1000 + d$nord
dd <- d[, c("nordp_x", "nord_x")]
gr.test <- graph.data.frame(dd)
links <- data.frame(org_id = unique(unlist(dd)),
id = clusters(gr.test)$membership)
d <- merge(d, links, by.x = "nord_x", by.y = "org_id", all.x = TRUE)
d$uid <- d$nquest * 100 + d$id
d[, c(fields, "uid")]
}
The function can "tell", for example, that
year nordp nord
2000 1 1
2002 1 2
2004 2 3
is the same individual, by chaining together the nordp and nord over the years, and assigns the same unique ID to all 3 rows. So, for example,
assignID(subset(df, nquest == 10))
# year nquest nord nordp sex age dob uid
# 1 2000 10 1 1 F 40 1960 1001
# 2 2000 10 2 2 M 43 1957 1002
# 3 2002 10 1 1 F 42 1960 1001
# 4 2002 10 2 2 M 45 1957 1002
# 5 2002 10 3 101 F 15 1987 1003
# 6 2004 10 1 1 F 44 1960 1001
# 7 2004 10 2 2 M 47 1957 1002
# 8 2004 10 3 3 F 17 1987 1003
gives us an additional column with the uid for each individual.
The remaining steps are straightforward. We split the dataframe by nquest, apply assignID to each subset, and rbind the output:
dd <- do.call(rbind, by(df, df$nquest, assignID))
Then we can just group by uid and filter by count:
dd %>% group_by(uid) %>% filter(n()>1)
# Source: local data frame [10 x 8]
# Groups: uid [4]
# year nquest nord nordp sex age dob uid
# <int> <int> <int> <dbl> <fctr> <int> <int> <dbl>
# 1 2000 10 1 1 F 40 1960 1001
# 2 2000 10 2 2 M 43 1957 1002
# 3 2002 10 1 1 F 42 1960 1001
# 4 2002 10 2 2 M 45 1957 1002
# 5 2002 10 3 101 F 15 1987 1003
# 6 2004 10 1 1 F 44 1960 1001
# 7 2004 10 2 2 M 47 1957 1002
# 8 2004 10 3 3 F 17 1987 1003
# 9 2000 30 1 1 M 30 1970 3001
# 10 2002 30 1 1 M 32 1970 3001

Summarizing a dataframe by date and group

I am trying to summarize a data set by a few different factors. Below is an example of my data:
household<-c("household1","household1","household1","household2","household2","household2","household3","household3","household3")
date<-c(sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 9))
value<-c(1:9)
type<-c("income","water","energy","income","water","energy","income","water","energy")
df<-data.frame(household,date,value,type)
household date value type
1 household1 1999-05-10 100 income
2 household1 1999-05-25 200 water
3 household1 1999-10-12 300 energy
4 household2 1999-02-02 400 income
5 household2 1999-08-20 500 water
6 household2 1999-02-19 600 energy
7 household3 1999-07-01 700 income
8 household3 1999-10-13 800 water
9 household3 1999-01-01 900 energy
I want to summarize the data by month. Ideally the resulting data set would have 12 rows per household (one for each month) and a column for each category of expenditure (water, energy, income) that is a sum of that month's total.
I tried starting by adding a column with a short date, and then I was going to filter for each type and create a separate data frame for the summed data per transaction type. I was then going to merge those data frames together to have the summarized df. I attempted to summarize it using ddply, but it aggregated too much, and I can't keep the household level info.
ddply(df,.(shortdate),summarize,mean_value=mean(value))
shortdate mean_value
1 14/07 15.88235
2 14/09 5.00000
3 14/10 5.00000
4 14/11 21.81818
5 14/12 20.00000
6 15/01 10.00000
7 15/02 12.50000
8 15/04 5.00000
Any help would be much appreciated!
It sounds like what you are looking for is a pivot table. I like to use reshape::cast for these types of tables. If there is more than one value returned for a given expenditure type for a given household/year/month combination, this will sum those values. If there is only one value, it returns the value. The "sum" argument is not required but only placed there to handle exceptions. I think if your data is clean you shouldn't need this argument.
hh <- c("hh1", "hh1", "hh1", "hh2", "hh2", "hh2", "hh3", "hh3", "hh3")
date <- c(sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 9))
value <- c(1:9)
type <- c("income", "water", "energy", "income", "water", "energy", "income", "water", "energy")
df <- data.frame(hh, date, value, type)
# Load lubridate library, add date and year
library(lubridate)
df$month <- month(df$date)
df$year <- year(df$date)
# Load reshape library, run cast from reshape, creates pivot table
library(reshape)
dfNew <- cast(df, hh+year+month~type, value = "value", sum)
> dfNew
hh year month energy income water
1 hh1 1999 4 3 0 0
2 hh1 1999 10 0 1 0
3 hh1 1999 11 0 0 2
4 hh2 1999 2 0 4 0
5 hh2 1999 3 6 0 0
6 hh2 1999 6 0 0 5
7 hh3 1999 1 9 0 0
8 hh3 1999 4 0 7 0
9 hh3 1999 8 0 0 8
Try this:
df$ym<-zoo::as.yearmon(as.Date(df$date), "%y/%m")
library(dplyr)
df %>% group_by(ym,type) %>%
summarise(mean_value=mean(value))
Source: local data frame [9 x 3]
Groups: ym [?]
ym type mean_value
<S3: yearmon> <fctr> <dbl>
1 jan 1999 income 1
2 jun 1999 energy 3
3 jul 1999 energy 6
4 jul 1999 water 2
5 ago 1999 income 4
6 set 1999 energy 9
7 set 1999 income 7
8 nov 1999 water 5
9 dez 1999 water 8
Edit: the wide format:
reshape2::dcast(dfr, ym ~ type)
ym energy income water
1 jan 1999 NA 1 NA
2 jun 1999 3 NA NA
3 jul 1999 6 NA 2
4 ago 1999 NA 4 NA
5 set 1999 9 7 NA
6 nov 1999 NA NA 5
7 dez 1999 NA NA 8
If I understood your requirement correctly (from the description in the question), this is what you are looking for:
library(dplyr)
library(tidyr)
df %>% mutate(date = lubridate::month(date)) %>%
complete(household, date = 1:12) %>%
spread(type, value) %>% group_by(household, date) %>%
mutate(Total = sum(energy, income, water, na.rm = T)) %>%
select(household, Month = date, energy:water, Total)
#Source: local data frame [36 x 6]
#Groups: household, Month [36]
#
# household Month energy income water Total
# <fctr> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 household1 1 NA NA NA 0
#2 household1 2 NA NA NA 0
#3 household1 3 NA NA 200 200
#4 household1 4 NA NA NA 0
#5 household1 5 NA NA NA 0
#6 household1 6 NA NA NA 0
#7 household1 7 NA NA NA 0
#8 household1 8 NA NA NA 0
#9 household1 9 300 NA NA 300
#10 household1 10 NA NA NA 0
# ... with 26 more rows
Note: I used the same df you provided in the question. The only change I made was the value column. Instead of 1:9, I used seq(100, 900, 100)
If I got it wrong, please let me know and I will delete my answer. I will add an explanation of what's going on if this is correct.

Resources