I have a very large (~30M observations) dataframe in R and I am having trouble with a new column I want to create.
The data is formatted like this:
Country Year Value
1 A 2000 1
2 A 2001 NA
3 A 2002 2
4 B 2000 4
5 B 2001 NA
6 B 2002 NA
7 B 2003 3
My problem is that I would like to impute the NAs in the value column based on other values in that column. Specifically, if there is a non-NA value for the same country I would like that to replace the NA in later years, until there is another non-NA value.
The data above would therefore be transformed into this:
Country Year Value
1 A 2000 1
2 A 2001 1
3 A 2002 2
4 B 2000 4
5 B 2001 4
6 B 2002 4
7 B 2003 3
To solve this, I first tried using a loop with a lookup function and also some if_else statements, but wasn't able to get it to behave as I expected. In general, I am struggling to find an efficient solution that will be able to perform the task in the order of minutes-hours and not days.
Is there an easy way to do this?

Using tidyr's fill:
df %>%
group_by(Country) %>%
# A tibble: 7 × 3
# Groups: Country [2]
Country Year Value
<chr> <dbl> <dbl>
1 A 2000 1
2 A 2001 1
3 A 2002 2
4 B 2000 4
5 B 2001 4
6 B 2002 4
7 B 2003 3


`str_replace_all` numeric values in column according to named vector

I want to use a named vector to map numeric values of a data frame column.
consider the following example:
df <- data.frame(year = seq(2000,2004,1), value = sample(11:15, r = T)) %>%
add_row(year=2005, value=1)
# year value
# 1 2000 12
# 2 2001 15
# 3 2002 11
# 4 2003 12
# 5 2004 14
# 6 2005 1
I now want to replace according to a vector, like this one
repl_vec <- c("1"="apple", "11"="radish", "12"="tomato", "13"="cucumber", "14"="eggplant", "15"="carrot")
which I do with this
df %>% mutate(val_alph = str_replace_all(value, repl_vec))
However, this gives:
# year value val_alph
# 1 2000 11 appleapple
# 2 2001 13 apple3
# 3 2002 15 apple5
# 4 2003 12 apple2
# 5 2004 14 apple4
# 6 2005 1 apple
since str_replace_all uses the first match and not the whole match. In the real data, the names of the named vector are also numbers (one- and two-digits).
I expect the output to be like this:
# year value val_alph
# 1 2000 11 radish
# 2 2001 13 cucumber
# 3 2002 15 carrot
# 4 2003 12 tomato
# 5 2004 14 eggplant
# 6 2005 1 apple
Does someone have a clever way of achieving this?
I would use base R's match instead of string matching here, since you are looking for exact whole string matches.
df %>%
mutate(value = repl_vec[match(value, names(repl_vec))])
#> year value
#> 1 2000 radish
#> 2 2001 carrot
#> 3 2002 carrot
#> 4 2003 cucumber
#> 5 2004 eggplant
#> 6 2005 apple
Is this what you want to do?
df <- data.frame(year = seq(2000,2004,1), value = sample(11:15, r = T)) %>%
add_row(year=2005, value=1)
repl_vec <- c("1"="one", "11"="eleven", "12"="twelve", "13"="thirteen", "14"="fourteen", "15"="fifteen")
names(repl_vec) <- paste0("\\b", names(repl_vec), "\\b")
df %>%
mutate(val_alph = str_replace_all(value, repl_vec, names(repl_vec)))
which gives:
year value val_alph
1 2000 14 fourteen
2 2001 12 twelve
3 2002 15 fifteen
4 2003 14 fourteen
5 2004 11 eleven
6 2005 1 one

Merging two dataframes by multiple columns without losing data [duplicate]

I am new to R and have two very large datasets I want to merge. They look as follows
ID year val1 val3
1 1 2001 2 34
2 2 2004 1 25
3 3 2003 3 36
4 4 2003 2 46
5 5 1999 1 55
6 6 2005 3 44
The second dataframe is as follows
ID year val2
1 1 2001 2
2 2 2004 1
3 3 2003 3
4 4 2002 5
5 5 1998 4
6 6 2004 6
I want the final merged set to look like this
ID year val1 val3 val2
1 1 2001 2 34 2
2 2 2004 1 25 1
3 3 2003 3 36 3
4 4 2002 NA NA 5
5 4 2003 2 46 NA
6 5 1998 NA NA 4
7 5 1999 1 55 NA
8 6 2004 NA NA 6
9 6 2005 3 44 NA
I tried merging by ID and year using the following
total <- merge(df1,df2,by=c("id","year"))
But this results in only merging if ID and year BOTH match. I want to it to happen so that if the ID matches but year doesn't match, a new row will add in the same ID the entry for year and val2 while leaving val1 and val3 as NA.
I then tried merging only by ID and then removing rows if year.x != year.y, but since the datasets were too large it wasn't very efficient.
merge has an argument all that specifies if you want to keep all rows from left and right side (i.e. all rows from x and all rows from y)
total <- merge(df1,df2,by=c("id","year"), all=TRUE)
You must specify all.x=TRUE and all.y=TRUE so you keep all unique rows from both datasets
total <- merge(df1,df2,by=c("id","year"),all.x=TRUE,all.y=TRUE)

back fill NA values in panel data set

I want to know how I can backfill NA values in panel data set.
data set
date firms return
1999 A NA
2000 A 5
2001 A NA
1999 B 9
2000 B NA
2001 B 10
expected out come
date firms return
1999 A 5
2000 A 5
2001 A NA
1999 B 9
2000 B 10
2001 B 10
I use this formula to fill NA values with previous value in panel data set
df1<-df %>% group_by(firms) %>% fill(return)
Is there any easy way like this by which I can fill NA values with next value in a panel data set.
You almost had it.
df <- df %>% group_by(firms) %>% fill(return, .direction="up")
# A tibble: 6 x 3
# Groups: firms [2]
date firms return
<int> <fct> <int>
1 1999 A 5
2 2000 A 5
3 2001 A NA
4 1999 B 9
5 2000 B 10
6 2001 B 10

Assign unique ID based on two columns [duplicate]

I have a dataframe (df) that looks like this:
School Student Year
A 10 1999
A 10 2000
A 20 1999
A 20 2000
A 20 2001
B 10 1999
B 10 2000
And I would like to create a person ID column so that df looks like this:
ID School Student Year
1 A 10 1999
1 A 10 2000
2 A 20 1999
2 A 20 2000
2 A 20 2001
3 B 10 1999
3 B 10 2000
In other words, the ID variable indicates which person it is in the dataset, accounting for both Student number and School membership (here we have 3 students total).
I did df$ID <- df$Student and tried to request the value +1 if c("School", "Student) was unique. It isn't working. Help appreciated.
We can do this in base R without doing any group by operation
df$ID <- cumsum(!duplicated(df[1:2]))
# School Student Year ID
#1 A 10 1999 1
#2 A 10 2000 1
#3 A 20 1999 2
#4 A 20 2000 2
#5 A 20 2001 2
#6 B 10 1999 3
#7 B 10 2000 3
NOTE: Assuming that 'School' and 'Student' are ordered
Or using tidyverse
df %>%
mutate(ID = group_indices_(df, .dots=c("School", "Student")))
# School Student Year ID
#1 A 10 1999 1
#2 A 10 2000 1
#3 A 20 1999 2
#4 A 20 2000 2
#5 A 20 2001 2
#6 B 10 1999 3
#7 B 10 2000 3
As #radek mentioned, in the recent version (dplyr_0.8.0), we get the notification that group_indices_ is deprecated, instead use group_indices
df %>%
mutate(ID = group_indices(., School, Student))
Group by School and Student, then assign group id to ID variable.
df[, ID := .GRP, by = .(School, Student)]
# School Student Year ID
# 1: A 10 1999 1
# 2: A 10 2000 1
# 3: A 20 1999 2
# 4: A 20 2000 2
# 5: A 20 2001 2
# 6: B 10 1999 3
# 7: B 10 2000 3
df <- fread('School Student Year
A 10 1999
A 10 2000
A 20 1999
A 20 2000
A 20 2001
B 10 1999
B 10 2000')

expand.grid() based on values in two variables in R

I would like to expand a grid in R such that the expansion occurs for unique values of one variable but joint values for two variables. For example:
frame <- data.frame(id = seq(1:2),id2 = seq(1:2), year = c(2005, 2008))
I would like to expand the frame for each year, but such that id and id2 are considered jointly (e.g. (1,1), and (2,2) to generate an output like:
id id2 year
1 1 2005
1 1 2006
1 1 2007
1 1 2005
2 2 2006
2 2 2007
2 2 2008
Using expand.grid(), does someone know how to do this? I have not been able to wrangle the code past looking at each id uniquely and producing a frame with all combinations given the following code:
with(frame, expand.grid(year = seq(min(year), max(year)), id = unique(id), id2 = unique(id2)))
Thanks for any and all help.
You could do this with reshape::expand.grid.df
expand.grid.df(data.frame(id=1:2,id2=1:2), data.frame(year=c(2005:2008)))
> expand.grid.df(data.frame(id=1:2,id2=1:2), data.frame(year=c(2005:2008)))
id id2 year
1 1 1 2005
2 2 2 2005
3 1 1 2006
4 2 2 2006
5 1 1 2007
6 2 2 2007
7 1 1 2008
8 2 2 2008
Here is another way using base R
indx <- diff(frame$year)+1
indx1 <- rep(1:nrow(frame), each=indx)
frame1 <- transform(frame[indx1,1:2], year=seq(frame$year[1], length.out=indx, by=1))
row.names(frame1) <- NULL
# id id2 year
#1 1 1 2005
#2 1 1 2006
#3 1 1 2007
#4 1 1 2008
#5 2 2 2005
#6 2 2 2006
#7 2 2 2007
#8 2 2 2008
