I've got a list containing multiple dataframes with two columns (Year and area).
The problem is that some dataframes only contain information from 2002-2015 or 2003-2017 and other from 2001-2018 and so one. So they differ in length.
list:
list(structure(list(Year= c(2001,2002,2004,2005), Area=c(1,2,3,4), class ="data.frame"),
structure(list(Year= c(2001,2004,2018), Area=c(1,2,4), class ="data.frame",
(list(Year= c(2008,2009,2014,2015,2016), Area=c(1,2,3,4,5), class ="data.frame"))
How can I modify them all to the same length (from 2001-2018) by adding NA or better 0 for area if there is no area information for that year.
Let
A = data.frame(Year= c(2001,2002,2004,2005), Area=c(1,2,3,4))
B = data.frame(Year= c(2001,2004,2018), Area=c(1,2,4))
C = list(A, B)
Then we have
Ref = data.frame(Year = 2001:2018)
New.List = lapply(C, function(x) dplyr::left_join(Ref, x))
with the desired result
[[1]]
Year Area
1 2001 1
2 2002 2
3 2003 NA
4 2004 3
5 2005 4
6 2006 NA
7 2007 NA
8 2008 NA
9 2009 NA
10 2010 NA
11 2011 NA
12 2012 NA
13 2013 NA
14 2014 NA
15 2015 NA
16 2016 NA
17 2017 NA
18 2018 NA
[[2]]
Year Area
1 2001 1
2 2002 NA
3 2003 NA
4 2004 2
5 2005 NA
6 2006 NA
7 2007 NA
8 2008 NA
9 2009 NA
10 2010 NA
11 2011 NA
12 2012 NA
13 2013 NA
14 2014 NA
15 2015 NA
16 2016 NA
17 2017 NA
18 2018 4
To make sure that all data.frames in the list share the same spelling of Year, do
lapply(C, function(x) {colnames(x)[1] = "Year"; x})
provided the first column is always the Year-column.
It looks simple, but I couldn't find the answer on-line. I have panel data with city characteristics over the years 1995-2015. For some variables, I just have data for the years 2000 and 2010. Therefore, I want to create new variables where I impute the missing data of the years 1995-2004 with the 2000 values and the years 2005-2015 with 2010 values.
My data set looks like this example:
cities idhm year
1 B NA 1995
2 C NA 1996
3 D NA 1997
4 E NA 1998
5 F NA 1999
6 G 24599 2000
7 H NA 2001
8 I NA 2002
9 J NA 2003
10 K NA 2004
11 L NA 2005
12 M NA 2006
13 N NA 2007
14 O NA 2008
15 P NA 2009
16 Q 5598 2010
17 R NA 2011
18 S NA 2012
19 T NA 2013
20 U NA 2014
21 V NA 2015
I want to have a data set like this one:
cities idhm year newvar
1 B NA 1995 24599
2 C NA 1996 24599
3 D NA 1997 24599
4 E NA 1998 24599
5 F NA 1999 24599
6 G 24599 2000 24599
7 H NA 2001 24599
8 I NA 2002 24599
9 J NA 2003 24599
10 K NA 2004 24599
11 L NA 2005 5598
12 M NA 2006 5598
13 N NA 2007 5598
14 O NA 2008 5598
15 P NA 2009 5598
16 Q 5598 2010 5598
17 R NA 2011 5598
18 S NA 2012 5598
19 T NA 2013 5598
20 U NA 2014 5598
21 V NA 2015 5598
Any help is welcomed.
I suspect your data may be larger than this example, so a more general case would be to use a rolling join. I find that easiest with data.table.
First, make a dictionary of complete data to join on.
library(data.table)
setDT(data1)
dictionary <- data1[!is.na(idhm),.(year,idhm)]
dictionary
# year idhm
#1: 2000 24599
#2: 2010 5598
Then perform the join on = "year" and roll = "nearest".
result <- dictionary[data1,on = "year",roll="nearest"]
result[,.(cities,year,idhm)]
# cities year idhm
# 1: B 1995 24599
# 2: C 1996 24599
# 3: D 1997 24599
# 4: E 1998 24599
# 5: F 1999 24599
# 6: G 2000 24599
# 7: H 2001 24599
# 8: I 2002 24599
# 9: J 2003 24599
#10: K 2004 24599
#11: L 2005 24599
#12: M 2006 5598
#13: N 2007 5598
#14: O 2008 5598
#15: P 2009 5598
#16: Q 2010 5598
#17: R 2011 5598
#18: S 2012 5598
#19: T 2013 5598
#20: U 2014 5598
#21: V 2015 5598
# cities year idhm
Data
data1 <- structure(list(cities = structure(1:21, .Label = c("B", "C",
"D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P",
"Q", "R", "S", "T", "U", "V"), class = "factor"), idhm = c(NA,
NA, NA, NA, NA, 24599L, NA, NA, NA, NA, NA, NA, NA, NA, NA, 5598L,
NA, NA, NA, NA, NA), year = 1995:2015), class = "data.frame", row.names = c(NA,
-21L))
We can do :
df$new_var <- NA
df$new_var[df$year >= 1995 & df$year <= 2004] <- df$idhm[df$year == 2000]
df$new_var[df$year >= 2005 & df$year <= 2015] <- df$idhm[df$year == 2010]
Or using dplyr :
library(dplyr)
df %>%
mutate(new_var = case_when(between(year, 1995, 2004) ~idhm[year == 2000],
between(year, 2005, 2015) ~idhm[year == 2010]))
# cities idhm year new_var
#1 B NA 1995 24599
#2 C NA 1996 24599
#3 D NA 1997 24599
#4 E NA 1998 24599
#5 F NA 1999 24599
#6 G 24599 2000 24599
#7 H NA 2001 24599
#8 I NA 2002 24599
#9 J NA 2003 24599
#10 K NA 2004 24599
#11 L NA 2005 5598
#12 M NA 2006 5598
#13 N NA 2007 5598
#14 O NA 2008 5598
#15 P NA 2009 5598
#16 Q 5598 2010 5598
#17 R NA 2011 5598
#18 S NA 2012 5598
#19 T NA 2013 5598
#20 U NA 2014 5598
#21 V NA 2015 5598
I have a temporal dataset, however, it is incomplete so I can not reconstruct the series accurately. These are the data:
df<-data.frame(year=c(2006,2007,2008,2009,2010,2011,2012,2013,2014,2015),
sample1=c("D","D","DDD","D","U","UU","UUU","U","D","DDD"),
sample2=c("U","UU","D","D","DDD","D","U","UU","UUU","U"),
sample3=c("D","DDD","D","U","UU","UUU","U","D","DDD","D"),
sample4=c("D","D","UUU","U","D","DDD","D","U","U",NA),
sample5=c(NA,"UU","D","U","UU","UUU","U","D","U",NA))
I need it to end up like this:
df2<-data.frame(year=c(2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,
2015,2016,2017,2018),
sample1=c(NA,NA,"D","D","DDD","D","U","UU","UUU","U","D","DDD",NA,NA,NA),
sample2=c("U","UU","D","D","DDD","D","U","UU","UUU","U",NA,NA,NA,NA,NA),
sample3=c(NA,NA,NA,"D","DDD","D","U","UU","UUU","U","D","DDD","D",NA,NA),
sample4=c(NA,NA,"D","D",NA,NA,NA,NA,"UUU","U","D","DDD","D","U","U"),
sample5=c(NA,"UU","D",NA,NA,NA,"U","UU","UUU","U",NA,NA,"D","U",NA))
I need all the columns aligned in the same pattern, the best result was using DNA alignment functions, but these times to find the best alignment invert the elements, in my case can not occur this.
I have no idea how to do this.
dplyr's add_row function makes this pretty easy, once the initial dataframe exists.
library(dplyr)
df<-data.frame(year=c(2006,2007,2008,2009,2010,2011,2012,2013,2014,2015),
sample1 = c("D","D","DDD","D","U","UU","UUU","U","D","DDD"),
sample2 = c("U","UD","D","D","DDD","D","U","UU","UUU","U"),
sample3 = c("D","DDD","D","U","UU","UUU","U","D","DDD","D"),
sample4 = c("D","D","UUU","U","D","DDD","D","U","U",NA),
sample5 = c(NA,"UU","D","U","UU","UUU","U","D","U",NA))
df2 <- df %>%
add_row(year = 2016:2018)
library(dplyr)
df <- data_frame(year=c(2006,2007,2008,2009,2010,2011,2012,2013,2014,2015),
sample1=c("D","D","DDD","D","U","UU","UUU","U","D","DDD"),
sample2=c("U","UD","D","D","DDD","D","U","UU","UUU","U"),
sample3=c("D","DDD","D","U","UU","UUU","U","D","DDD","D"),
sample4=c("D","D","UUU","U","D","DDD","D","U","U",NA),
sample5=c(NA,"UU","D","U","UU","UUU","U","D","U",NA)) %>%
add_row(year = c(2004, 2005), .before = 1) %>%
add_row(year = c(2016:2018))
Result:
# A tibble: 15 x 6
year sample1 sample2 sample3 sample4 sample5
<dbl> <chr> <chr> <chr> <chr> <chr>
1 2004 NA NA NA NA NA
2 2005 NA NA NA NA NA
3 2006 D U D D NA
4 2007 D UD DDD D UU
5 2008 DDD D D UUU D
6 2009 D D U U U
7 2010 U DDD UU D UU
8 2011 UU D UUU DDD UUU
9 2012 UUU U U D U
10 2013 U UU D U D
11 2014 D UUU DDD U U
12 2015 DDD U D NA NA
13 2016 NA NA NA NA NA
14 2017 NA NA NA NA NA
15 2018 NA NA NA NA NA
I have a data frame (panel data): Ctry column indicates the name of countries in my data frame. In any column (for example: Carx) if number of NAs is larger 3; I want to drop the related country in my data fame. For example,
Country A has 2 NA
Country B has 4 NA
Country C has 3 NA
I want to drop country B in my data frame. I have a data frame like this (This is for illustration, my data frame is actually very huge):
Ctry year Carx
A 2000 23
A 2001 18
A 2002 20
A 2003 NA
A 2004 24
A 2005 18
B 2000 NA
B 2001 NA
B 2002 NA
B 2003 NA
B 2004 18
B 2005 16
C 2000 NA
C 2001 NA
C 2002 24
C 2003 21
C 2004 NA
C 2005 24
I want to create a data frame like this:
Ctry year Carx
A 2000 23
A 2001 18
A 2002 20
A 2003 NA
A 2004 24
A 2005 18
C 2000 NA
C 2001 NA
C 2002 24
C 2003 21
C 2004 NA
C 2005 24
A fairly straightforward way in base R is to use sum(is.na(.)) along with ave, to do the counting, like this:
with(mydf, ave(Carx, Ctry, FUN = function(x) sum(is.na(x))))
# [1] 1 1 1 1 1 1 4 4 4 4 4 4 3 3 3 3 3 3
Once you have that, subsetting is easy:
mydf[with(mydf, ave(Carx, Ctry, FUN = function(x) sum(is.na(x)))) <= 3, ]
# Ctry year Carx
# 1 A 2000 23
# 2 A 2001 18
# 3 A 2002 20
# 4 A 2003 NA
# 5 A 2004 24
# 6 A 2005 18
# 13 C 2000 NA
# 14 C 2001 NA
# 15 C 2002 24
# 16 C 2003 21
# 17 C 2004 NA
# 18 C 2005 24
You can use by() function to group by Ctry and count NA's of each group :
DF <- read.csv(
text='Ctry,year,Carx
A,2000,23
A,2001,18
A,2002,20
A,2003,NA
A,2004,24
A,2005,18
B,2000,NA
B,2001,NA
B,2002,NA
B,2003,NA
B,2004,18
B,2005,16
C,2000,NA
C,2001,NA
C,2002,24
C,2003,21
C,2004,NA
C,2005,24',
stringsAsFactors=F)
res <- by(data=DF$Carx,INDICES=DF$Ctry,FUN=function(x)sum(is.na(x)))
validCtry <-names(res)[res <= 3]
DF[DF$Ctry %in% validCtry, ]
# Ctry year Carx
#1 A 2000 23
#2 A 2001 18
#3 A 2002 20
#4 A 2003 NA
#5 A 2004 24
#6 A 2005 18
#13 C 2000 NA
#14 C 2001 NA
#15 C 2002 24
#16 C 2003 21
#17 C 2004 NA
#18 C 2005 24
EDIT :
if you have more columns to check, you could adapt the previous code as follows:
res <- by(data=DF,INDICES=DF$Ctry,
FUN=function(x){
return(sum(is.na(x$Carx)) <= 3 &&
sum(is.na(x$Barx)) <= 3 &&
sum(is.na(x$Tarx)) <= 3)
})
validCtry <- names(res)[res]
DF[DF$Ctry %in% validCtry, ]
where, of course, you may change the condition in FUN according to your needs.
Since you mention that you data is "very huge" (whatever that means exactly), you could try a solution with dplyr and see if it's perhaps faster than the solutions in base R. If the other solutions are fast enough, just ignore this one.
require(dplyr)
newdf <- df %.% group_by(Ctry) %.% filter(sum(is.na(Carx)) <= 3)