Drop subgroup of obs in dataframe if first observation of group is na - r

In R I have a dataframe df of this form:
a b year month id
1 2 2012 01 1234758
1 1 2012 02 1234758
NA 5 2011 04 1234759
5 5 2011 05 1234759
5 5 2011 06 1234759
2 2 2001 11 1234760
NA NA 2001 11 1234760
Some of the a's and b's are NAs. I wish to subset the dataframe by id, have each subset ordered by year and month and then drop the whole subset/id if the first observation in order of time of either a or b is na.
For the example above, inteded result is:
a b year month id
1 2 2012 01 1234758
1 1 2012 02 1234758
2 2 2001 11 1234760
NA NA 2001 11 1234760
I did it the non vectorized way, which took forever to run, as follow:
df_summary <- as.data.frame(table(df$id),stringsAsFactors=FALSE)
df <- df[order(df$id,df$year,df$month),]
remove <- ""
j <- 1
l <- 0
for(i in 1:nrow(df_summary)){
m <- df_summary$Var1[i]
if( is.na(df$a[j]) | is.na(df$b[j]) ) {
l <- l + 1
remove[l] <- df_summary$id[i]
}
j <- j + m
}
df <- df[!(df$id %in% remove),]
What is a faster, vectorized way, to achieve the same result?
What I tried, also to double-check my code:
dt <- setDT(df)
remove_vectorized <- dt[,list(remove_first_na=(is.na(a[1]) | is.na(b[1]))),by=id]
which suggests me to remove ALL observation, which is patently wrong.

Here are few data.table possible approaches
First- fixing your attempt
library(data.table)
setDT(df)[, if(!is.na(a[1L]) & !is.na(b[1L])) .SD, by = id]
# id a b year month
# 1: 1234758 1 2 2012 1
# 2: 1234758 1 1 2012 2
# 3: 1234760 2 2 2001 11
# 4: 1234760 NA NA 2001 11
Or we can generalize this (on expense of speed probably)
setDT(df)[, if(Reduce(`&`, !is.na(.SD[1L, .(a, b)]))) .SD, by = id]
## OR maybe `setDT(df)[, if(Reduce(`&`, !sapply(.SD[1L, .(a, b)], is.na))) .SD , by = id]`
## in order to avoid to matrix conversions)
# id a b year month
# 1: 1234758 1 2 2012 1
# 2: 1234758 1 1 2012 2
# 3: 1234760 2 2 2001 11
# 4: 1234760 NA NA 2001 11
Another way is to combine unique and na.omit methods
indx <- na.omit(unique(setDT(df), by = "id"), by = c("a", "b"))
Then, a simple subset will do
df[id %in% indx$id]
# id a b year month
# 1: 1234758 1 2 2012 1
# 2: 1234758 1 1 2012 2
# 3: 1234760 2 2 2001 11
# 4: 1234760 NA NA 2001 11
Or maybe a binary join?
df[indx[, .(id)], on = "id"]
# id a b year month
# 1: 1234758 1 2 2012 1
# 2: 1234758 1 1 2012 2
# 3: 1234760 2 2 2001 11
# 4: 1234760 NA NA 2001 11
Or
indx <- na.omit(unique(setDT(df, key = "id")), by = c("a", "b"))
df[.(indx$id)]
# id a b year month
# 1: 1234758 1 2 2012 1
# 2: 1234758 1 1 2012 2
# 3: 1234760 2 2 2001 11
# 4: 1234760 NA NA 2001 11
(The last two are mainly for illustration)
For more info regarding data.table, please visit Getting Started on GH

Related

data.table: is it possible to merge .SD and return a new 'sub data table' by group?

I have a data table organized by id and year, with a frequency (freq) value for every year where the frequency is at least 1. The start and end year may differ for every id.
Example:
> dt <- data.table(id=c('A','A','A','A','B','B','B','B'),year=c(2010,2012,2013,2015,2006,2007,2010,2011),freq=c(2,1,4,3,1,3,5,7))
> dt
id year freq
1: A 2010 2
2: A 2012 1
3: A 2013 4
4: A 2015 3
5: B 2006 1
6: B 2007 3
7: B 2010 5
8: B 2011 7
I would like to make each time series by id complete, i.e. add rows with freq=0 for any missing year. So the result for the example above should look like this:
id year freq
A 2010 2
A 2011 0
A 2012 1
A 2013 4
A 2014 0
A 2015 3
B 2006 1
B 2007 3
B 2008 0
B 2009 0
B 2010 5
B 2011 7
I'm starting with data.table and I'm interested to see if this is doable. With plyr or dplyr I would have used a merge operation with a complete column of years for every sub dataframe by id. Is there an equivalent to this solution with data.table?
We can't use CJ-based approaches because the missing rows need to be by-id. An alternative is:
library(data.table)
dt[ dt[, .(year = do.call(seq, as.list(range(year)))), by = .(id)],
on = .(id, year)
][is.na(freq), freq := 0][]
# id year freq
# <char> <int> <num>
# 1: A 2010 2
# 2: A 2011 0
# 3: A 2012 1
# 4: A 2013 4
# 5: A 2014 0
# 6: A 2015 3
# 7: B 2006 1
# 8: B 2007 3
# 9: B 2008 0
# 10: B 2009 0
# 11: B 2010 5
# 12: B 2011 7
Another solution, maybe more explicit than #r2evans'? First make a table of complete series:
years <- dt[, list(year= seq(min(year), max(year))), by= id]
years
id year
1: A 2010
2: A 2011
3: A 2012
4: A 2013
5: A 2014
6: A 2015
7: B 2006
8: B 2007
9: B 2008
10: B 2009
11: B 2010
12: B 2011
then merge and replace NAs:
full <- merge(dt, years, all.y= TRUE)
full[, freq := ifelse(is.na(freq), 0, freq)]
full
id year freq
1: A 2010 2
2: A 2011 0
3: A 2012 1
4: A 2013 4
5: A 2014 0
6: A 2015 3
7: B 2006 1
8: B 2007 3
9: B 2008 0
10: B 2009 0
11: B 2010 5
12: B 2011 7
Here is another data.table way to solve your problem:
dt[, .SD[.(min(year):max(year)), on="year"], by=id][is.na(freq), freq:=0]
# id year freq
# <char> <int> <num>
# 1: A 2010 2
# 2: A 2011 0
# 3: A 2012 1
# 4: A 2013 4
# 5: A 2014 0
# 6: A 2015 3
# 7: B 2006 1
# 8: B 2007 3
# 9: B 2008 0
# 10: B 2009 0
# 11: B 2010 5
# 12: B 2011 7

Lagging single column in Time-Series

I am running 4.0.3. No access to the internet.
I want to lag a single column of a multicolumn Time-Series. I wasn't able to find a satisfactory answer anywhere else.
Intuitively this makes sense to me, but it just doesn't work:
library(tsbox)
data=data.frame(Date=c('2005-01-01','2005-02-01','2005-03-01','2005-04-01','2005-05-01'),
col1 = c(1,2,3,4,5),
col2 = c(1,2,3,4,5))
data[,'Date']= as.POSIXct(data[,'Date'],format='%Y-%m-%d')
timeseries = ts_ts(ts_long(data))
timeseries[,'col1_L1'] = lag(timeseries[,'col1'],1)
What I get:
col1 col2 col1_L1
Jan 2005 1 1 1
Feb 2005 2 2 2
Mar 2005 3 3 3
Apr 2005 4 4 4
May 2005 5 5 5
What I would expect from this code:
col1 col2 col1_L1
Jan 2005 1 1 NA
Feb 2005 2 2 1
Mar 2005 3 3 2
Apr 2005 4 4 3
May 2005 5 5 4
I wasn't able to reproduce your example (likely due to the reasons pointed out in the comments) but perhaps you could use the function from this post, e.g.
data=data.frame(Date=c('2005-01-01','2005-02-01','2005-03-01','2005-04-01','2005-05-01'),
col1 = c(1,2,3,4,5),
col2 = c(1,2,3,4,5))
data[,'Date']= as.POSIXct(data[,'Date'],format='%Y-%m-%d')
lagpad <- function(x, k) {
if (k>0) {
return (c(rep(NA, k), x)[1 : length(x)] )
}
else {
return (c(x[(-k+1) : length(x)], rep(NA, -k)))
}
}
data$col_l1 <- lagpad(data$col2, 1)
data
#> Date col1 col2 col_l1
#> 1 2005-01-01 1 1 NA
#> 2 2005-02-01 2 2 1
#> 3 2005-03-01 3 3 2
#> 4 2005-04-01 4 4 3
#> 5 2005-05-01 5 5 4

data.table aggregation by one column using the maximum value of another column - R

I've got a data.table DT that I would like to aggregate by one column (year) using the maximum value of another column (month). Here's a sample of my data.table.
> DT <- data.table(month = c("2016-01", "2016-02", "2016-03", "2017-01", "2017-02", "2017-03")
, col1 = c(3,5,2,8,4,9)
, year = c(2016, 2016,2016, 2017,2017,2017))
> DT
month col1 year
1: 2016-01 3 2016
2: 2016-02 5 2016
3: 2016-03 2 2016
4: 2017-01 8 2017
5: 2017-02 4 2017
6: 2017-03 9 2017
The desired output
> ## desired output
> DT
month col1 year desired_output
1: 2016-01 3 2016 2
2: 2016-02 5 2016 2
3: 2016-03 2 2016 2
4: 2017-01 8 2017 9
5: 2017-02 4 2017 9
6: 2017-03 9 2017 9
Aggregating by the column year, the desired output should be the value of col1 for the latest month. But somehow the following code doesn't work, it gives me a warning and returns NAs. What am I doing wrong?
> ## wrong output
> DT[, output := col1[which.max(month)], by = .(year)]
Warning messages:
1: In which.max(month) : NAs introduced by coercion
2: In which.max(month) : NAs introduced by coercion
> DT
month col1 year output
1: 2016-01 3 2016 NA
2: 2016-02 5 2016 NA
3: 2016-03 2 2016 NA
4: 2017-01 8 2017 NA
5: 2017-02 4 2017 NA
6: 2017-03 9 2017 NA
We get the index of the max value in 'month by converting to yearmon class from zoo and use that to get the corresponding value from 'col1' in creating the 'desired_output' column grouped by 'year'
library(zoo)
library(data.table)
DT[, desired_output := col1[which.max(as.yearmon(month))], .(year)]
DT
# month col1 year desired_output
#1: 2016-01 3 2016 2
#2: 2016-02 5 2016 2
#3: 2016-03 2 2016 2
#4: 2017-01 8 2017 9
#5: 2017-02 4 2017 9
#6: 2017-03 9 2017 9
Or extract the 'month' and get the index of max value
DT[, desired_output := col1[which.max(month(as.IDate(paste0(month,
"-01"))))], .(year)]

How do you select a max of one column and not NA's in another column in R?

I'm looking for a way in R where I can select the max(col1) where col2 is not NA?
Example datafame named df1
#df1
Year col1 col2
2016 4 NA # has NA
2016 2 NA # has NA
2016 1 3 # this is the max for 2016
2017 3 NA
2017 2 3 # this is the max for 2017
2017 1 3
2018 2 4 # this is the max for 2018
2018 1 NA
I would like the new dataset to only return
Year col1 col2
2016 1 3
2017 2 3
2018 2 4
If any one can help, it would be very appreciated?
In base R
out <- na.omit(df1)
merge(aggregate(col1 ~ Year, out, max), out) # thanks to Rui
# Year col1 col2
#1 2016 1 3
#2 2017 2 3
#3 2018 2 4
Using dplyr:
library(dplyr)
df1 %>% filter(!is.na(col2)) %>%
group_by(year) %>%
arrange(desc(col1)) %>%
slice(1)
Using data.table:
library(data.table)
setDT(df1)
df1[!is.na(col2), .SD[which.max(col1)], by = Year]
This works in a fresh R session:
library(data.table)
dt = fread("Year col1 col2
2016 4 NA
2016 2 NA
2016 1 3
2017 3 NA
2017 2 3
2017 1 3
2018 2 4
2018 1 NA")
dt[!is.na(col2), .SD[which.max(col1)], by = Year]
# Year col1 col2
# 1: 2016 1 3
# 2: 2017 2 3
# 3: 2018 2 4

Removing rows of data frame if number of NA in a column is larger than 3

I have a data frame (panel data): Ctry column indicates the name of countries in my data frame. In any column (for example: Carx) if number of NAs is larger 3; I want to drop the related country in my data fame. For example,
Country A has 2 NA
Country B has 4 NA
Country C has 3 NA
I want to drop country B in my data frame. I have a data frame like this (This is for illustration, my data frame is actually very huge):
Ctry year Carx
A 2000 23
A 2001 18
A 2002 20
A 2003 NA
A 2004 24
A 2005 18
B 2000 NA
B 2001 NA
B 2002 NA
B 2003 NA
B 2004 18
B 2005 16
C 2000 NA
C 2001 NA
C 2002 24
C 2003 21
C 2004 NA
C 2005 24
I want to create a data frame like this:
Ctry year Carx
A 2000 23
A 2001 18
A 2002 20
A 2003 NA
A 2004 24
A 2005 18
C 2000 NA
C 2001 NA
C 2002 24
C 2003 21
C 2004 NA
C 2005 24
A fairly straightforward way in base R is to use sum(is.na(.)) along with ave, to do the counting, like this:
with(mydf, ave(Carx, Ctry, FUN = function(x) sum(is.na(x))))
# [1] 1 1 1 1 1 1 4 4 4 4 4 4 3 3 3 3 3 3
Once you have that, subsetting is easy:
mydf[with(mydf, ave(Carx, Ctry, FUN = function(x) sum(is.na(x)))) <= 3, ]
# Ctry year Carx
# 1 A 2000 23
# 2 A 2001 18
# 3 A 2002 20
# 4 A 2003 NA
# 5 A 2004 24
# 6 A 2005 18
# 13 C 2000 NA
# 14 C 2001 NA
# 15 C 2002 24
# 16 C 2003 21
# 17 C 2004 NA
# 18 C 2005 24
You can use by() function to group by Ctry and count NA's of each group :
DF <- read.csv(
text='Ctry,year,Carx
A,2000,23
A,2001,18
A,2002,20
A,2003,NA
A,2004,24
A,2005,18
B,2000,NA
B,2001,NA
B,2002,NA
B,2003,NA
B,2004,18
B,2005,16
C,2000,NA
C,2001,NA
C,2002,24
C,2003,21
C,2004,NA
C,2005,24',
stringsAsFactors=F)
res <- by(data=DF$Carx,INDICES=DF$Ctry,FUN=function(x)sum(is.na(x)))
validCtry <-names(res)[res <= 3]
DF[DF$Ctry %in% validCtry, ]
# Ctry year Carx
#1 A 2000 23
#2 A 2001 18
#3 A 2002 20
#4 A 2003 NA
#5 A 2004 24
#6 A 2005 18
#13 C 2000 NA
#14 C 2001 NA
#15 C 2002 24
#16 C 2003 21
#17 C 2004 NA
#18 C 2005 24
EDIT :
if you have more columns to check, you could adapt the previous code as follows:
res <- by(data=DF,INDICES=DF$Ctry,
FUN=function(x){
return(sum(is.na(x$Carx)) <= 3 &&
sum(is.na(x$Barx)) <= 3 &&
sum(is.na(x$Tarx)) <= 3)
})
validCtry <- names(res)[res]
DF[DF$Ctry %in% validCtry, ]
where, of course, you may change the condition in FUN according to your needs.
Since you mention that you data is "very huge" (whatever that means exactly), you could try a solution with dplyr and see if it's perhaps faster than the solutions in base R. If the other solutions are fast enough, just ignore this one.
require(dplyr)
newdf <- df %.% group_by(Ctry) %.% filter(sum(is.na(Carx)) <= 3)

Resources