How do you select a max of one column and not NA's in another column in R? - r

I'm looking for a way in R where I can select the max(col1) where col2 is not NA?
Example datafame named df1
#df1
Year col1 col2
2016 4 NA # has NA
2016 2 NA # has NA
2016 1 3 # this is the max for 2016
2017 3 NA
2017 2 3 # this is the max for 2017
2017 1 3
2018 2 4 # this is the max for 2018
2018 1 NA
I would like the new dataset to only return
Year col1 col2
2016 1 3
2017 2 3
2018 2 4
If any one can help, it would be very appreciated?

In base R
out <- na.omit(df1)
merge(aggregate(col1 ~ Year, out, max), out) # thanks to Rui
# Year col1 col2
#1 2016 1 3
#2 2017 2 3
#3 2018 2 4

Using dplyr:
library(dplyr)
df1 %>% filter(!is.na(col2)) %>%
group_by(year) %>%
arrange(desc(col1)) %>%
slice(1)
Using data.table:
library(data.table)
setDT(df1)
df1[!is.na(col2), .SD[which.max(col1)], by = Year]
This works in a fresh R session:
library(data.table)
dt = fread("Year col1 col2
2016 4 NA
2016 2 NA
2016 1 3
2017 3 NA
2017 2 3
2017 1 3
2018 2 4
2018 1 NA")
dt[!is.na(col2), .SD[which.max(col1)], by = Year]
# Year col1 col2
# 1: 2016 1 3
# 2: 2017 2 3
# 3: 2018 2 4

Related

Lagging single column in Time-Series

I am running 4.0.3. No access to the internet.
I want to lag a single column of a multicolumn Time-Series. I wasn't able to find a satisfactory answer anywhere else.
Intuitively this makes sense to me, but it just doesn't work:
library(tsbox)
data=data.frame(Date=c('2005-01-01','2005-02-01','2005-03-01','2005-04-01','2005-05-01'),
col1 = c(1,2,3,4,5),
col2 = c(1,2,3,4,5))
data[,'Date']= as.POSIXct(data[,'Date'],format='%Y-%m-%d')
timeseries = ts_ts(ts_long(data))
timeseries[,'col1_L1'] = lag(timeseries[,'col1'],1)
What I get:
col1 col2 col1_L1
Jan 2005 1 1 1
Feb 2005 2 2 2
Mar 2005 3 3 3
Apr 2005 4 4 4
May 2005 5 5 5
What I would expect from this code:
col1 col2 col1_L1
Jan 2005 1 1 NA
Feb 2005 2 2 1
Mar 2005 3 3 2
Apr 2005 4 4 3
May 2005 5 5 4
I wasn't able to reproduce your example (likely due to the reasons pointed out in the comments) but perhaps you could use the function from this post, e.g.
data=data.frame(Date=c('2005-01-01','2005-02-01','2005-03-01','2005-04-01','2005-05-01'),
col1 = c(1,2,3,4,5),
col2 = c(1,2,3,4,5))
data[,'Date']= as.POSIXct(data[,'Date'],format='%Y-%m-%d')
lagpad <- function(x, k) {
if (k>0) {
return (c(rep(NA, k), x)[1 : length(x)] )
}
else {
return (c(x[(-k+1) : length(x)], rep(NA, -k)))
}
}
data$col_l1 <- lagpad(data$col2, 1)
data
#> Date col1 col2 col_l1
#> 1 2005-01-01 1 1 NA
#> 2 2005-02-01 2 2 1
#> 3 2005-03-01 3 3 2
#> 4 2005-04-01 4 4 3
#> 5 2005-05-01 5 5 4

Counting columns with NAs after group_by

I want to count the number of columns that have an NA value after using group_by.
Similar questions have been asking, but counting total NAs not columns with NA (group by counting non NA)
Data:
Spes <- "Year Spec.1 Spec.2 Spec.3 Spec.4
1 2016 5 NA NA 5
2 2016 1 NA NA 6
3 2016 6 NA NA 4
4 2018 NA 5 5 9
5 2018 NA 4 7 3
6 2018 NA 5 2 1
7 2019 6 NA NA NA
8 2019 4 NA NA NA
9 2019 3 NA NA NA"
Data <- read.table(text=spes, header = TRUE)
Data$Year <- as.factor(Data$Year)
The desired output:
2016 2
2018 1
2019 3
I have tried a few things, this is my current best attempt. I would be keen for a dplyr solution.
> Data %>%
group_by(Year) %>%
summarise_each(colSums(is.na(Data, [2:5])))
Error: Can't create call to non-callable object
I have tried variations without much luck. Many thanks
One option could be to group_by Year, check if there is any NA values in each column and calculate their sum for each Year.
library(dplyr)
Data %>%
group_by(Year) %>%
summarise_all(~any(is.na(.))) %>%
mutate(output = rowSums(.[-1])) %>%
select(Year, output)
# A tibble: 3 x 2
# Year output
# <fct> <dbl>
#1 2016 2
#2 2018 1
#3 2019 3
Base R translation using aggregate
rowSums(aggregate(.~Year, Data, function(x)
any(is.na(x)), na.action = "na.pass")[-1], na.rm = TRUE)
#[1] 2 1 3

data.table aggregation by one column using the maximum value of another column - R

I've got a data.table DT that I would like to aggregate by one column (year) using the maximum value of another column (month). Here's a sample of my data.table.
> DT <- data.table(month = c("2016-01", "2016-02", "2016-03", "2017-01", "2017-02", "2017-03")
, col1 = c(3,5,2,8,4,9)
, year = c(2016, 2016,2016, 2017,2017,2017))
> DT
month col1 year
1: 2016-01 3 2016
2: 2016-02 5 2016
3: 2016-03 2 2016
4: 2017-01 8 2017
5: 2017-02 4 2017
6: 2017-03 9 2017
The desired output
> ## desired output
> DT
month col1 year desired_output
1: 2016-01 3 2016 2
2: 2016-02 5 2016 2
3: 2016-03 2 2016 2
4: 2017-01 8 2017 9
5: 2017-02 4 2017 9
6: 2017-03 9 2017 9
Aggregating by the column year, the desired output should be the value of col1 for the latest month. But somehow the following code doesn't work, it gives me a warning and returns NAs. What am I doing wrong?
> ## wrong output
> DT[, output := col1[which.max(month)], by = .(year)]
Warning messages:
1: In which.max(month) : NAs introduced by coercion
2: In which.max(month) : NAs introduced by coercion
> DT
month col1 year output
1: 2016-01 3 2016 NA
2: 2016-02 5 2016 NA
3: 2016-03 2 2016 NA
4: 2017-01 8 2017 NA
5: 2017-02 4 2017 NA
6: 2017-03 9 2017 NA
We get the index of the max value in 'month by converting to yearmon class from zoo and use that to get the corresponding value from 'col1' in creating the 'desired_output' column grouped by 'year'
library(zoo)
library(data.table)
DT[, desired_output := col1[which.max(as.yearmon(month))], .(year)]
DT
# month col1 year desired_output
#1: 2016-01 3 2016 2
#2: 2016-02 5 2016 2
#3: 2016-03 2 2016 2
#4: 2017-01 8 2017 9
#5: 2017-02 4 2017 9
#6: 2017-03 9 2017 9
Or extract the 'month' and get the index of max value
DT[, desired_output := col1[which.max(month(as.IDate(paste0(month,
"-01"))))], .(year)]

Combine data in many row into a columnn

I have a data like this:
year Male
1 2011 8
2 2011 1
3 2011 4
4 2012 3
5 2012 12
6 2012 9
7 2013 4
8 2013 3
9 2013 3
and I need to group the data for the year 2011 in one column, 2012 in the next column and so on.
2011 2012 2013
1 8 3 4
2 1 12 3
3 4 9 3
How do I achieve this?
One option is unstack if the number of rows per 'year' is the same
unstack(df1, Male ~ year)
One option is to use functions from dplyr and tidyr.
library(dplyr)
library(tidyr)
dt2 <- dt %>%
group_by(year) %>%
mutate(ID = 1:n()) %>%
spread(year, Male) %>%
select(-ID)
1
If every year has the same number of data, you could split the data and cbind it using base R
do.call(cbind, split(df$Male, df$year))
# 2011 2012 2013
#[1,] 8 3 4
#[2,] 1 12 3
#[3,] 4 9 3
2
If every year does not have the same number of data, you could use rbind.fill of plyr
df[10,] = c(2015, 5) #Add only one data for the year 2015
library(plyr)
setNames(object = data.frame(t(rbind.fill.matrix(lapply(split(df$Male, df$year), t)))),
nm = unique(df$year))
# 2011 2012 2013 2015
#1 8 3 4 5
#2 1 12 3 NA
#3 4 9 3 NA
3
Yet another way is to use dcast to convert data from long to wide format
df[10,] = c(2015, 5) #Add only one data for the year 2015
library(reshape2)
dcast(df, ave(df$Male, df$year, FUN = seq_along) ~ year, value.var = "Male")[,-1]
# 2011 2012 2013 2015
#1 8 3 4 5
#2 1 12 3 NA
#3 4 9 3 NA

Drop subgroup of obs in dataframe if first observation of group is na

In R I have a dataframe df of this form:
a b year month id
1 2 2012 01 1234758
1 1 2012 02 1234758
NA 5 2011 04 1234759
5 5 2011 05 1234759
5 5 2011 06 1234759
2 2 2001 11 1234760
NA NA 2001 11 1234760
Some of the a's and b's are NAs. I wish to subset the dataframe by id, have each subset ordered by year and month and then drop the whole subset/id if the first observation in order of time of either a or b is na.
For the example above, inteded result is:
a b year month id
1 2 2012 01 1234758
1 1 2012 02 1234758
2 2 2001 11 1234760
NA NA 2001 11 1234760
I did it the non vectorized way, which took forever to run, as follow:
df_summary <- as.data.frame(table(df$id),stringsAsFactors=FALSE)
df <- df[order(df$id,df$year,df$month),]
remove <- ""
j <- 1
l <- 0
for(i in 1:nrow(df_summary)){
m <- df_summary$Var1[i]
if( is.na(df$a[j]) | is.na(df$b[j]) ) {
l <- l + 1
remove[l] <- df_summary$id[i]
}
j <- j + m
}
df <- df[!(df$id %in% remove),]
What is a faster, vectorized way, to achieve the same result?
What I tried, also to double-check my code:
dt <- setDT(df)
remove_vectorized <- dt[,list(remove_first_na=(is.na(a[1]) | is.na(b[1]))),by=id]
which suggests me to remove ALL observation, which is patently wrong.
Here are few data.table possible approaches
First- fixing your attempt
library(data.table)
setDT(df)[, if(!is.na(a[1L]) & !is.na(b[1L])) .SD, by = id]
# id a b year month
# 1: 1234758 1 2 2012 1
# 2: 1234758 1 1 2012 2
# 3: 1234760 2 2 2001 11
# 4: 1234760 NA NA 2001 11
Or we can generalize this (on expense of speed probably)
setDT(df)[, if(Reduce(`&`, !is.na(.SD[1L, .(a, b)]))) .SD, by = id]
## OR maybe `setDT(df)[, if(Reduce(`&`, !sapply(.SD[1L, .(a, b)], is.na))) .SD , by = id]`
## in order to avoid to matrix conversions)
# id a b year month
# 1: 1234758 1 2 2012 1
# 2: 1234758 1 1 2012 2
# 3: 1234760 2 2 2001 11
# 4: 1234760 NA NA 2001 11
Another way is to combine unique and na.omit methods
indx <- na.omit(unique(setDT(df), by = "id"), by = c("a", "b"))
Then, a simple subset will do
df[id %in% indx$id]
# id a b year month
# 1: 1234758 1 2 2012 1
# 2: 1234758 1 1 2012 2
# 3: 1234760 2 2 2001 11
# 4: 1234760 NA NA 2001 11
Or maybe a binary join?
df[indx[, .(id)], on = "id"]
# id a b year month
# 1: 1234758 1 2 2012 1
# 2: 1234758 1 1 2012 2
# 3: 1234760 2 2 2001 11
# 4: 1234760 NA NA 2001 11
Or
indx <- na.omit(unique(setDT(df, key = "id")), by = c("a", "b"))
df[.(indx$id)]
# id a b year month
# 1: 1234758 1 2 2012 1
# 2: 1234758 1 1 2012 2
# 3: 1234760 2 2 2001 11
# 4: 1234760 NA NA 2001 11
(The last two are mainly for illustration)
For more info regarding data.table, please visit Getting Started on GH

Resources