Min and Max across multiple columns with NAs - r

For the following sample data dat, is there a way to calculate min and max while handling NAs. My input is:
dat <- read.table(text = "ID Name PM TP2 Sigma
1 Tim 1 2 3
2 Sam 0 NA 1
3 Pam 2 1 NA
4 Ali 1 0 2
NA NA NA NA NA
6 Tim 2 0 7", header = TRUE)
My required output is:
ID Name PM TP2 Sigma Min Max
1 Tim 1 2 3 1 3
2 Sam 0 NA 1 0 1
3 Pam 2 1 NA 1 2
4 Ali 1 0 2 0 2
NA NA NA NA NA NA NA
6 Tim 2 0 7 0 7
My Effort
1- I have seen similar posts but none of them has discussed issues where all entries in a column were NAs e.g., Get the min of two columns
Based on this, I have tried pmin() and pmax(), but they do not work for me.
2- Another similar question is minimum (or maximum) value of each row across multiple columns. Again, there is no need to handle NAs.
3- Lastly, this question minimum (or maximum) value of each row across multiple columns talks about NA but not all elements in a column have missing values.
4- Also, some of the solutions require that the columns list to be included to be excluded is typed manually, my original data is quite wide, I want to have an easier solution where I can express columns by numbers rather than names.
Partial Solution
I have tried the following solution but Min column ends up having Inf and the Max column ends up having -Inf.
dat$min = apply(dat[,c(2:4)], 1, min, na.rm = TRUE)
dat$max = apply(dat[,c(2:4)], 1, max, na.rm = TRUE)
I can manually get rid of Inf by using something like:
dat$min[is.infinite(dat$min)] = NA
But I was wondering if there is a better way of achieving my desired outcome? Any advice would be greatly appreciated.
Thank you for your time.

You can use hablar's min_ and max_ function which returns NA if all values are NA.
library(dplyr)
library(hablar)
dat %>%
rowwise() %>%
mutate(min = min_(c_across(-ID)),
max = max_(c_across(-ID)))
You can also use this with apply -
cbind(dat, t(apply(dat[-1], 1, function(x) c(min = min_(x), max = max_(x)))))
# ID PM TP2 Sigma min max
#1 1 1 2 3 1 3
#2 2 0 NA 1 0 1
#3 3 2 1 NA 1 2
#4 4 1 0 2 0 2
#5 NA NA NA NA NA NA
#6 5 2 0 7 0 7

The following solution seems to work with the transform() function:
dat <- transform(dat, min = pmin(PM, TP2, Sigma))
dat <- transform(dat, max = pmin(PM, TP2, Sigma))
Without using the transform() function, the data seemed to mess up. Also, the above command requires that all column names are written explicitly. I do not understand why writing a short version like below, fails.
pmin(dat[,2:4])) or
pmax(dat[,2:4]))
I am posting the only solution that I could come up with, in case someone else stumbles upon a similar issue.

I would use data.table for this task. I use the rowSums to count the numbers of row with na and compare it to the number of columns in total. I just use in dat.new all columns where you have at least one nonNA value. Then you can use the na.rm=T as usually.
I hope this little code helps you.
library(data.table)
#your data
dat <- read.table(text = "ID PM TP2 Sigma
1 1 2 3
2 0 NA 1
3 2 1 NA
4 1 0 2
NA NA NA NA
5 2 0 7", header = TRUE)
#generate data.table and add id
dat <- data.table(dat)
number.cols <- dim(dat)[2] #4
dat[,id:=c(1:dim(dat)[1])]
# > dat
# ID PM TP2 Sigma id
# 1: 1 1 2 3 1
# 2: 2 0 NA 1 2
# 3: 3 2 1 NA 3
# 4: 4 1 0 2 4
# 5: NA NA NA NA 5
# 6: 5 2 0 7 6
#use new data.table to select all rows with at least one nonNA value
dat.new <- dat[rowSums(is.na(dat))<number.cols,]
dat.new[, MINv:=min(.SD, na.rm=T), by=id]
dat.new[, MAXv:=max(.SD, na.rm=T), by=id]
#if you need it merged to the old data
dat <- merge(dat, dat.new[,.(id,MINv,MAXv)], by="id")

On way might be to use pmin and pmax with do.call:
dat$min <- do.call(pmin, c(dat[,c(3:5)], na.rm=TRUE))
dat$max <- do.call(pmax, c(dat[,c(3:5)], na.rm=TRUE))
dat
# ID Name PM TP2 Sigma min max
#1 1 Tim 1 2 3 1 3
#2 2 Sam 0 NA 1 0 1
#3 3 Pam 2 1 NA 1 2
#4 4 Ali 1 0 2 0 2
#5 NA <NA> NA NA NA NA NA
#6 6 Tim 2 0 7 0 7

Related

How to change NA into 0 based on other variable / how many times it was recorded

I am still new to R and need help. I want to change the NA value in variables x1,x2,x3 to 0 based on the value of count. Count specifies the number of observations, and the x1,x2,x3 stand for the visit to the site (or replication). The value in each 'X' variable is the number of species found. However, not all sites were visited 3 times. The variable count is telling us how many times the site was actually visited. I want to identify the actual NA and real 0 (which means no species found). I want to change the NA into 0 if the site is actually visited and keep it NA if the site is not visited. For example from the dummy data, 'zhask' site is visited 2 times, then the NA in x1 of zhask needs to be replaced with 0.
This is the dummy data:
site x1 x2 x3 count
1 miya 1 2 1 3
2 zhask NA 1 NA 2
3 balmond 3 NA 2 3
4 layla NA 1 NA 2
5 angela NA 3 NA 2
So, it the table need to be changed into:
site x1 x2 x3 count
1 miya 1 2 1 3
2 zhask 0 1 NA 2
3 balmond 3 0 2 3
4 layla 0 1 NA 2
5 angela 0 3 NA 2
I've tried many things and try to make my own function, however, it is not working:
for(i in 1:nrow(df))
{
if( is.na(df$x1[i]) && (i < df$count[i]))
{df$x1[i]=0}
else
{df$x1[i]=df$x1[i]}
}
this is the script for the dummy dataframe:
x1= c(1,NA,3, NA, NA)
x2= c(2,1, NA, 1, 3)
x3 = c(1, NA, 2, NA, NA)
count=c(3,2,3,2,2)
site=c("miya", "zhask", "balmond", "layla", "angela")
df=data.frame(site,x1,x2,x3,count)
Any help will be very much appreciated!
One way to be to apply a function over all of your count columns. Here's a way to do that.
cols <- c("x1", "x2", "x3")
df[, cols] <- mapply(function(col, idx, count) {
ifelse(idx <=count & is.na(col), 0, col)
}, df[,cols], seq_along(cols), MoreArgs=list(count=df$count))
# site x1 x2 x3 count
# 1 miya 1 2 1 3
# 2 zhask 0 1 NA 2
# 3 balmond 3 0 2 3
# 4 layla 0 1 NA 2
# 5 angela 0 3 NA 2
We use mapply to iterate over the columns and the index of the column. We also pass in the count value each time (since it's the same for all columns, it goes in the MoreArgs= parameter). This mapply will return a list and we can use that to replace the columns with the updated values.
If you wanted to use dplyr, that might look more like
library(dplyr)
cols <- c("x1"=1, "x2"=2, "x3"=3)
df %>%
mutate(across(starts_with("x"), ~if_else(cols[cur_column()]<count & is.na(.x), 0, .x)))
I used the cols vector to get the index of the column which doesn't seem to be otherwise available when using across().
But a more "tidy" way to tackle this problem would be to pivot your data first to a "tidy" format. Then you can clean the data more easily and pivot back if necessary
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols=starts_with("x")) %>%
mutate(index=readr::parse_number(name)) %>%
mutate(value=if_else(index < count & is.na(value), 0, value)) %>%
select(-index) %>%
pivot_wider(names_from=name, values_from=value)
# site count x1 x2 x3
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 miya 3 1 2 1
# 2 zhask 2 0 1 NA
# 3 balmond 3 3 0 2
# 4 layla 2 0 1 NA
# 5 angela 2 0 3 NA
Via some indexing of the columns:
vars <- c("x1","x2","x3")
df[vars][is.na(df[vars]) & (col(df[vars]) <= df$count)] <- 0
# site x1 x2 x3 count
#1 miya 1 2 1 3
#2 zhask 0 1 NA 2
#3 balmond 3 0 2 3
#4 layla 0 1 NA 2
#5 angela 0 3 NA 2
Essentially this is:
selecting the variables/columns and storing in vars
flagging the NA cells within those variables with is.na(df[vars])
col(df[vars]) returns a column number for every cell, which can be checked if it is less than the df$count in each corresponding row
the values meeting both the above criteria are overwritten <- with 0
This could be yet another solution using purrr::pmap:
purrr::pmap is used for row-wise operations when applied on a data frame. It enables us to iterate over multiple arguments at the same time. So here c(...) refers to all corresponding elements of the selected variable (all except site) in each row
I think the rest of the solution is pretty clear but please let me know if I need to explain more about this.
library(dplyr)
library(purrr)
library(tidyr)
df %>%
mutate(output = pmap(df[-1], ~ {x <- head(c(...), -1)
inds <- which(is.na(x))
req <- tail(c(...), 1) - sum(!is.na(x))
x[inds[seq_len(req)]] <- 0
x})) %>%
select(site, output, count) %>%
unnest_wider(output)
# A tibble: 5 x 5
site x1 x2 x3 count
<chr> <dbl> <dbl> <dbl> <dbl>
1 miya 1 2 1 3
2 zhask 0 1 NA 2
3 balmond 3 0 2 3
4 layla 0 1 NA 2
5 angela 0 3 NA 2

create new column (with outcome min or NA) from multiple selected columns

My data has many columns and subjects, but to illustrate it simpler, lets say I have 7 subjects with 3 variables/columns called x1, x2 and x3 (values range from 1 to 3 and NAs). In the analysis that I want it is important I actually call the columns I want to use (since I cannot just use the whole dataframe in my analysis because there are more variables/columns there)
>data <- data.frame(‘id’=c(1,2,3,4,5,6,7), ‘x1’=c(1,2,2,NA,3,3,1), ‘x2’=c(NA,3,1,NA,2,3,2), ‘x3’=c(NA,2,NA,NA,3,NA,1)
id x1 x2 x3
1 1 NA NA
2 2 3 2
3 2 1 NA
4 NA NA NA
5 3 2 NA
6 3 3 NA
7 1 2 1
The class of x1 x2 and x3 are numeric.
Out of that, I want to create a variable/column called ‘x4’ that:
- gives me the lowest number of row x1, x2 and x3.
-If there is an NA in a row of x1,x2,x3, the NA shall be ignored.
-If they are however ALL NAs, I would want the outcome to be NA. (NOT Inf, which is what it does with my code now)
-If there are two lowest numbers that are the same, just display any one of those two. So like this:
>data <- data.frame(‘id’=c(1,2,3,4,5,6,7), ‘x1’=c(1,2,2,NA,3,3,1), ‘x2’=c(NA,3,1,NA,2,3,2), ‘x3’=c(NA,2,NA,NA,3,NA,1), ‘x4’=c(1,2,1,NA,2,3,1)
id x1 x2 x3 x4
1 1 NA NA 1
2 2 3 2 2
3 2 1 NA 1
4 NA NA NA NA
5 3 2 NA 2
6 3 3 NA 3
7 1 2 1 1
I managed to find a very similar question, and I can mostly make it work: min for each row with dataframe in R
data$x4 <- apply(data[, c("x1","x2","x3")],1, FUN=min, na.rm = TRUE)
the problem I have now is that in case of all NAs (so id number 4), my outcome is not NA, but it is 'Inf'.
Question 1:How can I make it so it becomes an NA instead of Inf? I can of course do that afterwards like this:
is.na(data$x4) <- sapply(data$x4, is.infinite)
But I wonder if there is a nice way to do that already with/inside the previous code?
Also, rather then using sapply and the inside FUNction min, I would also like to try to make it work with code in a way like below: Question 2: is using this other code below possible?
data$x4 <- min(data[, c("x1","x2","x3")],1 , na.rm = TRUE)
for this x4 gets the outcome '1' everytime. I guess it just shows the lowest number (1) of the whole column? I dont understand why. I am already using ',1' but doesnt help.
I hope somebody can help me(r and stackoverflow newbie) out, thanks!
You are looking for pmin function which returns the (regular or parallel) minima of the input values. Below are two approaches using pmin:
df$minIget <- do.call(pmin, c(df[,-1], na.rm = TRUE)) # Approch1: using do.call
df %>% rowwise() %>% mutate(minIget = pmin(x1, x2,x3,na.rm = T))# Approch2: using tidyverse.
output:
A tibble: 7 x 5
# Rowwise:
id x1 x2 x3 minIget
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 NA NA 1
2 2 2 3 2 2
3 3 2 1 NA 1
4 4 NA NA NA NA
5 5 3 2 3 2
6 6 3 3 NA 3
7 7 1 2 1 1
You can test if all are NA before you call min like:
apply(data[, c("x1","x2","x3")], 1, function(x)
if(all(is.na(x))) NA else min(x, na.rm=TRUE))
#[1] 1 2 1 NA 2 3 1
min(data[, c("x1","x2","x3")],1 , na.rm = TRUE) gives you the minimum of 1 and data[, c("x1","x2","x3")].

How do I lag a data.frame?

I'd like to lag whole dataframe in R.
In python, it's very easy to do this, using shift() function
(ex: df.shift(1))
However, I could not find any as an easy and simple method as in pandas shift() in R.
How can I do this?
> x = data.frame(a=c(1,2,3),b=c(4,5,6))
> x
a b
1 1 4
2 2 5
3 3 6
What I want is,
> lag(x,1)
>
a b
1 NA NA
2 1 4
3 2 5
Any good idea?
Pretty simple in base R:
rbind(NA, head(x, -1))
a b
1 NA NA
2 1 4
3 2 5
head with -1 drops the final row and rbind with NA as the first argument adds a row of NAs.
You can also use row indexing [, like this
x[c(NA, 1:(nrow(x)-1)),]
a b
NA NA NA
1 1 4
2 2 5
This leaves an NA in the row name of the first variable, to "fix" this, you can strip the data.frame class and then reassign it:
data.frame(unclass(x[c(NA, 1:(nrow(x)-1)),]))
a b
1 NA NA
2 1 4
3 2 5
Here, you can use rep to produce the desired lags
data.frame(unclass(x[c(rep(NA, 2), 1:(nrow(x)-2)),]))
a b
1 NA NA
2 NA NA
3 1 4
and even put this into a function
myLag <- function(dat, lag) data.frame(unclass(dat[c(rep(NA, lag), 1:(nrow(dat)-lag)),]))
Give it a try
myLag(x, 2)
a b
1 NA NA
2 NA NA
3 1 4
library(dplyr)
x %>% mutate_all(lag)
a b
1 NA NA
2 1 4
3 2 5
Just for completeness this would be analogous to how zoo implements it (but for a data.frame since the zoo lag(...) method doesn't work on data.frame objects):
lag.df <- function(x, lag) {
if (lag < 0)
rbind(NA, head(x, lag))
else
rbind(tail(x, -lag), NA)
}
and use like this:
x <- data.frame(dt=c(as.Date('2019-01-01'), as.Date('2019-01-02'), as.Date('2019-01-03')), a=c(1,2,3),b=c(4,5,6))
lag.df(x, -1)
lag.df(x, 1)
or you can just use zoo:
library(zoo)
x <- data.frame(dt=c(as.Date('2019-01-01'), as.Date('2019-01-02'), as.Date('2019-01-03')), a=c(1,2,3),b=c(4,5,6))
x.zoo <- read.zoo(x)
lag(x.zoo, -1)
lag(x.zoo, 1)

Conditional calculation of means of different columns in data.table with R

Here was discussed the question of calculation of means and medians of vector t, for each value of vector y (from 1 to 4) where x=1, z=1, using aggregate function in R.
x y z t
1 1 1 10
1 0 1 15
2 NA 1 14
2 3 0 15
2 2 1 17
2 1 NA 19
3 4 2 18
3 0 2 NA
3 2 2 45
4 3 2 NA
4 1 3 59
5 0 3 0
5 4 3 45
5 4 4 74
5 1 4 86
Multiple aggregation in R with 4 parameters
But how can I for each value (from 1 to 5) of vector x calculate (mean(y)+mean(z))/(mean(z)-mean(t)) ? And do not make calculations for values 0 and NA in any vector. For example, in vector y the 3rd value is 0, so the 3rd number in every vector (y,z,t) should not be used. And in result the the third row (for x=3) should be NA.
Here is the code for calculating means of y,z and t and it`s needed to add the formula for calculation (mean(y)+mean(z))/(mean(z)-mean(t)):
data <- data.table(dataframe)
bar <- data[,.N,by=x]
foo <- data[ ,list(mean.y =mean(y, na.rm = T),
mean.z=mean(z, na.rm = T),
mean.t=mean(t,na.rm = T)),
by=x]
In this code for calculating means all rows are used, but for calculating (mean(y)+mean(z))/(mean(z)-mean(t)), any row where y or z or t equal to zero or NA should not be used.
Update:
Oh, this can be further simplified, as data.table doesn't subset NA by default (especially with such cases in mind, similar to base::subset). So, you just have to do:
dt[y != 0 & z != 0 & t != 0,
list(ans = (mean(y) + mean(z))/(mean(z) - mean(t))), by = x]
FWIW, here's how I'd do it in data.table:
dt[(y | NA) & (z | NA) & (t | NA),
list(ans=(mean(y)+mean(z))/(mean(z)-mean(t))), by=x]
# x ans
# 1: 1 -0.22222222
# 2: 2 -0.18750000
# 3: 3 -0.16949153
# 4: 4 -0.07142857
# 5: 5 -0.10309278
Let's break it down with the general syntax: dt[i, j, by]:
In i, we filter out for your conditions using a nice little hack TRUE | NA = TRUE and FALSE | NA = NA and NA | NA = NA (you can test these out in your R session).
Since you say you need only the non-zero non-NA values, it's just a matter of |ing each column with NA - which'll return TRUE only for your condition. That settles the subset by condition part.
Then for each group in by, we aggregate according to your function, in j, to get the result.
HTH
Here's one solution:
# create your sample data frame
df <- read.table(text = " x y z t
1 1 1 10
1 0 1 15
2 NA 1 14
2 3 0 15
2 2 1 17
2 1 NA 19
3 4 2 18
3 0 2 NA
3 2 2 45
4 3 2 NA
4 1 3 59
5 0 3 0
5 4 3 45
5 4 4 74
5 1 4 86", header = TRUE)
library('dplyr')
dfmeans <- df %>%
filter(!is.na(y) & !is.na(z) & !is.na(t)) %>% # remove rows with NAs
filter(y != 0 & z != 0 & t != 0) %>% # remove rows with zeroes
group_by(x) %>%
summarize(xmeans = (mean(y) + mean(z)) / (mean(z) - mean(t)))
I'm sure there is a simpler way to remove the rows with NAs and zeroes, but it's not coming to me. Anyway, dfmeans looks like this:
# x xmeans
# 1 1 -0.22222222
# 2 2 -0.18750000
# 3 3 -0.16949153
# 4 4 -0.07142857
# 5 5 -0.10309278
And if you just want the values from xmeans use dfmeans$xmeans.

R - Comparing values in a column and creating a new column with the results of this comparison. Is there a better way than looping?

I'm a beginner of R. Although I have read a lot in manuals and here at this board, I have to ask my first question. It's a little bit the same as here but not really the same and i don't understand the explanation there.I have a dataframe with hundreds of thousands of rows and 30 columns. But for my question I created a simplier dataframe that you can use:
a <- sample(c(1,3,5,9), 20, replace = TRUE)
b <- sample(c(1,NA), 20, replace = TRUE)
df <- data.frame(a,b)
Now I want to compare the values of the last column (here column b), so that I'm looking iteratively at the value of each row if it is the same as the in the next row. If it is the same I want to write a 0 as the value in a new column in the same row, otherwise it should be a 1 as the value of the new column.
Here you can see my code, that's not working, because the rows of the new column only contain 0:
m<-c()
for (i in seq(along=df[,1])){
ifelse(df$b[i] == df$b[i+1],m <- 0, m <- 1)
df$mov <- m
}
The result, what I want to get, looks like the example below. What's the mistake? And is there a better way than creating loops? Maybe looping could be very slow for my big dataset.
a b mov
1 9 NA 0
2 1 NA 1
3 1 1 1
4 5 NA 0
5 1 NA 0
6 3 NA 0
7 3 NA 1
8 5 1 0
9 1 1 0
10 3 1 0
11 1 1 0
12 9 1 0
13 1 1 1
14 5 NA 0
15 9 NA 0
16 9 NA 0
17 9 NA 0
18 5 NA 0
19 3 NA 0
20 1 NA 0
Thank you for your help!
There are a couple things to consider in your example.
First, to avoid a loop, you can create a copy of the vector that is shifted by one position. (There are about 20 ways to do this.) Then when you test vector B vs C it will do element-by-element comparison of each position vs its neighbor.
Second, equality comparisons don't work with NA -- they always return NA. So NA == NA is not TRUE it is NA! Again, there are about 20 ways to get around this, but here I have just replaced all the NAs in the temporary vector with a placeholder that will work for the tests of equality.
Finally, you have to decide what you want to do with the last value (which doesn't have a neighbor). Here I have put 1, which is your assignment for "doesn't match its neighbor".
So, depending on the range of values possible in b, you could do
c = df$b
z = length(c)
c[is.na(c)] = 'x' # replace NA with value that will allow equality test
df$mov = c(1 * !(c[1:z-1] == c[2:z]),1) # add 1 to the end for the last value
You could do something like this to mark the ones which match
df$bnext <- c(tail(df$b,-1),NA)
df$bnextsame <- ifelse(df$bnext == df$b | (is.na(df$b) & is.na(df$bnext)),0,1)
There are plenty of NAs here because there are plenty of NAs in your column b as well and any comparison with NA returns an NA and not a TRUE/FALSE. You could add a df[is.na(df$bnextsame),"bnextsame"] <- 0 to fix that.
You can use a "rolling equality test" with zoo 's rollapply. Also, identical is preferred to ==.
#identical(NA, NA)
#[1] TRUE
#NA == NA
#[1] NA
library(zoo)
df$mov <- c(rollapply(df$b, width = 2,
FUN = function(x) as.numeric(!identical(x[1], x[2]))), "no_comparison")
#`!` because you want `0` as `TRUE` ;
#I added a "no_comparison" to last value as it is not compared with any one
df
# a b mov
#1 5 1 0
#2 1 1 0
#3 9 1 1
#4 5 NA 1
#5 9 1 1
#.....
#19 1 NA 0
#20 1 NA no_comparison

Resources