Creating a new columns based on values from existing multiple columns - r

I need to create a new column named "condition" (which is not there initially) based on the first three columns. If the values are from cond1 then it should be 1 in my condition column and so on. Any suggestions.
cond_test = read.csv("https://www.dropbox.com/s/du76g4vlfz2uaph/cond_test.csv?dl=1")
cond_test
#> ï..cond1 cond2 cond3 condition
#> 1 2 NA NA 1
#> 2 4 NA NA 1
#> 3 NA 3 NA 2
#> 4 NA 5 NA 2
#> 5 NA 4 NA 2
#> 6 NA NA 1 3
#> 7 NA NA 4 3
#> 8 NA NA 7 3

You can use max.col to get first non-NA value in each row.
max.col(!is.na(cond_test))
#[1] 1 1 2 2 2 3 3 3
If you have more than one non-NA value in the row you can look at ties.method argument in ?max.col on how to handle ties.
In dplyr you can use rowwise :
library(dplyr)
cond_test %>%
rowwise() %>%
mutate(condition = which.max(!is.na(c_across())))

I tried the following code and is working. But any elegant solutions are welcome.
cond_test$condition = ifelse(!is.na(cond_test$ï..cond1), 1,
ifelse(!is.na(cond_test$cond2), 2, 3))

Related

Filter data.frame with all colums NA but keep when some are NA

I want to remove rows from a data.frame where all cols are NA. But I'd like to keep rows that have some values NA.
I know how to do this with base R but I'm trying to figure out how to make it work with tidyverse. I'm trying the across operator.
library(tidyverse)
teste <- data.frame(a = c(1,NA,3, NA), b = c(NA, NA, 3, 4), c = c(1, NA, 3, 4))
teste
#> a b c
#> 1 1 NA 1
#> 2 NA NA NA
#> 3 3 3 3
#> 4 NA 4 4
# I whant to remove rows where all values are NA
# that is, remove only line 2
# here I can get the lines with all values NA
teste %>%
filter(across(a:c, is.na))
#> a b c
#> 1 NA NA NA
# If I negate the filter, it does not work
# the last line (NA, 4, 4) is missing
teste %>%
filter(!across(a:c, is.na))
#> a b c
#> 1 1 NA 1
#> 2 3 3 3
# This is what I'm expecting
# a b c
# 1 NA 1
# 3 3 3
# NA 4 4
# Using base I can do this with
teste[apply(teste, 1, function(x) sum(is.na(x))) < 3,]
#> a b c
#> 1 1 NA 1
#> 3 3 3 3
#> 4 NA 4 4
How can I do this using tidyverse?
Created on 2020-08-18 by the reprex package (v0.3.0)
We can use base R
teste[rowSums(!is.na(teste)) >0,]
# a b c
#1 1 NA 1
#3 3 3 3
#4 NA 4 4
Or using apply and any
teste[apply(!is.na(teste), 1, any),]
which can be also used within filter
teste %>%
filter(rowSums(!is.na(.)) >0)
Or using c_across from dplyr, we can directly remove the rows with all NA
library(dplyr)
teste %>%
rowwise %>%
filter(!all(is.na(c_across(everything()))))
# A tibble: 3 x 3
# Rowwise:
# a b c
# <dbl> <dbl> <dbl>
#1 1 NA 1
#2 3 3 3
#3 NA 4 4
NOTE: filter_all is getting deprecated
Previously in dplyr, you could use filter_all (for all columns)/filter_at (for specific columns) which had any_vars :
library(dplyr)
teste %>% filter_all(any_vars(!is.na(.)))
However, across does not have direct replacement of any_vars so you can use this with Reduce :
teste %>% filter(Reduce(`|`, across(.fns = Negate(is.na))))
# a b c
#1 1 NA 1
#2 3 3 3
#3 NA 4 4
Using data.table, you can produce the same outcome.
teste2 <- teste[-which(is.na(teste$a)&is.na(teste$b)&is.na(teste$c)),]

R First Row Value Meets Criteria

data = data.frame(STUDENT=c(1,2,3,4,5,6,7,8),
CAT=c(NA,NA,1,2,3,NA,NA,0),
DOG=c(NA,NA,2,3,2,NA,1,NA),
MOUSE=c(2,3,NA,NA,NA,NA,NA,NA),
WANT=c(2,3,2,2,3,NA,NA,NA))
I have 'data' and wish to create the 'WANT' variable and what it does is it takes the first non-NA value that does not equals to '1' or '0' and it stores it in 'WANT'. The code example above shows an example of what I hope to get.
We can use coalesce after changing the values 0, 1 in the selected columns to NA, then bind the column with the original dataset
library(dplyr)
data %>%
transmute(across(CAT:MOUSE, ~ replace(., . %in% 0:1, NA))) %>%
transmute(WANT2 = coalesce(!!! .)) %>%
bind_cols(data, .)
# STUDENT CAT DOG MOUSE WANT WANT2
#1 1 NA NA 2 2 2
#2 2 NA NA 3 3 3
#3 3 1 2 NA 2 2
#4 4 2 3 NA 2 2
#5 5 3 2 NA 3 3
#6 6 NA NA NA NA NA
#7 7 NA 1 NA NA NA
#8 8 0 NA NA NA NA
Or using data.table with fcoalesce. Convert the 'data.frame' to 'data.table' (setDT(data)), specify the columns of interest in .SDcols, loop over the .SD replace the values that are 0, 1 to NA, use fcoalesce and assign (:=) it to create new column 'WANT2'
library(data.table)
setDT(data)[, WANT2 := do.call(fcoalesce, lapply(.SD, function(x)
replace(x, x %in% 0:1, NA))), .SDcols = CAT:MOUSE]
or with base R, we can use a vectorized option with row/column indexing to extract the first non-NA element after replaceing the values 0, 1 to NA
m1 <- !is.na(replace(data[2:4], data[2:4] == 1|data[2:4] == 0, NA))
data$WAN2 <- data[2:4][cbind(seq_len(nrow(m1)), max.col(m1, "first"))]
data$WANT2[data$WANT2 == 0] <- NA
Try this:
data$Want2 <- apply(data[,-c(1,5)],1,function(x) x[min(which(!is.na(x) & x!=0 & x!=1))])
STUDENT CAT DOG MOUSE WANT Want2
1 1 NA NA 2 2 2
2 2 NA NA 3 3 3
3 3 1 2 NA 2 2
4 4 2 3 NA 2 2
5 5 3 2 NA 3 3
6 6 NA NA NA NA NA
7 7 NA 1 NA NA NA
8 8 0 NA NA 0 NA

Remove groups which do not have non-consecutive NA values in R

I have the following Data frame
group <- c(2,2,2,2,4,4,4,4,5,5,5,5)
D <- c(NA,2,NA,NA,NA,2,3,NA,NA,NA,1,1)
df <- data.frame(group, D)
df
group D
1 2 NA
2 2 2
3 2 NA
4 2 NA
5 4 NA
6 4 2
7 4 3
8 4 NA
9 5 NA
10 5 NA
11 5 1
12 5 1
I would like to only keep groups that contain non consecutive NA values at least once. in this case group 5 would be removed because it does not contain non consecutive NA values, but only consecutive NA values. group 2 and 4 remain because they do contain non consecutive NA values (NA values separated by row(s) with a non NA value).
therefore the resulting data frame would look like this:
df2
group D
1 2 NA
2 2 2
3 2 NA
4 2 NA
5 4 NA
6 4 2
7 4 3
8 4 NA
any ideas :)?
How about using difference between the index of NA-values per group?
library(dplyr)
df %>% group_by(group) %>% filter(any(diff(which(is.na(D))) > 1))
## A tibble: 8 x 2
## Groups: group [2]
# group D
# <dbl> <dbl>
#1 2. NA
#2 2. 2.
#3 2. NA
#4 2. NA
#5 4. NA
#6 4. 2.
#7 4. 3.
#8 4. NA
I'm not sure this would catch all potential edge cases but it seems to work for the given example.

Merging rows by a group [duplicate]

This question already has answers here:
Apply several summary functions (sum, mean, etc.) on several variables by group in one call
(7 answers)
Closed 5 years ago.
I have a data set
>data.frame(GROUP=c("A","A","A","G","G","F","F","E","T"),
FIRST=c(10,2,3,6,NA,NA,NA,1,NA),
SECOND=c(3,NA,NA,1,NA,4,2,1,NA),
THIRD=c(5,7,NA,NA,NA,1,NA,1,1))
GROUP FIRST SECOND THIRD
1 A 10 3 5
2 A 2 NA 7
3 A 3 NA NA
4 G 6 1 NA
5 G NA NA NA
6 F NA 4 1
7 F NA 2 NA
8 E 1 1 1
9 T NA NA 1
I want to combine the data using the GROUP-column in two ways:
Mean of columns inside a group
GROUP FIRST SECOND THIRD
1 A 5 3 6
2 G 6 1 NA
3 F NA 3 1
4 E 1 1 1
5 T NA NA 1
Column-wise max value inside a group
GROUP FIRST SECOND THIRD
1 A 10 3 7
2 G 6 1 NA
3 F NA 4 1
4 E 1 1 1
5 T NA NA 1
Is there a quick way to do this or should I create a new function?
We can use aggregate from base R
aggregate(.~GROUP, d1, mean, na.rm = TRUE, na.action=NULL)
Or using dplyr
library(dplyr)
d1 %>%
group_by(GROUP) %>%
summarise_each(funs(mean=mean(., na.rm = TRUE)))
Or
d1 %>%
group_by(GROUP) %>%
summarise_each(funs(max=max(., na.rm = TRUE)))

row numbers for explicit rows in r

I need to get row numbers for explicit rows grouped over id. Let's say dataframe (df) looks like this:
id a b
3 2 NA
3 3 2
3 10 NA
3 21 0
3 2 NA
4 1 5
4 1 0
4 5 NA
I need to create one more column that would give row number sequence excluding the case where b == 0.
desired output:
id a b row
3 2 NA 1
3 3 2 2
3 10 NA 3
3 21 0 -
3 2 NA 4
4 1 5 1
4 1 0 -
4 5 NA 2
I used dplyr but not able to achieve the same,
My code:
df <- df %>%
group_by(id) %>%
mutate(row = row_number(id[b != 0]))
Please suggest some better way to do this.
I would propose using the data.table package for its nice capability in operating on subsets and thus avoiding inefficient operations such as ifelse or evaluation the whole data set. Also, it is better to keep you vector in numeric class (for future operations), thus NA will be probably preferable to - (character), here's a possible solution
library(data.table)
setDT(df)[is.na(b) | b != 0, row := seq_len(.N), by = id]
# id a b row
# 1: 3 2 NA 1
# 2: 3 3 2 2
# 3: 3 10 NA 3
# 4: 3 21 0 NA
# 5: 3 2 NA 4
# 6: 4 1 5 1
# 7: 4 1 0 NA
# 8: 4 5 NA 2
The idea here is to operate only on the rows where is.na(b) | b != 0 and generate a sequence of each group size (.N) while updating row in place (using :=). All the rest of the rows will be assigned with NAs by default.

Resources