subsetting !is.na for multiple conditions unexpected results - r

I am trying to remove rows in a df when NA appear in two specific columns.
Example dataframe
tmp <- data.frame(state = c(1, 1, 2, 2, 3, 3, 4, 5),
reg = c(NA, 3, 6, NA, 9, 1, NA, 7),
gas = c(NA, 5, NA, 9, 1, 3, NA, 1),
other = c(1, 2, 4, 2, 6, 8, 1, 1) )
from the table you can see there are two rows where both "reg" and "gas" are NA
table(tmp$reg, tmp$gas, useNA = 'always')
1 3 5 9 <NA>
1 0 1 0 0 0
3 0 0 1 0 0
6 0 0 0 0 1
7 1 0 0 0 0
9 1 0 0 0 0
<NA> 0 0 0 1 2
I would like to remove these rows but retain the other NA values.
I tried this code:
tmp[!is.na(tmp$reg & tmp$gas), ]
but it removes all lines with NA in reg and gas
state reg gas other
2 1 3 5 2
5 3 9 1 6
6 3 1 3 8
8 5 7 1 1
This is the result that I am looking for:
state reg gas other
2 1 3 5 2
3 2 6 NA 4
4 2 NA 9 2
5 3 9 1 6
6 3 1 3 8
8 5 7 1 1
I also tried
tmp[which(!is.na(tmp$reg & tmp$gas)), ]
but that produces the same unwanted result.

I don't know why the initial approach didn't work, but I guess there is some fault in the chaining that I can not see. Taking the opposite approach (removing those that fulfills the condition) seems to produce the desired output.
tmp <- data.frame(state = c(1, 1, 2, 2, 3, 3, 4, 5),
reg = c(NA, 3, 6, NA, 9, 1, NA, 7),
gas = c(NA, 5, NA, 9, 1, 3, NA, 1),
other = c(1, 2, 4, 2, 6, 8, 1, 1) )
res = tmp[-which(is.na(tmp$reg) & is.na(tmp$gas)),]
res
#> state reg gas other
#> 2 1 3 5 2
#> 3 2 6 NA 4
#> 4 2 NA 9 2
#> 5 3 9 1 6
#> 6 3 1 3 8
#> 8 5 7 1 1
Created on 2020-12-24 by the reprex package (v0.3.0)

Related

Delete rows with na in certain columns, only if dummy = 0

I want the following statement:
If df$dummy=0 --> delete all rows with na values in column 2:5.
I try
df[df$dummy==0] <- na.omit(df[2:5],)
But it does not function properly.
anyone that can help me?
It's always better to include a little reproducible example, otherwise the folks here who answer your question will need to do it for you.
Suppose your data frame looks like this:
df <- data.frame(dummy = c(0, 1, 1, 0, 0, 0, 1, 1, 0, 0),
col2 = c(1, NA, 3, 4, 5, NA, 7, 8, 9, 10),
col3 = c(1, 2, 3, 4, 5, 6, 7, 8, 9, NA),
col4 = c(NA, 2, 3, 4, 5, 6, 7, 8, 9, 10),
col5 = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
df
#> dummy col2 col3 col4 col5
#> 1 0 1 1 NA 1
#> 2 1 NA 2 2 2
#> 3 1 3 3 3 3
#> 4 0 4 4 4 4
#> 5 0 5 5 5 5
#> 6 0 NA 6 6 6
#> 7 1 7 7 7 7
#> 8 1 8 8 8 8
#> 9 0 9 9 9 9
#> 10 0 10 NA 10 10
Then you can filter out the columns where dummy == 0 AND where any row in columns 2:5 have NA by doing:
df[-which(df$dummy == 0 & apply(df[2:5], 1, anyNA)), ]
#> dummy col2 col3 col4 col5
#> 2 1 NA 2 2 2
#> 3 1 3 3 3 3
#> 4 0 4 4 4 4
#> 5 0 5 5 5 5
#> 7 1 7 7 7 7
#> 8 1 8 8 8 8
#> 9 0 9 9 9 9
You will see that the only NA that remains occurs in a row where dummy == 1, as expected.
Created on 2021-11-12 by the reprex package (v2.0.0)
Does this work?
Data:
df <- data.frame(
dummy = c(1,0,0,1,0),
c1 = c(NA, NA, 2, 3, 1),
c2 = c(NA, NA, NA, 1, 4)
)
Solution:
library(dplyr)
df %>%
filter(
!(dummy == 0 & if_any(starts_with("c"), is.na)))
dummy c1 c2
1 1 NA NA
2 1 3 1
3 0 1 4

Modify variables in longitudinal data sets (keep first appearance of values on person-level)

I have a dataframe:
i <- c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3)
t <- c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4)
x <- c(0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1)
y <- c(5, 6, 7, 8, 4, 5, 6, 7, 6, 7, 8, 8)
j1 <- c(NA, NA, NA, NA, NA, 5, NA, 7, NA, NA, 8, 8)
dat <- data.frame(i, t, x, y, j1)
dat
i t x y j1
1 1 1 0 5 NA
2 1 2 0 6 NA
3 1 3 0 7 NA
4 1 4 0 8 NA
5 2 1 0 4 NA
6 2 2 1 5 5
7 2 3 0 6 NA
8 2 4 1 7 7
9 3 1 0 6 NA
10 3 2 0 7 NA
11 3 3 1 8 8
12 3 4 1 9 8
The dataframe refers to 3 persons "i" at 4 points in time "t". "j1" switches to "y" when "x" turns from 0 to 1 for a person "i". While "x" stays on 1 for a person, "j1" does not change within time (see person 3). When "x" is 0, "j1" is always NA.
Now I want to add a new variable "j2" to the dataframe which is a modification of "j1". The difference should be the following: For each person "i", there should be only one value for "j2". Namely, it should be the first value for "j1" for each person (the first change from 0 to 1 in "x").
Accordingly, the result should look like this:
dat
i t x y j1 j2
1 1 1 0 5 NA NA
2 1 2 0 6 NA NA
3 1 3 0 7 NA NA
4 1 4 0 8 NA NA
5 2 1 0 4 NA NA
6 2 2 1 5 5 5
7 2 3 0 6 NA NA
8 2 4 1 7 7 NA
9 3 1 0 6 NA NA
10 3 2 0 7 NA NA
11 3 3 1 8 8 8
12 3 4 1 9 8 NA
I appreciate suggestions on how to address this with dplyr
Somewhat more concise than the others:
library(tidyverse)
dat <- structure(list(i = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3), t = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4), x = c(0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1), y = c(5, 6, 7, 8, 4, 5, 6, 7, 6, 7, 8, 8), j1 = c(NA, NA, NA, NA, NA, 5, NA, 7, NA, NA, 8, 8)), class = "data.frame", row.names = c(NA, -12L))
dat %>%
group_by(i) %>%
mutate(j2 = ifelse(1:n() == which(x == 1)[1], y, NA)) %>%
ungroup()
#> # A tibble: 12 × 6
#> i t x y j1 j2
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 0 5 NA NA
#> 2 1 2 0 6 NA NA
#> 3 1 3 0 7 NA NA
#> 4 1 4 0 8 NA NA
#> 5 2 1 0 4 NA NA
#> 6 2 2 1 5 5 5
#> 7 2 3 0 6 NA NA
#> 8 2 4 1 7 7 NA
#> 9 3 1 0 6 NA NA
#> 10 3 2 0 7 NA NA
#> 11 3 3 1 8 8 8
#> 12 3 4 1 8 8 NA
possible solution
library(tidyverse)
i <- c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3)
t <- c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4)
x <- c(0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1)
y <- c(5, 6, 7, 8, 4, 5, 6, 7, 6, 7, 8, 8)
j1 <- c(NA, NA, NA, NA, NA, 5, NA, 7, NA, NA, 8, 8)
df <- data.frame(i, t, x, y, j1)
tmp <- df %>%
filter(x == 1) %>%
group_by(i) %>%
slice(1) %>%
ungroup() %>%
rename(j2 = j1)
left_join(df, tmp)
#> Joining, by = c("i", "t", "x", "y")
#> i t x y j1 j2
#> 1 1 1 0 5 NA NA
#> 2 1 2 0 6 NA NA
#> 3 1 3 0 7 NA NA
#> 4 1 4 0 8 NA NA
#> 5 2 1 0 4 NA NA
#> 6 2 2 1 5 5 5
#> 7 2 3 0 6 NA NA
#> 8 2 4 1 7 7 NA
#> 9 3 1 0 6 NA NA
#> 10 3 2 0 7 NA NA
#> 11 3 3 1 8 8 8
#> 12 3 4 1 8 8 NA
Created on 2021-09-08 by the reprex package (v2.0.1)
Function f puts NA after first value that is not NA in vector x. FUnction f is applied to j1 for each group determined by i.
f <- function(x){
ind <- which(!is.na(x))[1]
if(is.na(ind) || ind == length(x)) return(x)
x[(which.min(is.na(x))+1):length(x)] <- NA
x
}
dat %>%
group_by(i) %>%
mutate(j2 = f(j1)) %>%
ungroup()
Option1
You can use dplyr with mutate, use j1 and replace()the values for which both the current and the previous (lag()) value are non-NA with NAs:
library(dplyr)
dat %>% group_by(i) %>%
mutate(j2=replace(j1, !is.na(j1) & !is.na(lag(j1)), NA))
Option2
You can use replace() and replace all values in j1 which are not the first non-NA value (which(!is.na(j1))[1]).
dat %>% group_by(i) %>%
mutate(j2=replace(j1, which(!is.na(j1))[1], NA))
Option3
You can use purrr::accumulate() too. Call accumulate comparing consecutive (.x, .y) values form the j1 vector. If they are the same, the output will be NA.
library(dplyr)
dat %>% group_by(i) %>%
mutate(j2=purrr::accumulate(j1, ~ifelse(.x %in% .y, NA, .y)))
Output
# A tibble: 12 x 6
# Groups: i [3]
i t x y j1 j2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 0 5 NA NA
2 1 2 0 6 NA NA
3 1 3 0 7 NA NA
4 1 4 0 8 NA NA
5 2 1 0 4 NA NA
6 2 2 1 5 5 5
7 2 3 0 6 NA NA
8 2 4 1 7 7 7
9 3 1 0 6 NA NA
10 3 2 0 7 NA NA
11 3 3 1 8 8 8
12 3 4 1 8 8 NA

Clustering rows by group based on column value with conditions

A few days ago I opened this thread:
Clustering rows by group based on column value
In which we obtained this result:
df <- data.frame(ID = c(1,1,1,1,1,1,1,1,1,1,1, 1, 1,1,1,1,1),
Obs1 = c(1,1,0,1,0,1,1,0,1,0,0,0,1,1,1,1,1),
Control = c(0,3,3,1,12,1,1,1,36,13,1,1,2,24,2,2,48),
ClusterObs1 = c(1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5))
With:
df <- df %>%
group_by(ID) %>%
mutate_at(vars(Obs1),
funs(ClusterObs1= with(rle(.), rep(cumsum(values == 1), lengths))))
Now I have to make some modifications:
If value of 'Control' is higher than 12 and actual 'Obs1' value is equal to 1 and to previous 'Obs1' value, 'DesiredResultClusterObs1' value should add +1
df <- data.frame(ID = c(1,1,1,1,1,1,1,1,1,1,1, 1, 1,1,1,1,1),
Obs1 = c(1,1,0,1,0,1,1,0,1,0,0,0,1,1,1,1,1),
Control = c(0,3,3,1,12,1,1,1,36,13,1,1,2,24,2,2,48),
ClusterObs1 = c(1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5),
DesiredResultClusterObs1 = c(1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 6, 6, 6, 7))
I have considered add if_else condition with lag in funs but unsuccessfully, any ideas?
EDIT: How it would be for many columns?
This seems to work:
df %>%
mutate(DesiredResultClusterOrbs1 = with(rle(Control > 12 & Obs1 == 1 & lag(Obs1) == 1),
rep(cumsum(values == 1), lengths)) + ClusterObs1)
ID Obs1 Control ClusterObs1 DesiredResultClusterOrbs1
1 1 1 0 1 1
2 1 1 3 1 1
3 1 0 3 1 1
4 1 1 1 2 2
5 1 0 12 2 2
6 1 1 1 3 3
7 1 1 1 3 3
8 1 0 1 3 3
9 1 1 36 4 4
10 1 0 13 4 4
11 1 0 1 4 4
12 1 0 1 4 4
13 1 1 2 5 5
14 1 1 24 5 6
15 1 1 2 5 6
16 1 1 2 5 6
17 1 1 48 5 7
Basically, we use the rle+rep mechanic from your previous thread to create a cumulative vector from the TRUE/FALSE result of your conditions and add it to the existing ClusterObs1.
If you want to create multiple DesiredResultClusterOrbs, you can use mapply. Maybe there's a dplyr solution for this, but this is base R.
Data:
df <- data.frame(ID = c(1,1,1,1,1,1,1,1,1,1,1, 1, 1,1,1,1,1),
Obs1 = c(1,1,0,1,0,1,1,0,1,0,0,0,1,1,1,1,1),
Obs2 = rbinom(17, 1, .5),
Control = c(0,3,3,1,12,1,1,1,36,13,1,1,2,24,2,2,48),
ClusterObs1 = c(1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5))
df <- df %>%
mutate_at(vars(Obs2),
funs(ClusterObs2= with(rle(.), rep(cumsum(values == 1), lengths))))
The loop:
newcols <- mapply(function(x, y){
with(rle(df$Control > 12 & x == 1 & lag(x) == 1),
rep(cumsum(values == 1), lengths)) + y
}, df[2:3], df[5:6])
This produces a matrix with the new columns, which you can then rename and cbind to your data:
colnames(newcols) <- paste0("DesiredResultClusterOrbs", 1:2)
cbind.data.frame(df, newcols)
ID Obs1 Obs2 Control ClusterObs1 ClusterObs2 DesiredResultClusterOrbs1 DesiredResultClusterOrbs2
1 1 1 1 0 1 1 1 1
2 1 1 1 3 1 1 1 1
3 1 0 0 3 1 1 1 1
4 1 1 0 1 2 1 2 1
5 1 0 0 12 2 1 2 1
6 1 1 0 1 3 1 3 1
7 1 1 1 1 3 2 3 2
8 1 0 0 1 3 2 3 2
9 1 1 1 36 4 3 4 3
10 1 0 1 13 4 3 4 4
11 1 0 0 1 4 3 4 4
12 1 0 1 1 4 4 4 5
13 1 1 1 2 5 4 5 5
14 1 1 0 24 5 4 6 5
15 1 1 1 2 5 5 6 6
16 1 1 1 2 5 5 6 6
17 1 1 1 48 5 5 7 7

showing count per category in ever row of that category in a new column [duplicate]

This question already has answers here:
How to get group-level statistics while preserving the original dataframe?
(3 answers)
Closed 4 years ago.
I have the following data frame
g <- c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6)
m <- c(1, NA, NA, NA, 3, NA, 2, 1, 3, NA, 3, NA, NA, 4, NA, NA, NA, 2, 1, NA, 7, 3, NA, 1)
df <- data.frame(g, m)
I would like to show the number of non NA values per category of g (1 to 6) which I counted by:
> df %>% group_by(g) %>% summarise(non_na_count = sum(!is.na(m)))
# A tibble: 6 x 2
g non_na_count
<dbl> <int>
1 1. 1
2 2. 3
3 3. 2
4 4. 1
5 5. 2
6 6. 3
now I would like to produce a new column, l, that shows the number of NA values per category in every row such that the result would be:
g m l
1 1 1 1
2 1 NA 1
3 1 NA 1
4 1 NA 1
5 2 3 3
6 2 NA 3
7 2 2 3
8 2 1 3
9 3 3 2
10 3 NA 2
11 3 3 2
12 3 NA 2
13 4 NA 1
14 4 4 1
15 4 NA 1
16 4 NA 1
17 5 NA 2
18 5 2 2
19 5 1 2
20 5 NA 2
21 6 7 3
22 6 3 3
23 6 NA 3
24 6 1 3
anyone know how this can be done :)?
We need mutate to create column
df %>%
group_by(g) %>%
mutate(non_na_count = sum(!is.na(m)))
You are almost there. What you need to do is collect the output of group by and add it back to the original df.
df_notna <- df %>% group_by(g) %>% summarise(non_na_count = sum(!is.na(m)))
total <- merge(df,df_notna,by="g")
Look at other ways to merge here: https://www.statmethods.net/management/merging.html

Create matrix of counts using two variables

I have two columns - a unique id column id and the day of travel day. My objective is to create a matrix of counts per id per day (and to include all days even if the count is zero)
> test
id day
1 3 3
2 4 4
3 1 4
4 2 3
5 2 5
6 2 4
7 1 1
8 5 4
9 1 1
10 3 2
11 2 2
12 4 2
13 2 4
14 2 5
15 4 5
16 3 4
17 5 3
18 3 2
19 5 5
20 3 4
21 1 3
22 2 3
23 2 5
24 5 2
25 3 2
The output should be the following, where rows represent id and columns represent day:
> output
1 2 3 4 5
1 2 0 1 1 0
2 0 1 2 2 3
3 0 3 1 2 0
4 0 1 0 1 1
5 0 1 1 1 1
I have tried the following with the reshape package
output <- reshape2::dcast(test, day ~ id, sum)
but it throws the following error:
Error in unique.default(x) : unique() applies only to vectors
Why does this happen and what would the right solution be in dplyr or using base R? Any tips would be appreciated.
Here is the data:
> dput(test)
structure(list(id = c(3, 4, 1, 2, 2, 2, 1, 5, 1, 3, 2, 4, 2,
2, 4, 3, 5, 3, 5, 3, 1, 2, 2, 5, 3), day = c(3, 4, 4, 3, 5, 4,
1, 4, 1, 2, 2, 2, 4, 5, 5, 4, 3, 2, 5, 4, 3, 3, 5, 2, 2)), .Names = c("id",
"day"), row.names = c(NA, -25L), class = "data.frame")
Easier to see whats going on with character variables
id <- c('a', 'a', 'b', 'f', 'b', 'a')
day <- c('x', 'x', 'x', 'y', 'z', 'x')
test <- data.frame(id, day)
output <- as.data.frame.matrix(table(test))
This is the simplest way to do it...use the table() function then convert to data.frame
ans <- tapply(test$id, test$day,
function(x) {
y <- table(x)
z <- rep(0, 5)
z[as.numeric(names(y))] <- y
z
} )
do.call("cbind", ans)
1 2 3 4 5
[1,] 2 0 1 1 0
[2,] 0 1 2 2 3
[3,] 0 3 1 2 0
[4,] 0 1 0 1 1
[5,] 0 1 1 1 1

Resources