Delete rows with na in certain columns, only if dummy = 0 - r

I want the following statement:
If df$dummy=0 --> delete all rows with na values in column 2:5.
I try
df[df$dummy==0] <- na.omit(df[2:5],)
But it does not function properly.
anyone that can help me?

It's always better to include a little reproducible example, otherwise the folks here who answer your question will need to do it for you.
Suppose your data frame looks like this:
df <- data.frame(dummy = c(0, 1, 1, 0, 0, 0, 1, 1, 0, 0),
col2 = c(1, NA, 3, 4, 5, NA, 7, 8, 9, 10),
col3 = c(1, 2, 3, 4, 5, 6, 7, 8, 9, NA),
col4 = c(NA, 2, 3, 4, 5, 6, 7, 8, 9, 10),
col5 = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
df
#> dummy col2 col3 col4 col5
#> 1 0 1 1 NA 1
#> 2 1 NA 2 2 2
#> 3 1 3 3 3 3
#> 4 0 4 4 4 4
#> 5 0 5 5 5 5
#> 6 0 NA 6 6 6
#> 7 1 7 7 7 7
#> 8 1 8 8 8 8
#> 9 0 9 9 9 9
#> 10 0 10 NA 10 10
Then you can filter out the columns where dummy == 0 AND where any row in columns 2:5 have NA by doing:
df[-which(df$dummy == 0 & apply(df[2:5], 1, anyNA)), ]
#> dummy col2 col3 col4 col5
#> 2 1 NA 2 2 2
#> 3 1 3 3 3 3
#> 4 0 4 4 4 4
#> 5 0 5 5 5 5
#> 7 1 7 7 7 7
#> 8 1 8 8 8 8
#> 9 0 9 9 9 9
You will see that the only NA that remains occurs in a row where dummy == 1, as expected.
Created on 2021-11-12 by the reprex package (v2.0.0)

Does this work?
Data:
df <- data.frame(
dummy = c(1,0,0,1,0),
c1 = c(NA, NA, 2, 3, 1),
c2 = c(NA, NA, NA, 1, 4)
)
Solution:
library(dplyr)
df %>%
filter(
!(dummy == 0 & if_any(starts_with("c"), is.na)))
dummy c1 c2
1 1 NA NA
2 1 3 1
3 0 1 4

Related

Is there a way to use the lead function to figure out the first row that meets a condition?

Hi I have a dataframe as such,
df= structure(list(a = c(1, 3, 4, 6, 3, 2, 5, 1), b = c(1, 3, 4,
2, 6, 7, 2, 6), c = c(6, 3, 6, 5, 3, 6, 5, 3), d = c(6, 2, 4,
5, 3, 7, 2, 6), e = c(1, 2, 4, 5, 6, 7, 6, 3), f = c(2, 3, 4,
2, 2, 7, 5, 2)), .Names = c("a", "b", "c", "d", "e", "f"), row.names = c(NA,
8L), class = "data.frame")
df$total = apply ( df, 1,sum )
df$row = seq ( 1, nrow ( df ))
so the dataframe looks like this.
> df
a b c d e f total row
1 1 1 6 6 1 2 17 1
2 3 3 3 2 2 3 16 2
3 4 4 6 4 4 4 26 3
4 6 2 5 5 5 2 25 4
5 3 6 3 3 6 2 23 5
6 2 7 6 7 7 7 36 6
7 5 2 5 2 6 5 25 7
8 1 6 3 6 3 2 21 8
what I want to do is figure the first leading row where the total is greater than the current. For example for row 1 the total is 17 and the nearest leading row >= 17 would be row 3.
I could loop through each row but it gets really messy. Is this possible?
thanks in advance.
We can do this in 2 steps with dplyr. First we set grouping to rowwise, which applies the operation on each row (basically it makes it work like we were doing an apply loop through the rows), then we find all the rows where total is larger than that row's total. Then we drop those that come before the current row and pick the first (which is the next one):
library(dplyr)
df %>%
rowwise() %>%
mutate(nxt = list(which(.$total > total)),
nxt = nxt[nxt > row][1])
# A tibble: 8 × 9
# Rowwise:
a b c d e f total row nxt
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 1 1 6 6 1 2 17 1 3
2 3 3 3 2 2 3 16 2 3
3 4 4 6 4 4 4 26 3 6
4 6 2 5 5 5 2 25 4 6
5 3 6 3 3 6 2 23 5 6
6 2 7 6 7 7 7 36 6 NA
7 5 2 5 2 6 5 25 7 NA
8 1 6 3 6 3 2 21 8 NA

Modify variables in longitudinal data sets (keep first appearance of values on person-level)

I have a dataframe:
i <- c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3)
t <- c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4)
x <- c(0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1)
y <- c(5, 6, 7, 8, 4, 5, 6, 7, 6, 7, 8, 8)
j1 <- c(NA, NA, NA, NA, NA, 5, NA, 7, NA, NA, 8, 8)
dat <- data.frame(i, t, x, y, j1)
dat
i t x y j1
1 1 1 0 5 NA
2 1 2 0 6 NA
3 1 3 0 7 NA
4 1 4 0 8 NA
5 2 1 0 4 NA
6 2 2 1 5 5
7 2 3 0 6 NA
8 2 4 1 7 7
9 3 1 0 6 NA
10 3 2 0 7 NA
11 3 3 1 8 8
12 3 4 1 9 8
The dataframe refers to 3 persons "i" at 4 points in time "t". "j1" switches to "y" when "x" turns from 0 to 1 for a person "i". While "x" stays on 1 for a person, "j1" does not change within time (see person 3). When "x" is 0, "j1" is always NA.
Now I want to add a new variable "j2" to the dataframe which is a modification of "j1". The difference should be the following: For each person "i", there should be only one value for "j2". Namely, it should be the first value for "j1" for each person (the first change from 0 to 1 in "x").
Accordingly, the result should look like this:
dat
i t x y j1 j2
1 1 1 0 5 NA NA
2 1 2 0 6 NA NA
3 1 3 0 7 NA NA
4 1 4 0 8 NA NA
5 2 1 0 4 NA NA
6 2 2 1 5 5 5
7 2 3 0 6 NA NA
8 2 4 1 7 7 NA
9 3 1 0 6 NA NA
10 3 2 0 7 NA NA
11 3 3 1 8 8 8
12 3 4 1 9 8 NA
I appreciate suggestions on how to address this with dplyr
Somewhat more concise than the others:
library(tidyverse)
dat <- structure(list(i = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3), t = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4), x = c(0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1), y = c(5, 6, 7, 8, 4, 5, 6, 7, 6, 7, 8, 8), j1 = c(NA, NA, NA, NA, NA, 5, NA, 7, NA, NA, 8, 8)), class = "data.frame", row.names = c(NA, -12L))
dat %>%
group_by(i) %>%
mutate(j2 = ifelse(1:n() == which(x == 1)[1], y, NA)) %>%
ungroup()
#> # A tibble: 12 × 6
#> i t x y j1 j2
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 0 5 NA NA
#> 2 1 2 0 6 NA NA
#> 3 1 3 0 7 NA NA
#> 4 1 4 0 8 NA NA
#> 5 2 1 0 4 NA NA
#> 6 2 2 1 5 5 5
#> 7 2 3 0 6 NA NA
#> 8 2 4 1 7 7 NA
#> 9 3 1 0 6 NA NA
#> 10 3 2 0 7 NA NA
#> 11 3 3 1 8 8 8
#> 12 3 4 1 8 8 NA
possible solution
library(tidyverse)
i <- c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3)
t <- c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4)
x <- c(0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1)
y <- c(5, 6, 7, 8, 4, 5, 6, 7, 6, 7, 8, 8)
j1 <- c(NA, NA, NA, NA, NA, 5, NA, 7, NA, NA, 8, 8)
df <- data.frame(i, t, x, y, j1)
tmp <- df %>%
filter(x == 1) %>%
group_by(i) %>%
slice(1) %>%
ungroup() %>%
rename(j2 = j1)
left_join(df, tmp)
#> Joining, by = c("i", "t", "x", "y")
#> i t x y j1 j2
#> 1 1 1 0 5 NA NA
#> 2 1 2 0 6 NA NA
#> 3 1 3 0 7 NA NA
#> 4 1 4 0 8 NA NA
#> 5 2 1 0 4 NA NA
#> 6 2 2 1 5 5 5
#> 7 2 3 0 6 NA NA
#> 8 2 4 1 7 7 NA
#> 9 3 1 0 6 NA NA
#> 10 3 2 0 7 NA NA
#> 11 3 3 1 8 8 8
#> 12 3 4 1 8 8 NA
Created on 2021-09-08 by the reprex package (v2.0.1)
Function f puts NA after first value that is not NA in vector x. FUnction f is applied to j1 for each group determined by i.
f <- function(x){
ind <- which(!is.na(x))[1]
if(is.na(ind) || ind == length(x)) return(x)
x[(which.min(is.na(x))+1):length(x)] <- NA
x
}
dat %>%
group_by(i) %>%
mutate(j2 = f(j1)) %>%
ungroup()
Option1
You can use dplyr with mutate, use j1 and replace()the values for which both the current and the previous (lag()) value are non-NA with NAs:
library(dplyr)
dat %>% group_by(i) %>%
mutate(j2=replace(j1, !is.na(j1) & !is.na(lag(j1)), NA))
Option2
You can use replace() and replace all values in j1 which are not the first non-NA value (which(!is.na(j1))[1]).
dat %>% group_by(i) %>%
mutate(j2=replace(j1, which(!is.na(j1))[1], NA))
Option3
You can use purrr::accumulate() too. Call accumulate comparing consecutive (.x, .y) values form the j1 vector. If they are the same, the output will be NA.
library(dplyr)
dat %>% group_by(i) %>%
mutate(j2=purrr::accumulate(j1, ~ifelse(.x %in% .y, NA, .y)))
Output
# A tibble: 12 x 6
# Groups: i [3]
i t x y j1 j2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 0 5 NA NA
2 1 2 0 6 NA NA
3 1 3 0 7 NA NA
4 1 4 0 8 NA NA
5 2 1 0 4 NA NA
6 2 2 1 5 5 5
7 2 3 0 6 NA NA
8 2 4 1 7 7 7
9 3 1 0 6 NA NA
10 3 2 0 7 NA NA
11 3 3 1 8 8 8
12 3 4 1 8 8 NA

subsetting !is.na for multiple conditions unexpected results

I am trying to remove rows in a df when NA appear in two specific columns.
Example dataframe
tmp <- data.frame(state = c(1, 1, 2, 2, 3, 3, 4, 5),
reg = c(NA, 3, 6, NA, 9, 1, NA, 7),
gas = c(NA, 5, NA, 9, 1, 3, NA, 1),
other = c(1, 2, 4, 2, 6, 8, 1, 1) )
from the table you can see there are two rows where both "reg" and "gas" are NA
table(tmp$reg, tmp$gas, useNA = 'always')
1 3 5 9 <NA>
1 0 1 0 0 0
3 0 0 1 0 0
6 0 0 0 0 1
7 1 0 0 0 0
9 1 0 0 0 0
<NA> 0 0 0 1 2
I would like to remove these rows but retain the other NA values.
I tried this code:
tmp[!is.na(tmp$reg & tmp$gas), ]
but it removes all lines with NA in reg and gas
state reg gas other
2 1 3 5 2
5 3 9 1 6
6 3 1 3 8
8 5 7 1 1
This is the result that I am looking for:
state reg gas other
2 1 3 5 2
3 2 6 NA 4
4 2 NA 9 2
5 3 9 1 6
6 3 1 3 8
8 5 7 1 1
I also tried
tmp[which(!is.na(tmp$reg & tmp$gas)), ]
but that produces the same unwanted result.
I don't know why the initial approach didn't work, but I guess there is some fault in the chaining that I can not see. Taking the opposite approach (removing those that fulfills the condition) seems to produce the desired output.
tmp <- data.frame(state = c(1, 1, 2, 2, 3, 3, 4, 5),
reg = c(NA, 3, 6, NA, 9, 1, NA, 7),
gas = c(NA, 5, NA, 9, 1, 3, NA, 1),
other = c(1, 2, 4, 2, 6, 8, 1, 1) )
res = tmp[-which(is.na(tmp$reg) & is.na(tmp$gas)),]
res
#> state reg gas other
#> 2 1 3 5 2
#> 3 2 6 NA 4
#> 4 2 NA 9 2
#> 5 3 9 1 6
#> 6 3 1 3 8
#> 8 5 7 1 1
Created on 2020-12-24 by the reprex package (v0.3.0)

showing count per category in ever row of that category in a new column [duplicate]

This question already has answers here:
How to get group-level statistics while preserving the original dataframe?
(3 answers)
Closed 4 years ago.
I have the following data frame
g <- c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6)
m <- c(1, NA, NA, NA, 3, NA, 2, 1, 3, NA, 3, NA, NA, 4, NA, NA, NA, 2, 1, NA, 7, 3, NA, 1)
df <- data.frame(g, m)
I would like to show the number of non NA values per category of g (1 to 6) which I counted by:
> df %>% group_by(g) %>% summarise(non_na_count = sum(!is.na(m)))
# A tibble: 6 x 2
g non_na_count
<dbl> <int>
1 1. 1
2 2. 3
3 3. 2
4 4. 1
5 5. 2
6 6. 3
now I would like to produce a new column, l, that shows the number of NA values per category in every row such that the result would be:
g m l
1 1 1 1
2 1 NA 1
3 1 NA 1
4 1 NA 1
5 2 3 3
6 2 NA 3
7 2 2 3
8 2 1 3
9 3 3 2
10 3 NA 2
11 3 3 2
12 3 NA 2
13 4 NA 1
14 4 4 1
15 4 NA 1
16 4 NA 1
17 5 NA 2
18 5 2 2
19 5 1 2
20 5 NA 2
21 6 7 3
22 6 3 3
23 6 NA 3
24 6 1 3
anyone know how this can be done :)?
We need mutate to create column
df %>%
group_by(g) %>%
mutate(non_na_count = sum(!is.na(m)))
You are almost there. What you need to do is collect the output of group by and add it back to the original df.
df_notna <- df %>% group_by(g) %>% summarise(non_na_count = sum(!is.na(m)))
total <- merge(df,df_notna,by="g")
Look at other ways to merge here: https://www.statmethods.net/management/merging.html

How to sort dataframe in descending order

I have a data.frame(v1,v2,y)
v1: 1 5 8 6 1 1 6 8
v2: 2 6 9 8 4 5 2 3
y: 1 1 2 2 3 3 4 4
and now I want it sorted by y like this:
y: 1 2 3 4 1 2 3 4
v1: 1 8 1 6 5 6 1 8
v2: 2 9 4 2 6 8 5 3
I tried:
sorted <- df[,,sort(df$y)]
but this does not work.. please help
You can try a tidyverse solution
library(tidyverse)
data.frame(y, v1, v2) %>%
group_by(y) %>%
mutate(n=1:n()) %>%
arrange(n, y) %>%
select(-n) %>%
ungroup()
# A tibble: 8 x 3
y v1 v2
<dbl> <dbl> <dbl>
1 1 1 2
2 2 8 9
3 3 1 4
4 4 6 2
5 1 5 6
6 2 6 8
7 3 1 5
8 4 8 3
data:
v1 <- c(1, 5, 8, 6, 1, 1, 6, 8)
v2<- c( 2, 6, 9, 8, 4, 5, 2, 3)
y<- c(1, 1, 2, 2, 3, 3, 4, 4 )
Idea is to add an index along y and then arrange by the index and y.
We can use ave from base R to create a sequence by 'y' group and order on it
df[order(with(df, ave(y, y, FUN = seq_along))),]
# v1 v2 y
#1 1 2 1
#3 8 9 2
#5 1 4 3
#7 6 2 4
#2 5 6 1
#4 6 8 2
#6 1 5 3
#8 8 3 4
data
df <- data.frame(v1 = c(1, 5, 8, 6, 1, 1, 6, 8),
v2 = c(2, 6, 9, 8, 4, 5, 2, 3),
y = c(1, 1, 2, 2, 3, 3, 4, 4))
You could also do alternating subset twice and rbind these together:
rbind(df[c(TRUE,FALSE),], df[c(FALSE,TRUE),])
The result:
v1 v2 y
1 1 2 1
3 8 9 2
5 1 4 3
7 6 2 4
2 5 6 1
4 6 8 2
6 1 5 3
8 8 3 4
You can use matrix() to reorder the indizes of the rows:
df <- data.frame(v1 = c(1, 5, 8, 6, 1, 1, 6, 8),
v2 = c(2, 6, 9, 8, 4, 5, 2, 3),
y = c(1, 1, 2, 2, 3, 3, 4, 4))
df[c(matrix(1:nrow(df), ncol=2, byrow=TRUE)),]
# v1 v2 y
# 1 1 2 1
# 3 8 9 2
# 5 1 4 3
# 7 6 2 4
# 2 5 6 1
# 4 6 8 2
# 6 1 5 3
# 8 8 3 4
The solution uses the property in which order the elements of the matrix are stored (in R it is like in FORTRAN) - the index of the first dimension is running first. In FORTRAN one uses the terminus leading dimension for the number of values for this first dimension (for a 2-dimensional array, i.e. a matrix, it is the number of rows).

Resources