R Removing row if three or more values are NA

R Removing row if three or more values are NA - r

I feel like I should be able to do this with filter or subset, but can't figure out how.
How do I remove a row if three or more of the cells in that row are "NA"?
So in this dataset, rows with titles 1A-C2 and 3A-C2 would be removed.
my_data <- data.frame(Title = c("1A-C2", "1D-T2", "1F-T1", "1E-C2", "3A-C2", "3F-T2"),
Group1 = c(NA, 10, 2, 9, NA, 4), Group2 = c(1, 3, 6, 1, NA, 3), Group3=c(NA, 3, 3, 8, NA, 4), Group4=c(NA, NA, 4, 5, 1, 7), Group5=c(1, 4, 3, 3, 9, NA), Group6=c(NA, 4, 5, 6, 1, NA))
Thank you!!

With Base R,
my_data[rowSums(is.na(my_data))<3,]
gives,
Title Group1 Group2 Group3 Group4 Group5 Group6
2 1D-T2 10 3 3 NA 4 4
3 1F-T1 2 6 3 4 3 5
4 1E-C2 9 1 8 5 3 6
6 3F-T2 4 3 4 7 NA NA

Using dplyr :
library(dplyr)
my_data %>%
rowwise() %>%
filter(sum(is.na(c_across(starts_with('Group')))) < 3)
# Title Group1 Group2 Group3 Group4 Group5 Group6
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1D-T2 10 3 3 NA 4 4
#2 1F-T1 2 6 3 4 3 5
#3 1E-C2 9 1 8 5 3 6
#4 3F-T2 4 3 4 7 NA NA

In base R, we can use Reduce with is.na
subset(my_data, Reduce(`+`, lapply(my_data[startsWith(names(my_data), "Group")],
is.na)) < 3)
# Title Group1 Group2 Group3 Group4 Group5 Group6
#2 1D-T2 10 3 3 NA 4 4
#3 1F-T1 2 6 3 4 3 5
#4 1E-C2 9 1 8 5 3 6
#6 3F-T2 4 3 4 7 NA NA

Related

Is there a way to use the lead function to figure out the first row that meets a condition?

Hi I have a dataframe as such,
df= structure(list(a = c(1, 3, 4, 6, 3, 2, 5, 1), b = c(1, 3, 4,
2, 6, 7, 2, 6), c = c(6, 3, 6, 5, 3, 6, 5, 3), d = c(6, 2, 4,
5, 3, 7, 2, 6), e = c(1, 2, 4, 5, 6, 7, 6, 3), f = c(2, 3, 4,
2, 2, 7, 5, 2)), .Names = c("a", "b", "c", "d", "e", "f"), row.names = c(NA,
8L), class = "data.frame")
df$total = apply ( df, 1,sum )
df$row = seq ( 1, nrow ( df ))
so the dataframe looks like this.
> df
a b c d e f total row
1 1 1 6 6 1 2 17 1
2 3 3 3 2 2 3 16 2
3 4 4 6 4 4 4 26 3
4 6 2 5 5 5 2 25 4
5 3 6 3 3 6 2 23 5
6 2 7 6 7 7 7 36 6
7 5 2 5 2 6 5 25 7
8 1 6 3 6 3 2 21 8
what I want to do is figure the first leading row where the total is greater than the current. For example for row 1 the total is 17 and the nearest leading row >= 17 would be row 3.
I could loop through each row but it gets really messy. Is this possible?
thanks in advance.

We can do this in 2 steps with dplyr. First we set grouping to rowwise, which applies the operation on each row (basically it makes it work like we were doing an apply loop through the rows), then we find all the rows where total is larger than that row's total. Then we drop those that come before the current row and pick the first (which is the next one):
library(dplyr)
df %>%
rowwise() %>%
mutate(nxt = list(which(.$total > total)),
nxt = nxt[nxt > row][1])
# A tibble: 8 × 9
# Rowwise:
a b c d e f total row nxt
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 1 1 6 6 1 2 17 1 3
2 3 3 3 2 2 3 16 2 3
3 4 4 6 4 4 4 26 3 6
4 6 2 5 5 5 2 25 4 6
5 3 6 3 3 6 2 23 5 6
6 2 7 6 7 7 7 36 6 NA
7 5 2 5 2 6 5 25 7 NA
8 1 6 3 6 3 2 21 8 NA

Adding numbers based on a condition

my_df <- tibble(
b1 = c(2, 6, 3, 6, 4, 2, 1, 9, NA),
b2 = c(NA, 4, 6, 2, 6, 6, 1, 1, 7),
b3 = c(5, 9, 8, NA, 2, 3, 9, 5, NA),
b4 = c(NA, 6, NA, 10, 12, 8, 3, 6, 2),
b5 = c(2, 12, 1, 7, 8, 5, 5, 6, NA),
b6 = c(9, 2, 4, 6, 7, 6, 6, 7, 9),
b7 = c(1, 3, 7, 7, 4, 2, 2, 9, 5),
b8 = c(NA, 8, 4, 5, 1, 4, 1, 3, 6),
b9 = c(4, 5, 7, 9, 5, 1, 1, 2, NA),
b10 = c(14, 2, 4, 2, 1, 1, 1, 1, 5))
Hi Guys,
Hope you are all good. I have a df like this (very big one), and I want to tell R to add 10 to the values in b1 if there is 2 in either b6, 67, b8 or b9.
Thanks once again in anticipation.

We can create a logical condition in case_when by taking the row sums of subset of columns b6:b9 to find if the row have at least 2 in any of the row then add 10 to b1 or else return the original column
library(dplyr)
my_df <- my_df %>%
mutate(b1 = case_when(rowSums(select(cur_data(), b6:b9) == 2,
na.rm = TRUE) > 0 ~ b1 + 10, TRUE ~ b1))
-output
my_df
# A tibble: 9 x 10
b1 b2 b3 b4 b5 b6 b7 b8 b9 b10
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 NA 5 NA 2 9 1 NA 4 14
2 16 4 9 6 12 2 3 8 5 2
3 3 6 8 NA 1 4 7 4 7 4
4 6 2 NA 10 7 6 7 5 9 2
5 4 6 2 12 8 7 4 1 5 1
6 12 6 3 8 5 6 2 4 1 1
7 11 1 9 3 5 6 2 1 1 1
8 19 1 5 6 6 7 9 3 2 1
9 NA 7 NA 2 NA 9 5 6 NA 5
Or may also use if_any
my_df %>%
mutate(b1 = case_when(if_any(b6:b9, `%in%`, 2) ~ b1 + 10, TRUE ~ b1))
-output
# A tibble: 9 x 10
b1 b2 b3 b4 b5 b6 b7 b8 b9 b10
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 NA 5 NA 2 9 1 NA 4 14
2 16 4 9 6 12 2 3 8 5 2
3 3 6 8 NA 1 4 7 4 7 4
4 6 2 NA 10 7 6 7 5 9 2
5 4 6 2 12 8 7 4 1 5 1
6 12 6 3 8 5 6 2 4 1 1
7 11 1 9 3 5 6 2 1 1 1
8 19 1 5 6 6 7 9 3 2 1
9 NA 7 NA 2 NA 9 5 6 NA 5
Or the same in base R
i1 <- rowSums(my_df[6:9] == 2, na.rm = TRUE) > 0
my_df$b1[i1] <- my_df$b1[i1] + 10
Or with Reduce/lapply and %in%
i1 <- Reduce(`|`, lapply(my_df[6:9], `%in%`, 2))
my_df$b1[i1] <- my_df$b1[i1] + 10

You can also use the following solution:
library(dplyr)
library(purrr)
my_df %>%
pmap_df(~ {x <- c(...)[6:9];
y <- c(...)[1]
if(any(2 %in% x[!is.na(x)])) {
y + 10
} else {
y
}
}) %>%
bind_cols(my_df[-1])
# A tibble: 9 x 10
b1 b2 b3 b4 b5 b6 b7 b8 b9 b10
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 NA 5 NA 2 9 1 NA 4 14
2 16 4 9 6 12 2 3 8 5 2
3 3 6 8 NA 1 4 7 4 7 4
4 6 2 NA 10 7 6 7 5 9 2
5 4 6 2 12 8 7 4 1 5 1
6 12 6 3 8 5 6 2 4 1 1
7 11 1 9 3 5 6 2 1 1 1
8 19 1 5 6 6 7 9 3 2 1
9 NA 7 NA 2 NA 9 5 6 NA 5
Or we can use this thanks to a great suggestion by dear #akrun:
my_df %>%
mutate(b1 = ifelse(pmap_lgl(select(cur_data(), b6:b9), ~ 2 %in% c(...)), b1 + 10, b1))

Like your previous question, you can also use rowwise() here
my_df %>% rowwise() %>%
mutate(b1 = ifelse(any(c_across(b6:b9) == 2, na.rm = T), b1 + 10, b1))
# A tibble: 9 x 10
# Rowwise:
b1 b2 b3 b4 b5 b6 b7 b8 b9 b10
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 NA 5 NA 2 9 1 NA 4 14
2 16 4 9 6 12 2 3 8 5 2
3 3 6 8 NA 1 4 7 4 7 4
4 6 2 NA 10 7 6 7 5 9 2
5 4 6 2 12 8 7 4 1 5 1
6 12 6 3 8 5 6 2 4 1 1
7 11 1 9 3 5 6 2 1 1 1
8 19 1 5 6 6 7 9 3 2 1
9 NA 7 NA 2 NA 9 5 6 NA 5

Sum over a range with start and end point variables in R

I have a dataframe with the following variables:
start_point end_point variable_X
1 5 0.3757
2 7 0.4546
3 7 0.1245
4 8 0.3455
5 11 0.2399
6 12 0.0434
7 15 0.4323
... ... ...
I would like to add a fourth column that sums variable X from the start point to the end points defined in the first two columns, i.e. the entry in the first row would be the sum between 1 and 5 (inclusive): 0.3757+0.4546+0.1245+0.3455+0.2399 = 1.5402, the entry in second row would be sum between 2 and 7 (inclusive): 0.4546+0.1245+0.3455+0.2399+0.0434+0.4323 = 1.6402 and so forth.
I'm new to R, any help would be greatly appreciated.

There are probably slicker ways to do this, but here's a quick version:
df$sumX <- apply(df, 1, function(x) sum(df$variable_X[x[1]:x[2]]))
df
start_point end_point variable_X sumX
1 1 5 0.3757 1.5402
2 2 7 0.4546 1.6402
3 3 7 0.1245 1.1856
4 4 8 0.3455 NA
5 5 11 0.2399 NA
6 6 12 0.0434 NA
7 7 15 0.4323 NA
The last few rows are NA here because I don't have rows 8 through 15 of your data.

A solution with dplyr, using another reproducible example to address the situation with NA's in end_point as in the OP's comment (with ifelse):
# Reproducible example
mydf = data.frame(start_point = 1:9,
end_point = c(5, NA, 7, 8, 11, 12, 7, 15, NA),
variable_X = c(1, 5, 2, 3, 5, 4, 2, 1, 2))
library(dplyr)
mydf %>% rowwise() %>%
mutate(sumX = ifelse(is.na(end_point), NA, sum(mydf$variable_X[start_point:end_point])))
# start_point end_point variable_X sumX
# <int> <dbl> <dbl> <dbl>
# 1 1 5 1 16
# 2 2 NA 5 NA
# 3 3 7 2 16
# 4 4 8 3 15
# 5 5 11 5 NA
# 6 6 12 4 NA
# 7 7 7 2 2
# 8 8 15 1 NA
# 9 9 NA 2 NA

eliminating categories with a certain number of non-NA values in R

I have a data frame df which looks like this
> g <- c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6)
> m <- c(1, NA, NA, NA, 3, NA, 2, 1, 3, NA, 3, NA, NA, 4, NA, NA, NA, 2, 1, NA, 7, 3, NA, 1)
> df <- data.frame(g, m)
where g is the category (1 to 6) and m are values in that category.
I've managed to find the amount of none NA values per category by :
aggregate(m ~ g, data=df, function(x) {sum(!is.na(x))}, na.action = NULL)
g m
1 1 1
2 2 3
3 3 2
4 4 1
5 5 2
6 6 3
and would now like to eliminate the rows (categories) where the number of None-NA is 1 and only keep those where the number of NA is 2 and above.
the desired outcome would be
g m
5 2 3
6 2 NA
7 2 2
8 2 1
9 3 3
10 3 NA
11 3 3
12 3 NA
17 5 NA
18 5 2
19 5 1
20 5 NA
21 6 7
22 6 3
23 6 NA
24 6 1
every g=1 and g=4 is eliminated because as shown there is only 1 none-NA in each of those categories
any suggestions :)?

If you want base R, then I suggest you use your aggregation:
df2 <- aggregate(m ~ g, data=df, function(x) {sum(!is.na(x))}, na.action = NULL)
df[ ! df$g %in% df2$g[df2$m < 2], ]
# g m
# 5 2 3
# 6 2 NA
# 7 2 2
# 8 2 1
# 9 3 3
# 10 3 NA
# 11 3 3
# 12 3 NA
# 17 5 NA
# 18 5 2
# 19 5 1
# 20 5 NA
# 21 6 7
# 22 6 3
# 23 6 NA
# 24 6 1
If you want to use dplyr, perhaps
library(dplyr)
group_by(df, g) %>%
filter(sum(!is.na(m)) > 1) %>%
ungroup()
# # A tibble: 16 × 2
# g m
# <dbl> <dbl>
# 1 2 3
# 2 2 NA
# 3 2 2
# 4 2 1
# 5 3 3
# 6 3 NA
# 7 3 3
# 8 3 NA
# 9 5 NA
# 10 5 2
# 11 5 1
# 12 5 NA
# 13 6 7
# 14 6 3
# 15 6 NA
# 16 6 1

One can try a dplyr based solution. group_by on g will help to get the desired count.
library(dplyr)
df %>% group_by(g) %>%
filter(!is.na(m)) %>%
filter(n() >=2) %>%
summarise(count = n())
#Result
# # A tibble: 6 x 2
# g count
# <dbl> <int>
# 1 2.00 3
# 2 3.00 2
# 3 5.00 2
# 4 6.00 3

showing count per category in ever row of that category in a new column [duplicate]

This question already has answers here:
How to get group-level statistics while preserving the original dataframe?
(3 answers)
Closed 4 years ago.
I have the following data frame
g <- c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6)
m <- c(1, NA, NA, NA, 3, NA, 2, 1, 3, NA, 3, NA, NA, 4, NA, NA, NA, 2, 1, NA, 7, 3, NA, 1)
df <- data.frame(g, m)
I would like to show the number of non NA values per category of g (1 to 6) which I counted by:
> df %>% group_by(g) %>% summarise(non_na_count = sum(!is.na(m)))
# A tibble: 6 x 2
g non_na_count
<dbl> <int>
1 1. 1
2 2. 3
3 3. 2
4 4. 1
5 5. 2
6 6. 3
now I would like to produce a new column, l, that shows the number of NA values per category in every row such that the result would be:
g m l
1 1 1 1
2 1 NA 1
3 1 NA 1
4 1 NA 1
5 2 3 3
6 2 NA 3
7 2 2 3
8 2 1 3
9 3 3 2
10 3 NA 2
11 3 3 2
12 3 NA 2
13 4 NA 1
14 4 4 1
15 4 NA 1
16 4 NA 1
17 5 NA 2
18 5 2 2
19 5 1 2
20 5 NA 2
21 6 7 3
22 6 3 3
23 6 NA 3
24 6 1 3
anyone know how this can be done :)?

We need mutate to create column
df %>%
group_by(g) %>%
mutate(non_na_count = sum(!is.na(m)))

You are almost there. What you need to do is collect the output of group by and add it back to the original df.
df_notna <- df %>% group_by(g) %>% summarise(non_na_count = sum(!is.na(m)))
total <- merge(df,df_notna,by="g")
Look at other ways to merge here: https://www.statmethods.net/management/merging.html

Categories

HOME

amazon-dynamodb

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R Removing row if three or more values are NA - r

With Base R, my_data[rowSums(is.na(my_data))<3,] gives, Title Group1 Group2 Group3 Group4 Group5 Group6 2 1D-T2 10 3 3 NA 4 4 3 1F-T1 2 6 3 4 3 5 4 1E-C2 9 1 8 5 3 6 6 3F-T2 4 3 4 7 NA NA

Using dplyr : library(dplyr) my_data %>% rowwise() %>% filter(sum(is.na(c_across(starts_with('Group')))) < 3) # Title Group1 Group2 Group3 Group4 Group5 Group6 # <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #1 1D-T2 10 3 3 NA 4 4 #2 1F-T1 2 6 3 4 3 5 #3 1E-C2 9 1 8 5 3 6 #4 3F-T2 4 3 4 7 NA NA

In base R, we can use Reduce with is.na subset(my_data, Reduce(`+`, lapply(my_data[startsWith(names(my_data), "Group")], is.na)) < 3) # Title Group1 Group2 Group3 Group4 Group5 Group6 #2 1D-T2 10 3 3 NA 4 4 #3 1F-T1 2 6 3 4 3 5 #4 1E-C2 9 1 8 5 3 6 #6 3F-T2 4 3 4 7 NA NA

Related

Is there a way to use the lead function to figure out the first row that meets a condition?

Adding numbers based on a condition

Sum over a range with start and end point variables in R

eliminating categories with a certain number of non-NA values in R

showing count per category in ever row of that category in a new column [duplicate]

Categories

Resources