I need to create a new column with the sum values in several other columns, but with conditions.
My data is
ID <- c(A,B,C,D,E,F)
Q1 <- c(0,1,7,9,na,3)
Q2 <- c(0,3,2,2,na,3)
Q3 <- c(0,0,7,9,na,3)
dta <- as.data.frame (ID,Q1,Q2,Q3)
I need to sum values from the columns only if the values are < 4. If there is any value in any column that is > 4, the result should be dismissed. And I need to preserve the rows with only "na".
The result should look like
Result
0
4
na
na
na
9
I have tried :
library(dplyr)
dta %>% filter(Q1 < 4) %>% mutate(Result = rowSums(.[2:4]))
but then, all the rows with values > 4 disappear, and I was only able filter one row at a time. I have also tried:
dta$Result <- ifelse(c("Q1", "Q2", "Q3") < 4, rowSums(.[2:4]), NA)
but then all my results are "na"
ID <- c("A","B","C","D","E","F")
Q1 <- c(0,1,7,9,NA,3)
Q2 <- c(0,3,2,2,NA,3)
Q3 <- c(0,0,7,9,NA,3)
dta <- data.frame(ID,Q1,Q2,Q3)
You have to switch the sum and ifelse statement.
dta %>%
rowwise() %>%
mutate(result = sum(ifelse(c(Q1, Q2, Q3)<4, c(Q1, Q2, Q3), NA)))
You can use the following solution:
library(dplyr)
dta %>%
rowwise() %>%
mutate(Result = ifelse(any(c_across(Q1:Q3) > 4), NA, Reduce(`+`, c_across(Q1:Q3))))
# A tibble: 6 x 5
# Rowwise:
ID Q1 Q2 Q3 Result
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 0 0 0 0
2 B 1 3 0 4
3 C 7 2 7 NA
4 D 9 2 9 NA
5 E NA NA NA NA
6 F 3 3 3 9
Related
I have a follow-up on this question: Sum values from rows with conditions in R
Here is my data:
ID <- c("A", "B", "C", "D", "E", "F")
Q1 <- c(0, 1, 7, 9, NA, 3)
Q2 <- c(0, 3, 2, 2, NA, 3)
Q3 <- c(0, 0, 7, 9, NA, 3)
dta <- data.frame(ID, Q1, Q2, Q3)
I need to sum every value below 7, but in lines with values over 7, I need to sum all the numbers below 7 and ignore the ones over it. Rows with all NAs should be preserved. Result should look like this:
ProxySum
0
4
2
2
NA
9
I have tried this code based on the response from the last post:
dta2 <- dta %>%
rowwise() %>%
mutate(ProxySum = ifelse(all(c_across(Q1:Q3) < 7), Reduce(`+`, c_across(Q1:Q3)), (ifelse(any(c_across(Q1:Q3) > 7), sum(.[. < 7]), NA))))
But in the rows with numbers over 7 I end up with a sum of all the rows and columns. What I am missing?
One way to do it in base:
rowSums(dta[, 2:4] * (dta[, 2:4] < 7))
# [1] 0 4 2 2 NA 9
Adding explanation, according to #tjebo comment
With dta[, 2:4] < 7 you produce a dataframe populated with logical values, where TRUE or FALSE corresponds to the values which are less or greater than 7. It is possible to do in one line, since this operation is vectorized;
Than, you multiply above logical dataframe, and a dataframe populated with your original values. Under the hood, R converts logical types into numeric types, so all FALSE and TRUEs from your logical dataset, are converted to 0s and 1s. Which means that you multiply your original values by 1 if they are less than 7, and by 0s otherwise;
Since NA < 7 produces NA, and following multiplication by NA will produce NAs as well - you preserve the original NAs;
Last step is to call rowSums() on a resulting dataframe, which will sum up the values for each particular row. Since those of them that exceed 7 are turned into 0s, you exclude them from resulting sum;
In case, when you want to get a sum for the rows where at least one value is not NA, you can use na.rm = TRUE argument to your rowSums() call. However, in this case, for the rows with NAs only you will get 0.
Another option making use of rowSums and dplyr::across:
ID <- LETTERS[1:6]
Q1 <- c(0,1,7,9,NA,3)
Q2 <- c(0,3,2,2,NA,3)
Q3 <- c(0,0,7,9,NA,3)
dta <- data.frame(ID,Q1,Q2,Q3)
library(dplyr)
dta %>%
mutate(ProxySum = rowSums(across(Q1:Q3, function(.x) { .x[.x >= 7] <- 0; .x })))
#> ID Q1 Q2 Q3 ProxySum
#> 1 A 0 0 0 0
#> 2 B 1 3 0 4
#> 3 C 7 2 7 2
#> 4 D 9 2 9 2
#> 5 E NA NA NA NA
#> 6 F 3 3 3 9
How about a slightly different approach - first pivot longer, then sum by condition by group, then pivot back.
In this current version, rows that contain only "some" NAs will return a value other than NA. (NA will be considered as 0). If you want to return NA for those rows, change all to any.
library(tidyverse)
ID <- c("A","B","C","D","E","F")
Q1 <- c(0,1,7,9,NA,3)
Q2 <- c(0,3,2,2,NA,3)
Q3 <- c(0,0,7,9,NA,3)
dta <- data.frame(ID,Q1,Q2,Q3)
dta %>%
pivot_longer(-ID) %>%
group_by(ID) %>%
mutate(ProxySum = ifelse(all(is.na(value)), NA, sum(value[which(value<7)]))) %>%
pivot_wider()
#> # A tibble: 6 × 5
#> # Groups: ID [6]
#> ID ProxySum Q1 Q2 Q3
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 A 0 0 0 0
#> 2 B 4 1 3 0
#> 3 C 2 7 2 7
#> 4 D 2 9 2 9
#> 5 E NA NA NA NA
#> 6 F 9 3 3 3
Created on 2021-12-14 by the reprex package (v2.0.1)
Update: See #tjebo comment of identical solution as stefan:
Here is a non identical solution: using hablar:
library(dplyr)
library(hablar)
dta %>%
rowwise() %>%
mutate(sum = sum_(across(Q1:Q3, ~case_when(.<7 ~sum_(.)))))
First answer: Possible identical to stefan's answer:
Here is another dplyr solution:
library(dplyr)
dta %>%
mutate(across(where(is.numeric), ~ifelse(.>=7,0,.)),
sum = rowSums(across(where(is.numeric))))
ID Q1 Q2 Q3 sum
1 A 0 0 0 0
2 B 1 3 0 4
3 C 0 2 0 2
4 D 0 2 0 2
5 E NA NA NA NA
6 F 3 3 3 9
For example, I have this data frame:
Id
Age
1
14
2
28
and I want to make a long column like this:
Id
new column
1
1
2
2
14
28
What should I do?
We may unlist data and create the column by padding NA based on the max length
lst1 <- list(df1$id, unlist(df1))
out <- data.frame(lapply(lst1, `length<-`, max(lengths(lst1))))
names(out) <- c("id", "new_column")
Here is another approach:
df1 <- data.frame(New_column = c(df[,"Id"], df[,"Age"]))
merge(df$Id, df1, by="row.names", all=TRUE)[,-1]
Output:
x New_column
1 1 1
2 2 2
3 NA 14
4 NA 28
An approach with dplyr
library(dplyr)
df %>%
mutate(Age = Id) %>%
bind_rows(
df %>%
mutate(Id = NA)
) %>%
rename(new_column = Age)
# A tibble: 4 x 2
Id new_column
<int> <int>
1 1 1
2 2 2
3 NA 14
4 NA 28
I want to produce output that shows my df sorted by the number of NAs in each row (as in the df_rows_sorted_by_NAs column below) but that keeps the original row name/number (df col). The combination would look like column 3 below:
# df_rows_sorted_by_NAs df desired_output
# Row 1 : 38 Row 442 : 37 Row 3112 : 38
# Row 2 : 38 Row 3112 : 38 Row 3113 : 38
# Row 3 : 37 Row 3113 : 38 Row 442 : 37
# Row 18 : 30 Row 1128 : 30 Row 1128 : 30
I get the first output with this:
# Sort df by num of NAs
df_rows_sorted_by_NAs <- df[order(rowSums(is.na(df)), decreasing = TRUE), drop = FALSE, ]
# View obs with >=30 NAs
for (row_name in row.names(df_rows_sorted_by_NAs)) {
if (rowSums(is.na(df_rows_sorted_by_NAs[row_name,])) >= 30) {
cat("Row ", row_name, ": ",
rowSums(is.na(df_rows_sorted_by_NAs[row_name,])), "\n")
}
}
I get the second output with this:
for (row_name in row.names(df)) {
if (rowSums(is.na(df[row_name,])) >= 30) {
cat("Row ", row_name, ": ", rowSums(is.na(df[row_name,])), "\n")
}
}
I tried drop = FALSE for order but got the same result. Any suggestions on how to keep the row names when I create the new df?
This seems to work for me:
a <- c(1, 2, 3)
b<- c(1, NA, 3)
c <- c(NA, NA, 3)
d <- c(1, NA, NA)
e <- c(NA, 2, 3)
df <- data.frame(a, b, c, d, e)
df
df <- df[order(rowSums(is.na(df)), decreasing = TRUE),]
df
gives
a b c d e
1 1 1 NA 1 NA
2 2 NA NA NA 2
3 3 3 3 NA 3
then
a b c d e
2 2 NA NA NA 2
1 1 1 NA 1 NA
3 3 3 3 NA 3
and then
df[rowSums(is.na(df)) >1,]
a b c d e
2 2 NA NA NA 2
1 1 1 NA 1 NA
Is the actual question how do you put "Row:" in front?
paste0("Row ", row.names( df[rowSums(is.na(df)) >1,]), ": ",
rowSums(is.na(df)))
Gives you the vector with the strings, you can make that print vertically but that's a different question than getting the sort done.
The tidyverse package is good for these tasks:
library(tidyverse)
An example dataframe:
df <- tribble(
~Length, ~Width, ~Mass, ~Date,
10.3, 3.1, 0.021, "2018-11-28",
NA, 3.3, NA, "2018-11-29",
10.5, NA, 0.025, "2018-11-30"
)
With package dplyr, you can create an ID column and "number of NAs" column with row_number() and rowSums. Of course, if you already have a row ID column, then you can remove ID = row_number() from mutate:
df %>%
mutate(ID = row_number(), noNAs = rowSums(is.na(.)))
... results in ...
# A tibble: 3 x 6
Length Width Mass Date ID noNAs
<dbl> <dbl> <dbl> <chr> <int> <dbl>
1 10.3 3.1 0.021 2018-11-28 1 0
2 NA 3.3 NA 2018-11-29 2 2
3 10.5 NA 0.025 2018-11-30 3 1
... adding select by ID and noNAs, arranging by noNAs (in descending order):
df <- df %>%
mutate(ID = row_number(), noNAs = rowSums(is.na(.)))%>%
select(ID, noNAs) %>%
arrange(desc(noNAs))
... results in ...
# A tibble: 3 x 2
ID noNAs
<int> <dbl>
1 2 2
2 3 1
3 1 0
Finally, if you wanted to filter for rows with more than 30 NAs, then:
df %>% filter(noNAs > 30)
I have a survey where some questions were not answered by some participants. Here is a simplified version of my data
df <- data.frame(ID = c(12:16), Q1 = c("a","b","a","a",NA),
Q2 = c("a","a",NA,"b",NA), Q3 = c(NA,"a","a","a","b"))
df
I would like to see which ID numbers did not answer which questions. The following code is very close to the output I want but identifies the subject by row number - I would like the subject identified by ID number
table(data.frame(which(is.na(df), arr.ind=TRUE)))
right now the output shows that rows 1,3,5 did not answer at least one question and it identifies the column with the missing value. I would like it show me the same thing but with ID numbers 12,14,16. It would be a bonus if you could have the column names (eg Q1,Q2,Q3) in the output as well instead of column number.
We can get the column names which are NA row-wise using apply and make it into a comma separated string and attach it to a new dataframe along with it's ID.
new_df <- data.frame(ID =df$ID, ques = apply(df, 1, function(x)
paste0(names(which(is.na(x))), collapse = ",")))
new_df
# ID ques
#1 12 Q3
#2 13
#3 14 Q2
#4 15
#5 16 Q1,Q2
Similar equivalent would be
new_df <- data.frame(ID = df$ID, ques = apply(is.na(df), 1, function(x)
paste0(names(which(x)), collapse = ",")))
In base R:
res <- df[!complete.cases(df),]
res[-1] <- as.numeric(is.na(res[-1]))
res
# ID Q1 Q2 Q3
# 12 12 0 0 1
# 14 14 0 1 0
# 16 16 1 1 0
If you wish to avoid apply type operations and continue from which(..., T), you can do something like the following:
tmp <- data.frame(which(is.na(df[, 2:4]), T))
# change to character
tmp[, 2] <- paste0('Q', tmp[, 2])
# gather column numbers together for each row number
tmp_split <- split(tmp[, 2], tmp[, 1])
# preallocate new column in df
df$missing <- vector('list', 5)
df$missing[as.numeric(names(tmp_split))] <- tmp_split
This produces
> df
ID Q1 Q2 Q3 missing
1 12 a a <NA> Q3
2 13 b a a NULL
3 14 a <NA> a Q2
4 15 a b a NULL
5 16 <NA> <NA> b Q1, Q2
You can convert data in long format using tidyr::gather. Filter for Answer not available. Finally, you can summarise your data using toString as:
library(tidyverse)
df %>% gather(Question, Ans, -ID) %>%
filter(is.na(Ans)) %>%
group_by(ID) %>%
summarise(NotAnswered = toString(Question))
# # A tibble: 3 x 2
# ID NotAnswered
# <int> <chr>
# 1 12 Q3
# 2 14 Q2
# 3 16 Q1, Q2
If, OP wants to include all IDs in result then, solution can be as:
df %>% gather(Question, Ans, -ID) %>%
group_by(ID) %>%
summarise(NoAnswered = toString(Question[is.na(Ans)])) %>%
as.data.frame()
# ID NoAnswered
# 1 12 Q3
# 2 13
# 3 14 Q2
# 4 15
# 5 16 Q1, Q2
How's this with tidyverse:
data:
library(tidyverse)
df <- data.frame(ID = c(12:16), Q1 = c("a","b","a","a",NA), Q2 = c("a","a",NA,"b",NA), Q3 = c(NA,"a","a","a","b"))
code:
x <- df %>% filter(is.na(Q1) | is.na(Q2) | is.na(Q3)) # filter out NAs
y <- cbind(x %>% select(ID),
x %>% select(Q1, Q2, Q3) %>% sapply(., function(x) ifelse(is.na(x), 1, 0))
) # in 1/0 format
output:
x:
ID Q1 Q2 Q3
1 12 a a <NA>
2 14 a <NA> a
3 16 <NA> <NA> b
y:
ID Q1 Q2 Q3
1 12 0 0 1
2 14 0 1 0
3 16 1 1 0
My attempt is no better than any already offered, but it's a fun problem, so here's mine. Because why not?:
library( magrittr )
df$ques <- df %>%
is.na() %>%
apply( 1, function(x) {
x %>%
which() %>%
names() %>%
paste0( collapse = "," )
} )
df
# ID Q1 Q2 Q3 ques
# 1 12 a a <NA> Q3
# 2 13 b a a
# 3 14 a <NA> a Q2
# 4 15 a b a
# 5 16 <NA> <NA> b Q1,Q2
Most of the answer comes from your question:
df[which(is.na(df), arr.ind=TRUE)[,1],]
# ID Q1 Q2 Q3
# 5 16 <NA> <NA> b
# 3 14 a <NA> a
# 5.1 16 <NA> <NA> b
# 1 12 a a <NA>
df <- data.frame(category=c("cat1","cat1","cat2","cat1","cat2","cat2","cat1","cat2"),
value=c(NA,2,3,4,5,NA,7,8))
I'd like to add a new column to the above dataframe which takes the cumulative mean of the value column, not taking into account NAs. Is it possible to do this with dplyr? I've tried
df <- df %>% group_by(category) %>% mutate(new_col=cummean(value))
but cummean just doesn't know what to do with NAs.
EDIT: I do not want to count NAs as 0.
You could use ifelse to treat NAs as 0 for the cummean call:
library(dplyr)
df <- data.frame(category=c("cat1","cat1","cat2","cat1","cat2","cat2","cat1","cat2"),
value=c(NA,2,3,4,5,NA,7,8))
df %>%
group_by(category) %>%
mutate(new_col = cummean(ifelse(is.na(value), 0, value)))
Output:
# A tibble: 8 x 3
# Groups: category [2]
category value new_col
<fct> <dbl> <dbl>
1 cat1 NA 0.
2 cat1 2. 1.00
3 cat2 3. 3.00
4 cat1 4. 2.00
5 cat2 5. 4.00
6 cat2 NA 2.67
7 cat1 7. 3.25
8 cat2 8. 4.00
EDIT: Now I see this isn't the same as ignoring NAs.
Try this one instead. I group by a column which specifies if the value is NA or not, meaning cummean can run without encountering any NAs:
library(dplyr)
df <- data.frame(category=c("cat1","cat1","cat2","cat1","cat2","cat2","cat1","cat2"),
value=c(NA,2,3,4,5,NA,7,8))
df %>%
group_by(category, isna = is.na(value)) %>%
mutate(new_col = ifelse(isna, NA, cummean(value)))
Output:
# A tibble: 8 x 4
# Groups: category, isna [4]
category value isna new_col
<fct> <dbl> <lgl> <dbl>
1 cat1 NA TRUE NA
2 cat1 2. FALSE 2.00
3 cat2 3. FALSE 3.00
4 cat1 4. FALSE 3.00
5 cat2 5. FALSE 4.00
6 cat2 NA TRUE NA
7 cat1 7. FALSE 4.33
8 cat2 8. FALSE 5.33
An option is to remove value before calculating cummean. In this method rows with NA value will not be accounted for cummean calculation. Not sure if OP wants to consider NA value as 0 in calculation.
df %>% mutate(rn = row_number()) %>%
filter(!is.na(value)) %>%
group_by(category) %>%
mutate(new_col = cummean(value)) %>%
ungroup() %>%
right_join(mutate(df, rn = row_number()), by="rn") %>%
select(category = category.y, value = value.y, new_col) %>%
as.data.frame()
# category value new_col
# 1 cat1 NA NA
# 2 cat1 2 2.000000
# 3 cat2 3 3.000000
# 4 cat1 4 3.000000
# 5 cat2 5 4.000000
# 6 cat2 NA NA
# 7 cat1 7 4.333333
# 8 cat2 8 5.333333
I needed something similar, but cannot replace NAs with 0. So I created this simple function, which works with dplyr. Hope this helps.
cummean.na <- function(x, na.rm = T)
{
# x = c(NA, seq(1, 10, 1)); na.rm = T
n <- length(x)
op <- rep(NA, n)
for(i in 1:n) {op[i] <- ifelse(is.na(x[i]), NA, mean(x[1:i], na.rm = !!na.rm))}
rm(x, na.rm, n, i)
return(op)
}
Custom function to calculate "cummean", ignoring NA's and carrying forward the previous cumulative mean value to the next NA value:
cummean.na <-
function(x) {
tmp_ind <- cumsum(!is.na(x))
x_nona <- x[!is.na(x)]
out <- cummean(x_nona)[tmp_ind]
return(out)
}
Example output:
> cummean.na(1:5)
[1] 1.0 1.5 2.0 2.5 3.0
> cummean.na(c(1, 2, 3, NA, 4, 5))
[1] 1.0 1.5 2.0 2.0 2.5 3.0