I am quite new to R. Using dplyr and filter, I want to select records for which a list of variables !=NA.
df %>% filter (var1 != "NA" | var2 != "NA" | var3 != "NA" )
The problem is that I have 85 such variables (ending with HR). So I have extracted them and put them in a vector.
hr_variables <- grep("HR$", names(ssc), value=TRUE)
I would like to make a loop that will fetch hr_variable and then filter() by applying the OR condition to each element.
Is this possible in R?
We can use base R to do this more easily
ssc[!rowSums(is.na(ssc[hr_variables])),]
# col1_HR col2_HR col3
#2 1 3 0.5365853
#3 2 4 0.4196231
Or using tidyverse
library(tidyverse)
ssc %>%
select_(.dots = hr_variables) %>%
map(~is.na(.)) %>%
reduce(`|`) %>%
`!` %>%
extract(ssc, .,)
Or with complete.cases
ssc %>%
select_(.dots = hr_variables) %>%
complete.cases(.) %>%
extract(ssc, ., )
data
set.seed(24)
ssc <- data.frame(col1_HR = c(NA, 1, 2, 3), col2_HR = c(NA, 3, 4, NA), col3 = rnorm(4))
Related
I need to recode some columns in my data, there are 29 columns with the same coded expressions
The cells are coded with numbers, something like that:
1 - Normal
2 - Altered
3 - NA
I want to create a for loop to change all columns at the same time. I need to transform the number code (1;2;3) into names(Normal;Alteres;NA)
thats what im trying to do.... i dont get any error message but this arent working....
for (i in names(df[,123:151])){
mutate(i = case_when(
i == 1 ~ 'Normal',
i == 2 ~ 'Altered',
i == 3 ~ 'NA'))
}
An easy way to do this would be to use dplyr from tidyverse.
library(tidyverse)
#make test dataframe
col1 <- c("1", "2", "3")
col2 <- c(3, 2, 2)
df <- data.frame(col1, col2)
df_recoded<-df %>%
mutate(across(.cols = everything(), ~case_when(
. == 1 ~ 'Normal',
. == 2 ~ 'Altered',
. == 3 ~ NA_character_)))
Try this:
df %>% mutate(across(.cols = names(df)[121:151],
.fns = ~recode(.,`1` = "Normal", `2` = "Altered", `3` = "NA",.default=NA_character_)))
Given a data frame like data:
data <- data.frame(group = rep(c('a','b'), each= 100),
value = rnorm(200))
We want to filter values for group == b using dplyr and use boxplot.stats to identify outliers:
library(dplyr)
data%>%
filter(group == 'b')%>%
summarise(out.stats = boxplot.stats(value))
This returns the error Column out.stats must be length 1 (a summary value), not 4, why does this not work? How do you apply functions like this inside a pipe?
The following answers to the question and to the last comment to the question, where the OP asks for the row numbers of the outliers.
what if we want to return the row numbers that go with
boxplot.stats()$out from the pipe? so if we did
b<-data%>%filter(group=='b') outside of the pipe, we could have used:
which(b$value %in% boxplot.stats(b$value)$out)
This is done by left_joining with the original data.
library(dplyr)
set.seed(1234)
data <- data.frame(group = rep(c('a','b'), each= 100),
value = rnorm(200))
data %>% filter(group == 'b') %>% pull(value) %>%
boxplot.stats() %>% '[['('out') %>%
data.frame() %>%
left_join(data, by = c('.' = 'value'))
# . group
#1 3.043766 b
#2 -2.732220 b
#3 -2.855759 b
We can use the new version of dplyr which can also return summarise with more than one row
library(dplyr) # >= 1.0.0
data%>%
filter(group == 'b')%>%
summarise(out.stats = boxplot.stats(value))
# out.stats
#1 -2.4804222, -0.7546693, 0.1304050, 0.6390749, 2.2682247
#2 100
#3 -0.08980661, 0.35061653
#4 -3.014914
How can I use mutate to achieve the below?
bd_diag_date <- df %>%
apply(1, function(dates) last(na.omit(dates))) %>%
as.data.frame() %>%
`colnames<-`("diag_date")
I tried this below but didn't work. I can't find out why and it says Error: Column 'diagnosis_date' is of unsupported type symbol. Should I assume mutate takes any function operation that can apply to a vector? If not, then what kind of operation does it accept?
bd_diag_date <- df %>%
rowwise() %>%
{mutate(., diag_date=last(na.omit(all_vars(.))))}
I also have a more general questions. That is how can I debug this? Every time I encounter this problem I have to google stack exchange but I feel like this isn't the right way to improve my dplyr skill.
We can use pmap
library(dplyr)
library(purrr)
df %>%
mutate(diag_date = pmap(., ~ last(na.omit(c(...)))))
If the columns are numeric, we can use pmap_dbl, simply using pmap returns a list column
df %>%
mutate(diag_date = pmap_dbl(., ~ last(na.omit(c(...)))))
# col1 col2 col3 diag_date
#1 1 NA 2 2
#2 NA 2 NA 2
#3 3 4 NA 4
If we need to return only a single column, use transmute
df %>%
transmute(diag_date = pmap_dbl(., ~ last(na.omit(c(...)))))
Or with group_split and map
df %>%
group_split(grp = row_number(), keep = FALSE) %>%
map_dfr(~ .x %>%
transmute(diag_date = last(na.omit(unlist(.)))))
Or using base R with max.col
df$diag_date <- df[cbind(seq_len(nrow(df)), max.col(!is.na(df), 'last'))]
data
df <- data.frame(col1 = c(1, NA, 3), col2 = c(NA, 2, 4), col3 = c(2, NA, NA))
I have a data frame containing several forms of data, such as:
<dbl> <chr> <dttm> <chr> <chr>
0001 cccc Feb-01-18 bbbb 1ab76
0002 bbbb Apr-02-20 cccc 7we54
...
What I'm trying to do is create a new column "f" that returns a count of the number of specific character values (e.g., "cccc" OR "bbbb") within each row. I've tried using a combination of the dplyr merge function and rowSums but have not had any luck despite trying several variations.
df %>% mutate(new = rowSums(. == "cccc"))
Any guidance would be greatly appreciated. Thanks!
One option would be to specify the |
library(dplyr)
df %>%
mutate(f = rowSums(. == "cccc"| .== "bbbb"))
Also, this can be made more specific by checking only columns that are character class
df %>%
select_if(is.character) %>%
transmute(f = rowSums(. == "cccc" | . == "bbbb"))%>%
bind_cols(df, .)
Base R solution:
df <- data.frame(a = c("c","b"), d = c("c", "c"), e = c(1,2), stringsAsFactors = F)
pattern <- "c"
df["count"] <- rowSums(apply(df, 2, function(x, s = pattern) x %in% s))
I'm using RStudio Version 0.98.1028 on windows. Summarising a multi level data frame, package dplyr, using the function sum(), I lost a row, which had sum = 0. In other words, if my original data frame was something like
group <- as.factor(rep(c('X', 'Y'), each = 1, times = 6))
type <- as.factor(rep(c('a', 'b'), each = 2, times = 3))
day <- as.factor(rep(1:3, each = 4))
df = data.frame(type = type, day = day, value = abs(rnorm(12)))
df = df[day != 1 | type != 'a',]
and I summarise it
df1 = df %>%
group_by(day, type) %>%
summarise(sum = sum(value))
then I get one missing row, which is the interaction between day = 1 and type = a, which I would like to have (even if it's 0...)
Thanks in advance!
EB
You could try left_join
library(dplyr)
left_join(expand.grid(type=unique(df$type), day=unique(df$day)), df1) %>%
group_by(day, type) %>%
summarise(sum=sum(value, na.rm=TRUE))
# day type sum
#1 1 a 0.0000000
#2 1 b 0.5132914
#3 2 a 1.2482210
#4 2 b 0.9232343
#5 3 a 2.0381779
#6 3 b 0.7558351
where df1 is
df1 <- df[day != 1 | type != 'a',]