Given a data frame like data:
data <- data.frame(group = rep(c('a','b'), each= 100),
value = rnorm(200))
We want to filter values for group == b using dplyr and use boxplot.stats to identify outliers:
library(dplyr)
data%>%
filter(group == 'b')%>%
summarise(out.stats = boxplot.stats(value))
This returns the error Column out.stats must be length 1 (a summary value), not 4, why does this not work? How do you apply functions like this inside a pipe?
The following answers to the question and to the last comment to the question, where the OP asks for the row numbers of the outliers.
what if we want to return the row numbers that go with
boxplot.stats()$out from the pipe? so if we did
b<-data%>%filter(group=='b') outside of the pipe, we could have used:
which(b$value %in% boxplot.stats(b$value)$out)
This is done by left_joining with the original data.
library(dplyr)
set.seed(1234)
data <- data.frame(group = rep(c('a','b'), each= 100),
value = rnorm(200))
data %>% filter(group == 'b') %>% pull(value) %>%
boxplot.stats() %>% '[['('out') %>%
data.frame() %>%
left_join(data, by = c('.' = 'value'))
# . group
#1 3.043766 b
#2 -2.732220 b
#3 -2.855759 b
We can use the new version of dplyr which can also return summarise with more than one row
library(dplyr) # >= 1.0.0
data%>%
filter(group == 'b')%>%
summarise(out.stats = boxplot.stats(value))
# out.stats
#1 -2.4804222, -0.7546693, 0.1304050, 0.6390749, 2.2682247
#2 100
#3 -0.08980661, 0.35061653
#4 -3.014914
I have a dataset that is organized by groups (site) and has baseline observations (trt == 0) and observations collected from a modified environment (trt == 1, although it's not experimental data which is why I'm doing this). For the trt == 1 observations, I would like to calculate the quantile of each observation within the baseline ecdf for that group (i.e. site). My instinct was to use map2_dbl() but the ecdf to compare to is within the list-column itself, not external to the data. I'm struggling to get the correct syntax (in the R tidyverse).
df <- tibble(site = rep(letters[1:4], length.out = 2000),
trt = rep(c(0, 1), each = 1000),
value = c(rnorm(n = 1000), rnorm(.1, n = 1000)))
# calculate ecdf for baseline:
baseline <- df %>%
filter(trt == 0) %>%
group_by(site) %>%
summarize(ecdf0 = list(ecdf(value)))
# compare each trt = 1 observation to ecdf for that site:
trtQuantile <- df %>%
filter(trt == 1) %>%
inner_join(baseline)
# what would be next line is where I'm struggling to get the correct map syntax
head(trtQuantile)
# for the first row I am aiming for the result given by:
trtQuantile$ecdf0[[1]](trtQuantile$value[[1]])
Any advice from the purrr masters is appreciated! Thanks.
You can use map2_dbl :
library(dplyr)
library(purrr)
trtQuantile %>% mutate(out = map2_dbl(ecdf0, value, ~.x(.y)))
Or mapply in base R :
trtQuantile$out <- mapply(function(x, y) x(y),trtQuantile$ecdf0,trtQuantile$value)
I am trying to add a column with a condition using the mutate function in r, but keep getting an error. The code is straight from the teacher's lecture, but yet an error occurs. The LineItem column is a factor class, I am not sure if that make a difference.
Please advice on what I am missing.
Thank you,
Avi
df <- read.csv('ities_short.csv')
colSums(is.na(df))
sl <- str_length(df$LineItem)
avg <- mean(str_length(df$LineItem))
df <- df %>% mutate(LineItem_LongName = ifelse(sl > avg), 1, 0)
Error in ifelse(sl > avg) : argument "yes" is missing, with no default
You have placed ')' at wrong places. The general syntax for ifelse is:
ifelse(cond,value if true, value if false)
df <- read.csv('ities_short.csv')
colSums(is.na(df))
sl <- str_length(df$LineItem)
avg <- mean(str_length(df$LineItem))
df <- df %>% mutate(LineItem_LongName = ifelse(sl > avg, 1, 0))
#Nirbhay Singh answer is correct. However, if you compare two vectors, it's generally better to use dplyr::if_else because it is stricter regarding NA values :
df <- df %>% mutate(LineItem_LongName = if_else(sl > avg, 1, 0))
See the doc
Don't create separate objects and use it in dataframe, instead keep them in dataframe itself. You can remove the columns later which you don't need. Moreover, you can do this without ifelse.
library(dplyr)
library(stringr)
df %>%
mutate(temp = str_length(LineItem),
LineItem_LongName = as.integer(temp > mean(temp)))
Or in base R :
df$temp <- nchar(df$LineItem)
transform(df, LineItem_LongName = +(temp > mean(temp)))
My goal is to go through various signals and ignore any 1's that are not part of a series (minimum of at least two 1's in a row). The data is an xts time series with 180K+ columns and 84 months. I've provided a small simplified data set I've used a nest for loop, but it's taking way too long to finish on the entire data set. It works but is horribly inefficient.
I know there's some way to use an apply function, but I can't figure it out.
Example data:
mod_sig <- data.frame(a = c(0,1,0,0,0,1,1,0,0,0,1,0,1,1),
b = c(0,0,1,0,0,1,0,0,0,1,1,1,1,1),
c = c(0,1,0,1,0,1,1,1,0,0,0,1,1,0),
d = c(0,1,1,1,0,1,1,0,0,1,1,1,1,1),
e = c(0,0,0,0,0,0,0,0,0,0,1,0,0,0))
mod_sig <- xts(mod_sig, order.by = as.Date(seq(as.Date("2016-01-01"), as.Date("2017-02-01"), by = "month")))
Example code:
# fixing months where condition is only met for one month
# creating a new data frame for modified signals
Signals_Fin <- data.frame(matrix(nrow = nrow(mod_sig), ncol = ncol(mod_sig)))
colnames(Signals_Fin) <- colnames(mod_sig)
# Loop over Signals to change 1's to 0's for one month events
for(col in 1:ncol(mod_sig)) {
for(row in 1:nrow(mod_sig)) {
val <- ifelse(mod_sig[row,col] == 1,
ifelse(mod_sig[row-1,col] == 0,
ifelse(mod_sig[row+1,col] == 0,0,1),1),0)
Signals_Fin[row, col] <- val
}
}
As you can see with the loop, any 1's that aren't in a sequence are changed to 0's. I know there is a better way, so I'm hoping to improve my approach. Any insights would be greatly appreciated. Thanks!
Answer from Zack and Ryan:
Zack and Ryan were spot on with dyplr, I only made slight modifications based off what was given and some colleague help.
Answer code:
mod_sig <- data.frame(a = c(0,1,0,0,0,1,1,0,0,0,1,0,1,1),
b = c(0,0,1,0,0,1,0,0,0,1,1,1,1,1),
c = c(0,1,0,1,0,1,1,1,0,0,0,1,1,0),
d = c(0,1,1,1,0,1,1,0,0,1,1,1,1,1),
e = c(0,0,0,0,0,0,0,0,0,0,1,0,0,0))
Signals_fin = mod_sig %>%
mutate_all(funs(ifelse((. == 1 & (lag(.) == 1 | lead(.) == 1)),1,0))) %>%
mutate_all(funs(ifelse(is.na(.), 0, .)))
Signals_fin <- xts(Signals_fin, order.by = as.Date(seq(as.Date("2016-01-01"), as.Date("2017-02-01"), by = "month")))
here's a stab from a dplyr perspective, I converted your row_names to a column but you can just as easily convert them back to rownames with tibble::column_to_rownames():
library(dplyr)
library(tibble)
mod_sig %>%
as.data.frame() %>%
rownames_to_column('months') %>%
mutate_at(vars(-months), function(x){
if_else(x == 1 &
(lag(x, order_by = .$months) == 1 |
lead(x, order_by = .$months) == 1),
1,
0)
})
As suggested by #Ryan, his mutate_at call is more elegant, it's important everything is already sorted, though:
mod_sig %>%
as.data.frame() %>%
rownames_to_column('months') %>%
mutate_at(vars(-months), ~ as.numeric(.x & (lag(.x) | lead(.x))))
And to build on his suggestion:
mod_sig %>%
as.data.frame() %>%
mutate_all(~ as.numeric(.x & (lag(.x) | lead(.x))))
I have a large dataframe and want to standardise multiple columns while conditioning the mean and the standard deviation on values. Say I have the following example data:
set.seed(123)
df = data.frame("sample" = c(rep(1:2, each = 5)),
"status" = c(0,1),
"s1" = runif(10, -1, 1),
"s2" = runif(10, -5, 5),
"s3" = runif(10, -25, 25))
and want to standardise every s1-s3 while conditioning the mean and standard deviation to be status==0. If I should do this for say, s1 only I could do the following:
df = df %>% group_by(sample) %>%
mutate(sd_s1 = (s1 - mean(s1[status==0])) / sd(s1[status==0]))
But my problem arises when I have to perform this operation on multiple columns. I tried writing a function to include with mutate_at:
standardize <- function(x) {
return((x - mean(x[status==0]))/sd(x[status==0]))
}
df = df %>% group_by(sample) %>%
mutate_at(vars(s1:s3), standardize)
Which just creates Na values for s1-s3.
I have tried to use the answer provided in:
R - dplyr - mutate - use dynamic variable names, but cannot figure out how to do the subsetting.
Any help is greatly appreciated. Thanks!
We could just use
df %>%
group_by(sample) %>%
mutate_at(vars(s1:s3), funs((.- mean(.[status == 0]))/sd(.[status == 0])))