My problem is that I don't understand why I cannot calculate mean and sd of a variable total.
Steps that I have done:
I filtered the dataset in order to see data only from day 1 to 7.
I have summarised values from the variable "x" and created a new variable "Total"
I have a dataset:
name day x
ab 1 3
cd 3 5
fg 7 2
ll 3 1
kk 9 0
My code:
df_changed <- df%>%
dplyr::group_by(`name`, `day` )%>%
dplyr::filter(`day`>= 1, `day`<= 7) %>%
dplyr::summarise(Total=sum(x, na.rm = TRUE))%>%
dplyr::summarise(mean = mean(Total), sd = sd(Total)) %>%
view(df_changed)
Perhaps you may want to calculate mean and SD of x instead of Total as stated. Try this code
library(tidyverse)
df_changed <- df%>%
dplyr::group_by(`name`, `day` )%>%
dplyr::filter(`day`>= 1, `day`<= 7) %>%
dplyr::summarise(Total=sum(x, na.rm = TRUE),
mean = mean(x, na.rm = T),
sd = sd(x, na.rm =T))
Related
A dataframe:
mydf <- data.frame(
x = rep(letters[1:3], 4),
y = rnorm(12, 0, 3)
)
I can easily mutate a new column z that is the value of y plus or minus a random number:
mydf <- mydf %>%
mutate(z = rnorm(nrow(.), mean = 0, sd = sd(y)))
What I wouldlike to do is create z as a random number but when setting the sd use the sd for that letter only.
Tried:
mydf <- mydf %>%
group_by(x) %>%
mutate(z = rnorm(nrow(.), mean = 0, sd = sd(y)))
Error: Problem with `mutate()` input `z`.
x Input `z` can't be recycled to size 4.
ℹ Input `z` is `rnorm(nrow(.), mean = 0, sd = sd(y))`.
ℹ Input `z` must be size 4 or 1, not 12.
ℹ The error occurred in group 1: x = "a".
How can I add z, which is the value of y plus or minus a random number with an sd equal to that of the sd for the group as opposed to the column as a whole?
Here the nrow(.) will break the group by attribute and get the entire row and mutate requires the length of the new the column to be the same as the number of rows of the earlier data. So, this will break that stream unless we wrap the column in a list which may not be what the OP wanted.
library(dplyr)
mydf %>%
group_by(x) %>%
summarise(n = nrow(.))
# A tibble: 3 x 2
# x n
# <chr> <int>
#1 a 12 ###
#2 b 12 ###
#3 c 12 ###
We can use n()
mydf %>%
group_by(x) %>%
mutate(z = rnorm(n(), mean = 0, sd = sd(y)))
I have a dataframe where I would like to first group on a particular column (ID) and then remove the outliers from a particular column (Number) based on group and then calculate the mean for each group.
library(dplyr)
id<-c("A","B","C","A","B","B")
id<-as.data.frame(id)
number <-c(5,10,2,6,1000,12)
number<-as.data.frame(number)
total<-cbind(id,number)
I tried below approach but it is not working
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs = c(.25, .75), na.rm = na.rm, ...)
val <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - val)] <- NA
y[x > (qnt[2] + val)] <- NA
y
}
df2 <- total %>%
group_by(id) %>%
mutate(mean_val = remove_outliers(number)) %>%
ungroup() %>%
filter(!is.na(mean_val))
I would appreciate if someone could help
Input and expected O/P
There are not enough observations in your B group to treat 1000 as outlier.
See,
remove_outliers(c(5, 1000, 12))
#[1] 5 1000 12
However, if you add one more observation it treats 1000 as outlier.
remove_outliers(c(5, 1000, 12, 6))
#[1] 5 NA 12 6
So in general something like this should give you the expected output :
library(dplyr)
total %>%
group_by(id) %>%
mutate(mean_val = remove_outliers(number)) %>%
filter(!is.na(mean_val)) %>%
mutate(mean_val = mean(mean_val)) %>%
ungroup()
There is a database of whole year:
Month Day Time X Y
...
3 1 0 2 4
3 1 1 4 2
3 1 2 7 3
3 1 3 8 8
3 1 4 4 6
3 1 5 1 4
3 1 6 6 6
3 1 7 7 9
...
3 2 0 5 7
3 2 1 7 2
3 2 2 9 3
...
4 1 0 2 8
...
I want to find maximum value of X for each day and create a plot for each day starting from beginning of the day (Time 0) up to this found maximum value. I tried to use dataframe but I got a bit lost and database is quite big so I'm not sure if this is the best idea.
Any ideas how to do it?
If I understood you correctly, this should work:
Sample dataset:
set.seed(123)
df <- data.frame(Month = sample(c(1:12), 30, replace = TRUE),
Day = sample(c(1:31), 30, replace = TRUE),
Time = sample(c(1:24), 30, replace = TRUE),
x = rnorm(30, mean = 10, sd = 5),
y = rnorm(30, mean = 10, sd = 5))
Using tidyverse (ggplot and dplyr):
require(tidyverse)
df %>%
#Grouping by month and day
group_by(Month, Day) %>%
#Creating new variables for x and y - the max value, and removing values bigger than the max value.
mutate(maxX = max(x, na.rm = TRUE),
maxY = max(y, na.rm = TRUE),
plotX = ifelse(x > maxY, NA, x),
plotY = ifelse(y > maxY, NA, y)) %>%
ungroup() %>%
#Select and gather only the needed variables for the plot
select(Time, plotX, plotY) %>%
gather(plot, value, -Time) %>%
#Plot
ggplot(aes(Time, value, color = plot)) +
geom_point()
output:
You can try a tidyverse. Duplicated Times per Day and Month are removed without any ranking.
library(tidyverse)
set.seed(123)
df <- data.frame(Month = sample(c(1:2), 30, replace = TRUE),
Day = sample(c(1:2), 30, replace = TRUE),
Time = sample(c(1:10), 30, replace = TRUE),
x = rnorm(30, mean = 10, sd = 5),
y = rnorm(30, mean = 10, sd = 5))
df %>%
group_by(Month, Day) %>%
filter(!duplicated(Time)) %>% # remove dupliceted "Time"'s.
filter(x<=max(x) & Time <= Time[x == max(x)]) %>%
ggplot(aes(Time, x)) +
geom_line() +
geom_point(data=. %>% filter(x == max(x)))+
facet_grid(Month~Day, labeller = label_both)
Or try to put all in one plot using different colors
df %>%
group_by(Month, Day) %>%
filter(!duplicated(Time)) %>%
filter(x<=max(x) & Time <= Time[x == max(x)]) %>%
ggplot(aes(Time, x, color = interaction(Month, Day))) +
geom_line() +
geom_point(data=. %>% filter(x == max(x)))
I am using dplyr summarise function. My data contain NAs so I need to include na.rm=TRUE for each call. for example:
group <- rep(c('a', 'b'), 3)
value <- c(1:4, NA, NA)
df = data.frame(group, value)
library(dplyr)
group_by(df, group) %>% summarise(
mean = mean(value, na.rm=TRUE),
sd = sd(value, na.rm=TRUE),
min = min(value, na.rm=TRUE))
Is there a way to write the argument na.rm=TRUE only one time, and not
on each row?
You should use summarise_at, which lets you compute multiple functions for the supplied columns and set arguments that are shared among them:
df %>% group_by(group) %>%
summarise_at("value",
funs(mean = mean, sd = sd, min = min),
na.rm = TRUE)
If you're planning to apply your functions to one column only, you can use filter(!is.na()) in order to filter out any NA values of this variable only (i.e. NA in other variables won't affect the process).
group <- rep(c('a', 'b'), 3)
value <- c(1:4, NA, NA)
df = data.frame(group, value)
library(dplyr)
group_by(df, group) %>%
filter(!is.na(value)) %>%
summarise(mean = mean(value),
sd = sd(value),
min = min(value))
# # A tibble: 2 x 4
# group mean sd min
# <fctr> <dbl> <dbl> <dbl>
# 1 a 2 1.414214 1
# 2 b 3 1.414214 2
I was wondering if there is a way to compute the mean excluding outliers using the dplyr package in R? I was trying to do something like this, but did not work:
library(dplyr)
w = rep("months", 4)
value = c(1, 10, 12, 9)
df = data.frame(w, value)
output = df %>% group_by(w) %>% summarise(m = mean(value, na.rm = T, outlier = T))
So in above example, output should be 10.333 (mean of 10, 12, & 9) instead of 8 (mean of 1, 10, 12, 9)
Thanks!
One way would be something like this using the outlier package.
library(outliers) #containing function outlier
library(dplyr)
df %>%
group_by(w) %>%
filter(!value %in% c(outlier(value))) %>%
summarise(m = mean(value, na.rm = TRUE))
# w m
#1 months 10.33333