I want to be able to get the counts, standard deviation and mean of certain variables after grouping them. I am able to get the mean and std, but getting the counts is giving me an error. This is the following code I have:
NYC_Trees %>%
group_by(Condition) %>%
dplyr::summarise(mean = round(mean(Compensatory.Value), 2),
sd = round(sd(Compensatory.Value), 2),
count(NYC_Trees,Condition, wt = Compensatory.Value))
I get the error: cannot handle.
I want the output such as:
Condition Native N Mean Std
What am I doing wrong?
count(NYC_Trees,Condition, wt = Compensatory.Value) should be the same as NYC_Trees %>% group_by(Condition) %>% summarise(n = sum(Compensatory.Value). This clearly returns a vector and therefore the summarise function cannot handle it.
So you could just have the line n = sum(Compensatory.Value) inside the summarise:
NYC_Trees %>%
group_by(Condition) %>%
dplyr::summarise(mean = round(mean(Compensatory.Value), 2),
sd = round(sd(Compensatory.Value), 2),
n = sum(Compensatory.Value))
Is that what you are trying to do? If you just want the number of values in each group, you can use n = n() instead:
NYC_Trees %>%
group_by(Condition) %>%
dplyr::summarise(mean = round(mean(Compensatory.Value), 2),
sd = round(sd(Compensatory.Value), 2),
n = n())
Related
I have some data where I use the rsample package to create rolling windows (I use the iris data set here). The rolling_iris dataset contains a number of lists.
I would like to compute the min, max, mean and sd of each of the lists. That is in split 1 compute the min across the first 4 columns etc. I originally do this by mapping over the splits and using pivot_longer to rearrange the data then computing the statistics, finally using pivot_wider to get the data back into the original form. This is quite slow.
library(dplyr)
library(purrr)
iris
rolling_iris <- rsample::rolling_origin(iris, initial = 10, assess = 1, cumulative = FALSE, skip = 0)
rolling_iris_statistics <- map(rolling_iris$splits, ~analysis(.x) %>%
pivot_longer(cols = 1:4) %>%
mutate(
min = min(value),
max = max(value),
mean = mean(value),
sd = sd(value)
) %>%
group_by(name) %>%
mutate(rowID = row_number()) %>%
pivot_wider(names_from = name, values_from = value)
)
I would like to map over each of the lists and compute the above statistics. Then once this is done scale the analysis by the following function.
Scale_Me <- function(x){
(x - min(x)) / (max(x) - min(x))
}
Additional:
rolling_iris_analysis <- map(rolling_iris$splits, ~analysis(.x))
rolling_iris_assessment <- map(rolling_iris$splits, ~assessment(.x))
EDIT:
I managed to compute the following (I am not sure if it is "faster")
analysis <- map(rolling_iris$splits, ~analysis(.x))
map(analysis, ~select(., c(1:4)) %>% as.matrix %>% mean())
The below code subsets into each sub data frame. So, rolling_iris_dfs is a list of data frames. Then, you can iterate over each data frame and compute statistics.
rolling_iris_dfs <- map(seq(1, length(rolling_iris[[1]])), ~rolling_iris[[1]][[.x]]$data)
rolling_iris_stats <- map(rolling_iris_dfs, ~analysis(.x) %>%
pivot_longer(cols = 1:4) %>%
mutate(
min = min(value),
max = max(value),
mean = mean(value),
sd = sd(value)
) %>%
group_by(name) %>%
mutate(rowID = row_number()) %>%
pivot_wider(names_from = name, values_from = value)
)
Give a minimum example.
df <- data.frame("Treatment" = c(rep("A", 2), rep("B", 2)), "Price" = 1:4, "Cost" = 2:5)
I want to summarize the data by treatments for all the variables I have, and put them together, so I define a function to do this for each variable first, and then rbind them later on.
SummarizeFn <- function(x,y,z) {
x %>% group_by(Treatment) %>%
summarize(n = n(), Mean = mean(y), SD = sd(y)) %>%
cbind("Var" = rep(y, 3)) # add a column to show which variable those statistics belong to.
}
SumPrice <- SummarizeFn(df, df$Price, "Price")
However, R tells me that object "Price" is not found. How to solve this problem?
Also, how to make y as a character indicating the mean and sd are of price?
Price isnt a variable, you need SummarizeFn(df,df$Price) because Price is just defined in your list df
SummarizeFn <- function(x,y,z)
{
df1<-(x %>% group_by(Treatment)
%>% summarize(n = n(), Mean = mean(y), SD = sd(y))
)
df1<- df1 %>% mutate ("Var" = z)
return(df1)
}
SumPrice <- SummarizeFn(df, df$Price,"Price")
I would like to perform kmeans within groups and add to my data information about cluster number and center which an observation was assigned to (still, within groups so cluster 1 is not the same for group A and group B). I thought that I can pluck cluster assignment and centroid from kmeans and then maybe join these two with each other and finally, with original data. To do the former I wanted to add a row number to data frames with centers and then join by the number of cluster. But how can I add row number within nested data frames? The following code works well until the last, 'nested' mutate.
my_data <- data.frame(group = c(sample(c('A', 'B', 'C'), 20, replace = TRUE)), x = runif(100, 0, 10), y = runif(100, 0, 10))
my_data %>%
group_by(group) %>%
nest() %>%
mutate(km_cluster = map(data, ~kmeans(.x, 3) %>% pluck('cluster')),
km_centers = map(data, ~kmeans(.x, 3) %>% pluck('centers') %>% mutate(cluster = row_number())))
#Luke.sonnet provided an answer that works well with map, but interestingly not with map2, see below:
my_data %>%
group_by(group) %>%
nest() %>%
mutate(number = sample(3:7, 3)) %>%
mutate(km_cluster = map2(data, number, ~kmeans(.x, .y) %>% pluck('cluster')),
km_centers = map2(data, number, ~kmeans(.x, .y) %>% pluck('centers') %>% as_tibble() %>% mutate(cluster = row_number())))
Any ideas how to solve the issue in that case? And equally important, what is the cause of such behaviour?
The problem is that pluck() is returning a matrix. Cast to a tibble first and number differently.
library(tidyverse)
my_data <- data.frame(group = c(sample(c('A', 'B', 'C'), 20, replace = TRUE)), x = runif(100, 0, 10), y = runif(100, 0, 10))
my_data %>%
group_by(group) %>%
nest() %>%
mutate(number = sample(3:7, 3)) %>%
mutate(km_cluster = map2(data, number, ~kmeans(.x, .y) %>% pluck('cluster')),
km_centers = map2(data, number, ~kmeans(.x, .y) %>% pluck('centers') %>% as_tibble() %>% mutate(cluster = seq_len(nrow(.)))))
Note you can also do mutate(cluster = row_number(x)))) and this provides different numbers (note that just using row_number() uses the rows from the parent df). I think given kmeans that the matrix of centers is ordered row-wise by cluster number that the answer in the main chunk is correct.
I would like to summarise a grouped data.frame without knowing the name of the column. But what I know is, that the feature is always at position 3 (column) in this data.frame, is that possible?
df <- data_frame(date = rep(c("2017-01-01", "2017-01-02", "2017-01-03"), 2),
group = rep(c("A", "B"), 3),
temperature = runif(6, -10, 30),
percipitation = runif(6, 0,5)
)
parameter <- "perc"
df1 <- df %>%
select(date, group, starts_with(parameter)) %>%
group_by(group) %>%
summarise(
avg = mean(percipitation)
)
In this example the code works, but of course only for the parameter 'perc' and not for 'temp' or so.
avg = mean(df[[3]])
or something like this doesn't work. Any suggestions?
You could keep just the grouping variable and the third column using select(group, 3). The function summarise_all() can then be used to calculate the mean.
df %>%
select(group, 3) %>%
group_by(group) %>%
summarise_all(
funs(mean)
)
I'd like to create a function that can calculate the moving mean for a variable number of last observations and different variables. Take this as mock data:
df = expand.grid(site = factor(seq(10)),
year = 2000:2004,
day = 1:50)
df$temp = rpois(dim(df)[1], 5)
Calculating for 1 variable and a fixed number of last observations works. E.g. this calculates the average of the temperature of the last 5 days:
library(dplyr)
library(zoo)
df <- df %>%
group_by(site, year) %>%
arrange(site, year, day) %>%
mutate(almost_avg = rollmean(x = temp, 5, align = "right", fill = NA)) %>%
mutate(avg = lag(almost_avg, 1))
So far so good. Now trying to functionalize fails.
avg_last_x <- function(dataframe, column, last_x) {
dataframe <- dataframe %>%
group_by(site, year) %>%
arrange(site, year, day) %>%
mutate(almost_avg = rollmean(x = column, k = last_x, align = "right", fill = NA)) %>%
mutate(avg = lag(almost_avg, 1))
return(dataframe) }
avg_last_x(dataframe = df, column = "temp", last_x = 10)
I get this error:
Error in mutate_impl(.data, dots) : k <= n is not TRUE
I understand this is probably related to the evaluation mechanism in dplyr, but I don't get it fixed.
Thanks in advance for your help.
This should fix it.
library(lazyeval)
avg_last_x <- function(dataframe, column, last_x) {
dataframe %>%
group_by(site, year) %>%
arrange(site, year, day) %>%
mutate_(almost_avg = interp(~rollmean(x = c, k = last_x, align = "right",
fill = NA), c = as.name(column)),
avg = ~lag(almost_avg, 1))
}