I have a dataframe so when I try to calculate the mean of column A I just write
mean(df$A)
and it works fine.
But when I try to calculate mean of only part of the data frame I get an error saying it isn't a number or logical value
df$A %>% filter(A=="some value") %>% mean(df$A)
The type of A is double. I also tried to convert it to numeric using
df$A <- as.numeric(as.character(df$A))
but it didn't work.
Best would be to provide an example of your column A.
However, by just looking to your question the problem is in your magrittr-dplyr syntax.
base syntax:
mean(df$A[df$A == 'some value'])
dplyr with pipes:
df %>% filter(A==2) %>% summarise(., average = mean(A))
Careful with syntax and pipes, more info here.
Try df %>% filter(A==some value) %>% summarise(mean(A)).
Note that the mean will be some value because of the filter.
Also, mean() works fine with objects of class double
Related
I'm working with a data frame and looking to calculate the mean age of players debut in baseball.
I can get the answer, however I am a bit confused why I get different outputs doing the same things 2 ways.
Firstly, when I run the below code I get the correct mean:
mean(as.numeric(players$debut_age)/365, na.rm=TRUE)
But when I reorganize this as a pipe, it instead only prints the vector of days in debut_age:
players$debut_age %>% as.numeric()/365 %>% mean(na.rm=TRUE)
I'm sure there is something simple I'm missing, but I would like to know why these don't produce the same result.
We can use divide_by
library(dplyr)
players$debut_age %>%
as.numeric() %>%
magrittr::divide_by(365) %>%
mean(na.rm = TRUE)
Or place the as.numeric with / inside a block of {}
players$debut_age %>%
{as.numeric()/365} %>%
mean(na.rm=TRUE)
I need to categorize numeric variable into the quartile and assign the median values for the quartile groups using loop (because my original dataset has lots of variable).
What I intend is doing the following manipulation over lots of variables:
data(iris)
iris%>%mutate(Sepal.Lengthq=as.factor(ntile(Sepal.Length,4)))%>%
group_by(Sepal.Lengthq)%>%
mutate(Sepal.Lengthq_median=median(Sepal.Length,na.rm=T))
I need loop, so I wrote codes like:
quartilization=c("Sepal.Length","Sepal.Width")
for (i in seq_along(quartilization)){
iris2=iris %>%
mutate(!!str_c(quartilization[i],"q"):=ntile(.[[quartilization[i]]],4)) %>%
group_by_at(vars(one_of(!!str_c(quartilization[i],"q")))) %>%
mutate(!!str_c(quartilization[i],"qn"):=median(.[[quartilization[i]]],na.rm=T)) %>%
ungroup()
}
However, 1) it does not return "Sepal.Lengthqn" and 2) "Sepal.Widthqn" is a same value over samples.
I feel like the syntax for the median function is wrong, but cannot fix it.
So appreciated if anyone could share me some input. Thank you.
When you are using ., you refer to entire dataframe, hence you get the same value for all the years. Use .data in median to get data in the group.
I use map_dfc instead of for loop because it is easier and shorter. I also use transmute instead of mutate because mutate returns all the column every time whereas transmute only returns the changed columns which can be binded to original dataframe.
library(dplyr)
library(purrr)
library(stringr)
quartilization=c("Sepal.Length","Sepal.Width")
bind_cols(iris, map_dfc(quartilization, ~{
iris %>%
group_by(!!str_c(.x,"q") := ntile(.[[.x]],4)) %>%
transmute(!!str_c(.x,"qn"):= median(.data[[.x]],na.rm=TRUE))
}))
I obviously get an error with the below but I was hoping to summarise the same column with regards to mean and median, and also how many points are in the polygon. But within the same pipe. Any help would be great.
Nin_Sep_points_sf_joined <-
st_join(merged_ten_seven_shp, Nin_Sep_sf_3011) %>%
filter(!is.na(Employment_diff)) %>%
group_by(Kod) %>%
summarise(Count=mean(as.numeric(as.character(price)))), summarise(Count_tot=n()), summarise(Count=median(as.numeric(as.character(price))))
You can supply multiple arguments to summarize which you separate with a ,:
library(dplyr)
Nin_Sep_points_sf_joined <-
st_join(merged_ten_seven_shp, Nin_Sep_sf_3011) %>%
filter(!is.na(Employment_diff)) %>%
group_by(Kod) %>%
summarise(Count=mean(as.numeric(as.character(price))),
Count_tot=n(),
Count=median(as.numeric(as.character(price))))
Note that you can even refer to the results of previous arguments in the next argument. So you could calculate SD based on Count_tot.
I'm basically looking for the equivalent of the following python code in R:
df.groupby('Categorical')['Count'].count()[0]
The following is what I'm doing in R:
by(df$count,df$Categorical,sum)
This accomplishes the same thing as the first code but I'd like to know how to store an index value to a variable in R (new to R) .
Based on the by code, it seems like we can use (assuming that 'count' is a columns of 1s)
library(dplyr)
out <- df %>%
group_by(Categorical) %>%
summarise(Sum = sum(count))
If the columns 'count' have other values as well, the python function is taking the frequency count of 'Categorical' column. So, a similar option would be
out <- df %>%
count(Categorical) %>%
slice(1) %>%
pull(n)
I am trying to rewrite this expression to magrittr’s pipe operator:
print(mean(pull(df, height), na.rm=TRUE))
which returns 175.4 for my dataset.
I know that I have to start with the data frame and write it as >df%>% but I’m confused about how to write it inside out. For example, should the na.rm=TRUE go inside mean(), pull() or print()?
UPDATE: I actually figured it out by trial and error...
>df%>%
+pull(height)%>%
+mean(na.rm=TRUE)
+print()
returns 175.4
It would be good practice to make a reproducible example, with dummy data like this:
height <- seq(1:30)
weight <- seq(1:30)
df <- data.frame(height, weight)
These pipe operators work with the majority of the tidyverse (not just magrittr). What you are trying to do is actually coming out of dplyr. The na.rm=T is required for many summary variables like mean, sd, as well as certain functions used to gather specific data points like min, max, etc. These functions don't play well with NA values.
df %>% pull(height) %>% mean(na.rm=T) %>% print()
Unless your data is nested you may not even need to use pull
df %>% summarise(mean = mean(height,na.rm=T))
Also, using summarise you can pipe these into another dataframe rather than just printing, and call them out of the dataframe whenever you want.
df %>% summarise(meanHt = mean(height,na.rm=T), sdHt = sd(height,na.rm=T)) -> summary
summary[1]
summary[2]