Hello R and tidyverse wizards,
I try to count the rows of the starwars data set to know how many observations we get with the variables "height" and "mass"
.
I managed to get it with this code:
library(tidyverse)
starwars %>%
select(height, mass) %>%
drop_na() %>%
summarise(across(.cols = c(height, mass),
list(obs = ~ n(),
mean = mean,
sd = sd))) %>%
View()
I would like to replace the obs = ~ n() by the count function and tried this version:
library(tidyverse)
starwars %>%
select(height, mass) %>%
drop_na() %>%
summarise(across(.cols = c(height, mass),
list(obs = count,
mean = mean,
sd = sd))) %>%
View()
but it was too simple to work, classic :p
I had this error message --> Error in View : Problem while computing ..1 = across(...)
And when I got rid of the View() function, I had another error message --> Error in summarise():
! Problem while computing ..1 = across(...).
Caused by error in across():
! Problem while computing column height_obs.
Caused by error in UseMethod():
! no applicable method for 'count' applied to an object of class "c('integer', 'numeric')"
So, I got two questions:
could someone please explain why the code worked with ~ n() but not with count?
is it possible to use the count function instead of ~ n() in that case?
Sorry if it is a dumb question but I just try to understand the across and the count functions by playing with it.
In the function description it says that "df %>% count(a, b) is roughly equivalent to df %>% group_by(a, b) %>% summarise(n = n())", so I assume that using count() within across results in something like a double summarize-command, hence the use in favor of n().
Edit: Here you find the solution in the comment by G. Grothendieck
What is the difference between n() and count() in R? When should one favour the use of either or both?
n() returns a number
count() returns a dataframe
count() takes a dataframe as its first argument. It then returns counts for columns within that dataframe, passed as additional arguments. e.g.,
library(dplyr)
count(starwars, mass, height)
When you put count() inside across(), it passes columns to count() without including the dataframe as the first argument. Equivalent to if you ran,
count(starwars$mass, starwars$height)
Because count() expects a dataframe as the first argument, it throws an error.
n(), on the other hand, doesn’t take any arguments, and simply counts rows in the current environment (or group). You have to include the ~, as otherwise it will try passing each column to n(), which causes an error since n() doesn’t expect arguments.
i would like to index by column name within the sum command using the sequence operator.
library(dbplyr)
library(tidyverse)
df=data.frame(
X=c("A","B","C"),
X.1=c(1,2,3),X.2=c(1,2,3),X.3=c(1,2,3),X.4=c(1,2,3),X.5=c(1,2,3),X.6=c(1,2,3),X.7=c(1,2,3),X.8=c(1,2,3),X.9=c(1,2,3),X.10=c(1,2,3),
X.11=c(1,2,3),X.12=c(1,2,3),X.13=c(1,2,3),X.14=c(1,2,3),X.15=c(1,2,3),X.16=c(1,2,3),X.17=c(1,2,3),X.18=c(1,2,3),X.19=c(1,2,3),X.20=c(1,2,3),
X.21=c(1,2,3),X.22=c(1,2,3),X.23=c(1,2,3),X.24=c(1,2,3),X.25=c(1,2,3),X.26=c(1,2,3),X.27=c(1,2,3),X.28=c(1,2,3),X.29=c(1,2,3),X.30=c(1,2,3),
X.31=c(1,2,3),X.32=c(1,2,3),X.33=c(1,2,3),X.34=c(1,2,3),X.35=c(1,2,3),X.36=c(1,2,3),X.37=c(1,2,3),X.38=c(1,2,3),X.39=c(1,2,3),X.40=c(1,2,3),
X.41=c(1,2,3),X.42=c(1,2,3),X.43=c(1,2,3),X.44=c(1,2,3),X.45=c(1,2,3),X.46=c(1,2,3),X.47=c(1,2,3),X.48=c(1,2,3),X.49=c(1,2,3),X.50=c(1,2,3),
X.51=c(1,2,3),X.52=c(1,2,3),X.53=c(1,2,3),X.54=c(1,2,3),X.55=c(1,2,3),X.56=c(1,2,3))
Is there a quicker way todo this. The following provides the correct result. However, for large datasets (larger than this one ) it becomes vary laborious to deal with especially when pivot_wider is used and the columns are not created before hand (like above)
df %>% rowwise() %>% mutate(
Result_column=case_when(
X=="A"~ sum(c(X.1,X.2,X.3,X.4,X.5)),
X=="B"~ sum(c(X.4,X.5)),
X=="C" ~ sum(c( X.3, X.4, X.5, X.6, X.7, X.8, X.9, X.10, X.11, X.12, X.13, X.14, X.15, X.16,
X.17, X.18, X.19, X.20, X.21, X.22, X.23, X.24, X.25, X.26, X.27, X.28, X.29, X.30,
X.31, X.32, X.33, X.34, X.35, X.36, X.37, X.38, X.39, X.40, X.41, X.42,X.43, X.44,
X.45, X.46, X.47, X.48, X.49, X.50, X.51, X.52, X.53, X.54, X.55, X.56)))) %>% dplyr::select(Result_column)
The following is the how it would be used when using "select" syntax, which is that i would like to use. However, does not provide correct numerical solution. One can shorter the code by ~50 entries, by using a sequence operator ":".
df %>% rowwise() %>% mutate(
Result_column=case_when(
X=="A"~ sum(c(X.1:X.5)),
X=="B"~ sum(c(X.4:X.5)),
X=="C" ~ sum(c(X.3:X.56)))) %>% dplyr::select(Result_column)
below is a related question, however, not the same because what is needed is not a column that starts with "X" but rather a sequence.
Using mutate rowwise over a subset of columns
EDIT:
the provided code (below) from cnbrowlie is correct.
df %>% mutate(
Result_column=case_when(
X=="A"~ sum(c(X.1:X.5)),
X=="B"~ sum(c(X.4:X.5)),
X=="C" ~ sum(c(X.3:X.56)))) %>% dplyr::select(Result_column)
This can be done with dplyr>=1.0.0 using rowSums() (which computes the sum for a row across multiple columns) and across() (which superceded vars() as a method for specifying columns in a dataframe, allowing the use of : to select sequences of columns):
df %>% rowwise() %>% mutate(
Result_column=case_when(
X=="A"~ rowSums(across(X.1:X.5)),
X=="B"~ rowSums(across(X.4:X.5)),
X=="C" ~ rowSums(across(X.3:X.56))
)
) %>% dplyr::select(Result_column)
I am new to dplyr() package and trying to use it for my visualization assignment. I am able to pipe my data to ggplot() but unable to do that with plot(). I came across this post and the answers including the one in comments, didn't work for me.
Code 1:
emission <- mynei %>%
select(Emissions, year) %>%
group_by(year) %>%
summarise (total=sum(Emissions))
emission %>%
plot(year, total,.)
I get the following error:
Error in plot(year, total, emission) : object 'year' not found
Code 2:
mynei %>%
select(Emissions, year) %>%
group_by(year) %>%
summarise (total=sum(Emissions))%>%
plot(year, total, .)
This didn't work either and returned the same error.
Interestingly, the solution from the post I mentioned works for the same dataset but doesn't work out for my own data. However, I am able to create the plot using emission$year and emission$total.
Am I missing anything?
plot.default doesn't take a data argument, so your best bet is to pipe to with:
mynei %>%
select(Emissions, year) %>%
group_by(year) %>%
summarise (total=sum(Emissions))%>%
with(plot(year, total))
In case anyone missed #aosmith's comment on the question, plot.formula does have a data argument, but of course the formula is the first argument so we need to use the . to put the data in the right place. So another option is
... %>%
plot(total ~ year, data = .)
Of course, ggplot takes data as the first argument, so to use ggplot do:
... %>%
ggplot(aes(x = year, y = total)) + geom_point()
lattice::xyplot is likeplot.formula: there is a data argument, but it's not first, so:
... %>%
xyplot(total ~ year, data = .)
Just look at the documentation and make sure you use a . if data isn't the first argument. If there's no data argument at all, using with is a good work-around.
As an alternative, you can use the %$% operator from magrittr to be able to access the columns of a dataframe directly. For example:
iris %$%
plot(Sepal.Length~Sepal.Width)
This is useful many times when you need to feed the result of a dplyr chain to a base R function (such as table, lm, plot, etc). It can also be used to extract a column from a dataframe as a vector, e.g.:
iris %>% filter(Species=='virginica') %$% Sepal.Length
This is the same as:
iris %>% filter(Species=='virginica') %>% pull(Sepal.Length)