R : doesn't recognise column in a new table - r

This is part of an online course I am doing, R for data analysis.
A tibble is created using the group_by and summarise functions on the diamonds data set - the new tibble indeed exists and looks as you would expect, I checked. Now a bar plot has to be created using these summary values in the new tibble, but it gives me all sorts of errors associated with not recognising the columns.
I transformed the tibble into a data frame, and still get the same problem.
Here is the code:
diamonds_by_color <- group_by(diamonds, color)
diamonds_mp_by_color <- summarise(diamonds_by_color, mean_price = mean(price))
diamonds_mp_by_color <- as.data.frame(diamonds_mp_by_color)
colorcounts <- count(diamonds_by_color$mean_price)
colorbarplot <- barplot(diamonds_by_color$mean_price, names.arg = diamonds_by_color$color,
main = "Average price for different colour diamonds")
The error I get when running the function count is:
Error in UseMethod("summarise_") :
no applicable method for 'summarise_' applied to an object of class "NULL"
In addition: Warning message:
Unknown or uninitialised column: 'mean_price'.
It's probably something trivial but I have been reading quite a lot and tried a few things and can't figure it out. Any help will be super appreciated :)

Your diamonds_by_color never has mean_price assigned to it.
Your last two lines of code work if you reference diamonds_mp_by_color instead:
colorcounts <- count(diamonds_mp_by_color, mean_price)
barplot(diamonds_mp_by_color$mean_price,
names.arg=diamonds_mp_by_color$color,
main="Average price for different colour diamonds")

Here is a way to summarise the price by color using dplyr and piping straight to a barplot using ggplot2.
diamonds %>% group_by(color) %>%
summarise(mean.price=mean(price,na.rm=1)) %>%
ggplot(aes(color,mean.price)) + geom_bar(stat='identity')

Best dplyr idiom is not to declare a temporary result for each operation. Just do one big pipe; also the %>% notation is clearer because you don't have to keep specifying which dataframe as the first arg in each operation:
diamonds %>%
group_by(color) %>%
summarise(mean_price = mean(price)) %>%
tally() %>% # equivalent to n() on a group
# may need ungroup() %>%
barplot(mean_price, names.arg = color,
main = "Average price for different colour diamonds")
(Something like that. You can assign the output of the pipe before the barplot if you like. I'm transiting through an airport so I can't check it in R.)

Related

I am trying to make a table with percentages after having used pivot.wider in R

I am currently trying to make a table with percentages after having used the pivot.wider command on a variable. htrisk is the datafile and menopaus and invasive are variables. Using the following code:
p_t <- htrisk %>%
group_by(menopaus, invasive) %>%
count(invasive, name = "n") %>%
pivot_wider(names_from = invasive, values_from = n, values_fill = 0)
pivot_test
Current table with wanted changes
I get the table above which is what I want, but I want to add two percentage columns which show the percents for let's say pre-meno/no and pre-meno/yes. Then for post-meno/no and post-meno/yes.
I have tried using the prop.table but I get the error "Error in FUN(X[[i]], ...) :
only defined on a data frame with all numeric-alike variables".
Any help or direction would be much appreciated!
With dplyr, use mutate to add new columns.
pivot_test %>%
mutate(
pct_no = No / (No + Yes),
pct_yes = 1 - pct_no
)
If you need more help, please share enough sample data in valid R syntax to make a reproducible example, e.g. dput(pivot_test[1:3, ]) for the first 3 rows.

Counting in ggplot2

I'm new to R and looking to get some help/explanation on why my code is doing what it is doing. I've started doing the Tidy Tuesday projects to better learn R, so that is where the data is from. Tidy Tuesday information
Goal:
The end result I am looking to do is sort my bar graph by which country's runners had the most first place finishes from the data and only display the top 10.
Thought process
In my head, how this would happen would be having R add up each instance of the country and have it saved into a variable.
So my first attempt is returning this:
The top_N is something I found googling around, but if I take it out, it does look right, just not limited to the top ten.
Questions:
Am I using reorder correctly to control the order of nationalities?
What is the best way to limit the which results are shown?
Where exactly in the code is it counting each nationality? I'm thinking it is in the sum, but not not 100% sure. Most examples I've found of this used it for numerical values, not strings and that has me a bit confused.
library(tidyverse)
library(ggplot2)
library(readr)
library(dplyr)
ultra_rankings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-10-26/ultra_rankings.csv')
race <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-10-26/race.csv')
ultra_rankings %>%
filter(rank == '1') %>% #Only looks at rows that have a first place finish
top_n(10, nationality) %>% #I think this is what is throwing me off
ggplot(aes(x = reorder(nationality, -rank, sum), y = rank))+geom_bar(stat = "identity")+
labs(title = "First Place Rankings by Country", caption = "Data from runrepeat.com")+
theme(plot.title = element_text(hjust = .5))+ylab("Total First Place Finishes")+xlab("Runner Nationalities")
Try this:
gt <- ultra_rankings %>% filter(rank==1) %>% group_by(nationality) %>% count(nationality) %>%arrange(-n) %>% head(10)
Then we have to change the factor to preserve sort order
gt$nationality <- factor(gt$nationality, levels = unique(gt$nationality))
Now it can be plotted:
ggplot(data=gt,aes(x=nationality,y=n))+geom_bar(stat="identity")

How to Create Multiple Frequency Tables with Percentages Across Factor Variables using Purrr::map

library(tidyverse)
library(ggmosaic) for "happy" dataset.
I feel like this should be a somewhat simple thing to achieve, but I'm having difficulty with percentages when using purrr::map together with table(). Using the "happy" dataset, I want to create a list of frequency tables for each factor variable. I would also like to have rounded percentages instead of counts, or both if possible.
I can create frequency precentages for each factor variable separately with the code below.
with(happy,round(prop.table(table(marital)),2))
However I can't seem to get the percentages to work correctly when using table() with purrr::map. The code below doesn't work...
happy%>%select_if(is.factor)%>%map(round(prop.table(table)),2)
The second method I tried was using tidyr::gather, and calculating the percentage with dplyr::mutate and then splitting the data and spreading with tidyr::spread.
TABLE<-happy%>%select_if(is.factor)%>%gather()%>%group_by(key,value)%>%summarise(count=n())%>%mutate(perc=count/sum(count))
However, since there are different factor variables, I would have to split the data by "key" before spreading using purrr::map and tidyr::spread, which came close to producing some useful output except for the repeating "key" values in the rows and the NA's.
TABLE%>%split(TABLE$key)%>%map(~spread(.x,value,perc))
So any help on how to make both of the above methods work would be greatly appreciated...
You can use an anonymous function or a formula to get your first option to work. Here's the formula option.
happy %>%
select_if(is.factor) %>%
map(~round(prop.table(table(.x)), 2))
In your second option, removing the NA values and then removing the count variable prior to spreading helps. The order in the result has changed, however.
TABLE = happy %>%
select_if(is.factor) %>%
gather() %>%
filter(!is.na(value)) %>%
group_by(key, value) %>%
summarise(count = n()) %>%
mutate(perc = round(count/sum(count), 2), count = NULL)
TABLE %>%
split(.$key) %>%
map(~spread(.x, value, perc))

Is it possible to combine density plots of two separate variables with ggvis

I feel like I've searched everywhere for this but essentially I have time series data of multiple numeric variables and I wanted to create one single plot that has then density function of two or variables on it.
So essentially I have:
df %>% ggvis(~y1) %>% layer_densities()
df %>% ggvis(~y2) %>% layer_densities()
but if I do something like:
df %>% ggvis(~y1) %>% layer_densities() %>% layer_densities(~y2)
I get the following error:
Error in eval(expr, envir, enclos) : object 'y2' not found
I feel like this shouldn't be too difficult but I can't figure it out, I don't think I am supposed to use group by because these are two seperate variables with no similar factors or characteristics. Any help would be appreciated.
You can work-around by reshaping your dataset so you have a grouping variable in one column and the values of both columns you want to plot in another. I do the work via melt from reshape2.
library(reshape2)
df2 = melt(df, measure.vars = c("y1", "y2"))
Once you do that you can use group_by to get a separate density layer for each variable.
df2 %>% group_by(variable) %>%
ggvis(~value) %>%
layer_densities()
in ggplot you can set color = "your time variable" in aes() to get this density

Using dplyr, how to pipe or chain to plot()?

I am new to dplyr() package and trying to use it for my visualization assignment. I am able to pipe my data to ggplot() but unable to do that with plot(). I came across this post and the answers including the one in comments, didn't work for me.
Code 1:
emission <- mynei %>%
select(Emissions, year) %>%
group_by(year) %>%
summarise (total=sum(Emissions))
emission %>%
plot(year, total,.)
I get the following error:
Error in plot(year, total, emission) : object 'year' not found
Code 2:
mynei %>%
select(Emissions, year) %>%
group_by(year) %>%
summarise (total=sum(Emissions))%>%
plot(year, total, .)
This didn't work either and returned the same error.
Interestingly, the solution from the post I mentioned works for the same dataset but doesn't work out for my own data. However, I am able to create the plot using emission$year and emission$total.
Am I missing anything?
plot.default doesn't take a data argument, so your best bet is to pipe to with:
mynei %>%
select(Emissions, year) %>%
group_by(year) %>%
summarise (total=sum(Emissions))%>%
with(plot(year, total))
In case anyone missed #aosmith's comment on the question, plot.formula does have a data argument, but of course the formula is the first argument so we need to use the . to put the data in the right place. So another option is
... %>%
plot(total ~ year, data = .)
Of course, ggplot takes data as the first argument, so to use ggplot do:
... %>%
ggplot(aes(x = year, y = total)) + geom_point()
lattice::xyplot is likeplot.formula: there is a data argument, but it's not first, so:
... %>%
xyplot(total ~ year, data = .)
Just look at the documentation and make sure you use a . if data isn't the first argument. If there's no data argument at all, using with is a good work-around.
As an alternative, you can use the %$% operator from magrittr to be able to access the columns of a dataframe directly. For example:
iris %$%
plot(Sepal.Length~Sepal.Width)
This is useful many times when you need to feed the result of a dplyr chain to a base R function (such as table, lm, plot, etc). It can also be used to extract a column from a dataframe as a vector, e.g.:
iris %>% filter(Species=='virginica') %$% Sepal.Length
This is the same as:
iris %>% filter(Species=='virginica') %>% pull(Sepal.Length)

Resources