I'm new to R and looking to get some help/explanation on why my code is doing what it is doing. I've started doing the Tidy Tuesday projects to better learn R, so that is where the data is from. Tidy Tuesday information
Goal:
The end result I am looking to do is sort my bar graph by which country's runners had the most first place finishes from the data and only display the top 10.
Thought process
In my head, how this would happen would be having R add up each instance of the country and have it saved into a variable.
So my first attempt is returning this:
The top_N is something I found googling around, but if I take it out, it does look right, just not limited to the top ten.
Questions:
Am I using reorder correctly to control the order of nationalities?
What is the best way to limit the which results are shown?
Where exactly in the code is it counting each nationality? I'm thinking it is in the sum, but not not 100% sure. Most examples I've found of this used it for numerical values, not strings and that has me a bit confused.
library(tidyverse)
library(ggplot2)
library(readr)
library(dplyr)
ultra_rankings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-10-26/ultra_rankings.csv')
race <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-10-26/race.csv')
ultra_rankings %>%
filter(rank == '1') %>% #Only looks at rows that have a first place finish
top_n(10, nationality) %>% #I think this is what is throwing me off
ggplot(aes(x = reorder(nationality, -rank, sum), y = rank))+geom_bar(stat = "identity")+
labs(title = "First Place Rankings by Country", caption = "Data from runrepeat.com")+
theme(plot.title = element_text(hjust = .5))+ylab("Total First Place Finishes")+xlab("Runner Nationalities")
Try this:
gt <- ultra_rankings %>% filter(rank==1) %>% group_by(nationality) %>% count(nationality) %>%arrange(-n) %>% head(10)
Then we have to change the factor to preserve sort order
gt$nationality <- factor(gt$nationality, levels = unique(gt$nationality))
Now it can be plotted:
ggplot(data=gt,aes(x=nationality,y=n))+geom_bar(stat="identity")
Related
In my dataset I have two columns, named part_1 and part_2, that contain several numerical values.
I am required to create a graph that shows how the average varies in the two parts.
I think that the best way is to create a barplot with a bar for each part, but I'm not sure about it.
First, I created two new columns that contain the mean values for the two parts in each row:
averages <- my_data %>% mutate(avg_part1=mean(part_1,na.rm=T)) %>% mutate(avg_part2=mean(part_2,na.rm=T))
Then, I inserted the values in two new variables:
avg_part1 <- averages %>% slice(1) %>% pull(avg_part1) avg_part2 <- averages %>% slice(1) %>% pull(avg_part2)
To create the plot I did:
to_graph<-c("First part"=avg_part1,"Second part"=avg_part2) barplot(to_graph)
And I obtained the graph I wanted, but it's not very nice to see.
I feel like this process is too complex and I may be able to do everything in a couple lines and without creating so many new variables, do you have any suggestions?
Also, I would prefer to create the graph with ggplot because it's better to improve the design, but I don't really know how to do it.
Thanks!
Using ggplot:
library(ggplot2)
library(dplyr)
my_data %>%
stack(select = c(part_1, part_2)) %>%
ggplot(aes(values, x=ind)) + geom_bar(stat="summary", fun=mean)
I have a calculation that I have to perform for 23 people (they have varying number of rows allocated to each person so difficult to do in excel. What I'd like to do is take the total time each person took to complete a test and divide it into 5 time categories (20%) so that I can look at their reaction time in more detail.
I will just do this by hand but it will take quite a while because they have 8 sets of data each. I'm hoping someone can show me the best way to use a loop or automate this process even just a little. I have tried to understand the examples but I'm afraid I don't have the skill. So by hand I would do it like I have below where I just filter by each subject.
I started by selecting the relevant columns, then filtered by subject so that I could calculate the time they started and the time they finished and used that to create a variable (testDuration) that could be used to create the 20% proportions of RTs that I'm after. I have no idea how to get the individual subjects' test start, end, duration and timeBin sizes to appear in one column. Any help very gratefully received.
Subj1 <- rtTrialsYA_s1 %>%
select(Subject, RetRating.OnsetTime, RetRating.RT, RetRating.RTTime) %>%
filter(Subject==1) %>%
summarise(
testStart =
min(RetRating.OnsetTime),
testEnd = max(RetRating.RTTime)
) %>%
mutate(
testDuration = testEnd - testStart,
timeBin =
testDuration/5
)
Subj2 <- rtTrialsYA_s1 %>%
select(Subject, RetRating.OnsetTime, RetRating.RT, RetRating.RTTime) %>%
filter(Subject==2) %>%
summarise(
testStart =
min(RetRating.OnsetTime),
testEnd = max(RetRating.RTTime)
) %>%
mutate(
testDuration = testEnd - testStart,
timeBin =
testDuration/5
)
I'm not positive that I understand your code, but this function can be called for any Subject value and then return the output:
myfunction <- function(subjectNumber){
Subj <- rtTrialsYA_s1 %>%
select(Subject, RetRating.OnsetTime, RetRating.RT, RetRating.RTTime) %>%
filter(Subject==subjectNumber) %>%
summarise(testStart = min(RetRating.OnsetTime), testEnd = max(RetRating.RTTime)) %>%
mutate(testDuration = testEnd -testStart) %>%
mutate(timeBin = testDuration/5)
return(Subj)
}
Subj1 <- myfunction(1)
Subj2 <- myfunction(2)
To loop through this, I'll need to know what your data and the desired output looks like.
I think you're missing one piece and that is simply dplyr::group_by.
You can use it as follows to break your dataset into groups, each containing the observations belonging to only one subject, and then summarise on those groups with whatever it is you want to analyze.
library(dplyr)
df <- rtTrialsYA_s1 %>%
group_by(Subject) %>%
summarise(
testStart = min(RetRating.OnsetTime),
testEnd = max(RetRating.RTTime),
testDuration = testEnd - testStart,
timeBin = testDuration/5,
.groups = "drop"
)
There is no need to do separate mutate calls in your code, btw. Also, you can continue to do column calculations right within summarise, as long as the result vectors have the same length as your aggregated columns.
And since summarise retains only the grouping columns and whatever you are defining, there is no real need to do a select statement before, either.
// update
You say you need all your calculated columns to appear within one single column. For that you can use tidyr::pivot_longer. Using the df we calculated above:
library(tidyr)
df_long <- df %>%
pivot_longer(-Subject)
Above will take all columns, except Subject and pivot them into two columns, one containing the former col name and one containing the former value.
I have two problems: I would like to create a graph with multiple lines by adding the values in the columns stepwise (should kind of end up looking like multiple saturation curves). I think geom_step in the ggplot2 package should work. However, I don't know how to add the values in the columns as I go and I don't know how to add multiple lines (I will have over 100 lines) therefore both steps should be automated in some way.
This data set shows my data, only contains the first 3 columns and the first 13 lines.
a<-c(0,1,1,1,1,1,1,0,1,0,1,0,1)
b<-c(0,1,0,0,1,0,1,0,1,0,1,0,1)
c<-c(0,1,1,0,1,0,1,1,1,1,1,1,1)
df<-data.frame(a,b,c)
Can anyone help me? I have no idea where to start.
If you're looking for cumulative sums of the data, the cumsum() function will do it for you.
It isn't completely clear to me what you're looking for, but this might take care of it:
a<-c(0,1,1,1,1,1,1,0,1,0,1,0,1)
b<-c(0,1,0,0,1,0,1,0,1,0,1,0,1)
c<-c(0,1,1,0,1,0,1,1,1,1,1,1,1)
df2<-data.frame(a,b,c)
df3 <- df2 %>%
mutate_all(cumsum) %>%
rename_all(paste0, 'x') %>%
cbind(df2) %>%
mutate(row = row_number()) %>%
pivot_longer(ax:c)
ggplot(df3) +
geom_step(aes(x = row, y = value, color = name))
The data was reshaped to longer data for ease of plotting. Original data was left in as well, those are the lines that stay near the bottom of the graph.
The output:
This is part of an online course I am doing, R for data analysis.
A tibble is created using the group_by and summarise functions on the diamonds data set - the new tibble indeed exists and looks as you would expect, I checked. Now a bar plot has to be created using these summary values in the new tibble, but it gives me all sorts of errors associated with not recognising the columns.
I transformed the tibble into a data frame, and still get the same problem.
Here is the code:
diamonds_by_color <- group_by(diamonds, color)
diamonds_mp_by_color <- summarise(diamonds_by_color, mean_price = mean(price))
diamonds_mp_by_color <- as.data.frame(diamonds_mp_by_color)
colorcounts <- count(diamonds_by_color$mean_price)
colorbarplot <- barplot(diamonds_by_color$mean_price, names.arg = diamonds_by_color$color,
main = "Average price for different colour diamonds")
The error I get when running the function count is:
Error in UseMethod("summarise_") :
no applicable method for 'summarise_' applied to an object of class "NULL"
In addition: Warning message:
Unknown or uninitialised column: 'mean_price'.
It's probably something trivial but I have been reading quite a lot and tried a few things and can't figure it out. Any help will be super appreciated :)
Your diamonds_by_color never has mean_price assigned to it.
Your last two lines of code work if you reference diamonds_mp_by_color instead:
colorcounts <- count(diamonds_mp_by_color, mean_price)
barplot(diamonds_mp_by_color$mean_price,
names.arg=diamonds_mp_by_color$color,
main="Average price for different colour diamonds")
Here is a way to summarise the price by color using dplyr and piping straight to a barplot using ggplot2.
diamonds %>% group_by(color) %>%
summarise(mean.price=mean(price,na.rm=1)) %>%
ggplot(aes(color,mean.price)) + geom_bar(stat='identity')
Best dplyr idiom is not to declare a temporary result for each operation. Just do one big pipe; also the %>% notation is clearer because you don't have to keep specifying which dataframe as the first arg in each operation:
diamonds %>%
group_by(color) %>%
summarise(mean_price = mean(price)) %>%
tally() %>% # equivalent to n() on a group
# may need ungroup() %>%
barplot(mean_price, names.arg = color,
main = "Average price for different colour diamonds")
(Something like that. You can assign the output of the pipe before the barplot if you like. I'm transiting through an airport so I can't check it in R.)
I am a bit stuck with some code. Of course I would appreciate a piece of code which sorts my dilemma, but I am also grateful for hints of how to sort that out.
Here goes:
First of all, I installed the packages (ggplot2, lubridate, and openxlsx)
The relevant part:
I extract a file from an Italians gas TSO website:
Storico_G1 <- read.xlsx(xlsxFile = "http://www.snamretegas.it/repository/file/Info-storiche-qta-gas-trasportato/dati_operativi/2017/DatiOperativi_2017-IT.xlsx",sheet = "Storico_G+1", startRow = 1, colNames = TRUE)
Then I created a data frame with the variables I want to keep:
Storico_G1_df <- data.frame(Storico_G1$pubblicazione, Storico_G1$IMMESSO, Storico_G1$`SBILANCIAMENTO.ATTESO.DEL.SISTEMA.(SAS)`)
Then change the time format:
Storico_G1_df$pubblicazione <- ymd_h(Storico_G1_df$Storico_G1.pubblicazione)
Now the struggle begins. Since in this example I would like to chart the 2 time series with 2 different Y axes because the ranges are very different. This is not really a problem as such, because with the melt function and ggplot i can achieve that. However, since there are NAs in 1 column, I dont know how I can work around that. Since, in the incomplete (SAS) column, I mainly care about the data point at 16:00, I would ideally have hourly plots on one chart and only 1 datapoint a day on the second chart (at said 16:00). I attached an unrelated example pic of a chart style I mean. However, in the attached chart, I have equally many data points on both charts and hence it works fine.
Grateful for any hints.
Take care
library(lubridate)
library(ggplot2)
library(openxlsx)
library(dplyr)
#Use na.strings it looks like NAs can have many values in the dataset
storico.xl <- read.xlsx(xlsxFile = "http://www.snamretegas.it/repository/file/Info-storiche-qta-gas-trasportato/dati_operativi/2017/DatiOperativi_2017-IT.xlsx",
sheet = "Storico_G+1", startRow = 1,
colNames = TRUE,
na.strings = c("NA","N.D.","N.D"))
#Select and rename the crazy column names
storico.g1 <- data.frame(storico.xl) %>%
select(pubblicazione, IMMESSO, SBILANCIAMENTO.ATTESO.DEL.SISTEMA..SAS.)
names(storico.g1) <- c("date_hour","immesso","sads")
# the date column look is in the format ymd_h
storico.g1 <- storico.g1 %>% mutate(date_hour = ymd_h(date_hour))
#Not sure exactly what you want to plot, but here is each point by hour
ggplot(storico.g1, aes(x= date_hour, y = immesso)) + geom_line()
#For each day you can group, need to format the date_hour for a day
#You can check there are 24 points per day
#feed the new columns into the gplot
storico.g1 %>%
group_by(date = as.Date(date_hour, "d-%B-%y-")) %>%
summarise(count = n(),
daily.immesso = sum(immesso)) %>%
ggplot(aes(x = date, y = daily.immesso)) + geom_line()