I have a very big dataset that I'd like to illustrate using plotly in R.
A sample of my dataset is shown below:
> new_data_2
# Groups: newdatum [8]
date activity totaal
<date> <fct> <int>
1 2019-11-21 N11 144
2 2019-09-22 N11 129
3 2019-05-15 N22 117
4 2019-01-23 N22 12
5 2019-07-04 N22 12
6 2019-07-18 N22 12
...
For every activity I want to display the amount (totaal) per date (date) in a time series plot.
Somehow I don't get it right in R. Somehow I need to group my activity to display, but I can't figure it out.
new_data_2 %>%
group_by(activity) %>%
plot_ly(x=new_data_2$newdatum) %>%
add_lines(y=~new_data_2$totaal, color = ~factor(newdatum))
It does display an empty plot and not with the 'activity' on the left side.
What i want to achieve is:
You're on the right track, but after the group_by() you need to tell R to do something to the groups.
new_data_2 %>%
group_by(activity, date) %>% # use two groupings since you want by activity & date
summarise(totaal_2 = sum(totaal))
That should get to the dataframe you're looking for. You can use ggplot & plotly on it from there.
I would recommend reshaping the data first (as above), saving it as a new object, and then graphing it. Doing it this way helps you see each step along the way. Pipes %>% are great, but can make each step difficult to see.
This might not be very obvious at first, but the structure of your data is ideal for plot with multiple time series. You don't even need to worry with the group_by function. Your dataset seems to hava a long format where the dates in the date column and the names in activity column are not unique. But you will have only one variable per activity and date.
Given the correct specifications, plot_ly() will group your data using color=~activity like this: p <- plot_ly(new_data2, x = ~date, y = ~totall, color = ~activity) %>% add_lines(). Since you haven't provided a data sample that is large enough, I'll use the built-in dataset economics_long to show you how you can do this. First of all, notice how the structure of my sampled dataset matches yours:
date variable value
1 1967-07-01 psavert 12.5
2 1967-08-01 psavert 12.5
3 1967-09-01 psavert 11.7
4 1967-10-01 psavert 12.5
5 1967-11-01 psavert 12.5
6 1967-12-01 psavert 12.1
...
Plot:
Code:
library(plotly)
library(dplyr)
# data
data("economics_long")
df <- data.frame(economics_long)
# keep only some variables that have values on a comparable level
df <- df %>% filter(!(variable %in% c('pop', 'pce', 'unemploy')))
# plotly time series
p <- plot_ly(df, x = ~date, y = ~value, color = ~variable) %>%
add_lines()
# show plot
p
Related
here's my data:
head(df)
FY Analyte Value
<fct> <fct> <dbl>
1 2007-08 CONF(G) 634
2 2007-08 PH(G) 7.8
3 2007-08 TEMP(G) 24.8
4 2007-08 UHS(G) 2.5
5 2007-08 FC(G) 0.5
6 2007-08 CBOD(C) 1
My dataset is a long df, spanning 10 years. I want to create multiple ggplots (of each Analyte) where the x axis is FY (financial year) and the y axis is Value. Ideally the Y axis title would also change based on the variable being plotted.
I've seen a few reproducible chunks of code in my search to do this but none of them seem to apply to a long dataframe (where I want to loop through each level of the Analyte variable). I also want it to save to my working directory (possibly using the png and dev.off() functions).
Anyone know a solution?
Thanks!
Split the data for each Analyte and use map to save the plot as separate image.
library(tidyverse)
df %>%
group_split(Analyte) %>%
map(~{
analyte_name <- .$Analyte[1]
tmp <- ggplot(., aes(FY, Value)) + geom_boxplot() + ggtitle(analyte_name)
ggsave(paste0(analyte_name, '.png'), tmp)
})
I have conducted a study with triplicates (SampleID) for each sample (Sample) on different time points.
Now, I want to plot the means of the triplicates for the characteristic "Aerobic".
I want to plot for example the development of amount of aerobic bacteria over time. Therefore, I need to calculate the means (and the standard deviation) of the triplicates and then plot these means in the graph. Here, I could imagine to use a geom_line or geom_point diagram.
SampleID Sample Aerobic Anaerobic Day
[Factor] [Factor] [num] [num] [num]
1 V1.1.K1 V1.1.K 0.610063430 0.05146154 1
2 V1.1.K2 V1.1.K 0.740887757 0.02115290 1
3 V1.1.K3 V1.1.K 0.683726217 0.04270182 1
4 V1.1.N1 V1.1.N 0.432019752 0.35722350 1
5 V1.1.N2 V1.1.N 0.515792694 0.41357935 1
6 V1.14.K16 V1.14.K 0.038141335 0.84496088 14
7 V1.14.K17 V1.14.K 0.042078682 0.76523093 14
8 V1.14.K18 V1.14.K 0.009594763 0.90767637 14
9 V1.14.N0 V1.14.N 0.513100502 0.10618731 14
10 V1.14.W16 V1.14.W 0.483710571 0.32765968 14
How should i do this?
I tried it with the following code
plot <- mydata %>%
group_by(Sample) %>%
mutate(Mean=mean(Aerobic)) %>%
ggplot(aes(x=Day, y=Aerobic)) +
geom_point()
If I google the questions I get only information about how to calculate the mean alone, but not to set up a new table with the means for the different variables.
Is there something like
calc_mean_by_group ??
You would help me a lot :)
Simple base-R solution for calculating the means:
tapply(X = foo$Aerobic, INDEX = foo$Sample, FUN = mean)
("foo" being the name of your data.frame)
I have a set of categorical variables listed by date. The desired outcome is a plot of counts of the categorical variables selected by a particular date range. I can produce a plot of the entire set but no variations that I have found (or people have suggested I use) produces that outcome. Date is formatted as date and libloc is a character. The end result desired is plot of the number of instructions we do in different locations by semester.
I understand this is an unimportant/uninteresting question to most of you -- but I am a 62 year old classics librarian stuck at home because of the pandemic having to learn to program so I can keep my job - so can people please be kind. I realize I am not phrasing my question the way you might want but I am doing the best I can trying to do this.
library(ggplot)
library(lubridate)
library(readr)
df <- read_excel("C:/Users/12083/Desktop/instructions/datasetd.xlsx")
df %>%
select(date,Location) %>%
filter(date >= as.Date("2017-01-05") & date <= as.Date("2018-01-10"))%>%
group_by(Location) %>%
summarise(count=n())
g <- ggplot(df, aes(Location))
g + geom_bar()
Salve!
You might find that my santoku package helps. It can chop dates into intervals:
library(santoku)
library(dplyr)
df_summary <- df %>%
select(date,Location) %>%
filter(date >= as.Date("2017-01-05") & date <= as.Date("2018-01-10")) %>%
mutate(semester = chop(date, as.Date(c("2017-01-05", "2017-01-09")))) %>%
group_by(Location, semester) %>%
summarise(count=n())
Obviously you will want to pick your semester dates appropriately.
Then you can print with something like:
ggplot(df_summary, aes(semester, count)) + geom_col() + facet_wrap(vars(location))
Hope this helps:
#### Filtering using R to a specific date range ####
# From: https://stackoverflow.com/questions/62926802/filtering-using-r-to-a-specific-date-range
# First, I downloaded a sample dataset with dates and categorical data from here:
# https://vincentarelbundock.github.io/Rdatasets/datasets.html
# Specifically, I got weather.csv
setwd("F:/Home Office/R")
data = read.csv("weather.csv") # Read the data into R
head(data) # Quality control, looks good
data = data[,2:3] # For this example, I cut it to only take the relevant columns
data$date = as.Date(data$date) # This formats the date as dates for R
library(tidyverse) # This will import some functions that you need, spcifically %>% and ggplot
# Step 0: look that the data makes sense to you
summary(data$date)
summary(data$city)
# Step 1: filter the right data
filtered = data %>%
filter(date > as.Date("2016-07-01") & date < as.Date("2017-07-01")) # This will only take rows between those dates
# Step 2: Plot the filtered data
# Using a bar plot:
plot = ggplot(filtered, aes(x=city, fill = city)) + geom_bar() # You don't really need the fill, but I like it
plot
# Quality control: look at the numbers before and after the filtering:
summary(data$city)
summary(filtered$city)
Outputs:
> summary(short.data$city)
Auckland Beijing Chicago Mumbai San Diego
731 731 731 731 731
> summary(filtered$city)
Auckland Beijing Chicago Mumbai San Diego
364 364 364 364 364
You might be able to make it more elegant... but I think it works well
EDIT TO MAKE IT INTO A LINE PLOT
This edit is following your request in the comments:
# Line plot
# The major difference between geom_bar() and geom_line() is that
# geom_line() requires both an X and Y values.
# So first I created a new data frame which has these values:
summarised.data = filtered %>%
group_by(city) %>%
tally()
# Now you can create the plot with ggplot:
# Notes:
# 1. group = 1 is necessary
# 2. I added geom_point() so that each X value gets a point. I think it's easier to read. You can remove this if you like
plot.line = ggplot(summarised.data, aes(x=city, y=n, group = 1)) + geom_line() + geom_point()
plot.line
Outputs:
> summarised.data
# A tibble: 5 x 2
city n
<fct> <int>
1 Auckland 364
2 Beijing 364
3 Chicago 364
4 Mumbai 364
5 San Diego 364
This is a new answer because the approach is different
#### Filtering using R to a specific date range ####
# From: https://stackoverflow.com/questions/62926802/filtering-using-r-to-a-specific-date-range
# First, the data I took by copy and pasting from here:
# https://stackoverflow.com/questions/63006201/mapping-sample-data-to-actual-csv-data
# and saved it as bookdata.csv with Excel
setwd("C:/Users/di58lag/Documents/scratchboard/Scratchboard")
data = read.csv("bookdata.csv") # Read the data into R
head(data) # Quality control, looks good
data$dates = as.Date(data$dates, format = "%d/%m/%Y") # This formats the date as dates for R
library(tidyverse) # This will import some functions that you need, spcifically %>% and ggplot
# Step 0: look that the data makes sense to you
summary(data$dates)
summary(data$city)
# Step 1: filter the right data
start.date = as.Date("2020-01-02")
end.date = as.Date("2020-01-04")
filtered = data %>%
filter(dates >= start.date &
dates <= end.date) # This will only take rows between those dates
# Step 2: Plotting
# Now you can create the plot with ggplot:
# Notes:
# I added geom_point() so that each X value gets a point.
# I think it's easier to read. You can remove this if you like
# Also added color, because I like it, feel free to delete
Plot = ggplot(filtered, aes(x=dates, y=classes, group = city)) + geom_line(aes(linetype=city, color = city)) + geom_point(aes(color=city))
Plot
# For a clean version of the plot:
clean.plot = ggplot(filtered, aes(x=dates, y=classes, group = city)) + geom_line(aes(linetype=city))
clean.plot
Outputs:
Plot:
Clean.plot:
EDIT: ADDED A TABLE FUNCTION!
After reading your comments I think I figured out what you're trying to do.
You asked for:
"counts of location of instructors on the vertical and dates on the horizontal."
The problem is that the original data doesn't actually give you the number of counts - ie "how many times a specific location apears in a specific date".
Therefore, I had to add another line using the table function to calculate this:
data.table = as.data.frame(table(filtered))
this calculates how many times each combination of date+location apears and give a value called "Freq".
Now you can plot this Freq as the count as follows:
# Step 1.5: Counting the values
data.table = as.data.frame(table(filtered)) # This calculates the frequency of each date+location combination
data.table = data.table %>% filter(Freq>0) # This is used to cut out any Freq=0 values (you don't want to plot cases where no event occured)
data.table$dates = as.Date(data.table$dates) # You need to rerun the "as.Date" func because it formats the dates back to "Factors"
#Quality control:
dim(filtered) # Gives you the size of the dataframe before the counting
dim(data.table) # Gives the size after the counting
summary(data.table) # Will give you a summary of how many values are for each city, what is the date range and what is the Frequency range
# Now you can create the plot with ggplot:
# Notes:
# I added geom_point() so that each X value gets a point.
# I think it's easier to read. You can remove this if you like
# Also added color, because I like it, feel free to delete
Plot = ggplot(data.table, aes(x=dates, y=Freq, group = city)) + geom_line(aes(linetype=city, color = city)) + geom_point(aes(color=city))
Plot
# For a clean version of the plot:
clean.plot = ggplot(filtered, aes(x=dates, y=classes, group = city)) + geom_line(aes(linetype=city))
clean.plot
I have a feeling it's not exactly what you wanted becuase the numbers are quite low (ranging between 1-12 counts) but this is what I understand.
OUTPUTS:
> summary(data.table)
city dates Freq
Pocatello :56 Min. :2015-01-12 Min. :1.000
Idaho Falls:10 1st Qu.:2015-02-10 1st Qu.:1.000
Meridian : 8 Median :2015-03-04 Median :1.000
: 0 Mean :2015-03-11 Mean :1.838
8 : 0 3rd Qu.:2015-04-06 3rd Qu.:2.000
Boise : 0 Max. :2015-06-26 Max. :5.000
(Other) : 0
I have a data frame in R that lists monthly sales data by department for a store. Each record contains a month/year, a department name, and the total sales in that department for the month. I'm trying to calculate the mean sales by department, adding them to the vector avgs but I seem to be having two problems: the total sales per department is not compiling at all (its evaluating to zero) and avgs is compiling by record instead of by department. Here's what I have:
avgs = c()
for(dept in data$departmentName){
total <- 0
for(record in data){
if(identical(data$departmentName, dept)){
total <- total + data$ownerSales[record]
}
}
avgs <- c(avgs, total/72)
}
Upon looking at avgs on completion of the loop, I find that it's returning a vector of zeroes the length of the data frame rather than a vector of 22 averages (there are 22 departments). I've been tweaking this forever and I'm sure it's a stupid mistake, but I can't figure out what it is. Any help would be appreciated.
why not use library(dplyr)?:
library(dplyr)
data(iris)
iris %>% group_by(Species) %>% # or dept
summarise(total_plength = sum(Petal.Length), # total owner sales
weird_divby72 = total_plength/72) # total/72?
# A tibble: 3 × 3
Species total_plength weird_divby72
<fctr> <dbl> <dbl>
1 setosa 73.1 1.015278
2 versicolor 213.0 2.958333
3 virginica 277.6 3.855556
your case would probably look like this :
data %>% group_by(deptName) %>%
summarise(total_sales = sum(ownerSales),
monthly_sales = total_sales/72)
I like dplyr for it's syntax and pipeability. I think it is a huge improvement over base R for ease of data wrangling. Here is a good cheat sheet to help you get rolling: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
This seems simple, but I've tried multiple variations of matplot, ggplot2, regular old plot...I can't get any to do what I need.
I have a gigantic dataframe of years, months, and observations. I simplified it down to number of observations per month, per year, see below. I'm not sure why it read in with the "X" in front of each column heading, but if it's not going to affect the code, right now I don't care.
head(storms)
X Month X1992 X1993 X1994
1 1 1 2 1
2 2 2 4 1
3 3 3 26 10
4 4 4 47 26
5 5 5 969 615
The full (simplified) set is 10 columns of years (1992-2001), each with 12 months/rows of totals (1 storm in Jan 1992, 26 storms in March 1993...). I need simply to plot these all on an x-axis 120 months long, # of observations per month on the y-axis. It could be a line or bars or vertical lines. I've seen many ways to plot 20 lines with 12 months on the x-axis; that is not what I'm going for. I also need to label the years every 12 months, but I think I can figure that out after I get this block out of the way.
In other words (I hope this is more clear if the previous is not):
y axis: # of storms, ylim=c(0-1000)
x axis: 10 sets of months (Jan-Dec, 1992-2001, 120 months total). The only labels will be the years, every 12 months of course.
I know I'm just thinking about it wrong, could someone please set my head straight?
(first post; please also tell me if I'm not formatting or inquiring properly!)
is this something you are looking for? If I am not mistaken, you may want to rearrange your data frame. You wanna make your data frame longer rather than wider. Then, you can draw a figure. The thing is that you have 120 month. So you may need to think plot space issue. But at least this example let you move forward. I hope this helps you.
library(tidyr)
library(ggplot2)
# Create a sample data
month <- rep(c(1:12), each = 1, times = 2)
nintytwo <- runif(24, 0, 20)
nintythree <- runif(24, 0, 20)
# Crate a data frame
ana <- data.frame(month, nintytwo, nintythree)
# Make the data longer rather than wider.
bob <- gather(ana, year, value, -month)
bob$month <- as.factor(bob$month)
# Draw a firure
cathy <- ggplot(bob, aes(x= year,y = value, fill = month)) + geom_bar(stat="identity", position="dodge")
cathy
Here's an example using base R :
# create an example data
set.seed(123)
df <- data.frame(Month=1:12)
for(y in 1992:2001){
tmp <- data.frame(X=as.integer(abs(rnorm(12,mean=2,sd=10))))
colnames(tmp) <- paste("X",y,sep="")
df <- cbind(df,tmp)
}
# reshape to long format (one column with n.of storms, and period columns)
long <- reshape(df[,-1], idvar="Month", ids=df$Month,
times=names(df[,-1]), timevar="Year",
varying = list(names(df[,-1])),
direction = "long",v.names="Storms")
# remove the "X" from the year
long$Year <- substr(long$Year,2,nchar(long$Year))
nYears <- length(unique(long$Year))
# plot the line
plot(x=1:nrow(long),y=long$Storms,type="l",
xaxt="n",main="Monthly Storms",
xlab="Period",ylab="Storms",col="RoyalBlue")
# add custom labels
axis(1,at=((1:nYears)*12)-6,labels=unique(long$Year))
# add vertical lines
abline(v=c(0.5,((1:nYears)*12)+0.5),col="Gray80",lty=2)
Result :