Creating a timeline based on years in R - r

I am fairly new to coding with R. I am working with relatively easy data just a large amount of it. I am trying to create a timeline based on the years these species of whale were sighted. So I just have two variables (year and species) and 151 observations. There are a total of 25 species to plot on this time line and I have provided a small example of my data below.
year species
1792 Megaptera novaeangliae
1792 Physeter macrocephalus
1793 Physeter macrocephalus
1832 Physeter macrocephalus
1833 Physeter macrocephalus
I have tried creating the timeline using timelineS and timelineG as well as vistime. TimelineG gets close to creating what I want but it does not seem to plot anything. The code is as follows:
timelineG(t8, start="year", end="year", names="species")
timelineG results
I am just kind of stuck. I do have the month and day that the species were sighted so I can add that back if needed. Thank you in advance for any guidance.

An example with ggplot:
library(tidyverse)
df <- tibble::tribble(
~year, ~species,
1792L, "Megaptera novaeangliae",
1792L, "Physeter macrocephalus",
1793L, "Physeter macrocephalus",
1832L, "Physeter macrocephalus",
1833L, "Physeter macrocephalu"
)
df %>%
ggplot(aes(x = year, y = species)) +
geom_point()
Or with the timelineG function:
library(timelineS)
df %>%
group_by(species) %>%
summarise(start = min(year),
end = max(year)) %>%
timelineG(start = "start", end = "end", names = "species")

or like this using timelineg,
dta <- data.frame(species = c('Megaptera novaeangliae','Physeter macrocephalus',
'Physeter macrocephalus','Physeter macrocephalus',
'Physeter macrocephalus'),
year = c(1792,1792,1793,1832,1833))
#str(dta)
dta$year <- as.Date(ISOdate(dta$year, 1, 1)) # assuming beginning of year for year
# str(dta)
timelineS(dta, main = "BGoodwin's species example")

Related

Getting candlestick chart to display properly using a text / .txt file of historic stock prices in R

Hell there,
I have purchased the historic intraday prices of the S&P 500 (1min through 1hour) back through 2005 because most stock charting packages stop reporting intraday prices around 2016 or 2011. I have successfully imported the prices and gotten R to only read market hours, excluding premarket and aftermarket. Two problems exist. First, I need to get the chart to not show saturday and sunday. The bigger problem is that the plot is NOT showing candlesticks, but bars and they are very hard to read. I have tried increasing the size via (size = 4), but the bars overlap and are still not candlesticks. How can I get these to show as proper candlesticks? thank you
library(quantmod)
library(tidyquant)
library(tidyverse)
library(ggplot2)
library(readr)
library(ggforce)
library(dplyr)
dir <- "E:/Stock Trading/Historical Data/SPY_qjrt28"
setwd(dir)
data <- read_csv("SPY_30min.txt",
col_names = FALSE)
names(data) <- tolower(c("DateTime", "Open", "High", "Low", "Close", "Volume"))
data
#clean the data
write_rds(data, "cleaned.rds")
read_rds("cleaned.rds")
spy30m <- read_rds("cleaned.rds")
firstwave <- filter(spy30m, datetime >= as.Date('2009-03-02'), datetime <= as.Date('2009-03-19'))
# adding more time objects to the dataset
data <- data %>%
mutate(hour = hour(datetime),
minute = minute(datetime),
hms = as_hms(datetime))
# is the hour function working as expected? Yes!
data %>%
select(datetime, hour) %>%
sample_n(10)
# look at bins of observations at 30 minute intervals. Looks good!
data %>%
group_by(hms) %>%
summarise(count = n()) %>%
arrange(hms) %>%
print(n=100)
# filter the dataset to only include the times during regular market hours
data_regularmkt <- data %>%
# `filter` is the dplyr function that limits the number of observations in a data frame
# `between` function takes 3 arguments: an object/variable, a lower bound value, and upper bound value
filter(between(hms, as_hms("09:30:00"), as_hms("16:00:00")))
# look at it again
data_regularmkt %>%
group_by(hms) %>%
summarise(count = n()) %>%
arrange(hms) %>%
print(n=100)
###########
firstwave <- filter(spy30m, datetime >= as.Date('2009-03-06'), datetime <= as.Date('2009-03-19'))
ggplot(firstwave, aes(x = datetime, y = close)) +
geom_candlestick(aes(open = open, high = high, low = low, close = close, size = 3))
Say we have a data frame df with the columns date (dttm format), open, high, low, close.
To overcome the issue that non-trading hours are shown, my first idea was to use another x-axis scale. Here's with a row-index.
library(tidyverse)
library(lubridate)
library(tidyquant)
df <- df %>%
arrange(date) %>%
mutate(i = row_number())
# this is for the x-axis labels
df_x <- df %>%
group_by(d = floor_date(date, "day")) %>%
filter(date %in% c(min(date)))
df %>%
ggplot(aes(x = i)) +
geom_candlestick(aes(open = open, low = low, high = high, close = close)) +
scale_x_continuous(breaks = df_x$i,
labels = df_x$date)
The problem then is that if a contract is halted during trading hours, there will be no data too just like with night or weekend. However, these times you probably want to show anyway.
One could probably play with dplyr functions' complete or expand to fix the data first and still use my solution of plotting over an index x-scale.
Easier could be to use the plotly library.
plt <- plot_ly(data = df, x = ~date,
open = ~open, close = ~close,
high = ~high, low = ~low,
type="candlestick")
plt
This is to hide the non-trading hours:
plt %>% layout(showlegend = F, xaxis = list(rangebreaks=
list(
list(bounds=list(17, 9),
pattern="hour")),#hide hours outside of 9am-5pm
dtick=86400000.0/2,
tickformat="%H:%M\n%b\n%Y"))
More information can be found here: https://plotly.com/r/time-series/#hiding-nonbusiness-hours and https://plotly.com/r/candlestick-charts/
As for you not liking the appearance of tidyquant's geom_candlestick, I also suggest you try out Plotly.

why my bar chart not showing all the data

I am working on a music streaming project, and I am trying to get the top15 global streamings in 2020 and make it an interactive graph.
It successfully showed the top 15 song names as a dataframe, but it failed to show as a bar graph, I wonder where did I do wrong here? Although it worked after I flip the bar graph into horizontal, but the data seem to look a bit off.
It looks like this as a vertical bar graph:
The horizontical bar graph looks like this, but the data seem incorrect:
Here is the code I have:
library("dplyr")
library("ggplot2")
# load the .csv into R studio, you can do this 1 of 2 ways
#read.csv("the name of the .csv you downloaded from kaggle")
spotiify_origional <- read.csv("charts.csv")
spotiify_origional <- read.csv("https://raw.githubusercontent.com/info201a-au2022/project-group-1-section-aa/main/data/charts.csv")
View(spotiify_origional)
# filters down the data
# removes the track id, explicit, and duration columns
spotify_modify <- spotiify_origional %>%
select(name, country, date, position, streams, artists, genres = artist_genres)
#returns all the data just from 2022
#this is the data set you should you on the project
spotify_2022 <- spotify_modify %>%
filter(date >= "2022-01-01") %>%
arrange(date) %>%
group_by(date)
# use write.csv() to turn the new dataset into a .csv file
write.csv(Your DataFrame,"Path to export the DataFrame\\File Name.csv", row.names = FALSE)
write.csv(spotify_2022, "/Users/oliviasapp/Documents/info201/project-group-1-section-aa/data/spotify_2022.csv" , row.names = FALSE)
# then I pushed the spotify_2022.csv to the GitHub repo
View(spotiify_origional)
spotify_2022_global <- spotify_modify %>%
filter(date >= "2022-01-01") %>%
filter(country == "global") %>%
arrange(date) %>%
group_by(streams)
View(spotify_2022_global)
top_15 <- spotify_2022_global[order(spotify_2022_global$streams, decreasing = TRUE), ]
top_15 <- top_15[1:15,]
top_15$streams <- as.numeric(top_15$streams)
View(top_15)
col_chart <- ggplot(data = top_15) +
geom_col(mapping = aes(x = name, y = streams)) +
ggtitle("Top 15 Songs Daily Streamed Globally") +
theme(plot.title = element_text(hjust = 0.5))
col_chart <- col_chart + coord_cartesian(ylim = c(999000,1000000)) + coord_flip()
col_chart
Thank you so much! Any suggestions will hugely help!
top_15 <- spotify_2022_global[order(spotify_2022_global$streams, decreasing = TRUE), ]
This code sorts in decreasing order, but the streams data here is still of character type, so numbers like 999975 will be "higher" than 1M, which is why your data looks weird. One song had two weeks just under 1M which is why it shows up with ~2M.
If you use this instead you'll get more what you intended:
top_15 <- spotify_2022_global[order(as.numeric(spotify_2022_global$streams), decreasing = TRUE), ]
However, this is finding the highest song-weeks, not the highest songs, so in this case all 15 highest song-weeks were one song.
I'd suggest you group_by(name) and then summarize to get total streams by song, filter top 15, and then make name an ordered factor, e.g. with forcats::fct_reorder.

Further scaling plotted data by year into intervals

I am practicing visualizing data with R with a dataset on certain incidents worldwide. I created a data frame only containing the number of incidents per year with the plyr count function.
library(plyr)
df_incidents <- count(df$iyear)
names(df_incidents)[names(df_incidents) == "x"] <- "year"
names(df_incidents)[names(df_incidents) == "freq"] <- "incidents"
df_incidents
Output:
year incidents
1970 651
1971 471
1972 568
1973 473
1974 581
... all the way to 2018
I visualised the above data with ggplot(df_incidents,aes(x=year,y=incidents)) + geom_bar(stat="identity") which returned a histogram of incidents per year, but I am unable to further group year into intervals of 5 years.
Should I alter my ggplot statement to scale the data or further process my df_incidents data frame into distinctive groups of approx 5 years from 1970?
You can try an approach using bars with scale_x_continuous() or using a new variable defined by cut() function. Here the approaches:
library(ggplot2)
library(dplyr)
set.seed(123)
#Data
df_incidents <- data.frame(year=1978:2018,
incidents=round(runif(41,500,1000),0))
#Plot option 1
ggplot(df_incidents,aes(x=year,y=incidents))+
geom_bar(stat = 'identity',color='black',fill='cyan')+
scale_x_continuous(breaks = seq(1978,2018,by=5))
Output:
And the second approach:
#Plot option 2
df_incidents %>%
mutate(Cutyear=cut(year,breaks = seq(1978,2018,by=5),include.lowest = T,right = F)) %>%
group_by(Cutyear) %>%
summarise(incidents=sum(incidents,na.rm=T)) %>%
ggplot(aes(x=Cutyear,y=incidents))+
geom_bar(stat = 'identity',color='black',fill='cyan')
Output:
You can bin your years into categories like below:
df_incidents %>% mutate(Binned= cut(year, breaks = c(1970, 1975, ....,2020))) %>%
group_by(Binned) %>% summarize(Incidents= count(incidents)) %>%
ggplot(.,aes(x=Binned,y=Incidents)) + geom_bar(stat="identity")

R stacked bar charts including "other" (using ggplot2)

I want to make a stacked barchart that describes abundances of taxa at two locations in three different seasons. I'm using ggplot2. Making the plot is ok, but I have 48 taxa so I end up with a lot of different colours in the bar. There are only eight taxa that occur frequently and abundantly, so I'd like to group the others into "Other" for the plot.
My data looks like this:
SampleID TransectID SampleYear Season Location Taxa1 Taxa2 Taxa3 .... Taxa48
BW15001 1 2015 fall SiteA 25 0 0 0
BW15001 2 2015 fall SiteA 32 0 0 2
BW15001 2 2015 fall SiteA 6 0 45 0
BW15001 3 2015 fall SiteA 78 1 2 0
This is what I have tried (modified from here):
y <- rowSums(invert[6:54])
x<-invert[6:54]/y
x<-invert[,order(-colSums(x))]
#Extract list of top N Taxa
N<-8
taxa_list<-colnames(x)[1:N]
#remove "__Unknown__" and add it to others
taxa_list<-taxa_list[!grepl("Unknown",taxa_list)]
N<-length(taxa_list)
#Generate a new table with everything added to Others
new_x<-data.frame(x[,colnames(x) %in% taxa_list],
Others=rowSums(x[,!colnames(x) %in% taxa_list]))
df<-NULL
for (i in 1:dim(new_x)[2]){
tmp<-data.frame(row.names=NULL,Sample=rownames(new_x),
Taxa=rep(colnames(new_x)[i],dim(new_x) [1]),Value=new_x[,i],Type=grouping_info[,1])
if(i==1){df<-tmp} else {df<-rbind(df,tmp)}
}
To plot the graph:
colours <- c("#F0A3FF", "#0075DC", "#993F00","#4C005C","#2BCE48","#FFCC99","#808080","#94FFB5","#8F7C00","#9DCC00","#C20088","#003380","#FFA405","#FFA8BB","#426600","#FF0010","#5EF1F2","#00998F","#740AFF","#990000","#FFFF00");
library(ggplot2)
p<-ggplot(df,aes(Sample,Value,fill=Taxa))+
geom_bar(stat="identity")+
facet_grid(. ~ Type, drop=TRUE,scale="free",space="free_x")
p<-p+scale_fill_manual(values=colours[1:(N+1)])
p<-p+theme_bw()+ylab("Proportions")
p<-p+ scale_y_continuous(expand = c(0,0))+
theme(strip.background = element_rect(fill="gray85"))+
theme(panel.spacing = unit(0.3, "lines"))
p<-p+theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
p
The main problem that I would like help with today is pulling out the main taxa and lumping the rest as "Other". I think I can figure out how to group the graph by Season and Location using facet_grid() later...
Thanks!
Expanding on my comment. Take a look at the forcats package. Without a full example, it's hard to say, but the following should work:
library(tidyverse)
library(forcats)
temp <- df %>%
gather(taxa, amount, -c(1:5))
# Reshape the data so that that there is one record per each amount
tidy_df <- temp[rep(rownames(temp), times = temp$amount), ]
tidy_df %>%
select(-amount) %>%
mutate(taxa = fct_lump(taxa, n = 2)) %>% # Check out this line
ggplot(., aes(x = SampleID, fill = taxa)) +
geom_bar()
You can change fct_lump(taxa, n = 2) to fct_lump(taxa, n = 8) to group the top 8 categories. Alternatively, you can use fct_lump(taxa, prop = 0.9) to lump things up by proportions.
If you are simply going after the "presence" of the taxa in a sample (and not the value or amount), things are a bit simpler and can likely be handled in one pipe:
df %>%
gather(taxa, amount, -c(1:5)) %>%
mutate(amount = na_if(amount, 0)) %>%
na.omit() %>%
mutate(taxa = fct_lump(taxa, n = 2)) %>%
ggplot(., aes(x = SampleID, fill = taxa)) +
geom_bar()
One way of doing it:
library(plyr)
d=data.frame(SampleID=rep('BW15001',4),
TransectID=c(1,2,2,3),
SampleYear=rep(2015,4),
Taxa1=c(25,32,6,78),
Taxa2=c(0,0,0,1),
Taxa3=c(0,0,45,3))
#Reshape the df so that all taxa columns are melted into two
d=melt(d,id=colnames(d[,1:3]))
d$variable=as.character(d$variable)
# rename all uninteresting taxa as 'other'
`%ni%` <- Negate(`%in%`) # Here I decided to select the ones to keep, but the other way around is fine as well of course
d[d$variable %ni% c('Taxa1','Taxa2'),'variable']='Other' #here you could add a function to automatically determine which taxta you want to keep, as you already did
# aggregate all data for 'other'
d=ddply(d,colnames(d[,1:4]),summarise,value=sum(value))
#make your plot, this one is just a bad example
ggplot(d,aes(SampleID,value,fill=variable))+
geom_bar(stat="identity")+
facet_grid(. ~ Type, drop=TRUE,scale="free",space="free_x")

box plot for multiple observations

I have multiple observation of rainfall for the same station for around 14 years the data frame is in something like this :
df (from date -01/01/2000)
v1 v2 v3 v4 v5 v6 ........ v20
1 1 2 4 8 9..............
1.4 4 3.8..................
1.5 3 1.6....................
1.6 8 .....................
.
.
.
.
till date 31/01/2013 i.e total 5114 observations
where v1 v2 ...v20 are the rainfall simulation for the same point; I want to plot the box plot which represents the collective range of quantiles and median monthly when all the observations are taken together.
I can plot box plot for single monthly values using :
df$month<-factor(month.name,levels=month.name)
library(reshape2)
df.long<-melt(df,id.vars="month")
ggplot(df.long,aes(month,value))+geom_boxplot()
but in this problem as the data is daily and there are multiple observations i don't get idea where to start.
sample data
df = data.frame(matrix(rnorm(20), nrow=5114,ncol=100))
In case if u want to work with a zoo object :
date<-seq(as.POSIXct("2000-01-01 00:00:00","GMT"),as.POSIXct("2013-12-31 00:00:00","GMT"), by="1440 min")
If you want yo can also convert it to zoo object
x <- zoo(df, order.by=seq(as.POSIXct("2000-01-01 00:00:00","GMT"), as.POSIXct("2013-12-31 00:00:00","GMT"), by="1440 min"))
I am not familiar with zoo. So, I converted your sample to data frame. Your idea of using melt() is a right way. Then, you need to aggregate rain amount by month. I think it is good to look up aggregate() and other options. Here, I used dplyr and tidyr to arrange the sample data. I hope this will let you move forward.
### zoo to data frame by # Joshua Ulrich
### http://stackoverflow.com/questions/14064097/r-convert-between-zoo-object-and-data-frame-results-inconsistent-for-different
zoo.to.data.frame <- function(x, index.name="Date") {
stopifnot(is.zoo(x))
xn <- if(is.null(dim(x))) deparse(substitute(x)) else colnames(x)
setNames(data.frame(index(x), x, row.names=NULL), c(index.name,xn))
}
### to data frame
foo <- zoo.to.data.frame(df)
str(foo)
library(dplyr)
library(tidyr)
### wide to long data frame, aggregate rain amount by Date
ana <- foo %>%
melt(., id.vars = "Date") %>%
group_by(Date) %>%
summarize(rain = sum(value))
### Aggregate rain amount by year and month
bob <- ana %>%
separate(Date, c("year", "month", "date")) %>%
group_by(year, month) %>%
summarize(rain = sum(rain))
### Drawing a ggplot figure
ggplot(data = bob, aes(x = month, y = rain)) +
geom_boxplot()
just found out an easier way to do it, hwoever your answered really helped jazzuro
install.packages("reshape2")
library(dplyr)
library(reshape2)
require(ggplot2)
df = data.frame(matrix(rnorm(20), nrow=5114,ncol=100))
x <- zoo(df, order.by=seq(as.POSIXct("2000-01-01 00:00:00","GMT"),
as.POSIXct("2013-12-31 00:00:00","GMT"), by="1440 min"))
v<-aggregate(x, as.yearmon, mean)
months<- rep(1:12,14)
lol<-data.frame(v,months)
df.m <- melt(lol, id.var = "months")
View(df.m)
p <- ggplot(df.m, aes(factor(months), value))
p + geom_boxplot(aes(fill = months))

Resources