why my bar chart not showing all the data - r

I am working on a music streaming project, and I am trying to get the top15 global streamings in 2020 and make it an interactive graph.
It successfully showed the top 15 song names as a dataframe, but it failed to show as a bar graph, I wonder where did I do wrong here? Although it worked after I flip the bar graph into horizontal, but the data seem to look a bit off.
It looks like this as a vertical bar graph:
The horizontical bar graph looks like this, but the data seem incorrect:
Here is the code I have:
library("dplyr")
library("ggplot2")
# load the .csv into R studio, you can do this 1 of 2 ways
#read.csv("the name of the .csv you downloaded from kaggle")
spotiify_origional <- read.csv("charts.csv")
spotiify_origional <- read.csv("https://raw.githubusercontent.com/info201a-au2022/project-group-1-section-aa/main/data/charts.csv")
View(spotiify_origional)
# filters down the data
# removes the track id, explicit, and duration columns
spotify_modify <- spotiify_origional %>%
select(name, country, date, position, streams, artists, genres = artist_genres)
#returns all the data just from 2022
#this is the data set you should you on the project
spotify_2022 <- spotify_modify %>%
filter(date >= "2022-01-01") %>%
arrange(date) %>%
group_by(date)
# use write.csv() to turn the new dataset into a .csv file
write.csv(Your DataFrame,"Path to export the DataFrame\\File Name.csv", row.names = FALSE)
write.csv(spotify_2022, "/Users/oliviasapp/Documents/info201/project-group-1-section-aa/data/spotify_2022.csv" , row.names = FALSE)
# then I pushed the spotify_2022.csv to the GitHub repo
View(spotiify_origional)
spotify_2022_global <- spotify_modify %>%
filter(date >= "2022-01-01") %>%
filter(country == "global") %>%
arrange(date) %>%
group_by(streams)
View(spotify_2022_global)
top_15 <- spotify_2022_global[order(spotify_2022_global$streams, decreasing = TRUE), ]
top_15 <- top_15[1:15,]
top_15$streams <- as.numeric(top_15$streams)
View(top_15)
col_chart <- ggplot(data = top_15) +
geom_col(mapping = aes(x = name, y = streams)) +
ggtitle("Top 15 Songs Daily Streamed Globally") +
theme(plot.title = element_text(hjust = 0.5))
col_chart <- col_chart + coord_cartesian(ylim = c(999000,1000000)) + coord_flip()
col_chart
Thank you so much! Any suggestions will hugely help!

top_15 <- spotify_2022_global[order(spotify_2022_global$streams, decreasing = TRUE), ]
This code sorts in decreasing order, but the streams data here is still of character type, so numbers like 999975 will be "higher" than 1M, which is why your data looks weird. One song had two weeks just under 1M which is why it shows up with ~2M.
If you use this instead you'll get more what you intended:
top_15 <- spotify_2022_global[order(as.numeric(spotify_2022_global$streams), decreasing = TRUE), ]
However, this is finding the highest song-weeks, not the highest songs, so in this case all 15 highest song-weeks were one song.
I'd suggest you group_by(name) and then summarize to get total streams by song, filter top 15, and then make name an ordered factor, e.g. with forcats::fct_reorder.

Related

Getting candlestick chart to display properly using a text / .txt file of historic stock prices in R

Hell there,
I have purchased the historic intraday prices of the S&P 500 (1min through 1hour) back through 2005 because most stock charting packages stop reporting intraday prices around 2016 or 2011. I have successfully imported the prices and gotten R to only read market hours, excluding premarket and aftermarket. Two problems exist. First, I need to get the chart to not show saturday and sunday. The bigger problem is that the plot is NOT showing candlesticks, but bars and they are very hard to read. I have tried increasing the size via (size = 4), but the bars overlap and are still not candlesticks. How can I get these to show as proper candlesticks? thank you
library(quantmod)
library(tidyquant)
library(tidyverse)
library(ggplot2)
library(readr)
library(ggforce)
library(dplyr)
dir <- "E:/Stock Trading/Historical Data/SPY_qjrt28"
setwd(dir)
data <- read_csv("SPY_30min.txt",
col_names = FALSE)
names(data) <- tolower(c("DateTime", "Open", "High", "Low", "Close", "Volume"))
data
#clean the data
write_rds(data, "cleaned.rds")
read_rds("cleaned.rds")
spy30m <- read_rds("cleaned.rds")
firstwave <- filter(spy30m, datetime >= as.Date('2009-03-02'), datetime <= as.Date('2009-03-19'))
# adding more time objects to the dataset
data <- data %>%
mutate(hour = hour(datetime),
minute = minute(datetime),
hms = as_hms(datetime))
# is the hour function working as expected? Yes!
data %>%
select(datetime, hour) %>%
sample_n(10)
# look at bins of observations at 30 minute intervals. Looks good!
data %>%
group_by(hms) %>%
summarise(count = n()) %>%
arrange(hms) %>%
print(n=100)
# filter the dataset to only include the times during regular market hours
data_regularmkt <- data %>%
# `filter` is the dplyr function that limits the number of observations in a data frame
# `between` function takes 3 arguments: an object/variable, a lower bound value, and upper bound value
filter(between(hms, as_hms("09:30:00"), as_hms("16:00:00")))
# look at it again
data_regularmkt %>%
group_by(hms) %>%
summarise(count = n()) %>%
arrange(hms) %>%
print(n=100)
###########
firstwave <- filter(spy30m, datetime >= as.Date('2009-03-06'), datetime <= as.Date('2009-03-19'))
ggplot(firstwave, aes(x = datetime, y = close)) +
geom_candlestick(aes(open = open, high = high, low = low, close = close, size = 3))
Say we have a data frame df with the columns date (dttm format), open, high, low, close.
To overcome the issue that non-trading hours are shown, my first idea was to use another x-axis scale. Here's with a row-index.
library(tidyverse)
library(lubridate)
library(tidyquant)
df <- df %>%
arrange(date) %>%
mutate(i = row_number())
# this is for the x-axis labels
df_x <- df %>%
group_by(d = floor_date(date, "day")) %>%
filter(date %in% c(min(date)))
df %>%
ggplot(aes(x = i)) +
geom_candlestick(aes(open = open, low = low, high = high, close = close)) +
scale_x_continuous(breaks = df_x$i,
labels = df_x$date)
The problem then is that if a contract is halted during trading hours, there will be no data too just like with night or weekend. However, these times you probably want to show anyway.
One could probably play with dplyr functions' complete or expand to fix the data first and still use my solution of plotting over an index x-scale.
Easier could be to use the plotly library.
plt <- plot_ly(data = df, x = ~date,
open = ~open, close = ~close,
high = ~high, low = ~low,
type="candlestick")
plt
This is to hide the non-trading hours:
plt %>% layout(showlegend = F, xaxis = list(rangebreaks=
list(
list(bounds=list(17, 9),
pattern="hour")),#hide hours outside of 9am-5pm
dtick=86400000.0/2,
tickformat="%H:%M\n%b\n%Y"))
More information can be found here: https://plotly.com/r/time-series/#hiding-nonbusiness-hours and https://plotly.com/r/candlestick-charts/
As for you not liking the appearance of tidyquant's geom_candlestick, I also suggest you try out Plotly.

R: Creative visualization in RStudio

I am at the final stages of a project where i have been comparing the appraisal price vs the sold price of different properties. The complete code for data collection and tidying is below.
At this stage i am looking at different ways to visualize my data. However, I am quite new to it so my question is whether anyone has any "new" or special ways they visualizing data that they find usefull og intuitive. I have given a couple of examples of what i am able to visualize now using ggplot.
Additionally: Now my visualizations plots all 1275 observations every time. I would however also like to visualize the data both with mean and median for the Percentage, Sold and Tax variables which i am most interested in. For example to visualize the mean value of the Percentage column based on different years.
Appreciate any help!
Complete code:
#Step 1: Load needed library
library(tidyverse)
library(rvest)
library(jsonlite)
library(stringi)
library(dplyr)
library(data.table)
library(ggplot2)
#Step 2: Access the URL of where the data is located
url <- "https://www.forsvarsbygg.no/ListApi/ListContent/78635/SoldEstates/0/10/"
#Step 3: Direct JSON as format of data in URL
data <- jsonlite::fromJSON(url, flatten = TRUE)
#Step 4: Access all items in API
totalItems <- data$TotalNumberOfItems
#Step 5: Summarize all data from API
allData <- paste0('https://www.forsvarsbygg.no/ListApi/ListContent/78635/SoldEstates/0/', totalItems,'/') %>%
jsonlite::fromJSON(., flatten = TRUE) %>%
.[1] %>%
as.data.frame() %>%
rename_with(~str_replace(., "ListItems.", ""), everything())
#Step 6: removing colunms not needed
allData <- allData[, -c(1,4,8,9,11,12,13,14,15)]
#Step 7: remove whitespace and change to numeric in columns SoldAmount and Tax
#https://stackoverflow.com/questions/71440696/r-warning-argument-is-not-an-atomic-vector-when-attempting-to-remove-whites/71440806#71440806
allData[c("Tax", "SoldAmount")] <- lapply(allData[c("Tax", "SoldAmount")], function(z) as.numeric(gsub(" ", "", z)))
#Step 8: Remove rows where value is NA
#https://stackoverflow.com/questions/4862178/remove-rows-with-all-or-some-nas-missing-values-in-data-frame
alldata <- allData %>%
filter(across(where(is.numeric),
~ !is.na(.)))
#Step 9: Remove values below 10000 NOK on SoldAmount og Tax.
alldata <- alldata %>%
filter_all(any_vars(is.numeric(.) & . > 10000))
#Step 10: Calculate percentage change between tax and sold amount and create new column with percent change
#df %>% mutate(Percentage = number/sum(number))
alldata_Percent <- alldata %>% mutate(Percentage = (SoldAmount-Tax)/Tax)
Visualization
# Plot Percentage difference based on County
ggplot(data=alldata_Percent,mapping = aes(x = Percentage, y = County)) +
geom_point(size = 1.5)
#Plot County with both Date and Percentage difference The The
theme_set(new = ggthemes::theme_economist())
p <- ggplot(data = alldata_Percent,
mapping = aes(x = Date, y = Percentage, colour = County)) +
geom_line(na.rm = TRUE) +
geom_point(na.rm = TRUE)
p

Ggplot loop over unique variables in a group

I made a loop to make a plot for every unique value of a variable within a group. To make my code reproducible I used nyflights13 package. Unfortunately, in here my code gives desired result. In my data however I would have flight origins that don't happen in a certain year, giving me an empty plot for that origin in that year. I would like that in one group (in this example year), Only the origins that happened in that year are shown. Could somebody help me out?
library(nycflights13)
library(tidyverse)
plotter_de_plot<-function(origination, YEARR){
eval(substitute(origination), flights)
eval(substitute(YEARR), flights)
flights %>%
subset(year==YEARR)%>%
select(month,origin,hour,year)%>%
group_by(origin, month) %>%
mutate(AMOUNT = (sum(hour, na.rm=TRUE))) %>%
filter(!is.na(hour),
origin==origination,year==YEARR) %>%
ggplot(aes(month,AMOUNT), na.rm = TRUE)+
geom_point() +
labs(title=origination,subtitle=YEARR)
}
for (i in unique(flights$origin)){
plot(plotter_de_plot(i,2013))
}
In addition to stefan's answer which adresses the problem perfectly, I would recommend using purrr::map instead of your for loop:
my_plots = unique(flights$origin) %>%
set_names() %>%
map(plotter_de_plot, YEARR=2013)
my_plots$EWR
my_plots$LGA
my_plots$JFK
This way, you can access each plot inside a list. Another way would be to use facets.
Also, your plots are absurdly heavy (several Mb) and might take a long time to plot. That is because you are using mutate() instead of summarise().
Here is an example with facets that took <1 sec to compute:
flights %>%
filter(year==2013)%>%
select(month, origin, hour,year)%>%
group_by(origin, month) %>%
summarise(AMOUNT = (sum(hour, na.rm=TRUE))) %>%
ggplot(aes(month,AMOUNT), na.rm = TRUE)+
geom_point() +
labs(subtitle="Year 2013") +
facet_wrap(~origin)
One option would be to break your pipeline into two parts, data wrangling and plotting. Doing so you could check whether the filtered and aggregated dataset contains any data using e.g. nrow > 0 and return NULL if it doesn't. In your for loop you could then check for NULL before plotting:
To mimic your use case I used flights$year[flights$origin == "EWR"] <- 2015 so that the example data includes an origin with no data for year 2013:
library(nycflights13)
library(tidyverse)
plotter_de_plot <- function(origination, YEARR) {
d <- flights %>%
select(month, origin, hour, year) %>%
filter(
!is.na(hour),
origin == origination, year == YEARR
) %>%
group_by(month) %>%
mutate(AMOUNT = sum(hour, na.rm = TRUE))
if (nrow(d) > 0) {
ggplot(d, aes(month, AMOUNT), na.rm = TRUE) +
geom_point() +
labs(title = origination, subtitle = YEARR)
}
}
flights$year[flights$origin == "EWR"] <- 2015
for (i in unique(flights$origin)) {
p <- plotter_de_plot(i, 2013)
if (!is.null(p)) plot(p)
}

LDA topic model plotting by year

I'm trying to plot tweet topics by year from this file
https://www.mediafire.com/file/64lzbt46v01jbe1/cleaned.xlsx/file
works fine to get the topics, but when I try to plot them by year I have a dimensions problem:
library(readxl)
library(tm)
tweets <- read_xlsx("C:/cleaned.xlsx")
mytextdata <- tweets$textdata
# Convert to tm corpus and use its API
corpus <- Corpus(VectorSource(mytextdata)) # Create corpus object
dtm <- DocumentTermMatrix(corpus)
ui = unique(dtm$i)
dtm.new = dtm[ui,]
k <- 7
ldaTopics <- LDA(dtm.new, method = "Gibbs", control=list(alpha = 0.1, seed = 77), k = k)
tmResult <- posterior(ldaTopics)
theta <- tmResult$topics
dim(theta)
dim(theta)=4857 and I have 4876 dates in my cleaned.xls file and I need them to be the same to run this aggregate function
topic_proportion_per_decade <- aggregate(theta, by = list(decade = textdata$decade), mean)
from here
https://tm4ss.github.io/docs/Tutorial_6_Topic_Models.html
I think that the problem is that the cleaned.xls file is not clean enough and that's why theta misses some rows..
But in fact I really don't know why theta misses some rows..
I also don't know how to clean the file better if that was the problem, the file looks good to me, there are some rows that are only numbers or non-english words but I prefer to keep them..
The problem is that ui = unique(dtm$i) removes several documents (I don't know why you do this, so I won't comment on that part). So your theta doesn't have the same number of rows as the data. We can solve this by only keeping the rows which are still in theta:
library("dplyr")
library("reshape2")
library("ggplot2")
tweets_clean <- tweets %>%
mutate(id = rownames(.)) %>%
filter(id %in% rownames(theta)) %>% # keep only rows still in theta
cbind(theta) %>% # now we can attach the topics to the data.frame
mutate(year = format(date, "%Y")) # make year variable
I then used dplyr functions to make the aggregation, since I think it makes for easier to read code:
tweets_clean_yearly <- tweets_clean %>%
group_by(year) %>%
summarise_at(vars(as.character(1:7)), funs(mean)) %>%
melt(id.vars = "year")
Then we can plot this:
ggplot(tweets_clean_yearly, aes(x = year, y = value, fill = variable)) +
geom_bar(stat = "identity") +
ylab("proportion")
Note: I tested if theta and tweets had really the same documents with:
tweets_clean <- tweets %>%
mutate(id = rownames(.)) %>%
filter(id %in% rownames(theta))
all.equal(tweets_clean$id, rownames(theta))

cumsum data over time by factor

I'm using the campaign contributions data from Oregon and I'm trying to make a graph that displays the cumulative amount of contributions per candidate over time. Here's what I have so far:
ggplot(aes(x = as.Date(contb_receipt_dt, "%d-%b-%y"),
y = cumsum(contb_receipt_amt)),
data = subset(oregon_data,
table(oregon_data$cand_nm)[oregon_data$cand_nm] > 1000
& as.Date(contb_receipt_dt, "%d-%b-%y") > as.Date("2015-01-01")))
+ geom_line(aes(color = cand_nm), bins = 5)
This is what it looks like:
What I would like to see is a line for each candidate that starts off at 0 and slowly goes up with each additional contribution. What should I do?
I would use dplyr to calculate the cumsum column before sending it on to ggplot. This should give you enough to get sarted, however you will need to pretty it up and filter the data to get the results you are looking for:
WashingtonData <- read.csv("P00000001-WA.csv")
WashingtonData <- WashingtonData %>% arrange(contb_receipt_dt)
MyGraphData <- WashingtonData %>% group_by(cand_nm) %>% mutate(cumsum = cumsum(contb_receipt_amt))
g <- ggplot(data=MyGraphData, aes(y=cumsum, x=contb_receipt_dt, color=cand_nm)) + geom_line()
g

Resources