cumsum data over time by factor - r

I'm using the campaign contributions data from Oregon and I'm trying to make a graph that displays the cumulative amount of contributions per candidate over time. Here's what I have so far:
ggplot(aes(x = as.Date(contb_receipt_dt, "%d-%b-%y"),
y = cumsum(contb_receipt_amt)),
data = subset(oregon_data,
table(oregon_data$cand_nm)[oregon_data$cand_nm] > 1000
& as.Date(contb_receipt_dt, "%d-%b-%y") > as.Date("2015-01-01")))
+ geom_line(aes(color = cand_nm), bins = 5)
This is what it looks like:
What I would like to see is a line for each candidate that starts off at 0 and slowly goes up with each additional contribution. What should I do?

I would use dplyr to calculate the cumsum column before sending it on to ggplot. This should give you enough to get sarted, however you will need to pretty it up and filter the data to get the results you are looking for:
WashingtonData <- read.csv("P00000001-WA.csv")
WashingtonData <- WashingtonData %>% arrange(contb_receipt_dt)
MyGraphData <- WashingtonData %>% group_by(cand_nm) %>% mutate(cumsum = cumsum(contb_receipt_amt))
g <- ggplot(data=MyGraphData, aes(y=cumsum, x=contb_receipt_dt, color=cand_nm)) + geom_line()
g

Related

why my bar chart not showing all the data

I am working on a music streaming project, and I am trying to get the top15 global streamings in 2020 and make it an interactive graph.
It successfully showed the top 15 song names as a dataframe, but it failed to show as a bar graph, I wonder where did I do wrong here? Although it worked after I flip the bar graph into horizontal, but the data seem to look a bit off.
It looks like this as a vertical bar graph:
The horizontical bar graph looks like this, but the data seem incorrect:
Here is the code I have:
library("dplyr")
library("ggplot2")
# load the .csv into R studio, you can do this 1 of 2 ways
#read.csv("the name of the .csv you downloaded from kaggle")
spotiify_origional <- read.csv("charts.csv")
spotiify_origional <- read.csv("https://raw.githubusercontent.com/info201a-au2022/project-group-1-section-aa/main/data/charts.csv")
View(spotiify_origional)
# filters down the data
# removes the track id, explicit, and duration columns
spotify_modify <- spotiify_origional %>%
select(name, country, date, position, streams, artists, genres = artist_genres)
#returns all the data just from 2022
#this is the data set you should you on the project
spotify_2022 <- spotify_modify %>%
filter(date >= "2022-01-01") %>%
arrange(date) %>%
group_by(date)
# use write.csv() to turn the new dataset into a .csv file
write.csv(Your DataFrame,"Path to export the DataFrame\\File Name.csv", row.names = FALSE)
write.csv(spotify_2022, "/Users/oliviasapp/Documents/info201/project-group-1-section-aa/data/spotify_2022.csv" , row.names = FALSE)
# then I pushed the spotify_2022.csv to the GitHub repo
View(spotiify_origional)
spotify_2022_global <- spotify_modify %>%
filter(date >= "2022-01-01") %>%
filter(country == "global") %>%
arrange(date) %>%
group_by(streams)
View(spotify_2022_global)
top_15 <- spotify_2022_global[order(spotify_2022_global$streams, decreasing = TRUE), ]
top_15 <- top_15[1:15,]
top_15$streams <- as.numeric(top_15$streams)
View(top_15)
col_chart <- ggplot(data = top_15) +
geom_col(mapping = aes(x = name, y = streams)) +
ggtitle("Top 15 Songs Daily Streamed Globally") +
theme(plot.title = element_text(hjust = 0.5))
col_chart <- col_chart + coord_cartesian(ylim = c(999000,1000000)) + coord_flip()
col_chart
Thank you so much! Any suggestions will hugely help!
top_15 <- spotify_2022_global[order(spotify_2022_global$streams, decreasing = TRUE), ]
This code sorts in decreasing order, but the streams data here is still of character type, so numbers like 999975 will be "higher" than 1M, which is why your data looks weird. One song had two weeks just under 1M which is why it shows up with ~2M.
If you use this instead you'll get more what you intended:
top_15 <- spotify_2022_global[order(as.numeric(spotify_2022_global$streams), decreasing = TRUE), ]
However, this is finding the highest song-weeks, not the highest songs, so in this case all 15 highest song-weeks were one song.
I'd suggest you group_by(name) and then summarize to get total streams by song, filter top 15, and then make name an ordered factor, e.g. with forcats::fct_reorder.

R: Creative visualization in RStudio

I am at the final stages of a project where i have been comparing the appraisal price vs the sold price of different properties. The complete code for data collection and tidying is below.
At this stage i am looking at different ways to visualize my data. However, I am quite new to it so my question is whether anyone has any "new" or special ways they visualizing data that they find usefull og intuitive. I have given a couple of examples of what i am able to visualize now using ggplot.
Additionally: Now my visualizations plots all 1275 observations every time. I would however also like to visualize the data both with mean and median for the Percentage, Sold and Tax variables which i am most interested in. For example to visualize the mean value of the Percentage column based on different years.
Appreciate any help!
Complete code:
#Step 1: Load needed library
library(tidyverse)
library(rvest)
library(jsonlite)
library(stringi)
library(dplyr)
library(data.table)
library(ggplot2)
#Step 2: Access the URL of where the data is located
url <- "https://www.forsvarsbygg.no/ListApi/ListContent/78635/SoldEstates/0/10/"
#Step 3: Direct JSON as format of data in URL
data <- jsonlite::fromJSON(url, flatten = TRUE)
#Step 4: Access all items in API
totalItems <- data$TotalNumberOfItems
#Step 5: Summarize all data from API
allData <- paste0('https://www.forsvarsbygg.no/ListApi/ListContent/78635/SoldEstates/0/', totalItems,'/') %>%
jsonlite::fromJSON(., flatten = TRUE) %>%
.[1] %>%
as.data.frame() %>%
rename_with(~str_replace(., "ListItems.", ""), everything())
#Step 6: removing colunms not needed
allData <- allData[, -c(1,4,8,9,11,12,13,14,15)]
#Step 7: remove whitespace and change to numeric in columns SoldAmount and Tax
#https://stackoverflow.com/questions/71440696/r-warning-argument-is-not-an-atomic-vector-when-attempting-to-remove-whites/71440806#71440806
allData[c("Tax", "SoldAmount")] <- lapply(allData[c("Tax", "SoldAmount")], function(z) as.numeric(gsub(" ", "", z)))
#Step 8: Remove rows where value is NA
#https://stackoverflow.com/questions/4862178/remove-rows-with-all-or-some-nas-missing-values-in-data-frame
alldata <- allData %>%
filter(across(where(is.numeric),
~ !is.na(.)))
#Step 9: Remove values below 10000 NOK on SoldAmount og Tax.
alldata <- alldata %>%
filter_all(any_vars(is.numeric(.) & . > 10000))
#Step 10: Calculate percentage change between tax and sold amount and create new column with percent change
#df %>% mutate(Percentage = number/sum(number))
alldata_Percent <- alldata %>% mutate(Percentage = (SoldAmount-Tax)/Tax)
Visualization
# Plot Percentage difference based on County
ggplot(data=alldata_Percent,mapping = aes(x = Percentage, y = County)) +
geom_point(size = 1.5)
#Plot County with both Date and Percentage difference The The
theme_set(new = ggthemes::theme_economist())
p <- ggplot(data = alldata_Percent,
mapping = aes(x = Date, y = Percentage, colour = County)) +
geom_line(na.rm = TRUE) +
geom_point(na.rm = TRUE)
p

Plotting frequency of occurrences based on start/end times in R

I have a "trips" dataset that includes a unique trip id, and a start and end time (the specific hour and minute) of the trips. These trips were all taken on the same day. I am trying to determine the number of cars on the road at any given time and plot it as a line graph using ggplot in R. In other words, a car is "on the road" at any time in between its start and end time.
The most similar example I can find uses the following structure:
yearly_counts <- trips %>%
count(year, trip_id)
ggplot(data = yearly_counts, mapping = aes(x = year, y = n)) +
geom_line()
Would the best approach be to modify this structure have an "minutesByHour_count" variable that has a count for every minute of every hour? This seems inefficient to me, and still doesn't solve the problem of getting the counts from the start/end time.
Is there any easier way to do this?
Here's an example based on counting each start as an additional car, and each end as a reduction in the count:
library(tidyverse)
df %>%
gather(type, time, c(start_hour, end_hour)) %>%
mutate(count_chg = if_else(type == "start_hour", 1, -1)) %>%
arrange(time) %>%
mutate(car_count = cumsum(count_chg)) %>%
ggplot(aes(time, car_count)) +
geom_step()
Sample data:
df <- data.frame(
uniqueID = 1:60,
start_hour = seq(8, 12, length.out = 60),
dur_hour = 0.05*1:60
)
df$end_hour = df$start_hour + df$dur_hour
df$dur_hour = NULL

R stacked bar charts including "other" (using ggplot2)

I want to make a stacked barchart that describes abundances of taxa at two locations in three different seasons. I'm using ggplot2. Making the plot is ok, but I have 48 taxa so I end up with a lot of different colours in the bar. There are only eight taxa that occur frequently and abundantly, so I'd like to group the others into "Other" for the plot.
My data looks like this:
SampleID TransectID SampleYear Season Location Taxa1 Taxa2 Taxa3 .... Taxa48
BW15001 1 2015 fall SiteA 25 0 0 0
BW15001 2 2015 fall SiteA 32 0 0 2
BW15001 2 2015 fall SiteA 6 0 45 0
BW15001 3 2015 fall SiteA 78 1 2 0
This is what I have tried (modified from here):
y <- rowSums(invert[6:54])
x<-invert[6:54]/y
x<-invert[,order(-colSums(x))]
#Extract list of top N Taxa
N<-8
taxa_list<-colnames(x)[1:N]
#remove "__Unknown__" and add it to others
taxa_list<-taxa_list[!grepl("Unknown",taxa_list)]
N<-length(taxa_list)
#Generate a new table with everything added to Others
new_x<-data.frame(x[,colnames(x) %in% taxa_list],
Others=rowSums(x[,!colnames(x) %in% taxa_list]))
df<-NULL
for (i in 1:dim(new_x)[2]){
tmp<-data.frame(row.names=NULL,Sample=rownames(new_x),
Taxa=rep(colnames(new_x)[i],dim(new_x) [1]),Value=new_x[,i],Type=grouping_info[,1])
if(i==1){df<-tmp} else {df<-rbind(df,tmp)}
}
To plot the graph:
colours <- c("#F0A3FF", "#0075DC", "#993F00","#4C005C","#2BCE48","#FFCC99","#808080","#94FFB5","#8F7C00","#9DCC00","#C20088","#003380","#FFA405","#FFA8BB","#426600","#FF0010","#5EF1F2","#00998F","#740AFF","#990000","#FFFF00");
library(ggplot2)
p<-ggplot(df,aes(Sample,Value,fill=Taxa))+
geom_bar(stat="identity")+
facet_grid(. ~ Type, drop=TRUE,scale="free",space="free_x")
p<-p+scale_fill_manual(values=colours[1:(N+1)])
p<-p+theme_bw()+ylab("Proportions")
p<-p+ scale_y_continuous(expand = c(0,0))+
theme(strip.background = element_rect(fill="gray85"))+
theme(panel.spacing = unit(0.3, "lines"))
p<-p+theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
p
The main problem that I would like help with today is pulling out the main taxa and lumping the rest as "Other". I think I can figure out how to group the graph by Season and Location using facet_grid() later...
Thanks!
Expanding on my comment. Take a look at the forcats package. Without a full example, it's hard to say, but the following should work:
library(tidyverse)
library(forcats)
temp <- df %>%
gather(taxa, amount, -c(1:5))
# Reshape the data so that that there is one record per each amount
tidy_df <- temp[rep(rownames(temp), times = temp$amount), ]
tidy_df %>%
select(-amount) %>%
mutate(taxa = fct_lump(taxa, n = 2)) %>% # Check out this line
ggplot(., aes(x = SampleID, fill = taxa)) +
geom_bar()
You can change fct_lump(taxa, n = 2) to fct_lump(taxa, n = 8) to group the top 8 categories. Alternatively, you can use fct_lump(taxa, prop = 0.9) to lump things up by proportions.
If you are simply going after the "presence" of the taxa in a sample (and not the value or amount), things are a bit simpler and can likely be handled in one pipe:
df %>%
gather(taxa, amount, -c(1:5)) %>%
mutate(amount = na_if(amount, 0)) %>%
na.omit() %>%
mutate(taxa = fct_lump(taxa, n = 2)) %>%
ggplot(., aes(x = SampleID, fill = taxa)) +
geom_bar()
One way of doing it:
library(plyr)
d=data.frame(SampleID=rep('BW15001',4),
TransectID=c(1,2,2,3),
SampleYear=rep(2015,4),
Taxa1=c(25,32,6,78),
Taxa2=c(0,0,0,1),
Taxa3=c(0,0,45,3))
#Reshape the df so that all taxa columns are melted into two
d=melt(d,id=colnames(d[,1:3]))
d$variable=as.character(d$variable)
# rename all uninteresting taxa as 'other'
`%ni%` <- Negate(`%in%`) # Here I decided to select the ones to keep, but the other way around is fine as well of course
d[d$variable %ni% c('Taxa1','Taxa2'),'variable']='Other' #here you could add a function to automatically determine which taxta you want to keep, as you already did
# aggregate all data for 'other'
d=ddply(d,colnames(d[,1:4]),summarise,value=sum(value))
#make your plot, this one is just a bad example
ggplot(d,aes(SampleID,value,fill=variable))+
geom_bar(stat="identity")+
facet_grid(. ~ Type, drop=TRUE,scale="free",space="free_x")

Using functionals instead of for loops to identify sequential changes in a vector

My data look like this:
I want to identify which "downward trend" each observation is part of, so I can group them and do things like make this graph:
My logic for distinguishing "downward trends" is that they end when the next observation has a higher measurement.
I've written a loop to do this, but I'm wondering if there's a better way to do it with one of the apply functions or something like them.
##Create sample data
df <- data.frame(timestamp = seq(1:20),
measurement = seq(10, 1, by = -1))
## This is the for loop I'm hoping to improve
df$downward.trend.seq <- 0
seq <- 1
for(i in 1:nrow(df)){
df$downward.trend.seq[i] <- seq
if (i < nrow(df) & df$measurement[i] < df$measurement[i+1]) {
seq <- seq + 1
}
}
## Code for plots
library(ggplot2)
library(dplyr)
ggplot(df, aes(x = timestamp, y = measurement)) + geom_point()
ggplot(df, aes(x = timestamp, y = measurement, group = downward.trend.seq)) + geom_line(aes(color=downward.trend.seq %>% factor))
You can use which and diff to help identify the where downward trend changes occur, and use cumsum to fill out the group membership.
# set up new column with all 0s
df$downward.trend.seq <- 0
# use diff to identify indices to change to 1
df$downward.trend.seq[which(c(NA, diff(df$measurement)) > 0)] <- 1
# use cumsum to fill in proper group membership
df$downward.trend.seq <- cumsum(df$downward.trend.seq)
Here is a dplyr solution
df %>% mutate(data_group = cumsum( c(0, diff(measurement)) > 0 ))
This performs the cumulative sum over a logical vector and assigns the results to data_group

Resources