Histogram of Weekdays by Year R - r

I have a .csv file that I have loaded into R using the following basic command:
lace <- read.csv("lace for R.csv")
It pulls in my data just fine. Here is the str of the data:
str(lace)
'data.frame': 2054 obs. of 20 variables:
$ Admission.Day : Factor w/ 872 levels "1/1/2013","1/10/2011",..: 231 238 238 50 59 64 64 64 67 67 ...
$ Year : int 2010 2010 2010 2011 2011 2011 2011 2011 2011 2011 ...
$ Month : int 12 12 12 1 1 1 1 1 1 1 ...
$ Day : int 28 30 30 3 4 6 6 6 7 7 ...
$ DayOfWeekNumber : int 3 5 5 2 3 5 5 5 6 6 ...
$ Day.of.Week : Factor w/ 7 levels "Friday","Monday",..: 6 5 5 2 6 5 5 5 1 1 ...
What I am trying to do is create three (3) different histograms and then plot them all together on one. I want to create a histogram for each year, where the x axis or labels will be the days of the week starting with Sunday and ending on Saturday.
Firstly how would I go about creating a histogram out of Factors, which the days of the week are in?
Secondly how do I create a histogram for the days of the week for a given year?
I have tried using the following post here but cannot get it working. I use the Admission.Day as the variable and get an error message:
dat <- as.Date(lace$Admission.Day)
Error in charToDate(x) : character string is not in a standard unambiguous format
Thank you,

Expanding on the comment above: the problem seems to be with importing dates, rather than making the histogram. Assuming there is an excel workbook "lace for R.xlsx", with a sheet "lace":
## Not tested...
library(XLConnect)
myData <- "lace for R.xlsx" # NOTE: need path also...
wb <- loadWorkbook(myData)
lace <- readWorksheet(wb, sheet="lace")
lace$Admission.Day <- as.Date(lace$Admission.Day)
should provide dates that work with all the R date functions. Also, the lubridate package provides a number of functions that are more intuitive to use than format(...).
Then, as an example:
library(lubridate) # for year(...) and wday(...)
library(ggplot2)
# random dates around Jun 1, across 5 years...
set.seed(123)
lace <- data.frame(date=as.Date(rnorm(1000,sd=50)+365*(0:4),origin="2008/6/1"))
lace$year <- factor(year(lace$date))
lace$dow <- wday(lace$date, label=T)
# This creates the histograms...
ggplot(lace) +
geom_histogram(aes(x=dow, fill=year)) + # fill color by year
facet_grid(~year) + # facet by year
theme(axis.text.x=element_text(angle=90)) # to rotate weekday names...
Produces this:

Related

R error: level sets of factors are different

I'm working on an assignment practicing Logistic Regression models. Our data is on shots made in NBA games and each row includes a column for what team the player making the shot belongs to and a column for who the home team was.
I am trying to add a column with TRUE/FALSE values based on whether or not the shot was taken by the home team, based on some example code we were provided.
df$home.advntg <- df$Team == df$Home
However I keep getting the error: "Error in Ops.factor(df$Team, df$Home) :
level sets of factors are different"
When I check the columns with str() however these are the results:
str(df$Team) : "Factor w/ 30 levels "ATL","BKN","BOS",..: 7 16 27 3 24 1 10 8 12 12 ..."
str(df$Home) : " Factor w/ 30 levels "ATL","BKN","BOS",..: 7 20 27 5 28 1 10 8 1 12 ..."
The data I'm using is a subset of a much larger dataset which covered shots made from 1997 to 2020. The code worked on the original data, so something about how I've reduced it to just the 2020 shots is probably responsible. The dates of the games are in YMD format, so to filter down to just 2020 I ran:
df0 <- read_csv("NBA Shot Locations 1997 - 2020.csv")
df0$Year <- substr(df0$"Game Date",1,4)
df <- filter(df0, Year == 2020)
df <- df[,-23]
When I run str and check the columns with the original data (for which there was no error) I get:
str(df$Team) : "Factor w/ 36 levels "ATL","BKN","BOS",..: 18 17 8 2 18 7 1 10 9 12 ..."
str(df$Home) : "Factor w/ 36 levels "ATL","BKN","BOS",..: 6 17 5 28 18 2 1 10 9 12 ..."
In both cases the Factor levels look like they're the same. I don't really understand what the numbers being returned by the str function represent.

Convert comma separated decimals from character to numeric

For my exam i have to build some scatter plots in r. I created a data frame with 4 variables. with this data frame i want to add regression lines to my scatter plots.
the name of my data frame is "alle".
variable names are: demo, tot, besch, usd
with this code i tried to line the regression line but got following result:
reg1<- lm(tot~demo, data=alle)
Warning messages:
1: In model.response(mf, "numeric") :
using type = "numeric" with a factor response will be ignored
2: In Ops.factor(y, z$residuals) : ‘-’ not meaningful for factors
here is the structure of "alle"
str(alle)
'data.frame': 11 obs. of 4 variables:
$ demo : chr "498.300.775" "500.297.033" "502.090.235" "503.170.618" ...
$ tot : Factor w/ 11 levels "4.846.423","4.871.049",..: 1 3 4 5 2 8 7 6 10 9 ...
$ besch: Factor w/ 9 levels "68,4","68,6",..: 5 7 3 2 2 1 1 4 6 8 ...
$ usd : Factor w/ 44 levels "0,68434","0,72584",..: 26 30 29 23 28 22 24 25 15 14 ...
Tried to convert column "demo" to numeric with
alle$demo <- as.numeric(as.character(alle$demo))
it converted the column to numeric but now the rows are full with "NA"s.
I think that i all columns must be numeric.
How can I convert all 4 columns to numeric and finally plot the regression lines.
Data:
> head(alle,6)
demo tot besch usd
1 498.300.775 4.846.423 69,8 1,3705
2 500.297.033 4.891.934 70,3 1,4708
3 502.090.235 4.901.358 69,0 1,3948
4 503.170.618 4.906.313 68,6 1,3257
5 502.964.837 4.871.049 68,6 1,3920
6 504.047.964 5.010.371 68,4 1,2848
thanks
Try doing it in two steps. First get rid of the dots, then replace the commas by decimal points and coerce to numeric.
alle[] <- lapply(alle, function(x) gsub("\\.", "", x))
alle[] <- lapply(alle, function(x) as.numeric(sub(",", ".", x)))
Note:
The above solution is broken in two for readability. The following does the same but it takes just one lapply loop and should therefore be faster if the dataset is big. If the dataset is small to medium, maybe the two steps solutions is preferable.
alle[] <- lapply(alle, function(x){
as.numeric(sub(",", ".", gsub("\\.", "", x)))
})
With dplyr:
library(dplyr)
alle %>%
mutate_all(as.character) %>%
mutate_at(c("besch","usd"),function(x) as.numeric(as.character(gsub(",",".",x)))) ->alle
demo tot besch usd
1 498.300.775 4.846.423 69.8 1.3705
2 500.297.033 4.891.934 70.3 1.4708
3 502.090.235 4.901.358 69.0 1.3948
4 503.170.618 4.906.313 68.6 1.3257
5 502.964.837 4.871.049 68.6 1.3920
6 504.047.964 5.010.371 68.4 1.2848

R- using dygraph with csv

following is my ex.csv data input to R.
Date pr pa
1 2015-01-01 6497985 4833118
2 2015-02-01 88289 4305786
3 2015-03-01 0 1149480
4 2015-04-01 0 16706470
5 2015-05-01 0 7025197
6 2015-06-01 0 6752085
also, here is raw data
Date,pr,pa
2015/1/1,6497985,4833118
2015/2/1,88289,4305786
2015/3/1,0,1149480
2015/4/1,0,16706470
2015/5/1,0,7025197
2015/6/1,0,6752085
how can I use R package dygraph with this data?
> str(ex)
'data.frame': 6 obs. of 3 variables:
$ Date: Factor w/ 6 levels "2015/1/1","2015/2/1",..: 1 2 3 4 5 6
$ pr : int 6497985 88289 0 0 0 0
$ pa : int 4833118 4305786 1149480 16706470 7025197 6752085
> dygraph(ex)
Error in dygraph(ex) : Unsupported type passed to argument 'data'.
Please help me.appreciate a lot.
Here are the steps to get it done: First, you need to convert your strings to a Date that is understandable for R. Then convert your data to an xts time series (required by dygraphs). Then plot it with dygraphs.
library(dygraphs)
library(xts)
data<-read.csv("test.csv")
data$Date<- as.Date(data$Date) #convert to date
time_series <- xts(data, order.by = data$Date) #make xts
dygraph(time_series) #now plot

GGPLOT: Printing Stacked Bar Chart & Line to File

I know that it might not look like it from this question, but I've actually been programming for over 20 years, but I'm new to R. I'm trying to move away from Excel and to automate creation of about 100 charts I currently do in Excel by hand. I've asked two previous questions about this: here and here. Those solutions work for those toy examples, but when I try the exact same code on my own full program, they behave very differently and I'm completely befuddled as to why. When I run the program below, the testplot.png file is just a plot of the line, without the stacked bar chart.
So here is my (full) code as cut down as I can make it. If anyone wants to critique my programming, go ahead. I know that the comments are light, but that's to try to shorten it for this post. Also, this does actually download the USDA PSD database which is about 20MB compressed and is 170MB uncompressed...sorry but I would love someone's help on this!
Edit, here are str() outputs of both 'full' data and 'toy' data. The toy data works, the full data doesn't.
> str(melteddata)
Classes ‘data.table’ and 'data.frame': 18 obs. of 3 variables:
$ Year : int 1 2 3 4 5 6 1 2 3 4 ...
$ variable: Factor w/ 3 levels "stocks","exports",..: 1 1 1 1 1 1 2 2 2 2 ...
$ Qty : num 2 4 3 2 4 3 4 8 6 4 ...
- attr(*, ".internal.selfref")=<externalptr>
> str(SoySUHist)
Classes ‘data.table’ and 'data.frame': 159 obs. of 3 variables:
$ Year : int 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 ...
$ variable: Factor w/ 3 levels "Stocks","DomCons",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Qty : num 0.0297 0.0356 0.0901 0.1663 0.3268 ...
- attr(*, ".internal.selfref")=<externalptr>
> str(linedata)
Classes ‘data.table’ and 'data.frame': 6 obs. of 2 variables:
$ Year: int 1 2 3 4 5 6
$ Qty : num 15 16 15 16 15 16
- attr(*, ".internal.selfref")=<externalptr>
> str(SoyProd)
Classes ‘data.table’ and 'data.frame': 53 obs. of 2 variables:
$ Year: int 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 ...
$ Qty : num 701 846 928 976 1107 ...
- attr(*, ".internal.selfref")=<externalptr>
>
library(data.table)
library(ggplot2)
library(ggthemes)
library(plyr)
toyplot <- function(plotdata,linedata){
plotCExp <- ggplot(plotdata) +
geom_bar(aes(x=Year,y=Qty,factor=variable,fill=variable), stat="identity") +
geom_line(data=linedata, aes(x=Year,y=Qty)) # <---- comment out this line & the stack plot works
ggsave(plotCExp,filename = "ggsavetest.png", width=7, height=5, units="in")
}
convertto <- function(value,crop,unit='BU'){
if (unit=='BU' & ( crop=='WHEAT' | crop=='SOYBEANS')){
value = value * 36.7437
}
return(value)
}
# =====================================
# Download Data (Warning...large download!)
# =====================================
system("curl https://apps.fas.usda.gov/psdonline/download/psd_alldata_csv.zip | funzip > DATA/psd.csv")
tmp <- fread("DATA/psd.csv")
PSD = data.table(tmp)
rm(tmp)
setkey(PSD,Country_Code,Commodity_Code,Attribute_ID)
tmp=unique(PSD[,.(Commodity_Description,Attribute_Description,Commodity_Code,Attribute_ID)])
tmp[order(Commodity_Description)]
names(PSD)[names(PSD) == "Market_Year"] = "Year"
names(PSD)[names(PSD) == "Value"] = "Qty"
PSDCmdtyAtt = unique(PSD[,.(Commodity_Code,Attribute_ID)])
# Soybean Production, Consumpion, Stocks/Use
SoyStocks = PSD[list("US",2222000,176),.(Year,Qty)] # Ending Stocks
SoyExp = PSD[list("US",2222000,88),.(Year,Qty)] # Exports
SoyProd = PSD[list("US",2222000,28),.(Year,Qty)] # Total Production
SoyDmCons = PSD[list("US",2222000,125),.(Year,Qty)] # Total Dom Consumption
SoyStocks$Qty = convertto(SoyStocks$Qty,"SOYBEANS","BU")/1000
SoyExp$Qty = convertto(SoyExp$Qty,"SOYBEANS","BU")/1000
SoyProd$Qty = convertto(SoyProd$Qty,"SOYBEANS","BU")/1000
SoyDmCons$Qty = convertto(SoyDmCons$Qty,"SOYBEANS","BU")/1000
# Stocks/Use
SoySUPlot <- SoyExp
names(SoySUPlot)[names(SoySUPlot) == "Qty"] = "Exports"
SoySUPlot$DomCons = SoyDmCons$Qty
SoySUPlot$Stocks = SoyStocks$Qty
SoySUHist <- melt(SoySUPlot,id.vars="Year")
SoySUHist$Qty = SoySUHist$value/1000
SoySUHist$value <- NULL
SoySUPlot$StocksUse = 100*SoySUPlot$Stocks/(SoySUPlot$DomCons+SoySUPlot$Exports)
SoySUPlot$Production = SoyProd$Qty/1000
SoySUHist$variable <- factor(SoySUHist$variable, levels = rev(levels(SoySUHist$variable)))
SoySUHist = arrange(SoySUHist,variable)
toyplot(SoySUHist,SoyProd)
All right, I'm feeling generous. Your example code contains a lot of fluff that should not be in a minimal reproducible example and your system call is not portable, but I had a look anyway.
The good news: Your code works as expected.
Let's plot only the bars:
ggplot(SoySUHist) +
geom_bar(aes(x=Year,y=Qty,factor=variable,fill=variable), stat="identity")
Now only the lines:
ggplot(SoySUHist) +
geom_line(data=SoyProd, aes(x=Year,y=Qty))
Now compare the scales of the y-axes. If you plot both together, the bars get plotted, but they are so small that you can't see them. You need to rescale:
ggplot(SoySUHist) +
geom_bar(aes(x=Year,y=Qty,factor=variable,fill=variable), stat="identity") +
geom_line(data=SoyProd, aes(x=Year,y=Qty/1000))

ggplot2 time series with an ordered factor on the x-axis

I'd be extremely grateful for your assistance with the following issue.
I wish to create a representative time series for different subjects who have undertaken a test at discrete intervals. The data frame is called Hayling.Impulsivity. Here is a sample of the data in wide format:
Subject Baseline 2-weeks 6-weeks 3-months
1 1 15 23 5 NA
2 2 15 27 3 4
3 3 5 7 0 19
4 4 1 5 2 6
5 5 3 7 18 27
6 6 0 2 19 2`
I then made Subject a factor:
Hayling.Impulsivity$Subject<-factor(Hayling.Impulsivity$Subject)
I then melted the data frame into long format using the reshape package:
Long.H.I.<-melt(Hayling.Impulsivity, id.vars="Subject", variable.name="Follow Up", value.name="Hayling AB Error Score")
I then ordered the measurement variables:
Long.H.I.$"Follow Up"<-factor(Long.H.I.$"Follow Up", levels=c("Baseline", "2-weeks", "6-weeks", "3-months"), ordered=TRUE)
Here's the structure of this data frame:
'data.frame': 52 obs. of 3 variables:
$ Subject : Factor w/ 13 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Follow Up : Ord.factor w/ 4 levels "Baseline"<"2-weeks"<..: 1 1 1 1 1 1 1 1 1 1 ...
$ Hayling AB Error Score: num 15 15 5 1 3 0 3 0 0 33 ...
Now I try to construct the time series in ggplot:
ggplot(Long.H.I., aes("Follow Up", "Hayling AB Error Score", group=Subject, colour=Subject))+geom_line()
But all I get is an empty plot. I'm not permitted to post an image to show you but the x and y axes are labelled only with "Follow Up" and "Hayling AB Error Score" respectively. There are no actual scales / values / categories on either axis and no points have been plotted.
Where have I gone wrong?
It looks like spaces in your column names are causing the problem even if you use aes_string. You could replace the spaces with underscores and then label the x and y axes explicitly. Code could look like:
Hayling.Impulsivity$Subject<-factor(Hayling.Impulsivity$Subject)
Long.H.I.<-melt(Hayling.Impulsivity, id.vars="Subject",
variable.name="Follow_Up", value.name="Hayling_AB_Error_Score")
Long.H.I.$Follow_Up <-factor(Long.H.I.$"Follow_Up",
levels=c("Baseline","2-weeks","6-Weeks","3-months"), ordered=TRUE)
ggplot(Long.H.I., aes(Follow_Up, Hayling_AB_Error_Score, group=Subject, colour=Subject))+
geom_line() +
labs(x="Follow Up", y="Hayling AB Error Score")

Resources