GGPLOT: Printing Stacked Bar Chart & Line to File - r

I know that it might not look like it from this question, but I've actually been programming for over 20 years, but I'm new to R. I'm trying to move away from Excel and to automate creation of about 100 charts I currently do in Excel by hand. I've asked two previous questions about this: here and here. Those solutions work for those toy examples, but when I try the exact same code on my own full program, they behave very differently and I'm completely befuddled as to why. When I run the program below, the testplot.png file is just a plot of the line, without the stacked bar chart.
So here is my (full) code as cut down as I can make it. If anyone wants to critique my programming, go ahead. I know that the comments are light, but that's to try to shorten it for this post. Also, this does actually download the USDA PSD database which is about 20MB compressed and is 170MB uncompressed...sorry but I would love someone's help on this!
Edit, here are str() outputs of both 'full' data and 'toy' data. The toy data works, the full data doesn't.
> str(melteddata)
Classes ‘data.table’ and 'data.frame': 18 obs. of 3 variables:
$ Year : int 1 2 3 4 5 6 1 2 3 4 ...
$ variable: Factor w/ 3 levels "stocks","exports",..: 1 1 1 1 1 1 2 2 2 2 ...
$ Qty : num 2 4 3 2 4 3 4 8 6 4 ...
- attr(*, ".internal.selfref")=<externalptr>
> str(SoySUHist)
Classes ‘data.table’ and 'data.frame': 159 obs. of 3 variables:
$ Year : int 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 ...
$ variable: Factor w/ 3 levels "Stocks","DomCons",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Qty : num 0.0297 0.0356 0.0901 0.1663 0.3268 ...
- attr(*, ".internal.selfref")=<externalptr>
> str(linedata)
Classes ‘data.table’ and 'data.frame': 6 obs. of 2 variables:
$ Year: int 1 2 3 4 5 6
$ Qty : num 15 16 15 16 15 16
- attr(*, ".internal.selfref")=<externalptr>
> str(SoyProd)
Classes ‘data.table’ and 'data.frame': 53 obs. of 2 variables:
$ Year: int 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 ...
$ Qty : num 701 846 928 976 1107 ...
- attr(*, ".internal.selfref")=<externalptr>
>
library(data.table)
library(ggplot2)
library(ggthemes)
library(plyr)
toyplot <- function(plotdata,linedata){
plotCExp <- ggplot(plotdata) +
geom_bar(aes(x=Year,y=Qty,factor=variable,fill=variable), stat="identity") +
geom_line(data=linedata, aes(x=Year,y=Qty)) # <---- comment out this line & the stack plot works
ggsave(plotCExp,filename = "ggsavetest.png", width=7, height=5, units="in")
}
convertto <- function(value,crop,unit='BU'){
if (unit=='BU' & ( crop=='WHEAT' | crop=='SOYBEANS')){
value = value * 36.7437
}
return(value)
}
# =====================================
# Download Data (Warning...large download!)
# =====================================
system("curl https://apps.fas.usda.gov/psdonline/download/psd_alldata_csv.zip | funzip > DATA/psd.csv")
tmp <- fread("DATA/psd.csv")
PSD = data.table(tmp)
rm(tmp)
setkey(PSD,Country_Code,Commodity_Code,Attribute_ID)
tmp=unique(PSD[,.(Commodity_Description,Attribute_Description,Commodity_Code,Attribute_ID)])
tmp[order(Commodity_Description)]
names(PSD)[names(PSD) == "Market_Year"] = "Year"
names(PSD)[names(PSD) == "Value"] = "Qty"
PSDCmdtyAtt = unique(PSD[,.(Commodity_Code,Attribute_ID)])
# Soybean Production, Consumpion, Stocks/Use
SoyStocks = PSD[list("US",2222000,176),.(Year,Qty)] # Ending Stocks
SoyExp = PSD[list("US",2222000,88),.(Year,Qty)] # Exports
SoyProd = PSD[list("US",2222000,28),.(Year,Qty)] # Total Production
SoyDmCons = PSD[list("US",2222000,125),.(Year,Qty)] # Total Dom Consumption
SoyStocks$Qty = convertto(SoyStocks$Qty,"SOYBEANS","BU")/1000
SoyExp$Qty = convertto(SoyExp$Qty,"SOYBEANS","BU")/1000
SoyProd$Qty = convertto(SoyProd$Qty,"SOYBEANS","BU")/1000
SoyDmCons$Qty = convertto(SoyDmCons$Qty,"SOYBEANS","BU")/1000
# Stocks/Use
SoySUPlot <- SoyExp
names(SoySUPlot)[names(SoySUPlot) == "Qty"] = "Exports"
SoySUPlot$DomCons = SoyDmCons$Qty
SoySUPlot$Stocks = SoyStocks$Qty
SoySUHist <- melt(SoySUPlot,id.vars="Year")
SoySUHist$Qty = SoySUHist$value/1000
SoySUHist$value <- NULL
SoySUPlot$StocksUse = 100*SoySUPlot$Stocks/(SoySUPlot$DomCons+SoySUPlot$Exports)
SoySUPlot$Production = SoyProd$Qty/1000
SoySUHist$variable <- factor(SoySUHist$variable, levels = rev(levels(SoySUHist$variable)))
SoySUHist = arrange(SoySUHist,variable)
toyplot(SoySUHist,SoyProd)

All right, I'm feeling generous. Your example code contains a lot of fluff that should not be in a minimal reproducible example and your system call is not portable, but I had a look anyway.
The good news: Your code works as expected.
Let's plot only the bars:
ggplot(SoySUHist) +
geom_bar(aes(x=Year,y=Qty,factor=variable,fill=variable), stat="identity")
Now only the lines:
ggplot(SoySUHist) +
geom_line(data=SoyProd, aes(x=Year,y=Qty))
Now compare the scales of the y-axes. If you plot both together, the bars get plotted, but they are so small that you can't see them. You need to rescale:
ggplot(SoySUHist) +
geom_bar(aes(x=Year,y=Qty,factor=variable,fill=variable), stat="identity") +
geom_line(data=SoyProd, aes(x=Year,y=Qty/1000))

Related

How to change order for pyramid plots with ggplot2 to dataset order?

I have a dataset with climate suitability values (0-1) for tree species for both present and future.
I would like to visualise the data in a pyramid plot with the ggplot2 package, whereas present should be displayed on the left side of the plot and future on the right side and the tree species in the according order given in my raw dataset.
b2010<-read.csv("csi_before2010_abund_order.csv",header=T,sep = ";")
str(b2010)
'data.frame': 20 obs. of 7 variables:
$ species: Factor w/ 10 levels "Acer platanoides",..: 9 9 7 7 8 8 6 6 5 5 ...
$ time : Factor w/ 2 levels "future","present": 2 1 2 1 2 1 2 1 2 1 ...
$ grid1 : num 0.6001 0.5945 0.6366 0.0424 0.6941 ...
$ grid2 : num 0.6399 0.5129 0.6981 0.0399 0.711 ...
$ grid3 : num 0.6698 0.5212 0.6863 0.0446 0.6795 ...
$ mean : num 0.6366 0.5429 0.6737 0.0423 0.6949 ...
$ group : Factor w/ 1 level "before 2010": 1 1 1 1 1 1 1 1 1 1 ...
b2010$mean = ifelse(b2010$time == "future", b2010$mean * -1,b2010$mean)
head(b2010)
species time grid1 grid2 grid3 mean group
1 Tilia europaea present 0.60009009 0.63990200 0.66975713 0.63658307 before 2010
2 Tilia europaea future 0.59452874 0.51294094 0.52115256 -0.54287408 before 2010
3 Sorbus intermedia present 0.63659602 0.69813931 0.68629903 0.67367812 before 2010
4 Sorbus intermedia future 0.04242327 0.03990654 0.04460707 -0.04231229 before 2010
5 Tilia cordata present 0.69414478 0.71097034 0.67950863 0.69487458 before 2010
6 Tilia cordata future 0.55790818 0.53918493 0.51979470 -0.53896260 before 2010
ggplot(b2010, aes(x = factor(species), y = mean, fill = time)) +
geom_bar(stat = "identity") +
facet_share(~time, dir = "h", scales = "free", reverse_num = T) +
coord_flip()
Now, future and present are in the wrong order and also the species are ordered alphabetically, even though they are clearly "factors" and should therefore be ordered according to my dataset. I would very much appreciate your help.
Thank you and kind regards
You are misunderstanding how factors work. Bars are plotted in the order as printed by levels(b2010$species). In order to change this order, you'll have to manually reorder them, i.e.
b2010$species <- factor(b2010$species,
levels = c("Sorbus intermedia", "Tilia chordata"...))
These levels can naturally be also a function of some statistic, i.e. mean. To do that, you would do something along the lines of
myorder <- b2010[order(b2010$mean) & b2010$time == "present", "species"]
b2010$species <- factor(b2010$species, levels = myorder)

Ordering categorical variables with ggplot stacked barplots

I have a data frame tbl with 252 obs of 8 variables.
> str(CE0008)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 252 obs. of 8 variables:
$ Period start time: POSIXct, format: "2018-02-28" "2018-02-28" "2018-02-28" "2018-02-28" ...
$ WBTS name : chr "CE0008" "CE0008" "CE0008" "CE0008" ...
$ WBTS ID : num 27 27 27 27 27 27 27 27 27 27 ...
$ WCEL name : chr "CE0008U09C3" "CE0008U21B1" "CE0008U21B3" "CE0008U21C2" ...
$ WCEL ID : num 33 2 22 13 32 3 23 1 11 31 ...
$ PRACHDelayRange : num 4 4 4 4 4 4 4 4 4 4 ...
$ class : Ord.factor w/ 21 levels "Class 0"<"Class 1"<..: 1 1 1 1 1 1 1 1 1 1 ...
$ count : num 22177 37507 37580 24066 6029 ...
I wish to produce a bar plot where the x axis contains the class variable where class variables in ordered from Class 0 through to Class 20. The height of the bars is given by the count variable and the bars are coloured by WCEL name.
From the str of my data frame tbl the class variable is an ord. factor but for some reason when I plot the data the order doesn't carry over and similarly with fill of the colour by 'WCEL name'. Any pointers would be greatly appreciated.
ggplot(data = CE0008, aes(x = class, y = count, fill = 'WCEL name')) + geom_col()
data$class <- factor(data$class, as.character(data$class))
your ggplot call goes here

Remove duplicates in R without converting to numeric

I have 2 variables in a data frame with 300 observations.
$ imagelike: int 3 27 4 5370 ...
$ user: Factor w/ 24915 levels "\"0.1gr\"","\"008bla\"", ..
I then tried to remove the duplicates, such as "- " appears 2 times:
testclean <- data1[!duplicated(data1), ]
This gives me the warning message:
In Ops.factor(left): "-"not meaningful for factors
I have then converted it to a maxtrix:
data2 <- data.matrix(data1)
testclean2 <- data2[!duplicated(data2), ]
This does the trick - however - it converts the userNames to a numeric.
=========================================================================
I am new but I have tried looking at previous posts on this topic (including the one below) but it did not work out:
Convert data.frame columns from factors to characters
Some sample data, from your image (please don't post images of data!):
data1 <- data.frame(imageLikeCount = c(3,27,4,4,16,103),
userName = c("\"testblabla\"", "test_00", "frenchfries", "frenchfries", "test.inc", "\"parmezan_pizza\""))
str(data1)
# 'data.frame': 6 obs. of 2 variables:
# $ imageLikeCount: num 3 27 4 4 16 103
# $ userName : Factor w/ 5 levels "\"parmezan_pizza\"",..: 2 5 3 3 4 1
To fix the problem with factors as well as the embedded quotes:
data1$userName <- gsub('"', '', as.character(data1$userName))
str(data1)
# 'data.frame': 6 obs. of 2 variables:
# $ imageLikeCount: num 3 27 4 4 16 103
# $ userName : chr "testblabla" "test_00" "frenchfries" "frenchfries" ...
Like #DanielWinkler suggested, if you can change how the data is read-in or defined, you might choose to include stringsAsFactors = FALSE (this argument is accepted in many functions, including read.csv, read.table, and most data.frame functions including as.data.frame and rbind):
data1 <- data.frame(imageLikeCount = c(3,27,4,4,16,103),
userName = c("\"testblabla\"", "test_00", "frenchfries", "frenchfries", "test.inc", "\"parmezan_pizza\""),
stringsAsFactors = FALSE)
str(data1)
# 'data.frame': 6 obs. of 2 variables:
# $ imageLikeCount: num 3 27 4 4 16 103
# $ userName : chr "\"testblabla\"" "test_00" "frenchfries" "frenchfries" ...
(Note that this still has embedded quotes, so you'll still need something like data1$userName <- gsub('"', '', data1$userName).)
Now, we have data that looks like this:
data1
# imageLikeCount userName
# 1 3 testblabla
# 2 27 test_00
# 3 4 frenchfries
# 4 4 frenchfries
# 5 16 test.inc
# 6 103 parmezan_pizza
and your need to remove duplicates works:
data1[! duplicated(data1), ]
# imageLikeCount userName
# 1 3 testblabla
# 2 27 test_00
# 3 4 frenchfries
# 5 16 test.inc
# 6 103 parmezan_pizza
Try
data$userName <- as.character(data$userName)
And then
data<-unique(data)
You could also pass the argument stringAsFactor = FALSE when reading the data. This is usually a good idea.

R ggplot - Error stat_bin requires continuous x variable

My table is data.combined with following structure:
'data.frame': 1309 obs. of 12 variables:
$ Survived: Factor w/ 3 levels "0","1","None": 1 2 2 2 1 1 1 1 2 2 ...
$ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
$ Name : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
$ Sex : num 2 1 1 1 2 2 2 2 1 1 ...
$ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ Ticket : Factor w/ 929 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Cabin : Factor w/ 187 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
$ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
$ Title : Factor w/ 4 levels "Master.","Miss.",..: 3 3 2 3 3 3 3 1 3 3 ...
I want to draw a graph to reflect the relationship between Title and Survived, categorized by Pclass. I used the following code:
ggplot(data.combined[1:891,], aes(x=Title, fill = Survived)) +
geom_histogram(binwidth = 0.5) +
facet_wrap(~Pclass) +
ggtitle ("Pclass") +
xlab("Title") +
ylab("Total count") +
labs(fill = "Survived")
However this results in error: Error: StatBin requires a continuous x variable the x variable is discrete. Perhaps you want stat="count"?
If I change variable Title into numeric: data.combined$Title <- as.numeric(data.combined$Title) then the code works but the label in the graph is also numeric (below). Please tell me why it happens and how to fix it. Thanks.
Btw, I use R 3.2.3 on Mac El Capital.
Graph: Instead of Mr, Miss,Mrs the x axis shows numeric values 1,2,3,4
Sum up the answer from the comments above:
1 - Replace geom_histogram(binwidth=0.5) with geom_bar(). However this way will not allow binwidth customization.
2 - Using stat_count(width = 0.5) instead of geom_bar() or geom_histogram(binwidth = 0.5) would solve it.
extractTitle <- function(Name) {
Name <- as.character(Name)
if (length(grep("Miss.", Name)) > 0) {
return ("Miss.")
} else if (length(grep("Master.", Name)) > 0) {
return ("Master.")
} else if (length(grep("Mrs.", Name)) > 0) {
return ("Mrs.")
} else if (length(grep("Mr.", Name)) > 0) {
return ("Mr.")
} else {
return ("Other")
}
}
titles <- NULL
for (i in 1:nrow(data.combined)){
titles <- c(titles, extractTitle(data.combined[i, "Name"]))
}
data.combined$title <- as.factor(titles)
ggplot(data.combined[1:892,], aes(x = title, fill = Survived))+
geom_bar(width = 0.5) +
facet_wrap("Pclass")+
xlab("Pclass")+
ylab("total count")+
labs(fill = "Survived")
As stated above use geom_bar() instead of geom_histogram, refer sample code given below(I wanted separate graph for each month for birth date data):
ggplot(data = pf,aes(x=dob_day))+
geom_bar()+
scale_x_discrete(breaks = 1:31)+
facet_wrap(~dob_month,ncol = 3)
I had the same issue but none of the above solutions worked. Then I noticed that the column of the data frame I wanted to use for the histogram wasn't numeric:
df$variable<- as.numeric(as.character(df$variable))
Taken from here
I had the same error. In my original code, I read my .csv file with read_csv(). After I changed the file into .xlsx and read it with read_excel(), the code ran smoothly.

Histogram of Weekdays by Year R

I have a .csv file that I have loaded into R using the following basic command:
lace <- read.csv("lace for R.csv")
It pulls in my data just fine. Here is the str of the data:
str(lace)
'data.frame': 2054 obs. of 20 variables:
$ Admission.Day : Factor w/ 872 levels "1/1/2013","1/10/2011",..: 231 238 238 50 59 64 64 64 67 67 ...
$ Year : int 2010 2010 2010 2011 2011 2011 2011 2011 2011 2011 ...
$ Month : int 12 12 12 1 1 1 1 1 1 1 ...
$ Day : int 28 30 30 3 4 6 6 6 7 7 ...
$ DayOfWeekNumber : int 3 5 5 2 3 5 5 5 6 6 ...
$ Day.of.Week : Factor w/ 7 levels "Friday","Monday",..: 6 5 5 2 6 5 5 5 1 1 ...
What I am trying to do is create three (3) different histograms and then plot them all together on one. I want to create a histogram for each year, where the x axis or labels will be the days of the week starting with Sunday and ending on Saturday.
Firstly how would I go about creating a histogram out of Factors, which the days of the week are in?
Secondly how do I create a histogram for the days of the week for a given year?
I have tried using the following post here but cannot get it working. I use the Admission.Day as the variable and get an error message:
dat <- as.Date(lace$Admission.Day)
Error in charToDate(x) : character string is not in a standard unambiguous format
Thank you,
Expanding on the comment above: the problem seems to be with importing dates, rather than making the histogram. Assuming there is an excel workbook "lace for R.xlsx", with a sheet "lace":
## Not tested...
library(XLConnect)
myData <- "lace for R.xlsx" # NOTE: need path also...
wb <- loadWorkbook(myData)
lace <- readWorksheet(wb, sheet="lace")
lace$Admission.Day <- as.Date(lace$Admission.Day)
should provide dates that work with all the R date functions. Also, the lubridate package provides a number of functions that are more intuitive to use than format(...).
Then, as an example:
library(lubridate) # for year(...) and wday(...)
library(ggplot2)
# random dates around Jun 1, across 5 years...
set.seed(123)
lace <- data.frame(date=as.Date(rnorm(1000,sd=50)+365*(0:4),origin="2008/6/1"))
lace$year <- factor(year(lace$date))
lace$dow <- wday(lace$date, label=T)
# This creates the histograms...
ggplot(lace) +
geom_histogram(aes(x=dow, fill=year)) + # fill color by year
facet_grid(~year) + # facet by year
theme(axis.text.x=element_text(angle=90)) # to rotate weekday names...
Produces this:

Resources