R ggplot - Error stat_bin requires continuous x variable - r

My table is data.combined with following structure:
'data.frame': 1309 obs. of 12 variables:
$ Survived: Factor w/ 3 levels "0","1","None": 1 2 2 2 1 1 1 1 2 2 ...
$ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
$ Name : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
$ Sex : num 2 1 1 1 2 2 2 2 1 1 ...
$ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ Ticket : Factor w/ 929 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Cabin : Factor w/ 187 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
$ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
$ Title : Factor w/ 4 levels "Master.","Miss.",..: 3 3 2 3 3 3 3 1 3 3 ...
I want to draw a graph to reflect the relationship between Title and Survived, categorized by Pclass. I used the following code:
ggplot(data.combined[1:891,], aes(x=Title, fill = Survived)) +
geom_histogram(binwidth = 0.5) +
facet_wrap(~Pclass) +
ggtitle ("Pclass") +
xlab("Title") +
ylab("Total count") +
labs(fill = "Survived")
However this results in error: Error: StatBin requires a continuous x variable the x variable is discrete. Perhaps you want stat="count"?
If I change variable Title into numeric: data.combined$Title <- as.numeric(data.combined$Title) then the code works but the label in the graph is also numeric (below). Please tell me why it happens and how to fix it. Thanks.
Btw, I use R 3.2.3 on Mac El Capital.
Graph: Instead of Mr, Miss,Mrs the x axis shows numeric values 1,2,3,4

Sum up the answer from the comments above:
1 - Replace geom_histogram(binwidth=0.5) with geom_bar(). However this way will not allow binwidth customization.
2 - Using stat_count(width = 0.5) instead of geom_bar() or geom_histogram(binwidth = 0.5) would solve it.

extractTitle <- function(Name) {
Name <- as.character(Name)
if (length(grep("Miss.", Name)) > 0) {
return ("Miss.")
} else if (length(grep("Master.", Name)) > 0) {
return ("Master.")
} else if (length(grep("Mrs.", Name)) > 0) {
return ("Mrs.")
} else if (length(grep("Mr.", Name)) > 0) {
return ("Mr.")
} else {
return ("Other")
}
}
titles <- NULL
for (i in 1:nrow(data.combined)){
titles <- c(titles, extractTitle(data.combined[i, "Name"]))
}
data.combined$title <- as.factor(titles)
ggplot(data.combined[1:892,], aes(x = title, fill = Survived))+
geom_bar(width = 0.5) +
facet_wrap("Pclass")+
xlab("Pclass")+
ylab("total count")+
labs(fill = "Survived")

As stated above use geom_bar() instead of geom_histogram, refer sample code given below(I wanted separate graph for each month for birth date data):
ggplot(data = pf,aes(x=dob_day))+
geom_bar()+
scale_x_discrete(breaks = 1:31)+
facet_wrap(~dob_month,ncol = 3)

I had the same issue but none of the above solutions worked. Then I noticed that the column of the data frame I wanted to use for the histogram wasn't numeric:
df$variable<- as.numeric(as.character(df$variable))
Taken from here

I had the same error. In my original code, I read my .csv file with read_csv(). After I changed the file into .xlsx and read it with read_excel(), the code ran smoothly.

Related

Where should I do reorder on bargraph to achieve make the bar group same squence as dataframe

I have a dataframe like this:
> str(mydata6)
'data.frame': 6 obs. of 4 variables:
$ Comparison : Factor w/ 6 levels "Decreased_Adult",..: 5 2 6 3 4 1
$ differential_IR_number: num 446 305 965 599 1799 ...
$ Stage : Factor w/ 3 levels "AdultvsE11","E14vsE11",..: 2 2 3 3 1 1
$ Change : Factor w/ 2 levels "Decrease","Increase": 2 1 2 1 2 1
column 1,3,4 are factors and column 2 are numeric
I used the following code to do a bargraph:
ggplot(mydata6, aes(x=Stage, y=differential_IR_number, fill=Change)) + #don't need to use "" for x= and y, comparing to the above code
geom_bar(stat = "identity", position = "stack") + #using stack to make decrease and increase stack with each other
theme(axis.text.x = element_text(angle = 90, hjust = 1)) + #using theme function to change the labeling to be vertical
geom_text(aes(label=differential_IR_number), position=position_stack(vjust=0.5))
The result is following:
But I want the order to be E14vsE11 E18vsE11 and AdultvsE11, I tried to reorder/sort at different positions but none works.
Why it does not following the order of mydataframe?
The order is the one of the levels of the factor. You can set the order you want as follows:
mydata6$Stage <- factor(mydata6$Stage, levels = c("E14vsE11", "E18vsE11", "AdultvsE11"))

Error when using "ifelse" to create new column of data set based on formula using variables within datased

I have a data set that looks like this:
'data.frame': 25952 obs. of 12 variables:
$ Year : int 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
$ Date : Factor w/ 15 levels "","2016-05-26",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Location : Factor w/ 22 levels "Coquet Island",..: 3 3 3 3 3 3 3 3 3 3 ...
$ Group : Factor w/ 3 levels "vis","worm","inktvis": 1 1 1 1 1 1 1 1 1 1 ...
$ Part : Factor w/ 8 levels "kaak","otoliet",..: 3 3 3 3 3 3 3 3 3 3 ...
$ Part_type : Factor w/ 6 levels "","asteriscus",..: 4 4 4 4 4 4 4 4 4 4 ...
$ Taxon : Factor w/ 20 levels "","Ansjovis",..: 19 19 19 19 19 19 19 19 19 19 ...
$ Further_ID : Factor w/ 12 levels "","kabeljauwachtige?",..: 5 8 5 5 8 8 8 5 8 8 ...
$ Length : num 3021 2897 3203 3118 2509 ...
$ Width : num 1483 1511 1427 1387 1276 ...
$ YearLocation: chr "2016 Markenje" "2016 Markenje" "2016 Markenje" "2016 Markenje" ...
$ Family : Factor w/ 5 levels "Nereidae","Invertebrates",..: 5 5 5 5 5 5 5 5 5 5 ...
I am trying to calculate total fish length for every sample using a formula based on "taxon" and "length" values. This is an example of one of the codes I have been using (the formula varies slightly based on Taxon):
dieet.GS.EU$FL <- ifelse(dieet.GS.EU$Taxon %in% c("zandspieringachtige"), ((0.000006 * (dieet.GS.EU$Length ^ 2) + (0.0311 * dieet.GS.EU$Length) + 24.161)), dieet.GS.EU$FL)
When I try to run this I receive the error message:
Error in ans[!test & ok] <- rep(no, length.out = length(ans))[!test & :
replacement has length zero
In addition: Warning message:
In rep(no, length.out = length(ans)) :
'x' is NULL so the result will be NULL
Can anyone please help me figure out the issue here? I believe it has to do with my data set rather than the code I am running, but I am not very experienced in R so I am not sure. Any help would be greatly appreciated.
The statement
dieet.GS.EU$FL <- ifelse(dieet.GS.EU$Taxon %in% c("zandspieringachtige"),
((0.000006 * (dieet.GS.EU$Length ^ 2) +
(0.0311 * dieet.GS.EU$Length) + 24.161)),
dieet.GS.EU$FL)
refers to dieet.GS.EU$FL in the alternative, but it doesn't exist. You need to put a default value there instead, e.g.
dieet.GS.EU$FL <- ifelse(dieet.GS.EU$Taxon %in% c("zandspieringachtige"),
((0.000006 * (dieet.GS.EU$Length ^ 2) +
(0.0311 * dieet.GS.EU$Length) + 24.161)),
0)
or perhaps create the column first, and then use your original:
dieet.GS.EU$FL <- 0 # or some more complicated default
dieet.GS.EU$FL <- ifelse(dieet.GS.EU$Taxon %in% c("zandspieringachtige"),
((0.000006 * (dieet.GS.EU$Length ^ 2) +
(0.0311 * dieet.GS.EU$Length) + 24.161)),
dieet.GS.EU$FL)

How to sum up numbers in one CSV-column that belong to one factor in another column?

I am pretty new to R and have a data file that represents a budget. I want to sum up all the price tags for one purpose in the purpose column. That purpose gets automatically factored when reading in the csv. But how can I assign the right prices to a purpose with several counts in the file and sum them up?
I got the file from this link:
http://www.berlin.de/imperia/md/content/senatsverwaltungen/finanzen/haushalt/ansatzn2013.xls?download.html
I opened it in Open Office, exported the .csv-file and called it ausgaben.csv.
> ausgaben <- read.csv("ausgaben.csv")
> str(ausgaben)
'data.frame': 15895 obs. of 8 variables:
$ Bereich : Factor w/ 13 levels "(30) Senatsverwaltungen",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Einzelplan : Factor w/ 28 levels "(01) Abgeordnetenhaus",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Kapitel : Factor w/ 270 levels "(0100) Abgeordnetenhaus",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Titelart : Factor w/ 1 level "Ausgaben": 1 1 1 1 1 1 1 1 1 1 ...
$ Titel : int 41101 41103 42201 42701 42801 42811 42821 44100 44304 44379 ...
$ Titelbezeichnung: Factor w/ 1286 levels "Abdeckung von Geldverlusten",..: 57 973 182 67 262 257 95 127 136 797 ...
$ Funktion : Factor w/ 135 levels "(011) Politische Führung",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Euro : Factor w/ 2909 levels "-1.083,0","-1.295,0",..: 539 2226 1052 1167 1983 1111 1575 2749 1188 1167 ...
In "Funktionen" are 135 levels which correspond to sums in "Euro". I want to get all the numbers in "Euro" for all their corresponding levels in "Funktionen" and sum them, so I get 135 Euro values and can show what is spent for what purpose in this budget.
This could be done with plyr:::ddply or many other functions (ave, tapply, etc...).
I think that 'Euro' should not be a factor, but numeric - so please fix this before trying to aggregate.
Since we do not have your data here is a toy example:
set.seed(1234)
df <- data.frame(fac = sample(LETTERS[1:3], 50, replace = TRUE),
x = runif(50))
require(plyr)
ddply(df, .(fac), summarise,
sum_x = sum(x))
# fac sum_x
1 A 7.938613
2 B 6.692007
3 C 5.645078
You can read the xls file with the gdata package:
library(gdata)
ausgaben <- read.xls("ansatzn2013.xls")
Firstly, you need to transform the values in the column Ansatz.2013.inkl..Nachtrag.in.Tsd..EUR from factor to numeric:
Euro <- as.character(ausgaben$Ansatz.2013.inkl..Nachtrag.in.Tsd..EUR)
Euro <- as.numeric(sub(",", "", Euro))
Then, you can calculate the sums with the aggregate function:
aggregate(Euro ~ ausgaben$Funktion, FUN = sum)

ggplot2: overlay control group line on graph panel set

I have a stacked areaplot made with ggplot2:
dists.med.areaplot<-qplot(starttime,value,fill=dists,facets=~groupname,
geom='area',data=MDist.median, stat='identity') +
labs(y='median distances', x='time(s)', fill='Distance Types')+
opts(title=subt) +
scale_fill_brewer(type='seq') +
facet_wrap(~groupname, ncol=2) + grect #grect adds the grey/white vertical bars
It looks like this:
I want to add a an overlay of the profile of the control graph (bottom right) to all the graphs in the output (groupname==rowH is the control).
So far my best efforts have yielded this:
cline<-geom_line(aes(x=starttime,y=value),
data=subset(dists.med,groupname=='rowH'),colour='red')
dists.med.areaplot + cline
I need the 3 red lines to be 1 red line that skims the top of the dark blue section. And I need that identical line (the rowH line) to overlay each of the panels.
The dataframe looks like this:
> str(MDist.median)
'data.frame': 2880 obs. of 6 variables:
$ groupname: Factor w/ 8 levels "rowA","rowB",..: 1 1 1 1 1 1 1 1 1 1 ...
$ fCycle : Factor w/ 6 levels "predark","Cycle 1",..: 1 1 1 1 1 1 1 1 1 1 ...
$ fPhase : Factor w/ 2 levels "Light","Dark": 2 2 2 2 2 2 2 2 2 2 ...
$ starttime: num 0.3 60 120 180 240 300 360 420 480 540 ...
$ dists : Factor w/ 3 levels "inadist","smldist",..: 1 1 1 1 1 1 1 1 1 1 ...
$ value : num 110 117 115 113 114 ...
The red line should be calculated as the sum of the value at each starttime, where groupname='rowH'. I have tried creating cline the following ways. Each results in an error or incorrect output:
#sums the entire y for all points and makes horizontal line
cline<-geom_line(aes(x=starttime,y=sum(value)),data=subset(dists.med,groupname=='rowH'),colour='red')
#using related dataset with pre-summed y's
> cline<-geom_line(aes(x=starttime,y=tot_dist),data=subset(t.med,groupname=='rowH'))
> dists.med.areaplot + cline
Error in eval(expr, envir, enclos) : object 'dists' not found
Thoughts?
ETA:
It appears that the issue I was having with 'dists' not found has to do with the fact that the initial plot, dists.med.areaplot was created via qplot. To avoid this issue, I can't build on a qplot. This is the code for the working plot:
cline.data <- subset(
ddply(MDist.median, .(starttime, groupname), summarize, value = sum(value)),
groupname == "rowH")
cline<-geom_line(data=transform(cline.data,groupname=NULL), colour='red')
dists.med.areaplot<-ggplot(MDist.median, aes(starttime, value)) +
grect + nogrid +
geom_area(aes(fill=dists),stat='identity') +
facet_grid(~groupname)+ scale_fill_brewer(type='seq') +
facet_wrap(~groupname, ncol=2) +
cline
resulting in this graphset:
This Learning R blog post should be of some help:
http://learnr.wordpress.com/2009/12/03/ggplot2-overplotting-in-a-faceted-scatterplot/
It might be worth computing the summary outside of ggplot with plyr.
cline.data <- ddply(MDist.median, .(starttime, groupname), summarize, value = sum(value))
cline.data.subset <- subset(cline.data, groupname == "rowH")
Then add it to the plot with
last_plot() + geom_line(data = transform(cline.data.subset, groupname = NULL), color = "red")

stacked barchart with lattice: is my data too big?

I want a graph that looks similar to the example given in the lattice docs:
#EXAMPLE GRAPH, not my data
> barchart(yield ~ variety | site, data = barley,
+ groups = year, layout = c(1,6), stack = TRUE,
+ auto.key = list(points = FALSE, rectangles = TRUE, space = "right"),
+ ylab = "Barley Yield (bushels/acre)",
+ scales = list(x = list(rot = 45)))
I melted my data to obtain this "long" form dataframe:
> str(MDist)
'data.frame': 34560 obs. of 6 variables:
$ fCycle : Factor w/ 2 levels "Dark","Light": 2 2 2 2 2 2 2 2 2 2 ...
$ groupname: Factor w/ 8 levels "rowA","rowB",..: 1 1 1 1 1 1 1 1 1 1 ...
$ location : Factor w/ 96 levels "c1","c10","c11",..: 1 1 1 1 1 1 1 1 1 1 ...
$ timepoint: num 1 2 3 4 5 6 7 8 9 10 ...
$ variable : Factor w/ 3 levels "inadist","smldist",..: 1 1 1 1 1 1 1 1 1 1 ...
$ value : num 0 55.7 75.3 99.2 45.9 73.8 79.3 73.5 69.8 67.6 ...
I want to create a stacked barchart for each groupname and fCycle. I tried this:
barchart(value~timepoint|groupname*fCycle, data=MDist, groups=variable,stack=T)
It doesn't throw any errors, but it's still thinking after 30 minutes. Is this because it doesn't know how to deal with the 36 values that contribute to each bar? How can I make this data easier for barchart to digest?
I don't know lattice well, but could it be because your timepoint variable is numeric, not a factor?

Resources