stacked barchart with lattice: is my data too big? - r

I want a graph that looks similar to the example given in the lattice docs:
#EXAMPLE GRAPH, not my data
> barchart(yield ~ variety | site, data = barley,
+ groups = year, layout = c(1,6), stack = TRUE,
+ auto.key = list(points = FALSE, rectangles = TRUE, space = "right"),
+ ylab = "Barley Yield (bushels/acre)",
+ scales = list(x = list(rot = 45)))
I melted my data to obtain this "long" form dataframe:
> str(MDist)
'data.frame': 34560 obs. of 6 variables:
$ fCycle : Factor w/ 2 levels "Dark","Light": 2 2 2 2 2 2 2 2 2 2 ...
$ groupname: Factor w/ 8 levels "rowA","rowB",..: 1 1 1 1 1 1 1 1 1 1 ...
$ location : Factor w/ 96 levels "c1","c10","c11",..: 1 1 1 1 1 1 1 1 1 1 ...
$ timepoint: num 1 2 3 4 5 6 7 8 9 10 ...
$ variable : Factor w/ 3 levels "inadist","smldist",..: 1 1 1 1 1 1 1 1 1 1 ...
$ value : num 0 55.7 75.3 99.2 45.9 73.8 79.3 73.5 69.8 67.6 ...
I want to create a stacked barchart for each groupname and fCycle. I tried this:
barchart(value~timepoint|groupname*fCycle, data=MDist, groups=variable,stack=T)
It doesn't throw any errors, but it's still thinking after 30 minutes. Is this because it doesn't know how to deal with the 36 values that contribute to each bar? How can I make this data easier for barchart to digest?

I don't know lattice well, but could it be because your timepoint variable is numeric, not a factor?

Related

lattice plot error: need finite xlim values calls

Whenever I try and plot across factors I keep getting the error.
Here is how my data looks like:
str(dataWithNoNa)
## 'data.frame': 17568 obs. of 4 variables:
## $ steps : num 1.717 0.3396 0.1321 0.1509 0.0755 ...
## $ date : Factor w/ 61 levels "2012-10-01","2012-10-02",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ interval: int 0 5 10 15 20 25 30 35 40 45 ...
## $ dayType : Factor w/ 2 levels "Weekday","Weekend": 1 1 1 1 1 1 1 1 1 1 ...
I am trying to plot using the lattice plotting system using Weekday/Weekend as a factor.
Here is what I tried:
plot(dataWithNoNa$steps~ dataWithNoNa$interval | dataWithNoNa$dayType, type="l")
Error in plot.window(...) : need finite 'xlim' values
I even checked to make sure my data had no NAs:
sum(is.na(dataWithNoNa$interval))
## [1] 0
sum(is.na(dataWithNoNa$steps))
## [1] 0
What am I doing wrong?
Try this:
library(lattice)
xyplot(steps ~ interval | factor(dayType), data=df)
Output:
Sample data:
df <- data.frame(
steps=c(1.717,0.3396,0.1321,0.1509,0.0755),
interval=c(0,5,10,15,20),
dayType=c(1,1,1,2,2)
)

single instead multiple boxplots with ggplot

I would like to make a boxplot for a variable (Theta..vol..) depending on two factors (Tiefe) and (Ort).
> str(data)
'data.frame': 30 obs. of 6 variables:
$ Nummer : int > 1 2 3 4 5 6 7 8 9 10 ...
$ Name : int 11 12 13 14 15 16 17 18 19 20 ...
$ Ort : Factor w/ 2 levels "NNW","S": 2 2 2 2 2 2 2 2 2 2 ...
$ Tiefe : int 20 20 20 20 20 50 50 50 50 50 ...
$ Gerät : int 2 2 2 2 2 2 2 2 2 2 ...
$ Theta..vol..: num 15 16.4 14.9 16.6 10.6 22.1 17.6 10 18 20.3 ...
My code is:
ggplot(data, aes(x = Tiefe, y = Theta..vol.., fill=Ort))+geom_boxplot()
Since the variable(Tiefe) has 3 levels and the variable (Ort) has 2 levels I wish to see three paired boxplots (each pair for a single (Tiefe).
But I see just a single pair (one boxplot for one level of "Ort" and another boxplot for the second level of the "Ort"
What should I change to get three pairs for each "Tiefe"? Thank you
In your code, Tiefe is being read as an integer not a factor.
Easy fix using dplyr with ggplot2:
First I made some dummy data:
library(dplyr)
data <- tibble(
Ort = ifelse(runif(30) > 0.5, "NNW", "S"),
Tiefe = rep(c(20, 50, 75), times = 10),
Theta..vol.. = rnorm(30,15))
Next, we modify the Tiefe column before piping into the ggplot:
data %>%
mutate(Tiefe = factor(Tiefe)) %>%
ggplot(aes(x = Tiefe, y = Theta..vol.., fill = Ort)) +
geom_boxplot()

R ggplot - Error stat_bin requires continuous x variable

My table is data.combined with following structure:
'data.frame': 1309 obs. of 12 variables:
$ Survived: Factor w/ 3 levels "0","1","None": 1 2 2 2 1 1 1 1 2 2 ...
$ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
$ Name : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
$ Sex : num 2 1 1 1 2 2 2 2 1 1 ...
$ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ Ticket : Factor w/ 929 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Cabin : Factor w/ 187 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
$ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
$ Title : Factor w/ 4 levels "Master.","Miss.",..: 3 3 2 3 3 3 3 1 3 3 ...
I want to draw a graph to reflect the relationship between Title and Survived, categorized by Pclass. I used the following code:
ggplot(data.combined[1:891,], aes(x=Title, fill = Survived)) +
geom_histogram(binwidth = 0.5) +
facet_wrap(~Pclass) +
ggtitle ("Pclass") +
xlab("Title") +
ylab("Total count") +
labs(fill = "Survived")
However this results in error: Error: StatBin requires a continuous x variable the x variable is discrete. Perhaps you want stat="count"?
If I change variable Title into numeric: data.combined$Title <- as.numeric(data.combined$Title) then the code works but the label in the graph is also numeric (below). Please tell me why it happens and how to fix it. Thanks.
Btw, I use R 3.2.3 on Mac El Capital.
Graph: Instead of Mr, Miss,Mrs the x axis shows numeric values 1,2,3,4
Sum up the answer from the comments above:
1 - Replace geom_histogram(binwidth=0.5) with geom_bar(). However this way will not allow binwidth customization.
2 - Using stat_count(width = 0.5) instead of geom_bar() or geom_histogram(binwidth = 0.5) would solve it.
extractTitle <- function(Name) {
Name <- as.character(Name)
if (length(grep("Miss.", Name)) > 0) {
return ("Miss.")
} else if (length(grep("Master.", Name)) > 0) {
return ("Master.")
} else if (length(grep("Mrs.", Name)) > 0) {
return ("Mrs.")
} else if (length(grep("Mr.", Name)) > 0) {
return ("Mr.")
} else {
return ("Other")
}
}
titles <- NULL
for (i in 1:nrow(data.combined)){
titles <- c(titles, extractTitle(data.combined[i, "Name"]))
}
data.combined$title <- as.factor(titles)
ggplot(data.combined[1:892,], aes(x = title, fill = Survived))+
geom_bar(width = 0.5) +
facet_wrap("Pclass")+
xlab("Pclass")+
ylab("total count")+
labs(fill = "Survived")
As stated above use geom_bar() instead of geom_histogram, refer sample code given below(I wanted separate graph for each month for birth date data):
ggplot(data = pf,aes(x=dob_day))+
geom_bar()+
scale_x_discrete(breaks = 1:31)+
facet_wrap(~dob_month,ncol = 3)
I had the same issue but none of the above solutions worked. Then I noticed that the column of the data frame I wanted to use for the histogram wasn't numeric:
df$variable<- as.numeric(as.character(df$variable))
Taken from here
I had the same error. In my original code, I read my .csv file with read_csv(). After I changed the file into .xlsx and read it with read_excel(), the code ran smoothly.

Mean of all means of subsets of data differs from overall mean

I have a large data set which looks like so:
str(ldt)
data.frame': 116105 obs. of 11 variables:
$ s : Factor w/ 35 levels "1","10","11",..: 1 1 1 1 1 1 1 1 1 1 ...
$ PM : Factor w/ 3 levels "C","F","NF": 3 3 3 3 3 3 3 3 3 3 ...
$ day : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
$ block : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 2 2 2 2 2 ...
$ item : chr "parity" "grudoitong" "gunirec" "pirul" ...
$ C : logi TRUE TRUE TRUE TRUE TRUE FALSE ...
$ S : Factor w/ 2 levels "Nonword","Word": 2 1 1 1 2 2 2 1 2 1 ...
$ R : Factor w/ 2 levels "Nonword","Word": 2 1 1 1 2 1 2 1 2 1 ...
$ RT : num 0.838 1.026 0.93 0.553 0.815 ...
When I get means by factor from this data set, and then get the mean of those means it's slightly different from the mean of the original data set. It's different again when I split it into more factors and get the mean of those means. For example:
mean(ldt$RT[ldt$C])
[1] 0.6630013
mean(tapply(ldt$RT[ldt$C],list(s=ldt$s[ldt$C], PM= ldt$PM[ldt$C]),mean))
[1] 0.6638781
mean(tapply(ldt$RT[ldt$C],list(s=ldt$s[ldt$C], day = ldt$day[ldt$C], item=ldt$S[ldt$C], PM=ldt$PM[ldt$C]),mean))
[1] 0.6648401
What on earth is causing this discrepancy? The only thing I can imagine is that the subset means are getting rounded off. Is that why the answers are different? What's the exact mechanic at work here?
Thank you
The mean of means is not the same as the mean of all numbers.
Simple example: Take the dataset
1,3,5,6,7
The mean of 1 and 3 obviously is 2, the mean of 5,6,7 is 6.
The mean of the means therefore would be 4.
However, we have 1+3+5+6+7 = 22 and 22/5 = 4.4.
Thus, your problem is on the mathematical side of your calculation on not with your code.
To overcome this problem you would have to use the weighted mean, e.g. weight the summands of the outer mean with the number of values in each group, divided by the total number of observations. In our example:
2/5 * 2 + 3/5 * 6 = 4.4

How to sum up numbers in one CSV-column that belong to one factor in another column?

I am pretty new to R and have a data file that represents a budget. I want to sum up all the price tags for one purpose in the purpose column. That purpose gets automatically factored when reading in the csv. But how can I assign the right prices to a purpose with several counts in the file and sum them up?
I got the file from this link:
http://www.berlin.de/imperia/md/content/senatsverwaltungen/finanzen/haushalt/ansatzn2013.xls?download.html
I opened it in Open Office, exported the .csv-file and called it ausgaben.csv.
> ausgaben <- read.csv("ausgaben.csv")
> str(ausgaben)
'data.frame': 15895 obs. of 8 variables:
$ Bereich : Factor w/ 13 levels "(30) Senatsverwaltungen",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Einzelplan : Factor w/ 28 levels "(01) Abgeordnetenhaus",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Kapitel : Factor w/ 270 levels "(0100) Abgeordnetenhaus",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Titelart : Factor w/ 1 level "Ausgaben": 1 1 1 1 1 1 1 1 1 1 ...
$ Titel : int 41101 41103 42201 42701 42801 42811 42821 44100 44304 44379 ...
$ Titelbezeichnung: Factor w/ 1286 levels "Abdeckung von Geldverlusten",..: 57 973 182 67 262 257 95 127 136 797 ...
$ Funktion : Factor w/ 135 levels "(011) Politische Führung",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Euro : Factor w/ 2909 levels "-1.083,0","-1.295,0",..: 539 2226 1052 1167 1983 1111 1575 2749 1188 1167 ...
In "Funktionen" are 135 levels which correspond to sums in "Euro". I want to get all the numbers in "Euro" for all their corresponding levels in "Funktionen" and sum them, so I get 135 Euro values and can show what is spent for what purpose in this budget.
This could be done with plyr:::ddply or many other functions (ave, tapply, etc...).
I think that 'Euro' should not be a factor, but numeric - so please fix this before trying to aggregate.
Since we do not have your data here is a toy example:
set.seed(1234)
df <- data.frame(fac = sample(LETTERS[1:3], 50, replace = TRUE),
x = runif(50))
require(plyr)
ddply(df, .(fac), summarise,
sum_x = sum(x))
# fac sum_x
1 A 7.938613
2 B 6.692007
3 C 5.645078
You can read the xls file with the gdata package:
library(gdata)
ausgaben <- read.xls("ansatzn2013.xls")
Firstly, you need to transform the values in the column Ansatz.2013.inkl..Nachtrag.in.Tsd..EUR from factor to numeric:
Euro <- as.character(ausgaben$Ansatz.2013.inkl..Nachtrag.in.Tsd..EUR)
Euro <- as.numeric(sub(",", "", Euro))
Then, you can calculate the sums with the aggregate function:
aggregate(Euro ~ ausgaben$Funktion, FUN = sum)

Resources