Multiple lines from single column using ggplot - r

I'm have a dataframe as like below. I need to graph based on region, date as x Axis and AveElapsedTime as y axis.
>avg_data
date region AveElapsedTime
1 2012-05-19 betasol 1372
2 2012-05-22 atpTax 1652
3 2012-06-02 betasol 1630
4 2012-06-02 atpTax 1552
5 2012-06-02 Tax 1552
6 2012-06-07 betasol 1408
7 2012-06-12 betasol 1471
8 2012-06-15 betasol 1384
9 2012-06-21 betasol 1390
10 2012-06-22 atpTax 1252
11 2012-06-23 betasol 1442
If I rearrage the above one based on region, it will be as like below. It should not plot if there is no value(NA) for particular date.
date atpTax betasol Tax
1 2012-05-19 NA 1372 NA
2 2012-05-22 1652 NA NA
3 2012-06-02 1552 1630 1552
4 2012-06-07 NA 1408 NA
5 2012-06-12 NA 1471 NA
6 2012-06-15 NA 1384 NA
7 2012-06-21 NA 1390 NA
8 2012-06-22 1252 NA NA
9 2012-06-23 NA 1442 NA
I tried using the below ggplot command, I'm getting geom_path error.
ggplot(avg_data, aes(date, AveElapsedTime)) + geom_line(aes(col=region)) + opts(axis.text.x = theme_text(angle=90, hjust=1))
geom_path: Each group consist of only one observation. Do you need to adjust the group aesthetic?
> str(avg_data)
'data.frame': 11 obs. of 3 variables:
$ date : Factor w/ 9 levels "2012-05-19","2012-05-22",..: 1 2 3 3 3 4 5 6 7 8 ...
$ region : Factor w/ 3 levels "atpTax","betasol",..: 2 1 2 1 3 2 2 2 2 1 ...
$ AveElapsedTime: int 1372 1652 1630 1552 1552 1408 1471 1384 1390 1252 ...
Please advise on this.

As the error message indicates, you need to specify the group. Like this:
ggplot(avg_data, aes(date, AveElapsedTime, colour=region, group=region)) +
geom_point() + geom_line()

Related

side by side boxplot in R

I am trying to make a side-by-side box and whisker plot of durasec broken out by placement and media
df <- read.csv("http://citadel.sjfc.edu/faculty/ageraci/data/dataset-nci-2012-subset1.csv")
str(df)
'data.frame': 11475 obs. of 7 variables:
$ time : int 1 1 1 1 1 1 1 1 1 1 ...
$ durasec : int 168 149 179 155 90 133 17 14 14 18 ...
$ placement: int 401 402 403 403 403 403 403 403 403 403 ...
$ format : int 8 9 8 8 9 8 12 12 12 12 ...
$ focus : int 1 1 1 1 1 1 3 3 1 1 ...
$ topic : int 5 5 5 2 2 2 26 26 11 24 ...
$ media : int 4 4 4 4 4 4 4 4 4 4 ...
favstats(~durasec | placement + media, data =df)
401.4 14 120.25 164.5 197.00 754 171.39686 90.85643 446 0
402.4 9 92.00 143.0 182.00 619 157.20935 107.92586 449 0
403.4 3 23.00 54.0 141.00 807 90.18696 90.50816 4172 0
401.5 12 94.25 165.5 254.75 1136 215.05121 180.52376 742 0
402.5 7 98.50 181.0 306.00 716 211.23293 145.88735 747 0
403.5 3 34.00 96.0 173.50 1098 124.85180 112.56758 4919 0
6 rows
bwplot(placement + media ~ durasec, data = df)
When I run this last piece of code it gives me a box and whisker plot but on the Y axis instead of the combinations of 401.4 through 403.5 like in the favstats, it just gives me 1 through 5 and the data doesn't appear to exactly match the favstats.
How can I get it to display the six combinations and their data like in the favstats?
You can try the following code
library(lattice)
bwplot(durasec ~ as.factor(df$placement) | as.factor(df$media), data = df)
Using ggplot:
library(ggplot2)
library(dplyr)
df <- read.csv("http://citadel.sjfc.edu/faculty/ageraci/data/dataset-nci-2012-subset1.csv")
df_fac <- df %>%
mutate_at(vars(placement:media), ~as.factor(.))
ggplot(data = df_fac) +
geom_boxplot(aes(x = durasec, y = placement, fill = media))
Created on 2020-04-06 by the reprex package (v0.3.0)

Plot multiple variables on one barplot?

I'm working with this dataset: https://archive.ics.uci.edu/ml/datasets/Wholesale+customers
I want to plot a histogram with all the variables that have the amount spent (everything but region and channel). I'd like them to be plotted by channel, which has 2 levels. I got the following code from an example on a website, but put in my variables:
category=c(rep("Fresh",2),rep("Grocery",2),rep("Milk",2),rep("Frozen",2),
rep("Detergents_Paper",2),rep("Delicassen",2))
condition=rep(c("Food Service", "Retail"))
value=abs(rnorm(12 , 0 , 15))
data=data.frame(category,condition,value)
ggplot(data, aes(fill=condition, y=value, x=category)) +
geom_bar(position="dodge", stat="identity")
This produces what I want, but it doesn't use my data. Here's the graph I get, but the values don't mean anything since they're basically random.
How do I get my data to plot like this?
By loading the tidyr package, the data can be reshaped to support the expected output.
library(ggplot2)
library(tidyr)
After reading in the data with the correct classes for the columns (factors for Channel and Region while the remaining six fields are numeric), check the data for correctness.
df <- read.csv(file = url('https://archive.ics.uci.edu/ml/machine-learning-databases/00292/Wholesale%20customers%20data.csv'), colClasses = c('factor','factor','numeric','numeric','numeric','numeric','numeric','numeric'))
str(df)
'data.frame': 440 obs. of 8 variables:
$ Channel : Factor w/ 2 levels "1","2": 2 2 2 1 2 2 2 2 1 2 ...
$ Region : Factor w/ 3 levels "1","2","3": 3 3 3 3 3 3 3 3 3 3 ...
$ Fresh : num 12669 7057 6353 13265 22615 ...
$ Milk : num 9656 9810 8808 1196 5410 ...
$ Grocery : num 7561 9568 7684 4221 7198 ...
$ Frozen : num 214 1762 2405 6404 3915 ...
$ Detergents_Paper: num 2674 3293 3516 507 1777 ...
$ Delicassen : num 1338 1776 7844 1788 5185 ...
head(df)
Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
1 2 3 12669 9656 7561 214 2674 1338
2 2 3 7057 9810 9568 1762 3293 1776
3 2 3 6353 8808 7684 2405 3516 7844
4 1 3 13265 1196 4221 6404 507 1788
5 2 3 22615 5410 7198 3915 1777 5185
6 2 3 9413 8259 5126 666 1795 1451
The data appear to have imported correctly.
Next we use a combination of tidyr::gather and ggplot2::ggplot to produce the desired bar plot (not histogram).
df %>%
tidyr::gather(Type, Amount, -c(Channel, Region)) %>%
ggplot(aes(x=Type, y=Amount, fill=Channel, group=Channel)) +
geom_col(position = position_dodge())
tidyr::gather(Type, Amount, -c(Channel, Region)) will reshape the data set from this:
Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
1 2 3 12669 9656 7561 214 2674 1338
2 2 3 7057 9810 9568 1762 3293 1776
3 2 3 6353 8808 7684 2405 3516 7844
4 1 3 13265 1196 4221 6404 507 1788
5 2 3 22615 5410 7198 3915 1777 5185
6 2 3 9413 8259 5126 666 1795 1451
To a "longer" data set which now has the product type as rows:
Channel Region Type Amount
1 2 3 Fresh 12669
2 2 3 Fresh 7057
3 2 3 Fresh 6353
4 1 3 Fresh 13265
5 2 3 Fresh 22615
6 2 3 Fresh 9413
This prepares the data to be plotted using ggplot2::ggplot where the x input can be mapped to the new Type variable and the y variable to Amount.
Be sure to use Group=Channel and position=position_dodge() so that ggplot will know that you want the bars side-by-side.

Not getting the correct degrees of freedom in R

I'm unsure what I'm doing wrong. This is the data that I'm using:
dtf <- read.table(text=
"Litter Treatment Tube.L
1 Control 1641
2 Control 1290
3 Control 2411
4 Control 2527
5 Control 1930
6 Control 2158
1 GH 1829
2 GH 1811
3 GH 1897
4 GH 1506
5 GH 2060
6 GH 1207
1 FSH 3395
2 FSH 3113
3 FSH 2219
4 FSH 2667
5 FSH 2210
6 FSH 2625
1 GH+FSH 1537
2 GH+FSH 1991
3 GH+FSH 3639
4 GH+FSH 2246
5 GH+FSH 1840
6 GH+FSH 2217", header=TRUE)
What I did was:
BoarsMod1 <- aov(Tube.L ~ Litter + Treatment, data=dtf)
anova(BoarsMod1)
I'm getting an incorrect number of degrees of freedom for litter. It should be 5 (as there are 6 litter blocks) but it is 1. Am I doing something wrong?

Fill values in one vector conditioned on values in another vector

I have two data frames with two different dimensions :
1:
head(x)
Year GDP_deflator
1 1825 NA
2 1826 NA
3 1827 NA
4 1828 NA
5 1829 NA
6 1829 NA
7 1830 NA
8 1830 NA
9 1830 NA
10 1831 NA
dim(x)
1733 2
2:
head(dataDef)
Year GDP_deflator
1 1825 1.788002
2 1826 1.884325
3 1827 2.016997
4 1828 1.802907
5 1829 1.781999
6 1830 1.866437
7 1831 1.960316
8 1832 2.029601
9 1833 1.880957
10 1834 1.845750
dim(dataDef)
101 2
I would like to substitute values from dataDef$GDP_deflator column into x$GDP_deflator column conditioned on Year column. In other words, I would like the answer to be:
head (x)
Year GDP_deflator
1 1825 1.788002
2 1826 1.884325
3 1827 2.016997
4 1828 1.802907
5 1829 1.781999
6 1829 1.781999
7 1830 1.866437
8 1830 1.866437
9 1830 1.866437
10 1831 1.960316
So the repeating years (i.e. 1830) get the same value, 1.866437. Any suggestions?
Best Regards
One possibility is to use match:
x$GDP_deflator <- dataDef$GDP_deflator[match(x$Year, dataDef$Year)]
You want to merge the two data.frames. It's a many-to-one merge.

extracting from dataframe and merge based on condition

I'm having a dataframe like ba.
I need to extract the dataframe based on region and merge based on date.
It is working if I do manually as like below. But If the number of region is more than two, I need to extract using sapply and then I need to merge(not sure how I can do using loop or sapply). Please advise how I can extract based on "region" and then merge even there are more than two regions(ex: betasol, alpha, atpTax) dynamically.
> ba
date region AveElapsedTime
1 2012-05-19 betasol 1372
2 2012-05-22 atpTax 1652
3 2012-06-02 betasol 1630
4 2012-06-02 atpTax 1552
5 2012-06-07 betasol 1408
6 2012-06-12 betasol 1471
7 2012-06-15 betasol 1384
8 2012-06-21 betasol 1390
9 2012-06-22 atpTax 1252
10 2012-06-23 betasol 1442
> dfa <- ba[ab$region == "atpTax", c("date", "AveElapsedTime")]
> dfb <- ba[ab$region == "betasol", c("date", "AveElapsedTime")]
> merge(dfa, dfb, by="date", all=TRUE)
date AveElapsedTime.x AveElapsedTime.y
1 2012-05-19 NA 1372
2 2012-05-22 1652 NA
3 2012-06-02 1552 1630
4 2012-06-07 NA 1408
5 2012-06-12 NA 1471
6 2012-06-15 NA 1384
7 2012-06-21 NA 1390
8 2012-06-22 1252 NA
9 2012-06-23 NA 1442
extractfun <- function(z, ab) {
df[z] <- ab[ab$region == z, c("date","region")]
}
sapply(unique(ba$region), FUN=extractfun, ab=avg_data)
require(reshape)
cast(ba,date~region)

Resources