I'm working with this dataset: https://archive.ics.uci.edu/ml/datasets/Wholesale+customers
I want to plot a histogram with all the variables that have the amount spent (everything but region and channel). I'd like them to be plotted by channel, which has 2 levels. I got the following code from an example on a website, but put in my variables:
category=c(rep("Fresh",2),rep("Grocery",2),rep("Milk",2),rep("Frozen",2),
rep("Detergents_Paper",2),rep("Delicassen",2))
condition=rep(c("Food Service", "Retail"))
value=abs(rnorm(12 , 0 , 15))
data=data.frame(category,condition,value)
ggplot(data, aes(fill=condition, y=value, x=category)) +
geom_bar(position="dodge", stat="identity")
This produces what I want, but it doesn't use my data. Here's the graph I get, but the values don't mean anything since they're basically random.
How do I get my data to plot like this?
By loading the tidyr package, the data can be reshaped to support the expected output.
library(ggplot2)
library(tidyr)
After reading in the data with the correct classes for the columns (factors for Channel and Region while the remaining six fields are numeric), check the data for correctness.
df <- read.csv(file = url('https://archive.ics.uci.edu/ml/machine-learning-databases/00292/Wholesale%20customers%20data.csv'), colClasses = c('factor','factor','numeric','numeric','numeric','numeric','numeric','numeric'))
str(df)
'data.frame': 440 obs. of 8 variables:
$ Channel : Factor w/ 2 levels "1","2": 2 2 2 1 2 2 2 2 1 2 ...
$ Region : Factor w/ 3 levels "1","2","3": 3 3 3 3 3 3 3 3 3 3 ...
$ Fresh : num 12669 7057 6353 13265 22615 ...
$ Milk : num 9656 9810 8808 1196 5410 ...
$ Grocery : num 7561 9568 7684 4221 7198 ...
$ Frozen : num 214 1762 2405 6404 3915 ...
$ Detergents_Paper: num 2674 3293 3516 507 1777 ...
$ Delicassen : num 1338 1776 7844 1788 5185 ...
head(df)
Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
1 2 3 12669 9656 7561 214 2674 1338
2 2 3 7057 9810 9568 1762 3293 1776
3 2 3 6353 8808 7684 2405 3516 7844
4 1 3 13265 1196 4221 6404 507 1788
5 2 3 22615 5410 7198 3915 1777 5185
6 2 3 9413 8259 5126 666 1795 1451
The data appear to have imported correctly.
Next we use a combination of tidyr::gather and ggplot2::ggplot to produce the desired bar plot (not histogram).
df %>%
tidyr::gather(Type, Amount, -c(Channel, Region)) %>%
ggplot(aes(x=Type, y=Amount, fill=Channel, group=Channel)) +
geom_col(position = position_dodge())
tidyr::gather(Type, Amount, -c(Channel, Region)) will reshape the data set from this:
Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
1 2 3 12669 9656 7561 214 2674 1338
2 2 3 7057 9810 9568 1762 3293 1776
3 2 3 6353 8808 7684 2405 3516 7844
4 1 3 13265 1196 4221 6404 507 1788
5 2 3 22615 5410 7198 3915 1777 5185
6 2 3 9413 8259 5126 666 1795 1451
To a "longer" data set which now has the product type as rows:
Channel Region Type Amount
1 2 3 Fresh 12669
2 2 3 Fresh 7057
3 2 3 Fresh 6353
4 1 3 Fresh 13265
5 2 3 Fresh 22615
6 2 3 Fresh 9413
This prepares the data to be plotted using ggplot2::ggplot where the x input can be mapped to the new Type variable and the y variable to Amount.
Be sure to use Group=Channel and position=position_dodge() so that ggplot will know that you want the bars side-by-side.
Related
I am trying to make a side-by-side box and whisker plot of durasec broken out by placement and media
df <- read.csv("http://citadel.sjfc.edu/faculty/ageraci/data/dataset-nci-2012-subset1.csv")
str(df)
'data.frame': 11475 obs. of 7 variables:
$ time : int 1 1 1 1 1 1 1 1 1 1 ...
$ durasec : int 168 149 179 155 90 133 17 14 14 18 ...
$ placement: int 401 402 403 403 403 403 403 403 403 403 ...
$ format : int 8 9 8 8 9 8 12 12 12 12 ...
$ focus : int 1 1 1 1 1 1 3 3 1 1 ...
$ topic : int 5 5 5 2 2 2 26 26 11 24 ...
$ media : int 4 4 4 4 4 4 4 4 4 4 ...
favstats(~durasec | placement + media, data =df)
401.4 14 120.25 164.5 197.00 754 171.39686 90.85643 446 0
402.4 9 92.00 143.0 182.00 619 157.20935 107.92586 449 0
403.4 3 23.00 54.0 141.00 807 90.18696 90.50816 4172 0
401.5 12 94.25 165.5 254.75 1136 215.05121 180.52376 742 0
402.5 7 98.50 181.0 306.00 716 211.23293 145.88735 747 0
403.5 3 34.00 96.0 173.50 1098 124.85180 112.56758 4919 0
6 rows
bwplot(placement + media ~ durasec, data = df)
When I run this last piece of code it gives me a box and whisker plot but on the Y axis instead of the combinations of 401.4 through 403.5 like in the favstats, it just gives me 1 through 5 and the data doesn't appear to exactly match the favstats.
How can I get it to display the six combinations and their data like in the favstats?
You can try the following code
library(lattice)
bwplot(durasec ~ as.factor(df$placement) | as.factor(df$media), data = df)
Using ggplot:
library(ggplot2)
library(dplyr)
df <- read.csv("http://citadel.sjfc.edu/faculty/ageraci/data/dataset-nci-2012-subset1.csv")
df_fac <- df %>%
mutate_at(vars(placement:media), ~as.factor(.))
ggplot(data = df_fac) +
geom_boxplot(aes(x = durasec, y = placement, fill = media))
Created on 2020-04-06 by the reprex package (v0.3.0)
This question already has answers here:
How to convert a factor to integer\numeric without loss of information?
(12 answers)
Closed 5 years ago.
Have the below dataframe where all the columns are factors which I want to use them as numeric columns. I tried different ways but it is changing to different values when I try as.numeric(as.character(.))
The data comes in a semicolon separated format. A subset of data to reproduce the problem is:
rawData <- "Date;Time;Global_active_power;Global_reactive_power;Voltage;Global_intensity;Sub_metering_1;Sub_metering_2;Sub_metering_3
21/12/2006;11:23:00;?;?;?;?;?;?;
21/12/2006;11:24:00;?;?;?;?;?;?;
16/12/2006;17:24:00;4.216;0.418;234.840;18.400;0.000;1.000;17.000
16/12/2006;17:25:00;5.360;0.436;233.630;23.000;0.000;1.000;16.000
16/12/2006;17:26:00;5.374;0.498;233.290;23.000;0.000;2.000;17.000
16/12/2006;17:27:00;5.388;0.502;233.740;23.000;0.000;1.000;17.000
16/12/2006;17:28:00;3.666;0.528;235.680;15.800;0.000;1.000;17.000
16/12/2006;17:29:00;3.520;0.522;235.020;15.000;0.000;2.000;17.000
16/12/2006;17:30:00;3.702;0.520;235.090;15.800;0.000;1.000;17.000
16/12/2006;17:31:00;3.700;0.520;235.220;15.800;0.000;1.000;17.000
16/12/2006;17:32:00;3.668;0.510;233.990;15.800;0.000;1.000;17.000
"
hpc <- read.csv(text=rawData,sep=";")
str(hpc)
When run against the full data file after dropping the date and time variables, the output from str() looks like:
> str(hpc)
'data.frame': 2075259 obs. of 7 variables:
$ Global_active_power : Factor w/ 4187 levels "?","0.076","0.078",..: 2082 2654 2661 2668 1807 1734 1825 1824 1808 1805 ...
$ Global_reactive_power: Factor w/ 533 levels "?","0.000","0.046",..: 189 198 229 231 244 241 240 240 235 235 ...
$ Voltage : Factor w/ 2838 levels "?","223.200",..: 992 871 837 882 1076 1010 1017 1030 907 894 ...
$ Global_intensity : Factor w/ 222 levels "?","0.200","0.400",..: 53 81 81 81 40 36 40 40 40 40 ...
$ Sub_metering_1 : Factor w/ 89 levels "?","0.000","1.000",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Sub_metering_2 : Factor w/ 82 levels "?","0.000","1.000",..: 3 3 14 3 3 14 3 3 3 14 ...
$ Sub_metering_3 : num 17 16 17 17 17 17 17 17 17 16 ...
Can anyone help me in getting the expected output?
expected output:
> str(hpc)
'data.frame': 2075259 obs. of 7 variables:
$ Global_active_power : num "?","0.076","0.078",..: 2082 2654 2661 2668 1807 1734 1825 1824 1808 1805 ...
$ Global_reactive_power: num "?","0.000","0.046",..: 189 198 229 231 244 241 240 240 235 235 ...
$ Voltage : num "?","223.200",..: 992 871 837 882 1076 1010 1017 1030 907 894 ...
$ Global_intensity : num "?","0.200","0.400",..: 53 81 81 81 40 36 40 40 40 40 ...
$ Sub_metering_1 : num "?","0.000","1.000",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Sub_metering_2 : num "?","0.000","1.000",..: 3 3 14 3 3 14 3 3 3 14 ...
$ Sub_metering_3 : num 17 16 17 17 17 17 17 17 17 16 ...
Not able to test your data frame, but hopefully this will work. I notice that in the output of str(hpc) not all columns are factors. mutate_if can apply a function to those meet the requirement of a predictive function.
library(dplyr)
hpc2 <- hpc %>%
mutate_if(is.factor, funs(as.numeric(as.character(.))))
I'm unsure what I'm doing wrong. This is the data that I'm using:
dtf <- read.table(text=
"Litter Treatment Tube.L
1 Control 1641
2 Control 1290
3 Control 2411
4 Control 2527
5 Control 1930
6 Control 2158
1 GH 1829
2 GH 1811
3 GH 1897
4 GH 1506
5 GH 2060
6 GH 1207
1 FSH 3395
2 FSH 3113
3 FSH 2219
4 FSH 2667
5 FSH 2210
6 FSH 2625
1 GH+FSH 1537
2 GH+FSH 1991
3 GH+FSH 3639
4 GH+FSH 2246
5 GH+FSH 1840
6 GH+FSH 2217", header=TRUE)
What I did was:
BoarsMod1 <- aov(Tube.L ~ Litter + Treatment, data=dtf)
anova(BoarsMod1)
I'm getting an incorrect number of degrees of freedom for litter. It should be 5 (as there are 6 litter blocks) but it is 1. Am I doing something wrong?
I want to plot tick data on a minute-basis. My dataframe looks like the following:
> head(df)
No Date Time Close Volume Weekday
1 3361 03.12.2012 08:00:00.000 7.435 27000000 Montag
2 3362 03.12.2012 08:01:00.000 7.428 47000000 Montag
3 3363 03.12.2012 08:02:00.000 7.428 41000000 Montag
4 3364 03.12.2012 08:03:00.000 7.429 39000000 Montag
5 3365 03.12.2012 08:04:00.000 7.426 44000000 Montag
6 3366 03.12.2012 08:05:00.000 7.423 49000000 Montag
>
Now I want to plot the first 241 entries, with the correct x-axis description. Currently I use a simple 1:241 vector:
plot(c(1:241),df[1:241,4],type="l")
And I get:
When I try
plot(df[1:241,3],df[1:241,4],type="l")
this looks like:
What's wrong here? Thanks!
EDIT:
> str(df)
'data.frame': 81613 obs. of 6 variables:
$ No : int 3361 3362 3363 3364 3365 3366 3367 3368 3369 3370 ...
$ Date : Factor w/ 270 levels "01.01.2013","01.02.2013",..: 25 25 25 25 25 25 25 25 25 25 ...
$ Time : Factor w/ 600 levels "08:00:00.000",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Close : num 7.43 7.43 7.43 7.43 7.43 ...
$ Volume : int 27000000 47000000 41000000 39000000 44000000 49000000 51000000 48000000 49000000 45000000 ...
$ Weekday: Factor w/ 5 levels "Dienstag","Donnerstag",..: 5 5 5 5 5 5 5 5 5 5 ...
>
EDIT2:
Data here.
You need to convert your variables Date and Time with something like strptime:
df$DateTime = strptime(paste(as.character(df$Date), as.character(df$Time)), "%m.%d.%Y %H:%M:%S")
plot(df$DateTime[1:241], df$Close[1:241], type="l")
I'm have a dataframe as like below. I need to graph based on region, date as x Axis and AveElapsedTime as y axis.
>avg_data
date region AveElapsedTime
1 2012-05-19 betasol 1372
2 2012-05-22 atpTax 1652
3 2012-06-02 betasol 1630
4 2012-06-02 atpTax 1552
5 2012-06-02 Tax 1552
6 2012-06-07 betasol 1408
7 2012-06-12 betasol 1471
8 2012-06-15 betasol 1384
9 2012-06-21 betasol 1390
10 2012-06-22 atpTax 1252
11 2012-06-23 betasol 1442
If I rearrage the above one based on region, it will be as like below. It should not plot if there is no value(NA) for particular date.
date atpTax betasol Tax
1 2012-05-19 NA 1372 NA
2 2012-05-22 1652 NA NA
3 2012-06-02 1552 1630 1552
4 2012-06-07 NA 1408 NA
5 2012-06-12 NA 1471 NA
6 2012-06-15 NA 1384 NA
7 2012-06-21 NA 1390 NA
8 2012-06-22 1252 NA NA
9 2012-06-23 NA 1442 NA
I tried using the below ggplot command, I'm getting geom_path error.
ggplot(avg_data, aes(date, AveElapsedTime)) + geom_line(aes(col=region)) + opts(axis.text.x = theme_text(angle=90, hjust=1))
geom_path: Each group consist of only one observation. Do you need to adjust the group aesthetic?
> str(avg_data)
'data.frame': 11 obs. of 3 variables:
$ date : Factor w/ 9 levels "2012-05-19","2012-05-22",..: 1 2 3 3 3 4 5 6 7 8 ...
$ region : Factor w/ 3 levels "atpTax","betasol",..: 2 1 2 1 3 2 2 2 2 1 ...
$ AveElapsedTime: int 1372 1652 1630 1552 1552 1408 1471 1384 1390 1252 ...
Please advise on this.
As the error message indicates, you need to specify the group. Like this:
ggplot(avg_data, aes(date, AveElapsedTime, colour=region, group=region)) +
geom_point() + geom_line()