side by side boxplot in R - r

I am trying to make a side-by-side box and whisker plot of durasec broken out by placement and media
df <- read.csv("http://citadel.sjfc.edu/faculty/ageraci/data/dataset-nci-2012-subset1.csv")
str(df)
'data.frame': 11475 obs. of 7 variables:
$ time : int 1 1 1 1 1 1 1 1 1 1 ...
$ durasec : int 168 149 179 155 90 133 17 14 14 18 ...
$ placement: int 401 402 403 403 403 403 403 403 403 403 ...
$ format : int 8 9 8 8 9 8 12 12 12 12 ...
$ focus : int 1 1 1 1 1 1 3 3 1 1 ...
$ topic : int 5 5 5 2 2 2 26 26 11 24 ...
$ media : int 4 4 4 4 4 4 4 4 4 4 ...
favstats(~durasec | placement + media, data =df)
401.4 14 120.25 164.5 197.00 754 171.39686 90.85643 446 0
402.4 9 92.00 143.0 182.00 619 157.20935 107.92586 449 0
403.4 3 23.00 54.0 141.00 807 90.18696 90.50816 4172 0
401.5 12 94.25 165.5 254.75 1136 215.05121 180.52376 742 0
402.5 7 98.50 181.0 306.00 716 211.23293 145.88735 747 0
403.5 3 34.00 96.0 173.50 1098 124.85180 112.56758 4919 0
6 rows
bwplot(placement + media ~ durasec, data = df)
When I run this last piece of code it gives me a box and whisker plot but on the Y axis instead of the combinations of 401.4 through 403.5 like in the favstats, it just gives me 1 through 5 and the data doesn't appear to exactly match the favstats.
How can I get it to display the six combinations and their data like in the favstats?

You can try the following code
library(lattice)
bwplot(durasec ~ as.factor(df$placement) | as.factor(df$media), data = df)

Using ggplot:
library(ggplot2)
library(dplyr)
df <- read.csv("http://citadel.sjfc.edu/faculty/ageraci/data/dataset-nci-2012-subset1.csv")
df_fac <- df %>%
mutate_at(vars(placement:media), ~as.factor(.))
ggplot(data = df_fac) +
geom_boxplot(aes(x = durasec, y = placement, fill = media))
Created on 2020-04-06 by the reprex package (v0.3.0)

Related

Calculating regression line from given quantile of values depending on category

I have a quite huge dataframe (nearly 100,000 observations with about 40 variables) from which I want ggplot to draw scatterplots with lm- or loess-lines. But the lines should be calculated only based on a certain quantile of variable-values of each observation date. And I would like to do the filtering or subsetting directly in ggplot without creating a new data object or subdataframe in advance.
As my 'real' dataframe would be too large I created fictive example with a dataframe of 144 observations named df_Bandvals (Code at the end of the post). Here following structure, the first 25 lines and a scatterplot with a loess-line based on ALL observations
> str(df_Bandvals)
'data.frame': 144 obs. of 5 variables:
$ obsdate : int 190101 190101 190101 190101 190101 190101 190101 190101 190101 190101 ...
$ transsect : chr "A" "A" "A" "A" ...
$ PointNr : num 1 2 3 4 5 6 1 2 3 4 ...
$ depth : num 31 31 31 31 31 31 31 31 31 31 ...
$ Band12plusmin: num 169 241 229 159 221 196 188 216 233 149 ...
> df_Bandvals
obsdate transsect PointNr depth Band12plusmin
1 190101 A 1 31 169
2 190101 A 2 31 241
3 190101 A 3 31 229
4 190101 A 4 31 159
5 190101 A 5 31 221
6 190101 A 6 31 196
7 190101 B 1 31 188
8 190101 B 2 31 216
9 190101 B 3 31 233
10 190101 B 4 31 149
11 190101 B 5 31 169
12 190101 B 6 31 181
13 190102 A 1 3 356
14 190102 A 2 3 368
15 190102 A 3 3 293
16 190102 A 4 3 261
17 190102 A 5 3 313
18 190102 A 6 3 374
19 190102 B 1 3 327
20 190102 B 2 3 409
21 190102 B 3 3 369
22 190102 B 4 3 334
23 190102 B 5 3 376
24 190102 B 6 3 318
25 190103 A 1 25 183
The plot shows depth vs. Band12plusmin with an according loess-line. Point colors are assigned to the respective observation date (obsdate). Each observation date includes 12 observations.
Now, my basic question was: How to get a loess line based only on the lower 50%-quantile Band12plusmin-values of each observation date? Or in other words with referring to the plot: ggplot should only use the 6 lower points of each color for calculating the line.
And as mentioned before I would like to do the filtering or subsetting directly in ggplot without creating a new data object or subdataframe in advance.
I tried around with subsetting, but my problem in this case is that I cannot just specify a universal Band12plusmin-threshold as, of course, the 50%-treshold individually differs for each obsdate-group. I am quite new to R and ggplot, so, for now I failed to find a solution for that say class-individual-derived-threshold-conditionned filtering.
May anybody help here?
Here the code of the dataframe and plot
obsdate<-rep(c(190101:190112),each=12, mode=factor)
transsect<-rep(rep(c("A","B"), each=6), 12)
PointNr<-rep(c(1,2,3,4,5,6), times=24)
depth<-rep(c(31,3,25,-9,13,18,7,-10,3,-4,11,21),each=12)
Band12<-rep(c(199,349,225,844,257,231,301,875,378,521,210,246), each=12)
set.seed(13423)
plusminRandom<-round(rnorm(144, mean=0, sd=33))
plusminRandom
Band12plusmin<-Band12+plusminRandom
df_Bandvals<-data.frame(obsdate, transsect, PointNr, depth, Band12plusmin)
str(df_Bandvals)
head(df_Bandvals, 20)
library (ggplot2)
ggplot(data=df_Bandvals, aes(x=depth, y=Band12plusmin))+
scale_x_continuous(limits = c(-15, 35))+
scale_y_continuous(limits = c(120, 960))+
geom_point(aes(color=factor(obsdate)), size=1.5)+
geom_smooth(method="loess")
You should be able to use the data argument within geom_smooth()
ggplot(data = df_Bandvals, aes(x = depth, y = Band12plusmin)) +
scale_x_continuous(limits = c(-15, 35)) +
scale_y_continuous(limits = c(120, 960)) +
geom_point(aes(color = factor(obsdate)), size = 1.5) +
geom_smooth(
data = df_Bandvals %>%
group_by(obsdate) %>%
filter(Band12plusmin < median(Band12plusmin)),
method = "loess"
)

Unable to designate CSV column heads "as.factor" for R -Error

I am having an issue with assigning factors to my data CSV. Here is a summary of the data frame:
> data.frame': 303 obs. of 12 variables:
> PLOT : int 19 177 54 114 41 48 142 134 160 267 ...
> RANGE : int 2 12 4 8 3 4 10 9 11 18 ...
> ROW : int 4 12 9 9 11 3 7 14 10 12 ...
> REP : int 1 1 1 1 1 1 1 1 1 1 ...
> ENTRY : Factor w/ 184 levels "","17_YMG_0293",..: 40 40 77 82 87 88 102 103 103 6 ...
> PLOT_ID : Factor w/ 301 levels "","18_HZG_OvOv_001",..: 20 178 55 115 42 49 143 135 161 268 ...
> Shatter : num 9 9 9 9 9 9 9 9 9 8 ...
> Chaff.Color : Factor w/ 4 levels "","*Blank ones are segregating in color",..: 3 4 3 4 4 4 3 4 4 3 ...
> Heading_d.from.Jan.1: int 138 139 137 133 135 135 133 137 135 136 ...
> Height_cm : int 74 73 77 76 74 79 78 73 76 70 ...
> Plot.weight..kg. : num 0.26 0.18 0.19 0.14 0.33 0.19 0.13 0.11 0.24 0.18 ...
But I get this error:
HAYSData$Rep<-as.factor(HAYSData$Rep)
Error in `$<-.data.frame`(`*tmp*`, Rep, value = integer(0)) :
replacement has 0 rows, data has 303
I get the same type of error for Entry, Range, and Rows. I am not sure when I look at length(Entry) for example I get 300. I even tested with changing factor to numeric but it does not help.
I don't have an NA in my data each category is its own column as well.
I don't know if something is wrong with my CSV. I have worked this same script with another CSV but no issues in the part of the script for the other data.
Can someone please help me?
It's case-sensitive, try with:
HAYSData$REP <- as.factor(HAYSData$REP)
HAYSData$ENTRY <- as.factor(HAYSData$ENTRY)
HAYSData$RANGE <- as.factor(HAYSData$RANGE)
HAYSData$ROW <- as.factor(HAYSData$ROW)

ggplot in R- graph for data analysis [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I have a large CSV file which I decided to import into R and use for some data analysis. Bascially it is file with flight delays for few years and trying to create a graph to see the average delay per day of the week. I thought of the histogram but it plots graph which is not usable? Any idea please let me know. Would other graph work better? Also is there any easy way to compare on time flights to delayed flights per day of the week?
file name - airline
str(airline)
'data.frame': 7009728 obs. of 29 variables:
$ Year : int 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...
$ Month : int 1 1 1 1 1 1 1 1 1 1 ...
$ DayofMonth : int 3 3 3 3 3 3 3 3 3 3 ...
$ DayOfWeek : int 4 4 4 4 4 4 4 4 4 4 ...
$ DepTime : int 2003 754 628 926 1829 1940 1937 1039 617 1620 ...
$ CRSDepTime : int 1955 735 620 930 1755 1915 1830 1040 615 1620 ...
$ ArrTime : int 2211 1002 804 1054 1959 2121 2037 1132 652 1639 ...
$ CRSArrTime : int 2225 1000 750 1100 1925 2110 1940 1150 650 1655 ...
$ UniqueCarrier : Factor w/ 20 levels "9E","AA","AQ",..: 18 18 18 18 18 18 18 18 18 18 ...
$ FlightNum : int 335 3231 448 1746 3920 378 509 535 11 810 ...
$ TailNum : Factor w/ 5374 levels "","80009E","80019E",..: 3769 4129 1961 3059 2142 3852 4062 1961 3616 3324 ...
$ ActualElapsedTime: int 128 128 96 88 90 101 240 233 95 79 ...
$ CRSElapsedTime : int 150 145 90 90 90 115 250 250 95 95 ...
$ AirTime : int 116 113 76 78 77 87 230 219 70 70 ...
$ ArrDelay : int -14 2 14 -6 34 11 57 -18 2 -16 ...
$ DepDelay : int 8 19 8 -4 34 25 67 -1 2 0 ...
$ Origin : Factor w/ 303 levels "ABE","ABI","ABQ",..: 136 136 141 141 141 141 141 141 141 141 ...
$ Dest : Factor w/ 304 levels "ABE","ABI","ABQ",..: 287 287 49 49 49 151 157 157 177 177 ...
$ Distance : int 810 810 515 515 515 688 1591 1591 451 451 ...
$ TaxiIn : int 4 5 3 3 3 4 3 7 6 3 ...
$ TaxiOut : int 8 10 17 7 10 10 7 7 19 6 ...
$ Cancelled : int 0 0 0 0 0 0 0 0 0 0 ...
$ CancellationCode : Factor w/ 5 levels "","A","B","C",..: 1 1 1 1 1 1 1 1 1 1
$ Diverted : int 0 0 0 0 0 0 0 0 0 0 ...
$ CarrierDelay : int NA NA NA NA 2 NA 10 NA NA NA ...
$ WeatherDelay : int NA NA NA NA 0 NA 0 NA NA NA ...
$ NASDelay : int NA NA NA NA 0 NA 0 NA NA NA ...
$ SecurityDelay : int NA NA NA NA 0 NA 0 NA NA NA ...
$ LateAircraftDelay: int NA NA NA NA 32 NA 47 NA NA NA ...
my graph:
library(ggplot2)
ggplot(airline,aes(x = DayOfWeek, fill = factor(DepDelay))) +
geom_histogram(binwidth = 1) +
xlab ("Day of week") +
ylab ("Dep Delay") +
labs (fill = "Airline")
To a great extent it would depend on what do you want to show. I made a small example using the flights data available in the nycflights13 package. Using the code below you could experiment with charts that would meet your analytical requirements.
Code
# Libs and data -----------------------------------------------------------
Vectorize(require)(package = c("nycflights13", "ggplot2", "ggthemes",
"dplyr"),
character.only = TRUE)
# Work -------------------------------------------------------------------
flights %>%
# Create week day summary
mutate_each(funs(as.character), 1:3) %>%
mutate(date = as.Date(paste(year, month, day, sep = "-"))) %>%
mutate(weekday = weekdays(date, abbreviate = FALSE)) %>%
group_by(weekday, carrier) %>%
na.omit() %>%
summarise(mean_dl = round(mean(dep_delay),2)) %>%
ggplot(aes(x = as.factor(weekday), y = mean_dl)) +
geom_bar(stat = "identity") +
facet_wrap(~carrier) +
xlab("Day") +
ylab("Mean Dep Delay") +
theme_wsj() +
theme(axis.text.x = element_text(angle = 90))
Results
For example, this could be a modest start:
If you want to get a better answer, I would suggest that you have a look at this discussion on producing a good R example. I would further took the liberty of suggesting that you:
Post a neat data extract that would be easy for other colleagues to work with
Elaborate more on the problem you are facing with respect to the particular chart you want to develop.
Comparing flight delays
You can make the further use of the dplyr grammar to compare flights on time and delayed ones.
Code
For example you could use the code below to count flights that were on time and the delayed ones per each day:
flights %>%
# Create week day summary
mutate_each(funs(as.character), 1:3) %>%
mutate(date = as.Date(paste(year, month, day, sep = "-"))) %>%
mutate(weekday = weekdays(date, abbreviate = FALSE)) %>%
# Create flag for on time / dly
mutate(ontime = ifelse(dep_delay == 0, "on-time", "delayed")) %>%
group_by(weekday, ontime) %>%
na.omit() %>%
summarise(count_flights = n())

Find mean from subset of one column based on ranking in the top 50 of another column

I have a data frame that has the following columns:
> str(wbr)
'data.frame': 214 obs. of 12 variables:
$ countrycode : Factor w/ 214 levels "ABW","ADO","AFG",..: 1 2 3 4 5 6 7 8 9 10 ...
$ countryname : Factor w/ 214 levels "Afghanistan",..: 10 5 1 6 2 202 8 9 4 7 ...
$ gdp_per_capita : num 19913 35628 415 2738 4091 ...
$ literacy_female : num 96.7 NA 17.6 59.1 95.7 ...
$ literacy_male : num 96.9 NA 45.4 82.5 98 ...
$ literacy_all : num 96.8 NA 31.7 70.6 96.8 ...
$ infant_mortality : num NA 2.2 70.2 101.6 13.3 ...
$ illiteracy_female: num 3.28 NA 82.39 40.85 4.31 ...
$ illiteracy_mele : num 3.06 NA 54.58 17.53 1.99 ...
$ illiteracy_male : num 3.06 NA 54.58 17.53 1.99 ...
$ illiteracy_all : num 3.18 NA 68.26 29.42 3.15 ...
I would like to find the mean of illiteracy_all from the top 50 countries with the highest GDP.
Before you answer me I need to inform you that the data frame has NA values meaning that if I want to find the mean I would have to write:
mean(wbr$illiteracy_all, na.rm=TRUE)
For a reproducible example, let's take:
data.df <- data.frame(x=101:120, y=rep(c(1,2,3,NA), times=5))
So how could I average the y values for e.g. the top 5 values of x?
> data.df
x y
1 101 1
2 102 2
3 103 3
4 104 NA
5 105 1
6 106 2
7 107 3
8 108 NA
9 109 1
10 110 2
11 111 3
12 112 NA
13 113 1
14 114 2
15 115 3
16 116 NA
17 117 1
18 118 2
19 119 3
20 120 NA
Any of the following would work:
mean(data.df[rank(-data.df$x)<=5,"y"], na.rm=TRUE)
mean(data.df$y[rank(-data.df$x)<=5], na.rm=TRUE)
with(data.df, mean(y[rank(-x)<=5], na.rm=TRUE))
To unpack why this works, note first that rank gives ranks in a different order to what you might expect, 1 being the rank of the smallest number not the largest:
> rank(data.df$x)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
We can get round that by negating the input:
> rank(-data.df$x)
[1] 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
So now ranks 1 to 5 are the "top 5". If we want a vector of TRUE and FALSE to indicate the position of the top 5 we can use:
> rank(-data.df$x)<=5
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[14] FALSE FALSE TRUE TRUE TRUE TRUE TRUE
(In reality you might find you have some ties in your data set. This is only going to cause issues if the 50th position is tied. You might want to have a look at the ties.method argument for rank to see how you want to handle this.)
So let's grab the values of y in those positions:
> data.df[rank(-data.df$x)<=5,"y"]
[1] NA 1 2 3 NA
Or you could use:
> data.df$y[rank(-data.df$x)<=5]
[1] NA 1 2 3 NA
So now we know what to input into mean:
> mean(data.df[rank(-data.df$x)<=5,"y"], na.rm=TRUE)
[1] 2
Or:
> mean(data.df$y[rank(-data.df$x)<=5], na.rm=TRUE)
[1] 2
Or if you don't like repeating the name of the data frame, use with:
> with(data.df, mean(y[rank(-x)<=5], na.rm=TRUE))
[1] 2

Carc data from rda file to numeric matrix

I try to make KDA (Kernel discriminant analysis) for carc data, but when I call command X<-data.frame(scale(X)); r shows error:
"Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric"
I tried to use as.numeric(as.matrix(carc)) and carc<-na.omit(carc), but it does not help either
library(ks);library(MASS);library(klaR);library(FSelector)
install.packages("klaR")
install.packages("FSelector")
library(ks);library(MASS);library(klaR);library(FSelector)
attach("carc.rda")
data<-load("carc.rda")
data
carc<-na.omit(carc)
head(carc)
class(carc) # check for its class
class(as.matrix(carc)) # change class, and
as.numeric(as.matrix(carc))
XX<-carc
X<-XX[,1:12];X.class<-XX[,13];
X<-data.frame(scale(X));
fit.pc<-princomp(X,scores=TRUE);
plot(fit.pc,type="line")
X.new<-fit.pc$scores[,1:5]; X.new<-data.frame(X.new);
cfs(X.class~.,cbind(X.new,X.class))
X.new<-fit.pc$scores[,c(1,4)]; X.new<-data.frame(X.new);
fit.kda1<-Hkda(x=X.new,x.group=X.class,pilot="samse",
bw="plugin",pre="sphere")
kda.fit1 <- kda(x=X.new, x.group=X.class, Hs=fit.kda1)
Can you help to resolve this problem and make this analysis?
Added:The car data set( Chambers, kleveland, Kleiner & Tukey 1983)
> head(carc)
P M R78 R77 H R Tr W L T D G C
AMC_Concord 4099 22 3 2 2.5 27.5 11 2930 186 40 121 3.58 US
AMC_Pacer 4749 17 3 1 3.0 25.5 11 3350 173 40 258 2.53 US
AMC_Spirit 3799 22 . . 3.0 18.5 12 2640 168 35 121 3.08 US
Audi_5000 9690 17 5 2 3.0 27.0 15 2830 189 37 131 3.20 Europe
Audi_Fox 6295 23 3 3 2.5 28.0 11 2070 174 36 97 3.70 Europe
Here is a small dataset with similar characteristics to what you describe
in order to answer this error:
"Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric"
carc <- data.frame(type1=rep(c('1','2'), each=5),
type2=rep(c('5','6'), each=5),
x = rnorm(10,1,2)/10, y = rnorm(10))
This should be similar to your data.frame
str(carc)
# 'data.frame': 10 obs. of 3 variables:
# $ type1: Factor w/ 2 levels "1","2": 1 1 1 1 1 2 2 2 2 2
# $ type2: Factor w/ 2 levels "5","6": 1 1 1 1 1 2 2 2 2 2
# $ x : num -0.1177 0.3443 0.1351 0.0443 0.4702 ...
# $ y : num -0.355 0.149 -0.208 -1.202 -1.495 ...
scale(carc)
# Similar error
# Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
Using set()
require(data.table)
DT <- data.table(carc)
cols_fix <- c("type1", "type2")
for (col in cols_fix) set(DT, j=col, value = as.numeric(as.character(DT[[col]])))
str(DT)
# Classes ‘data.table’ and 'data.frame': 10 obs. of 4 variables:
# $ type1: num 1 1 1 1 1 2 2 2 2 2
# $ type2: num 5 5 5 5 5 6 6 6 6 6
# $ x : num 0.0465 0.1712 0.1582 0.1684 0.1183 ...
# $ y : num 0.155 -0.977 -0.291 -0.766 -1.02 ...
# - attr(*, ".internal.selfref")=<externalptr>
The first column(s) of your data set may be factors. Taking the data from corrgram:
library(corrgram)
carc <- auto
str(carc)
# 'data.frame': 74 obs. of 14 variables:
# $ Model : Factor w/ 74 levels "AMC Concord ",..: 1 2 3 4 5 6 7 8 9 10 ...
# $ Origin: Factor w/ 3 levels "A","E","J": 1 1 1 2 2 2 1 1 1 1 ...
# $ Price : int 4099 4749 3799 9690 6295 9735 4816 7827 5788 4453 ...
# $ MPG : int 22 17 22 17 23 25 20 15 18 26 ...
# $ Rep78 : num 3 3 NA 5 3 4 3 4 3 NA ...
# $ Rep77 : num 2 1 NA 2 3 4 3 4 4 NA ...
# $ Hroom : num 2.5 3 3 3 2.5 2.5 4.5 4 4 3 ...
# $ Rseat : num 27.5 25.5 18.5 27 28 26 29 31.5 30.5 24 ...
# $ Trunk : int 11 11 12 15 11 12 16 20 21 10 ...
# $ Weight: int 2930 3350 2640 2830 2070 2650 3250 4080 3670 2230 ...
# $ Length: int 186 173 168 189 174 177 196 222 218 170 ...
# $ Turn : int 40 40 35 37 36 34 40 43 43 34 ...
# $ Displa: int 121 258 121 131 97 121 196 350 231 304 ...
# $ Gratio: num 3.58 2.53 3.08 3.2 3.7 3.64 2.93 2.41 2.73 2.87 ...
So exclude them by trying this:
X<-XX[,3:14]
or this
X<-XX[,-(1:2)]

Resources