ggplot in R- graph for data analysis [closed] - r

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I have a large CSV file which I decided to import into R and use for some data analysis. Bascially it is file with flight delays for few years and trying to create a graph to see the average delay per day of the week. I thought of the histogram but it plots graph which is not usable? Any idea please let me know. Would other graph work better? Also is there any easy way to compare on time flights to delayed flights per day of the week?
file name - airline
str(airline)
'data.frame': 7009728 obs. of 29 variables:
$ Year : int 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...
$ Month : int 1 1 1 1 1 1 1 1 1 1 ...
$ DayofMonth : int 3 3 3 3 3 3 3 3 3 3 ...
$ DayOfWeek : int 4 4 4 4 4 4 4 4 4 4 ...
$ DepTime : int 2003 754 628 926 1829 1940 1937 1039 617 1620 ...
$ CRSDepTime : int 1955 735 620 930 1755 1915 1830 1040 615 1620 ...
$ ArrTime : int 2211 1002 804 1054 1959 2121 2037 1132 652 1639 ...
$ CRSArrTime : int 2225 1000 750 1100 1925 2110 1940 1150 650 1655 ...
$ UniqueCarrier : Factor w/ 20 levels "9E","AA","AQ",..: 18 18 18 18 18 18 18 18 18 18 ...
$ FlightNum : int 335 3231 448 1746 3920 378 509 535 11 810 ...
$ TailNum : Factor w/ 5374 levels "","80009E","80019E",..: 3769 4129 1961 3059 2142 3852 4062 1961 3616 3324 ...
$ ActualElapsedTime: int 128 128 96 88 90 101 240 233 95 79 ...
$ CRSElapsedTime : int 150 145 90 90 90 115 250 250 95 95 ...
$ AirTime : int 116 113 76 78 77 87 230 219 70 70 ...
$ ArrDelay : int -14 2 14 -6 34 11 57 -18 2 -16 ...
$ DepDelay : int 8 19 8 -4 34 25 67 -1 2 0 ...
$ Origin : Factor w/ 303 levels "ABE","ABI","ABQ",..: 136 136 141 141 141 141 141 141 141 141 ...
$ Dest : Factor w/ 304 levels "ABE","ABI","ABQ",..: 287 287 49 49 49 151 157 157 177 177 ...
$ Distance : int 810 810 515 515 515 688 1591 1591 451 451 ...
$ TaxiIn : int 4 5 3 3 3 4 3 7 6 3 ...
$ TaxiOut : int 8 10 17 7 10 10 7 7 19 6 ...
$ Cancelled : int 0 0 0 0 0 0 0 0 0 0 ...
$ CancellationCode : Factor w/ 5 levels "","A","B","C",..: 1 1 1 1 1 1 1 1 1 1
$ Diverted : int 0 0 0 0 0 0 0 0 0 0 ...
$ CarrierDelay : int NA NA NA NA 2 NA 10 NA NA NA ...
$ WeatherDelay : int NA NA NA NA 0 NA 0 NA NA NA ...
$ NASDelay : int NA NA NA NA 0 NA 0 NA NA NA ...
$ SecurityDelay : int NA NA NA NA 0 NA 0 NA NA NA ...
$ LateAircraftDelay: int NA NA NA NA 32 NA 47 NA NA NA ...
my graph:
library(ggplot2)
ggplot(airline,aes(x = DayOfWeek, fill = factor(DepDelay))) +
geom_histogram(binwidth = 1) +
xlab ("Day of week") +
ylab ("Dep Delay") +
labs (fill = "Airline")

To a great extent it would depend on what do you want to show. I made a small example using the flights data available in the nycflights13 package. Using the code below you could experiment with charts that would meet your analytical requirements.
Code
# Libs and data -----------------------------------------------------------
Vectorize(require)(package = c("nycflights13", "ggplot2", "ggthemes",
"dplyr"),
character.only = TRUE)
# Work -------------------------------------------------------------------
flights %>%
# Create week day summary
mutate_each(funs(as.character), 1:3) %>%
mutate(date = as.Date(paste(year, month, day, sep = "-"))) %>%
mutate(weekday = weekdays(date, abbreviate = FALSE)) %>%
group_by(weekday, carrier) %>%
na.omit() %>%
summarise(mean_dl = round(mean(dep_delay),2)) %>%
ggplot(aes(x = as.factor(weekday), y = mean_dl)) +
geom_bar(stat = "identity") +
facet_wrap(~carrier) +
xlab("Day") +
ylab("Mean Dep Delay") +
theme_wsj() +
theme(axis.text.x = element_text(angle = 90))
Results
For example, this could be a modest start:
If you want to get a better answer, I would suggest that you have a look at this discussion on producing a good R example. I would further took the liberty of suggesting that you:
Post a neat data extract that would be easy for other colleagues to work with
Elaborate more on the problem you are facing with respect to the particular chart you want to develop.
Comparing flight delays
You can make the further use of the dplyr grammar to compare flights on time and delayed ones.
Code
For example you could use the code below to count flights that were on time and the delayed ones per each day:
flights %>%
# Create week day summary
mutate_each(funs(as.character), 1:3) %>%
mutate(date = as.Date(paste(year, month, day, sep = "-"))) %>%
mutate(weekday = weekdays(date, abbreviate = FALSE)) %>%
# Create flag for on time / dly
mutate(ontime = ifelse(dep_delay == 0, "on-time", "delayed")) %>%
group_by(weekday, ontime) %>%
na.omit() %>%
summarise(count_flights = n())

Related

side by side boxplot in R

I am trying to make a side-by-side box and whisker plot of durasec broken out by placement and media
df <- read.csv("http://citadel.sjfc.edu/faculty/ageraci/data/dataset-nci-2012-subset1.csv")
str(df)
'data.frame': 11475 obs. of 7 variables:
$ time : int 1 1 1 1 1 1 1 1 1 1 ...
$ durasec : int 168 149 179 155 90 133 17 14 14 18 ...
$ placement: int 401 402 403 403 403 403 403 403 403 403 ...
$ format : int 8 9 8 8 9 8 12 12 12 12 ...
$ focus : int 1 1 1 1 1 1 3 3 1 1 ...
$ topic : int 5 5 5 2 2 2 26 26 11 24 ...
$ media : int 4 4 4 4 4 4 4 4 4 4 ...
favstats(~durasec | placement + media, data =df)
401.4 14 120.25 164.5 197.00 754 171.39686 90.85643 446 0
402.4 9 92.00 143.0 182.00 619 157.20935 107.92586 449 0
403.4 3 23.00 54.0 141.00 807 90.18696 90.50816 4172 0
401.5 12 94.25 165.5 254.75 1136 215.05121 180.52376 742 0
402.5 7 98.50 181.0 306.00 716 211.23293 145.88735 747 0
403.5 3 34.00 96.0 173.50 1098 124.85180 112.56758 4919 0
6 rows
bwplot(placement + media ~ durasec, data = df)
When I run this last piece of code it gives me a box and whisker plot but on the Y axis instead of the combinations of 401.4 through 403.5 like in the favstats, it just gives me 1 through 5 and the data doesn't appear to exactly match the favstats.
How can I get it to display the six combinations and their data like in the favstats?
You can try the following code
library(lattice)
bwplot(durasec ~ as.factor(df$placement) | as.factor(df$media), data = df)
Using ggplot:
library(ggplot2)
library(dplyr)
df <- read.csv("http://citadel.sjfc.edu/faculty/ageraci/data/dataset-nci-2012-subset1.csv")
df_fac <- df %>%
mutate_at(vars(placement:media), ~as.factor(.))
ggplot(data = df_fac) +
geom_boxplot(aes(x = durasec, y = placement, fill = media))
Created on 2020-04-06 by the reprex package (v0.3.0)

Using R - Need advice predicting next year based on irregular time-series

I need advice with regards to the following inquiry: "Based on your observations, what could you say about the load for the same months in year 2019?"
The str()/head() of the df looks like this:
data.frame': 683 obs. of 10 variables:
$ Route : chr "A" "B" "A" "A" ...
$ FlightNumber: int 770 279 128 235 434 543 556 663 770 279 ...
$ Capacity : int 375 345 375 375 375 375 375 375 375 345 ...
$ Booked : int 379 314 374 379 373 377 379 378 379 294 ...
$ DDate : Date, format: "2018-05-01" "2018-05-01" "2018-05-02" "2018-05-03" ...
$ Year : num 2018 2018 2018 2018 2018 ...
$ Month : num 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 1 2 3 4 5 6 7 8 8 ...
$ Hour : int 12 20 12 12 12 12 12 12 12 20 ...
$ load : num 1.011 0.91 0.997 1.011 0.995 ...
Route FlightNumber Capacity Booked DDate Year Month Day Hour load(=Booked/Capacity)
1 A 770 375 379 2018-05-01 2018 5 1 12 1.0106667
2 B 279 345 314 2018-05-01 2018 5 1 20 0.9101449
3 A 128 375 374 2018-05-02 2018 5 2 12 0.9973333
4 A 235 375 379 2018-05-03 2018 5 3 12 1.0106667
5 A 434 375 373 2018-05-04 2018 5 4 12 0.9946667
6 A 543 375 377 2018-05-05 2018 5 5 12 1.0053333
If I plot the data, it looks like this: geom_point
UPDATE: I ended up doing the following:
dat_A <- test %>% select(Route, DDate, load) %>% filter(Route == "A")
ts_A <- ts(dat_A$load, start = c(2017,5), end = c(2018,11), frequency = 1*12)
forecast(ts_A, h=12) %>% plot()
Predicted outcome image
#Double checking
fit <- auto.arima(ts_A)
summary(fit)
predict <- forecast(fit,n=1)
plot(predict)
plot.ts(predict$residuals)
qqnorm(predict$residuals)
acf(predict$residuals)
Does the prediction seem sound? Looks rather flat even though I also tried train(1:480)/validat(481:611) via arima then forecast with a RMSE of 0.036...
Here is a solution you can try. I can give you a direction to generate a time-series using the following function. First load your data say it is df and it has the column Booked so you can use the following method to generate a time-series which can be easily fit.
ts_data = ts(df$Booked, start = c(2017,1), end = c(2018,12), frequency = 12)
Now you can simply apply time-series prediction on this ts_data to predict the value of 2019. I am leaving rest of the code for you. Thank you!!
To convert a vector or data.frame into a time series, you could use:
dat <- as.ts(as.matrix(dat))

How to use scale_x_discrete with intervals created by cut()

Given this:
kc$sqft_living_group <- cut(kc$sqft_living, breaks = c(0, 1000, 2000, 3000, 5000, 7000, 10000, 15000), dig.lab=5)
How do I set the limit of my ggplot2 graph?
Nothing I can find shows the syntax to set the limit for intervals.
kc %>%
filter(zipcode %in% top_10_zipcodes) %>%
group_by(sqft_living_group) %>%
summarize(Mean_Price = mean(price)) %>%
ggplot(aes(y = Mean_Price, x = sqft_living_group)) +
geom_bar(stat = "identity") +
scale_y_continuous(labels = comma) +
scale_x_discrete(limits = "(0, 1000], (1000, 12000]") <---------- HERE
structure of data:
'data.frame': 21613 obs. of 22 variables:
$ id : num 7.13e+09 6.41e+09 5.63e+09 2.49e+09 1.95e+09 ...
$ date : POSIXct, format: "2014-10-13" "2014-12-09" "2015-02-25" "2014-12-09" ...
$ price : num 221900 538000 180000 604000 510000 ...
$ bedrooms : int 3 3 2 4 3 4 3 3 3 3 ...
$ bathrooms : num 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
$ sqft_living : int 1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
$ sqft_lot : int 5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
$ floors : Factor w/ 6 levels "1","1.5","2",..: 1 3 1 1 1 1 3 1 1 3 ...
$ waterfront : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ view : int 0 0 0 0 0 0 0 0 0 0 ...
$ condition : int 3 3 3 5 3 3 3 3 3 3 ...
$ grade : int 7 7 6 7 8 11 7 7 7 7 ...
$ sqft_above : int 1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
$ sqft_basement : int 0 400 0 910 0 1530 0 0 730 0 ...
$ yr_built : int 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
$ yr_renovated : Factor w/ 70 levels "0","1934","1940",..: 1 46 1 1 1 1 1 1 1 1 ...
$ zipcode : Factor w/ 70 levels "98001","98002",..: 67 56 17 59 38 30 3 69 61 24 ...
$ lat : num 47.5 47.7 47.7 47.5 47.6 ...
$ long : num -122 -122 -122 -122 -122 ...
$ sqft_living15 : int 1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
$ sqft_lot15 : int 5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
$ sqft_living_group: Factor w/ 7 levels "(0,1000]","(1000,2000]",..: 2 3 1 2 2 5 2 2 2 2 ...

leaflet, Error: cannot allocate vector of size 177.2 Mb

I have tried everything I can think of to fix this error but I have not been able to figure it out. 32 bit machine, trying to build a choropleth. The data file is pretty basic some municipal IDs with population figures associated with it. The shape file is taken from here: www.ontario.ca/data/municipal-boundaries
library('tmap')
library('leaflet')
library('magrittr')
library('rio')
library('plyr')
library('scales')
library('htmlwidgets')
library('tmaptools')
setwd("C:/Users/rdhasa/desktop")
datafile <- "shapefiles2/Population - 2014.csv"
Pop2014 <- rio::import(datafile)
Pop2014$Population <- as.factor(Pop2014$Population)
str(Pop2014)
'data.frame': 454 obs. of 9 variables:
$ MUNID : int 20002 18000 18013 18001 18005 18017 18009 18039 18020 18029 ...
$ YEAR : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
$ MAH CODE : int 1106 10000 10101 10102 10401 10402 10404 10601 10602 10603 ...
$ V4 : int 1999 1800 1813 1801 1805 1817 1809 1839 1820 1829 ...
$ Municipality: chr "Toronto C" "Durham R" "Oshawa C" "Pickering C" ...
$ Tier : chr "ST" "UT" "LT" "LT" ...
$ A : int 11 11 11 11 11 11 11 11 11 11 ...
$ B : chr "a" "a" "a" "a" ...
$ Population : Factor w/ 438 levels "-","1,006","1,026",..: 160 359 117 432 86 419 97 73 179 171 ...
mnshape <- "shapefiles2/MUNICIPAL_BOUNDARY_LOWER_AND_SINGLE_TIER.shp"
mngeo2 <- read_shape(file=mnshape)
str(mngeo2#data)
'data.frame': 683 obs. of 13 variables:
$ MUNID : int 1002 1002 1002 1009 1009 1009 1016 1016 1016 1026 ...
$ MAH_CODE : int 71616 71616 71616 71618 71618 71618 71614 71614 71614 71613 ...
$ SGC_CODE : int 1005 1005 1005 1011 1011 1011 1020 1020 1020 1030 ...
$ ASSESSMENT: int 101 101 101 406 406 406 506 506 506 511 ...
$ LEGAL_NAME: Factor w/ 414 levels "CITY OF BARRIE",..: 369 369 369 370 370 370 96 96 96 334 ...
$ STATUS : Factor w/ 2 levels "LOWER TIER","SINGLE TIER": 1 1 1 1 1 1 1 1 1 1 ...
$ EXTENT : Factor w/ 3 levels "ISLANDS","LAND",..: 1 2 3 1 2 3 1 2 3 2 ...
$ MSO : Factor w/ 4 levels "CENTRAL","EASTERN",..: 2 2 2 2 2 2 2 2 2 2 ...
$ NAME_PREFI: Factor w/ 8 levels "-","CITY OF",..: 6 6 6 6 6 6 4 4 4 6 ...
$ UPPER_TIER: Factor w/ 30 levels "BRUCE","DUFFERIN",..: 27 27 27 27 27 27 27 27 27 27 ...
$ NAME : Factor w/ 413 levels "ADDINGTON HIGHLANDS",..: 339 339 339 342 342 342 337 337 337 259 ...
$ Shape_Leng: num 0.115 1.622 1.563 0.551 1.499 ...
$ Shape_Area: num 2.32e-05 6.95e-02 7.51e-03 5.63e-04 5.09e-02 ...
mnmap <- append_data(mngeo2, Pop2014, key.shp = "MUNID", key.data="MUNID")
minPct <- min(c(mnmap#data$Population))
maxPct <- max(c(mnmap#data$Population))
paletteLayers <- colorBin(palette = "RdBu", domain = c(minPct, maxPct), bins = c(0, 50000,200000 ,500000, 1000000, 2000000) , pretty=FALSE)
rm(mngeo2)
rm(Pop2014)
rm(mnshape)
rm(datafile)
rm(maxPct)
rm(minPct)
gc()
leaflet(mnmap) %>%
addProviderTiles("CartoDB.Positron") %>%
addPolygons(stroke=TRUE,
smoothFactor = 0.2,
weight = 1,
fillOpacity = .6)
Error: cannot allocate vector of size 177.2 Mb
Is there I can maybe safe space through simplfying the shape file. If so how would I go about doing this efficiently?
THanks

How to overwrite a factor in R

I have a dataset:
> k
EVTYPE FATALITIES INJURIES
198704 HEAT 583 0
862634 WIND 158 1150
68670 WIND 116 785
148852 WIND 114 597
355128 HEAT 99 0
67884 WIND 90 1228
46309 WIND 75 270
371112 HEAT 74 135
230927 HEAT 67 0
78567 WIND 57 504
The variables are as follows. As per the first answer by joran, unused levels can be dropped by droplevels, so no worry about the 898 levels, the illustrative k I'm showing is the complete dataset obtained from k <- d1[1:10, 3:4] where d1 is the original dataset.
> str(k)
'data.frame': 10 obs. of 3 variables:
$ EVTYPE : Factor w/ 898 levels " HIGH SURF ADVISORY",..: 243 NA NA NA 243 NA NA 243 243 NA
$ FATALITIES: num 583 158 116 114 99 90 75 74 67 57
$ INJURIES : num 0 1150 785 597 0 ...
I'm trying to overwrite the WIND factor:
> k[k$EVTYPE==factor("WIND"), ]$EVTYPE <- factor("AFDAF")
> k[k$EVTYPE=="WIND", ]$EVTYPE <- factor("AFDAF")
But both commands give me error messages: level sets of factors are different or invalid factor level, NA generated.
How should I do this?
Try this instead:
k <- droplevels(d1[1:10, 3:5])
Factors (as per the documentation) are simply a vector of integer codes and then a simple vector of labels for each code. These are called the "levels". The levels are an attribute, and persist with your data even when subsetting.
This is a feature, since for many statistical procedures it is vital to keep track of all the possible values that variable could have, even if they don't appear in the actual data.
Some people find this irritation and run R using options(stringsAsFactors = FALSE).
To simply change the levels, you can do something like this:
d <- read.table(text = " EVTYPE FATALITIES INJURIES
198704 HEAT 583 0
862634 WIND 158 1150
68670 WIND 116 785
148852 WIND 114 597
355128 HEAT 99 0
67884 WIND 90 1228
46309 WIND 75 270
371112 HEAT 74 135
230927 HEAT 67 0
78567 WIND 57 504",header = TRUE,sep = "",stringsAsFactors = TRUE)
> str(d)
'data.frame': 10 obs. of 3 variables:
$ EVTYPE : Factor w/ 2 levels "HEAT","WIND": 1 2 2 2 1 2 2 1 1 2
$ FATALITIES: int 583 158 116 114 99 90 75 74 67 57
$ INJURIES : int 0 1150 785 597 0 1228 270 135 0 504
> levels(d$EVTYPE) <- c('A','B')
> str(d)
'data.frame': 10 obs. of 3 variables:
$ EVTYPE : Factor w/ 2 levels "A","B": 1 2 2 2 1 2 2 1 1 2
$ FATALITIES: int 583 158 116 114 99 90 75 74 67 57
$ INJURIES : int 0 1150 785 597 0 1228 270 135 0 504
Or to just change one:
levels(d$EVTYPE)[2] <- 'C'

Resources