Using R - Need advice predicting next year based on irregular time-series - r

I need advice with regards to the following inquiry: "Based on your observations, what could you say about the load for the same months in year 2019?"
The str()/head() of the df looks like this:
data.frame': 683 obs. of 10 variables:
$ Route : chr "A" "B" "A" "A" ...
$ FlightNumber: int 770 279 128 235 434 543 556 663 770 279 ...
$ Capacity : int 375 345 375 375 375 375 375 375 375 345 ...
$ Booked : int 379 314 374 379 373 377 379 378 379 294 ...
$ DDate : Date, format: "2018-05-01" "2018-05-01" "2018-05-02" "2018-05-03" ...
$ Year : num 2018 2018 2018 2018 2018 ...
$ Month : num 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 1 2 3 4 5 6 7 8 8 ...
$ Hour : int 12 20 12 12 12 12 12 12 12 20 ...
$ load : num 1.011 0.91 0.997 1.011 0.995 ...
Route FlightNumber Capacity Booked DDate Year Month Day Hour load(=Booked/Capacity)
1 A 770 375 379 2018-05-01 2018 5 1 12 1.0106667
2 B 279 345 314 2018-05-01 2018 5 1 20 0.9101449
3 A 128 375 374 2018-05-02 2018 5 2 12 0.9973333
4 A 235 375 379 2018-05-03 2018 5 3 12 1.0106667
5 A 434 375 373 2018-05-04 2018 5 4 12 0.9946667
6 A 543 375 377 2018-05-05 2018 5 5 12 1.0053333
If I plot the data, it looks like this: geom_point
UPDATE: I ended up doing the following:
dat_A <- test %>% select(Route, DDate, load) %>% filter(Route == "A")
ts_A <- ts(dat_A$load, start = c(2017,5), end = c(2018,11), frequency = 1*12)
forecast(ts_A, h=12) %>% plot()
Predicted outcome image
#Double checking
fit <- auto.arima(ts_A)
summary(fit)
predict <- forecast(fit,n=1)
plot(predict)
plot.ts(predict$residuals)
qqnorm(predict$residuals)
acf(predict$residuals)
Does the prediction seem sound? Looks rather flat even though I also tried train(1:480)/validat(481:611) via arima then forecast with a RMSE of 0.036...

Here is a solution you can try. I can give you a direction to generate a time-series using the following function. First load your data say it is df and it has the column Booked so you can use the following method to generate a time-series which can be easily fit.
ts_data = ts(df$Booked, start = c(2017,1), end = c(2018,12), frequency = 12)
Now you can simply apply time-series prediction on this ts_data to predict the value of 2019. I am leaving rest of the code for you. Thank you!!

To convert a vector or data.frame into a time series, you could use:
dat <- as.ts(as.matrix(dat))

Related

side by side boxplot in R

I am trying to make a side-by-side box and whisker plot of durasec broken out by placement and media
df <- read.csv("http://citadel.sjfc.edu/faculty/ageraci/data/dataset-nci-2012-subset1.csv")
str(df)
'data.frame': 11475 obs. of 7 variables:
$ time : int 1 1 1 1 1 1 1 1 1 1 ...
$ durasec : int 168 149 179 155 90 133 17 14 14 18 ...
$ placement: int 401 402 403 403 403 403 403 403 403 403 ...
$ format : int 8 9 8 8 9 8 12 12 12 12 ...
$ focus : int 1 1 1 1 1 1 3 3 1 1 ...
$ topic : int 5 5 5 2 2 2 26 26 11 24 ...
$ media : int 4 4 4 4 4 4 4 4 4 4 ...
favstats(~durasec | placement + media, data =df)
401.4 14 120.25 164.5 197.00 754 171.39686 90.85643 446 0
402.4 9 92.00 143.0 182.00 619 157.20935 107.92586 449 0
403.4 3 23.00 54.0 141.00 807 90.18696 90.50816 4172 0
401.5 12 94.25 165.5 254.75 1136 215.05121 180.52376 742 0
402.5 7 98.50 181.0 306.00 716 211.23293 145.88735 747 0
403.5 3 34.00 96.0 173.50 1098 124.85180 112.56758 4919 0
6 rows
bwplot(placement + media ~ durasec, data = df)
When I run this last piece of code it gives me a box and whisker plot but on the Y axis instead of the combinations of 401.4 through 403.5 like in the favstats, it just gives me 1 through 5 and the data doesn't appear to exactly match the favstats.
How can I get it to display the six combinations and their data like in the favstats?
You can try the following code
library(lattice)
bwplot(durasec ~ as.factor(df$placement) | as.factor(df$media), data = df)
Using ggplot:
library(ggplot2)
library(dplyr)
df <- read.csv("http://citadel.sjfc.edu/faculty/ageraci/data/dataset-nci-2012-subset1.csv")
df_fac <- df %>%
mutate_at(vars(placement:media), ~as.factor(.))
ggplot(data = df_fac) +
geom_boxplot(aes(x = durasec, y = placement, fill = media))
Created on 2020-04-06 by the reprex package (v0.3.0)

How to convert one of my columns from 226 FACTORS?

New to programming in R. I have a dataset in which one column is or should be numeric since it has %values! I need to plot that data using ggplot2 but I can't since I'm pretty new with this.
Summary:
DataSet = 245 Rows, 6 columns.
I have spent 5 hours searching for the right code. But posts seem to be to advance for my understanding.
data.frame': 245 obs. of 6 variables:
$ location : Factor w/ 8 levels "site01","site02",..: 1 1 1 1 1 1 1 1 1 1 ...
$ coralType: Factor w/ 5 levels "blue corals",..: 1 1 1 1 1 1 1 1 2 2 ...
$ longitude: num 144 144 144 144 144 ...
$ latitude : num -11.8 -11.8 -11.8 -11.8 -11.8 ...
$ year : int 2010 2011 2012 2013 2014 2015 2016 2017 2011 2012 ...
$ value : Factor w/ 223 levels "10.01%","10.23%",..: 113 123 166 168 184 193 196 200 43 44 ...
See that df$value? That is my issue I need it to be numeric so I can plot it, right now I can't! Simply put $value needs to be numeric. Would really appreciate if any of you R veterans can help me out?!
You need to remove the percentage symbol and save it as a numeric value.
df <- data.frame(value = paste(1:100, "%", sep = ""))
df$value <- as.numeric(sub("%", "", df$value))

C50 failed in r with "c50 code called exit with value 1"

I am having issue with training C50 on my dataset. Before this post, I have researched all the other similar issues/solutions people had. However, my dataset has none of the issue they had but still failed the C50 execution in r. My dataset looks like:
'data.frame': 113967 obs. of 15 variables:
$ region : Factor w/ 51 levels "US:AK","US:AL",..: 2 3 3 4 4 4 4 5 5 5 ...
$ city : Factor w/ 6396 levels "179708","179720",..: 24 156 156 194 214 226 244 276 316 407 ...
$ dma : Factor w/ 211 levels "1","500","501",..: 24 148 148 173 173 173 189 195 204 208 ...
$ user_day : Factor w/ 7 levels "0","1","2","3",..: 6 6 6 6 6 6 6 6 6 6 ...
$ user_hour : Factor w/ 24 levels "0","1","10","11",..: 5 16 16 4 22 7 10 11 15 21 ...
$ os_extended : Factor w/ 71 levels "0","100","113",..: 55 68 68 7 29 14 14 14 29 34 ...
$ browser : Factor w/ 19 levels "0","10","11",..: 19 18 18 8 18 9 18 17 18 18 ...
$ domain : Factor w/ 2685 levels "0calc.com","100daysofrealfood.com",..: 1709 777 777 1406 727 2658 1406 1604 964 2658 ...
$ position : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 2 1 1 1 2 ...
$ placement : Factor w/ 5406 levels "10004098","10008956",..: 3331 1696 1714 3600 438 479 3598 3423 5406 479 ...
$ publisher : Factor w/ 1641 levels "1000773","1000776",..: 581 687 687 663 1369 1525 663 624 1641 1525 ...
$ seller_member_id : Factor w/ 304 levels "1001","1019",..: 19 101 101 40 19 35 40 40 75 35 ...
$ user_group : Factor w/ 1000 levels "0","1","10","100",..: 252 243 243 363 343 342 162 380 122 212 ...
$ size : Factor w/ 7 levels "160x600","300x250",..: 5 2 2 4 5 2 2 1 2 2 ...
$ predict.bid.vector.bin: Factor w/ 2 levels "(0.112,0.831]",..: 1 1 1 1 1 1 1 2 1 2 ...
As you can see, the last variable is my target variable (as factor) and all features here have more than 1 level. Moreover, there is no NA in the dataset. Yet, when i execute the C50, i got error:
> library(C50)
> myC50_Tree <- C5.0(x = test_set[,-15], y = test_set$predict.bid.vector.bin)
c50 code called exit with value 1
> summary(myC50_Tree)
Call:
C5.0.default(x = test_set[, -15], y = test_set$predict.bid.vector.bin)
C5.0 [Release 2.07 GPL Edition] Fri Apr 13 14:29:54 2018
-------------------------------
*** line 6 of `undefined.names': attribute `region' has only one value `US'
Error limit exceeded
What would be the issue here?
***You can get the simulated dataset of mine with following r code:
# --- Set unique feature values
region <- c("US:AL","US:AR","US:AZ","US:CA","US:CO","US:CT","US:DC","US:FL")
city <- c("179944","180802","181120","181212","181251","181315","181400","181512","181762","181842","181934","181953","182259","182295")
dma <- c('522','693','754','875','345','234')
user_day <- c('1','2','3','4','5','6')
user_hour <- c('12','11','10','9','8','7','6','5')
os_extended <- c('187','92','125','87','90')
browser <- c('8','9','18','5')
domain <- c('yahoo.com','youtube.com','mmctw.com','msn.com','frive.com','wework.com')
position <- c('0','1','2','3')
placement <- c('`234123412','34563451','235234624','46785467','234556834','85991927394')
publisher <- c('5345','57867','78034','123452','84567','245645','956752')
seller_memeber_id <- c('234','745','546','687','235')
user_group <- c('112','556','009','345','238')
size <- c('100X20','340X10','300X500','300X600')
predict.bid.vector.bin <- c('(0.831,1.55]', '(0.112,0.831]')
features <- list(region,city,dma,user_day,user_hour,os_extended,browser,domain,position,placement,publisher,seller_memeber_id,user_group,size,predict.bid.vector.bin)
# --- Sample simulated dataset
test_set <- vector()
for (feature in 1:length(features)) {
test_set <- cbind(test_set, sample(features[[feature]],1000,replace=TRUE))
}
test_set <- data.frame(test_set)
colnames(test_set) <- c('region','city','dma','user_day','user_hour',
'os_extended','browser','domain','position',
'placement','publisher','seller_memeber_id',
'user_group','size','predict.bid.vector.bin')
# --- check data
str(test_set)
The problem is the variable name region -- I think C5.0 doesn't like the colons in there. I recreated your dataset with:
region <- c("AL","AR","AZ","CA","CO","CT","DC","FL")
And then it worked with no errors:
treeModel <- C5.0(x=test_set[,-15],y=test_set[,15])
treeModel
...
Evaluation on training data (1000 cases):
Decision Tree
----------------
Size Errors
103 220(22.0%) <<
(a) (b) <-classified as
---- ----
358 122 (a): class 1
98 422 (b): class 2
Attribute usage:
100.00% user_hour
28.30% region
27.30% dma
24.30% city
17.60% user_day
15.40% size
12.70% placement
9.10% user_group
7.90% browser
6.50% os_extended
4.70% publisher
4.40% position
3.70% domain
3.00% seller_memeber_id
I also recoded the dependent variable as 1 and 2 just in case the string with the ranges was giving it a problem, but that didn't seem to matter at all (however in the output above you'll see that it predicted to Class 1 and Class 2, and that's why).

ggplot in R- graph for data analysis [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I have a large CSV file which I decided to import into R and use for some data analysis. Bascially it is file with flight delays for few years and trying to create a graph to see the average delay per day of the week. I thought of the histogram but it plots graph which is not usable? Any idea please let me know. Would other graph work better? Also is there any easy way to compare on time flights to delayed flights per day of the week?
file name - airline
str(airline)
'data.frame': 7009728 obs. of 29 variables:
$ Year : int 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...
$ Month : int 1 1 1 1 1 1 1 1 1 1 ...
$ DayofMonth : int 3 3 3 3 3 3 3 3 3 3 ...
$ DayOfWeek : int 4 4 4 4 4 4 4 4 4 4 ...
$ DepTime : int 2003 754 628 926 1829 1940 1937 1039 617 1620 ...
$ CRSDepTime : int 1955 735 620 930 1755 1915 1830 1040 615 1620 ...
$ ArrTime : int 2211 1002 804 1054 1959 2121 2037 1132 652 1639 ...
$ CRSArrTime : int 2225 1000 750 1100 1925 2110 1940 1150 650 1655 ...
$ UniqueCarrier : Factor w/ 20 levels "9E","AA","AQ",..: 18 18 18 18 18 18 18 18 18 18 ...
$ FlightNum : int 335 3231 448 1746 3920 378 509 535 11 810 ...
$ TailNum : Factor w/ 5374 levels "","80009E","80019E",..: 3769 4129 1961 3059 2142 3852 4062 1961 3616 3324 ...
$ ActualElapsedTime: int 128 128 96 88 90 101 240 233 95 79 ...
$ CRSElapsedTime : int 150 145 90 90 90 115 250 250 95 95 ...
$ AirTime : int 116 113 76 78 77 87 230 219 70 70 ...
$ ArrDelay : int -14 2 14 -6 34 11 57 -18 2 -16 ...
$ DepDelay : int 8 19 8 -4 34 25 67 -1 2 0 ...
$ Origin : Factor w/ 303 levels "ABE","ABI","ABQ",..: 136 136 141 141 141 141 141 141 141 141 ...
$ Dest : Factor w/ 304 levels "ABE","ABI","ABQ",..: 287 287 49 49 49 151 157 157 177 177 ...
$ Distance : int 810 810 515 515 515 688 1591 1591 451 451 ...
$ TaxiIn : int 4 5 3 3 3 4 3 7 6 3 ...
$ TaxiOut : int 8 10 17 7 10 10 7 7 19 6 ...
$ Cancelled : int 0 0 0 0 0 0 0 0 0 0 ...
$ CancellationCode : Factor w/ 5 levels "","A","B","C",..: 1 1 1 1 1 1 1 1 1 1
$ Diverted : int 0 0 0 0 0 0 0 0 0 0 ...
$ CarrierDelay : int NA NA NA NA 2 NA 10 NA NA NA ...
$ WeatherDelay : int NA NA NA NA 0 NA 0 NA NA NA ...
$ NASDelay : int NA NA NA NA 0 NA 0 NA NA NA ...
$ SecurityDelay : int NA NA NA NA 0 NA 0 NA NA NA ...
$ LateAircraftDelay: int NA NA NA NA 32 NA 47 NA NA NA ...
my graph:
library(ggplot2)
ggplot(airline,aes(x = DayOfWeek, fill = factor(DepDelay))) +
geom_histogram(binwidth = 1) +
xlab ("Day of week") +
ylab ("Dep Delay") +
labs (fill = "Airline")
To a great extent it would depend on what do you want to show. I made a small example using the flights data available in the nycflights13 package. Using the code below you could experiment with charts that would meet your analytical requirements.
Code
# Libs and data -----------------------------------------------------------
Vectorize(require)(package = c("nycflights13", "ggplot2", "ggthemes",
"dplyr"),
character.only = TRUE)
# Work -------------------------------------------------------------------
flights %>%
# Create week day summary
mutate_each(funs(as.character), 1:3) %>%
mutate(date = as.Date(paste(year, month, day, sep = "-"))) %>%
mutate(weekday = weekdays(date, abbreviate = FALSE)) %>%
group_by(weekday, carrier) %>%
na.omit() %>%
summarise(mean_dl = round(mean(dep_delay),2)) %>%
ggplot(aes(x = as.factor(weekday), y = mean_dl)) +
geom_bar(stat = "identity") +
facet_wrap(~carrier) +
xlab("Day") +
ylab("Mean Dep Delay") +
theme_wsj() +
theme(axis.text.x = element_text(angle = 90))
Results
For example, this could be a modest start:
If you want to get a better answer, I would suggest that you have a look at this discussion on producing a good R example. I would further took the liberty of suggesting that you:
Post a neat data extract that would be easy for other colleagues to work with
Elaborate more on the problem you are facing with respect to the particular chart you want to develop.
Comparing flight delays
You can make the further use of the dplyr grammar to compare flights on time and delayed ones.
Code
For example you could use the code below to count flights that were on time and the delayed ones per each day:
flights %>%
# Create week day summary
mutate_each(funs(as.character), 1:3) %>%
mutate(date = as.Date(paste(year, month, day, sep = "-"))) %>%
mutate(weekday = weekdays(date, abbreviate = FALSE)) %>%
# Create flag for on time / dly
mutate(ontime = ifelse(dep_delay == 0, "on-time", "delayed")) %>%
group_by(weekday, ontime) %>%
na.omit() %>%
summarise(count_flights = n())

Creating decision tree

I have a csv file (298 rows and 24 columns) and i want to create a decision tree to predict the column "salary". I have downloaded tree package and added via library function.
But when i try to create the decision tree:
model<-tree(salary~.,data)
I get the error like below:
*Error in tree(salary ~ ., data) :
factor predictors must have at most 32 levels*
What is wrong with that? Data is as follows:
Name bat hit homeruns runs
1 Alan Ashby 315 81 7 24
2 Alvin Davis 479 130 18 66
3 Andre Dawson 496 141 20 65
...
team position putout assists errors
1 Hou. C 632 43 10
2 Sea. 1B 880 82 14
3 Mon. RF 200 11 3
salary league87 team87
1 475 N Hou.
2 480 A Sea.
3 500 N Chi.
And its the value of str(data):
'data.frame': 263 obs. of 24 variables:
$ Name : Factor w/ 263 levels "Al Newman","Alan Ashby",..: 2 7 8 10 6 1 13 11 9 3 ...
$ bat : int 315 479 496 321 594 185 298 323 401 574 ...
$ hit : int 81 130 141 87 169 37 73 81 92 159 ...
$ homeruns : int 7 18 20 10 4 1 0 6 17 21 ...
$ runs : int 24 66 65 39 74 23 24 26 49 107 ...
$ runs.batted : int 38 72 78 42 51 8 24 32 66 75 ...
$ walks : int 39 76 37 30 35 21 7 8 65 59 ...
$ years.in.major.leagues : int 14 3 11 2 11 2 3 2 13 10 ...
$ bats.during.career : int 3449 1624 5628 396 4408 214 509 341 5206 4631 ...
$ hits.during.career : int 835 457 1575 101 1133 42 108 86 1332 1300 ...
$ homeruns.during.career : int 69 63 225 12 19 1 0 6 253 90 ...
$ runs.during.career : int 321 224 828 48 501 30 41 32 784 702 ...
$ runs.batted.during.career: int 414 266 838 46 336 9 37 34 890 504 ...
$ walks.during.career : int 375 263 354 33 194 24 12 8 866 488 ...
$ league : Factor w/ 2 levels "A","N": 2 1 2 2 1 2 1 2 1 1 ...
$ division : Factor w/ 2 levels "E","W": 2 2 1 1 2 1 2 2 1 1 ...
$ team : Factor w/ 24 levels "Atl.","Bal.",..: 9 21 14 14 16 14 10 1 7
8 ...
$ position : Factor w/ 23 levels "1B","1O","23",..: 10 1 20 1 22 4 22 22 13 22 ...
$ putout : int 632 880 200 805 282 76 121 143 0 238 ...
$ assists : int 43 82 11 40 421 127 283 290 0 445 ...
$ errors : int 10 14 3 4 25 7 9 19 0 22 ...
$ salary : num 475 480 500 91.5 750 ...
$ league87 : Factor w/ 2 levels "A","N": 2 1 2 2 1 1 1 2 1 1 ...
$ team87 : Factor w/ 24 levels "Atl.","Bal.",..: 9 21 5 14 16 13 10 1 7 8 ...
The issue is almost certainly that you're including the name variable in your model, as it has too many factor levels. I would also remove it a methodological standpoint but this probably isn't the place for that discussion. Try:
train <- data
train$Name <- NULL
model<-tree(salary~.,train)
It seems that your salary is a factor vector, while you are trying to perform a regression, so it should be a numbers vector. Simply convert you salary to numeric, and it should work just fine. For more details read the library's help:
http://cran.r-project.org/web/packages/tree/tree.pdf
Usage
tree(formula, data, weights, subset, na.action = na.pass,
control = tree.control(nobs, ...), method = "recursive.partition",
split = c("deviance", "gini"), model = FALSE, x = FALSE, y = TRUE, wts
= TRUE, ...)
Arguments
formula A formula expression. The left-hand-side (response) should be either a numerical vector when a
regression tree will be fitted or a factor, when a classification tree
is produced. The right-hand-side should be a series of numeric or
factor variables separated by +; there should be no interaction terms.
Both . and - are allowed: regression trees can have offset terms.
(...)
Depending on what exactly is stored in your salary variable, the conversion can be less or more tricky, but this should generaly work:
salary = as.numeric(levels(salary))[salary]
EDIT
As pointed out in the comment, the actual error corresponds to the data variable, so if it is a numerical data, it could also be converted to numeric to solve the issue, if it has to be a factor you will need another model or reduce the number of levels. You can also convert these factors to the numerical format by hand (by for example defining as many binary features as you have levels), but this can lead to the exponential growth of your input space.
EDIT2
It seems that you have to first decide what you are trying to model. You are trying to predict salary, but based on what? It seems that your data consists of players' records, then their names are for sure wrong type of data to use for this prediction (in particular - it is probably causing the 32 levels error). You should remove all the columns from the data variable which should not be used for building a prediction. I do not know what is the exact aim here (as there is no information regarding it in the question), so I can only guess that you are trying to predict the person's salary based on his/her stats, so you should remove from the input data: players' names, players' teams and obviously salaries (as predicting X using X is not a good idea ;)).

Resources