I am currently working on a project and I need some help. I want to predict the length of flight delays, using a statistical model. The data set does not contain the length of flight delays, but it can be calculated from the actual and scheduled departure times, I know that actual departure times - scheduled departure time will give me the flight delay which is the dependent variable. I am struggling to get the explanatory (independent) variables in a useful form to do regression analysis - the main problem is the time format of the first two columns when you read in the table from the csv file. I have attach the data file to the question because I wasn't too sure how to attach my file, I'm new to this coding thing hehe. Any help will be appreciated. xx
https://drive.google.com/file/d/11BXmJCB5UGEIRmVkM-yxPb_dHeD2CgXa/view?usp=sharing
EDIT:
Firstly Thank you for all the help
Okay I'm going to try and ask more precise questions on this topic:
So after importing the file using:
1)
Delays <- read.table("FlightDelaysSM.csv",header =T,sep=",")
2)The main issue I am having is getting the columns schedule time and deptime into a format where I can do arithmetic calculation
3)I tried the below
Delays[,1] - Delays[,2]
where the obvious issue arises for example 800 (8am) - 756 (7.56am) = 44 not 4 minutes
4)Using the help from #kerry Jackson (thank you, you're amazing x) I tried
DepartureTime <- strptime(formatC(Delays$deptime, width = 4, format = "d", flag = "0", %H%M)
ScheduleTime <- strptime(formatC(Delays$schedtime, width = 4, format = "d", flag = "0", %H%M)
DelayTime = DepartureTime - ScheduleTime
The values are also given are in seconds, I want the difference to be in minutes how would I go about doing this?
5) I then did the following:
DelayData <- data.frame(ScheduleTime, DepartureTime, DelayTime, Delays[, 4:7])
What I attain after making the DelayData
as you can see by the image I have the seconds units in my column called DelayTime which I don't want as stated in 4), and the date is in the columns ScheduleTime and DepartureTime could I possibly get some suggestions on how to correct this?
Create a new column called flight_delay:
install.packages('tidyverse')
library(tidyverse)
your_data <- your_data %>%
mutate(flight_delay=deptime-schedtime)
Now, create a linear regression model predicting flight_delay by every other variable:
mod <- lm(flight_delay ~ ., data=your_data)
To optimize your model, use the step function:
mod <- step(mod)
Analyze results:
summary(mod)
Related
I started learning R three days ago so pls bear with me.... if you see any flaws in my code or calculations please call them out.
I have tried this, but get a error message every time:
table.AnnualizedReturns(Apple.Monthly.Returns[, 2:3, drop = FALSE], scale = 12,
Rf = 0, geometric = TRUE, digits = 4)
Error in checkData(R) :
The data cannot be converted into a time series. If you are trying to pass in names from a data object with one column, you should use the form 'data[rows, columns, drop = FALSE]'. Rownames should have standard date formats, such as '1985-03-15'.
As you can clearly see I have no clue what I am doing.
This is every line of code I have written this far:
Dates <- Data_Task2$`Names Date`[1801:2270]
as.numeric(Dates)
Dates <- ymd(Dates)
Monthly.Return <- Data_Task2$Returns[1801:2270]
Monthly.Return <- as.numeric(Monthly.Return)
Apple.Monthly.Returns <- data.frame(Dates, Monthly.Return)
Log.return = log(Monthly.Return + 1)
Apple.Monthly.Returns$Log.return = log(Apple.Monthly.Returns$Monthly.Return + 1)
You should check out the Tidyverse and specifically dplyr (https://dplyr.tidyverse.org/).
This gets you to a good starting point:
https://www.r-bloggers.com/2014/03/using-r-quickly-calculating-summary-statistics-with-dplyr/
I having trouble calculating average sentiment of each row in a relatively big dataset (N=36140).
My dataset containts review data from an app on Google Play Store (each row represents one review) and I would like to calculate sentiment of each review using sentiment_by() function.
The problem is that this function takes a lot of time to calculate it.
Here is the link to my dataset in .csv format:
https://drive.google.com/drive/folders/1JdMOGeN3AtfiEgXEu0rAP3XIe3Kc369O?usp=sharing
I have tried using this code:
library(sentimentr)
e_data = read.csv("15_06_2016-15_06_2020__Sygic.csv", stringsAsFactors = FALSE)
sentiment=sentiment_by(e_data$review)
Then I get the following warning message (After I cancel the process when 10+ minutes has passed):
Warning message:
Each time `sentiment_by` is run it has to do sentence boundary disambiguation when a
raw `character` vector is passed to `text.var`. This may be costly of time and
memory. It is highly recommended that the user first runs the raw `character`
vector through the `get_sentences` function.
I have also tried to use the get_sentences() function with the following code, but the sentiment_by() function still needs a lot of time to execute the calculations
e_sentences = e_data$review %>%
get_sentences()
e_sentiment = sentiment_by(e_sentences)
I have datasets regarding the Google Play Store review data and I have used the sentiment_by() function for the past month and it worked very quickly when calculating the sentiment... I only started to run calculations for this long since yesterday.
Is there a way to quickly calculate sentiment for each row on a big dataset.
The algorithm used in sentiment appears to be O(N^2) once you get above 500 or so individual reviews, which is why it's suddenly taking a lot longer when you upped the size of the dataset significantly. Presumably it's comparing every pair of reviews in some way?
I glanced through the help file (?sentiment) and it doesn't seem to do anything which depends on pairs of reviews so that's a bit odd.
library(data.table)
reviews <- iconv(e_data$review, "") # I had a problem with UTF-8, you may not need this
x1 <- rbindlist(lapply(reviews[1:10],sentiment_by))
x1[,element_id:=.I]
x2 <- sentiment_by(reviews[1:10])
produce effectively the same output which means that the sentimentr package has a bug in it causing it to be unnecessarily slow.
One solution is just to batch the reviews. This will break the 'by' functionality in sentiment_by, but I think you should be able to group them yourself before you send them in (or after as it doesnt seem to matter).
batch_sentiment_by <- function(reviews, batch_size = 200, ...) {
review_batches <- split(reviews, ceiling(seq_along(reviews)/batch_size))
x <- rbindlist(lapply(review_batches, sentiment_by, ...))
x[, element_id := .I]
x[]
}
batch_sentiment_by(reviews)
Takes about 45 seconds on my machine (and should be O(N) for bigger datasets.
I have a sample usage table of 'Account','Asset','Date','Asset Network Usage' with 15 days of summarised Usage data per Asset. I am trying to append the table with forecasted usage per day over the next 15 days, or at least create an output with the same table structure.
E.g.
Date (m/d/Y) Account Asset Network Usage
4/4/2019 Acct#100 AS-4310 56.5251
4/5/2019 Acct#100 AS-4310 592.1843
4/6/2019 Acct#100 AS-4310 556.1898
4/7/2019 Acct#100 AS-4310 808.2403
4/8/2019 Acct#100 AS-4310 466.118
I've been able to produce the appended table aggregating only by Date. I want to include Date / Account / Asset however I'm challenged in setting an index that doesn't run into an error on the timeseries ts() function
library(forecast)
library(ggfortify)
dataset <-
as.data.frame(read.csv(file="/path/Data.csv",header=TRUE,sep=","))
dataset <- aggregate(Network_Usgae ~ Date,data = dataset, FUN= sum)
ts <- ts(dataset$Network_Usage, frequency=15)
decom <- stl(ts,s.window = "periodic")
pred <- forecast(decom,h = 15)
fort <- fortify(pred,ts.connect= TRUE )
Any suggestions on syntax updates, or use of a different method to achieve my outcome?
I think forecast only works on objects convertable to matrixes, my suggestion is using lists and predicting the "values" while keeping relevant information about other stuff in other elements.
If you provide a dput() dataset I can create an example for you.
Good Luck.
Like this we have 500 entries. Entries may be repeated.
The date represents the date on which that particular car part(carparts) was malfunctioning. We have to predict on which date a car part(carparts) is going to malfunction.
Codes are written in R. The code to develop the table is mentioned below:-
q<-c("Mercedes","Audi","Tata","Renault","Ferrari","Lamborgini")
w<-sample(q,500,replace=TRUE)
m <- c("accelerator", "gear", "coolant", "brakes", "airbags")
k <- sample(m, 500, replace=TRUE)
e <- seq(as.Date("2010/1/1"), as.Date("2011/1/1"), by="days")
l <- sample(e, 500, replace=TRUE)
test <- list(w,k, l)
t2 <- as.data.frame(test)
names(t2) <- c("carnames","carparts", "date")
t2$Diffdate<-as.numeric(t2$date-as.Date("2010-01-01"))
head(t2)
I'm preparing my data for survival analysis .In the code above I haven't included the censor and event variable .(I tried a rough draft and it went messy.) I just need an idea of how to include the event and censor variable along with the carparts variable and carnames variable. I'm getting stuck as I'm unable to frame all the variables in a single table.
Two problems that I'm facing are:-
1> I can't find a way to keep carparts ,carnames,event and censor variable in one table.
2> And the event variable is always 1 in each entry because for each entry(row) there is a breakdown/defect of carparts happening.Is it ok to have so?
As in all the examples I saw from internet the event variable had both ones and zeroes.
Edited 1:- Its not necessary to do it in R, you may write it down (draw the table including the columns present as well as the censor and event variable) on a piece of paper and attach the snapshot.
Thanks
What I have done so far:
read.csv("filename.csv", header=TRUE)
df$header1
df$header2
Now I want to calculate the Rate of Change: Header1 is Dates, Header2 is prices
Rate of Change by date for all values comparative to preceding date.
I want to generate two separate columns of Rate of Change performing the same operation on another file.
Once rate of change is available for both the data sets, I will look to find the net rate of change for each date.
Where I am now:
df<-read.csv("audusd.csv", head = TRUE)
df$Date
df$Close
rate<- 100*diff(df$Close/df[-nrow(df),]$Close
This executes and then I get this:
> rate<- 100*diff(df$Close/df[-nrow(df),]$Close
+
+
In the console.
Thanks for all the help till now, please help further. :)
Also, I am a bit confused about whether I can get the results into a file? Or Do I have to run the code each time?
This certainly would help with my work in a big way, if I can understand and start using it.
You can also use the diff(...) function, which allows you to avoid looping through rows:
rate <- 100*diff(df$price)/df[-nrow(df),]$price
diff(...) calculates the difference row(n) - row(n-1). df[-nrow(df),] returns all rows of df except the last.
Here's a complete example with working code.
# all this just to get sample data.
# daily close for AAPL from 2013.01.01 - today
library(tseries)
library(zoo)
ts <- get.hist.quote(instrument="AAPL",
start="2013-01-01", end="2014-01-01",
quote="AdjClose", provider="yahoo", origin="1970-01-01",
compression="d", retclass="zoo")
df <- data.frame(ts)
df <- data.frame(date=as.Date(rownames(df)),price=df$AdjClose)
df <- df[!is.na(df$price),]
# calculate daily rate of change...
rate <- 100*diff(df$price)/df[-nrow(df),]$price
plot(df[-nrow(df),]$date,rate,type="l",xlab="2013",ylab="Pct. Change",main="APPL")
Given what you said in the comments about how you would do it in excel here is the R equivalent.
dataset=(1:10)^3 #Data in your case one of your columns
ratedata=NULL # Something you have to do so R can put data in this object
for(i in 1:(length(dataset)-1)){ratedata[i]=(dataset[i+1]-dataset[i])/dataset[i]*100} # A simple for function doing what you specified.
ratedata # Returning the rate data so you can look at it
FYI:This only works if your time is at regular intervals.