Interval classification in r - r

I have two dataframes as follows
str(daily)
Classes ‘grouped_df’,‘tbl_df’,‘tbl’ and 'data.frame':15264 obs.of 3 variables:
$ steps : int 0 0 0 0 0 0 0 0 0 0 ...
$ date : Date, format: "2012-10-02" "2012-10-02" "2012-10-02" ...
$ interval: int 0 5 10 15 20 25 30 35 40 45 ...
interval<-data.frame(unique(daily$interval))
str(interval)
'data.frame': 288 obs. of 1 variable:
$ unique.daily.interval.:int 0 5 10 15 20 25 30 35 40 45 50 55 100..2350 2355
using dplyr, what I intended to do was find the mean of daily$steps for each interval across daily$Date using the following
mutate(daily,class=cut(daily$steps,c(0,interval$unique.daily.interval.),
include.lowest = TRUE) %>%
group_by(class) %>%
summarise(Mean = mean(daily$steps)))
The code fails giving the following error
Error: 'breaks' are not unique
which I have isolated to the 'class=cut' function. I have checked the interval df for uniqueness being only 288 values. Can someone point out what I am doing wrong ? Here is a reference I used Create class intervals in r and sum values
Here is a link to the data in question Activity monitoring data
thanks.

Related

R round correlate function from corrr package

I'm creating a correlation table using the correlate function in the corrr package. Here is my code and a screenshot of the output.
correlation_table <- corrr::correlate(salary_professor_dataset_cor_table,
method = "pearson")
correlation_table
I think this would look better and be easier to read if I could round off the values in the correlation table. I tried this code:
correlation_table <- round(corrr::correlate(salary_professor_dataset_cor_table,
method = "pearson"),2)
But I get this error:
Error in Math.data.frame(list(term = c("prof_rank_factor", "yrs.since.phd", : non-numeric variable(s) in data frame: term
The non-numeric variables part of this error message doesn't make sense to me. When I look at the structure I only see integer or numeric variable types.
'data.frame': 397 obs. of 6 variables:
$ prof_rank_factor : num 3 3 1 3 3 2 3 3 3 3 ...
$ yrs.since.phd : int 19 20 4 45 40 6 30 45 21 18 ...
$ yrs.service : int 18 16 3 39 41 6 23 45 20 18 ...
$ salary : num 139750 173200 79750 115000 141500 ...
$ sex_factor : num 1 1 1 1 1 1 1 1 1 2 ...
$ discipline_factor: num 2 2 2 2 2 2 2 2 2 2 ...
How can I clean up this correlation table with rounded values?
After returning the tibble output with correlate, loop across the columns that are numeric and round
library(dplyr)
corrr::correlate(salary_professor_dataset_cor_table,
method = "pearson") %>%
mutate(across(where(is.numeric), round, digits = 2))
We can use:
options(digits=2)
correlation_table <- corrr::correlate(salary_professor_dataset_cor_table,
method = "pearson")
correlation_table

How to create independent different data.frame in a loop R

Good evening everybody,
I'm stuck about the construction of the for loop, I don't have any problem, buit I'd like to understand how I can create dataframe "independents" (duplicite with some differences).
I wrote the code step by step (it works), but I think that, maybe, there is a way to compact the code with the for.
x is my original data.frame
str(x)
Classes ‘data.table’ and 'data.frame': 13500 obs. of 6 variables:
$ a: int 1 56 1058 567 987 574 1001...
$ b: int 10 5 10 5 5 10 10 5 10 10 ...
$ c: int NA NA NA NA NA NA NA NA NA NA ...
$ d: int 0 0 0 0 0 0 0 0 0 0 ...
$ e: int 0 0 0 0 0 0 0 0 0 0 ...
$ f: int 22 22 22 22 22 22 22 22 22 22 ...
My first goal is to delete per every column the eventualy NA and "" elements. I do this by these codes of rows.
x_b<- x[!(!is.na(x$b) & x$b==""), ]
x_c<- x[!(!is.na(x$c) & x$c==""), ]
x_d<- x[!(!is.na(x$d) & x$d==""), ]
x_e<- x[!(!is.na(x$e) & x$e==""), ]
x_f<- x[!(!is.na(x$f) & x$f==""), ]
After this the second goal is to create per each new data.frame a id code that I create using the function paste0(x_b$a, x_b$f).
x_b$ID_1<-paste0(x_b$a, x_b$b)
x_c$ID_2<-paste0(x_c$a, x_c$c)
x_d$ID_3<-paste0(x_c$a, x_c$d)
x_e$ID_4<-paste0(x_c$a, x_c$e)
x_f$ID_5<-paste0(x_c$a, x_c$f)
I created this for loop to try to minimize the rows that I use, and to create a good code visualization.
z<-data.frame("a", "b","c","d","e","f")
zy<-data.frame("x_b", "x_c", "x_d", "x_e", "x_f")
for(i in z) {
for (j in zy ) {
target <- paste("_",i)
x[[i]]<-(!is.na(x[[i]]) & x[[i]]=="") #with this I able to create a column on the x data.frame,
#but if I put a new dataframe the for doesn't work
#the name, but I don't want this. I'd like to create a
#data.base per each transformation.
#at this point of the script, I should have a new
#different dataframe, as x_b, x_c, x_d, x_e, x_f but I
#don't know
#How to create them?
#If I have these data frame I will do this anther function
#in the for loop:
zy[[ID]]<-paste0(x_b$a, "_23X")
}
}
I'd like to have as output this:
str(x_b)
Classes ‘data.table’ and 'data.frame': 13500 obs. of 6 variables:
$ a: int 1 56 1058 567 987 574 1001...
$ b: int 10 5 10 5 5 10 10 5 10 10 ...
$ c: int NA NA NA NA NA NA NA NA NA NA ...
$ d: int 0 0 0 0 0 0 0 0 0 0 ...
$ e: int 0 0 0 0 0 0 0 0 0 0 ...
$ f: int 22 22 22 22 22 22 22 22 22 22 ...
$ ID: int 1_23X 56_23X 1058_23X 567_23X 987_23X 574_23X 1001_23X...
and so on.
I think that there is some important concept about the dataframe that I miss.
Where I wrong?
Thank you so much in advance for the support.
There is simple way to do this with the tidyverse package(s):
First goal:
drop.na(df)
You can also use na_if if you want convert "" to NA.
Second goal: use mutate to create a new variable:
df <- df %>%
mutate(id = paste0(x_b$a, "_23X"))

High OOB error rate for random forest

I am trying to develop a model to predict the WaitingTime variable. I am running a random forest on the following dataset.
$ BookingId : Factor w/ 589855 levels "00002100-1E20-E411-BEB6-0050568C445E",..: 223781 471484 372126 141550 246376 512394 566217 38486 560536 485266 ...
$ PickupLocality : int 1 67 77 -1 33 69 67 67 67 67 ...
$ ExZone : int 0 0 0 0 1 1 0 0 0 0 ...
$ BookingSource : int 2 2 2 2 2 2 7 7 7 7 ...
$ StarCustomer : int 1 1 1 1 1 1 1 1 1 1 ...
$ PickupZone : int 24 0 0 0 6 11 0 0 0 0 ...
$ ScheduledStart_Day : int 14 20 22 24 24 24 31 31 31 31 ...
$ ScheduledStart_Month : int 6 6 6 6 6 6 7 7 7 7 ...
$ ScheduledStart_Hour : int 14 17 7 2 8 8 1 2 2 2 ...
$ ScheduledStart_Minute : int 6 0 58 55 53 54 54 0 12 19 ...
$ ScheduledStart_WeekDay: int 1 7 2 4 4 4 6 6 6 6 ...
$ Season : int 1 1 1 1 1 1 1 1 1 1 ...
$ Pax : int 1 3 2 4 2 2 2 4 1 4 ...
$ WaitingTime : int 45 10 25 5 15 25 40 15 40 30 ...
I am splitting the dataset into training/test subsets into 80%/20% using the sample method and then running a random forest excluding the BookingId factor. This is only used to validate the predictions.
set.seed(1)
index <- sample(1:nrow(data),round(0.8*nrow(data)))
train <- data[index,]
test <- data[-index,]
library(randomForest)
extractFeatures <- function(data) {
features <- c( "PickupLocality",
"BookingSource",
"StarCustomer",
"ScheduledStart_Month",
"ScheduledStart_Day",
"ScheduledStart_WeekDay",
"ScheduledStart_Hour",
"Season",
"Pax")
fea <- data[,features]
return(fea)
}
rf <- randomForest(extractFeatures(train), as.factor(train$WaitingTime), ntree=600, mtry=2, importance=TRUE)
The problem is that all attempts to try and decrease OOB error rate and increase the accuracy failed. The maximum accuracy that I managed to achieve was ~23%.
I tried to change the number of features used, different ntree and mtry values, different training/test ratios, and also taking into consideration only data with WaitingTime <= 40. My last attempt was to follow MrFlick's suggestion and get the same sample size for all classes of get the same sample size for all classes of my predicting variable (WaitingTime).1
tempdata <- subset(tempdata, WaitingTime <= 40)
rndid <- with(tempdata, ave(tempdata$Season, tempdata$WaitingTime, FUN=function(x) {sample.int(length(x))}))
data <- tempdata[rndid<=27780,]
Do you know of any other ways how I can achieve at least accuracy over 50%?
Records by WaitingTime class:
Thanks in advance!
Messing with the randomForest hyperparameters will almost assuredly not significantly increase your performance.
I would suggest using a regression approach for you data. Since waiting time isn't categorical, a classification approach may not work very well. Your classification model loses the ordering information that 5 < 10 < 15, etc.
One thing to first try is to use a simple linear regression. Bin the predicted values from the test set and recalculate the accuracy. Better? Worse? If it's better, than go ahead and try a randomForest regression model (or as I would prefer, gradient boosted machines).
Secondly, it's possible that your data is just not predictive of the variable that you're interested in. Maybe the data got messed up somehow upstream. It might be a good diagnostic to first calculate correlation and/or mutual information of the predictors with the outcome.
Also, with so many categorical labels, 23% might actually not be that bad. The probability of a particular datapoint to be correctly labeled based on random guess is N_class/N. So the accuracy of a random guess model is not 50%. You can calculate the adjusted rand index to show that it is better than random guess.

How to merge two data frames with non overlapping dates?

I have a data set with the following variables:
steps: Number of steps taking in a 5-minute interval
date: The date on which the measurement was taken in YYYY-MM-DD format
interval: Identifier for the 5-minute interval in which measurement was taken (288 intervals per day)
The main data set:
> head(activityData, 3)
steps date interval
1 1.7169811 2012-10-01 0
2 0.3396226 2012-10-01 5
3 0.1320755 2012-10-01 10
> str(activityData)
'data.frame': 17568 obs. of 3 variables:
$ steps : num 1.717 0.3396 0.1321 0.1509 0.0755 ...
$ date : chr "2012-10-01" "2012-10-01" "2012-10-01" "2012-10-01" ...
$ interval: num 0 5 10 15 20 25 30 35 40 45 ...
The data set has a range of two months.
I had to divided it by weekdays and weekend days. I did it with the following functions:
> dataAs.xtsWeekday <- dataAs.xts[.indexwday(dataAs.xts) %in% 1:5]
> dataAs.xtsWeekend <- dataAs.xts[.indexwday(dataAs.xts) %in% c(0, 6)]
After doing this I had to make some calculation, at which I failed so I decided to export the files and read them in, again.
After I imported the data again, I made the calculation I wanted, and I tried to merge the 2 datasets, but did not succeed.
First data set:
> head(weekdays, 3)
X steps date interval daytype
1 1 37.3826 2012-10-01 0 weekday
2 2 37.3826 2012-10-01 5 weekday
3 3 37.3826 2012-10-01 10 weekday
> str(weekdays)
'data.frame': 12960 obs. of 5 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ steps : num 37.4 37.4 37.4 37.4 37.4 ...
$ date : chr "2012-10-01" "2012-10-01" "2012-10-01" "2012-10-01" ...
$ interval: int 0 5 10 15 20 25 30 35 40 45 ...
$ daytype : chr "weekday" "weekday" "weekday" "weekday" ...
Second data set:
> head(weekend, 3)
X steps date interval daytype
1 1 0 2012-10-06 0 weekend
2 2 0 2012-10-06 5 weekend
3 3 0 2012-10-06 10 weekend
> str(weekend)
'data.frame': 4608 obs. of 5 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ steps : num 0 0 0 0 0 0 0 0 0 0 ...
$ date : chr "2012-10-06" "2012-10-06" "2012-10-06" "2012-10-06" ...
$ interval: int 0 5 10 15 20 25 30 35 40 45 ...
$ daytype : chr "weekend" "weekend" "weekend" "weekend" ...
Now I would like to merge the 2 data sets (weekdays, weekends) by date, but the problem is that I don't have any common dates or anything else common.
The final data set should have 4 columns and 17568 observations.
The columns should be:
steps: Number of steps taking in a 5-minute interval
date: The date on which the measurement was taken in YYYY-MM-DD format
interval: Identifier for the 5-minute interval in which measurement was taken
daytype: weekends days or normal weekdays.
I tried with:
merge
join(plyr)
union
Everywhere I looked all the data sets had a common ID or a common column in both data sets, not like in my case.
I also looked here, but I did not understand much and at many others, but they had nothing in common with my data set.
The other option I thought about was to add a column to the original data set and call it
"ID" and redo everything that I did so far; thing that I'll have to do if I don't find a way around this problem.
I would like some advice on how to proceed or what to try next.
Since you mentioned that your final data set should have 17568 (=4608+12960) observations/rows, I assume you want to stack the two data.frames over each other (and possibly order them by date afterwards). This is done by using rbind().
finaldata <- rbind(weekdays, weekend)
If you want to remove column X:
finaldata$X <- NULL
To convert your date column to actual dates:
finaldata$date <- as.Date(finaldata$date, format="%Y-%m-%d")
To order the whole data by date:
finaldata <- finaldata[order(finaldata$date),]

How to add dummy variables in R

I know there are several questions about this topic, but none of them seem to answer my specific question.
I have a dataset with five independent variables and I want to add two dummy variables to my regression in R. I have my data in Excel and importing the dataset is not a problem (I use read.csv2). Now, when I want to see my dummy variables, D1 and D2, I can't. I can see all the other variables. The two dummy variables both vary from 0 and 1 through the dataset.
I can easily see a summary of all my data, including D1 and D2 (with median, mean, etc.), and I can call each of the 5 variables separately without any problems at all, but I can't do that with D1 and D2.
> str(tilskuere) 'data.frame': 180 obs. of 7 variables:
$ ATT : int 3166 4315 7123 6575 7895 7323 3579 9571 5345 6595 ...
$ PRICE : int 80 95 120 100 105 115 80 130 105 100 ...
$ viewers: int 41000 43000 56000 66000 157000 91000 51000 30000 36000 72000 ...
$ CB1 : int 10 10 5 2 7 2 3 1 10 1 ...
$ CB2 : num 1 1 1 0 0.33 ...
$ D1 : int 0 0 0 1 0 0 0 0 0 0 ...
$ D2 : int 1 0 0 0 0 1 1 0 0 0 ...
> summary(tilskuere)
> mean(ATT) [1] 6856.372
> mean(D1) Fejl i
mean(D1) : object 'D1' not found
To sum up: I can run regressions in R without D1 and D2, but I can't include these as dummy variables as R can't find these variables, when I run them. R simply says "object D1 not found."
I hope someone can help. Thank you in advance.
Kind regards
Mikkel
I added the material in your comment to the text , added some linefeeds, and it is now clear that you don't understand that columns are not first class objects in R. Try:
mean(tilskuere$D1)
You can see what objects are in your workspace with:
ls()
You appear to have an object named ATT in your workspace as well as a length-180 column by the same name in the object named tilskuere.

Resources