Get data between certain timestamps in R - r

I have the following data frame:
>head(d)
Smed Time TTFB
1 186 2014-03-19 20:13:00 6
2 196 2014-03-19 21:23:00 0
3 186 2014-03-19 19:33:00 22
4 186 2014-03-19 14:43:00 134
5 186 2014-03-19 23:53:00 36
> str(d)
$ Smed : Factor w/ 71 levels "1033","1038",..: 14 16 14 14 14 14 47 14 14 68 ...
$ Time : POSIXct, format: "2014-03-19 20:13:00" "2014-03-19 20:13:00" "2014-03-19 20:13:00" "2014-03-19 20:13:00" ...
$ TTFB : int 6 0 22 134 36 96 61 37 28 19 ...
How do i do a time comparison with POSIXct time format to get data say between 18:13 and 23:20

POSIXct understands < and >, so you simply do this:
set.seed(1)
d <- data.frame(Time=as.POSIXct("2014-03-19")+24*60*60*runif(100))
index <- as.POSIXct("2014-03-19 18:00:00")<=d$Time &
d$Time<=as.POSIXct("2014-03-19 23:20:00")
d[index,]

Try something like
start <- as.POSIXct("2014-03-19 18:13:00")
end <- as.POSIXct("2014-03-19 23:10:00")
d[start < d$Time & d$Time < end, ]
Hope it helps

Related

Unable to designate CSV column heads "as.factor" for R -Error

I am having an issue with assigning factors to my data CSV. Here is a summary of the data frame:
> data.frame': 303 obs. of 12 variables:
> PLOT : int 19 177 54 114 41 48 142 134 160 267 ...
> RANGE : int 2 12 4 8 3 4 10 9 11 18 ...
> ROW : int 4 12 9 9 11 3 7 14 10 12 ...
> REP : int 1 1 1 1 1 1 1 1 1 1 ...
> ENTRY : Factor w/ 184 levels "","17_YMG_0293",..: 40 40 77 82 87 88 102 103 103 6 ...
> PLOT_ID : Factor w/ 301 levels "","18_HZG_OvOv_001",..: 20 178 55 115 42 49 143 135 161 268 ...
> Shatter : num 9 9 9 9 9 9 9 9 9 8 ...
> Chaff.Color : Factor w/ 4 levels "","*Blank ones are segregating in color",..: 3 4 3 4 4 4 3 4 4 3 ...
> Heading_d.from.Jan.1: int 138 139 137 133 135 135 133 137 135 136 ...
> Height_cm : int 74 73 77 76 74 79 78 73 76 70 ...
> Plot.weight..kg. : num 0.26 0.18 0.19 0.14 0.33 0.19 0.13 0.11 0.24 0.18 ...
But I get this error:
HAYSData$Rep<-as.factor(HAYSData$Rep)
Error in `$<-.data.frame`(`*tmp*`, Rep, value = integer(0)) :
replacement has 0 rows, data has 303
I get the same type of error for Entry, Range, and Rows. I am not sure when I look at length(Entry) for example I get 300. I even tested with changing factor to numeric but it does not help.
I don't have an NA in my data each category is its own column as well.
I don't know if something is wrong with my CSV. I have worked this same script with another CSV but no issues in the part of the script for the other data.
Can someone please help me?
It's case-sensitive, try with:
HAYSData$REP <- as.factor(HAYSData$REP)
HAYSData$ENTRY <- as.factor(HAYSData$ENTRY)
HAYSData$RANGE <- as.factor(HAYSData$RANGE)
HAYSData$ROW <- as.factor(HAYSData$ROW)

How to user NSE inside fct_reorder() in ggplot2

I would like to know how to use NSE (Non-Standard Evaluation) expression in fct_reorder() in ggplot2 to replicate charts for different data frames.
This is an example of data frame that I use to draw a chart:
travel_time_br30 travel_time_br30_int time_reduction shift not_shift total
1 0-30 0 10 2780 3268 6048
2 0-30 0 20 2779 3269 6048
3 0-30 0 30 2984 3064 6048
4 0-30 0 40 3211 2837 6048
5 30-60 30 10 2139 2007 4146
6 30-60 30 20 2159 1987 4146
7 30-60 30 30 2363 1783 4146
8 30-60 30 40 2478 1668 4146
9 60-90 60 10 764 658 1422
10 60-90 60 20 721 701 1422
11 60-90 60 30 782 640 1422
12 60-90 60 40 801 621 1422
13 90-120 90 10 296 224 520
14 90-120 90 20 302 218 520
15 90-120 90 30 317 203 520
16 90-120 90 40 314 206 520
17 120-150 120 10 12 10 22
18 120-150 120 20 10 12 22
19 120-150 120 30 10 12 22
20 120-150 120 40 13 9 22
21 150-180 150 10 35 21 56
22 150-180 150 20 40 16 56
23 150-180 150 30 40 16 56
24 150-180 150 40 35 21 56
share
1 45.96561
2 45.94907
3 49.33862
4 53.09193
5 51.59190
6 52.07429
7 56.99469
8 59.76845
9 53.72714
10 50.70323
11 54.99297
12 56.32911
13 56.92308
14 58.07692
15 60.96154
16 60.38462
17 54.54545
18 45.45455
19 45.45455
20 59.09091
21 62.50000
22 71.42857
23 71.42857
24 62.50000
These are the scripts to draw a chart from above data frame:
g.var <- "travel_time_br30"
go.var <- "travel_time_br30_int"
test %>% ggplot(.,aes_(x=as.name(x.var),y=as.name("share"),group=as.name(g.var))) +
geom_line(size=1.4, aes(
color=fct_reorder(travel_time_br30,order(travel_time_br30_int))))
As I have several data frames which has different fields such as access_time_br30, access_time_br30_int instead of travel_time_br30 and travel_time_br30_int in the data frame, I set two variables (g.var and go.var) to easily replicate multiple chars in the same scripts.
As I need to reorder the factor group numerically, in particular, changing order of travel_time_br30 by travel_time_br30_int, I am using fct_reorder function in ggplot2(., aes_(...)). However, if I use aes_ with fct_reorder() in geom_line() as shown as an example in the following script, it returns an error saying Error:fmust be a factor (or character vector).
geom_line(size=1.4, aes_(color=fct_reorder(as.name(g.var),order(as.name(go.var)))))
Fct_reorder() does not seem to have an NSE version like fct_reorder_().
Is it impossible to use both aes_ and fct_reorder() in a sequence of scripts or are there any other solutions?
Based on my novice working knowledge of tidy-eval, you could transform your factor order in mutate() before passing the data into ggplot() and acheive your result.
Sorry I couldn't easily read in your table above, because of the line return so I made a new example off of mtcars that I think captures your intent. (let me know if it doesn't)
mtcars2 <- mutate(mtcars,
gear_int = 6 - gear,
gear_intrev = rev(gear_int)) %>%
mutate_at(vars(cyl, gear), as.factor)
library(rlang)
gg_reorder <- function(data, col_var, col_order) {
eq_var <- sym(col_var) # sym is flexible and my novice preference
eq_ord <- sym(col_order)
data %>% mutate(!!quo_name(eq_var) := fct_reorder(!!eq_var, !!eq_ord) ) %>%
ggplot(aes_(~mpg, ~hp, color = eq_var)) +
geom_line()
}
And now put it to use plotting...
gg_reorder(mtcars2, "gear", "gear_int")
gg_reorder(mtcars2, "gear", "gear_intrev")
I didn't specify all of the aes_() variables as strings but you could pass those as text and use the as.name() pattern. If you want more tidy-eval patterns Edwin Thoen wrote up a bunch of common cases.

read.csv() parses numeric columns as factors [duplicate]

This question already has answers here:
How to convert a factor to integer\numeric without loss of information?
(12 answers)
Closed 5 years ago.
Have the below dataframe where all the columns are factors which I want to use them as numeric columns. I tried different ways but it is changing to different values when I try as.numeric(as.character(.))
The data comes in a semicolon separated format. A subset of data to reproduce the problem is:
rawData <- "Date;Time;Global_active_power;Global_reactive_power;Voltage;Global_intensity;Sub_metering_1;Sub_metering_2;Sub_metering_3
21/12/2006;11:23:00;?;?;?;?;?;?;
21/12/2006;11:24:00;?;?;?;?;?;?;
16/12/2006;17:24:00;4.216;0.418;234.840;18.400;0.000;1.000;17.000
16/12/2006;17:25:00;5.360;0.436;233.630;23.000;0.000;1.000;16.000
16/12/2006;17:26:00;5.374;0.498;233.290;23.000;0.000;2.000;17.000
16/12/2006;17:27:00;5.388;0.502;233.740;23.000;0.000;1.000;17.000
16/12/2006;17:28:00;3.666;0.528;235.680;15.800;0.000;1.000;17.000
16/12/2006;17:29:00;3.520;0.522;235.020;15.000;0.000;2.000;17.000
16/12/2006;17:30:00;3.702;0.520;235.090;15.800;0.000;1.000;17.000
16/12/2006;17:31:00;3.700;0.520;235.220;15.800;0.000;1.000;17.000
16/12/2006;17:32:00;3.668;0.510;233.990;15.800;0.000;1.000;17.000
"
hpc <- read.csv(text=rawData,sep=";")
str(hpc)
When run against the full data file after dropping the date and time variables, the output from str() looks like:
> str(hpc)
'data.frame': 2075259 obs. of 7 variables:
$ Global_active_power : Factor w/ 4187 levels "?","0.076","0.078",..: 2082 2654 2661 2668 1807 1734 1825 1824 1808 1805 ...
$ Global_reactive_power: Factor w/ 533 levels "?","0.000","0.046",..: 189 198 229 231 244 241 240 240 235 235 ...
$ Voltage : Factor w/ 2838 levels "?","223.200",..: 992 871 837 882 1076 1010 1017 1030 907 894 ...
$ Global_intensity : Factor w/ 222 levels "?","0.200","0.400",..: 53 81 81 81 40 36 40 40 40 40 ...
$ Sub_metering_1 : Factor w/ 89 levels "?","0.000","1.000",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Sub_metering_2 : Factor w/ 82 levels "?","0.000","1.000",..: 3 3 14 3 3 14 3 3 3 14 ...
$ Sub_metering_3 : num 17 16 17 17 17 17 17 17 17 16 ...
Can anyone help me in getting the expected output?
expected output:
> str(hpc)
'data.frame': 2075259 obs. of 7 variables:
$ Global_active_power : num "?","0.076","0.078",..: 2082 2654 2661 2668 1807 1734 1825 1824 1808 1805 ...
$ Global_reactive_power: num "?","0.000","0.046",..: 189 198 229 231 244 241 240 240 235 235 ...
$ Voltage : num "?","223.200",..: 992 871 837 882 1076 1010 1017 1030 907 894 ...
$ Global_intensity : num "?","0.200","0.400",..: 53 81 81 81 40 36 40 40 40 40 ...
$ Sub_metering_1 : num "?","0.000","1.000",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Sub_metering_2 : num "?","0.000","1.000",..: 3 3 14 3 3 14 3 3 3 14 ...
$ Sub_metering_3 : num 17 16 17 17 17 17 17 17 17 16 ...
Not able to test your data frame, but hopefully this will work. I notice that in the output of str(hpc) not all columns are factors. mutate_if can apply a function to those meet the requirement of a predictive function.
library(dplyr)
hpc2 <- hpc %>%
mutate_if(is.factor, funs(as.numeric(as.character(.))))

Creating decision tree

I have a csv file (298 rows and 24 columns) and i want to create a decision tree to predict the column "salary". I have downloaded tree package and added via library function.
But when i try to create the decision tree:
model<-tree(salary~.,data)
I get the error like below:
*Error in tree(salary ~ ., data) :
factor predictors must have at most 32 levels*
What is wrong with that? Data is as follows:
Name bat hit homeruns runs
1 Alan Ashby 315 81 7 24
2 Alvin Davis 479 130 18 66
3 Andre Dawson 496 141 20 65
...
team position putout assists errors
1 Hou. C 632 43 10
2 Sea. 1B 880 82 14
3 Mon. RF 200 11 3
salary league87 team87
1 475 N Hou.
2 480 A Sea.
3 500 N Chi.
And its the value of str(data):
'data.frame': 263 obs. of 24 variables:
$ Name : Factor w/ 263 levels "Al Newman","Alan Ashby",..: 2 7 8 10 6 1 13 11 9 3 ...
$ bat : int 315 479 496 321 594 185 298 323 401 574 ...
$ hit : int 81 130 141 87 169 37 73 81 92 159 ...
$ homeruns : int 7 18 20 10 4 1 0 6 17 21 ...
$ runs : int 24 66 65 39 74 23 24 26 49 107 ...
$ runs.batted : int 38 72 78 42 51 8 24 32 66 75 ...
$ walks : int 39 76 37 30 35 21 7 8 65 59 ...
$ years.in.major.leagues : int 14 3 11 2 11 2 3 2 13 10 ...
$ bats.during.career : int 3449 1624 5628 396 4408 214 509 341 5206 4631 ...
$ hits.during.career : int 835 457 1575 101 1133 42 108 86 1332 1300 ...
$ homeruns.during.career : int 69 63 225 12 19 1 0 6 253 90 ...
$ runs.during.career : int 321 224 828 48 501 30 41 32 784 702 ...
$ runs.batted.during.career: int 414 266 838 46 336 9 37 34 890 504 ...
$ walks.during.career : int 375 263 354 33 194 24 12 8 866 488 ...
$ league : Factor w/ 2 levels "A","N": 2 1 2 2 1 2 1 2 1 1 ...
$ division : Factor w/ 2 levels "E","W": 2 2 1 1 2 1 2 2 1 1 ...
$ team : Factor w/ 24 levels "Atl.","Bal.",..: 9 21 14 14 16 14 10 1 7
8 ...
$ position : Factor w/ 23 levels "1B","1O","23",..: 10 1 20 1 22 4 22 22 13 22 ...
$ putout : int 632 880 200 805 282 76 121 143 0 238 ...
$ assists : int 43 82 11 40 421 127 283 290 0 445 ...
$ errors : int 10 14 3 4 25 7 9 19 0 22 ...
$ salary : num 475 480 500 91.5 750 ...
$ league87 : Factor w/ 2 levels "A","N": 2 1 2 2 1 1 1 2 1 1 ...
$ team87 : Factor w/ 24 levels "Atl.","Bal.",..: 9 21 5 14 16 13 10 1 7 8 ...
The issue is almost certainly that you're including the name variable in your model, as it has too many factor levels. I would also remove it a methodological standpoint but this probably isn't the place for that discussion. Try:
train <- data
train$Name <- NULL
model<-tree(salary~.,train)
It seems that your salary is a factor vector, while you are trying to perform a regression, so it should be a numbers vector. Simply convert you salary to numeric, and it should work just fine. For more details read the library's help:
http://cran.r-project.org/web/packages/tree/tree.pdf
Usage
tree(formula, data, weights, subset, na.action = na.pass,
control = tree.control(nobs, ...), method = "recursive.partition",
split = c("deviance", "gini"), model = FALSE, x = FALSE, y = TRUE, wts
= TRUE, ...)
Arguments
formula A formula expression. The left-hand-side (response) should be either a numerical vector when a
regression tree will be fitted or a factor, when a classification tree
is produced. The right-hand-side should be a series of numeric or
factor variables separated by +; there should be no interaction terms.
Both . and - are allowed: regression trees can have offset terms.
(...)
Depending on what exactly is stored in your salary variable, the conversion can be less or more tricky, but this should generaly work:
salary = as.numeric(levels(salary))[salary]
EDIT
As pointed out in the comment, the actual error corresponds to the data variable, so if it is a numerical data, it could also be converted to numeric to solve the issue, if it has to be a factor you will need another model or reduce the number of levels. You can also convert these factors to the numerical format by hand (by for example defining as many binary features as you have levels), but this can lead to the exponential growth of your input space.
EDIT2
It seems that you have to first decide what you are trying to model. You are trying to predict salary, but based on what? It seems that your data consists of players' records, then their names are for sure wrong type of data to use for this prediction (in particular - it is probably causing the 32 levels error). You should remove all the columns from the data variable which should not be used for building a prediction. I do not know what is the exact aim here (as there is no information regarding it in the question), so I can only guess that you are trying to predict the person's salary based on his/her stats, so you should remove from the input data: players' names, players' teams and obviously salaries (as predicting X using X is not a good idea ;)).

Generate entries in time series data

I want to generate a row (with zero ammount) for each missing month (until the current) in the following dataframe. Can you please give me a hand in this? Thanks!
trans_date ammount
1 2004-12-01 2968.91
2 2005-04-01 500.62
3 2005-05-01 434.30
4 2005-06-01 549.15
5 2005-07-01 276.77
6 2005-09-01 548.64
7 2005-10-01 761.69
8 2005-11-01 636.77
9 2005-12-01 1517.58
10 2006-03-01 719.09
11 2006-04-01 1231.88
12 2006-05-01 580.46
13 2006-07-01 1468.43
14 2006-10-01 692.22
15 2006-11-01 505.81
16 2006-12-01 1589.70
17 2007-03-01 1559.82
18 2007-06-01 764.98
19 2007-07-01 964.77
20 2007-09-01 405.18
21 2007-11-01 112.42
22 2007-12-01 1134.08
23 2008-02-01 269.72
24 2008-03-01 208.96
25 2008-04-01 353.58
26 2008-05-01 756.00
27 2008-06-01 747.85
28 2008-07-01 781.62
29 2008-09-01 195.36
30 2008-10-01 424.24
31 2008-12-01 166.23
32 2009-02-01 237.11
33 2009-04-01 110.94
34 2009-07-01 191.29
35 2009-11-01 153.42
36 2009-12-01 222.87
37 2010-09-01 1259.97
38 2010-11-01 375.61
39 2010-12-01 496.48
40 2011-02-01 360.07
41 2011-03-01 324.95
42 2011-04-01 566.93
43 2011-06-01 281.19
44 2011-08-01 428.04
'data.frame': 44 obs. of 2 variables:
$ trans_date : Date, format: "2004-12-01" "2005-04-01" "2005-05-01" "2005-06-01" ...
$ ammount: num 2969 501 434 549 277 ...
you can use seq.Date and merge:
> str(df)
'data.frame': 44 obs. of 2 variables:
$ trans_date: Date, format: "2004-12-01" "2005-04-01" "2005-05-01" "2005-06-01" ...
$ ammount : num 2969 501 434 549 277 ...
> mns <- data.frame(trans_date = seq.Date(min(df$trans_date), max(df$trans_date), by = "month"))
> df2 <- merge(mns, df, all = TRUE)
> df2$ammount <- ifelse(is.na(df2$ammount), 0, df2$ammount)
> head(df2)
trans_date ammount
1 2004-12-01 2968.91
2 2005-01-01 0.00
3 2005-02-01 0.00
4 2005-03-01 0.00
5 2005-04-01 500.62
6 2005-05-01 434.30
and if you need months until current, use this:
mns <- data.frame(trans_date = seq.Date(min(df$trans_date), Sys.Date(), by = "month"))
note that it is sufficient to call simply seq instead of seq.Date if the parameters are Date class.
If you're using xts objects, you can use timeBasedSeq and merge.xts. Assuming your original data is in an object Data:
# create xts object:
# no comma on the first subset (Data['ammount']) keeps column name;
# as.Date needs a vector, so use comma (Data[,'trans_date'])
x <- xts(Data['ammount'],as.Date(Data[,'trans_date']))
# create a time-based vector from 2004-12-01 to 2011-08-01. The "m" denotes
# monthly time-steps. By default this returns a yearmon class. Use
# retclass="Date" to return a Date vector.
d <- timeBasedSeq(paste(start(x),end(x),"m",sep="/"), retclass="Date")
# merge x with an "empty" xts object, xts(,d), filling with zeros
y <- merge(x,xts(,d),fill=0)

Resources