how to convert from df to a time series in R - r

i have a data frame with a year column of type numeric and a avgtemp column of type numeric, so how can i convert it to a time series with a good format
Example:
year AvgTempZScore
<dbl> <dbl>
1 1835 0.109
2 1836 0.168
3 1837 0.177
4 1838 0.143
5 1839 0.188
6 1840 0.198
7 1841 0.200
8 1842 0.230
9 1843 0.237
10 1844 0.194
Str
tibble [179 × 2] (S3: tbl_df/tbl/data.frame)
$ year : num [1:179] 1835 1836 1837 1838 1839 ...
$ AvgTempZScore: num [1:179] 0.109 0.168 0.177 0.143 0.188 ...

With xtsand lubridate:
xts::xts(x = df$AvgTempZScore,order.by = lubridate::ymd(df$year, truncated = 2L))
[,1]
1835-01-01 0.109
1836-01-01 0.168
1837-01-01 0.177
1838-01-01 0.143
1839-01-01 0.188
1840-01-01 0.198
1841-01-01 0.200
1842-01-01 0.230
1843-01-01 0.237
1844-01-01 0.19

This can be useful using ts() function.
db=structure(list(year = c(1835,1836,1837,1838,1839,1840,1841,1842,1843,1844),
AvgTempZScore = c(0.109,0.168,0.177,0.143,0.188,0.198,0.200,0.230,0.237,0.194)),
row.names = c(1:10),
class = "data.frame")
str(db)
#'data.frame': 10 obs. of 2 variables:
# $ year : num 1835 1836 1837 1838 1839 ...
# $ AvgTempZScore: num 0.109 0.168 0.177 0.143 0.188 0.198 0.2 0.23 0.237 0.194
db = ts(db,frequency = 1,start=1835, end=1844)
str(db)
#Time-Series [1:10, 1:2] from 1835 to 1844: 1835 1836 1837 1838 1839 ...
#- attr(*, "dimnames")=List of 2
#..$ : NULL
#..$ : chr [1:2] "year" "AvgTempZScore"#```

Related

How do I fix the units in the legend of this stat_density_2d plot?

I've managed to create my first stat_density_2d plot:
ggplot(qual_colleges_all_data_clean, aes(x = UGDS, y = MD_EARN_WNE_P10), alpha = 0.4, size = 0.5) +
geom_point() +
stat_density_2d(aes(fill = ..level..), geom="polygon") +
scale_fill_gradient(low="lightblue", high="firebrick3") +
geom_smooth(method="loess", se=TRUE, formula = y~x, color="chartreuse1", fill="darkolivegreen2") +
labs(title = "Median Earnings By UGDS") +
scale_colour_hue(h = c(0, 280)) +
xlab(label = "UGDS") +
ylab(label = "Median Earnings 10 years After Entry")
In my data UGDS is a (numeric class) column of integers.
The plot looks fine, but I don't understand why the legend units are given as, e.g., "8e-09". How do I change the legend units to something more readily comprehensible (perhaps "more dense" and "less dense")?
Here's the str():
## tibble [1,121 × 25] (S3: tbl_df/tbl/data.frame)
## $ UNITID : num [1:1121] 159009 217721 198862 176406 158802 ...
## $ INSTNM : chr [1:1121] "Grambling State University" "Benedict College" "Livingstone College" "Tougaloo College" ...
## $ SCH_DEG : num [1:1121] 3 3 3 3 3 3 3 3 3 3 ...
## $ CCBASIC : num [1:1121] 18 22 22 21 21 22 18 19 19 18 ...
## $ ADM_RATE : num [1:1121] 0.972 0.768 0.501 0.707 0.388 ...
## $ ACTCM25 : num [1:1121] 16 15 14 15 19 14 16 16 15 15 ...
## $ ACTCM75 : num [1:1121] 20 19 16 22 23 17 19 18 19 19 ...
## $ SAT_AVG : num [1:1121] 981 933 854 969 1075 ...
## $ UGDS : num [1:1121] 4153 2034 1114 703 1215 ...
## $ COSTT4_A : num [1:1121] 25754 27927 28319 22594 31625 ...
## $ TUITIONFEE_IN : num [1:1121] 7683 16600 18296 10861 19281 ...
## $ TUITIONFEE_OUT : num [1:1121] 16706 16600 18296 10861 19281 ...
## $ AVGFACSAL : num [1:1121] 6352 5808 5408 4755 6472 ...
## $ PFTFAC : num [1:1121] 0.994 0.719 0.825 0.741 0.593 ...
## $ RET_FT4 : num [1:1121] 0.744 0.556 0.562 0.73 0.669 ...
## $ MD_EARN_WNE_P10 : num [1:1121] 33961 30951 30521 33291 37014 ...
## $ PCT25_EARN_WNE_P10: num [1:1121] 19299 19230 17973 20243 23757 ...
## $ PCT75_EARN_WNE_P10: num [1:1121] 50236 43045 45132 46866 52462 ...
## $ MD_EARN_WNE_P6 : num [1:1121] 27399 23174 24630 25088 29132 ...
## $ GRADS : num [1:1121] 1079 6 NA 8 NA ...
## $ RET_FT4_POOLED : num [1:1121] 0.734 0.56 0.53 0.738 0.678 ...
## $ C100_4_POOLED : num [1:1121] 0.105 0.18 0.152 0.22 0.269 ...
## $ INEXPFTE : num [1:1121] 3905 4224 5598 4884 7815 ...
## $ C150_4_POOLED : num [1:1121] 0.34 0.271 0.239 0.412 0.454 ...
## $ GRAD_DEBT_MDN : num [1:1121] 36750 34500 32875 32100 31250 ...
Thank you, most sincerely, for any help!

Multiclass classification in H2O randomForest

I'm trying to use H20 randomForest for multiclass classification in R, but when I run the code, the randomForest always comes out as a regression model - despite the target variable being a factor. I am trying to predict 'Gradient', a factor with 5 levels, by one other factor 'Period' with 4 levels, and 21 numerical predictors.
Any help would be appreciated. Code below....
>str(df)
Class 'H2OFrame' <environment: 0x000001f6b361abe0>
- attr(*, "op")= chr ":="
- attr(*, "eval")= logi TRUE
- attr(*, "id")= chr "RTMP_sid_aecc_35"
- attr(*, "nrow")= int 63878
- attr(*, "ncol")= int 22
- attr(*, "types")=List of 22
- attr(*, "data")='data.frame': 10 obs. of 22 variables:
..$ Gradient: Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1
..$ Period : Factor w/ 4 levels "Dawn","Day","Dusk",..: 2 2 2 2 2 2 2 2 2 2
..$ AC1 : num 1792 1793 1790 1790 1797 ...
..$ AC2 : num 316 316 318 317 324 ...
..$ AC3 : num 972 972 974 975 979 ...
etc for remaining numerical predictors.
>splits <- h2o.splitFrame(df, c(0.6,0.2), seed=1234)
>train <- h2o.assign(splits[[1]], "train.hex")
>valid <- h2o.assign(splits[[2]], "valid.hex")
>test <- h2o.assign(splits[[3]], "test.hex")
>str(train)
Class 'H2OFrame' <environment: 0x000002266fac7d40>
- attr(*, "op")= chr "assign"
- attr(*, "id")= chr "train.hex"
- attr(*, "nrow")= int 38259
- attr(*, "ncol")= int 22
- attr(*, "types")=List of 22
- attr(*, "data")='data.frame': 10 obs. of 22 variables:
..$ Gradient: Factor w/ 5 levels "LB","LU","PB",..: 1 1 1 1 1 1 1 1 1 1
..$ Period : Factor w/ 4 levels "Dawn","Day","Dusk",..: 2 2 2 2 2 2 2 2 2 2
..$ AC1 : num 1793 1797 1796 1805 1803 ...
..$ AC2 : num 316 324 322 322 323 ...
..$ AC3 : num 972 979 979 988 986 ...
..$ AC4 : num 663 662 664 673 670 ...
..$ AC5 : num 828 825 824 824 825 ...
..$ AD1 : num 1.22 1.42 1.73 2.25 1.99 ...
..$ AD2 : num 1.1 1.27 1.35 1.38 1.38 ...
..$ AD3 : num 1.22 1.42 1.72 2.24 1.99 ...
..$ AD4 : num 1.87 1.53 2.07 2.03 1.78 ...
..$ AD5 : num 2.33 2.33 2.33 2.33 2.33 ...
..$ AE1 : num 0.877 0.849 0.794 0.636 0.72 ...
..$ AE2 : num 0.3687 0.2332 0.1369 0.0433 0.0546 ...
..$ AE3 : num 0.774 0.723 0.624 0.335 0.487 ...
..$ AE4 : num 0.574 0.697 0.44 0.477 0.605 ...
..$ AE5 : num 0.542 0.542 0.554 0.543 0.542 ...
..$ BI1 : num 53 71.9 64 75.4 74.6 ...
..$ BI2 : num 6.51 5.88 4.54 2.3 2.34 ...
..$ BI3 : num 22.2 26 21.5 27.9 28 ...
..$ BI4 : num 7.86 9.58 8.59 12.17 12.5 ...
..$ BI5 : num 11.3 17.9 16.4 18.1 17.5 ...
> train[1:5,] ## rows 1-5, all columns
Gradient Period AC1 AC2 AC3 AC4 AC5 AD1 AD2 AD3 AD4 AD5 AE1 AE2 AE3 AE4 AE5
1 LB Day 1792.97 316.4038 972.4288 663.2612 827.6400 1.217491 1.104860 1.217491 1.866627 2.332115 0.876794 0.368712 0.774123 0.574168 0.541993
2 LB Day 1796.78 324.3562 979.2218 662.2341 824.6436 1.421910 1.274373 1.421910 1.526506 2.331810 0.848660 0.233177 0.722544 0.696906 0.542409
3 LB Day 1796.09 321.9081 978.7464 664.1776 824.4437 1.726798 1.345030 1.721740 2.066543 2.326278 0.794230 0.136892 0.624107 0.440458 0.553766
4 LB Day 1805.14 322.0390 987.9472 673.2841 824.3146 2.248474 1.381644 2.239061 2.028538 2.331881 0.636007 0.043267 0.334964 0.477149 0.542572
5 LB Day 1803.15 323.1540 985.6376 669.7603 824.6003 1.992025 1.380468 1.992004 1.782532 2.331971 0.720153 0.054578 0.486951 0.604876 0.542420
BI1 BI2 BI3 BI4 BI5
1 53.03567 6.506536 22.23446 7.862767 11.32708
2 71.94775 5.879407 26.04130 9.579798 17.94337
3 63.98763 4.535041 21.50727 8.590985 16.38780
4 75.38319 2.301110 27.89600 12.165991 18.06316
5 74.60517 2.342853 28.02568 12.499122 17.52902
rf1 <- h2o.randomForest(
training_frame = train,
validation_frame = valid,
x=2:22,
y=1,
ntrees = 200,
stopping_rounds = 2,
score_each_iteration = T,
seed = 1000000) `
>perf <- h2o.performance(rf1, valid)
>h2o.mcc(perf)
Error in h2o.metric(object, thresholds, "absolute_mcc") :
No absolute_mcc for H2OMultinomialMetrics
h2o.accuracy(perf)
Error in h2o.metric(object, thresholds, "accuracy") :
No accuracy for H2OMultinomialMetrics
and a summary from the model summary:
H2OMultinomialMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **
Training Set Metrics:
=====================
Extract training frame with `h2o.getFrame("train.hex")`
MSE: (Extract with `h2o.mse`) 0.2499334
RMSE: (Extract with `h2o.rmse`) 0.4999334
Logloss: (Extract with `h2o.logloss`) 0.9987891
Mean Per-Class Error: 0.2941914
R^2: (Extract with `h2o.r2`) 0.8683096
mcc is specifically for binary classifiers; your factor has more than 2 levels.
You can tell you have successfully done a multinomial classification, rather than a regression, because the error message says "No absolute_mcc for H2OMultinomialMetrics".
h2o.accuracy() and h2o.logloss() are available for multinomial models.
UPDATE: ...well, the docs say h2o.accuracy() is available, but a quick check on the iris dataset gives me the same error you see; must be related to that warning in the docs (which I didn't understand).
Anyway, more useful is likely to be h2o.confusionMatrix(rf1); the overall error shown in the bottom right is 1 - accuracy. Also h2o.confusionMatrix(rf1,valid=T) and h2o.confusionMatrix(rf1, test)

.csv file int variable to factor in R

I have a big .csv data set. $B1 through $B34. They are all numeric, which is fine. But I would like the last column to be in "factor" The values of the last column DEC consists of only 1 and 0.
How can I make the last column "factor"
mydata<-read.csv(file.choose(),header=T)
str(mydata)
'data.frame': 1024 obs. of 35 variables:
$ B1 : num 90.8 113.2 100.4 144.5 131.6 ...
$ B2 : num 0.133 0.139 0.144 0.147 0.141 ...
-----------
-----------
$ B32 : num 0.216 0.27 0.309 0.259 0.304 ...
$ B33 : num 0.526 0.407 0.286 0.129 0.37 ...
$ B34 : num 4.33 5.61 4.81 7.32 6.83 ...
$ DEC : int 1 1 1 1 1 1 1 1 1 1 ...
You can use the as.factor() function to convert any column to factor. For example:
mydata<-read.csv("data.csv") #Read the data#
mydata$DEC<-as.factor(mydata$DEC) #Convert the column to a factor
class(mydata$DEC) #Just to check that it worked#

r quantregForest() error: NA's produced by integer overflow lead to an invalid argument in the rep() function

I am trying to use the quantregForest() function from the quantregForest package (which is built on the randomForest package.)
I tried to train the model using:
qrf_model <- quantregForest(x=Xtrain, y=Ytrain, importance=TRUE, ntree=10)
and I get the following error message (even after reducing the number of trees from 100 to 10):
Error in rep(0, nobs * nobs * npred) : invalid 'times' argument
plus a warning:
In nobs * nobs * npred : NAs produced by integer overflow
The data frame Xtrain has 38 numeric variables, and it looks like this:
> str(Xtrain)
'data.frame': 31132 obs. of 38 variables:
$ X1 : num 301306 6431 2293 1264 32477 ...
$ X2 : num 173.2 143.5 43.4 180.6 1006.2 ...
$ X3 : num 0.1598 0.1615 0.1336 0.0953 0.1988 ...
$ X4 : num 0.662 0.25 0.71 0.709 0.671 ...
$ X5 : num 0.05873 0.0142 0 0.00154 0.09517 ...
$ X6 : num 0.01598 0 0.0023 0.00154 0.01634 ...
$ X7 : num 0.07984 0.03001 0.00845 0.04304 0.09326 ...
$ X8 : num 0.92 0.97 0.992 0.957 0.907 ...
$ X9 : num 105208 1842 830 504 11553 ...
$ X10: num 69974 1212 611 352 7080 ...
$ X11: num 0.505 0.422 0.55 0.553 0.474 ...
$ X12: num 0.488 0.401 0.536 0.541 0.45 ...
$ X13: num 0.333 0.419 0.257 0.282 0.359 ...
$ X14: num 0.187 0.234 0.172 0.207 0.234 ...
$ X15: num 0.369 0.216 0.483 0.412 0.357 ...
$ X16: num 0.0765 0.1205 0.0262 0.054 0.0624 ...
$ X17: num 2954 77 12 10 739 ...
$ X18: num 2770 43 9 21 433 119 177 122 20 17 ...
$ X19: num 3167 72 49 25 622 ...
$ X20: num 3541 57 14 24 656 ...
$ X21: num 3361 82 0 33 514 ...
$ X22: num 3929 27 10 48 682 ...
$ X23: num 3695 73 61 15 643 ...
$ X24: num 4781 52 5 14 680 ...
$ X25: num 3679 103 5 23 404 ...
$ X26: num 7716 120 55 40 895 ...
$ X27: num 11043 195 72 48 1280 ...
$ X28: num 16080 332 160 83 1684 ...
$ X29: num 12312 125 124 62 1015 ...
$ X30: num 8218 99 36 22 577 ...
$ X31: num 9957 223 146 26 532 ...
$ X32: num 0.751 0.444 0.621 0.527 0.682 ...
$ X33: num 0.01873 0 0 0.00317 0.02112 ...
$ X34: num 0.563 0.372 0.571 0.626 0.323 ...
$ X35: num 0.366 0.39 0.156 0.248 0.549 ...
$ X36: num 0.435 0.643 0.374 0.505 0.36 ...
$ X37: num 0.526 0.31 0.577 0.441 0.591 ...
$ X38: num 0.00163 0 0 0 0.00155 0.00103 0 0 0 0 ...
And the response variable Ytrain looks like this:
> str(Ytrain)
num [1:31132] 2605 56 8 16 214 ...
I checked that neither Xtrain or Ytrain contain any NA's by:
> sum(is.na(Xtrain))
[1] 0
> sum(is.na(Ytrain))
[1] 0
I am assuming that the error message for the invalid "times" argument for the rep(0, nobs * nobs * npred)) function comes from the NA value assigned to the product nobs * nobs * npred due to an integer overflow.
What I do not understand is where the integer overflow comes from. None of my variables are of the integer class so what am I missing?
I examined the source code for the quantregForest() function and the source code for the method predict.imp called by the quantregForest() function.
I found that nobs stands for the number of observations. In the case above nobs =length(Ytrain) = 31132 . The variable npred stands for the number of predictors. It is given by npred = ncol(Xtrain)=38. Both npred and nobs are of class integer, and
npred*npred*nobs = 31132*31132*38 = 36829654112.
And herein lies the root cause of the error, since:
npred*npred*nobs = 36829654112 > 2147483647,
where 2147483647 is the maximal integer value in R. Hence the integer overflow warning and the replacement of the product npred*npred*nobs with an NA.
The bottom line is, in order to avoid the error message I will have to use quite a bit fewer observations when training the model or set importance=FALSE in the quantregForest() function argument. The computations required to find variable importance are very memory intensive, even when using less then 10000 observations.

Subsetting by rows to do a correlation

I created a data frame from another dataset with 332 ID's. I split the data frame by IDs and would like to do a count rows of each ID and then do a correlation function. Can someone tell me how to do a count of the rows of each ID in order to do a correlation from these individual groups.
jlhoward your suggestion to add "table(dat1$ID)" command worked. My other problem is the function will not stop running
corr<-function(directory,threshold=)
####### file location path#####
for(i in 1:332){dat<-rbind(dat,read.csv(specdata1[i]))
dat1<-dat[complete.cases(dat),]
dat2<-(split(dat1,dat1$ID))
list(dat2)
dat3<-table(dat1$ID)
for (i in dat1>=threshold){
x<-dat1$sulfate
y<-dat1$nitrate
correlations<-cor(x,y,use="pairwise.complete.obs",method="pearson")
corrs_output<-c(corrs_output,correlations)
}
I'm trying to correlate the "sulfate" and "nitrate of each ID monitor that fits a threshold. I created a list that has all the complete cases per ID monitor. I need the function to do a correlation for "sulfate" and "nitrate of every set per ID that's => the threshold argument in the function. Below is the head and tail of the structure of the data.frame/list of each data set within the main data set "specdata1".
head of entire data.frame/list of specdata1 complete cases for
correlation
head(str(dat2,1))
List of 323
$ 1 :'data.frame': 117 obs. of 4 variables:
..$ Date : Factor w/ 4018 levels "2003-01-01","2003-01-02",..: 279 285 291 297 303 315 321 327 333 339 ...
..$ sulfate: num [1:117] 7.21 5.99 4.68 3.47 2.42 1.43 2.76 3.41 1.3 3.15 ...
..$ nitrate: num [1:117] 0.651 0.428 1.04 0.363 0.507 0.474 0.425 0.964 0.491 0.669 ...
..$ ID : int [1:117] 1 1 1 1 1 1 1 1 1 1 ...
tail of entire data.frame/list for all complete cases of specdata1
tail(str(dat2,1))
$ 99 :'data.frame': 479 obs. of 4 variables:
..$ Date : Factor w/ 4018 levels "2003-01-01","2003-01-02",..: 1774 1780 1786 1804 1810 1816 1822 1840 1852 1858 ...
..$ sulfate: num [1:479] 1.51 8.2 1.48 4.75 3.47 1.19 1.77 2.27 2.06 2.11 ...
..$ nitrate: num [1:479] 0.725 1.64 1.01 6.81 0.751 1.69 2.08 0.996 0.817 0.488 ...
..$ ID : int [1:479] 99 99 99 99 99 99 99 99 99 99 ...
[list output truncated]

Resources