i have a data frame with a year column of type numeric and a avgtemp column of type numeric, so how can i convert it to a time series with a good format
Example:
year AvgTempZScore
<dbl> <dbl>
1 1835 0.109
2 1836 0.168
3 1837 0.177
4 1838 0.143
5 1839 0.188
6 1840 0.198
7 1841 0.200
8 1842 0.230
9 1843 0.237
10 1844 0.194
Str
tibble [179 × 2] (S3: tbl_df/tbl/data.frame)
$ year : num [1:179] 1835 1836 1837 1838 1839 ...
$ AvgTempZScore: num [1:179] 0.109 0.168 0.177 0.143 0.188 ...
With xtsand lubridate:
xts::xts(x = df$AvgTempZScore,order.by = lubridate::ymd(df$year, truncated = 2L))
[,1]
1835-01-01 0.109
1836-01-01 0.168
1837-01-01 0.177
1838-01-01 0.143
1839-01-01 0.188
1840-01-01 0.198
1841-01-01 0.200
1842-01-01 0.230
1843-01-01 0.237
1844-01-01 0.19
This can be useful using ts() function.
db=structure(list(year = c(1835,1836,1837,1838,1839,1840,1841,1842,1843,1844),
AvgTempZScore = c(0.109,0.168,0.177,0.143,0.188,0.198,0.200,0.230,0.237,0.194)),
row.names = c(1:10),
class = "data.frame")
str(db)
#'data.frame': 10 obs. of 2 variables:
# $ year : num 1835 1836 1837 1838 1839 ...
# $ AvgTempZScore: num 0.109 0.168 0.177 0.143 0.188 0.198 0.2 0.23 0.237 0.194
db = ts(db,frequency = 1,start=1835, end=1844)
str(db)
#Time-Series [1:10, 1:2] from 1835 to 1844: 1835 1836 1837 1838 1839 ...
#- attr(*, "dimnames")=List of 2
#..$ : NULL
#..$ : chr [1:2] "year" "AvgTempZScore"#```
Related
I've managed to create my first stat_density_2d plot:
ggplot(qual_colleges_all_data_clean, aes(x = UGDS, y = MD_EARN_WNE_P10), alpha = 0.4, size = 0.5) +
geom_point() +
stat_density_2d(aes(fill = ..level..), geom="polygon") +
scale_fill_gradient(low="lightblue", high="firebrick3") +
geom_smooth(method="loess", se=TRUE, formula = y~x, color="chartreuse1", fill="darkolivegreen2") +
labs(title = "Median Earnings By UGDS") +
scale_colour_hue(h = c(0, 280)) +
xlab(label = "UGDS") +
ylab(label = "Median Earnings 10 years After Entry")
In my data UGDS is a (numeric class) column of integers.
The plot looks fine, but I don't understand why the legend units are given as, e.g., "8e-09". How do I change the legend units to something more readily comprehensible (perhaps "more dense" and "less dense")?
Here's the str():
## tibble [1,121 × 25] (S3: tbl_df/tbl/data.frame)
## $ UNITID : num [1:1121] 159009 217721 198862 176406 158802 ...
## $ INSTNM : chr [1:1121] "Grambling State University" "Benedict College" "Livingstone College" "Tougaloo College" ...
## $ SCH_DEG : num [1:1121] 3 3 3 3 3 3 3 3 3 3 ...
## $ CCBASIC : num [1:1121] 18 22 22 21 21 22 18 19 19 18 ...
## $ ADM_RATE : num [1:1121] 0.972 0.768 0.501 0.707 0.388 ...
## $ ACTCM25 : num [1:1121] 16 15 14 15 19 14 16 16 15 15 ...
## $ ACTCM75 : num [1:1121] 20 19 16 22 23 17 19 18 19 19 ...
## $ SAT_AVG : num [1:1121] 981 933 854 969 1075 ...
## $ UGDS : num [1:1121] 4153 2034 1114 703 1215 ...
## $ COSTT4_A : num [1:1121] 25754 27927 28319 22594 31625 ...
## $ TUITIONFEE_IN : num [1:1121] 7683 16600 18296 10861 19281 ...
## $ TUITIONFEE_OUT : num [1:1121] 16706 16600 18296 10861 19281 ...
## $ AVGFACSAL : num [1:1121] 6352 5808 5408 4755 6472 ...
## $ PFTFAC : num [1:1121] 0.994 0.719 0.825 0.741 0.593 ...
## $ RET_FT4 : num [1:1121] 0.744 0.556 0.562 0.73 0.669 ...
## $ MD_EARN_WNE_P10 : num [1:1121] 33961 30951 30521 33291 37014 ...
## $ PCT25_EARN_WNE_P10: num [1:1121] 19299 19230 17973 20243 23757 ...
## $ PCT75_EARN_WNE_P10: num [1:1121] 50236 43045 45132 46866 52462 ...
## $ MD_EARN_WNE_P6 : num [1:1121] 27399 23174 24630 25088 29132 ...
## $ GRADS : num [1:1121] 1079 6 NA 8 NA ...
## $ RET_FT4_POOLED : num [1:1121] 0.734 0.56 0.53 0.738 0.678 ...
## $ C100_4_POOLED : num [1:1121] 0.105 0.18 0.152 0.22 0.269 ...
## $ INEXPFTE : num [1:1121] 3905 4224 5598 4884 7815 ...
## $ C150_4_POOLED : num [1:1121] 0.34 0.271 0.239 0.412 0.454 ...
## $ GRAD_DEBT_MDN : num [1:1121] 36750 34500 32875 32100 31250 ...
Thank you, most sincerely, for any help!
I'm trying to use H20 randomForest for multiclass classification in R, but when I run the code, the randomForest always comes out as a regression model - despite the target variable being a factor. I am trying to predict 'Gradient', a factor with 5 levels, by one other factor 'Period' with 4 levels, and 21 numerical predictors.
Any help would be appreciated. Code below....
>str(df)
Class 'H2OFrame' <environment: 0x000001f6b361abe0>
- attr(*, "op")= chr ":="
- attr(*, "eval")= logi TRUE
- attr(*, "id")= chr "RTMP_sid_aecc_35"
- attr(*, "nrow")= int 63878
- attr(*, "ncol")= int 22
- attr(*, "types")=List of 22
- attr(*, "data")='data.frame': 10 obs. of 22 variables:
..$ Gradient: Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1
..$ Period : Factor w/ 4 levels "Dawn","Day","Dusk",..: 2 2 2 2 2 2 2 2 2 2
..$ AC1 : num 1792 1793 1790 1790 1797 ...
..$ AC2 : num 316 316 318 317 324 ...
..$ AC3 : num 972 972 974 975 979 ...
etc for remaining numerical predictors.
>splits <- h2o.splitFrame(df, c(0.6,0.2), seed=1234)
>train <- h2o.assign(splits[[1]], "train.hex")
>valid <- h2o.assign(splits[[2]], "valid.hex")
>test <- h2o.assign(splits[[3]], "test.hex")
>str(train)
Class 'H2OFrame' <environment: 0x000002266fac7d40>
- attr(*, "op")= chr "assign"
- attr(*, "id")= chr "train.hex"
- attr(*, "nrow")= int 38259
- attr(*, "ncol")= int 22
- attr(*, "types")=List of 22
- attr(*, "data")='data.frame': 10 obs. of 22 variables:
..$ Gradient: Factor w/ 5 levels "LB","LU","PB",..: 1 1 1 1 1 1 1 1 1 1
..$ Period : Factor w/ 4 levels "Dawn","Day","Dusk",..: 2 2 2 2 2 2 2 2 2 2
..$ AC1 : num 1793 1797 1796 1805 1803 ...
..$ AC2 : num 316 324 322 322 323 ...
..$ AC3 : num 972 979 979 988 986 ...
..$ AC4 : num 663 662 664 673 670 ...
..$ AC5 : num 828 825 824 824 825 ...
..$ AD1 : num 1.22 1.42 1.73 2.25 1.99 ...
..$ AD2 : num 1.1 1.27 1.35 1.38 1.38 ...
..$ AD3 : num 1.22 1.42 1.72 2.24 1.99 ...
..$ AD4 : num 1.87 1.53 2.07 2.03 1.78 ...
..$ AD5 : num 2.33 2.33 2.33 2.33 2.33 ...
..$ AE1 : num 0.877 0.849 0.794 0.636 0.72 ...
..$ AE2 : num 0.3687 0.2332 0.1369 0.0433 0.0546 ...
..$ AE3 : num 0.774 0.723 0.624 0.335 0.487 ...
..$ AE4 : num 0.574 0.697 0.44 0.477 0.605 ...
..$ AE5 : num 0.542 0.542 0.554 0.543 0.542 ...
..$ BI1 : num 53 71.9 64 75.4 74.6 ...
..$ BI2 : num 6.51 5.88 4.54 2.3 2.34 ...
..$ BI3 : num 22.2 26 21.5 27.9 28 ...
..$ BI4 : num 7.86 9.58 8.59 12.17 12.5 ...
..$ BI5 : num 11.3 17.9 16.4 18.1 17.5 ...
> train[1:5,] ## rows 1-5, all columns
Gradient Period AC1 AC2 AC3 AC4 AC5 AD1 AD2 AD3 AD4 AD5 AE1 AE2 AE3 AE4 AE5
1 LB Day 1792.97 316.4038 972.4288 663.2612 827.6400 1.217491 1.104860 1.217491 1.866627 2.332115 0.876794 0.368712 0.774123 0.574168 0.541993
2 LB Day 1796.78 324.3562 979.2218 662.2341 824.6436 1.421910 1.274373 1.421910 1.526506 2.331810 0.848660 0.233177 0.722544 0.696906 0.542409
3 LB Day 1796.09 321.9081 978.7464 664.1776 824.4437 1.726798 1.345030 1.721740 2.066543 2.326278 0.794230 0.136892 0.624107 0.440458 0.553766
4 LB Day 1805.14 322.0390 987.9472 673.2841 824.3146 2.248474 1.381644 2.239061 2.028538 2.331881 0.636007 0.043267 0.334964 0.477149 0.542572
5 LB Day 1803.15 323.1540 985.6376 669.7603 824.6003 1.992025 1.380468 1.992004 1.782532 2.331971 0.720153 0.054578 0.486951 0.604876 0.542420
BI1 BI2 BI3 BI4 BI5
1 53.03567 6.506536 22.23446 7.862767 11.32708
2 71.94775 5.879407 26.04130 9.579798 17.94337
3 63.98763 4.535041 21.50727 8.590985 16.38780
4 75.38319 2.301110 27.89600 12.165991 18.06316
5 74.60517 2.342853 28.02568 12.499122 17.52902
rf1 <- h2o.randomForest(
training_frame = train,
validation_frame = valid,
x=2:22,
y=1,
ntrees = 200,
stopping_rounds = 2,
score_each_iteration = T,
seed = 1000000) `
>perf <- h2o.performance(rf1, valid)
>h2o.mcc(perf)
Error in h2o.metric(object, thresholds, "absolute_mcc") :
No absolute_mcc for H2OMultinomialMetrics
h2o.accuracy(perf)
Error in h2o.metric(object, thresholds, "accuracy") :
No accuracy for H2OMultinomialMetrics
and a summary from the model summary:
H2OMultinomialMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **
Training Set Metrics:
=====================
Extract training frame with `h2o.getFrame("train.hex")`
MSE: (Extract with `h2o.mse`) 0.2499334
RMSE: (Extract with `h2o.rmse`) 0.4999334
Logloss: (Extract with `h2o.logloss`) 0.9987891
Mean Per-Class Error: 0.2941914
R^2: (Extract with `h2o.r2`) 0.8683096
mcc is specifically for binary classifiers; your factor has more than 2 levels.
You can tell you have successfully done a multinomial classification, rather than a regression, because the error message says "No absolute_mcc for H2OMultinomialMetrics".
h2o.accuracy() and h2o.logloss() are available for multinomial models.
UPDATE: ...well, the docs say h2o.accuracy() is available, but a quick check on the iris dataset gives me the same error you see; must be related to that warning in the docs (which I didn't understand).
Anyway, more useful is likely to be h2o.confusionMatrix(rf1); the overall error shown in the bottom right is 1 - accuracy. Also h2o.confusionMatrix(rf1,valid=T) and h2o.confusionMatrix(rf1, test)
I have a big .csv data set. $B1 through $B34. They are all numeric, which is fine. But I would like the last column to be in "factor" The values of the last column DEC consists of only 1 and 0.
How can I make the last column "factor"
mydata<-read.csv(file.choose(),header=T)
str(mydata)
'data.frame': 1024 obs. of 35 variables:
$ B1 : num 90.8 113.2 100.4 144.5 131.6 ...
$ B2 : num 0.133 0.139 0.144 0.147 0.141 ...
-----------
-----------
$ B32 : num 0.216 0.27 0.309 0.259 0.304 ...
$ B33 : num 0.526 0.407 0.286 0.129 0.37 ...
$ B34 : num 4.33 5.61 4.81 7.32 6.83 ...
$ DEC : int 1 1 1 1 1 1 1 1 1 1 ...
You can use the as.factor() function to convert any column to factor. For example:
mydata<-read.csv("data.csv") #Read the data#
mydata$DEC<-as.factor(mydata$DEC) #Convert the column to a factor
class(mydata$DEC) #Just to check that it worked#
I am trying to use the quantregForest() function from the quantregForest package (which is built on the randomForest package.)
I tried to train the model using:
qrf_model <- quantregForest(x=Xtrain, y=Ytrain, importance=TRUE, ntree=10)
and I get the following error message (even after reducing the number of trees from 100 to 10):
Error in rep(0, nobs * nobs * npred) : invalid 'times' argument
plus a warning:
In nobs * nobs * npred : NAs produced by integer overflow
The data frame Xtrain has 38 numeric variables, and it looks like this:
> str(Xtrain)
'data.frame': 31132 obs. of 38 variables:
$ X1 : num 301306 6431 2293 1264 32477 ...
$ X2 : num 173.2 143.5 43.4 180.6 1006.2 ...
$ X3 : num 0.1598 0.1615 0.1336 0.0953 0.1988 ...
$ X4 : num 0.662 0.25 0.71 0.709 0.671 ...
$ X5 : num 0.05873 0.0142 0 0.00154 0.09517 ...
$ X6 : num 0.01598 0 0.0023 0.00154 0.01634 ...
$ X7 : num 0.07984 0.03001 0.00845 0.04304 0.09326 ...
$ X8 : num 0.92 0.97 0.992 0.957 0.907 ...
$ X9 : num 105208 1842 830 504 11553 ...
$ X10: num 69974 1212 611 352 7080 ...
$ X11: num 0.505 0.422 0.55 0.553 0.474 ...
$ X12: num 0.488 0.401 0.536 0.541 0.45 ...
$ X13: num 0.333 0.419 0.257 0.282 0.359 ...
$ X14: num 0.187 0.234 0.172 0.207 0.234 ...
$ X15: num 0.369 0.216 0.483 0.412 0.357 ...
$ X16: num 0.0765 0.1205 0.0262 0.054 0.0624 ...
$ X17: num 2954 77 12 10 739 ...
$ X18: num 2770 43 9 21 433 119 177 122 20 17 ...
$ X19: num 3167 72 49 25 622 ...
$ X20: num 3541 57 14 24 656 ...
$ X21: num 3361 82 0 33 514 ...
$ X22: num 3929 27 10 48 682 ...
$ X23: num 3695 73 61 15 643 ...
$ X24: num 4781 52 5 14 680 ...
$ X25: num 3679 103 5 23 404 ...
$ X26: num 7716 120 55 40 895 ...
$ X27: num 11043 195 72 48 1280 ...
$ X28: num 16080 332 160 83 1684 ...
$ X29: num 12312 125 124 62 1015 ...
$ X30: num 8218 99 36 22 577 ...
$ X31: num 9957 223 146 26 532 ...
$ X32: num 0.751 0.444 0.621 0.527 0.682 ...
$ X33: num 0.01873 0 0 0.00317 0.02112 ...
$ X34: num 0.563 0.372 0.571 0.626 0.323 ...
$ X35: num 0.366 0.39 0.156 0.248 0.549 ...
$ X36: num 0.435 0.643 0.374 0.505 0.36 ...
$ X37: num 0.526 0.31 0.577 0.441 0.591 ...
$ X38: num 0.00163 0 0 0 0.00155 0.00103 0 0 0 0 ...
And the response variable Ytrain looks like this:
> str(Ytrain)
num [1:31132] 2605 56 8 16 214 ...
I checked that neither Xtrain or Ytrain contain any NA's by:
> sum(is.na(Xtrain))
[1] 0
> sum(is.na(Ytrain))
[1] 0
I am assuming that the error message for the invalid "times" argument for the rep(0, nobs * nobs * npred)) function comes from the NA value assigned to the product nobs * nobs * npred due to an integer overflow.
What I do not understand is where the integer overflow comes from. None of my variables are of the integer class so what am I missing?
I examined the source code for the quantregForest() function and the source code for the method predict.imp called by the quantregForest() function.
I found that nobs stands for the number of observations. In the case above nobs =length(Ytrain) = 31132 . The variable npred stands for the number of predictors. It is given by npred = ncol(Xtrain)=38. Both npred and nobs are of class integer, and
npred*npred*nobs = 31132*31132*38 = 36829654112.
And herein lies the root cause of the error, since:
npred*npred*nobs = 36829654112 > 2147483647,
where 2147483647 is the maximal integer value in R. Hence the integer overflow warning and the replacement of the product npred*npred*nobs with an NA.
The bottom line is, in order to avoid the error message I will have to use quite a bit fewer observations when training the model or set importance=FALSE in the quantregForest() function argument. The computations required to find variable importance are very memory intensive, even when using less then 10000 observations.
I created a data frame from another dataset with 332 ID's. I split the data frame by IDs and would like to do a count rows of each ID and then do a correlation function. Can someone tell me how to do a count of the rows of each ID in order to do a correlation from these individual groups.
jlhoward your suggestion to add "table(dat1$ID)" command worked. My other problem is the function will not stop running
corr<-function(directory,threshold=)
####### file location path#####
for(i in 1:332){dat<-rbind(dat,read.csv(specdata1[i]))
dat1<-dat[complete.cases(dat),]
dat2<-(split(dat1,dat1$ID))
list(dat2)
dat3<-table(dat1$ID)
for (i in dat1>=threshold){
x<-dat1$sulfate
y<-dat1$nitrate
correlations<-cor(x,y,use="pairwise.complete.obs",method="pearson")
corrs_output<-c(corrs_output,correlations)
}
I'm trying to correlate the "sulfate" and "nitrate of each ID monitor that fits a threshold. I created a list that has all the complete cases per ID monitor. I need the function to do a correlation for "sulfate" and "nitrate of every set per ID that's => the threshold argument in the function. Below is the head and tail of the structure of the data.frame/list of each data set within the main data set "specdata1".
head of entire data.frame/list of specdata1 complete cases for
correlation
head(str(dat2,1))
List of 323
$ 1 :'data.frame': 117 obs. of 4 variables:
..$ Date : Factor w/ 4018 levels "2003-01-01","2003-01-02",..: 279 285 291 297 303 315 321 327 333 339 ...
..$ sulfate: num [1:117] 7.21 5.99 4.68 3.47 2.42 1.43 2.76 3.41 1.3 3.15 ...
..$ nitrate: num [1:117] 0.651 0.428 1.04 0.363 0.507 0.474 0.425 0.964 0.491 0.669 ...
..$ ID : int [1:117] 1 1 1 1 1 1 1 1 1 1 ...
tail of entire data.frame/list for all complete cases of specdata1
tail(str(dat2,1))
$ 99 :'data.frame': 479 obs. of 4 variables:
..$ Date : Factor w/ 4018 levels "2003-01-01","2003-01-02",..: 1774 1780 1786 1804 1810 1816 1822 1840 1852 1858 ...
..$ sulfate: num [1:479] 1.51 8.2 1.48 4.75 3.47 1.19 1.77 2.27 2.06 2.11 ...
..$ nitrate: num [1:479] 0.725 1.64 1.01 6.81 0.751 1.69 2.08 0.996 0.817 0.488 ...
..$ ID : int [1:479] 99 99 99 99 99 99 99 99 99 99 ...
[list output truncated]