Subsetting by rows to do a correlation - r

I created a data frame from another dataset with 332 ID's. I split the data frame by IDs and would like to do a count rows of each ID and then do a correlation function. Can someone tell me how to do a count of the rows of each ID in order to do a correlation from these individual groups.
jlhoward your suggestion to add "table(dat1$ID)" command worked. My other problem is the function will not stop running
corr<-function(directory,threshold=)
####### file location path#####
for(i in 1:332){dat<-rbind(dat,read.csv(specdata1[i]))
dat1<-dat[complete.cases(dat),]
dat2<-(split(dat1,dat1$ID))
list(dat2)
dat3<-table(dat1$ID)
for (i in dat1>=threshold){
x<-dat1$sulfate
y<-dat1$nitrate
correlations<-cor(x,y,use="pairwise.complete.obs",method="pearson")
corrs_output<-c(corrs_output,correlations)
}
I'm trying to correlate the "sulfate" and "nitrate of each ID monitor that fits a threshold. I created a list that has all the complete cases per ID monitor. I need the function to do a correlation for "sulfate" and "nitrate of every set per ID that's => the threshold argument in the function. Below is the head and tail of the structure of the data.frame/list of each data set within the main data set "specdata1".
head of entire data.frame/list of specdata1 complete cases for
correlation
head(str(dat2,1))
List of 323
$ 1 :'data.frame': 117 obs. of 4 variables:
..$ Date : Factor w/ 4018 levels "2003-01-01","2003-01-02",..: 279 285 291 297 303 315 321 327 333 339 ...
..$ sulfate: num [1:117] 7.21 5.99 4.68 3.47 2.42 1.43 2.76 3.41 1.3 3.15 ...
..$ nitrate: num [1:117] 0.651 0.428 1.04 0.363 0.507 0.474 0.425 0.964 0.491 0.669 ...
..$ ID : int [1:117] 1 1 1 1 1 1 1 1 1 1 ...
tail of entire data.frame/list for all complete cases of specdata1
tail(str(dat2,1))
$ 99 :'data.frame': 479 obs. of 4 variables:
..$ Date : Factor w/ 4018 levels "2003-01-01","2003-01-02",..: 1774 1780 1786 1804 1810 1816 1822 1840 1852 1858 ...
..$ sulfate: num [1:479] 1.51 8.2 1.48 4.75 3.47 1.19 1.77 2.27 2.06 2.11 ...
..$ nitrate: num [1:479] 0.725 1.64 1.01 6.81 0.751 1.69 2.08 0.996 0.817 0.488 ...
..$ ID : int [1:479] 99 99 99 99 99 99 99 99 99 99 ...
[list output truncated]

Related

Converting the structure of input data in R

Below is the structures of my dataframe:
'data.frame': 213 obs. of 2 variables:
$ up_entrez: Factor w/ 143 levels "101739","108077",...: 3 94 125 103 3 34 3 37 134 13 ...
$ Ratio : num 3.1 3.37 1.8 1.21 6.92 ....
and I want to convert it to something like this for the function to take it as an input:
Named num [1:12495] 4.57 4.51 4.42 4.14 3.88 ....
- attr(*, "names")= chr [1:12495] "4312" "8318" "10874" "55143" ....
How do I do that?
We can use setNames to create a named vector
v1 <- with(df1, setNames(Ratio, up_entrez))

Lapply over subset of dataframes without splitting further

I am at a loss on how to do this without addressing each individual part. I have an initial timeseries dataset that I split into a list of 12 dataframes representing each month. Within each month, I want to run calculations and ggplot on each unique site without having to call each individual site. The structure currently is as follows:
$ April :'data.frame': 9360 obs. of 15 variables:
..$ site_id : int [1:9360] 1003 1003 1003 1003 1003 1003 1003 1003 1003 1003 ...
..$ UTC_date.1 : Date[1:9360], format: "2005-04-01" "2005-04-02" "2005-04-03" "2005-04-04" ...
..$ POSIXct : POSIXct[1:9360], format: "2005-04-01 06:00:00" "2005-04-02 06:00:00" "2005-04-03 06:00:00" "2005-04-04 06:00:00" ...
..$ swe_mm : num [1:9360] 45.9 44.6 43.5 42.4 41.2 ...
..$ fsca : num [1:9360] 1 1 1 1 0.997 ...
..$ snoht_m : num [1:9360] 0.303 0.239 0.21 0.186 0.165 ...
..$ swe_mm.1 : num [1:9360] 45.9 44.6 43.5 42.4 41.2 ...
..$ fsca.1 : num [1:9360] 1 1 1 1 0.997 ...
..$ snoht_m.1 : num [1:9360] 0.303 0.239 0.21 0.186 0.165 ...
..$ actSWE_mm : num [1:9360] 279 282 282 282 282 284 292 295 295 295 ...
..$ actSD_cm : num [1:9360] 79 79 NA 79 79 81 185 81 81 81 ...
..$ swe_Res_mm : num [1:9360] 233 237 238 240 241 ...
..$ snoht_Res_m : num [1:9360] 0.487 0.551 NA 0.604 0.625 ...
..$ swe_Res1_mm : num [1:9360] 233 237 238 240 241 ...
..$ snoht_Res1_m: num [1:9360] 0.487 0.551 NA 0.604 0.625 ...
I can use lapply to calculate the standardized rmse without issue if I apply it to each dataframe entirely:
stdres.fun <- function(data,x,out) {data[out] <- data[[x]] / ((sum(data[[x]]^2, na.rm = TRUE)/NROW(data))^.5); data}
monthSplit <- lapply(monthSplit, stdres.fun, x = "swe_Res_mm", out="stdSWE_res")
However, I am having trouble figuring out how to run this calculation on each unique site_id. What I mean to say is there are 32 different sites. They are the same sites in each dataframe, however I want to calculate the rmse for each site within each dataframe in the list. So if I had sites 946 and 1003, the calculation would run on each of those separately rather than together.
I'm assuming I can split the data further into different lists but I feel like this would be messier than it already is. Is there another way I can go about doing this?
We could modify the function and use tidyverse methods
library(purrr)
library(dplyr)
monthSplit2 <- map(monthSplit, ~
.x %>%
group_by(sites) %>%
mutate(stdSWE_res = swe_Res_mm/((sum(swe_Res_mm^2,
na.rm = TRUE)/n()) ^.5))

Multiclass classification in H2O randomForest

I'm trying to use H20 randomForest for multiclass classification in R, but when I run the code, the randomForest always comes out as a regression model - despite the target variable being a factor. I am trying to predict 'Gradient', a factor with 5 levels, by one other factor 'Period' with 4 levels, and 21 numerical predictors.
Any help would be appreciated. Code below....
>str(df)
Class 'H2OFrame' <environment: 0x000001f6b361abe0>
- attr(*, "op")= chr ":="
- attr(*, "eval")= logi TRUE
- attr(*, "id")= chr "RTMP_sid_aecc_35"
- attr(*, "nrow")= int 63878
- attr(*, "ncol")= int 22
- attr(*, "types")=List of 22
- attr(*, "data")='data.frame': 10 obs. of 22 variables:
..$ Gradient: Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1
..$ Period : Factor w/ 4 levels "Dawn","Day","Dusk",..: 2 2 2 2 2 2 2 2 2 2
..$ AC1 : num 1792 1793 1790 1790 1797 ...
..$ AC2 : num 316 316 318 317 324 ...
..$ AC3 : num 972 972 974 975 979 ...
etc for remaining numerical predictors.
>splits <- h2o.splitFrame(df, c(0.6,0.2), seed=1234)
>train <- h2o.assign(splits[[1]], "train.hex")
>valid <- h2o.assign(splits[[2]], "valid.hex")
>test <- h2o.assign(splits[[3]], "test.hex")
>str(train)
Class 'H2OFrame' <environment: 0x000002266fac7d40>
- attr(*, "op")= chr "assign"
- attr(*, "id")= chr "train.hex"
- attr(*, "nrow")= int 38259
- attr(*, "ncol")= int 22
- attr(*, "types")=List of 22
- attr(*, "data")='data.frame': 10 obs. of 22 variables:
..$ Gradient: Factor w/ 5 levels "LB","LU","PB",..: 1 1 1 1 1 1 1 1 1 1
..$ Period : Factor w/ 4 levels "Dawn","Day","Dusk",..: 2 2 2 2 2 2 2 2 2 2
..$ AC1 : num 1793 1797 1796 1805 1803 ...
..$ AC2 : num 316 324 322 322 323 ...
..$ AC3 : num 972 979 979 988 986 ...
..$ AC4 : num 663 662 664 673 670 ...
..$ AC5 : num 828 825 824 824 825 ...
..$ AD1 : num 1.22 1.42 1.73 2.25 1.99 ...
..$ AD2 : num 1.1 1.27 1.35 1.38 1.38 ...
..$ AD3 : num 1.22 1.42 1.72 2.24 1.99 ...
..$ AD4 : num 1.87 1.53 2.07 2.03 1.78 ...
..$ AD5 : num 2.33 2.33 2.33 2.33 2.33 ...
..$ AE1 : num 0.877 0.849 0.794 0.636 0.72 ...
..$ AE2 : num 0.3687 0.2332 0.1369 0.0433 0.0546 ...
..$ AE3 : num 0.774 0.723 0.624 0.335 0.487 ...
..$ AE4 : num 0.574 0.697 0.44 0.477 0.605 ...
..$ AE5 : num 0.542 0.542 0.554 0.543 0.542 ...
..$ BI1 : num 53 71.9 64 75.4 74.6 ...
..$ BI2 : num 6.51 5.88 4.54 2.3 2.34 ...
..$ BI3 : num 22.2 26 21.5 27.9 28 ...
..$ BI4 : num 7.86 9.58 8.59 12.17 12.5 ...
..$ BI5 : num 11.3 17.9 16.4 18.1 17.5 ...
> train[1:5,] ## rows 1-5, all columns
Gradient Period AC1 AC2 AC3 AC4 AC5 AD1 AD2 AD3 AD4 AD5 AE1 AE2 AE3 AE4 AE5
1 LB Day 1792.97 316.4038 972.4288 663.2612 827.6400 1.217491 1.104860 1.217491 1.866627 2.332115 0.876794 0.368712 0.774123 0.574168 0.541993
2 LB Day 1796.78 324.3562 979.2218 662.2341 824.6436 1.421910 1.274373 1.421910 1.526506 2.331810 0.848660 0.233177 0.722544 0.696906 0.542409
3 LB Day 1796.09 321.9081 978.7464 664.1776 824.4437 1.726798 1.345030 1.721740 2.066543 2.326278 0.794230 0.136892 0.624107 0.440458 0.553766
4 LB Day 1805.14 322.0390 987.9472 673.2841 824.3146 2.248474 1.381644 2.239061 2.028538 2.331881 0.636007 0.043267 0.334964 0.477149 0.542572
5 LB Day 1803.15 323.1540 985.6376 669.7603 824.6003 1.992025 1.380468 1.992004 1.782532 2.331971 0.720153 0.054578 0.486951 0.604876 0.542420
BI1 BI2 BI3 BI4 BI5
1 53.03567 6.506536 22.23446 7.862767 11.32708
2 71.94775 5.879407 26.04130 9.579798 17.94337
3 63.98763 4.535041 21.50727 8.590985 16.38780
4 75.38319 2.301110 27.89600 12.165991 18.06316
5 74.60517 2.342853 28.02568 12.499122 17.52902
rf1 <- h2o.randomForest(
training_frame = train,
validation_frame = valid,
x=2:22,
y=1,
ntrees = 200,
stopping_rounds = 2,
score_each_iteration = T,
seed = 1000000) `
>perf <- h2o.performance(rf1, valid)
>h2o.mcc(perf)
Error in h2o.metric(object, thresholds, "absolute_mcc") :
No absolute_mcc for H2OMultinomialMetrics
h2o.accuracy(perf)
Error in h2o.metric(object, thresholds, "accuracy") :
No accuracy for H2OMultinomialMetrics
and a summary from the model summary:
H2OMultinomialMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **
Training Set Metrics:
=====================
Extract training frame with `h2o.getFrame("train.hex")`
MSE: (Extract with `h2o.mse`) 0.2499334
RMSE: (Extract with `h2o.rmse`) 0.4999334
Logloss: (Extract with `h2o.logloss`) 0.9987891
Mean Per-Class Error: 0.2941914
R^2: (Extract with `h2o.r2`) 0.8683096
mcc is specifically for binary classifiers; your factor has more than 2 levels.
You can tell you have successfully done a multinomial classification, rather than a regression, because the error message says "No absolute_mcc for H2OMultinomialMetrics".
h2o.accuracy() and h2o.logloss() are available for multinomial models.
UPDATE: ...well, the docs say h2o.accuracy() is available, but a quick check on the iris dataset gives me the same error you see; must be related to that warning in the docs (which I didn't understand).
Anyway, more useful is likely to be h2o.confusionMatrix(rf1); the overall error shown in the bottom right is 1 - accuracy. Also h2o.confusionMatrix(rf1,valid=T) and h2o.confusionMatrix(rf1, test)

.csv file int variable to factor in R

I have a big .csv data set. $B1 through $B34. They are all numeric, which is fine. But I would like the last column to be in "factor" The values of the last column DEC consists of only 1 and 0.
How can I make the last column "factor"
mydata<-read.csv(file.choose(),header=T)
str(mydata)
'data.frame': 1024 obs. of 35 variables:
$ B1 : num 90.8 113.2 100.4 144.5 131.6 ...
$ B2 : num 0.133 0.139 0.144 0.147 0.141 ...
-----------
-----------
$ B32 : num 0.216 0.27 0.309 0.259 0.304 ...
$ B33 : num 0.526 0.407 0.286 0.129 0.37 ...
$ B34 : num 4.33 5.61 4.81 7.32 6.83 ...
$ DEC : int 1 1 1 1 1 1 1 1 1 1 ...
You can use the as.factor() function to convert any column to factor. For example:
mydata<-read.csv("data.csv") #Read the data#
mydata$DEC<-as.factor(mydata$DEC) #Convert the column to a factor
class(mydata$DEC) #Just to check that it worked#

r quantregForest() error: NA's produced by integer overflow lead to an invalid argument in the rep() function

I am trying to use the quantregForest() function from the quantregForest package (which is built on the randomForest package.)
I tried to train the model using:
qrf_model <- quantregForest(x=Xtrain, y=Ytrain, importance=TRUE, ntree=10)
and I get the following error message (even after reducing the number of trees from 100 to 10):
Error in rep(0, nobs * nobs * npred) : invalid 'times' argument
plus a warning:
In nobs * nobs * npred : NAs produced by integer overflow
The data frame Xtrain has 38 numeric variables, and it looks like this:
> str(Xtrain)
'data.frame': 31132 obs. of 38 variables:
$ X1 : num 301306 6431 2293 1264 32477 ...
$ X2 : num 173.2 143.5 43.4 180.6 1006.2 ...
$ X3 : num 0.1598 0.1615 0.1336 0.0953 0.1988 ...
$ X4 : num 0.662 0.25 0.71 0.709 0.671 ...
$ X5 : num 0.05873 0.0142 0 0.00154 0.09517 ...
$ X6 : num 0.01598 0 0.0023 0.00154 0.01634 ...
$ X7 : num 0.07984 0.03001 0.00845 0.04304 0.09326 ...
$ X8 : num 0.92 0.97 0.992 0.957 0.907 ...
$ X9 : num 105208 1842 830 504 11553 ...
$ X10: num 69974 1212 611 352 7080 ...
$ X11: num 0.505 0.422 0.55 0.553 0.474 ...
$ X12: num 0.488 0.401 0.536 0.541 0.45 ...
$ X13: num 0.333 0.419 0.257 0.282 0.359 ...
$ X14: num 0.187 0.234 0.172 0.207 0.234 ...
$ X15: num 0.369 0.216 0.483 0.412 0.357 ...
$ X16: num 0.0765 0.1205 0.0262 0.054 0.0624 ...
$ X17: num 2954 77 12 10 739 ...
$ X18: num 2770 43 9 21 433 119 177 122 20 17 ...
$ X19: num 3167 72 49 25 622 ...
$ X20: num 3541 57 14 24 656 ...
$ X21: num 3361 82 0 33 514 ...
$ X22: num 3929 27 10 48 682 ...
$ X23: num 3695 73 61 15 643 ...
$ X24: num 4781 52 5 14 680 ...
$ X25: num 3679 103 5 23 404 ...
$ X26: num 7716 120 55 40 895 ...
$ X27: num 11043 195 72 48 1280 ...
$ X28: num 16080 332 160 83 1684 ...
$ X29: num 12312 125 124 62 1015 ...
$ X30: num 8218 99 36 22 577 ...
$ X31: num 9957 223 146 26 532 ...
$ X32: num 0.751 0.444 0.621 0.527 0.682 ...
$ X33: num 0.01873 0 0 0.00317 0.02112 ...
$ X34: num 0.563 0.372 0.571 0.626 0.323 ...
$ X35: num 0.366 0.39 0.156 0.248 0.549 ...
$ X36: num 0.435 0.643 0.374 0.505 0.36 ...
$ X37: num 0.526 0.31 0.577 0.441 0.591 ...
$ X38: num 0.00163 0 0 0 0.00155 0.00103 0 0 0 0 ...
And the response variable Ytrain looks like this:
> str(Ytrain)
num [1:31132] 2605 56 8 16 214 ...
I checked that neither Xtrain or Ytrain contain any NA's by:
> sum(is.na(Xtrain))
[1] 0
> sum(is.na(Ytrain))
[1] 0
I am assuming that the error message for the invalid "times" argument for the rep(0, nobs * nobs * npred)) function comes from the NA value assigned to the product nobs * nobs * npred due to an integer overflow.
What I do not understand is where the integer overflow comes from. None of my variables are of the integer class so what am I missing?
I examined the source code for the quantregForest() function and the source code for the method predict.imp called by the quantregForest() function.
I found that nobs stands for the number of observations. In the case above nobs =length(Ytrain) = 31132 . The variable npred stands for the number of predictors. It is given by npred = ncol(Xtrain)=38. Both npred and nobs are of class integer, and
npred*npred*nobs = 31132*31132*38 = 36829654112.
And herein lies the root cause of the error, since:
npred*npred*nobs = 36829654112 > 2147483647,
where 2147483647 is the maximal integer value in R. Hence the integer overflow warning and the replacement of the product npred*npred*nobs with an NA.
The bottom line is, in order to avoid the error message I will have to use quite a bit fewer observations when training the model or set importance=FALSE in the quantregForest() function argument. The computations required to find variable importance are very memory intensive, even when using less then 10000 observations.

Resources