I have a big .csv data set. $B1 through $B34. They are all numeric, which is fine. But I would like the last column to be in "factor" The values of the last column DEC consists of only 1 and 0.
How can I make the last column "factor"
mydata<-read.csv(file.choose(),header=T)
str(mydata)
'data.frame': 1024 obs. of 35 variables:
$ B1 : num 90.8 113.2 100.4 144.5 131.6 ...
$ B2 : num 0.133 0.139 0.144 0.147 0.141 ...
-----------
-----------
$ B32 : num 0.216 0.27 0.309 0.259 0.304 ...
$ B33 : num 0.526 0.407 0.286 0.129 0.37 ...
$ B34 : num 4.33 5.61 4.81 7.32 6.83 ...
$ DEC : int 1 1 1 1 1 1 1 1 1 1 ...
You can use the as.factor() function to convert any column to factor. For example:
mydata<-read.csv("data.csv") #Read the data#
mydata$DEC<-as.factor(mydata$DEC) #Convert the column to a factor
class(mydata$DEC) #Just to check that it worked#
Related
Suppose I would like to create a dataframe in R with two objects/ variables like this following two examples coming from the library(projpred).
The first example is:
projpred::df_gaussian
> str(df_gaussian)
'data.frame': 100 obs. of 2 variables:
$ x: num [1:100, 1:20] 0.274 2.245 -0.125 -0.544 -1.459 ...
$ y: num -1.275 1.843 0.459 0.564 1.873 ...
The second example is
projpred::df_binom
str(df_binom)
> str(df_binom)
'data.frame': 100 obs. of 2 variables:
$ x: num [1:100, 1:30] -0.619 1.094 -0.357 -2.469 0.567 ...
$ y: int 0 1 1 0 1 0 0 0 1 1 ...
Clearly here the 'x' is a matrix of dimension 100 X 20 and 'y' is a vector/matrix of dimension 100 X 1. When I do the something like the following:
> x<- matrix(rnorm(49,0,1),ncol=7,nrow=7)
> x
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 0.7475965 0.25087585 -0.5454202 1.1080362 0.772668671 0.1541041 -0.18822798
[2,] 0.1156593 0.01525141 -1.7016563 -1.4725411 -1.103412611 -0.5244481 -1.35857198
[3,] -2.0756020 0.76945330 2.1603842 -0.7884491 -0.058697197 1.7486573 -0.22084569
[4,] -0.7190079 0.02477635 -0.1113622 0.2430216 -0.002865642 0.8650818 0.01232973
[5,] -0.9197059 0.88796240 -0.7654234 -1.3388553 -1.323093057 -0.6983747 1.20201014
[6,] 1.4298535 0.04451137 1.2678596 0.3640843 -0.046717376 -2.2444299 1.80306550
[7,] -0.3876859 0.62635356 -0.3490285 -0.9496578 1.150150174 0.4247856 -0.97021264
> y<- rnorm(7,5,1)
> y
[1] 6.456016 4.984491 5.209759 7.183303 4.461657 5.005530 5.052837
> z<-data.frame(x,y)
I get something like below, which is not essentially what I want.
> z
X1 X2 X3 X4 X5 X6 X7 y
1 0.7475965 0.25087585 -0.5454202 1.1080362 0.772668671 0.1541041 -0.18822798 6.456016
2 0.1156593 0.01525141 -1.7016563 -1.4725411 -1.103412611 -0.5244481 -1.35857198 4.984491
3 -2.0756020 0.76945330 2.1603842 -0.7884491 -0.058697197 1.7486573 -0.22084569 5.209759
4 -0.7190079 0.02477635 -0.1113622 0.2430216 -0.002865642 0.8650818 0.01232973 7.183303
5 -0.9197059 0.88796240 -0.7654234 -1.3388553 -1.323093057 -0.6983747 1.20201014 4.461657
6 1.4298535 0.04451137 1.2678596 0.3640843 -0.046717376 -2.2444299 1.80306550 5.005530
7 -0.3876859 0.62635356 -0.3490285 -0.9496578 1.150150174 0.4247856 -0.97021264 5.052837
> str(z)
'data.frame': 7 obs. of 8 variables:
$ X1: num 0.748 0.116 -2.076 -0.719 -0.92 ...
$ X2: num 0.2509 0.0153 0.7695 0.0248 0.888 ...
$ X3: num -0.545 -1.702 2.16 -0.111 -0.765 ...
$ X4: num 1.108 -1.473 -0.788 0.243 -1.339 ...
$ X5: num 0.77267 -1.10341 -0.0587 -0.00287 -1.32309 ...
$ X6: num 0.154 -0.524 1.749 0.865 -0.698 ...
$ X7: num -0.1882 -1.3586 -0.2208 0.0123 1.202 ...
$ y : num 6.46 4.98 5.21 7.18 4.46 ...
Wrap with I for asis - or else by calling the data.frame, the matrix will be converted to data.frame. It is documented in ?I
In function data.frame. Protecting an object by enclosing it in I() in a call to data.frame inhibits the conversion of character vectors to factors and the dropping of names, and ensures that matrices are inserted as single columns. I can also be used to protect objects which are to be added to a data frame, or converted to a data frame via as.data.frame.
z <- data.frame(x = I(x), y)
> str(z)
'data.frame': 7 obs. of 2 variables:
$ x: 'AsIs' num [1:7, 1:7] -0.178 -1.37 -0.682 1.166 0.437 ...
$ y: num 5.12 4.58 5.41 4.91 6.43 ...
> is.matrix(z$x)
[1] TRUE
If we need to change the class from "AsIs"
> class(z$x) <- c("matrix", "array")
> str(z)
'data.frame': 7 obs. of 2 variables:
$ x: 'matrix' num [1:7, 1:7] -0.178 -1.37 -0.682 1.166 0.437 ...
$ y: num 5.12 4.58 5.41 4.91 6.43 ...
Or another option is tibble
library(tibble)
z1 <- tibble(x, y)
str(z1)
tibble [7 × 2] (S3: tbl_df/tbl/data.frame)
$ x: num [1:7, 1:7] -0.178 -1.37 -0.682 1.166 0.437 ...
$ y: num [1:7] 5.12 4.58 5.41 4.91 6.43 ...
i have a data frame with a year column of type numeric and a avgtemp column of type numeric, so how can i convert it to a time series with a good format
Example:
year AvgTempZScore
<dbl> <dbl>
1 1835 0.109
2 1836 0.168
3 1837 0.177
4 1838 0.143
5 1839 0.188
6 1840 0.198
7 1841 0.200
8 1842 0.230
9 1843 0.237
10 1844 0.194
Str
tibble [179 × 2] (S3: tbl_df/tbl/data.frame)
$ year : num [1:179] 1835 1836 1837 1838 1839 ...
$ AvgTempZScore: num [1:179] 0.109 0.168 0.177 0.143 0.188 ...
With xtsand lubridate:
xts::xts(x = df$AvgTempZScore,order.by = lubridate::ymd(df$year, truncated = 2L))
[,1]
1835-01-01 0.109
1836-01-01 0.168
1837-01-01 0.177
1838-01-01 0.143
1839-01-01 0.188
1840-01-01 0.198
1841-01-01 0.200
1842-01-01 0.230
1843-01-01 0.237
1844-01-01 0.19
This can be useful using ts() function.
db=structure(list(year = c(1835,1836,1837,1838,1839,1840,1841,1842,1843,1844),
AvgTempZScore = c(0.109,0.168,0.177,0.143,0.188,0.198,0.200,0.230,0.237,0.194)),
row.names = c(1:10),
class = "data.frame")
str(db)
#'data.frame': 10 obs. of 2 variables:
# $ year : num 1835 1836 1837 1838 1839 ...
# $ AvgTempZScore: num 0.109 0.168 0.177 0.143 0.188 0.198 0.2 0.23 0.237 0.194
db = ts(db,frequency = 1,start=1835, end=1844)
str(db)
#Time-Series [1:10, 1:2] from 1835 to 1844: 1835 1836 1837 1838 1839 ...
#- attr(*, "dimnames")=List of 2
#..$ : NULL
#..$ : chr [1:2] "year" "AvgTempZScore"#```
I have the output from a coxph function, which is estimated by strata. I would like to transform this output from a list into a data frame. The code I ran for coxph is below:
k <- coxph(Surv(cum.goodp, dlq.next) ~ rpc.length + cluster(itemcode) + strata(sector), data = nr.sample)
m <- summary(survfit(k))
There are twenty different strata used to estimate the coxph. Here is the structure of the list
List of 16
$ n : int [1:20] 870 843 2278 603 6687 8618 15155 920 2598 654 ...
$ time : num [1:870] 1 2 3 4 5 6 7 8 9 10 ...
$ n.risk : num [1:870] 870 592 448 361 320 286 232 214 196 186 ...
$ n.event : num [1:870] 246 126 77 34 33 25 18 18 8 6 ...
$ n.censor : num [1:870] 32 18 10 7 1 29 0 0 2 0 ...
$ strata : Factor w/ 20 levels "sector=11","sector=21",..: 1 1 1 1 1 1 1 1 1 1 ...
$ surv : num [1:870] 0.725 0.571 0.471 0.425 0.379 ...
$ type : chr "right"
$ cumhaz : num [1:870] 0.322 0.561 0.754 0.856 0.971 ...
$ std.err : num [1:870] 0.015 0.017 0.0174 0.0174 0.0173 ...
$ upper : num [1:870] 0.755 0.605 0.506 0.46 0.414 ...
$ lower : num [1:870] 0.696 0.538 0.438 0.392 0.347 ...
$ conf.type: chr "log"
$ conf.int : num 0.95
$ call : language survfit(formula = k)
$ table : num [1:20, 1:7] 870 843 2278 603 6687 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:20] "sector=11" "sector=21" "sector=22" "sector=23" ...
.. ..$ : chr [1:7] "records" "n.max" "n.start" "events" ...
- attr(*, "class")= chr "summary.survfit"
I have done this before, but without strata. When I did not have strata I used the following approach:
col <- lapply(c(1 : 7), function(x) m[x])
tbl <- do.call(data.frame, col)
However, when I try that approach here, I get the familiar error:
cannot coerce class "c("survfit.cox", "survfit")" to a data.frame
All columns have the same name, but they are of different length. If possible, I would like to add a column to the final data frame that contains the particular strata that the results are for. Is there a way to do this? It doesn't have to be in base R. Any help would be much appreciated. Thanks so much.
This problem can be solved via the tidy function in the broom package. For the example above, the code is:
n <- survfit(k)
df <- tidy(n)
The tidy function produces a data frame with a variable "strata". It does not, however, provide the median and mean, but they can be estimated from the data frame df if one were so inclined. If the survfit object has multiple strata, the glance(list) cannot provide the median or mean.
I am trying to use the quantregForest() function from the quantregForest package (which is built on the randomForest package.)
I tried to train the model using:
qrf_model <- quantregForest(x=Xtrain, y=Ytrain, importance=TRUE, ntree=10)
and I get the following error message (even after reducing the number of trees from 100 to 10):
Error in rep(0, nobs * nobs * npred) : invalid 'times' argument
plus a warning:
In nobs * nobs * npred : NAs produced by integer overflow
The data frame Xtrain has 38 numeric variables, and it looks like this:
> str(Xtrain)
'data.frame': 31132 obs. of 38 variables:
$ X1 : num 301306 6431 2293 1264 32477 ...
$ X2 : num 173.2 143.5 43.4 180.6 1006.2 ...
$ X3 : num 0.1598 0.1615 0.1336 0.0953 0.1988 ...
$ X4 : num 0.662 0.25 0.71 0.709 0.671 ...
$ X5 : num 0.05873 0.0142 0 0.00154 0.09517 ...
$ X6 : num 0.01598 0 0.0023 0.00154 0.01634 ...
$ X7 : num 0.07984 0.03001 0.00845 0.04304 0.09326 ...
$ X8 : num 0.92 0.97 0.992 0.957 0.907 ...
$ X9 : num 105208 1842 830 504 11553 ...
$ X10: num 69974 1212 611 352 7080 ...
$ X11: num 0.505 0.422 0.55 0.553 0.474 ...
$ X12: num 0.488 0.401 0.536 0.541 0.45 ...
$ X13: num 0.333 0.419 0.257 0.282 0.359 ...
$ X14: num 0.187 0.234 0.172 0.207 0.234 ...
$ X15: num 0.369 0.216 0.483 0.412 0.357 ...
$ X16: num 0.0765 0.1205 0.0262 0.054 0.0624 ...
$ X17: num 2954 77 12 10 739 ...
$ X18: num 2770 43 9 21 433 119 177 122 20 17 ...
$ X19: num 3167 72 49 25 622 ...
$ X20: num 3541 57 14 24 656 ...
$ X21: num 3361 82 0 33 514 ...
$ X22: num 3929 27 10 48 682 ...
$ X23: num 3695 73 61 15 643 ...
$ X24: num 4781 52 5 14 680 ...
$ X25: num 3679 103 5 23 404 ...
$ X26: num 7716 120 55 40 895 ...
$ X27: num 11043 195 72 48 1280 ...
$ X28: num 16080 332 160 83 1684 ...
$ X29: num 12312 125 124 62 1015 ...
$ X30: num 8218 99 36 22 577 ...
$ X31: num 9957 223 146 26 532 ...
$ X32: num 0.751 0.444 0.621 0.527 0.682 ...
$ X33: num 0.01873 0 0 0.00317 0.02112 ...
$ X34: num 0.563 0.372 0.571 0.626 0.323 ...
$ X35: num 0.366 0.39 0.156 0.248 0.549 ...
$ X36: num 0.435 0.643 0.374 0.505 0.36 ...
$ X37: num 0.526 0.31 0.577 0.441 0.591 ...
$ X38: num 0.00163 0 0 0 0.00155 0.00103 0 0 0 0 ...
And the response variable Ytrain looks like this:
> str(Ytrain)
num [1:31132] 2605 56 8 16 214 ...
I checked that neither Xtrain or Ytrain contain any NA's by:
> sum(is.na(Xtrain))
[1] 0
> sum(is.na(Ytrain))
[1] 0
I am assuming that the error message for the invalid "times" argument for the rep(0, nobs * nobs * npred)) function comes from the NA value assigned to the product nobs * nobs * npred due to an integer overflow.
What I do not understand is where the integer overflow comes from. None of my variables are of the integer class so what am I missing?
I examined the source code for the quantregForest() function and the source code for the method predict.imp called by the quantregForest() function.
I found that nobs stands for the number of observations. In the case above nobs =length(Ytrain) = 31132 . The variable npred stands for the number of predictors. It is given by npred = ncol(Xtrain)=38. Both npred and nobs are of class integer, and
npred*npred*nobs = 31132*31132*38 = 36829654112.
And herein lies the root cause of the error, since:
npred*npred*nobs = 36829654112 > 2147483647,
where 2147483647 is the maximal integer value in R. Hence the integer overflow warning and the replacement of the product npred*npred*nobs with an NA.
The bottom line is, in order to avoid the error message I will have to use quite a bit fewer observations when training the model or set importance=FALSE in the quantregForest() function argument. The computations required to find variable importance are very memory intensive, even when using less then 10000 observations.
I created a data frame from another dataset with 332 ID's. I split the data frame by IDs and would like to do a count rows of each ID and then do a correlation function. Can someone tell me how to do a count of the rows of each ID in order to do a correlation from these individual groups.
jlhoward your suggestion to add "table(dat1$ID)" command worked. My other problem is the function will not stop running
corr<-function(directory,threshold=)
####### file location path#####
for(i in 1:332){dat<-rbind(dat,read.csv(specdata1[i]))
dat1<-dat[complete.cases(dat),]
dat2<-(split(dat1,dat1$ID))
list(dat2)
dat3<-table(dat1$ID)
for (i in dat1>=threshold){
x<-dat1$sulfate
y<-dat1$nitrate
correlations<-cor(x,y,use="pairwise.complete.obs",method="pearson")
corrs_output<-c(corrs_output,correlations)
}
I'm trying to correlate the "sulfate" and "nitrate of each ID monitor that fits a threshold. I created a list that has all the complete cases per ID monitor. I need the function to do a correlation for "sulfate" and "nitrate of every set per ID that's => the threshold argument in the function. Below is the head and tail of the structure of the data.frame/list of each data set within the main data set "specdata1".
head of entire data.frame/list of specdata1 complete cases for
correlation
head(str(dat2,1))
List of 323
$ 1 :'data.frame': 117 obs. of 4 variables:
..$ Date : Factor w/ 4018 levels "2003-01-01","2003-01-02",..: 279 285 291 297 303 315 321 327 333 339 ...
..$ sulfate: num [1:117] 7.21 5.99 4.68 3.47 2.42 1.43 2.76 3.41 1.3 3.15 ...
..$ nitrate: num [1:117] 0.651 0.428 1.04 0.363 0.507 0.474 0.425 0.964 0.491 0.669 ...
..$ ID : int [1:117] 1 1 1 1 1 1 1 1 1 1 ...
tail of entire data.frame/list for all complete cases of specdata1
tail(str(dat2,1))
$ 99 :'data.frame': 479 obs. of 4 variables:
..$ Date : Factor w/ 4018 levels "2003-01-01","2003-01-02",..: 1774 1780 1786 1804 1810 1816 1822 1840 1852 1858 ...
..$ sulfate: num [1:479] 1.51 8.2 1.48 4.75 3.47 1.19 1.77 2.27 2.06 2.11 ...
..$ nitrate: num [1:479] 0.725 1.64 1.01 6.81 0.751 1.69 2.08 0.996 0.817 0.488 ...
..$ ID : int [1:479] 99 99 99 99 99 99 99 99 99 99 ...
[list output truncated]