R - Count occurence of forward slashes in each row - r

I have a dataframe that looks like this:
12/04/2017 00:00:02.30,-2.31,-2.97,-0.3,-1.4
12/04/2017 00:00:02.40,-1.89,-2.94,-1.15,-1.4
12/04/2017 00:00:02.50,-1.66,-3.14,-0.06,-1.39
12/04/2017 00:00:02.60,-1.84,-3.16,0.18,-1.37
12/04/2017 00:00:02.70,-2.12/04/2017 00:00:02.80,-2,-2.56,0.17,-1.41
12/04/2017 00:00:02.90,-2.18,-2.31,0.11,-1.45
12/04/2017 00:00:03,-2.14,-2.21,-0.05,-1.45
The logger where the data comes from somtimes writes one of the dates into the row of the other line (5th row in the example). I need to delete these lines in R. But I have not really a clue how to find and delete these lines in the dataframe.
My first idea was to look for the number of forward slashes in each row. But could not find a way on how to do that.
Another way might be to get the mean length of all rows and check for lines that are longer than the mean and delete those. But same here. Can't find a way to make a mean over aall characters ina row (strings and numbers).
edit: The output from str(df):
str(df)
'data.frame': 856645 obs. of 6 variables:
$ station: chr "Arof" "Arof" "Arof" "Arof" ...
$ date : Factor w/ 863989 levels "12/04/2017 00:00:01.10",..: 1 2 3 4 5 6 7 8 9 10 ...
$ u : Factor w/ 1327 levels "","0","-0.01",..: 132 84 146 136 112 120 126 33 281 240 ...
$ v : num -0.62 -0.41 -1.58 -1.65 -1.25 -1.8 -1.86 -2.46 -2.59 -2.87 ...
$ w : num 0.89 1.09 0.63 0.53 0.84 0.58 0.46 0.48 -0.16 -0.01 ...
$ temp : num -1.36 -1.41 -1.41 -1.41 -1.41 -1.41 -1.5 -1.48 -1.51 -1.46 ...
- attr(*, "na.action")=Class 'omit' Named int [1:7344] 18 113 246 378 513 643 646 778 909 1042 ...
.. ..- attr(*, "names")= chr [1:7344] "18" "113" "246" "378" ...

Usinggrepl we can search for . followed by 2 digits number followed by /
grepl("\\.\\d{2}\\/",data$date)
[1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE
apply(data,1, function(x) sum(grepl("\\.\\d{2}\\/",x)))

Related

Converting the structure of input data in R

Below is the structures of my dataframe:
'data.frame': 213 obs. of 2 variables:
$ up_entrez: Factor w/ 143 levels "101739","108077",...: 3 94 125 103 3 34 3 37 134 13 ...
$ Ratio : num 3.1 3.37 1.8 1.21 6.92 ....
and I want to convert it to something like this for the function to take it as an input:
Named num [1:12495] 4.57 4.51 4.42 4.14 3.88 ....
- attr(*, "names")= chr [1:12495] "4312" "8318" "10874" "55143" ....
How do I do that?
We can use setNames to create a named vector
v1 <- with(df1, setNames(Ratio, up_entrez))

Lapply over subset of dataframes without splitting further

I am at a loss on how to do this without addressing each individual part. I have an initial timeseries dataset that I split into a list of 12 dataframes representing each month. Within each month, I want to run calculations and ggplot on each unique site without having to call each individual site. The structure currently is as follows:
$ April :'data.frame': 9360 obs. of 15 variables:
..$ site_id : int [1:9360] 1003 1003 1003 1003 1003 1003 1003 1003 1003 1003 ...
..$ UTC_date.1 : Date[1:9360], format: "2005-04-01" "2005-04-02" "2005-04-03" "2005-04-04" ...
..$ POSIXct : POSIXct[1:9360], format: "2005-04-01 06:00:00" "2005-04-02 06:00:00" "2005-04-03 06:00:00" "2005-04-04 06:00:00" ...
..$ swe_mm : num [1:9360] 45.9 44.6 43.5 42.4 41.2 ...
..$ fsca : num [1:9360] 1 1 1 1 0.997 ...
..$ snoht_m : num [1:9360] 0.303 0.239 0.21 0.186 0.165 ...
..$ swe_mm.1 : num [1:9360] 45.9 44.6 43.5 42.4 41.2 ...
..$ fsca.1 : num [1:9360] 1 1 1 1 0.997 ...
..$ snoht_m.1 : num [1:9360] 0.303 0.239 0.21 0.186 0.165 ...
..$ actSWE_mm : num [1:9360] 279 282 282 282 282 284 292 295 295 295 ...
..$ actSD_cm : num [1:9360] 79 79 NA 79 79 81 185 81 81 81 ...
..$ swe_Res_mm : num [1:9360] 233 237 238 240 241 ...
..$ snoht_Res_m : num [1:9360] 0.487 0.551 NA 0.604 0.625 ...
..$ swe_Res1_mm : num [1:9360] 233 237 238 240 241 ...
..$ snoht_Res1_m: num [1:9360] 0.487 0.551 NA 0.604 0.625 ...
I can use lapply to calculate the standardized rmse without issue if I apply it to each dataframe entirely:
stdres.fun <- function(data,x,out) {data[out] <- data[[x]] / ((sum(data[[x]]^2, na.rm = TRUE)/NROW(data))^.5); data}
monthSplit <- lapply(monthSplit, stdres.fun, x = "swe_Res_mm", out="stdSWE_res")
However, I am having trouble figuring out how to run this calculation on each unique site_id. What I mean to say is there are 32 different sites. They are the same sites in each dataframe, however I want to calculate the rmse for each site within each dataframe in the list. So if I had sites 946 and 1003, the calculation would run on each of those separately rather than together.
I'm assuming I can split the data further into different lists but I feel like this would be messier than it already is. Is there another way I can go about doing this?
We could modify the function and use tidyverse methods
library(purrr)
library(dplyr)
monthSplit2 <- map(monthSplit, ~
.x %>%
group_by(sites) %>%
mutate(stdSWE_res = swe_Res_mm/((sum(swe_Res_mm^2,
na.rm = TRUE)/n()) ^.5))

Why is distance matrix (dist()) giving empty values for data sets having more than ~50 observations?

I have a data set for which I'm calculating its distance matrix. Below is the data, which has 251 observations.
> str(mydata)
'data.frame': 251 obs. of 7 variables:
$ BodyFat: num 12.3 6.1 25.3 10.4 28.7 20.9 19.2 12.4 4.1 11.7 ...
$ Weight : num 154 173 154 185 184 ...
$ Chest : num 93.1 93.6 95.8 101.8 97.3 ...
$ Abdomen: num 85.2 83 87.9 86.4 100 94.4 90.7 88.5 82.5 88.6 ...
$ Hip : num 94.5 98.7 99.2 101.2 101.9 ...
$ Thigh : num 59 58.7 59.6 60.1 63.2 66 58.4 60 62.9 63.1 ...
$ Biceps : num 32 30.5 28.8 32.4 32.2 35.7 31.9 30.5 35.9 35.6 ...
I normalize the data.
means = apply(mydata,2,mean)
sds = apply(mydata,2,sd)
nor = scale(mydata,center=means,scale=sds)
When i calculate the distance matrix, I can see lot of empty values and moreover distance is measured only from 4 observations.
distance =dist(nor)
> str(distance)
'dist' num [1:31375] 1.33 2.09 1.9 3.08 3.99 ...
- attr(*, "Size")= int 251
- attr(*, "Labels")= chr [1:251] "1" "2" "3" "4" ...
- attr(*, "Diag")= logi FALSE
- attr(*, "Upper")= logi FALSE
- attr(*, "method")= chr "euclidean"
- attr(*, "call")= language dist(x = nor)
> distance # o/p omitted from this post as it has 257 observations.
1 2 3 4 5 6 7
2 1.3346445
3 2.0854437 2.5474796
4 1.8993458 1.4908813 2.5840752
5 3.0790252 3.4485667 2.2165366 2.7021809
8 9 10 11 12 13 14
2
3
4
5
15 16 17 18 19 20 21
This list goes on empty for the remaining 247 comparisons.
Now, I reduce the data set to 20 observations
Here I get a proper distance matrix.
distancetiny=dist(nor)
> str(distancetiny)
'dist' num [1:1176] 1.14 1.8 1.61 2.62 3.39 ...
- attr(*, "Size")= int 49
- attr(*, "Labels")= chr [1:49] "1" "2" "3" "4" ...
- attr(*, "Diag")= logi FALSE
- attr(*, "Upper")= logi FALSE
- attr(*, "method")= chr "euclidean"
- attr(*, "call")= language dist(x = nor)
> distancetiny
1 2 3 4 5 6 7
2 1.1380433
3 1.7990293 2.2088928
4 1.6064118 1.2871522 2.2483586
5 2.6235853 2.9669283 1.9132224 2.3256624
6 3.3898119 3.3730508 3.3718447 2.2615557 2.0094434
7 1.8947704 2.0065514 1.7685604 1.1065940 1.7387938 2.2321156
8 1.1732465 1.0663217 1.6733689 0.8873140 2.1959298 2.7939555 1.1448269
9 2.2721969 2.0545882 3.4263262 1.4058375 3.1811955 2.4011074 2.3078714
10 2.3753110 2.2424464 3.0289947 1.2808398 2.3230202 1.4242653 1.8571654
11 1.5620472 1.1878554 2.5750350 0.5718248 2.7714795 2.6314286 1.5132365
12 3.5088571 3.2484020 4.1164488 2.2723772 3.1377318 1.4795230 2.8274818
13 2.1448841 2.2679705 1.8726670 1.3494988 1.2176727 1.5544030 1.0725518
14 3.6679035 3.7459402 3.6869023 2.6677308 2.1318420 0.7347359 2.5729973
15 2.9908457 3.3312661 3.1289870 2.4340473 1.8027070 1.3626019 2.3795360
16 1.6117570 2.0283356 1.2011116 1.5961064 1.3196981 2.4456436 1.2569683
17 3.2991393 3.5991747 3.0438049 2.6066933 1.4742664 1.0945621 2.2214101
18 3.9409008 4.0726826 4.0113908 2.9250144 2.5228901 0.9087254 2.8158563
19 2.7468511 2.9495031 3.2439229 1.8312508 2.4122436 1.3932604 1.9640170
20 3.7515064 3.7021743 3.9404231 2.5813440 2.5390519 0.8352961 2.6530503
21 2.3102053 2.3878491 2.0836800 1.4328028 1.2991221 1.5287862 1.1769205
There is no empty values in the output when the observation is 21.
Why is this so? Does the dist() do not work when the observation count goes beyond a threshold ?
I'm unable to figure it out. Please help.
This seems to be a size issue. When the dataset contains more than 60-80 observations, the distance matrix is unable to be displayed properly (even for the initial rows). Looks like the values are present in it perfectly alright, and just that we cannot see them as it is.
Further operation on the distance matrix (like Hierarchical agglomerative clustering ) proved that nothing to worried about it's weird display.

r quantregForest() error: NA's produced by integer overflow lead to an invalid argument in the rep() function

I am trying to use the quantregForest() function from the quantregForest package (which is built on the randomForest package.)
I tried to train the model using:
qrf_model <- quantregForest(x=Xtrain, y=Ytrain, importance=TRUE, ntree=10)
and I get the following error message (even after reducing the number of trees from 100 to 10):
Error in rep(0, nobs * nobs * npred) : invalid 'times' argument
plus a warning:
In nobs * nobs * npred : NAs produced by integer overflow
The data frame Xtrain has 38 numeric variables, and it looks like this:
> str(Xtrain)
'data.frame': 31132 obs. of 38 variables:
$ X1 : num 301306 6431 2293 1264 32477 ...
$ X2 : num 173.2 143.5 43.4 180.6 1006.2 ...
$ X3 : num 0.1598 0.1615 0.1336 0.0953 0.1988 ...
$ X4 : num 0.662 0.25 0.71 0.709 0.671 ...
$ X5 : num 0.05873 0.0142 0 0.00154 0.09517 ...
$ X6 : num 0.01598 0 0.0023 0.00154 0.01634 ...
$ X7 : num 0.07984 0.03001 0.00845 0.04304 0.09326 ...
$ X8 : num 0.92 0.97 0.992 0.957 0.907 ...
$ X9 : num 105208 1842 830 504 11553 ...
$ X10: num 69974 1212 611 352 7080 ...
$ X11: num 0.505 0.422 0.55 0.553 0.474 ...
$ X12: num 0.488 0.401 0.536 0.541 0.45 ...
$ X13: num 0.333 0.419 0.257 0.282 0.359 ...
$ X14: num 0.187 0.234 0.172 0.207 0.234 ...
$ X15: num 0.369 0.216 0.483 0.412 0.357 ...
$ X16: num 0.0765 0.1205 0.0262 0.054 0.0624 ...
$ X17: num 2954 77 12 10 739 ...
$ X18: num 2770 43 9 21 433 119 177 122 20 17 ...
$ X19: num 3167 72 49 25 622 ...
$ X20: num 3541 57 14 24 656 ...
$ X21: num 3361 82 0 33 514 ...
$ X22: num 3929 27 10 48 682 ...
$ X23: num 3695 73 61 15 643 ...
$ X24: num 4781 52 5 14 680 ...
$ X25: num 3679 103 5 23 404 ...
$ X26: num 7716 120 55 40 895 ...
$ X27: num 11043 195 72 48 1280 ...
$ X28: num 16080 332 160 83 1684 ...
$ X29: num 12312 125 124 62 1015 ...
$ X30: num 8218 99 36 22 577 ...
$ X31: num 9957 223 146 26 532 ...
$ X32: num 0.751 0.444 0.621 0.527 0.682 ...
$ X33: num 0.01873 0 0 0.00317 0.02112 ...
$ X34: num 0.563 0.372 0.571 0.626 0.323 ...
$ X35: num 0.366 0.39 0.156 0.248 0.549 ...
$ X36: num 0.435 0.643 0.374 0.505 0.36 ...
$ X37: num 0.526 0.31 0.577 0.441 0.591 ...
$ X38: num 0.00163 0 0 0 0.00155 0.00103 0 0 0 0 ...
And the response variable Ytrain looks like this:
> str(Ytrain)
num [1:31132] 2605 56 8 16 214 ...
I checked that neither Xtrain or Ytrain contain any NA's by:
> sum(is.na(Xtrain))
[1] 0
> sum(is.na(Ytrain))
[1] 0
I am assuming that the error message for the invalid "times" argument for the rep(0, nobs * nobs * npred)) function comes from the NA value assigned to the product nobs * nobs * npred due to an integer overflow.
What I do not understand is where the integer overflow comes from. None of my variables are of the integer class so what am I missing?
I examined the source code for the quantregForest() function and the source code for the method predict.imp called by the quantregForest() function.
I found that nobs stands for the number of observations. In the case above nobs =length(Ytrain) = 31132 . The variable npred stands for the number of predictors. It is given by npred = ncol(Xtrain)=38. Both npred and nobs are of class integer, and
npred*npred*nobs = 31132*31132*38 = 36829654112.
And herein lies the root cause of the error, since:
npred*npred*nobs = 36829654112 > 2147483647,
where 2147483647 is the maximal integer value in R. Hence the integer overflow warning and the replacement of the product npred*npred*nobs with an NA.
The bottom line is, in order to avoid the error message I will have to use quite a bit fewer observations when training the model or set importance=FALSE in the quantregForest() function argument. The computations required to find variable importance are very memory intensive, even when using less then 10000 observations.

Subsetting by rows to do a correlation

I created a data frame from another dataset with 332 ID's. I split the data frame by IDs and would like to do a count rows of each ID and then do a correlation function. Can someone tell me how to do a count of the rows of each ID in order to do a correlation from these individual groups.
jlhoward your suggestion to add "table(dat1$ID)" command worked. My other problem is the function will not stop running
corr<-function(directory,threshold=)
####### file location path#####
for(i in 1:332){dat<-rbind(dat,read.csv(specdata1[i]))
dat1<-dat[complete.cases(dat),]
dat2<-(split(dat1,dat1$ID))
list(dat2)
dat3<-table(dat1$ID)
for (i in dat1>=threshold){
x<-dat1$sulfate
y<-dat1$nitrate
correlations<-cor(x,y,use="pairwise.complete.obs",method="pearson")
corrs_output<-c(corrs_output,correlations)
}
I'm trying to correlate the "sulfate" and "nitrate of each ID monitor that fits a threshold. I created a list that has all the complete cases per ID monitor. I need the function to do a correlation for "sulfate" and "nitrate of every set per ID that's => the threshold argument in the function. Below is the head and tail of the structure of the data.frame/list of each data set within the main data set "specdata1".
head of entire data.frame/list of specdata1 complete cases for
correlation
head(str(dat2,1))
List of 323
$ 1 :'data.frame': 117 obs. of 4 variables:
..$ Date : Factor w/ 4018 levels "2003-01-01","2003-01-02",..: 279 285 291 297 303 315 321 327 333 339 ...
..$ sulfate: num [1:117] 7.21 5.99 4.68 3.47 2.42 1.43 2.76 3.41 1.3 3.15 ...
..$ nitrate: num [1:117] 0.651 0.428 1.04 0.363 0.507 0.474 0.425 0.964 0.491 0.669 ...
..$ ID : int [1:117] 1 1 1 1 1 1 1 1 1 1 ...
tail of entire data.frame/list for all complete cases of specdata1
tail(str(dat2,1))
$ 99 :'data.frame': 479 obs. of 4 variables:
..$ Date : Factor w/ 4018 levels "2003-01-01","2003-01-02",..: 1774 1780 1786 1804 1810 1816 1822 1840 1852 1858 ...
..$ sulfate: num [1:479] 1.51 8.2 1.48 4.75 3.47 1.19 1.77 2.27 2.06 2.11 ...
..$ nitrate: num [1:479] 0.725 1.64 1.01 6.81 0.751 1.69 2.08 0.996 0.817 0.488 ...
..$ ID : int [1:479] 99 99 99 99 99 99 99 99 99 99 ...
[list output truncated]

Resources