How to deal with NaN in R? - r

I have two binary files with the same dimensions(corr and rmse ).I want to do this:
replace all pixels in rmse by NA whenevr corr is NA.
file1:
conne <- file("D:\\omplete.bin","rb")
corr<- readBin(conne, numeric(), size=4, n=1440*720, signed=TRUE)
file2:
rms <- file("D:\\hgmplete.bin","rb")
rmse<- readBin(rms, numeric(), size=4, n=1440*720, signed=TRUE)
I did this:
rmse[corr==NA]=NA
did not do anything, so I tried this:
rmse[corr==NaN]=NA
did not do anything either! Can anybody help me on this.
Head of the file corr:
> corr
[1] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

You need to use the logical test is.nan(). In this case:
rmse[is.nan(corr)]=NA
should do the trick

Related

twang - Error in Di - crossprod(WX[index, ], X[index, ]) : non-conformable arrays

I'm trying to build propensity scores with the twang package, but I keep getting this error:
Error in Di - crossprod(WX[index, ], X[index, ]) : non-conformable arrays
I'm attaching the code:
ps.TPSV.gbm = ps(Cardioversione ~ Sesso+ age,
data = prova)
> ps.TPSV.gbm = ps(Cardioversione ~ Sesso+ age,
+ data = prova)
Fitting boosted model
Iter TrainDeviance ValidDeviance StepSize Improve
1 0.6590 nan 0.0100 nan
2 0.6581 nan 0.0100 nan
3 0.6572 nan 0.0100 nan
4 0.6564 nan 0.0100 nan
5 0.6556 nan 0.0100 nan
6 0.6548 nan 0.0100 nan
7 0.6540 nan 0.0100 nan
8 0.6533 nan 0.0100 nan
9 0.6526 nan 0.0100 nan
...
9900 0.4164 nan 0.0100 nan
9920 0.4161 nan 0.0100 nan
9940 0.4160 nan 0.0100 nan
9960 0.4158 nan 0.0100 nan
9980 0.4157 nan 0.0100 nan
10000 0.4155 nan 0.0100 nan
Diagnosis of unweighted analysis
Error in Di - crossprod(WX[index, ], X[index, ]) : non-conformable arrays
I honestly don't understand which is the problem, the variables are one factorial (Sesso) and one numeric (age), there are no missing values...could anyone help me?
Thank you in advance
I've already tried changing the variables introduced in the PS but there's no way, I tried if the example code works with the lalonde dataset included in twang and it works well.

Object p not found when running gbm()

I am aware of the question GBM: Object 'p' not found; however it did not contain sufficient information to allow the stack to answer. I don't believe this is a duplicate as I've followed what was indicated in this question and the linked duplicate Error in R gbm function when cv.folds > 0 which, does not describe the same error.
I have been sure to follow the recommendation of leaving out any columns that were not used in the model.
This error appears when the cv.folds is greater than 0:
object 'p' not found
From what I can see, setting cv.folds to 0 is not producing meaningful outputs.I have attempted different distributions, fractions, trees etc. I'm confident I've parameterized something incorrectly but I can't for the life of me see what it is.
Model and output:
model_output <- gbm(formula = ign ~ . ,
distribution = "bernoulli",
var.monotone = rep(0,9),
data = model_sample,
train.fraction = 0.50,
n.cores = 1,
n.trees = 150,
cv.folds = 1,
keep.data = T,
verbose=T)
Iter TrainDeviance ValidDeviance StepSize Improve
1 nan nan 0.1000 nan
2 nan nan 0.1000 nan
3 nan nan 0.1000 nan
4 nan nan 0.1000 nan
5 nan nan 0.1000 nan
6 nan nan 0.1000 nan
7 nan nan 0.1000 nan
8 nan nan 0.1000 nan
9 nan nan 0.1000 nan
10 nan nan 0.1000 nan
20 nan nan 0.1000 nan
40 nan nan 0.1000 nan
60 nan nan 0.1000 nan
80 nan nan 0.1000 nan
100 nan nan 0.1000 nan
120 nan nan 0.1000 nan
140 nan nan 0.1000 nan
150 nan nan 0.1000 nan
Minimum data to generate error used to be here, however once the suggest by #StupidWolf is employed it is too small, the suggestion below will get passed the initial error. Subsequent errors are occurring and solutions will be posted here upon discovery.
It's not meant to deal with the situation someone sets cv.folds = 1. By definition, k fold means splitting the data into k parts, training on 1 part and testing on the other.. So I am not so sure what is 1 -fold cross validation, and if you look at the code for gbm, at line 437
if(cv.folds > 1) {
cv.results <- gbmCrossVal(cv.folds = cv.folds, nTrain = nTrain,
....
p <- cv.results$predictions
}
It makes the predictions and when it collects the results into gbm, line 471:
if (cv.folds > 0) {
gbm.obj$cv.fitted <- p
}
So if cv.folds ==1, p is not calculated, but it is > 0 hence you get the error.
Below is a reproducible example:
library(MASS)
test = Pima.tr
test$type = as.numeric(test$type)-1
model_output <- gbm(type~ . ,
distribution = "bernoulli",
var.monotone = rep(0,7),
data = test,
train.fraction = 0.5,
n.cores = 1,
n.trees = 30,
cv.folds = 1,
keep.data = TRUE,
verbose=TRUE)
gives me the error object 'p' not found
Set it to cv.folds = 2, and it runs smoothly....
model_output <- gbm(type~ . ,
distribution = "bernoulli",
var.monotone = rep(0,7),
data = test,
train.fraction = 0.5,
n.cores = 1,
n.trees = 30,
cv.folds = 2,
keep.data = TRUE,
verbose=TRUE)

How to stop printing for "ps" function in "twang" package?

The "ps" function (propensity score estimation) in "twang" package in R keeps printing its report. How can I turn that off?
I already tried to set the "print.level" argument to be 0. But it is not working for me.
D = rbinom(100, size = 1, prob = 0.5)
X1 = rnorm(100)
X2 = rnorm(100)
ps(D ~ ., data = data.frame(D, X1, X2), stop.method = 'es.mean',
estimand = "ATE", print.level = 0)
I hope there is no printing of the process, but it keeps giving me something like:
Fitting gbm model
Iter TrainDeviance ValidDeviance StepSize Improve
1 1.3040 nan 0.0100 nan
2 1.3012 nan 0.0100 nan
3 1.2985 nan 0.0100 nan
4 1.2959 nan 0.0100 nan
5 1.2932 nan 0.0100 nan
6 1.2907 nan 0.0100 nan
7 1.2880 nan 0.0100 nan
8 1.2855 nan 0.0100 nan
9 1.2830 nan 0.0100 nan
10 1.2804 nan 0.0100 nan
20 1.2562 nan 0.0100 nan
.....
which is annoying.
Presumably you want to capture the result in a variable; if you combine that with the verbose = FALSE parameter, it should do what you need:
res <- ps(D ~ ., data = data.frame(D, X1, X2), stop.method = 'es.mean',
estimand = "ATE", print.level = 0, verbose = FALSE)
I haven't tested whether you still need print.level = 0.

daily, monthly and annual mean

I have data with hourly. I need to convert into daily, monthly and then in the annual.
Also, some dates are missing in that, So i want to include that as well.
#Date
24/02/2000/05:25:00 NaN NaN NaN
26/02/2000/05:10:00 0.227 0.2002496 0.2009378
26/02/2000/06:50:00 NaN NaN NaN
27/02/2000/05:55:00 0.21 0.1687891 0.1630572
28/02/2000/05:00:00 NaN NaN 0.1265696
28/02/2000/06:35:00 0.136 0.1446176 0.1479067
29/02/2000/05:40:00 0.293 0.2279881 0.1900514
01/03/2000/04:45:00 NaN NaN NaN
01/03/2000/06:25:00 0.322 0.3068518 0.2880579
02/03/2000/05:30:00 0.332 0.2793714 0.2391622
02/03/2000/07:05:00 NaN NaN NaN
03/03/2000/06:10:00 0.335 0.2151302 0.2218139
04/03/2000/05:15:00 0.1 0.1138773 0.1168912
04/03/2000/06:55:00 NaN NaN NaN
05/03/2000/06:00:00 0.117 0.1333082 0.147145
06/03/2000/05:05:00 NaN 0.2426362 0.2401871
06/03/2000/06:40:00 NaN 0.32587 0.2845067
07/03/2000/05:45:00 0.323 0.3143821 0.3096662
08/03/2000/04:50:00 NaN NaN NaN
08/03/2000/06:30:00 0.236 0.23232 0.2300107
10/03/2000/06:20:00 0.113 0.1429935 0.1453774
11/03/2000/05:25:00 0.276 0.3238274 0.3150585
11/03/2000/07:00:00 NaN NaN NaN
12/03/2000/06:05:00 0.215 0.2537585 0.2512374
13/03/2000/05:10:00 0.163 0.2273455 0.2679352
13/03/2000/06:50:00 NaN NaN NaN
14/03/2000/05:55:00 0.09 0.1311507 0.1761056
15/03/2000/05:00:00 NaN NaN 0.1447348
15/03/2000/06:35:00 0.125 0.1232291 0.1387782
16/03/2000/05:40:00 0.019 0.06970426 0.11602
17/03/2000/04:45:00 NaN NaN NaN
17/03/2000/06:25:00 0.194 0.1964414 0.1874403
18/03/2000/05:30:00 0.263 0.2749394 0.242199
18/03/2000/07:05:00 NaN NaN NaN
19/03/2000/06:10:00 0.217 0.217737 0.2183706
20/03/2000/05:15:00 0.253 0.2307511 0.2089891
20/03/2000/06:55:00 NaN NaN NaN
21/03/2000/06:00:00 0.282 0.2413632 0.2511235
22/03/2000/05:05:00 NaN 0.382685 0.3944636
22/03/2000/06:45:00 NaN 0.2734097 0.241442
23/03/2000/05:50:00 0.347 0.3289219 0.3003848
24/03/2000/04:50:00 NaN NaN NaN
24/03/2000/06:30:00 0.18 0.1892378 0.2021516
25/03/2000/05:35:00 0.216 0.1871835 0.206762
26/03/2000/06:20:00 0.189 0.1836237 0.2116453
27/03/2000/05:25:00 0.195 0.1817446 0.1804464
27/03/2000/07:00:00 NaN NaN NaN
28/03/2000/06:05:00 0.208 0.168515 0.1819115
29/03/2000/05:10:00 0.162 0.1598227 0.1689523
29/03/2000/06:50:00 NaN NaN NaN
30/03/2000/05:55:00 0.145 0.1472181 0.1723774
31/03/2000/05:00:00 NaN NaN 0.157723
31/03/2000/06:35:00 0.226 0.2108984 0.2339231
I guess you are talking about spliting your date variable in year, month, and day and then you want to calculate some grouping statistics of another varibale which you did not include in your example. If that is the case you could do the following:
# load package
library(dplyr)
#Date
Date <- data.frame( Date =strptime(c("24/02/2000/05:25:00",
"26/02/2000/05:10:00",
"26/02/2000/06:50:00",
"27/02/2000/05:56:00",
"28/02/2000/05:00:00",
"28/02/2000/06:35:00",
"29/02/2000/05:40:00",
"01/03/2000/04:45:00",
"01/03/2000/06:25:00",
"02/03/2000/05:30:00",
"02/03/2000/07:05:00",
"03/03/2000/06:10:00",
"04/03/2000/05:15:00",
"04/03/2000/06:55:00",
"05/03/2000/06:00:00",
"06/03/2000/05:05:00",
"06/03/2000/06:40:00",
"07/03/2000/05:45:00",
"08/03/2000/04:50:00",
"08/03/2000/06:30:00",
"10/03/2000/06:20:00",
"11/03/2000/05:25:00",
"11/03/2000/07:00:00",
"12/03/2000/06:05:00",
"13/03/2000/05:10:00",
"13/03/2000/06:50:00",
"14/03/2000/05:55:00",
"15/03/2000/05:00:00",
"15/03/2000/06:35:00",
"16/03/2000/05:40:00",
"17/03/2000/04:45:00",
"17/03/2000/06:25:00",
"18/03/2000/05:30:00",
"18/03/2000/07:05:00",
"19/03/2000/06:10:00",
"20/03/2000/05:15:00",
"20/03/2000/06:55:00",
"21/03/2000/06:00:00",
"22/03/2000/05:05:00",
"22/03/2000/06:45:00",
"23/03/2000/05:50:00",
"24/03/2000/04:50:00",
"24/03/2000/06:30:00",
"25/03/2000/05:35:00",
"26/03/2000/06:20:00",
"27/03/2000/05:25:00",
"27/03/2000/07:00:00",
"28/03/2000/06:05:00",
"29/03/2000/05:10:00",
"29/03/2000/06:50:00",
"30/03/2000/05:55:00",
"31/03/2000/05:00:00",
"31/03/2000/06:35:00"), format = "%d/%m/%Y/%H:%M:%S"))
# Split your Date variable in days, months, and years
Date[,"Year"] <- format(Date$Date, format = "%Y")
Date[,"Month"] <- format(Date$Date, format = "%m")
Date[,"Day"] <- format(Date$Date, format = "%d")
# Make up some random variable to calculate summary statistics on
Date[,"Random"] <- sample(seq(1,7,1),size=dim(Date)[1], replace = TRUE)
# Now you can calculate grouped statistics by day, month, or year
MonthMean <- Date %>%
group_by(Month) %>%
select(Month, Random) %>%
summarise(Mean = mean(Random))
# Output
# A tibble: 2 × 2
Month Mean
<chr> <dbl>
1 02 3.142857
2 03 4.217391
I have splited the data in Day,Month and Year then calculated Daymean, Monthlymean and Annualmean
using code:
# open the file
file1 <-read.table(file.choose(), header=T)
# View the content of the file
View(file1)
# assign the date
as.character(file1$Date) -> file1$date
time <- as.Date( file1$date, "%d/%m/%Y")
# seperate the day, month, year
file1[,"Year"] <- format(time, format = "%Y")
file1[,"Month"] <- format(time, format = "%m")
file1[,"Day"] <- format(time, format = "%d")
# to see the updates file
View(file1)
# avearaging the dayily mean then same as month wise
aggregate(file1[, 2:4], list(file$Day), mean, na.rm=T)

lapply and passing arguments

I'm trying to learn how to effectivly use the apply family in R. I have the following numeric vector
>aa
[1] 0.047619 0.000000 NaN 0.000000 0.000000 0.000000 0.000000 0.000000
[9] NaN NaN 0.000000 NaN NaN NaN NaN NaN
[17] 0.000000 0.000000 NaN NaN NaN NaN NaN NaN
[25] NaN 0.100000 0.000000 0.000000 0.000000 0.000000 1.000000 NaN
[33] NaN NaN NaN NaN NaN NaN 0.133333 NaN
[41] NaN 0.000000 0.000000 0.000000 NaN NaN NaN NaN
[49] NaN
and I'm trying to get the n factor out of pwr.t.test with each of these as an input to the d argument.
My attempt(s) have yielded this as the latest result, and frankly, I'm stumped...> lapply(aa,function(x) pwr.t.test(d=x,power=.8,sig.level=.05,type="one.sample",alternative="two.sided"))
with the following error message:
Error in uniroot(function(n) eval(p.body) - power, c(2 + 1e-10, 1e+07)) :
f() values at end points not of opposite sign
Any ideas on the right way to do this?
Short answer: The number of subjects needed is greater than the maximum that R will check for. Add some checks so that you don't run the function when d == 0 and it will work.
When d = 0, you need an infinite number of subjects to detect the difference. The error you are seeing is because R tries to calculate power numerically. The algorithm R uses first checks the bounds of the interval over which the possible values for N lie (about 2 to 1e+07). Because the function for power has the same sign at both endpoints of the interval and is monotonic in N, R throws an error saying that the root (the value of N you are looking for) cannot be found.

Resources