lapply and passing arguments - r

I'm trying to learn how to effectivly use the apply family in R. I have the following numeric vector
>aa
[1] 0.047619 0.000000 NaN 0.000000 0.000000 0.000000 0.000000 0.000000
[9] NaN NaN 0.000000 NaN NaN NaN NaN NaN
[17] 0.000000 0.000000 NaN NaN NaN NaN NaN NaN
[25] NaN 0.100000 0.000000 0.000000 0.000000 0.000000 1.000000 NaN
[33] NaN NaN NaN NaN NaN NaN 0.133333 NaN
[41] NaN 0.000000 0.000000 0.000000 NaN NaN NaN NaN
[49] NaN
and I'm trying to get the n factor out of pwr.t.test with each of these as an input to the d argument.
My attempt(s) have yielded this as the latest result, and frankly, I'm stumped...> lapply(aa,function(x) pwr.t.test(d=x,power=.8,sig.level=.05,type="one.sample",alternative="two.sided"))
with the following error message:
Error in uniroot(function(n) eval(p.body) - power, c(2 + 1e-10, 1e+07)) :
f() values at end points not of opposite sign
Any ideas on the right way to do this?

Short answer: The number of subjects needed is greater than the maximum that R will check for. Add some checks so that you don't run the function when d == 0 and it will work.
When d = 0, you need an infinite number of subjects to detect the difference. The error you are seeing is because R tries to calculate power numerically. The algorithm R uses first checks the bounds of the interval over which the possible values for N lie (about 2 to 1e+07). Because the function for power has the same sign at both endpoints of the interval and is monotonic in N, R throws an error saying that the root (the value of N you are looking for) cannot be found.

Related

twang - Error in Di - crossprod(WX[index, ], X[index, ]) : non-conformable arrays

I'm trying to build propensity scores with the twang package, but I keep getting this error:
Error in Di - crossprod(WX[index, ], X[index, ]) : non-conformable arrays
I'm attaching the code:
ps.TPSV.gbm = ps(Cardioversione ~ Sesso+ age,
data = prova)
> ps.TPSV.gbm = ps(Cardioversione ~ Sesso+ age,
+ data = prova)
Fitting boosted model
Iter TrainDeviance ValidDeviance StepSize Improve
1 0.6590 nan 0.0100 nan
2 0.6581 nan 0.0100 nan
3 0.6572 nan 0.0100 nan
4 0.6564 nan 0.0100 nan
5 0.6556 nan 0.0100 nan
6 0.6548 nan 0.0100 nan
7 0.6540 nan 0.0100 nan
8 0.6533 nan 0.0100 nan
9 0.6526 nan 0.0100 nan
...
9900 0.4164 nan 0.0100 nan
9920 0.4161 nan 0.0100 nan
9940 0.4160 nan 0.0100 nan
9960 0.4158 nan 0.0100 nan
9980 0.4157 nan 0.0100 nan
10000 0.4155 nan 0.0100 nan
Diagnosis of unweighted analysis
Error in Di - crossprod(WX[index, ], X[index, ]) : non-conformable arrays
I honestly don't understand which is the problem, the variables are one factorial (Sesso) and one numeric (age), there are no missing values...could anyone help me?
Thank you in advance
I've already tried changing the variables introduced in the PS but there's no way, I tried if the example code works with the lalonde dataset included in twang and it works well.

A Simple Matrix Product Returns NaN in R

In R, I have a 3000x4 matrix that looks like the following:
[1,] 8.458792e-02 6.915341e-02 2.179035e-01 8.458792e-02
[2,] 1.933362e-01 2.895261e-01 2.836058e-01 1.933362e-01
[3,] 2.706257e-02 3.233158e-02 7.077421e-02 2.706257e-02
[4,] 2.621281e-01 1.448730e-01 5.983300e-01 2.621281e-01
[5,] 2.018450e-01 2.322246e-01 1.634069e-01 0.000000e+00
[6,] 2.990089e-01 4.956391e-01 3.123204e-01 2.990089e-01
[7,] 1.244709e+00 -4.636184e-01 2.340081e+00 1.244709e+00
[8,] -1.124598e+00 -1.761734e+00 2.832896e-01 0.000000e+00
[9,] 1.394569e-02 2.337716e-02 3.243019e-02 1.394569e-02
[10,] -2.134538e-01 -1.295015e-01 1.296246e-01 0.000000e+00
[ reached getOption("max.print") -- omitted 2990 rows ]
Here, let me call the matrix "C".
I am suffering from a problem that the matrix product returns NaNs:
> t(C)%*%C
NaN NaN NaN NaN
NaN NaN NaN NaN
NaN NaN NaN NaN
NaN NaN NaN NaN
But, if I try to the calculation using only the first 100 rows, the problem does not occur:
>t(C[1:100,])%*%C[1:100,]
27.063320 8.051414 27.027122 15.340364
8.051414 10.571046 5.213047 3.521941
27.027122 5.213047 41.211831 23.785906
15.340364 3.521941 23.785906 15.340364
So, why does this happens? and how can I solve this problem?
(Of course, my code has more details but the only source where the problem happens is just the matrix product aforementioned, thus, I think additional detail may be not informative to solve this problem.)

How can I calculate the progressive mean in a vector BUT restarting when a condition is met?

Given the following vector,
x<-c(0,0,0,5.0,5.1,5.0,6.5,6.7,6.0,0,0,0,0,3.8,4.0,3.6)
I would like to have a vector with the cumulative mean, like
cumsum(x)/seq_along(x)
but restarting the computation each time that the difference between two subsequent values is grater than 1.3 or less than -1.3. My aim is to obtain a vector like
d<-c(0,0,0,5,5.05,5.03,6.5,6.6,6.37,0,0,0,0,3.8,3.9,3.8)
You can use cumsum(abs(diff(x)) > 1.3)) to define groups, which are used in aggregate to restart cumsum(x)/seq_along(x) each time when the difference is grater than 1.3 or less than -1.3.
unlist(aggregate(x, list(c(0, cumsum(abs(diff(x)) > 1.3))),
function(x) cumsum(x)/seq_along(x))[,2])
# [1] 0.000000 0.000000 0.000000 5.000000 5.050000 5.033333 6.500000 6.600000
# [9] 6.400000 0.000000 0.000000 0.000000 0.000000 3.800000 3.900000 3.800000
Maybe you can try ave + findInterval like below
ave(x,findInterval(seq_along(x),which(abs(diff(x))>1.3)+1),FUN = function(v) cumsum(v)/seq_along(v))
which gives
[1] 0.000000 0.000000 0.000000 5.000000 5.050000 5.033333 6.500000 6.600000
[9] 6.400000 0.000000 0.000000 0.000000 0.000000 3.800000 3.900000 3.800000

how can I print variable importance in gbm function?

I used the gbm function to implement gradient boosting. And I want to perform classification.
After that, I used the varImp() function to print variable importance in gradient boosting modeling.
But... only 4 variables have non-zero importance. There are 371 variables in my big data.... Is it right?
This is my code and result.
>asd<-read.csv("bigdatafile.csv",header=TRUE)
>asd1<-gbm(TARGET~.,n.trees=50,distribution="adaboost", verbose=TRUE,interaction.depth = 1,data=asd)
Iter TrainDeviance ValidDeviance StepSize Improve
1 0.5840 nan 0.0010 0.0011
2 0.5829 nan 0.0010 0.0011
3 0.5817 nan 0.0010 0.0011
4 0.5806 nan 0.0010 0.0011
5 0.5795 nan 0.0010 0.0011
6 0.5783 nan 0.0010 0.0011
7 0.5772 nan 0.0010 0.0011
8 0.5761 nan 0.0010 0.0011
9 0.5750 nan 0.0010 0.0011
10 0.5738 nan 0.0010 0.0011
20 0.5629 nan 0.0010 0.0011
40 0.5421 nan 0.0010 0.0010
50 0.5321 nan 0.0010 0.0010
>varImp(asd1,numTrees = 50)
Overall
CA0000801 0.00000
AS0000138 0.00000
AS0000140 0.00000
A1 0.00000
PROFILE_CODE 0.00000
A2 0.00000
CB_thinfile2 0.00000
SP_thinfile2 0.00000
thinfile1 0.00000
EW0001901 0.00000
EW0020901 0.00000
EH0001801 0.00000
BS_Seg1_Score 0.00000
BS_Seg2_Score 0.00000
LA0000106 0.00000
EW0001903 0.00000
EW0002801 0.00000
EW0002902 0.00000
EW0002903 0.00000
EW0002904 0.00000
EW0002906 0.00000
LA0300104_SP 56.19052
ASMGRD2 2486.12715
MIX_GRD 2211.03780
P71010401_1 0.00000
PS0000265 0.00000
P11021100 0.00000
PE0000123 0.00000
There are 371 variables. So above the result,I didn't write other variables. That all have zero importance.
TARGET is target variable. And I produced 50 trees. TARGET variable has two levels. so I used adaboost.
Is there a mistake in my code??? There are a little non-zero variables....
Thank you for your reply.
You cannot use importance() NOR varImp() this is only for Random Forest.
However, you can use summary.gbm from the gbm package.
Ex:
summary.gbm(boost_model)
Output will look like:
In your code, n.trees is very low and shrinkage is very high.
Just adjust this two factor.
n.trees is Number of trees. N increasing N reduces the error on training set, but setting it too high may lead to over-fitting.
interaction.depth(maximum nodes per tree) is number of splits it has to perform on a tree(starting from a single node).
shrinkage is considered as a learning rate. shrinkage is commonly used in ridge regression where it reduces regression coefficients to zero and, thus, reduces the impact of potentially unstable regression coefficients.
I recommend uses 0.1 for all data sets with more than 10,000 records.
Also! use a small shrinkage when growing many trees.
If you input 1,000 in n.trees & 0.1 in shrinkage, you can get different value.
And if you want to know relative influence of each variable in the gbm, Use summary.gbm() not varImp(). Of course, varImp() is good function. but I recommend summary.gbm().
Good luck.

How to deal with NaN in R?

I have two binary files with the same dimensions(corr and rmse ).I want to do this:
replace all pixels in rmse by NA whenevr corr is NA.
file1:
conne <- file("D:\\omplete.bin","rb")
corr<- readBin(conne, numeric(), size=4, n=1440*720, signed=TRUE)
file2:
rms <- file("D:\\hgmplete.bin","rb")
rmse<- readBin(rms, numeric(), size=4, n=1440*720, signed=TRUE)
I did this:
rmse[corr==NA]=NA
did not do anything, so I tried this:
rmse[corr==NaN]=NA
did not do anything either! Can anybody help me on this.
Head of the file corr:
> corr
[1] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
You need to use the logical test is.nan(). In this case:
rmse[is.nan(corr)]=NA
should do the trick

Resources