Different result from different sample split method? - r

I'm practicing the different methods of splitting datasets. However, the split observation numbers have different results. Isn't the result of the observation numbers with two functions should be same? Since the split ratio and original data set are the same?
Here's my code
##split data set with caTools
set.seed(144)
split.5<-sample.split(CTR$CTR,0.7)
ctr.test<-filter(CTR,split.5==FALSE)
ctr.train<-filter(CTR,split.5==TRUE)
##split data set with sample function
train.id=sample(nrow(CTR),0.7*nrow(CTR))
ctr_test=CTR[-train.id,]
ctr_train=CTR[train.id,]
according to the result calculated from the calculator, the observation number of sample is correct, equal to total observations*0.7.
Thanks a lot for help!

Here's a modified version of your code, let me know if this isn't what you meant.
set.seed(144)
split.5<-sample.split(chickwts$weight,0.7)
ctr.test<-dplyr::filter(chickwts,split.5==FALSE)
ctr.train<-dplyr::filter(chickwts,split.5==TRUE)
set.seed(144)
train.id=sample(nrow(chickwts),0.7*nrow(chickwts))
ctr_test=chickwts[-train.id,]
ctr_train=chickwts[train.id,]
And then you can see that
nrow(ctr_test) == nrow(ctr.test)
is TRUE.
I think you are asking why you don't end up with the same rows.
That is why the index values where train.id == TRUE are not the same as the values for split.5.
Using ?Random will get you more information on random number generation in R. I didn't look into the code of sample.split() but the documentation doesn't say much about the details of the RNG used.
So if you try this
set.seed(144)
split.52<-sample.split(chickwts$weight,0.7)
ctr.test2<-dplyr::filter(chickwts,split.5==FALSE)
ctr.train2<-dplyr::filter(chickwts,split.5==TRUE)
ctr.test2 == ctr.test
set.seed(144)
train.id2=sample(nrow(chickwts),0.7*nrow(chickwts))
ctr_test2=chickwts[-train.id,]
ctr_train2=chickwts[train.id,]
ctr_test == ctr_test2
You see that each method works as you expected but they are working somewhat differently.
Update:
A closer look reveals that sample.split() uses runif(). In contrast sample() relies on much lower level code. This is an article that can get you started on what that means:
https://datascienceconfidential.github.io/r/2017/12/28/how-to-read-source-code-internal-r-function.html

Related

subsetting from an bigstatsr::FBM object for LDpred2 R-tutorial

I'm using LDpred2 incorporated in bigsnpr to calculate polygenic scores with my own set of genetic Data. I am following the steps found in the online tutorial of LDpred2 on Github (https://privefl.github.io/bigsnpr/articles/LDpred2.html) to use the automatic model snp_ldpred2_auto.
I cannot execute the line:
pred_auto <- big_prodMat(G, beta_auto, ind.row = ind.test, ind.col = df_beta[["_NUM_ID_"]])
I suspect this happens because the matrices are not fit for multiplication with each other since the number of columns in G (the FBM matrix) is not identical as the number of rows in beta_auto (a common matrix). I intend to filter out variants (SNPs) from G such that the number of variants in G equals the number of variants in beta_auto .
I have never worked before with matrices of class FBM.code256 and do not know how to achieve this subsetting. Guidance is much appreciated.

How to extract aggregated imputed data from R-package 'mice'?

I have a question regarding the aggregation of imputed data as created by the R-package 'mice'.
As far as I understand it, the 'complete'-command of 'mice' is applied to extract the imputed values of, e.g., the first imputation. However, when running a total of ten imputations, I am not sure, which imputed values to extract. Does anyone know how to extract the (aggregate) imputed data across all imputations?
Since I would like to enter the data into MS Excel and perform further calculations in another software tool, such a command would be very helpful.
Thank you for your comments. A simple example (from 'mice' itself) can be found below:
R> library("mice")
R> nhanes
R> imp <- mice(nhanes, seed = 23109) #create imputation
R> complete(imp) #extraction of the five imputed datasets (row-stacked matrix)
How can I aggregate the five imputed data sets and extract the imputed values to Excel?
I had similar issue.
I used the code below which is good enough to numeric vars.
For others I thought about randomly choose one of the imputed results (because averaging can disrupt it).
My offered code is (for numeric):
tempData <- mice(data,m=5,maxit=50,meth='pmm',seed=500)
completedData <- complete(tempData, 'long')
a<-aggregate(completedData[,3:6] , by = list(completedData$.id),FUN= mean)
you should join the results back.
I think the 'Hmisc' is a better package.
if you already found nicer/ more elegant/ built in solution - please share with us.
You should use complete(imp,action="long") to get values for each imputation. If you see ?complete, you will find
complete(x, action = 1, include = FALSE)
Arguments
x
An object of class mids as created by the function mice().
action
If action is a scalar between 1 and x$m, the function returns the data with imputation number action filled in. Thus, action=1 returns the first completed data set, action=2 returns the second completed data set, and so on. The value of action can also be one of the following strings: 'long', 'broad', 'repeated'. See 'Details' for the interpretation.
include
Flag to indicate whether the orginal data with the missing values should be included. This requires that action is specified as 'long', 'broad' or 'repeated'.
So, the default is to return the first imputed values. In addition, the argument action can also be a string: long, broad, and repeated. If you enter long, it will give you the data in long format. You can also set include = TRUE if you want the original missing data.
ok, but still you have to choose one imputed dataset for further analyses... I think the best option is to analyze using your complete(imp,action="long") and pool the results afterwards.fit <- with(data=imp,exp=lm(bmi~hyp+chl))
pool(fit)
but I also assume its not forbidden to use just one of the imputed datasets ;)

How to split data 70:30 and get a different range of data everytime you split it

I'm currently using R to do feature selection through the use of Random Forest regression. I want to split my data 70:30, which is easy enough to do. However, I want to be able to do this 10 times, with each 10 times obtaining a different set of examples from the one before.
> trainIndex<- createDataPartition(lipids$RT..seconds., p=0.7, list=F)
> lipids.train <- lipids[trainIndex, ]
> lipids.test <- lipids[-trainIndex, ]
This is what I'm doing at the moment, and it works great for splitting my data 70:30. But when I do it again , I get the same 70% of the data in my training set, and the same 30% of the data in my test data. I know this is how createDataPartition works, but is there way of making it so that I get a different 70% of the data the next time I perform it?
Thanks
In the future, please include the packages you're using since createDataPartition is not in base R. I'm assuming you're using the caret package. If that is correct, did you find the times argument?
trainIndex<- createDataPartition(lipids$RT..seconds., p=0.7, list=F, times=10)
As mentioned in the comment, you can just as simply use sample:
sample(seq_along(lipids$RD..seconds), as.integer(0.7 * nrow(lipids)))
And sample will choose a different random seed each time it is run, so you will get different orders.
library(dplyr)
n <- as.integer(length(data[,1])*0.7)
data_70 <- data[sample(nrow(data),n), ]
data_30 <- anti_join(data, data_70)

R: conditional expand.grid function

I would like to find all combinations of vector elements that matches a specific condition. The function expand.grid returns all possible combinations without checking for a specific condition. It is possible to test for a specific condition after using the expand.grid function, but in some situations the number of possible combinations is too large to generate them with expand.grid. Therefore is there a function that allows me to check for a condition while generating all possible combinations.
This is a simplified version of the problem:
A <- seq.int(12, from=0, by=1)*15
B <- seq.int(27, from=0, by=1)*23
C <- seq.int(18, from=0, by=1)*18
D <- seq.int(33, from=0, by=1)*10
out<-expand.grid(A,B,C,D) #out is a dataframe with 235144 x 4 as dimensions
idx<-which(rowSums(out)<=400 & rowSums(out)>=300) #Only a small fraction of 'out' is needed
results <- out(idx,)
In a word, no. After all, if you knew a priori which combinations were desirable/undesirable, you could exclude them from the expansion, e.g. expand.grid(A[A<20],B[B<15],...) . In the general case, which I'm assuming is your real question, you have no simple way to exclude portions of the input vectors.
You might just want to write a multilevel loop which tests each combination in turn and saves or rejects it. This will be slow (again, unless you come up with some clever algorithm to predict regions which are all TRUE or FALSE). So, in the long run, you may be better off using some of the R-packages which partition large calculations (and datasets) so as to avoid exceeding your memory limits.
Now that I've said all that, someone's going to post a link to a package which does exactly that :-(

perform function on pairs of columns

I am trying to run some Monte Carlo simulations on animal position data. So far, I have sampled 100 X and Y coordinates, 100 times. This results in a list of 200. I then convert this list into a dataframe that is more condusive to eventual functions I want to run for each sample (kernel.area).
Now I have a data frame with 200 columns, and I would like to perform the kernel.area function using each successive pair of columns.
I can't reproduce my own data here very well, so I've tried to give a basic example just to show the structure of the data frame I'm working with. I've included the for loop I've tried so far, but I am still an R novice and would appreciate any suggestions.
# generate dataframe representing X and Y positions
df <- data.frame(x=seq(1:200),y=seq(1:200))
# 100 replications of sampling 100 "positions"
resamp <- replicate(100,df[sample(nrow(df),100),])
# convert to data frame (kernel.area needs an xy dataframe)
df2 <- do.call("rbind", resamp[1:2,])
# xy positions need to be in columns for kernel.area
df3 <- t(df2)
#edit: kernel.area requires you have an id field, but I am only dealing with one individual, so I'll construct a fake one of the same length as the positions
id=replicate(100,c("id"))
id=data.frame(id)
Here is the structure of the for loop I've tried (edited since first post):
for (j in seq(1,ncol(df3)-1,2)) {
kud <- kernel.area(df3[,j:(j+1)],id=id,kern="bivnorm",unin=c("m"),unout=c("km2"))
print(kud)
}
My end goal is to calculate kernel.area for each resampling event (ie rows 1:100 for every pair of columns up to 200), and be able to combine the results in a dataframe. However, after running the loop, I get this error message:
Error in df[, 1] : incorrect number of dimensions
Edit: I realised my id format was not the same as my data frame, so I change it and now have the error:
Error in kernelUD(xy, id, h, grid, same4all, hlim, kern, extent) :
id should have the same length as xy
First, a disclaimer: I have never worked with the package adehabitat, which has a function kernel.area, which I assume you are using. Perhaps you could confirm which package contains the function in question.
I think there are a couple suggestions I can make that are independent of knowledge of the specific package, though.
The first lies in the creation of df3. This should probably be
df3 <- t(df2), but this is most likely correct in your actual code
and just a typo in your post.
The second suggestion has to do with the way you subset df3 in the
loop. j:j+1 is just a single number, since the : has a higher
precedence than + (see ?Syntax for the order in which
mathematical operations are conducted in R). To get the desired two
columns, use j:(j+1) instead.
EDIT:
When loading adehabitat, I was warned to "Be careful" and use the related new packages, among which is adehabitatHR, which also contains a function kernel.area. This function has slightly different syntax and behavior, but perhaps it would be worthwhile examining. Using adehabitatHR (I had to install from source since the package is not available for R 2.15.0), I was able to do the following.
library(adehabitatHR)
for (j in seq(1,ncol(df3)-1,2)) {
kud <-kernelUD(SpatialPoints(df3[,j:(j+1)]),kern="bivnorm")
kernAr<-kernel.area(kud,unin=c("m"),unout=c("km2"))
print(kernAr)
}
detach(package:adehabitatHR, unload=TRUE)
This prints something, and as is mentioned in a comment below, kernelUD() is called before kernel.area().

Resources