0 observations of N variables - r

I have a dataset, which looks like:
y Age Height
0 Aage Aheight
1 Bage Bheight
All variables are divided into at least two categories.
When I open dataset with the code:
DM_input = read.csv(file="C:/Users/user/Desktop/test.CSV",header = TRUE, sep = ",")
R correctly shows: 5040 observations of 11 variables.
When I try to break down dataset into test and train with the following code:
> train <- DM_input[DM_input$rand <= 0.7, c(2,3,4,5,6,7,8,9,10)]
> test <- DM_input[DM_input$rand > 0.7, c(2,3,4,5,6,7,8,9,10)]
I get 0 observations out of 11 variables, and the tables are empty.
I do not understand why that is happening, I removed special characters - it did not help.
Thanks

I think that sample.int could help you to break down dataset.
Here is a example:
data(iris)
# number of rows of dataset
size_iris <- nrow(iris)
# set the proportion of sample split to 0.7
size_sample <- floor(0.7*size_iris)
# set a reproducible random result
set.seed(2020)
# sample the dataset
mysample <- sample.int(n=size_iris, size=size_sample, replace=F)
train <- iris[mysample,]
test <- iris[-mysample,]
# checking sizes
size_iris
[1] 150
nrow(train)
[1] 105
nrow(test)
[1] 45
There is a similar question here with a lot of good answers: How to split data into training/testing sets using sample function

Related

Creating a for-loop to store LDA misclassification rates

I have a dataset of 104 samples (2 classes) and 182 variables. I am to carry out LDA on the dataset. My strategy involves first carrying out PCA in order to reduce dimensionality; this leaves me with 104 PCs. Now, what I want to do is carry out LDA on the PCs. I want to carry it out first where the number of PCs equal to 1, and store the misclassification rates into a data frame object. I will then do the same for 2, 3 and so on until ~50 PCs; the number is not important. I have created a for-loop to try solve this but I end up with a data frame where the only row is the final value I choose for my PCs. Here is the code I have so far:
# required packages
library(MASS)
library(class)
library(tidyverse)
# reading in and cleaning data
og_data <- read.csv("data.csv")
og_data <- og_data[, -1]
og_data$tumour <- unclass(as.factor(og_data$tumour))
# standardizing
st_data <- as.data.frame(cbind(og_data[, 1], scale(og_data[, -1])))
colnames(st_data)[1] <- "tumour"
# PCA for dimension reduction
k=10 # this is for the for-loop
grouping <- c(rep(1, 62), rep(2, 42)) # a vector denoting the true class of the samples
pca <- prcomp(st_data[, -1])
df_misclassification <- tibble(i=as.numeric(),
misclassification_rate_1=as.numeric(),
misclassification_rate_2=as.numeric())
for (i in k){
a <- as.data.frame(pca$x[, 1:i])
b <- lda(a, grouping=grouping, CV=TRUE)
c <- table(list(predicted=b$class, observed=grouping)) # confusion matrix
d <- t(as.data.frame(diag(c) / rowSums(c))) # misclassification rate for each class
df_misclassification <- df_misclassification %>%
add_row(i=i,
misclassification_rate_1=d[, 1],
misclassification_rate_2=d[, 2])
}
Running the above for k=10 leaves me with the following data frame:
# A tibble: 1 x 3
i misclassification_rate_1 misclassification_rate_2
<dbl> <dbl> <dbl>
1 10 0.952 0.951
I would like the table to have 10 rows, one for each number of PCs used. There is some overwriting in the for-loop but I have no idea how to fix this. Any help would be much appreciated. Thank you.
My for-loop was wrong. It should have been for (i in 1:k).

How to capture the most important variables in Bootstrapped models in R?

I have several models that I would like to compare their choices of important predictors over the same data set, Lasso being one of them. The data set I am using consists of census data with around a thousand variables that have been renamed to "x1", "x2" and so on for convenience sake (The original names are extremely long). I would like to report the top features then rename these variables with a shorter more concise name.
My attempt to solve this is by extracting the top variables in each iterated model, put it into a list, then finding the mean of the top variables in X amount of loops. However, my issue is I still find variability with the top 10 most used predictors and so I cannot manually alter the variable names as each run on the code chunk yields different results. I suspect this is because I have so many variables in my analysis and due to CV causing the creation of new models every bootstrap.
For the sake of a simple example I used mtcars and will look for the top 3 most common predictors due to only having 10 variables in this data set.
library(glmnet)
data("mtcars") # Base R Dataset
df <- mtcars
topvar <- list()
for (i in 1:100) {
# CV and Splitting
ind <- sample(nrow(df), nrow(df), replace = TRUE)
ind <- unique(ind)
train <- df[ind, ]
xtrain <- model.matrix(mpg~., train)[,-1]
ytrain <- df[ind, 1]
test <- df[-ind, ]
xtest <- model.matrix(mpg~., test)[,-1]
ytest <- df[-ind, 1]
# Create Model per Loop
model <- glmnet(xtrain, ytrain, alpha = 1, lambda = 0.2)
# Store Coeffecients per loop
coef_las <- coef(model, s = 0.2)[-1, ] # Remove intercept
# Store all nonzero Coefficients
topvar[[i]] <- coef_las[which(coef_las != 0)]
}
# Unlist
varimp <- unlist(topvar)
# Count all predictors
novar <- table(names(varimp))
# Find the mean of all variables
meanvar <- tapply(varimp, names(varimp), mean)
# Return top 3 repeated Coefs
repvar <- novar[order(novar, decreasing = TRUE)][1:3]
# Return mean of repeated Coefs
repvar.mean <- meanvar[names(repvar)]
repvar
Now if you were to rerun the code chunk above you would notice that the top 3 variables change and so if I had to rename these variables it would be difficult to do if they are not constant and changing every run. Any suggestions on how I could approach this?
You can use function set.seed() to ensure your sample will return the same sample each time. For example
set.seed(123)
When I add this to above code and then run twice, the following is returned both times:
wt carb hp
98 89 86

R: Testing each level of a factor without creating new variables

Suppose I have a data frame with a binary grouping variable and a factor. An example of such a grouping variable could specify assignment to the treatment and control conditions of an experiment. In the below, b is the grouping variable while a is an arbitrary factor variable:
a <- c("a","a","a","b","b")
b <- c(0,0,1,0,1)
df <- data.frame(a,b)
I want to complete two-sample t-tests to assess the below:
For each level of a, whether there is a difference in the mean propensity to adopt that level between the groups specified in b.
I have used the dummies package to create separate dummies for each level of the factor and then manually performed t-tests on the resulting variables:
library(dummies)
new <- dummy.data.frame(df, names = "a")
t.test(new$aa, new$b)
t.test(new$ab, new$b)
I am looking for help with the following:
Is there a way to perform this without creating a large number of dummy variables via dummy.data.frame()?
If there is not a quicker way to do it without creating a large number of dummies, is there a quicker way to complete the t-test across multiple columns?
Note
This is similar to but different from R - How to perform the same operation on multiple variables and nearly the same as this question Apply t-test on many columns in a dataframe split by factor but the solution of that question no longer works.
Here is a base R solution implementing a chi-squired test for equality of proportions, which I believe is more likely to answer whatever question you're asking of your data (see my comment above):
set.seed(1)
## generate similar but larger/more complex toy dataset
a <- sample(letters[1:4], 100, replace = T)
b <- sample(0:1, 10, replace = T)
head((df <- data.frame(a,b)))
a b
1 b 1
2 b 0
3 c 0
4 d 1
5 a 1
6 d 0
## create a set of contingency tables for proportions
## of each level of df$a to the others
cTbls <- lapply(unique(a), function(x) table(df$a==x, df$b))
## apply chi-squared test to each contingency table
results <- lapply(cTbls, prop.test, correct = FALSE)
## preserve names
names(results) <- unique(a)
## only one result displayed for sake of space:
results$b
2-sample test for equality of proportions without continuity
correction
data: X[[i]]
X-squared = 0.18382, df = 1, p-value = 0.6681
alternative hypothesis: two.sided
95 percent confidence interval:
-0.2557295 0.1638177
sample estimates:
prop 1 prop 2
0.4852941 0.5312500
Be aware, however, that is you might not want to interpret your p-values without correcting for multiple comparisons. A quick simulation demonstrates that the chance of incorrectly rejecting the null hypothesis with at least one of of your tests can be dramatically higher than 5%(!) :
set.seed(11)
sum(
replicate(1e4, {
a <- sample(letters[1:4], 100, replace = T)
b <- sample(0:1, 100, replace = T)
df <- data.frame(a,b)
cTbls <- lapply(unique(a), function(x) table(df$a==x, df$b))
results <- lapply(cTbls, prop.test, correct = FALSE)
any(lapply(results, function(x) x$p.value < .05))
})
) / 1e4
[1] 0.1642
I dont exactly understand what this is doing from a statistical standpoint, but this code generates a list where each element is the output from the t.test() you run above:
a <- c("a","a","a","b","b")
b <- c(0,0,1,0,1)
df <- data.frame(a,b)
library(dplyr)
library(tidyr)
dfNew<-df %>% group_by(a) %>% summarise(count = n()) %>% spread(a, count)
lapply(1:ncol(dfNew), function (x)
t.test(c(rep(1, dfNew[1,x]), rep(0, length(b)-dfNew[1,x])), b))
This will save you the typing of t.test(foo, bar) continuously, and also eliminates the need for dummy variables.
Edit: I dont think the above method preserves the order of the columns, only the frequency of values measured as 0 or 1. If the order is important (again, I dont know the goal of this procedure) then you can use the dummy method and lapply through the data.frame you named new.
library(dummies)
new <- dummy.data.frame(df, names = "a")
lapply(1:(ncol(new)-1), function(x)
t.test(new[,x], new[,ncol(new)]))

stratified splitting the data

I have a large data set and like to fit different logistic regression for each City, one of the column in my data. The following 70/30 split works without considering City group.
indexes <- sample(1:nrow(data), size = 0.7*nrow(data))
train <- data[indexes,]
test <- data[-indexes,]
But this does not guarantee the 70/30 split for each city.
lets say that I have City A and City B, where City A has 100 rows, and City B has 900 rows, totaling 1000 rows. Splitting the data with above code will give me 700 rows for train and 300 for test data, but it does not guarantee that i will have 70 rows for City A, and 630 rows for City B in the train data. How do i do that?
Once i have the training data split-ed to 70/30 fashion for each city,i will run logistic regression for each city ( I know how to do this once i have the train data)
Try createDataPartition from caret package. Its document states: By default, createDataPartition does a stratified random split of the data.
library(caret)
train.index <- createDataPartition(Data$Class, p = .7, list = FALSE)
train <- Data[ train.index,]
test <- Data[-train.index,]
it can also be used for stratified K-fold like:
ctrl <- trainControl(method = "repeatedcv",
repeats = 3,
...)
# when calling train, pass this train control
train(...,
trControl = ctrl,
...)
check out caret document for more details
The package splitstackshape has a nice function stratified which can do this as well, but this is a bit better than createDataPartition because it can use multiple columns to stratify at once. It can be used with one column like:
library(splitstackshape)
set.seed(42) # good idea to set the random seed for reproducibility
stratified(data, c('City'), 0.7)
Or with multiple columns:
stratified(data, c('City', 'column2'), 0.7)
The typical way is with split
lapply( split(dfrm, dfrm$City), function(dd){
indexes= sample(1:nrow(dd), size = 0.7*nrow(dd))
train= dd[indexes, ] # Notice that you may want all columns
test= dd[-indexes, ]
# analysis goes here
}
If you were to do it in steps as you attempted above it would be like this:
cities <- split(data,data$city)
idxs <- lapply(cities, function (d) {
indexes <- sample(1:nrow(d), size=0.7*nrow(d))
})
train <- data[ idxs[[1]], ] # for the first city
test <- data[ -idxs[[1]], ]
I happen to think the is the clumsy way to do it, but perhaps breaking it down into small steps will let you examine the intermediate values.
Your code works just fine as is, if City is a column, simply run training data as train[,2]. You can do this easily for each one with a lambda function
logReg<-function(ind) {
reg<-glm(train[,ind]~WHATEVER)
....
return(val) }
Then run sapply over the vector of city indexes.
Another possible way, similar to IRTFMs answer (e.g., using only base-r) is to use the following. Note that this answer returns a stratified index, which can be used like the index calculated in the question.
p <- 0.7
strats <- your_data$the_stratify_variable
rr <- split(1:length(strats), strats)
idx <- sort(as.numeric(unlist(sapply(rr, function(x) sample(x, length(x) * p)))))
train <- your_data[idx, ]
test <- your_data[-idx, ]
Example:
p <- 0.7
strats <- mtcars$cyl
rr <- split(1:length(strats), strats)
idx <- sort(as.numeric(unlist(sapply(rr, function(x) sample(x, length(x) * p)))))
train <- mtcars[idx, ]
test <- mtcars[-idx, ]
table(mtcars$cyl) / nrow(mtcars)
#> 4 6 8
#> 0.34375 0.21875 0.43750
table(train$cyl) / nrow(train)
#> 4 6 8
#> 0.35 0.20 0.45
table(test$cyl) / nrow(test)
#> 4 6 8
#> 0.3333333 0.2500000 0.4166667
We see that all datasets all (mtcars), train, and test have roughly the same class distributions!

How to view the nearest neighbors in R?

Let me start by saying I have no experience with R, KNN or data science in general. I recently found Kaggle and have been playing around with the Digit Recognition competition/tutorial.
In this tutorial they provide some sample code to get you started with a basic submission:
# makes the KNN submission
library(FNN)
train <- read.csv("c:/Development/data/digits/train.csv", header=TRUE)
test <- read.csv("c:/Development/data/digits/test.csv", header=TRUE)
labels <- train[,1]
train <- train[,-1]
results <- (0:9)[knn(train, test, labels, k = 10, algorithm="cover_tree")]
write(results, file="knn_benchmark.csv", ncolumns=1)
My questions are:
How can I view the nearest neighbors that have been selected for a
particular test row?
How can I modify which of those ten is selected
for my results?
These questions may be too broad. If so, I would welcome any links that could point me down the right road.
It is very possible that I have said something that doesn't make sense here. If this is the case, please correct me.
1) You can get the nearest neighbors of a given row like so:
k <- knn(train, test, labels, k = 10, algorithm="cover_tree")
indices <- attr(k, "nn.index")
Then if you want the indices of the 10 nearest neighbors to row 20 in the training set:
print(indices[20, ])
(You'll get the 10 nearest neighbors because you selected k=10). For example, if you run with only the first 1000 rows of the training and testing set (to make it computationally easier):
train <- read.csv("train.csv", header=TRUE)[1:1000, ]
test <- read.csv("test.csv", header=TRUE)[1:1000, ]
labels <- train[,1]
train <- train[,-1]
k <- knn(train, test, labels, k = 10, algorithm="cover_tree")
indices = attr(k, "nn.index")
print(indices[20, ])
# output:
# [1] 829 539 784 487 293 882 367 268 201 277
Those are the indices within the training set of 1000 that are closest to the 20th row of the test set.
2) It depends what you mean by "modify". For starters, you can get the indices of each of the 10 closest labels to each row like this:
closest.labels = apply(indices, 2, function(col) labels[col])
You can then see the labels of the 10 closest points to the 20th training point like this:
closest.labels[20, ]
# [1] 0 0 0 0 0 0 0 0 0 0
This indicates that all 10 of the closest points to row 20 are all in the group labeled 0. knn simply chooses the label by majority vote (with ties broken at random), but you could choose some kind of weighting scheme if you prefer.
ETA: If you're interested in weighting the closer elements more heavily in your voting scheme, note that you can also get the distances to each of the k neighbors like this:
dists = attr(k, "nn.dist")
dists[20, ]
# output:
# [1] 1238.777 1243.581 1323.538 1398.060 1503.371 1529.660 1538.128 1609.730
# [9] 1630.910 1667.014

Resources