Partitioning Data creates unexpected results - r

I am trying to partition my data to a 60% Training and 40% Test Set using the following code.
split <- sample.split(divdat, SplitRatio = 0.6)
split
train.div <- subset(divdat, split == "TRUE")
test.div <- subset(divdat, split == "FALSE")
However, when using this code it splits my data as if it were 50/50. I have two hundred observations but and I get 100 observations for each. Any ideas what I am doing wrong here?

Function sample.split splits not by row, but by labels. to do it should change the first argument of sample.split to column values where you store labels. Then you'll observe 60/40 ration of training/test sets. I.e.
library(caTools)
divdat <- data.frame(id = 1:10, chars = letters[1:10], labels = c("X", "Y"))
split <- sample.split(divdat$labels, SplitRatio = 0.6)
train.div <- subset(divdat, split == "TRUE")
test.div <- subset(divdat, split == "FALSE")
train.div
test.div
Output:
> train.div
id chars labels
2 2 b Y
3 3 c X
5 5 e X
6 6 f Y
9 9 i X
10 10 j Y
> test.div
id chars labels
1 1 a X
4 4 d Y
7 7 g X
8 8 h Y

Related

Removing points in a vector based on another

I've two vectors with data x and y. Let's say the first one is the distance and the second the temperature.
How can I remove from both x and y all points which distance is lower a constant distance 'd' between two consecutive points ( xi - xi-1 )
x = (1,2,3,8,12)
y = (10,12,11,9,12)
remove points with a distance smaller than 5
x = 1, 2(out as 2-1 <5), 3 (out as 3-1 <5), 8, 12 (fine as last even thoug 12-8<5)
x = (1,8,12)
y = (10,9,12)
Here is one idea assuming that your first and last elements are never removed,
v1 <- setNames(x, y)[c(TRUE, (diff(x) >= 5)[-(length(x)-1)], TRUE)]
#10 9 12
# 1 8 12
#To make it a bit more clear on how the named vector is structured (still a vector)
names(v1)
#[1] "10" "9" "12" <- Note: I get 9 whereas you get 11
unname(v1)
#[1] 1 8 12
Or you can make it a function,
rm_elements <- function(x, y, n){
v1 <- setNames(x, y)[c(TRUE, (diff(x) >= n)[-(length(x)-1)], TRUE)]
return(list(x = unname(v1), y = as.numeric(names(v1))))
}
rm_elements(x, y, 5)
#$x
#[1] 1 8 12
#$y
#[1] 10 9 12
EDIT: To accomodate your comment for when you have them in a data frame, then we can alter the function a bit to accept a data frame (no matter how you name the variables), and return a subset of that data frame, i.e.
rm_elements <- function(df, n){
v1 <- df[c(TRUE, (diff(df[[1]]) >= n)[-(nrow(df)-1)], TRUE),]
return(v1)
}
#Make a data frame from the vectors,
d1 <- data.frame(x=x, y=y)
rm_elements(d1, 5)
which gives,
x y
1 1 10
4 8 9
5 12 12

Run a separate function for each item depending on the value of another variable with dplyr

I have a dataset which contains a categorical variable. Depending on the value of this variable, I want to run a different function for each such value. All the possible functions have the same return type. I might wish to run say, sin() if category is 'A', cos() if category is 'B', and tan() if category is 'C'.
The real application for this is in simulating populations, where outcomes depend on the values of categories, but sometimes in very different ways.
Toy example
library(dplyr)
category=c('A','B','C')
N <- 100
pop <- as.data.frame(ID <- seq(1:N))
pop <- as.tbl(pop)
pop$Category <- sample(category,N,replace=TRUE)
pop$score <- runif(N)
pop
tf <- function(x,EXPR) {
switch(EXPR,
A = cos(x),
B = sin(x),
C = tan(x)
)}
pop$results <- tf(pop$Score,pop$Category)
This code fails,reasonably enough, with the error message
Error in switch(EXPR, A = cos(x), B = sin(x), C = tan(x)) : EXPR must be a length 1 vector
I have looked, carefully, at dplyr and do, and I can easily see how to run the same function for each category separately. However, I need a function which depends on the category value.
Suggestions greatly appreciated.
The rowwise function is what you need to force it evaluate row by row...
pop<-data.frame(ID=1:100,
category = sample(c("A", "B", "C"),100,replace=TRUE),
score = runif(100))
exprs<-function(category, score){
if(category=="A")
ret <- sin(score)
if(category=="B")
ret <- cos(score)
if(category=="C")
ret <- tan(score)
ret }
pop %>%
rowwise %>%
mutate(answer = exprs(category, score))
Source: local data frame [100 x 4]
Groups:
# A tibble: 100 × 4
ID category score answer
<int> <fctr> <dbl> <dbl>
1 1 C 0.5219332 0.5751317
2 2 C 0.9266336 1.3314972
3 3 B 0.2729260 0.9629863
4 4 B 0.6575110 0.7915158
5 5 B 0.0910481 0.9958580
6 6 C 0.9968752 1.5467554
7 7 A 0.3429183 0.3362369
8 8 A 0.9101669 0.7896062
9 9 B 0.9291849 0.5984872
10 10 C 0.8913347 1.2379742
# ... with 90 more rows
You can use Vectorize():
set.seed(42)
category=c('A','B','C')
N <- 10
pop <- data.frame(ID=seq(1:N), Category=sample(category,N,replace=TRUE), score=runif(N), stringsAsFactors = FALSE)
tf <- function(x, EXPR) switch(EXPR,
'A' = cos(x),
'B' = sin(x),
'C' = tan(x))
TF <- Vectorize(tf)
pop$result <- TF(pop$score, pop$Category)
or (thx to #42 for the comment)
pop$result <- mapply(tf, pop$score, pop$Category)
The error appears because you are sending the complete vector , instead of record wise. I used lapply to call your function for each row and it works
library(dplyr)
category=c('A','B','C')
N <- 100
pop <- data.frame(ID = seq(1:N))
pop$Category <- sample(category,N,replace=TRUE)
pop$Category <- as.factor(pop$Category)
pop$score <- runif(N)
tf <- function(x,EXPR) {
switch(EXPR,
A = cos(x),
B = sin(x),
C = tan(x)
)}
## call tf for every row in the dataframe
pop$results <-lapply( seq_len(nrow(pop)) , function (i) {
tf(pop$score[i],pop$Category[i])
}) %>% unlist
Thanks

Loop a sequence in R (standardize and winsorize dataframe)

I'm trying to loop this sequence of steps in r for a data frame.
Here is my data:
ID Height Weight
a 100 80
b 80 90
c na 70
d 120 na
....
Here is my code so far
winsorize2 <- function(x) {
Min <- which(x == min(x))
Max <- which(x == max(x))
ord <- order(x)
x[Min] <- x[ord][length(Min)+1]
x[Max] <- x[ord][length(x)-length(Max)]
x}
df<-read.csv("data.csv")
df2 <- scale(df[,-1], center = TRUE, scale = TRUE)
id<-df$Type
full<-data.frame(id,df2)
full[is.na(full)] <- 0
full[, -1] <- sapply(full[,-1], winsorize2)
what i'm trying to do is this:-> Standardize a dataframe, then winsorize the standardized dataframe using the function winsorize2, ie replace the most extreme values with the second least extreme value. This is then repeated 10 times. How do i do a loop for this? Im confused as in the sequence ive already replaced the nas with 0s and so i should remove this step from the loop too?
edit:After discussion with #ekstroem, we decided to change to code to introduce the boundaries
df<-read.csv("data.csv")
id<-df$Type
df2<- scale(df[,-1], center = TRUE, scale = TRUE)
df2[is.na(df2)] <- 0
df2[df2<=-3] = -3
df2[df2>=3] = 3
df3<-df2 #trying to loop again
df3<- scale(df3, center = TRUE, scale = TRUE)
df3[is.na(df3)] <- 0
df3[df3<=-3] = -3
df3[df3>=3] = 3
There are some boundary issues that are not fully specified in your code, but maybe the following can be used (using base R and not super efficient)
wins2 <- function(x, n=1) {
xx <- sort(unique(x))
x[x<=xx[n]] <- xx[n+1]
x[x>=xx[length(xx)-n]] <- xx[length(xx)-n]
x
}
This yields:
x <- 1:11
wins(x,1)
[1] 2 2 3 4 5 6 7 8 9 10 10
wins(x,3)
[1] 4 4 4 4 5 6 7 8 8 8 8

How to split into train and test data ensuring same combinations of factors are present in both train and test?

Is there a way to split the data into train and test such that all combinations of categorical predictors in the test data are present in the training data? If it is not possible to split the data given the proportions specified for the test and train sizes, then those levels should not be included in the test data.
Say I have data like this:
SAMPLE_DF <- data.frame("FACTOR1" = c(rep(letters[1:2], 8), "g", "g", "h", "i"),
"FACTOR2" = c(rep(letters[3:5], 2,), rep("z", 3), "f"),
"response" = rnorm(10,10,1),
"node" = c(rep(c(1,2),5)))
> SAMPLE_DF
FACTOR1 FACTOR2 response node
1 a c 10.334690 1
2 b d 11.467605 2
3 a e 8.935463 1
4 b c 10.253852 2
5 a d 11.067347 1
6 b e 10.548887 2
7 a z 10.066082 1
8 b z 10.887074 2
9 a z 8.802410 1
10 b f 9.319187 2
11 a c 10.334690 1
12 b d 11.467605 2
13 a e 8.935463 1
14 b c 10.253852 2
15 a d 11.067347 1
16 b e 10.548887 2
17 g z 10.066082 1
18 g z 10.887074 2
19 h z 8.802410 1
20 i f 9.319187 2
In the test data, if there were a combination of FACTOR 1 and 2 of a c then this would also be in the train data. The same goes for all other possible combinations.
createDataPartition does this for one level, but I would like it for all levels.
You could try the following using dplyr to remove the combinations that appear only once and therefore would end up only in the training or test set and then use CreateDataPartition to make the split:
Data
SAMPLE_DF <- data.frame("FACTOR1" = rep(letters[1:2], 10),
"FACTOR2" = c(rep(letters[3:5], 2,), rep("z", 4)),
"num_pred" = rnorm(10,10,1),
"response" = rnorm(10,10,1))
Below you use dplyr to count the number of the combinations of factor1 and factor2. If any of those are 1 then you filter them out:
library(dplyr)
mydf <-
SAMPLE_DF %>%
mutate(all = paste(FACTOR1,FACTOR2)) %>%
group_by(all) %>%
summarise(total=n()) %>%
filter(total>=2)
The above only keeps combinations of factor1 and 2 that appear at least twice
You remove rows from SAMPLE_DF according to the above kept combinations:
SAMPLE_DF2 <- SAMPLE_DF[paste(SAMPLE_DF$FACTOR1,SAMPLE_DF$FACTOR2) %in% mydf$all,]
And finally you let createDataPartition do the split for you:
library(caret)
IND_TRAIN <- createDataPartition(paste(SAMPLE_DF2$FACTOR1,SAMPLE_DF2$FACTOR2))$Resample
#train set
A <- SAMPLE_DF2[ IND_TRAIN,]
#test set
B <- SAMPLE_DF2[-IND_TRAIN,]
>identical(sort(paste(A$FACTOR1,A$FACTOR2)) , sort(paste(B$FACTOR1,B$FACTOR2)))
[1] TRUE
As you can see at the identical line, the combinations are exactly the same!
This is a function I have put together that does this and splits a dataframe into train and test sets (to the user's percentage liking) that contain Factor variables(columns) that have the same levels.
getTrainAndTestSamples_BalancedFactors <- function(data, percentTrain = 0.75, inSequence = F, seed = 0){
set.seed(seed) # Set Seed so that same sample can be reproduced in future also
sample <- NULL
train <- NULL
test<- NULL
listOfFactorsAndTheirLevels <- lapply(Filter(is.factor, data), summary)
factorContainingOneElement <- sapply(listOfFactorsAndTheirLevels, function(x){any(x==1)})
if (any(factorContainingOneElement))
warning("This dataframe cannot be reliably split into sets that contain Factors equally represented. At least one factor contains only 1 possible level.")
else {
# Repeat loop until all Factor variables have same levels on both train and test set
repeat{
# Now Selecting 'percentTrain' of data as sample from total 'n' rows of the data
if (inSequence)
sample <- 1:floor(percentTrain * nrow(data))
else
sample <- sample.int(n = nrow(data), size = floor(percentTrain * nrow(data)), replace = F)
train <- data[sample, ] # create train set
test <- data[-sample, ] # create test set
train_factor_only <- Filter(is.factor, train) # df containing only 'train' Factors as columns
test_factor_only <- Filter(is.factor, test) # df containing only 'test' Factors as columns
haveFactorsWithExistingLevels <- NULL
for (i in ncol(train_factor_only)){ # for each column (i.e. factor variable)
names_train <- names(summary(train_factor_only[,i])) # get names of all existing levels in this factor column for 'train'
names_test <- names(summary(test_factor_only[,i])) # get names of all existing levels in this factor column for 'test'
symmetric_diff <- union(setdiff(names_train,names_test), setdiff(names_test,names_train)) # get the symmetric difference between the two factor columns (from 'train' and 'test')
if (length(symmetric_diff) == 0) # if no elements in the symmetric difference set then it means that both have the same levels present at least once
haveFactorsWithExistingLevels <- c(haveFactorsWithExistingLevels, TRUE) # append logic TRUE
else # if some elements in the symmetric difference set then it means that one of the two sets (train, test) has levels the other doesn't and it will eventually flag up when using function predict()
haveFactorsWithExistingLevels <- c(haveFactorsWithExistingLevels, FALSE) # append logic FALSE
}
if(all(haveFactorsWithExistingLevels))
break # break out of the repeat loop because we found a split that has factor levels existing in both 'train' and 'test' sets, for all Factor variables
}
}
return (list( "train" = train, "test" = test))
}
Use like:
df <- getTrainAndTestSamples_BalancedFactors(some_dataframe)
df$train # this is your train set
df$test # this is your test set
...and no more annoying errors from R!
This can be definitely improved and I am looking forward to more efficient ways of doing this in the comments below, however, one can just use the code as it is.
Enjoy!

How to multiple across columns in dataset and save to new dataset without changing original data in R

I want to multiply all the values in columns e.g. by 5, and then save the results into a new dataset, without changing the data being read in.
Using a loop I use the following R code:
raw_data[,i]<-raw_data[,i]*5
What I want is to keep the original data as it is, raw_data, and save the multiplied data into e.g. new_data:
new_data[,i]<-raw_data[,i]*5
I get an error saying the object 'new_data' is not found.
Is there a neat way of doing this, or do you have to create the new_data object first as an empty dataset?
No need for loops here.
# a toy data frame
raw_data <- data.frame(x = 1:2, y = 3:4, z = 5:6)
# same applies if you have your data in a matrix
# raw_data <- matrix(1:6, ncol = 3)
raw_data
# x y z
# 1 1 3 5
# 2 2 4 6
new_data <- raw_data * 5
new_data
# x y z
# 1 5 15 25
# 2 10 20 30

Resources