I've built a predictive model that uses a large number (30 or so) of independent factor variables. As the dataset I'm using is much larger than the RAM of my machine, I have sampled it for both my training and test sets.
I am now looking to use the model to make predictions over the entire dataset. I'm pulling in the dataset 1 million rows at a time, and each time, I find new levels for some of my factor variables that were not in my training and test set, therefore preventing the model from making predictions.
As there are so many independent factor variables (and so many overall observations), correcting each case by hand is becoming a real pain.
One additional wrinkle to be aware of: there is no guarantee that the order of variables in the overall dataframe and the training/test sets are the same, as I do pre-processing on the data that changes their order.
As such, I'd like to write a function that:
Selects and sorts the columns of the new data based on the
configuration of my sampled dataframe
Loops through the sampled and new dataframe and designates all factor levels in the new
dataframe that do not exist in their corresponding column in the
sample dataframe as Other.
If a factor level exists in my sample but not the new dataframe, create the level (with no observations assigned to it) to its corresponding column in the new dataframe.
I've got #1 together, but don't know the best way to do #2 and #3. If it were any other language, I'd use for loops, but I know that's frowned upon in R.
Here's a reproducible example:
sampleData <- data.frame(abacus=factor(c("a","b","a","a","a")), montreal=factor(c("f","f","f","f","a")), boston=factor(c("z","y","z","z","q")))
dataset <- data.frame(florida=factor(c("e","q","z","d","b", "a")), montreal=factor(c("f","f","f","f","a", "a")), boston=factor(c("m","y","z","z","r", "f")), abacus=factor(c("a","b","z","a","a", "g")))
sampleData
abacus montreal boston
1 a f z
2 b f y
3 a f z
4 a f z
5 a a q
dataset
florida montreal boston abacus
1 e f m a
2 q f y b
3 z f z z
4 d f z a
5 b a r a
6 a a f g
sampleData <- sample[,order(names(sampleData))]
dataset <- dataset[,order(names(dataset))]
dataset <- dataset[,(colnames(sampleData)]
Below is what I would want dataset to look like once this function is complete (I don't really care about the final ordering of the columns in dataset; I'm just thinking its necessary for the loop (or whatever you guys deem best) to work. Notice that the column dataset$florida is omitted:
dataset
montreal boston abacus
1 f Other a
2 f y b
3 f z Other
4 f z a
5 a Other a
6 a Other Other
Also note that in dataset, the 'q' level for boston does not appear, although it does appear in sampleData. Therefore, the levels will differ if we omit 'q' from the factor in dataset, meaning that in 'dataset', we need boston to include the level q, but to have no actual observations assigned to it.
Last, note that as I'm doing this on 30 variables at a time, I need a programmatic solution and not one that reassigns factors by using explicit column names.
This seems like it might work.
From this function, the new levels returned for the boston column are Other y z q, even though there are no values for the level q. Regarding your comment in the original question, the only way I've found to effectively apply new factor levels is also with a for loop like you, and it's worked well for me so far.
A function, findOthers() :
findOthers <- function(newData) ## might want a second argument for sampleData
{
## take only those columns that are in 'sampleData'
dset <- newData[, names(sampleData)]
## change the 'dset' columns to character
dsetvals <- sapply(dset, as.character)
## change the 'sampleData' levels to character
samplevs <- sapply(sampleData, function(y) as.character(levels(y)))
## find the unmatched elements
others <- sapply(seq(ncol(dset)), function(i){
!(dsetvals[,i] %in% samplevs[[i]])
})
## change the unmatched elements to 'Other'
dsetvals[others] <- "Other"
## create new data frame
newDset <- data.frame(dsetvals)
## get the new levels for each column
newLevs <- lapply(seq(newDset), function(i){
Get <- c(as.character(newDset[[i]]), as.character(samplevs[[i]]))
ul <- unique(unlist(Get))
})
## set the new levels for each column
for(i in seq(newDset)) newDset[,i] <- factor(newDset[,i], newLevs[[i]])
## result
newDset
}
Your sample data :
sampleData <- data.frame(abacus=factor(c("a","b","a","a","a")),
montreal=factor(c("f","f","f","f","a")),
boston=factor(c("z","y","z","z","q")))
dataset <- data.frame(florida=factor(c("e","q","z","d","b", "a")),
montreal=factor(c("f","f","f","f","a", "a")),
boston=factor(c("m","y","z","z","r", "f")),
abacus=factor(c("a","b","z","a","a", "g")))
Call findOthers() and view the result with the new factor levels :
(new <- findOthers(newData = dataset))
# abacus montreal boston
# 1 a f Other
# 2 b f y
# 3 Other f z
# 4 a f z
# 5 a a Other
# 6 Other a Other
as.list(new)
# $abacus
# [1] a b Other a a Other
# Levels: a b Other
#
# $montreal
# [1] f f f f a a
# Levels: f a
#
# $boston
# [1] Other y z z Other Other
# Levels: Other y z q ## note the new level 'q', with no value in the column
To answer just the question you ask (rather than suggest what you might do instead). Here we have to make each column character, replace then re-factorise.
sampleData = sapply(sampleData, as.character)
sampleData = gsub("q", "other", sampleData)
sampleData = sapply(sampleData, as.factor)
This depends on "q" only inhabiting one column. Otherwise you just have to edit each column separately to get only the changes you want:
sampleData = sapply(sampleData, as.character)
sampleData$boston = gsub("q", "other", sampleData$boston)
sampleData = sapply(sampleData, as.factor)
However I think you should just filter the train and test data of these rows as they are so few
they will make absolutely no difference to your model. Otherwise you're making it difficult.
summary(dataset)
dataset <- dataset[dataset$abacus!="z", ]
If the dataset is very very large and you are not doing this because of that then you may want to do this with something like the dplyr package and filter function.
Does this accomplish what you want?
# Select and sort the columns of dataset as in sampleData
sampleData <- sampleData[, order(names(sampleData))]
dataset <- dataset[, colnames(sampleData)]
f <- function(dataset, sampleData, col) {
# For a given column col, assign "Other" to all factor levels
# in dataset[col] that do not exist in sampleData[col].
# If a factor level exists in sampleData[col] but not in dataset[col],
# preserve it as a factor level.
v <- factor(dataset[, col], levels = c(levels(sampleData[, col]), "Other"))
v[is.na(v)] <- "Other"
v
}
# Apply f to all columns of dataset
l <- lapply(colnames(dataset), function(x) f(dataset, sampleData, x))
res <- data.frame(l) # Format into a data frame
colnames(res) <- colnames(dataset) # Assign the names of dataset
dataset <- res # Assign the result to dataset
You can test as follows
> dataset[, "boston"]
[1] Other y z z Other Other
Levels: q y z Other
> dataset[, "montreal"]
[1] f f f f a a
Levels: a f Other
> dataset[, "abacus"]
[1] a b Other a a Other
Levels: a b Other
Related
I have a dataset, a factor variable contains 140 levels however, I only need 80 levels randomly selected, is there any r function or script that can help me to do this task?
You can do in base R:
# reproducible dataset
set.seed(1)
nlevels <- 5
nkeep <- 3
string <- letters[1:5]
string <- sample(string, nlevels*2, replace = TRUE)
string <- as.factor(string)
string
[1] a d a b e c b c c a
# possible solution
keep <- sample(levels(string), nkeep)
string[string %in% keep]
[1] a a b e b a
Levels: a b c d e
Take nkeep levels randomly and keep only corresponding values. Use function droplevels afterwards if needed.
Say that your variable is named x and is a factor with 140 levels. you could randomly select 80 levels as follows:
y = factor(x, sample(levels(x), 80))
Is there a way to split the data into train and test such that all combinations of categorical predictors in the test data are present in the training data? If it is not possible to split the data given the proportions specified for the test and train sizes, then those levels should not be included in the test data.
Say I have data like this:
SAMPLE_DF <- data.frame("FACTOR1" = c(rep(letters[1:2], 8), "g", "g", "h", "i"),
"FACTOR2" = c(rep(letters[3:5], 2,), rep("z", 3), "f"),
"response" = rnorm(10,10,1),
"node" = c(rep(c(1,2),5)))
> SAMPLE_DF
FACTOR1 FACTOR2 response node
1 a c 10.334690 1
2 b d 11.467605 2
3 a e 8.935463 1
4 b c 10.253852 2
5 a d 11.067347 1
6 b e 10.548887 2
7 a z 10.066082 1
8 b z 10.887074 2
9 a z 8.802410 1
10 b f 9.319187 2
11 a c 10.334690 1
12 b d 11.467605 2
13 a e 8.935463 1
14 b c 10.253852 2
15 a d 11.067347 1
16 b e 10.548887 2
17 g z 10.066082 1
18 g z 10.887074 2
19 h z 8.802410 1
20 i f 9.319187 2
In the test data, if there were a combination of FACTOR 1 and 2 of a c then this would also be in the train data. The same goes for all other possible combinations.
createDataPartition does this for one level, but I would like it for all levels.
You could try the following using dplyr to remove the combinations that appear only once and therefore would end up only in the training or test set and then use CreateDataPartition to make the split:
Data
SAMPLE_DF <- data.frame("FACTOR1" = rep(letters[1:2], 10),
"FACTOR2" = c(rep(letters[3:5], 2,), rep("z", 4)),
"num_pred" = rnorm(10,10,1),
"response" = rnorm(10,10,1))
Below you use dplyr to count the number of the combinations of factor1 and factor2. If any of those are 1 then you filter them out:
library(dplyr)
mydf <-
SAMPLE_DF %>%
mutate(all = paste(FACTOR1,FACTOR2)) %>%
group_by(all) %>%
summarise(total=n()) %>%
filter(total>=2)
The above only keeps combinations of factor1 and 2 that appear at least twice
You remove rows from SAMPLE_DF according to the above kept combinations:
SAMPLE_DF2 <- SAMPLE_DF[paste(SAMPLE_DF$FACTOR1,SAMPLE_DF$FACTOR2) %in% mydf$all,]
And finally you let createDataPartition do the split for you:
library(caret)
IND_TRAIN <- createDataPartition(paste(SAMPLE_DF2$FACTOR1,SAMPLE_DF2$FACTOR2))$Resample
#train set
A <- SAMPLE_DF2[ IND_TRAIN,]
#test set
B <- SAMPLE_DF2[-IND_TRAIN,]
>identical(sort(paste(A$FACTOR1,A$FACTOR2)) , sort(paste(B$FACTOR1,B$FACTOR2)))
[1] TRUE
As you can see at the identical line, the combinations are exactly the same!
This is a function I have put together that does this and splits a dataframe into train and test sets (to the user's percentage liking) that contain Factor variables(columns) that have the same levels.
getTrainAndTestSamples_BalancedFactors <- function(data, percentTrain = 0.75, inSequence = F, seed = 0){
set.seed(seed) # Set Seed so that same sample can be reproduced in future also
sample <- NULL
train <- NULL
test<- NULL
listOfFactorsAndTheirLevels <- lapply(Filter(is.factor, data), summary)
factorContainingOneElement <- sapply(listOfFactorsAndTheirLevels, function(x){any(x==1)})
if (any(factorContainingOneElement))
warning("This dataframe cannot be reliably split into sets that contain Factors equally represented. At least one factor contains only 1 possible level.")
else {
# Repeat loop until all Factor variables have same levels on both train and test set
repeat{
# Now Selecting 'percentTrain' of data as sample from total 'n' rows of the data
if (inSequence)
sample <- 1:floor(percentTrain * nrow(data))
else
sample <- sample.int(n = nrow(data), size = floor(percentTrain * nrow(data)), replace = F)
train <- data[sample, ] # create train set
test <- data[-sample, ] # create test set
train_factor_only <- Filter(is.factor, train) # df containing only 'train' Factors as columns
test_factor_only <- Filter(is.factor, test) # df containing only 'test' Factors as columns
haveFactorsWithExistingLevels <- NULL
for (i in ncol(train_factor_only)){ # for each column (i.e. factor variable)
names_train <- names(summary(train_factor_only[,i])) # get names of all existing levels in this factor column for 'train'
names_test <- names(summary(test_factor_only[,i])) # get names of all existing levels in this factor column for 'test'
symmetric_diff <- union(setdiff(names_train,names_test), setdiff(names_test,names_train)) # get the symmetric difference between the two factor columns (from 'train' and 'test')
if (length(symmetric_diff) == 0) # if no elements in the symmetric difference set then it means that both have the same levels present at least once
haveFactorsWithExistingLevels <- c(haveFactorsWithExistingLevels, TRUE) # append logic TRUE
else # if some elements in the symmetric difference set then it means that one of the two sets (train, test) has levels the other doesn't and it will eventually flag up when using function predict()
haveFactorsWithExistingLevels <- c(haveFactorsWithExistingLevels, FALSE) # append logic FALSE
}
if(all(haveFactorsWithExistingLevels))
break # break out of the repeat loop because we found a split that has factor levels existing in both 'train' and 'test' sets, for all Factor variables
}
}
return (list( "train" = train, "test" = test))
}
Use like:
df <- getTrainAndTestSamples_BalancedFactors(some_dataframe)
df$train # this is your train set
df$test # this is your test set
...and no more annoying errors from R!
This can be definitely improved and I am looking forward to more efficient ways of doing this in the comments below, however, one can just use the code as it is.
Enjoy!
I have a huge data frame. I am stuck with if function. Let me first present the simple example and then I lay down my problem:
z <- c(0,1,2,3,4,5)
y <- c(2,2,2,3,3,3)
a <- c(1,1,1,2,2,2)
x <- data.frame(z,y,a)
Problem: I want to run if function which sums column z values based for row which has same y and a only if the second row of each group has corresponding z equals 1
I am sorry but I am quite new in R so not able to present any reasonable codes which I have done by my own.
Any help would be highly appreciated.
As mentioned, your problem isn't clearly stated.
Perhaps you are looking to do something like this:
x$new <- with(x, ave(z, y, a, FUN = function(k)
ifelse(k[2] == 1, sum(k), NA)))
x
# z y a new
# 1 0 2 1 3
# 2 1 2 1 3
# 3 2 2 1 3
# 4 3 3 2 NA
# 5 4 3 2 NA
# 6 5 3 2 NA
Here, I've created a new column "new" which sums the values of "z" grouped by "y" and "a", but only if the second value in the group is equal to 1.
Since you say your data frame is quite large, you might want to convert your data frame to a data.table object using the data.table package. You will likely find that the required operations are much faster if you have a great many rows. However, the construction of the code for your case is not straight forward with data.table.
If I understnad what you want to do (which is not entirely clear to me) you could try the following:
library(data.table)
z <- c(0,1,2,3,4,5)
y <- c(2,2,2,3,3,3)
a <- c(1,1,1,2,2,2)
x <- data.frame(z,y,a)
xx <- as.data.table(x) # Make a data.table object
setkey(xx, z) # Make the z column a key
xx[1, sum(a)] # Sum all values in column a where the key z = 1
[1] 1
# Now try the other sum you mention
xx[, sum(z), by = list(z = y)] # A column sum over groups defined by z = y
z V1
1: 2 2
2: 3 3
sum(xx[, sum(z), by = list(z = y)][, V1]) # Summing over the sums for each group should do it
[1] 5
To create the sum over the column a where z = 1, I made the z column a key. The syntax xx[1, sum(a)] sums a where the key value (z value) is 1.
I can create groups with the data.table object with by, which is analogous to a SQL WHERE clause if you are familiar with SQL. However, the result is the sum of the column z for each of groups created. This may be inefficient if you have a great many possible matching values where z = y. The outer sum adds the values for each group in the sub-selected V1 column of the inner result.
If you are going to use data.table in a serious way study the informative vignettes available for that package.
M Dowle, T Short, S Lianoglou, A Srinivasan with contributions from R Saporta and E Antonyan (2014). data.table: Extensions of data.frame. R package version 1.9.2. http://CRAN.R-project.org/package=data.table
This question already has an answer here:
converting string IDs into numbers in a multilevel analysis using R
(1 answer)
Closed 9 years ago.
I have two data sets, one for student level data and another one for class level data. Student and class level IDs are generated as string values like:
Student data set:
student ID ->141PSDM2L,1420CHY1L,1JNLV36HH,1MNSBXUST,2K7EVS7X6,2N2SC26HL,...
class ID ->XK37HDN,XK37HDN,XK37HDN,3K3EH77,3K3EH77,2K36HN6,...
class level data set:
class ID ->XK37HDN,3K3EH77,2K36HN6,3K3LHSH,3K3LHSY,DK3EH14,DK3EH1H,DK3EH1K,...
In student data set,each class ID is repeated equal to the number of students in the class but in class level data set we only have one code for each class.
How can I convert those ID into integers? considering both student and class level ID.IN other words, I want to have IDs as below (or something similar):
Student data set:
student ID ->1,2,3,4,5,6,...
class ID ->1,1,1,2,2,3,...
class level data set:
class ID ->1,2,3,4,5,6,7,8,...
EDIT:
Conversion on student level data is not difficult. The problem arises when I want to convert class level data. Because of the repetition of class IDs in student data set, class IDs take values from 1 to 1533 but doing the same conversion method in class level data produces values from 1 to 896 so I don't know if , for example,class ID of 45 in student level data has the position as class ID 45 in class level data set.
Assuming that your studentID and classID are factors, I would use the fact that internally these are stored numerically. Hence if you can get the levels the same on both factors (i.e. in same order, and such that identical(levels(f1), levels(f2)) == TRUE), then you can simply coerce to integers.
I was thinking something along the lines of:
## dummy data first
set.seed(1)
df1 <- data.frame(f1 = sample(letters, 100, replace = TRUE),
f2 = sample(LETTERS, 100, replace = TRUE,
prob = rep(c(0.25, 0.75), length = 26)))
df2 <- with(df1, data.frame(f2 = sample(factor(unique(f2),
levels = sample(unique(f2)))),
vals = rnorm(length(unique(f2)))))
Note the levels of the factors are not identical even though there is a match between the data (given the way I generated them)
> identical(with(df1, levels(f2)), with(df2, levels(f2)))
[1] FALSE
Now make the levels identical, here I just take the union in case there are some values in one factor and not the other, and vice versa.
## make levels identical
levs <- sort(union(with(df1, levels(f2)), with(df2, levels(f2))))
df1 <- transform(df1, f2 = factor(f2, levels = levs))
df2 <- transform(df2, f2 = factor(f2, levels = levs))
> identical(with(df1, levels(f2)), with(df2, levels(f2)))
[1] TRUE
Now record to numeric
## recode as numeric
df1b <- transform(df1, f2int = as.numeric(f2))
df2b <- transform(df2, f2int = as.numeric(f2))
> head(df1b)
f1 f2 f2int
1 g B 2
2 j D 4
3 o R 17
4 x A 1
5 f F 6
6 x J 10
> head(df2b)
f2 vals f2int
1 Z -0.17955653 23
2 U -0.10019074 20
3 N 0.71266631 13
4 J -0.07356440 10
5 B -0.03763417 2
6 X -0.68166048 22
Notice the f1int and f2int values for f2 equal to B or J.
My point in the comments about merge() was if you want to match the tables, you can do the usual database joins using merge(). E.g.:
> head(merge(df1, df2, sort = FALSE))
f2 f1 vals
1 B g -0.03763417
2 B v -0.03763417
3 B u -0.03763417
4 B e -0.03763417
5 B w -0.03763417
6 D i -0.58889449
which would avoid the potentially error-prone step of getting the levels in order and converting to integers, if this was the ultimate aim.
I use ddply to summarize some data.frameby various categories, like this:
# with both group and size being factors / categorical
split.df <- ddply(mydata,.(group,size),summarize,
sumGroupSize = sum(someValue))
This works smoothly, but often I like to calculate ratios which implies that I need to divide by the group's total. How can I calculate such a total within the same ddply call?
Let's say I'd like to have the share of observations in group A that are in size class 1. Obviously I have to calculate the sum of all observations in size class 1 first.
Sure I could do this with two ddply calls, but using all one call would be more comfortable. Is there a way to do so?
EDIT:
I did not mean to ask overly specific, but I realize I was disturbing people here. So here's my specific problem. In fact I do have an example that works, but I don't consider it really nifty. Plus it has a shortcoming that I need to overcome: it does not work correctly with apply.
library(plyr)
# make the dataset more "realistic"
mydata <- warpbreaks
names(mydata) <- c("someValue","group","size")
mydata$category <- c(1,2,3)
mydata$categoryA <- c("A","A","X","X","Z","Z")
# add some NA
mydata$category[c(8,10,19)] <- NA
mydata$categoryA[c(14,1,20)] <- NA
# someValue is summarized !
# note we have a another, varying category hence we need the a parameter
calcShares <- function(a, data) {
# !is.na needs to be specific!
tempres1 <- eval(substitute(ddply(data[!is.na(a),],.(group,size,a),summarize,
sumTest = sum(someValue,na.rm=T))),
envir=data, enclos=parent.frame())
tempres2 <- eval(substitute(ddply(data[!is.na(a),],.(group,size),summarize,
sumTestTotal = sum(someValue,na.rm=T))),
envir=data, enclos=parent.frame())
res <- merge(tempres1,tempres2,by=c("group","size"))
res$share <- res$sumTest/res$sumTestTotal
return(res)
}
test <- calcShares(category,mydata)
test2 <- calcShares(categoryA,mydata)
head(test)
head(test2)
As you can see I intend to run this over different categorical variables. In the example I have only two (category, categoryA) but in fact I got more, so using apply with my function would be really nice, but somehow it does not work correctly.
applytest <- head(apply(mydata[grep("^cat",
names(mydata),value=T)],2,calcShares,data=mydata))
.. returns a warning message and a strange name (newX[, i] ) for the category var.
So how can I do THIS a) more elegantly and b) fix the apply issue?
This seems simple, so I may be missing some aspect of your question.
First, define a function that calculates the values you want inside each level of group. Then, instead of using .(group, size) to split the data.frame, use .(group), and apply the newly defined function to each of the split pieces.
library(plyr)
# Create a dataset with the names in your example
mydata <- warpbreaks
names(mydata) <- c("someValue", "group", "size")
# A function that calculates the proportional contribution of each size class
# to the sum of someValue within a level of group
getProps <- function(df) {
with(df, ave(someValue, size, FUN=sum)/sum(someValue))
}
# The call to ddply()
res <- ddply(mydata, .(group),
.fun = function(X) transform(X, PROPS=getProps(X)))
head(res, 12)
# someValue group size PROPS
# 1 26 A L 0.4785203
# 2 30 A L 0.4785203
# 3 54 A L 0.4785203
# 4 25 A L 0.4785203
# 5 70 A L 0.4785203
# 6 52 A L 0.4785203
# 7 51 A L 0.4785203
# 8 26 A L 0.4785203
# 9 67 A L 0.4785203
# 10 18 A M 0.2577566
# 11 21 A M 0.2577566
# 12 29 A M 0.2577566