Related
I would like to keep only the top 2 factor levels based on the frequency and group all other factors into Other. I tried this but it doesn't help.
df=data.frame(a=as.factor(c(rep('D',3),rep('B',5),rep('C',2))),
b=as.factor(c(rep('A',5),rep('B',5))),
c=as.factor(c(rep('A',3),rep('B',5),rep('C',2))))
myfun=function(x){
if(is.factor(x)){
levels(x)[!levels(x) %in% names(sort(table(x),decreasing = T)[1:2])]='Others'
}
}
df=as.data.frame(lapply(df, myfun))
Expected Output
a b c
D A A
D A A
D A A
B A B
B A B
B B B
B B B
B B B
others B others
others B others
This might get a bit messy, but here is one approach via base R,
fun1 <- function(x){levels(x) <-
c(names(sort(table(x), decreasing = TRUE)[1:2]),
rep('others', length(levels(x))-2));
return(x)}
However the above function will need to first be re-ordered and as OP states in comment, the correct one will be,
fun1 <- function(x){ x=factor(x,
levels = names(sort(table(x), decreasing = TRUE)));
levels(x) <- c(names(sort(table(x), decreasing = TRUE)[1:2]),
rep('others', length(levels(x))-2));
return(x) }
This is now easy thanks to fct_lump() from the forcats package.
fct_lump(df$a, n = 2)
# [1] D D D B B B B B Other Other
# Levels: B D Other
The argument n controls the number of most common levels to be preserved, lumping together the others.
A vector of factor:
vec <- factor(c('a','b','b','c','b','c'))
[1] a b b c b c
Levels: a b c
Would expect a new vector of
vec_new
[1] 3 1 1 2 1 2
The one with higher frequency will be converted to smaller integer.
Any help is appreciated, thank you
x2 <- rev(sort(table(x)))
names(x2) <- names(sort(table(x)))
levels(x) <- x2[order(names(x2))]
x
[1] 3 1 1 2 1 2
Levels: 3 1 2
We first find the highest frequency factor and reverse the order (smallest to largest) with rev(sort(table(x))). Next we rename that smallest to largest vector to match the names of the regular largest to smallest frequency table. Lastly, we can now assign the new levels based on the order of the names while using the smallest to largest indices.
Another option courtesy of #RichardScriven:
s <- sort(table(x))
x <- factor(vec, labels = rev(s), levels = names(s))
Data
vec <- letters[c(1,2,2,3,2,3)]
x <- factor(vec)
[1] a b b c b c
Levels: a b c
Just to throw in another one-liner:
as.numeric(reorder(vec, -ave(as.numeric(vec), vec, FUN = length)))
# [1] 3 1 1 2 1 2
First, you calculate the (negative - to have the proper ordering afterwards) frequency of each vector level with ave, then you reorder the factor levels with reorder. The latter calculates the mean of -ave(.) for each level and resorts the factor levels accordingly in increasing order (that's why we used -ave(.)). Finally, transform the factor into a numeric.
Not sure if there's a more efficient approach, but you you can find out how frequently different levels of the factor occur with table(vec), and then you can manually order the levels of the factor with levels(vec) <- c("b", "c", "a").
IN R
my data
a <- c('1','2','3','1','1')
b <- c('3','1','2','1','2')
j <- data.frame(a,b)
rowSums(j) #error
How can I calculate sum of the row?
In case you have real character vectors (not factors like in your example) you can use data.matrix in order to convert all the columns to numeric class
j <- data.frame(a, b, stringsAsFactors = FALSE)
rowSums(data.matrix(j))
## [1] 4 3 5 2 3
Otherwise, you will have to convert first to character and then to numeric in order to not lose information
rowSums(sapply(j, function(x) as.numeric(as.character(x))))
## [1] 4 3 5 2 3
I've built a predictive model that uses a large number (30 or so) of independent factor variables. As the dataset I'm using is much larger than the RAM of my machine, I have sampled it for both my training and test sets.
I am now looking to use the model to make predictions over the entire dataset. I'm pulling in the dataset 1 million rows at a time, and each time, I find new levels for some of my factor variables that were not in my training and test set, therefore preventing the model from making predictions.
As there are so many independent factor variables (and so many overall observations), correcting each case by hand is becoming a real pain.
One additional wrinkle to be aware of: there is no guarantee that the order of variables in the overall dataframe and the training/test sets are the same, as I do pre-processing on the data that changes their order.
As such, I'd like to write a function that:
Selects and sorts the columns of the new data based on the
configuration of my sampled dataframe
Loops through the sampled and new dataframe and designates all factor levels in the new
dataframe that do not exist in their corresponding column in the
sample dataframe as Other.
If a factor level exists in my sample but not the new dataframe, create the level (with no observations assigned to it) to its corresponding column in the new dataframe.
I've got #1 together, but don't know the best way to do #2 and #3. If it were any other language, I'd use for loops, but I know that's frowned upon in R.
Here's a reproducible example:
sampleData <- data.frame(abacus=factor(c("a","b","a","a","a")), montreal=factor(c("f","f","f","f","a")), boston=factor(c("z","y","z","z","q")))
dataset <- data.frame(florida=factor(c("e","q","z","d","b", "a")), montreal=factor(c("f","f","f","f","a", "a")), boston=factor(c("m","y","z","z","r", "f")), abacus=factor(c("a","b","z","a","a", "g")))
sampleData
abacus montreal boston
1 a f z
2 b f y
3 a f z
4 a f z
5 a a q
dataset
florida montreal boston abacus
1 e f m a
2 q f y b
3 z f z z
4 d f z a
5 b a r a
6 a a f g
sampleData <- sample[,order(names(sampleData))]
dataset <- dataset[,order(names(dataset))]
dataset <- dataset[,(colnames(sampleData)]
Below is what I would want dataset to look like once this function is complete (I don't really care about the final ordering of the columns in dataset; I'm just thinking its necessary for the loop (or whatever you guys deem best) to work. Notice that the column dataset$florida is omitted:
dataset
montreal boston abacus
1 f Other a
2 f y b
3 f z Other
4 f z a
5 a Other a
6 a Other Other
Also note that in dataset, the 'q' level for boston does not appear, although it does appear in sampleData. Therefore, the levels will differ if we omit 'q' from the factor in dataset, meaning that in 'dataset', we need boston to include the level q, but to have no actual observations assigned to it.
Last, note that as I'm doing this on 30 variables at a time, I need a programmatic solution and not one that reassigns factors by using explicit column names.
This seems like it might work.
From this function, the new levels returned for the boston column are Other y z q, even though there are no values for the level q. Regarding your comment in the original question, the only way I've found to effectively apply new factor levels is also with a for loop like you, and it's worked well for me so far.
A function, findOthers() :
findOthers <- function(newData) ## might want a second argument for sampleData
{
## take only those columns that are in 'sampleData'
dset <- newData[, names(sampleData)]
## change the 'dset' columns to character
dsetvals <- sapply(dset, as.character)
## change the 'sampleData' levels to character
samplevs <- sapply(sampleData, function(y) as.character(levels(y)))
## find the unmatched elements
others <- sapply(seq(ncol(dset)), function(i){
!(dsetvals[,i] %in% samplevs[[i]])
})
## change the unmatched elements to 'Other'
dsetvals[others] <- "Other"
## create new data frame
newDset <- data.frame(dsetvals)
## get the new levels for each column
newLevs <- lapply(seq(newDset), function(i){
Get <- c(as.character(newDset[[i]]), as.character(samplevs[[i]]))
ul <- unique(unlist(Get))
})
## set the new levels for each column
for(i in seq(newDset)) newDset[,i] <- factor(newDset[,i], newLevs[[i]])
## result
newDset
}
Your sample data :
sampleData <- data.frame(abacus=factor(c("a","b","a","a","a")),
montreal=factor(c("f","f","f","f","a")),
boston=factor(c("z","y","z","z","q")))
dataset <- data.frame(florida=factor(c("e","q","z","d","b", "a")),
montreal=factor(c("f","f","f","f","a", "a")),
boston=factor(c("m","y","z","z","r", "f")),
abacus=factor(c("a","b","z","a","a", "g")))
Call findOthers() and view the result with the new factor levels :
(new <- findOthers(newData = dataset))
# abacus montreal boston
# 1 a f Other
# 2 b f y
# 3 Other f z
# 4 a f z
# 5 a a Other
# 6 Other a Other
as.list(new)
# $abacus
# [1] a b Other a a Other
# Levels: a b Other
#
# $montreal
# [1] f f f f a a
# Levels: f a
#
# $boston
# [1] Other y z z Other Other
# Levels: Other y z q ## note the new level 'q', with no value in the column
To answer just the question you ask (rather than suggest what you might do instead). Here we have to make each column character, replace then re-factorise.
sampleData = sapply(sampleData, as.character)
sampleData = gsub("q", "other", sampleData)
sampleData = sapply(sampleData, as.factor)
This depends on "q" only inhabiting one column. Otherwise you just have to edit each column separately to get only the changes you want:
sampleData = sapply(sampleData, as.character)
sampleData$boston = gsub("q", "other", sampleData$boston)
sampleData = sapply(sampleData, as.factor)
However I think you should just filter the train and test data of these rows as they are so few
they will make absolutely no difference to your model. Otherwise you're making it difficult.
summary(dataset)
dataset <- dataset[dataset$abacus!="z", ]
If the dataset is very very large and you are not doing this because of that then you may want to do this with something like the dplyr package and filter function.
Does this accomplish what you want?
# Select and sort the columns of dataset as in sampleData
sampleData <- sampleData[, order(names(sampleData))]
dataset <- dataset[, colnames(sampleData)]
f <- function(dataset, sampleData, col) {
# For a given column col, assign "Other" to all factor levels
# in dataset[col] that do not exist in sampleData[col].
# If a factor level exists in sampleData[col] but not in dataset[col],
# preserve it as a factor level.
v <- factor(dataset[, col], levels = c(levels(sampleData[, col]), "Other"))
v[is.na(v)] <- "Other"
v
}
# Apply f to all columns of dataset
l <- lapply(colnames(dataset), function(x) f(dataset, sampleData, x))
res <- data.frame(l) # Format into a data frame
colnames(res) <- colnames(dataset) # Assign the names of dataset
dataset <- res # Assign the result to dataset
You can test as follows
> dataset[, "boston"]
[1] Other y z z Other Other
Levels: q y z Other
> dataset[, "montreal"]
[1] f f f f a a
Levels: a f Other
> dataset[, "abacus"]
[1] a b Other a a Other
Levels: a b Other
This question already has an answer here:
converting string IDs into numbers in a multilevel analysis using R
(1 answer)
Closed 9 years ago.
I have two data sets, one for student level data and another one for class level data. Student and class level IDs are generated as string values like:
Student data set:
student ID ->141PSDM2L,1420CHY1L,1JNLV36HH,1MNSBXUST,2K7EVS7X6,2N2SC26HL,...
class ID ->XK37HDN,XK37HDN,XK37HDN,3K3EH77,3K3EH77,2K36HN6,...
class level data set:
class ID ->XK37HDN,3K3EH77,2K36HN6,3K3LHSH,3K3LHSY,DK3EH14,DK3EH1H,DK3EH1K,...
In student data set,each class ID is repeated equal to the number of students in the class but in class level data set we only have one code for each class.
How can I convert those ID into integers? considering both student and class level ID.IN other words, I want to have IDs as below (or something similar):
Student data set:
student ID ->1,2,3,4,5,6,...
class ID ->1,1,1,2,2,3,...
class level data set:
class ID ->1,2,3,4,5,6,7,8,...
EDIT:
Conversion on student level data is not difficult. The problem arises when I want to convert class level data. Because of the repetition of class IDs in student data set, class IDs take values from 1 to 1533 but doing the same conversion method in class level data produces values from 1 to 896 so I don't know if , for example,class ID of 45 in student level data has the position as class ID 45 in class level data set.
Assuming that your studentID and classID are factors, I would use the fact that internally these are stored numerically. Hence if you can get the levels the same on both factors (i.e. in same order, and such that identical(levels(f1), levels(f2)) == TRUE), then you can simply coerce to integers.
I was thinking something along the lines of:
## dummy data first
set.seed(1)
df1 <- data.frame(f1 = sample(letters, 100, replace = TRUE),
f2 = sample(LETTERS, 100, replace = TRUE,
prob = rep(c(0.25, 0.75), length = 26)))
df2 <- with(df1, data.frame(f2 = sample(factor(unique(f2),
levels = sample(unique(f2)))),
vals = rnorm(length(unique(f2)))))
Note the levels of the factors are not identical even though there is a match between the data (given the way I generated them)
> identical(with(df1, levels(f2)), with(df2, levels(f2)))
[1] FALSE
Now make the levels identical, here I just take the union in case there are some values in one factor and not the other, and vice versa.
## make levels identical
levs <- sort(union(with(df1, levels(f2)), with(df2, levels(f2))))
df1 <- transform(df1, f2 = factor(f2, levels = levs))
df2 <- transform(df2, f2 = factor(f2, levels = levs))
> identical(with(df1, levels(f2)), with(df2, levels(f2)))
[1] TRUE
Now record to numeric
## recode as numeric
df1b <- transform(df1, f2int = as.numeric(f2))
df2b <- transform(df2, f2int = as.numeric(f2))
> head(df1b)
f1 f2 f2int
1 g B 2
2 j D 4
3 o R 17
4 x A 1
5 f F 6
6 x J 10
> head(df2b)
f2 vals f2int
1 Z -0.17955653 23
2 U -0.10019074 20
3 N 0.71266631 13
4 J -0.07356440 10
5 B -0.03763417 2
6 X -0.68166048 22
Notice the f1int and f2int values for f2 equal to B or J.
My point in the comments about merge() was if you want to match the tables, you can do the usual database joins using merge(). E.g.:
> head(merge(df1, df2, sort = FALSE))
f2 f1 vals
1 B g -0.03763417
2 B v -0.03763417
3 B u -0.03763417
4 B e -0.03763417
5 B w -0.03763417
6 D i -0.58889449
which would avoid the potentially error-prone step of getting the levels in order and converting to integers, if this was the ultimate aim.