I used the "dummies" package to create 42 dummy variables for the 42 levels of a factor variable in my data-frame. Now I only want to keep the 5 dummies that represent the five most common factor levels. I used:
counts <- colSums(dummy_variables)
rank <- sort(counts)
to figure out what those levels are, but now I want to be able to reference the most common ones and keep them in my data frame. I am somewhat new to R - I just can't figure out the syntax to do this.
Filter out the top 5 variables, and then subset only those columns.
rank <- sort(counts)[(length(counts)-4):length(counts)]
dummy_variables <- dummy_variables[names(dummy_variables) %in% names(rank)]
Or in one line as the commenter suggested,
dummy_variables[names(dummy_variables) %in% names(tail(sort(colSums(dummy_variables)),5))]
Related
I need to score a big dataframe-- 250M rows x ~2000 cols (1TB). A fit has been developed (from C5.0 library) based on training data. Currently the entire dataset wont fit in memory. Big data solutions are being investigated, but I'm wondering if I can cut up the file and run predict() in chunks.
The basic problem is that I want to translate column classes and factor levels from one dataframe to another. More detail:
PROBLEM #1: Some of the column classes don't match when being read from hadoop.
PROBLEM #2: by chunking you can get factors that have fewer factor levels than the training dataset (since you're looking at a subsample of the full set). Because of this predict() doesn't want to try and score a set that has missing factor levels.
QUESTION: I was hoping to just take the classes and factor levels from the training set and class the columns and 'relevel' the factors in the big scoring set with levels(). Can one take classes of columns and factors of one dataframe and transpose them to the same variable name to 'relevel' the variables of another dataframe? I suppose this could be done with a for loop by reading factors from one frame and applying it to the other. But this would require lots of if statements for all the classes, it seems it would be messy. Is there a way to do this with apply functions or a simpler 'one-liner'?
This is my data https://www.dropbox.com/s/msf0ro8saav7wbl/data1.txt?dl=0 (dataA), i want to extract "Habitat" to have frequency table so that i can calculate any statistical analysis such as mean and variance, and also to plot such as boxplot using ggplot2
I tried to use solution in duplicate question here R: How to get common counts (frequency) of levels of two factor variables by ID Variable (as new data frame) but i think it does not help my problem
Here's the easiest way to get a data.frame with frequencies using table. I'm using t to transpose and as.data.frame.matrix to transform it into a data.frame.
as.data.frame.matrix(t(table(data1)))
A B C
Adult 1 2 1
Juvenile 2 0 0
I have 10 factor variables, i want to get all the possible unique combinations of the factor variables by level wise.
My dataframe has the following data variables:
And i want to get the output as formatted below:
unique(dataframe_name)
This command will display unique values in dataframe.
unique_data <- subset(unique(dataframe_name))
My data consists of data about smartphones.
To do a random forest, I need to convert my factor Brand into a lot of dummies.
I tried this code
m <- model.matrix( ~ Brand, data = data_price)
Intercept BrandApple BrandAcer BrandAlcatel ...
1 0 0 1
1 1 0 0
...
The problem is that the original data has 2039 rows, while the output of this only has 2038.
Now I want to add the dummies to my data_price, but this doesn't works.
How could I make a dummy and add it to my data set?
Your approach using model.matrix should work fine, and we only need to figure out what happened to that missing row. I guess the issue is that there are missing values in your factor. Consider the following:
dat <- factor(mtcars$cyl)
dat2 <- dat
dat2[1] <- NA
Here, I have taken a factor, namely the number of cylinders in the mtcars dataset, and for comparison I have created a second factor where I have replaced one value with NA. Let's look at the number of rows that model.matrix will spit out in each case:
nrow(model.matrix(~dat))
[1] 32
nrow(model.matrix(~dat2))
[1] 31
You see that in the case where the factor variable had a missing value, the output of model.matrix had one row less, which is maybe not surprising.
You can either create an own factor level for the missing value, or you can safely drop the row with the missing value from your original data set, if this seems appropriate given your application. The output of model.matrix contains row names, which you can use to merge the data back onto the original dataframe if you want to go down that route.
I have a dataframe that I need to split into smaller dataframes by groups of factors so that I can paginate tables and figures.
For example, say I wanted to split the diamonds dataset into mini dataframes with 2 cut levels per dataframe. That would mean a list of 2 dataframes with 2 levels, 1 one dataframe with 1 level.
levels(diamonds$cut)
# "Fair" "Good" "Very Good" "Premium" "Ideal"
I'm trying to use split() to accomplish this. split(diamonds, diamonds$cut) splits the set into dataframes by factor, but how would you split it up by groups of 2, 3, or n levels? Something like split(data,rep(1:round(nrow(data)/10),each=10)) works when each factor only has one row, but im working with a "long" dataframe so the factors are spread out along the length of the dataframe.
This question comes close, but uses a numeric variable that I don't have.
We split the levels of the 'cut' variable with a grouping variable created with gl and then subset the 'diamonds' in each of the list element using %in%.
v1 <- levels(diamonds$cut)
n <- 2
lapply(split(v1, as.numeric(gl(length(v1), n, length(v1)))),
function(x) diamonds[diamonds$cut %in% x,])
By using:
diamonds$splt <- c("B","A")[diamonds$cut %in% c("Very Good","Premium","Ideal") + 1L]
you create a new variable on which you can split the dataset in two with:
split(diamonds, diamonds$splt)
simple solution:
df_splt<-split(diamonds,ceiling(as.numeric(diamonds$cut)/2))
Note though there are empty levels in each data.frame.
>table(df_splt[[1]]$cut)
Fair Good Very Good Premium Ideal
1610 4906 0 0 0