remove factors with criteria - r

I'm dealing with the KDD 2010 data https://pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp
In R, how can I remove rows with a factor that has a low total number of instances.
I've tried the following:
create a table for the student name factor
studenttable <- table(data$Anon.Student.Id)
returns a table
l5eh0S53tB Qwq8d0du28 tyU2s0MBzm dvG32rxRzQ i8f2gg51r5 XL0eQIoG72
9890 7989 7665 7242 6928 6651
then I can get a table that tells me if there are more than 1000 data points for a given factor level
biginstances <- studenttable>1000
then I tried making a subset of the data on this query
bigdata <- subset(data, (biginstances[Anon.Student.Id]))
But I get weird subsets that still have the original number of factor levels as the full set.
I'm simply interested in removing the rows that have a factor that isn't well represented in the dataset.

There are probably more efficient ways to do this but this should get you what you want. I didn't use the names you used but you should be able to follow the logic just fine (hopefully!)
# Create some fake data
dat <- data.frame(id = rep(letters[1:5], 1:5), y = rnorm(15))
# tabulate the id variable
tab <- table(dat$id)
# Get the names of the ids that we care about.
# In this case the ids that occur >= 3 times
idx <- names(tab)[tab >=3]
# Only look at the data that we care about
dat[dat$id %in% idx,]

#Dason gave you some good code to work with as a starting point. I'm going to try to explain why (I think) what you tried didn't work.
biginstances <- studenttable>1000
This will create a logical vector whose length is equal the number of unique student id's. studenttable contained a count for each unique value of data$Anon.Student.Id. When you try to use that logical vector in subset:
bigdata <- subset(data, (biginstances[Anon.Student.Id]))
it's length is almost surely much less than the number of rows in data. And since the subsetting criteria in subset is meant to identify rows of data, R's recycling rules take over and you get 'weird' looking subsets.
I would also add that taking subsets to remove rare factor levels will not change the levels attribute of the factor. In other words, you'll get a factor back with no instances of that level, but all of the original factor levels will remain in the levels attribute. For example:
> fac <- factor(rep(letters[1:3],each = 3))
> fac
[1] a a a b b b c c c
Levels: a b c
> fac[-(1:3)]
[1] b b b c c c
Levels: a b c
> droplevels(fac[-(1:3)])
[1] b b b c c c
Levels: b c
So you'll want to use droplevels if you want to ensure that those levels are really 'gone'. Also, see options(stringsAsFactors = FALSE).

Another approach will involve a join between your dataset and the table of interest.
I'll use plyr for my purpose but it can be done using base function (like merge and as.data.frame.table)
require(plyr)
set.seed(123)
Data <- data.frame(var1 = sample(LETTERS[1:5], size = 100, replace = TRUE),
var2 = 1:100)
R> table(Data$var1)
A B C D E
19 20 21 22 18
## rows with category less than 20
mytable <- count(Data, vars = "var1")
## mytable <- as.data.frame(table(Data$var1))
R> str(mytable)
'data.frame': 5 obs. of 2 variables:
$ var1: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
$ freq: int 19 20 21 22 18
Data <- join(Data, mytable)
## Data <- merge(Data, mytable)
R> str(Data)
'data.frame': 100 obs. of 3 variables:
$ var1: Factor w/ 5 levels "A","B","C","D",..: 3 2 3 5 3 5 5 4 3 1 ...
$ var2: int 1 2 3 4 5 6 7 8 9 10 ...
$ freq: int 21 20 21 18 21 18 18 22 21 19 ...
mysubset <- droplevels(subset(Data, freq > 20))
R> table(mysubset$var1)
C D
21 22
Hope this help..

this is how I managed to do this.
I sorted the table of factors and associated counts.
studenttable <- sort(studenttable, decreasing=TRUE)
now that it's in order we could use column ranges sensibly. So I got the number of factors that are represented more than 1000 times in the data.
sum(studenttable>1000)
230
sum(studenttable<1000)
344
344+230=574
now we know the first 230 factor levels are the ones we care about. So, we can do
idx <- names(studenttable[1:230])
bigdata <- data[data$Anon.Student.Id %in% idx,]
we can verify it worked by doing
bigstudenttable <- table(bigdata$Anon.Student.Id)
to get a print out and see all the factor levels with less than 1000 instances are now 0.

Related

R - Aggregate Function different Results When Adding new grouping column

I am a R-beginner and I am stuck and can't find a solution. Any remarks are highly appreciated. Here is the problem:
I have a dataframe df.
The columns are converted to char (Attributes) and num.
I want to reduce the dataframe by using the aggregate function (dply is not an option).
When I am aggregating using
df_agg <- aggregate(df["AMOUNT"], df[c("ATTRIBUTE1")], sum)
I get correct results. But I want to group by more attributes. When adding more attributes for example
df_agg <- aggregate(df["AMOUNT"], df[c("ATTRIBUTE1", "ATTRIBUTE2")], sum)
then at some point, the aggegrate result changes. The sum of Amount is no longer equal to the result of the first first aggegration (or the original dataframe).
Has anyone an idea what causes this behavior.
My best guess is that you have missing values in some of your grouping columns. Demonstrating on the built-in mtcars data, which has no missing values, everything is fine:
sum(mtcars$mpg)
# [1] 642.9
sum(aggregate(mtcars["mpg"], mtcars[c("am")], sum)$mpg)
# [1] 642.9
sum(aggregate(mtcars["mpg"], mtcars[c("am", "cyl")], sum)$mpg)
# [1] 642.9
But if we introduce a missing value in a grouping variable, it is not included in the aggregation:
mt = mtcars
mt$cyl[1] = NA
sum(aggregate(mt["mpg"], mt[c("am", "cyl")], sum)$mpg)
# [1] 621.9
The easiest fix would be to fill in the missing values with something other than NA, perhaps the string "missing".
I think #Gregor has correctly pointed out that problem could be a grouping variable having NA. The dplyr handles NA in grouping variables differently than aggregate.
We have an alternate solution with aggregate. Please note that document suggest that
`by` a list of grouping elements, each as long as the variables in the data
frame x. The elements are coerced to factors before use.
Here is clue. You can convert your grouping variables to factor using exclude="" which will ensure NA are part of factor.
set.seed(1)
df <- data.frame(ATTRIBUTE1 = sample(LETTERS[1:3], 10, replace = TRUE),
ATTRIBUTE2 = sample(letters[1:3], 10, replace = TRUE),
AMOUNT = 1:10)
df$ATTRIBUTE2[5] <- NA
aggregate(df["AMOUNT"], by = list(factor(df$ATTRIBUTE1,exclude = ""),
factor(df$ATTRIBUTE2, exclude="")), sum)
# Group.1 Group.2 AMOUNT
# 1 A a 1
# 2 B a 2
# 3 B b 9
# 4 C b 10
# 5 A c 10
# 6 B c 11
# 7 C c 7
# 8 A <NA> 5
The result when grouping variables are not explicitly converted to factor to include NA is as:
aggregate(df["AMOUNT"], df[c("ATTRIBUTE1", "ATTRIBUTE2")], sum)
# ATTRIBUTE1 ATTRIBUTE2 AMOUNT
# 1 A a 1
# 2 B a 2
# 3 B b 9
# 4 C b 10
# 5 A c 10
# 6 B c 11
# 7 C c 7

Divide many columns by another column

Ok with
A <- c(1:10)
B <- c(2:11)
C <- c(3:12)
df1 <- data.frame(A,B,C)
I do not understand this error:
df2 <- df1 / df1[,"C"]
df2 <- df1[1:3,] / df1[1:3,"C"]
a <- subset (df1, select = c(A, B))
b <- subset (df1, select = c (C))
c <- a/b
## Error in Ops.data.frame(a, b) :
## ‘/’ only defined for equally-sized data frames
seeing that both have the same number of rows:
dim(a)
dim(b)
R automatically drops dimensions (unless you explicitly specify drop=FALSE) when the use of matrix indexing results in a dimension of size 1 (i.e., one row or one column), but using subset() on a data frame always results in a data frame (even if it's only one column):
> str(b)
'data.frame': 10 obs. of 1 variable:
$ C: int 3 4 5 6 7 8 9 10 11 12
> str(df1[,"C"])
int [1:10] 3 4 5 6 7 8 9 10 11 12
So dividing by df1[,"C"] is dividing by a numeric (integer) vector rather than by a data frame. The error ‘/’ only defined for equally-sized data frames means that the two data frames should be exactly equally-sized (same number of rows and columns).
sweep(df1,df1[,"C"],MARGIN=1,"/") might be safer.
Since they are not equally sized, i.e. different number of columns, it will throw error.
Below will divide a to b
c <- a/b$C

An Elegant way to change columns type in dataframe in R

I have a data.frame which contains columns of different types, such as integer, character, numeric, and factor.
I need to convert the integer columns to numeric for use in the next step of analysis.
Example: test.data includes 4 columns (though there are thousands in my real data set): age, gender, work.years, and name; age and work.years are integer, gender is factor, and name is character. What I need to do is change age and work.years into a numeric type. And I wrote one piece of code to do this.
test.data[sapply(test.data, is.integer)] <-lapply(test.data[sapply(test.data, is.integer)], as.numeric)
It looks not good enough though it works. So I am wondering if there is some more elegant methods to fulfill this function. Any creative method will be appreciated.
I think elegant code is sometimes subjective. For me, this is elegant but it may be less efficient compared to the OP's code. However, as the question is about elegant code, this can be used.
test.data[] <- lapply(test.data, function(x) if(is.integer(x)) as.numeric(x) else x)
Also, another elegant option is dplyr
library(dplyr)
library(magrittr)
test.data %<>%
mutate_each(funs(if(is.integer(.)) as.numeric(.) else .))
Now very elegant in dplyr (with magrittr %<>% operator)
test.data %<>% mutate_if(is.integer,as.numeric)
It's tasks like this that I think are best accomplished with explicit loops. You don't buy anything here by replacing a straightforward for-loop with the hidden loop of a function like lapply(). Example:
## generate data
set.seed(1L);
N <- 3L; test.data <- data.frame(age=sample(20:90,N,T),gender=factor(sample(c('M','F'),N,T)),work.years=sample(1:5,N,T),name=sample(letters,N,T),stringsAsFactors=F);
test.data;
## age gender work.years name
## 1 38 F 5 b
## 2 46 M 4 f
## 3 60 F 4 e
str(test.data);
## 'data.frame': 3 obs. of 4 variables:
## $ age : int 38 46 60
## $ gender : Factor w/ 2 levels "F","M": 1 2 1
## $ work.years: int 5 4 4
## $ name : chr "b" "f" "e"
## solution
for (cn in names(test.data)[sapply(test.data,is.integer)])
test.data[[cn]] <- as.double(test.data[[cn]]);
## result
test.data;
## age gender work.years name
## 1 38 F 5 b
## 2 46 M 4 f
## 3 60 F 4 e
str(test.data);
## 'data.frame': 3 obs. of 4 variables:
## $ age : num 38 46 60
## $ gender : Factor w/ 2 levels "F","M": 1 2 1
## $ work.years: num 5 4 4
## $ name : chr "b" "f" "e"

How to re-arrange a data.frame

I am interested in re-arranging a data.frame in R. Bear with me a I stumble through a reproducible example.
I have a nominal variable which can have 1 of two values. Currently this nominal variable is a column. Instead I would like to have two columns, representing the two values this nominal variable can have. Here is an exmample data frame. S is the nominal variable with values T and C.
n <- c(1,1,2,2,3,3,4,4)
s <- c("t","c","t","c","t","c","t","c")
b <- c(11,23,6,5,12,16,41,3)
mydata <- data.frame(n, s, b)
I would rather have a data frame that looked like this
n.n <- c(1,2,3,4)
trt <- c(11,6,23,41)
cnt <- c(23,5,16,3)
new.data <- data.frame(n.n, trt, cnt)
I am sure there is a way to use mutate or possibly tidyr but I am not sure what the best route is and my data frame that I would like to re-arrange is quite large.
you want spread:
library(dplyr)
library(tidyr)
new.data <- mydata %>% spread(s,b)
n c t
1 1 23 11
2 2 5 6
3 3 16 12
4 4 3 41
How about unstack(mydata, b~s):
c t
1 23 11
2 5 6
3 16 12
4 3 41

Preserve ordered factor when using ddply

I use ddply a lot. I use ordered factors occasionally. Calling ddply on a data frame that contains an ordered factor drops any ordering in the recombined data frame.
I wrote the following wrapper for ddply that records level ordering and then re-applies it on any columns that were ordered originally:
dat <- data.frame(a=runif(10),b=factor(letters[10:1],
levels=letters[10:1],ordered=TRUE),
c = rep(letters[1:2],times=5),
d = factor(rep(c('lev1','lev2'),times=5),ordered=TRUE))
#Drops ordering on b and d
dat1 <- ddply(dat,.(c),transform,log_a = log(a))
ddplyKeepOrder <- function(dat,...){
orderedCols <- colnames(dat)[sapply(dat,is.ordered)]
levs <- lapply(dat[,orderedCols,drop=FALSE],levels)
result <- ddply(.data = dat,...)
ind <- match(orderedCols,colnames(result))
levs <- levs[!is.na(ind)]
orderedCols <- orderedCols[!is.na(ind)]
ind <- ind[!is.na(ind)]
if (length(ind) > 0){
for (i in 1:length(ind)){
result[,orderedCols[i]] <- factor(result[,orderedCols[i]],
levels=levs[[i]],ordered=TRUE)
}
}
return(droplevels(result))
}
#Preserves ordering on b and d
dat2 <- ddplyKeepOrder(dat,.variables = .(c),.fun = transform,log_a = log(a))
I haven't checked this function thoroughly so there might be cases it doesn't handle. Is there a better/more complete way to handle this? I could probably remove the for loop if I thought about it a bit, I suppose.
In particular, the checking I do after the ddply call to see if there are still any of the original ordered factors present seems really ugly, but I would like the function to be able to handle cases where ddply alters which columns are present, possibly removing ordered factors.
Thoughts?
I use the code below for these types of problems ("ddply" not "ordered factor") and it seems to handle your specific example without issue (other than different row names).
> dat2 <- do.call(rbind, lapply(split(dat, dat$c), transform, log_a=log(a)))
> str(dat2)
'data.frame': 10 obs. of 5 variables:
$ a : num 0.216 0.607 0.197 0.171 0.797 ...
$ b : Ord.factor w/ 10 levels "j"<"i"<"h"<"g"<..: 1 3 5 7 9 2 4 6 8 10
$ c : Factor w/ 2 levels "a","b": 1 1 1 1 1 2 2 2 2 2
$ d : Ord.factor w/ 2 levels "lev1"<"lev2": 1 1 1 1 1 2 2 2 2 2
$ log_a: num -1.532 -0.499 -1.625 -1.767 -0.227 ...

Resources