I use ddply a lot. I use ordered factors occasionally. Calling ddply on a data frame that contains an ordered factor drops any ordering in the recombined data frame.
I wrote the following wrapper for ddply that records level ordering and then re-applies it on any columns that were ordered originally:
dat <- data.frame(a=runif(10),b=factor(letters[10:1],
levels=letters[10:1],ordered=TRUE),
c = rep(letters[1:2],times=5),
d = factor(rep(c('lev1','lev2'),times=5),ordered=TRUE))
#Drops ordering on b and d
dat1 <- ddply(dat,.(c),transform,log_a = log(a))
ddplyKeepOrder <- function(dat,...){
orderedCols <- colnames(dat)[sapply(dat,is.ordered)]
levs <- lapply(dat[,orderedCols,drop=FALSE],levels)
result <- ddply(.data = dat,...)
ind <- match(orderedCols,colnames(result))
levs <- levs[!is.na(ind)]
orderedCols <- orderedCols[!is.na(ind)]
ind <- ind[!is.na(ind)]
if (length(ind) > 0){
for (i in 1:length(ind)){
result[,orderedCols[i]] <- factor(result[,orderedCols[i]],
levels=levs[[i]],ordered=TRUE)
}
}
return(droplevels(result))
}
#Preserves ordering on b and d
dat2 <- ddplyKeepOrder(dat,.variables = .(c),.fun = transform,log_a = log(a))
I haven't checked this function thoroughly so there might be cases it doesn't handle. Is there a better/more complete way to handle this? I could probably remove the for loop if I thought about it a bit, I suppose.
In particular, the checking I do after the ddply call to see if there are still any of the original ordered factors present seems really ugly, but I would like the function to be able to handle cases where ddply alters which columns are present, possibly removing ordered factors.
Thoughts?
I use the code below for these types of problems ("ddply" not "ordered factor") and it seems to handle your specific example without issue (other than different row names).
> dat2 <- do.call(rbind, lapply(split(dat, dat$c), transform, log_a=log(a)))
> str(dat2)
'data.frame': 10 obs. of 5 variables:
$ a : num 0.216 0.607 0.197 0.171 0.797 ...
$ b : Ord.factor w/ 10 levels "j"<"i"<"h"<"g"<..: 1 3 5 7 9 2 4 6 8 10
$ c : Factor w/ 2 levels "a","b": 1 1 1 1 1 2 2 2 2 2
$ d : Ord.factor w/ 2 levels "lev1"<"lev2": 1 1 1 1 1 2 2 2 2 2
$ log_a: num -1.532 -0.499 -1.625 -1.767 -0.227 ...
Related
I am using the following code, which works fine (improvement suggestions very much welcome):
WeeklySlopes <- function(Year, Week){
DynamicQuery <- paste('select DayOfYear, Week, Year, Close from SourceData where year =', Year, 'and week =', Week, 'order by DayOfYear')
SubData = sqldf(DynamicQuery)
SubData$X <- as.numeric(rownames(SubData))
lmfit <- lm(Close ~ X, data = SubData)
lmfit <- tidy(lmfit)
Slope <- as.numeric(sqldf("select estimate from lmfit where term = 'X'"))
e <- globalenv()
e$WeeklySlopesDf[nrow(e$WeeklySlopesDf) + 1,] = c(Year,Week, Slope)
}
WeeklySlopesDf <- data.frame(Year = integer(), Week = integer(), Slope = double())
WeeklySlopes(2017, 15)
WeeklySlopes(2017, 14)
head(WeeklySlopesDf)
Is there really no other way to append a row to my existing dataframe. I seem to need to access the globalenv. On the other hand, why can sqldf 'see' the 'global' dataframe SourceData?
dfrm <- data.frame(a=1:10, b=letters[1:10]) # reproducible example
myfunc <- function(new_a=20){ g <- globalenv(); g$dfrm[3,1] <- new_a; cat(dfrm[3,1])}
myfunc()
20
dfrm
a b
1 1 a
2 2 b
3 20 c # so your strategy might work, although it's unconventional.
Now try to extend dataframe outside a function:
dfrm[11, ] <- c(a=20,b="c")
An occult disaster (conversion of numeric column to character):
str(dfrm)
'data.frame': 11 obs. of 2 variables:
$ a: chr "1" "2" "20" "4" ...
$ b: Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
So use a list to avoid occult coercion:
dfrm <- data.frame(a=1:10, b=letters[1:10]) # start over
dfrm[11, ] <- list(a=20,b="c")
str(dfrm)
'data.frame': 11 obs. of 2 variables:
$ a: num 1 2 3 4 5 6 7 8 9 10 ...
$ b: Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
Now try within a function:
myfunc <- function(new_a=20, new_b="ZZ"){ g <- globalenv(); g$dfrm[nrow(dfrm)+1, ] <- list(a=new_a,b=new_b)}
myfunc()
Warning message:
In `[<-.factor`(`*tmp*`, iseq, value = "ZZ") :
invalid factor level, NA generated
str(dfrm)
'data.frame': 12 obs. of 2 variables:
$ a: num 1 2 3 4 5 6 7 8 9 10 ...
$ b: Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
So it succeeds, but if there are any factor columns, non-existent levels will get turned into NA values (with a warning). You method of using named access to objects in the global environment is rather unconventional but there is a set of tested methods that you might want to examine. Look at ?R6. Other options are <<- and assign which allows one to specify the environment in which the assignment is to occur.
I Have few variables AGE ACT_TYPE GENDER in my data frame. Instead of printing each of these factor variable's level distribution, I have used for loop to print the distribution. However nothing seems to be printing. Please let me know how to resolve the issue ..
> str(combin)
Classes ‘data.table’ and 'data.frame': 500000 obs. of 333 variables:
$ CUSTOMER_ID : int 385793 286891 108751 278651 23637 130723 5694 275523 163723 469852 ...
$ ACT_TYPE : Factor w/ 2 levels "CSA","SA": 1 1 1 1 1 1 2 2 2 1 ...
$ GENDER : Factor w/ 3 levels "","F","M": 3 3 3 3 3 3 3 3 3 3 ...
$ LEGAL_ENTITY : Factor w/ 7 levels "ASSOCIATION",..: 3 3 3 3 3 3 3 3 3 3
combin[, prop.table(table(GENDER))]
GENDER
F M
0.000272 0.232436 0.767292
combin[, prop.table(table(ACT_TYPE))]
ACT_TYPE
CSA SA
0.710686 0.289314
If I replace the above printing to the display with forloop, I don't see any o/p.
Please let me know where I am going wrong...
for(i in names(combin)) {
combin[, prop.table(table(names(combin)[i]))]
}
Also suggest me how can I apply a condition in the for loop to only print the
distribution only if it's a factor variable.
You could use purrr to loop through each column of the data frame and return a list, where each item in the list corresponds to a column and the columns that are factors are the prop.tables
library(purrr)
#generate some random data like yours
mydf <- data_frame(
id = sample(1:100, 10,replace = F)
, ACT_TYPE = factor(sample(c("CSA", "SA"),10, replace = T))
, GENDER = factor(sample(c("", "F", "M"), 10, replace = T))
)
# use map_if to generate prop.tables when the column is a factor
map_if(mydf, ~class(.x) == "factor", ~prop.table(table(.x)) )
I have a data.frame which contains columns of different types, such as integer, character, numeric, and factor.
I need to convert the integer columns to numeric for use in the next step of analysis.
Example: test.data includes 4 columns (though there are thousands in my real data set): age, gender, work.years, and name; age and work.years are integer, gender is factor, and name is character. What I need to do is change age and work.years into a numeric type. And I wrote one piece of code to do this.
test.data[sapply(test.data, is.integer)] <-lapply(test.data[sapply(test.data, is.integer)], as.numeric)
It looks not good enough though it works. So I am wondering if there is some more elegant methods to fulfill this function. Any creative method will be appreciated.
I think elegant code is sometimes subjective. For me, this is elegant but it may be less efficient compared to the OP's code. However, as the question is about elegant code, this can be used.
test.data[] <- lapply(test.data, function(x) if(is.integer(x)) as.numeric(x) else x)
Also, another elegant option is dplyr
library(dplyr)
library(magrittr)
test.data %<>%
mutate_each(funs(if(is.integer(.)) as.numeric(.) else .))
Now very elegant in dplyr (with magrittr %<>% operator)
test.data %<>% mutate_if(is.integer,as.numeric)
It's tasks like this that I think are best accomplished with explicit loops. You don't buy anything here by replacing a straightforward for-loop with the hidden loop of a function like lapply(). Example:
## generate data
set.seed(1L);
N <- 3L; test.data <- data.frame(age=sample(20:90,N,T),gender=factor(sample(c('M','F'),N,T)),work.years=sample(1:5,N,T),name=sample(letters,N,T),stringsAsFactors=F);
test.data;
## age gender work.years name
## 1 38 F 5 b
## 2 46 M 4 f
## 3 60 F 4 e
str(test.data);
## 'data.frame': 3 obs. of 4 variables:
## $ age : int 38 46 60
## $ gender : Factor w/ 2 levels "F","M": 1 2 1
## $ work.years: int 5 4 4
## $ name : chr "b" "f" "e"
## solution
for (cn in names(test.data)[sapply(test.data,is.integer)])
test.data[[cn]] <- as.double(test.data[[cn]]);
## result
test.data;
## age gender work.years name
## 1 38 F 5 b
## 2 46 M 4 f
## 3 60 F 4 e
str(test.data);
## 'data.frame': 3 obs. of 4 variables:
## $ age : num 38 46 60
## $ gender : Factor w/ 2 levels "F","M": 1 2 1
## $ work.years: num 5 4 4
## $ name : chr "b" "f" "e"
I have a wide data.frame that is all character vectors (df1). I have a separate vector(vec1) that contains the column classes I'd like to assign to each of the columns in df1.
If I was using read.csv(), I'd use the colClasses argument and set it equal to vec1, but there doesn't appear to be a similar option for an existing data.frame.
Any suggestions for a fast way to do this besides a loop?
I don't know if it will be of help but I have run into the same need many times and I have created a function in case it helps:
reclass <- function(df, vec){
df[] <- Map(function(x, f){
#switch below shows the accepted values in the vector
#you can modify it and/or add more
f <- switch(f,
as.is = 'force',
factor = 'as.factor',
num = 'as.numeric',
char = 'as.character')
#takes the name of the function and fetches the function
f <- get(f)
#apply the function
f(x)
},
df,
vec)
df
}
It uses Map to pass in a vector of classes to the data.frame. Each element corresponds to the class of the column. The length of both the dataframe and the vector need to be the same.
I am using switch as well to make the corresponding classes shorter to type. Use as.is to keep the class the same, the rest are self explanatory I think.
Small example:
df1 <- data.frame(1:10, letters[1:10], runif(50))
> str(df1)
'data.frame': 50 obs. of 3 variables:
$ X1.10 : int 1 2 3 4 5 6 7 8 9 10 ...
$ letters.1.10.: Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
$ runif.50. : num 0.0969 0.1957 0.8283 0.1768 0.9821 ...
And after the function:
df1 <- reclass(df1, c('num','as.is','char'))
> str(df1)
'data.frame': 50 obs. of 3 variables:
$ X1.10 : num [1:50] 1 2 3 4 5 6 7 8 9 10 ...
$ letters.1.10.: Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
$ runif.50. : chr [1:50] "0.0968757788650692" "0.19566105119884" "0.828283685725182" "0.176784737734124" ...
I guess Map internally is a loop but it is written in C so it should be fast enough.
May be you could try this function that makes the same work.
reclass <- function (df, vec_types) {
for (i in 1:ncol(df)) {
type <- vec_types[i]
class(df[ , i]) <- type
}
return(df)
}
and this is an example of vec_types (vector of types):
vec_types <- c('character', rep('integer', 3), rep('character', 2))
you can test the function (reclass) whith this table (df):
table <- data.frame(matrix(sample(1:10,30, replace = T), nrow = 5, ncol = 6))
str(table) # original column types
# apply the function
table <- reclass(table, vec_types)
str(table) # new column types
I'm dealing with the KDD 2010 data https://pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp
In R, how can I remove rows with a factor that has a low total number of instances.
I've tried the following:
create a table for the student name factor
studenttable <- table(data$Anon.Student.Id)
returns a table
l5eh0S53tB Qwq8d0du28 tyU2s0MBzm dvG32rxRzQ i8f2gg51r5 XL0eQIoG72
9890 7989 7665 7242 6928 6651
then I can get a table that tells me if there are more than 1000 data points for a given factor level
biginstances <- studenttable>1000
then I tried making a subset of the data on this query
bigdata <- subset(data, (biginstances[Anon.Student.Id]))
But I get weird subsets that still have the original number of factor levels as the full set.
I'm simply interested in removing the rows that have a factor that isn't well represented in the dataset.
There are probably more efficient ways to do this but this should get you what you want. I didn't use the names you used but you should be able to follow the logic just fine (hopefully!)
# Create some fake data
dat <- data.frame(id = rep(letters[1:5], 1:5), y = rnorm(15))
# tabulate the id variable
tab <- table(dat$id)
# Get the names of the ids that we care about.
# In this case the ids that occur >= 3 times
idx <- names(tab)[tab >=3]
# Only look at the data that we care about
dat[dat$id %in% idx,]
#Dason gave you some good code to work with as a starting point. I'm going to try to explain why (I think) what you tried didn't work.
biginstances <- studenttable>1000
This will create a logical vector whose length is equal the number of unique student id's. studenttable contained a count for each unique value of data$Anon.Student.Id. When you try to use that logical vector in subset:
bigdata <- subset(data, (biginstances[Anon.Student.Id]))
it's length is almost surely much less than the number of rows in data. And since the subsetting criteria in subset is meant to identify rows of data, R's recycling rules take over and you get 'weird' looking subsets.
I would also add that taking subsets to remove rare factor levels will not change the levels attribute of the factor. In other words, you'll get a factor back with no instances of that level, but all of the original factor levels will remain in the levels attribute. For example:
> fac <- factor(rep(letters[1:3],each = 3))
> fac
[1] a a a b b b c c c
Levels: a b c
> fac[-(1:3)]
[1] b b b c c c
Levels: a b c
> droplevels(fac[-(1:3)])
[1] b b b c c c
Levels: b c
So you'll want to use droplevels if you want to ensure that those levels are really 'gone'. Also, see options(stringsAsFactors = FALSE).
Another approach will involve a join between your dataset and the table of interest.
I'll use plyr for my purpose but it can be done using base function (like merge and as.data.frame.table)
require(plyr)
set.seed(123)
Data <- data.frame(var1 = sample(LETTERS[1:5], size = 100, replace = TRUE),
var2 = 1:100)
R> table(Data$var1)
A B C D E
19 20 21 22 18
## rows with category less than 20
mytable <- count(Data, vars = "var1")
## mytable <- as.data.frame(table(Data$var1))
R> str(mytable)
'data.frame': 5 obs. of 2 variables:
$ var1: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
$ freq: int 19 20 21 22 18
Data <- join(Data, mytable)
## Data <- merge(Data, mytable)
R> str(Data)
'data.frame': 100 obs. of 3 variables:
$ var1: Factor w/ 5 levels "A","B","C","D",..: 3 2 3 5 3 5 5 4 3 1 ...
$ var2: int 1 2 3 4 5 6 7 8 9 10 ...
$ freq: int 21 20 21 18 21 18 18 22 21 19 ...
mysubset <- droplevels(subset(Data, freq > 20))
R> table(mysubset$var1)
C D
21 22
Hope this help..
this is how I managed to do this.
I sorted the table of factors and associated counts.
studenttable <- sort(studenttable, decreasing=TRUE)
now that it's in order we could use column ranges sensibly. So I got the number of factors that are represented more than 1000 times in the data.
sum(studenttable>1000)
230
sum(studenttable<1000)
344
344+230=574
now we know the first 230 factor levels are the ones we care about. So, we can do
idx <- names(studenttable[1:230])
bigdata <- data[data$Anon.Student.Id %in% idx,]
we can verify it worked by doing
bigstudenttable <- table(bigdata$Anon.Student.Id)
to get a print out and see all the factor levels with less than 1000 instances are now 0.