How does ddply handle factors as "split" variables? - r

I have a data.frame with 20 columns. The first two are factors, and the rest are numeric. I'd like to use the first two columns as split variables and then apply the mean() to the remaining columns.
This seems like a quick and easy job for ddply(), however, the results for the output data.frame are not what I am looking for. Here is a minimal example with just one column of data:
Aa <- c(rep(c("A", "a"), each = 20))
Bb <- c(rep(c("B", "b", "B", "b"), each = 10))
x <- runif(40)
df1 <- data.frame(Aa, Bb, x)
ddply(df1, .(Aa, Bb), mean)
The output is:
Aa Bb x
1 NA NA 0.5193275
2 NA NA 0.4491907
3 NA NA 0.4848128
4 NA NA 0.4717899
Warning messages:
1: In mean.default(X[[1L]], ...) :
argument is not numeric or logical: returning NA
The warning is repeated 8 times, presumably once for each call to mean(). I'm guessing this comes from trying to take the mean of a factor. I could write this as:
ddply(df1, .(Aa, Bb), function(df1) mean(df1$x))
or
ddply(df1, .(Aa, Bb), summarize, x = mean(x))
both of which do work (not giving NAs), but I would rather avoid writing out 18 such x = mean(x) statements, one for each of my numeric columns.
Is there a general solution? I'm not wedded to ddply if there is a better answer elsewhere.

Since you are reducing hte number of rows, you need to use summarise:
> ddply(df1, .(Aa, Bb), summarise, mean_x =mean(x) )
Aa Bb mean_x
1 a b 0.3790675
2 a B 0.4242922
3 A b 0.5622329
4 A B 0.4574471
It's just as easy to use aggregate in this instance. Let's say you had two variables:
> aggregate(df1[-(1:2)], df1[1:2], mean)
Aa Bb x y
1 a b 0.4249121 0.4639192
2 A b 0.6127175 0.4639192
3 a B 0.4522292 0.4826715
4 A B 0.5201965 0.4826715

ddply supports negative indexing as well:
ddply(df1, .(Aa, Bb), function(x) mean(x[-(1:2)]))

Related

Equivalent of row_number for columns dplyr

I am trying to apply a function to columns of a tibble, or data.frame, depending on the index of columns. It appears to me several time, and I give just one MWE
library(tidyverse)
test <- data.frame(a = c(1,2,3), b = c(7,8,9), c = c(3,5,6))
test <- test %>% as_tibble() %>% mutate_all( ~lead(., 2))
This will lead by 2 every columns (just an example). But what I want is to lead the first column by 1, the second by 2, and so on. Doing something like mutate_all(~lead(., col_number()).
For this little example, I know one way to do it, like:
test <- as.matrix(test)
for (i in 1:ncol(test)){ test[,i] <- lead(test[,i], i) }
There might be other way to do it too, haven't thought about it much (one needs to convert as a matrix first, otherwise it doesn't produce the right result, I don't really know why).
But I'd like to do it with a mutate or apply, being able to get the index of column in general. With a more complex example.
Any idea?
One option is using purrr::map2_df to sequentially lead every column based on column number.
purrr::map2_df(test, seq_along(test), dplyr::lead)
# A tibble: 3 x 3
# a b c
# <dbl> <dbl> <dbl>
#1 2 9 NA
#2 3 NA NA
#3 NA NA NA
We can also use base R Map
test[] <- Map(function(x, y) c(tail(x, -y), rep(NA, y)), test, seq_along(test))
We can use data.table shift
library(data.table)
setDT(test)[, Map(shift, .SD, n = 1:3, type = 'lead')]
# a b c
#1: 2 9 NA
#2: 3 NA NA
#3: NA NA NA
Or using purrr
library(purrr)
map2_dfr(test, 1:3, ~shift(.x, type = 'lead'))

Generate a table of values for several columns

Say I have a dataframe or datatable.
For example:
try <- data.frame(AA=c(1,2,3,1,2,3,4,5,NA),BB=c(1,2,2,NA,
2,1,2,2,NA), CC=c("A","B", NA, NA, "A","B", "A","C","B"))
setDT(try)
AA BB CC
1 1 A
2 2 B
3 2 NA
1 NA NA
2 2 A
3 1 B
4 2 A
5 2 C
NA NA B
I want to summarize the values in order to export them to an Excel file for further manipulation later.
I could create a table for each column but, in real life, some variable could have too many different values (such as the weight or DOB of people).
I can get the first six value for a single column with:
table(try$BB, useNA ="ifany")
1 2 <NA>
2 5 2
But when I try to do it automatically for all the columns at once it doesn't work as expected:
try[,lapply(.SD,function(x) table(x,useNA="ifany")[1:6] )]
because the table() command generates a 2 rows result and only one is used to create the final summary table.
What procedure do you suggest to keep that information?
For example I could try to convert that single-variable tables to something like
"1":2 "2":5 "NA":2
But I don't know how to do it. Maybe converting it to factors, maybe pasting the values.
I'm not even able to extract the rows of the table for further manipulation.
Any solution with base data.frame or date.table is welcome.
Or I could even order that table to get the most common values first.
PD: I want somethin like this:
AA "1":2 "2":2 "3":2 "4":1 "5":1 "NA": 1
BB "1":2 "2":5 "NA": 2
CC "A":3 "B":3 "C":1 "NA": 2
PD2:
I've tried this
try[,lapply(.SD, function(x) { tmp <- table(x,
useNA ="ifany") ; mapply(paste0, names( tmp ),
rep(":", length(tmp)), tmp )} )
]
But it's too long and it doesn't work well
AA BB CC
1:2 1:2 A:3
2:2 2:5 B:3
3:2 NA:2 C:1
4:1 1:2 NA:2
5:1 1:2 A:3
NA:1 2:5 B:3
It fills the last values with fake values.
Another option would be to interleave the names and the values.
In this example I should get:
AA BB CC
"1:2" "1:2" "A:3"
"2:2" "2:5" "B:3"
"3:2" "NA:2" "C:1"
"4:1" NA "NA:2"
"5:1" NA NA
"NA:1" NA NA
The problem is that the list is converted internally to a datatable by the command as.data.table.list() and the different size vectors are recycled instead of filled with NAs.
You can get your desired output with
library(magrittr)
tab = try %>% lapply(table, useNA = "ifany")
len = max(lengths(tab))
tab %>% lapply(
. %>%
{ paste0(names(.), ":", .) } %>%
`length<-`(len)
) %>% setDF %>% print
AA BB CC
1 1:2 1:2 A:3
2 2:2 2:5 B:3
3 3:2 NA:2 C:1
4 4:1 <NA> NA:2
5 5:1 <NA> <NA>
6 NA:1 <NA> <NA>
I haven't learned purrr, but if you like using pipes, that might offer somewhat cleaner code.
This is my data.table solution with some ideas from Frank.
siz <- 6
try[,lapply(.SD, function(x) { tmp <- table(x,
useNA ="ifany") ; tmp2 <- c(tmp[is.na(names(tmp))],
rev(sort(tmp[!is.na(names(tmp))])));
tmp3 <- mapply(paste0, names( tmp2 ),rep(":",
length(tmp2)),tmp2); length(tmp3)<-siz; tmp3})
]
It places the NAs always at the beginning and order the other elements from the most common to the least common.
Maybe there are some simpler ways to summarize the information.

How to apply a complex function across crossed levels of factors in data.frame (in R)?

I want to apply a function accross the crossed levels of factors in a data.frame similar to what aggregate would do, but for more complex functions than aggregate can handle.
For example.
fact1=c(rep('A',6),rep('B',6))
fact2=c(rep(c(rep('C',3),rep('D',3)),2))
crit1=rnorm(12)
crit2=crit1+rnorm(12)
dat=data.frame(fact1,fact2,crit1,crit2)
target.fit = function(dat){
mod=lm(dat$crit2~dat$crit1)
return(mod$coefficients[2])
}
This code generates a data.frame dat. The goal is to apply target.fit to each of the crossed levels of fact1 and fact2 (here an lm).
It is simple to do this for functions that require only one input vector such as the mean using aggregate.
> aggregate(dat,list(fact1=fact1,fact2=fact2),mean)
fact1 fact2 fact1 fact2 crit1 crit2
1 A C NA NA -0.5875951 -0.6048572
2 B C NA NA 0.3712372 0.9135742
3 A D NA NA -1.0163750 -2.4971846
4 B D NA NA 0.3937682 0.6227697
However, aggregate does not work for multi-variate inputs.
> aggregate(dat,list(fact1=fact1,fact2=fact2),target.fit)
Error in dat$crit2 : $ operator is invalid for atomic vectors
How can I solve this programing problem?
You could use the formula method to avoid getting NA column
aggregate(.~fact1+fact2, dat, FUN=mean)
For the custom function
library(data.table)#v1.9.5+
setDT(dat)[,target.fit(.SD) ,.(fact1, fact2)]
# fact1 fact2 V1
#1: A C 1.060835
#2: A D 1.259871
#3: B C 1.451595
#4: B D 1.766432
which is the same as
setDT(dat)[, coef(lm(crit2~crit1))[2] ,.(fact1, fact2)]
# fact1 fact2 V1
#1: A C 1.060835
#2: A D 1.259871
#3: B C 1.451595
#4: B D 1.766432
Or using dplyr
library(dplyr)
dat %>%
group_by(fact1, fact2) %>%
do(data.frame(V1=target.fit(.)))
# fact1 fact2 V1
#1 A C 1.060835
#2 A D 1.259871
#3 B C 1.451595
#4 B D 1.766432
A base R option is
sapply(split(dat, as.list(dat[paste0('fact',1:2)]), drop=FALSE), target.fit)
#A.C.dat$crit1 B.C.dat$crit1 A.D.dat$crit1 B.D.dat$crit1
# 1.060835 1.451595 1.259871 1.766432
Or
by(dat, list(dat$fact1, dat$fact2), FUN=target.fit)
To get factor levels in a data.frame,
do.call(rbind,by(dat, list(dat$fact1, dat$fact2),
FUN=function(x) cbind(x[1,1:2], V1=target.fit(x))))
NOTE: Used set.seed(24) as seed for creating the dat
In the days before data.table and dplyr, the standard method was lapply(split(data,fators),func)
> lapply( split( dat, list(fact1, fact2) ), target.fit)
$A.C
dat$crit1
1.328941
$B.C
dat$crit1
0.3281161
$A.D
dat$crit1
-0.10337
$B.D
dat$crit1
2.8962
The split function on a dataframe argument returns smaller dataframes composed of subsets based on the crossed factors arguments. If you needed it to be as a vector, the sapply function could be substituted for the lapply:
> sapply( split( dat, list(fact1, fact2) ), target.fit)
A.C.dat$crit1 B.C.dat$crit1 A.D.dat$crit1 B.D.dat$crit1
1.3289409 0.3281161 -0.1033700 2.8962000
I probably would have written the function to pass the dat argument tot the data argument of lm:
target.fit = function(dat){
mod=lm(crit2~$crit1, data=dat)
return(mod$coefficients[2])
}

R help on aggregation function

for my question I created a dummy data frame:
set.seed(007)
DF <- data.frame(a = rep(LETTERS[1:5], each=2), b = sample(40:49), c = sample(1:10))
DF
a b c
1 A 49 2
2 A 43 3
3 B 40 7
4 B 47 1
5 C 41 9
6 C 48 8
7 D 45 6
8 D 42 5
9 E 46 10
10 E 44 4
How can I use the aggregation function on column a so that, for instance, for "A" the following value is calculated: 49-43 / 2+3?
I started like:
aggregate(DF, by=list(DF$a), FUN=function(x) {
...
})
The problem I have is that I do not know how to access the 4 different cells 49, 43, 2 and 3
I tried x[[1]][1] and similar stuff but don't get it working.
Inside aggregate, the function FUN is applied independently to each column of your data. Here you want to use a function that takes two columns as inputs, so a priori, you can't use aggregate for that.
Instead, you can use ddply from the plyr package:
ddply(DF, "a", summarize, res = (b[1] - b[2]) / sum(c))
# a res
# 1 A 1.2000000
# 2 B -0.8750000
# 3 C -0.4117647
# 4 D 0.2727273
# 5 E 0.1428571
When you aggregate the FUN argument can be anything you want. Keep in mind that the value passed will either be a vector (if x is one column) or a little data.frame or matrix (if x is more than one). However, aggregate doesn't let you access the columns of a multi-column argument. For example.
aggregate( . ~ a, data = DF, FUN = function(x) diff(x[,1]) / sum(x[,2]) )
That fails with an error even though I used . (which takes all of the columns of DF that I'm not using elsewhere). To see what aggregate is trying to do there look at the following.
aggregate( . ~ a, data = DF, FUN = sum )
The two columns, b, and c, were aggregated but from the first attempt we know that you can't do something that accesses each column separately. So, strictly sticking with aggregate you need two passes and three lines of code.
diffb <- aggregate( b ~ a, data = DF, FUN = diff )
Y <- aggregate( c ~ a, data = DF, FUN = sum )
Y$c <- diffb$b / Y$c
Now Y contains the result you want.
The by function is simpler than aggregate and all it does is split the original data.frame using the indices and then apply the FUN function.
l <- by( data = DF, INDICES = DF$a, FUN = function(x) diff(x$b)/sum(x$c), simplify = FALSE )
unlist(l)
You have to do a little to get the result back into a data.frame if you really want one.
data.frame(a = names(l), x = unlist(l))
Using data.table could be faster and easier.
library(data.table)
DT <- data.table(DF)
DT[, (-1*diff(b))/sum(c), by=a]
a V1
1: A 1.2000000
2: B -0.8750000
3: C -0.4117647
4: D 0.2727273
5: E 0.1428571
Using aggregate, not so good. I didn't a better way to do it using aggregate :( but here's an attempt.
B <- aggregate(DF$b, by=list(DF$a), diff)
C <- aggregate(DF$c, by=list(DF$a), sum)
data.frame(a=B[,1], Result=(-1*B[,2])/C[,2])
a Result
1 A 1.2000000
2 B -0.8750000
3 C -0.4117647
4 D 0.2727273
5 E 0.1428571
A data.table solution - for efficiency of time and memory.
library(data.table)
DT <- as.data.table(DF)
DT[, list(calc = diff(b) / sum(c)), by = a]
You can use the base by() function:
listOfRows <-
by(data=DF,
INDICES=DF$a,
FUN=function(x){data.frame(a=x$a[1],res=(x$b[1] - x$b[2])/(x$c[1] + x$c[2]))})
newDF <- do.call(rbind,listOfRows)

Change the class from factor to numeric of many columns in a data frame

What is the quickest/best way to change a large number of columns to numeric from factor?
I used the following code but it appears to have re-ordered my data.
> head(stats[,1:2])
rk team
1 1 Washington Capitals*
2 2 San Jose Sharks*
3 3 Chicago Blackhawks*
4 4 Phoenix Coyotes*
5 5 New Jersey Devils*
6 6 Vancouver Canucks*
for(i in c(1,3:ncol(stats))) {
stats[,i] <- as.numeric(stats[,i])
}
> head(stats[,1:2])
rk team
1 2 Washington Capitals*
2 13 San Jose Sharks*
3 24 Chicago Blackhawks*
4 26 Phoenix Coyotes*
5 27 New Jersey Devils*
6 28 Vancouver Canucks*
What is the best way, short of naming every column as in:
df$colname <- as.numeric(ds$colname)
You have to be careful while changing factors to numeric. Here is a line of code that would change a set of columns from factor to numeric. I am assuming here that the columns to be changed to numeric are 1, 3, 4 and 5 respectively. You could change it accordingly
cols = c(1, 3, 4, 5);
df[,cols] = apply(df[,cols], 2, function(x) as.numeric(as.character(x)));
Further to Ramnath's answer, the behaviour you are experiencing is that due to as.numeric(x) returning the internal, numeric representation of the factor x at the R level. If you want to preserve the numbers that are the levels of the factor (rather than their internal representation), you need to convert to character via as.character() first as per Ramnath's example.
Your for loop is just as reasonable as an apply call and might be slightly more readable as to what the intention of the code is. Just change this line:
stats[,i] <- as.numeric(stats[,i])
to read
stats[,i] <- as.numeric(as.character(stats[,i]))
This is FAQ 7.10 in the R FAQ.
HTH
This can be done in one line, there's no need for a loop, be it a for-loop or an apply. Use unlist() instead :
# testdata
Df <- data.frame(
x = as.factor(sample(1:5,30,r=TRUE)),
y = as.factor(sample(1:5,30,r=TRUE)),
z = as.factor(sample(1:5,30,r=TRUE)),
w = as.factor(sample(1:5,30,r=TRUE))
)
##
Df[,c("y","w")] <- as.numeric(as.character(unlist(Df[,c("y","w")])))
str(Df)
Edit : for your code, this becomes :
id <- c(1,3:ncol(stats)))
stats[,id] <- as.numeric(as.character(unlist(stats[,id])))
Obviously, if you have a one-column data frame and you don't want the automatic dimension reduction of R to convert it to a vector, you'll have to add the drop=FALSE argument.
I know this question is long resolved, but I recently had a similar issue and think I've found a little more elegant and functional solution, although it requires the magrittr package.
library(magrittr)
cols = c(1, 3, 4, 5)
df[,cols] %<>% lapply(function(x) as.numeric(as.character(x)))
The %<>% operator pipes and reassigns, which is very useful for keeping data cleaning and transformation simple. Now the list apply function is much easier to read, by only specifying the function you wish to apply.
Here are some dplyr options:
# by column type:
df %>%
mutate_if(is.factor, ~as.numeric(as.character(.)))
# by specific columns:
df %>%
mutate_at(vars(x, y, z), ~as.numeric(as.character(.)))
# all columns:
df %>%
mutate_all(~as.numeric(as.character(.)))
I think that ucfagls found why your loop is not working.
In case you still don't want use a loop here is solution with lapply:
factorToNumeric <- function(f) as.numeric(levels(f))[as.integer(f)]
cols <- c(1, 3:ncol(stats))
stats[cols] <- lapply(stats[cols], factorToNumeric)
Edit. I found simpler solution. It seems that as.matrix convert to character. So
stats[cols] <- as.numeric(as.matrix(stats[cols]))
should do what you want.
lapply is pretty much designed for this
unfactorize<-c("colA","colB")
df[,unfactorize]<-lapply(unfactorize, function(x) as.numeric(as.character(df[,x])))
I found this function on a couple other duplicate threads and have found it an elegant and general way to solve this problem. This thread shows up first on most searches on this topic, so I am sharing it here to save folks some time. I take no credit for this just so see the original posts here and here for details.
df <- data.frame(x = 1:10,
y = rep(1:2, 5),
k = rnorm(10, 5,2),
z = rep(c(2010, 2012, 2011, 2010, 1999), 2),
j = c(rep(c("a", "b", "c"), 3), "d"))
convert.magic <- function(obj, type){
FUN1 <- switch(type,
character = as.character,
numeric = as.numeric,
factor = as.factor)
out <- lapply(obj, FUN1)
as.data.frame(out)
}
str(df)
str(convert.magic(df, "character"))
str(convert.magic(df, "factor"))
df[, c("x", "y")] <- convert.magic(df[, c("x", "y")], "factor")
I would like to point out that if you have NAs in any column, simply using subscripts will not work. If there are NAs in the factor, you must use the apply script provided by Ramnath.
E.g.
Df <- data.frame(
x = c(NA,as.factor(sample(1:5,30,r=T))),
y = c(NA,as.factor(sample(1:5,30,r=T))),
z = c(NA,as.factor(sample(1:5,30,r=T))),
w = c(NA,as.factor(sample(1:5,30,r=T)))
)
Df[,c(1:4)] <- as.numeric(as.character(Df[,c(1:4)]))
Returns the following:
Warning message:
NAs introduced by coercion
> head(Df)
x y z w
1 NA NA NA NA
2 NA NA NA NA
3 NA NA NA NA
4 NA NA NA NA
5 NA NA NA NA
6 NA NA NA NA
But:
Df[,c(1:4)]= apply(Df[,c(1:4)], 2, function(x) as.numeric(as.character(x)))
Returns:
> head(Df)
x y z w
1 NA NA NA NA
2 2 3 4 1
3 1 5 3 4
4 2 3 4 1
5 5 3 5 5
6 4 2 4 4
you can use unfactor() function from "varhandle" package form CRAN:
library("varhandle")
my_iris <- data.frame(Sepal.Length = factor(iris$Sepal.Length),
sample_id = factor(1:nrow(iris)))
my_iris <- unfactor(my_iris)
I like this code because it's pretty handy:
data[] <- lapply(data, function(x) type.convert(as.character(x), as.is = TRUE)) #change all vars to their best fitting data type
It is not exactly what was asked for (convert to numeric), but in many cases even more appropriate.
Based on #SDahm's answer, this was an "optimal" solution for my tibble:
data %<>% lapply(type.convert) %>% as.data.table()
This requires dplyr and magrittr.
I tried a bunch of these on a similar problem and kept getting NAs. Base R has some really irritating coercion behaviors, which are generally fixed in Tidyverse packages. I used to avoid them because I didn't want to create dependencies, but they make life so much easier that now I don't even bother trying to figure out the Base R solution most of the time.
Here's the Tidyverse solution, which is extremely simple and elegant:
library(purrr)
mydf <- data.frame(
x1 = factor(c(3, 5, 4, 2, 1)),
x2 = factor(c("A", "C", "B", "D", "E")),
x3 = c(10, 8, 6, 4, 2))
map_df(mydf, as.numeric)
df$colname <- as.numeric(df$colname)
I tried this way for changing one column type and I think it is better than many other versions, if you are not going to change all column types
df$colname <- as.character(df$colname)
for the vice versa.
I had problems converting all columns to numeric with an apply() call:
apply(data, 2, as.numeric)
The problem turns out to be because some of the strings had a comma in them -- e.g. "1,024.63" instead of "1024.63" -- and R does not like this way of formatting numbers. So I removed them and then ran as.numeric():
data = as.data.frame(apply(data, 2, function(x) {
y = str_replace_all(x, ",", "") #remove commas
return(as.numeric(y)) #then convert
}))
Note that this requires the stringr package to be loaded.
That's what's worked for me. The apply() function tries to coerce df to matrix and it returns NA's.
numeric.df <- as.data.frame(sapply(df, 2, as.numeric))

Resources