Appending data frames based on a function in R - r

How do I append data frames one after the other to form another data frame?
Whether a data frame would be included or not will be decided by a criteria.
Here is an example data:
d1 <- data.frame(MyGroups =sample(LETTERS,100,replace=TRUE),
MyInt = sample(c(1:20),100,replace=TRUE))
Now, how should I choose groups (A,B,C...) from MyGroups that have mean of variable MyInt greater than 10?
I tried the following without a success. Here, I am appending the data frame into a file based on the given criteria.
require("plyr")
keepGrp <- function(df0) {
if(max(df0$MyInt < 10)) {df0 <- NULL}
write.csv(df0,'mytable.txt',append=TRUE,sep=',')
}
ddply(d1,.(MyInt),function(x) keepGrp(x))
The desired data frame should be in file mytable.txt
I am fully sure there is a better way to do what I am trying to do.
I would be happy to clarify my question if I need to do so.
I will appreciate of someone can (1) show me a feedback on improving my programming thoughts (2) give me a solution to my problem.

If I understand your question properly, you want to calculate the mean by group and only write those that meet a certain threshold to a pre-existing file. If so, why not calculate all the means at once, subset that, then write that out? Here's a one liner that should probably be split into multiples, but I think you'll get the point:
write.table(
subset(
ddply(d1, "MyGroups", transform, meanval = mean(MyInt)
),
meanval > 10),
"yourcsv.csv", append = TRUE, sep = ",", col.names = FALSE
)

It is simpler than you are making it. The function called by ddply can either return the subset of data if the criteria are met, or an empty data.frame if not.
keepGrp <- function(df0) {
if(mean(df0$MyInt) > 10) {
df0
} else {
data.frame()
}
}
res <- ddply(d1, .(MyGroups), keepGrp)
Note that your tests inside keepGrp was wrong (didn't test the mean of the MyInt values) and the grouping of the ddply was wrong (should be MyGroups, not MyInt).
Checking that this is right:
> ddply(d1, .(MyGroups), summarise, ave = mean(MyInt))
MyGroups ave
1 A 14.200000
2 B 9.600000
3 C 5.600000
4 D 5.600000
5 E 8.000000
6 F 10.500000
7 G 7.333333
8 H 12.000000
9 I 7.333333
10 J 9.500000
11 K 11.000000
12 L 12.375000
13 M 13.250000
14 N 12.000000
15 O 11.666667
16 P 8.625000
17 Q 13.000000
18 R 6.000000
19 S 16.000000
20 T 12.000000
21 U 12.000000
22 V 13.250000
23 W 17.666667
24 X 9.000000
25 Y 12.400000
26 Z 13.750000
> unique(res$MyGroup)
[1] A F H K L M N O Q S T U V W Y Z
Levels: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
So the ones that appear in res are those that have the appropriate mean value for MyInt.

Related

faster way to multiply each entry by a column

I have a data frame g with 3 columns, a, b and X. I need to multiply X with each entry in a and add to b to create a new variable. Right now i'm using a for loop
for(i in 1:N) {
g$Eout[i] = mean((g$a[i]*g$X+g$b[i]-(g$X)^2)^2);
}
which is really slow in R. Is there anyway to do this faster?
Try this:
set.seed(2)
N <- 30
g <- data.frame(a=1:N,b=seq(1,2,length.out=N),X=seq(10,20,length.out=N))
g$new <- sapply(g$X, function(x) mean((g$a * x + g$b - x^2)^2))
head(g)
# a b X new
# 1 1 1.000000 10.00000 10735.67
# 2 2 1.034483 10.34483 11077.04
# 3 3 1.068966 10.68966 11416.58
# 4 4 1.103448 11.03448 11757.01
# 5 5 1.137931 11.37931 12101.40
# 6 6 1.172414 11.72414 12453.14
Since you want each value of X multiplying all values of g$a, etc, you need to resort to some vectorized goodness. (Using #thelatemail's suggested 3e4 takes about 7sec per sapply ...)

R expand.grid with row restrictions

I have a numeric vector x of length N and would like to create a vector of the within-set sums of all of the following sets: any possible combination of the x elements with at most M elements in each combination. I put together a slow iterative approach; what I am looking for here is a way without using any loops.
Consider the approach I have been taking, in the following example with N=5 and M=4
M <- 4
x <- 11:15
y <- as.matrix(expand.grid(rep(list(0:1), length(x))))
result <- y[rowSums(y) <= M, ] %*% x
However, as N gets large (above 22 for me), the expand.grid output becomes too big and gives an error (replace x above with x <- 11:55 to observe this). Ideally there would be an expand.grid function that permits restrictions on the rows before constructing the full matrix, which (at least for what I want) would keep the matrix size within memory limits.
Is there a way to achieve this without causing problems for large N?
Your problem has to do with the sheer amount of combinations.
What you appear to be doing is listing all different combinations of 0's and 1's in a sequence of length of x.
In your example x has length 5 and you have 2^5=32 combinations
When x has length 22 you have 2^22=4194304 combinations.
Couldn't you use a binary encoding instead?
In your case that would mean
0 stands for 00000
1 stands for 00001
2 stands for 00010
3 stands for 00011
...
It will not solve your problem completely, but you should be able to get a bit further than now.
Try this:
c(0, unlist(lapply(1:M, function(k) colSums(combn(x, k)))))
It generates the same result as with your expand.grid approach, shown below for the test data.
M <- 4
x <- 11:15
# expand.grid approach
y <- as.matrix(expand.grid(rep(list(0:1), length(x))))
result <- y[rowSums(y) <= M, ] %*% x
# combn approach
result1 <- c(0, unlist(lapply(1:M, function(k) colSums(combn(x, k)))))
all(sort(result[,1]) == sort(result1))
# [1] TRUE
This should be fast (it takes 0.227577 secs on my machine, with N=22, M=4):
x <- 1:22 # N = 22
M <- 4
c(0, unlist(lapply(1:M, function(k) colSums(combn(x, k)))))
# [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 3 4 5 6 7
you may want to choose the unique values of the sums with
unique(c(0, unlist(lapply(1:M, function(k) colSums(combn(x, k))))))

Calculating the mode or 2nd/3rd/4th most common value

Surely there has to be a function out there in some package for this?
I've searched and I've found this function to calculate the mode:
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
But I'd like a function that lets me easily calculate the 2nd/3rd/4th/nth most common value in a column of data.
Ultimately I will apply this function to a large number of dplyr::group_by()s.
Thank you for your help!
Maybe you could try
f <- function (x) with(rle(sort(x)), values[order(lengths, decreasing = TRUE)])
This gives unique vector values sorted by decreasing frequency. The first will be the mode, the 2nd will be 2nd most common, etc.
Another method is to based on table():
g <- function (x) as.numeric(names(sort(table(x), decreasing = TRUE)))
But this is not recommended, as input vector x will be coerced to factor first. If you have a large vector, this is very slow. Also on exit, we have to extract character names and of the table and coerce it to numeric.
Example
set.seed(0); x <- rpois(100, 10)
f(x)
# [1] 11 12 7 9 8 13 10 14 5 15 6 2 3 16
Let's compare with the contingency table from table:
tab <- sort(table(x), decreasing = TRUE)
# 11 12 7 9 8 13 10 14 5 15 6 2 3 16
# 14 14 11 11 10 10 9 7 5 4 2 1 1 1
as.numeric(names(tab))
# [1] 11 12 7 9 8 13 10 14 5 15 6 2 3 16
So the results are the same.
Here is an R function that I made (inspired by several other SO posts), which may work for your goal (and I use a local dataset on religious affiliation to illustrate it):
It's simple; only R base functions are involved: length, match, sort, tabulate, table, unique, which, as.character.
Find_Nth_Mode = function(d, N = 2) {
maxN = function(x, N){
len = length(x)
if(N>len){
warning('N greater than length(x). Setting N=length(x)')
N = length(x)
}
sort(x,partial=len-N+1)[len-N+1]
}
(ux = unique(as.character(d)))
(match(d, ux))
(a1 = tabulate(match(d, ux)))
(a2 = maxN(a1, N))
(a3 = which(a1 == a2))
(ux[a3])
}
Sample Output
> table(religion_data$relig11)
0.None 1.Protestant_Conservative 2.Protestant_Liberal 3.Catholic
34486 6134 19678 36880
4.Orthodox 5.Islam_Sunni 6.Islam_Shia 7.Hindu
20702 28170 668 4653
8.Buddhism 9.Jewish 10.Other
9983 381 6851
> Find_Nth_Mode(religion_data$relig11, 1)
[1] "3.Catholic"
> Find_Nth_Mode(religion_data$relig11, 2)
[1] "0.None"
> Find_Nth_Mode(religion_data$relig11, 3)
[1] "5.Islam_Sunni"
Reference:
I want to express my gratitude to these posts, from which I get the two functions and integrate them into one:
function to find the N th largest value: Fastest way to find second (third...) highest/lowest value in vector or column
how to find the second largest mode value?
Calculating the mode or 2nd/3rd/4th most common value

R associative memory doesn't work as expected

I am trying to use associative memory and ddply to add a column to a data frame. For example:
First, I have defined association and a function that uses association to calculate product of two elements of a row (property damage and multiplier) to get actual damage in dollars. Here,"B" means Billion, "m|M" means MIllions, etc.
validMultiplierLetter <- c("B", "h", "H", "k", "K", "m", "M")
Multiplier <- c(1000000000, 100, 100, 1000, 1000, 1000000, 1000000)
names(Multiplier) <- validMultiplierLetter
The function ploss (property loss) is:
ploss <- function(pd,pm) {
if (pm %in% validMultiplierLetter) pd*Multiplier[pm]
else 0
}
here is a sample data frame with columns pd (property damage) and pm (multiplier) and ddply code to create a pl (property loss) column, which is a product of property damage and the associated value of multiplier. Invalid multipliers are equivalent to 0 (e.g., "+").
tdf <- data.frame(pd = c(5, 10, 15, 20, 25), pm = c("B", "m", "K", "+", "h"))
tldf <- ddply(tdf, .(pd, pm), transform, pl = ploss(pd,pm))
I get the following output when I execute the code above - you can see that the right multiplier was not used for the rows.
> tldf
pd pm pl
1 5 B 500
2 10 m 10000
3 15 K 15000
4 20 + 0
5 25 h 2500
Strangely though, when you pass constant, the multiplier works correctly. But, when you pass a variable (whose value is same as the constant), for some reason you get an incorrect result.
> Multiplier["B"]
B
1e+09
> tdf$pm[1]
[1] B
Levels: + B h K m
> Multiplier[tdf$pm[1]]
h
100
Any explanation of why this happens and how to fix it is greatly appreciated. Thanks.
The problem is that tdf$pm is a factor. When presented a factor, [ will use the factor levels rather than the character values:
x <- 10:15
names(x) <- LETTERS[1:6]
x
## A B C D E F
## 10 11 12 13 14 15
x[c('A','F')] # Lookup by name
## A F
## 10 15
x[factor(c('A','F'))] # Lookup by integer
## A B
## 10 11
This is fixed by using as.character around the factor, so that a character vector is presented to [:
x[as.character(factor(c('A','F')))]
## A F
## 10 15
For your problem, you can coerce to character in the transform function:
ddply(tdf, .(pd, pm), transform, pl = ploss(pd,as.character(pm)))
## pd pm pl
## 1 5 B 5.0e+09
## 2 10 m 1.0e+07
## 3 15 K 1.5e+04
## 4 20 + 0.0e+00
## 5 25 h 2.5e+03
In addition, you could vectorize your ploss function in the obvious way and do the job directly with transform:
ploss <- function(pd,pm) {
ifelse(pm %in% validMultiplierLetter, pd*Multiplier[pm], 0)
}
transform(tdf, pl=ploss(pd, as.character(pm)))
## pd pm pl
## 1 5 B 5.0e+09
## 2 10 m 1.0e+07
## 3 15 K 1.5e+04
## 4 20 + 0.0e+00
## 5 25 h 2.5e+03
And of course, the as.character coercion could be within the function ploss, so it isn't required in the transform call:
ploss <- function(pd,pm) {
ifelse(pm %in% validMultiplierLetter, pd*Multiplier[as.character(pm)], 0)
}
The problem I see is that, if you're using the default R options, tdf$pm is a factor, not a character. You can check this with class(tdf$pm). What's happening here is that "B" is really a mask for 2 (following the order in the printout: Levels: + B h K m), so pd has the value of 2 as far as [ is concerned, and Multiplier[2] is 100 as you've assigned.
When you call data.frame (or read.table) you need to add the argument stringsAsFactors = FALSE, or change the corresponding global option with the options function.

Apply over two data frames

I'm using R, and I have two data.frames, A and B. They both have 6 rows, but A has 25000 columns (genes), and B has 30 columns. I'd like to apply a function with two arguments f(x,y) where x is every column of A and y is every column of B. So far it looks like this:
i = 1
for (x in A){
j = 1
for (y in B){
out[i,j] <- f(x,y)
j = j + 1
}
i = i + 1
}
I have two issues with this: from my Python programming I associate keeping track of counters like this as crufty, and from my R programming I am nervous of for loops. However, I can't quite see how to apply apply (or even if I should apply apply) to this problem and was hoping someone might enlighten me. I need to treat f() as atomic (it's actually cor.test()) for now.
Since you are using data frames, it might be faster to use lapply or sapply to do this (specially given the scope of your data frames). For example,
x <- data.frame(col1=c(1,2,3,4), col2=c(5,6,7,8), col3=c(9,10,11,12))
y <- data.frame(col1=c(1,2,3,4), col2=c(5,6,7,8))
bl <- lapply(x, function(u){
lapply(y, function(v){
f(u,v) # Function with column from x and column from y as inputs
})
})
out = matrix(unlist(bl), ncol=ncol(y), byrow=T)
Some data
nrows <- 6
A <- data.frame(a = runif(nrows), b = runif(nrows), c = runif(nrows))
B <- data.frame(z = rnorm(nrows), y = rnorm(nrows))
The trick: remember columns with expand.grid
counter <- expand.grid(seq_along(A), seq_along(B))
f <- function(x)
{
cor.test(A[, x["Var1"]], B[, x["Var2"]])$estimate
}
Now we only need 1 call to apply.
stats <- apply(counter, 1, f)
names(stats) <- paste(names(A)[counter$Var1], names(B)[counter$Var2], sep = ",")
stats
Nesting the applies works, not the easiest syntax, though.
x<-data.frame(col1=c(1,2,3,4), col2=c(5,6,7,8), col3=c(9,10,11,12))
y<-data.frame(col1=c(1,2,3,4), col2=c(5,6,7,8))
z<-apply(x,2,function(col,df2)
{
apply(df2,2,function(col2,col1)
{
col2+col1
},col)
},y)
z
col1 col2 col3
[1,] 2 6 10
[2,] 4 8 12
[3,] 6 10 14
[4,] 8 12 16
[5,] 6 10 14
[6,] 8 12 16
[7,] 10 14 18
[8,] 12 16 20

Resources