Using sapply and tapply - r

I struggle to correct my code. Given this data frame
S <- data.frame(Z1=c("A","A","A","D","D","A","A","A"),
Z2=c("A","A","A","D","D","C","C","D"),
K1=c(24,36,44,63,34,26,19,23),
K2=c(12,24,13,16,23,25,12,34))
I applied this transformation:
B <- sapply(1:2, function(x) {
x1 <- S[c(x, x+2)]
tapply(x1[,2], x1[,1], FUN=function(S) ceiling(median(S)))
})
colnames(B) <- c("G1","G2")
which I expected to set B to
G1 G2
C 0 19
B 0 0
A 25 13
D 49 23
but instead I get this error:
Error in `colnames<-`(`*tmp*`, value = c("G1", "G2")) :
attempt to set 'colnames' on an object with less than two dimensions

One of your problems is that R has no idea you consider Z1 and Z2 to be categorical variables that can take values A, B, C, D. The way you tell it this is with the factor type.
S <- data.frame(Z1=c("A","A","A","D","D","A","A","A"),
Z2=c("A","A","A","D","D","C","C","D"),
K1=c(24,36,44,63,34,26,19,23),
K2=c(12,24,13,16,23,25,12,34))
S$Z1 <- factor(S$Z1, levels=c("A", "B", "C", "D"))
S$Z2 <- factor(S$Z2, levels=c("A", "B", "C", "D"))
Notice how I have to explicitly spell out that all four of A, B, C, D are possible even though not all of them appear. Having done that, your transformation function produces a 2D matrix to which colnames can be applied.
B <- sapply(1:2, function(x) {
x1 <- S[c(x, x+2)]
tapply(x1[,2], x1[,1], FUN=function(S) ceiling(median(S)))
})
colnames(B) <- c("G1","G2")
However, you don't get zeroes where you expected them to be, you get NAs:
> B
G1 G2
A 25 13
B NA NA
C NA 19
D 49 23
This is because the median value of an empty set is undefined. You can paper over that with is.na:
> B[is.na(B)] <- 0
> B
G1 G2
A 25 13
B 0 0
C 0 19
D 49 23
Also, the S[c(x, x+2)] thing is extremely brittle and I would not be relying on it in production code if I were you. Likewise the thing where you use sapply(1:2, function(x) ...) where the function operates on global variables.
You may find the reshape2 package easier to persuade to do what you want.

Related

R: How to slice a window of elements in named vector

Given the following named vector:
x <- c(54, 36, 67, 25, 76)
names(x) <- c('a', 'b', 'c', 'd', 'e')
How one can extract the elements between 'b' and 'd'? I can do that for data tables with the dplyr::select(dt, b:d) but for some reason, I cannot find a solution for named vectors (all the examples I find are for extracting element(s) by giving all the names not a range of names)...
You could do
x[which(names(x) == "b"):which(names(x) == "d")]
#> b c d
#> 36 67 25
The problem being that there is no guarantee in a named vector that names are unique, and if there are duplicate names the entire concept becomes meaningless.
If you wanted a complete solution that allows for tidyverse-style non-standard evaluation and sensible error messages you could have
subset_named <- function(data, exp)
{
if(missing(exp)) return(data)
exp <- as.list(match.call())$exp
if(is.numeric(exp)) return(data[exp])
if(is.character(exp)) return(data[exp])
tryCatch({
ss <- suppressWarnings(eval(exp))
return(data[ss])},
error = function(e)
{
if(as.character(exp[[1]]) != ":")
stop("`exp` must be a sequence created by ':'")
n <- names(data)
first <- as.character(exp[[2]])
second <- as.character(exp[[3]])
first_match <- which(n == first)
second_match <- which(n == second)
if(length(first_match) == 0)
stop("\"", first, "\" not found in names(",
deparse(substitute(data)), ")")
if(length(second_match) == 0)
stop("\"", second, "\" not found in names(",
deparse(substitute(data)), ")")
if(length(first_match) > 1) {
warning("\"", first,
"\" found more than once. Using first occurence only")
first_match <- first_match[1]
}
if(length(second_match) > 1) {
warning("\"", second,
"\" found more than once. Using first occurence only")
second_match <- second_match[1]
}
return(data[first_match:second_match])
})
}
That allows the following behaviour:
subset_named(x, "b":"d")
#> b c d
#> 36 67 25
subset_named(x, b:d)
#> b c d
#> 36 67 25
subset_named(x, 1:3)
#> a b c
#> 54 36 67
subset_named(x, "e")
#> e
#> 76
subset_named(x)
#> a b c d e
#> 54 36 67 25 76
One option could be:
x[Reduce(`:`, which(names(x) %in% c("b", "d")))]
b c d
36 67 25
You can use match in base R :
x[match('b', names(x)):match('d', names(x))]
# b c d
#36 67 25
Or if you want to use something like b:d convert it into dataframe as column
library(dplyr)
t(x) %>%
as.data.frame() %>%
select(b:d)
1) subset In base R this can be done using the select argument of subset. The only catch is that only the data.frame method of subset supports the select argument but we can convert x to a data.frame and then convert back. It also allows more complex specifications such as c(b:d, d) .
unlist(subset(data.frame(as.list(x)), select = b:d))
## b c d
## 36 67 25
2) evalq Another base R possibility is to create a list with the values 1, 2, 3, ... and the same names as x and then evaluate b:d with respect to it giving the desired indexes which can then be indexed into x. This also allows complex specifications as in (1).
x[ evalq(b:d, setNames(as.list(seq_along(x)), names(x))) ]
## b c d
## 36 67 25
We could turn this into a function like this:
sel <- function(x, select, envir = parent.frame()) {
ix <- setNames(as.list(seq_along(x)), names(x))
x[ eval(substitute(select), ix, envir) ]
}
sel(x, b:d)
sel(x, c(b:c, d))
sel(x, d:b) # reverse order
3) logical condition Again with only base R, if the names are in sorted order, as in the question, then we can check for names between the endpoints:
x[names(x) >= "b" & names(x) <= "d"]
## b c d
## 36 67 25
4) zoo If the names are in ascending order, as in the question, we could create a zoo series with those names as the times and then use window.zoo to pick out the subseries and finally convert back.
library(zoo)
coredata(window(zoo(x, names(x)), start = "b", end = "d"))
## b c d
## 36 67 25

iterating table() results into matrix/data frame

This must be simple but I'm banging my head against it for a while. Please help. I have a large data set from which I get all kinds of information via table(). I then want to store these counts, with the rownames that were counted. For a reproducible example consider
a <- c("a", "b", "c", "d", "a", "b") # one count, occurring twice for a and
# b and once for c and d
b <- c("a", "c") # a completly different property from the dataset
# occurring once for a and c
x <- table(a)
y <- table(b) # so now x and y hold the information I seek
How can I merge/bind/whatever to get from x and y to this form:
x. y.
a 2. 1
b 2. 0
c 1. 1
d. 1 0
HOWEVER, I need to use the solution to work iteratively, in a loop that takes x and y and gets the requested form above, and then gets more tables added, each hopefully adding a column. One of my many failed attempts, just to show my (probably flawed) logic, is:
member <- function (data = dfm, groupvar = 'group', analysis = kc15) {
res<-matrix(NA,ncol=length(analysis$size)+1) #preparing an object for the results
res[,1]<-table(docvars(data,groupvar)) #getting names and totals of groups
for (i in 1:length(analysis$size)) { #getting a bunch of counts that I care about
r<-table(docvars(data,groupvar)[analysis$cluster==i])
res<-cbind(res,r) #here's the problem, trying to add each new count as a column.
}
res
}
So, to sum, the reproducible example above means to replicate the first column in res and an r, and I'm seeking (I think) a correct solution instead of the cbind, which would allow adding columns of different length but similar names, as in the example above.
Please help its embarrassing how much time I'm wasting on this
The following may be an option, which merges on the "row names" of the data frames, converted from the frequency tables:
df <- merge(as.data.frame(x, row.names=1, responseName ="x"),
as.data.frame(y, row.names=1, responseName ="y"),
by="row.names", all=TRUE)
df[is.na(df)] <- 0; df
Row.names x y
1 a 2 1
2 b 2 0
3 c 1 1
4 d 1 0
Then, this method can be incorporated into your real data with some modification. I've made up the data since I didn't have any to work with.
set.seed(1234)
groupvar <- sample(letters[1:4], 16, TRUE)
clusters <- 1:4
cluster <- rep(clusters, each=4)
Merge the first two tables:
res <- merge(as.data.frame(table(groupvar[cluster==1]),
row.names=1, responseName=clusters[1]),
as.data.frame(table(groupvar[cluster==2]),
row.names=1, responseName=clusters[2]),
by="row.names", all=TRUE)
Then merge the others using your for loop.
for (i in 3:length(clusters)) {
r <- table(groupvar[cluster==i])
res <- merge(res, as.data.frame(r, row.names=1, responseName = clusters[i]),
by.x="Row.names", by.y="row.names", all=TRUE)
}
res[is.na(res)] <- 0
res
Row.names X1 X2 X3 X4
1 a 1 2 0 0
2 b 1 1 2 2
3 c 0 1 1 2
4 d 2 0 1 0
merge the transposed and re-transpose.
res <- t(merge(t(unclass(x)), t(unclass(y)), all=TRUE))
res <- `colnames<-`(res[order(rownames(res)), 2:1], c("x", "y"))
res[is.na(res)] <- 0
res
# x y
# a 2 1
# b 2 0
# c 1 1
# d 1 0

R associative memory doesn't work as expected

I am trying to use associative memory and ddply to add a column to a data frame. For example:
First, I have defined association and a function that uses association to calculate product of two elements of a row (property damage and multiplier) to get actual damage in dollars. Here,"B" means Billion, "m|M" means MIllions, etc.
validMultiplierLetter <- c("B", "h", "H", "k", "K", "m", "M")
Multiplier <- c(1000000000, 100, 100, 1000, 1000, 1000000, 1000000)
names(Multiplier) <- validMultiplierLetter
The function ploss (property loss) is:
ploss <- function(pd,pm) {
if (pm %in% validMultiplierLetter) pd*Multiplier[pm]
else 0
}
here is a sample data frame with columns pd (property damage) and pm (multiplier) and ddply code to create a pl (property loss) column, which is a product of property damage and the associated value of multiplier. Invalid multipliers are equivalent to 0 (e.g., "+").
tdf <- data.frame(pd = c(5, 10, 15, 20, 25), pm = c("B", "m", "K", "+", "h"))
tldf <- ddply(tdf, .(pd, pm), transform, pl = ploss(pd,pm))
I get the following output when I execute the code above - you can see that the right multiplier was not used for the rows.
> tldf
pd pm pl
1 5 B 500
2 10 m 10000
3 15 K 15000
4 20 + 0
5 25 h 2500
Strangely though, when you pass constant, the multiplier works correctly. But, when you pass a variable (whose value is same as the constant), for some reason you get an incorrect result.
> Multiplier["B"]
B
1e+09
> tdf$pm[1]
[1] B
Levels: + B h K m
> Multiplier[tdf$pm[1]]
h
100
Any explanation of why this happens and how to fix it is greatly appreciated. Thanks.
The problem is that tdf$pm is a factor. When presented a factor, [ will use the factor levels rather than the character values:
x <- 10:15
names(x) <- LETTERS[1:6]
x
## A B C D E F
## 10 11 12 13 14 15
x[c('A','F')] # Lookup by name
## A F
## 10 15
x[factor(c('A','F'))] # Lookup by integer
## A B
## 10 11
This is fixed by using as.character around the factor, so that a character vector is presented to [:
x[as.character(factor(c('A','F')))]
## A F
## 10 15
For your problem, you can coerce to character in the transform function:
ddply(tdf, .(pd, pm), transform, pl = ploss(pd,as.character(pm)))
## pd pm pl
## 1 5 B 5.0e+09
## 2 10 m 1.0e+07
## 3 15 K 1.5e+04
## 4 20 + 0.0e+00
## 5 25 h 2.5e+03
In addition, you could vectorize your ploss function in the obvious way and do the job directly with transform:
ploss <- function(pd,pm) {
ifelse(pm %in% validMultiplierLetter, pd*Multiplier[pm], 0)
}
transform(tdf, pl=ploss(pd, as.character(pm)))
## pd pm pl
## 1 5 B 5.0e+09
## 2 10 m 1.0e+07
## 3 15 K 1.5e+04
## 4 20 + 0.0e+00
## 5 25 h 2.5e+03
And of course, the as.character coercion could be within the function ploss, so it isn't required in the transform call:
ploss <- function(pd,pm) {
ifelse(pm %in% validMultiplierLetter, pd*Multiplier[as.character(pm)], 0)
}
The problem I see is that, if you're using the default R options, tdf$pm is a factor, not a character. You can check this with class(tdf$pm). What's happening here is that "B" is really a mask for 2 (following the order in the printout: Levels: + B h K m), so pd has the value of 2 as far as [ is concerned, and Multiplier[2] is 100 as you've assigned.
When you call data.frame (or read.table) you need to add the argument stringsAsFactors = FALSE, or change the corresponding global option with the options function.

R help on aggregation function

for my question I created a dummy data frame:
set.seed(007)
DF <- data.frame(a = rep(LETTERS[1:5], each=2), b = sample(40:49), c = sample(1:10))
DF
a b c
1 A 49 2
2 A 43 3
3 B 40 7
4 B 47 1
5 C 41 9
6 C 48 8
7 D 45 6
8 D 42 5
9 E 46 10
10 E 44 4
How can I use the aggregation function on column a so that, for instance, for "A" the following value is calculated: 49-43 / 2+3?
I started like:
aggregate(DF, by=list(DF$a), FUN=function(x) {
...
})
The problem I have is that I do not know how to access the 4 different cells 49, 43, 2 and 3
I tried x[[1]][1] and similar stuff but don't get it working.
Inside aggregate, the function FUN is applied independently to each column of your data. Here you want to use a function that takes two columns as inputs, so a priori, you can't use aggregate for that.
Instead, you can use ddply from the plyr package:
ddply(DF, "a", summarize, res = (b[1] - b[2]) / sum(c))
# a res
# 1 A 1.2000000
# 2 B -0.8750000
# 3 C -0.4117647
# 4 D 0.2727273
# 5 E 0.1428571
When you aggregate the FUN argument can be anything you want. Keep in mind that the value passed will either be a vector (if x is one column) or a little data.frame or matrix (if x is more than one). However, aggregate doesn't let you access the columns of a multi-column argument. For example.
aggregate( . ~ a, data = DF, FUN = function(x) diff(x[,1]) / sum(x[,2]) )
That fails with an error even though I used . (which takes all of the columns of DF that I'm not using elsewhere). To see what aggregate is trying to do there look at the following.
aggregate( . ~ a, data = DF, FUN = sum )
The two columns, b, and c, were aggregated but from the first attempt we know that you can't do something that accesses each column separately. So, strictly sticking with aggregate you need two passes and three lines of code.
diffb <- aggregate( b ~ a, data = DF, FUN = diff )
Y <- aggregate( c ~ a, data = DF, FUN = sum )
Y$c <- diffb$b / Y$c
Now Y contains the result you want.
The by function is simpler than aggregate and all it does is split the original data.frame using the indices and then apply the FUN function.
l <- by( data = DF, INDICES = DF$a, FUN = function(x) diff(x$b)/sum(x$c), simplify = FALSE )
unlist(l)
You have to do a little to get the result back into a data.frame if you really want one.
data.frame(a = names(l), x = unlist(l))
Using data.table could be faster and easier.
library(data.table)
DT <- data.table(DF)
DT[, (-1*diff(b))/sum(c), by=a]
a V1
1: A 1.2000000
2: B -0.8750000
3: C -0.4117647
4: D 0.2727273
5: E 0.1428571
Using aggregate, not so good. I didn't a better way to do it using aggregate :( but here's an attempt.
B <- aggregate(DF$b, by=list(DF$a), diff)
C <- aggregate(DF$c, by=list(DF$a), sum)
data.frame(a=B[,1], Result=(-1*B[,2])/C[,2])
a Result
1 A 1.2000000
2 B -0.8750000
3 C -0.4117647
4 D 0.2727273
5 E 0.1428571
A data.table solution - for efficiency of time and memory.
library(data.table)
DT <- as.data.table(DF)
DT[, list(calc = diff(b) / sum(c)), by = a]
You can use the base by() function:
listOfRows <-
by(data=DF,
INDICES=DF$a,
FUN=function(x){data.frame(a=x$a[1],res=(x$b[1] - x$b[2])/(x$c[1] + x$c[2]))})
newDF <- do.call(rbind,listOfRows)

How does ddply handle factors as "split" variables?

I have a data.frame with 20 columns. The first two are factors, and the rest are numeric. I'd like to use the first two columns as split variables and then apply the mean() to the remaining columns.
This seems like a quick and easy job for ddply(), however, the results for the output data.frame are not what I am looking for. Here is a minimal example with just one column of data:
Aa <- c(rep(c("A", "a"), each = 20))
Bb <- c(rep(c("B", "b", "B", "b"), each = 10))
x <- runif(40)
df1 <- data.frame(Aa, Bb, x)
ddply(df1, .(Aa, Bb), mean)
The output is:
Aa Bb x
1 NA NA 0.5193275
2 NA NA 0.4491907
3 NA NA 0.4848128
4 NA NA 0.4717899
Warning messages:
1: In mean.default(X[[1L]], ...) :
argument is not numeric or logical: returning NA
The warning is repeated 8 times, presumably once for each call to mean(). I'm guessing this comes from trying to take the mean of a factor. I could write this as:
ddply(df1, .(Aa, Bb), function(df1) mean(df1$x))
or
ddply(df1, .(Aa, Bb), summarize, x = mean(x))
both of which do work (not giving NAs), but I would rather avoid writing out 18 such x = mean(x) statements, one for each of my numeric columns.
Is there a general solution? I'm not wedded to ddply if there is a better answer elsewhere.
Since you are reducing hte number of rows, you need to use summarise:
> ddply(df1, .(Aa, Bb), summarise, mean_x =mean(x) )
Aa Bb mean_x
1 a b 0.3790675
2 a B 0.4242922
3 A b 0.5622329
4 A B 0.4574471
It's just as easy to use aggregate in this instance. Let's say you had two variables:
> aggregate(df1[-(1:2)], df1[1:2], mean)
Aa Bb x y
1 a b 0.4249121 0.4639192
2 A b 0.6127175 0.4639192
3 a B 0.4522292 0.4826715
4 A B 0.5201965 0.4826715
ddply supports negative indexing as well:
ddply(df1, .(Aa, Bb), function(x) mean(x[-(1:2)]))

Resources