R aggregate all possible combinations incl. "don't cares" - r

Say we've got a dataframe with 3 columns representing 3 different cases, and each can be of state 0 or 1. A fourth column contains a measurement.
set.seed(123)
df <- data.frame(round(runif(25)),
round(runif(25)),
round(runif(25)),
runif(25))
colnames(df) <- c("V1", "V2", "V3", "x")
head(df)
V1 V2 V3 x
1 0 1 0 0.2201189
2 1 1 0 0.3798165
3 0 1 1 0.6127710
aggregate(df$x, by=list(df$V1, df$V2, df$V3), FUN=mean)
Group.1 Group.2 Group.3 x
1 0 0 0 0.1028646
2 1 0 0 0.5081943
3 0 1 0 0.4828984
4 1 1 0 0.5197925
5 0 0 1 0.4571073
6 1 0 1 0.3219217
7 0 1 1 0.6127710
8 1 1 1 0.6029213
The aggregate function calculates the mean for all possible combinations. However, in my research I also need to know the outcome of combinations, where certain columns may have any state. For example, the mean of all observations with V1==1 & V2==1, regardless the contents of V3. The result should look like this, with the asterisk representing "don't care":
Group.1 Group.2 Group.3 x
1 * * * 0.1234567 (this is the mean of all rows)
2 0 * * 0.1234567
3 1 * * 0.1234567
4 * 0 * 0.1224567
5 * 1 * 0.1234567
[ all other possible combinations follow, should be total of 27 rows ]
Is there a easy way to achieve this?

Here is the ldply-ddply method:
library(plyr)
ldply(list(.(V1,V2,V3),.(V1),.(V2),.()), function(y) ddply(df,y,summarise,x=mean(x)))
V1 V2 V3 x .id
1 0 0 0 0.1028646 <NA>
2 0 0 1 0.4571073 <NA>
3 0 1 0 0.4828984 <NA>
4 0 1 1 0.6127710 <NA>
5 1 0 0 0.5081943 <NA>
6 1 0 1 0.3219217 <NA>
7 1 1 0 0.5197925 <NA>
8 1 1 1 0.6029213 <NA>
9 0 NA NA 0.4436400 <NA>
10 1 NA NA 0.4639997 <NA>
11 NA 0 NA 0.4118793 <NA>
12 NA 1 NA 0.5362985 <NA>
13 NA NA NA 0.4566702 <NA>
Essentially you create a list of all your variable combinations you are interested in, and iterate over this with ldply and using ddply to perform the aggreation. The magic of plyr puts it all into a compact dataframe for you. All that remains is to remove the spurious .id column introduced by the grand mean (.()) and to replace the NAs in the groups with "*" if needed.
To get all combinations you can use combn and lapply to generate a list with the relevant combinations to plug into ldply:
all.combs <- unlist(lapply(0:3,combn,x=c("V1","V2","V3"),simplify=FALSE),recursive=FALSE)
ldply(all.combs, function(y) ddply(df,y,summarise,x=mean(x)))
.id x V1 V2 V3
1 <NA> 0.4566702 NA NA NA
2 <NA> 0.4436400 0 NA NA
3 <NA> 0.4639997 1 NA NA
4 <NA> 0.4118793 NA 0 NA
5 <NA> 0.5362985 NA 1 NA
6 <NA> 0.4738541 NA NA 0
7 <NA> 0.4380543 NA NA 1
8 <NA> 0.3862588 0 0 NA
9 <NA> 0.5153666 0 1 NA
10 <NA> 0.4235250 1 0 NA
11 <NA> 0.5530440 1 1 NA
12 <NA> 0.3878900 0 NA 0
13 <NA> 0.4882400 0 NA 1
14 <NA> 0.5120604 1 NA 0
15 <NA> 0.4022073 1 NA 1
16 <NA> 0.4502901 NA 0 0
17 <NA> 0.3820042 NA 0 1
18 <NA> 0.5013455 NA 1 0
19 <NA> 0.6062045 NA 1 1
20 <NA> 0.1028646 0 0 0
21 <NA> 0.4571073 0 0 1
22 <NA> 0.4828984 0 1 0
23 <NA> 0.6127710 0 1 1
24 <NA> 0.5081943 1 0 0
25 <NA> 0.3219217 1 0 1
26 <NA> 0.5197925 1 1 0
27 <NA> 0.6029213 1 1 1

(Nice reproducible code, btw, well-stated question.)
Perhaps the best way to attack this would be to create (and later
discard) another column indicating a grouping. Starting with your
data:
set.seed(123)
df <- data.frame(round(runif(25)),
round(runif(25)),
round(runif(25)),
runif(25))
colnames(df) <- c("V1", "V2", "V3", "x")
Let's first form a data.frame with all possibles, using a fourth
column to provide a unique group id.
allpossibles <- expand.grid(V1=unique(df$V1), V2=unique(df$V2), V3=unique(df$V3))
allpossibles$id <- 1:nrow(allpossibles)
head(allpossibles, n=3)
## V1 V2 V3 id
## 1 0 1 0 1
## 2 1 1 0 2
## 3 0 0 0 3
With this data.frame, change the id for rows where you have desired
commonality. For instance, the following two combinations (1,1,0) and
(1,1,1) are identical as far as you care, so set the id variable to
be the same:
subset(allpossibles, V1==1 & V2==1)
## V1 V2 V3 id
## 2 1 1 0 2
## 6 1 1 1 6
allpossibles$id[6] <- 2
From here, merge the two data.frames so that id is incorporated into
the original:
df2 <- merge(df, allpossibles, by=c('V1','V2','V3'))
head(df2, n=3)
## V1 V2 V3 x id
## 1 0 0 0 0.1028646 3
## 2 0 0 1 0.1750527 7
## 3 0 0 1 0.3435165 7
From here, it's a simple matter of aggregating the data and remerging
with allpossibles (to regain V1, V2, and V3):
df3 <- aggregate(df2$x, by=list(df2$id), FUN=mean)
colnames(df3) <- c('id','x')
(df4 <- merge(allpossibles, df3, by='id'))
## id V1 V2 V3 x
## 1 1 0 1 0 0.4828984
## 2 2 1 1 0 0.5530440
## 3 2 1 1 1 0.5530440
## 4 3 0 0 0 0.1028646
## 5 4 1 0 0 0.5081943
## 6 5 0 1 1 0.6127710
## 7 7 0 0 1 0.4571073
## 8 8 1 0 1 0.3219217
If you can accept the data with semi-duplicate rows (see rows 2 and 3
above), then just remove the $id column and have at it. If you must
unique-ify the rows, something like the following might work:
df5 <- do.call(rbind, by(df4, df4$id, function(ldf) {
if (nrow(ldf) > 1) {
uniqlen <- apply(ldf, 2, function(x) length(unique(x)))
ldf[,which(uniqlen > 1)] <- NA
ldf <- ldf[1,]
}
ldf
}))
df5 <- df5[, ! 'id' == names(df5)]
df5
## V1 V2 V3 x
## 1 0 1 0 0.4828984
## 2 1 1 NA 0.5530440
## 3 0 0 0 0.1028646
## 4 1 0 0 0.5081943
## 5 0 1 1 0.6127710
## 7 0 0 1 0.4571073
## 8 1 0 1 0.3219217
(Slightly cleaner-looking code can be used if you replace
do.call(rbind, by( with ddply( using the plyr package. The
internal function and its results are the same. ddply in this case
is a little slower, but that could likely be improved with a better
internal function.)

First, let me define a helper function to create all possible combinations of columns
allcomb<-function(x, addnone=T) {
x<-do.call(c, lapply(length(v):1, function(n) combn(v,n,simplify=F)))
if(addnone) x<-c(x,0)
x
}
Now we can use this to aggregate over the different subsets
v<-names(df)[1:3]
vv<-allcomb(v)
dd<-lapply(vv, function(cols) aggregate(df$x, df[, cols, drop=F], mean))
This actually returns a list of data.frames for all the different combinations, to merge them all together, we can use rbind.fill from plyr
library(plyr)
dd<-do.call(rbind.fill, dd)
This actually leaves the "any" values as NA rather than "*". If want to turn those into asterisks (and consequently convert your group columns to strings rather than numeric values) you can do
dd[1:3]<-lapply(dd[1:3], function(x) {x[is.na(x)]<-"*";x})
which finally gives
V1 V2 V3 x
1 0 0 0 0.1028646
2 1 0 0 0.5081943
3 0 1 0 0.4828984
4 1 1 0 0.5197925
5 0 0 1 0.4571073
6 1 0 1 0.3219217
7 0 1 1 0.6127710
8 1 1 1 0.6029213
9 0 0 * 0.3862588
10 1 0 * 0.4235250
11 0 1 * 0.5153666
12 1 1 * 0.5530440
13 0 * 0 0.3878900
14 1 * 0 0.5120604
15 0 * 1 0.4882400
16 1 * 1 0.4022073
17 * 0 0 0.4502901
18 * 1 0 0.5013455
19 * 0 1 0.3820042
20 * 1 1 0.6062045
21 0 * * 0.4436400
22 1 * * 0.4639997
23 * 0 * 0.4118793
24 * 1 * 0.5362985
25 * * 0 0.4738541
26 * * 1 0.4380543
27 * * * 0.4566702

Related

Drop columns when there are many missingness in R

I am trying to drop some columns that have less than 5 valid values. Here is an example dataset.
df <- data.frame(id = c(1,2,3,4,5,6,7,8,9,10),
i1 = c(0,1,1,1,1,0,0,1,NA,1),
i2 = c(1,0,0,1,0,1,1,0,0,NA),
i3 = c(NA,NA,NA,NA,NA,NA,NA,NA,NA,0),
i4 = c(NA,1,NA,NA,NA,NA,NA,NA,1,NA))
> df
id i1 i2 i3 i4
1 1 0 1 NA NA
2 2 1 0 NA 1
3 3 1 0 NA NA
4 4 1 1 NA NA
5 5 1 0 NA NA
6 6 0 1 NA NA
7 7 0 1 NA NA
8 8 1 0 NA NA
9 9 NA 0 NA 1
10 10 1 NA 0 NA
in this case, columns i3 and i4 needs to be dropped from the data frame.
How can I get the desired dataset below:
> df
id i1 i2
1 1 0 1
2 2 1 0
3 3 1 0
4 4 1 1
5 5 1 0
6 6 0 1
7 7 0 1
8 8 1 0
9 9 NA 0
10 10 1 NA
You can keep cols with at least 5 non-missing values with:
df[colSums(!is.na(df)) >= 5]
Can use discard from the purrr package:
library(purrr)
df <- data.frame(id = c(1,2,3,4,5,6,7,8,9,10),
i1 = c(0,1,1,1,1,0,0,1,NA,1),
i2 = c(1,0,0,1,0,1,1,0,0,NA),
i3 = c(NA,NA,NA,NA,NA,NA,NA,NA,NA,0),
i4 = c(NA,1,NA,NA,NA,NA,NA,NA,1,NA))
df %>%
discard(~ sum(!is.na(.))<5)
#> id i1 i2
#> 1 1 0 1
#> 2 2 1 0
#> 3 3 1 0
#> 4 4 1 1
#> 5 5 1 0
#> 6 6 0 1
#> 7 7 0 1
#> 8 8 1 0
#> 9 9 NA 0
#> 10 10 1 NA
Created on 2022-11-10 with reprex v2.0.2
While this is likely slower than base R methods (for datasets with extremely many columns > 1000), I generally feel the readability of the code is far superior. In addition, it is easy to do more complicated statements.
Using R base, another approach...
> df[, sapply(df, function(x) sum(is.na(x))) < 5]
id i1 i2
1 1 0 1
2 2 1 0
3 3 1 0
4 4 1 1
5 5 1 0
6 6 0 1
7 7 0 1
8 8 1 0
9 9 NA 0
10 10 1 NA
A performance comparison of the different answers given in this post:
funs = list(
colSums = function(df){df[colSums(!is.na(df)) >= nrow/10]},
sapply = function(df){df[, sapply(df, function(x) sum(!is.na(x))) >= nrow/10]},
discard = function(df){df %>% discard(~ sum(!is.na(.)) < nrow/10)},
mutate = function(df){df %>% mutate(across(where(~ sum(!is.na(.)) < nrow/10), ~ NULL))},
select = function(df){df %>% select(where(~ sum(!is.na(.)) >= nrow/10))})
ncol = 10000
nrow = 100
df = replicate(ncol, sample(c(1:9, NA), nrow, TRUE)) %>% as_tibble()
avrtime = map_dbl(funs, function(f){
duration = c()
for(i in 1:10){
t1 = Sys.time()
f(df)
t2 = Sys.time()
duration[i] = as.numeric(t2 - t1)}
return(mean(duration))})
avrtime[order(avrtime)]
The average time taken by each (in seconds):
colSums sapply discard select mutate
0.04510500 0.04692972 0.29207475 0.29451160 0.31755514
Using select
library(dplyr)
df %>%
select(where(~ sum(complete.cases(.x)) >=5))
-output
id i1 i2
1 1 0 1
2 2 1 0
3 3 1 0
4 4 1 1
5 5 1 0
6 6 0 1
7 7 0 1
8 8 1 0
9 9 NA 0
10 10 1 NA
Or in base R
Filter(\(x) sum(complete.cases(x)) >=5 , df)

Most frequent values in sliding window dataframe in R

I have the following dataframe (df):
A B T Required col (window = 3)
1 1 0 1
2 3 0 3
3 4 0 4
4 2 1 1 4
5 6 0 0 2
6 4 1 1 0
7 7 1 1 1
8 8 1 1 1
9 1 0 0 1
I would like to add the required column, as followed:
Insert in the current row the previous row value of A or B.
If in the last 3 (window) rows most of time the content of A column is equal to T column - choose A, otherwise - B. (There can be more columns - so the content of the column with the most times equal to T will be chosen).
What is the most efficient way to do it for big data table.
I changed the column named T to be named TC to avoid confusion with T as an abbreviation for TRUE
library(tidyverse)
library(data.table)
df[, newcol := {
equal <- A == TC
map(1:.N, ~ if(.x <= 3) NA
else if(sum(equal[.x - 1:3]) > 3/2) A[.x - 1]
else B[.x - 1])
}]
df
# N A B TC newcol
# 1: 1 1 0 1 NA
# 2: 2 3 0 3 NA
# 3: 3 4 0 4 NA
# 4: 4 2 1 1 4
# 5: 5 6 0 0 2
# 6: 6 4 1 1 0
# 7: 7 7 1 1 1
# 8: 8 8 1 1 1
# 9: 9 1 0 0 1
This works too, but it's less clear, and likely less efficient
df[, newcol := shift(A == TC, 1:3) %>%
pmap_lgl(~sum(...) > 3/2) %>%
ifelse(shift(A), shift(B))]
data:
df <- fread("
N A B TC
1 1 0 1
2 3 0 3
3 4 0 4
4 2 1 1
5 6 0 0
6 4 1 1
7 7 1 1
8 8 1 1
9 1 0 0
")
Probably much less efficient than the answer by Ryan, but without additional packages.
A<-c(1,3,4,2,6,4,7,8,1)
B<-c(0,0,0,1,0,1,1,1,0)
TC<-c(1,3,4,1,0,1,1,1,0)
req<-rep(NA,9)
df<-data.frame(A,B,TC,req)
window<-3
for(i in window:(length(req)-1)){
equal <- sum(df$A[(i-window+1):i]==df$TC[(i-window+1):i])
if(equal > window/2){
df$req[i+1]<-df$A[i]
}else{
df$req[i+1]<-df$B[i]
}
}

Getting R Frequency counts for all possible answers

I've started with R and I'm still finding my way with syntax.
I'm looking to get the frequencies for a scaled variable which has values of 0 through 10 and NA.
Id <- c(1,2,3,4,5)
ClassA <- c(1,NA,3,1,1)
ClassB <- c(2,1,1,3,3)
R <- c(5,5,7,NA,9)
S <- c(3,7,NA,9,5)
df <- data.frame(Id,ClassA,ClassB,R,S)
library(plyr)
count(df,'R')
I get a result of
R freq
1 5 2
2 7 1
3 9 1
4 NA 1
I'm looking for a result of
R freq
1 0 0
2 1 0
3 2 0
4 3 0
5 4 0
6 5 2
7 6 0
8 7 1
9 8 0
10 9 1
11 10 0
12 NA 1
If I have the vector showing the possible results
RAnswers <- c(0,1,2,3,4,5,6,7,8,9,10,NA)
How would I apply it with the data set to get the above result?
Here's a base R solution built around table(), match(), and replace():
freq <- table(df$R,useNA='ifany');
freq;
##
## 5 7 9 <NA>
## 2 1 1 1
R <- c(0:10,NA);
df2 <- data.frame(R=R,freq=freq[match(R,as.integer(names(freq)))]);
df2$freq[is.na(df2$freq)] <- 0;
df2;
## R freq
## 1 0 0
## 2 1 0
## 3 2 0
## 4 3 0
## 5 4 0
## 6 5 2
## 7 6 0
## 8 7 1
## 9 8 0
## 10 9 1
## 11 10 0
## 12 NA 1
Edit: Frank has a better answer, here's how you can use table() on a factor to get the required output:
setNames(nm=c('R','freq'),data.frame(table(factor(df$R,levels=RAnswers,exclude=NULL))));
## R freq
## 1 0 0
## 2 1 0
## 3 2 0
## 4 3 0
## 5 4 0
## 6 5 2
## 7 6 0
## 8 7 1
## 9 8 0
## 10 9 1
## 11 10 0
## 12 <NA> 1
This kind of tasks is easily done with package dplyr. For keeping the non-used values of R, you have to define R as factor and use tidyr's complete-function
library(dplyr)
library(tidyr)
df %>%
mutate(R = factor(R, levels=1:10)) %>%
group_by(R) %>%
summarise(freq=n()) %>%
complete(R, fill=list(freq=0))

expand data.frame to long format and increment value

I would like to convert my data from a short format to a long format and I imagine there is a simple way to do it (possibly with reshape2, plyr, dplyr, etc?).
For example, I have:
foo <- data.frame(id = 1:5,
y = c(0, 1, 0, 1, 0),
time = c(2, 3, 4, 2, 3))
id y time
1 0 2
2 1 3
3 0 4
4 1 2
5 0 3
I would like to expand/copy each row n times, where n is that row's value in the "time" column. However, I would also like the variable "time" to be incremented from 1 to n. That is, I would like to produce:
id y time
1 0 1
1 0 2
2 1 1
2 1 2
2 1 3
3 0 1
3 0 2
3 0 3
3 0 4
4 1 1
4 1 2
5 0 1
5 0 2
5 0 3
As a bonus, I would also like to do a sort of incrementing of the variable "y" where, for those ids with y = 1, y is set to 0 until the largest value of "time". That is, I would like to produce:
id y time
1 0 1
1 0 2
2 0 1
2 0 2
2 1 3
3 0 1
3 0 2
3 0 3
3 0 4
4 0 1
4 1 2
5 0 1
5 0 2
5 0 3
This seems like something that dplyr might already do, but I just don't know where to look. Regardless, any solution that avoids loops is helpful.
You can create a new data frame with the proper id and time columns for the long format, then merge that with the original. This leaves NA for the unmatched values, which can then be substituted with 0:
merge(foo,
with(foo,
data.frame(id=rep(id,time), time=sequence(time))
),
all.y=TRUE
)
## id time y
## 1 1 1 NA
## 2 1 2 0
## 3 2 1 NA
## 4 2 2 NA
## 5 2 3 1
## 6 3 1 NA
## 7 3 2 NA
## 8 3 3 NA
## 9 3 4 0
## 10 4 1 NA
## 11 4 2 1
## 12 5 1 NA
## 13 5 2 NA
## 14 5 3 0
A similar merge works for the first expansion. Merge foo without the time column with the same created data frame as above:
merge(foo[c('id','y')],
with(foo,
data.frame(id=rep(id,time), time=sequence(time))
)
)
## id y time
## 1 1 0 1
## 2 1 0 2
## 3 2 1 1
## 4 2 1 2
## 5 2 1 3
## 6 3 0 1
## 7 3 0 2
## 8 3 0 3
## 9 3 0 4
## 10 4 1 1
## 11 4 1 2
## 12 5 0 1
## 13 5 0 2
## 14 5 0 3
It's not necessary to specify all (or all.y) in the latter expression because there are multiple time values for each matching id value, and these are expanded. In the prior case, the time values were matched from both data frames, and without specifying all (or all.y) you would get your original data back.
The initial expansion can be achieved with:
newdat <- transform(
foo[rep(rownames(foo),foo$time),],
time = sequence(foo$time)
)
# id y time
#1 1 0 1
#1.1 1 0 2
#2 2 1 1
#2.1 2 1 2
#2.2 2 1 3
# etc
To get the complete solution, including the bonus part, then do:
newdat$y[-cumsum(foo$time)] <- 0
# id y time
#1 1 0 1
#1.1 1 0 2
#2 2 0 1
#2.1 2 0 2
#2.2 2 1 3
#etc
If you were really excitable, you could do it all in one step using within:
within(
foo[rep(rownames(foo),foo$time),],
{
time <- sequence(foo$time)
y[-cumsum(foo$time)] <- 0
}
)
If you're willing to go with "data.table", you can try:
library(data.table)
fooDT <- as.data.table(foo)
fooDT[, list(time = sequence(time)), by = list(id, y)]
# id y time
# 1: 1 0 1
# 2: 1 0 2
# 3: 2 1 1
# 4: 2 1 2
# 5: 2 1 3
# 6: 3 0 1
# 7: 3 0 2
# 8: 3 0 3
# 9: 3 0 4
# 10: 4 1 1
# 11: 4 1 2
# 12: 5 0 1
# 13: 5 0 2
# 14: 5 0 3
And, for the bonus question:
fooDT[, list(time = sequence(time)),
by = list(id, y)][, y := {y[1:(.N-1)] <- 0; y},
by = id][]
# id y time
# 1: 1 0 1
# 2: 1 0 2
# 3: 2 0 1
# 4: 2 0 2
# 5: 2 1 3
# 6: 3 0 1
# 7: 3 0 2
# 8: 3 0 3
# 9: 3 0 4
# 10: 4 0 1
# 11: 4 1 2
# 12: 5 0 1
# 13: 5 0 2
# 14: 5 0 3
For the bonus question, alternatively:
fooDT[, list(time=seq_len(time)), by=list(id,y)][y == 1,
y := c(rep.int(0, .N-1L), 1), by=id][]
With dplyr (and magritte for nice legibility):
library(magrittr)
library(dplyr)
foo[rep(1:nrow(foo), foo$time), ] %>%
group_by(id) %>%
mutate(y = !duplicated(y, fromLast = TRUE),
time = 1:n())
Hope it helps

Table by row with R

I would like to tabulate by row within a data frame. I can obtain adequate results using table within apply in the following example:
df.1 <- read.table(text = '
state county city year1 year2 year3 year4 year5
1 2 4 0 0 0 1 2
2 5 3 10 20 10 NA 10
2 7 1 200 200 NA NA 200
3 1 1 NA NA NA NA NA
', na.strings = "NA", header=TRUE)
tdf <- t(df.1)
apply(tdf[4:nrow(tdf),1:nrow(df.1)], 2, function(x) {table(x, useNA = "ifany")})
Here are the results:
[[1]]
x
0 1 2
3 1 1
[[2]]
x
10 20 <NA>
3 1 1
[[3]]
x
200 <NA>
3 2
[[4]]
x
<NA>
5
However, in the following example each row consists of a single value.
df.2 <- read.table(text = '
state county city year1 year2 year3 year4 year5
1 2 4 0 0 0 0 0
2 5 3 1 1 1 1 1
2 7 1 2 2 2 2 2
3 1 1 NA NA NA NA NA
', na.strings = "NA", header=TRUE)
tdf.2 <- t(df.2)
apply(tdf.2[4:nrow(tdf.2),1:nrow(df.2)], 2, function(x) {table(x, useNA = "ifany")})
The output I obtain is:
# [1] 5 5 5 5
As such, I cannot tell from this output that the first 5 is for 0, the second 5 is for 1, the third 5 is for 2 and the last 5 is for NA. Is there a way I can have R return the value represented by each 5 in the second example?
You can use lapply to systematically output a list. You would have to loop over the row indices:
sub.df <- as.matrix(df.2[grepl("year", names(df.2))])
lapply(seq_len(nrow(sub.df)),
function(i)table(sub.df[i, ], useNA = "ifany"))
Protect the result by wrapping with list:
apply(tdf.2[4:nrow(tdf.2),1:nrow(df.2)], 2,
function(x) {list(table(x, useNA = "ifany")) })
Here's a table solution:
table(
rep(rownames(df.1),5),
unlist(df.1[,4:8]),
useNA="ifany")
This gives
0 1 2 10 20 200 <NA>
1 3 1 1 0 0 0 0
2 0 0 0 3 1 0 1
3 0 0 0 0 0 3 2
4 0 0 0 0 0 0 5
...and for your df.2:
0 1 2 <NA>
1 5 0 0 0
2 0 5 0 0
3 0 0 5 0
4 0 0 0 5
Well, this is a solution unless you really like having a list of tables for some reason.
I think the problem is stated in applys help:
... If n equals 1, apply returns a vector if MARGIN has length 1 and
an array of dimension dim(X)[MARGIN] otherwise ...
The inconsistencies of the return values of base R's apply family is the reason why I shifted completely to plyrs **ply functions. So this works as desired:
library(plyr)
alply( df.2[ 4:8 ], 1, function(x) table( unlist(x), useNA = "ifany" ) )

Resources