I have a list of named vectors (see below and at end for dput version) I would like to "merge" together to make a matrix and fill in zeros if a vector doesn't contain a name (character in this case). This doesn't seem that hard but I haven't found a working base solution to the problem. I thought about using match but that seems very costly of time when I'm sure there's a fancy way to use do.call and rbind together.
List of Named Vectors:
$greg
e i k l
1 2 1 1
$sam
! c e i t
1 1 1 2 1
$teacher
? c i k l
1 1 1 1 1
Final Desired Output
! ? c e i k l t
greg 0 0 0 1 2 1 1 0
sam 1 0 1 1 2 0 0 1
teacher 0 1 1 0 1 1 1 0
Likely this is the output people will give and filling NAs with 0 is easy
! ? c e i k l t
greg NA NA NA 1 2 1 1 NA
sam 1 NA 1 1 2 NA NA 1
teacher NA 1 1 NA 1 1 1 NA
Sample Data
L2 <- structure(list(greg = structure(c(1L, 2L, 1L, 1L), .Dim = 4L, .Dimnames = structure(list(
c("e", "i", "k", "l")), .Names = ""), class = "table"), sam = structure(c(1L,
1L, 1L, 2L, 1L), .Dim = 5L, .Dimnames = structure(list(c("!",
"c", "e", "i", "t")), .Names = ""), class = "table"), teacher = structure(c(1L,
1L, 1L, 1L, 1L), .Dim = 5L, .Dimnames = structure(list(c("?",
"c", "i", "k", "l")), .Names = ""), class = "table")), .Names = c("greg",
"sam", "teacher"))
Here's a fairly straight forward base solution:
# first determine all possible column names
cols <- sort(unique(unlist(lapply(L2,names), use.names=FALSE)))
# initialize the output
out <- matrix(0, length(L2), length(cols), dimnames=list(names(L2),cols))
# loop over list and fill in the matrix
for(i in seq_along(L2)) {
out[names(L2)[i], names(L2[[i]])] <- L2[[i]]
}
UPDATE with benchmarks:
f1 <- function(L2) {
cols <- sort(unique(unlist(lapply(L2,names), use.names=FALSE)))
out <- matrix(0, length(L2), length(cols), dimnames=list(names(L2),cols))
for(i in seq_along(L2)) out[names(L2)[i], names(L2[[i]])] <- L2[[i]]
out
}
f2 <- function(L2) {
L.names <- sort(unique(unlist(sapply(L2, names))))
L3 <- t(sapply(L2, function(x) x[L.names]))
colnames(L3) <- L.names
L3[is.na(L3)] <- 0
L3
}
f3 <- function(L2) {
m <- do.call(rbind, lapply(L2, as.data.frame))
m$row <- sub("[.].*", "", rownames(m))
m$Var1 <- factor(as.character(m$Var1))
xtabs(Freq ~ row + Var1, m)
}
library(rbenchmark)
benchmark(f1(L2), f2(L2), f3(L2), order="relative")[,1:5]
# test replications elapsed relative user.self
# 1 f1(L2) 100 0.022 1.000 0.020
# 2 f2(L2) 100 0.051 2.318 0.052
# 3 f3(L2) 100 0.788 35.818 0.760
set.seed(21)
L <- replicate(676, {n=sample(10,1); l=sample(26,n);
setNames(sample(6,n,TRUE), letters[l])}, simplify=FALSE)
names(L) <- levels(interaction(letters,LETTERS))
benchmark(f1(L), f2(L), order="relative")[,1:5]
# test replications elapsed relative user.self
# 1 f1(L) 100 1.84 1.000 1.828
# 2 f2(L) 100 4.24 2.304 4.220
I think something like this:
names <- sort(unique(unlist(lapply(L2, names), use.names=FALSE)))
L3 <- t(vapply(L2, function(x) x[names], FUN.VALUE=numeric(length(names))))
colnames(L3) <- names
L3[is.na(L3)] <- 0
reshape2 Solution. This can be readily done with the reshape2 package by melting the list into long form and then using dcast to reshape it back into wide form:
> library(reshape2)
> m <- melt(L2)
> m$Var.1 <- factor(as.character(m$Var.1)) # optional - if columns should be sorted
> dcast(m, L1 ~ Var.1, fill = 0)
L1 ! ? c e i k l t
1 greg 0 0 0 1 2 1 1 0
2 sam 1 0 1 1 2 0 0 1
3 teacher 0 1 1 0 1 1 1 0
Base Solution. And here is a corresponding base solution where the first two lines perform the melt, the next line ensures the columns will be sorted and the last line reshapes from long to wide:
> m <- do.call(rbind, lapply(L2, as.data.frame))
> m$row <- sub("[.].*", "", rownames(m))
> m$Var1 <- factor(as.character(m$Var1))
> xtabs(Freq ~ row + Var1, m)
Var1
row ! ? c e i k l t
greg 0 0 0 1 2 1 1 0
sam 1 0 1 1 2 0 0 1
teacher 0 1 1 0 1 1 1 0
EDIT: Added a base solution and modified the sort line.
While typing this I thought of this solution but wonder if there's a more efficient one:
chars <- sort(unique(unlist(lapply(L2, names))))
L3 <- lapply(L2, function(x){
nots <- chars[!chars %in% names(x)]
new <- rev(c(x, rep(0, length(nots))))
names(new)[1:length(nots)] <- nots
new[order(names(new))]
})
do.call(rbind, L3)
Yielding:
! ? c e i k l t
greg 0 0 0 1 2 1 1 0
sam 1 0 1 1 2 0 0 1
teacher 0 1 1 0 1 1 1 0
Related
This question already has an answer here:
R: Why am I not getting type or class "factor" after converting columns to factor?
(1 answer)
Closed 7 months ago.
Assume the following data.frame with columns of ordered factors:
dat0 <- data.frame(X1 = 1:5, X2 = 1:5, X3 = c(1,1:4), X4 = c(2,2:5))
dat <- data.frame(lapply(dat0, factor, ordered=TRUE, levels=1:5, labels=letters[1:5]))
I want to create a nice looking table that compiles how many a:e are in each column of dat (including any 0 counts). The function table() is an obvious choice.
My "clean" attempt at making this table does not work. See below:
The table() function works as expected (i.e., includes all 5 factor choices -- even if one or more has a 0 count) when applied to individual columns:
table(dat[,1])
a b c d e
1 1 1 1 1
table(dat[,3])
a b c d e
2 1 1 1 0
# note: that a 0 is provided for any factor missing
However, when I try to use an apply() function on the data.frame to include all column counts into one table, I get wonky resulting formatting:
apply(dat, 2, table)
$X1
a b c d e
1 1 1 1 1
$X2
a b c d e
1 1 1 1 1
$X3
a b c d
2 1 1 1
$X4
b c d e
2 1 1 1
I can demonstrate the cause of the issue by only including columns of my data.frame that have at least 1 count for each factor that is similar between the columns. (i.e., I can get my desired formatting outcome by removing any column with a 0 count for any factor):
apply(dat[1:2], 2, table) # only including columns of dat with all 5 letters (i.e., no 0 counts)
X1 X2
a 1 1
b 1 1
c 1 1
d 1 1
e 1 1
Question: Is there a simple workaround/solution here when using table() or am I going to have to find a different approach?
Note: I know I could simply cbind() the individual table results, but that's very tedious in my actual more complex data set.
We may use table in sapply.
sapply(dat, table)
# X1 X2 X3 X4
# a 1 1 2 0
# b 1 1 1 2
# c 1 1 1 1
# d 1 1 1 1
# e 1 1 0 1
Or vapply which is faster, but we need to know the .
vapply(dat, table, nlevels(unlist(dat)))
# X1 X2 X3 X4
# a 1 1 2 0
# b 1 1 1 2
# c 1 1 1 1
# d 1 1 1 1
# e 1 1 0 1
If we don't urgently need the row names, we may use tabulate.
sapply(dat, tabulate, nlevels(unlist(dat)))
# X1 X2 X3 X4
# [1,] 1 1 2 0
# [2,] 1 1 1 2
# [3,] 1 1 1 1
# [4,] 1 1 1 1
# [5,] 1 1 0 1
In case we know the nlevels before, we may simplify it to vapply(dat, table, numeric(5L)) and sapply(dat, tabulate, numeric(5L)) which also gives a gain in speed.
Here comes the benchmark
set.seed(42)
DAT <- dat[sample(nrow(dat),1e5, replace=TRUE), ]
r <- matrix(, 5L, dim(DAT)[2])
microbenchmark::microbenchmark(
t(data.frame(do.call(rbind,lapply(DAT, table)))),
sapply(DAT, table),
vapply(DAT, table, numeric(5L)),
vapply(DAT, table, numeric(nlevels(unlist(dat)))),
sapply(DAT, tabulate, 5L),
sapply(DAT, tabulate, nlevels(unlist(dat))),
`for`={for (j in seq_along(DAT)) r[, j] <- tabulate(DAT[, j], 5L)}
)
Unit: microseconds
expr min lq mean median uq max neval cld
t(data.frame(do.call(rbind, lapply(DAT, table)))) 9960.629 10101.4820 11662.6014 10221.6970 14459.0215 17422.732 100 c
sapply(DAT, table) 9690.340 9822.2150 11721.6487 9934.2045 14128.6330 19107.070 100 c
vapply(DAT, table, numeric(5L)) 9630.185 9729.9155 11313.4803 9816.3260 14017.8180 22655.129 100 c
vapply(DAT, table, numeric(nlevels(unlist(dat)))) 9753.252 9890.5700 11309.0461 9976.4840 14110.4775 17906.082 100 c
sapply(DAT, tabulate, 5L) 725.613 742.7820 778.6458 785.3595 807.1935 916.700 100 a
sapply(DAT, tabulate, nlevels(unlist(dat))) 848.600 891.1135 936.7825 939.8245 967.2390 1114.601 100 a
for 3580.538 3846.5700 4059.3048 3922.1300 3981.4300 19752.024 100 b
Data:
dat <- structure(list(X1 = structure(1:5, levels = c("a", "b", "c",
"d", "e"), class = c("ordered", "factor")), X2 = structure(1:5, levels = c("a",
"b", "c", "d", "e"), class = c("ordered", "factor")), X3 = structure(c(1L,
1L, 2L, 3L, 4L), levels = c("a", "b", "c", "d", "e"), class = c("ordered",
"factor")), X4 = structure(c(2L, 2L, 3L, 4L, 5L), levels = c("a",
"b", "c", "d", "e"), class = c("ordered", "factor"))), class = "data.frame", row.names = c(NA,
-5L))
Solution:
Use lapply and not apply as explained in the ZheyuanLi's linked answer and his comment.
Summary: The problem of apply is that it converts everything to characters, then table re-factors those characters so that unused levels are not preserved. But lapply gives a list.
Use a combination of data.frame, do.call, rbind, and t (transpose) to get the data into the desired data.frame format:
t(data.frame(do.call(rbind,lapply(dat, table))))
X1 X2 X3 X4
a 1 1 2 0
b 1 1 1 2
c 1 1 1 1
d 1 1 1 1
e 1 1 0 1
Or:
As ZheyuanLi pointed out, one can simply use sapply(dat, table).
Also thanks jay.sf for showing how vapply works.
I have a data.table with a list column "c":
df <- data.table(a = 1:3, c = list(1L, 1:2, 1:3))
df
a c
1: 1 1
2: 2 1,2
3: 3 1,2,3
I want to create separate columns for the values in "c".
I create a set of new columns F_1, F_2, F_3:
mmax <- max(df$a)
flux <- paste("F", 1:mmax, sep = "_")
df[, (flux) := 0]
df
a c F_1 F_2 F_3
1: 1 1 0 0 0
2: 2 1,2 0 0 0
3: 3 1,2,3 0 0 0
I want to dispatch values in "c" to columns F_1, F_2, F_3 like this:
df
a c F_1 F_2 F_3
1: 1 1 1 0 0
2: 2 1,2 1 2 0
3: 3 1,2,3 1 2 3
What I have tried:
comp_vect <- function(vec, mmax){
vec <- vec %>% unlist()
n <- length(vec)
answr <- c(vec, rep(0, l = mmax -n))
}
df[ , ..flux := mapply(comp_vect, c, mmax)]
The expected data.table is :
> df
a c F_1 F_2 F_3
1: 1 1 1 0 0
2: 2 1,2 1 2 0
3: 3 1,2,3 1 2 3
I followed a radically different approach. I rbinded the list column and then dcasted it, obtaining the desired result. Last part is to set the names.
library(data.table)
df <- data.table(a = 1:3, d = list(1L, c(1L, 2L), c(1L, 2L, 3L)))
df2 <- df[, rbind(d), by = a][, dcast(.SD, a ~ V1, fill = 0)]
setnames(df2, 2:4, flux)[]
a F_1 F_2 F_3
1: 1 1 0 0
2: 2 1 2 0
3: 3 1 2 3
where flux is the variable of names that you defined in your question.
Please notice that avoided using the column name c, as it may be confused with the function c().
Solution :
for(idx in seq(max(sapply(df$c, length)))){ # maximum number of values according to all the elements of the list
set(x = df,
i = NULL,
j = paste0("F_",idx), # column's name
value = sapply(df$c, function(x){
if(is.na(x[idx])){
return(0) # 0 instead of NA
} else {
return(x[idx])
}
})
)
}
Explications :
We can extract the values from a list like this :
sapply(df$c, function(ll) return(ll[1])) # first value
[1] 1 1 1
sapply(df$c, function(ll) return(ll[2])) # second value
[1] NA 2 2
sapply(df$c, function(ll) return(ll[3])) # third value
[1] NA NA 3
We see that if there is no value, we have a NA.
We need an iterator to extract all values at the position idx. For that, we'll find the number of values in each element of df$c (the list) and keep the maximum.
max(sapply(df$c, length))
[1] 3
If we want zeros instead of NAs, we need to create a function in the sapply to convert them :
vec <- c(NA, 5, 1, NA)
> sapply(vec, function(x) if(is.na(x)) return(0) else return(x))
[1] 0 5 1 0
I have a binary data set which looks like
a b c d
r1 1 1 0 0
r2 0 1 1 0
r3 1 0 0 1
And a vector
V <- c("a", "c")
I want to a command to search colnames and change values in these columns. for example change 1 to A. So the output would be:
a b c d
r1 A 1 0 0
r2 0 1 A 0
r3 A 0 0 1
Here is a vectorized way to do it,
df[names(df) %in% V] <- replace(df[names(df) %in% V], df[names(df) %in% V] == 1, 'A')
#or avoid calling the %in% part 3 times by assigning it, i.e.
i1 <- names(df) %in% V
df[i1] <- replace(df[i1], df[i1] == 1, 'A')
#or a more simplified syntax, compliments of #Cath,
df[, V][df[, V]==1] <- "A"
which gives,
a b c d
r1 A 1 0 0
r2 0 1 A 0
r3 A 0 0 1
A solution with dplyr:
library(dplyr)
V <- c("a", "c")
df %>%
mutate_at(V, ~replace(.x, .x == 1, 'A'))
# a b c d
# r1 A 1 0 0
# r2 0 1 A 0
# r3 A 0 0 1
mutate_at takes a a data.frame and a vector of column names and applys the specified function to each of the columns.
DATA
df <- structure(list(a = c(1L, 0L, 1L), b = c(1L, 1L, 0L),
c = c(0L, 1L, 0L), d = c(0L, 0L, 1L)),
.Names = c("a", "b", "c", "d"),
class = "data.frame", row.names = c("r1", "r2", "r3"))
If left hand side (LHS) and right hand side (RHS) or of the same type, then data.table can be used to update only the selected "cells" in place, i.e., without copying the whole column:
library(data.table)
setDT(df)
for (s in V) df[get(s) == 1L, (s) := 99L] # replacement value is of type integer
df[]
a b c d
1: 99 1 0 0
2: 0 1 99 0
3: 99 0 0 1
To verify that only selected rows in each column are updated, we can check the addresses of each column before and after the update using:
df[, lapply(.SD, address), .SDcols = V]
(In addition, the verbose mode can be switched on by options(datatable.verbose = TRUE).)
In case LHS and RHS are of different type, a type conversion is required anyway. Therefore, the whole column needs to be replaced:
df[, (V) := lapply(.SD, function(x) replace(x, x == 1L, "A")), .SDcols = V]
df
a b c d
1: A 1 0 0
2: 0 1 A 0
3: A 0 0 1
Using address() shows that each of the affected columns has been copied. But only the affected columns are copied, the other columns haven't been touched. This is different to the other answers posted so far where the whole data frame is copied.
I have a data frame which looks something like this:
id val
1 a
1 b
2 a
2 c
2 d
3 a
3 a
think of each row as a label, val, that was given to some observation with an id.
What I ultimately want to get to is a "co-occurence" matrix that looks something like this where I get a count of how many times each letter appears within the same id with each other letter:
a b c d
a 1 1 1 1
b 1 0 0 0
c 1 0 0 1
d 1 0 1 0
I've been wracking my brain looking for ways to do this, but have come up empty so far. Any hints? Preferably using tidyverse tools, but open to other options as well at this point.
EDIT: the solutions to the question linked as a possible duplicate do not work in this case. I'm not sure why, but I suspect it has to do with that question having a data frame with 3 columns.
Here's a solution in base R. Not quite elegant but seems to work
temp = data.frame(do.call(cbind, lapply(split(df, df$id), function(a)
combn(a$val, 2))), stringsAsFactors = FALSE)
sapply(sort(unique(df$val)), function(rows)
sapply(sort(unique(df$val)), function(cols)
sum(sapply(temp, function(x)
identical(sort(x), sort(c(rows, cols)))))))
# a b c d
#a 1 1 1 1
#b 1 0 0 0
#c 1 0 0 1
#d 1 0 1 0
OR with igraph
temp = t(do.call(cbind, lapply(split(df, df$id), function(a) combn(a$val, 2))))
library(igraph)
as.matrix(get.adjacency(graph(temp, directed = FALSE)))
# a c b d
#a 1 1 1 1
#c 1 0 0 1
#b 1 0 0 0
#d 1 1 0 0
DATA
df = structure(list(id = c(1L, 1L, 2L, 2L, 2L, 3L, 3L),
val = c("a", "b", "a", "c", "d", "a", "a")),
.Names = c("id", "val"),
class = "data.frame",
row.names = c(NA, -7L))
A solution with dplyr + purrr:
library(dplyr)
library(purrr)
df %>%
split(.$id) %>%
map_dfr(function(x){
t(combn(x$val, 2)) %>%
data.frame(stringsAsFactors = FALSE)
}) %>%
mutate_all(funs(factor(., levels = c("a", "b", "c", "d")))) %>%
table() %>%
pmax(., t(.))
Result:
X2
X1 a b c d
a 1 1 1 1
b 1 0 0 0
c 1 0 0 1
d 1 0 1 0
Notes:
I first split the df by id, then used map_dfr from purrr to map the combn function to each id group.
combn finds all combinations of elements within a vector (length(vec) choose 2) and returns a matrix.
_dfr at the end of map_dfr means that the result will be a dataframe by row binding each element of the list. So this is effectively do.call(rbind, lapply()).
mutate_all makes sures that table retains all the levels needed even if a letter does not exist in a column.
Finally, since after the table step, an upper triangular matrix is produced, I fed that matrix and its transpose into pmax
pmax finds the parallel maxima from the two inputs and returns a symmetric matrix as desired.
Data:
df = read.table(text= "id val
1 a
1 b
2 a
2 c
2 d
3 a
3 a", header = TRUE, stringsAsFactors = FALSE)
I would like to compare the previous row value whether it is same as the current one (for more than 1 variables and also using list of values). In this case how do I perform write code. I read 'apply' functions can be used.
I searched this topic here before posting this question found somewhat similar but unable to find the exact one. I'm quite new to R.
Here is my sample table: (Flag needs to be done based on conditions)
Ticket No V1 V2 Flag
Tkt10256 1 X 0
Tkt10257 1 aa 0
Tkt10257 2 bb 1
Tkt10257 3 x 0
Tkt10260 1 cc 0
Tkt10260 2 aa 1
Tkt10262 3 bb 0
I have to Flag based on the below conditions (if all the conditions are satisfied then mark as 1)
Variable 2 should be the following one of 4 names (aa, bb, cc, dd)
Variable 1 should be the different from previous row
Ticket number has to be the same as previous row
Thanks in advance for the help !
An approach without looping:
indx1 <- with(df, V2 %in% paste0(letters[1:4], letters[1:4]) )
indx2 <- with(df, c(TRUE,V1[-1]!=V1[-length(V1)]))
indx3 <- with(df, c(FALSE,Ticket.No[-1]==Ticket.No[-nrow(df)]))
df$Flag <- (indx1 & indx2 & indx3)+0
df$Flag
#[1] 0 0 1 0 0 1 0
data
df <- structure(list(Ticket.No = c("Tkt10256", "Tkt10257", "Tkt10257",
"Tkt10257", "Tkt10260", "Tkt10260", "Tkt10262"), V1 = c(1L, 1L,
2L, 3L, 1L, 2L, 3L), V2 = c("X", "aa", "bb", "x", "cc", "aa",
"bb"), Flag = c(0L, 0L, 1L, 1L, 0L, 1L, 0L)), .Names = c("Ticket.No",
"V1", "V2", "Flag"), class = "data.frame", row.names = c(NA,
-7L))
One more:
Check this on your larger data. I'm not exactly sure if duplicated is the right function to use there. If the numbers in the TicketNo column are increasing (i.e. the Xs in TktXXXXX), then it should work fine.
> dat2 <- dat[dat$V2 %in% c("aa", "bb", "cc", "dd"),]
> rn <- rownames(dat2)[duplicated(dat2[[1]]) & !c(FALSE, diff(dat2[[2]]) == 0)]
> dat$Flag <- (rownames(dat) %in% rn)+0
> dat
# TicketNo V1 V2 Flag
# 1 Tkt10256 1 X 0
# 2 Tkt10257 1 aa 0
# 3 Tkt10257 2 bb 1
# 4 Tkt10257 3 x 0
# 5 Tkt10260 1 cc 0
# 6 Tkt10260 2 aa 1
# 7 Tkt10262 3 bb 0
A variation on #Akrun's answer:
with(df,
V2 %in% c("aa","bb","cc","dd") &
c(FALSE,diff(V1) != 0) &
c(FALSE,head(Ticket.No, -1)) == Ticket.No
) + 0
#[1] 0 0 1 0 0 1 0
Try:
for(i in 2:nrow(ddf)){
ddf$Flag[i] = ifelse( ddf$V2[i] %in% c('aa', 'bb', 'cc', 'dd')
&& ddf$V1[i] != ddf$V1[(i-1)]
&& ddf$TicketNo[i] == ddf$TicketNo[(i-1)]
,1,0)
}
ddf
TicketNo V1 V2 Flag
1 Tkt10256 1 X 0
2 Tkt10257 1 aa 0
3 Tkt10257 2 bb 1
4 Tkt10257 3 x 0
5 Tkt10260 1 cc 0
6 Tkt10260 2 aa 1
7 Tkt10262 3 bb 0