There is a similar question about combining vectors with different lengths here, but all answers (except #Ronak Shah`s answer) loose the names/colnames.
My problem is that I need to keep the column names, which seems to be possible using the rowr package and cbind.fills.
I would like to stay in base-R or use stringi and the output shoud remain a matrix.
Test data:
inp <- list(structure(c("1", "2"), .Dim = 2:1, .Dimnames = list(NULL,"D1")),
structure(c("3", "4", "5"), .Dim = c(3L, 1L), .Dimnames = list(NULL, "D2")))
I know that I could get the column names beforehand and then reassign them after creating the matrix, like:
## Using stringi
colnam <- unlist(lapply(inp, colnames))
out <- stri_list2matrix(inp)
colnames(out) <- colnam
out
## Using base-R
colnam <- unlist(lapply(inp, colnames))
max_length <- max(lengths(inp))
nm_filled <- lapply(inp, function(x) {
ans <- rep(NA, length = max_length)
ans[1:length(x)]<- x
ans
})
out <- do.call(cbind, nm_filled)
colnames(out) <- colnam
out
Are there other options that keep the column names?
Since stringi is ok for you to use, you can use the function stri_list2matrix(), i.e.
setNames(as.data.frame(stringi::stri_list2matrix(inp)), sapply(inp, colnames))
# D1 D2
#1 1 3
#2 2 4
#3 <NA> 5
Here is a slightly more concise base R variation
len <- max(lengths(inp))
nms <- sapply(inp, colnames)
do.call(cbind, setNames(lapply(inp, function(x)
replace(rep(NA, len), 1:length(x), x)), nms))
# D1 D2
#[1,] "1" "3"
#[2,] "2" "4"
#[3,] NA "5"
Not sure if this constitutes a sufficiently different solution from what you've already posted. Will remove if deemed too similar.
Update
Or how about a merge?
Reduce(
function(x, y) merge(x, y, all = T, by = 0),
lapply(inp, as.data.frame))[, -1]
# D1 D2
#1 1 3
#2 2 4
#3 <NA> 5
The idea here is to convert the list entries to data.frames, then add a row number and merge by row and merge by row by setting by = 0 (thanks #Henrik). Note that this will return a data.frame rather than a matrix.
Here is using base:
do.call(cbind,
lapply(inp, function(i){
x <- data.frame(i, stringsAsFactors = FALSE)
as.matrix( x[ seq(max(lengths(inp))), , drop = FALSE ] )
#if we matrices have more than 1 column use:
#as.matrix( x[ seq(max(sapply(inp, nrow))), , drop = FALSE ] )
}
))
# D1 D2
# 1 "1" "3"
# 2 "2" "4"
# NA NA "5"
The idea is to make all matrices to have the same number of rows. When we subset dataframe by index, rows that do not exist will be returned as NA, then we convert back to matrix and cbind.
Related
I have two list objects. l1 contains information that has been read in through path files. l2 is a list of values that have similar name components as those in l1. I have assigned attributes to both list based on the names of the elements in the list. I would like to reach my expected results using the attributes that I have assigned to my list.
For example: I would like to apply a function mean() between the elements with the attribute id that are "2013_mean" in l1 to those with the attribute year that are also "2013" in l2. I would like to do the similar thing with those when the attribute for year is "2016".
# File List
oldl1 <- list(2,3,4,5)
names(oldl1) <- c("C:/Users/2013_mean.csv",
"C:/Users/2013_median.csv",
"C:/Users/2016_mean.csv",
"C:/Users/2016_median.csv"
)
newl1 <- list(2,3,4,5,8,9)
names(newl1) <- c("C:/Users/2013_mean.csv",
"C:/Users/2013_median.csv",
"C:/Users/2016_mean.csv",
"C:/Users/2016_median.csv",
"C:/Users/2017_mean.csv",
"C:/Users/2017_median.csv"
)
attributes(l1) <- data.frame(id = sub("\\.csv", "", basename(names(l1))),
year = trimws(basename(names(l1)), whitespace = "_.*"))
# Other List
l2 <- list(8,9,10,15,1)
names(l2) <- c("2013_A",
"2013_B",
"2013_C",
"2016_D",
"2016_E")
attributes(l2) <- data.frame(year = trimws(names(l2), whitespace = "_.*"))
expected <- list(mean(c(l1[[1]], l2[[1]])),
mean(c(l1[[1]], l2[[2]])),
mean(c(l1[[1]], l2[[3]])),
mean(c(l1[[3]], l2[[4]])),
mean(c(l1[[3]], l2[[5]]))
)
We may use the attributes to split and match and get the mean
yrs <- intersect(attr(l1, "year"), attr(l2, "year"))
i1 <- grepl("mean", attr(l1, "id"))
i12 <- attr(l1, "year") %in% yrs
i1 <- i1 & i12
i2 <- attr(l2, "year") %in% yrs
l2new <- l2[i2]
l1new <- l1[i1]
attr(l1new, "year") <- attr(l1, "year")[i1]
out <- do.call(c, Map(function(x, y) lapply(x, function(z)
mean(c(z, y))), split(l2new, attr(l2, 'year')[i2]), l1new))
names(out) <- NULL
-checking with OP's expected
> identical(out, expected)
[1] TRUE
Or another option is to convert the list with attributes to a data.frame, do a merge and use rowMeans and then convert to list with as.list
as.list(rowMeans(merge(transform(data.frame(attributes(l2)),
l2 = unlist(l2)),
subset(transform(data.frame(attributes(l1)), l1 = unlist(l1)),
grepl("mean", id), select = c(year, l1)), all.x = TRUE)[-1]))
-output
[[1]]
[1] 5
[[2]]
[1] 5.5
[[3]]
[1] 6
[[4]]
[1] 9.5
[[5]]
[1] 2.5
I have my variables named in little-endian fashion, separated by periods.
I'd like to create index variables for each different level and get summary output for the variables at each level, but I'm getting stuck at the first step trying to break apart my variables and put them in a table to start working with them:
Variable naming convention:
Environment.Construct.Subconstruct_1.subconstruct_i.#.Short_Name
Example:
n <- 6
dat <- data.frame(
ph1.career_interest.delight.1.Friendly=sample(1:5, n, replace=TRUE),
ph1.career_interest.delight.2.Advantagious=sample(1:5, n, replace=TRUE),
ph1.career_interest.philosophy.1.Meaningful_Difference=sample(1:5, n, replace=TRUE),
ph1.career_interest.philosophy.2.Enable_Work=sample(1:5, n, replace=TRUE)
)
# create list of variable names
names <- as.list(colnames( dat ))
## Try to create a heirarchy of variables: Step 1: Create matrix
heir <- as.matrix(strsplit(names,".", fixed = TRUE))
I've gone through a couple iterations but it still returns an error:
Error in strsplit(names, ".", fixed = TRUE) : non-character argument
Instead of wrapping with as.list, directly use the colnames because according to ?strsplit, the input x would be
x - character vector, each element of which is to be split. Other inputs, including a factor, will give an error.
Thus, if it is a list, it is not the expected input class for strsplit
nm1 <- colnames(dat)
strsplit(nm1, ".", fixed = TRUE)
#[[1]]
#[1] "ph1" "career_interest" "delight" "1" "Friendly"
#[[2]]
#[1] "ph1" "career_interest" "delight" "2" "Advantagious"
#[[3]]
#[1] "ph1" "career_interest" "philosophy" "1" "Meaningful_Difference"
#[[4]]
#[1] "ph1" "career_interest" "philosophy" "2" "Enable_Work"
Output is a list of vectors. It is not clear from the OP's post about the expected output format. If we need a matrix or data.frame, can rbind those list elements (assuming they have the same length)
m1 <- do.call(rbind, strsplit(nm1, ".", fixed = TRUE))
returns a matrix
Or can convert to data.frame with rbind.data.frame
NOTE: names is a function name. It is better not to assign object names with function names
Update
If the lengths are not the same, an option is to pad NA at the end for those elements with less length
lst1 <- strsplit(nm1, ".", fixed = TRUE)
lst1[[1]] <- lst1[[1]][1:3] # making lengths different
mx <- max(lengths(lst1))
do.call(rbind, lapply(lst1, `length<-`, mx))
# [,1] [,2] [,3] [,4] [,5]
#[1,] "ph1" "career_interest" "delight" NA NA
#[2,] "ph1" "career_interest" "delight" "2" "Advantagious"
#[3,] "ph1" "career_interest" "philosophy" "1" "Meaningful_Difference"
#[4,] "ph1" "career_interest" "philosophy" "2" "Enable_Work"
You can count number of '.' in the column names to count number of new columns to create. We can then use tidyr::separate to divide data into n new columns splitting on ..
#Changing 1st column name to make length unequal
names(dat)[1] <- 'ph1.career_interest.delight.1'
#Number of new columns to be created
n <- max(stringr::str_count(names(dat), '\\.')) + 1
tidyr::separate(data.frame(name = names(dat)), name,
paste0('col', seq_len(n)), sep = '\\.', fill = 'right')
# col1 col2 col3 col4 col5
#1 ph1 career_interest delight 1 <NA>
#2 ph1 career_interest delight 2 Advantagious
#3 ph1 career_interest philosophy 1 Meaningful_Difference
#4 ph1 career_interest philosophy 2 Enable_Work
I am trying to figure out how to convert a data.frame to a list of lists. Suppose I had (feel free to modify this if you need to capture more attributes for later):
v <- list(
row1=list(col1 = as.Date("2011-01-23"), col2="A"),
row2=list(col1 = as.Date("2012-03-03"), col2="B"))
d <- do.call(rbind, lapply(v, as.data.frame))
d$col3 <- 2
How do I get d back to a list of lists (similar to v). The end result should be equivalent to the result of:
vr <- list(
row1=list(col1 = as.Date("2011-01-23"), col2="A", col3=2),
row2=list(col1 = as.Date("2012-03-03"), col2="B", col3=2))
You can do
out <- lapply(split(d, rownames(d)), as.list)
out
#$row1
#$row1$col1
#[1] "2011-01-23"
#$row1$col2
#[1] "A"
#$row1$col3
#[1] 2
#$row2
#$row2$col1
#[1] "2012-03-03"
#$row2$col2
#[1] "B"
#$row2$col3
#[1] 2
If you add stringsAsFactors = FALSE when creating d, i.e.
d <- do.call(rbind, lapply(v, as.data.frame, stringsAsFactors = FALSE))
d$col3 <- 2
then
identical(out, vr)
returns TRUE.
You have to go through the columns again making them lists before you pass them as values of the element of the main list. I hope the below code helps:
apply(d,MARGIN = 1, FUN=function(x){as.list(x)})
Using R 3.1.0
a = as.data.frame(do.call(cbind, lapply(1:100, function(x) { c(1,2,3)})))
b = unstack(stack(a))
# Returns FALSE
all(colnames(a) == colnames(b))
The documentation on stack/unstack says unstacking should "reverse this [stack] operation". Am I missing something? Why do I need to re-order the columns of b?
The last few lines of the stack (see utils:::stack.data.frame) function create a data.frame with two columns, "values" and "ind". The "ind" column is created with the code:
ind = factor(rep.int(names(x), lapply(x, length)))
But, look at how factor works in general (pay attention to the order of the "Levels"):
factor(c(1, 2, 3, 10, 4))
# [1] 1 2 3 10 4
# Levels: 1 2 3 4 10
factor(paste0("A", c(1, 2, 3, 10, 4)))
# [1] A1 A2 A3 A10 A4
# Levels: A1 A10 A2 A3 A4
If the functionality you describe is important for your analysis, you might do better modifying a version of stack.data.frame to capture the order of the data.frame names during the factoring process, like this:
Stack <- function (x, select, ...)
{
if (!missing(select)) {
nl <- as.list(1L:ncol(x))
names(nl) <- names(x)
vars <- eval(substitute(select), nl, parent.frame())
x <- x[, vars, drop = FALSE]
}
keep <- unlist(lapply(x, is.vector))
if (!sum(keep))
stop("no vector columns were selected")
if (!all(keep))
warning("non-vector columns will be ignored")
x <- x[, keep, drop = FALSE]
data.frame(values = unlist(unname(x)),
# REMOVE THIS --> ind = factor(rep.int(names(x), lapply(x, length))),
# AND ADD THIS:
ind = factor(rep.int(names(x), lapply(x, length)), unique(names(x))),
stringsAsFactors = FALSE)
}
Testing, one, two, three...
## Not using identical here because
## the factor levels are different
all.equal(Stack(a), stack(a))
# [1] TRUE
identical(unstack(Stack(a)), a)
# [1] TRUE
You'll never get me to defend the R documentation...
stack(...) creates a new data frame with two columns, values and ind. The latter has the column names from the original table, as a factor, ordered alphabetically. unstack(...) uses that factor to (re-) create columns of the new data frame. So the phrase "Unstacking reverses this operation" should be interpreted loosely...
To get the result you want, you need to reorder the factor ind, as follows:
a <- as.data.frame(do.call(cbind, lapply(1:100, function(x) { c(1,2,3)})))
c <- stack(a)
c$ind <- factor(c$ind, levels=colnames(a))
d <- unstack(c)
identical(a,d)
# [1] TRUE
I have a data frame where some consecutive columns have the same name. I need to search for these, add their values in for each row, drop one column and replace the other with their sum.
without previously knowing which patterns are duplicated, possibly having to compare one column name with the following to see if there's a match.
Can someone help?
Thanks in advance.
> dfrm <- data.frame(a = 1:10, b= 1:10, cc= 1:10, dd=1:10, ee=1:10)
> names(dfrm) <- c("a", "a", "b", "b", "b")
> sapply(unique(names(dfrm)[duplicated(names(dfrm))]),
function(x) Reduce("+", dfrm[ , grep(x, names(dfrm))]) )
a b
[1,] 2 3
[2,] 4 6
[3,] 6 9
[4,] 8 12
[5,] 10 15
[6,] 12 18
[7,] 14 21
[8,] 16 24
[9,] 18 27
[10,] 20 30
EDIT 2: Using rowSums allows simplification of the first sapply argumentto just unique(names(dfrm)) at the expense of needing to remember to include drop=FALSE in "[":
sapply(unique(names(dfrm)),
function(x) rowSums( dfrm[ , grep(x, names(dfrm)), drop=FALSE]) )
To deal with NA's:
sapply(unique(names(dfrm)),
function(x) apply(dfrm[grep(x, names(dfrm))], 1,
function(y) if ( all(is.na(y)) ) {NA} else { sum(y, na.rm=TRUE) }
) )
(Edit note: addressed Tommy counter-example by putting unique around the names(.)[.] construction.
The erroneous code was:
sapply(names(dfrm)[unique(duplicated(names(dfrm)))],
function(x) Reduce("+", dfrm[ , grep(x, names(dfrm))]) )
Here is my one liner
# transpose data frame, sum by group = rowname, transpose back.
t(rowsum(t(dfrm), group = rownames(t(dfrm))))
One way is to identify duplcates using (surprise) the duplicated function, and then loop through them to calculate the sums. Here is an example:
dat.dup <- data.frame(x=1:10, x=1:10, x=1:10, y=1:10, y=1:10, z=1:10, check.names=FALSE)
dups <- unique(names(dat.dup)[duplicated(names(dat.dup))])
for (i in dups) {
dat.dup[[i]] <- rowSums(dat.dup[names(dat.dup) == i])
}
dat <- dat.dup[!duplicated(names(dat.dup))]
Some sample data.
dfr <- data.frame(
foo = rnorm(20),
bar = 1:20,
bar = runif(20),
check.names = FALSE
)
Method: Loop over unique column names; if there is only one of that name, then selecting all columns with that nme will return a vector, but if there are duplicates it will also be a data frame. Use rowSums to sum over rows. (Duh. EDIT: Not quite as 'duh' as previously thought!) lapply returns a list, which we need to reform into a data frame, and finally we fix the names. EDIT: sapply avoids the need for the last step.
unique_col_names <- unique(colnames(dfr))
new_dfr <- sapply(unique_col_names, function(name)
{
subs <- dfr[, colnames(dfr) == name]
if(is.data.frame(subs))
rowSums(subs)
else
subs
})