How to automate hierarchical grouping of variables based on variable name - r

I have my variables named in little-endian fashion, separated by periods.
I'd like to create index variables for each different level and get summary output for the variables at each level, but I'm getting stuck at the first step trying to break apart my variables and put them in a table to start working with them:
Variable naming convention:
Environment.Construct.Subconstruct_1.subconstruct_i.#.Short_Name
Example:
n <- 6
dat <- data.frame(
ph1.career_interest.delight.1.Friendly=sample(1:5, n, replace=TRUE),
ph1.career_interest.delight.2.Advantagious=sample(1:5, n, replace=TRUE),
ph1.career_interest.philosophy.1.Meaningful_Difference=sample(1:5, n, replace=TRUE),
ph1.career_interest.philosophy.2.Enable_Work=sample(1:5, n, replace=TRUE)
)
# create list of variable names
names <- as.list(colnames( dat ))
## Try to create a heirarchy of variables: Step 1: Create matrix
heir <- as.matrix(strsplit(names,".", fixed = TRUE))
I've gone through a couple iterations but it still returns an error:
Error in strsplit(names, ".", fixed = TRUE) : non-character argument

Instead of wrapping with as.list, directly use the colnames because according to ?strsplit, the input x would be
x - character vector, each element of which is to be split. Other inputs, including a factor, will give an error.
Thus, if it is a list, it is not the expected input class for strsplit
nm1 <- colnames(dat)
strsplit(nm1, ".", fixed = TRUE)
#[[1]]
#[1] "ph1" "career_interest" "delight" "1" "Friendly"
#[[2]]
#[1] "ph1" "career_interest" "delight" "2" "Advantagious"
#[[3]]
#[1] "ph1" "career_interest" "philosophy" "1" "Meaningful_Difference"
#[[4]]
#[1] "ph1" "career_interest" "philosophy" "2" "Enable_Work"
Output is a list of vectors. It is not clear from the OP's post about the expected output format. If we need a matrix or data.frame, can rbind those list elements (assuming they have the same length)
m1 <- do.call(rbind, strsplit(nm1, ".", fixed = TRUE))
returns a matrix
Or can convert to data.frame with rbind.data.frame
NOTE: names is a function name. It is better not to assign object names with function names
Update
If the lengths are not the same, an option is to pad NA at the end for those elements with less length
lst1 <- strsplit(nm1, ".", fixed = TRUE)
lst1[[1]] <- lst1[[1]][1:3] # making lengths different
mx <- max(lengths(lst1))
do.call(rbind, lapply(lst1, `length<-`, mx))
# [,1] [,2] [,3] [,4] [,5]
#[1,] "ph1" "career_interest" "delight" NA NA
#[2,] "ph1" "career_interest" "delight" "2" "Advantagious"
#[3,] "ph1" "career_interest" "philosophy" "1" "Meaningful_Difference"
#[4,] "ph1" "career_interest" "philosophy" "2" "Enable_Work"

You can count number of '.' in the column names to count number of new columns to create. We can then use tidyr::separate to divide data into n new columns splitting on ..
#Changing 1st column name to make length unequal
names(dat)[1] <- 'ph1.career_interest.delight.1'
#Number of new columns to be created
n <- max(stringr::str_count(names(dat), '\\.')) + 1
tidyr::separate(data.frame(name = names(dat)), name,
paste0('col', seq_len(n)), sep = '\\.', fill = 'right')
# col1 col2 col3 col4 col5
#1 ph1 career_interest delight 1 <NA>
#2 ph1 career_interest delight 2 Advantagious
#3 ph1 career_interest philosophy 1 Meaningful_Difference
#4 ph1 career_interest philosophy 2 Enable_Work

Related

Can I use the unlist function in a dataframe?

I was working with a list containing the words from a text and the tags classifying them. I was supposed to restore an old letter, and to do this i needed to extract only the words in a vector, so instead of using sapply, i did this:
words <- unlist(data.frame(letter)[1,], use.names = FALSE)
It appeared to work, but the auxiliary professor said that doing this was a problem, since you can only use unlist in lists, so I fixed it, but in the end the results were the same.
PS: I know that using sapply is more efficient, i just didn't remember the function, I'm just curious to know if you can use unlist in other objects
As #Gregor notes, data.frames are lists. Consider the following example:
df <- data.frame(Col1 = LETTERS[1:5], Col2 = 1:5, stringsAsFactors = FALSE)
is.list(df)
#[1] TRUE
Therefore, you can use lapply on a data.frame to perform column-wise operations:
lapply(df,paste0, collapse = "")
#$Col1
#[1] "ABCDE"
#$Col2
#[1] "12345"
You have to be careful, however, when subsetting a data.frame, as you may not get a list depending on the method you use.
df["Col2"]
# Col2
#1 1
#2 2
#3 3
#4 4
#5 5
is.list(df["Col2"])
#[1] TRUE
df[,"Col2"]
#[1] 1 2 3 4 5
is.list(df[,"Col2"])
#[1] FALSE
is.list(df[["Col2"]])
#[1] FALSE
is.list(df$Col2)
#[1] FALSE
is.list(subset(df,select = Col2))
#[1] TRUE
To my knowledge, however, subsetting whole rows always returns a list.
df[1,]
# Col1 Col2
#1 A 1
is.list(df[1,])
#[1] TRUE
is.list(subset(df,1:5 == 1))
#[1] TRUE
We can use the dput function to view a text representation of the underlying structure of a single row:
dput(df[1,])
#structure(list(Col1 = "A", Col2 = 1L), row.names = 1L, class = "data.frame")
As we can see, even the single row is clearly a list. Therefore, we can reasonably unlist that row just as we would any list that is not also a data.frame.
unlist(df[1,], use.names = FALSE)
#[1] "A" "1"
unlist(list(Col1 = "A", Col2 = 1L), use.names = FALSE)
#[1] "A" "1"

Combine matrices of different length and keep column names

There is a similar question about combining vectors with different lengths here, but all answers (except #Ronak Shah`s answer) loose the names/colnames.
My problem is that I need to keep the column names, which seems to be possible using the rowr package and cbind.fills.
I would like to stay in base-R or use stringi and the output shoud remain a matrix.
Test data:
inp <- list(structure(c("1", "2"), .Dim = 2:1, .Dimnames = list(NULL,"D1")),
structure(c("3", "4", "5"), .Dim = c(3L, 1L), .Dimnames = list(NULL, "D2")))
I know that I could get the column names beforehand and then reassign them after creating the matrix, like:
## Using stringi
colnam <- unlist(lapply(inp, colnames))
out <- stri_list2matrix(inp)
colnames(out) <- colnam
out
## Using base-R
colnam <- unlist(lapply(inp, colnames))
max_length <- max(lengths(inp))
nm_filled <- lapply(inp, function(x) {
ans <- rep(NA, length = max_length)
ans[1:length(x)]<- x
ans
})
out <- do.call(cbind, nm_filled)
colnames(out) <- colnam
out
Are there other options that keep the column names?
Since stringi is ok for you to use, you can use the function stri_list2matrix(), i.e.
setNames(as.data.frame(stringi::stri_list2matrix(inp)), sapply(inp, colnames))
# D1 D2
#1 1 3
#2 2 4
#3 <NA> 5
Here is a slightly more concise base R variation
len <- max(lengths(inp))
nms <- sapply(inp, colnames)
do.call(cbind, setNames(lapply(inp, function(x)
replace(rep(NA, len), 1:length(x), x)), nms))
# D1 D2
#[1,] "1" "3"
#[2,] "2" "4"
#[3,] NA "5"
Not sure if this constitutes a sufficiently different solution from what you've already posted. Will remove if deemed too similar.
Update
Or how about a merge?
Reduce(
function(x, y) merge(x, y, all = T, by = 0),
lapply(inp, as.data.frame))[, -1]
# D1 D2
#1 1 3
#2 2 4
#3 <NA> 5
The idea here is to convert the list entries to data.frames, then add a row number and merge by row and merge by row by setting by = 0 (thanks #Henrik). Note that this will return a data.frame rather than a matrix.
Here is using base:
do.call(cbind,
lapply(inp, function(i){
x <- data.frame(i, stringsAsFactors = FALSE)
as.matrix( x[ seq(max(lengths(inp))), , drop = FALSE ] )
#if we matrices have more than 1 column use:
#as.matrix( x[ seq(max(sapply(inp, nrow))), , drop = FALSE ] )
}
))
# D1 D2
# 1 "1" "3"
# 2 "2" "4"
# NA NA "5"
The idea is to make all matrices to have the same number of rows. When we subset dataframe by index, rows that do not exist will be returned as NA, then we convert back to matrix and cbind.

Select column name based on data frame content R

I want to build a matrix or data frame by choosing names of columns where the element in the data frame contains does not contain an NA. For example, suppose I have:
zz <- data.frame(a = c(1, NA, 3, 5),
b = c(NA, 5, 4, NA),
c = c(5, 6, NA, 8))
which gives:
a b c
1 1 NA 5
2 NA 5 6
3 3 4 NA
4 5 NA 8
I want to recognize each NA and build a new matrix or df that looks like:
a c
b c
a b
a c
There will be the same number of NAs in each row of the input matrix/df. I can't seem to get the right code to do this. Suggestions appreciated!
library(dplyr)
library(tidyr)
zz %>%
mutate(k = row_number()) %>%
gather(column, value, a, b, c) %>%
filter(!is.na(value)) %>%
group_by(k) %>%
summarise(temp_var = paste(column, collapse = " ")) %>%
separate(temp_var, into = c("var1", "var2"))
# A tibble: 4 × 3
k var1 var2
* <int> <chr> <chr>
1 1 a c
2 2 b c
3 3 a b
4 4 a c
Here's a possible vectorized base R approach
indx <- which(!is.na(zz), arr.ind = TRUE)
matrix(names(zz)[indx[order(indx[, "row"]), "col"]], ncol = 2, byrow = TRUE)
# [,1] [,2]
#[1,] "a" "c"
#[2,] "b" "c"
#[3,] "a" "b"
#[4,] "a" "c"
This finds non-NA indices, sorts by rows order and then subsets the names of your zz data set according to the sorted index. You can wrap it into as.data.frame if you prefer it over a matrix.
EDIT: transpose the data frame one time before process, so don't need to transpose twice in loop in first version.
cols <- names(zz)
for (column in cols) {
zz[[column]] <- ifelse(is.na(zz[[column]]), NA, column)
}
t_zz <- t(zz)
cols <- vector("list", length = ncol(t_zz))
for (i in 1:ncol(t_zz)) {
cols[[i]] <- na.omit(t_zz[, i])
}
new_dt <- as.data.frame(t(do.call("cbind", cols)))
The tricky part here is your goal actually change data frame structure, so the task of "remove NA in each row" have to build row by row as new data frame, since every column in each row could came from different column of original data frame.
zz[1, ] is a one row data frame, use t to convert it into vector so we can use na.omit, then transpose back to row.
I used 2 for loops, but for loops are not necessarily bad in R. The first one is vectorized for each column. The second one need to be done row by row anyway.
EDIT: growing objects is very bad in performance in R. I knew I can use rbindlist from data.table which can take a list of data frames, but OP don't want new packages. My first attempt just use rbind which could not take list as input. Later I found an alternative is to use do.call. It's still slower than rbindlist though.

Get only the structure(row names & column name) of data set in R

Consider a data frame with row names and column names:
> data <- data.frame(a=1:3,b=2:4,c=3:5,row.names=c("x","y","z"))
> data
a b c
x 1 2 3
y 2 3 4
z 3 4 5
I just want to display the row names and column names of data like:
a b c
x
y
z
Perhaps you need
data[] <- ''
data
# a b c
#x
#y
#z
If we need only the names, then dimnames is an option which return the row names and column names in a list.
dimnames(data)
#[[1]]
#[1] "x" "y" "z"
#[[2]]
#[1] "a" "b" "c"
Or may be
m1 <- matrix("", ncol = ncol(data), nrow = nrow(data),
dimnames = list(rownames(data), colnames(data)) )
If you want to see the column names in your dataset just use this
print(names(dataset_name))
For its structure,
str(dataset_name)

how do I search for columns with same name, add the column values and replace these columns with same name by their sum? Using R

I have a data frame where some consecutive columns have the same name. I need to search for these, add their values in for each row, drop one column and replace the other with their sum.
without previously knowing which patterns are duplicated, possibly having to compare one column name with the following to see if there's a match.
Can someone help?
Thanks in advance.
> dfrm <- data.frame(a = 1:10, b= 1:10, cc= 1:10, dd=1:10, ee=1:10)
> names(dfrm) <- c("a", "a", "b", "b", "b")
> sapply(unique(names(dfrm)[duplicated(names(dfrm))]),
function(x) Reduce("+", dfrm[ , grep(x, names(dfrm))]) )
a b
[1,] 2 3
[2,] 4 6
[3,] 6 9
[4,] 8 12
[5,] 10 15
[6,] 12 18
[7,] 14 21
[8,] 16 24
[9,] 18 27
[10,] 20 30
EDIT 2: Using rowSums allows simplification of the first sapply argumentto just unique(names(dfrm)) at the expense of needing to remember to include drop=FALSE in "[":
sapply(unique(names(dfrm)),
function(x) rowSums( dfrm[ , grep(x, names(dfrm)), drop=FALSE]) )
To deal with NA's:
sapply(unique(names(dfrm)),
function(x) apply(dfrm[grep(x, names(dfrm))], 1,
function(y) if ( all(is.na(y)) ) {NA} else { sum(y, na.rm=TRUE) }
) )
(Edit note: addressed Tommy counter-example by putting unique around the names(.)[.] construction.
The erroneous code was:
sapply(names(dfrm)[unique(duplicated(names(dfrm)))],
function(x) Reduce("+", dfrm[ , grep(x, names(dfrm))]) )
Here is my one liner
# transpose data frame, sum by group = rowname, transpose back.
t(rowsum(t(dfrm), group = rownames(t(dfrm))))
One way is to identify duplcates using (surprise) the duplicated function, and then loop through them to calculate the sums. Here is an example:
dat.dup <- data.frame(x=1:10, x=1:10, x=1:10, y=1:10, y=1:10, z=1:10, check.names=FALSE)
dups <- unique(names(dat.dup)[duplicated(names(dat.dup))])
for (i in dups) {
dat.dup[[i]] <- rowSums(dat.dup[names(dat.dup) == i])
}
dat <- dat.dup[!duplicated(names(dat.dup))]
Some sample data.
dfr <- data.frame(
foo = rnorm(20),
bar = 1:20,
bar = runif(20),
check.names = FALSE
)
Method: Loop over unique column names; if there is only one of that name, then selecting all columns with that nme will return a vector, but if there are duplicates it will also be a data frame. Use rowSums to sum over rows. (Duh. EDIT: Not quite as 'duh' as previously thought!) lapply returns a list, which we need to reform into a data frame, and finally we fix the names. EDIT: sapply avoids the need for the last step.
unique_col_names <- unique(colnames(dfr))
new_dfr <- sapply(unique_col_names, function(name)
{
subs <- dfr[, colnames(dfr) == name]
if(is.data.frame(subs))
rowSums(subs)
else
subs
})

Resources