Combining identical columns, concatenating the column names in R

Combining identical columns, concatenating the column names in R - r

I have a matrix for a minimal example:
data <- c(1, 1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 4)
Matrix = matrix(data, nrow = 3, ncol=4)
colnames(Matrix) <-c("4","3","7","100")
rownames(Matrix) <-c("bob","foo","bar")
> Matrix
4 3 7 100
bob 1 1 2 3
foo 1 1 2 4
bar 1 1 3 4
I want to combine any identical columns, besides the names, and update the colnames such that I know the original columns that were identical I have tried using loops to find the duplicates, but I can't get the combining names part.
The expected result would be something like the following matrix:
>Matrix
4-3 7 100
bob 1 2 3
foo 1 2 4
bar 1 3 4

Here is another base R option
do.call(
cbind,
Map(
function(x) `colnames<-`(Matrix[, (nm<-names(x))[1], drop = FALSE], paste0(nm, collapse = "-")),
split(u <-unlist(Map(toString, as.data.frame(Matrix))), u)
)
)
which gives
4-3 7 100
bob 1 2 3
foo 1 2 4
bar 1 3 4

We could split the columns into a list based on the pasted values of the column, then get the first column, paste the column names and cbind
do.call(cbind, lapply(unname(split.default(as.data.frame(Matrix),
apply(Matrix, 2, paste, collapse = ''))),
function(x) matrix(x[,1],
dimnames = list(NULL, paste(colnames(x), collapse='-')))))

Related

Return maximum of conditionally selected pairs from a vector in R

Reproducible example:
set.seed(1)
A <- round(runif(12, min = 1, max = 5))
> A
[1] 1 2 2 4 3 4 3 4 5 3 4 5
expectedResult <- c(max(A[1], A[4]), max(A[2], A[5]), max(A[3], A[6]), max(A[7], A[10]), max(A[8], A[11]), max(A[9], A[12]))
> expectedResult
[1] 4 3 4 3 4 5
Each A needs to be considered as a collection of segments with 6 elements. For example, A here has 2 segments such as A[1:6] and A[7:12]. For each segment, the first 3 elements are compared with the next 3 elements. Therefore I need to take max(A[1],A[4]), max(A[2], A[5]), max(A[2], A[5]), max(A[3], A[6]), max(A[7], A[10]), max(A[8], A[11]), max(A[9], A[12]).
My original vector has way more elements than this example and therefore I need a much simpler approach to do this. In addition, speed is also a factor for the original calculation and therefore looking for a fast solution as well.

We could create a function to split the vector by 'n' elements, loop over the list, create a matrix with nrow specified as 2, use pmax to do elementwise max after converting to data.frame, return the output by unlisting the list
f1 <- function(vec, n) {
lst1 <- split(vec, as.integer(gl(length(vec), n, length(vec))))
unname(unlist(lapply(lst1, function(x)
do.call(pmax, as.data.frame(t(matrix(x, nrow = 2, byrow = TRUE)))))))
}
-output
> f1(A, 6)
[1] 4 3 4 3 4 5
If the length is not a multiple of 3 or 6, another option is to do a group by operation with tapply after splitting
unname(unlist(lapply(split(A, as.integer(gl(length(A), 6,
length(A)))), function(x) tapply(x, (seq_along(x)-1) %% 3 + 1, FUN = max))))
[1] 4 3 4 3 4 5
data
A <- c(1, 2, 2, 4, 3, 4, 3, 4, 5, 3, 4, 5)

Another option in base R:
a <- 6
unlist(tapply(A, gl(length(A)/a, a),
function(x) pmax(head(x, a/2), tail(x, a/2))),, FALSE)
[1] 4 3 4 3 4 5
or even
a <- 6
unlist(tapply(A, gl(length(A)/a, a),
function(x) do.call(pmax, data.frame(matrix(x, ncol = 2)))),, FALSE)
[1] 4 3 4 3 4 5

You can reshape the vector to a 3d array, split by column, and take the parallel max. This should be pretty efficient as far as base R goes.
do.call(pmax.int, asplit(`dim<-`(A, c(3,2,2)), 2))
[1] 4 3 4 3 4 5

Count equal elements in R

I have x is:
x<-c( 1, 2 , 3 , 1 , 4 , 5 , 6 , 2 , 3 , 2 , 3 , 8 )
How can i count the equal elements in x? I want the returned result as 3.
Explanation: There are 3 values(1,2,3) that appeared at least twice.
With x[i]==1 there are 2 elements, count=1
With x[i]==2 there are 3 elements, count=2
With x[i]==3 there are 3 elements, count=3
I want the result is count=3.
Thank you very much!

So if I undestand well, you want to count how many numbers are repeated in the vector.
One way to do it would be to construct a table from the vector, and see how many elements have a count higher than one:
x <- c(1, 2, 3, 1, 4, 5, 6, 2, 3, 2, 3, 8)
tab <- table(x)
sum(tab > 1)
#> [1] 3
Created on 2020-11-26 by the reprex package (v0.3.0)

Here are couple of base R options :
Using rle :
sum(with(rle(sort(x)), lengths > 1))
#[1] 3
With tapply :
sum(tapply(x, x, length) > 1)

library(tibble)
library(dplyr)
x<-c( 1, 2 , 3 , 1 , 4 , 5 , 6 , 2 , 3 , 2 , 3 , 8 )
tibble::tibble(x) %>% dplyr::count(x)
Edit:
As I was kindly asked to add some comments I will gladly do so.
Here I emplyoed packages tibble to create a data.frame (or tibble, rather) and the dplyr package for pivoting.
tibble::tibble(x)
turns the vector x into a tibble (type of data.frame) for further analysis. Unfortunately, the only variable of the tibble x is also called x. Sorry for that!
%>%
The pipe operator takes the value to its left (here, the newly created tibble called x ) and provides it as input for the subsequent command :
dplyr::count(x)
Here, we use dplyr to count() the variable x from the tibble x (again, sorry for that). The result will show which variables occured how many times:
x n
<dbl> <int>
1 1 2
2 2 3
3 3 3
4 4 1
5 5 1
6 6 1
7 8 1
where the first column (1 to 7) are simply row numbers, x are the values provided by the original question and n counts how often each variable occured.

length(unique(x[duplicated(x)]))
# 3
Data
x <- c(1, 2, 3, 1, 4, 5, 6, 2, 3, 2, 3, 8)

Unlist nested list by name

I import a nested list of unknown length (here 2) and unknown names (here iter1 and iter2) and get the names of the list:
iter1 <- list(1, 2, 3, 4)
iter2 <- list(1, 2, 3, 4)
nested_list <- list(iter1 = iter1, iter2 = iter2)
names <- names(nested_list)
The next thing I want to do is actually this:
unlist <- data.frame(x=unlist(nested_list$iter1))
But due to the fact I don't know the names beforehand I want to do something like this:
unlist <- data.frame(x=unlist(nested_list$names[1]))
Which is certainly not working. There is no error, but the created list is empty.
In the end I want to do something like this:
for(i in 1:length(nested_list)) {
unlist <- data.frame(x=unlist(nested_list$names[i]))
print(unlist)
}

Using Map, avoiding the names vector.
data.frame(Map(unlist, nested_list)[1])
# iter1
# 1 1
# 2 2
# 3 3
# 4 4
Or, in order to give column names with mapply:
data.frame(x=mapply(unlist, nested_list)[,1])
# x
# 1 1
# 2 2
# 3 3
# 4 4
The 1 in brackets indicates first list name, use 2 for the second name accordingly.
Data
nested_list <- list(iter1 = list(1, 2, 3, 4), iter2 = list(1, 2, 3, 4))

Maybe you can try the code below
unlist <- data.frame(x=unlist(nested_list[names[1]]))
such that
x
iter11 1
iter12 2
iter13 3
iter14 4

I am not sure I get what you intended as result, could you precise it if needed ?
iter1 <- list(1, 2, 3, 4)
iter2 <- list(1, 2, 3, 4)
nested_list <- list(iter1 = iter1, iter2 = iter2)
names <- names(nested_list)
cbind.data.frame(lapply(nested_list, unlist))
#> iter1 iter2
#> 1 1 1
#> 2 2 2
#> 3 3 3
#> 4 4 4

Subsetting of data.frames with variable name vs. column number

I am fairly new to R and I have run into a problem with subsetting data frames a number of times. I have found a fix but would just like to understand what I am missing.
Here is an exemplary bit of code, where I don't understand the functional difference.
Example data frame:
df <- data.frame(V1 = c(1:10), V2 = c(rep(1, times = 10)))
this produces an "undefined columns selected" error:
df1 <- df[df$V1 < 5, df$V2]
but this works:
df2 <- df[df$V1 < 5, 2]
I don't understand why when reffering to the column by its name via $V2 I do not recieve the same result as when reffering to the same column by its number.
This is a really basic question, I am aware, but I would just like to get my head around it.
Thanks and also sorry if formatting is off or anything (first time posting..),
Christoph

df[df$V1 < 5, df$V2] doesn't give an "undefined columns selected" error.
df[df$V1 < 5, df$V2]
# V1 V1.1 V1.2 V1.3 V1.4 V1.5 V1.6 V1.7 V1.8 V1.9
#1 1 1 1 1 1 1 1 1 1 1
#2 2 2 2 2 2 2 2 2 2 2
#3 3 3 3 3 3 3 3 3 3 3
#4 4 4 4 4 4 4 4 4 4 4
As you have only 1 in df$V2 and 1st column is present in your dataframe. It selects 1st column for length(df$V2) times and as it is not advised to have columns with same name it adds prefix .1, .2 to it.
This is same as doing
df[df$V1 < 5, c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1)]
It would give an undefined column selected error , if you select columns which are not present in data.
df[df$V1 < 5, c(1, 3)]
Error in [.data.frame(df, df$V1 < 5, c(1, 3)) :
undefined columns selected
There are different ways in which you can access data
By column name which is
df[df$V1 < 5, "V2"]
#[1] 1 1 1 1
Or
df$V2[df$V1 < 5]
and by column position.
df[df$V1 < 5, 2]
#[1] 1 1 1 1

Extend/expand data frame with column of lists each into a row

I have a a data frame of the following type:
df <- data.frame("col1" = c(1,2,3,4))
df$col2 <- list(list(1,1,1),list(2,2,2),list(3,3,3),list(4,4,4))
df$col3 <- list(c(1,1,1),c(2,2,2),c(3,3,3),c(4,4,4))
df
And get:
col1 col2 col3
1 1 1, 1, 1 1, 1, 1
2 2 2, 2, 2 2, 2, 2
3 3 3, 3, 3 3, 3, 3
4 4 4, 4, 4 4, 4, 4
Now I would like to manipulate this data frame to get something like:
col1 col3
1 1 1
1 1
1 1
2 2 2
2 2
2 2
3 3 3
3 3
3 3
...
Now I can do this with a simple loop. For each row I convert the list into a data frame. I then use rbind to append the data frames into a single one.
My question is: how do I do this with vectorized function?
I have tried apply, sapply, mapply and Reducebut with no success. applywas the only that actually execute but produced incorrect results (got only the first element of each list).

We can remove the first column (df[-1]), loop over the other columns, unlist and then convert the list to data.frame
lst <- lapply(df[-1], unlist)
dfN <- data.frame(lst)