Merging list of lists with comma separated in R - r

I have a data set like that
Reproducible data
test <- list(c("1"),c("2"),c("3"),c(c("a"),c("b")),c("d"))
Desired output is:
1
2
3
a,b
d
I have tried
output <- do.call(rbind, test)

You can use toString to collapse every element in the list.
out <- data.frame(result = sapply(test, toString))
out
# result
#1 1
#2 2
#3 3
#4 a, b
#5 d

We can use tidyverse methods
Loop over the list with map
Paste the elements using str_c
Return a single column dataset by adding the suffix _dfr in map
library(dplyr)
library(purrr)
library(stringr)
map_dfr(test, ~ tibble(col1 = str_c(.x, collapse=",")))
# A tibble: 5 x 1
col1
<chr>
1 1
2 2
3 3
4 a,b
5 d

Actually you are already close to your goal. Maybe the code below can help you
> do.call(rbind, Map(toString, test))
[,1]
[1,] "1"
[2,] "2"
[3,] "3"
[4,] "a, b"
[5,] "d"

Maybe this could help too beside all the very good answers you received:
test |>
lapply(\(x) paste(x, collapse = ",")) |>
cbind()
[,1]
[1,] "1"
[2,] "2"
[3,] "3"
[4,] "a,b"
[5,] "d"

Related

How to sum values of columns if they have some row values in common in R? [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 1 year ago.
I have a very big dataframe. let df below represent it:
df <-as.data.frame(rbind(c("a",1,1,1),c("a",1,1,1),c("a",1,1,1),c("b",2,2,2),c("b",2,2,2),c("b",2,2,2)))
[,1] [,2] [,3] [,4]
[1,] "a" "1" "1" "1"
[2,] "a" "1" "1" "1"
[3,] "a" "1" "1" "1"
[4,] "b" "2" "2" "2"
[5,] "b" "2" "2" "2"
[6,] "b" "2" "2" "2"
I want to create a dataframe like the one below out of it:
[,1] [,2] [,3] [,4]
[1,] "a" "3" "3" "3"
[2,] "b" "6" "6" "6"
I see several similar posts here, but the answers although very useful need a vector pf all possible values in the first column and so on. my problem is my dataset has about 3000 rows.
How can I get the result in r?
We could use group_byand summariseafter using type.convert(as.is=TRUE):
library(dplyr)
df %>%
type.convert(as.is=TRUE) %>%
group_by(V1) %>%
summarise(across(V2:V4, sum))
V1 V2 V3 V4
<chr> <int> <int> <int>
1 a 3 3 3
2 b 6 6 6
We can use aggregate
df <- type.convert(df, as.is = TRUE)
aggregate(.~ V1, df, FUN = sum)
-ouptut
V1 V2 V3 V4
1 a 3 3 3
2 b 6 6 6
NOTE: The OP created the data.frame from a matrix and matrix can hold only a single class. Thus, do the type conversion first
Another aggregate option
> aggregate(. ~ V1, df, function(x) sum(as.numeric(x)))
V1 V2 V3 V4
1 a 3 3 3
2 b 6 6 6

Find first occurrence of a character in column of a data frame in R

Struggling with string handling in R...
I've got a column of strings in an R data frame. Each one contains the "=" character once and only once. I'd like to know the position of the "=" character in each element of the column, as a step to splitting the column into two separate columns (one for the bit before the "=" and one for the bit after the "="). Can anyone help please? I'm sure it's simple but I'm struggling to find the answer.
For example, if I have:
x <- data.frame(string = c("aa=1", "aa=2", "aa=3", "b=1", "b=2", "abc=5"))
I'd like a bit of code to return
(3, 3, 3, 2, 2, 4)
Thank you.
To get the position of "=" you can use the regexp function:
regexpr("=", x$string)
#[1] 3 3 3 2 2 4
#attr(,"match.length")
#[1] 1 1 1 1 1 1
#attr(,"useBytes")
#[1] TRUE
However, as #Michael stated if your goal is to split the string you can use strsplit:
strsplit(x$string, "=")
#[[1]]
#[1] "aa" "1"
#
#[[2]]
#[1] "aa" "2"
#
#[[3]]
#[1] "aa" "3"
#
#[[4]]
#[1] "b" "1"
#
#[[5]]
#[1] "b" "2"
#
#[[6]]
#[1] "abc" "5"
Or to combine with do.call and `rbind to create a new dataframe:
do.call(rbind, strsplit(x$string, "="))
# [,1] [,2]
#[1,] "aa" "1"
#[2,] "aa" "2"
#[3,] "aa" "3"
#[4,] "b" "1"
#[5,] "b" "2"
#[6,] "abc" "5"
Here's a way to do:
library(stringr)
str_locate(x$string, "=")[,1]
You can use gregexpr:
unlist(lapply(gregexpr(pattern = '=', x$string), min))
[1] 3 3 3 2 2 4
In Base R you can do:
as.numeric(lapply(strsplit(as.character(x$string), ""), function(x) which(x == "=")))
[1] 3 3 3 2 2 4
Here is another solution to obtain a two column dataframe, the first containing the characters before = and the second one containing the characters after =. You can do that without obtaining the positions of the = character.
library(stringr)
t(as.data.frame(strsplit(x$string, "=")))
# [,1] [,2]
#c..aa....1.. "aa" "1"
#c..aa....2.. "aa" "2"
#c..aa....3.. "aa" "3"
#c..b....1.. "b" "1"
#c..b....2.. "b" "2"
#c..abc....5.. "abc" "5"
Some may find this more readable
library(tidyverse)
x %>%
mutate(
number = string %>% str_extract('[:digit:]+'),
text = string %>% str_extract('[:alpha:]+')
) %>%
as_tibble()
# A tibble: 6 x 3
string number text
<fct> <chr> <chr>
1 aa=1 1 aa
2 aa=2 2 aa
3 aa=3 3 aa
4 b=1 1 b
5 b=2 2 b
6 abc=5 5 abc

cbind coerces a data frame to matrix

I'm having trouble When using cbind. Prior to using cbind the object is a data.frame of two character vectors.
After I add a column using cbind, the data.frame object changes class to matrix. I've tried as.vector, declaring h as an empty character vector, etc. but couldn't fix it. Thank you for any suggestions and help.
output <- data.frame(h = character(), st = character()) ## empty dataframe
st <- state.abb
h <- (rep("a", 50))
output <- cbind(output$h, h) ## output changes to matrix class here
output <- cbind(output, st) ## adding a second column
I guess you may not need cbind().
output <- data.frame(state = state.abb, h = rep("a", 50))
head(output)
state h
1 AL a
2 AK a
3 AZ a
4 AR a
5 CA a
6 CO a
# Ken I'm not sure what you actually want to obtain but it may be easier if variables are kept in a list. Below is an example.
state <- state.abb
h <- rep("a", 50)
lst <- list(state = state, h = h)
mat <- as.matrix(do.call(cbind, lst))
head(mat)
state h
[1,] "AL" "a"
[2,] "AK" "a"
[3,] "AZ" "a"
[4,] "AR" "a"
[5,] "CA" "a"
[6,] "CO" "a"
df <- as.data.frame(do.call(cbind, lst))
head(df)
state h
1 AL a
2 AK a
3 AZ a
4 AR a
5 CA a
6 CO a
As a complement of info, notice that you could use single bracket notation to make it work with something close to your original code:
data
output <- data.frame(h = letters[1:5],st = letters[6:10])
h2 <- (rep("a", 5))
This won't work
cbind(output$h, h2)
# h2
# [1,] "1" "a"
# [2,] "2" "a"
# [3,] "3" "a"
# [4,] "4" "a"
# [5,] "5" "a"
class(cbind(output$h, h2)) # matrix
It's a matrix and factors have been coerced in numbers
this will work
cbind(output["h"], h2)
# h h2
# 1 a a
# 2 b a
# 3 c a
# 4 d a
# 5 e a
class(cbind(output["h"], h2)) # data.frame
Note that with double brackets (output[["h"]]) you'll have the same inadequate result as when using the dollar notation.

R: transposing and splitting a row with a delimiter.

I have a table
rawData <- as.data.frame(matrix(c(1,2,3,4,5,6,"a,b,c","d,e","f"),nrow=3,ncol=3))
1 4 a,b,c
2 5 d,e
3 6 f
I would like to convert to
1 2 3
4 5 6
a d f
b e
c
so far I can transpose and split the third column, however, I'm lost as to how to reconstruct a new table with the format outline above?
new = t(rawData)
for (e in 1:ncol(new)){
s<-strsplit(new[3:3,e], split=",")
print(s)
}
I tried creating new vectors for each iteration but I'm not sure how to efficiently put each one back into a dataframe. Would be grateful for any help. thanks!
You can use stri_list2matrix from the stringi package:
library(stringi)
rawData <- as.data.frame(matrix(c(1,2,3,4,5,6,"a,b,c","d,e","f"),nrow=3,ncol=3),stringsAsFactors = F)
d1 <- t(rawData[,1:2])
rownames(d1) <- NULL
d2 <- stri_list2matrix(strsplit(rawData$V3,split=','))
rbind(d1,d2)
# [,1] [,2] [,3]
# [1,] "1" "2" "3"
# [2,] "4" "5" "6"
# [3,] "a" "d" "f"
# [4,] "b" "e" NA
# [5,] "c" NA NA
You can also use cSplit from my "splitstackshape" package.
By default, it just creates additional columns after splitting the input:
library(splitstackshape)
cSplit(rawData, "V3")
# V1 V2 V3_1 V3_2 V3_3
# 1: 1 4 a b c
# 2: 2 5 d e NA
# 3: 3 6 f NA NA
You can just transpose that to get your desired output.
t(cSplit(rawData, "V3"))
# [,1] [,2] [,3]
# V1 "1" "2" "3"
# V2 "4" "5" "6"
# V3_1 "a" "d" "f"
# V3_2 "b" "e" NA
# V3_3 "c" NA NA

count of records within levels of a factor

I am trying to populate a field in a table (or create a separate vector altogether, whichever is easier) with consecutive numbers from 1 to n, where n is the total number of records that share the same factor level, and then back to 1 for the next level, etc. That is, for a table like this
data<-matrix(c(rep('A',4),rep('B',3),rep('C',4),rep('D',2)),ncol=1)
the result should be a new column (e.g. "sample") as follows:
sample<-c(1,2,3,4,1,2,3,1,2,3,4,1,2)
You can get it as follows, using ave:
data <- data.frame(data)
new <- ave(rep(1,nrow(data)),data$data,FUN=cumsum)
all.equal(new,sample) # check if it's right.
You can use rle function together with lapply :
sample <- unlist(lapply(rle(data[,1])$lengths,FUN=function(x){1:x}))
data <- cbind(data,sample)
Or even better, you can combine rle and sequence in the following one-liner (thanks to #Arun suggestion)
data <- cbind(data,sequence(rle(data[,1])$lengths))
> data
[,1] [,2]
[1,] "A" "1"
[2,] "A" "2"
[3,] "A" "3"
[4,] "A" "4"
[5,] "B" "1"
[6,] "B" "2"
[7,] "B" "3"
[8,] "C" "1"
[9,] "C" "2"
[10,] "C" "3"
[11,] "C" "4"
[12,] "D" "1"
[13,] "D" "2"
There are lots of different ways of achieving this, but I prefer to use ddply() from plyr because the logic seems very consistent to me. I think it makes more sense to be working with a data.frame (your title talks about levels of a factor):
dat <- data.frame(ID = c(rep('A',4),rep('B',3),rep('C',4),rep('D',2)))
library(plyr)
ddply(dat, .(ID), summarise, sample = 1:length(ID))
# ID sample
# 1 A 1
# 2 A 2
# 3 A 3
# 4 A 4
# 5 B 1
# 6 B 2
# 7 B 3
# 8 C 1
# 9 C 2
# 10 C 3
# 11 C 4
# 12 D 1
# 13 D 2
My answer:
sample <- unlist(lapply(levels(factor(data)), function(x)seq_len(sum(factor(data)==x))))
factors <- unique(data)
f1 <- length(which(data == factors[1]))
...
fn <- length(which(data == factors[length(factors)]))
You can use a for loop or 'apply' family to speed that part up.
Then,
sample <- c(1:f1, 1:f2, ..., 1:fn)
Once again you can use a for loop for that part. Here is the full script you can use:
data<-matrix(c(rep('A',4),rep('B',3),rep('C',4),rep('D',2)),ncol=1)
factors <- unique(data)
f <- c()
for(i in 1:length(factors)) {
f[i] <- length(which(data == factors[i]))
}
sample <- c()
for(i in 1:length(f)) {
sample <- c(sample, 1:f[i])
}
> sample
[1] 1 2 3 4 1 2 3 1 2 3 4 1 2

Resources