how to convert a list with different length of lists to a dataframe in r - r

I have a list containing three different length of vectors with unique elements for each vector.
data <- list(ARG=letters[1:8],BRZ=c("a","b","c","f","h","g","l","m","n"),US=c("u","b","c","e","h","f","q","a","n","t"))
I would like to convert this list to a data frame by mergering them together, the result is expected as below or similar output, Thank you for helping this.
ID ARG BRZ US
a 1 1 1
b 1 1 1
c 1 1 1
d 1
e 1 1
f 1 1 1
g 1 1
h 1 1 1
l 1
m 1
n 1 1
q 1
t 1
u 1

We use mtabulate and transpose the output
library(qdapTools)
t(mtabulate(data))
Or if we are using base R, then stack into a data.frame with 2 columns and apply the table
table(stack(data))
Assuming that there are no duplicates for each entry. If there are duplicates, then we may need a logical vector coerced to binary
+(table(stack(data)) >0)

Related

Filtering of dataframe columns displaying a counter intuitive behavior (R)

Take as an example the dataframe below. I need to change the dataframe by keeping only the columns that are in the filter objects.
test <- data.frame(A = c(1,6,1,2,3) , B = c(1,2,1,1,2), C = c(1,7,6,4,1), D = c(1,1,1,1,1))
filter <- c("A", "B", "C", "D")
filter2 <- c("A","B","D")
To do that I'm using this piece of code:
`%ni%` <- Negate(`%in%`)
test <- test[,-which(names(test) %ni% filter2)]
If I use the filter2 object I get what is expected:
A B D
1 1 1 1
2 6 2 1
3 1 1 1
4 2 1 1
5 3 2 1
However, if I use the filter object, I get a dataframe with zero columns:
data frame with 0 columns and 5 rows
I expected to get an untouched dataframe, since filter had all test columns in it. Why does this happen, and how can I write a more reliable code not to get empty dataframes in these situations?
Use ! instead of -
test[,!(names(test) %ni% filter2)]
test[,!(names(test) %ni% filter)]
by wrapping with which and using -, it works only when the length of output of which is greater than 0
> which(names(test) %ni% filter2)
[1] 3
> which(names(test) %ni% filter)
integer(0)
By doing the -, there is no change in the integer(0) case
> -which(names(test) %ni% filter)
integer(0)
> -which(names(test) %ni% filter2)
[1] -3
thus,
> test[integer(0)]
data frame with 0 columns and 5 rows
I think you can simplify the column selection process by subsetting the dataframe with character vector of column names.
test[filter]
# A B C D
#1 1 1 1 1
#2 6 2 7 1
#3 1 1 6 1
#4 2 1 4 1
#5 3 2 1 1
test[filter2]
# A B D
#1 1 1 1
#2 6 2 1
#3 1 1 1
#4 2 1 1
#5 3 2 1

count characters based on the order they appear

How does one count the characters based on the order they appear in a single length string. Below is an minimal example:
x <- "abbccdddaab"
First thought was this but it only counts them irrespective of order:
table(unlist(strsplit(x, "\\b")))
a b c d
3 3 2 3
But the desired output is:
a b c d a b
1 2 2 3 2 1
I would imagine the solution would require a for loop?
We can use rle instead of table as rle returns the output as a list of values and lengths based on checking whether the adjacent elements are same or not
out <- rle(strsplit(x, "\\b")[[1]])
setNames(out$lengths, out$values)
# a b c d a b
# 1 2 2 3 2 1
Using data.table::rleid :
x <- "abbccdddaab"
tmp <- strsplit(x, "\\b")[[1]]
table(data.table::rleid(tmp))
#1 2 3 4 5 6
#1 2 2 3 2 1

Treshold values row-wise in a dataframe

Consider an example data frame:
A B C v
5 4 2 3
7 1 3 5
1 2 1 1
I want to set all elements of a row to 1 if the element is bigger or equal than v, and 0 otherwise. The example data frame would result in the following:
A B C v
1 1 0 3
1 0 0 5
1 1 1 1
How can I do this efficiently? The number of columns will be much higher, and I would like a solution that does not require me to specify the names of the columns individually, and will apply it to all of them (except v) instead.
My solution with a for loop is way too slow.
We can create a logical matrix and coerce to binary
df1[-4] <- +(df1[-4] >= df1$v)

Frequency of Characters in Strings as columns in data frame using R

I have a data frame initial of the following format
> head(initial)
Strings
1 A,A,B,C
2 A,B,C
3 A,A,A,A,A,B
4 A,A,B,C
5 A,B,C
6 A,A,A,A,A,B
and the data frame I want is final
> head(final)
Strings A B C
1 A,A,B,C 2 1 1
2 A,B,C 1 1 1
3 A,A,A,A,A,B 5 1 0
4 A,A,B,C 2 1 1
5 A,B,C 1 1 1
6 A,A,A,A,A,B 5 1 0
to generate the data frames the following codes can be used to keep the number of rows high
initial<-data.frame(Strings=rep(c("A,A,B,C","A,B,C","A,A,A,A,A,B"),100))
final<-data.frame(Strings=rep(c("A,A,B,C","A,B,C","A,A,A,A,A,B"),100),A=rep(c(2,1,5),100),B=rep(c(1,1,1),100),C=rep(c(1,1,0),100))
What is the fastest way I can achieve this? Any help will be greatly appreciated
We can use base R methods for this task. We split the 'Strings' column (strsplit(...)), set the names of the output list with the sequence of rows, stack to convert to data.frame with key/value columns, get the frequency with table, convert to 'data.frame' and cbind with the original dataset.
cbind(df1, as.data.frame.matrix(
table(
stack(
setNames(
strsplit(as.character(df1$Strings),','), 1:nrow(df1))
)[2:1])))
# Strings A B C D
#1 A,B,C,D 1 1 1 1
#2 A,B,B,D,D,D 1 2 0 3
#3 A,A,A,A,B,C,D,D 4 1 1 2
or we can use mtabulate after splitting the column.
library(qdapTools)
cbind(df1, mtabulate(strsplit(as.character(df1$Strings), ',')))
# Strings A B C D
#1 A,B,C,D 1 1 1 1
#2 A,B,B,D,D,D 1 2 0 3
#3 A,A,A,A,B,C,D,D 4 1 1 2
Update
For the new dataset 'initial', the second method works. If we need to use the first method with the correct order, convert to factor class with levels specified as the unique elements of 'ind'.
df1 <- stack(setNames(strsplit(as.character(initial$Strings), ','),
seq_len(nrow(initial))))
df1$ind <- factor(df1$ind, levels=unique(df1$ind))
cbind(initial, as.data.frame.matrix(table(df1[2:1])))

Adding values if characters match in list

I'm trying to sum the occurrence of every possible letter in a character string in a list but if I do:
table(simplify2array(as.vector(x)))
Error in base::table(...) : attempt to make a table with >= 2^31 elements
So I did the following and made a table for each character string.
x <- lapply(x, table)
head(lapply(x, table))
[[1]]
E F G H L N P Q R S Y
1 2 1 2 3 1 1 3 3 2 1
[[2]]
A C D G I K L N P R V
1 1 2 1 1 3 2 4 3 1 1
How can I now add up all of these values if the letters exist in each list? Each list can have different letters.
Maybee you could use:
x_v <- unlist(x)
table(x_v)
if this dosn't work. The aggregate() command could help you.

Resources