Can I use the unlist function in a dataframe? - r

I was working with a list containing the words from a text and the tags classifying them. I was supposed to restore an old letter, and to do this i needed to extract only the words in a vector, so instead of using sapply, i did this:
words <- unlist(data.frame(letter)[1,], use.names = FALSE)
It appeared to work, but the auxiliary professor said that doing this was a problem, since you can only use unlist in lists, so I fixed it, but in the end the results were the same.
PS: I know that using sapply is more efficient, i just didn't remember the function, I'm just curious to know if you can use unlist in other objects

As #Gregor notes, data.frames are lists. Consider the following example:
df <- data.frame(Col1 = LETTERS[1:5], Col2 = 1:5, stringsAsFactors = FALSE)
is.list(df)
#[1] TRUE
Therefore, you can use lapply on a data.frame to perform column-wise operations:
lapply(df,paste0, collapse = "")
#$Col1
#[1] "ABCDE"
#$Col2
#[1] "12345"
You have to be careful, however, when subsetting a data.frame, as you may not get a list depending on the method you use.
df["Col2"]
# Col2
#1 1
#2 2
#3 3
#4 4
#5 5
is.list(df["Col2"])
#[1] TRUE
df[,"Col2"]
#[1] 1 2 3 4 5
is.list(df[,"Col2"])
#[1] FALSE
is.list(df[["Col2"]])
#[1] FALSE
is.list(df$Col2)
#[1] FALSE
is.list(subset(df,select = Col2))
#[1] TRUE
To my knowledge, however, subsetting whole rows always returns a list.
df[1,]
# Col1 Col2
#1 A 1
is.list(df[1,])
#[1] TRUE
is.list(subset(df,1:5 == 1))
#[1] TRUE
We can use the dput function to view a text representation of the underlying structure of a single row:
dput(df[1,])
#structure(list(Col1 = "A", Col2 = 1L), row.names = 1L, class = "data.frame")
As we can see, even the single row is clearly a list. Therefore, we can reasonably unlist that row just as we would any list that is not also a data.frame.
unlist(df[1,], use.names = FALSE)
#[1] "A" "1"
unlist(list(Col1 = "A", Col2 = 1L), use.names = FALSE)
#[1] "A" "1"

Related

Return a list from a data.table

How can I return a list from a data.table? The problem is that whenever I return a list in the j part it is transformed into a data.tableformat by design?
Assume that I want to return a list whose elements are named(!) length 1 vectors, i.e.:
expected_result <- list(A = c(a = 1), B = c(b = 2), C = c(c = 3))
This does the trick, but it feels hackish with the 2 lists just to extract then again the first element.
So what would be the canonical way?
library(data.table)
library(magrittr)
d <- data.table(id = LETTERS[1:3],
nm = letters[1:3],
val = 1:3)
d[, .(.(setNames(val, nm) %>%
split(seq_along(.)) %>%
setNames(id)))]$V1[[1]] %>%
all.equal(expected_result)
# [1] TRUE
Edit
#akrun gave a nice non data.table solution, but I would rather find a solution which uses [.data.table syntax.
The easiest approach would be using with. split the named vector created (with 'val', and 'nm' columns) with 'id'
with(d, split(setNames(val, nm), id))
-output
#$A
#a
#1
#$B
#b
#2
#$C
#c
#3
Or if we want to use create the list within data.table, it is a bit convoluted as data.table/data.frame columns are list elements with equal length. So, we may need to unclass the attributes or wrap it in a nested list and then flatten the list elements
do.call(c, d[, .(list(split(setNames(val, nm), id)))]$V1)
#$A
#a
#1
#$B
#b
#2
#$C
#c
#3

How to automate hierarchical grouping of variables based on variable name

I have my variables named in little-endian fashion, separated by periods.
I'd like to create index variables for each different level and get summary output for the variables at each level, but I'm getting stuck at the first step trying to break apart my variables and put them in a table to start working with them:
Variable naming convention:
Environment.Construct.Subconstruct_1.subconstruct_i.#.Short_Name
Example:
n <- 6
dat <- data.frame(
ph1.career_interest.delight.1.Friendly=sample(1:5, n, replace=TRUE),
ph1.career_interest.delight.2.Advantagious=sample(1:5, n, replace=TRUE),
ph1.career_interest.philosophy.1.Meaningful_Difference=sample(1:5, n, replace=TRUE),
ph1.career_interest.philosophy.2.Enable_Work=sample(1:5, n, replace=TRUE)
)
# create list of variable names
names <- as.list(colnames( dat ))
## Try to create a heirarchy of variables: Step 1: Create matrix
heir <- as.matrix(strsplit(names,".", fixed = TRUE))
I've gone through a couple iterations but it still returns an error:
Error in strsplit(names, ".", fixed = TRUE) : non-character argument
Instead of wrapping with as.list, directly use the colnames because according to ?strsplit, the input x would be
x - character vector, each element of which is to be split. Other inputs, including a factor, will give an error.
Thus, if it is a list, it is not the expected input class for strsplit
nm1 <- colnames(dat)
strsplit(nm1, ".", fixed = TRUE)
#[[1]]
#[1] "ph1" "career_interest" "delight" "1" "Friendly"
#[[2]]
#[1] "ph1" "career_interest" "delight" "2" "Advantagious"
#[[3]]
#[1] "ph1" "career_interest" "philosophy" "1" "Meaningful_Difference"
#[[4]]
#[1] "ph1" "career_interest" "philosophy" "2" "Enable_Work"
Output is a list of vectors. It is not clear from the OP's post about the expected output format. If we need a matrix or data.frame, can rbind those list elements (assuming they have the same length)
m1 <- do.call(rbind, strsplit(nm1, ".", fixed = TRUE))
returns a matrix
Or can convert to data.frame with rbind.data.frame
NOTE: names is a function name. It is better not to assign object names with function names
Update
If the lengths are not the same, an option is to pad NA at the end for those elements with less length
lst1 <- strsplit(nm1, ".", fixed = TRUE)
lst1[[1]] <- lst1[[1]][1:3] # making lengths different
mx <- max(lengths(lst1))
do.call(rbind, lapply(lst1, `length<-`, mx))
# [,1] [,2] [,3] [,4] [,5]
#[1,] "ph1" "career_interest" "delight" NA NA
#[2,] "ph1" "career_interest" "delight" "2" "Advantagious"
#[3,] "ph1" "career_interest" "philosophy" "1" "Meaningful_Difference"
#[4,] "ph1" "career_interest" "philosophy" "2" "Enable_Work"
You can count number of '.' in the column names to count number of new columns to create. We can then use tidyr::separate to divide data into n new columns splitting on ..
#Changing 1st column name to make length unequal
names(dat)[1] <- 'ph1.career_interest.delight.1'
#Number of new columns to be created
n <- max(stringr::str_count(names(dat), '\\.')) + 1
tidyr::separate(data.frame(name = names(dat)), name,
paste0('col', seq_len(n)), sep = '\\.', fill = 'right')
# col1 col2 col3 col4 col5
#1 ph1 career_interest delight 1 <NA>
#2 ph1 career_interest delight 2 Advantagious
#3 ph1 career_interest philosophy 1 Meaningful_Difference
#4 ph1 career_interest philosophy 2 Enable_Work

How to index a dataframe that contains lists/vectors as values

Indexing a dataframe works for single values but not for values that are list elements or vectors.
I have two lists of the genes that I need to match up. In each list, the genes are named as different gene aliases. I need to query a large list of genes in order to filter out any genes that are not shared between the two datasets. To do this, I created a dataframe that contains all genes from both lists. Each value in the dataframe is either a single string or a vector of multiple strings (aliases). A separate column assigns each group of aliases a unique number, which I am using to match the two lists. For each gene I need to check if it is present in the dataframe. But I cannot index the vector values. See below:
df <- data.frame("col1"=I(list(c("MALAT1","FTK2","CAS9"),
"MS4A6A",
c("LACT1","FLEE6","LOC98"))),
"col2"=I(list(c("CASS4","MS4A2","NME"),
"PLD3",
"ADAM4")))
"MALAT1" %in% df$col1
[1] FALSE
"MS4A6A" %in% df$col1
[1] TRUE
As it is a list, we can unlist
"MALAT1" %in% unlist(df$col1)
#[1] TRUE
The reason, second one returns TRUE is because the second element is of length 1 while the one with "MALAT1" is not
-testing
If we change the list element that have a single element to "MALAT1"
df$col1[2] <- "MALAT1"
"MALAT1" %in% df$col1
#[1] TRUE
Generally, when we have a list, if we want to test on each element
lapply(df$col1, `%in%`, x = "LACT1")
#[[1]]
#[1] FALSE
#[[2]]
#[1] FALSE
#[[3]]
#[1] TRUE
Here is another workaround, which plays a trick on df by flattening the lists in columns via rapply + toString
df[] <- rapply(df, toString, how = "unlist")
such that
> df
col1 col2
1 MALAT1, FTK2, CAS9 CASS4, MS4A2, NME
2 MS4A6A PLD3
3 LACT1, FLEE6, LOC98 ADAM4
and then you can use grepl to check if the objective can be found in the column via, e.g.,
> grepl("LACT1", df$col1, fixed = TRUE)
[1] FALSE FALSE TRUE
> grepl("NME", df$col2, fixed = TRUE)
[1] TRUE FALSE FALSE
You were almost there. Just wrap unlist() around the list you have your da

Check if string contains anything other than items in vector [R]

I have a dataframe containing a column of strings. I want to check whether any of the elements in each string match any of the elements in one or more predefined vectors, and then return a new logical column. This is easily accomplished using grepl().
However (and this is the part I need help with), I also want to check whether the strings contain any elements other than those contained in the keyword vectors.
Example data:
matchvector1 <- c("Apple","Banana","Orange")
matchvector2 <- c("Strawberry","Kiwi","Grapefruit")
id <- c(1,2,3)
string_column <- c(paste0(c("Apple","Banana"),collapse=", "), paste0(c("Strawberry","Kiwi"), collapse = ", "), paste0(c("Apple","Pineapple"), collapse = ", "))
df <- data.frame(id, string_column)
df$string_column <- as.character(df$string_column)
matches_vector1 <- grepl(paste(matchvector1, collapse = "|"), df$string_column)
matches_vector2 <- grepl(paste(matchvector2, collapse = "|"), df$string_column)
The output should look something like:
matches_vector1: TRUE FALSE TRUE
matches_vector2: FALSE TRUE FALSE
unmatched_words: FALSE FALSE TRUE
I'm stuck on this last part. Is there an easy way to match on anything except something in a list of keywords using grepl() (or another function)? I suspect it will involve using negative lookaround somehow but the few existing threads on this didn't seem to answer my question.
One option is to split the 'string_column' with separate_rows, grouped by 'id', check if there are not any elements from 'string_column' %in% the concatenated vectors
library(dplyr)
library(tidyr)
df %>%
separate_rows(string_column) %>%
group_by(id) %>%
summarise(unmatched = any(!string_column %in% c(matchvector1, matchvector2)) )
# A tibble: 3 x 2
# id unmatched
#* <dbl> <lgl>
#1 1 FALSE
#2 2 FALSE
#3 3 TRUE
or in base R
lengths(sapply(strsplit(df$string_column, ",\\s*"),
setdiff, c(matchvector1, matchvector2))) > 0
#[1] FALSE FALSE TRUE

Extract string from a cell and put it in a new data frame R

In a R project, I want to extract strings from a data frame which a column is like
"A|B|C"
"B|Z"
"I|P"
...
I want to have a new data frame with column A B C Z I P
I think to make it with a for and a gsub, but it is not easy because the pattern extract the | and I am not sure if it is the best and elegant way to do this kind of task
With a combination of strsplit,unlist and unique you can do:
#Steps:
#1) split each element of column with separator as "|"
#2) combine output for all items with unlist
#3) retain unique elements of those
vec = c("A|B|C","B|Z","I|P")
newDF = data.frame(newCol = unique(unlist(lapply(vec,function(x) unlist(strsplit(x,"[|]")) ))),
stringsAsFactors = FALSE)
newDF$newCol
#[1] "A" "B" "C" "Z" "I" "P"
starting with the dataframe df, with base R we can try the following:
data.frame(col=unique(unlist(strsplit(as.character(df$col), split='\\|'))))
# col
#1 A
#2 B
#3 C
#4 Z
#5 I
#6 P
or with dplyr
df %>%
mutate(col = strsplit(col, "\\|")) %>%
unnest(col) %>% unique
# col
# (chr)
#1 A
#2 B
#3 C
#4 Z
#5 I
#6 P
data
df <- data.frame(col=c("A|B|C",
"B|Z",
"I|P"), stringsAsFactors = FALSE)
If you want them to be the names of the columns, try this:
symbols <- unique(unlist(strsplit(as.character(df$col), split='\\|')))
df <- data.frame(matrix(vector(), 0, length(symbols),
dimnames=list(c(), symbols)), stringsAsFactors=F)
df
#[1] A B C Z I P
#<0 rows> (or 0-length row.names)
We can use cSplit
library(splitstackshape)
unique(cSplit(df1, "V1", "|", "long"), by = "V1")
data
df1 <- data.frame(V1 = c("A|B|C","B|Z","I|P"))
The scan function with the text parameter input appears suited for this task:
st <- c("A|B|C","B|Z","I|P")
scan(text=st, what="", sep="|")
Read 7 items
[1] "A" "B" "C" "B" "Z" "I" "P"
It wasn't clear to me from your problem description or example how you wanted this to be aligned with the original 3 row dataframe.

Resources