Convert dataframe into list in R - r

I would like to convert a data frame into a list.
See input in Table 1.
See the output in Table 2.
When you open a list in R from the environment.
Name - the following names clus1, clus2...
Type - should contain values from column V1
Value - list of length 3
Table 1
V1 V2 V3
clus1 10 a d
clus2 20 b e
clus3 5 c f
Table 2
$`clus1`
[1] "a" "d"
$`clus2`
[2] "b" "e"
$`clus3`
[2] "c" "f"

t1 = read.table(text = " V1 V2 V3
clus1 10 a d
clus2 20 b e
clus3 5 c ''", header = T)
result = split(t1[, 2:3], f = row.names(t1))
result = lapply(result, function(x) {
x = as.character(unname(unlist(x)))
x[x != '']})
result
# $clus1
# [1] "a" "d"
#
# $clus2
# [1] "b" "e"
#
# $clus3
# [1] "c"
In this particular case, we can go a bit more directly if we convert to matrix first:
r2 = split(as.matrix(t1[, 2:3]), f = row.names(t1))
r2 = lapply(r2, function(x) x[x != ''])
# same result

You might think of this as a reshaping task in order to scale it for multiple columns, i.e. create a column of values rather than tracking throughout that you're working with columns V2 and V3. That way, you can do it in one pass with some basic tidyverse functions. This also lets you easily filter the data before making the list, based on removing blanks or any other condition, again without specifying columns.
library(dplyr)
# thanks #Gregor for filling in the data
tibble::rownames_to_column(t1, var = "clust") %>%
select(-V1) %>%
tidyr::gather(key, value, -clust) %>%
filter(value != "") %>%
split(.$clust) %>%
purrr::map("value")
#> $clus1
#> [1] "a" "d"
#>
#> $clus2
#> [1] "b" "e"
#>
#> $clus3
#> [1] "c"

Related

Maintaining order of extracted patterns from strings in R

I'm trying to extract a pattern from a string, but am having difficulty maintaining the order. Consider:
library(stringr)
string <- "A A A A C A B A"
extract <- c("B","C")
str_extract_all(string,extract)
[[1]]
[1] "B"
[[2]]
[1] "C"
The output of this is a list; is it possible to return a vector that maintains the original ordering, ie that "C"precedes "B" in the string? I've tried many options of gsub with no luck. Thanks.
Try using the following regexp:
str_extract_all(string,"[BC]")
## [[1]]
## [1] "C" "B"
or more generally:
str_extract_all(string, paste(extract, collapse = "|"))
string <- "A A A A C A B A B"
extract <- c("B","C")
inds = unlist(sapply(extract, function(p){
as.numeric(gregexpr(p, string)[[1]])
}))
sort(inds[inds > 0])
# C B1 B2
# 9 13 17

Storing unique values of each column (of a df) in list

It is straight forward to obtain unique values of a column using unique. However, I am looking to do the same but for multiple columns in a dataframe and store them in a list, all using base R. Importantly, it is not combinations I need but simply unique values for each individual column. I currently have the below:
# dummy data
df = data.frame(a = LETTERS[1:4]
,b = 1:4)
# for loop
cols = names(df)
unique_values_by_col = list()
for (i in cols)
{
x = unique(i)
unique_values_by_col[[i]] = x
}
The problem comes when displaying unique_values_by_col as it shows as empty. I believe the problem is i is being passed to the loop as a text not a variable.
Any help would be greatly appreciated. Thank you.
Why not avoid the for loop altogether using lapply:
lapply(df, unique)
Resulting in:
> $a
> [1] A B C D
> Levels: A B C D
> $b
> [1] 1 2 3 4
Or you have also apply that is specifically done to be run on column or line:
apply(df,2,unique)
result:
> apply(df,2,unique)
a b
[1,] "A" "1"
[2,] "B" "2"
[3,] "C" "3"
[4,] "D" "4"
thought if you want a list lapply return you a list so may be better
Your for loop is almost right, just needs one fix to work:
# for loop
cols = names(df)
unique_values_by_col = list()
for (i in cols) {
x = unique(df[[i]])
unique_values_by_col[[i]] = x
}
unique_values_by_col
# $a
# [1] A B C D
# Levels: A B C D
#
# $b
# [1] 1 2 3 4
i is just a character, the name of a column within df so unique(i) doesn't make sense.
Anyhow, the most standard way for this task is lapply() as shown by demirev.
Could this be what you're trying to do?
Map(unique,df)
Result:
$a
[1] A B C D
Levels: A B C D
$b
[1] 1 2 3 4

Get a single value from a data frame in R

Say I have a data frame df such as :
col1 col2
x1 y1
x2 y2
with arbitrary values in each "cell".
How do I get a single value for a given cell ?
For instance to get the value of the cell in the first row and second column, doing this :
df[1,2]
works with numeric values, but with strings it return the levels as well.
What is the proper way of getting a single value (for instance for use in a condition for a subset of another data frame) ?
EDIT
More details about what I need this for. Say I need to use values from df to subset another data frame df2 :
subset(df2, (id == SomeCommand(df[1,1])) & (name == SomeCommand(df[1,2])))
Is there any such "SomeCommand" that would reliably return a single value (w/o levels) of the appropriate type regardless of the type of the columns in df ?
R will get out of its way to try to figure out what you want. If you coerce to character, it should work. Here's a quick example.
> xy <- data.frame(a = c(0.1, 0.2, 0.3), b = factor(1:3), c = letters[1:3])
>
> xy$a == 0.1
[1] TRUE FALSE FALSE
> xy$a == "0.1"
[1] TRUE FALSE FALSE
> xy$b == "2"
[1] FALSE TRUE FALSE
> xy$b == 2
[1] FALSE TRUE FALSE
> xy$c == "a"
[1] TRUE FALSE FALSE
A common application is to obtain a particular value of one variable in a data-frame given the value of one or more other column variables in the same record. For this the "filter" command can be used. It may look clunky but it works well for a large data-frame.
library(dplyr)
df
rnames col1 col2 col3
1 row1 1 3 a
2 row2 2 6 b
3 row3 3 9 c
4 row4 4 12 d
5 row5 5 15 e
To find the value of col1 given col3 = 'c'
a <- filter(df, col3=='c') # can specify multiple known column values
a #produces a data-frame with the record(s)
rnames col1 col2 col3
1 row3 3 9 c # which contains Col1 = 3
class(a)
[1] "data.frame"
But can get value of Col1 in one line
b <- filter(df, col3=='c')$col1
b
[1] 3
class(b)
[1] "numeric"
For a result with multiple values
c <- filter(df, col1 > 3)$col3
c[1] "d" "e" # list if > 1 result
class(c)
[1] "character"
One way that works is, defining the colClasses of your dataframe while creating it:
for example:
my_table = read.table("myfile.txt", sep=" ", colClasses = c("character", "character", "numeric"))

Shuffling a vector - all possible outcomes of sample()?

I have a vector with five items.
my_vec <- c("a","b","a","c","d")
If I want to re-arrange those values into a new vector (shuffle), I could use sample():
shuffled_vec <- sample(my_vec)
Easy - but the sample() function only gives me one possible shuffle. What if I want to know all possible shuffling combinations? The various "combn" functions don't seem to help, and expand.grid() gives me every possible combination with replacement, when I need it without replacement. What's the most efficient way to do this?
Note that in my vector, I have the value "a" twice - therefore, in the set of shuffled vectors returned, they all should each have "a" twice in the set.
I think permn from the combinat package does what you want
library(combinat)
permn(my_vec)
A smaller example
> x
[1] "a" "a" "b"
> permn(x)
[[1]]
[1] "a" "a" "b"
[[2]]
[1] "a" "b" "a"
[[3]]
[1] "b" "a" "a"
[[4]]
[1] "b" "a" "a"
[[5]]
[1] "a" "b" "a"
[[6]]
[1] "a" "a" "b"
If the duplicates are a problem you could do something similar to this to get rid of duplicates
strsplit(unique(sapply(permn(my_vec), paste, collapse = ",")), ",")
Or probably a better approach to removing duplicates...
dat <- do.call(rbind, permn(my_vec))
dat[duplicated(dat),]
Noting that your data is effectively 5 levels from 1-5, encoded as "a", "b", "a", "c", and "d", I went looking for ways to get the permutations of the numbers 1-5 and then remap those to the levels you use.
Let's start with the input data:
my_vec <- c("a","b","a","c","d") # the character
my_vec_ind <- seq(1,length(my_vec),1) # their identifier
To get the permutations, I applied the function given at Generating all distinct permutations of a list in R:
permutations <- function(n){
if(n==1){
return(matrix(1))
} else {
sp <- permutations(n-1)
p <- nrow(sp)
A <- matrix(nrow=n*p,ncol=n)
for(i in 1:n){
A[(i-1)*p+1:p,] <- cbind(i,sp+(sp>=i))
}
return(A)
}
}
First, create a data.frame with the permutations:
tmp <- data.frame(permutations(length(my_vec)))
You now have a data frame tmp of 120 rows, where each row is a unique permutation of the numbers, 1-5:
>tmp
X1 X2 X3 X4 X5
1 1 2 3 4 5
2 1 2 3 5 4
3 1 2 4 3 5
...
119 5 4 3 1 2
120 5 4 3 2 1
Now you need to remap them to the strings you had. You can remap them using a variation on the theme of gsub(), proposed here: R: replace characters using gsub, how to create a function?
gsub2 <- function(pattern, replacement, x, ...) {
for(i in 1:length(pattern))
x <- gsub(pattern[i], replacement[i], x, ...)
x
}
gsub() won't work because you have more than one value in the replacement array.
You also need a function you can call using lapply() to use the gsub2() function on every element of your tmp data.frame.
remap <- function(x,
old,
new){
return(gsub2(pattern = old,
replacement = new,
fixed = TRUE,
x = as.character(x)))
}
Almost there. We do the mapping like this:
shuffled_vec <- as.data.frame(lapply(tmp,
remap,
old = as.character(my_vec_ind),
new = my_vec))
which can be simplified to...
shuffled_vec <- as.data.frame(lapply(data.frame(permutations(length(my_vec))),
remap,
old = as.character(my_vec_ind),
new = my_vec))
.. should you feel the need.
That gives you your required answer:
> shuffled_vec
X1 X2 X3 X4 X5
1 a b a c d
2 a b a d c
3 a b c a d
...
119 d c a a b
120 d c a b a
Looking at a previous question (R: generate all permutations of vector without duplicated elements), I can see that the gtools package has a function for this. I couldn't however get this to work directly on your vector as such:
permutations(n = 5, r = 5, v = my_vec)
#Error in permutations(n = 5, r = 5, v = my_vec) :
# too few different elements
You can adapt it however like so:
apply(permutations(n = 5, r = 5), 1, function(x) my_vec[x])
# [,1] [,2] [,3] [,4]
#[1,] "a" "a" "a" "a" ...
#[2,] "b" "b" "b" "b" ...
#[3,] "a" "a" "c" "c" ...
#[4,] "c" "d" "a" "d" ...
#[5,] "d" "c" "d" "a" ...

R - preserve order when using matching operators (%in%)

I am using matching operators to grab values that appear in a matrix from a separate data frame. However, the resulting matrix has the values in the order they appear in the data frame, not in the original matrix. Is there any way to preserve the order of the original matrix using the matching operator?
Here is a quick example:
vec=c("b","a","c"); vec
df=data.frame(row.names=letters[1:5],values=1:5); df
df[rownames(df) %in% vec,1]
This produces > [1] 1 2 3 which is the order "a" "b" "c" appears in the data frame. However, I would like to generate >[1] 2 1 3 which is the order they appear in the original vector.
Thanks!
Use match.
df[match(vec, rownames(df)), ]
# [1] 2 1 3
Be aware that if you have duplicate values in either vec or rownames(df), match may not behave as expected.
Edit:
I just realized that row name indexing will solve your issue a bit more simply and elegantly:
df[vec, ]
# [1] 2 1 3
Use match (and get rid of the NA values for elements in either vector for those that don't match in the other):
Filter(function(x) !is.na(x), match(rownames(df), vec))
Since row name indexing also works on vectors, we can take this one step further and define:
'%ino%' <- function(x, table) {
xSeq <- seq(along = x)
names(xSeq) <- x
Out <- xSeq[as.character(table)]
Out[!is.na(Out)]
}
We now have the desired result:
df[rownames(df) %ino% vec, 1]
[1] 2 1 3
Inside the function, names() does an auto convert to character and table is changed with as.character(), so this also works correctly when the inputs to %ino% are numbers:
LETTERS[1:26 %in% 4:1]
[1] "A" "B" "C" "D"
LETTERS[1:26 %ino% 4:1]
[1] "D" "C" "B" "A"
Following %in%, missing values are removed:
LETTERS[1:26 %in% 3:-5]
[1] "A" "B" "C"
LETTERS[1:26 %ino% 3:-5]
[1] "C" "B" "A"
With %in% the logical sequence is repeated along the dimension of the object being subsetted, this is not the case with %ino%:
data.frame(letters, LETTERS)[1:5 %in% 3:-5,]
letters LETTERS
1 a A
2 b B
3 c C
6 f F
7 g G
8 h H
11 k K
12 l L
13 m M
16 p P
17 q Q
18 r R
21 u U
22 v V
23 w W
26 z Z
data.frame(letters, LETTERS)[1:5 %ino% 3:-5,]
letters LETTERS
3 c C
2 b B
1 a A

Resources