How to read in list from file in R? - r

I have list written in file created by sink() - "file.txt". That file contains one list, which look like this, and it contains only numers:
[[1]]
[1] 1 2
[[2]]
[1] 1 2 3
how to read in data as list from such file ?
EDITION :
I'm going to try read it as a string, then use some regex to remove '[[*]]' and substitute '[*]' with special symbol - let it be '#'. Then take every substring between '#', split it into vector and put into empty list.

Something like this should do the trick. (The exact details may vary, but at least this will give you some ideas to work with.)
l <- readLines("file.txt")
l2 <- gsub("\\[{2}\\d+\\]{2}", "#", l) # Replace [[*]] with '#'
l3 <- gsub("\\[\\d+\\]\\s", "", l2)[-1] # Remove all [*]
l4 <- paste(l3, collapse=" ") # Paste together into one string
l5 <- strsplit(l4, "#")[[1]] # Break into list
lapply(l5, function(X) scan(textConnection(X))) # Use scan to convert 2 numeric
# [[1]]
# [1] 1 2
#
# [[2]]
# [1] 1 2 3

Related

Efficient way of assigning attribute table names based on a partial list element name

I have a list with path names as the element names in list l1.
# File List
l1 <- list(2,3,4,5)
names(l1) <- c("C:/Users/2013_mean.csv",
"C:/Users/2013_median.csv",
"C:/Users/2015_mean.csv",
"C:/Users/2015_median.csv")
I would like to assign a attribute table to the list that looks similar to the following in a more efficient manner. I would like to extract only a portion of the path name from the elements in l1 assign them to its respective componenent. For example:
I would like to "grab" the name "2013_mean" from "C:/Users/2013_mean.csv" in l1 and assign it to that element in the attribute table. Is there a more efficient way of doing this?
attributes(l1) <- data.frame(id = c("2013_mean", "2013_median", "2015_mean", "2015_median")
)
attributes(l1)
We can use basename on the names of the list to extract the substring
attributes(l1) <- data.frame(id = sub("\\.csv", "", basename(names(l1))))
-output
> l1
[[1]]
[1] 2
[[2]]
[1] 3
[[3]]
[1] 4
[[4]]
[1] 5
attr(,"id")
[1] "2013_mean" "2013_median" "2015_mean" "2015_median"
Or another option is basename + file_path_sans_ext
tools::file_path_sans_ext(basename(names(l1)))
[1] "2013_mean" "2013_median" "2015_mean" "2015_median"

Read txt file into list where each list element is delimited by row ending with colon

I've got the following .txt structure
test <- "A n/a:
4001
Exam date:
2020-01-01 15:38
Pos (deg):
18.19
18.37"
I'd like to read this into a list, where each list element is given the name of the row ending with a colon, and the values are given by the following rows. (see: expected output).
Challenges
The number of rows (the length of each list element) can differ. There can be special characters (e.g., "A n/a") and there is the date time value which contains a pesky colon.
My problem
My current solution (see below) is unsafe, because I cannot be sure that I have a full list of all expected elements - the file might contain unexpected list elements which I would then not capture, or worse, they would mess up the entire data.
What I tried
I tried reading the txt to json with jsonlite::fromJson, because the structure somehow resembled it, but this gave an error about an unexpected character.
I tried to read into a single string and split, but this leaves me, again, with all values in a single list element:
readr::read_file(test)
strsplit(test, split = ":\n")
My current approach is to read this in with read.csv2 and generate a lookup on the (expected) row names, create a vector for splitting and using the first element of the resulting list for naming.
myfile <- read.csv2(text = test,
header = FALSE)
lu <- paste(c("A n", "date", "Pos"), collapse = "|")
ls_file <- split(myfile$V1, cumsum(grepl(lu, myfile$V1, ignore.case = TRUE)))
names(ls_file) <- unlist(lapply(ls_file, function(x) x[1]))
ls_file <- lapply(ls_file, function(x) x <- x[2:length(x)])
## expected output is a named list
## The spaces and backticks below do not really bother me,
## but I would get rid of them in a next step.
ls_file
#> $`A n/a:`
#> [1] " 4001"
#>
#> $`Exam date:`
#> [1] " 2020-01-01 15:38"
#>
#> $`Pos (deg):`
#> [1] "18.19" "18.37"
Assuming the name of each element ends with :, then we can:
res <- readLines(textConnection(test))
res <- split(res, cumsum(endsWith(res, ':')))
res <- setNames(lapply(res, `[`, -1), sapply(res, `[`, 1))
# > res
# $`A n/a:`
# [1] " 4001"
#
# $`Exam date:`
# [1] " 2020-01-01 15:38"
#
# $`Pos (deg):`
# [1] "18.19" "18.37"

String matching within a list of lists [duplicate]

I have a list like this:
map_tmp <- list("ABC",
c("EGF", "HIJ"),
c("KML", "ABC-IOP"),
"SIN",
"KMLLL")
> grep("ABC", map_tmp)
[1] 1 3
> grep("^ABC$", map_tmp)
[1] 1 # by using regex, I get the index of "ABC" in the list
> grep("^KML$", map_tmp)
[1] 5 # I wanted 3, but I got 5. Claiming the end of a string by "$" didn't help in this case.
> grep("^HIJ$", map_tmp)
integer(0) # the regex do not return to me the index of a string inside the vector
How can I get the index of a string (exact match) in the list?
I'm ok not to use grep. Is there any way to get the index of a certain string (exact match) in the list? Thanks!
Using lapply:
which(lapply(map_tmp, function(x) grep("^HIJ$", x))!=0)
The lapply function gives you a list of which for each element in the list (0 if there's no match). The which!=0 function gives you the element in the list where your string occurs.
Use either mapply or Map with str_detect to find the position, I have run only for one string "KML" , you can run it for all others. I hope this is helpful.
First of all we make the lists even so that we can process it easily
library(stringr)
map_tmp_1 <- lapply(map_tmp, `length<-`, max(lengths(map_tmp)))
### Making the list even
val <- t(mapply(str_detect,map_tmp_1,"^KML$"))
> which(val[,1] == T)
[1] 3
> which(val[,2] == T)
integer(0)
In case of "ABC" string:
val <- t(mapply(str_detect,map_tmp_1,"ABC"))
> which(val[,1] == T)
[1] 1
> which(val[,2] == T)
[1] 3
>
I had the same question. I cannot explain why grep would work well in a list with characters but not with regex. Anyway, the best way I found to match a character string using common R script is:
map_tmp <- list("ABC",
c("EGF", "HIJ"),
c("KML", "ABC-IOP"),
"SIN",
"KMLLL")
sapply( map_tmp , match , 'ABC' )
It returns a list with similar structure as the input with 'NA' or '1', depending on the result of the match test:
[[1]]
[1] 1
[[2]]
[1] NA NA
[[3]]
[1] NA NA
[[4]]
[1] NA
[[5]]
[1] NA

Parsing String and splitting it in R

I have somehow a regex problem with handling strings in R.
I have data structure provided by RNAfold software that looks like this:
"....(((..((((((((.(((((((((((.........))))))))))).))))))))..))).."
This is a typical secondary structure for miRNAs, but I also have other sequences that are not miRNAs, that look somwhat like this:
...((((.....))))...........(((((((...((..(((..((((...((((((.....)))).))...)))).))).))...))))))).......
This second sequence has two hairpin loops, one at the beginning and another one in the middle, whereas the first sequence just has one hairpin loop in the middle.
Dots (".") represent nucleotides that are not paired, while "(" represent nucleotides that are paired with their counterparts, represented as ")".
I want to split this string so that I can get the stems in the structure.
The output I would like to obtain is:
Input:
[1] "....(((..((((((((.(((((((((((.........))))))))))).))))))))..))).."
Output:
[1] "....(((..((((((((.(((((((((((........."
[2] "))))))))))).))))))))..))).."
So that I can count the number of splited strings and count the number of stems.
The result for the second sequence would be:
Input:
[1] ...((((.....))))...........(((((((...((..(((..((((...((((((.....)))).))...)))).))).))...))))))).......
Output:
[1] "...((((....."
[2] "))))...........(((((((...((..(((..((((...((((((....."
[3] ")))).))...)))).))).))...)))))))......."
So in esence, what I want is to parse the strings, so that they are splitted when they fin a ")" symbol, conserving all the symbols of the string.
I have been tried using strplit() and some regex variations but I haven't been able to find the trick...
Any help?
Thanks
You could do a lookahead and look for dots ending by a closing parenthesis which come straight after an opening parenthesis.
x <- c("....(((..((((((((.(((((((((((..))))))))))).))))))))..)))..",
"...((((.....))))...........(((((((...((..(((..((((...((((((.....)))).))...)))).))).))...))))))).......")
strsplit(x, "\\((?=(\\.+\\)))", perl = TRUE)
# [[1]]
# [1] "....(((..((((((((.((((((((((" "..))))))))))).))))))))..))).."
#
# [[2]]
# [1] "...(((" ".....))))...........(((((((...((..(((..((((...((((("
# [3] ".....)))).))...)))).))).))...)))))))......."
If you looking to count character it might be more convenient to do this:
x <- "...((((.....))))...........(((((((...((..(((..((((...((((((.....)))).))...)))).))).))...)))))))......."
with(rle(strsplit(x, "")[[1]]), setNames(lengths, values))
## . ( . ) . ( . ( . ( . ( . ( . ) . ) . ) . ) . ) . ) .
## 3 4 5 4 11 7 3 2 2 3 2 4 3 6 5 4 1 2 3 4 1 3 1 2 3 7 7
You can get the output you specified using DavidArenburg's logic but with a twist - David uses a lookahead regex expression to find the ( that precedes the pattern.{N}) where N can be any number. A variable-length lookbehind (where pattern contains unspecified # of a character) would be ideal but does not work (read - is not allowed). The trick is to reverse the string to use variable-length lookahead, much like a variable-length lookbehind might operate.
Data
S <- c("....(((..((((((((.(((((((((((.........))))))))))).))))))))..)))..", "...((((.....))))...........(((((((...((..(((..((((...((((((.....)))).))...)))).))).))...))))))).......")
Functions
reverse_string <- function(S) {
paste(rev(unlist(strsplit(S, ""))), collapse="")
}
myfun <- function(S) {
T <- reverse_string(S)
result <- unlist(strsplit(T, "\\)(?=(\\.+\\())", perl = TRUE))
setNames(rev(sapply(result, function(i) reverse_string(i))), NULL)
}
Result
lapply(S, myfun)
# [[1]]
# [1] "....(((..((((((((.(((((((((((........."
# [2] ")))))))))).))))))))..))).."
# [[2]]
# [1] "...((((....."
# [2] ")))...........(((((((...((..(((..((((...((((((....."
# [3] "))).))...)))).))).))...)))))))......."

grep exact match in vector inside a list in R

I have a list like this:
map_tmp <- list("ABC",
c("EGF", "HIJ"),
c("KML", "ABC-IOP"),
"SIN",
"KMLLL")
> grep("ABC", map_tmp)
[1] 1 3
> grep("^ABC$", map_tmp)
[1] 1 # by using regex, I get the index of "ABC" in the list
> grep("^KML$", map_tmp)
[1] 5 # I wanted 3, but I got 5. Claiming the end of a string by "$" didn't help in this case.
> grep("^HIJ$", map_tmp)
integer(0) # the regex do not return to me the index of a string inside the vector
How can I get the index of a string (exact match) in the list?
I'm ok not to use grep. Is there any way to get the index of a certain string (exact match) in the list? Thanks!
Using lapply:
which(lapply(map_tmp, function(x) grep("^HIJ$", x))!=0)
The lapply function gives you a list of which for each element in the list (0 if there's no match). The which!=0 function gives you the element in the list where your string occurs.
Use either mapply or Map with str_detect to find the position, I have run only for one string "KML" , you can run it for all others. I hope this is helpful.
First of all we make the lists even so that we can process it easily
library(stringr)
map_tmp_1 <- lapply(map_tmp, `length<-`, max(lengths(map_tmp)))
### Making the list even
val <- t(mapply(str_detect,map_tmp_1,"^KML$"))
> which(val[,1] == T)
[1] 3
> which(val[,2] == T)
integer(0)
In case of "ABC" string:
val <- t(mapply(str_detect,map_tmp_1,"ABC"))
> which(val[,1] == T)
[1] 1
> which(val[,2] == T)
[1] 3
>
I had the same question. I cannot explain why grep would work well in a list with characters but not with regex. Anyway, the best way I found to match a character string using common R script is:
map_tmp <- list("ABC",
c("EGF", "HIJ"),
c("KML", "ABC-IOP"),
"SIN",
"KMLLL")
sapply( map_tmp , match , 'ABC' )
It returns a list with similar structure as the input with 'NA' or '1', depending on the result of the match test:
[[1]]
[1] 1
[[2]]
[1] NA NA
[[3]]
[1] NA NA
[[4]]
[1] NA
[[5]]
[1] NA

Resources