'dictionary' list to data.table columns - r

I am converting output from an API call to a bibliography database, that returns content in RIS form. I would then like to get a data.table object, with a row for each database item, and a column for each field of the RIS output.
I will explain more about RIS later, but I am stuck in the following:
I would like to get a data.table using something like:
PubDB <- as.data.table(list(TY = "txtTY",TI = "txtTI"))
which returns:
PubDB
TY TI
1: txtTY txtTI
However, what I have is a string (actually a vector of strings returned from API call: PubStr is one element)
PubStr
## [1] "TY = \"txtTY\",TI = \"txtTI\" "
How can I convert this string to the list needed inside the as.data.table command above?
More specifically, following the first steps of my code, after resp<-GET(url), rawToChar(resp$content) and as.data.table() after some string manipulation, I have a data table with rows for each publication, and one column called PubStr that has the string as above. How to convert this string to many columns, for each row of the data.table. Note: some rows have more or fewer fields.

I am unsure of RIS format but if each element of these strings are separated by commas and then within each comma the header column names are separated by the equal sign then here is a quick and dirty function that uses base R and data.table:
RIS_parser_fn<-function(x){
string_parse_list<-lapply(lapply(x,
function(i) tstrsplit(i,",")),
function(j) lapply(tstrsplit(j,"="),
function(k) t(gsub("\\W","",k))))
datatable_format<-rbindlist(lapply(lapply(string_parse_list,
function(i) data.table(Reduce("rbind",i))),
function(j) setnames(j,unlist(j[1,]))[-1]),fill = T)
return(datatable_format)
}
The first line of code simply creates a list of lists which contain 2 lists of matrices. The outer list has the number of elements equal to the size of the initial vector of strings. The inner list has exactly two matrix elements with the number of columns equal to the number of fields in each string element determined by the ',' sign. The first matrix in each list of lists consists of the columns headers (determined by the '=' sign) and the second matrix contains the values they are equal to. The last gsub simply removes any special characters remaining in the matrices. May need to modify this if you want nonalphanumeric characters to be present in the values. There were not any in your example.
The second line of code converts these lists into one data.table object. The Reduce function simply rbinds the 2 element lists and then converts them to data.tables. Hence there is now only one list consisting of data.tables for each initial string element. The "j" lapply function sets the column names to the first row of the matrix and then removes that row from the data.table. The final rbindlist call combines the list of the data.tables which have varying number of columns. Set the fill=T to allow them to be combined and NAs will be assigned to cells that do not have that particular field.
I added a second string element with one more field to test the code:
PubStr<-c("TY = \"txtTY1\",TI = \"txtTI1\"","TY = \"txtTY2\",TI = \"txtTI2\" ,TF = \"txtTF2\"")
RIS_parser_fn(PubStr)
Returns this:
TY TI TF
1: txtTY1 txtTI1 <NA>
2: txtTY2 txtTI2 txtTF2
Hopefully this will help you out and/or stimulate some ideas for more efficient code. Best of luck!

Related

Data table subsetting in r by concatenating string variables

I have a data table that I am trying to subset by creating a list of variable names by pasting together some string vectors in the j argument of the data table, but I'm running into difficulty.
I have a character vector called foos (for this example foos <- c('FOO0','FOO1','FOO2')) and a vector I created with c() . I wanted to subset my data table by doing dt[,paste0(foos, c('VAR0','VAR1','VAR2'))] but that didn’t work as expected. I output what paste0(foos, c('VAR0','VAR1','VAR2')) returns and it becomes
[1] "FOO0VAR0" "FOO1VAR1" "FOO2VAR2"
so it seems this approach does a vector index by vector index concatenation instead of a concatenation of the vectors themselves (and that’s a bit surprising to me, I’d expect to have to lapply to get a paste happening on elements of a vector). Changing the permutation of the c() and paste0 didn’t work. I also tried to do
dt[,c(foos,c('VAR0','VAR1','VAR2'))] but that also doesn't work.
Is there a way to subset by a created concatenation of two string vectors in the jth column of a data table in R?

R code to extract two columns (one of which has multiple text strings which need to be parsed) into a named list of character vectors

I currently have a situation where I have a dataframe where I need to convert two of the columns a specified format. Example of the data in each column:
Column 1: Some_text_String
Column 2:
GO:0048046^cellular_component^apoplast`GO:0005618^cellular_component^cell wall`GO:0005576^cellular_component^extracellular region`GO:0099503^cellular_component^secretory vesicle`GO:0004252^molecular_function^serine-type endopeptidase activity`GO:0080001^biological_process^mucilage extrusion from seed coat`GO:0048359^biological_process^mucilage metabolic process involved in seed coat development`GO:0010214^biological_process^seed coat development
So I have two problems. I need to parse the second column so that only the GO:XXXXXXXX text is included. A partial solution that gets the first term is stringr::str_extract(mydataframe[1,2], ".{0,8}GO.{0,8}") but this only captures the first term.
Secondly the final output needs to be a named list of character vectors, with the list names being the first column and each element of the list being a character vector. This is direct from the vignette of the R package I'm trying to use (topGO).
The object returned by readMappings is a named list of character
vectors. The list names give the genes identifiers. Each element of
the list is a character vector and contains the GO identifiers
annotated to the specific gene
I know this is simple but I'm just getting stuck trying to use apply or some other solution and my brain is on strike.
Repex:
myvector1 <- c("Some_text_String")
myvector2 <- c("GO:0048046^cellular_component^apoplast`GO:0005618^cellular_component^cell wall`")
mydataframe <- data.frame(myvector1,myvector2)
# parse myvector2 to remove everything except GO terms.
# This code only gets the first term, but I need all of them as a vector
stringr::str_extract(mydataframe [1,2], ".{0,8}GO.{0,8}")
# At this point the desired result is named list of character vectors, with the list names being the first column and each element of the list being a character vector.
You can use str_extract_all to extract all the values that satisfy the pattern and use setNames to get a named list.
library(stringr)
setNames(str_extract_all(mydataframe [1,2], "GO.{0,8}"), mydataframe$myvector1)
#$Some_text_String
#[1] "GO:0048046" "GO:0005618"

R: How to use setdiff on two string vectors by only comparing the first 3 tab delimited items in each string? without using qdap

ive previously asked this question and the answer i received worked: R: How to use setdiff on two string vectors by only comparing the first 3 tab delimited items in each string?,
However the qdap requires rJava and correct user system setup.cannot load R package qdap. So now i am re-asking the question but am wondering if there is a way to do this without using qdap? i will repeat the question below:
I am trying to figure out a way in R to take the difference of two string vectors, but only based on the first 3 columns that are tab delimited in each string. For Example this is list1 and list2
list1:
"1\t1113200\t1118399\t1\t1101465\t1120176\tENSRNOG00000040300\tRaet1l\t0\n"
"1\t1180200\t1187599\t1\t1177682\t1221416\tENSRNOG00000061316\tAABR07000121.1\t0\n"
"1\t1180200\t1187599\t1\t1177632\t1221416\tENSRNOG00000061316\tAABR07000121.1\t0\n"
list2:
"1\t1113200\t1118399\t1\t1101465\t1120176\tENSRNOG00000040300\tRaet1l\t0\n"
"1\t1180200\t1187599\t1\t1177682\t1221416\tENSRNOG00000061316\tAABR07000121.1\t0\n"
i want to do setdiff(list2,list1) , so that i just get everything in list2 that is non-existent in list1, however i want to do it based on just the first 3 tab delimited strings. So in list1 i would only consider:
"1\t1113200\t1118399"
from the first entry. However i still want the full string returned. I only want to compare using the first 3 columns. I am having trouble figuring out how to do this, any help would be appreciated. Ive already looked at several SO posts, none of them seemed to help.
Looks like you just need to extract up to the third tab character (to get the first three columns) from list1 and compare that to the same in list2?
There are quite a few ways to do this in base R, here's one using regular expressions to extract the first three tabs:
# first, let's get the first 3 columns of `list1` (get up to the third tab)
m = regexec("^(?:[^\t]+\t){3}", list1)
# you'll see it's a list with the first 3 columns of each thing in `x`
first3.list1 = unlist(regmatches(list1, m))
Now we have the first three columns we can match against list2. You can extract the first three columns of list2 similarly and use %in% like the answer to your previous question now. (setdiff will only return the non-matching first 3 columns, while using %in% can be used to index the original list2 to extract the entire original string)
m = regexec("^(?:[^\t]+\t){3}", list2)
first3.list2 = unlist(regmatches(list2, m))
list2[!(first3.list2 %in% first3.list1)]
(It seems for the example you provided, there are no lines in list2 whose first 3 columns are not in the first 3 columns of list1).
Other approaches include using strsplit or read.delim to split your dataframe into columns, then using paste to paste the first 3 back together, and then proceeding similarly.

Count of Comma separated values in r

I have a column named subcat_id in which the values are stored as comma separated lists. I need to count the number of values and store the counts in a new column. The lists also have Null values that I want to get rid of.
I would like to store the counts in the n column.
We can try
nchar(gsub('[^,]+', '', gsub(',(?=,)|(^,|,$)', '',
gsub('(Null){1,}', '', df1$subcat_id), perl=TRUE)))+1L
#[1] 6 4
Or
library(stringr)
str_count(df1$subcat_id, '[0-9.]+')
#[1] 6 4
data
df1 <- data.frame(subcat_id = c('1,2,3,15,16,78',
'1,2,3,15,Null,Null'), stringsAsFactors=FALSE)
You can do
sapply(strsplit(subcat_id,","),FUN=function(x){length(x[x!="Null"])})
strsplit(subcat_id,",") will return a list of each item in subcat_id split on commas. sapply will apply the specified function to each item in this list and return us a vector of the results.
Finally, the function that we apply will take just the non-null entries in each list item and count the resulting sublist.
For example, if we have
subcat_id <- c("1,2,3","23,Null,4")
Then running the above code returns c(3,4) which you can assign to your column.
If running this from a dataframe, it is possible that the character column has been interpreted as a factor, in which case the error non-character argument will be thrown. To fix this, we need to force interpretation as a character vector with the as.character function, changing the command to
sapply(strsplit(as.character(frame$subcat_id),","),FUN=function(x){length(x[x!="Null"])})

extract data from a list without using loop in R

I have a vector v with row positions:
v<-c(10,3,100,50,...)
with those positions I want to extract elements of a list, having a column fixed, for example lets suppose my column number is 2, so I am doing:
data<-c()
data<-c(list1[[v]][[2]])
list1 has the data in the following format:
[[34]]
[1] "200_s_at" "483" "1933" "3664"
So for example, I want to extract from the row 342 the value 1910 only, column 2, and do the same with the next rows
but I got an error when I want to do that, is it possible to do it directly? or should I have a loop that read one by one the positions in v and fill the data vector like:
#algorithm
for i<-1 to length(v)
pos<-v[i]
data[[i]]<-c(list1[[pos]][[2]])
next i
Thanks
You can use sapply as below:
sapply(list1[v], `[`, 2)
However, depending on your data, you might get an unexpected output, as explained in Why is `vapply` safer than `sapply`?. For example, what if some of your list items have length < 2? What if some of the list items are not vectors but data.frames? Also, the output class may differ based on the class of your list elements (logical, integer, numeric, character). If for example, you expect that all your list items are character vectors of length >= 2, then it is safer to do:
vapply(list1[v], `[`, character(1), 2)
where vapply will double check your assumptions for you, and error out if it finds a problem.

Resources