R - Irregular metadata; create df from large single column - r

The title doesn't really do my question justice, because there are probably a few ways to skin this cat. But I picked one approach and went with it. This is what I'm working with:
I've pulled all the metadata for a particular study in the NCBI database using the "Send to:" option on their interface and downloading a .txt file.
In total, I have ~23k samples, each with up to 609 unique questions and answers from a questionnaire totaling 8M+ obs of 1 variable when read as a .csv. To my dismay, the metadata are irregular. Some samples have 140 associated key/value pairs. Others have 492. I've included a header of a sample below.
1: qiita_sid_10317:10317.BLANK1.6H.GUELPH
Identifiers: BioSample: SAMEA4790059; SRA: ERS2609990
Organism: metagenome
Attributes:
/Alias="qiita_sid_10317:10317.BLANK1.6H.GUELPH"
/description="American Gut control"
/ENA checklist="ERC000011"
/INSDC center alias="UCSDMI"
/INSDC center name="University of California San Diego Microbiome Initiative"
/INSDC first public="2018-07-13T17:03:10Z"
/INSDC last update="2018-07-13T14:50:03Z"
/INSDC status="public"
/SRA accession="ERS2609990"
I've tried (including but not limited to):
Read .txt file (adding a delimiter hasn't made a difference, am I missing something here?)
I've tried reading the data using various delimiters
I've even removed the header data in Sublime Text, leaving only "Attributes:" and the "/"-delimited key/value pairs in order to mess with the column that way
I've split the column found all unique values in col1 to maybe create a df from scratch, etc etc.
Can't seem to get past the cleaning steps:
samples <- read.csv("~/biosample_result_full.txt")
samples_split <- cSplit(samples, splitCols = sample$Colname, sep = "=")
samples_split$Attributes_1 <- gsub(" ", "_", samples_split$Attributes_1)
questions <- unique(samples_split$Attributes_1)
Ideally, each sample and associated metadata would be transformed into rows, with each "Attribute"/question as the column name.
Any help is greatly appreciated.

I see that the website you've linked to, allows fot the option to export data to xml. I strongly suggest to do so. R can hande/parse xml-files very efficient.
When I download the first three results from that site to a file biosample_result.xml , it's easy to process using the xml2-package
library( xml2 )
library( magrittr )
doc <- read_xml( "./biosample_result.xml")
#gret all BioSample nodes
BioSample.Nodes <- xml_find_all( doc, "//BioSample")
#build a data.frame
data.frame(
sample_name = xml_find_first( BioSample.Nodes , ".//Id[#db='SRA']") %>% xml_text(),
stringsAsFactors = FALSE )
# sample_name
# 1 ERS2609990
# 2 ERS2609989
# 3 ERS2609988
So if you can use the XML, you will just have to use the right xpath-syntax to get the data/nodes you need, into the columns you want...
In the exmaple above, I extracted (from each BioSample-node) the first ID-node with attribute db equals SRA, and stored the result in the co0lumn sample_name.
Still assuming you can use the xml-data.
If you are lokking for all attributes into one df, you need the functions from purrr, so just load the entire tidyverse
library( tidyverse )
df <- xml_find_all( doc, "//BioSample") %>%
map_df(~{
set_names(
xml_find_all(.x, ".//Attribute") %>% xml_text(),
xml_find_all(.x, ".//Attribute") %>% xml_attr( "attribute_name" )
) %>%
as.list() %>%
flatten_df()
})
will result in a df like this

Related

Merge two dataframes without common column if order is known

TLDR; I've gotten myself into some trouble. I have a dataframe that I exported into different output files based on column ID. Now, I want to merge data from those files back into the original dataframe without a common index, knowing only that their order was preserved within the ID grouping.
I'm starting with a large dataframe (n > 30,000). One of my columns contains text documents in many different languages (text). Another column (en) indicates which language the document is.
MWE:
df <- data.frame(text = c('this is a test', 'questo è un test', 'cest un test', 'another test', 'un autre essai'),
en = c("en", "it", "fr", "en", "fr"))
I wanted to translate them all into English, but needed to do it manually without a Google Translate API. First I did the following to obtain a list of all unique non-English entries:
noneng <- subset(df, lang_2!="en")
noneng <- subset(noneng, !is.na(text))
noneng <- noneng[!duplicated(noneng[,c('text')]),]
I then grouped elements of this list by language group and exported them.
# export to sep files
write_mult = function(x) {
write.table(x,paste0("translate/untranslated_",unique(x$lang_1),".txt"))
return(x)
}
noneng %>%
group_by(lang_1) %>%
do(write_mult(.))
I end up with a bunch of delimited .txt files, one for each language group ID. Within each .txt is the complete list of unique rows for that language group.
untranslated_en.txt (2 entries)
untranslated_it.txt (2 entries)
untranslated_fr.txt (1 entry)
...
I painstakingly ran these through Google Translate and now have English language versions for each row of each .txt. I was planning on replacing the untranslated versions in the original dataframe with its translation.
What I (foolishly) didn't think about in advance was how I was going to match the translated versions back to the untranslated ones in original dataframe without having given them an index/key of some kind.
There must be a way out of this. For example,
I know the original order of noneng before I exported it
group_by and unique must have predictable output based on the order they're given.
So it seems like it should be possible to recover the order of the .txt files even without an index column.
If I knew that, then I could merge them based on order, even without an index. But having split the output into multiple .txts has thrown me off, and I'm totally stumped about how one would backward induct this with R?
I hope I'm explaining this clearly. I'd be grateful for any help.
Perhaps something like this:
library(dplyr)
# helper function to add row numbers per language group
add_row_in_lang <- function(df) {
df %>% group_by(en) %>% mutate(row = row_number()) %>% ungroup()
}
# add row numbers, join to table with combined translations
df %>%
add_row_in_lang() %>%
left_join(
list.files(path = "translated_csvs", full.names = TRUE) %>%
purrr::map_dfr( read_csv ) %>%
add_row_in_lang(),
by = c("en", "row")
)
Note how I've specified to join just on en and row, since it sounds like in your case those will be the keys.
I'm not sure about the format of your CSVs, so I can't verify or know what tweaks might be required, but if you had the data frames you could do the same thing:
df_fr <- data.frame(text = c('cest un test', 'un autre essai'),
en = c("fr", "fr"), translation = c("This is a test", "Another test"))
df_it <- data.frame(text = c('questo è un test'),
en = c("it"), translation = c("This is a test"))
df_en <- data.frame(text = c('this is a test', 'another test'),
en = c("en"), translation = c("This is a test", "another test"))
df %>%
add_row_in_lang() %>%
left_join(
bind_rows(df_fr, df_it, df_en) %>%
add_row_in_lang(),
by = c("en", "row")
)
# A tibble: 5 × 5
text.x en row text.y translation
<chr> <chr> <int> <chr> <chr>
1 this is a test en 1 this is a test This is a test
2 questo è un test it 1 questo è un test This is a test
3 cest un test fr 1 cest un test This is a test
4 another test en 2 another test another test
5 un autre essai fr 2 un autre essai Another test
Note how in my example here, both tables have the original text -- but I specified that the join only use en and the row column that we added. Since in my example both files have a text column, it outputs as text.x (from the first table) and text.y (from the 2nd one).
Not sure what caused me to overthink this. I came up with a rough but straighforward fix. In case it's useful for anyone else, this is what I did (not as clean as Jon's answer but maybe more comprehensible for newbies).
This step is optional, but first I gave the dataframe an index post hoc and ran the export procedure again. I sampled the untranslated entries to note their indices. Then, I ordered the dataframe by group ID to confirm that the indices do match up (as expected).
noneng <- noneng[order(noneng$lang_1),]
I noted the group order in the sorted dataframe (alphabetical), imported each of the .txts, and used bind_rows to create a new dataframe noneng_translated of equal length. Sort that dataframe by ID. Finally, cbind(noneng, noneng_translated). Simple enough.
en <- data.frame(lapply(en, as.character))
it <- data.frame(lapply(it, as.character))
fr <- data.frame(lapply(fr, as.character))
...
newdf <- bind_rows(en,it,fr,...)
cbind(noneng, newdf)
Uglier and riskier than merging on an index would have been, but it did the trick. Curious how others would do this. I'll leave it up in case others in the future make similar mistakes.

How to select for certain data in a .txt file

I have a .txt import file from a weather station using some pretty advanced code, and I need to sort based on one area of content within each line. Here's a few lines:
13:30:00.587: <- $GPGGA,183000.30,4415.6243,N,08823.9769,W,1,7,1.7,225.5,M,-33.4,M,,*68
13:30:00.683: <- $GPGLL,4415.6243,N,08823.9769,W,183000.40,A,A*72
13:30:00.779: <- $GPVTG,159.6,T,163.2,M,0.1,N,0.1,K,A*2E
I basically need to be able to group together all lines with a $GPGGA, and do the same for $GPGLL, $GPVTG, and I believe 6 other types of entries that repeat. group_by() does work, nor do select() or sort() for obvious reasons. The formatting here is clearly not in any organized table format, making this very difficult for me. How do I do this?
Here's the code I used to import the original file (I replaced my actual username with "my username"):
filefolder <-"C:\\Users\\"my username"\\Downloads\\"
Weather_data = paste(filefolder, "Jul_13_2021_Weatherstation_Test_File.txt", sep = "")
Weather_data <- read.delim("Jul_13_2021_Weatherstation_Test_File.txt")
And here's what I have so far in my attempt:
Screenshot of what I have so far
1: https://i.stack.imgur.com/FSlzf.png][1]
As you say there is no organisation in the table. I would suggest doing something with regular expressions:
df <- data.frame(text = c("13:30:00.587: <- $GPGGA,183000.30,4415.6243,N,08823.9769,W,1,7,1.7,225.5,M,-33.4,M,,*68",
"13:30:00.683: <- $GPGLL,4415.6243,N,08823.9769,W,183000.40,A,A*72",
"13:30:00.779: <- $GPVTG,159.6,T,163.2,M,0.1,N,0.1,K,A*2E"))
library(dplyr)
df %>%
mutate(Entry = gsub(".*\\$([A-Z]+),.*", "\\1", text)) %>%
group_by(Entry)

Web Scraping Education Data in R

Was presented a problem at work and am trying to think / work my way through it. However, I am very new at web scraping, and need some help, or just good starting points, on web scraping.
I have a website from the education commission.
http://ecs.force.com/mbdata/mbprofgroupall?Rep=DEA
This site contains 50 tables, one for each state, with two columns in a question / answer format. My first attempt has been this...
library(tidyverse)
library(httr)
library(XML)
tibble(url = "http://ecs.force.com/mbdata/mbprofgroupall?Rep=DEA") %>%
mutate(get_data = map(.x = url,
~GET(.x))) %>%
mutate(list_data = map(.x = get_data,
~readHTMLTable(doc=content(.x, "text")))) %>%
pull(list_data)
My first thought was to create multiple dataframes, one for each state, in a list format.
This idea does not seem to have worked as anticipated. I was expecting a list, but it seems like a list of on response rather than 50. It appears that this one response read each line, but did not differentiate from one table to the next. Confused on next steps, anyone with any ideas? Web Scraping is odd to me.
Second attempt was to copy and paste the table into R as a tribble, one state at a time. This sort of worked, but not every column is formatted the same way. Attempted to use tidyr::separate() to break up the columns by "/t" and that worked for some columns, but not all.
Any help on this problem, or even just where to look to learn more about web scraping, would be very helpful. This did not seem all the difficult at first, but seems like there are a couple of things I am am missing. Maybe rvest? Have never used it, but know it is common with web scraping activities.
Thanks in advance!
As you already guessed rvest is a very good choice for web scraping. Using rvest you can get the table from your desired website in just two steps. With some additional data wrangling this could be transformed in a nice data frame.
library(rvest)
#> Loading required package: xml2
library(tidyverse)
html <- read_html("http://ecs.force.com/mbdata/mbprofgroupall?Rep=DEA")
df <- html %>%
html_table(fill = TRUE, header = FALSE) %>%
.[[1]] %>%
# Remove empty rows and rows containing the table header
filter(!(X1 == "" & X2 == ""), !(grepl("^Dual", X1) & grepl("^Dual", X2))) %>%
# Create state column
mutate(is_state = X1 == X2, state = ifelse(is_state, X1, NA_character_)) %>%
fill(state) %>%
filter(!is_state) %>%
select(-is_state)
head(df, 2)
#> X1
#> 1 Statewide policy in place
#> 2 Definition or title of program
#> X2
#> 1 Yes
#> 2 Dual Enrollment – Postsecondary Institutions. High school students are allowed to take college courses for credit either at a high school or on a college campus.
#> state
#> 1 Alabama
#> 2 Alabama

extracting list-in-a-list-in-a-list to build dataframe in R

I am trying to build a data frame with book id, title, author, rating, collection, start and finish date from the LibraryThing api with my personal data. I am able to get a nested list fairly easily, and I have figured out how to build a data frame with everything but the dates (perhaps in not the best way but it works). My issue is with the dates.
The list I'm working with normally has 20 elements, but it adds the startfinishdates element only if I added dates to the book in my account. This is causing two issues:
If it was always there, I could extract it like everything else and it would have NA most of the time, and I could use cbind to get it lined up correctly with the other information
When I extract it using the name, and get an object with less elements, I don't have a way to join it back to everything else (it doesn't have the book id)
Ultimately, I want to build this data frame and an answer that tells me how to pull out the book id and associate it with each startfinishdate so I can join on book id is acceptable. I would just add that to the code I have.
I'm also open to learning a better approach from the jump and re-designing the entire thing as I have not worked with lists much in R and what I put together was after much trial and error. I do want to use R though, as ultimately I am going to use this to create an R Markdown page for my web site (for instance, a plot that shows finish dates of books).
You can run the code below and get the data (no api key required).
library(jsonlite)
library(tidyverse)
library(assertr)
data<-fromJSON("http://www.librarything.com/api_getdata.php?userid=cau83&key=392812157&max=450&showCollections=1&responseType=json&showDates=1")
books.lst<-data$books
#create df from json
create.df<-function(item){
df<-map_df(.x=books.lst,~.x[[item]])
df2 <- t(df)
return(df2)
}
ids<-create.df(1)
titles<-create.df(2)
ratings<-create.df(12)
authors<-create.df(4)
#need to get the book id when i build the date df's
startdates.df<-map_df(.x=books.lst,~.x$startfinishdates) %>% select(started_stamp,started_date)
finishdates.df<-map_df(.x=books.lst,~.x$startfinishdates) %>% select(finished_stamp,finished_date)
collections.df<-map_df(.x=books.lst,~.x$collections)
#from assertr: will create a vector of same length as df with all values concatenated
collections.v<-col_concat(collections.df, sep = ", ")
#assemble df
books.df<-as.data.frame(cbind(ids,titles,authors,ratings,collections.v))
names(books.df)<-c("ID","Title","Author","Rating","Collections")
books.df<-books.df %>% mutate(ID=as.character(ID),Title=as.character(Title),Author=as.character(Author),
Rating=as.character(Rating),Collections=as.character(Collections))
This approach is outside the tidyverse meta-package. Using base-R you can make it work using the following code.
Map will apply the user defined function to each element of data$books which is provided in the argument and extract the required fields for your data.frame. Reduce will take all the individual dataframes and merge them (or reduce) to a single data.frame booksdf.
library(jsonlite)
data<-fromJSON("http://www.librarything.com/api_getdata.php?userid=cau83&key=392812157&max=450&showCollections=1&responseType=json&showDates=1")
booksdf=Reduce(function(x,y){rbind(x,y)},
Map(function(x){
lenofelements = length(x)
if(lenofelements>20){
if(!is.null(x$startfinishdates$started_date)){
started_date = x$startfinishdates$started_date
}else{
started_date=NA
}
if(!is.null(x$startfinishdates$started_stamp)){
started_stamp = x$startfinishdates$started_date
}else{
started_stamp=NA
}
if(!is.null(x$startfinishdates$finished_date)){
finished_date = x$startfinishdates$finished_date
}else{
finished_date=NA
}
if(!is.null(x$startfinishdates$finished_stamp)){
finished_stamp = x$startfinishdates$finished_stamp
}else{
finished_stamp=NA
}
}else{
started_stamp = NA
started_date = NA
finished_stamp = NA
finished_date = NA
}
book_id = x$book_id
title = x$title
author = x$author_fl
rating = x$rating
collections = paste(unlist(x$collections),collapse = ",")
return(data.frame(ID=book_id,Title=title,Author=author,Rating=rating,
Collections=collections,Started_date=started_date,Started_stamp=started_stamp,
Finished_date=finished_date,Finished_stamp=finished_stamp))
},data$books))

Extracting data from a specific position of a PDF?

I'm attempting to extract data from a pdf, which can be located at https://www.dol.gov/ui/data.pdf. The data I'm interested in are on page 4 of the PDF and are the 3 observations of the Initial Claims (NSA), the 3 observations of the Insured Unemployment (NSA), and the most recent week used covered employment (footnote 2).
I've read the PDF into R using pdftools, but the text output which is generated is quite ugly (kind of to be expected - due to the nature of PDFs). Is there any way I can extract specific data from this text output? I believe the data will always be in the same place in the output, which is helpful.
The output I'm looking at can be seen with the following script:
library(pdftools)
download.file("https://www.dol.gov/ui/data.pdf", "data.pdf", mode="wb")
uidata <- pdf_text("data.pdf")
uidata[4]
I've searched people with similar questions and fiddled around with scan() and grep(), but can't seem to figure out a way to isolate and extract the data I need from the text output. Thanks in advance if anyone stumbles upon this and can point me in the right direction - if not I'll be trying to figure this out!
With grep and a little regex, you can get everything you need into a usable structure:
library(magrittr)
x <- pdftools::pdf_text('https://www.dol.gov/ui/data.pdf')
x2 <- readLines(textConnection(x[4]))
r <- grep('WEEK ENDING', x2)
l <- lapply(seq_along(r), function(i){
x2[r[i]:(na.omit(c(r[i + 1], grep('FOOTNOTE', x2)))[1] - 1)] %>%
trimws() %>%
gsub('\\s{2,}', ';', .) %>%
paste(collapse = '\n') %>%
read.csv2(text = ., dec = '.')
})
from_footnote <- as.numeric(gsub('^2|\\D', '', x2[grep('2\\.', x2)]))
l[[1]][3,]
#> WEEK.ENDING December.17 December.10 Change
#> Initial Claims (NSA) 315,613 305,333 +10,280 352,534
#> December.3
#> Initial Claims (NSA) 319,641
from_footnote
#> [1] 138322138
You'll still need to parse the numbers, but at least it's usable.

Resources