Extract list based on string with tabulizer package - r

Extracting the quarterly income statement with the tabulizer package and converting it to tabular form.
# 2017 Q3 Report
telia_url = "http://www.teliacompany.com/globalassets/telia-
company/documents/reports/2017/q3/telia-company-q3-2017-en"
telialists = extract_tables(telia_url)
teliatest1 = as.data.frame(telialists[22])
#2009 Q3#
telia_url2009 = "http://www.teliacompany.com/globalassets/telia-
company/documents/reports/2009/q3/teliasonera-q3-2009-report-en.pdf"
telialists2009 = extract_tables(telia_url2009)
teliatest2 = as.data.frame(telialists2009[9])
Interested only in the Condensed Consolidated Statements of Comprehensive Income table. This string is exact or very similar for all historical reports.
Above, for the 2017 report, list #22 was the correct table. However, since 2009 report had a different layout, #9 was the correct for that particular report.
What would be a clever solution to make this function dynamic, depending on where the string (or substring) of "Condensed Consolidated Statements of Comprehensive Income" is located?
Perhaps using the tm package to find the relative position?
Thanks

You could use pdftools to find the page you're interested in.
For instance a function like this one should do the job:
get_table <- function(url) {
txt <- pdftools::pdf_text(url)
p <- grep("condensed consolidated statements.{0,10}comprehensive income",
txt,
ignore.case = TRUE)[1]
L <- tabulizer::extract_tables(url, pages = p)
i <- which.max(lengths(L))
data.frame(L[[i]])
}
The first step is to read all the pages in the character vector txt. Then grep allows you to find the first page looking like the one you want (I inserted .{0,10} to allow a maximum of ten characters like spaces or newlines in the middle of the title).
Using tabulizer, you can extract the list L of all tables located on this page, which should be much faster than extracting all the tables of the document, as you did. Your table is probably the biggest on that page, hence the which.max.

Related

Is it possible to analyze items' performance from different exams using the same questions but with a different order in each version?

We produced five exams2nops exams using the same groups of items, with randomized order. All of them were schoice items. As such, five different *.rds files were obtained, each of them will be used with the correspondent scanned exams. I noticed that in those *.rds files to be used in the nops_eval there is the information about the *.rmd which as used to produce that exams' question. E.g.:
However, after producing the the nops_eval.csv that information is lost.
I would like to merge all five nops_eval.csv files using the *.rmd information to match each question. Since the same question (e.g. exercise 22) can be genareted from the different *.rmd files. All the same 22 *.rmd files were used in all exams (all have the same 22 questions but with different orders.
I would like to obtain a data frame with the merged csv to allow me to conduct Item Response Theory and Rasch modeling analysis.
Yes you can merge the information from the CSV files by reordering them based on the file/name information from the RDS files. Below I illustrate how to do this using the check.* columns from the CSV files. These are typically closest to what I need for doing an IRT analysis.
First, you read the CSV and RDS from the first version of the exam:
eval1 <- read.csv2("/path/to/first/nops_eval.csv", dec = ".")
metainfo1 <- readRDS("/path/to/first/exam.rds")
Then, you only extract the check.* columns and use the exercise file names as column names.
eval1 <- eval1[, paste0("check.", 1:length(metainfo1[[1]]))]
names(eval1) <- sapply(metainfo1[[1]], function(x) x$metainfo$file)
I'm using $file here because it is always unique across exercises. If $name is also unique in your case and has the better labels, you can also use that instead.
Then you do the same for the second version of the exam:
eval2 <- read.csv2("/path/to/second/nops_eval.csv", dec = ".")
metainfo2 <- readRDS("/path/to/second/exam.rds")
eval2 <- eval2[, paste0("check.", 1:length(metainfo2[[1]]))]
names(eval2) <- sapply(metainfo2[[1]], function(x) x$metainfo$file)
If the same exercises have been used in the construction of the two version, the column names of eval1 and eval2 are the same, just in a different order. Then you can simply do
eval2 <- eval2[, names(eval1)]
to reorder the columns of eval2 to match those of eval1. Subsequently, you can do:
eval <- rbind(eval1, eval2)
If you have more than two versions of the exam, you just iterate the same code and rbind() everything together in the end.
Similar code can also be used if the exercises are just partially overlapping between the versions of the exercise. In that case I first construct a large enough NA matrix with the merged exercise file names and then insert the results:
n1 <- nrow(eval1)
n2 <- nrow(eval2)
nam <- unique(c(names(eval1), names(eval2)))
eval <- matrix(NA, nrow = n1 + n2, ncol = length(nam))
colnames(eval) <- nam
eval[1:n1, names(eval1)] <- as.matrix(eval1)
eval[(n1 + 1):(n1 + n2), names(eval2)] <- as.matrix(eval2)
Again you would need to iterate suitably to merge more than two versions.
In either case the resulting eval could then be processed further to become the IRT matrix for subsquent analysis.

extracting list-in-a-list-in-a-list to build dataframe in R

I am trying to build a data frame with book id, title, author, rating, collection, start and finish date from the LibraryThing api with my personal data. I am able to get a nested list fairly easily, and I have figured out how to build a data frame with everything but the dates (perhaps in not the best way but it works). My issue is with the dates.
The list I'm working with normally has 20 elements, but it adds the startfinishdates element only if I added dates to the book in my account. This is causing two issues:
If it was always there, I could extract it like everything else and it would have NA most of the time, and I could use cbind to get it lined up correctly with the other information
When I extract it using the name, and get an object with less elements, I don't have a way to join it back to everything else (it doesn't have the book id)
Ultimately, I want to build this data frame and an answer that tells me how to pull out the book id and associate it with each startfinishdate so I can join on book id is acceptable. I would just add that to the code I have.
I'm also open to learning a better approach from the jump and re-designing the entire thing as I have not worked with lists much in R and what I put together was after much trial and error. I do want to use R though, as ultimately I am going to use this to create an R Markdown page for my web site (for instance, a plot that shows finish dates of books).
You can run the code below and get the data (no api key required).
library(jsonlite)
library(tidyverse)
library(assertr)
data<-fromJSON("http://www.librarything.com/api_getdata.php?userid=cau83&key=392812157&max=450&showCollections=1&responseType=json&showDates=1")
books.lst<-data$books
#create df from json
create.df<-function(item){
df<-map_df(.x=books.lst,~.x[[item]])
df2 <- t(df)
return(df2)
}
ids<-create.df(1)
titles<-create.df(2)
ratings<-create.df(12)
authors<-create.df(4)
#need to get the book id when i build the date df's
startdates.df<-map_df(.x=books.lst,~.x$startfinishdates) %>% select(started_stamp,started_date)
finishdates.df<-map_df(.x=books.lst,~.x$startfinishdates) %>% select(finished_stamp,finished_date)
collections.df<-map_df(.x=books.lst,~.x$collections)
#from assertr: will create a vector of same length as df with all values concatenated
collections.v<-col_concat(collections.df, sep = ", ")
#assemble df
books.df<-as.data.frame(cbind(ids,titles,authors,ratings,collections.v))
names(books.df)<-c("ID","Title","Author","Rating","Collections")
books.df<-books.df %>% mutate(ID=as.character(ID),Title=as.character(Title),Author=as.character(Author),
Rating=as.character(Rating),Collections=as.character(Collections))
This approach is outside the tidyverse meta-package. Using base-R you can make it work using the following code.
Map will apply the user defined function to each element of data$books which is provided in the argument and extract the required fields for your data.frame. Reduce will take all the individual dataframes and merge them (or reduce) to a single data.frame booksdf.
library(jsonlite)
data<-fromJSON("http://www.librarything.com/api_getdata.php?userid=cau83&key=392812157&max=450&showCollections=1&responseType=json&showDates=1")
booksdf=Reduce(function(x,y){rbind(x,y)},
Map(function(x){
lenofelements = length(x)
if(lenofelements>20){
if(!is.null(x$startfinishdates$started_date)){
started_date = x$startfinishdates$started_date
}else{
started_date=NA
}
if(!is.null(x$startfinishdates$started_stamp)){
started_stamp = x$startfinishdates$started_date
}else{
started_stamp=NA
}
if(!is.null(x$startfinishdates$finished_date)){
finished_date = x$startfinishdates$finished_date
}else{
finished_date=NA
}
if(!is.null(x$startfinishdates$finished_stamp)){
finished_stamp = x$startfinishdates$finished_stamp
}else{
finished_stamp=NA
}
}else{
started_stamp = NA
started_date = NA
finished_stamp = NA
finished_date = NA
}
book_id = x$book_id
title = x$title
author = x$author_fl
rating = x$rating
collections = paste(unlist(x$collections),collapse = ",")
return(data.frame(ID=book_id,Title=title,Author=author,Rating=rating,
Collections=collections,Started_date=started_date,Started_stamp=started_stamp,
Finished_date=finished_date,Finished_stamp=finished_stamp))
},data$books))

R – reading in and then extracting the same cell from a list of binary img files

I’m trying to extract the values of the same [i,j] cell from ~14000 img files. I’ve set up a working function that did this for smaller batches where it was reasonable to put the files in my directory, but now that I’m ready to look at the larger dataset I’m stuck. The img files are organized by year, with 365 separate files for each of 38 winters. Each winter has its own folder (WS1978_1979data, WS1979_1980data, etc.), and each day has its own file containing snow depth data for a large satellite grid in the Arctic (ssmi_n_snowdepth_5day_1978307.img, ssmi_n_snowdepth_5day_1978308.img, etc.) starting October 1 and going through September 30 of the following year. My ultimate hope (at least for this stage) is to create a vector of 365 snow depths for the cell of interest and to do this for each year in the dataset.
I can specify the appropriate file path to generate a list of the files I want for a given year, but then when I use my function to extract the particular cell I want, it looks for the file in the directory, which is wrong. Can you help me out? I feel like I must be missing something simple but I haven’t been able to find what I need.
Example of making a list of all the files for the winter of 1979-1980:
w1979s1980 <- as.vector(list.files(path="SnowDepth/WS1979_1980data", pattern=".img"))`
Function to extract the snow depth from a given cell for all the files in that list:
cell.depthKotz <- function(depthfile){
depth.val <- c()
for(i in 1:length(depthfile)) {
depth.mat <- matrix(readBin(depthfile[i], what="integer", n=136192, size=2, endian="little"),
nrow=448, ncol=304, byrow=TRUE)
depth.val[i] <- depth.mat[187,65]
depth.val[depth.val == 110] <- NA
depth.val[depth.val == 120] <- NA
depth.val[depth.val == 130] <- NA
depth.val[depth.val == 140] <- NA
depth.val[depth.val == 150] <- NA
depth.val[depth.val == 160] <- NA
}
return(depth.val)
}
And then probably save this as a vector when I run the function for a given year:
Sdepths1978.1979 <- as.vector(cell.depthKotz(w1979s1980))
I should add that I’m very new to all this as far as even knowing how to phrase what I’m asking for, so let me know if I need to edit the title/question or add more detail. I’m not concerned about runtime if you see that sort of inefficiency in the functions above, but if there are obvious changes that would mean less repetitive/manual effort from me and more automated effort from R feel free to say so. Thanks for your help!
There is a recursive flag in the list.files function.
files <- list.files(path = "src", pattern = "\\.jpg$", recursive = TRUE)
If you make the path point to the parent directory. And add the recursive = T flag you should be good.
Optionally you can change the pattern to end with $ stating the files must end with this pattern. In a rare case where there is another file in the directory named someinfo.img.txt this would be ignored.

Retaining unique identifiers (e.g., record id) when using tm functions - doesn't work with lot's of data?

I am working with unstructured text (Facebook) data, and am pre-processing it (e.g., stripping punctuation, removing stop words, stemming). I need to retain the record (i.e., Facebook post) ids while pre-processing. I have a solution that works on a subset of the data but fails with all the data (N = 127K posts). I have tried chunking the data, and that doesn't work either. I think it has something to do with me using a work-around, and relying on row names. For example, it appears to work with the first ~15K posts but when I keep subsetting, it fails. I realize my code is less than elegant so happy to learn better/completely different solutions - all I care about is keeping the IDs when I go to V Corpus and then back again. I'm new to the tm package and the readTabular function in particular. (Note: I ran the to lower and remove Words before making the VCorpus as I originally thought that was part of the issue).
Working code is below:
Sample data
fb = data.frame(RecordContent = c("I'm dating a celebrity! Skip to 2:02 if you, like me, don't care about the game.",
"Photo fails of this morning. Really Joe?",
"This piece has been almost two years in the making. Finally finished! I'm antsy for October to come around... >:)"),
FromRecordId = c(682245468452447, 737891849554475, 453178808037464),
stringsAsFactors = F)
Remove punctuation & make lower case
fb$RC = tolower(gsub("[[:punct:]]", "", fb$RecordContent))
fb$RC2 = removeWords(fb$RC, stopwords("english"))
Step 1: Create special reader function to retain record IDs
myReader = readTabular(mapping=list(content="RC2", id="FromRecordId"))
Step 2: Make my corpus. Read in the data using DataframeSource and the custom reader function where each FB post is a "document"
corpus.test = VCorpus(DataframeSource(fb), readerControl=list(reader=myReader))
Step 3: Clean and stem
corpus.test2 = corpus.test %>%
tm_map(removeNumbers) %>%
tm_map(stripWhitespace) %>%
tm_map(stemDocument, language = "english") %>%
as.VCorpus()
Step 4: Make the corpus back into a character vector. The row names are now the IDs
fb2 = data.frame(unlist(sapply(corpus.test2, `[`, "content")), stringsAsFactors = F)
Step 5: Make new ID variable for later merge, name vars, and prep for merge back onto original dataset
fb2$ID = row.names(fb2)
fb2$RC.ID = gsub(".content", "", fb2$ID)
colnames(fb2)[1] = "RC.stem"
fb3 = select(fb2, RC.ID, RC.stem)
row.names(fb3) = NULL
I think the ids are being stored and retained by default, by the tm module. You can fetch them all (in a vectorized manner) with
meta(corpus.test, "id")
$`682245468452447`
[1] "682245468452447"
$`737891849554475`
[1] "737891849554475"
$`453178808037464`
[1] "453178808037464"
I'd recommend to read the documentation of the the tm::meta() function, but it's not very good.
You can also add arbitrary metadata (as key-value pairs) to each collection item in the corpus, as well as collection-level metadata.

How to ignore comma in text para when saving in .csv format?

I am trying to extract data from NCBI using different functions in rentrez package. However, I have an issue because the function extract_from_esummary() in rentrez results in matrix, where text of a column is splitted into adjacent columns when saved in .csv file ( as shown in Image) because of "," is recognized as a delimiter.
library (rentrez)
PM.ID <- c("25979833", "25667274","23792568","22435913")
p.data <- entrez_summary(db = "pubmed", id = PM.ID )
pubrecord.table <- extract_from_esummary(esummaries = p.data ,
elements = c("uid","title","fulljournalname",
"pubtype"))
From the image example above, In Column PMID: 25979833, the journal name split to extend into the next column. European journal of cancer (Oxfordin columns 1 and then England : 1990) in next column. When I did a dput(pubrecord.table), I understood that the split is because the words are separated by comma ",". How can I make R understand thatEuropean journal of cancer (Oxford, England : 1990) belongs to the same column ? Similar issue with the Title and Pubtype fields.... where the long text has a comma in between and R breaks it by csv format. How can I clean the data to so that data is in appropriate column ?
I thought this looked like a bug in extract_from_esummary. I searched package's issues on Github for "comma" and got this, which says:
This is not really a problem with rentrez, just a property of NCBI records and R objects.
In this case, the pubtype field is variably-sized.
When you try and write the matrix it represents the vectors like you'd type them in (c(..., ...)) which adds a comma which breaks the csv format.
In this case, you can collapse the vectors and unlist each matrix row to allow them to be written out
The issue page has code examples as well.

Resources