Creating columns from scraped pdf with cuts on spaces

Creating columns from scraped pdf with cuts on spaces - r

I'm trying to create a data frame from the following PDF
library(tabulizer)
url <- "https://doccs.ny.gov/system/files/documents/2020/06/doccs-covid-19-confirmed-by-facility-6.30.2020.pdf"
tab1 <- extract_tables(url)
However, when I call tab1 it only has one column:
[,1]
[1,] "NYS DOCCS INCARCERATED INDIVIDUALS COVID-19 REPORT BY REPORTED FACILITY"
[2,] "AS OF JUNE 29, 2020 AT 3:00 PM"
[3,] "POSITIVE CASE STATUS OTHER TESTS"
[4,] "TOTAL"
[5,] "FACILITY RECOVERED DECEASED POSITIVE PENDING NEGATIVE"
[6,] "TOTAL 495 16 519 97 805"
[7,] "ADIRONDACK 0 0 0 75 0"
[8,] "ALBION 0 0 0 0 2"
[9,] "ALTONA 0 0 0 0 1"
I would like to extract what should be the individual columns to create a dataframe (e.g. for row 7 I extract its contents into the following columns: Facility ("Adirondack") Recovered (0) Decesased (0) Positive (0) Pending (75) Negative (0) ). I'm thinking that the most efficient way to do this would be to make cuts in tab1 based on spaces, but this doesn't work since some of the facilities have multiple words in them, so the space cut would get messed up. Does anyone have an idea for a solution? Thanks for the help!

Here is how I would handle this using the "lattice" method of table extraction from the tabulizer package.
#install.packages("tidyverse")
library(tidyverse)
#install.packages("janitor")
library(janitor)
#install.packages("tabulizer")
library(tabulizer)
url <- "https://doccs.ny.gov/system/files/documents/2020/06/doccs-covid-19-confirmed-by-facility-6.30.2020.pdf"
tab1 <- tabulizer::extract_tables(url, method = "lattice") %>%
as.data.frame() %>%
dplyr::slice(-1,-2) %>%
janitor::row_to_names(row_number = 1)

Here is a workaround:
library(tabulizer)
url <- "https://doccs.ny.gov/system/files/documents/2020/06/doccs-covid-19-confirmed-by-facility-6.30.2020.pdf"
tab1 <- extract_tables(url)
plouf <- tab1[[1]][6:dim(tab1[[1]])[1],]
plouf <- gsub("([A-Z]+) ([A-Z]+)","\\1_\\2",plouf)
df <- read.table(text = paste0(t(plouf) ,collapse = "\n\r"),sep = " ")
names(df) <- strsplit(tab1[[1]][5,]," ")[[1]]
FACILITY RECOVERED DECEASED POSITIVE PENDING NEGATIVE
1 TOTAL 495 16 519 97 805
2 ADIRONDACK 0 0 0 75 0
3 ALBION 0 0 0 0 2
4 ALTONA 0 0 0 0 1
5 ATTICA 2 0 2 1 7
6 AUBURN 0 0 0 0 10
7 BARE_HILL 0 0 0 0 6
8 BEDFORD_HILLS 43 1 44 5 53
9 CAPE_VINCENT 0 0 0 0 0
10 CAYUGA 0 0 0 2 1
11 CLINTON 1 0 1 0 25
12 COLLINS 1 0 1 0 13
13 COXSACKIE 1 0 1 0 57
14 DOWNSTATE 1 0 1 0 12
15 EASTERN 17 1 20 0 17
16 EDGECOMBE 0 0 0 0 0
17 ELMIRA 0 0 0 1 20
18 FISHKILL 78 5 83 4 98
19 FIVE_POINTS 0 0 0 0 4
20 FRANKLIN 1 0 1 0 24
I take the table after the title, then remove the spaces between the FACILITY names with gsub (I actually replace them with _, so you can rechange to space after if you want. You can also use str_replace from stringr instead of gsub).
I then use read.table, forcing the text with a end of line after each line. I add the name after (because if not, they get changed in the gsub and read.table do not read them properly).

Related

R matrix - how to convert data from data frame columns into 1/0 matrix

I'm preparing a master's degree project and stuck with basic data manipulation. I'm importing several data from the Prestashop database to R, one of those is a data frame with carts IDs and products included in it (see below).
What I want to do is to create a matrix that will reflect the same data but in the easiest way as a matrix, here's a draft of the most desirable look:
Any hints on how the code should look? Thank you in advance for any help!
EDIT:
Code sample (dataframe):
x <- data.frame (order_id = c("12", "13","13","13","14","14","15","16"),
product_id = c("123","123","378","367","832","900",NA,"378"))
SOLUTION:
xtabs is good, but when it comes to NA values it skips the line in the results. There's an option to force addNA=TRUE, but it adds the NA 'column' and counts the NA as 1 (see below)
y <- xtabs(formula = ~., data = x)
Output - example 1 (addNA=FALSE):
product_id
order_id 123 367 378 832 900
12 1 0 0 0 0
13 1 1 1 0 0
14 0 0 0 1 1
16 0 0 1 0 0
Output - example 2 (addNA=TRUE):
product_id
order_id 123 367 378 832 900 <NA>
12 1 0 0 0 0 0
13 1 1 1 0 0 0
14 0 0 0 1 1 0
15 0 0 0 0 0 1
16 0 0 1 0 0 0
The igraph approach seems to be more accurate.

You are looking for creating an adjacency matrix from a bipartite network from which you have the nodes list. You can directly use the package igraph to create the adjacency matrix from the node list and simplify it.
From x:
order_id product_id
1 12 123
2 13 123
3 13 378
4 13 367
5 14 832
6 14 900
7 15 <NA>
8 16 378
graph_from_dataframe <- igraph::graph.data.frame(x)
adjacency_matrix <- igraph::get.adjacency(graph_from_dataframe, sparse = FALSE)
# removing redundant entries
adjacency_matrix <- adj[rownames(adj) %in% x$order_id, colnames(adj) %in% x$product_id]
123 378 367 832 900
12 1 0 0 0 0
13 1 1 1 0 0
14 0 0 0 1 1
15 0 0 0 0 0
16 0 1 0 0 0
More resources on this SO question and this RPubs blog post.

R inspect() function, from tm package, only returns 10 outputs when using dictionary terms

I have 70 PDFs of scientific papers that I'm trying to narrow down by looking for specific terms within them, using the dictionary function of inspect(), which is part of the tm package. My PDFs are stored in a VCorpus object. Here's an example of what my code looks like using the crude dataset and common terms that would show up in (probably) every example paper in crude:
library(tm)
output.matrix <- inspect(DocumentTermMatrix(crude,
list(dictionary = c("i","and",
"all","of",
"the","if",
"i'm","looking",
"for","but","because","has",
"it","was"))))
output <- data.frame(output.matrix)
This search only ever returns 10 papers into output.matrix. The outcome given is:
Docs all and because but for has i i'm the was
144 0 9 0 5 5 2 0 0 17 1
236 0 7 4 2 4 5 0 0 15 7
237 1 11 1 3 3 2 0 0 30 2
246 0 9 0 0 6 1 0 0 18 2
248 1 6 1 1 2 0 0 0 27 4
273 0 5 2 2 4 1 0 0 21 1
368 0 1 0 1 0 0 0 0 11 2
489 0 5 0 0 4 0 0 0 8 0
502 0 6 0 1 5 0 0 0 13 0
704 0 5 1 0 3 2 0 0 21 0
For my actual dataset of 70 papers, I know there should be greater than 10 because as I add more PDFs to my VCorpus, which I know contain at least one of my search terms, I still only get 10 in the output. I want to adjust the outcome to be a list, like the one shown, that gives every paper from the VCorpus that contains a term, not just what I assume is the first 10.
Using R version 4.0.2, macOS High Sierra 10.13.6

You are misinterpreting what inspect does. For a document term matrix it show the first 10 rows and columns. inspect should only be used to check your corpus or document term matrix if it looks as you expect. Never for transforming data to a data.frame. If you want the data of the document term matrix in a data.frame, the following piece of code does this, using your example code and removing all the rows and columns that don't have a value for any of the documents or terms.
# do not use inspect as this will give a wrong result!
output.matrix <- DocumentTermMatrix(crude,
list(dictionary = c("i","and",
"all","of",
"the","if",
"i'm","looking",
"for","but","because","has",
"it","was")))
# remove rows and columns that are 0 staying inside a sparse matrix for speed
out <- output.matrix[slam::row_sums(output.matrix) > 0,
slam::col_sums(output.matrix) > 0]
# transform to data.frame
out_df <- data.frame(docs = row.names(out), as.matrix(out), row.names = NULL)
out_df
docs all and because but for. has the was
1 127 0 1 0 0 2 0 5 1
2 144 0 9 0 5 5 2 17 1
3 191 0 0 0 0 2 0 4 0
4 194 1 1 0 0 2 0 4 1
5 211 0 2 0 0 2 0 8 0
6 236 0 7 4 2 4 5 15 7
7 237 1 11 1 3 3 2 30 2
8 242 0 3 0 1 1 1 6 1
9 246 0 9 0 0 6 1 18 2
10 248 1 6 1 1 2 0 27 4
11 273 0 5 2 2 4 1 21 1
12 349 0 2 0 0 0 0 5 0
13 352 0 3 0 0 0 0 7 1
14 353 0 1 0 0 2 1 4 3
15 368 0 1 0 1 0 0 11 2
16 489 0 5 0 0 4 0 8 0
17 502 0 6 0 1 5 0 13 0
18 543 0 0 0 0 3 0 5 1
19 704 0 5 1 0 3 2 21 0
20 708 0 0 0 0 0 0 0 1

How do I create a new column in R that is 1 if a certain value in another column is an outlier?

I want to create a new column that is 1 if the value of a particular column is an outlier. Otherwise, the value should be 0.
An example would be the following:
outlier <- c(rnorm(10,0,5),40,-60,rnorm(10,0,5))
V1
1 -6.273411
2 -6.576979
3 9.256693
4 -2.448468
5 -7.386433
6 -8.922403
7 -1.339524
8 -2.136594
9 -2.271990
10 -6.066499
11 40.000000
12 -60.000000
13 6.697281
14 -3.212984
15 6.950176
16 -7.054237
17 11.820208
18 -1.836457
19 -1.341675
20 -3.271044
21 -10.260103
22 8.239565
So, observation 11 and 12 should be clearly outliers:
boxplot.stats(outlier)$out
[1] 40 -60
What I want to archive is the following:
V1 V2
1 -6.273411 0
2 -6.576979 0
3 9.256693 0
4 -2.448468 0
5 -7.386433 0
6 -8.922403 0
7 -1.339524 0
8 -2.136594 0
9 -2.271990 0
10 -6.066499 0
11 40.000000 1
12 -60.000000 1
13 6.697281 0
14 -3.212984 0
15 6.950176 0
16 -7.054237 0
17 11.820208 0
18 -1.836457 0
19 -1.341675 0
20 -3.271044 0
21 -10.260103 0
22 8.239565 0
Is there any elegant way to do this?
Thanks!

Keep in mind there is no universal, agreed definition for what is an "outlier" in all cases. By default, boxplot assumes the value is no more than 1.5 times the inter-quartile range away from the .25 and .75 quartiles. You can write your own function which gives you complete control over the definition. For example
is_outlier <- function(x) {
iqr <- IQR(x)
q <- quantile(x, c(.25, .75))
x < q[1]-1.5*iqr | x > q[2]+1.5*iqr
}
you can use it with your data like
is_outlier(outlier)
which returns TRUE/FALSE. Which you can convert to 1/0 with as.numeric(is_outlier(outlier)) or is_outlier(outlier)+0 if that's really needed.

We can use %in% to convert to logical and coerce it back to binary with as.integer or +
+(outlier %in% boxplot.stats(outlier)$out)
#[1] 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0

get the number of character vector elements in a corpus

my goal is to use R for lexicon based sentiment analysis!
i have two character vectors. one with positive words and one with negative words.
e.g.
pos <- c("good", "accomplished", "won", "happy")
neg <- c("bad", "loss", "damaged", "sued", "disaster")
i now have a corpus of thousands of news articles and i want to know for each article, how
many elements of my vectors pos and neg are in the article.
e.g. (not sure about how the corpus function works here but you get the idea: there are two articles in my corpus)
mycorpus <- Corpus("The CEO is happy that they finally won the case.", "The disaster caused a huge loss.")
i want to get something like this:
article 1: 2 element of pos and 0 element of neg
article 2: 0 elements of pos, 2 elements of neg
another good thing would be, if i can get the following for each article:
(number of pos words - number of neg words)/(number of total words in article)
thank you very much!!
EDIT:
# Victorp: this doesn't seem to work
the matrix i get looks good:
mytdm[1:6,1:10]
Docs
Terms 1 2 3 4 5 6 7 8 9 10
aaron 0 0 0 0 0 1 0 0 0 0
abandon 1 1 0 0 0 0 0 0 0 0
abandoned 0 0 0 3 0 0 0 0 0 0
abbey 0 0 0 0 0 0 0 0 0 0
abbott 0 0 0 0 0 0 0 0 0 0
abbotts 0 0 1 0 0 0 0 0 0 0
but when i do your command i get zero for every document!
colSums(mytdm[rownames(mytdm) %in% pos, ])
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
why is that??

Hello you can use the TermDocumentMatrix for doing that :
mycorpus <- Corpus(VectorSource(c("The CEO is happy that they finally won the case.", "The disaster caused a huge loss.")))
mytdm <- TermDocumentMatrix(mycorpus, control=list(removePunctuation=TRUE))
mytdm <- as.matrix(mytdm)
# Positive words
colSums(mytdm[rownames(mytdm) %in% pos, ])
1 2
2 0
# Negative words
colSums(mytdm[rownames(mytdm) %in% neg, ])
1 2
0 2
# Total number of words per documents
colSums(mytdm)
1 2
9 5

Here's another approach:
## pos <- c("good", "accomplished", "won", "happy")
## neg <- c("bad", "loss", "damaged", "sued", "disaster")
##
## mycorpus <- Corpus(VectorSource(
## list("The CEO is happy that they finally won the case.",
## "The disaster caused a huge loss.")))
library(qdap)
with(tm_corpus2df(mycorpus), termco(text, docs, list(pos=pos, neg=neg)))
## docs word.count pos neg
## 1 1 10 2(20.00%) 0
## 2 2 6 0 2(33.33%)

operation on dataframe factor columns

I dont want to perform operation in a loop,My data look like this
dfU[4:7]
vNeg neg pos vPos
1 0 35 28 0
2 0 42 26 0
3 0 77 59 0
4 0 14 24 0
5 0 35 45 0
6 0 17 12 0
7 0 31 23 0
8 0 64 52 1
9 0 15 17 0
10 0 21 29 0
when i performed certain operation like this but getting an wrong result may be just because of conversion i tried with with and transform also but getting an error not meaningful for factors
b<-as.numeric(((as.numeric(dfU[,4])*-5)+(as.numeric(dfU[,5])*-2)+(as.numeric(dfU[,6])*2)+(as.numeric(dfU[,7])*5)))
b
[1] -14 -32 -16 18 8 -8 -18 -7 6 14 24 -9 0
error may be just because of this when i am converting integer to numeric
typeof(dfU[,4])
[1] "integer"
as.numeric(dfU[,4])
[1] 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1
k<-transform(dfU, (vNeg*(-5))+(neg*(-2))+(pos*2)+(vPos*5))
not meaningful for factors
i want the 8th column in a dataframe to be as score and i want to avoid the loop ,Is their any better way to perform operation on columns,any help in this direction,thanks.

The best would be to avoid having the 4th. column as factor if this is not what to you want to.
Still, a workaround is using as.numeric(as.character( )). Assume "a" is your 4th column, your situation is this:
> a <- as.factor(c(rep(0,7),1,rep(0,2)))
> a
[1] 0 0 0 0 0 0 0 1 0 0
Levels: 0 1
> as.numeric(a)
[1] 1 1 1 1 1 1 1 2 1 1
And the workaround does:
> as.numeric(as.character(a))
[1] 0 0 0 0 0 0 0 1 0 0