I am trying to run a script located here. The setup for which looks like this:
readlines <- function(...) {
lapply(list(...), readline)
}
input = readlines(
"Please Input Your Census API Key (Get a Free Census Api Key Here: <https://api.census.gov/data/key_signup.html>): ",
"Enter the State(s) you would like to use separated by a comma (i.e. Oregon, Washington) or enter USA if you want to calculate SVI for all 50 states: ",
"What Year would you like to calculate? (2009-2019 Are Available): ",
"Where would you like to save these files? Please type or copy and paste a complete file path: "
)
#**Install and load required packages**
API.key = input[[1]]
States = as.list(unlist(strsplit(input[[2]], split=",")))
Year = as.integer(input[[3]])
dir.create(paste0(gsub("\\\\", "/", input[[4]]), "/Social_Vulnerability"))
setwd(paste0(gsub("\\\\", "/", input[[4]]), "/Social_Vulnerability"))
My understanding is that I would replace the values after input=readlines( with the local info I have. So I rewrote it as:
readlines <- function(...) {
lapply(list(...), readline)
}
input = readlines(
"a952c5c861faf0ec64e05348e67xxxxxxxxxx", # "Please Input Your Census API Key (Get a Free Census Api Key Here: <https://api.census.gov/data/key_signup.html>): "
"USA", # "Enter the State(s) you would like to use separated by a comma (i.e. Oregon, Washington) or enter USA if you want to calculate SVI for all 50 states: "
"2009", # "What Year would you like to calculate? (2009-2019 Are Available): "
"C:/Users/FirstNameLastName/Data/Derived" #Where would you like to save these files? Please type or copy and paste a complete file path: "
)
#**Install and load required packages**
API.key = input[[1]]
States = as.list(unlist(strsplit(input[[2]], split=",")))
Year = as.integer(input[[3]])
dir.create(paste0(gsub("\\\\", "/", input[[4]]), "/Social_Vulnerability"))
setwd(paste0(gsub("\\\\", "/", input[[4]]), "/Social_Vulnerability"))
And I assume it's supposed to work where the readlines bit pulls in that info, and then drops it into the 1..[4].. spots as appropriate for API.key = input[1] etc. But, what input has saved after I ran the top bit is:
input
[[1]]
[1] "View(input)"
[[2]]
[1] "View(input)"
[[3]]
[1] "## R"
[[4]]
[1] "readlines <- function(...) {"
Which does not seem right at all. Any advice on where I am going wrong?
Related
#Read state of union file
speech<-readLines("stateoftheunion1790-2012.txt")
head(speech)
What does this code below do after it reads the file ??? I was told It will give a list where each entry is the text between consecutive ***'s. But what does that mean.
x <- grep("^\\*{3}", speech)
list.speeches <- list()
for(i in 1:length(x)){
if(i == 1){
list.speeches[[i]] <- paste(speech[1:x[1]], collapse = " ")
}else{
list.speeches[[i]] <- paste(speech[x[i-1]:x[i]], collapse = " ")}
}
It looks like you're new to SO; welcome to the community! As #Allan Cameron pointed out, whenever you ask questions, especially if you want great answers quickly, it's best to make your question reproducible. This includes sample data like the output from dput() or reprex::reprex(). Check it out: making R reproducible questions.
I've detailed each part of the code with coding comments. Feel free to ask questions if you've got any.
speech <- readLines("https://raw.githubusercontent.com/jdwilson4/Intro-Data-Science/master/Data/stateoftheunion1790-2012.txt")
head(speech) # print the first 6 rows captured in the object speech
# [1] "The Project Gutenberg EBook of Complete State of the Union Addresses,"
# [2] "from 1790 to the Present"
# [3] ""
# [4] "Character set encoding: UTF8"
# [5] ""
# [6] "The addresses are separated by three asterisks"
x <- grep("^\\*{3}", speech)
# searches speech char vector for indices coinciding with strings of 3 asterisks ***
list.speeches <- list() # create a list to store the results
for(i in 1:length(x)){ # for each index that coincided with three asterisks
if(i == 1){ # if it's the first set of asterisks ***
list.speeches[[i]] <- paste(speech[1:x[1]], collapse = " ")
# capture all vector elements up to the first set of 3 asterisks
# capture file information and who gave each of the speeches
}else{
list.speeches[[i]] <- paste(speech[x[i-1]:x[i]], collapse = " ")}
} # capture the info between each set of subsequent indices
# capture all rows of each speech (currently separated by ***)
# place each complete speech in a different list position
I have to do a topic modeling based on pieces of texts containing emojis with R. Using the replace_emoji() and replace_emoticon functions let me analyze them, but there is a problem with the results.
A red heart emoji is translated as "red heart ufef". These words are then treated separately during the analysis and compromise the results.
Terms like "heart" can have a very different meaning as can be seen with "red heart ufef" and "broken heart"
The function replace_emoji_identifier() doesn't help either, as the identifiers make an analysis hard.
Dummy data set reproducible with by using dput() (including the step force to lowercase:
Emoji_struct <- c(
list(content = "🔥🔥 wow", "😮 look at that", "😤this makes me angry😤", "😍❤\ufe0f, i love it!"),
list(content = "😍😍", "😊 thanks for helping", "😢 oh no, why? 😢", "careful, challenging ❌❌❌")
)
Current coding (data_orig is a list of several files):
library(textclean)
#The rest should be standard r packages for pre-processing
#pre-processing:
data <- gsub("'", "", data)
data <- replace_contraction(data)
data <- replace_emoji(data) # replace emoji with words
data <- replace_emoticon(data) # replace emoticon with words
data <- replace_hash(data, replacement = "")
data <- replace_word_elongation(data)
data <- gsub("[[:punct:]]", " ", data) #replace punctuation with space
data <- gsub("[[:cntrl:]]", " ", data)
data <- gsub("[[:digit:]]", "", data) #remove digits
data <- gsub("^[[:space:]]+", "", data) #remove whitespace at beginning of documents
data <- gsub("[[:space:]]+$", "", data) #remove whitespace at end of documents
data <- stripWhitespace(data)
Desired output:
[1] list(content = c("fire fire wow",
"facewithopenmouth look at that",
"facewithsteamfromnose this makes me angry facewithsteamfromnose",
"smilingfacewithhearteyes redheart \ufe0f, i love it!"),
content = c("smilingfacewithhearteyes smilingfacewithhearteyes",
"smilingfacewithsmilingeyes thanks for helping",
"cryingface oh no, why? cryingface",
"careful, challenging crossmark crossmark crossmark"))
Any ideas? Lower cases would work, too.
Best regards. Stay safe. Stay healthy.
Answer
Replace the default conversion table in replace_emoji with a version where the spaces/punctuation is removed:
hash2 <- lexicon::hash_emojis
hash2$y <- gsub("[[:space:]]|[[:punct:]]", "", hash2$y)
replace_emoji(Emoji_struct[,1], emoji_dt = hash2)
Example
Single character string:
replace_emoji("wow!😮 that is cool!", emoji_dt = hash2)
#[1] "wow! facewithopenmouth that is cool!"
Character vector:
replace_emoji(c("1: 😊", "2: 😍"), emoji_dt = hash2)
#[1] "1: smilingfacewithsmilingeyes "
#[2] "2: smilingfacewithhearteyes "
List:
list("list_element_1: 🔥", "list_element_2: ❌") %>%
lapply(replace_emoji, emoji_dt = hash2)
#[[1]]
#[1] "list_element_1: fire "
#
#[[2]]
#[1] "list_element_2: crossmark "
Rationale
To convert emojis to text, replace_emoji uses lexicon::hash_emojis as a conversion table (a hash table):
head(lexicon::hash_emojis)
# x y
#1: <e2><86><95> up-down arrow
#2: <e2><86><99> down-left arrow
#3: <e2><86><a9> right arrow curving left
#4: <e2><86><aa> left arrow curving right
#5: <e2><8c><9a> watch
#6: <e2><8c><9b> hourglass done
This is an object of class data.table. We can simply modify the y column of this hash table so that we remove all the spaces and punctuation. Note that this also allows you to add new ASCII byte representations and an accompanying string.
I need to extract specific parts of a large corpus of PDF documents. The PDFs are large and messy reports containing all kinds of digital, alphabetic and other information. The files are of different length but have unified content and sections across them. The documents have a Table of Content with the section names in them. For example
Table of Content:
Item 1. Business 1
Item 1A. Risk Factors 2
Item 1B. Unresolved Staff Comments 5
Item 2. Properties 10
Item N........
..........text I do not care about...........
Item 1A. Risk Factors
.....text I am interested in getting.......
(section ends)
Item 1B. Unresolved Staff Comments
..........text I do not care about...........
I have no problem reading them in and analyzing them as a whole but I need to pull out only the text between "Item 1A. Risk Factors" and "Item 1B. Unresolved Staff Comments".
I used pdftools, tm, quanteda and readtext package
This is the part of code I use to read-in my docs. I created a directory where I placed my PDFs and called it "PDF" and another directory where R will place converted to ".txt" files.
pdf_directory <- paste0(getwd(), "/PDF")
txt_directory <- paste0(getwd(), "/Texts")
Then I create a list of files using "list.files" function.
files <- list.files(pdf_directory, pattern = ".pdf", recursive = FALSE,
full.names = TRUE)
files
After that, I go on to create a function that extracts file names.
extract <- function(filename) {
print(filename)
try({
text <- pdf_text(filename)
})
f <- gsub("(.*)/([^/]*).pdf", "\\2", filename)
write(text, file.path(txt_directory, paste0(f, ".txt")))
}
for (file in files) {
extract(file)
}
After this step, I get stuck and do not know how to proceed. I am not sure if I should try to extract the section of interest when I read data in, therefore, I suppose, I would have to wrestle with the chunk where I create the function -- f <- gsub("(.*)/([^/]*).pdf", "\\2", filename)? I apologize for such questions but I am self-teaching myself.
I also tried engaging the following code on just one file instead of a corpus:
start <- grep("^\\*\\*\\* ITEM 1A. RISK FACTORS", text_df$text) + 1
stop <- grep("^ITEM 1B. UNRESOLVED STAFF COMMENTS", text_df$text) - 1
lines <- raw[start:stop]
scd <- paste0(".*",start,"(.*)","\n",stop,".*")
gsub(scd,"\\1", name_of_file)
but it did not help me in any way.
I don't really see why you would write files into a txt first, so I did it all in one go.
What threw me off a little is that your patterns have lots of extra spaces. You can match them with the regular expression \\s+
library(stringr)
files <- c("https://corporate.exxonmobil.com/-/media/Global/Files/investor-relations/investor-relations-publications-archive/ExxonMobil-2016-Form-10-K.pdf",
"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf")
relevant_l <- lapply(files, function(file) {
# print status message
message("processing: ", basename(file))
lines <- unlist(stringr::str_split(pdftools::pdf_text(file), "\n"))
start <- stringr::str_which(lines, "ITEM 1A.\\s+RISK FACTORS")
end <- stringr::str_which(lines, "ITEM 1B.\\s+UNRESOLVED STAFF COMMENTS")
# cover a few different outcomes depending on what was found
if (length(start) == 1 & length(end) == 1) {
relevant <- lines[start:end]
} else if (length(start) == 0 | length(end) == 0) {
relevant <- "Pattern not found"
} else {
relevant <- "Problems found"
}
return(relevant)
})
#> processing: ExxonMobil-2016-Form-10-K.pdf
#> processing: dummy.pdf
names(relevant_l) <- basename(files)
sapply(relevant_l, head)
#> $`ExxonMobil-2016-Form-10-K.pdf`
#> [1] "ITEM 1A. RISK FACTORS\r"
#> [2] "ExxonMobil’s financial and operating results are subject to a variety of risks inherent in the global oil, gas, and petrochemical\r"
#> [3] "businesses. Many of these risk factors are not within the Company’s control and could adversely affect our business, our financial\r"
#> [4] "and operating results, or our financial condition. These risk factors include:\r"
#> [5] "Supply and Demand\r"
#> [6] "The oil, gas, and petrochemical businesses are fundamentally commodity businesses. This means ExxonMobil’s operations and\r"
#>
#> $dummy.pdf
#> [1] "Pattern not found"
I would return the results as a list and then use original file names to name the list elements. Let me know if you have questions. I use the package stringr since it's fast and consistent in dealing with strings. But the command str_which and grep are pretty the same.
So I'm diving into yet another language (R), and need to be able to look at individual items in a dataframe(?). I've tried a number of ways to access this, but so far am confused by what R wants me to do to get this out. Current code:
empStatistics <- read.csv("C:/temp/empstats.csv", header = TRUE, row.names = NULL, encoding = "UTF-8", sep = ",", dec = ".", quote = "\"", comment.char = "")
attach(empStatistics)
library(svDialogs)
Search_Item <- dlgInput("Enter a Category", "")$res
if (!length(Search_Item)) {
cat("You didn't pick anything!?")
} else {
Category <- empStatistics[Search_Item]
}
Employee_Name <- dlgInput("Enter a Player", "")$res
if (!length(Employee_Name)) {
cat("No Person Selected!\n")
} else {
cat(empStatistics[Employee_Name, Search_Item])
}
and the sample of my csv file:
Name,Age,Salary,Department
Frank,25,40000,IT
Joe,24,40000,Sales
Mary,34,56000,HR
June,39,70000,CEO
Charles,60,120000,Janitor
From the languages I'm used to, I would have expected the brackets to work, but that obviously isn't the case here, so I tried looking for other solutions including separating each variable into its own brackets, trying to figure out how to use subset() (failed there, not sure it is applicable), tried to find out how to get the column and row indexes, and a few other things I'm not sure I can describe.
How can I enter values into variables, and then use that to get the individual pieces of data (ex, enter "Frank" for the name and "Age" for the search item and get back 25 or "June" for the name and "Department" for the search item to get back "CEO")?
If you would like to access it like that, you can do:
Search_Item <- "Salary"
Employee_Name <- "Frank"
empStatistics <- read.csv("empstats.csv",header = TRUE, row.names = 1)
empStatistics[Employee_Name,Search_Item]
[1] 40000
R doesn't have an Index for its data.frame. The other thing you can try is:
empStatistics <- read.csv("empstats.csv",header = TRUE)
empStatistics[match(Employee_Name,empStatistics$Name),Search_Item]
[1] 40000
i'm trying to lemmatizzate a corpus of document in R with wordnet library. This is the code:
corpus.documents <- Corpus(VectorSource(vector.documents))
corpus.documents <- tm_map(corpus.documents removePunctuation)
library(wordnet)
lapply(corpus.documents,function(x){
x.filter <- getTermFilter("ContainsFilter", x, TRUE)
terms <- getIndexTerms("NOUN", 1, x.filter)
sapply(terms, getLemma)
})
but when running this. I have this error:
Errore in .jnew(paste("com.nexagis.jawbone.filter", type, sep = "."), word, :
java.lang.NoSuchMethodError: <init>
and those are stack calls:
5 stop(structure(list(message = "java.lang.NoSuchMethodError: <init>",
call = .jnew(paste("com.nexagis.jawbone.filter", type, sep = "."),
word, ignoreCase), jobj = <S4 object of class structure("jobjRef", package
="rJava")>), .Names = c("message",
"call", "jobj"), class = c("NoSuchMethodError", "IncompatibleClassChangeError", ...
4 .jnew(paste("com.nexagis.jawbone.filter", type, sep = "."), word,
ignoreCase)
3 getTermFilter("ContainsFilter", x, TRUE)
2 FUN(X[[1L]], ...)
1 lapply(corpus.documents, function(x) {
x.filter <- getTermFilter("ContainsFilter", x, TRUE)
terms <- getIndexTerms("NOUN", 1, x.filter)
sapply(terms, getLemma) ...
what's wrong?
So this does not address your use of wordnet, but does provide an option for lemmatizing that might work for you (and is better, IMO...). This uses the MorphAdorner API developed at Northwestern University. You can find detailed documentation here. In the code below I'm using their Adorner for Plain Text API.
# MorphAdorner (Northwestern University) web service
adorn <- function(text) {
require(httr)
require(XML)
url <- "http://devadorner.northwestern.edu/maserver/partofspeechtagger"
response <- GET(url,query=list(text=text, media="xml",
xmlOutputType="outputPlainXML",
corpusConfig="ncf", # Nineteenth Century Fiction
includeInputText="false", outputReg="true"))
doc <- content(response,type="text/xml")
words <- doc["//adornedWord"]
xmlToDataFrame(doc,nodes=words)
}
library(tm)
vector.documents <- c("Here is some text.",
"This might possibly be some additional text, but then again, maybe not...",
"This is an abstruse grammatical construction having as it's sole intention the demonstration of MorhAdorner's capability.")
corpus.documents <- Corpus(VectorSource(vector.documents))
lapply(corpus.documents,function(x) adorn(as.character(x)))
# [[1]]
# token spelling standardSpelling lemmata partsOfSpeech
# 1 Here Here Here here av
# 2 is is is be vbz
# 3 some some some some d
# 4 text text text text n1
# 5 . . . . .
# ...
I'm just showing the lemmatization of the first "document". partsOfSpeech follows the NUPOS convention.
This answers your question, but does not really solve your problem. There is another solution above (different answer) that attempts to provide a solution.
There are several issues with the way you are using the wordnet package, described below, but the bottom line is that even after addressing these, I could not get wordnet to produce anything but gibberish.
First: You can't just install the wordnet package in R, you have to install Wordnet on your computer, or at least download the dictionaries. Then, before you use the package, you need to run initDict("path to wordnet dictionaries").
Second: It looks like getTermFilter(...) expects a character argument for x. The way you have it set up, you are passing an object of type PlainTextDocument. So you need to use as.character(x) to convert that to it's contained text, or you get the java error in your question.
Third: It looks like getTermFilter(...) expects single words (or phrases). For instance, if you pass "This is a phrase" to getTermFilter(...) it will look up "This is a phrase" in the dictionary. It will not find it of course, so getIndexTerms(...) returns NULL and getLemma(...) fails... So you have to parse the text of your PlainTextDocument into words first.
Finally, I'm not sure it's a good idea to remove punctuation. For instance "it's" will be converted to "its" but these are different words with different meanings, and they lemmatize differently.
Rolling all this up:
library(tm)
vector.documents <- c("This is a line of text.", "This is another one.")
corpus.documents <- Corpus(VectorSource(vector.documents))
corpus.documents <- tm_map(corpus.documents, removePunctuation)
library(wordnet)
initDict("C:/Program Files (x86)/WordNet/2.1/dict")
lapply(corpus.documents,function(x){
sapply(unlist(strsplit(as.character(x),"[[:space:]]+")), function(word) {
x.filter <- getTermFilter("StartsWithFilter", word, TRUE)
terms <- getIndexTerms("NOUN",1,x.filter)
if(!is.null(terms)) sapply(terms,getLemma)
})
})
# [[1]]
# This is a line of text
# "thistle" "isaac" "a" "line" "off-axis reflector" "text"
As you can see, the output is still gibberish. "This" is lemmatized as "thistle" and so on. It may be that I have the dictionaries configured improperly, so you might have better luck. If you are committed to wordnet, for some reason, I suggest you contact the package authors.