How do I use the which function to search my dataframe? - r

I have a bunch of PDFs that I would like to search through in order to quickly locate tables and graphs relevant to my research.
#I load the following libraries
library(pdfsearch)
library(tm)
library(pdftools)
#I assign the directory of my PDF files to the path where they are located
directory <- '/References'
#and then I search the directory for the keywords "table", "graph", and "chart"
txt <- keyword_directory(directory,
keyword = c('table', 'graph', 'chart'),
split_pdf = TRUE,
remove_hyphen = TRUE,
full_names = TRUE)
#Up to this point everything works fine. I get a nice data.frame called "txt"
#with 1356 objects in 7 columns. However, when I try to search the data.frame
#I start running into trouble.
#I start with "hunter" a term that I know resides in the token_text column
txt[which(txt$token_text == 'hunter'), ]
#executing this code produces the following message
[1] ID pdf_name keyword page_num line_num line_text token_text
<0 rows> (or 0-length row.names)
Am I using the right tool to search through my data.frame? Is there an easier way to cross reference this data? Is there a package out there somewhere that is designed to help one crawl through a mountain of PDFs? Thanks for your time

The which function returns TRUE or FALSE based on if the condition is met (for every value given in that condition, e.g. all values in a dataframe's column). You can subset a dataframe by inputing TRUE/FALSE values for the rows you want to keep / discard.
Combining this you get:
txt[which(txt$token_text == 'hunter'), ] which you did and got no rows returned. As was pointed out in the comments, which is for exact matching and you may have no exact matches.
Getting TRUE/FALSE based on partial matches or regex you can use the grepl function instead:
txt[grepl("hunter", txt$token_text, ignore.case=TRUE), ]
For easier understanding I prefer doing this with the dplyr package:
library(dplyr)
txt %>% filter(grepl("hunter",token_text, ignore.case=TRUE))

Related

Excluding words in sentimentr

How do you drop multiple terms from the sentimentr dictionary?
For example, the words "please" and "advise" are associated with positive sentiment, but I do not want those particular words to influence my analysis.
I've figured out a way with the following script to exclude 1 word but need to exclude many more:
mysentiment<- lexicon::hash_sentiment_jockers_rinker[x != "please"]
mytext <- c(
'Hello, We are looking to purchase this material for a part we will be making, but your site doesnt state that this is RoHS complaint. Is it possible that its just not listed as such online, but it actually is RoHS complaint? Please advise. '
)
sentiment_by(mytext, polarity_dt = mysentiment)
extract_sentiment_terms(mytext,polarity_dt = mysentiment)
You can subset the mysentiment data.table. Just create a vector of the words you don't want included and use it to subset.
mysentiment<- lexicon::hash_sentiment_jockers_rinker
words_to_exclude <- c("please", "advise")
mysentiment <- mysentiment [!x %in% words_to_exclude]

Retrieve synonyms of words using wordnet for R

I'm currently working with wordnet in R (I'm using RStudio for Windows (64bit)) and created a data.frame containing synset_offset, ss_type and word from the data.x files (where x is noun, adj, etc) of the wordnet database.
A sample can be created like this:
wnet <- data.frame(
"synset_offset" = c(02370954,02371120,02371337),
"ss_type" = c("VERB","VERB","VERB"),
"word" = c("fill", "depute", "substitute")
)
My issue happens when using the wordnet package to get the list of synonyms that I'd like to add as an additional column.
library(wordnet)
wnet$synonyms <- synonyms(wnet$word,wnet$ss_type)
I receive the following error.
Error in .jnew(paste("com.nexagis.jawbone.filter", type, sep = "."), word, :
java.lang.NoSuchMethodError: <init>
If I apply the function with defined values, it works.
> synonyms("fill","VERB")
[1] "fill" "fill up" "fulfil" "fulfill" "make full" "meet" "occupy" "replete" "sate" "satiate" "satisfy"
[12] "take"
Any suggestions to solve my issue are welcome.
I can't install the wordnet package on my computer for some reason, but it seems you're giving the synonyms function array arguments and you can't, you should be able to solve it with apply.
syn_list <- apply(wnet,by=1,function(row){synonyms(row["word"],row["ss_type"])})
it will return the output of the synonyms function for each row of the wnet data.frame
it's not clear what you want to do with:
wnet$synonyms <- synonyms(wnet$word,wnet$ss_type)
as for each row you will have an array of synonyms, that don't fit in the 3 rows of your data.frame.
maybe something like this will work for you:
wnet$synonyms <- sapply(syn_list,paste,collapse=", ")
EDIT - Here is a working solution to the problem above.
wnet$synset <- mapply(synonyms, as.character(wnet$word), as.character(wnet$ss_type))

Dynamic variable in grepl()

This is the continuation of the following thread:
Creating Binary Identifiers Based On Condition Of Word Combinations For Filter
Expected output is the same as per the said thread.
I am now writing a function that can take dynamic names as variables.
This is the code that I am aiming at, if I am to run it manually:
df <- df %>% group_by(id, date) %>% mutate(flag1 = if(eval(parse(text=conditions))) grepl(pattern, item_name2) else FALSE)
To make it take into consideration dynamic variable names, I have been doing the code this way:
groupcolumns <- c(id, date)
# where id and date will be entered into the function as character strings by the user
variable <- list(~if(eval(parse(text=conditions))) grepl(pattern, item) else FALSE)
# converting to formula to use with dynamically generated column names
# "conditons" being the following character vector, which I can automatically generate:
conditons <- "any(grepl("Alpha", Item)) & any(grepl("Bravo", Item))"
This becomes:
df <- df %>% group_by_(.dots = groupcolumns) %>% mutate_(.dots = setNames(variable, flags[1]))
# where flags[1] is a predefined vector of columns names that I have created
flags <- paste("flag", seq(1:100), sep = "")
The problem is, I am unable to do anything to the grepl function; to specify the "item" dynamically. If I do it this way, as "df$item", and do a eval(parse(text="df$item")), the intention of piping fails as I am doing a group_by_ and it results in an error (naturally). This also applies to the conditions that I set.
Does a way exists for me to tell grepl to use a dynamic variable name?
Thanks a lot (especially to akrun)!
edit 1:
tried the following, and now there is no problem of passing the name of the item into grepl.
variable <- list(~if(eval(parse(text=conditions))) grepl(pattern, as.name(item)) else FALSE)
However, the problem lies in that piping seems not to work, as the output of as.name(item) is seen as an object, which does not exist in the environment.
edit 2:
trying do() in dplyr:
variable <- list(~if(eval(parse(text=conditions))) grepl(pattern, .$deparse(as.name(item))) else FALSE)
df <- df %>% group_by_(.dots = groupcolumns) %>% do_(.dots = setNames(variable, combiflags[1]))
which throws me the error:
Error: object 'Item' not found
If I understand your question correctly, you want to be able to dynamically input both patterns and the object to be searched by these patterns in grepl? The best solution for you will depend entirely on how you choose to store the patterns and how you choose to store the objects to be searched. I have a few ideas that should help you though.
For dynamic patterns, try inputting a list of patterns using the paste function. This will allow you to search many different patterns at once.
grepl(paste(your.pattern.list, collapse="|"), item)
Lets say you want to set up a scenario where you are storing many patterns of interest in a directory. Perhaps collected automatically from a server, or from some other output. You can create lists of patterns if they are in separate files using this:
#set working directory
setwd("/path/to/files/i/want")
#make a list of all files in this directory
inFilePaths = list.files(path=".", pattern=glob2rx("*"), full.names=TRUE)
#perform a function for each file in the list
for (inFilePath in inFilePaths)
{
#grepl function goes here
#if each file in the folder is a table/matrix/dataframe of patterns try this
inFileData = read_csv(inFilePath)
vectorData=as.vector(inFileData$ColumnOfPatterns)
grepl(paste(vectorData, collapse="|"), item)
}
For dynamically specifying the item, you can use an almost identical framework
#set working directory
setwd("/path/to/files/i/want")
#make a list of all files in this directory
inFilePaths = list.files(path=".", pattern=glob2rx("*"), full.names=TRUE)
#perform a function for each file in the list
for (inFilePath in inFilePaths)
{
#grepl function goes here
#if each file in the folder is a table/matrix/dataframe of data to be searched try this
inFileData = read_csv(inFilePath)
grepl(pattern, inFileData$ColumnToBeSearched)
}
If this is too far off from what you envisioned, please update your question with details about how the data you are using is stored.

R: subset() function altered character data into strange code

i read some data into R with the read.xlsx() in openxlsx package, and here's my code for reading the data:
data_all = read.xlsx(xlsxFile = paste0(path, EoLfileName), sheet = 1, detectDates = T, skipEmptyRows = F)
now, when i access one name cell in my data, it will print the name in characters:
> data_all[1,'name']
[1] "76-ES+ADVIP-20G"
now, lets say i want to subset out some rows based on a condition on another colum:
data_sub = subset(data_all, !is.na(data_all$amount))
however, then if i print this subset data, i'd get:
> data_sub[1,'name']
[1] "A94198.10"
i've also tried to do subsetting using the following method:
data_sub = data_all[!is.na(data_all$amount),]
but i get the same thing: the expected output of "76-ES+ADVIP-20G" would be turned into "A94198.10"
I've checked many times with mode() and str() for data_all$name and data_sub$name, both return character, so they are in correct format.
here's a link to smaple data to play with:
https://drive.google.com/file/d/0BwIbultIWxeVY1VtdDU5NFp1Tkk/view?usp=sharing
Please please help me! I am quite stuck, and i dont see other posts with similar problem.
Why is this happeneing? subsetting shouldnt change data formatting correct?
Thank you in advance for your help!
additional note (if its helpful):
so when i tried to debug, i noticed that, when i was viewing the data_all in RStudio, and if i copy and paste the name "76-ES+ADVIP-20G" into the filter bar, it actually cannot find it; i'd have to type in "76-ES" and as soon as i type in the next character which is "+", RStudio data view filter would say "no matching records found"

How to force R to read in numerical order?

I have several files in one folder and I want to rename them,I noticed that R reads in alphbatical order, so I used the command mixedsort and it worked but when I checked the results I found that the files were read in a different order not numerically. The name of the first file is Daily_NPP1.bin up to Daily_NPP365.bin
a<- list.files("C:\\New folder (6)", "*.bin", full.names = TRUE)
k<- mixedsort(a)#### load package feild
b <- sprintf("C:carbonflux\\Daily_Rh%d.bin", seq(k))
file.rename(a, b)
How do I force R to read in numerical order?
If renaming is all you want to do you could just do the following, regardless of sorting.
b <- sub("^.*?([0-9]+)\\.bin$", "C:\\\\carbonflux\\\\Daily_Rh\\1.bin", a)
file.rename(a, b)
The first argument to sub extracts the numbers at the end of the file names, and the second pastes it into the new file name template (at the position of \\1). All the \\\\ are needed to escape the backslashes properly.
Here is a way to order the vector without renaming the files:
# Replication of data:
a <- sort(paste0("Daily_NPP",1:365,".bin"))
# Extract numbers and order:
a <- a[order(as.numeric(gsub("[^0-9]","",a)))]

Resources