How to specify a text column when read a csv file? - r

I use this way to read a csv file:
Here the str()
$ an_id : int 4840 41981 40482 37473 33278 29083 30940 29374 24023 23922 ...
It seems to be an int character column and using the following it is converted to chr
df$an_id <- paste0("doc_", df$an_id)
However when I execute this command I receive this error:
toks <- corpus(df, docid_field = "an_id") %>%
tokens()
Error in corpus.data.frame(df, docid_field = "an_id") :
column name text not found
Is there any different way to read the file or pass the column as text?
If I save this data into csv file and read the file and run the command they work properly
dtext <- data.frame(id = c(1,2,3,4), text = c("here","This dataset contains movie reviews along with their associated binary sentiment polarity labels. It is intended to serve as a benchmark for sentiment classification. This document outlines how the dataset was gathered, and how to use the files provided.", "The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg). We also include an additional 50,000 unlabeled documents for unsupervised learning.", "There are two top-level directories [train/, test/] corresponding to the training and test sets. Each contains [pos/, neg/] directories for the reviews with binary labels positive and negative. Within these directories, reviews are stored in text files named following the convention [[id]_[rating].txt] where [id] is a unique id and [rating] is the star rating for that review on a 1-10 scale. For example, the file [test/pos/200_8.txt] is the text for a positive-labeled test set example with unique id 200 and star rating 8/10 from IMDb. The [train/unsup/] directory has 0 for all ratings because the ratings are omitted for this portion of the dataset."),stringsAsFactors = F)

As mentioned in the comments by #Nathalie, the following does the trick if the data is in a data.frame. docid_field references to the document ids column and text_field should reference the column that contains the text.
toks <- corpus(df,
docid_field = "an_id",
text_field = "text") %>%
tokens()
str(toks)
List of 4
$ doc_1: chr "here"
$ doc_2: chr [1:39] "This" "dataset" "contains" "movie" ...
$ doc_3: chr [1:36] "The" "core" "dataset" "contains" ...
$ doc_4: chr [1:105] "There" "are" "two" "top-level" ...
- attr(*, "types")= chr [1:102] "here" "This" "dataset" "contains" ...
- attr(*, "padding")= logi FALSE
- attr(*, "class")= chr "tokens"
- attr(*, "what")= chr "word"
- attr(*, "ngrams")= int 1
- attr(*, "skip")= int 0
- attr(*, "concatenator")= chr "_"
- attr(*, "docvars")='data.frame': 4 obs. of 0 variables
data:
df <- structure(list(an_id = c("doc_1", "doc_2", "doc_3", "doc_4"),
text = c("here", "This dataset contains movie reviews along with their associated binary sentiment polarity labels. It is intended to serve as a benchmark for sentiment classification. This document outlines how the dataset was gathered, and how to use the files provided.",
"The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg). We also include an additional 50,000 unlabeled documents for unsupervised learning.",
"There are two top-level directories [train/, test/] corresponding to the training and test sets. Each contains [pos/, neg/] directories for the reviews with binary labels positive and negative. Within these directories, reviews are stored in text files named following the convention [[id]_[rating].txt] where [id] is a unique id and [rating] is the star rating for that review on a 1-10 scale. For example, the file [test/pos/200_8.txt] is the text for a positive-labeled test set example with unique id 200 and star rating 8/10 from IMDb. The [train/unsup/] directory has 0 for all ratings because the ratings are omitted for this portion of the dataset."
)), row.names = c(NA, -4L), class = "data.frame")

Related

Google Places API and R -- calling 2nd column in a data frame returns six separate columns

I'm trying to store the results of a data frame I retrieved from a list via the Google Places API. My call to the API...
library(googleway)
HAVE_PLACES <- google_places(search_string = "grocery store",
location = c(35.4168, -80.5883),
radius = 10000, key = key)
...returns a list object HAVE_PLACES:
The third object in this list - results - is a data frame with one observation for each location retrieved in the API call. When I call View(HAVE_PLACES$results), I get what looks like a set of vectors - as I expect when looking at a data frame...
...But it looks like the data frame includes data frames:
WHAT IS GOING ON HERE?
More specifically:
How can a data frame contain data frames, and why does View() show the nested data frames as it would vectors?
When working with data of this type, where you want the columns you're seeing in View() to simply be vectors - for manipulation and exporting purposes - are there any best practices? I'm about to convert each vector of this alleged data frame called geometry into separate objects, and cbind() the results to the HAVE_PLACES$results. But this feels insane.
Akrun is right (as usual!). A data.frame can have lists as 'columns'. This is normal behaviour.
Your question seems to be a more general question about how to extract nested list data in R, but using Google's API response as an example. Given you're using googleway (I'm the author of the pacakge), I'm answering it in the context of Google's response. However, there are numerous other answers and examples online about how to work with lists in R.
Explanation
You're seeing the nested lists in your results because the data returned from Google's API is actually JSON. The google_places() function 'simplifies' this to a data.frame using jsonlite::fromJSON() internally.
If you set simplify = F in the function call you can see the raw JSON output
library(googleway)
set_key("GOOGLE_API_KEY")
HAVE_PLACES_JSON <- google_places(search_string = "grocery store",
location = c(35.4168, -80.5883),
radius = 10000,
simplify = F)
## run this to view the JSON.
jsonlite::prettify(paste0(HAVE_PLACES_JSON))
You'll see the JSON can contain many nested objects. When converted to an R data.frame these nested objects are returned as list columns'
If you're not familiar with JSON it may be worth a bit of research to see what it's all about.
Extracting Data
I've written some functions to extract useful pieces of information from the API responses which may be of help here
locations <- place_location(HAVE_PLACES)
head(locations)
# lat lng
# 1 35.38690 -80.55993
# 2 35.42111 -80.57277
# 3 35.37006 -80.66360
# 4 35.39793 -80.60813
# 5 35.44328 -80.62367
# 6 35.37034 -80.54748
placenames <- place_name(HAVE_PLACES)
head(placenames)
# "Food Lion" "Food Lion" "Food Lion" "Food Lion" "Food Lion" "Food Lion"
However, note that you will still get some list objects returned, because, in this case, a 'location' can have many 'types'
placetypes <- place_type(HAVE_PLACES)
str(placetypes)
# List of 20
# $ : chr [1:5] "grocery_or_supermarket" "store" "food" "point_of_interest" ...
# $ : chr [1:5] "grocery_or_supermarket" "store" "food" "point_of_interest" ...
# $ : chr [1:5] "grocery_or_supermarket" "store" "food" "point_of_interest" ...
# $ : chr [1:5] "grocery_or_supermarket" "store" "food" "point_of_interest" ...
Summary
With Google's API responses you will have to extract the specific data elemets you want and construct them into your required object
df <- cbind(
place_name(HAVE_PLACES)
, place_location(HAVE_PLACES)
, place_type(HAVE_PLACES)[[1]] ## only selecting the 1st 'type'
)
head(df)
# place_name(HAVE_PLACES) lat lng place_type(HAVE_PLACES)[[1]]
# 1 Food Lion 35.38690 -80.55993 grocery_or_supermarket
# 2 Food Lion 35.42111 -80.57277 store
# 3 Food Lion 35.37006 -80.66360 food
# 4 Food Lion 35.39793 -80.60813 point_of_interest
# 5 Food Lion 35.44328 -80.62367 establishment
# 6 Food Lion 35.37034 -80.54748 grocery_or_supermarket

Extract metadata with R

Good day
I am a newbie to Stackoverflow:)
I am trying my hand with programming with R and found this platform a great source of help.
I have developed some code leveraging stackoverflow, but now I am failing to read the metadata from this htm file
Please direct download this file before using in R
setwd("~/NLP")
library(tm)
library(rvest)
library(tm.plugin.factiva)
file <-read_html("facts.htm")
source <- FactivaSource(file)
corpus <- Corpus(source, readerControl = list(language = NA))
# See the contents of the documents
inspect(corpus)
head(corpus)
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 3
See meta-data associated with first article
meta(corpus[[3]])
meta(corpus[[3]])
author : character(0)
datetimestamp: 2017-08-31
description : character(0)
heading : Rain, Rain, Rain
id : TIMEUK-170830-e
language : en
origin : thetimes.co.uk
edition : character(0)
section : Comment
subject : c("Hurricanes/Typhoons", "Storms", "Political/General News", "Disasters/Accidents", "Natural Disasters/Catastrophes", "Risk News", "Weather")
coverage : c("United States", "North America")
company : character(0)
industry : character(0)
infocode : character(0)
infodesc : character(0)
wordcount : 333
publisher : News UK & Ireland Limited
rights : © Times Newspapers Limited 2017
How can I save each metadata (SE, HD, AU, ..PUB, AU) - all 18 metadata elements column-wise in a dataframe or write to excel for each document in corpus?
Example of output:
SE HD AU ...
Doc 1
2
3
Thank you for your help
The simplest way I know of to do it is:
Make a data frame from each of the three lists in your corpus:
one<-data.frame(unlist(meta(corpus[[1]])))
two<-data.frame(unlist(meta(corpus[[2]])))
three<-data.frame(unlist(meta(corpus[[3]])))
Then you will want to merge them into a single data frame. For the first two, this is easy to do, as using "row.names" will cause them to merge on the NON VARIABLE row names. But the second merge, you need to merge based on the column now named "Row.Names" So you need to create and rename the first column of the third file with the row names, using setDT allows you to do this without adding another full set of information, just redirecting R to see the row names as the first column
setDT(three, keep.rownames = TRUE)[]
colnames(three)[1] <- "Row.names"
then you simply merge the first and second data frame into variable named meta, and then merge meta with three using "Row.names" (the new name of the first column now).
meta <- merge(one, two, by="row.names", all=TRUE)
meta <- merge(meta, three, by = "Row.names", all=TRUE)
Your data will look like this:
Row.names unlist.meta.corpus..1.... unlist.meta.corpus..2.... unlist.meta.corpus..3....
1 author Jenni Russell <NA> <NA>
2 coverage1 United States North Korea United States
3 coverage2 North America United States North America
4 coverage3 <NA> Japan <NA>
5 coverage4 <NA> Pyongyang <NA>
6 coverage5 <NA> Asia Pacific <NA>
Those NA values are there because not all of the sub-lists had values for all of the observations.
By using the all=TRUE on both merges, you preserve all of the fields, with and without data, which makes it easy to work with moving forward.
If you look at this PDF from CRAN on page two the section Details shows you how to access the content and metadata. From there is is simply about unlisting to move them into data frames.
If you get lost, send a comment and I will do what I can to help you out!
EDIT BY REQUEST:
To write this to Excel is not super difficult because the data is already "square" in a uniform data frame. You would just install xlsx package and xlxsjars then use the following function:
write.xlsx(meta, file, sheetName="Sheet1",
col.names=TRUE, row.names=TRUE, append=FALSE, showNA=TRUE)
You can find information about the package here: page 38 gives more detail.
And if you want to save the content, you can change meta to content in the files which extract the data from corpus and make the initial dataframes. The entire process will be the same otherwise

How to save a data frame in R

According to the answer to this question, you can save a data frame "foo" in R with the save() function as follows:
save(foo,file="data.Rda")
Here is data frame "df":
> str(df)
'data.frame': 1254 obs. of 2 variables
$ text : chr "RT #SchmittySays: I love this 1st grade #science teacher from #Duluth http://t.co/HWDYFnIyqV #NSTA15 #AlbertEinstein #inspirat"| __truncated__ "RT #KVernonBHS: #smrtgrls would love Stellar Girls. Empowering female scientists rocks! #NSTA15 http://t.co/1ZU0yjVF67" "RT #leducmills: Leaving #SXSWedu to go straight to #NSTA15. There should be some sort of arbitrary conference-hopper social med"| __truncated__ "RT #KRScienceLady: Congrats to a wonderful colleague who helped #ngss Bcome reality, Stephen Pruitt, Distinguished Service to "| __truncated__ ...
$ group: Factor w/ 2 levels "narst","nsta": 2 2 2 2 2 2 2 2 2 2 ...
It seems to save fine:
> save(df, file = "~/downloads/df.Rda")
But it turns out only the name of the object saved:
> df1 <- load("~/downloads/df.Rda")
> str(df1)
chr "df"
I tried the saveRDS() function suggested in another answer to the same question referenced above which worked fine, but I'd like to know why save() isn't working.
You might want to take a look at this question here: R data formats: RData, Rda, Rds etc.
When loading an .rda object, you are going to load all objects with their original names to the global environment. You can't assign objects to new names using load as you tried to do.
If you want to save objects that can be loaded with different names later, then you should use the .rds format (saveRDS and readRDS). If you want to save more than one object in a .rds file, the simplest solution is to put all of them on a list and save only the list. If after reading the .rds you want to put the objects of the list in the global environment, you can use list2env.

text searching in R

I am trying to make a query to use in a R package named RISmed, which will search and downloaded relevant journal article information from pubmed database. I want to search two words always together, for example :
query= "gene sequencing"
search<-EUtilsSummary(query,type="esearch",db = "pubmed",mindate=2014, maxdate=2014, retmax=20)
If I use, above command, it will search gene and sequencing separately, then both gene and sequencing,that means if in whole text gene and sequencing exists, my command captures them but I want to search in such a way, that it will consider "Gene sequencing", two words always together. How can I write that query? Would anyone please help me?
Thanks in advance !
I would try this:
query <- '"gene sequencing"[Title/Abstract]'
The pubmed search engine does accept quoted strings and you just need to know how to preserve them within R. Using surrounding single quotes is one method. Using back-slashed quotes would be another. Notice that the returned value from my experiment with your code shows that escape-backslashing is how the implemeters of that package do it:
> str(search)
Formal class 'EUtilsSummary' [package "RISmed"] with 6 slots
..# db : chr "pubmed"
..# count : num 542
..# retmax : num 20
..# retstart : num 0
..# PMID : chr [1:20] "25548628" "25543043" "25542841" "25540641" ...
..# querytranslation: chr "\"gene sequencing\"[Title/Abstract] AND 2014[EDAT] : 2014[EDAT]"

Exporting data from UScensus2010

I need to extract Census 2010 data for CDPs (nationwide) and figured it would be simplest to use the UScensus2010 package to extract the key tables I need, one state at a time, and then export the resulting data. Towards this end I ran the following code
library(UScensus2010)
install.cdp("osx")
demographics(dem="P0030001", state="oh",level="cdp")
p3.race <- demographics(dem=c("P0010001", "P0030001", "P0030002", "P0030003", "P0030004", "P0030005", "P0030006"), state="oh",level="cdp")
Note that for p3.race we have
..$ : chr [1:1204] "Hamler village" "Florida village" "Deshler village" "New Bavaria village" ...
..$ : chr [1:7] "P0010001" "P0030001" "P0030002" "P0030003" ...
So far so good. Now, what would be the best way to export p3.race (as *.csv *.dta, or any other format) such that the CDP names show up as column A and P001001:P0030006 show up as Columns B:H?

Resources