I scraped some tripadvisor content (id, quote, ratings, complete review) in a csv file and tried to filter out the documents with just 5* rating, but it seems not to work.
> x <- read.csv ("test.csv", header = TRUE, stringsAsFactors = FALSE)
> (corp <- VCorpus(DataframeSource (x),
+ readerControl = list(language = "eng")))
I get the following:
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 50
Now, on filtering, it shows, that there are 0 documents with a rating of 5* and that cant be right.
> idx <- meta(corp, "rating") == '5'
> corp [idx]
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 0
Did I overlook anything on creating the corpus?
text output as requested
'data.frame': 682 obs. of 6 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ id : chr "rn360260358" "rn359340351" "rn356397660" "rn355961772" ...
$ quote : chr "Nice but not unique " "Beautiful scenery of German forest with a lake" "Beautiful Lake and Amazing Mountain Views" "Beautiful!" ...
$ rating : chr "3" "5" "5" "5" ...
$ date : chr "Reviewed 5 March 2016" "Reviewed 29 February 2016" "Reviewed 27 February 2016" ...
$ reviewnospace: chr "We visited the lake with our daughters in March. All s...
Your data import method simply does not pass the metadata. DataFrameSource(x) passes all variables of x as the document text.
Moreover, whatever the method, there is no easy, automatic way to add a bunch of metadata in tm. Instead, we can use VectorSource(x$reviewnospace) (assuming this is the column that is holding the text), and in a second step assign it the metadata. Your indexing then works as expected.
library(tm)
# use VectorSource to import data
corp <- VCorpus(VectorSource(x$reviewnospace), readerControl = list(language = "eng"))
# assign metadata
meta(corp,tag = "rating") <- x$rating
idx <- meta(corp, "rating") == '5'
corp [idx]
Related
I have following code:
url <- "https://lebensmittel-naehrstoffe.de/calciumhaltige-lebensmittel/"
page <- read_html(url) #Creates an html document from URL
Ca <- html_table(page, fill = TRUE, dec = ",") #Parses tables into data frames
Ca <- data.frame(Ca)
But my last column of my data.frame Ca[,4] consists of values containing "." and "," - hence it is a german talbe the dec is",", but in R it is always a character. I tried already with gsub and as.numeric, but it always failed. Pleasse note: I already put dec=","
Could someone help me? If possible it should be a solution to run it on a lot of data.frames (or html imports or what ever) because I have many such tables...
Thank you very much!
You can use readr::parse_number :
Ca <- html_table(page, fill = TRUE, dec = ",")[[1]]
Ca$`Calciumgehalt in mg` <- readr::parse_number(Ca$`Calciumgehalt in mg`, locale = locale(decimal_mark = ",", grouping_mark = "."))
str(Ca)
# 'data.frame': 82 obs. of 4 variables:
# $ Lebensmittel : chr "Basilikum, getrocknet" "Majoran, getrocknet" "Thymian, getrocknet" "Selleriesamen" ...
# $ Kategorie : chr "Gewürze" "Gewürze" "Gewürze" "Gewürze" ...
# $ Mengenangabe : chr "je 100 Gramm" "je 100 Gramm" "je 100 Gramm" "je 100 Gramm" ...
# $ Calciumgehalt.in.mg: num 2240 1990 1890 1767 1597 ...
Ok, I know this is not a reproducible example, because I only managed to get this error with this specific data.table, that has almost 1gb, so I don't know how to send it to you. Anyway, I am completely lost... If someone knows what is happening here, please tell me.
I have the original data.table and some other ones obtained by just changing the skip argument.
> original <- fread('json.csv')
> skip100 <- fread('json.csv', skip = 100, sep = ',')
> skip1000 <- fread('json.csv', skip = 1000, sep = ',')
> skip10000 <- fread('json.csv', skip = 10000, sep = ',')
> str(original)
Classes ‘data.table’ and 'data.frame': 29315 obs. of 7 variables:
$ id : chr "0015023cc06b5362d332b3baf348d11567ca2fbb" "004f0f8bb66cf446678dc13cf2701feec4f36d76" "00d16927588fb04d4be0e6b269fc02f0d3c2aa7b" "0139ea4ca580af99b602c6435368e7fdbefacb03" ...
$ title : chr "The RNA pseudoknots in foot-and-mouth disease virus are dispensable for genome replication but essential for th"| __truncated__ "Healthcare-resource-adjusted vulnerabilities towards the 2019-nCoV epidemic across China" "Real-time, MinION-based, amplicon sequencing for lineage typing of infectious bronchitis virus from upper respiratory samples" "A Combined Evidence Approach to Prioritize Nipah Virus Inhibitors" ...
$ authors : chr "Joseph C Ward,Lidia Lasecka-Dykes,Chris Neil,Oluwapelumi Adeyemi,Sarah , Gold,Niall Mclean,Caroline Wrig"| __truncated__ "Hanchu Zhou,Jiannan Yang,Kaicheng Tang,â\200 ,Qingpeng Zhang,Zhidong Cao,Dirk Pfeiffer,Daniel Dajun Zeng" "Salman L Butt,Eric C Erwood,Jian Zhang,Holly S Sellers,Kelsey Young,Kevin K Lahmers,James B Stanton" "Nishi Kumari,Ayush Upadhyay,Kishan Kalia,Rakesh Kumar,Kanika Tuteja,Rani Paul,Eugenia Covernton,Tina Sh"| __truncated__ ...
$ institution: chr "" "City University of Hong Kong,City University of Hong Kong,City University of Hong Kong,NA,City University of Ho"| __truncated__ "University of Georgia,University of Georgia,University of Georgia,University of Georgia,University of Georgia,V"| __truncated__ "Panjab University,Delhi University,D.A.V. College,CSIR-Institute of Microbial Technology,Panjab University,Univ"| __truncated__ ...
$ country : chr "" "China,China,China,NA,China,China,China,China" "USA,USA,USA,USA,USA,USA,USA" "India,India,India,India,India,India,France,India,NA,India" ...
$ abstract : chr "word count: 194 22 Text word count: 5168 23 24 25 author/funder. All rights reserved. No reuse allowed without "| __truncated__ "" "Infectious bronchitis (IB) causes significant economic losses in the global poultry industry. Control of infect"| __truncated__ "Nipah Virus (NiV) came into limelight recently due to an outbreak in Kerala, India. NiV causes severe disease a"| __truncated__ ...
$ body_text : chr "VP3, and VP0 (which is further processed to VP2 and VP4 during virus assembly) (6). The P2 64 and P3 regions en"| __truncated__ "The 2019-nCoV epidemic has spread across China and 24 other countries 1-3 as of February 8, 2020 . The mass qua"| __truncated__ "Infectious bronchitis (IB), which is caused by infectious bronchitis virus (IBV), is one of the most important "| __truncated__ "Nipah is an infectious negative-sense single-stranded RNA virus which belongs to the genus henipavirus and fami"| __truncated__ ...
- attr(*, ".internal.selfref")=<externalptr>
The number of observations is consistent for skip = 100 and 10000, but not for skip = 1000, as shown below.
> nrow(original)
[1] 29315
> nrow(skip100)
[1] 29215
> nrow(skip1000)
[1] 28316
> nrow(skip10000)
[1] 19315
What is happening?
# parse PubMed data
library(XML) # xpath
library(rentrez) # entrez_fetch
pmids <- c("25506969","25032371","24983039","24983034","24983032","24983031","26386083",
"26273372","26066373","25837167","25466451","25013473","23733758")
# Above IDs are mix of Books and journal articles
# ID# 23733758 is an journal article and has No abstract
data.pubmed <- entrez_fetch(db = "pubmed", id = pmids, rettype = "xml",
parsed = TRUE)
abstracts <- xpathApply(data.pubmed, "//Abstract", xmlValue)
names(abstracts) <- pmids
It works well if every record has an abstract. However, when there is a PMID (#23733758) without a pubmed abstract ( or a book article or something else), it skips resulting in an error 'names' attribute [5] must be the same length as the vector [4]
Q: How to pass multiple paths/nodes so that, I can extract journal article, Books or Reviews ?
UPDATE : hrbrmstr solution helps to address the NA. But,can xpathApply take multiple nodes like c(//Abstract, //ReviewArticle , etc etc )?
You have to attack it one tag element up:
abstracts <- xpathApply(data.pubmed, "//PubmedArticle//Article", function(x) {
val <- xpathSApply(x, "./Abstract", xmlValue)
if (length(val)==0) val <- NA_character_
val
})
names(abstracts) <- pmids
str(abstracts)
List of 5
## $ 24019382: chr "Adenocarcinoma of the lung, a leading cause of cancer death, frequently displays mutational activation of the KRAS proto-oncoge"| __truncated__
## $ 23927882: chr "Mutations in components of the mitogen-activated protein kinase (MAPK) cascade may be a new candidate for target for lung cance"| __truncated__
## $ 23825589: chr "Aberrant activation of MAP kinase signaling pathway and loss of tumor suppressor LKB1 have been implicated in lung cancer devel"| __truncated__
## $ 23792568: chr "Sorafenib, the first agent developed to target BRAF mutant melanoma, is a multi-kinase inhibitor that was approved by the FDA f"| __truncated__
## $ 23733758: chr NA
Per your comment with an alternate way to do this:
str(xpathApply(data.pubmed, '//PubmedArticle//Article', function(x) {
xmlValue(xmlChildren(x)$Abstract)
}))
## List of 5
## $ : chr "Adenocarcinoma of the lung, a leading cause of cancer death, frequently displays mutational activation of the KRAS proto-oncoge"| __truncated__
## $ : chr "Mutations in components of the mitogen-activated protein kinase (MAPK) cascade may be a new candidate for target for lung cance"| __truncated__
## $ : chr "Aberrant activation of MAP kinase signaling pathway and loss of tumor suppressor LKB1 have been implicated in lung cancer devel"| __truncated__
## $ : chr "Sorafenib, the first agent developed to target BRAF mutant melanoma, is a multi-kinase inhibitor that was approved by the FDA f"| __truncated__
## $ : chr NA
I am trying to import a text file into R, and put it into a data frame, along with other data.
My delimiter is "|" and a sample of my data is here :
|Painless check-in. Two legs of 3 on AC: AC105, YYZ-YVR. Roomy and clean A321 with fantastic crew. AC33: YVR-SYD,
very light load and had 3 seats to myself. A very enthusiastic and friendly crew as usual on this transpacific
route that I take several times a year. Arrived 20 min ahead of schedule. The expected high level of service from
our flag carrier, Air Canada. Altitude Elite member.
|We recently returned from Dublin to Toronto, then on to Winnipeg. Other than cutting it close due to limited
staffing in Toronto our flight was excellent. Due to the rush in Toronto one of our carry ones was placed to go in
the cargo hold. When we arrived in Winnipeg it stayed in Toronto, they were most helpful and kind at the Winnipeg
airport, and we received 3 phone calls the following day in regards to the misplaced bag and it was delivered to
our home. We are very thankful and more than appreciative of the service we received what a great end to a
wonderful holiday.
|Flew Toronto to Heathrow. Much worse flight than on the way out. We paid a hefty extra fee for exit seats which
had no storage whatsoever, and not even any room under the seats. Ridiculous. Crew were poor, not friendly. One
older male member of staff was quite attitudinal, acting as though he was doing everyone a huge favour by serving
them. A reasonable dinner but breakfast was a measly piece of banana loaf. That's it! The worst airline breakfast
I have had.
As you can see, there are many "|" , but as this screenshot below shows, when I imported the data in R, it only separated it once, instead of about 152 times.
How do I get each individual piece of text in a different column inside the data frame? I would like a data frame of length 152, not 2.
EDIT: The code lines are:
myData <- read.table("C:/Users/Norbert/Desktop/research/Important files/Airline Reviews/Reviews/air_can_Review.txt", sep="|",quote=NULL, comment='',fill = TRUE, header=FALSE)
length(myData)
[1] 2
class(myData)
[1] "data.frame"
str(myData)
'data.frame': 1244 obs. of 2 variables:
$ V1: Factor w/ 1093 levels "","'delayed' on departure (I reference flights between March 2014 and January 2015 in this regard: Denver, SFO,",..: 210 367 698 853 1 344 483 87 757 52 ...
$ V2: Factor w/ 154 levels ""," hotel","5/9/2014, LHR to Vancouver, AC855. 23/9/2014, Vancouver to LHR, AC854. For Economy the leg room was OK compared to",..: 1 1 1 1 78 1 1 1 1 1 ...
myDataFrame <- data.frame(text = myData, otherVar2 = 1, otherVar2 = "blue", stringsAsFactors = FALSE)
str(myDataFrame)
'data.frame': 531 obs. of 3 variables:
$ text : chr "BRU-YUL, May 26th, A330-300. Departed on-time, landed 30 minutes late due to strong winds, nice flight, food" "excellent, cabin-crew smiling and attentive except for one old lady throwing meal trays like boomerangs. Seat-" "pitch was very generous, comfortable seat, IFE a bit outdated but selection was Okay. Air Canadas problem is\nthat the new pro"| __truncated__ "" ...
$ otherVar2 : num 1 1 1 1 1 1 1 1 1 1 ...
$ otherVar2.1: chr "blue" "blue" "blue" "blue" ...
length(myDataFrame)
[1] 3
A better way to read in the text is using scan(), and then put it into a data frame with your other variables (here I just made some up). Note that I took your text above, and pasted it into a file called sample.txt, after removing the starting "|".
myData <- scan("sample.txt", what = "character", sep = "|")
myDataFrame <- data.frame(text = myData, otherVar2 = 1, otherVar2 = "blue",
stringsAsFactors = FALSE)
str(myDataFrame)
## 'data.frame': 3 obs. of 3 variables:
## $ text : chr "Painless check-in. Two legs of 3 on AC: AC105, YYZ-YVR. Roomy and clean A321 with fantastic crew. AC33: YVR-SYD, very light loa"| __truncated__ "We recently returned from Dublin to Toronto, then on to Winnipeg. Other than cutting it close due to limited staffing in Toront"| __truncated__ "Flew Toronto to Heathrow. Much worse flight than on the way out. We paid a hefty extra fee for exit seats which had no storage "| __truncated__
## $ otherVar2 : num 1 1 1
## $ otherVar2.1: Factor w/ 1 level "blue": 1 1 1
The otherVar1, otherVar2 are just placeholders for your own variables, as you said you wanted a data.frame with other variables. I chose an integer variable and a text variable, and by specifying a single value, it gets recycled for all observations in the dataset (in the example, 3).
I realize that your question asks how to get each text in a different column, but that is not a good way to use a data.frame, since data.frames are designed to hold variables in columns. (With one text per column, you cannot add other variables.)
If you really want to do that, you have to coerce the data after transposing it, as follows:
myDataFrame <- as.data.frame(t(data.frame(text = myData, stringsAsFactors = FALSE)), stringsAsFactors = FALSE)
str(myDataFrame)
## 'data.frame': 1 obs. of 3 variables:
## $ V1: chr "Painless check-in. Two legs of 3 on AC: AC105, YYZ-YVR. Roomy and clean A321 with fantastic crew. AC33: YVR-SYD, very light loa"| __truncated__
## $ V2: chr "We recently returned from Dublin to Toronto, then on to Winnipeg. Other than cutting it close due to limited staffing in Toront"| __truncated__
## $ V3: chr "Flew Toronto to Heathrow. Much worse flight than on the way out. We paid a hefty extra fee for exit seats which had no storage "| __truncated__
length(myDataFrame)
## [1] 3
"Measly banana loaf"? Definitely economy class.
I have read an ascii (.spe) file into R. This file contains one column of, mostly, integers. However R is interpreting these integers incorrectly, probably because I am not specifying the correct format or something like that. The file was generated in Ortec Maestro software. Here is the code:
library(SDMTools)
strontium<-read.table("C:/Users/Hal 2/Desktop/beta_spec/strontium 90 spectrum.spe",header=F,skip=2)
str_spc<-vector(mode="numeric")
for (i in 1:2037)
{
str_spc[i]<-as.numeric(strontium$V1[i+13])
}
Here, for example, strontium$V1[14] has the value 0, but R is interpreting it as a 10. I think I may have to convert the data to some other format, or something like that, but I'm not sure and I'm probably googling the wrong search terms.
Here are the first few lines from the file:
$SPEC_ID:
No sample description was entered.
$SPEC_REM:
DET# 1
DETDESC# MCB 129
AP# Maestro Version 6.08
$DATE_MEA:
10/14/2014 15:13:16
$MEAS_TIM:
1516 1540
$DATA:
0 2047
Here is a link to the file: https://www.dropbox.com/sh/y5x68jen487qnmt/AABBZyC6iXBY3e6XH0XZzc5ba?dl=0
Any help appreciated.
I saw someone had made a parser for SPE Spectra files in python and I can't let that stand without there being at least a minimally functioning R version, so here's one that parses some of the fields, but gets you your data:
library(stringr)
library(gdata)
library(lubridate)
read.spe <- function(file) {
tmp <- readLines(file)
tmp <- paste(tmp, collapse="\n")
records <- strsplit(tmp, "\\$")[[1]]
records <- records[records!=""]
spe <- list()
spe[["SPEC_ID"]] <- str_match(records[which(startsWith(records, "SPEC_ID"))],
"^SPEC_ID:[[:space:]]*([[:print:]]+)[[:space:]]+")[2]
spe[["SPEC_REM"]] <- strsplit(str_match(records[which(startsWith(records, "SPEC_REM"))],
"^SPEC_REM:[[:space:]]*(.*)")[2], "\n")
spe[["DATE_MEA"]] <- mdy_hms(str_match(records[which(startsWith(records, "DATE_MEA"))],
"^DATE_MEA:[[:space:]]*(.*)[[:space:]]$")[2])
spe[["MEAS_TIM"]] <- strsplit(str_match(records[which(startsWith(records, "MEAS_TIM"))],
"^MEAS_TIM:[[:space:]]*(.*)[[:space:]]$")[2], "\n")[[1]]
spe[["ROI"]] <- str_match(records[which(startsWith(records, "ROI"))],
"^ROI:[[:space:]]*(.*)[[:space:]]$")[2]
spe[["PRESETS"]] <- strsplit(str_match(records[which(startsWith(records, "PRESETS"))],
"^PRESETS:[[:space:]]*(.*)[[:space:]]$")[2], "\n")[[1]]
spe[["ENER_FIT"]] <- strsplit(str_match(records[which(startsWith(records, "ENER_FIT"))],
"^ENER_FIT:[[:space:]]*(.*)[[:space:]]$")[2], "\n")[[1]]
spe[["MCA_CAL"]] <- strsplit(str_match(records[which(startsWith(records, "MCA_CAL"))],
"^MCA_CAL:[[:space:]]*(.*)[[:space:]]$")[2], "\n")[[1]]
spe[["SHAPE_CAL"]] <- str_match(records[which(startsWith(records, "SHAPE_CAL"))],
"^SHAPE_CAL:[[:space:]]*(.*)[[:space:]]*$")[2]
spe_dat <- strsplit(str_match(records[which(startsWith(records, "DATA"))],
"^DATA:[[:space:]]*(.*)[[:space:]]$")[2], "\n")[[1]]
spe[["SPE_DAT"]] <- as.numeric(gsub("[[:space:]]", "", spe_dat)[-1])
return(spe)
}
dat <- read.spe("strontium 90 spectrum.Spe")
str(dat)
## List of 10
## $ SPEC_ID : chr "No sample description was entered."
## $ SPEC_REM :List of 1
## ..$ : chr [1:3] "DET# 1" "DETDESC# MCB 129" "AP# Maestro Version 6.08"
## $ DATE_MEA : POSIXct[1:1], format: "2014-10-14 15:13:16"
## $ MEAS_TIM : chr "1516 1540"
## $ ROI : chr "0"
## $ PRESETS : chr [1:3] "None" "0" "0"
## $ ENER_FIT : chr "0.000000 0.002529"
## $ MCA_CAL : chr [1:2] "3" "0.000000E+000 2.529013E-003 0.000000E+000 keV"
## $ SHAPE_CAL: chr "3\n3.100262E+001 0.000000E+000 0.000000E+000"
## $ SPE_DAT : num [1:2048] 0 0 0 0 0 0 0 0 0 0 ...
head(dat$SPE_DAT)
## [1] 0 0 0 0 0 0
It needs some polish and there's absolutely no error checking (i.e. for missing fields), but no time today to deal with that. I'll finish the parsing and make a minimal package wrapper for it over the next couple days.