fread skip argument not consistent - r

Ok, I know this is not a reproducible example, because I only managed to get this error with this specific data.table, that has almost 1gb, so I don't know how to send it to you. Anyway, I am completely lost... If someone knows what is happening here, please tell me.
I have the original data.table and some other ones obtained by just changing the skip argument.
> original <- fread('json.csv')
> skip100 <- fread('json.csv', skip = 100, sep = ',')
> skip1000 <- fread('json.csv', skip = 1000, sep = ',')
> skip10000 <- fread('json.csv', skip = 10000, sep = ',')
> str(original)
Classes ‘data.table’ and 'data.frame': 29315 obs. of 7 variables:
$ id : chr "0015023cc06b5362d332b3baf348d11567ca2fbb" "004f0f8bb66cf446678dc13cf2701feec4f36d76" "00d16927588fb04d4be0e6b269fc02f0d3c2aa7b" "0139ea4ca580af99b602c6435368e7fdbefacb03" ...
$ title : chr "The RNA pseudoknots in foot-and-mouth disease virus are dispensable for genome replication but essential for th"| __truncated__ "Healthcare-resource-adjusted vulnerabilities towards the 2019-nCoV epidemic across China" "Real-time, MinION-based, amplicon sequencing for lineage typing of infectious bronchitis virus from upper respiratory samples" "A Combined Evidence Approach to Prioritize Nipah Virus Inhibitors" ...
$ authors : chr "Joseph C Ward,Lidia Lasecka-Dykes,Chris Neil,Oluwapelumi Adeyemi,Sarah , Gold,Niall Mclean,Caroline Wrig"| __truncated__ "Hanchu Zhou,Jiannan Yang,Kaicheng Tang,â\200 ,Qingpeng Zhang,Zhidong Cao,Dirk Pfeiffer,Daniel Dajun Zeng" "Salman L Butt,Eric C Erwood,Jian Zhang,Holly S Sellers,Kelsey Young,Kevin K Lahmers,James B Stanton" "Nishi Kumari,Ayush Upadhyay,Kishan Kalia,Rakesh Kumar,Kanika Tuteja,Rani Paul,Eugenia Covernton,Tina Sh"| __truncated__ ...
$ institution: chr "" "City University of Hong Kong,City University of Hong Kong,City University of Hong Kong,NA,City University of Ho"| __truncated__ "University of Georgia,University of Georgia,University of Georgia,University of Georgia,University of Georgia,V"| __truncated__ "Panjab University,Delhi University,D.A.V. College,CSIR-Institute of Microbial Technology,Panjab University,Univ"| __truncated__ ...
$ country : chr "" "China,China,China,NA,China,China,China,China" "USA,USA,USA,USA,USA,USA,USA" "India,India,India,India,India,India,France,India,NA,India" ...
$ abstract : chr "word count: 194 22 Text word count: 5168 23 24 25 author/funder. All rights reserved. No reuse allowed without "| __truncated__ "" "Infectious bronchitis (IB) causes significant economic losses in the global poultry industry. Control of infect"| __truncated__ "Nipah Virus (NiV) came into limelight recently due to an outbreak in Kerala, India. NiV causes severe disease a"| __truncated__ ...
$ body_text : chr "VP3, and VP0 (which is further processed to VP2 and VP4 during virus assembly) (6). The P2 64 and P3 regions en"| __truncated__ "The 2019-nCoV epidemic has spread across China and 24 other countries 1-3 as of February 8, 2020 . The mass qua"| __truncated__ "Infectious bronchitis (IB), which is caused by infectious bronchitis virus (IBV), is one of the most important "| __truncated__ "Nipah is an infectious negative-sense single-stranded RNA virus which belongs to the genus henipavirus and fami"| __truncated__ ...
- attr(*, ".internal.selfref")=<externalptr>
The number of observations is consistent for skip = 100 and 10000, but not for skip = 1000, as shown below.
> nrow(original)
[1] 29315
> nrow(skip100)
[1] 29215
> nrow(skip1000)
[1] 28316
> nrow(skip10000)
[1] 19315
What is happening?

Related

Using read_html in R to get Russell 3000 holdings?

I was wondering if there is a way to automatically pull the Russell 3000 holdings from the iShares website in R using the read_html (or rvest) function?
url: https://www.ishares.com/us/products/239714/ishares-russell-3000-etf
(all holdings in the table on the bottom, not just top 10)
So far I have had to copy and paste into an Excel document, save as a CSV, and use read_csv to create a tibble in R of the ticker, company name, and sector.
I have used read_html to pull the SP500 holdings from wikipedia, but can't seem to figure out the path I need to put in to have R automatically pull from iShares website (and there arent other reputable websites I've found with all ~3000 holdings). Here is the code used for SP500:
read_html("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")%>%
html_node("table.wikitable")%>%
html_table()%>%
select('Symbol','Security','GICS Sector','GICS Sub Industry')%>%
as_tibble()
First post, sorry if it is hard to follow...
Any help would be much appreciated
Michael
IMPORTANT
According to the Terms & Conditions listed on BlackRock's website (here):
Use any robot, spider, intelligent agent, other automatic device, or manual process to search, monitor or copy this Website or the reports, data, information, content, software, products services, or other materials on, generated by or obtained from this Website, whether through links or otherwise (collectively, "Materials"), without BlackRock's permission, provided that generally available third-party web browsers may be used without such permission;
I suggest you ensure you are abiding by those terms before using their data in a way that violates those rules. For educational purposes, here is how data would be obtained:
First you need to get to the actual data (not the interactive javascript). How familiar are you with the devloper function on your browser? If you navigate through the webiste and track the traffic, you will notice a large AJAX:
https://www.ishares.com/us/products/239714/ishares-russell-3000-etf/1467271812596.ajax?tab=all&fileType=json
This is the data you need (all). After locating this, it is just cleaning the data. Example:
library(jsonlite)
#Locate the raw data by searching the Network traffic:
url="https://www.ishares.com/us/products/239714/ishares-russell-3000-etf/1467271812596.ajax?tab=all&fileType=json"
#pull the data in via fromJSON
x<-jsonlite::fromJSON(url,flatten=TRUE)
>Large list (10.4 Mb)
#use a comination of `lapply` and `rapply` to unlist, structuring the results as one large list
y<-lapply(rapply(x, enquote, how="unlist"), eval)
>Large list (50677 elements, 6.9Mb)
y1<-y[1:15]
> str(y1)
List of 15
$ aaData1 : chr "MSFT"
$ aaData2 : chr "MICROSOFT CORP"
$ aaData3 : chr "Equity"
$ aaData.display: chr "2.95"
$ aaData.raw : num 2.95
$ aaData.display: chr "109.41"
$ aaData.raw : num 109
$ aaData.display: chr "2,615,449.00"
$ aaData.raw : int 2615449
$ aaData.display: chr "$286,156,275.09"
$ aaData.raw : num 2.86e+08
$ aaData.display: chr "286,156,275.09"
$ aaData.raw : num 2.86e+08
$ aaData14 : chr "Information Technology"
$ aaData15 : chr "2588173"
**Updated: In case you are unable to clean the data, here you are:
testdf<- data.frame(matrix(unlist(y), nrow=50677, byrow=T),stringsAsFactors=FALSE)
#Where we want to break the DF at (every nth row)
breaks <- 17
#number of rows in full DF
nbr.row <- nrow(testdf)
repeats<- rep(1:ceiling(nbr.row/breaks),each=breaks)[1:nbr.row]
#split DF from clean-up
newDF <- split(testdf,repeats)
Result:
> str(head(newDF))
List of 6
$ 1:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "MSFT" "MICROSOFT CORP" "Equity" "2.95" ...
$ 2:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "AAPL" "APPLE INC" "Equity" "2.89" ...
$ 3:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "AMZN" "AMAZON COM INC" "Equity" "2.34" ...
$ 4:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "BRKB" "BERKSHIRE HATHAWAY INC CLASS B" "Equity" "1.42" ...
$ 5:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "FB" "FACEBOOK CLASS A INC" "Equity" "1.35" ...
$ 6:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "JNJ" "JOHNSON & JOHNSON" "Equity" "1.29" ...

R - analysing tripadvisor content with tm

I scraped some tripadvisor content (id, quote, ratings, complete review) in a csv file and tried to filter out the documents with just 5* rating, but it seems not to work.
> x <- read.csv ("test.csv", header = TRUE, stringsAsFactors = FALSE)
> (corp <- VCorpus(DataframeSource (x),
+ readerControl = list(language = "eng")))
I get the following:
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 50
Now, on filtering, it shows, that there are 0 documents with a rating of 5* and that cant be right.
> idx <- meta(corp, "rating") == '5'
> corp [idx]
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 0
Did I overlook anything on creating the corpus?
text output as requested
'data.frame': 682 obs. of 6 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ id : chr "rn360260358" "rn359340351" "rn356397660" "rn355961772" ...
$ quote : chr "Nice but not unique " "Beautiful scenery of German forest with a lake" "Beautiful Lake and Amazing Mountain Views" "Beautiful!" ...
$ rating : chr "3" "5" "5" "5" ...
$ date : chr "Reviewed 5 March 2016" "Reviewed 29 February 2016" "Reviewed 27 February 2016" ...
$ reviewnospace: chr "We visited the lake with our daughters in March. All s...
Your data import method simply does not pass the metadata. DataFrameSource(x) passes all variables of x as the document text.
Moreover, whatever the method, there is no easy, automatic way to add a bunch of metadata in tm. Instead, we can use VectorSource(x$reviewnospace) (assuming this is the column that is holding the text), and in a second step assign it the metadata. Your indexing then works as expected.
library(tm)
# use VectorSource to import data
corp <- VCorpus(VectorSource(x$reviewnospace), readerControl = list(language = "eng"))
# assign metadata
meta(corp,tag = "rating") <- x$rating
idx <- meta(corp, "rating") == '5'
corp [idx]

xpathApply: How to pass multiple paths or nodes?

# parse PubMed data
library(XML) # xpath
library(rentrez) # entrez_fetch
pmids <- c("25506969","25032371","24983039","24983034","24983032","24983031","26386083",
"26273372","26066373","25837167","25466451","25013473","23733758")
# Above IDs are mix of Books and journal articles
# ID# 23733758 is an journal article and has No abstract
data.pubmed <- entrez_fetch(db = "pubmed", id = pmids, rettype = "xml",
parsed = TRUE)
abstracts <- xpathApply(data.pubmed, "//Abstract", xmlValue)
names(abstracts) <- pmids
It works well if every record has an abstract. However, when there is a PMID (#23733758) without a pubmed abstract ( or a book article or something else), it skips resulting in an error 'names' attribute [5] must be the same length as the vector [4]
Q: How to pass multiple paths/nodes so that, I can extract journal article, Books or Reviews ?
UPDATE : hrbrmstr solution helps to address the NA. But,can xpathApply take multiple nodes like c(//Abstract, //ReviewArticle , etc etc )?
You have to attack it one tag element up:
abstracts <- xpathApply(data.pubmed, "//PubmedArticle//Article", function(x) {
val <- xpathSApply(x, "./Abstract", xmlValue)
if (length(val)==0) val <- NA_character_
val
})
names(abstracts) <- pmids
str(abstracts)
List of 5
## $ 24019382: chr "Adenocarcinoma of the lung, a leading cause of cancer death, frequently displays mutational activation of the KRAS proto-oncoge"| __truncated__
## $ 23927882: chr "Mutations in components of the mitogen-activated protein kinase (MAPK) cascade may be a new candidate for target for lung cance"| __truncated__
## $ 23825589: chr "Aberrant activation of MAP kinase signaling pathway and loss of tumor suppressor LKB1 have been implicated in lung cancer devel"| __truncated__
## $ 23792568: chr "Sorafenib, the first agent developed to target BRAF mutant melanoma, is a multi-kinase inhibitor that was approved by the FDA f"| __truncated__
## $ 23733758: chr NA
Per your comment with an alternate way to do this:
str(xpathApply(data.pubmed, '//PubmedArticle//Article', function(x) {
xmlValue(xmlChildren(x)$Abstract)
}))
## List of 5
## $ : chr "Adenocarcinoma of the lung, a leading cause of cancer death, frequently displays mutational activation of the KRAS proto-oncoge"| __truncated__
## $ : chr "Mutations in components of the mitogen-activated protein kinase (MAPK) cascade may be a new candidate for target for lung cance"| __truncated__
## $ : chr "Aberrant activation of MAP kinase signaling pathway and loss of tumor suppressor LKB1 have been implicated in lung cancer devel"| __truncated__
## $ : chr "Sorafenib, the first agent developed to target BRAF mutant melanoma, is a multi-kinase inhibitor that was approved by the FDA f"| __truncated__
## $ : chr NA

Data importing Delimiter issue in R

I am trying to import a text file into R, and put it into a data frame, along with other data.
My delimiter is "|" and a sample of my data is here :
|Painless check-in. Two legs of 3 on AC: AC105, YYZ-YVR. Roomy and clean A321 with fantastic crew. AC33: YVR-SYD,
very light load and had 3 seats to myself. A very enthusiastic and friendly crew as usual on this transpacific
route that I take several times a year. Arrived 20 min ahead of schedule. The expected high level of service from
our flag carrier, Air Canada. Altitude Elite member.
|We recently returned from Dublin to Toronto, then on to Winnipeg. Other than cutting it close due to limited
staffing in Toronto our flight was excellent. Due to the rush in Toronto one of our carry ones was placed to go in
the cargo hold. When we arrived in Winnipeg it stayed in Toronto, they were most helpful and kind at the Winnipeg
airport, and we received 3 phone calls the following day in regards to the misplaced bag and it was delivered to
our home. We are very thankful and more than appreciative of the service we received what a great end to a
wonderful holiday.
|Flew Toronto to Heathrow. Much worse flight than on the way out. We paid a hefty extra fee for exit seats which
had no storage whatsoever, and not even any room under the seats. Ridiculous. Crew were poor, not friendly. One
older male member of staff was quite attitudinal, acting as though he was doing everyone a huge favour by serving
them. A reasonable dinner but breakfast was a measly piece of banana loaf. That's it! The worst airline breakfast
I have had.
As you can see, there are many "|" , but as this screenshot below shows, when I imported the data in R, it only separated it once, instead of about 152 times.
How do I get each individual piece of text in a different column inside the data frame? I would like a data frame of length 152, not 2.
EDIT: The code lines are:
myData <- read.table("C:/Users/Norbert/Desktop/research/Important files/Airline Reviews/Reviews/air_can_Review.txt", sep="|",quote=NULL, comment='',fill = TRUE, header=FALSE)
length(myData)
[1] 2
class(myData)
[1] "data.frame"
str(myData)
'data.frame': 1244 obs. of 2 variables:
$ V1: Factor w/ 1093 levels "","'delayed' on departure (I reference flights between March 2014 and January 2015 in this regard: Denver, SFO,",..: 210 367 698 853 1 344 483 87 757 52 ...
$ V2: Factor w/ 154 levels ""," hotel","5/9/2014, LHR to Vancouver, AC855. 23/9/2014, Vancouver to LHR, AC854. For Economy the leg room was OK compared to",..: 1 1 1 1 78 1 1 1 1 1 ...
myDataFrame <- data.frame(text = myData, otherVar2 = 1, otherVar2 = "blue", stringsAsFactors = FALSE)
str(myDataFrame)
'data.frame': 531 obs. of 3 variables:
$ text : chr "BRU-YUL, May 26th, A330-300. Departed on-time, landed 30 minutes late due to strong winds, nice flight, food" "excellent, cabin-crew smiling and attentive except for one old lady throwing meal trays like boomerangs. Seat-" "pitch was very generous, comfortable seat, IFE a bit outdated but selection was Okay. Air Canadas problem is\nthat the new pro"| __truncated__ "" ...
$ otherVar2 : num 1 1 1 1 1 1 1 1 1 1 ...
$ otherVar2.1: chr "blue" "blue" "blue" "blue" ...
length(myDataFrame)
[1] 3
A better way to read in the text is using scan(), and then put it into a data frame with your other variables (here I just made some up). Note that I took your text above, and pasted it into a file called sample.txt, after removing the starting "|".
myData <- scan("sample.txt", what = "character", sep = "|")
myDataFrame <- data.frame(text = myData, otherVar2 = 1, otherVar2 = "blue",
stringsAsFactors = FALSE)
str(myDataFrame)
## 'data.frame': 3 obs. of 3 variables:
## $ text : chr "Painless check-in. Two legs of 3 on AC: AC105, YYZ-YVR. Roomy and clean A321 with fantastic crew. AC33: YVR-SYD, very light loa"| __truncated__ "We recently returned from Dublin to Toronto, then on to Winnipeg. Other than cutting it close due to limited staffing in Toront"| __truncated__ "Flew Toronto to Heathrow. Much worse flight than on the way out. We paid a hefty extra fee for exit seats which had no storage "| __truncated__
## $ otherVar2 : num 1 1 1
## $ otherVar2.1: Factor w/ 1 level "blue": 1 1 1
The otherVar1, otherVar2 are just placeholders for your own variables, as you said you wanted a data.frame with other variables. I chose an integer variable and a text variable, and by specifying a single value, it gets recycled for all observations in the dataset (in the example, 3).
I realize that your question asks how to get each text in a different column, but that is not a good way to use a data.frame, since data.frames are designed to hold variables in columns. (With one text per column, you cannot add other variables.)
If you really want to do that, you have to coerce the data after transposing it, as follows:
myDataFrame <- as.data.frame(t(data.frame(text = myData, stringsAsFactors = FALSE)), stringsAsFactors = FALSE)
str(myDataFrame)
## 'data.frame': 1 obs. of 3 variables:
## $ V1: chr "Painless check-in. Two legs of 3 on AC: AC105, YYZ-YVR. Roomy and clean A321 with fantastic crew. AC33: YVR-SYD, very light loa"| __truncated__
## $ V2: chr "We recently returned from Dublin to Toronto, then on to Winnipeg. Other than cutting it close due to limited staffing in Toront"| __truncated__
## $ V3: chr "Flew Toronto to Heathrow. Much worse flight than on the way out. We paid a hefty extra fee for exit seats which had no storage "| __truncated__
length(myDataFrame)
## [1] 3
"Measly banana loaf"? Definitely economy class.

Select by Date or Sort by date on GoogleNewsSource R

I am using the R package tm.plugin.webmining. Using the function GoogleNewsSource() I would like to query the news sorted by date and also from a specific date. Is there any paremeter to query the news of a specific date?
library(tm)
library(tm.plugin.webmining)
searchTerm <- "Data Mining"
corpusGoog <- WebCorpus(GoogleNewsSource(params=list(hl="en", q=searchTerm,
ie="utf-8", num=10, output="rss" )))
headers <- meta(corpusGoog,tag="datetimestamp")
If you're looking for a data frame-like structure, this is how you'd go about creating it (note: not all fields are extracted from the corpus):
library(dplyr)
make_row <- function(elem) {
data.frame(timestamp=elem[[2]]$datetimestamp,
heading=elem[[2]]$heading,
description=elem[[2]]$description,
content=elem$content,
stringsAsFactors=FALSE)
}
dat <- bind_rows(lapply(corpusGoog, make_row))
str(dat)
## Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 10 obs. of 4 variables:
## $ timestamp : POSIXct, format: "2015-02-03 13:08:16" "2015-01-11 23:37:45" ...
## $ heading : chr "A guide to data mining with Hadoop - Information Age" "Barack Obama to seek limits on student data mining - Politico" "Is data mining riddled with risk or a natural hazard of the internet? - INTHEBLACK" "Why an obscure British data-mining company is worth $3 billion - Quartz" ...
## $ description: chr "Information AgeA guide to data mining with HadoopInformation AgeWith the advent of the Internet of Things and the transition fr"| __truncated__ "PoliticoBarack Obama to seek limits on student data miningPoliticoPresident Barack Obama on Monday is expected to call for toug"| __truncated__ "INTHEBLACKIs data mining riddled with risk or a natural hazard of the internet?INTHEBLACKData mining is now viewed as a serious"| __truncated__ "QuartzWhy an obscure British data-mining company is worth $3 billionQuartzTesco, the troubled British retail group, is starting"| __truncated__ ...
## $ content : chr "A guide to data mining with Hadoop\nHow businesses can realise and capitalise on the opportunities that Hadoop offers\nPosted b"| __truncated__ "By Stephanie Simon\n1/11/15 6:32 PM EST\nPresident Barack Obama on Monday is expected to call for tough legislation to protect "| __truncated__ "By Adam Courtenay\nData mining is now viewed as a serious security threat, but with all the hype, s"| __truncated__ "How We Buy\nJanuary 12, 2015\nTesco, the troubled British retail group, is starting over. After an accounting scandal , a serie"| __truncated__ ...
Then, you can do anything you want. For example:
dat %>%
arrange(timestamp) %>%
select(heading) %>%
head
## Source: local data frame [6 x 1]
##
## heading
## 1 The potential of fighting corruption through data mining - Transparency International (pre
## 2 Barack Obama to seek limits on student data mining - Politico
## 3 Why an obscure British data-mining company is worth $3 billion - Quartz
## 4 Parks and Rec Recap: Treat Yo Self to Some Data Mining - Indianapolis Monthly
## 5 Fraud and data mining in Vancouverâ\u0080¦just Outside the Lines - Vancouver Sun (blog)
## 6 'Parks and Rec' Data-Mining Episode Was Eerily True To Life - MediaPost Communications
If you want/need something else, you need to be clearer in your question.
I was looking at google query string and noticed they pass startdate and enddate tag in the query if you click dates on right hand side of the page.
You can use the same tag name and yout results will be confined within start and end date.
GoogleFinanceSource(query, params = list(hl = "en", q = query, ie = "utf-8",
start = 0, num = 25, output = "rss",
startdate='2015-10-26', enddate = '2015-10-28'))

Resources