I am a Data Science student writing my thesis using product review data. However, this is packed in a .gz file.
The file name when downloaded is 'xxx.json.gz' and when I look into the properties it says the type of file is gz Archive (.gz), Opens with 7-Zip File Manager.
I found the following code:
z <- gzfile("xxx.json.gz")
data = read.csv(z)
But the object 'data' is now a list. All columns are factors and the column with the review text is not right at all. I think the read.csv() part is wrong since it's supposed to be a json file.
Does anyone have solution? I also have the URL address of the data if that's better to use: http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Electronics_5.json.gz
Loading it at the moment, I got 5,152,500 records right now, it is probably the review text that is clogging it up
library(jsonlite)
happy_data <-stream_in(
gzcon(
url("http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Electronics_5.json.gz")
)
)
Related
In my job I have to perform some analytics on data shared by external organisation through user access granted on web portal. Various reports are available there, which I can view and download in many formats. Two of these formats are very useful namely MS Excel and 'XML file with report data'. Excel file is normally heavily formatted (with sub-totals, merged cells, etc.) to suit the purpose of Excel users. Converting these Excel files to data frame/table is normally a big hassle. I therefore prefer to download 'xml' file and then parse it through -> save it in csv and then carry out my analysis in R.
However, whenever I try to parse xml file directly into R (to avoid intervening convert to csv step) I never succeed. So far I have tried XML xml2 libraries in R but to no avail.
Recently I tried this code.
library("XML")
library("methods")
setwd("C:\\Users\\Administrator\\Desktop\\")
res <- xmlParse("Skil.xml")
> res <- xmlParse("Skil.xml")
xmlns: URI RptSancDig_VoucherCompilationSheet is not absolute
rootnode <- xmlRoot(res)
rootsize <- xmlSize(rootnode)
> rootsize
[1] 2
xmldataframe <- xmlToDataFrame("Skil.xml")
> xmldataframe <- xmlToDataFrame("Skil.xml")
xmlns: URI RptSancDig_VoucherCompilationSheet is not absolute
> xmldataframe
Textbox24 Textbox63 DDOName_Collection
1 <NA> <NA> <NA>
2
Just to mention the file size of Skil.xml is about 12.1 Mb, and is successfully parsed in Excel.
I have also tried read_xml() function of xml2 but to no avail.
I would have happily shared a sample file to try, but I am unable to do so. Moreover, I am also unable to generate a sample file in that kind of xml format.
Can someone help?
I am learning python (using 3.5). I realize I will probably take a bit of heat for posting my question. Here goes: I have literally reviewed several hundred posts, help docs, etc. all in an attempt to construct the code I need. No luck thus far. I hope someone can help me. I have a set of URLs say, 18 or more. Only 2 illustrated here:
[1] "http://www.senate.mo.gov/media/15info/Chappelle-Nadal/releases/111915.html"
[2] "http://www.senate.mo.gov/media/15info/Chappelle-Nadal/releases/092215.htm"
I need to scrape all the data (text) behind each url and write out to individual text files (one for each URL) for future topic model analysis. Right now, I pull in the urls through R using rvest. I then take each url (one at a time, by code) into python and do the following:
soup = BeautifulSoup(urlopen('http://www.senate.mo.gov/media/14info/chappelle-nadal/Columns/012314-Condensed.html').read())
txt = soup.find('div', {'class' : 'body'})
print(soup.get_text())
#print(soup.prettify()) not much help
#store the info in an object, then write out the object
test=print(soup.get_text())
test=soup.get_text()
#below does write a file
#how to take my BS object and get it in
open_file = open('23Jan2014cplNadal1.txt', 'w')
open_file.write(test)
open_file.close()
The above gets me partially to my target. It leaves me just a little clean up regarding the text, but that's okay. The problem is that it is labor intensive.
Is there a way to
Write a clean text file (without invisibles, etc.) out from R with all listed urls?
For python 3.5: Is there a way to take all the urls, once they are in a clean single file (the clean text file, one url per line), and have some iterative process retrieve the text behind each url and write out a text file for each URL's data(text) to a location on my hard drive?
I have to do this process for approximately 1000 state-level senators. Any help or direction is greatly appreciated.
Edit to original: Thank you so much all. To N. Velasquez: I tried the following:
urls<-c("http://www.senate.mo.gov/media/14info/Chappelle-Nadal/releases/120114.html",
"http://www.senate.mo.gov/media/14info/Chappelle-Nadal/releases/110614.htm"
)
for (url in urls) {
download.file(url, destfile = basename(url), method="curl", mode ="w", extra="-k")
}
html files are then written out to my working directory. However, is there a way to write out text files instead of html files? I've read download.file info and can't seem to figure out a way to push out individual text files. Regarding the suggestion for a for loop: Is what I illustrate what you mean for me to attempt? Thank you!
The answer for 1 is: Sure!
The following code will loop you through the html list and export atomic TXTs, as per your request.
Note that through rvest and html_node() you could get a much more structure datset, with recurring parts of the html stored separately. (header, office info, main body, URL, etc...)
library(rvest)
urls <- (c("http://www.senate.mo.gov/media/15info/Chappelle-Nadal/releases/111915.html", "http://www.senate.mo.gov/media/15info/Chappelle-Nadal/releases/092215.htm"))
for (i in 1:length(urls))
{
ht <- list()
ht[i] <- html_text(html_node(read_html(urls[i]), xpath = '//*[#id="mainContent"]'), trim = TRUE)
ht <- gsub("[\r\n]","",ht)
writeLines(ht[i], paste("DOC_", i, ".txt", sep =""))
}
Look for the DOC_1.txt and DOC_2.txt in your working directory.
this is my first time working with XML data, and I'd appreciate any help/advice that you can offer!
I'm working on pulling some data that is stored on AWS in a collection of XML files. I have an index files that contains a list of the ~200,000 URLs where the XML files are hosted. I'm currently using the XML package in R to loop through each URL and pull the data from the node that I'm interested in. This is working fine, but with so many URLs, this loop takes around 12 hours to finish.
Here's a simplified version of my code. The index file contains the list of URLs. The parsed XML files aren't very large (stored as dat in this example...R tells me they're 432 bytes). I've put NodeOfInterest in as a placeholder for the spot where I'd normally list the XML tag that I'd like to pull data from.
for (i in 1:200000) {
url <- paste('http://s3.amazonaws.com/',index[i,9],'_public.xml', sep="") ## create URL based off of index file
dat <- (xmlTreeParse(url, useInternal = TRUE)) ## load entire XML file
nodes <- (getNodeSet(dat, "//x:NodeOfInterest", "x")) ##find node for the tag I'm interested in
if (length(nodes) > 0 & exists("dat")) {
dat2 <- xmlToDataFrame(nodes) ##create data table from node
compiled_data <- rbind(compiled_data, dat2) ##create master file
rm(dat2)
}
print(i)
}
It seems like there must be a more efficient way to pull this data. I think the longest step (by far) is loading the XML into memory, but I haven't found anything out there that suggests another option. Any advice???
Thanks in advance!
If parsing the XML into a tree is your pinchpoint (in xmlTreeParse) maybe use a streaming interface like SAX which will allow you to only process those elements that are useful for your application. I haven't used it, but the package xml2 is built on top of libxml2 which provides a SAX ability.
Thank you in advance for your're help. I am using R to analyse some data that is initially created in Matlab. I am using the package "R.Matlab" and it is fantastic for 1 file, but I am struggling to import multiple files.
The working script for a single file is as follows...
install.packages("R.matlab")
library(R.matlab)
x<-("folder_of_files")
path <- system.file("/home/ashley/Desktop/Save/2D Stream", package="R.matlab")
pathname <- file.path(x, "Test0000.mat")
data1 <- readMat(pathname)
And this works fantastic. The format of my files is 'Name_0000.mat' where between files the name is a constant and the 4 digits increase, but not necesserally by 1.
My attempt to load multiple files at once was along these lines...
for (i in 1:length(temp))
data1<-list()
{data1[[i]] <- readMat((get(paste(temp[i]))))}
And also in multiple other ways that included and excluded path and pathname from the loop, all of which give me the same error:
Error in get(paste(temp[i])) :
object 'Test0825.mat' not found
Where 0825 is my final file name. If you change the length of the loop it is always just the name of the final one.
I think the issue is that when it pastes the name it looks for that object, which as of yet does not exist so I need to have the pasted text in speach marks, yet I dont know how to do that.
Sorry this was such a long post....Many thanks
I am using QGIS software. I would like to show value of each raster cell as label.
My idea (I don't know any plugin or any functionality from QGIS which allow to it easier) is to export raster using gdal2xyz.py into coordinates-value format and then save it as vector (GML or shapefile). For this second task, I try to use
*gdal_polygonize.py:*
gdal_polygonize.py rainfXYZ.txt rainf.shp Creating output rainf.shp of
format GML.
0...10...20...30...40...50...60...70...80...90...100 - done.
unfortunately I am unable to load created file (even if I change the extension to .gml)
ogr2ogr tool don't even recognize this format.
yes - sorry I forgot to add such information.
In general after preparing CSV file (using gdal2xyz.py with -csv option),
I need to add one line at begining of it:
"Longitude,Latitude,Value" (without the quotes)
Then I need to create a VRT file which contain
*> <OGRVRTDataSource>
> <OGRVRTLayer name="Shapefile_name">
> <SrcDataSource>Shapefile_name.csv</SrcDataSource>
> <GeometryType>wkbPoint</GeometryType>
>
> <GeometryField encoding="PointFromColumns" x="Longitude"
> y="Latitude"/>
> </OGRVRTLayer> </OGRVRTDataSource>*
Run the command "ogr2ogr -select Value Shapefile_name.shp Shapefile_name.vrt". I got the file evap_OBC.shp and two other associated files.
For the sake of archive completeness, this question has also been asked on GDAL mailing list as thread save raster as point-vector file. It seems Chaitanya provided solution for it.