xml created from dataframe in R does not open in explorer - r

I have just learnt about xml and intended to turn my dataframe in R into xml, so that it could be used for corpus analysis. I am not using the conventional tagging label because of the nature of the data. The output does look right in the r console, but when I tried to open it in an explorer as a check if the xml is right, it shows nth in the explorer...so does it mean my xml is not right?
here is my r code:
comments$text<-gsub("&","&amp",comments$text)
comments$text<-gsub("<","&lt",comments$text)
comments$text<-gsub(">","&gt",comments$text)
comments$text<-gsub("\n"," <n/> ",comments$text)
top= newXMLNode("CWII Step 1_1")
o=newXMLNode("conversation",parent=top)
kids = lapply(comments$text[comments$parent_group==comments$parent_group[3]],
function(x)
newXMLNode("u",x,))
addChildren(o,kids)
saveXML(top, file="CWII Step 1_1.xml")
this is what is shown in r console
<CWII Step 1_1>
<conversation>
<u>Hello!!! I am a person from India !!!!</u>
<u>Hi . When I think of your country India I i.</u>
</conversation>
</CWII Step 1_1>
#
it seems like an xml to me, but unfortunately it does not show up when i open the .xml file in explorer, although it does show in notepad well with all tags

Related

How to import an ASCII text file into R - the NIBRS

Currently, I am trying to import the National Incidence Based Reporting System (NIBRS) 2019 data into R. The data comes in an ASCII text format, and so far I've tried readr::read_tsv and readr::read_fwf. However, I can't seem to import the data correctly - read_tsv shows only 1 column, while read_fwf needs column arguments that I do not understand how to decipher based on the text file.
Here is the link to the NIBRS. I used the Master File Downloads to download the zipped file for the NIBRS in 2019.
My overall goal is to have a typical dataframe/tibble for this data set with column names being the type of crime, and the rows being the number of incidents.
I have seen a few other examples of importing this data through this help page, but their copies of the data only covers up to 2015 (My data needs to range from 2015-2019).
.
Use read.fwf(). Column widths are listed here
We can use read_fwf with column_positions
library(readr)
read_fwf(filename, column_positions = c(2, 5, 10))

How can I open a .json.gz format file in R?

I am a Data Science student writing my thesis using product review data. However, this is packed in a .gz file.
The file name when downloaded is 'xxx.json.gz' and when I look into the properties it says the type of file is gz Archive (.gz), Opens with 7-Zip File Manager.
I found the following code:
z <- gzfile("xxx.json.gz")
data = read.csv(z)
But the object 'data' is now a list. All columns are factors and the column with the review text is not right at all. I think the read.csv() part is wrong since it's supposed to be a json file.
Does anyone have solution? I also have the URL address of the data if that's better to use: http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Electronics_5.json.gz
Loading it at the moment, I got 5,152,500 records right now, it is probably the review text that is clogging it up
library(jsonlite)
happy_data <-stream_in(
gzcon(
url("http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Electronics_5.json.gz")
)
)

automatic update of filename for read.csv in r

I run a monthly data import process in R, using something similar to this:
Data <- read.csv("c:/Data/March 2018 Data.csv")
However, I want to fully automate the process and, hence, find a way to change the date of the file being uploaded, in this case 'March 2018', using a variable from a lookup table. This lookup table is changed every month externally and the Date variable, which indicates the month of production, is updated during this.
I've tried to use paste() function, but didn't get very far:
Data <- read.csv(paste("C:/Data Folder",Date,"Data.csv"))
Keeps saying "No such file or directoryError in file". I've checked the file names and path are fine. The only issue I'm detecting is the code line in the directory appears like this
'c:/Data folder/ March 2018 Data.csv'
I'm not sure if that extra 'space' is the issue
Any ideas?
Thanks to both bobbel and jalazbe for this solution
I used paste0()
Data <- read.csv(paste0("c/Date folder/",Date,"Data.csv"))

PowerBI: How to save result of R script?

Is it possible to implement the following scenario in Power BI Desktop?
Load data from Excel file to several tables
Make calculation with R script from several data sources
Store results of calculation to new table in Power BI (.pbix)
The idea is to use Power BI Desktop for solving "transportation problem" with linear programming in R. Before solver will be running we need to make data transformations from several data sources. I'm new in Power BI. I see that it is possible to apply R scripts for loading and transformation of data, and visualizations. But I need the possibility of saving the results of calculation, for the subsequent visualization by the regular means of Power BI. Is it possible?
As I mentioned in my comment, this post would have solved most of your challenges. That approach replaces one of the tables with a new one after the R script, but you're specifically asking to produce a new table, presumably leaving the input tables untouched. I've recently written a post where you can do this using Python in the Power Query Editor. The only difference in your case would be the R script itself.
Here's how I would do it with an R script:
Data samples:
Table1
Date,Value1
2108-10-12,1
2108-10-13,2
2108-10-14,3
2108-10-15,4
2108-10-16,5
Table2
Date,Value2
2108-10-12,10
2108-10-13,11
2108-10-14,12
2108-10-15,13
2108-10-16,14
Power Query Editor:
With these tables loaded either from Excel or CSV files, you've got this setup in the Power Query Editor::
Now you can follow these steps to get a new table using an R script:
1. Change the data type of the Date Column to Text
2. Click Enter Data and click OK to get an empty table named Table3 by default.
3. Select the Transform tab and click Run R Script to open the Run R Script Edtor.
4. Leave it empty and click OK.
5. Remove = R.Execute("# 'dataset' holds the input data for this script",[dataset=#"Changed Type"]) from the Formula Bar and insert this: = R.Execute("# R Script:",[df1=Table1, df2=Table2]).
6. If you're promted to do so, click Edit Permission and Run.
7. Click the gear symbol next to Run R Scritp under APPLIED STEPS and insert the following snippet:
R script:
df3 <- merge(x = df1, y = df2, by = "Date", all.x = TRUE)
df3$Value3 <- df1$Value1 + df2$Value2
This snippet produces a new dataframe df3 by joining df1 and df2, and adds a new column Value3. This is a very simple setup but now you can do pretty much anything by just replacing the join and calculation methods:
8. Click Home > Close&Apply to get back to Power BI Desktop (Consider changing the data type of the Date column in Table3 from Text to Date before you do that, depending on how you'd like you tables, charts and slicers to behave.)
9. Insert a simple table to make sure everything went smoothly
I hope this was exactly what you were looking for. Let me know if not and I'll take another look at it.

Parseing XML by R always return XML declaration error

I am new to XML.
i downloaded a XML file, called ipg140722,from google drive, http://www.google.com/googlebooks/uspto-patents-grants-text.html
, I used Window 8.1, R 3.1.1,
library(XML)
url<- "E:\\clouddownload\\R-download\\ipg140722.xml"
indata<- xmlTreeParse(url)
XML declaration allowed only at the start of the document
Extra content at the end of the document
error: 1: XML declaration allowed only at the start of the document
2: Extra content at the end of the document
what is the problem
Note: This post is edited from the original version.
The object lesson here is that just because a file has an xml extension does not mean it is well formed XML.
If #MartinMorgan is correct about the file, Google seems to have taken all the patents approved during the week of 2014-07-22 (last week), converted them to XML, strung them together into a single text file, and given that an xml extension. Clearly this is not well-formed XML. So the challenge is to deconstruct that file. Here is away to do it in R.
lines <- readLines("ipg140722.xml")
start <- grep('<?xml version="1.0" encoding="UTF-8"?>',lines,fixed=T)
end <- c(start[-1]-1,length(lines))
library(XML)
get.xml <- function(i) {
txt <- paste(lines[start[i]:end[i]],collapse="\n")
# print(i)
xmlTreeParse(txt,asText=T)
# return(i)
}
docs <- lapply(1:10,get.xml)
class(docs[[1]])
# [1] "XMLInternalDocument" "XMLAbstractDocument"
So now docs is a list of parsed XML documents. These can be accessed individually as, e.g., docs[[1]], or collectively using something like the code below, which extracts the invention title from each document.
sapply(docs,function(doc) xmlValue(doc["//invention-title"][[1]]))
# [1] "Phallus retention harness" "Dress/coat"
# [3] "Shirt" "Shirt"
# [5] "Sandal" "Shoe"
# [7] "Footwear" "Flexible athletic shoe sole"
# [9] "Shoe outsole with a surface ornamentation contrast" "Shoe sole"
And no, I did not make up the name of the first patent.
Response to OPs comment
My original post, which detected the start of a new document using:
start <- grep("xml version",lines,fixed=T)
was too naive: it turns out the phrase "xml version" appears in the text of some of the patents. So this was breaking (some of) the documents prematurely, resulting in mal-formed XML. The code above fixes that problem. If you un-coment the two lines in the function get.xml(...) and run the code above with
docs <- lapply(1:length(start),get.xml)
you will see that all 6961 documents parse correctly.
But there is another problem: the parsed XML is very large, so if you leave these lines as comments and try to parse the full set, you run out of memory about half way through (or I did, on an 8GB system). There are two ways to work around this. The first is to do the parsing in blocks (say 2000 documents at a time). The second is to extract whatever information you need for your CSV file in get.xml(...) and discard the parsed document at each step.

Resources