I'm new to R and have a json file, containing data I'm hoping to convert to an R dataframe, that's been scraped in the following format:
The picture indicates where the data was scraped incorrectly, as no commas were inserted to separate entries. I've tried reading the data in with scan and separating into a list (to then read into a df) with this code:
indices <- grep(":[{",x, fixed=TRUE)
n <- length(indices)
l <- vector("list", n);
for(i in 1:n) {
ps <- substr(x ,indices[[i]], indices[i+1]) ## where i is whatever your Ps is
l[[i]] <- ps
}
But am getting empty string and NAN values. I've tried parsing with jsonlite, tidyjson, rjson, without any luck (which makes sense since the json is malformed). This article seems to match my json's structure, but the solution isn't working because of the missing commas. How would I insert a comma before every instance of "{"entries":[" in R when the file is read in as one string?
UPDATE: first, second and third entries
{"entries":[{"url":"/leonardomso/playground","name":"playground","lang":"TypeScript","desc":"Playground using React, Emotion, Relay, GraphQL, MongoDB.","stars":5,"forks":"2","updated":"2021-03-24T09:35:44Z","info":["react","reactjs","graphql","typescript","hooks","apollo","boilerplate","!DOCTYPE html \"\""],"repo_url":"/leonardomso?tab=repositories"}
{"entries":[{"url":"/leonardomso/playground","name":"playground","lang":"TypeScript","desc":"Playground using React, Emotion, Relay, GraphQL, MongoDB.","stars":5,"forks":"2","updated":"2021-03-24T09:35:44Z","info":["react","reactjs","graphql","typescript","hooks","apollo","boilerplate","!DOCTYPE html \"\""],"repo_url":"/leonardomso?tab=repositories"}
{"entries":[{"url":"/shiffman/Presentation-Manager","name":"Presentation-Manager","lang":"JavaScript","desc":"Simple web app to manage student presentation schedule.","stars":17,"forks":"15","updated":"2021-01-19T15:28:55Z","info":[]},{"desc":"","stars":null,"forks":"","info":[]},{"url":"/shiffman/A2Z-F20","name":"A2Z-F20","lang":"JavaScript","desc":"ITP Course Programming from A to Z Fall 2020","stars":40,"forks":"31","updated":"2020-12-21T13:52:58Z","info":[]},{"desc":"","stars":null,"forks":"","info":[]},{"desc":"","stars":null,"forks":"","info":[]},{"url":"/shiffman/RunwayML-Object-Detection","name":"RunwayML-Object-Detection","lang":"JavaScript","desc":"Object detection model with RunwayML, node.js, and p5.js","stars":16,"forks":"2","updated":"2020-11-15T23:36:36Z","info":[]},{"url":"/shiffman/ShapeClassifierCNN","name":"ShapeClassifierCNN","lang":"JavaScript","desc":"test code for new tutorial","stars":11,"forks":"1","updated":"2020-11-06T15:02:26Z","info":[]},{"url":"/shiffman/Bot-Code-of-Conduct","name":"Bot-Code-of-Conduct","desc":"Code of Conduct to guide ethical bot making practices","stars":15,"forks":"1","updated":"2020-10-15T18:30:26Z","info":[]},{"url":"/shiffman/Twitter-Bot-A2Z","name":"Twitter-Bot-A2Z","lang":"JavaScript","desc":"New twitter bot examples","stars":26,"forks":"2","updated":"2020-10-13T16:17:45Z","info":["hacktoberfest","!DOCTYPE html \"\""],"repo_url":"/shiffman?tab=repositories"}
You can use
gsub('}{"entries":[', '},{"entries":[', x, fixed=TRUE)
So, this is a plain replacement of all {"entries":[ with },{"entries":[.
Note the fixed=TRUE parameter that disables the regex engine parsing the string.
Related
I am using OpenCPU and R to create a web API that takes in some inputs and returns a topoJSON file from a database, as well as some other information. OpenCPU automatically pushes the output through toJSON, which results in JSON output that has quoted JSON in it (i.e., the topoJSON). This is obviously not ideal--especially since it then gets incredibly cluttered with backticked quotes (\"). I tried using fromJSON to convert it to an R object, which could then be converted back (which is incredibly inefficient), but it returns a slightly different syntax and the result is that it doesn't work.
I feel like there should be some way to convert the string to some other type of object that results in toJSON calling a different handler that tells it to just leave it alone, but I can't figure out how to do that.
> s <- '{"type":"Topology","objects":{"map": "0"}}'
> fromJSON(s)
$type
[1] "Topology"
$objects
$objects$map
[1] "0"
> toJSON(fromJSON(s))
{"type":["Topology"],"objects":{"map":["0"]}}
That's just the beginning of the file (I replaced the actual map with "0"), and as you can see, brackets appeared around "Topology" and "0". Alternately, if I just keep it as a string, I end up with this mess:
> toJSON(s)
["{\"type\":\"Topology\",\"objects\":{\"0000595ab81ec4f34__csv\": \"0\"}}"]
Is there any way to fix this so that I just get the verbatim string but without quotes and backticks?
EDIT: Note that because I'm using OpenCPU, the output needs to come from toJSON (so no other function can be used, unfortunately), and I can't do any post-processing.
To it seems you just want the values rather than vectors. Set auto_unbox=TRUE to turn length-one vectors into scalar values
toJSON(fromJSON(s), auto_unbox = TRUE)
# {"type":"Topology","objects":{"map":"0"}}
That does print without escaping for me (using jsonlite_1.5). Maybe you are using an older version of jsonlite. You can also get around that by using cat() to print the result. You won't see the slashes when you do that.
cat(toJSON(fromJSON(s), auto_unbox = TRUE))
You can manually unbox the relevant entries:
library(jsonlite)
s <- '{"type":"Topology","objects":{"map": "0"}}'
j <- fromJSON(s)
j$type <- unbox(j$type)
j$objects$map <- unbox(j$objects$map)
toJSON(j)
# {"type":"Topology","objects":{"map":"0"}}
I'm using base::paste in a for loop:
for (k in 1:length(summary$pro))
{
if (k == 1)
mp <- summary$pro[k]
else
mp <- paste(mp, summary$pro[k], sep = ",")
}
mp comes out as one big string, where the elements are separated by commas.
For example mp is "1,2,3,4,5,6"
Then, I want to put mp in a file, where each of its elements is added to a separate column in the same row. My code for this is:
write.table(mp, file = recompdatafile, sep = ",")
However, mp just appears in the CSV as one big string as opposed to being divided up. How can I achieve my desired format?
FYI
I've also tried converting mp to a list, and strsplit()-ing it, neither of which have worked.
Once I've added summary$pro to the file, how can I also add summary$me (which has the same format), in one row with multiple columns?
Thanks,
n.i.
If you want to write something to a file, write.table() isn't the only way. If you want to avoid headers and quotes and such, you can use the more direct cat. For example
cat(summary$pro, sep=",", file="filename.txt")
will write out the vector of values from summary$pro separated by commas more directly. You don't need to build a string first. (And building a string one element at a time as you did above is a bad practice anyway. Most functions in R can operate on an entire vector at a time, including paste).
I am trying to create a function that will calculate the frequency count of keywords using TM package. The function works fine if the text pasted from readline is on free form text without a new line. The problem is, when I paste a bunch of text copied from a spreadsheet, readline considers it as a new line.
keyword <- function() {
x <- readline(as.character('Input text here: '))
x <- Corpus(VectorSource(x))
...
tdm <- TermDocumentMatrix(x)
...
tdm
}
Here's the full code: https://github.com/CSCDataAnalytics/PM-Analysis/blob/master/Keyword.R
How can I prevent this from happening or at least consider a bunch of text of every row from the spreadsheet as one vector only?
If I'm understanding you correctly, the problem is when the user pastes the text from another application: the newline is causing R to stop accepting the subsequent lines.
One technique (fragile as it may be) is to look for a specific line, such as an empty line "" or a period ".". It's a little fragile because now you need (1) assurance that the data will "never" include that as a whole line, and (2) it is easily appended by the user.
Try:
endofinput <- ""
totalstr <- ""
while(! endofinput == (x <- readline('prompt (empty string when done): ')))
totalstr <- paste(totalstr, x)
In this case, the empty string is the catch, and when the while loop is done, totalstr contains all input separated by a space (this can be changed in the paste function).
NB: one problem with this technique is that it is "growing" the vector totalstr, which will eventually cause performance penalties (depending on the size of the input data): every loop iteration, more memory is allocated and the entire string is copied plus the new line of text. There are more verbose ways to side-step this problem (e.g., pre-allocate a vector larger than your anticipated input data), but if you aren't anticipated 1000s of lines then you may be able to accept this naive programming for simplicity.
Another option would be to have the user save the data to a text file and use file.choose() and readLines() to get your data.
Try collapsing the data into a single string after using readline
x <- paste(readline(as.character('Input text here: ')), collapse=' ')
Setting:
I have (simple) .csv and .dat files created from laboratory devices and other programs storing information on measurements or calculations. I have found this for other languages but nor for R
Problem:
Using R, I am trying to extract values to quickly display results w/o opening the created files. Hereby I have two typical settings:
a) I need to read a priori unknown values after known key words
b) I need to read lines after known key words or lines
I can't make functions such as scan() and grep() work.
c) Finally I would like to loop over dozens of files in a folder and give me a summary (to make the picture complete: I will manage this part)
I woul appreciate any form of help.
ok, it works for the key value (although perhaps not very nice)
variable<-scan("file.csv", what=character(),sep="")
returns a charactor vector of everything
variable[grep("keyword", ks)+2] # + 2 as the actual value is stored two places ahead
returns characters of seaked values.
as.numeric(lapply(variable, gsub, patt=",", replace="."))
for completion: data had to be altered to number and "," and "." problem needed to be solved.
in a line:
data=as.numeric(lapply(ks[grep("Ks_Boden", ks)+2], gsub, patt=",", replace="."))
Perseverence is not to bad of an asset ;-)
The rest isn't finished, yet, I will post once finished.
I´m all new to scraping and I´m trying to understand xpath using R. My objective is to create a vector of people from this website. I´m able to do it using :
r<-htmlTreeParse(e) ## e is after getURL
g.k<-(r[[3]][[1]][[2]][[3]][[2]][[2]][[2]][[1]][[4]])
l<-g.k[names(g.k)=="text"]
u<-ldply(l,function(x) {
w<-xmlValue(x)
return(w)
})
However this is cumbersome and I´d prefer to use xpath. How do I go about referencing the path detailed above? Is there a function for this or can I submit my path somehow referenced as above?
I´ve come to
xpathApply( htmlTreeParse(e, useInt=T), "//body//text//div//div//p//text()", function(k) xmlValue(k))->kk
But this leaves me a lot of cleaning up to do and I assume it can be done better.
Regards,
//M
EDIT: Sorry for the unclearliness, but I´m all new to this and rather confused. The XML document is too large to be pasted unfortunately. I guess my question is whether there is some easy way to find the name of these nodes/structure of the document, besides using view source ? I´ve come a little closer to what I´d like:
getNodeSet(htmlTreeParse(e, useInt=T), "//p")[[5]]->e2
gives me the list of what I want. However still in xml with br tags. I thought running
xpathApply(e2, "//text()", function(k) xmlValue(k))->kk
would provide a list that later could be unlisted. however it provides a list with more garbage than e2 displays.
Is there a way to do this directly:
xpathApply(htmlTreeParse(e, useInt=T), "//p[5]//text()", function(k) xmlValue(k))->kk
Link to the web page: I´m trying to get the names, and only, the names from the page.
getURL("http://legeforeningen.no/id/1712")
I ended up with
xml = htmlTreeParse("http://legeforeningen.no/id/1712", useInternalNodes=TRUE)
(no need for RCurl) and then
sub(",.*$", "", unlist(xpathApply(xml, "//p[4]/text()", xmlValue)))
(subset in xpath) which leaves a final line that is not a name. One could do the text processing in XML, too, but then one would iterate at the R level.
n <- xpathApply(xml, "count(//p[4]/text())") - 1L
sapply(seq_len(n), function(i) {
xpathApply(xml, sprintf('substring-before(//p[4]/text()[%d], ",")', i))
})
Unfortunately, this does not pick up names that do not contain a comma.
Use a mixture of xpath and string manipulation.
#Retrieve and parse the page.
library(XML)
library(RCurl)
page <- getURL("http://legeforeningen.no/id/1712")
parsed <- htmlTreeParse(page, useInternalNodes = TRUE)
Inspecting the parsed variable which contains the page's source tells us that instead of sensibly using a list tag (like <ul>), the author just put a paragraph (<p>) of text split with line breaks (<br />). We use xpath to retrieve the <p> elements.
#Inspection tells use we want the fifth paragraph.
name_nodes <- xpathApply(parsed, "//p")[[5]]
Now we convert to character, split on the <br> tags and remove empty lines.
all_names <- as(name_nodes, "character")
all_names <- gsub("</?p>", "", all_names)
all_names <- strsplit(all_names, "<br />")[[1]]
all_names <- all_names[nzchar(all_names)]
all_names
Optionally, separate the names of people and their locations.
strsplit(all_names, ", ")
Or more prettily with stringr.
str_split_fixed(all_names, ", ", 2)