I'm new to R and have a json file, containing data I'm hoping to convert to an R dataframe, that's been scraped in the following format:
The picture indicates where the data was scraped incorrectly, as no commas were inserted to separate entries. I've tried reading the data in with scan and separating into a list (to then read into a df) with this code:
indices <- grep(":[{",x, fixed=TRUE)
n <- length(indices)
l <- vector("list", n);
for(i in 1:n) {
ps <- substr(x ,indices[[i]], indices[i+1]) ## where i is whatever your Ps is
l[[i]] <- ps
}
But am getting empty string and NAN values. I've tried parsing with jsonlite, tidyjson, rjson, without any luck (which makes sense since the json is malformed). This article seems to match my json's structure, but the solution isn't working because of the missing commas. How would I insert a comma before every instance of "{"entries":[" in R when the file is read in as one string?
UPDATE: first, second and third entries
{"entries":[{"url":"/leonardomso/playground","name":"playground","lang":"TypeScript","desc":"Playground using React, Emotion, Relay, GraphQL, MongoDB.","stars":5,"forks":"2","updated":"2021-03-24T09:35:44Z","info":["react","reactjs","graphql","typescript","hooks","apollo","boilerplate","!DOCTYPE html \"\""],"repo_url":"/leonardomso?tab=repositories"}
{"entries":[{"url":"/leonardomso/playground","name":"playground","lang":"TypeScript","desc":"Playground using React, Emotion, Relay, GraphQL, MongoDB.","stars":5,"forks":"2","updated":"2021-03-24T09:35:44Z","info":["react","reactjs","graphql","typescript","hooks","apollo","boilerplate","!DOCTYPE html \"\""],"repo_url":"/leonardomso?tab=repositories"}
{"entries":[{"url":"/shiffman/Presentation-Manager","name":"Presentation-Manager","lang":"JavaScript","desc":"Simple web app to manage student presentation schedule.","stars":17,"forks":"15","updated":"2021-01-19T15:28:55Z","info":[]},{"desc":"","stars":null,"forks":"","info":[]},{"url":"/shiffman/A2Z-F20","name":"A2Z-F20","lang":"JavaScript","desc":"ITP Course Programming from A to Z Fall 2020","stars":40,"forks":"31","updated":"2020-12-21T13:52:58Z","info":[]},{"desc":"","stars":null,"forks":"","info":[]},{"desc":"","stars":null,"forks":"","info":[]},{"url":"/shiffman/RunwayML-Object-Detection","name":"RunwayML-Object-Detection","lang":"JavaScript","desc":"Object detection model with RunwayML, node.js, and p5.js","stars":16,"forks":"2","updated":"2020-11-15T23:36:36Z","info":[]},{"url":"/shiffman/ShapeClassifierCNN","name":"ShapeClassifierCNN","lang":"JavaScript","desc":"test code for new tutorial","stars":11,"forks":"1","updated":"2020-11-06T15:02:26Z","info":[]},{"url":"/shiffman/Bot-Code-of-Conduct","name":"Bot-Code-of-Conduct","desc":"Code of Conduct to guide ethical bot making practices","stars":15,"forks":"1","updated":"2020-10-15T18:30:26Z","info":[]},{"url":"/shiffman/Twitter-Bot-A2Z","name":"Twitter-Bot-A2Z","lang":"JavaScript","desc":"New twitter bot examples","stars":26,"forks":"2","updated":"2020-10-13T16:17:45Z","info":["hacktoberfest","!DOCTYPE html \"\""],"repo_url":"/shiffman?tab=repositories"}
You can use
gsub('}{"entries":[', '},{"entries":[', x, fixed=TRUE)
So, this is a plain replacement of all {"entries":[ with },{"entries":[.
Note the fixed=TRUE parameter that disables the regex engine parsing the string.
lets say I have a list in r that contains other lists and I want to remove certain characters (comma in the following example) from elements of each list.
my.list <- list(c("hello , world ", "hello world,,," ),c("123,456", "1,234"))
The following gets the job done
gsub(",", "", my.list[[1]])
gsub(",", "", my.list[[2]])
but how do I do it more efficiently as my actual problem is long? I tried the following but it gives me strange results
lapply(my.list, function(x) gsub(",","",my.list))
any help? thx
You might be able to utilize the function filter_element
I would suggest sanitizing your data before combining them into a list. Once the data has been sanitized, then start setting entering them into lists.
You can check out the documentation for filter_element on page 10 in the following PDF.
https://cran.r-project.org/web/packages/textclean/textclean.pdf
I have the following code:
webpage <- "https://www.dotabuff.com/heroes"
heroes <- read_html(webpage) %>%
html_nodes("div.name") %>%
html_text()
heroes <- sapply(gsub(" ","-",heroes), tolower)
It pulls a list of names from this website. When I run this code, it correctly parses all the hero names as lowercase and with words separated by hyphens.
When I run this code:
cat(webpage,heroes[i],sep="/")
with i being the object from the vector that I want to return (I intend to use it in a for loop), it will correctly return the webpage as I expect. However, when I do
var <- cat(webpage,heroes[i],sep="/")
it tells me that var is null, and does not have a value. It also will not assign that value to anything in the for loop, either, presenting it as null.
I've also tried
var <- toString(cat(webpage,heroes[i],sep="/"))
but that didn't work either (same issue)
What am I missing here?
I'm running this all in https://rstudio.cloud, for context. Is it something with the environment? I would have thought this would be simple.
I think you need to used paste instead of cat. The cat-function which is intended for printing whereas the paste-function is for concatenating strings:
var <- paste(webpage, heroes[i], sep="/")
As you can see in the help page ?cat does not return anything (null). So the behavior is as documented.
I'm using base::paste in a for loop:
for (k in 1:length(summary$pro))
{
if (k == 1)
mp <- summary$pro[k]
else
mp <- paste(mp, summary$pro[k], sep = ",")
}
mp comes out as one big string, where the elements are separated by commas.
For example mp is "1,2,3,4,5,6"
Then, I want to put mp in a file, where each of its elements is added to a separate column in the same row. My code for this is:
write.table(mp, file = recompdatafile, sep = ",")
However, mp just appears in the CSV as one big string as opposed to being divided up. How can I achieve my desired format?
FYI
I've also tried converting mp to a list, and strsplit()-ing it, neither of which have worked.
Once I've added summary$pro to the file, how can I also add summary$me (which has the same format), in one row with multiple columns?
Thanks,
n.i.
If you want to write something to a file, write.table() isn't the only way. If you want to avoid headers and quotes and such, you can use the more direct cat. For example
cat(summary$pro, sep=",", file="filename.txt")
will write out the vector of values from summary$pro separated by commas more directly. You don't need to build a string first. (And building a string one element at a time as you did above is a bad practice anyway. Most functions in R can operate on an entire vector at a time, including paste).
Using the XML library, I have parsed a web page
basicInfo <- htmlParse(myURL, isURL = TRUE)
the relevant section of which is
<div class="col-left"><h1 class="tourney-name">Price Cutter Charity Championship Pres'd by Dr Pep</h1><img class="tour-logo" alt="Nationwide Tour" src="http://a.espncdn.com/i/golf/leaderboard11/logo-nationwide-tour.png"/></div>
I can manage to extract the tournament name
tourney <- xpathSApply(basicInfo, "//*/div[#class='col-left']", xmlValue)
but also wish to know the tour it is from using the alt tag. In this case I want to get the result "Nationwide Tour"
TIA and apologies for scrolling required
Don't know R but I'm pretty good with XPath
Try this:
tourney_name <- xpathSApply(basicInfo, "//*/div[#class='col-left']/h1/text()", xmlValue)
tourney_loc <- xpathSApply(basicInfo, "//*/div[#class='col-left']/img/#alt", xmlValue)
Note the use of "#" to extract attributes and text() to extract text nodes (looks like R did this automatically), my revised tourney_name xpath should do the same thing, but it is more clear which part is being extracted.