I am trying to learn how to collect data from the Web into R. There's a website from Brazilian Ministery of Health that is sharing the numbers of the disease here in Brazil, it is a public portal.
COVIDBRASIL
So, on this page, I am interested in the graph that displays the daily reporting of cases here in Brazil. Using the inspector on Google Chrome I can access the JSON file feeding the data to this chart, my question is how could I get this file automatically with R. When I try to open the JSON in a new tab outside the inspector "Response" tab, I get an "Unauthorized" message. There is any way of doing this or every time I would have to manually copy the JSON from the inspector and update my R script?
In my case, I am interested in the "PortalDias" response. Thank you.
URL PORTAL DIAS
You need to set some headers to prevent this "Unauthorized" message. I copied them from the 'Headers' section in the browser 'Network' window.
library(curl)
library(jsonlite)
url <- "https://xx9p7hp1p7.execute-api.us-east-1.amazonaws.com/prod/PortalDias"
h <- new_handle()
handle_setheaders(h, Host = "xx9p7hp1p7.execute-api.us-east-1.amazonaws.com",
`Accept-Encoding` = "gzip, deflate, br",
`X-Parse-Application-Id` = "unAFkcaNDeXajurGB7LChj8SgQYS2ptm")
fromJSON(rawToChar(curl_fetch_memory(url, handle = h)$content))
# $results
# objectId label createdAt updatedAt qtd_confirmado qtd_obito
# 1 6vr9rUPbd4 26/02 2020-03-25T16:25:53.970Z 2020-03-25T22:25:42.967Z 1 123
# 2 FUNHS00sng 27/02 2020-03-25T16:27:34.040Z 2020-03-25T22:25:55.169Z 0 34
# 3 t4qW51clpj 28/02 2020-03-25T19:08:36.689Z 2020-03-25T22:26:02.427Z 0 35
# ...
I am able to fetch the table of default instance database. All I need to do is just copy that fetch data to save in named instance database using this RODBC, Any help can appreciate. Advance Thanks.
> library("RODBC")
> odbcChannel <- odbcConnect("SasDatabase")
> odbcClose(odbcChannel)
> odbcChannel <- odbcConnect("SasDatabase")
> sqlFetch(odbcChannel, "PR0__LOG1")
Fetched Data
[ DateTime Temp1 Temp2 PK_identity
1 2018-08-27 09:59:00 51 151 1
2 2018-08-27 10:00:00 11 11 2
3 2018-08-27 10:01:00 71 71 3
4 2018-08-27 10:02:00 31 131 4
Closing Conn
odbcClose(odbcChannel)
Want to copy this fetched data in another instance database.
Your question is not really clear, but it sounds like you want to upload the fetched data to a second database. If so then use RODBC or similar (depends on database type) to connect to the second database then you can use DBI function to upload. Some examples would be :
DBI::dbAppendTable()
or
DBI::dbSendQuery()
Any answer will need more detail about the second database instance to be more specific.
An excellent resource for R and databases is https://db.rstudio.com/
I'm trying to use libxively to update my feed, but it frequently seems to do nothing. I've got a basic call:
{
xi_datastream_t& ds = mXIFeed.datastreams[2];
::xi_str_copy_untiln(ds.datastream_id, sizeof (ds.datastream_id), "cc-output-power", '\0');
xi_datapoint_t& dp = ds.datapoints[0];
ds.datapoint_count = 1;
::xi_set_value_f32(&dp, mChargeController->outputPower());
}
const xi_context_t* ctx = ::xi_nob_feed_update(mXIContext, &mXIFeed);
it logs the following:
[io/posix/posix_io_layer.c:182 (posix_io_layer_init)] [posix_io_layer_init]
[io/posix/posix_io_layer.c:191 (posix_io_layer_init)] Creating socket...
[io/posix/posix_io_layer.c:202 (posix_io_layer_init)] Socket creation [ok]
Once or twice I saw my Xively developer page show a GET feed, but otherwise, nothing seems to get written. Any suggestions on what I should look at?
I tried to rebuild the library using blocking calls (would be nice if nob didn't mean no blocking calls), but I couldn't figure out how to build it.
Thanks!
EDIT:
I was able to build a synchronous version of the library, and that seems to work. Can anyone verify that the async version works? Is there more to it than simply calling xi_nob_feed_update()?
EDIT 2:
I tried running the async example, but I'm doing something wrong, as it always complains of no data received:
$ bin/asynch_feed_update <my key> <my feed ID> example 1 example 4 example 20 example 58 example 11 example 17
example: 1 7
example: 4 7
example: 20 7
example: 58 7
example: 11 7
example: 17 7
[io/posix_asynch/posix_asynch_io_layer.c:165 (posix_asynch_io_layer_init)] [posix_io_layer_init]
[io/posix_asynch/posix_asynch_io_layer.c:174 (posix_asynch_io_layer_init)] Creating socket...
[io/posix_asynch/posix_asynch_io_layer.c:185 (posix_asynch_io_layer_init)] Setting socket non blocking behaviour...
[io/posix_asynch/posix_asynch_io_layer.c:203 (posix_asynch_io_layer_init)] Socket creation [ok]
No data within five seconds.
The asynchronous version should work. The xi_nob_feed_update() is the right function to make a feed update request.
You have to call process_xively_nob_step() in a loop just after select().
In general, you should follow the asynchronous example.
Hello I need to test if the input is valid for calculator application:
3 * 9
100 + 20
200 - 99
Here's my regex:
(d).(*|+|-).(d)
but it doesn't seem to work. how can I verify and scrape the decimal values? thanks
Assuming ASP uses the standard notation, this should be
\d+\s[*+-]\s\d+
I have a (somewhat complex) web scraping challenge that I wish to accomplish and would love for some direction (to whatever level you feel like sharing) here goes:
I would like to go through all the "species pages" present in this link:
http://gtrnadb.ucsc.edu/
So for each of them I will go to:
The species page link (for example: http://gtrnadb.ucsc.edu/Aero_pern/)
And then to the "Secondary Structures" page link (for example: http://gtrnadb.ucsc.edu/Aero_pern/Aero_pern-structs.html)
Inside that link I wish to scrap the data in the page so that I will have a long list containing this data (for example):
chr.trna3 (1-77) Length: 77 bp
Type: Ala Anticodon: CGC at 35-37 (35-37) Score: 93.45
Seq: GGGCCGGTAGCTCAGCCtGGAAGAGCGCCGCCCTCGCACGGCGGAGGcCCCGGGTTCAAATCCCGGCCGGTCCACCA
Str: >>>>>>>..>>>>.........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<....
Where each line will have it's own list (inside the list for each "trna" inside the list for each animal)
I remember coming across the packages Rcurl and XML (in R) that can allow for such a task. But I don't know how to use them. So what I would love to have is:
1. Some suggestion on how to build such a code.
2. And recommendation for how to learn the knowledge needed for performing such a task.
Thanks for any help,
Tal
Tal,
You could use R and the XML package to do this, but (damn) that is some poorly formed HTML you are trying to parse. In fact, in most cases your would want to be using the readHTMLTable() function, which is covered in this previous thread.
Given this ugly HTML, however, we will have to use the RCurl package to pull the raw HTML and create some custom functions to parse it. This problem has two components:
Get all of the genome URLS from the base webpage (http://gtrnadb.ucsc.edu/) using the getURLContent() function in the RCurlpackage and some regex magic :-)
Then take that list of URLS and scrape the data you are looking for, and then stick it into a data.frame.
So, here goes...
library(RCurl)
### 1) First task is to get all of the web links we will need ##
base_url<-"http://gtrnadb.ucsc.edu/"
base_html<-getURLContent(base_url)[[1]]
links<-strsplit(base_html,"a href=")[[1]]
get_data_url<-function(s) {
u_split1<-strsplit(s,"/")[[1]][1]
u_split2<-strsplit(u_split1,'\\"')[[1]][2]
ifelse(grep("[[:upper:]]",u_split2)==1 & length(strsplit(u_split2,"#")[[1]])<2,return(u_split2),return(NA))
}
# Extract only those element that are relevant
genomes<-unlist(lapply(links,get_data_url))
genomes<-genomes[which(is.na(genomes)==FALSE)]
### 2) Now, scrape the genome data from all of those URLS ###
# This requires two complementary functions that are designed specifically
# for the UCSC website. The first parses the data from a -structs.html page
# and the second collects that data in to a multi-dimensional list
parse_genomes<-function(g) {
g_split1<-strsplit(g,"\n")[[1]]
g_split1<-g_split1[2:5]
# Pull all of the data and stick it in a list
g_split2<-strsplit(g_split1[1],"\t")[[1]]
ID<-g_split2[1] # Sequence ID
LEN<-strsplit(g_split2[2],": ")[[1]][2] # Length
g_split3<-strsplit(g_split1[2],"\t")[[1]]
TYPE<-strsplit(g_split3[1],": ")[[1]][2] # Type
AC<-strsplit(g_split3[2],": ")[[1]][2] # Anticodon
SEQ<-strsplit(g_split1[3],": ")[[1]][2] # ID
STR<-strsplit(g_split1[4],": ")[[1]][2] # String
return(c(ID,LEN,TYPE,AC,SEQ,STR))
}
# This will be a high dimensional list with all of the data, you can then manipulate as you like
get_structs<-function(u) {
struct_url<-paste(base_url,u,"/",u,"-structs.html",sep="")
raw_data<-getURLContent(struct_url)
s_split1<-strsplit(raw_data,"<PRE>")[[1]]
all_data<-s_split1[seq(3,length(s_split1))]
data_list<-lapply(all_data,parse_genomes)
for (d in 1:length(data_list)) {data_list[[d]]<-append(data_list[[d]],u)}
return(data_list)
}
# Collect data, manipulate, and create data frame (with slight cleaning)
genomes_list<-lapply(genomes[1:2],get_structs) # Limit to the first two genomes (Bdist & Spurp), a full scrape will take a LONG time
genomes_rows<-unlist(genomes_list,recursive=FALSE) # The recursive=FALSE saves a lot of work, now we can just do a straigh forward manipulation
genome_data<-t(sapply(genomes_rows,rbind))
colnames(genome_data)<-c("ID","LEN","TYPE","AC","SEQ","STR","NAME")
genome_data<-as.data.frame(genome_data)
genome_data<-subset(genome_data,ID!="</PRE>") # Some malformed web pages produce bad rows, but we can remove them
head(genome_data)
The resulting data frame contains seven columns related to each genome entry: ID, length, type, sequence, string, and name. The name column contains the base genome, which was my best guess for data organization. Here it what it looks like:
head(genome_data)
ID LEN TYPE AC SEQ
1 Scaffold17302.trna1 (1426-1498) 73 bp Ala AGC at 34-36 (1459-1461) AGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTTTTCCA
2 Scaffold20851.trna5 (43038-43110) 73 bp Ala AGC at 34-36 (43071-43073) AGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTTCTCCA
3 Scaffold20851.trna8 (45975-46047) 73 bp Ala AGC at 34-36 (46008-46010) TGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTTCTCCA
4 Scaffold17302.trna2 (2514-2586) 73 bp Ala AGC at 34-36 (2547-2549) GGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACAGGGATCGATGCCCGGGTTCTCCA
5 Scaffold51754.trna5 (253637-253565) 73 bp Ala AGC at 34-36 (253604-253602) CGGGGGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTCCTCCA
6 Scaffold17302.trna4 (6027-6099) 73 bp Ala AGC at 34-36 (6060-6062) GGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGAGTTCTCCA
STR NAME
1 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
2 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
3 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
4 >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>.>>>.......<<<.<<<<<<<<. Spurp
5 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
6 >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<......>>>>.......<<<<.<<<<<<<. Spurp
I hope this helps, and thanks for the fun little Sunday afternoon R challenge!
Just tried it using Mozenda (http://www.mozenda.com). After roughly 10 minutes and I had an agent that could scrape the data as you describe. You may be able to get all of this data just using their free trial. Coding is fun, if you have time, but it looks like you may already have a solution coded for you. Nice job Drew.
Interesting problem and agree that R is cool, but somehow i find R to be a bit cumbersome in this respect. I seem to prefer to get the data in intermediate plain text form first in order to be able to verify that the data is correct in every step... If the data is ready in its final form or for uploading your data somewhere RCurl is very useful.
Simplest in my opinion would be to (on linux/unix/mac/or in cygwin) just mirror the entire http://gtrnadb.ucsc.edu/ site (using wget) and take the files named /-structs.html, sed or awk the data you would like and format it for reading into R.
I'm sure there would be lots of other ways also.