With the next code I import data in JSON format from two different urls and then I combine them to get an unique dataframe:
library(jsonlite)
library(dplyr)
url1 <- 'https://c.stockcharts.com/j-sum/sum?cmd=perf&group=SECTOR_DJUSOI'
document1<-na.omit(select(fromJSON(url1),Ticker=sym,Name=name,SCTR=sctr,Capital=univ))
document1$SubSector="DJUSOI"
url2 <- 'https://c.stockcharts.com/j-sum/sum?cmd=perf&group=SECTOR_DJUSOL'
document2<-na.omit(select(fromJSON(url2),Ticker=sym,Name=name,SCTR=sctr,Capital=univ))
document2$SubSector="DJUSOL"
#Combined file
USA<-rbind.data.frame(document1,document2)
My problem is that I need to import data from more than 100 different urls, so I assume I should use a loop. The only thing that changes at each url is the sector name (after the underscore): DJUSOI, DJUSOL, etc.
Could somebody let me know how to do it?
Try this loop. You can add the sectors you want to sectors vector:
library(jsonlite)
library(dplyr)
sectors <- c('DJUSOI', 'DJUSOL')
documents <- data.frame()
for (sector in sectors){
url <- paste0('https://c.stockcharts.com/j-sum/sum?cmd=perf&group=SECTOR_', sector)
current <- fromJSON(url) %>%
select(Ticker=sym,Name=name,SCTR=sctr,Capital=univ) %>%
na.omit() %>%
mutate(SubSector = sector)
documents <- bind_rows(documents, current)
}
Since the only thing that really changes is the sector name, I think this should work for you:
library(jsonlite)
library(dplyr)
DownloadSubSec = function(sector) {
url = 'https://c.stockcharts.com/j-sum/sum?cmd=perf&group=SECTOR_'
url = paste0(url, sector)
doc = na.omit(select(fromJSON(url), Ticker = sym, Name = name,
SCTR = sctr, Capital = univ))
doc$SubSector = sector
return(doc)
}
sector_names = c('DJUSOI', 'DJUSOL')
usa = sector_names %>%
lapply(DownloadSubSec) %>%
bind_rows
Related
I have tried scraping data from a real estate site, and arranging the data in a way that can then easily be filtered and checked using a spreadsheet. I’m actually a little embarrassed that i don’t move of this R code forward.
Now that i have all the links to the posts, i can not now loop through the previously compiled dataframe and get the details from all the URLs.
Could you just please help me with it? Thanks a lot.
#Loading the rvest package
library(rvest)
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of
library(xml2)
complete <- data.frame()
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "chrome")
remDr$open()
URL.base <- "https://www.sreality.cz/hledani/prodej/byty?strana="
#"https://www.sreality.cz/hledani/prodej/byty/praha?strana="
#"https://www.sreality.cz/hledani/prodej/byty/praha?stari=dnes&strana="
#"https://www.sreality.cz/hledani/prodej/byty/praha?stari=tyden&strana="
for (i in 1:10000) {
#Specifying the url for desired website to be scrapped
main_link<- paste0(URL.base, i)
# go to website
remDr$navigate(main_link)
# get page source and save it as an html object with rvest
main_page <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# get the data
name <- html_nodes(main_page, css=".name.ng-binding") %>% html_text()
locality <- html_nodes(main_page, css=".locality.ng-binding") %>% html_text()
norm_price <- html_nodes(main_page, css=".norm-price.ng-binding") %>% html_text()
sreality_url <- main_page %>% html_nodes(".title") %>% html_attr("href")
sreality_url2 <- sreality_url[c(4:24)]
name2 <- name[c(4:24)]
record <- data.frame(cbind(name2, locality, norm_price, sreality_url2))
complete <- rbind(complete, record)
}
# Write CSV in R
write.csv(complete, file = "MyData.csv")
I would do this differently:
I would create a function, say 'scraper', that groups up together all the scraping functions you have already defined, doing so I'll create a list with the str_c of all the possibile links (say 30), after that a simple lapply function. As it all said, I will not use Rselenium. (libraries: rvest , stringr , tibble, dplyr )
url = 'https://www.sreality.cz/hledani/prodej/byty?strana='
here it is the URL base, starting from here you should be able to replicate the URL strings for all the pages (1 to whichever) you are interested in (and for all the possible url, for praha, olomuc, ostrava etc ).
main_page = read_html('https://www.sreality.cz/hledani/prodej/byty?strana=')
here you create all the linnks according to the number of pages you want:
list.of.pages = str_c(url, 1:30)
then define a single function for all the single data you are interested, in this way you are more precise and your error debug is easier, as well as the data quality. (I assume your CSS selections are right, otherwise you will obtain empty obj)
for names
name = function(url) {
data = html_nodes(url, css=".name.ng-binding") %>%
html_text()
return(data)
}
for locality
locality = function(url) {
data = html_nodes(url, css=".locality.ng-binding") %>%
html_text()
return(data)
}
for normprice
normprice = function(url) {
data = html_nodes(url, css=".norm-price.ng-binding") %>%
html_text()
return(data)
}
for hrefs
sreality_url = function(url) {
data = html_nodes(url, css=".title") %>%
html_attr("href")
return(data)
}
those are the single fuctions (the CSS selection, even if i didnt test them, seem to be not correct to me, but this will give you the right framework to work on). After that combine them into a tibble obj
get.data.table = function(html){
name = name(html)
locality = locality(html)
normprice = normprice(html)
hrefs = sreality_url(html)
combine = tibble(adtext = name,
loc = locality,
price = normprice,
URL = sreality_url)
combine %>%
select(adtext, loc, price, URL) return(combine)
}
then the final scraper:
scrape.all = function(urls){
list.of.pages %>%
lapply(get.data.table) %>%
bind_rows() %>%
write.csv(file = 'MyData.csv')
}
I want to download all the files named "listings.csv.gz" which refer to US cities from http://insideairbnb.com/get-the-data.html, I can do it by writing each link but is it possible to do in a loop?
In the end I'll keep only a few columns from each file and merge them into one file.
Since the problem was solved thanks to #CodeNoob I'd like to share how it all worked out:
page <- read_html("http://insideairbnb.com/get-the-data.html")
# Get all hrefs (i.e. all links present on the website)
links <- page %>%
html_nodes("a") %>%
html_attr("href")
# Filter for listings.csv.gz, USA cities, data for March 2019
wanted <- grep('listings.csv.gz', links)
USA <- grep('united-states', links)
wanted.USA = wanted[wanted %in% USA]
wanted.links <- links[wanted.USA]
wanted.links = grep('2019-03', wanted.links, value = TRUE)
wanted.cols = c("host_is_superhost", "summary", "host_identity_verified", "street",
"city", "property_type", "room_type", "bathrooms",
"bedrooms", "beds", "price", "security_deposit", "cleaning_fee",
"guests_included", "number_of_reviews", "instant_bookable",
"host_response_rate", "host_neighbourhood",
"review_scores_rating", "review_scores_accuracy","review_scores_cleanliness",
"review_scores_checkin" ,"review_scores_communication",
"review_scores_location", "review_scores_value", "space",
"description", "host_id", "state", "latitude", "longitude")
read.gz.url <- function(link) {
con <- gzcon(url(link))
df <- read.csv(textConnection(readLines(con)))
close(con)
df <- df %>% select(wanted.cols) %>%
mutate(source.url = link)
df
}
all.df = list()
for (i in seq_along(wanted.links)) {
all.df[[i]] = read.gz.url(wanted.links[i])
}
all.df = map(all.df, as_tibble)
You can actually extract all links, filter for the ones containing listings.csv.gz and then download these in a loop:
library(rvest)
library(dplyr)
# Get all download links
page <- read_html("http://insideairbnb.com/get-the-data.html")
# Get all hrefs (i.e. all links present on the website)
links <- page %>%
html_nodes("a") %>%
html_attr("href")
# Filter for listings.csv.gz
wanted <- grep('listings.csv.gz', links)
wanted.links <- links[wanted]
for (link in wanted.links) {
con <- gzcon(url(link))
txt <- readLines(con)
df <- read.csv(textConnection(txt))
# Do what you want
}
Example: Download and combine the files
To get the result you want I would suggest to write a download function that filters for the columns you want and then combines these in a single dataframe, for example something like this:
read.gz.url <- function(url) {
con <- gzcon(url(link))
df <- read.csv(textConnection(readLines(con)))
close(con)
df <- df %>% select(c('calculated_host_listings_count_shared_rooms', 'cancellation_policy' )) %>% # random columns I chose
mutate(source.url = url) # You may need to remember the origin of each row
df
}
all.df <- do.call('rbind', lapply(head(wanted.links,2), read.gz.url))
Note I only tested this on the first two files since they are pretty large
I need to run a script for each station (I was replacing the numbers 1 by 1 in the script) but there're more than 100 stations.
I thought maybe loop in script could save my time. Never done loop before, don't know if it's possible to do what I want. I've tried as the bellow but doesn't work.
Just a bit of my df8 data (txt):
RowNum,date,code,gauging_station,precp
1,01/01/2008 01:00,1586,315,0.4
2,01/01/2008 01:00,10990,16589,0.2
3,01/01/2008 01:00,17221,30523,0.6
4,01/01/2008 01:00,34592,17344,0
5,01/01/2008 01:00,38131,373,0
6,01/01/2008 01:00,44287,370,0
7,01/01/2008 01:00,53903,17314,0.4
8,01/01/2008 01:00,56005,16596,0
9,01/01/2008 01:00,56349,342,0
10,01/01/2008 01:00,57294,346,0
11,01/01/2008 01:00,64423,533,0
12,01/01/2008 01:00,75266,513,0
13,01/01/2008 01:00,96514,19187,0
Code:
station <- sample(50:150,53,replace=F)
for(i in station)
{
df08_1 <- filter(df08, V7==station [i])
colnames(df08_1) <- c("Date","gauging_station", "code", "precp")
df08_1 <- unique(df08_1)
final <- df08_1 %>%
group_by(Date=floor_date(Date, "1 hour"), gauging_station, code) %>%
summarize(precp=sum(precp))
write.csv(final,file="../station [i].csv", row.names = FALSE)
}
If you're not averse to using some tidyverse packages, I think you could simplify this a bit:
Updated with your new sample data - this runs ok on my computer:
Code:
library(dplyr)
dat %>%
select(-RowNum) %>%
distinct() %>%
group_by(date_hour = lubridate::floor_date(date, 'hour'), gauging_station, code) %>%
summarize(precp = sum(precp)) %>%
split(.$gauging_station) %>%
purrr::map(~write.csv(.x,
file = paste0('../',.x$gauging_station, '.csv'),
row.names = FALSE))
Data:
dat <- data.table::fread("RowNum,date,code,gauging_station,precp
1,01/01/2008 01:00,1586,315,0.4
2,01/01/2008 01:00,10990,16589,0.2
3,01/01/2008 01:00,17221,30523,0.6
4,01/01/2008 01:00,34592,17344,0
5,01/01/2008 01:00,38131,373,0
6,01/01/2008 01:00,44287,370,0
7,01/01/2008 01:00,53903,17314,0.4
8,01/01/2008 01:00,56005,16596,0
9,01/01/2008 01:00,56349,342,0
10,01/01/2008 01:00,57294,346,0
11,01/01/2008 01:00,64423,533,0
12,01/01/2008 01:00,75266,513,0
13,01/01/2008 01:00,96514,19187,0") %>%
mutate(date = as.POSIXct(date, format = '%m/%d/%Y %H:%M'))
Can't comment for a lack of reputation, but if the code works if you change station [i] for the number of the station, it sounds like each station is a part of and has to be extracted from the df08 object (dataframe).
If I understand you correctly, I would do this as follows:
stations <- c(1:100) #put your station IDs into a vector
for(i in stations) { #run the script for each entry in the list
#assuming that 'V7' is the name of the (unnamed) seventh column of df08, it could
#work like this:
df08_1 <- filter(df08, df08$V7==i) #if your station names are something like
#'station 1' as a string, use paste("station", 1, sep = "")
colnames(df08_1) <- c("Date","gauging_station", "code", "precp")
df08_1 <- unique(df08_1)
final <- df08_1 %>%
group_by(Date=floor_date(Date, "1 hour"), gauging_station, code) %>%
summarize(precp=sum(precp)) #floor_date here is probably your own function
write.csv(final,file=paste("../station", i, ".csv", sep=""), row.names = FALSE)
#automatically generate names. You can modify the string to whatever you want ofc.
}
If this and all of the other examples don't work, could you provide us with some dummy data to work with, just to see what the df08 dataframe looks like? And also what the floor_date() function does?
I'm new-ish to R and am having some trouble iterating through values.
For context: I have data on 60 people over time, and each person has his/her own dataset in a folder (I received the data with id #s 00:59). For each person, there are 2 values I need - time of response and picture response given (a number 1 - 16). I need to convert this data from wide to long format for each person, and then eventually append all of the datasets together.
My problem is that I'm having trouble writing a loop that will do this for each person (i.e. each dataset). Here's the code I have so far:
pam[x] <- fromJSON(file = "PAM_u[x].json")
pam[x]df <- as.data.frame(pam[x])
#Creating long dataframe for times
pam[x]_long_times <- gather(
select(pam[x]df, starts_with("resp")),
key = "time",
value = "resp_times"
)
#Creating long dataframe for pic_nums (affect response)
pam[x]_long_pics <- gather(
select(pam[x]df, starts_with("pic")),
key = "picture",
value = "pic_num"
)
#Combining the two long dataframes so that I have one df per person
pam[x]_long_fin <- bind_cols(pam[x]_long_times, pam[x]_long_pics) %>%
select(resp_times, pic_num) %>%
add_column(id = [x], .before = 1)
If you replace [x] in the above code with a person's id# (e.g. 00), the code will run and will give me the dataframe I want for that person. Any advice on how to do this so I can get all 60 people done?
Thanks!
EDIT
So, using library(jsonlite) rather than library(rjson) set up the files in the format I needed without having to do all of the manipulation. Thanks all for the responses, but the solution was apparently much easier than I'd thought.
I don't know the structure of your json files. If you are not in the same folder, like the json files, try that:
library(jsonlite)
# setup - read files
json_folder <- "U:/test/" #adjust you folder here
files <- list.files(path = paste0(json_folder), pattern = "\\.json$")
# import data
pam <- NULL
pam_df <- NULL
for (i in seq_along(files)) {
pam[[i]] <- fromJSON(file = files[i])
pam_df[[i]] <- as.data.frame(pam[[i]])
}
Here you generally read all json files in the folder and build a vector of a length of 60.
Than you sequence along that vector and read all files.
I assume at the end you can do bind_rowsor add you code in the for loop. But remember to set the data frames to NULL before the loop starts, e.g. pam_long_pics <- NULL
Hope that helped? Let me know.
Something along these lines could work:
#library("tidyverse")
#library("jsonlite")
file_list <- list.files(pattern = "*.json", full.names = TRUE)
Data_raw <- tibble(File_name = file_list) %>%
mutate(File_contents = map(File_name, fromJSON)) %>% # This should result in a nested tibble
mutate(File_contents = map(File_contents, as_tibble))
Data_raw %>%
mutate(Long_times = map(File_contents, ~ gather(key = "time", value = "resp_times", starts_with("resp"))),
Long_pics = map(File_contents, ~ gather(key = "picture", value = "pic_num", starts_with("pic")))) %>%
unnest(Long_times, Long_pics) %>%
select(File_name, resp_times, pic_num)
EDIT: you may or may not need not to include as_tibble() after reading in the JSON files, depending on how your data looks like.
I'm trying to pull some data from an API throw it all into a single data frame. I'm trying to put a variable into the URL I'm pulling from and then loop it to pull data from 54 keys. Here's what I have so far with notes.
library("jsonlite")
library("httr")
library("lubridate")
options(stringsAsFactors = FALSE)
url <- "http://api.kuroganehammer.com"
### This gets me a list of 58 observations, I want to use this list to
### pull data for each using an API
raw.characters <- GET(url = url, path = "api/characters")
## Convert the results from unicode to a JSON
text.raw.characters <- rawToChar(raw.characters$content)
## Convert the JSON into an R object. Check the class of the object after
## it's retrieved and reformat appropriately
characters <- fromJSON(text.raw.characters)
class(characters)
## This pulls data for an individual character. I want to get one of
## these for all 58 characters by looping this and replacing the 1 in the
## URL path for every number through 58.
raw.bayonetta <- GET(url = url, path = "api/characters/1/detailedmoves")
text.raw.bayonetta <- rawToChar(raw.bayonetta$content)
bayonetta <- fromJSON(text.raw.bayonetta)
## This is the function I tried to create, but I get a lexical error when
## I call it, and I have no idea how to loop it.
move.pull <- function(x) {
char.x <- x
raw.x <- GET(url = url, path = cat("api/characters/",char.x,"/detailedmoves", sep = ""))
text.raw.x <- rawToChar(raw.x$content)
char.moves.x <- fromJSON(text.raw.x)
char.moves.x$id <- x
return(char.moves.x)
}
The first part of this:
library(jsonlite)
library(httr)
library(lubridate)
library(tidyverse)
base_url <- "http://api.kuroganehammer.com"
res <- GET(url = base_url, path = "api/characters")
content(res, as="text", encoding="UTF-8") %>%
fromJSON(flatten=TRUE) %>%
as_tibble() -> chars
Gets you a data frame of the characters.
This:
pb <- progress_estimated(length(chars$id))
map_df(chars$id, ~{
pb$tick()$print()
Sys.sleep(sample(seq(0.5, 2.5, 0.5), 1)) # be kind to the free API
res <- GET(url = base_url, path = sprintf("api/characters/%s/detailedmoves", .x))
content(res, as="text", encoding="UTF-8") %>%
fromJSON(flatten=TRUE) %>%
as_tibble()
}, .id = "id") -> moves
Gets you a data frame of all the "moves" and adds the "id" for the character. You get a progress bar for free, too.
You can then either left_join() as needed or group & nest the moves data into a separate list-nest column. If you want that to begin with you can use map() vs map_df().
Leave in the time pause code. It's a free API and you should likely increase the pause times to avoid DoS'ing their site.