Python 3:
I need to print the name, email, city, phone for all users in a json file.
I am just learning Python, so I don't know what code to use.
I can get the file, but don't know what to do to print the correct info.
#Imported functions
import requests
import json
#Using the following API endpoint:
#https://jsonplaceholder.typicode.com/users
#Use the GET method of the requests library to read and JSON encode your request.
r = requests.get('https://jsonplaceholder.typicode.com/users')
data = r.json()
print(r)
print()
print(data)
I want a nicely formatted list of the name, email, city, phone for all users.
Thanks for your help!
import requests
import json
r = requests.get('https://jsonplaceholder.typicode.com/users')
data = r.json()
for row in data:
print("Name: {}\nEmail: {}\nCity: {}\nPhone: {}\n".format(row['name'], row['email'],row['address']['city'],row['phone']))
# alternative to the line above
# print("Name: {name}\nEmail: {email}\nCity: {address[city]}\nPhone: {phone}\n".format_map(row))
Short explanation: data contains a list of the entries in the json-file that you are requesting. In this case 10 entries -> so data will have 10 items.
for row in data:
print(...)
will loop through data (the list with 10 entries) and each entry will be written to row. each row will be printed out, in a certain format. not the whole row, but certain fields in that row. you access them by their key. in this case['name'] and so on...
Related
I'm new to R and have a json file, containing data I'm hoping to convert to an R dataframe, that's been scraped in the following format:
The picture indicates where the data was scraped incorrectly, as no commas were inserted to separate entries. I've tried reading the data in with scan and separating into a list (to then read into a df) with this code:
indices <- grep(":[{",x, fixed=TRUE)
n <- length(indices)
l <- vector("list", n);
for(i in 1:n) {
ps <- substr(x ,indices[[i]], indices[i+1]) ## where i is whatever your Ps is
l[[i]] <- ps
}
But am getting empty string and NAN values. I've tried parsing with jsonlite, tidyjson, rjson, without any luck (which makes sense since the json is malformed). This article seems to match my json's structure, but the solution isn't working because of the missing commas. How would I insert a comma before every instance of "{"entries":[" in R when the file is read in as one string?
UPDATE: first, second and third entries
{"entries":[{"url":"/leonardomso/playground","name":"playground","lang":"TypeScript","desc":"Playground using React, Emotion, Relay, GraphQL, MongoDB.","stars":5,"forks":"2","updated":"2021-03-24T09:35:44Z","info":["react","reactjs","graphql","typescript","hooks","apollo","boilerplate","!DOCTYPE html \"\""],"repo_url":"/leonardomso?tab=repositories"}
{"entries":[{"url":"/leonardomso/playground","name":"playground","lang":"TypeScript","desc":"Playground using React, Emotion, Relay, GraphQL, MongoDB.","stars":5,"forks":"2","updated":"2021-03-24T09:35:44Z","info":["react","reactjs","graphql","typescript","hooks","apollo","boilerplate","!DOCTYPE html \"\""],"repo_url":"/leonardomso?tab=repositories"}
{"entries":[{"url":"/shiffman/Presentation-Manager","name":"Presentation-Manager","lang":"JavaScript","desc":"Simple web app to manage student presentation schedule.","stars":17,"forks":"15","updated":"2021-01-19T15:28:55Z","info":[]},{"desc":"","stars":null,"forks":"","info":[]},{"url":"/shiffman/A2Z-F20","name":"A2Z-F20","lang":"JavaScript","desc":"ITP Course Programming from A to Z Fall 2020","stars":40,"forks":"31","updated":"2020-12-21T13:52:58Z","info":[]},{"desc":"","stars":null,"forks":"","info":[]},{"desc":"","stars":null,"forks":"","info":[]},{"url":"/shiffman/RunwayML-Object-Detection","name":"RunwayML-Object-Detection","lang":"JavaScript","desc":"Object detection model with RunwayML, node.js, and p5.js","stars":16,"forks":"2","updated":"2020-11-15T23:36:36Z","info":[]},{"url":"/shiffman/ShapeClassifierCNN","name":"ShapeClassifierCNN","lang":"JavaScript","desc":"test code for new tutorial","stars":11,"forks":"1","updated":"2020-11-06T15:02:26Z","info":[]},{"url":"/shiffman/Bot-Code-of-Conduct","name":"Bot-Code-of-Conduct","desc":"Code of Conduct to guide ethical bot making practices","stars":15,"forks":"1","updated":"2020-10-15T18:30:26Z","info":[]},{"url":"/shiffman/Twitter-Bot-A2Z","name":"Twitter-Bot-A2Z","lang":"JavaScript","desc":"New twitter bot examples","stars":26,"forks":"2","updated":"2020-10-13T16:17:45Z","info":["hacktoberfest","!DOCTYPE html \"\""],"repo_url":"/shiffman?tab=repositories"}
You can use
gsub('}{"entries":[', '},{"entries":[', x, fixed=TRUE)
So, this is a plain replacement of all {"entries":[ with },{"entries":[.
Note the fixed=TRUE parameter that disables the regex engine parsing the string.
So I am trying to scrape the countries name from the table in a website https://www.theguardian.com/world/2020/oct/25/covid-world-map-countries-most-coronavirus-cases-deaths as a list. But when i am printing it out, it's just giving me empty list, instead of a list containing countries name. could anybody explain why i am getting this? the code is below,
import requests
from bs4 import BeautifulSoup
webpage = requests.get("https://www.theguardian.com/world/2020/oct/25/covid-world-map-countries-most-coronavirus-cases-deaths")
soup = BeautifulSoup(webpage.content, "html.parser")
countries = soup.find_all("div", attrs={"class": 'gv-cell gv-country-name'})
print(countries)
list_of_countries = []
for country in countries:
list_of_countries.append(country.get_text())
print(list_of_countries)
This is the output i am getting
[]
[]
Also, not only here, i was getting the same result (empty list) when i was trying to scrape a product's information from the amazon's website.
The list is dynamically retrieved from another endpoint you can find in the network tab which returns json. Something like as follows should work:
import requests
r = requests.get('https://interactive.guim.co.uk/2020/coronavirus-central-data/latest.json').json() #may need to add headers
countries = [i['attributes']['Country_Region'] for i in r['features']]
With some help, I am able to extract the landing image/main image of a url. However, I would like to be able to extract the subsequent images as well
require(rvest)
url <-"https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-
Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-
spons&keywords=lunch+bag&psc=1"
webpage <- read_html(url)
r <- webpage %>%
html_nodes("#landingImage") %>%
html_attr("data-a-dynamic-image")
imglink <- strsplit(r, '"')[[1]][2]
print(imglink)
This gives the correct output for the main image. However, I would like to extract the links when I roll-over to the other images of the same product. Essentially, I would like the output to have the following links:
1.https://images-na.ssl-images- amazon.com/images/I/81bF%2Ba21WLL.UY500.jpg
https://images-na.ssl-images-amazon.com/images/I/81HVwttGJAL.UY500.jpg
https://images-na.ssl-images-amazon.com/images/I/81Z1wxLn-uL.UY500.jpg
https://images-na.ssl-images-amazon.com/images/I/91iKg%2BKqKML.UY500.jpg
https://images-na.ssl-images-amazon.com/images/I/91zhpH7%2B8gL.UY500.jpg
Many thanks
As requested Python script at bottom. In order to make this applicable across languages the answer is in two parts. 1) A high level pseudo code description of steps which can be carried out with R/Python/many other languages 2) A python example.
R script to obtain string shown at end (Steps 1-3 of Process).
1) Process:
Obtain the html via GET request
Regex out a substring from one of the script tags which is in fact what jquery on the page uses to provide the image links from json
The regex pattern is
jQuery\.parseJSON\(\'(.*)\'\);
The explanation is:
Basically, the contained json object is gathered starting at the { before "dataInJson" and ending before the characters '). That extracts this json object as string. The use of 1st Capturing Group (.*) gathers from between the start string and end string (excluding either side).
The first match is the only one wanted, so out of matches returned the first must be extracted. This is then handled with a json parsing library that can take a string and return a json object
That json object is looped accessing by key (in the case of Python as structure is a dictionary - R will be slightly different) colorImages, to generate the colours (of the product), which are in turn used to access the actual urls themselves.
colours:
nested level for images:
2) Those steps shown in Python
import requests #library to handle xhr GET
import re #library to handle regex
import json
headers = {'User-Agent' : 'Mozilla/5.0', 'Referer':'https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc='}
r = requests.get('https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc=1', headers = headers)
p1 = re.compile(r'jQuery\.parseJSON\(\'(.*)\'\);')
data = p1.findall(r.text)[0]
json_source = json.loads(data)
for colour in json_source['colorImages']:
for image in json_source['colorImages'][colour]:
print(image['large'])
Output:
All the links for the product in all colours - large image links only (so urls appear slightly different and more numerous but are the same images)
R script to regex out required string and generate JSON:
library(rvest)
library( jsonlite)
library(stringr)
con <- url('https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc=1', "rb")
page = read_html(con)
page %>%
html_nodes(xpath=".//script[contains(., 'colorImages')]")%>%
html_text() %>% as.character %>% str_match(.,"jQuery\\.parseJSON\\(\\'(.*)\\'\\);") -> res
json = fromJSON(res[,2][2])
They've updated the page so now just use:
Python:
import requests #library to handle xhr GET
import re #library to handle regex
headers = {'User-Agent' : 'Mozilla/5.0', 'Referer':'https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc='}
r = requests.get('https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc=1', headers = headers)
p1 = re.compile(r'"large":"(.*?)"')
links = p1.findall(r.text)
print(links)
R:
library(rvest)
library(stringr)
con <- url('https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc=1', "rb")
page = read_html(con)
res <- page %>%
html_nodes(xpath=".//script[contains(., 'var data')]")%>%
html_text() %>% as.character %>%
str_match_all(.,'"large":"(.*?)"')
print(res[[1]][,2])
I have a set of Json data files that look like this
[
{"client":"toys",
"filename":"toy1.csv",
"file_row_number":1,
"secondary_db_index":"4050",
"processed_timestamp":1535004075,
"processed_datetime":"2018-08-23T06:01:15+0000",
"entity_id":"4050",
"entity_name":"4050",
"is_emailable":false,
"is_txtable":false,
"is_loadable":false}
]
I have created a Glue Crawler with the following custom classifier Json Path
$[*]
Glue returns the correct schema with the columns correctly identified.
However, when I query the data on Athena... all the data is landing in the first column and the rest of the columns are empty.
How can I get the data to spread according to their columns?
image of Athena query
Thank you!
It is a issue connected to Hive. I suggest two approaches. Firstly, you can create new table in Athena with struct data type like this:
CREATE EXTERNAL TABLE `example`(
`row` struct<client:string,filename:string,file_row_number:int,secondary_db_index:string,processed_timestamp:int,processed_datetime:string,entity_id:string,entity_name:string,is_emailable:boolean,is_txtable:boolean,is_loadable:boolean> COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://example'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'UPDATED_BY_CRAWLER'='example',
'averageRecordSize'='271',
'classification'='json',
'compressionType'='none',
'jsonPath'='$[*]',
'objectCount'='1',
'recordCount'='1',
'sizeKey'='271',
'transient_lastDdlTime'='1535533583',
'typeOfData'='file')
And then you can run the query as follows:
SELECT row.client, row.filename, row.file_row_number FROM "example"
Secondly, you can re-design your json file as below and then run the Crawler again. In this example I used Single-JSON-Record-Per-Line format.
{"client":"toys","filename":"toy1.csv","file_row_number":1,"secondary_db_index":"4050","processed_timestamp":1535004075,"processed_datetime":"2018-08-23T06:01:15+0000","entity_id":"4050","entity_name":"4050","is_emailable":false,"is_txtable":false,"is_loadable":false},
{"client":"toys2","filename":"toy2.csv","file_row_number":1,"secondary_db_index":"4050","processed_timestamp":1535004075,"processed_datetime":"2018-08-23T06:01:15+0000","entity_id":"4050","entity_name":"4050","is_emailable":false,"is_txtable":false,"is_loadable":false}
I'm attempting to scrape a page that has about 10 columns using Ruby and Nokogiri, with most of the columns being pretty straightforward by having unique class names. However, some of them have class ids that seem to have long number strings appended to what would be the standard class name.
For example, gametimes are all picked up with .eventLine-time, team names with .team-name, but this particular one has, for example:
<div class="eventLine-book-value" id="eventLineOpener-118079-19-1522-1">-3 -120</div>
.eventLine-book-value is not specific to this column, so it's not useful. The 13 digits are different for every game, and trying something like:
def nodes_by_selector(filename,selector)
file = open(filename)
doc = Nokogiri::HTML(file)
doc.css(^selector)
end
Has left me with errors. I've seen ^ and ~ be used in other languages, but I'm new to this and I have tried searching for ways to pick up all data under id=eventLineOpener-XXXX to no avail.
To pick up all data under id=eventLineOpener-XXXX, you need to pass 'div[id*=eventLineOpener]' as the selector:
def nodes_by_selector(filename,selector)
file = open(filename)
doc = Nokogiri::HTML(file)
doc.css(selector) #doc.css('div[id*=eventLineOpener]')
end
The above method will return you an array of Nokogiri::XML::Element objects having id=eventLineOpener-XXXX.
Further, to extract the content of each of these Nokogiri::XML::Element objects, you need to iterate over each of these objects and use the text method on those objects. For example:
doc.css('div[id*=eventLineOpener]')[0].text