Web Crawling: get shares of Youtube Video from statistics tab - web-scraping

Does anybody know a way to get the shares of youtube videos (not mine)? I would like to store them into a DB. It is not working with the yt api. Another problem ist that not every yt video has the statistics tab.
So far I tried the Youtube API, jsoup HTML Parser (the div showing the shares wasn't there, altough it is shown via inspect in firefox e.g) and import.io demo which was working but is definitely too expensive.

The best way is to look at the network logs, in this case it shows a POST on :
https://www.youtube.com/insight_ajax?action_get_statistics_and_data=1&v=$video_id
It sends a XSRF token in the body that is available in the original html body of the video page https://www.youtube.com/watch?v=$video_id in a javascript object like :
yt.setConfig({
'XSRF_TOKEN': "QUFFLUhqbnNvZUx4THR3eV80dHlacV9tRkRxc2NwSjlXQXxBQ3Jtc0ttd0JLWENnMjdYNE5IRWhibE9ZdDJTSk1aMktxTDR5d3JjSnkzVUtQWVcwdnp3X0tSOXEtM3hZdzVFdjNPeGpPRGtLVU5pVXV0SmtfdWJSUHNqTVg2WXBndjZpa3d6U25ja2FTelBBVWRlT0lZZkRDaDV6SU94VWE3cnpERHhWNVlUYWdyRjFqN1hvc0VLRmVwcEY3ZWdJMWgyUmc=",
'XSRF_FIELD_NAME': "session_token",
'XSRF_REDIRECT_TOKEN': "VlhMkn6F56dGGYcm4Rg7jCZR0vJ8MTQ5ODA1NzIwMkAxNDk3OTcwODAy"
});
It also needs some cookies set in this same video page.
Using python
with beautifulsoup & python-requests :
import requests
from bs4 import BeautifulSoup
import re
s = requests.Session()
video_id = "CPkU0dF4JKo"
r = s.get('https://www.youtube.com/watch?v={}'.format(video_id))
xsrf_token = re.search("'XSRF_TOKEN'\s*:\s*\"(.*)\"", r.text, re.IGNORECASE).group(1)
r = s.post(
'https://www.youtube.com/insight_ajax?action_get_statistics_and_data=1&v={}'.format(video_id),
data = {
'session_token': xsrf_token
}
)
metrics = [
int(t.text.encode('ascii', 'ignore').split(' ', 1)[0])
for t in BeautifulSoup(r.content, "lxml").find('html_content').find("tr").findAll("div", {"class":"bragbar-metric"})
]
print(metrics)
Using bash
with curl, sed, pup & xml_grep :
The following bash script will :
request the video page https://www.youtube.com/watch?v=$video_id with curl
store the cookies in a file called cookie.txt
extract the XSRF_TOKEN called session_token in the following request with sed
request the video statistic page https://www.youtube.com/insight_ajax?action_get_statistics_and_data=1&v=$video_id with curl with the cookies previously stored
parse the xml result extract the CDATA part with xml_grep
parse the html with pup to extract the bragbar-metric class div and convert the html result to json with json{}
use sed to remove unicode character
The script :
video_id=CPkU0dF4JKo
session_token=$(curl -s -c cookie.txt "https://www.youtube.com/watch?v=$video_id" | \
sed -rn "s/.*'XSRF_TOKEN'\s*:\s*\"(.*)\".*/\1/p")
curl -s -b cookie.txt -d "session_token=$session_token" \
"https://www.youtube.com/insight_ajax?action_get_statistics_and_data=1&v=$video_id" | \
xml_grep --text_only 'html_content' | \
pup 'div table tr .bragbar-metric text{}' | \
sed 's/\xc2\x91\|\xc2\x92\|\xc2\xa0\|\xe2\x80\x8e//' | \
sed 's/\s.*$//'
It gives number of views, time watched, subscriptions, shares:
120862
454
18
213

Related

Firebase BigQuery migration bash error

I am using the standard workflow of Google to run the migration from the old dataset to the new dataset (Migration steps). I imputed the missing values, such as Property ID, BigQuery ID, etc. When I ran the bash script the following error occured?
Migrating mindfulness.com_mindfulness_ANDROID.app_events_20180515
--allow_large_results --append_table --batch --debug_mode --destination_table=analytics_171690789.events_20180515 --noflatten_results --nouse_legacy_sql --parameter=firebase_app_id::1:437512149764:android:0dfd4ab1e9926c7c --parameter=date::20180515 --parameter=platform::ANDROID#platform --project_id=mindfulness --use_gce_service_account
FATAL Flags positioning error: Flag '--project_id=mindfulness' appears after final command line argument. Please reposition the flag.
Run 'bq help' to get help.
On stackoverflow and Google I couldn't find a solution. Someone any idea how to solve this?
My migration.sh script (with small modifications to the IDs to stay anonymous)
# Analytics Property ID for the Project. Find this in Analytics Settings in Firebase
PROPERTY_ID=171230123
# Bigquery Export Project
BQ_PROJECT_ID="mindfulness" #(e.g., "firebase-public-project")
# Firebase App ID for the app.
FIREBASE_APP_ID="1:123412149764:android:0dfd4ab1e1234c7c" #(e.g., "1:300830567303:ios:
# Dataset to import from.
BQ_DATASET="com_mindfulness_ANDROID" #(e.g., "com_firebase_demo_IOS")
# Platform
PLATFORM="ANDROID"#"platform of the app. ANDROID or IOS"
# Date range for which you want to run migration, [START_DATE,END_DATE]
START_DATE=20180515
END_DATE=20180517
# Do not modify the script below, unless you know what you are doing :)
startdate=$(date -d"$START_DATE" +%Y%m%d) || exit -1
enddate=$(date -d"$END_DATE" +%Y%m%d) || exit -1
# Iterate through the dates.
DATE="$startdate"
while [ "$DATE" -le "$enddate" ]; do
# BQ table constructed from above params.
BQ_TABLE="$BQ_PROJECT_ID.$BQ_DATASET.app_events_$DATE"
echo "Migrating $BQ_TABLE"
cat migration_script.sql | sed -e "s/SCRIPT_GENERATED_TABLE_NAME
$BQ_TABLE/g" | bq query \
--debug_mode \
--allow_large_results \
--noflatten_results \
--use_legacy_sql=False \
--destination_table analytics_$PROPERTY_ID.events_$DATE \
--batch \
--append_table \
--parameter=firebase_app_id::$FIREBASE_APP_ID \
--parameter=date::$DATE \
--parameter=platform::$PLATFORM \
--project_id=$BQ_PROJECT_ID
temp=$(date -I -d "$DATE + 1 day")
DATE=$(date -d "$temp" +%Y%m%d)
done
exit
# END OF SCRIPT
If you look at the output of your script, it contains this bit of text, right before the flag that's out of order:
--parameter=platform::ANDROID#platform --project_id=mindfulness
I'm pretty sure you want your platform to be ANDROID, not ANDROID#platform.
I suspect you can fix this just by putting a space between the end of the string, and the inline comment. So you have something like this:
PLATFORM="ANDROID" #"platform of the app. ANDROID or IOS"
Although to be safe, you might want to remove the comments entirely at the end of each line.

How to get website information with unix

Using unix commands, how would I be able to take website information and place it inside a variable?
I have been practicing with curl -sS which allows me to strip out the download progress output and just print the downloaded data (or any possible error) in the console. If there is another method, I would be glad to hear it.
But so far I have a website and I want to get certain information out of it, so I am using curl and cut like so:
curl -sS "https://en.wikipedia.org/wiki/List_of_Olympic_medalists_in_judo?action=raw | cut -c"19-"
How would I put this into a variable? My attempts have not been successful so far.
Wrap any command in $(...) to capture the output in the shell, which you could then assign to a variable (or do anything else you want with it):
var=$(curl -sS "https://en.wikipedia.org/wiki/List_of_Olympic_medalists_in_judo?action=raw | cut -c"19-")

Obtain filename from url in R

I have an url like http://example.com/files/01234 that when I click it from the browser downloads a zip file titled like file-08.zip
With wget I can download using the real file name by running
wget --content-disposition http://example.com/files/01234
Functions such as basename do not work in this case, for example:
> basename("http://example.com/files/01234")
[1] "01234"
I'd like to obtain just the filename from the URL in R and create a tibble with zip (files) names. No matter if using packages or system(...) command. Any ideas? what I'd like to obtain is something like
url | file
--------------------------------------------
http://example.com/files/01234 | file-08.zip
http://example.com/files/03210 | file-09.zip
...
Using the httr library, you can make a HEAD call and then parse he content-disposition header For example
library(httr)
hh <- HEAD("https://example.com/01234567")
get_disposition_filename <- function(x) {
sub(".*filename=", "", headers(x)$`content-disposition`)
}
get_disposition_filename(hh)
This function doesn't check that the header actually exists so it's not very robust, but should work in the case where the server returns an alternate name for the downloaded file.
With #Sathish contribution:
When URLs don't contain the file to download in the URL string a valid solution is
system("curl -IXGET -r 0-10 https://example.com/01234567 | grep attachment | sed 's/^.\\+filename=//'")
The idea is to read 10 bytes from the zip instead of the full file before obtaining file name, it will return file-789456.zip or the real zip name from that URL.

Getting a list of files on a web server

All,
I would like to get a list of files off of a server with the full url in tact. For example, I would like to get all the TIFFs from here.
http://hyperquad.telascience.org/naipsource/Texas/20100801/*
I can download all the .tif files with wget but I am looking for is just the full url to each file like this.
http://hyperquad.telascience.org/naipsource/Texas/20100801/naip10_1m_2597_04_2_20100430.tif
http://hyperquad.telascience.org/naipsource/Texas/20100801/naip10_1m_2597_04_3_20100424.tif
http://hyperquad.telascience.org/naipsource/Texas/20100801/naip10_1m_2597_04_4_20100430.tif
http://hyperquad.telascience.org/naipsource/Texas/20100801/naip10_1m_2597_05_1_20100430.tif
http://hyperquad.telascience.org/naipsource/Texas/20100801/naip10_1m_2597_05_2_20100430.tif
Any thoughts on how to get all these files in to a list using something like curl or wget?
Adam
You'd need the server to be willing to give you a page with a listing on it. This would normally be an index.html or just ask for the directory.
http://hyperquad.telascience.org/naipsource/Texas/20100801/
It looks like you're in luck in this case so, at risk of upsetting the web master, the solution would be to use wget's recursive option. Specify a maximum recursion of 1 to keep it constrained to that single directory.
I would use lynx shell web browser to get the list of links + grep and awk shell tools to filter the results, like this:
lynx -dump -listonly <URL> | grep http | grep <regexp> | awk '{print $2}'
..where:
URL - is the start URL, in your case: http://hyperquad.telascience.org/naipsource/Texas/20100801/
regexp - is the regular expression that selects only files that interest you, in your case: \.tif$
Complete example commandline to get links to TIF files on this SO page:
lynx -dump -listonly http://stackoverflow.com/questions/6989681/getting-a-list-of-files-on-a-web-server | grep http | grep \.tif$ | awk '{print $2}'
..now returns:
http://hyperquad.telascience.org/naipsource/Texas/20100801/naip10_1m_2597_04_2_20100430.tif
http://hyperquad.telascience.org/naipsource/Texas/20100801/naip10_1m_2597_04_4_20100430.tif
http://hyperquad.telascience.org/naipsource/Texas/20100801/naip10_1m_2597_05_2_20100430.tif
If you wget http://hyperquad.telascience.org/naipsource/Texas/20100801/, the HTML that is returned contains the list of files. If you don't need this to be general, you could use regexes to extract the links. If you need something more robust, you can use an HTML parser (e.g. BeautifulSoup), and programmatically extract the links on the page (from the actual HTML structure).
With winscp have a find window that is possible search for all files in directories and subdirectories from a directory in the own web - after is possible select all and copy, and have in text all links to all files -, need have the username and password for connect ftp:
https://winscp.net/eng/download.php
I have a client-server system that retrieves the file names from an assigned folder in the app server's folder, then displays thumbnails in the client.
CLIENT: (slThumbnailNames is a string list)
== on the server side ===
A TIDCmdTCPServer has a CommandHandler GetThumbnailNames (a commandhandler is a procedure)
Hints: sMFFBServerPictures is generated in the oncreate method of the app server.
sThumbnailDir is passed to the app server from the client.
`slThumbnailNames := funGetThumbnailNames(sThumbNailPath);
function TfMFFBClient.funGetThumbnailNames(sThumbnailPath:string):TStringList;
var
slThisStringList:TStringList;
begin
slThisStringList := TStringList.Create;
dmMFFBClient.tcpMFFBClient.SendCmd('GetThumbnailNames,' + sThumbnailPath,700);
dmMFFBClient.tcpMFFBClient.IOHandler.Capture(slThisStringList);
result := slThisStringList;
end;
procedure TfMFFBServer.MFFBCmdTCPServercmdGetThumbnailNames(
ASender: TIdCommand);
var
sRec:TSearchRec;
sThumbnailDir:string;
i,iNumFiles: Integer;
begin
try
ASender.Response.Clear;
sThumbnailDir := ASender.Params[0];
iNumFiles := FindFirst(sMFFBServerPictures + sThumbnailDir + '*_t.jpg', faAnyfile, SRec );
if iNumFiles = 0 then
try
ASender.Response.Add(SRec.Name);
while iNumFiles = 0 do
begin
if (SRec.Attr and faDirectory <> faDirectory) then
ASender.Response.Add(SRec.Name);
iNumFiles := FindNext(SRec);
end;
finally
FindClose(SRec)
end
else
ASender.Response.Add('NO THUMBNAILS');
except
on e:exception do
begin
messagedlg('Error in procedure TfMFFBServer.MFFBCmdTCPServercmdGetThumbnailNames'+#13+
'Error msg: ' + e.Message,mterror,[mbok],0);
end;
end;
end;`

HOWTO extract specific text from HTML/XML ,using cURL and HTTP

I successfully downloaded a file from a remote server using cURL and HTTP, but the file includes all the HTML code.
Is there a function in cURL so that I can extract the values I want?
For example, I am getting:
...
<body>
Hello,Manu
</body>
...
But I only want Hello,Manu.
Thanks in advance,
Manu
try using DOMDocument or any other XML parser.
$doc= new DOMDocument();
$doc->loadHTML($html_content); // result from curl
$xpath= new DOMXPath($doc);
echo $xpath->query('//body')->item(0)->nodeValue;
alternatively for command line you can use
curl 'http://.................' | xpath '//body'

Resources