Why does my import urlopen from urllib.request not go through? - web-scraping

I have the below code that I eventually want to webscrape and analyse.
My code has been running for almost an hour and it doesn't seem to pull through from this site.
import bs4 as bs
from urllib.request import urlopen as ureq
my_url2 = 'https://www.dreamteamfc.com/g/#tournament/stats-centre-stats'
ureq(my_url2)

The data you're looking for are loaded from other URL via Ajax (so BeautifulSoup doesn't see it). Also, use requests module for fetching the pages/Json data - it handles compression, redirects etc. automatically.
To load the data, use following example:
import json
import requests
url = "https://nuk-data.s3.eu-west-1.amazonaws.com/json/players_tournament.json"
data = requests.get(url).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
# print some data to screen:
for player in data:
print(
"{:<15} {:<15} {}".format(
player["first_name"], player["last_name"], player["cost"]
)
)
Prints:
Cristiano Ronaldo 7000000
Goran Pandev 1000000
David Marshall 2000000
Jesús Navas 3000000
Kasper Schmeichel 3000000
Sergio Ramos 5000000
Raúl Albiol 2000000
Giorgio Chiellini 3500000
...and so on.
EDIT: To load the data into a dataframe, you can use .json_normalize
import json
import requests
import pandas as pd
url = "https://nuk-data.s3.eu-west-1.amazonaws.com/json/players_tournament.json"
data = requests.get(url).json()
df = pd.json_normalize(data)
print(df)
df.to_csv("data.csv", index=None)
Prints:
id first_name last_name squad_id cost status positions locked injury_type injury_duration suspension_length cname stats.round_rank stats.season_rank stats.games_played stats.total_points stats.avg_points stats.high_score stats.low_score stats.last_3_avg stats.last_5_avg stats.selections stats.owned_by stats.MIN stats.SMR stats.SMB stats.GS stats.ASS stats.YC stats.RC stats.PM stats.PS stats.CS stats.GC stats.star_man_awards stats.7_plus_ratings stats.goals stats.assists stats.cards stats.clean_sheets tournament_stats.star_man_awards tournament_stats.7_plus_ratings tournament_stats.goals tournament_stats.assists tournament_stats.cards tournament_stats.clean_sheets
0 14937 Cristiano Ronaldo 359 7000000 playing [4] 0 None None None None 0 0 9 0 0 0 0 0 0 22760 41.3 764 0 0 15 0 1 0 0 0 7 0 0 0 0 0 0 0 0 0 0 0 0 0
1 15061 Goran Pandev 504 1000000 playing [4] 0 None None None None 0 0 0 0 0 0 0 0 0 50 0.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 15144 David Marshall 115 2000000 playing [1] 0 None None None None 0 0 0 0 0 0 0 0 0 166 0.3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 17740 Jesús Navas 118 3000000 playing [3] 0 None None None None 0 0 0 0 0 0 0 0 0 154 0.3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 17745 Kasper Schmeichel 369 3000000 playing [1] 0 None None None None 0 0 9 0 0 0 0 0 0 3261 5.9 810 0 0 0 0 1 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0
5 17861 Sergio Ramos 118 5000000 playing [2] 0 None None None None 0 0 9 0 0 0 0 0 0 14647 26.6 712 0 0 1 0 1 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0
...and so on.
And saves data.csv (screenshot from LibreOffice):

Related

Turn a long data structure to a wide matrix structure

I do have the following data structure...
ID value
1 1 1
2 1 63
3 1 2
4 1 58
5 2 3
6 2 4
7 3 34
8 3 25
Now I want to turn it into a kind of dyadic data structure. Every ID with the same value should have a relationship.
I tried several option and:
df_wide <- dcast(df, ID ~ value)
... have brought me a long way down the road...
ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 39 40
1 1001 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 1006 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 1007 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 2 0 0
4 1011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 1018 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 1020 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
7 1030 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0
8 1036 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Now is my main problem to turn it into a proper matrix to get a igraph object out of it.
df_wide_matrix <- data.matrix(df_wide)
df_aus_wide_g <- graph.edgelist(df_wide_matrix ,directed = TRUE)
don't get me there...
I also tried to transform it into a adjacency matrix...
df_wide_matrix <- get.adjacency(graph.edgelist(as.matrix(df_wide), directed=FALSE))
... but it didn't work either
If you want to create an edge between all IDs with the same value, try something like this instead. First merge the data frame onto itself by the value. Then, remove the value column, and remove all (undirected) edges that are duplicate or just points. Finally, convert to a two-column matrix and create the edges.
res <- merge(df, df, by='value', all=FALSE)[,c('ID.x','ID.y')]
res <- res[res$ID.x<res$ID.y,]
resg <- graph.edgelist(as.matrix(res))

zoo's NA handling methods in r

I am experimenting with different imputation method in zoo
So far I tried on my dataset na.locf, na.approx, na.spline. However, when I tried the same dataset with na.StructTS which uses seasonal Kalman filter it returns me the following error:
Error in StructTS(y) : 'x' must be numeric
Did I miss something? Any help is appreciated.
UPD1
my code:
empty <-zoo(order.by=seq.Date(head(index(df1.zoo),1),tail(index(df1.zoo),1),by="days"))
merged<-na.StructTS(merge(df1.zoo,empty))
here is df1.zoo:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
2012-01-01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 42
2012-01-02 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 57
2012-01-03 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 51
2012-01-04 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 41
2012-01-05 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 56
2012-01-06 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 55
here is empty:

Generating confusion matrix with HResults HTK Tool for handwriting recognition ICFHR’s example

I am studying how HTK Tools works with handwriting recognition. Following the ICFHR–2010 TUTORIAL I run examples for "Spanish-Numbers" corpus and received the resulting HMMs (files stored in folder hmm and listed in HMMsList), and res32.mlf with results of recognition received with HVite. Also I have master label file SamplesRef.mlf.
And now I want to see recognition results statistics, i.e. studying HResults tool.
When I run HResults as
HResults -I SamplesRef.mlf HMMsList res32.mlf
I see
====================== HTK Results Analysis =======================
Date: Tue Mar 31 15:21:11 2015
Ref : SamplesRef.mlf
Rec : res32.mlf
------------------------ Overall Results --------------------------
SENT: %Correct=0.00 [H=0, S=2, N=2]
WORD: %Corr=77.78, Acc=77.78 [H=7, D=0, S=2, I=0, N=9]
===================================================================
But if I add option -p in order to have confusion matrix I see the following error message:
~/icfhr$ HResults -p -I SamplesRef.mlf HMMsList res32.mlf
ERROR [+3331] Index: Label millones not in list[0 of 19]
FATAL ERROR - Terminating program HResults
I understand that message means that there is no HMM with name "millones" and I found that in my res32.mlf samples looks like:
"’*’/210341.rec"
mil
seiscientos
cincuenta
y
siete
millones
.
If I change res32.mlf with text editor to res33.mlf with content like:
"’*’/210341.rec"
m
i
l
s
e
i
s
c
i
... and so on.
And use samples.mlf (instead of SamplesRef.mlf) which inside looks like:
"*/210341.lab"
m
i
l
#
q
u
i
n
i
e
n
t
o
s
#
c
... and so on.
I have the desired result:
~/icfhr$ HResults -p -I samples.mlf HMMsList res33.mlf
====================== HTK Results Analysis =======================
Date: Tue Mar 31 15:35:42 2015
Ref : samples.mlf
Rec : res33.mlf
------------------------ Overall Results --------------------------
SENT: %Correct=0.00 [H=0, S=2, N=2]
WORD: %Corr=79.63, Acc=77.78 [H=43, D=5, S=6, I=1, N=54]
------------------------ Confusion Matrix -------------------------
a c d e i l m n o s t u v y Del [ %c / %e]
# 0 0 0 0 0 1 1 0 0 0 0 0 0 0 5 [ 0.0/3.7]
a 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0
d 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
e 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0
i 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0
l 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0
m 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0
n 0 1 0 0 0 0 0 6 0 0 0 0 0 0 0 [85.7/1.9]
o 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0
q 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 [ 0.0/1.9]
s 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0
t 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0
u 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 [50.0/1.9]
v 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
y 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 [50.0/1.9]
Ins 0 0 0 0 0 0 0 0 0 1 0 0 0 0
===================================================================
So, the main question is:
What is the simplest way (without text editor) to make mlf-files adapted for making confusion matrix?
(I suppose I miss some option of some HTK tool… but which tool and which option?)
Any useful ideas would be highly appreciated.
In order to use the -p option, you need to provide the labels list of the classes not your HMMs, (i.e. if you're trying to recognize the words Yes, No, Never) then your "HMMsList" file should be written as:
Yes
No
Never
Regardless of the HMMs that actually constitutes the words.
Your "HMMsList" file should be "LabelsList"

read.table line 15 does not contain 23 elements - R

here is the code I used:
d = read.table("Movies.txt",
sep="\t",
col.names=c( "id", "name", "date", "link", "c1", "c2", "c3","c4", "c5", "c6","c7", "c8", "c9","c10", "c11", "c12","c13", "c14", "c15","c16", "c17", "c18", "c19"),
fill=FALSE,
strip.white=TRUE)
and here is the text file:
1 Toy Story (1995) 01-Jan-95 http://us.imdb.com/M/title-exact?Toy%20Story%20(1995) 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
2 GoldenEye (1995) 01-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(1995) 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
3 Four Rooms (1995) 01-Jan-95 http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
4 Get Shorty (1995) 01-Jan-95 http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995) 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0
5 Copycat (1995) 01-Jan-95 http://us.imdb.com/M/title-exact?Copycat%20(1995) 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0
6 Shanghai Triad (Yao a yao yao dao waipo qiao) (1995) 01-Jan-95 http://us.imdb.com/Title?Yao+a+yao+yao+dao+waipo+qiao+(1995) 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
7 Twelve Monkeys (1995) 01-Jan-95 http://us.imdb.com/M/title-exact?Twelve%20Monkeys%20(1995) 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0
8 Babe (1995) 01-Jan-95 http://us.imdb.com/M/title-exact?Babe%20(1995) 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0
9 Dead Man Walking (1995) 01-Jan-95 http://us.imdb.com/M/title-exact?Dead%20Man%20Walking%20(1995) 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
10 Richard III (1995) 22-Jan-96 http://us.imdb.com/M/title-exact?Richard%20III%20(1995) 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0
11 Seven (Se7en) (1995) 01-Jan-95 http://us.imdb.com/M/title-exact?Se7en%20(1995) 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0
12 "Usual Suspects, The (1995)" 14-Aug-95 "http://us.imdb.com/M/title-exact?Usual%20Suspects,%20The%20(1995)" 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0
13 Mighty Aphrodite (1995) 30-Oct-95 http://us.imdb.com/M/title-exact?Mighty%20Aphrodite%20(1995) 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
14 "Postino, Il (1994)" 01-Jan-94 "http://us.imdb.com/M/title-exact?Postino,%20Il%20(1994)" 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
15 Mr. Holland's Opus (1995) 29-Jan-96 http://us.imdb.com/M/title-exact?Mr.%20Holland's%20Opus%20(1995) 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
16 French Twist (Gazon maudit) (1995) 01-Jan-95 http://us.imdb.com/M/title-exact?Gazon%20maudit%20(1995) 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0
17 From Dusk Till Dawn (1996) 05-Feb-96 http://us.imdb.com/M/title-exact?From%20Dusk%20Till%20Dawn%20(1996) 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0
18 "White Balloon, The (1995)" 01-Jan-95 http://us.imdb.com/M/title-exact?Badkonake%20Sefid%20(1995) 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
19 Antonia's Line (1995) 01-Jan-95 http://us.imdb.com/M/title-exact?Antonia%20(1995) 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
20 Angels and Insects (1995) 01-Jan-95 http://us.imdb.com/M/title-exact?Angels%20and%20Insects%20(1995) 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
It is clear that Line 15 had 23 elements. It is even clearer when in the text editor that there are tabs where they should be. Why would I be getting this error message ?
Disable quoting by adding the parameter quote = "" to read.table().
I believe the issue is with the various single ' and double-quotes " throughout the file.
Look at ?read.table for more information. Per the documentation, you can also look at ?scan for how the behavior of quotes embedded in quotes is handled.

How do you silently save an inspect object in R's tm package?

When I save the inspect() object in R's tm package it prints to screen. It does save the data that I want in the data.frame, but I have thousands of documents to analyze and the printing to screen is eating up my memory.
library(tm)
data("crude")
matrix <- TermDocumentMatrix(corpus,control=list(removePunctuation = TRUE,
stopwords=TRUE))
out= data.frame(inspect(matrix))
I have tried every trick that I can think of. capture.output() changes the object (not the desired effect), as does sink(). dev.off() does not work. invisible() does nothing. suppressWarnings(), suppressMessages(), and try() unsurprisingly do nothing. There are no silent or quiet options in the inspect command.
The closest that I can get is
out= capture.output(inspect(matrix))
out= data.frame(out)
which notably does not give the same data.frame, but pretty easily could be if I need to go down this route. Any other (less hacky) suggestions would be helpful. Thanks.
Windows 7
64- bit R-3.0.1
tm package is the most recent version (0.5-9.1).
Assign inside the capture then:
capture.output(out <- data.frame(inspect(matrix))) -> .null # discarding this
But really, inspect is for visual inspection, so maybe try
as.data.frame(as.matrix(matrix))
instead (btw matrix is a very unfortunate name for a variable, as that's a base function).
Using this input (varible name changed from you question as using a variable named "matrix" can be confusing:
library(tm)
data("crude")
tdm <- TermDocumentMatrix(crude,control=list(removePunctuation = TRUE,
stopwords=TRUE))
Then this will avoid printing to screen
m <- as.matrix(tdm)
and then I would personally do something like
require(data.table)
data.table(m, keep.rownames=TRUE)
# rn 127 144 191 194 211 236 237 242 246 248 273 349 352 353 368 489 502 543 704 708
# 1: 100000 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
# 2: 108 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
# 3: 111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
# 4: 115 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
# 5: 12217 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
# ---
# 996: yesterday 0 0 0 0 0 0 0 3 0 0 1 0 0 0 0 0 0 0 0 0
# 997: yesterdays 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
# 998: york 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0
# 999: zero 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0
# 1000: zone 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0

Resources