Scrapy Xpath return empty list - web-scraping

It work if Xpath using contains function
response.xpath('//table[contains(#class, "wikitable sortable")]')
However it returns a empty using code below:
response.xpath('//table[#class="wikitable sortable jquery-tablesorter"]')
Any explanation about why it return an empty list?
For more information, I'm trying to extract territory rankings table from this site https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population as practice.

You can extract territory rankings table easily using only pandas as follows:
Code:
import pandas as pd
dfs = pd.read_html('https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population',attrs={'class':'wikitable sortable'})
df = dfs[0]#.to_csv('d.csv')
print(df)
Output:
Rank State or territory ... % of the total U.S. pop.[d] % of Elec. Coll.
'20 '10 State or territory ... 2010 Ch.2010–2020 % of Elec. Coll.
0 1.0 1.0 California ... 11.91% –0.11%
10.04%
1 2.0 2.0 Texas ... 8.04% 0.66%
7.43%
2 3.0 4.0 Florida ... 6.01% 0.42%
5.58%
3 4.0 3.0 New York ... 6.19% –0.17%
5.20%
4 5.0 6.0 Pennsylvania ... 4.06% –0.18%
3.53%
5 6.0 5.0 Illinois ... 4.10% –0.28%
3.53%
6 7.0 7.0 Ohio ... 3.69% –0.17%
3.16%
7 8.0 9.0 Georgia ... 3.10% 0.10%
2.97%
8 9.0 10.0 North Carolina ... 3.05% 0.07%
2.97%
9 10.0 8.0 Michigan ... 3.16% –0.15%
2.79%
10 11.0 11.0 New Jersey ... 2.81% –0.04%
2.60%
11 12.0 12.0 Virginia ... 2.56% 0.02%
2.42%
12 13.0 13.0 Washington ... 2.15% 0.15%
2.23%
13 14.0 16.0 Arizona ... 2.04% 0.09%
2.04%
14 15.0 14.0 Massachusetts ... 2.09% 0.00%
2.04%
15 16.0 17.0 Tennessee ... 2.03% 0.03%
2.04%
16 17.0 15.0 Indiana ... 2.07% –0.05%
2.04%
17 18.0 19.0 Maryland ... 1.85% –0.00%
1.86%
18 19.0 18.0 Missouri ... 1.91% –0.08%
1.86%
19 20.0 20.0 Wisconsin ... 1.82% –0.06%
1.86%
20 21.0 22.0 Colorado ... 1.61% 0.12%
1.86%
21 22.0 21.0 Minnesota ... 1.70% 0.01%
1.86%
22 23.0 24.0 South Carolina ... 1.48% 0.05%
1.67%
23 24.0 23.0 Alabama ... 1.53% –0.03%
1.67%
24 25.0 25.0 Louisiana ... 1.45% –0.06%
1.49%
25 26.0 26.0 Kentucky ... 1.39% –0.04%
1.49%
26 27.0 27.0 Oregon ... 1.22% 0.04%
1.49%
27 28.0 28.0 Oklahoma ... 1.20% –0.02%
1.30%
28 29.0 30.0 Connecticut ... 1.14% –0.07%
1.30%
29 30.0 29.0 Puerto Rico ... 1.19% –0.21%
—
30 31.0 35.0 Utah ... 0.88% 0.09%
1.12%
31 32.0 31.0 Iowa ... 0.97% –0.02%
1.12%
32 33.0 36.0 Nevada ... 0.86% 0.06%
1.12%
33 34.0 33.0 Arkansas ... 0.93% –0.03%
1.12%
34 35.0 32.0 Mississippi ... 0.95% –0.06%
1.12%
35 36.0 34.0 Kansas ... 0.91% –0.04%
1.12%
36 37.0 37.0 New Mexico ... 0.66% –0.03%
0.93%
37 38.0 39.0 Nebraska ... 0.58% 0.00%
0.93%
38 39.0 40.0 Idaho ... 0.50% 0.05%
0.74%
39 40.0 38.0 West Virginia ... 0.59% –0.06%
0.74%
40 41.0 41.0 Hawaii ... 0.43% 0.00%
0.74%
41 42.0 43.0 New Hampshire ... 0.42% –0.01%
0.74%
42 43.0 42.0 Maine ... 0.42% –0.02%
0.74%
43 44.0 44.0 Rhode Island ... 0.34% –0.01%
0.74%
44 45.0 45.0 Montana ... 0.32% 0.01%
0.74%
45 46.0 46.0 Delaware ... 0.29% 0.01%
0.56%
46 47.0 47.0 South Dakota ... 0.26% 0.00%
0.56%
47 48.0 49.0 North Dakota ... 0.21% 0.02%
0.56%
48 49.0 48.0 Alaska ... 0.23% –0.01%
0.56%
49 50.0 51.0 District of Columbia ... 0.19% 0.01% 0.56%
50 51.0 50.0 Vermont ... 0.20% –0.01% 0.56%
51 52.0 52.0 Wyoming ... 0.18% –0.01% 0.56%
52 53.0 53.0 Guam[8] ... 0.05% –0.00% —
53 54.0 54.0 U.S. Virgin Islands[9] ... 0.03% –0.00% —
54 55.0 55.0 American Samoa[10] ... 0.02% –0.00% —
55 56.0 56.0 Northern Mariana Islands[11] ... 0.02% –0.00% —
56 NaN NaN Contiguous United States ... 98.03% 0.23% 98.70%
57 NaN NaN The fifty states ... 98.50% 0.21% 99.44%
58 NaN NaN The fifty states and D.C. ... 98.69% 0.22% 100.00%
59 NaN NaN Total United States ... — — —
[60 rows x 16 columns]

Related

How to sample data non-random

I have weather dataset my data is date-dependent
I want to predict the temperature from 07 May 2008 until 18 May 2008 (which is maybe a total of 10-15 observations) my data size is around 200
I will be using decision tree/RF and SVM & NN to make my prediction
I've never handled data like this so I'm not sure how to sample non random data
I want to sample data 80% train data and 30% test data but I want to sample the data in the original order not randomly. Is that possible ?
install.packages("rattle")
install.packages("RGtk2")
library("rattle")
seed <- 42
set.seed(seed)
fname <- system.file("csv", "weather.csv", package = "rattle")
dataset <- read.csv(fname, encoding = "UTF-8")
dataset <- dataset[1:200,]
dataset <- dataset[order(dataset$Date),]
set.seed(321)
sample_data = sample(nrow(dataset), nrow(dataset)*.8)
test<-dataset[sample_data,] # 30%
train<-dataset[-sample_data,] # 80%
output
> head(dataset)
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed
1 2007-11-01 Canberra 8.0 24.3 0.0 3.4 6.3 NW 30
2 2007-11-02 Canberra 14.0 26.9 3.6 4.4 9.7 ENE 39
3 2007-11-03 Canberra 13.7 23.4 3.6 5.8 3.3 NW 85
4 2007-11-04 Canberra 13.3 15.5 39.8 7.2 9.1 NW 54
5 2007-11-05 Canberra 7.6 16.1 2.8 5.6 10.6 SSE 50
6 2007-11-06 Canberra 6.2 16.9 0.0 5.8 8.2 SE 44
WindDir9am WindDir3pm WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am
1 SW NW 6 20 68 29 1019.7
2 E W 4 17 80 36 1012.4
3 N NNE 6 6 82 69 1009.5
4 WNW W 30 24 62 56 1005.5
5 SSE ESE 20 28 68 49 1018.3
6 SE E 20 24 70 57 1023.8
Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RISK_MM RainTomorrow
1 1015.0 7 7 14.4 23.6 No 3.6 Yes
2 1008.4 5 3 17.5 25.7 Yes 3.6 Yes
3 1007.2 8 7 15.4 20.2 Yes 39.8 Yes
4 1007.0 2 7 13.5 14.1 Yes 2.8 Yes
5 1018.5 7 7 11.1 15.4 Yes 0.0 No
6 1021.7 7 5 10.9 14.8 No 0.2 No
> head(test)
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed
182 2008-04-30 Canberra -1.8 14.8 0.0 1.4 7.0 N 28
77 2008-01-16 Canberra 17.9 33.2 0.0 10.4 8.4 N 59
88 2008-01-27 Canberra 13.2 31.3 0.0 6.6 11.6 WSW 46
58 2007-12-28 Canberra 15.1 28.3 14.4 8.8 13.2 NNW 28
96 2008-02-04 Canberra 18.2 22.6 1.8 8.0 0.0 ENE 33
126 2008-03-05 Canberra 12.0 27.6 0.0 6.0 11.0 E 46
WindDir9am WindDir3pm WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am
182 E N 2 19 80 40 1024.2
77 N NNE 15 20 58 62 1008.5
88 N WNW 4 26 71 28 1013.1
58 NNW NW 6 13 73 44 1016.8
96 SSE ENE 7 13 92 76 1014.4
126 SSE WSW 7 6 69 35 1025.5
Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RISK_MM RainTomorrow
182 1020.5 1 7 5.3 13.9 No 0.0 No
77 1006.1 6 7 24.5 23.5 No 4.8 Yes
88 1009.5 1 4 19.7 30.7 No 0.0 No
58 1013.4 1 5 18.3 27.4 Yes 0.0 No
96 1011.5 8 8 18.5 22.1 Yes 9.0 Yes
126 1022.2 1 1 15.7 26.2 No 0.0 No
> head(train)
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed
7 2007-11-07 Canberra 6.1 18.2 0.2 4.2 8.4 SE 43
9 2007-11-09 Canberra 8.8 19.5 0.0 4.0 4.1 S 48
11 2007-11-11 Canberra 9.1 25.2 0.0 4.2 11.9 N 30
16 2007-11-16 Canberra 12.4 32.1 0.0 8.4 11.1 E 46
22 2007-11-22 Canberra 16.4 19.4 0.4 9.2 0.0 E 26
25 2007-11-25 Canberra 15.4 28.4 0.0 4.4 8.1 ENE 33
WindDir9am WindDir3pm WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am
7 SE ESE 19 26 63 47 1024.6
9 E ENE 19 17 70 48 1026.1
11 SE NW 6 9 74 34 1024.4
16 SE WSW 7 9 70 22 1017.9
22 ENE E 6 11 88 72 1010.7
25 SSE NE 9 15 85 31 1022.4
Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RISK_MM RainTomorrow
7 1022.2 4 6 12.4 17.3 No 0.0 No
9 1022.7 7 7 14.1 18.9 No 16.2 Yes
11 1021.1 1 2 14.6 24.0 No 0.2 No
16 1012.8 0 3 19.1 30.7 No 0.0 No
22 1008.9 8 8 16.5 18.3 No 25.8 Yes
25 1018.6 8 2 16.8 27.3 No 0.0 No
I use mtcars as an example. An option to non-randomly split your data in train and test is to first create a sample size based on the number of rows in your data. After that you can use split to split the data exact at the 80% of your data. You using the following code:
smp_size <- floor(0.80 * nrow(mtcars))
split <- split(mtcars, rep(1:2, each = smp_size))
With the following code you can turn the split in train and test:
train <- split$`1`
test <- split$`2`
Let's check the number of rows:
> nrow(train)
[1] 25
> nrow(test)
[1] 7
Now the data is split in train and test without losing their order.

Delete Several Lines in txt file with conditional in R

i got problem how to delete several lines in txt file then convert into csv with R because i just want to get the data from txt.
My code cant delete propely because it delete lines which contain the date of the data
Here the code i used
setwd("D:/tugasmaritim/")
FILES <- list.files( pattern = ".txt")
for (i in 1:length(FILES)) {
l <- readLines(FILES[i],skip=4)
l2 <- l[-sapply(grep("</PRE><H3>", l), function(x) seq(x, x + 30))]
l3 <- l2[-sapply(grep("<P>Description", l2), function(x) seq(x, x + 29))]
l4 <- l3[-sapply(grep("<HTML>", l3), function(x) seq(x, x + 3))]
write.csv(l4,row.names=FALSE,file=paste0("D:/tugasmaritim/",sub(".txt","",FILES[i]),".csv"))
}
my data looks like this
<HTML>
<TITLE>University of Wyoming - Radiosonde Data</TITLE>
<LINK REL="StyleSheet" HREF="/resources/select.css" TYPE="text/css">
<BODY BGCOLOR="white">
<H2>96749 WIII Jakarta Observations at 00Z 02 Oct 1995</H2>
<PRE>
-----------------------------------------------------------------------------
PRES HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV
hPa m C C % g/kg deg knot K K K
-----------------------------------------------------------------------------
1011.0 8 23.2 22.5 96 17.30 0 0 295.4 345.3 298.5
1000.0 98 23.6 22.4 93 17.39 105 8 296.8 347.1 299.8
977.3 300 24.6 22.1 86 17.49 105 8 299.7 351.0 302.8
976.0 311 24.6 22.1 86 17.50 104 8 299.8 351.2 303.0
950.0 548 23.0 22.0 94 17.87 88 12 300.5 353.2 303.7
944.4 600 22.6 21.8 95 17.73 85 13 300.6 352.9 303.8
925.0 781 21.2 21.0 99 17.25 90 20 301.0 351.9 304.1
918.0 847 20.6 20.6 100 16.95 90 23 301.0 351.0 304.1
912.4 900 20.4 18.6 89 15.00 90 26 301.4 345.7 304.1
897.0 1047 20.0 13.0 64 10.60 90 26 302.4 334.1 304.3
881.2 1200 19.4 11.4 60 9.70 90 26 303.3 332.5 305.1
850.0 1510 18.2 8.2 52 8.09 95 18 305.2 329.9 306.7
845.0 1560 18.0 7.0 49 7.49 91 17 305.5 328.4 306.9
810.0 1920 15.0 9.0 67 8.97 60 11 306.0 333.4 307.7
792.9 2100 14.3 3.1 47 6.06 45 8 307.1 325.9 308.2
765.1 2400 13.1 -6.8 24 3.01 40 8 309.0 318.7 309.5
746.0 2612 12.2 -13.8 15 1.77 38 10 310.3 316.2 310.6
712.0 3000 10.3 -15.0 15 1.69 35 13 312.3 318.1 312.6
700.0 3141 9.6 -15.4 16 1.66 35 13 313.1 318.7 313.4
653.0 3714 6.6 -16.4 18 1.63 32 12 316.0 321.6 316.3
631.0 3995 4.8 -2.2 60 5.19 31 11 317.0 333.9 318.0
615.3 4200 3.1 -3.9 60 4.70 30 11 317.4 332.8 318.3
601.0 4391 1.6 -5.4 60 4.28 20 8 317.8 331.9 318.6
592.9 4500 0.6 -12.0 38 2.59 15 6 317.9 326.6 318.4
588.0 4567 0.0 -16.0 29 1.88 11 6 317.9 324.4 318.3
571.0 4800 -1.2 -18.9 25 1.51 355 5 319.1 324.4 319.4
549.8 5100 -2.8 -22.8 20 1.12 45 6 320.7 324.8 321.0
513.0 5649 -5.7 -29.7 13 0.64 125 10 323.6 326.0 323.8
500.0 5850 -5.1 -30.1 12 0.63 155 11 326.8 329.1 326.9
494.0 5945 -4.9 -29.9 12 0.65 146 11 328.1 330.6 328.3
471.7 6300 -7.4 -32.0 12 0.56 110 13 329.3 331.5 329.4
453.7 6600 -9.6 -33.8 12 0.49 100 14 330.3 332.2 330.4
400.0 7570 -16.5 -39.5 12 0.31 105 14 333.5 334.7 333.5
398.0 7607 -16.9 -39.9 12 0.30 104 14 333.4 334.6 333.5
371.9 8100 -20.4 -42.6 12 0.24 95 16 335.4 336.3 335.4
300.0 9660 -31.3 -51.3 12 0.11 115 18 341.1 341.6 341.2
269.0 10420 -36.3 -55.3 12 0.08 79 20 344.7 345.0 344.7
265.9 10500 -36.9 75 20 344.9 344.9
250.0 10920 -40.3 80 28 346.0 346.0
243.4 11100 -41.8 85 37 346.4 346.4
222.5 11700 -46.9 75 14 347.6 347.6
214.0 11960 -49.1 68 16 348.1 348.1
200.0 12400 -52.7 55 20 349.1 349.1
156.0 13953 -66.1 55 25 352.1 352.1
152.3 14100 -67.2 55 26 352.6 352.6
150.0 14190 -67.9 55 26 352.9 352.9
144.7 14400 -69.6 60 26 353.6 353.6
137.5 14700 -72.0 60 39 354.6 354.6
130.7 15000 -74.3 50 28 355.6 355.6
124.2 15300 -76.7 40 36 356.5 356.5
118.0 15600 -79.1 50 48 357.4 357.4
116.0 15698 -79.9 45 44 357.6 357.6
112.0 15900 -79.1 45 26 362.6 362.6
106.3 16200 -78.0 35 24 370.2 370.2
100.0 16550 -76.7 35 24 379.3 379.3
</PRE><H3>Station information and sounding indices</H3><PRE>
Station identifier: WIII
Station number: 96749
Observation time: 951002/0000
Station latitude: -6.11
Station longitude: 106.65
Station elevation: 8.0
Showalter index: 6.30
Lifted index: -1.91
LIFT computed using virtual temperature: -2.80
SWEAT index: 145.41
K index: 6.50
Cross totals index: 13.30
Vertical totals index: 23.30
Totals totals index: 36.60
Convective Available Potential Energy: 799.02
CAPE using virtual temperature: 1070.13
Convective Inhibition: -26.70
CINS using virtual temperature: -12.88
Equilibrum Level: 202.64
Equilibrum Level using virtual temperature: 202.60
Level of Free Convection: 828.70
LFCT using virtual temperature: 909.19
Bulk Richardson Number: 210.78
Bulk Richardson Number using CAPV: 282.30
Temp [K] of the Lifted Condensation Level: 294.96
Pres [hPa] of the Lifted Condensation Level: 958.67
Mean mixed layer potential temperature: 298.56
Mean mixed layer mixing ratio: 17.50
1000 hPa to 500 hPa thickness: 5752.00
Precipitable water [mm] for entire sounding: 36.31
</PRE>
<H2>96749 WIII Jakarta Observations at 00Z 03 Oct 1995</H2>
<PRE>
-----------------------------------------------------------------------------
PRES HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV
hPa m C C % g/kg deg knot K K K
-----------------------------------------------------------------------------
1012.0 8 23.6 22.9 96 17.72 140 2 295.7 346.9 298.9
1000.0 107 24.0 21.6 86 16.54 135 3 297.1 345.2 300.1
990.0 195 24.4 20.3 78 15.39 128 4 298.4 343.4 301.2
945.4 600 22.9 20.2 85 16.00 95 7 300.9 348.0 303.7
925.0 791 22.2 20.1 88 16.29 100 6 302.0 350.3 304.9
913.5 900 21.9 18.2 80 14.63 105 6 302.8 346.3 305.4
911.0 924 21.8 17.8 78 14.28 108 6 302.9 345.4 305.5
850.0 1522 17.4 16.7 96 14.28 175 6 304.4 347.1 307.0
836.0 1665 16.4 16.4 100 14.24 157 7 304.8 347.5 307.4
811.0 1925 15.0 14.7 98 13.14 123 8 305.9 345.6 308.3
795.0 2095 14.2 7.2 63 8.08 101 9 306.8 331.6 308.3
794.5 2100 14.2 7.2 63 8.05 100 9 306.8 331.5 308.3
745.0 2642 10.4 2.4 58 6.14 64 11 308.4 327.6 309.6
736.0 2744 11.0 0.0 47 5.23 57 11 310.2 326.7 311.1
713.8 3000 9.2 5.0 75 7.70 40 12 310.9 335.0 312.4
711.0 3033 9.0 5.6 79 8.08 40 12 311.0 336.2 312.6
700.0 3163 8.6 1.6 61 6.18 40 12 312.0 331.5 313.1
688.5 3300 8.3 -6.0 36 3.57 60 12 313.1 324.8 313.8
678.0 3427 8.0 -13.0 21 2.08 70 12 314.2 321.2 314.6
642.0 3874 5.0 -2.0 61 5.17 108 11 315.7 332.4 316.7
633.0 3989 4.4 -11.6 30 2.50 117 10 316.3 324.7 316.8
616.6 4200 3.1 -14.1 27 2.09 135 10 317.1 324.3 317.6
580.0 4694 0.0 -20.0 21 1.36 164 13 319.1 323.9 319.4
572.3 4800 -0.4 -20.7 20 1.29 170 14 319.9 324.5 320.1
510.8 5700 -4.0 -26.6 15 0.86 80 10 326.1 329.2 326.2
500.0 5870 -4.7 -27.7 15 0.79 80 10 327.2 330.2 327.4
497.0 5917 -4.9 -27.9 15 0.78 71 13 327.6 330.5 327.7
491.7 6000 -5.5 -28.3 15 0.76 55 19 327.9 330.7 328.0
473.0 6300 -7.6 -29.9 15 0.68 55 16 328.9 331.4 329.0
436.0 6930 -12.1 -33.1 16 0.54 77 17 330.9 333.0 331.0
400.0 7580 -17.9 -37.9 16 0.37 100 19 331.6 333.1 331.7
388.3 7800 -19.9 -39.9 15 0.31 105 20 331.8 333.1 331.9
386.0 7844 -20.3 -40.3 15 0.30 103 20 331.9 333.1 331.9
372.0 8117 -18.3 -38.3 16 0.38 91 23 338.1 339.6 338.1
343.6 8700 -22.1 -41.4 16 0.30 65 29 340.7 342.0 340.8
329.0 9018 -24.1 -43.1 16 0.26 73 27 342.2 343.2 342.2
300.0 9680 -29.9 -44.9 22 0.23 90 22 343.1 344.1 343.2
278.6 10200 -34.3 85 37 344.1 344.1
266.9 10500 -36.8 60 32 344.7 344.7
255.8 10800 -39.4 65 27 345.2 345.2
250.0 10960 -40.7 65 27 345.4 345.4
204.0 12300 -51.8 55 23 348.6 348.6
200.0 12430 -52.9 55 23 348.8 348.8
194.6 12600 -55.0 60 23 348.1 348.1
160.7 13800 -70.1 35 39 342.4 342.4
153.2 14100 -73.9 35 41 340.6 340.6
150.0 14230 -75.5 35 41 339.9 339.9
131.5 15000 -76.3 50 53 351.6 351.6
124.9 15300 -76.6 50 57 356.2 356.2
122.0 15436 -76.7 57 45 358.3 358.3
118.6 15600 -77.3 65 31 360.2 360.2
115.0 15779 -77.9 65 31 362.2 362.2
112.6 15900 -77.7 85 17 364.8 364.8
107.0 16200 -77.2 130 10 371.2 371.2
100.0 16590 -76.5 120 18 379.7 379.7
</PRE><H3>Station information and sounding indices</H3><PRE>
Station identifier: WIII
Station number: 96749
Observation time: 951003/0000
Station latitude: -6.11
Station longitude: 106.65
Station elevation: 8.0
Showalter index: -0.58
Lifted index: 0.17
LIFT computed using virtual temperature: -0.57
SWEAT index: 222.41
K index: 31.80
Cross totals index: 21.40
Vertical totals index: 22.10
Totals totals index: 43.50
Convective Available Potential Energy: 268.43
CAPE using virtual temperature: 431.38
Convective Inhibition: -84.04
CINS using virtual temperature: -81.56
Equilibrum Level: 141.42
Equilibrum Level using virtual temperature: 141.35
Level of Free Convection: 784.91
LFCT using virtual temperature: 804.89
Bulk Richardson Number: 221.19
Bulk Richardson Number using CAPV: 355.46
Temp [K] of the Lifted Condensation Level: 293.21
Pres [hPa] of the Lifted Condensation Level: 940.03
Mean mixed layer potential temperature: 298.46
Mean mixed layer mixing ratio: 16.01
1000 hPa to 500 hPa thickness: 5763.00
Precipitable water [mm] for entire sounding: 44.54
and here my data
data
and this is what i want to get
contoh

Tabulize function in R

I want to extract the table of page 112 in this pdf document:
http://publications.credit-suisse.com/tasks/render/file/index.cfm?fileid=432759CA-0A73-57F6-04C67EF7EE506040
# report 2017
url_location <-"http://publications.credit-suisse.com/tasks/render/file/index.cfm?fileid=432759CA-0A73-57F6-04C67EF7EE506040"
out <- extract_tables(url_location, pages = 112)
I have tried using these tutorials (link1,link2) about 'tabulize' package but I largely failed. There are some difficult aspects which I am not very experienced how to handle in R.
Can someone suggest something and help me with that ?
Installation
devtools::install_github("ropensci/tabulizer")
# load package
library(tabulizer)
Java deps — while getting easier to deal with — aren't necessary when the tables are this clean. Just a bit of string wrangling will get you what you need:
library(pdftools)
library(stringi)
library(tidyverse)
# read it with pdftools
book <- pdf_text("global-wealth-databook.pdf")
# go to the page
lines <- stri_split_lines(book[[113]])[[1]]
# remove footer
lines <- discard(lines, stri_detect_fixed, "Credit Suisse")
# find line before start of table
start <- last(which(stri_detect_regex(lines, "^[[:space:]]+")))+1
# find line after table
end <- last(which(lines == ""))-1
# smuch into something read.[table|csv] can read
tab <- paste0(stri_replace_all_regex(lines[start:end], "[[:space:]][[:space:]]+", "\t"), collapse="\n")
#read it
read.csv(text=tab, header=FALSE, sep="\t", stringsAsFactors = FALSE)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
## 1 Egypt 56,036 3,168 324 98.1 1.7 0.2 0.0 100.0 91.7
## 2 El Salvador 3,957 14,443 6,906 66.0 32.8 1.2 0.0 100.0 65.7
## 3 Equatorial Guinea 670 8,044 2,616 87.0 12.2 0.7 0.1 100.0 77.3
## 4 Eritrea 2,401 3,607 2,036 94.5 5.4 0.1 100.0 57.1 NA
## 5 Estonia 1,040 43,158 27,522 22.5 72.2 5.1 0.2 100.0 56.4
## 6 Ethiopia 49,168 153 103 100.0 0.0 100.0 43.4 NA NA
## 7 Fiji 568 6,309 3,059 85.0 14.6 0.4 0.0 100.0 68.2
## 8 Finland 4,312 159,098 57,850 30.8 33.8 33.5 1.9 100.0 76.7
## 9 France 49,239 263,399 119,720 25.3 21.4 49.3 4.0 100.0 70.2
## 10 Gabon 1,098 15,168 7,367 62.0 36.5 1.5 0.0 100.0 68.4
## 11 Gambia 904 898 347 99.2 0.7 0.0 100.0 72.4 NA
## 12 Georgia 2,950 19,430 9,874 50.7 47.6 1.6 0.1 100.0 66.8
## 13 Germany 67,244 203,946 47,091 29.5 33.7 33.9 2.9 100.0 79.1
## 14 Ghana 14,574 809 411 99.5 0.5 0.0 100.0 66.1 NA
## 15 Greece 9,020 111,684 54,665 20.7 52.9 25.4 1.0 100.0 67.7
## 16 Grenada 70 17,523 4,625 74.0 24.3 1.5 0.2 100.0 81.5
## 17 Guinea 5,896 814 374 99.4 0.6 0.0 100.0 69.7 NA
## 18 Guinea-Bissau 884 477 243 99.8 0.2 100.0 65.6 NA NA
## 19 Guyana 467 5,345 2,510 89.0 10.7 0.3 0.0 100.0 67.2
## 20 Haiti 6,172 2,879 894 96.2 3.6 0.2 0.0 100.0 76.9
## 21 Hong Kong 6,172 193,248 46,079 26.3 50.9 20.9 1.9 100.0 85.1
## 22 Hungary 7,846 39,813 30,111 11.8 83.4 4.8 0.0 100.0 45.3
## 23 Iceland 245 587,649 444,999 13.0 72.0 15.0 100.0 46.7 NA
## 24 India 834,608 5,976 1,295 92.3 7.2 0.5 0.0 100.0 83.0
## 25 Indonesia 167,559 11,001 1,914 81.9 17.0 1.1 0.1 100.0 83.7
## 26 Iran 56,306 3,831 1,856 94.1 5.7 0.2 0.0 100.0 67.3
## 27 Ireland 3,434 248,466 84,592 31.2 22.7 42.3 3.6 100.0 81.3
## 28 Israel 5,315 198,406 78,244 22.3 38.7 36.7 2.3 100.0 74.2
## 29 Italy 48,544 223,572 124,636 21.3 22.0 54.1 2.7 100.0 66.0
## 30 Jamaica 1,962 9,485 3,717 79.0 20.2 0.8 0.0 100.0 74.3
## 31 Japan 105,228 225,057 123,724 7.9 35.7 53.9 2.6 100.0 60.9
## 32 Jordan 5,212 13,099 6,014 65.7 33.1 1.2 0.0 100.0 76.1
## 33 Kazakhstan 12,011 4,441 334 97.6 2.1 0.3 0.0 100.0 92.6
## 34 Kenya 23,732 1,809 662 97.4 2.5 0.1 0.0 100.0 77.2
## 35 Korea 41,007 160,609 67,934 20.0 40.5 37.8 1.7 100.0 70.0
## 36 Kuwait 2,996 97,304 37,788 30.3 48.3 20.4 1.0 100.0 76.9
## 37 Kyrgyzstan 3,611 4,689 2,472 92.7 7.0 0.2 0.0 100.0 62.9
## 38 Laos 3,849 5,662 1,382 94.6 4.7 0.7 0.0 100.0 84.9
## 39 Latvia 1,577 27,631 17,828 29.0 68.6 2.2 0.1 100.0 53.6
## 40 Lebanon 4,085 24,161 6,452 69.0 28.5 2.3 0.2 100.0 82.0
## 41 Lesotho 1,184 3,163 945 95.9 3.8 0.3 0.0 100.0 79.8
## 42 Liberia 2,211 2,193 959 97.3 2.6 0.1 0.0 100.0 71.6
## 43 Libya 4,007 45,103 24,510 29.6 61.1 9.2 0.2 100.0 59.9
## 44 Lithuania 2,316 27,507 17,931 27.3 70.4 2.1 0.1 100.0 51.6
## 45 Luxembourg 450 313,687 167,664 17.0 20.0 58.8 4.2 100.0 68.1
## 46 Macedonia 1,607 9,044 5,698 77.0 22.5 0.5 0.0 100.0 56.4
UPDATE
This is more generic but you'll still have to do some manual cleanup. I think you would even if you used Tabula.
library(pdftools)
library(stringi)
library(tidyverse)
# read it with pdftools
book <- pdf_text("~/Downloads/global-wealth-databook.pdf")
transcribe_page <- function(book, pg) {
# go to the page
lines <- stri_split_lines(book[[pg]])[[1]]
# remove footer
lines <- discard(lines, stri_detect_fixed, "Credit Suisse")
# find line before start of table
start <- last(which(stri_detect_regex(lines, "^[[:space:]]+")))+1
# find line after table
end <- last(which(lines == ""))-1
# get the target rows
rows <- lines[start:end]
# map out where data values are
stri_replace_first_regex(rows, "([[:alpha:]]) ([[:alpha:]])", "$1_$2") %>%
stri_replace_all_regex("[^[:blank:]]", "X") %>%
map(~rle(strsplit(.x, "")[[1]])) -> pos
# compute the number of data fields
nfields <- ceiling(max(map_int(pos, ~length(.x$lengths))) / 2)
# do our best to get them into columns
data_frame(rec = rows) %>%
separate(rec, into=sprintf("X%s", 1:nfields), sep="[[:space:]]{2,}", fill="left") %>%
print(n=length(rows))
}
transcribe_page(book, 112)
transcribe_page(book, 113)
transcribe_page(book, 114)
transcribe_page(book, 115)
Take a look at the outputs for ^^. They aren't in terrible shape and some of the cleanup can be programmatic.

How to scrape tables inside a comment tag in html with R?

I am trying to scrape from http://www.basketball-reference.com/teams/CHI/2015.html using rvest. I used selectorgadget and found the tag to be #advanced for the table I want. However, I noticed it wasn't picking it up. Looking at the page source, I noticed that the tables are inside an html comment tag <!--
What is the best way to get the tables from inside the comment tags? Thanks!
Edit: I am trying to pull the 'Advanced' table: http://www.basketball-reference.com/teams/CHI/2015.html#advanced::none
You can use the XPath comment() function to select comment nodes, then reparse their contents as HTML:
library(rvest)
# scrape page
h <- read_html('http://www.basketball-reference.com/teams/CHI/2015.html')
df <- h %>% html_nodes(xpath = '//comment()') %>% # select comment nodes
html_text() %>% # extract comment text
paste(collapse = '') %>% # collapse to a single string
read_html() %>% # reparse to HTML
html_node('table#advanced') %>% # select the desired table
html_table() %>% # parse table
.[colSums(is.na(.)) < nrow(.)] # get rid of spacer columns
df[, 1:15]
## Rk Player Age G MP PER TS% 3PAr FTr ORB% DRB% TRB% AST% STL% BLK%
## 1 1 Pau Gasol 34 78 2681 22.7 0.550 0.023 0.317 9.2 27.6 18.6 14.4 0.5 4.0
## 2 2 Jimmy Butler 25 65 2513 21.3 0.583 0.212 0.508 5.1 11.2 8.2 14.4 2.3 1.0
## 3 3 Joakim Noah 29 67 2049 15.3 0.482 0.005 0.407 11.9 22.1 17.1 23.0 1.2 2.6
## 4 4 Aaron Brooks 30 82 1885 14.4 0.534 0.383 0.213 1.9 7.5 4.8 24.2 1.5 0.6
## 5 5 Mike Dunleavy 34 63 1838 11.6 0.573 0.547 0.181 1.7 12.7 7.3 9.7 1.1 0.8
## 6 6 Taj Gibson 29 62 1692 16.1 0.545 0.000 0.364 10.7 14.6 12.7 6.9 1.1 3.2
## 7 7 Nikola Mirotic 23 82 1654 17.9 0.556 0.502 0.455 4.3 21.8 13.3 9.7 1.7 2.4
## 8 8 Kirk Hinrich 34 66 1610 6.8 0.468 0.441 0.131 1.4 6.6 4.1 13.8 1.5 0.6
## 9 9 Derrick Rose 26 51 1530 15.9 0.493 0.325 0.224 2.6 8.7 5.7 30.7 1.2 0.8
## 10 10 Tony Snell 23 72 1412 10.2 0.550 0.531 0.148 2.5 10.9 6.8 6.8 1.2 0.6
## 11 11 E'Twaun Moore 25 56 504 10.3 0.504 0.273 0.144 2.7 7.1 5.0 10.4 2.1 0.9
## 12 12 Doug McDermott 23 36 321 6.1 0.480 0.383 0.140 2.1 12.2 7.3 3.0 0.6 0.2
## 13 13 Nazr Mohammed 37 23 128 8.7 0.431 0.000 0.100 9.6 22.3 16.1 3.6 1.6 2.8
## 14 14 Cameron Bairstow 24 18 64 2.1 0.309 0.000 0.357 10.5 3.3 6.8 2.2 1.6 1.1
Ok..got it.
library(stringi)
library(knitr)
library(rvest)
any_version_html <- function(x){
XML::htmlParse(x)
}
a <- 'http://www.basketball-reference.com/teams/CHI/2015.html#advanced::none'
b <- readLines(a)
c <- paste0(b, collapse = "")
d <- as.character(unlist(stri_extract_all_regex(c, '<table(.*?)/table>', omit_no_match = T, simplify = T)))
e <- html_table(any_version_html(d))
> kable(summary(e),'rst')
====== ========== ====
Length Class Mode
====== ========== ====
9 data.frame list
2 data.frame list
24 data.frame list
21 data.frame list
28 data.frame list
28 data.frame list
27 data.frame list
30 data.frame list
27 data.frame list
27 data.frame list
28 data.frame list
28 data.frame list
27 data.frame list
30 data.frame list
27 data.frame list
27 data.frame list
3 data.frame list
====== ========== ====
kable(e[[1]],'rst')
=== ================ === ==== === ================== === === =================================
No. Player Pos Ht Wt Birth Date  Exp College
=== ================ === ==== === ================== === === =================================
41 Cameron Bairstow PF 6-9 250 December 7, 1990 au R University of New Mexico
0 Aaron Brooks PG 6-0 161 January 14, 1985 us 6 University of Oregon
21 Jimmy Butler SG 6-7 220 September 14, 1989 us 3 Marquette University
34 Mike Dunleavy SF 6-9 230 September 15, 1980 us 12 Duke University
16 Pau Gasol PF 7-0 250 July 6, 1980 es 13
22 Taj Gibson PF 6-9 225 June 24, 1985 us 5 University of Southern California
12 Kirk Hinrich SG 6-4 190 January 2, 1981 us 11 University of Kansas
3 Doug McDermott SF 6-8 225 January 3, 1992 us R Creighton University
## Realized we should index with some names...but this is somewhat cheating as we know the start and end indexes for table titles..I prefer to parse-in-the-dark.
# Names are in h2-tags
e_names <- as.character(unlist(stri_extract_all_regex(c, '<h2(.*?)/h2>', simplify = T)))
e_names <- gsub("<(.*?)>","",e_names[grep('Roster',e_names):grep('Salaries',e_names)])
names(e) <- e_names
kable(head(e$Salaries), 'rst')
=== ============== ===========
Rk Player Salary
=== ============== ===========
1 Derrick Rose $18,862,875
2 Carlos Boozer $13,550,000
3 Joakim Noah $12,200,000
4 Taj Gibson $8,000,000
5 Pau Gasol $7,128,000
6 Nikola Mirotic $5,305,000
=== ============== ===========

Can't remove a row from a matrix in R

I'm trying to remove an outlier from a data matrix. The original matrix is called Westdata and I want to remove row 51.
I've tried the following line of code but it doesn't remove the outlier and the new matrix is identical to the old one.
Westdata.Outlier<-Westdata[-51,]
Westdata.Outlier
State Region Pay Spend Area
20 Mont. MN 22.5 3.95 West
21 Wyo. MN 27.2 5.44 West
22 N.Mex. MN 22.6 3.40 West
23 Utah MN 22.3 2.30 West
24 Wash. PA 26.0 3.71 West
25 Calif. PA 29.1 3.61 West
26 Hawaii PA 25.8 3.77 West
46 Idaho MN 21.0 2.51 West
47 Colo. MN 25.9 4.04 West
48 Ariz. MN 26.6 2.83 West
49 Nev. MN 25.6 2.93 West
50 Oreg. PA 25.8 4.12 West
51 Alaska PA 41.5 8.35 West
Any suggestions?

Resources