Related
I'm learning to scrape data and I'm using transfermakt for it but today I've faced with two problems. I've used Selector Gadget. My code is this:
library(rvest)
url <- "https://www.transfermarkt.es/fc-granada/startseite/verein/16795"
webpage <- read_html(url)
players_html <- html_nodes(webpage,"#yw1 .tooltipstered")
players <- html_text(players_html)
players
valores_html <- html_nodes(webpage,'.rechts.hauptlink')
valores <- html_text(valores_html)
valores
valores <- gsub(" miles €","000", valores)
valores <- gsub(" mill. €","0000", valores)
valores
valores <- gsub(",","",valores)
valores <- gsub(" ","", valores)
valores
I've had the first problem selecting the players. This is the output.
> players_html <- html_nodes(webpage,"#yw1 .tooltipstered")
> players <- html_text(players_html)
> players
character(0)
I think that the problem is in the CSS selector, but it's the one that shows me Selector Gadget when selecting players, so I don't know how to solve this.
The other problem occurs selecting their market values. Gsub doesn't remove some final whitespace that avoid putting characters as numbers. This is the output:
> valores_html <- html_nodes(webpage,'.rechts.hauptlink')
> valores <- html_text(valores_html)
> valores
[1] "700 miles € " "300 miles € " "800 miles € " "500 miles € "
"300 miles € "
[6] "300 miles € " "1,00 mill. € " "300 miles € " "1,20 mill. €
" "500 miles € "
[11] "1,70 mill. € " "1,50 mill. € " "1,00 mill. € " "800 miles €
" "800 miles € "
[16] "300 miles € " "2,00 mill. € " "800 miles € " "700 miles €
" "400 miles € "
[21] "700 miles € " "1,00 mill. € " "800 miles € "
> valores <- gsub(" miles €","000", valores)
> valores <- gsub(" mill. €","0000", valores)
> valores
[1] "700000 " "300000 " "800000 " "500000 " "300000 "
"300000 " "1,000000 "
[8] "300000 " "1,200000 " "500000 " "1,700000 " "1,500000 "
"1,000000 " "800000 "
[15] "800000 " "300000 " "2,000000 " "800000 " "700000 "
"400000 " "700000 "
[22] "1,000000 " "800000 "
> valores <- gsub(",","",valores)
> valores <- gsub(" ","", valores)
> valores
[1] "700000 " "300000 " "800000 " "500000 " "300000 "
"300000 " "1000000 " "300000 "
[9] "1200000 " "500000 " "1700000 " "1500000 " "1000000 "
"800000 " "800000 " "300000 "
[17] "2000000 " "800000 " "700000 " "400000 " "700000 "
"1000000 " "800000 "
Basically, that last gsub used for removing final whitespace does nothing in this case. Could someone give me a hand with these two problems?
PS: I'm using transfermarkt in spanish.
As for gsub, we may use
valores <- html_text(valores_html)
valores <- gsub(" miles €", "000", valores)
valores <- gsub(" mill. €", "0000", valores)
valores <- gsub("\\D", "", valores)
valores
# [1] "700000" "300000" "800000" "500000" "300000" "300000" "1000000" "300000" "1200000"
# [10] "500000" "1700000" "1500000" "1000000" "800000" "800000" "300000" "2000000" "800000"
# [19] "700000" "400000" "700000" "1000000" "800000"
where \\D is anything other than a digit.
For player names we may write
players_html <- html_nodes(webpage,"#yw1 span.hide-for-small a.spielprofil_tooltip")
players <- html_text(players_html)
players
# [1] "Rui Silva" "Aarón Escandell" "Bernardo Cruz"
# [4] "José Antonio Martínez" "Germán Sánchez" "Pablo Vázquez"
# [7] "Álex Martínez" "Adrián Castellano" "Víctor Díaz"
# [10] "Quini" "Nicolás Aguirre" "Fede San Emeterio"
# [13] "Ángel Montoro" "Fran Rico" "Alberto Martín"
# [16] "José Antonio González" "Alejandro Pozo" "Antonio Puertas"
# [19] "Fede Vico" "Daniel Ojeda" "Álvaro Vadillo"
# [22] "Adrián Ramos" "Rodri"
In this way we also get only one set of (full) names. Using, e.g., "#yw1 a.spielprofil_tooltip" would also return their short versions.
I have been looking around for few hours now and have not been able not remove "" from the character of strings below.
c("Final", "A", "7.43", "8.50", "15.93", "2.00",
"1.00", "0.30", "0.37", " 7.43", " 8.50", "0.50", "0.67", " ",
" ", " ", " ", " ", " ", " ", "B", "7.00", "3.77", "10.77",
" 7.00", "1.67", "3.77", " ", " ", " ", " ", " ", " ", " ", " ",
I have many more of these empty values in this dataset and just want to get rid of them before organizing then as a data frame like
Final
A B
7.43 7.43
8.50 8.50
15.93 0.50
2.00 0.67
1.00
0.30
Thanks,
You can use the base grep with values = TRUE. That searches the character vector for a given regex pattern and returns all values where that pattern is found.
You can think about the logic of your pattern a couple ways. One might be to think of it as keeping values with a "word" character, which are letters, numbers, or underscores.
x <- c("Final", "A", "7.43", "8.50", "15.93", "2.00", "1.00", "0.30", "0.37", " 7.43", " 8.50", "0.50", "0.67", " ", " ", " ", " ", " ", " ", " ", "B", "7.00", "3.77", "10.77", " 7.00", "1.67", "3.77", " ", " ", " ", " ", " ", " ", " ", " ")
grep("\\w", x, value = T)
#> [1] "Final" "A" "7.43" "8.50" "15.93" "2.00" "1.00" "0.30"
#> [9] "0.37" " 7.43" " 8.50" "0.50" "0.67" "B" "7.00" "3.77"
#> [17] "10.77" " 7.00" "1.67" "3.77"
Another way is to find values with a character that isn't a space (\\S is the negation of \\s):
grep("\\S", x, value = T)
#> [1] "Final" "A" "7.43" "8.50" "15.93" "2.00" "1.00" "0.30"
#> [9] "0.37" " 7.43" " 8.50" "0.50" "0.67" "B" "7.00" "3.77"
#> [17] "10.77" " 7.00" "1.67" "3.77"
Created on 2018-12-10 by the reprex package (v0.2.1)
I learned about stringr library and tried to use it to remove whitespace from a string, but I don't understand why it would not remove it when I index a string from a vector.
ex is a vector with strings similar to " 1950"
> library(stringr)
> str_replace_all(ex[1], fixed(" "), "")
[1] " 1950"
> str_replace_all(" 1950", fixed(" "), "")
[1] "1950"
> str(" 1950")
chr " 1950"
> str(ex[1])
chr " 1950"
I wanted to write a loop to remove the whitespaces, but I don't understand why stringr does not work when I use ex[1]
Following is dput(ex)
c(" 1950", " 1951", " 1952", " 1953", " 1954",
" 1955", " 1956", " 1957", " 1958", " 1959", " 1960",
" 1961", " 1962", " 1963", " 1964", " 1965", " 1966",
" 1967", " 1968", " 1969", " 1970", " 1971", " 1972",
" 1973", " 1974", " 1975", " 1976", " 1977", " 1978",
" 1979", " 1980", " 1981", " 1982", " 1983", " 1984",
" 1985", " 1986", " 1987", " 1988", " 1989", " 1990",
" 1991", " 1992", " 1993", " 1994", " 1995", " 1996",
" 1997", " 1998", " 1999", " 2000", " 2001", " 2002",
" 2003", " 2004", " 2005", " 2006", " 2007", " 2008",
" 2009", " 2010", " 2011", " 2012", " 2013", " 2014",
" 2015", " 2016", " 2017", " ", "Provided by : All China Marketing Resarch"
)
What library can I use in a for loop to remove whitespaces?
It is easier with trimws from base R
trimws(ex)
The OP's issue is not reproducible with str_extract
stringr::str_replace_all(ex[1], fixed(" "), "")
#[1] "1950"
str_trim within library(stringr) would work
i.e
trimmed = str_trim(your_string, which = c('left'))
For right trim, you could set it to right or for both sides, both. You also wouldn't need to loop in this case. A vector of strings can be passed directly to str_trim.
I've tried:
i <- as.numeric(as.character(Impress))
i <- as.numeric(as.character(levels(Impress)))
i <- as.numeric(paste(Impress))
I always get:
Warning message:
NAs introduced by coercion
> i
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
This is the data I want to be numeric:
> Impress
[1] 24,085,563.00 35,962,587.00 31,714,513.00 28,206,422.00 40,161,010.00 36,292,929.00 31,545,482.00
[8] 28,213,878.00 35,799,224.00 32,400,885.00 28,496,459.00 37,456,344.00 38,108,667.00 33,407,771.00
[15] 32,540,479.00 30,692,707.00 22,873,000.00 21,329,146.00 28,921,953.00 30,471,519.00 28,601,289.00
[22] 27,450,630.00 26,708,790.00 19,825,041.00 18,844,169.00 29,592,039.00 31,012,594.00 28,792,531.00
[29] 28,578,028.00 24,913,985.00
30 Levels: 18,844,169.00 19,825,041.00 21,329,146.00 22,873,000.00 24,085,563.00 24,913,985.00 ... 40,161,010.00
> paste(Impress)
[1] " 24,085,563.00 " " 35,962,587.00 " " 31,714,513.00 " " 28,206,422.00 " " 40,161,010.00 " " 36,292,929.00 " " 31,545,482.00 "
[8] " 28,213,878.00 " " 35,799,224.00 " " 32,400,885.00 " " 28,496,459.00 " " 37,456,344.00 " " 38,108,667.00 " " 33,407,771.00 "
[15] " 32,540,479.00 " " 30,692,707.00 " " 22,873,000.00 " " 21,329,146.00 " " 28,921,953.00 " " 30,471,519.00 " " 28,601,289.00 "
[22] " 27,450,630.00 " " 26,708,790.00 " " 19,825,041.00 " " 18,844,169.00 " " 29,592,039.00 " " 31,012,594.00 " " 28,792,531.00 "
[29] " 28,578,028.00 " " 24,913,985.00 "
and when I do i<-as.numeric(Impress), it pastes the wrong values.
Thanks!
As far as the computer is concerned, , is not a number and hence any number string containing it must not be numeric, even if to a human these look like perfectly acceptable numbers.
Get rid of the , and then it will work, e.g. using gsub()
i <- as.numeric(gsub(",", "", as.character(Impress)))
E.g.
Impress <- c("24,085,563.00", "35,962,587.00", "31,714,513.00", "28,206,422.00")
gsub(",", "", as.character(Impress))
i <- as.numeric(gsub(",", "", as.character(Impress)))
i
R> gsub(",", "", as.character(Impress))
[1] "24085563.00" "35962587.00" "31714513.00" "28206422.00"
R> i
[1] 24085563 35962587 31714513 28206422
R> is.numeric(i)
[1] TRUE
Because the data has commas, R cannot convert it to a numeric. You have to remove the commas with sub() first and then convert:
i <- as.numeric(gsub(",", "", as.character(impress)))
I've tried:
i <- as.numeric(as.character(Impress))
i <- as.numeric(as.character(levels(Impress)))
i <- as.numeric(paste(Impress))
I always get:
Warning message:
NAs introduced by coercion
> i
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
This is the data I want to be numeric:
> Impress
[1] 24,085,563.00 35,962,587.00 31,714,513.00 28,206,422.00 40,161,010.00 36,292,929.00 31,545,482.00
[8] 28,213,878.00 35,799,224.00 32,400,885.00 28,496,459.00 37,456,344.00 38,108,667.00 33,407,771.00
[15] 32,540,479.00 30,692,707.00 22,873,000.00 21,329,146.00 28,921,953.00 30,471,519.00 28,601,289.00
[22] 27,450,630.00 26,708,790.00 19,825,041.00 18,844,169.00 29,592,039.00 31,012,594.00 28,792,531.00
[29] 28,578,028.00 24,913,985.00
30 Levels: 18,844,169.00 19,825,041.00 21,329,146.00 22,873,000.00 24,085,563.00 24,913,985.00 ... 40,161,010.00
> paste(Impress)
[1] " 24,085,563.00 " " 35,962,587.00 " " 31,714,513.00 " " 28,206,422.00 " " 40,161,010.00 " " 36,292,929.00 " " 31,545,482.00 "
[8] " 28,213,878.00 " " 35,799,224.00 " " 32,400,885.00 " " 28,496,459.00 " " 37,456,344.00 " " 38,108,667.00 " " 33,407,771.00 "
[15] " 32,540,479.00 " " 30,692,707.00 " " 22,873,000.00 " " 21,329,146.00 " " 28,921,953.00 " " 30,471,519.00 " " 28,601,289.00 "
[22] " 27,450,630.00 " " 26,708,790.00 " " 19,825,041.00 " " 18,844,169.00 " " 29,592,039.00 " " 31,012,594.00 " " 28,792,531.00 "
[29] " 28,578,028.00 " " 24,913,985.00 "
and when I do i<-as.numeric(Impress), it pastes the wrong values.
Thanks!
As far as the computer is concerned, , is not a number and hence any number string containing it must not be numeric, even if to a human these look like perfectly acceptable numbers.
Get rid of the , and then it will work, e.g. using gsub()
i <- as.numeric(gsub(",", "", as.character(Impress)))
E.g.
Impress <- c("24,085,563.00", "35,962,587.00", "31,714,513.00", "28,206,422.00")
gsub(",", "", as.character(Impress))
i <- as.numeric(gsub(",", "", as.character(Impress)))
i
R> gsub(",", "", as.character(Impress))
[1] "24085563.00" "35962587.00" "31714513.00" "28206422.00"
R> i
[1] 24085563 35962587 31714513 28206422
R> is.numeric(i)
[1] TRUE
Because the data has commas, R cannot convert it to a numeric. You have to remove the commas with sub() first and then convert:
i <- as.numeric(gsub(",", "", as.character(impress)))