rvest wikipedia scraping cells with links produce duplicates

rvest wikipedia scraping cells with links produce duplicates - web-scraping

I am scraping an HTML table from wikipedia using rvest.
It appears that every time there is a cell that includes a link, I get the actual text, and then the name of the link in my R dataframe after running html_table().
Here is an example.
males_raw <- read_html("https://en.wikipedia.org/wiki/List_of_Academy_Award_Best_Actor_winners_by_age")
males <- males_raw %>%
html_nodes(xpath='//*[#id="mw-content-text"]/div/table') %>%
html_table()
males <- males[[1]]
Which produces a dataframe in which the names of the actors are repeated:
dplyr::glimpse(males)
Observations: 91
Variables: 9
$ `#` <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "1...
$ Actor <chr> "Jannings, EmilEmil Jannings", "Baxter, WarnerWarner Baxter...
$ Film <chr> "The Way of All Flesh and The Last CommandThe Last Command,...
$ `Date of birth` <chr> "1884-07-23(1884-07-23)July 23, 1884", "1889-03-29(1889-03-...
$ `Date of award` <chr> "May 16, 1929 (1929-05-16)", "April 3, 1930 (1930-04-03)", ...
$ `Age upon\nreceiving award` <chr> "44-2977004163670000000000♠44 years, 297 days", "41-0057004...
$ `Date of death` <chr> "1950-01-02(1950-01-02)January 2, 1950", "1951-05-07(1951-0...
$ Lifespan <chr> "23,903 days (7004239030000000000♠65 years, 163 days)", "22...
$ Notes <chr> "Held record as oldest winner for 2 award ceremonies (from ...
I would prefer to just have the name, e.g., "Jannings, Emil" instead of "Jannings, EmilEmil Jannings"
Thanks!

The problem here is that the table cells contain span elements which are not displayed. html_table converts the contents into text and appends it to the text in the td elements. It happens for other columns too.
<td>
<span style="display:none">Jannings, Emil</span>
Emil Jannings
</td>
One way around this is to remove the span nodes:
spans <- males_raw %>%
html_nodes(xpath = "//*/tr/td/span")
xml_remove(spans)
males_raw %>%
html_nodes(xpath='//*[#id="mw-content-text"]/div/table') %>%
html_table() %>%
.[[1]] %>%
glimpse()
Observations: 91
Variables: 9
$ `#` <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13"...
$ Actor <chr> "Emil Jannings", "Warner Baxter", "George Arliss", "Lionel Barrymor...
$ Film <chr> "The Last Command,The Way of All Flesh", "In Old Arizona", "Disrael...
$ `Date of birth` <chr> "July 23, 1884", "March 29, 1889", "April 10, 1868", "April 28, 187...
$ `Date of award` <chr> "May 16, 1929", "April 3, 1930", "November 5, 1930", "November 10, ...
$ `Age upon\nreceiving award` <chr> "44 years, 297 days", "41 years, 5 days", "62 years, 209 days", "53...
$ `Date of death` <chr> "January 2, 1950", "May 7, 1951", "February 5, 1946", "November 15,...
$ Lifespan <chr> "23,903 days (65 years, 163 days)", "22,683 days (62 years, 39 days...
$ Notes <chr> "Held record as oldest winner for 2 award ceremonies (from the 1st ...

Related

Complex Long-Wide Dataset to Long Dataset in R

I have a complex dataset that looks like this:
df1 <- tibble::tribble(~"Canada > London", ~"", ~"Notes", ~"United Kingdom > London", ~"", ~"",
"Restaurant", "Price", "Range", "Restaurant", "Price", "Range",
"Fried beef", "27", "25-30", "Fried beef", "29", "25 - 35",
"Fried potato", "5", "3 - 8", "Fried potato", "8", "3 - 8",
"Bar", "Price", "Range", "Price", "Range", "",
"Beer Lager", "5", "4 - 8", "Beer Lager", "6", "4 - 8",
"Beer Dark", "4", "3 - 7", "Beer Dark", "5", "3 - 7")
Or, for visual representation:
It is long in parameters (like Beer Lager, Beer Dark, ....) and wide by the data input (many wide elements like Canada > London, or United Kingdom > London).
The desired output would be two datasets that should look like this:
The first dataset (the Values):
The second dataset (the Ranges):
Any suggestions would be much appreciated :)

Your data is neither wide nor long but is a messy data table which needs some cleaning to convert it to tidy data. Afterwards you could get your desired tables using tidyr::pivot_wider:
library(dplyr)
library(tidyr)
library(purrr)
tidy_data <- function(.data, cols) {
.data <- .data[cols]
place <- names(.data)[[1]]
.data |>
rename(product = 1, price = 2, range = 3) |>
filter(!price %in% c("Price", "Range")) |>
mutate(place = place)
}
df1_tidy <- purrr::map_dfr(list(1:3, 4:6), tidy_data, .data = df1)
df1_tidy |>
select(place, product, price) |>
pivot_wider(names_from = product, values_from = price)
#> # A tibble: 2 × 5
#> place `Fried beef` `Fried potato` `Beer Lager` `Beer Dark`
#> <chr> <chr> <chr> <chr> <chr>
#> 1 Canada > London 27 5 5 4
#> 2 United Kingdom > London 29 8 6 5
df1_tidy |>
select(place, product, range) |>
pivot_wider(names_from = product, values_from = range, names_glue = "{product} Range")
#> # A tibble: 2 × 5
#> place `Fried beef Range` Fried potato Rang…¹ Beer …² Beer …³
#> <chr> <chr> <chr> <chr> <chr>
#> 1 Canada > London 25-30 3 - 8 4 - 8 3 - 7
#> 2 United Kingdom > London 25 - 35 3 - 8 4 - 8 3 - 7
#> # … with abbreviated variable names ¹`Fried potato Range`, ²`Beer Lager Range`,
#> # ³`Beer Dark Range`

I agree with #stefan. You actually have 4 tables, or 2 depending on how you look at it. Here is an implementation of 2 functions that start the cleaning and formatting process. The first split the dfs by row and the second function splits them by column. After that it is easier to format, clean, and merge the dfs into 1.
library(tidyverse)
df0 = tibble::tribble(~"Canada > London", ~"", ~"Notes", ~"United Kingdom > London", ~"", ~"",
"Restaurant", "Price", "Range", "Restaurant", "Price", "Range",
"Fried beef", "27", "25-30", "Fried beef", "29", "25 - 35",
"Fried potato", "5", "3 - 8", "Fried potato", "8", "3 - 8",
"Bar", "Price", "Range", "Price", "Range", "",
"Beer Lager", "5", "4 - 8", "Beer Lager", "6", "4 - 8",
"Beer Dark", "4", "3 - 7", "Beer Dark", "5", "3 - 7")
split_rows = function(df){
# breaks of sub-dfs within original df
df_breaks = df[,2] == "Price"
df_breaks = (1:length(df_breaks))[df_breaks]
df_breaks
# list to populate in loop with sub-dfs
df_list = c()
for(i in 1:length(df_breaks)){
# get start of sub-df
start = df_breaks[i]
# get end of sub-df
if(i == length(df_breaks)){
end = nrow(df) # if its the last set it to the last row of the original df
}
else{
end = df_breaks[i+1]-1 # else, set it to the next start - 1
}
# subset df
df_temp = df[start:end,]
# first row as header
colnames(df_temp) = df_temp[1,]
df_temp = df_temp[-1,]
# append to df_list
df_list = append(df_list,list(df_temp))
}
return(df_list)
}
split_cols = function(df_list,second_df_col_start = 4){
df_list = lapply(df_list, function(df){
df1 = df[,1:(second_df_col_start-1)]
df2 = df[,second_df_col_start:ncol(df)]
return(list(df1,df2))
})
return(df_list)
}
output = split_rows(df0) %>%
split_cols()
output:
[[1]]
[[1]][[1]]
# A tibble: 2 × 3
Restaurant Price Range
<chr> <chr> <chr>
1 Fried beef 27 25-30
2 Fried potato 5 3 - 8
[[1]][[2]]
# A tibble: 2 × 3
Restaurant Price Range
<chr> <chr> <chr>
1 Fried beef 29 25 - 35
2 Fried potato 8 3 - 8
[[2]]
[[2]][[1]]
# A tibble: 2 × 3
Bar Price Range
<chr> <chr> <chr>
1 Beer Lager 5 4 - 8
2 Beer Dark 4 3 - 7
[[2]][[2]]
# A tibble: 2 × 3
Price Range ``
<chr> <chr> <chr>
1 Beer Lager 6 4 - 8
2 Beer Dark 5 3 - 7

Merging two matrices with merge.Matrices does not return the desired output

I have two matrices provided below:
cf = structure(c("7", "7", "7", "7", "7", "7", "7", "1", "1", "1",
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1",
"1", "1", "1", "1", "1", "1", "1", "1", "2", "2", "2", "2", "2",
"2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2",
"2", "2", "2", "2", "2", "2", "3", "3", "3", "3", "3", "3", "3",
"3", "3", "3", "3", "3", "3", "3", "3", "3", "17", "18", "19",
"20", "21", "22", "23", "0", "1", "2", "3", "4", "5", "6", "7",
"8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18",
"19", "20", "21", "22", "23", "0", "1", "2", "3", "4", "5", "6",
"7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17",
"18", "19", "20", "21", "22", "23", "0", "1", "2", "3", "4",
"5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15"), .Dim = c(71L,
2L), .Dimnames = list(NULL, c("d", "h")))
hour_df<-data.frame(
day = as.character(rep(c(1,2,3,4,5,6,7), each = 24)),
hours = as.character(rep(c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23), times = 7)),
period = rep(c(rep("night",times = 8),rep("day",times = 12),rep("night",times = 4)), times = 7),
tariff_label = rep(c(rep("special feed", times = 8),rep("normal feed", times = 12),rep("special feed", times = 4)), times = 7),
week_period = c(rep("weekend",times = 32),rep("weekday",times = 108),rep("weekend",times = 28))
)
hour_df$tariff_label[hour_df$day %in% c("7","1")]<-"special feed"
hour_df<-as.matrix(hour_df)
I want to merge these matrices on two common columns in each matrix. e.g by.x = c("d","h"), by.y = c("day","hours")
If I use the base function merge()I get my desired output that looks like this
merge(cf,hour_df, by.x = c("d","h"), by.y = c("day","hours"))
d h period tariff_label week_period
1 1 0 night special feed weekend
2 1 1 night special feed weekend
3 1 10 day special feed weekend
4 1 11 day special feed weekend
5 1 12 day special feed weekend
6 1 13 day special feed weekend
7 1 14 day special feed weekend
8 1 15 day special feed weekend
9 1 16 day special feed weekend
10 1 17 day special feed weekend
11 1 18 day special feed weekend
12 1 19 day special feed weekend
13 1 2 night special feed weekend
14 1 20 night special feed weekend
15 1 21 night special feed weekend
16 1 22 night special feed weekend
17 1 23 night special feed weekend
18 1 3 night special feed weekend
19 1 4 night special feed weekend
20 1 5 night special feed weekend
21 1 6 night special feed weekend
22 1 7 night special feed weekend
23 1 8 day special feed weekend
24 1 9 day special feed weekend
25 2 0 night special feed weekend
26 2 1 night special feed weekend
27 2 10 day normal feed weekday
28 2 11 day normal feed weekday
29 2 12 day normal feed weekday
30 2 13 day normal feed weekday
31 2 14 day normal feed weekday
32 2 15 day normal feed weekday
33 2 16 day normal feed weekday
34 2 17 day normal feed weekday
35 2 18 day normal feed weekday
36 2 19 day normal feed weekday
37 2 2 night special feed weekend
38 2 20 night special feed weekday
39 2 21 night special feed weekday
40 2 22 night special feed weekday
41 2 23 night special feed weekday
42 2 3 night special feed weekend
43 2 4 night special feed weekend
44 2 5 night special feed weekend
45 2 6 night special feed weekend
46 2 7 night special feed weekend
47 2 8 day normal feed weekday
48 2 9 day normal feed weekday
49 3 0 night special feed weekday
50 3 1 night special feed weekday
51 3 10 day normal feed weekday
52 3 11 day normal feed weekday
53 3 12 day normal feed weekday
54 3 13 day normal feed weekday
55 3 14 day normal feed weekday
56 3 15 day normal feed weekday
57 3 2 night special feed weekday
58 3 3 night special feed weekday
59 3 4 night special feed weekday
60 3 5 night special feed weekday
61 3 6 night special feed weekday
62 3 7 night special feed weekday
63 3 8 day normal feed weekday
64 3 9 day normal feed weekday
65 7 17 day special feed weekend
66 7 18 day special feed weekend
67 7 19 day special feed weekend
68 7 20 night special feed weekend
69 7 21 night special feed weekend
70 7 22 night special feed weekend
71 7 23 night special feed weekend
As you see above I have 71 rows. I wanted to see if there is a faster function for merging matrices. I saw online that there is a function called merge.Matrix() and it should be faster than base merge. However, when I tried to implement it, I got a completely different result.
library(Matrix.utils)
merge.Matrix(cf,hour_df, by.x = c("d","h"), by.y = c("day","hours"))
d h day hours period tariff_label week_period
"7" "17" "1" "2" "night" "special feed" "weekend"
"7" "19" "1" "0" "night" "special feed" "weekend"
"7" "18" "1" "2" "night" "special feed" "weekend"
"7" "19" "1" "1" "night" "special feed" "weekend"
I tried to see online how it is used and more information about it but information on this function seems to be scarce. I also checked out the vignette. Can someone tell me what I am doing wrong or whether there is a better function than this?
Please Note
I am already aware of dplyr joins and data.table. It is important that both of the matrices stay matrices and that they are not changed into some other format. In reality, my code is performing a join from a list that contains thousands of matrices and therefore needs to be quick.

Summarising duplicates in dataframe in R [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 2 years ago.
I have a dateframe with the following data:
#sample data
Date <- c( "2020-01-01", "2020-01-01", "2020-01-01", "2020-01-01", "2020-01-01", "2020-01-02", "2020-01-02", "2020-01-02", "2020-01-02")
Salesperson <-c ( "Sales1", "Sales1", "Sales1", "Sales2", "Sales2", "Sales1", "Sales1", "Sales2", "Sales2" )
Clothing <-c ( "5", "2", "8", "3", "3", "4", "7", "3", "4" )
Electronics <-c ( "6", "9", "1", "2", "1", "2", "2", "1", "2" )
data<-data.frame(Date,Salesperson,Clothing,Electronics, stringsAsFactors = FALSE)
data$Date<-as.Date(data$Date,"%Y-%m-%d")
There are rows in the df where a salesperson have recorded their sales multiple times for the same date rather than adding them up.
The result I want is shown by the dataframe below:
Date <- c ( "2020-01-01", "2020-01-01", "2020-01-02", "2020-01-02" )
Salesperson <- c ( "Sales1", "Sales2", "Sales1", "Sales2")
Clothing <- c ( "15", "6", "11", "7" )
Electronics <- c ( "16", "3", "4", "3" )
data1<-data.frame(Date,Salesperson,Clothing,Electronics, stringsAsFactors = FALSE)
Does anyone know how to achieve this result?

To summarise your data, you need the numbers to be passed as numbers, not strings. See I added as.numeric() in front of your Clothing and Electronics variables:
Clothing <-as.numeric(c ( "5", "2", "8", "3", "3", "4", "7", "3", "4" ))
Electronics <-as.numeric(c ( "6", "9", "1", "2", "1", "2", "2", "1", "2" ))
Now, to summarise using the sum, try:
library(dplyr)
data %>%
group_by(Date, Salesperson) %>%
summarise(sum_cloth=(sum(Clothing)), sum_elec=sum(Electronics))
# Groups: Date [2]
Date Salesperson sum_cloth sum_elec
<chr> <chr> <dbl> <dbl>
1 2020-01-01 Sales1 15 16
2 2020-01-01 Sales2 6 3
3 2020-01-02 Sales1 11 4
4 2020-01-02 Sales2 7 3

Is there an R function or string of code to identify when two cells columns in a row are duplicates?

I have a dataframe in R with roughly 1,700 observations. I intersected GPS points with polygons, and want to determine if multiple IDs enter in the same polygon in the same 12 hour period (6pm to 6 am). Here is the head of my dataframe.
ID date time DOP datetime p pid1 Long Lat
289 Friday, September 1, 2017 1:15:29 AM 4.2 2017-09-01 01:15:29 <NA> 2 763692.8 3617676
289 Friday, September 1, 2017 4:15:15 AM 1.4 2017-09-01 04:15:15 <NA> 2 763674.5 3617692
299 Friday, September 1, 2017 5:00:16 AM 3.6 2017-09-01 05:00:16 <NA> 2 764427.2 3616750
13 Friday, September 1, 2017 5:15:25 AM 2.8 2017-09-01 05:15:25 <NA> 1 767800.5 3613057
299 Friday, September 1, 2017 5:15:29 AM 1.6 2017-09-01 05:15:29 <NA> 2 764420.7 3616746
299 Friday, September 1, 2017 5:30:08 AM 1.4 2017-09-01 05:30:08 <NA> 2 764420.7 3616747
You can see that for Friday September 1st, 2017, both ID numbers 289 and 299 where within PID1 #2 (PID1 #2 refers to polygon #2) at one point (roughly 45 minutes apart). I'd like to have some function or script to run through my dataset and identify instances where this occurs. That way I can identify what IDs are in what PID1 during specific times (within the 12 hour window), to ultimately have a dataset that shows how many times multiple IDs interact within a specific polygon.
Here is a sample dataset using dput for the first 5 lines of my dataset:
structure(list(X = c("388933", "387022", "507722", "941954",
"506441"), ID = structure(c(12L, 12L, 15L, 1L, 15L), .Label = c("13",
"17", "97", "100", "253", "255", "256", "259", "263", "272",
"281", "289", "294", "297", "299", "329", "337", "339", "344",
"347"), class = "factor"), date = c("Friday, September 1, 2017",
"Friday, September 1, 2017", "Friday, September 1, 2017", "Friday, September 1, 2017",
"Friday, September 1, 2017"), time = c("1:15:29 AM", "4:15:15 AM",
"5:00:16 AM", "5:15:25 AM", "5:15:29 AM"), DOP = c(4.2, 1.4,
3.6, 2.8, 1.6), datetime = structure(c(1504246529, 1504257315,
1504260016, 1504260925, 1504260929), class = c("POSIXct", "POSIXt"
), tzone = "CST6CDT"), p = c(NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_), pid1 = c("2", "2", "2", "1", "2"
), Long = c(763692.811797531, 763674.546077539, 764427.163679506,
767800.455784065, 764420.684442097), Lat = c(3617675.85664874,
3617692.02070415, 3616749.72487458, 3613057.33334349, 3616746.22303673
)), row.names = c("224811", "223697", "277383", "525686", "276768"
), class = "data.frame")
EDIT: I am editing this to show the way that I figured out to make this work.
uni <- unique(df[,c("ID","date", "pid1")])
df2 <- aggregate(ID~pid1+date, data= uni,length)
This was able to create a dataframe with the number of unique IDs per pid1 per day.
Thank you

Major EDIT
Give it a try now on your real data. I think I was letting the fact that ID was a factor get in the way. If we agree thos is giving you the right ideas we can close the basics out and perhaps start a new question on your 12 hour blocks.
library(dplyr)
library(tidyr)
set.seed(2020)
ID <- sample(10:200, replace = TRUE, size = 1000)
PID <- sample(1:31, replace = TRUE, size = 1000)
Date <- sample(c("Friday, September 1, 2017",
"Saturday, September 2, 2017",
"Sunday, September 3, 2017",
"Monday, September 4, 2017",
"Tuesday, September 5, 2017",
"Wednesday, September 6, 2017",
"Thursday, September 7, 2017",
"Friday, September 8, 2017"),
replace = TRUE,
size = 1000)
play <- data.frame(ID, Date, PID)
play$ID <- factor(play$ID)
ids_by_pid <-
play %>%
mutate(ID = as.integer(as.character(ID))) %>%
arrange(Date, PID, ID) %>%
tidyr::pivot_wider(id_cols = Date,
values_from = ID,
names_from = PID,
names_prefix = "pid",
names_sort = TRUE,
values_fn = list)
Here's the code I use to compare IDs in a PID for a particular day
play %>%
filter(Date == "Saturday, September 2, 2017", PID == "1") %>%
pull(ID)
#> [1] 40 30 89 133 36
#> 189 Levels: 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 ... 200
ids_by_pid %>%
filter(Date == "Saturday, September 2, 2017") %>%
select(pid1) %>%
pull %>%
unlist
#> [1] 30 36 40 89 133

conditionally aggregate columns using tidyverse for large time series dataset

After looking at a few other asked questions, and reading a few guides, I'm not able to find a suitable solution to my specific problem. Here's an example of the data to begin:
data <- data.frame(
Date = sample(c("1993-07-05", "1993-07-05", "1993-07-05", "1993-08-30", "1993-08-30", "1993-08-30", "1993-08-30", "1993-09-04", "1993-09-04")),
Site = sample(c("1", "1", "1", "1", "1", "1", "1", "1", "1")),
Station = sample(c("1", "2", "3", "1", "2", "3", "4", "1", "2")),
Oxygen = sample(c("0.9", "0.4", "4.2", "5.6", "7.3", "4.3", "9.5", "5.3", "0.3")))
I want to average all the oxygen values for the stations that are nested within a site that corresponds to a date. My dataset has a couple of thousand rows, and like in the example, there are an uneven number of stations, and the dates are uneven in length.
The output I'm looking for are columns like, "Date -> Site -> Average Oxygen", foregoing the need for a station column altogether in the new version of the time series.
Any help would be greatly appreciated!

After grouping by 'Site', 'Date', get the mean of 'Oxygen' (after converting it to numeric - it is factor column)
library(tidyverse)
data %>%
group_by(Site, Date) %>%
summarise(AverageOxygen = mean(as.numeric(as.character(Oxygen))))
# A tibble: 3 x 3
# Groups: Site [1]
# Site Date AverageOxygen
# <fct> <fct> <dbl>
#1 1 1993-07-05 3.97
#2 1 1993-08-30 5.2
#3 1 1993-09-04 2.55

Try:
library(hablar)
library(tidyverse)
data %>%
retype() %>%
group_by(Site, Date) %>%
summarize(AverageOxygen = mean(Oxygen))
which gives you:
# A tibble: 3 x 3
# Groups: Site [?]
Site Date AverageOxygen
<int> <date> <dbl>
1 1 1993-07-05 4.7
2 1 1993-08-30 3.55
3 1 1993-09-04 4.75

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

rvest wikipedia scraping cells with links produce duplicates - web-scraping

Related

Complex Long-Wide Dataset to Long Dataset in R

Merging two matrices with merge.Matrices does not return the desired output

Summarising duplicates in dataframe in R [duplicate]

Is there an R function or string of code to identify when two cells columns in a row are duplicates?

conditionally aggregate columns using tidyverse for large time series dataset

Categories

Resources