Web-scraping table with merged row entries in R - r

I'm trying to scrape data-tables from a website
https://newsroom.spotify.com/2020-03-09/36-new-artists-around-the-world-that-are-on-spotifys-radar/
The issue is that the first column entry is merged across multiple rows while the second column has discrete entries:
The data table which is being scraped llos something like:
Here different entries in the second column has been merges in a single entry using \n
Now I want to shift the merged data to different rows and need some help with the same.
The code for webscraping is
library(rvest
#Spotify's list of new artists to look out for
upcoming_artists <- "https://newsroom.spotify.com/2020-03-09/36-new-artists-around-the-world-that-are-on-spotifys-radar/"
upcoming_artists <- read_html(upcoming_artists)
upcoming_artists <- html_table(upcoming_artists)
The erroneous data frame looks something like:
list(structure(list(X1 = c("United States", "United Kingdom",
"Brazil", "Mexico", "Argentina", "Colombia", "Panama", "Spain",
"Australia", "France", "UAE & Lebanon", "South Africa", "Philippines",
"Indonesia", "Taiwan", "Austria", "Germany", "Netherlands", "Japan\n*RADAR locally titled Early Noise",
"India"), X2 = c("Alaina Castillo", "Young T + Bugsey", "Agnes Nunes",
"Silvana Estrada", "Romeo El Santo", "Ela Minus", "Boza", "DORA \nAleesha\nMaría José Llergo\nGuitarricadelafuente\nParanoid 1966",
"merci, mercy", "Lous and the Yakuza \nYuzmv\nPhilippine\nHervé",
"Hollaphonic x Xriss", "Elaine", "SB19\nAugust Wahh", "Mahen\nMonica Karina",
"張若凡RuoFan", "AVEC \nMy Ugly Clementine", "badmómzjay",
"RIMON \nJeangu Macrooy", "Fujii Kaze\nVaundy\nRina Sawayama",
"Mali\nWhen Chai Met Toast\nTaba Chake")), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame")))

Use separate_rows from package tidyr to separate a given column into rows.
I have changed the scrapping a bit.
suppressPackageStartupMessages({
library(rvest)
library(dplyr)
})
#Spotify's list of new artists to look out for
upcoming_artists_link <- "https://newsroom.spotify.com/2020-03-09/36-new-artists-around-the-world-that-are-on-spotifys-radar/"
upcoming_artists <- read_html(upcoming_artists_link)
upcoming_artists %>%
html_elements("tbody") %>%
html_table() %>%
`[[`(1) %>%
tidyr::separate_rows(X2, sep = "\n")
#> # A tibble: 35 × 2
#> X1 X2
#> <chr> <chr>
#> 1 United States Alaina Castillo
#> 2 United Kingdom Young T + Bugsey
#> 3 Brazil Agnes Nunes
#> 4 Mexico Silvana Estrada
#> 5 Argentina Romeo El Santo
#> 6 Colombia Ela Minus
#> 7 Panama Boza
#> 8 Spain DORA 
#> 9 Spain Aleesha
#> 10 Spain María José Llergo
#> # … with 25 more rows
Created on 2022-10-03 with reprex v2.0.2

Related

Create several columns from a complex column in R

Imagine dataset:
df1 <- tibble::tribble(~City, ~Population,
"United Kingdom > Leeds", 1500000,
"Spain > Las Palmas de Gran Canaria", 200000,
"Canada > Nanaimo, BC", 150000,
"Canada > Montreal", 250000,
"United States > Minneapolis, MN", 700000,
"United States > Milwaukee, WI", NA,
"United States > Milwaukee", 400000)
The same dataset for visual representation:
I would like to:
Split column City into three columns: City, Country, State (if available, NA otherwise)
Check that Milwaukee has data in state and population (the NA for Milwaukee should have a value of 400000 and then split [City-State-Country] :).
Could you, please, suggest the easiest method to do so :)
Here's another solution with extract to do the extraction of Country, City, and State in a single go with State extracted by an optional capture group (the remainder of the task is done as by #Allen's code):
library(tidyr)
library(dplyr)
df1 %>%
extract(City,
into = c("Country", "City", "State"),
regex = "([^>]+) > ([^,]+),? ?([A-Z]+)?"
) %>%
# as by #Allen Cameron:
group_by(Country, City) %>%
summarize(State = ifelse(all(is.na(State)), NA, State[!is.na(State)]),
Population = Population[!is.na(Population)])
You can use separate twice to get the country and state, then group_by Country and City to summarize away the NA values where appropriate:
library(tidyverse)
df1 %>%
separate(City, sep = " > ", into = c("Country", "City")) %>%
separate(City, sep = ', ', into = c('City', 'State')) %>%
group_by(Country, City) %>%
summarize(State = ifelse(all(is.na(State)), NA, State[!is.na(State)]),
Population = Population[!is.na(Population)])
#> # A tibble: 6 x 4
#> # Groups: Country [4]
#> Country City State Population
#> <chr> <chr> <chr> <dbl>
#> 1 Canada Montreal <NA> 250000
#> 2 Canada Nanaimo BC 150000
#> 3 Spain Las Palmas de Gran Canaria <NA> 200000
#> 4 United Kingdom Leeds <NA> 1500000
#> 5 United States Milwaukee WI 400000
#> 6 United States Minneapolis MN 700000

How to modify data frame in R based on one unique column

I have a data frame that looks like this.
Data
Denmark MG301
Denmark MG302
Australia MG301
Australia MG302
Sweden MG100
Sweden MG120
I need to make a new data frame based on unique values of 2nd columns while removing repeating values in Denmark. And results should look like this
Data
Australia MG301
Australia MG302
Sweden MG100
Sweden MG120
Regards
Update after clarification:
This code keeps all distinct values in column2:
distinct(df, code, .keep_all = TRUE)
Output:
1 Denmark MG301
2 Australia MG302
3 Sweden MG100
4 Sweden MG120
First answer:
I am not quite sure. But it gives the desired output:
df %>%
filter(country != "Denmark")
Output:
country code
<chr> <chr>
1 Australia MG301
2 Australia MG302
3 Sweden MG100
4 Sweden MG120
data:
df<- tribble(
~country, ~code,
"Denmark", "MG301",
"Denmark", "MG301",
"Australia", "MG301",
"Australia", "MG302",
"Sweden", "MG100",
"Sweden", "MG120")
In base R, the following code removes all rows with "Denmark" in the first column and all duplicated 2nd column by groups of 1st column.
i <- df1$V1 != "Denmark"
j <- as.logical(ave(df1$V2, df1$V1, FUN = duplicated))
df1[i & !j, ]
# V1 V2
#3 Australia MG301
#4 Australia MG302
#5 Sweden MG100
#6 Sweden MG120
Do you want just distinct ? then this may help
df <- data.frame(A = c("denmark", "denmark", "Australia", "Australia", "Sweden", "Sweden"), B = c("MG301","MG302","MG301","MG302","MG100","MG100"))
df %>% distinct()
A B
1 denmark MG301
2 denmark MG302
3 Australia MG301
4 Australia MG302
5 Sweden MG100
Or you want this ?
df %>%
group_by(B) %>%
dplyr::summarise(A = first(A))
B A
* <chr> <chr>
1 MG100 Sweden
2 MG301 denmark
3 MG302 denmark
Use duplicated with a ! bang operator to remove duplicated rows among that column.
To show a rather complicated case, I am adding one row in Denmark which is not duplicated and hence should not be filtered out.
df<- tribble(
~country, ~code,
"Denmark", "MG301",
"Denmark", "MG302",
'Denmark', "MG303",
"Australia", "MG301",
"Australia", "MG302",
"Sweden", "MG100",
"Sweden", "MG120")
# A tibble: 7 x 2
country code
<chr> <chr>
1 Denmark MG301
2 Denmark MG302
3 Denmark MG303
4 Australia MG301
5 Australia MG302
6 Sweden MG100
7 Sweden MG120
df %>%
mutate(d = duplicated(code)) %>%
group_by(code) %>%
mutate(d = sum(d)) %>% ungroup() %>%
filter(!(d > 0 & country == 'Denmark'))
# A tibble: 5 x 3
country code d
<chr> <chr> <int>
1 Denmark MG303 0
2 Australia MG301 1
3 Australia MG302 1
4 Sweden MG100 0
5 Sweden MG120 0

Trying to find values within excel cell based on given pairs in R df

I am using this excel sheet that I have currently read into R: https://www.knomad.org/sites/default/files/2018-04/bilateralmigrationmatrix20170_Apr2018.xlsx
dput(head(remittance, 5))
The output is:
structure(list(`Remittance-receiving country (across) - Remittance-sending country (down)` = c("Australia",
"Brazil", "Canada"), Brazil = c("27.868809286999106", "0", "31.284184411144214"
), Canada = c("46.827693406219382", "1.5806325278762619", "0"
), `Czech Republic` = c("104.79905129342241", "3.0488843262423089",
"176.79676736179096"), Finland = c("26.823089572300752", "1.3451674211686246",
"37.781150857376964"), France = c("424.37048861305249", "123.9763417712491",
"1296.7352242506483"), Germany = c("556.4140279523856", "66.518143815367239",
"809.9621650533453"), Hungary = c("200.08597014449356", "11.953328254521287",
"436.0811601171776"), Indonesia = c("172.0021287331823", "1.3701340430259537",
"33.545925908780198"), Italy = c("733.51652291459231", "116.74264895322995",
"1072.1119887588022"), `Korea, Rep.` = c("259.97044386689589",
"20.467939414361016", "326.94157937864327"), Netherlands = c("133.48932759488602",
"4.7378343766684532", "181.28828076733771"), Philippines = c("1002.3593555086774",
"1.5863355979877207", "2369.5223195675494"), Poland = c("109.73486651698796",
"5.8313637459523129", "341.10408952685464"), `Russian Federation` = c("19.082541158574934",
"1.0136604494838692", "58.760989426089431"), `Saudi Arabia` = c("13.578431465294949",
"0.32506772760873404", "15.511213677040857"), Sweden = c("91.887827513176489",
"5.1132733094740352", "65.860232580192786"), Thailand = c("383.08245004577498",
"2.7410805494977684", "79.370683058792849"), `United Kingdom` = c("1084.0742194994727",
"4.2050614573174592", "568.62605950140266"), `United States` = c("188.06242727403128",
"49.814372612310521", "661.98049661387927"), WORLD = c("5578.0296723604206",
"422.37127035334271", "8563.264510816849")), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))
I currently have a dataframe of two columns "Source" and "Destination" where each row is a pair of countries which I created by doing:
countries = c("Australia","Brazil", "Canada", "Czech Republic", "Germany", "Finland", "United Kingdom", "Italy", "Poland", "Russian Federation", "Sweden", "United States", "Philippines", "France", "Netherlands", "Hungary", "Saudi Arabia", "Thailand", "Korea, Rep.", "Indonesia")
pairs = t(combn(countries, 2))
I would like to use each pair to extract its corresponding value from the excel sheet above. (In the Excel sheet "Source" is the first column of countries-down and "Destination is the first row countries-across)
For example a sample of the df that I have looks as follows (it currently contains 190 pairs):
pairs = data.frame(Source = c("Australia", "Australia", "Australia"), Destination = c("Brazil", "Canada", "Czech Republic"))
Where the first pair in my df is (Australia, Brazil) which corresponds to a value of 27.868809286999106 from the excel sheet that I reproduced above. Is there a built-in R function that would match the pairs from my df to extract its corresponding value? Thanks
Perhaps what you need is dplyr::pivot_longer?
library(dplyr)
colnames(remittance)[1] <- 'source'
remittance %>% pivot_longer(-source, names_to = 'destination')
#----
# A tibble: 60 x 3
source destination value
<chr> <chr> <chr>
1 Australia Brazil 27.868809286999106
2 Australia Canada 46.827693406219382
3 Australia Czech Republic 104.79905129342241
4 Australia Finland 26.823089572300752
Note remittance is the dataframe in the OP dput.
Probably you are interested in keeping the flexibility of your nice combn approach.
To loop over your pairs data frame (it's actually a matrix though) you may use apply with MARGIN=1 for row-wise. In the FUN= argument we create data frames of one row each with source corresponding to column 1 of pairs and destination to column 2. The distance (or whatever this value is) we get by subsetting at the corresponding rows and columns of remittance (for brevity I shortend to rem).
Since we will get a list of single-line data frames, we want to rbind, and because we have multiple objects we need do.call.
res <- do.call(rbind,
apply(pairs, MARGIN=1, FUN=function(x)
data.frame(source=x[1], destination=x[2],
dist=as.integer(rem[rem[, 1] == x[1], rem[1, ] == x[2]])))
)
Since the .xlsx has zeros where actually should be NAs we should declare them as such in the result.
res[res == 0] <- NA
Result
head(res, 25)
# source destination dist
# 1 Australia Brazil 721
# 2 Australia Canada 24721
# 3 Australia Czech Republic 1074
# 4 Australia Germany 13938
# 5 Australia Finland 1121
# 6 Australia United Kingdom 135000
# 7 Australia Italy 19350
# 8 Australia Poland 974
# 9 Australia Russian Federation 543
# 10 Australia Sweden 3988
# 11 Australia United States 93179
# 12 Australia Philippines 4118
# 13 Australia France 8475
# 14 Australia Netherlands 10697
# 15 Australia Hungary 997
# 16 Australia Saudi Arabia NA
# 17 Australia Thailand 11298
# 18 Australia Korea, Rep. 5381
# 19 Australia Indonesia 11094
# 20 Brazil Canada 26647
# 21 Brazil Czech Republic 742
# 22 Brazil Germany 44000
# 23 Brazil Finland 1378
# 24 Brazil United Kingdom 55772
# 25 Brazil Italy 104779
Data:
u <- "https://www.knomad.org/sites/default/files/2018-04/bilateralmigrationmatrix20170_Apr2018.xlsx"
rem <- openxlsx::read.xlsx(u)
countries <- c("Australia", "Brazil", "Canada", "Czech Republic", "Germany",
"Finland", "United Kingdom", "Italy", "Poland", "Russian Federation",
"Sweden", "United States", "Philippines", "France", "Netherlands",
"Hungary", "Saudi Arabia", "Thailand", "Korea, Rep.", "Indonesia")
pairs <- t(combn(countries, 2))

R: Count frequency of values in nested list with sub-elements

I have a nested list that contains country names.
I want to count the frequency of the countries, whereby +1 is added with each mention in a sub-list (regardless of how often the country is mentioned in that sub-list).
For instance, if I have this list:
[[1]]
[1] "Austria" "Austria" "Austria"
[[2]]
[1] "Austria" "Sweden"
[[3]]
[1] "Austria" "Austria" "Sweden" "Sweden" "Sweden" "Sweden"
[[4]]
[1] "Austria" "Austria" "Austria"
[[5]]
[1] "Austria" "Japan"
... then I would like the result to be like this:
country freq
====================
Austria 5
Sweden 2
Japan 1
I have tried various ways with lapply, unlist, table, etc. but nothing worked the way I would need it. I would appreciate your help!
One way with lapply(), unlist() and table():
count <- table(unlist(lapply(lst, unique)))
count
# Austria Japan Sweden
# 5 1 2
as.data.frame(count)
# Var1 Freq
# 1 Austria 5
# 2 Japan 1
# 3 Sweden 2
Reproducible data (please provide yourself next time):
lst <- list(
c('Austria', 'Austria', 'Austria'),
c("Austria", "Sweden"),
c("Austria", "Austria", "Sweden", "Sweden", "Sweden", "Sweden"),
c("Austria", "Austria", "Austria"),
c("Austria", "Japan")
)
Here is another base R option
colSums(
do.call(
rbind,
lapply(
lst,
function(x) table(factor(x, levels = unique(unlist(lst)))) > 0
)
)
)
which gives
Austria Sweden Japan
5 2 1
One way would be to get data in dataframe format and count unique elements where each country occurs.
library(dplyr)
tibble::enframe(lst) %>%
tidyr::unnest(value) %>%
group_by(value) %>%
summarise(freq = n_distinct(name))
# value freq
# <chr> <int>
#1 Austria 5
#2 Japan 1
#3 Sweden 2
data
lst <- list(c('Austria', 'Austria', 'Austria'), c("Austria", "Sweden"),
c("Austria", "Austria", "Sweden", "Sweden", "Sweden", "Sweden"),
c("Austria", "Austria", "Austria"), c("Austria", "Japan" ))
An option is also to stack into a two column data.frame, then take the unique and apply the table
table(unique(stack(setNames(lst, seq_along(lst))))$values)
# Austria Japan Sweden
# 5 1 2

Changing spelling for multiple words at a time in R/replacing many words at once

I have a dataset (survey) and a column of birth_country, where people have written their country of birth. An example of it:
1 america
2 usa
3 american
4 us of a
5 united states
6 england
7 english
8 great britain
9 uk
10 united kingdom
how I would like it to look:
1 america
2 america
3 america
4 america
5 america
6 uk
7 uk
8 uk
9 uk
10 uk
I have tried using str_replace to manually insert the different spellings, to replace them with 'america' but when I look at my dataset, nothing has changed
e.g.
survey <- structure(list(birth_country = c("america", "usa", "american", "us of a", "united states", "england", "english", "great britain", "uk", "united kingdom")), row.names = c(NA, -10L), class = "data.frame")
survey$birth_country <- str_replace(survey$birth_country, ' "united state"|"united statea"|"united states of america"', "america")
thank you in advance
Come up with some patterns that only match for each country and basically loop over what you are already doing (you can change the replacement below with your favorite function)
survey <- structure(list(birth_country = c("america", "usa", "american", "us of a", "united states", "england", "english", "great britain", "uk", "united kingdom")), row.names = c(NA, -10L), class = "data.frame")
## use a _named_ list of regular expressions
## the name will be the replacement string
l <- list(
america = 'amer|us|states',
uk = 'eng|brit|king|uk',
'another country' = 'ano|an co',
chaz = 'chaz|chop'
)
f <- function(x, list) {
for (ii in seq_along(list)) {
x[grepl(list[[ii]], x, ignore.case = TRUE)] <- names(list)[ii]
}
x
}
## test it
f(survey$birth_country, l)
# [1] "america" "america" "america" "america" "america" "uk" "uk" "uk" "uk" "uk"
within(survey, {
clean <- f(birth_country, l)
})
# birth_country clean
# 1 america america
# 2 usa america
# 3 american america
# 4 us of a america
# 5 united states america
# 6 england uk
# 7 english uk
# 8 great britain uk
# 9 uk uk
# 10 united kingdom uk
Note that 1) if you don't give a pattern that matches, nothing will change, but 2) if you give a pattern that matches both countries (e.g., "united"), the first in the list will be used (unless the replacement itself is also matched)
Looks like the problem is in how you specified your regular expression. Try this (updated based on #Gabriella 's comment, and another tidyverse approach, similar to #MarBIo ):
library(tidyverse)
survey <- survey %>%
mutate(birth_country = if_else(
str_detect(birth_country,
"(united state)|(united statea)|(united states of america)"), #If your regular expression matches any in birth_country
"america", #Change it to "america"
birth_country #Otherwise, keep as is.
) #end of if_else
) #end of mutate
Other people are suggesting you come up with a more complex regular expression, which you can certainly do as well. Consecutive "or" (i.e. "|") statements in your regular expression works though.
In case you allow tidyverse`s mutate you can do:
library(tidyverse)
survey <- structure(list(birth_country = c("america", "usa", "american", "us of a", "united states", "england", "english", "great britain", "uk", "united kingdom")), row.names = c(NA, -10L), class = "data.frame")
americas <- c("america", "usa", "american", "us of a", "united states")
englands <- c("england", "english", "great britain")
survey %>%
mutate(birth_country = ifelse(birth_country %in% americas, 'america', 'UK'))
#> birth_country
#> 1 america
#> 2 america
#> 3 america
#> 4 america
#> 5 america
#> 6 UK
#> 7 UK
#> 8 UK
#> 9 UK
#> 10 UK

Resources