R: Count frequency of values in nested list with sub-elements - r

I have a nested list that contains country names.
I want to count the frequency of the countries, whereby +1 is added with each mention in a sub-list (regardless of how often the country is mentioned in that sub-list).
For instance, if I have this list:
[[1]]
[1] "Austria" "Austria" "Austria"
[[2]]
[1] "Austria" "Sweden"
[[3]]
[1] "Austria" "Austria" "Sweden" "Sweden" "Sweden" "Sweden"
[[4]]
[1] "Austria" "Austria" "Austria"
[[5]]
[1] "Austria" "Japan"
... then I would like the result to be like this:
country freq
====================
Austria 5
Sweden 2
Japan 1
I have tried various ways with lapply, unlist, table, etc. but nothing worked the way I would need it. I would appreciate your help!

One way with lapply(), unlist() and table():
count <- table(unlist(lapply(lst, unique)))
count
# Austria Japan Sweden
# 5 1 2
as.data.frame(count)
# Var1 Freq
# 1 Austria 5
# 2 Japan 1
# 3 Sweden 2
Reproducible data (please provide yourself next time):
lst <- list(
c('Austria', 'Austria', 'Austria'),
c("Austria", "Sweden"),
c("Austria", "Austria", "Sweden", "Sweden", "Sweden", "Sweden"),
c("Austria", "Austria", "Austria"),
c("Austria", "Japan")
)

Here is another base R option
colSums(
do.call(
rbind,
lapply(
lst,
function(x) table(factor(x, levels = unique(unlist(lst)))) > 0
)
)
)
which gives
Austria Sweden Japan
5 2 1

One way would be to get data in dataframe format and count unique elements where each country occurs.
library(dplyr)
tibble::enframe(lst) %>%
tidyr::unnest(value) %>%
group_by(value) %>%
summarise(freq = n_distinct(name))
# value freq
# <chr> <int>
#1 Austria 5
#2 Japan 1
#3 Sweden 2
data
lst <- list(c('Austria', 'Austria', 'Austria'), c("Austria", "Sweden"),
c("Austria", "Austria", "Sweden", "Sweden", "Sweden", "Sweden"),
c("Austria", "Austria", "Austria"), c("Austria", "Japan" ))

An option is also to stack into a two column data.frame, then take the unique and apply the table
table(unique(stack(setNames(lst, seq_along(lst))))$values)
# Austria Japan Sweden
# 5 1 2

Related

Web-scraping table with merged row entries in R

I'm trying to scrape data-tables from a website
https://newsroom.spotify.com/2020-03-09/36-new-artists-around-the-world-that-are-on-spotifys-radar/
The issue is that the first column entry is merged across multiple rows while the second column has discrete entries:
The data table which is being scraped llos something like:
Here different entries in the second column has been merges in a single entry using \n
Now I want to shift the merged data to different rows and need some help with the same.
The code for webscraping is
library(rvest
#Spotify's list of new artists to look out for
upcoming_artists <- "https://newsroom.spotify.com/2020-03-09/36-new-artists-around-the-world-that-are-on-spotifys-radar/"
upcoming_artists <- read_html(upcoming_artists)
upcoming_artists <- html_table(upcoming_artists)
The erroneous data frame looks something like:
list(structure(list(X1 = c("United States", "United Kingdom",
"Brazil", "Mexico", "Argentina", "Colombia", "Panama", "Spain",
"Australia", "France", "UAE & Lebanon", "South Africa", "Philippines",
"Indonesia", "Taiwan", "Austria", "Germany", "Netherlands", "Japan\n*RADAR locally titled Early Noise",
"India"), X2 = c("Alaina Castillo", "Young T + Bugsey", "Agnes Nunes",
"Silvana Estrada", "Romeo El Santo", "Ela Minus", "Boza", "DORA \nAleesha\nMaría José Llergo\nGuitarricadelafuente\nParanoid 1966",
"merci, mercy", "Lous and the Yakuza \nYuzmv\nPhilippine\nHervé",
"Hollaphonic x Xriss", "Elaine", "SB19\nAugust Wahh", "Mahen\nMonica Karina",
"張若凡RuoFan", "AVEC \nMy Ugly Clementine", "badmómzjay",
"RIMON \nJeangu Macrooy", "Fujii Kaze\nVaundy\nRina Sawayama",
"Mali\nWhen Chai Met Toast\nTaba Chake")), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame")))
Use separate_rows from package tidyr to separate a given column into rows.
I have changed the scrapping a bit.
suppressPackageStartupMessages({
library(rvest)
library(dplyr)
})
#Spotify's list of new artists to look out for
upcoming_artists_link <- "https://newsroom.spotify.com/2020-03-09/36-new-artists-around-the-world-that-are-on-spotifys-radar/"
upcoming_artists <- read_html(upcoming_artists_link)
upcoming_artists %>%
html_elements("tbody") %>%
html_table() %>%
`[[`(1) %>%
tidyr::separate_rows(X2, sep = "\n")
#> # A tibble: 35 × 2
#> X1 X2
#> <chr> <chr>
#> 1 United States Alaina Castillo
#> 2 United Kingdom Young T + Bugsey
#> 3 Brazil Agnes Nunes
#> 4 Mexico Silvana Estrada
#> 5 Argentina Romeo El Santo
#> 6 Colombia Ela Minus
#> 7 Panama Boza
#> 8 Spain DORA 
#> 9 Spain Aleesha
#> 10 Spain María José Llergo
#> # … with 25 more rows
Created on 2022-10-03 with reprex v2.0.2

How to modify data frame in R based on one unique column

I have a data frame that looks like this.
Data
Denmark MG301
Denmark MG302
Australia MG301
Australia MG302
Sweden MG100
Sweden MG120
I need to make a new data frame based on unique values of 2nd columns while removing repeating values in Denmark. And results should look like this
Data
Australia MG301
Australia MG302
Sweden MG100
Sweden MG120
Regards
Update after clarification:
This code keeps all distinct values in column2:
distinct(df, code, .keep_all = TRUE)
Output:
1 Denmark MG301
2 Australia MG302
3 Sweden MG100
4 Sweden MG120
First answer:
I am not quite sure. But it gives the desired output:
df %>%
filter(country != "Denmark")
Output:
country code
<chr> <chr>
1 Australia MG301
2 Australia MG302
3 Sweden MG100
4 Sweden MG120
data:
df<- tribble(
~country, ~code,
"Denmark", "MG301",
"Denmark", "MG301",
"Australia", "MG301",
"Australia", "MG302",
"Sweden", "MG100",
"Sweden", "MG120")
In base R, the following code removes all rows with "Denmark" in the first column and all duplicated 2nd column by groups of 1st column.
i <- df1$V1 != "Denmark"
j <- as.logical(ave(df1$V2, df1$V1, FUN = duplicated))
df1[i & !j, ]
# V1 V2
#3 Australia MG301
#4 Australia MG302
#5 Sweden MG100
#6 Sweden MG120
Do you want just distinct ? then this may help
df <- data.frame(A = c("denmark", "denmark", "Australia", "Australia", "Sweden", "Sweden"), B = c("MG301","MG302","MG301","MG302","MG100","MG100"))
df %>% distinct()
A B
1 denmark MG301
2 denmark MG302
3 Australia MG301
4 Australia MG302
5 Sweden MG100
Or you want this ?
df %>%
group_by(B) %>%
dplyr::summarise(A = first(A))
B A
* <chr> <chr>
1 MG100 Sweden
2 MG301 denmark
3 MG302 denmark
Use duplicated with a ! bang operator to remove duplicated rows among that column.
To show a rather complicated case, I am adding one row in Denmark which is not duplicated and hence should not be filtered out.
df<- tribble(
~country, ~code,
"Denmark", "MG301",
"Denmark", "MG302",
'Denmark', "MG303",
"Australia", "MG301",
"Australia", "MG302",
"Sweden", "MG100",
"Sweden", "MG120")
# A tibble: 7 x 2
country code
<chr> <chr>
1 Denmark MG301
2 Denmark MG302
3 Denmark MG303
4 Australia MG301
5 Australia MG302
6 Sweden MG100
7 Sweden MG120
df %>%
mutate(d = duplicated(code)) %>%
group_by(code) %>%
mutate(d = sum(d)) %>% ungroup() %>%
filter(!(d > 0 & country == 'Denmark'))
# A tibble: 5 x 3
country code d
<chr> <chr> <int>
1 Denmark MG303 0
2 Australia MG301 1
3 Australia MG302 1
4 Sweden MG100 0
5 Sweden MG120 0

Trying to find values within excel cell based on given pairs in R df

I am using this excel sheet that I have currently read into R: https://www.knomad.org/sites/default/files/2018-04/bilateralmigrationmatrix20170_Apr2018.xlsx
dput(head(remittance, 5))
The output is:
structure(list(`Remittance-receiving country (across) - Remittance-sending country (down)` = c("Australia",
"Brazil", "Canada"), Brazil = c("27.868809286999106", "0", "31.284184411144214"
), Canada = c("46.827693406219382", "1.5806325278762619", "0"
), `Czech Republic` = c("104.79905129342241", "3.0488843262423089",
"176.79676736179096"), Finland = c("26.823089572300752", "1.3451674211686246",
"37.781150857376964"), France = c("424.37048861305249", "123.9763417712491",
"1296.7352242506483"), Germany = c("556.4140279523856", "66.518143815367239",
"809.9621650533453"), Hungary = c("200.08597014449356", "11.953328254521287",
"436.0811601171776"), Indonesia = c("172.0021287331823", "1.3701340430259537",
"33.545925908780198"), Italy = c("733.51652291459231", "116.74264895322995",
"1072.1119887588022"), `Korea, Rep.` = c("259.97044386689589",
"20.467939414361016", "326.94157937864327"), Netherlands = c("133.48932759488602",
"4.7378343766684532", "181.28828076733771"), Philippines = c("1002.3593555086774",
"1.5863355979877207", "2369.5223195675494"), Poland = c("109.73486651698796",
"5.8313637459523129", "341.10408952685464"), `Russian Federation` = c("19.082541158574934",
"1.0136604494838692", "58.760989426089431"), `Saudi Arabia` = c("13.578431465294949",
"0.32506772760873404", "15.511213677040857"), Sweden = c("91.887827513176489",
"5.1132733094740352", "65.860232580192786"), Thailand = c("383.08245004577498",
"2.7410805494977684", "79.370683058792849"), `United Kingdom` = c("1084.0742194994727",
"4.2050614573174592", "568.62605950140266"), `United States` = c("188.06242727403128",
"49.814372612310521", "661.98049661387927"), WORLD = c("5578.0296723604206",
"422.37127035334271", "8563.264510816849")), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))
I currently have a dataframe of two columns "Source" and "Destination" where each row is a pair of countries which I created by doing:
countries = c("Australia","Brazil", "Canada", "Czech Republic", "Germany", "Finland", "United Kingdom", "Italy", "Poland", "Russian Federation", "Sweden", "United States", "Philippines", "France", "Netherlands", "Hungary", "Saudi Arabia", "Thailand", "Korea, Rep.", "Indonesia")
pairs = t(combn(countries, 2))
I would like to use each pair to extract its corresponding value from the excel sheet above. (In the Excel sheet "Source" is the first column of countries-down and "Destination is the first row countries-across)
For example a sample of the df that I have looks as follows (it currently contains 190 pairs):
pairs = data.frame(Source = c("Australia", "Australia", "Australia"), Destination = c("Brazil", "Canada", "Czech Republic"))
Where the first pair in my df is (Australia, Brazil) which corresponds to a value of 27.868809286999106 from the excel sheet that I reproduced above. Is there a built-in R function that would match the pairs from my df to extract its corresponding value? Thanks
Perhaps what you need is dplyr::pivot_longer?
library(dplyr)
colnames(remittance)[1] <- 'source'
remittance %>% pivot_longer(-source, names_to = 'destination')
#----
# A tibble: 60 x 3
source destination value
<chr> <chr> <chr>
1 Australia Brazil 27.868809286999106
2 Australia Canada 46.827693406219382
3 Australia Czech Republic 104.79905129342241
4 Australia Finland 26.823089572300752
Note remittance is the dataframe in the OP dput.
Probably you are interested in keeping the flexibility of your nice combn approach.
To loop over your pairs data frame (it's actually a matrix though) you may use apply with MARGIN=1 for row-wise. In the FUN= argument we create data frames of one row each with source corresponding to column 1 of pairs and destination to column 2. The distance (or whatever this value is) we get by subsetting at the corresponding rows and columns of remittance (for brevity I shortend to rem).
Since we will get a list of single-line data frames, we want to rbind, and because we have multiple objects we need do.call.
res <- do.call(rbind,
apply(pairs, MARGIN=1, FUN=function(x)
data.frame(source=x[1], destination=x[2],
dist=as.integer(rem[rem[, 1] == x[1], rem[1, ] == x[2]])))
)
Since the .xlsx has zeros where actually should be NAs we should declare them as such in the result.
res[res == 0] <- NA
Result
head(res, 25)
# source destination dist
# 1 Australia Brazil 721
# 2 Australia Canada 24721
# 3 Australia Czech Republic 1074
# 4 Australia Germany 13938
# 5 Australia Finland 1121
# 6 Australia United Kingdom 135000
# 7 Australia Italy 19350
# 8 Australia Poland 974
# 9 Australia Russian Federation 543
# 10 Australia Sweden 3988
# 11 Australia United States 93179
# 12 Australia Philippines 4118
# 13 Australia France 8475
# 14 Australia Netherlands 10697
# 15 Australia Hungary 997
# 16 Australia Saudi Arabia NA
# 17 Australia Thailand 11298
# 18 Australia Korea, Rep. 5381
# 19 Australia Indonesia 11094
# 20 Brazil Canada 26647
# 21 Brazil Czech Republic 742
# 22 Brazil Germany 44000
# 23 Brazil Finland 1378
# 24 Brazil United Kingdom 55772
# 25 Brazil Italy 104779
Data:
u <- "https://www.knomad.org/sites/default/files/2018-04/bilateralmigrationmatrix20170_Apr2018.xlsx"
rem <- openxlsx::read.xlsx(u)
countries <- c("Australia", "Brazil", "Canada", "Czech Republic", "Germany",
"Finland", "United Kingdom", "Italy", "Poland", "Russian Federation",
"Sweden", "United States", "Philippines", "France", "Netherlands",
"Hungary", "Saudi Arabia", "Thailand", "Korea, Rep.", "Indonesia")
pairs <- t(combn(countries, 2))

Changing spelling for multiple words at a time in R/replacing many words at once

I have a dataset (survey) and a column of birth_country, where people have written their country of birth. An example of it:
1 america
2 usa
3 american
4 us of a
5 united states
6 england
7 english
8 great britain
9 uk
10 united kingdom
how I would like it to look:
1 america
2 america
3 america
4 america
5 america
6 uk
7 uk
8 uk
9 uk
10 uk
I have tried using str_replace to manually insert the different spellings, to replace them with 'america' but when I look at my dataset, nothing has changed
e.g.
survey <- structure(list(birth_country = c("america", "usa", "american", "us of a", "united states", "england", "english", "great britain", "uk", "united kingdom")), row.names = c(NA, -10L), class = "data.frame")
survey$birth_country <- str_replace(survey$birth_country, ' "united state"|"united statea"|"united states of america"', "america")
thank you in advance
Come up with some patterns that only match for each country and basically loop over what you are already doing (you can change the replacement below with your favorite function)
survey <- structure(list(birth_country = c("america", "usa", "american", "us of a", "united states", "england", "english", "great britain", "uk", "united kingdom")), row.names = c(NA, -10L), class = "data.frame")
## use a _named_ list of regular expressions
## the name will be the replacement string
l <- list(
america = 'amer|us|states',
uk = 'eng|brit|king|uk',
'another country' = 'ano|an co',
chaz = 'chaz|chop'
)
f <- function(x, list) {
for (ii in seq_along(list)) {
x[grepl(list[[ii]], x, ignore.case = TRUE)] <- names(list)[ii]
}
x
}
## test it
f(survey$birth_country, l)
# [1] "america" "america" "america" "america" "america" "uk" "uk" "uk" "uk" "uk"
within(survey, {
clean <- f(birth_country, l)
})
# birth_country clean
# 1 america america
# 2 usa america
# 3 american america
# 4 us of a america
# 5 united states america
# 6 england uk
# 7 english uk
# 8 great britain uk
# 9 uk uk
# 10 united kingdom uk
Note that 1) if you don't give a pattern that matches, nothing will change, but 2) if you give a pattern that matches both countries (e.g., "united"), the first in the list will be used (unless the replacement itself is also matched)
Looks like the problem is in how you specified your regular expression. Try this (updated based on #Gabriella 's comment, and another tidyverse approach, similar to #MarBIo ):
library(tidyverse)
survey <- survey %>%
mutate(birth_country = if_else(
str_detect(birth_country,
"(united state)|(united statea)|(united states of america)"), #If your regular expression matches any in birth_country
"america", #Change it to "america"
birth_country #Otherwise, keep as is.
) #end of if_else
) #end of mutate
Other people are suggesting you come up with a more complex regular expression, which you can certainly do as well. Consecutive "or" (i.e. "|") statements in your regular expression works though.
In case you allow tidyverse`s mutate you can do:
library(tidyverse)
survey <- structure(list(birth_country = c("america", "usa", "american", "us of a", "united states", "england", "english", "great britain", "uk", "united kingdom")), row.names = c(NA, -10L), class = "data.frame")
americas <- c("america", "usa", "american", "us of a", "united states")
englands <- c("england", "english", "great britain")
survey %>%
mutate(birth_country = ifelse(birth_country %in% americas, 'america', 'UK'))
#> birth_country
#> 1 america
#> 2 america
#> 3 america
#> 4 america
#> 5 america
#> 6 UK
#> 7 UK
#> 8 UK
#> 9 UK
#> 10 UK

Find levels of a factors that appear more than once

I have this dataframe:
data <- data.frame(countries=c(rep('UK', 5),
rep('Netherlands 1a', 5),
rep('Netherlands', 5),
rep('USA', 5),
rep('spain', 5),
rep('Spain', 5),
rep('Spain 1a', 5),
rep('spain 1a', 5)),
var=rnorm(40))
countries var
1 UK 0.506232270
2 UK 0.976348808
3 UK -0.752151769
4 UK 1.137267199
5 UK -0.363406715
6 Netherlands 1a -0.800835463
7 Netherlands 1a 1.767724231
8 Netherlands 1a 0.810757929
9 Netherlands 1a -1.188975114
10 Netherlands 1a -0.763144245
11 Netherlands 0.428511920
12 Netherlands 0.835184425
13 Netherlands -0.198316780
14 Netherlands 1.108191193
15 Netherlands 0.946819500
16 USA 0.226786121
17 USA -0.466886468
18 USA -2.217910876
19 USA -0.003472937
20 USA -0.784264921
21 spain -1.418014562
22 spain 1.002412706
23 spain 0.472621627
24 spain -1.378960222
25 spain -0.197020702
26 Spain 1.197971896
27 Spain 1.227648883
28 Spain -0.253083684
29 Spain -0.076562960
30 Spain 0.338882352
31 Spain 1a 0.074459521
32 Spain 1a -1.136391220
33 Spain 1a -1.648418916
34 Spain 1a 0.277264011
35 Spain 1a -0.568411569
36 spain 1a 0.250151646
37 spain 1a -1.527885883
38 spain 1a -0.452190849
39 spain 1a 0.454168927
40 spain 1a 0.889401396
I want to be able to find levels of countries that appear in different forms more than once. Forms that levels of countries might appear in are:
lowercase, for example "spain"
titlecase, for example "Spain"
lowercase with a different word attached, for example "spain 1a"
titlecase with a different word attached, for example "Spain 1a"
So I need to function to return a vector listing levels countries that appear more than once. In data, the vector that should be returned is:
"Netherlands 1a", "Netherlands", "spain", "Spain", "spain 1a", "Spain 1a"
Is it possible to make a function that would return this vector?
A quick solution that should meet all requirements (assuming that the country name is always the first element of your data$country entries):
# Country substrings
country.substr <- sapply(strsplit(tolower(levels(data$countries)), " "), "[[", 1)
# Duplicated country substrings
country.substr.dupl <- duplicated(country.substr)
# Display all country levels that appear in different forms
do.call("c", lapply(unique(country.substr[country.substr.dupl]), function(i) {
levels(data$countries)[grep(i, tolower(levels(data$countries)))]
}))
[1] "Netherlands" "Netherlands 1a" "spain" "Spain" "spain 1a" "Spain 1a"
Update:
Assuming that the country name is not always to be found at the first position, you need to apply a different approach that I took from here. Note that I slightly modified your sample data to clarify what I'm doing:
data <- data.frame(countries=c(rep('United Kingdom', 5),
rep('united kingdom', 5),
rep('Netherlands', 5),
rep('Netherlands 1a', 5),
rep('1a Netherlands', 5),
rep('USA', 5),
rep('spain', 5),
rep('Spain', 5),
rep('Spain 1a', 5),
rep('spain 1a', 5)),
var=rnorm(50))
Now let's identify all country substrings that do NOT contain any numerics. The subsequent steps remain the same. Is that what you need?
# Remove mixed numeric/alphabetic parts from country names
country.substr <- lapply(strsplit(tolower(levels(data$countries)), " "), function(i) {
# Identify, paste and return alphabetic-only components
tmp <- grep("^[[:alpha:]]*$", i)
if (length(tmp) == 1)
return(i[tmp])
else
return(paste(i[tmp], collapse = " "))
})
# Identify douplicated country names
country.substr.dupl <- duplicated(country.substr)
# Display all country levels that appear in different forms
do.call("c", lapply(unique(country.substr[country.substr.dupl]), function(i) {
levels(data$countries)[grep(i, tolower(levels(data$countries)))]
}))
[1] "1a Netherlands" "Netherlands" "Netherlands 1a" "spain" "Spain" "spain 1a" "Spain 1a" "united kingdom" "United Kingdom"
Why not use grep? The ignore.case argument is just what you need here.
> uch <- unique(as.character(data$countries))
> found <- sapply(seq(uch), function(i){
if(!grepl("\\s|[0-9]", uch[i]))
grep(uch[i], uch, ignore.case = TRUE, value = TRUE)
})
> ff <- found[sapply(found, function(x) length(x) > 1)]
> unique(unlist(ff))
# [1] "Netherlands 1a" "Netherlands" "spain"
# [4] "Spain" "Spain 1a" "spain 1a"
Here's my logic: Take the unique factor levels of the column as a character vector. Then, compare it with itself, looking only at those levels that do not contain a space or a digit. grep will catch those, but the other way around is a bit more tough. Then, we just find the unique matches. So here's a function and a test run,
find.matches <- function(column)
{
uch <- unique(as.character(column))
found <- sapply(seq(uch), function(i){
if(!grepl("\\s|[0-9]", uch[i]))
grep(uch[i], uch, ignore.case = TRUE, value = TRUE)
})
ff <- found[sapply(found, function(x) length(x) > 1)]
unique(unlist(ff))
}
> dat <- data.frame(x = c("a", "a1", "a 1b", "c", "d"),
y = c("fac", "tor", "fac 1a", "tor1a", "fac"))
> sapply(dat, find.matches)
# $x
# [1] "a" "a1" "a 1b"
#
# $y
# [1] "fac" "fac 1a" "tor" "tor1a"

Resources