R: Count overall unique objects in column of lists - r

Ok, so here is my scenario: I have a dataset with a column composed of lists of words (keyword tags for YT videos, where each row is video data).
What I want to do is do a complete count of all unique object instances within these lists, for the entire column. So basically what I want in the end is a table with two fields: keyword, count.
If I just do a simple dplyr query, then it counts the list itself as a unique object. While this is also interesting, this is not what I want.
So this is the above dplyr query that I want to utilize further, but not sure how to nest unique instances within the unique lists:
vid_tag_freq = df %>%
count(tags)
To further clarify:
With a dataset like:
Tags
1 ['Dog', 'Cat', 'Mouse', 'Fish']
2 ['Cat', 'Fish']
3 ['Cat', 'Fish']
I am now getting:
Tags Count
1 ['Dog', 'Cat', 'Mouse', 'Fish'] 1
2 ['Cat', 'Fish'] 2
What I actually want:
Tags Count
1 'Cat' 3
2 'Fish' 3
3 'Dog' 1
4 'Mouse' 1
I hope that explains it lol
EDIT: This is what my data looks like, guess most are lists of lists? Maybe I should clean up [0]s as null?
[1] "[['Flood (Disaster Type)', 'Burlington (City/Town/Village)', 'Ontario (City/Town/Village)']]"
[2] "[0]"
[3] "[0]"
[4] "[['Rocket (Product Category)', 'Interview (TV Genre)', 'Canadian Broadcasting Corporation (TV Network)', 'Israel (Country)', 'Gaza War (Military Conflict)']]"
[5] "[0]"
[6] "[['Iraq (Country)', 'Military (Film Genre)', 'United States Of America (Country)']]"
[7] "[['Ebola (Disease Or Medical Condition)', 'Chair', 'Margaret Chan (Physician)', 'WHO']]"
[8] "[['CBC Television (TV Network)', 'CBC News (Website Owner)', 'Canadian Broadcasting Corporation (TV Network)']]"
[9] "[['Rob Ford (Politician)', 'the fifth estate', 'CBC Television (TV Network)', 'Bill Blair', 'Gillian Findlay', 'Documentary (TV Genre)']]"
[10] "[['B.C.', 'Dog Walking (Profession)', 'dogs', 'dog walker', 'death', 'dead']]"
[11] "[['Suicide Of Amanda Todd (Event)', 'Amanda Todd', 'cyberbullying', 'CBC Television (TV Network)', 'the fifth estate', 'Mark Kelley', 'cappers', 'Documentary (TV Genre)']]"
[12] "[['National Hockey League (Sports Association)', 'Climate Change (Website Category)', 'Hockey (Sport)', 'greenhouse gas', 'emissions']]"
[13] "[['Rob Ford (Politician)', 'bomb threat', 'Toronto (City/Town/Village)', 'City Hall (Building)']]"
[14] "[['Blue Jays', 'Ashes', 'friends']]"
[15] "[['Robin Williams (Celebrity)', 'Peter Gzowski']]"

It would help if you could dput() some of the data for a working example. Going off the idea that you have a list column, here are a couple of general solutions you may be able to work with:
df <- tibble::tibble(
x = replicate(10, sample(state.name, sample(5:10, 1), TRUE), simplify = FALSE)
)
df
#> # A tibble: 10 × 1
#> x
#> <list>
#> 1 <chr [7]>
#> 2 <chr [7]>
#> 3 <chr [8]>
#> 4 <chr [6]>
#> 5 <chr [8]>
#> 6 <chr [8]>
#> 7 <chr [8]>
#> 8 <chr [6]>
#> 9 <chr [5]>
#> 10 <chr [10]>
# dplyr in a dataframe
df |>
tidyr::unnest(x) |>
dplyr::count(x)
#> # A tibble: 36 × 2
#> x n
#> <chr> <int>
#> 1 Alabama 1
#> 2 Alaska 1
#> 3 Arkansas 4
#> 4 California 3
#> 5 Colorado 5
#> 6 Connecticut 1
#> 7 Delaware 3
#> 8 Florida 1
#> 9 Georgia 3
#> 10 Hawaii 2
#> # … with 26 more rows
# vctrs
vctrs::vec_count(unlist(df$x))
#> key count
#> 1 Colorado 5
#> 2 Louisiana 5
#> 3 North Dakota 4
#> 4 Mississippi 4
#> 5 Arkansas 4
#> 6 Delaware 3
#> 7 Vermont 3
#> 8 Minnesota 3
#> 9 Utah 3
#> 10 California 3
#> 11 Georgia 3
#> 12 Indiana 2
#> 13 Missouri 2
#> 14 New Hampshire 2
#> 15 Maryland 2
#> 16 Nebraska 2
#> 17 Hawaii 2
#> 18 New Jersey 2
#> 19 Oklahoma 2
#> 20 Massachusetts 1
#> 21 Illinois 1
#> 22 Texas 1
#> 23 Connecticut 1
#> 24 Rhode Island 1
#> 25 Michigan 1
#> 26 New York 1
#> 27 Ohio 1
#> 28 Nevada 1
#> 29 Florida 1
#> 30 Montana 1
#> 31 Wisconsin 1
#> 32 Alabama 1
#> 33 Alaska 1
#> 34 North Carolina 1
#> 35 Washington 1
#> 36 Kansas 1
Created on 2022-10-07 with reprex v2.0.2
Edit
If you list is actually a character vector, you'll need to do some string parsing.
# "list" but are actually strings
x <- c(
"[['Flood (Disaster Type)', 'Burlington (City/Town/Village)', 'Ontario (City/Town/Village)']]",
"[0]",
"[0]",
"[['Rocket (Product Category)', 'Interview (TV Genre)', 'Canadian Broadcasting Corporation (TV Network)', 'Israel (Country)', 'Gaza War (Military Conflict)']]",
"[0]",
"[['Iraq (Country)', 'Military (Film Genre)', 'United States Of America (Country)']]",
"[['Ebola (Disease Or Medical Condition)', 'Chair', 'Margaret Chan (Physician)', 'WHO']]",
"[['CBC Television (TV Network)', 'CBC News (Website Owner)', 'Canadian Broadcasting Corporation (TV Network)']]",
"[['Rob Ford (Politician)', 'the fifth estate', 'CBC Television (TV Network)', 'Bill Blair', 'Gillian Findlay', 'Documentary (TV Genre)']]",
"[['B.C.', 'Dog Walking (Profession)', 'dogs', 'dog walker', 'death', 'dead']]",
"[['Suicide Of Amanda Todd (Event)', 'Amanda Todd', 'cyberbullying', 'CBC Television (TV Network)', 'the fifth estate', 'Mark Kelley', 'cappers', 'Documentary (TV Genre)']]",
"[['National Hockey League (Sports Association)', 'Climate Change (Website Category)', 'Hockey (Sport)', 'greenhouse gas', 'emissions']]",
"[['Rob Ford (Politician)', 'bomb threat', 'Toronto (City/Town/Village)', 'City Hall (Building)']]",
"[['Blue Jays', 'Ashes', 'friends']]",
"[['Robin Williams (Celebrity)', 'Peter Gzowski']]"
)
# assing to a data.frame
df <- data.frame(x = x)
df |>
dplyr::mutate(
# remove square brackets at beginning or end
x = gsub("^\\[{1,2}|\\]{1,2}$", "", x),
# separate the strings into an actual list
x = strsplit(x, "',\\s|,\\s'")
) |>
# unnuest the list column so they appear as individual rows
tidyr::unnest(x) |>
# some extract cleaning to string out the '
dplyr::mutate(x = gsub("^'|'$", "", x)) |>
# count the individual elements
dplyr::count(x, sort = TRUE)
#> # A tibble: 47 × 2
#> x n
#> <chr> <int>
#> 1 0 3
#> 2 CBC Television (TV Network) 3
#> 3 Canadian Broadcasting Corporation (TV Network) 2
#> 4 Documentary (TV Genre) 2
#> 5 Rob Ford (Politician) 2
#> 6 the fifth estate 2
#> 7 Amanda Todd 1
#> 8 Ashes 1
#> 9 B.C. 1
#> 10 Bill Blair 1
#> # … with 37 more rows
# same result just working with the vector
x |>
gsub("^\\[{1,2}|\\]{1,2}$", "", x = _) |>
strsplit("',\\s|,\\s'") |>
unlist() |>
gsub("^'|'$", "", x = _) |>
vctrs::vec_count() # or table()
#> key count
#> 1 CBC Television (TV Network) 3
#> 2 0 3
#> 3 Rob Ford (Politician) 2
#> 4 the fifth estate 2
#> 5 Documentary (TV Genre) 2
#> 6 Canadian Broadcasting Corporation (TV Network) 2
#> 7 City Hall (Building) 1
#> 8 United States Of America (Country) 1
#> 9 Mark Kelley 1
#> 10 Israel (Country) 1
#> 11 Bill Blair 1
#> 12 Interview (TV Genre) 1
#> 13 Blue Jays 1
#> 14 Hockey (Sport) 1
#> 15 friends 1
#> 16 Peter Gzowski 1
#> 17 Suicide Of Amanda Todd (Event) 1
#> 18 greenhouse gas 1
#> 19 Dog Walking (Profession) 1
#> 20 Flood (Disaster Type) 1
#> 21 National Hockey League (Sports Association) 1
#> 22 Amanda Todd 1
#> 23 Chair 1
#> 24 dog walker 1
#> 25 bomb threat 1
#> 26 dogs 1
#> 27 Climate Change (Website Category) 1
#> 28 Robin Williams (Celebrity) 1
#> 29 Margaret Chan (Physician) 1
#> 30 cyberbullying 1
#> 31 Ashes 1
#> 32 Ontario (City/Town/Village) 1
#> 33 Iraq (Country) 1
#> 34 WHO 1
#> 35 cappers 1
#> 36 Gillian Findlay 1
#> 37 Military (Film Genre) 1
#> 38 CBC News (Website Owner) 1
#> 39 B.C. 1
#> 40 Ebola (Disease Or Medical Condition) 1
#> 41 Toronto (City/Town/Village) 1
#> 42 death 1
#> 43 emissions 1
#> 44 Rocket (Product Category) 1
#> 45 Gaza War (Military Conflict) 1
#> 46 dead 1
#> 47 Burlington (City/Town/Village) 1
Created on 2022-10-08 with reprex v2.0.2

It looks like you need unnest_longer():
library(dplyr)
library(tidyr)
df <- tibble(
Tags = list(
list('Dog', 'Cat', 'Mouse', 'Fish'),
list('Cat', 'Fish'),
list('Cat', 'Fish')
)
)
df %>%
tidyr::unnest_longer(Tags) %>%
count(Tags) %>%
arrange(desc(n))
#> # A tibble: 4 × 2
#> Tags n
#> <chr> <int>
#> 1 Cat 3
#> 2 Fish 3
#> 3 Dog 1
#> 4 Mouse 1

Related

Performing Record Linkage in R

I have the following dataset in R:
address = c( "44 Ocean Road Atlanta Georgia", "882 4N Road River NY, NY 12345", "882 - River Road NY, ZIP 12345", "123 Fake Road Boston Drive Boston", "123 Fake - Rd Boston 56789", "3665 Apt 5 Moon Crs", "3665 Unit Moon Crescent", "NO ADDRESS PROVIDED", "31 Silver Way Road", "1800 Orleans St, Baltimore, MD 21287, United States",
"1799 Orlans Street, Maryland , USA")
name = c("Pancake House of America" ,"ABC Center Building", "Cent. Bldg ABC", "BD Home 25 New", "Boarding Direct 25", "Pine Recreational Center", "Pine Rec. cntR", "Boston Swimming Complex", "boston gym center", "mas hospital" , "Massachusetts Hospital" )
blocking_var = c(1, 1,1,1, 1, 2,2,2,2,3,3)
my_data = data.frame(address, name, blocking_var)
The data looks something like this:
> my_data
address name blocking_var
1 44 Ocean Road Atlanta Georgia Pancake House of America 1
2 882 4N Road River NY, NY 12345 ABC Center Building 1
3 882 - River Road NY, ZIP 12345 Cent. Bldg ABC 1
4 123 Fake Road Boston Drive Boston BD Home 25 New 1
5 123 Fake - Rd Boston 56789 Boarding Direct 25 1
6 3665 Apt 5 Moon Crs Pine Recreational Center 2
7 3665 Unit Moon Crescent Pine Rec. cntR 2
8 NO ADDRESS PROVIDED Boston Swimming Complex 2
9 31 Silver Way Road boston gym center 2
10 1800 Orleans St, Baltimore, MD 21287, United States mas hospital 3
11 1799 Orlans Street, Maryland , USA Massachusetts Hospital 3
I am trying to follow this R tutorial (https://cran.r-project.org/web/packages/RecordLinkage/vignettes/WeightBased.pdf) and learn how to remove duplicates based on fuzzy conditions. The goal (within each "block") is to keep all unique records - and for fuzzy duplicates, only keep one occurrence of the duplicate.
I tried the following code:
library(RecordLinkage)
pairs=compare.dedup(my_data, blockfld=3)
But when I inspect the results, everything is NA - given these results, I think I am doing something wrong and there does not seem to be any point in continuing until this error is resolved.
Can someone please show me how I can resolve this problem and continue on with the tutorial?
In the end, I am looking for something like this:
address name blocking_var
1 44 Ocean Road Atlanta Georgia Pancake House of America 1
2 882 4N Road River NY, NY 12345 ABC Center Building 1
4 123 Fake Road Boston Drive Boston BD Home 25 New 1
6 3665 Apt 5 Moon Crs Pine Recreational Center 2
9 31 Silver Way Road boston gym center 2
10 1800 Orleans St, Baltimore, MD 21287, United States mas hospital 3
Thank you!
You forgot to enable the string comparison on columns (strcmp parameter):
address = c(
"44 Ocean Road Atlanta Georgia", "882 4N Road River NY, NY 12345", "882 - River Road NY, ZIP 12345", "123 Fake Road Boston Drive Boston", "123 Fake - Rd Boston 56789", "3665 Apt 5 Moon Crs", "3665 Unit Moon Crescent", "NO ADDRESS PROVIDED", "31 Silver Way Road", "1800 Orleans St, Baltimore, MD 21287, United States",
"1799 Orlans Street, Maryland , USA")
name = c("Pancake House of America" ,"ABC Center Building", "Cent. Bldg ABC", "BD Home 25 New", "Boarding Direct 25", "Pine Recreational Center", "Pine Rec. cntR", "Boston Swimming Complex", "boston gym center", "mas hospital" , "Massachusetts Hospital" )
blocking_var = c(1, 1,1,1, 1, 2,2,2,2,3,3)
my_data = data.frame(address, name, blocking_var)
library(RecordLinkage)
pairs <- compare.dedup(my_data, blockfld=3, strcmp = c("address", "name"))
pairs
#> $data
#> address name
#> 1 44 Ocean Road Atlanta Georgia Pancake House of America
#> 2 882 4N Road River NY, NY 12345 ABC Center Building
#> 3 882 - River Road NY, ZIP 12345 Cent. Bldg ABC
#> 4 123 Fake Road Boston Drive Boston BD Home 25 New
#> 5 123 Fake - Rd Boston 56789 Boarding Direct 25
#> 6 3665 Apt 5 Moon Crs Pine Recreational Center
#> 7 3665 Unit Moon Crescent Pine Rec. cntR
#> 8 NO ADDRESS PROVIDED Boston Swimming Complex
#> 9 31 Silver Way Road boston gym center
#> 10 1800 Orleans St, Baltimore, MD 21287, United States mas hospital
#> 11 1799 Orlans Street, Maryland , USA Massachusetts Hospital
#> blocking_var
#> 1 1
#> 2 1
#> 3 1
#> 4 1
#> 5 1
#> 6 2
#> 7 2
#> 8 2
#> 9 2
#> 10 3
#> 11 3
#>
#> $pairs
#> id1 id2 address name blocking_var is_match
#> 1 1 2 0.4657088 0.5014620 1 NA
#> 2 1 3 0.4256705 0.4551587 1 NA
#> 3 1 4 0.5924184 0.4543651 1 NA
#> 4 1 5 0.5139994 0.4768519 1 NA
#> 5 2 3 0.9082051 0.5802005 1 NA
#> 6 2 4 0.5112554 0.4734336 1 NA
#> 7 2 5 0.5094017 0.5467836 1 NA
#> 8 3 4 0.4767677 0.4404762 1 NA
#> 9 3 5 0.5418803 0.4761905 1 NA
#> 10 4 5 0.8550583 0.6672619 1 NA
#> 11 6 7 0.8749962 0.8306277 1 NA
#> 12 6 8 0.4385965 0.5243193 1 NA
#> 13 6 9 0.5622807 0.5502822 1 NA
#> 14 7 8 0.3974066 0.5075914 1 NA
#> 15 7 9 0.5626812 0.5896359 1 NA
#> 16 8 9 0.3942495 0.6478338 1 NA
#> 17 10 11 0.6939076 0.6843434 1 NA
#>
#> $frequencies
#> address name blocking_var
#> 0.09090909 0.09090909 0.33333333
#>
#> $type
#> [1] "deduplication"
#>
#> attr(,"class")
#> [1] "RecLinkData"
It then goes like this, using e.g. the EpiLink algorithm:
# Compute EpiLink weights
pairs_w <- epiWeights(pairs)
# Explore the pairs and their weight to find a good cutoff
getPairs(pairs_w, min.weight=0.6, max.weight=0.8)
#> id address
#> 1 2 882 4N Road River NY, NY 12345
#> 2 3 882 - River Road NY, ZIP 12345
#> 3
#> 4 10 1800 Orleans St, Baltimore, MD 21287, United States
#> 5 11 1799 Orlans Street, Maryland , USA
#> 6
#> 7 7 3665 Unit Moon Crescent
#> 8 9 31 Silver Way Road
#> 9
#> 10 6 3665 Apt 5 Moon Crs
#> 11 9 31 Silver Way Road
#> 12
#> 13 2 882 4N Road River NY, NY 12345
#> 14 5 123 Fake - Rd Boston 56789
#> 15
#> 16 1 44 Ocean Road Atlanta Georgia
#> 17 4 123 Fake Road Boston Drive Boston
#> 18
#> 19 8 NO ADDRESS PROVIDED
#> 20 9 31 Silver Way Road
#> 21
#> 22 3 882 - River Road NY, ZIP 12345
#> 23 5 123 Fake - Rd Boston 56789
#> 24
#> name blocking_var Weight
#> 1 ABC Center Building 1
#> 2 Cent. Bldg ABC 1 0.7916856
#> 3
#> 4 mas hospital 3
#> 5 Massachusetts Hospital 3 0.7468321
#> 6
#> 7 Pine Rec. cntR 2
#> 8 boston gym center 2 0.6548348
#> 9
#> 10 Pine Recreational Center 2
#> 11 boston gym center 2 0.6386475
#> 12
#> 13 ABC Center Building 1
#> 14 Boarding Direct 25 1 0.6156913
#> 15
#> 16 Pancake House of America 1
#> 17 BD Home 25 New 1 0.6118630
#> 18
#> 19 Boston Swimming Complex 2
#> 20 boston gym center 2 0.6099491
#> 21
#> 22 Cent. Bldg ABC 1
#> 23 Boarding Direct 25 1 0.6001716
#> 24
I chose > 0.7 to classify as link, < 0.6 to classify as a non-link.
Matches in-between are labelled as "possible".
pairs_class <- epiClassify(pairs_w, threshold.upper = 0.7, threshold.lower = 0.6)
summary(pairs_class)
#>
#> Deduplication Data Set
#>
#> 11 records
#> 17 record pairs
#>
#> 0 matches
#> 0 non-matches
#> 17 pairs with unknown status
#>
#>
#> Weight distribution:
#>
#> [0.5,0.55] (0.55,0.6] (0.6,0.65] (0.65,0.7] (0.7,0.75] (0.75,0.8] (0.8,0.85]
#> 1 6 5 1 1 1 1
#> (0.85,0.9]
#> 1
#>
#> 4 links detected
#> 6 possible links detected
#> 7 non-links detected
#>
#> Classification table:
#>
#> classification
#> true status N P L
#> <NA> 7 6 4
And the results:
# detected links, possible matches, non-links
getPairs(pairs_class, show = "links")
#> id address
#> 1 6 3665 Apt 5 Moon Crs
#> 2 7 3665 Unit Moon Crescent
#> 3
#> 4 4 123 Fake Road Boston Drive Boston
#> 5 5 123 Fake - Rd Boston 56789
#> 6
#> 7 2 882 4N Road River NY, NY 12345
#> 8 3 882 - River Road NY, ZIP 12345
#> 9
#> 10 10 1800 Orleans St, Baltimore, MD 21287, United States
#> 11 11 1799 Orlans Street, Maryland , USA
#> 12
#> name blocking_var Weight
#> 1 Pine Recreational Center 2
#> 2 Pine Rec. cntR 2 0.8801340
#> 3
#> 4 BD Home 25 New 1
#> 5 Boarding Direct 25 1 0.8054952
#> 6
#> 7 ABC Center Building 1
#> 8 Cent. Bldg ABC 1 0.7916856
#> 9
#> 10 mas hospital 3
#> 11 Massachusetts Hospital 3 0.7468321
#> 12
getPairs(pairs_class, show = "possible")
#> id address name blocking_var
#> 1 7 3665 Unit Moon Crescent Pine Rec. cntR 2
#> 2 9 31 Silver Way Road boston gym center 2
#> 3
#> 4 6 3665 Apt 5 Moon Crs Pine Recreational Center 2
#> 5 9 31 Silver Way Road boston gym center 2
#> 6
#> 7 2 882 4N Road River NY, NY 12345 ABC Center Building 1
#> 8 5 123 Fake - Rd Boston 56789 Boarding Direct 25 1
#> 9
#> 10 1 44 Ocean Road Atlanta Georgia Pancake House of America 1
#> 11 4 123 Fake Road Boston Drive Boston BD Home 25 New 1
#> 12
#> 13 8 NO ADDRESS PROVIDED Boston Swimming Complex 2
#> 14 9 31 Silver Way Road boston gym center 2
#> 15
#> 16 3 882 - River Road NY, ZIP 12345 Cent. Bldg ABC 1
#> 17 5 123 Fake - Rd Boston 56789 Boarding Direct 25 1
#> 18
#> Weight
#> 1
#> 2 0.6548348
#> 3
#> 4
#> 5 0.6386475
#> 6
#> 7
#> 8 0.6156913
#> 9
#> 10
#> 11 0.6118630
#> 12
#> 13
#> 14 0.6099491
#> 15
#> 16
#> 17 0.6001716
#> 18
getPairs(pairs_class, show = "nonlinks")
#> id address name blocking_var
#> 1 1 44 Ocean Road Atlanta Georgia Pancake House of America 1
#> 2 5 123 Fake - Rd Boston 56789 Boarding Direct 25 1
#> 3
#> 4 2 882 4N Road River NY, NY 12345 ABC Center Building 1
#> 5 4 123 Fake Road Boston Drive Boston BD Home 25 New 1
#> 6
#> 7 1 44 Ocean Road Atlanta Georgia Pancake House of America 1
#> 8 2 882 4N Road River NY, NY 12345 ABC Center Building 1
#> 9
#> 10 6 3665 Apt 5 Moon Crs Pine Recreational Center 2
#> 11 8 NO ADDRESS PROVIDED Boston Swimming Complex 2
#> 12
#> 13 3 882 - River Road NY, ZIP 12345 Cent. Bldg ABC 1
#> 14 4 123 Fake Road Boston Drive Boston BD Home 25 New 1
#> 15
#> 16 7 3665 Unit Moon Crescent Pine Rec. cntR 2
#> 17 8 NO ADDRESS PROVIDED Boston Swimming Complex 2
#> 18
#> 19 1 44 Ocean Road Atlanta Georgia Pancake House of America 1
#> 20 3 882 - River Road NY, ZIP 12345 Cent. Bldg ABC 1
#> 21
#> Weight
#> 1
#> 2 0.5890881
#> 3
#> 4
#> 5 0.5865789
#> 6
#> 7
#> 8 0.5794458
#> 9
#> 10
#> 11 0.5777132
#> 12
#> 13
#> 14 0.5591162
#> 15
#> 16
#> 17 0.5541298
#> 18
#> 19
#> 20 0.5442886
#> 21
Created on 2022-11-17 with reprex v2.0.2

Unique values in R using dplyr

starwars %>%
group_by(species,sex) %>%
summarise() %>%
select(unique.species=species, unique.sex=sex)
How to get unique values from 2 columns("species","sex") all together? I wrote the code above but i'm not sure it's right. Thank you
library(tidyverse)
starwars |>
select(species, sex) |>
distinct()
#> # A tibble: 41 × 2
#> species sex
#> <chr> <chr>
#> 1 Human male
#> 2 Droid none
#> 3 Human female
#> 4 Wookiee male
#> 5 Rodian male
#> 6 Hutt hermaphroditic
#> 7 Yoda's species male
#> 8 Trandoshan male
#> 9 Mon Calamari male
#> 10 Ewok male
#> # … with 31 more rows
Created on 2022-04-25 by the reprex package (v2.0.1)
library(tidyverse)
starwars %>%
expand(nesting(species, sex))
#> # A tibble: 41 × 2
#> species sex
#> <chr> <chr>
#> 1 Aleena male
#> 2 Besalisk male
#> 3 Cerean male
#> 4 Chagrian male
#> 5 Clawdite female
#> 6 Droid none
#> 7 Dug male
#> 8 Ewok male
#> 9 Geonosian male
#> 10 Gungan male
#> # … with 31 more rows
Created on 2022-04-25 by the reprex package (v2.0.1)
There are multiple options. You can use the following code:
unique(starwars[c("species", "sex")])
Output:
species sex
<chr> <chr>
1 Human male
2 Droid none
3 Human female
4 Wookiee male
5 Rodian male
6 Hutt hermaphroditic
7 Yoda's species male
8 Trandoshan male
9 Mon Calamari male
10 Ewok male
# … with 31 more rows

`dplyr::select` without reordering columns

I am looking for an easy, concise way to use dplyr::select without rearranging columns.
Consider this dataset:
library(tidyverse)
head(msleep)
#> # A tibble: 6 × 11
#> name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Cheetah Acin… carni Carn… lc 12.1 NA NA 11.9
#> 2 Owl mo… Aotus omni Prim… <NA> 17 1.8 NA 7
#> 3 Mounta… Aplo… herbi Rode… nt 14.4 2.4 NA 9.6
#> 4 Greate… Blar… omni Sori… lc 14.9 2.3 0.133 9.1
#> 5 Cow Bos herbi Arti… domesticated 4 0.7 0.667 20
#> 6 Three-… Brad… herbi Pilo… <NA> 14.4 2.2 0.767 9.6
#> # … with 2 more variables: brainwt <dbl>, bodywt <dbl>
If I select vore, genus and name, the resulting dataframe is arranged in the order in which the columns were provided.
msleep %>% select(vore, genus, name)
#> # A tibble: 83 × 3
#> vore genus name
#> <chr> <chr> <chr>
#> 1 carni Acinonyx Cheetah
#> 2 omni Aotus Owl monkey
#> 3 herbi Aplodontia Mountain beaver
#> 4 omni Blarina Greater short-tailed shrew
#> 5 herbi Bos Cow
#> 6 herbi Bradypus Three-toed sloth
#> 7 carni Callorhinus Northern fur seal
#> 8 <NA> Calomys Vesper mouse
#> 9 carni Canis Dog
#> 10 herbi Capreolus Roe deer
#> # … with 73 more rows
I would instead like to leave them in their default order: name, genus, then vore.
I have a solution (see below), but I do not like it because it is quite wordy, and not completely “tidyverse-esque”.
(I am teaching an intro to tidyverse course, and would like something that would not intimidate beginners.)
msleep %>%
select(all_of(names(msleep)[names(msleep) %in% c("vore", "genus", "name")]))
#> # A tibble: 83 × 3
#> name genus vore
#> <chr> <chr> <chr>
#> 1 Cheetah Acinonyx carni
#> 2 Owl monkey Aotus omni
#> 3 Mountain beaver Aplodontia herbi
#> 4 Greater short-tailed shrew Blarina omni
#> 5 Cow Bos herbi
#> 6 Three-toed sloth Bradypus herbi
#> 7 Northern fur seal Callorhinus carni
#> 8 Vesper mouse Calomys <NA>
#> 9 Dog Canis carni
#> 10 Roe deer Capreolus herbi
#> # … with 73 more rows
Is there such a thing? Thank you!
For context: In reality, we have a data frame with about 400 columns, from which we are selecting ~10-20 at a time to work with. The order of the columns in the original data frame is meaningful, but we don't want to have to labor over listing them in their correct order in the select statements. A very specific need, I'll admit.
Created on 2021-12-22 by the reprex package (v2.0.1)
We could use match with sort
library(dplyr)
msleep %>%
select(sort(match(c("vore", "genus", "name"), names(.))))
EDIT: Based on the OP's comments
Update:
In case of providing a vector we could do as akrun suggests in the comments:
nm1 <- c("vore", "genus", "name"); pattern <- str_c(nm1, collapse="|")
Original answer:
You could first define a string with the search items
and then use matches
pattern <- c("vore|genus|name")
select(msleep, matches(pattern))
name genus vore
<chr> <chr> <chr>
1 Cheetah Acinonyx carni
2 Owl monkey Aotus omni
3 Mountain beaver Aplodontia herbi
4 Greater short-tailed shrew Blarina omni
5 Cow Bos herbi
6 Three-toed sloth Bradypus herbi
7 Northern fur seal Callorhinus carni
8 Vesper mouse Calomys NA
9 Dog Canis carni
10 Roe deer Capreolus herbi
You can use the power of eval_select() to create a function to select and sort the columns.
library(dplyr)
select_in_order <- function(data, ...) {
ordered_cols <- sort(tidyselect::eval_select(expr(c(...)), data))
select(data, ordered_cols)
}
So now this will do what you are asking. The benefit is that it will be "full feature" to what you are used to being able to enter into a select() statement.
# library(ggplot2) # msleep is in ggplot2
msleep %>%
select_in_order(vore, genus, name)
# this will work as well
msleep %>%
select_in_order(starts_with("sleep"), vore, name:genus)
EDIT
As another option, simply use relocate() after your select() statement. This alternative approach accomplishes your end goal of keeping the columns in order in a way that is easy to understand by a beginner.
msleep %>%
select(vore, genus, name) %>%
relocate(any_of(names(msleep)))

R POS tagging and tokenizing in one go

I have a text as below.
Section <- c("If an infusion reaction occurs, interrupt the infusion.")
df <- data.frame(Section)
When I tokenize using tidytext and the code below,
AA <- df %>%
mutate(tokens = str_extract_all(df$Section, "([^\\s]+)"),
locations = str_locate_all(df$Section, "([^\\s]+)"),
locations = map(locations, as.data.frame)) %>%
select(-Section) %>%
unnest(tokens, locations)
It gives me the tokens, the start and end position. How do I obtain the POS tags while unnesting at the same time. Something as below (the POStags may not be correct in the image below)
You can use the package udpipe to get your POS data. Udpipe automatically tokenizes punctuation.
Section <- c("If an infusion reaction occurs, interrupt the infusion.")
df <- data.frame(Section, stringAsFactors = FALSE)
library(udpipe)
library(dplyr)
udmodel <- udpipe_download_model(language = "english")
udmodel <- udpipe_load_model(file = udmodel$file_model)
x <- udpipe_annotate(udmodel,
df$Section)
x <- as.data.frame(x)
x %>% select(token, upos)
token upos
1 If SCONJ
2 an DET
3 infusion NOUN
4 reaction NOUN
5 occurs NOUN
6 , PUNCT
7 interrupt VERB
8 the DET
9 infusion NOUN
10 . PUNCT
Now to combine this the result of a previous question you asked. I took one of the answers.
library(stringr)
library(purrr)
library(tidyr)
df %>% mutate(
tokens = str_extract_all(Section, "\\w+|[[:punct:]]"),
locations = str_locate_all(Section, "\\w+|[[:punct:]]"),
locations = map(locations, as.data.frame)) %>%
select(-Section) %>%
unnest(tokens, locations) %>%
mutate(POS = purrr::map_chr(tokens, function(x) as.data.frame(udpipe_annotate(udmodel, x = x, tokenizer = "vertical"))$upos))
tokens start end upos
1 If 1 2 SCONJ
2 an 4 5 DET
3 infusion 7 14 NOUN
4 reaction 16 23 NOUN
5 occurs 25 30 NOUN
6 , 31 31 PUNCT
7 interrupt 33 41 VERB
8 the 43 45 DET
9 infusion 47 54 NOUN
10 . 55 55 PUNCT
edit: better solution
But the best solution would be to start from udpipe and then do the rest. Note that I am using stringi instead of stringr package. stringr is based on stringi, but stringi has more options.
x <- udpipe_annotate(udmodel, x = df$Section)
x %>%
as_data_frame %>%
select(token, POSTag = upos) %>% # select needed columns
# add start/end locations
mutate(locations = map(token, function(x) data.frame(stringi::stri_locate(df$Section, fixed = x)))) %>%
unnest
# A tibble: 10 x 4
token POSTag start end
<chr> <chr> <int> <int>
1 If SCONJ 1 2
2 an DET 4 5
3 infusion NOUN 7 14
4 reaction NOUN 16 23
5 occurs NOUN 25 30
6 , PUNCT 31 31
7 interrupt VERB 33 41
8 the DET 43 45
9 infusion NOUN 7 14
10 . PUNCT 55 55
FYI. Since udpipe version 0.7 on CRAN, you can just do as follows.
library(udpipe)
x <- data.frame(doc_id = c("doc1", "doc2"),
text = c("If an infusion reaction occurs, interrupt the infusion.",
"Houston we have a problem"))
x <- udpipe(x, "english")
x
which gives you (notice the start/end as well as the token/upos/xpos which you are looking for):
doc_id paragraph_id sentence_id start end term_id token_id token lemma upos xpos feats head_token_id dep_rel deps misc
doc1 1 1 1 2 1 1 If if SCONJ IN <NA> 7 mark <NA> <NA>
doc1 1 1 4 5 2 2 an a DET DT Definite=Ind|PronType=Art 5 det <NA> <NA>
doc1 1 1 7 14 3 3 infusion infusion NOUN NN Number=Sing 4 compound <NA> <NA>
doc1 1 1 16 23 4 4 reaction reaction NOUN NN Number=Sing 5 compound <NA> <NA>
doc1 1 1 25 30 5 5 occurs occur NOUN NNS Number=Plur 7 nsubj <NA> SpaceAfter=No
doc1 1 1 31 31 6 6 , , PUNCT , <NA> 7 punct <NA> <NA>
doc1 1 1 33 41 7 7 interrupt interrupt VERB VB Mood=Imp|VerbForm=Fin 0 root <NA> <NA>
doc1 1 1 43 45 8 8 the the DET DT Definite=Def|PronType=Art 9 det <NA> <NA>
doc1 1 1 47 54 9 9 infusion infusion NOUN NN Number=Sing 7 obj <NA> SpaceAfter=No
doc1 1 1 55 55 10 10 . . PUNCT . <NA> 7 punct <NA> SpacesAfter=\\n
doc2 1 1 1 7 1 1 Houston Houston PROPN NNP Number=Sing 0 root <NA> <NA>
doc2 1 1 9 10 2 2 we we PRON PRP Case=Nom|Number=Plur|Person=1|PronType=Prs 3 nsubj <NA> <NA>
doc2 1 1 12 15 3 3 have have VERB VBP Mood=Ind|Tense=Pres|VerbForm=Fin 1 parataxis <NA> <NA>
doc2 1 1 17 17 4 4 a a DET DT Definite=Ind|PronType=Art 5 det <NA> <NA>
doc2 1 1 19 25 5 5 problem problem NOUN NN Number=Sing 3 obj <NA> SpacesAfter=\\n
Like the previous answerer, I think that udpipe is likely the easiest way to go for POS tagging. My favorite way to interact with udpipe is via the cleanNLP package. After the initializing function is called, it is just two lines of code to get the udpipe output.
library(tidyverse)
library(cleanNLP)
cnlp_init_udpipe()
#> Loading required namespace: udpipe
df <- data_frame(id = 1,
text = c("If an infusion reaction occurs, interrupt the infusion."))
cnlp_annotate(df) %>%
cnlp_get_tif()
#> # A tibble: 10 x 19
#> id sid tid word lemma upos pos cid pid definite mood
#> <chr> <int> <int> <chr> <chr> <chr> <chr> <dbl> <int> <chr> <chr>
#> 1 1 1 1 If if SCONJ IN 0 1 <NA> <NA>
#> 2 1 1 2 an a DET DT 3 1 Ind <NA>
#> 3 1 1 3 infu… infu… NOUN NN 6 1 <NA> <NA>
#> 4 1 1 4 reac… reac… NOUN NN 15 1 <NA> <NA>
#> 5 1 1 5 occu… occur NOUN NNS 24 1 <NA> <NA>
#> 6 1 1 6 , , PUNCT , 30 1 <NA> <NA>
#> 7 1 1 7 inte… inte… VERB VB 32 1 <NA> Imp
#> 8 1 1 8 the the DET DT 42 1 Def <NA>
#> 9 1 1 9 infu… infu… NOUN NN 46 1 <NA> <NA>
#> 10 1 1 10 . . PUNCT . 54 1 <NA> <NA>
#> # ... with 8 more variables: number <chr>, pron_type <chr>,
#> # verb_form <chr>, source <int>, relation <chr>, word_source <chr>,
#> # lemma_source <chr>, spaces <dbl>
Created on 2018-08-15 by the reprex package (v0.2.0).

compute deflation factor to index wages, by CPI, in panel data

I'm struggling to understand exactly how to compute a deflation factor for wages in a panel based on inflation.
I've teh R example below to help me illustrate the issue.
In Wooldridge (2009:452) Introductory Econometrics, 5th ed., he creates a deflation factor by dividing 107.6 by 65.2, i.e. 107.6/65.2 ≈ 1.65, but I can't figure out to to apply this to my own panel data. Wooldridge only mentions the deflation factor in passing.
Say I have a mini panel with two people, Jane and Tom, staring from 2006/2009 and running until 2015 with their yearly wage,
# install.packages(c("dplyr"), dependencies = TRUE)
library(dplyr)
set.seed(2)
tbl <- tibble(id = rep(c('Jane', 'Tom'), c(7, 10)),
yr = c(2009:2015, 2006:2015),
wg = c(rnorm(7, mean=5.1*10^4, sd=9), rnorm(10, 4*10^4, 12))
); tbl
#> A tibble: 17 x 3
#> id yr wg
#> <chr> <int> <dbl>
#> 1 Jane 2009 50991.93
#> 2 Jane 2010 51001.66
#> 3 Jane 2011 51014.29
#> 4 Jane 2012 50989.83
#> 5 Jane 2013 50999.28
#> 6 Jane 2014 51001.19
#> 7 Jane 2015 51006.37
#> 8 Tom 2006 39997.12
#> 9 Tom 2007 40023.81
#> 10 Tom 2008 39998.33
#> 11 Tom 2009 40005.01
#> 12 Tom 2010 40011.78
#> 13 Tom 2011 39995.29
#> 14 Tom 2012 39987.52
#> 15 Tom 2013 40021.39
#> 16 Tom 2014 39972.27
#> 17 Tom 2015 40010.54
I now get the consumer price index (CPI) (using this answer)
# install.packages(c("Quandl"), dependencies = TRUE)
CPI00to16 <- Quandl::Quandl("FRED/CPIAUCSL", collapse="annual",
start_date="2000-01-01", end_date="2016-01-01")
as_tibble(CPI00to16)
#> # A tibble: 17 x 2
#> Date Value
#> <date> <dbl>
#> 1 2016-12-31 238.106
#> 2 2015-12-31 237.846
#> 3 2014-12-31 236.290
#> 4 2013-12-31 234.723
#> 5 2012-12-31 231.221
#> 6 2011-12-31 227.223
#> 7 2010-12-31 220.472
#> 8 2009-12-31 217.347
#> 9 2008-12-31 211.398
#> 10 2007-12-31 211.445
#> 11 2006-12-31 203.100
#> 12 2005-12-31 198.100
#> 13 2004-12-31 191.700
#> 14 2003-12-31 185.500
#> 15 2002-12-31 181.800
#> 16 2001-12-31 177.400
#> 17 2000-12-31 174.600
my question is how do I deflate Jane and Tom's wages cf. Wooldridge 2009 selecting 2015 as the baseline year?
update; following MrSmithGoesToWashington’s comment below.
CPI00to16$yr <- as.numeric(format(CPI00to16$Date,'%Y'))
CPI00to16 <- mutate(CPI00to16, deflation_factor = CPI00to16[2,2]/Value)
df <- tbl %>% inner_join(as_tibble(CPI00to16[,3:4]), by = "yr")
df <- mutate(df, wg_defl = deflation_factor*wg, wg_diff = wg_defl-wg)
df
#> # A tibble: 17 x 6
#> id yr wg deflation_factor wg_defl wg_diff
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Jane 2009 50991.93 1.094315 55801.21 4809.2844
#> 2 Jane 2010 51001.66 1.078804 55020.78 4019.1176
#> 3 Jane 2011 51014.29 1.046751 53399.28 2384.9910
#> 4 Jane 2012 50989.83 1.028652 52450.80 1460.9728
#> 5 Jane 2013 50999.28 1.013305 51677.83 678.5477
#> 6 Jane 2014 51001.19 1.006585 51337.04 335.8494
#> 7 Jane 2015 51006.37 1.000000 51006.37 0.0000
#> 8 Tom 2006 39997.12 1.171078 46839.76 6842.6394
#> 9 Tom 2007 40023.81 1.124860 45021.18 4997.3691
#> 10 Tom 2008 39998.33 1.125110 45002.53 5004.1909
#> 11 Tom 2009 40005.01 1.094315 43778.07 3773.0575
#> 12 Tom 2010 40011.78 1.078804 43164.86 3153.0747
#> 13 Tom 2011 39995.29 1.046751 41865.12 1869.8369
#> 14 Tom 2012 39987.52 1.028652 41133.26 1145.7322
#> 15 Tom 2013 40021.39 1.013305 40553.87 532.4863
#> 16 Tom 2014 39972.27 1.006585 40235.49 263.2225
#> 17 Tom 2015 40010.54 1.000000 40010.54 0.0000

Resources