How to ensure authors have a unique name - r

I extracted publication data from Microsoft Academic. Unfortunately, some authors have different versions of their names, e.g.
names <- data.frame(publication_id = c("1", "2", "3", "4", "5"), author = c("D Smith", "D J Smith", "David Smith", "Enrique Salvador", "E J Salvador"), affiliation = c("UCT", "UCT", "UCT", "UTAS", "UTAS")
I want authors to have a unique name. Using the above example, I want to get a result that looks like this:
names <- data.frame(publication_id = c("1", "2", "3", "4", "5"), author = c("D. Smith", "D Smith", "D Smith", "E Salvador", "E Salvador"), affiliation = c("UCT", "UCT", "UCT", "UTAS", "UTAS")
I am dealing with 1000s of author names so using something like:
mutate(author = case_when(author == "D.J. Smith" ~ "D. Smith", author == "David Smith" ~ "D.Smith",
is impractical. I would appreciate any ideas/solutions. Thanks in advance.

I think the most likely thing to work would be to take the first letter of the string and the last name and put them together:
names <- data.frame(publication_id = c("1", "2", "3", "4", "5"), author = c("D Smith", "D J Smith", "David Smith", "Enrique Salvador", "E J Salvador"), affiliation = c("UCT", "UCT", "UCT", "UTAS", "UTAS"))
names$new_author <- paste0(substr(names$author, 1, 1),
" ",
gsub('.*\\s(\\w*)$', '\\1', names$author))
names
# publication_id author affiliation new_author
# 1 1 D Smith UCT D Smith
# 2 2 D J Smith UCT D Smith
# 3 3 David Smith UCT D Smith
# 4 4 Enrique Salvador UTAS E Salvador
# 5 5 E J Salvador UTAS E Salvador
That said, it assumes that there is only one D Smith at UCT which may be an untenable assumption.

Related

Running a string against multiple match dataframes

I have a dataset of text strings that look something like this:
strings <- structure(list(string = c("Jennifer Rae Hancock Brown", "Lisa Smith Houston Blogger",
"Tina Fay Las Cruces", "\t\nJamie Tucker Style Expert", "Jessica Wright Htx Satx",
"Julie Green Lifestyle Blogger", "Mike S Thomas Football Player",
"Tiny Fitness Houston Studio")), class = "data.frame", row.names = c(NA,
-8L))
I am trying to evaluate matches in those strings against two different datasets called firstname and lastname that look as such:
firstname <- structure(list(firstnames = c("Jennifer", "Lisa", "Tina", "Jamie",
"Jessica", "Julie", "Mike", "George")), class = "data.frame", row.names = c(NA,
-8L))
lastname <- structure(list(lastnames = c("Hancock", "Smith", "Houston", "Fay",
"Tucker", "Wright", "Green", "Thomas")), class = "data.frame", row.names = c(NA,
-8L))
First thing I would like to do is remove everything after the first three words in each string, so "Jennifer Rae Hancock Brown" would just become "Jessica Rae Hancock" and "Lisa Smith Houston Blogger" would become "Lisa Smith Houston"
After that, I then want to evaluate the first word of each string to see if it matches to anything in the firstname dataframe. If it does match, it creates a new column called in the final table called firstname with the result. If it doesn't match, the result is simply "N/A".
After that, I'd like to then evaluate the remaining words against the lastname dataframe. There can be multiple matches (As seen in the "Lisa Smith Houston" example) and if that's the case, both results will be stored in the final dataframe.
The final dataframe should look like this:
final <- structure(list(string = c("Jennifer Rae Hancock Brown", "Lisa Smith Houston Blogger",
"Lisa Smith Houston Blogger", "Tina Fay Las Cruces", "\t\nJamie Tucker Style Expert",
"Jessica Wright Htx Satx", "Julie Green Lifestyle Blogger", "Mike S Thomas Football Player",
"Tiny George Fitness Houston Studio"), firstname = c("Jennifer",
"Lisa", "Lisa", "Tina", "Jamie", "Jessica", "Julie", "Mike",
"N/A"), lastname = c("Hancock", "Smith", "Houston", "Fay", "Tucker",
"Wright", "Green", "Thomas", "N/A")), class = "data.frame", row.names = c(NA,
-9L))
What would be the most effective way to go about doing this?
We may use str_extract_all on the substring of 'string2' with pattern as the firstnames, lastnames vector converted to a single string with | (OR as delimiter) and return a list of vectors, then use unnest to convert the list to vector
library(dplyr)
library(stringr)
library(tidyr)
strings %>%
mutate(string2 = str_extract(trimws(string), "^\\S+\\s+\\S+\\s+\\S+"),
firstname = str_extract_all(string2,
str_c(firstname$firstnames, collapse = "|")),
lastname =str_extract_all(string2,
str_c(lastname$lastnames, collapse = "|")) ) %>%
unnest(where(is.list), keep_empty = TRUE) %>%
select(-string2)%>%
mutate(lastname = case_when(complete.cases(firstname) ~ lastname))
-output
# A tibble: 9 × 3
string firstname lastname
<chr> <chr> <chr>
1 "Jennifer Rae Hancock Brown" Jennifer Hancock
2 "Lisa Smith Houston Blogger" Lisa Smith
3 "Lisa Smith Houston Blogger" Lisa Houston
4 "Tina Fay Las Cruces" Tina Fay
5 "\t\nJamie Tucker Style Expert" Jamie Tucker
6 "Jessica Wright Htx Satx" Jessica Wright
7 "Julie Green Lifestyle Blogger" Julie Green
8 "Mike S Thomas Football Player" Mike Thomas
9 "Tiny Fitness Houston Studio" <NA> <NA>
OP's expected
> final
string firstname lastname
1 Jennifer Rae Hancock Brown Jennifer Hancock
2 Lisa Smith Houston Blogger Lisa Smith
3 Lisa Smith Houston Blogger Lisa Houston
4 Tina Fay Las Cruces Tina Fay
5 \t\nJamie Tucker Style Expert Jamie Tucker
6 Jessica Wright Htx Satx Jessica Wright
7 Julie Green Lifestyle Blogger Julie Green
8 Mike S Thomas Football Player Mike Thomas
9 Tiny George Fitness Houston Studio N/A N/A

How can I find the most common sequences in my data using R?

I'm trying to figure out how I can use the rollapply function (from the Zoo package) to find sequences of most common strings within a dataset, but I also need to do group certain variables (e.g. date, row, etc.)
Before I go any further, it's worth noting that this query builds on a question that I previously posted here : How can I find most common sequences (of strings) in my data using Tableau?
The solution offered there works really well, but I now want to apply it to a different dataset which provides some new challenges! Here's an example of the data that I'm working with in this new dataset:
structure(list(Title = c("Dragons' Den", "One Hot Summer", "Keeping Faith",
"Cuckoo", "Match of the Day", "Sportscene", "Sportscene", "The Irish League Show",
"Match of the Day", "EastEnders", "Dragons' Den", "Fake or Fortune?",
"Asian Provocateur", "In The Flesh", "Two Pints of Lager and a Packet of Crisps",
"Travels in Trumpland with Ed Balls", "Hidden", "Train Surfing Wars: A Matter of Life and Death",
"Bollywood: The World's Biggest Film Industry", "One Hot Summer",
"Asian Provocateur", "In The Flesh", "Two Pints of Lager and a Packet of Crisps",
"Travels in Trumpland with Ed Balls", "EastEnders", "Match of the Day",
"Dragons' Den", "The Next Step", "Doctor Who Series 11 Trailer",
"Doctor Who", "Doctor Who", "Doctor Who", "Picnic at Hanging Rock",
"Sylvia", "Keeping Faith", "Cardinal: Blackfly Season", "Picnic at Hanging Rock",
"Age Before Beauty", "One Hot Summer", "Stewart Lee's Comedy Vehicle",
"Asian Provocateur", "In The Flesh", "Two Pints of Lager and a Packet of Crisps",
"Travels in Trumpland with Ed Balls", "EastEnders", "Age Before Beauty",
"Holby City", "Who Do You Think You Are?", "Louis Theroux: Dark States",
"Louis Theroux: Dark States", "Louis Theroux", "Louis Theroux's Weird Weekends",
"Picnic at Hanging Rock", "Sylvia", "Keeping Faith", "Cardinal: Blackfly Season"
), Programme_Genre = c("Entertainment", "Documentary", "Drama",
"New SeriesComedy", "Sport", "Sport", "Sport", "Sport", "Sport",
"Drama", "Entertainment", "Documentary", "Comedy", "Drama", "Comedy",
"Documentary", "Crime Drama", "Documentary", "Documentary", "Documentary",
"Comedy", "Drama", "Comedy", "Documentary", "Drama", "Sport",
"Entertainment", "CBBC", "Sci-Fi", "Sci-Fi", "Sci-Fi", "Sci-Fi",
"Drama", "Film", "Drama", "Crime Drama", "On Now", "Drama", "Documentary",
"Comedy", "Comedy", "Drama", "Comedy", "Documentary", "Drama",
"Drama", "Drama", "History", "Documentary", "Documentary", "Documentary",
"Archive", "Drama", "Film", "Drama", "Crime Drama"), Programme_Category = c("Featured",
"Featured", "Featured", "Featured", "This Weekend's Football",
"This Weekend's Football", "This Weekend's Football", "This Weekend's Football",
"Most Popular", "Most Popular", "Most Popular", "Most Popular",
"Box Sets", "Box Sets", "Box Sets", "Box Sets", "Featured", "Featured",
"Featured", "Featured", "Box Sets", "Box Sets", "Box Sets", "Box Sets",
"Most Popular", "Most Popular", "Most Popular", "Most Popular",
"Doctor Who S1-S10", "Doctor Who S1-S10", "Doctor Who S1-S10",
"Doctor Who S1-S10", "Drama", "Drama", "Drama", "Drama", "Featured",
"Featured", "Featured", "Featured", "Box Sets", "Box Sets", "Box Sets",
"Box Sets", "Most Popular", "Most Popular", "Most Popular", "Most Popular",
"Louis Theroux", "Louis Theroux", "Louis Theroux", "Louis Theroux",
"Drama", "Drama", "Drama", "Drama"), date = c("13/08/2018", "13/08/2018",
"13/08/2018", "13/08/2018", "13/08/2018", "13/08/2018", "13/08/2018",
"13/08/2018", "13/08/2018", "13/08/2018", "13/08/2018", "13/08/2018",
"13/08/2018", "13/08/2018", "13/08/2018", "13/08/2018", "14/08/2018",
"14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018",
"14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018",
"14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018",
"14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", "15/08/2018",
"15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018",
"15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018",
"15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018",
"15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018"), column = c("1",
"2", "3", "4", "1", "2", "3", "4", "1", "2", "3", "4", "1", "2",
"3", "4", "1", "2", "3", "4", "1", "2", "3", "4", "1", "2", "3",
"4", "1", "2", "3", "4", "1", "2", "3", "4", "1", "2", "3", "4",
"1", "2", "3", "4", "1", "2", "3", "4", "1", "2", "3", "4", "1",
"2", "3", "4"), row = c("1", "1", "1", "1", "2", "2", "2", "2",
"3", "3", "3", "3", "4", "4", "4", "4", "1", "1", "1", "1", "2",
"2", "2", "2", "3", "3", "3", "3", "4", "4", "4", "4", "5", "5",
"5", "5", "1", "1", "1", "1", "2", "2", "2", "2", "3", "3", "3",
"3", "4", "4", "4", "4", "5", "5", "5", "5")), row.names = c(NA,
-56L), class = "data.frame")
Apologies but I'm not quite sure about best practice for sharing data. Hope the above works. It should look something like this:
Title Programme_Genre Programme_Category date column row
1 Dragons Den Entertainment Featured 13/08/2018 1 1
2 One Hot Summer Documentary Featured 13/08/2018 2 1
3 Keeping Faith Drama Featured 13/08/2018 3 1
4 Cuckoo New Series Comedy Featured 13/08/2018 4 1
5 Match of the Day Sport This Weekends... 13/08/2018 1 2
6 Sportscene Sport This Weekends... 13/08/2018 2 2
What I want to do is to use the rollapply function similar to how it was suggested in my previous question (see link above) but only on looking for sequences that appear on the same date and across a certain range of columns. For example, I want to know what the most common sequence of genre ("Programme_Genre") is but I only want the rollapply function to do this across columns 1-4 for each row on each date. I'm sure I'm not explaining this very well (I don't come from a data science background in case you hadn't guessed) so I'm more than happy to elaborate if necessary. Thanks in advance!
With tidyverse, zoo and lubridate, try:
library(tidyverse)
library(zoo)
library(lubridate)
df %>%
mutate(date = lubridate::dmy(date)) %>% # Optional. Properly parses date as Date class. Makes sorting easier.
filter(column <= 4) %>% # Step 1. Exclude observations with `column` values above 4.
group_split(row, date) %>% # Step 2. Splits the DF into smaller DFs representing row and date groups.
# Step 3 (below). Loops the solution to the previous question, gets a DF, and assigns the date and row signals to each observation.
map_df(.x = . ,
.f = ~(rollapply(data = .x$Programme_Genre , 3, c) %>%
as_tibble() %>%
mutate(date = unique(.x$date), row = unique(.x$row)))) %>%
group_by_all() %>%
tally() %>%
arrange(date, row, n)
# A tibble: 26 x 6
# Groups: V1, V2, V3, date [26]
V1 V2 V3 date row n
<chr> <chr> <chr> <date> <chr> <int>
1 Documentary Drama New SeriesComedy 2018-08-13 1 1
2 Entertainment Documentary Drama 2018-08-13 1 1
3 Sport Sport Sport 2018-08-13 2 2
4 Drama Entertainment Documentary 2018-08-13 3 1
5 Sport Drama Entertainment 2018-08-13 3 1
6 Comedy Drama Comedy 2018-08-13 4 1
7 Drama Comedy Documentary 2018-08-13 4 1
8 Crime Drama Documentary Documentary 2018-08-14 1 1
9 Documentary Documentary Documentary 2018-08-14 1 1
10 Comedy Drama Comedy 2018-08-14 2 1
# ... with 16 more rows
In this case also, I suggest you a similar strategy suggested in linked question.
Firstly load the libraries
library(tidyverse)
library(runner)
Strategy for say n=3
n <- 3
data %>%
group_by(date) %>%
mutate(l_seq = runner(x = Programme_Genre,
k = n,
function(x) ifelse(length(x) == n, list(x), list(rep(NA, n)))
)
) %>%
ungroup() %>%
group_split(date) %>%
map_df(., ~ map_df(.x$l_seq, ~setNames(.x, paste0('Col', seq_len(n)))) %>%
mutate(date = .x$date) %>%
na.omit() %>%
group_by_all() %>%
summarise(m = n(), .groups = 'drop') %>%
filter(m == max(m) & m > 1)
)
# A tibble: 2 x 5
Col1 Col2 Col3 date m
<chr> <chr> <chr> <chr> <int>
1 Sport Sport Sport 13/08/2018 3
2 Sci-Fi Sci-Fi Sci-Fi 14/08/2018 2
Needless to say m is the column giving you maximum count of sequence on that particular date
say if n=4, the above syntax gives you following results
# A tibble: 1 x 6
Col1 Col2 Col3 Col4 date m
<chr> <chr> <chr> <chr> <chr> <int>
1 Sport Sport Sport Sport 13/08/2018 2
There is no sequence of length more than 1 for length 5 in the sample

How to convert character matrix to numeric, keeping first column as row name: R

I have this matrix below and the apply loop changes the row names to numbers.
This is matrix:
treatmenta treatmentb
John Smith NA " 2"
John Doe "16" "11"
Mary Johnson " 3" " 1"
and this code as.matrix(apply(y, 2, as.numeric))
results is this but i want the row names to be people names
treatmenta treatmentb
[1,] NA 2
[2,] 16 11
[3,] 3 1
Converting to data.table also does not work. How do I do this?
Here is code to reproduce data:
name <- c("John Smith", "John Doe", "Mary Johnson")
treatmenta <- c("NA", "16", "3")
treatmentb <- c("2", "11", "1")
y <- data.frame(name, treatmenta, treatmentb)
rownames(y) <- y[,1]
y[,1] <- NULL
We can do
y <- `dimnames<-`(`dim<-`(as.numeric(y), dim(y)), dimnames(y))
y
# treatmenta treatmentb
#John Smith NA 2
#John Doe 16 11
#Mary Johnson 3 1
Or a compact option is
class(y) <- "numeric"
data
y <- structure(c(NA, "16", " 3", " 2", "11", " 1"), .Dim = c(3L, 2L
), .Dimnames = list(c("John Smith", "John Doe", "Mary Johnson"
), c("treatmenta", "treatmentb")))
You are going from a more general dataform (dataframes) to matrixes (vectors with dim attribute). During this as.matrix or any method from the base that converts your data to matrix will eventually call vector(x) which is generic function setting all your variables to charactor or will set everything to numeric but the name column to NAs (depending on how you call as.matrix).
Having said that, if for some reason you still have to use matrix form then use this for better readability:
treatmenta <- c("1", "16", "3")
treatmentb <- c("2", "11", "1")
y[,1] <- as.matrix(sapply(treatmenta, as.numeric))
y[,2] <- as.matrix(sapply(treatmentb, as.numeric))
#now they are not factors.
#> class(y)
#[1] "matrix"
name <- c("John Smith", "John Doe", "Mary Johnson")
row.names(y) <- name
# treatmenta treatmentb
# John Smith 1 2
# John Doe 16 11
# Mary Johnson 3 1

Tidy data Melt and Cast

In the Wickham's Tidy Data pdf he has an example to go from messy to tidy data.
I wonder where the code is?
For example, what code is used to go from
Table 1: Typical presentation dataset.
to
Table 3: The same data as in Table 1 but with variables in columns and observations in rows.
Per haps melt or cast. But from http://www.statmethods.net/management/reshape.html I cant see how.
(Note to self: Need it for GDPpercapita...)
The answer sort of depends on what the structure of your data are. In the paper you linked to, Hadley was writing about the "reshape" and "reshape2" packages.
It's ambiguous what the data structure is in "Table 1". Judging by the description, it would sound like a matrix with named dimnames (like I show in mymat). In that case, a simple melt would work:
library(reshape2)
melt(mymat)
# Var1 Var2 value
# 1 John Smith treatmenta —
# 2 Jane Doe treatmenta 16
# 3 Mary Johnson treatmenta 3
# 4 John Smith treatmentb 2
# 5 Jane Doe treatmentb 11
# 6 Mary Johnson treatmentb 1
If it were not a matrix, but a data.frame with row.names, you can still use the matrix method by using something like melt(as.matrix(mymat)).
If, on the other hand, the "names" are a column in a data.frame (as they are in the "tidyr" vignette, you need to specify either the id.vars or the measure.vars so that melt knows how to treat the columns.
melt(mydf, id.vars = "name")
# name variable value
# 1 John Smith treatmenta —
# 2 Jane Doe treatmenta 16
# 3 Mary Johnson treatmenta 3
# 4 John Smith treatmentb 2
# 5 Jane Doe treatmentb 11
# 6 Mary Johnson treatmentb 1
The new kid on the block is "tidyr". The "tidyr" package works with data.frames because it is often used in conjunction with dplyr. I won't reproduce the code for "tidyr" here, because that is sufficiently covered in the vignette.
Sample data:
mymat <- structure(c("—", "16", "3", " 2", "11", " 1"), .Dim = c(3L,
2L), .Dimnames = list(c("John Smith", "Jane Doe", "Mary Johnson"
), c("treatmenta", "treatmentb")))
mydf <- structure(list(name = structure(c(2L, 1L, 3L), .Label = c("Jane Doe",
"John Smith", "Mary Johnson"), class = "factor"), treatmenta = c("—",
"16", "3"), treatmentb = c(2L, 11L, 1L)), .Names = c("name",
"treatmenta", "treatmentb"), row.names = c(NA, 3L), class = "data.frame")

R reshape cast error

I am trying to follow the following example for the reshape package but am getting an error
smithsm <- melt(smiths)
smithsm
subject variable value
1 John Smith time 1.00
2 Mary Smith time 1.00
3 John Smith age 33.00
4 Mary Smith age NA
5 John Smith weight 90.00
6 Mary Smith weight NA
7 John Smith height 1.87
8 Mary Smith height 1.54
cast(smithsm, time + subject ~ variable)
This gives the error "Error: Casting formula contains variables not found in molten data: time". Does anyone know what is causing this error? The above is taken word for word from an example
Thanks!
The smithsm dataset doesn't have time column. It is not clear what the expected wide form is. Perhaps, this helps
library(reshape2)
dcast(smithsm, subject~variable, value.var='value')
# subject age height time weight
#1 John Smith 33 1.87 1 90
#2 Mary Smith NA 1.54 1 NA
data
smithsm <- structure(list(subject = c("John Smith", "Mary Smith", "John Smith",
"Mary Smith", "John Smith", "Mary Smith", "John Smith", "Mary Smith"
), variable = c("time", "time", "age", "age", "weight", "weight",
"height", "height"), value = c(1, 1, 33, NA, 90, NA, 1.87, 1.54
)), .Names = c("subject", "variable", "value"), class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8"))

Resources