Using R to determine author position in journals? - r

I'm looking to analyze the order of authors in academic papers, and have a dataset of journals, authors, publication titles, publication dates, etc. that I'm working with. The data comes with each publication title as a row, and the author(s) of the piece listed in a semi-colon-delimited list. For example:
authors, pubtitle, title, date
Name 1; Name 2; Name 3, Journal Title, Article Title, 2018
Name 1; Name 2, Journal Title, Article Title, 2019
Name 1; Name 2; Name 3; Name 4; Name 5, Journal Title, Article Title, 2018
I've come up with a pretty inefficient way to determine author order, but I'm wondering about suggestions to improve this. Right now, the general workflow looks like this:
data_name_listed <- readxl::read_xlsx("data-raw/data.xlsx")
data_name_listed <- data_name_listed %>%
rename(author = "Author") %>%
rename(title = "Title") %>%
rename(pubtitle = "Publication Title") %>%
rename(publisher = "Publisher") %>%
rename(date = "Date")
# Select just the author column
data_name_order <- data_name_listed %>% select(author)
data_name_order$author <- str_trim(data_name_order$author)
# Separate lists of names into columns according to the order they appear in the comma-separated list
# This is really inelegant.
data_name_order <- data_name_order %>%
separate(col = author, into = c("1","2","3","4","5","6","7","8","9","10","11",
"12","13","14","15", "16","17","18","19","20",
"21","22","23","24","25","26","27","28","29",
"30","31","32","33","34","35"), sep = ";")
# Gather the data into a tidy df
data_name_order <- data_name_order %>%
gather(position, name)
# Clean up special characters in names
data_name_order$name <- gsub("(.*)\\s+[A-Z]\\.?$", "\\1", data_name_order$name)
# Get rid of missing data
data_name_order <- data_name_order %>% drop_na()
# Convert position number to numeric
data_name_order$position <- as.numeric(data_name_order$position)
# Ensure no whitespace
data_name_order$name <- str_trim(data_name_order$name)
# Then merge this data with tidy journal data
# ... code ...
In particular, the separate() function is particularly messy, even though it seems to achieve what I hoped it would. I'd love any advice to make this a bit more clean and more reproducible/applicable to other datasets. Thanks!

Here's a suggestion without separate:
library(dplyr)
library(tidyr)
x %>%
select(authors) %>%
transmute(
id = row_number(),
author = strsplit(authors, ";")
) %>%
unnest() %>%
group_by(id) %>%
mutate(
position = row_number(),
author = trimws(author)
) %>%
ungroup()
# # A tibble: 10 x 3
# id author position
# <int> <chr> <int>
# 1 1 Name 1 1
# 2 1 Name 2 2
# 3 1 Name 3 3
# 4 2 Name 1 1
# 5 2 Name 2 2
# 6 3 Name 1 1
# 7 3 Name 2 2
# 8 3 Name 3 3
# 9 3 Name 4 4
# 10 3 Name 5 5
The introduction of id into the frame is to work around tidyr::spread's expectation that there are two columns, one to preserve and one to spread. It also (for your case) serves as an ability to re-merge authors back with the original data. If there is a better column that uniquely identifies each row/publication, use that instead. If you have no better fields, then it might be better to add it before you start this process, so "ensure" the original data and this lengthened data have identical ids, perhaps with:
x <- mutate(x, id = row_number())
# or with base
x$id <- seq_len(nrow(x))
Data:
x <- read.csv(header=TRUE, stringsAsFactors=FALSE, text="
authors, pubtitle, title, date
Name 1; Name 2; Name 3, Journal Title, Article Title, 2018
Name 1; Name 2, Journal Title, Article Title, 2019
Name 1; Name 2; Name 3; Name 4; Name 5, Journal Title, Article Title, 2018")

Related

Group strings that have the same words but in a different order

I have an example concatenated text field (please see sample data below) that is created from two or three different fields, however there is no guarantee that the order of the words will be the same. I would like to create a new dataset where fields with the same words, regardless of order, are collapsed. However, since I do not know in advance what words will be concatenated together, the code will have to recognize that all words in both strings match.
Code for example data:
var1<-c("BLUE|RED","RED|BLUE","WHITE|BLACK|ORANGE","BLACK|WHITE|ORANGE")
freq<-c(1,1,1,1)
have<-as.data.frame(cbind(var1,freq))
Have:
var1 freq
BLUE|RED 1
RED|BLUE 1
WHITE|BLACK|ORANGE 1
BLACK|WHITE|ORANGE 1
How can I collapse the data into what I want below?
color freq
BLUE|RED 2
WHITE|BLACK|ORANGE 2
data.frame(table(sapply(strsplit(have$var1, '\\|'),
function(x)paste(sort(x), collapse = '|'))))
Var1 Freq
1 BLACK|ORANGE|WHITE 2
2 BLUE|RED 2
In the world of piping: R > 4.0
have$var1 |>
strsplit('\\|')|>
sapply(\(x)paste0(sort(x), collapse = "|"))|>
table()|>
data.frame()
Here is a tidyverse approach:
library(dplyr)
library(tidyr)
have %>%
group_by(id=row_number()) %>%
separate_rows(var1) %>%
arrange(var1, .by_group = TRUE) %>%
mutate(var1 = paste(var1, collapse = "|")) %>%
slice(1) %>%
ungroup() %>%
count(var1, name = "freq")
var1 freq
<chr> <int>
1 BLACK|ORANGE|WHITE 2
2 BLUE|RED 2

Subset the Orginal dataframe with different combinations of 2 factor variables

I have a dataset with 11 columns and 18350 observations which has a variable company and region. There are 9 companies(company-0) spread across 5 regions(region-0 to region-5) and not all companies are present at all regions. I want to create a seperate dataframe for each combination of company and region.You can see like this-
company0-region1,
company0-region10,
company0-region7,
company1-region5,
company2-region0,
company3-region2,
company4-region3,
company5-region7,
company6-region6,
company8-region9,
company9-region8
Thus I need 11 different dataframes in R.No other combinations are possible
Any other approach would be highly appreciated.
Thanks in Advance
I used split function to get a list-
p<-split(tsog1,list(tsog1$company),drop=TRUE)
Now I have a list of dataframes and I can't convert the each element of that list into an individual dataframe.
I tried using loops too, but can't get a unique named dataframe.
v<-c(1:9)
p<-levels(tsog1$company)
for (x in v)
{
x.tsog1<-subset(tsog1,tsog1$company==p[x])
}
Dataset Image
You can create a column for the region company combination and split by that column.
For example:
library(tidyverse)
# Create a df with 9 regions, 6 companies, and some dummy observations (3 per case)
df <- expand.grid(region = 0:8, company = 0:5, dummy = 1:3 ) %>%
mutate(x = round(rnorm((54*3)),2)) %>%
select(-dummy) %>% as_tibble()
# Create the column to split, and split.
df %>%
mutate(region_company = paste(region,company, sep = '_')) %>%
split(., .$region_company)
Now, what to do once you have the list of data frames, depends on your next steps. If you want to for example, save them, you can do walk or lapply.
For saving:
df_list <- df %>%
mutate(region_company = paste(region,company, sep = '_')) %>%
split(., .$region_company)
iwalk(df_list,function(df, nm){
write_csv(df, paste0(nm,'.csv'))
})
Or if you simply wants to access it:
> df_list$`0_4`
# A tibble: 3 x 4
region company x region_company
<int> <int> <dbl> <chr>
1 0 4 0.54 0_4
2 0 4 1.61 0_4
3 0 4 0.16 0_4

The most frequent in column of dataframe

I have 3-column dataframe, where 3-rd (last) contains text body, something like one sentence.
Additionally I have one vector of words.
How to compute in elegant way a following thing:
find 15 the most frequent words (with number of occurences) in whole
3-rd column which occur in mentioned above vector ?
The sentence can look like:
I like dogs and my father like cats
vector=["dogs", "like"]
Here, the most frequent words are dogs and like.
You can try with this:
library(tidytext)
library(tidyverse)
df %>% # your data
unnest_tokens(word,text) %>% # clean a bit the data and split the phrases
group_by(word) %>% # grouping by words
summarise(Freq = n()) %>% # count them
arrange(-Freq) %>% # order decreasing
top_n(2) # here the top 2, you can use 15
Result:
# A tibble: 8 x 2
word Freq
<chr> <int>
1 dogs 3
2 i 2
If you already have the words splitted, you can skip the second line.
With data:
df <- data.frame(
id = c(1,2,3),
group = c(1,1,1),
text = c("I like dogs","I don't hate dogs", "dogs are the best"), stringsAsFactors = F)

Duplicated rows when aggregating data in dplyr()

I'm trying to create a set of cross-linguistic data by joining three datasets together in dplyr(). Two of the datasets are 'dictionaries' of sorts - they are word lists that I want to attach to speakers. There are 15 speakers and so a number of repetitions throughout the data, while each word only appears once in each of the dictionaries.
When I join two using left_join(), I get replicated cells. I know I can remove the duplicated cells, but I sense that there must be something simple that I'm doing wrong to create this issue.
Example data is as follows:
French <- c("un", "deux", "trois", "chien")
English <- c("one", "two", "three", "dog")
type <- c("number", "number", "number", "animal")
speaker <- c(1, 1, 1, 4)
df.fr = data.frame(speaker, French)
df.en = data.frame(speaker, English)
df.type = data.frame(English, type)
I want to create a new dataset, new.df, by joining df.en and df.fr by speaker, and then joining that to df.type by English.
Preferably I would use dplyr() to do this. When I do the following, I get duplicated rows:
new.data <- df.fr %>% left_join(df.en)
which generates
speaker French English
1 1 un one
2 1 un two
3 1 un three
4 1 deux one
5 1 deux two
6 1 deux three
7 1 trois one
8 1 trois two
9 1 trois three
10 4 chien dog
When really I just want it to join 'un' to 'one', 'deux' to 'two', etc:
speaker French English type
1 1 un one number
2 1 deux two number
3 1 trois three number
4 4 chien dog animal
Aside from cbinding the three datasets, you can create a unique id for each speaker for both df.fr and df.en and join on speaker + id:
library(dplyr)
df.fr %>%
group_by(speaker) %>%
mutate(id = 1:n()) %>%
left_join(df.en %>% group_by(speaker) %>% mutate(id = 1:n()),
by = c("speaker", "id")) %>%
left_join(df.type) %>%
select(-id)
If you have more than two language datasets, you can also write a more general solution using map and reduce from purrr:
library(purrr)
list(df.fr, df.en) %>%
map(~ group_by(., speaker) %>% mutate(id = 1:n())) %>%
reduce(left_join, by = c("speaker", "id")) %>%
left_join(df.type) %>%
select(-id)
Result:
# A tibble: 4 x 4
# Groups: speaker [2]
speaker French English type
<dbl> <fctr> <fctr> <fctr>
1 1 un one number
2 1 deux two number
3 1 trois three number
4 4 chien dog animal

R: How to best extract two XML attributes from a node?

The following code extracts one attribute (or all) from an XML file:
library(xml2);library(magrittr);library(readr);library(tibble);library(knitr)
fname<-'https://raw.githubusercontent.com/wardblonde/ODM-to-i2b2/master/odm/examples/CDISC_ODM_example_3.xml'
fname
x<-read_xml(fname)
xpath="//d1:ItemDef"
itemsNames <- x %>% xml_find_all(xpath, ns=xml_ns(x)) %>% xml_attr('Name')
items <- x %>% xml_find_all(xpath, ns=xml_ns(x))
Item looks like this:
<ItemDef OID="IT.ABNORM" Name="Normal/Abnormal/Not Done" DataType="integer" Length="1" ...
Sample file can be viewed here: https://raw.githubusercontent.com/wardblonde/ODM-to-i2b2/master/odm/examples/CDISC_ODM_example_3.xml
Using pipes and xml_attr, what is the best way to extract both the Name and DataType attributes and have them rbinded?
Ideally it would be a single line of super efficient piped code. I can extract names and types and have 'data.frame(name=names,type=types)' but that seems not the best and most modern.
The result should be a tibble with columns name and data type.
library(purrr)
map(items, xml_attrs) %>%
map_df(as.list) %>%
select(Name, DataType)
## # A tibble: 94 × 2
## Name DataType
## <chr> <chr>
## 1 Normal/Abnormal/Not Done integer
## 2 Actions taken re study drug text
## 3 Actions taken, other text
## 4 Stop Day - Enter Two Digits 01-31 text
## 5 Derived Stop Date text
## 6 Stop Month - Enter Two Digits 01-12 text
## 7 Stop Year - Enter Four Digit Year text
## 8 Outcome text
## 9 Relationship to study drug text
## 10 Severity text
## # ... with 84 more rows
One "base" version:
lapply(items, xml_attrs) %>%
lapply(function(x) as.data.frame(as.list(x))[,c("Name", "DataType")]) %>%
do.call(rbind, .) %>%
tbl_df()
NOTE: an issue with ^^ is that if Name or DataType is missing then you're SOL. You can mitigate that with:
lapply(items, xml_attrs) %>%
lapply(function(x) as.data.frame(as.list(x))[,c("Name", "DataType")]) %>%
data.table::rbindlist(fill=TRUE) %>%
tbl_df()
or:
lapply(items, xml_attrs) %>%
lapply(function(x) as.data.frame(as.list(x))[,c("Name", "DataType")]) %>%
bind_rows() %>%
tbl_df()
if you don't like purrr.

Resources