Duplicated rows when aggregating data in dplyr() - r

I'm trying to create a set of cross-linguistic data by joining three datasets together in dplyr(). Two of the datasets are 'dictionaries' of sorts - they are word lists that I want to attach to speakers. There are 15 speakers and so a number of repetitions throughout the data, while each word only appears once in each of the dictionaries.
When I join two using left_join(), I get replicated cells. I know I can remove the duplicated cells, but I sense that there must be something simple that I'm doing wrong to create this issue.
Example data is as follows:
French <- c("un", "deux", "trois", "chien")
English <- c("one", "two", "three", "dog")
type <- c("number", "number", "number", "animal")
speaker <- c(1, 1, 1, 4)
df.fr = data.frame(speaker, French)
df.en = data.frame(speaker, English)
df.type = data.frame(English, type)
I want to create a new dataset, new.df, by joining df.en and df.fr by speaker, and then joining that to df.type by English.
Preferably I would use dplyr() to do this. When I do the following, I get duplicated rows:
new.data <- df.fr %>% left_join(df.en)
which generates
speaker French English
1 1 un one
2 1 un two
3 1 un three
4 1 deux one
5 1 deux two
6 1 deux three
7 1 trois one
8 1 trois two
9 1 trois three
10 4 chien dog
When really I just want it to join 'un' to 'one', 'deux' to 'two', etc:
speaker French English type
1 1 un one number
2 1 deux two number
3 1 trois three number
4 4 chien dog animal

Aside from cbinding the three datasets, you can create a unique id for each speaker for both df.fr and df.en and join on speaker + id:
library(dplyr)
df.fr %>%
group_by(speaker) %>%
mutate(id = 1:n()) %>%
left_join(df.en %>% group_by(speaker) %>% mutate(id = 1:n()),
by = c("speaker", "id")) %>%
left_join(df.type) %>%
select(-id)
If you have more than two language datasets, you can also write a more general solution using map and reduce from purrr:
library(purrr)
list(df.fr, df.en) %>%
map(~ group_by(., speaker) %>% mutate(id = 1:n())) %>%
reduce(left_join, by = c("speaker", "id")) %>%
left_join(df.type) %>%
select(-id)
Result:
# A tibble: 4 x 4
# Groups: speaker [2]
speaker French English type
<dbl> <fctr> <fctr> <fctr>
1 1 un one number
2 1 deux two number
3 1 trois three number
4 4 chien dog animal

Related

How to append text to a column based on conditions?

I have an empty column designated for categorising entries in my data frame. Categories are not exclusive, i.e. one entry can have multiple categories.
animals categories
1 monkey
2 humpback whale
3 river trout
4 seagull
The categories column should have categories based on the animal's properties. I know the properties based on vectors. The elements in the vectors aren't necessarily a perfect match.
mammals <- c("whale", "monkey", "dog")
swimming <- c("whale", "trout", "dolphin")
How do I get the following result, ideally without looping?
animals categories
1 monkey mammal
2 humpback whale mammal,swimming
3 river trout swimming
4 seagull
This may be done with fuzzyjoin after creating a key/val dataset - lst from dplyr returns a named list, which is converted to a two column dataset with enframe, unnest the list column, grouped by 'animals', paste the 'categories' to a single string and then do a join (regex_left_join) with the original dataset
library(fuzzyjoin)
library(dplyr)
library(tidyr)
library(tibble)
keydat <- lst(mammals, swimming) %>%
enframe(name = 'categories', value = 'animals') %>%
unnest(animals) %>%
group_by(animals) %>%
summarise(categories = toString(categories))
regex_left_join(df1, keydat, by= 'animals', ignore_case = TRUE) %>%
transmute(animals = animals.x, categories)
# A tibble: 4 × 2
animals categories
<chr> <chr>
1 monkey mammals
2 humpback whale mammals, swimming
3 river trout swimming
4 seagull <NA>
data
df1 <- tibble(animals = c('monkey', 'humpback whale', 'river trout', 'seagull'))
A base R option using stack + aggregate + grepl
lut <- aggregate(
. ~ values,
type.convert(
stack(list(mammals = mammals, swimming = swimming)),
as.is = TRUE
),
toString
)
p <- sapply(
lut$values,
grepl,
x = df$animals
)
df$categories <- lut$ind[replace(rowSums(p * col(p)), rowSums(p) == 0, NA)]
which gives
> df
animals categories
1 monkey mammals
2 humpback whale mammals, swimming
3 river trout swimming
4 seagull <NA>
Data
df <- data.frame(animals = c("monkey", "humpback whale", "river trout", "seagull"))

Subset the Orginal dataframe with different combinations of 2 factor variables

I have a dataset with 11 columns and 18350 observations which has a variable company and region. There are 9 companies(company-0) spread across 5 regions(region-0 to region-5) and not all companies are present at all regions. I want to create a seperate dataframe for each combination of company and region.You can see like this-
company0-region1,
company0-region10,
company0-region7,
company1-region5,
company2-region0,
company3-region2,
company4-region3,
company5-region7,
company6-region6,
company8-region9,
company9-region8
Thus I need 11 different dataframes in R.No other combinations are possible
Any other approach would be highly appreciated.
Thanks in Advance
I used split function to get a list-
p<-split(tsog1,list(tsog1$company),drop=TRUE)
Now I have a list of dataframes and I can't convert the each element of that list into an individual dataframe.
I tried using loops too, but can't get a unique named dataframe.
v<-c(1:9)
p<-levels(tsog1$company)
for (x in v)
{
x.tsog1<-subset(tsog1,tsog1$company==p[x])
}
Dataset Image
You can create a column for the region company combination and split by that column.
For example:
library(tidyverse)
# Create a df with 9 regions, 6 companies, and some dummy observations (3 per case)
df <- expand.grid(region = 0:8, company = 0:5, dummy = 1:3 ) %>%
mutate(x = round(rnorm((54*3)),2)) %>%
select(-dummy) %>% as_tibble()
# Create the column to split, and split.
df %>%
mutate(region_company = paste(region,company, sep = '_')) %>%
split(., .$region_company)
Now, what to do once you have the list of data frames, depends on your next steps. If you want to for example, save them, you can do walk or lapply.
For saving:
df_list <- df %>%
mutate(region_company = paste(region,company, sep = '_')) %>%
split(., .$region_company)
iwalk(df_list,function(df, nm){
write_csv(df, paste0(nm,'.csv'))
})
Or if you simply wants to access it:
> df_list$`0_4`
# A tibble: 3 x 4
region company x region_company
<int> <int> <dbl> <chr>
1 0 4 0.54 0_4
2 0 4 1.61 0_4
3 0 4 0.16 0_4

How to transpose character data for unique IDs

Im trying to perform a sum function to count the number of interactions for Unique Id's
So I have something like this:
Client ID
JOE12_EMI
ABC12_CANC
ABC12_EMI
ABC12_RENE
and so on...
It'll also have a column next to it that counts the how many times each unique ID repeats.
Frequency
1
2
2
1
Is there a way that i can have all the activity types (EMI, TELI, PFL) summed for each ID and then placed into new columns?
I've tried to transpose the data by separating the actual ID from the activity type but this doesn't return the sums, thank you for any help. I'm not sure if that's the best way or if transposing the data to wide format and then doing another sum function but I am unsure how to go about it.
separate(frequency, id, c("id", "act_code") )
nd <- melt(frequency, id=(c("id")))
Try this:
library(dplyr)
data=data.frame(Client_ID= c("JOE12_EMI",
"ABC12_CANC",
"ABC12_EMI",
"ABC12_RENE"),
frequency= c(1,2,2,1))
client_and_id <- as.data.frame(do.call(rbind, strsplit(as.character(data$Client_ID), "_")))
names(client_and_id) <- c("client", "id")
data <- cbind(data, client_and_id)
data_sum <- data %>% group_by(id) %>% mutate(sum_freq = sum(frequency))
The output
> data_sum
# A tibble: 4 x 5
# Groups: id [3]
Client_ID frequency client id sum_freq
<fct> <dbl> <fct> <fct> <dbl>
1 JOE12_EMI 1 JOE12 EMI 3
2 ABC12_CANC 2 ABC12 CANC 2
3 ABC12_EMI 2 ABC12 EMI 3
4 ABC12_RENE 1 ABC12 RENE 1
You can also display the output by ID:
distinct(data_sum %>% dplyr::select(id, sum_freq))
# A tibble: 3 x 2
# Groups: id [3]
id sum_freq
<fct> <dbl>
1 EMI 3
2 CANC 2
3 RENE 1
You're on the right track; I think the only thing you need is a group_by. Something like this:
library(dplyr)
library(tidyr)
df = data.frame(ClientID = c("JOE12_EMI",
"ABC12_CANC",
"ABC12_EMI",
"ABC12_RENE"))
df %>%
separate(ClientID, into = c("id", "act_code"), sep = "_") %>%
group_by(id) %>%
mutate(frequency = n()) %>%
ungroup() %>%
group_by(id, act_code) %>%
mutate(act_frequency = n()) %>%
ungroup() %>%
spread(act_code, act_frequency)
(This does the sum by user and the pivot by activity type separately; it's possible to calculate the sum by user after pivoting, but this way is easier for me to read.)

Using R to determine author position in journals?

I'm looking to analyze the order of authors in academic papers, and have a dataset of journals, authors, publication titles, publication dates, etc. that I'm working with. The data comes with each publication title as a row, and the author(s) of the piece listed in a semi-colon-delimited list. For example:
authors, pubtitle, title, date
Name 1; Name 2; Name 3, Journal Title, Article Title, 2018
Name 1; Name 2, Journal Title, Article Title, 2019
Name 1; Name 2; Name 3; Name 4; Name 5, Journal Title, Article Title, 2018
I've come up with a pretty inefficient way to determine author order, but I'm wondering about suggestions to improve this. Right now, the general workflow looks like this:
data_name_listed <- readxl::read_xlsx("data-raw/data.xlsx")
data_name_listed <- data_name_listed %>%
rename(author = "Author") %>%
rename(title = "Title") %>%
rename(pubtitle = "Publication Title") %>%
rename(publisher = "Publisher") %>%
rename(date = "Date")
# Select just the author column
data_name_order <- data_name_listed %>% select(author)
data_name_order$author <- str_trim(data_name_order$author)
# Separate lists of names into columns according to the order they appear in the comma-separated list
# This is really inelegant.
data_name_order <- data_name_order %>%
separate(col = author, into = c("1","2","3","4","5","6","7","8","9","10","11",
"12","13","14","15", "16","17","18","19","20",
"21","22","23","24","25","26","27","28","29",
"30","31","32","33","34","35"), sep = ";")
# Gather the data into a tidy df
data_name_order <- data_name_order %>%
gather(position, name)
# Clean up special characters in names
data_name_order$name <- gsub("(.*)\\s+[A-Z]\\.?$", "\\1", data_name_order$name)
# Get rid of missing data
data_name_order <- data_name_order %>% drop_na()
# Convert position number to numeric
data_name_order$position <- as.numeric(data_name_order$position)
# Ensure no whitespace
data_name_order$name <- str_trim(data_name_order$name)
# Then merge this data with tidy journal data
# ... code ...
In particular, the separate() function is particularly messy, even though it seems to achieve what I hoped it would. I'd love any advice to make this a bit more clean and more reproducible/applicable to other datasets. Thanks!
Here's a suggestion without separate:
library(dplyr)
library(tidyr)
x %>%
select(authors) %>%
transmute(
id = row_number(),
author = strsplit(authors, ";")
) %>%
unnest() %>%
group_by(id) %>%
mutate(
position = row_number(),
author = trimws(author)
) %>%
ungroup()
# # A tibble: 10 x 3
# id author position
# <int> <chr> <int>
# 1 1 Name 1 1
# 2 1 Name 2 2
# 3 1 Name 3 3
# 4 2 Name 1 1
# 5 2 Name 2 2
# 6 3 Name 1 1
# 7 3 Name 2 2
# 8 3 Name 3 3
# 9 3 Name 4 4
# 10 3 Name 5 5
The introduction of id into the frame is to work around tidyr::spread's expectation that there are two columns, one to preserve and one to spread. It also (for your case) serves as an ability to re-merge authors back with the original data. If there is a better column that uniquely identifies each row/publication, use that instead. If you have no better fields, then it might be better to add it before you start this process, so "ensure" the original data and this lengthened data have identical ids, perhaps with:
x <- mutate(x, id = row_number())
# or with base
x$id <- seq_len(nrow(x))
Data:
x <- read.csv(header=TRUE, stringsAsFactors=FALSE, text="
authors, pubtitle, title, date
Name 1; Name 2; Name 3, Journal Title, Article Title, 2018
Name 1; Name 2, Journal Title, Article Title, 2019
Name 1; Name 2; Name 3; Name 4; Name 5, Journal Title, Article Title, 2018")

Parsing a Hierarchy in a String Value

I am trying to create an edge list from a single character vector. My list to be processed is over 93k elements long, but as an example I will provide a small excerpt.
The chracter strings are part of the ICD10 code hierarchy and the parent child relationships exist within the string. That means that a single string, "A0101", would have a parent of "A010"
It would look like this:
A00
A000
A001
A009
A01
A010
A0100
A0101
A02
A03
etc.
My vector does not contain any other data except the strings but i basically need to convert
dat <- c("A00", "A000", "A001", "A009", "A01", "A010", "A0100", "A0101", "A02")
into an edge list formatted as follows...
# (A00, A000)
# (A00, A001)
# (A00, A009)
# (A01, A010)
# (A010, A0100)
# (A010, A0101)
I am fairly certain there are more efficient ways to accomplish this but this excerpt of code should download the ICD10 CM data from the icd.data package. Use the children detection system from the icd package and then make extensive use of the tidyverse to return an edgelist. I had to get a bit creative to connect the "top" of the hierarchies since they do not include the chapters and sub chapters of ICD10 data as an individual 2 or 1 digit code.
Basically sub-chapters become 2 digit codes, chapters become 1 digit codes, and then there is a root node to connect everything at the top.
library(icd.data)
icd10 <- icd10cm2016
library(icd)
code_children <- lapply(icd10$code, children)
code_vec <- sapply(code_children, paste, collapse = ",")
code_df <- as.data.frame(code_vec, stringsAsFactors = F)
library(dplyr);library(stringr);library(tidyr)
code_df_new <- code_df %>%
mutate(parent = sapply(strsplit(code_vec,","), "[", 1)) %>%
separate(code_vec,
paste("code", 1:max(str_count(code_df$code_vec, ",")), sep ="."),
",",extra = "merge")
library(reshape2)
edgelist <- melt(code_df_new, id = "parent") %>%
filter(!is.na(value)) %>%
select(parent, child = value) %>%
arrange(parent)
edgelist <- subset(edgelist, edgelist$parent != edgelist$child)
edgelist <- subset(edgelist, nchar(edgelist$child) == nchar(edgelist$parent) + 1)
subchaps <- icd10 %>% select(three_digit, sub_chapter, chapter) %>%
mutate(two_digit = substr(three_digit, 1, 2)) %>%
select(parent = two_digit, child = three_digit) %>%
distinct()
chaps <- icd10 %>% select(three_digit, sub_chapter, chapter) %>%
mutate(
two_digit = substr(three_digit, 1, 2),
one_digit = substr(three_digit, 1, 1)) %>%
select(parent = one_digit, child = two_digit) %>%
distinct()
root <- icd10 %>% select(three_digit) %>%
mutate(parent = "root", child = substr(three_digit, 1, 1)) %>%
select(parent, child) %>%
distinct()
edgelist_final <- edgelist %>%
bind_rows(list(chaps, subchaps, root)) %>%
arrange(parent)
If anybody has any tips or methods to improve the efficiency of this code I am all ears. (eyes?)
On the assumption that the length of the node names in ICD10 fully define the order (with shorter ones being parents), here's an approach that connects each node with it's immediate parent, if available.
While I think the logic is legible here, I'd be curious to see what a more streamlined solution would look like.
# Some longer fake data to prove that it works acceptably
# with 93k rows (took a few seconds). These are just
# numbers of different lengths, converted to characters, but they
# should suffice if the assumption about length = order is correct.
set.seed(42)
fake <- runif(93000, 0, 500) %>%
magrittr::raise_to_power(3) %>%
as.integer() %>%
as.character()
# Step 1 - prep
library(dplyr); library(tidyr)
fake_2 <- fake %>%
as_data_frame() %>%
mutate(row = row_number()) %>%
# Step 2 - widen by level and fill in all parent nodes
mutate(level = str_length(value)) %>%
spread(level, value) %>%
fill(everything()) %>%
# Step 3 - Get two highest non-NA nodes
gather(level, code, -row) %>%
arrange(row, level) %>%
filter(!is.na(code)) %>%
group_by(row) %>%
top_n(2, wt = level) %>%
# Step 4 - Spread once more to get pairs
mutate(pos = row_number()) %>%
ungroup() %>%
select(-level) %>%
spread(pos, code)
Output on OP data
# A tibble: 9 x 3
row `1` `2`
<int> <chr> <chr>
1 1 A00 NA
2 2 A00 A000
3 3 A00 A001
4 4 A00 A009
5 5 A01 A009
6 6 A01 A010
7 7 A010 A0100
8 8 A010 A0101
9 9 A010 A0101
Output on 93k fake data
> head(fake, 10)
[1] "55174190" "50801321" "46771275" "6480673"
[5] "20447474" "879955" "4365410" "11434009"
[9] "5002257" "9200296"
> head(fake_2, 10)
# A tibble: 10 x 3
row `1` `2`
<int> <chr> <chr>
1 1 55174190 NA
2 2 50801321 NA
3 3 46771275 NA
4 4 6480673 46771275
5 5 6480673 20447474
6 6 6480673 20447474
7 7 4365410 20447474
8 8 4365410 11434009
9 9 5002257 11434009
10 10 9200296 11434009

Resources