I have the following data.frame:
id name altNames
1001 Joan character(0)
1002 Jane c("Janie", "Janet", "Jan")
1003 John Jon
1004 Bill Will
1005 Tom character(0)
The column altNames could be empty (i.e. character(0)), have just one name, or a list of names. What I want is a data.frame (or a list) where each entry from name and/or altNames appears just once along with the corresponding id, like this:
id name
1001 Joan
1002 Jane
1002 Janie
1002 Janet
1002 Jan
1003 John
1003 Jon
1004 Bill
1004 Will
1005 Tom
What's the most efficient way of doing it? Even better is dplyr is utilized.
Thanks
Edit: Here's the data:
df <- data_frame(
id = c("1001", "1002","1003", "1004", "1005"),
name = c("Joan", "Jane", "John", "Bill", "Tom"),
altNames = list(character(0), c("Janie", "Janet", "Jan"), "Jon", "Will", character(0))
)
Here's a possible data.table approach
library(data.table)
setDT(dat)[, .(name = c(name, unlist(altNames))), by = id]
# id name
# 1: 1001 Joan
# 2: 1002 Jane
# 3: 1002 Janie
# 4: 1002 Janet
# 5: 1002 Jan
# 6: 1003 John
# 7: 1003 Jon
# 8: 1004 Bill
# 9: 1004 Will
# 10: 1005 Tom
A base R version (using the df added by #rawr)
with(df, {
ns <- mapply(c, name, altNames)
data.frame(id = rep(id, times=lengths(ns)), name=unlist(ns), row.names=NULL)
})
# id name
#1 1001 Joan
#2 1002 Jane
#3 1002 Janie
#4 1002 Janet
#5 1002 Jan
#6 1003 John
#7 1003 Jon
#8 1004 Bill
#9 1004 Will
#10 1005 Tom
Here's a full dplyr + tidyr solution, the way I'd tackle it:
library(dplyr)
library(tidyr)
df <- data_frame(
id = c("1001", "1002","1003", "1004", "1005"),
name = c("Joan", "Jane", "John", "Bill", "Tom"),
altNames = list(character(0), c("Janie", "Janet", "Jan"), "Jon", "Will", character(0))
)
# Need some way to concatenate a list of vectors with a vectors
# in a "rowwise" way
vector_c <- function(...) {
Map(c, ...)
}
df %>%
mutate(
names = vector_c(name, altNames),
altNames = NULL,
name = NULL
) %>%
unnest(names)
#> Source: local data frame [10 x 2]
#>
#> id names
#> 1 1001 Joan
#> 2 1002 Jane
#> 3 1002 Janie
#> 4 1002 Janet
#> 5 1002 Jan
#> 6 1003 John
#> 7 1003 Jon
#> 8 1004 Bill
#> 9 1004 Will
#> 10 1005 Tom
Most of the hard work is done by tidyr::unnest(): it's designed to take data frame with a list-column and unnest it, repeating the other columns as needed.
Using tidyr, after cleaning the data with data.table:
First, fix the data:
library(data.table)
dat<-setDT(dat)
dat$altNames[sapply(dat$altNames, length) == 0] <- NA
Now unnest from tidyr and some dplyr:
library(dplyr)
library(tidyr)
dat %>% unnest(altNames) %>%
group_by(id) %>%
do(unique(c(.[["name"]],.[["altNames"]])))
id V1
1 1001 Joan
2 1001 NA
3 1002 Jane
4 1002 Janie
5 1002 Janet
6 1002 Jan
7 1003 John
8 1003 Jon
9 1004 Bill
10 1004 Will
11 1005 Tom
12 1005 NA
it has the NAs, but they are easily removed with %>% na.omit.
I think data.table is the winner on this one.
Related
Two big real life tables to join up, but here's a little reprex:
I've got a table of small strings and I want to left join on a second table, with the join being based on whether or not these small strings can be found inside the bigger strings on the second table.
df_1 <- data.frame(index = 1:5,
keyword = c("john", "ella", "mil", "nin", "billi"))
df_2 <- data.frame(index_2 = 1001:1008,
name = c("John Coltrane", "Ella Fitzgerald", "Miles Davis", "Billie Holliday",
"Nina Simone", "Bob Smith", "John Brown", "Tony Montana"))
df_results_i_want <- data.frame(index = c(1, 1:5),
keyword = c("john", "john", "ella", "mil", "nin", "billi"),
index_2 = c(1001, 1007, 1002, 1003, 1005, 1004),
name = c("John Coltrane", "John Brown", "Ella Fitzgerald",
"Miles Davis", "Nina Simone", "Billie Holliday"))
Seems like a str_detect() call and a left_join() call might be part of the solution - ie I'm hoping for something like:
library(tidyverse)
df_results <- df_1 |> left_join(df_2, join_by(blah blah str_detect() blah blah))
I'm using dplyr 1.1 so I can use join_by(), but I'm not sure of the correct way to get what I need - can anyone help please?
I suppose I could do a simple cross join using tidyr::crossing() and then do the str_detect() stuff afterwards (and filter out things that don't match)
df_results <- df_1 |>
crossing(df_2) |>
mutate(match = str_detect(name, fixed(keyword, ignore_case = TRUE))) |>
filter(match) |>
select(-match)
but in my real life example, the cross join would produce an absolutely enormous table that would overwhelm my PC.
Thank you.
You can try fuzzy_join::regex_join():
library(fuzzyjoin)
regex_join(df_2, df_1, by=c("name"="keyword"), ignore_case=T)
Output:
index.x name index.y keyword
1 1001 John Coltrane 1 john
2 1002 Ella Fitzgerald 2 ella
3 1003 Miles Davis 3 mil
4 1004 Billie Holliday 5 billi
5 1005 Nina Simone 4 nin
6 1007 John Brown 1 john
join_by does not support inexact join (but unequal), but you can use fuzzyjoin:
library(dplyr)
library(fuzzyjoin)
df_2 %>%
mutate(name = tolower(name)) %>%
fuzzy_left_join(df_1, ., by = c(keyword = "name"),
match_fun = \(x, y) str_detect(y, x))
index keyword index_2 name
1 1 john 1001 john coltrane
2 1 john 1007 john brown
3 2 ella 1002 ella fitzgerald
4 3 mil 1003 miles davis
5 4 nin 1005 nina simone
6 5 billi 1004 billie holliday
We can use SQL to do that.
library(sqldf)
sqldf("select * from [df_1] A
left join [df_2] B on B.name like '%' || A.keyword || '%'")
giving:
index keyword index_2 name
1 1 john 1001 John Coltrane
2 1 john 1007 John Brown
3 2 ella 1002 Ella Fitzgerald
4 3 mil 1003 Miles Davis
5 4 nin 1005 Nina Simone
6 5 billi 1004 Billie Holliday
It can be placed in a pipeline like this:
library(magrittr)
library(sqldf)
df_1 %>%
{ sqldf("select * from [.] A
left join [df_2] B on B.name like '%' || A.keyword || '%'")
}
I have a code where I see which people work in certain groups. When I ask the leader of each group to present those who work for them, in a survey, I get a row of all of the team members. What I need is to clean the data into multiple rows with their group information.
I don't know where to start.
This is what my data frame looks like,
LeaderName <- c('John','Jane','Louis','Carl')
Group <- c('3','1','4','2')
Member1 <- c('Lucy','Stephanie','Chris','Leslie')
Member1ID <- c('1','2','3','4')
Member2 <- c('Earl','Carlos','Devon','Francis')
Member2ID <- c('5','6','7','8')
Member3 <- c('Luther','Peter','','Severus')
Member3ID <- c('9','10','','11')
GroupInfo <- data.frame(LeaderName, Group, Member1, Member1ID, Member2 ,Member2ID, Member3, Member3ID)
This is what I would like it to show with a certain code
LeaderName_ <- c('John','Jane','Louis','Carl','John','Jane','Louis','Carl','John','Jane','','Carl')
Group_ <- c('3','1','4','2','3','1','4','2','3','1','','2')
Member <- c('Lucy','Stephanie','Chris','Leslie','Earl','Carlos','Devon','Francis','Luther','Peter','','Severus')
MemberID <- c('1','2','3','4','5','6','7','8','9','10','','11')
ActualGroupInfor <- data.frame(LeaderName_,Group_,Member,MemberID)
An option would be melt from data.table and specify the column name patterns in the measure parameter
library(data.table)
melt(setDT(GroupInfo), measure = patterns("^Member\\d+$",
"^Member\\d+ID$"), value.name = c("Member", "MemberID"))[, variable := NULL][]
# LeaderName Group Member MemberID
# 1: John 3 Lucy 1
# 2: Jane 1 Stephanie 2
# 3: Louis 4 Chris 3
# 4: Carl 2 Leslie 4
# 5: John 3 Earl 5
# 6: Jane 1 Carlos 6
# 7: Louis 4 Devon 7
# 8: Carl 2 Francis 8
# 9: John 3 Luther 9
#10: Jane 1 Peter 10
#11: Louis 4
#12: Carl 2 Severus 11
Here is a solution in base r:
reshape(
data=GroupInfo,
idvar=c("LeaderName", "Group"),
varying=list(
Member=which(names(GroupInfo) %in% grep("^Member[0-9]$",names(GroupInfo),value=TRUE)),
MemberID=which(names(GroupInfo) %in% grep("^Member[0-9]ID",names(GroupInfo),value=TRUE))),
direction="long",
v.names = c("Member","MemberID"),
sep="_")[,-3]
#> LeaderName Group Member MemberID
#> John.3.1 John 3 Lucy 1
#> Jane.1.1 Jane 1 Stephanie 2
#> Louis.4.1 Louis 4 Chris 3
#> Carl.2.1 Carl 2 Leslie 4
#> John.3.2 John 3 Earl 5
#> Jane.1.2 Jane 1 Carlos 6
#> Louis.4.2 Louis 4 Devon 7
#> Carl.2.2 Carl 2 Francis 8
#> John.3.3 John 3 Luther 9
#> Jane.1.3 Jane 1 Peter 10
#> Louis.4.3 Louis 4
#> Carl.2.3 Carl 2 Severus 11
Created on 2019-05-23 by the reprex package (v0.2.1)
I'm having trouble in shaping my dataframe.
Here's an example:
id institution name1 id1 name2 id2
1 usp Miles Davis 123 Arturo Sandoval 111
2 unb Chet Baker 321 Clifford Brown 121
3 usp Wayne Shorter 222 Hermeto Pascoal 322
4 Puc-rio John Coltrane 333 Charlie Parker 112
I need to keep the id and institution columns and gather the other ones like this:
id institution name_all id_all
1 usp Miles Davis 123
1 usp Arturo Sandoval 111
2 unb Chet Baker 321
2 unb Clifford Brown 121
3 usp Wayne Shorter 222
3 usp Hermeto Pascoal 322
4 Puc-rio John Coltrane 333
4 Puc-rio Charlie Parker 112
I'm using the gather function from the dplyr:
df %>%
gather(name_all, id_all, -id, -institution)
but it comes like this:
id institution name id
1 usp name1 Miles Davis
1 usp id1 123
2 unb name1 Chet Baker
2 unb id2 121
Any ideas on how to pair those values? I have more than 5 columns to do so, I think that I'm missing an argument to specify which one of them are paired. I hope I've made myself clear.
For a tidyverse solution, you can:
library(dplyr)
library(tidyr)
df %>%
gather(ColType, ColValue, -id, -institution) %>%
mutate(id_number = gsub("^(\\D*)(\\d*)$", "\\2", ColType, ignore.case = TRUE, perl = TRUE),
ColType = gsub("^(\\D*)(\\d*)$", "\\1", ColType, ignore.case = TRUE, perl = TRUE)
) %>%
spread(ColType, ColValue) %>%
select(-id_number)
I'm sure that there is a more elegant solution, but you can try:
df %>%
gather(var, name_all, -matches("id|institution")) %>%
gather(var2, val, -c(id, institution, var, name_all)) %>%
mutate(id_all = ifelse(parse_number(var) == parse_number(var2), val, NA)) %>%
na.omit() %>%
select(-var, -var2, -val) %>%
arrange(id)
id institution name_all id_all
1 1 usp Miles_Davis 123
2 1 usp Arturo_Sandoval 111
3 2 unb Chet_Baker 321
4 2 unb Clifford_Brown 121
5 3 usp Wayne_Shorter 222
6 3 usp Hermeto_Pascoal 322
7 4 Puc-rio John_Coltrane 333
8 4 Puc-rio Charlie_Parker 112
First, it transforms the data from wide to long, excluding the variables that are named institution or id. Second, it performs a second wide-to-long transformation to have all the numbered "id" variables and their values as separate rows. Third, it checks whether the "name" variable has the number as the "id variable. If so, it assigns the appropriate value, otherwise NA. Finally, it removes the rows with NAs, the redundant variables and arranges the data.
Sample data:
df <- read.table(text = "
id institution name1 id1 name2 id2
1 usp Miles_Davis 123 Arturo_Sandoval 111
2 unb Chet_Baker 321 Clifford_Brown 121
3 usp Wayne_Shorter 222 Hermeto_Pascoal 322
4 Puc-rio John_Coltrane 333 Charlie_Parker 112", header = TRUE, stringsAsFactors = FALSE)
This is what my dataframe looks like:
df <- read.table(text='
Name ActivityType ActivityDate
John Email 2014-01-01
John Webinar 2014-01-05
John Webinar 2014-01-20
John Email 2014-04-20
Tom Email 2014-01-01
Tom Webinar 2014-01-05
Tom Webinar 2014-01-20
Tom Email 2014-04-20
', header=T, row.names = NULL)
I have this vector x which contains different dates
x<- c("2014-01-03","2014-01-25","2015-05-27"). I want to insert rows in my original dataframe in a way that incorporates these dates in the x vector.This is what the output should look like:
Name ActivityType ActivityDate
John Email 2014-01-01
John NA 2014-01-03
John Webinar 2014-01-05
John Webinar 2014-01-20
John NA 2014-01-25
John Email 2014-04-20
John NA 2015-05-27
Tom Email 2014-01-01
Tom NA 2014-01-03
Tom Webinar 2014-01-05
Tom Webinar 2014-01-20
Tom NA 2014-01-25
Tom Email 2014-04-20
Tom NA 2015-05-27
Sincerely appreciate your help!
It looks like you've added one of the 'new' dates aginst each of the people, correct?
In which case you can turn your x into a data.frame, and merge/join it on
## original dataframe
df <- data.frame(Name = c(rep("John", 4), rep("Tom", 4)),
ActivityType = c("Email","Web","Web","Email","Email","Web","Web", "Email"),
ActivityDate = c("2014-01-01","2014-05-01","2014-20-01","2014-20-04","2014-01-01","2014-05-01","2014-20-01","2014-20-04"))
## Turning x into a dataframe.
x <- data.frame(ActivityDate = rep(c("2014-01-03","2014-01-25","2015-05-27"), 2),
Name = rep(c("John","Tom"), 3))
merge(df, x, by=c("Name", "ActivityDate"), all=T)
# Name ActivityDate ActivityType
# 1 John 2014-01-01 Email
# 2 John 2014-05-01 Web
# 3 John 2014-20-01 Web
# 4 John 2014-20-04 Email
# 5 John 2014-01-03 <NA>
# 6 John 2014-01-25 <NA>
# 7 John 2015-05-27 <NA>
# 8 Tom 2014-01-01 Email
# 9 Tom 2014-05-01 Web
# 10 Tom 2014-20-01 Web
# 11 Tom 2014-20-04 Email
# 12 Tom 2014-01-03 <NA>
# 13 Tom 2014-01-25 <NA>
# 14 Tom 2015-05-27 <NA>
Update
As you are having memory issues, you can use data.table thusly
library(data.table)
dt <- as.data.table(df)
x_dt <- as.data.table(x)
merge(dt, x_dt, by=c("Name","ActivityDate"), all=T)
or, if you're not looking to merge you can rbind them, using data.table's rbindlist
rbindlist(list(dt, x_dt), fill=TRUE) ## fill sets the 'ActivityType' to NA in X
Update 2
To generate your x with 16000 uniqe names (I've used numbers here, but the principle is the same) and 30 dates
ActivityDates <- seq(as.Date("2014-01-01"), as.Date("2014-01-31"), by=1)
Names <- seq(1,16000)
x <- data.frame(Names = rep(Names, length(ActivityDates)),
ActivityDates = rep(ActivityDates, length(Names)))
1) expand.grid Using expand.grid create a data frame adds with the rows to be added and then use rbind to combine df and adds converting the ActivityDate column to "Date" class. Then sort. No packages are used.
adds <- expand.grid(Name = levels(df$Name), ActivityType = NA, ActivityDate = x)
both <- transform(rbind(df, adds), ActivityDate = as.Date(ActivityDate))
o <- with(both, order(Name, ActivityDate))
both[o, ]
giving:
Name ActivityType ActivityDate
1 John Email 2014-01-01
9 John <NA> 2014-01-03
2 John Webinar 2014-01-05
3 John Webinar 2014-01-20
11 John <NA> 2014-01-25
4 John Email 2014-04-20
13 John <NA> 2015-05-27
5 Tom Email 2014-01-01
10 Tom <NA> 2014-01-03
6 Tom Webinar 2014-01-05
7 Tom Webinar 2014-01-20
12 Tom <NA> 2014-01-25
8 Tom Email 2014-04-20
14 Tom <NA> 2015-05-27
2) sqldf This uploads adds and df to an sqlite data base which it creates on the fly, then it performs the sql query and downloads the result. The computation occurs outside of R so it might work with your large data.
adds <- data.frame(Name = NA, ActivityDate = x)
library(sqldf)
sqldf("select *
from (select *
from df
union
select a.Name, NULL ActivityType, ActivityDate
from (select distinct Name from df) a
cross join adds b
) order by 1, 3"
)
giving:
Name ActivityType ActivityDate
1 John Email 2014-01-01
2 John <NA> 2014-01-03
3 John Webinar 2014-01-05
4 John Webinar 2014-01-20
5 John <NA> 2014-01-25
6 John Email 2014-04-20
7 John <NA> 2015-05-27
8 Tom Email 2014-01-01
9 Tom <NA> 2014-01-03
10 Tom Webinar 2014-01-05
11 Tom Webinar 2014-01-20
12 Tom <NA> 2014-01-25
13 Tom Email 2014-04-20
14 Tom <NA> 2015-05-27
I have the following data.frame:
id name altNames
1001 Joan character(0)
1002 Jane c("Janie", "Janet", "Jan")
1003 John Jon
1004 Bill Will
1005 Tom character(0)
The column altNames could be empty (i.e. character(0)), have just one name, or a list of names. What I want is a data.frame (or a list) where each entry from name and/or altNames appears just once along with the corresponding id, like this:
id name
1001 Joan
1002 Jane
1002 Janie
1002 Janet
1002 Jan
1003 John
1003 Jon
1004 Bill
1004 Will
1005 Tom
What's the most efficient way of doing it? Even better is dplyr is utilized.
Thanks
Edit: Here's the data:
df <- data_frame(
id = c("1001", "1002","1003", "1004", "1005"),
name = c("Joan", "Jane", "John", "Bill", "Tom"),
altNames = list(character(0), c("Janie", "Janet", "Jan"), "Jon", "Will", character(0))
)
Here's a possible data.table approach
library(data.table)
setDT(dat)[, .(name = c(name, unlist(altNames))), by = id]
# id name
# 1: 1001 Joan
# 2: 1002 Jane
# 3: 1002 Janie
# 4: 1002 Janet
# 5: 1002 Jan
# 6: 1003 John
# 7: 1003 Jon
# 8: 1004 Bill
# 9: 1004 Will
# 10: 1005 Tom
A base R version (using the df added by #rawr)
with(df, {
ns <- mapply(c, name, altNames)
data.frame(id = rep(id, times=lengths(ns)), name=unlist(ns), row.names=NULL)
})
# id name
#1 1001 Joan
#2 1002 Jane
#3 1002 Janie
#4 1002 Janet
#5 1002 Jan
#6 1003 John
#7 1003 Jon
#8 1004 Bill
#9 1004 Will
#10 1005 Tom
Here's a full dplyr + tidyr solution, the way I'd tackle it:
library(dplyr)
library(tidyr)
df <- data_frame(
id = c("1001", "1002","1003", "1004", "1005"),
name = c("Joan", "Jane", "John", "Bill", "Tom"),
altNames = list(character(0), c("Janie", "Janet", "Jan"), "Jon", "Will", character(0))
)
# Need some way to concatenate a list of vectors with a vectors
# in a "rowwise" way
vector_c <- function(...) {
Map(c, ...)
}
df %>%
mutate(
names = vector_c(name, altNames),
altNames = NULL,
name = NULL
) %>%
unnest(names)
#> Source: local data frame [10 x 2]
#>
#> id names
#> 1 1001 Joan
#> 2 1002 Jane
#> 3 1002 Janie
#> 4 1002 Janet
#> 5 1002 Jan
#> 6 1003 John
#> 7 1003 Jon
#> 8 1004 Bill
#> 9 1004 Will
#> 10 1005 Tom
Most of the hard work is done by tidyr::unnest(): it's designed to take data frame with a list-column and unnest it, repeating the other columns as needed.
Using tidyr, after cleaning the data with data.table:
First, fix the data:
library(data.table)
dat<-setDT(dat)
dat$altNames[sapply(dat$altNames, length) == 0] <- NA
Now unnest from tidyr and some dplyr:
library(dplyr)
library(tidyr)
dat %>% unnest(altNames) %>%
group_by(id) %>%
do(unique(c(.[["name"]],.[["altNames"]])))
id V1
1 1001 Joan
2 1001 NA
3 1002 Jane
4 1002 Janie
5 1002 Janet
6 1002 Jan
7 1003 John
8 1003 Jon
9 1004 Bill
10 1004 Will
11 1005 Tom
12 1005 NA
it has the NAs, but they are easily removed with %>% na.omit.
I think data.table is the winner on this one.