This question already has answers here:
Collapse text by group in data frame [duplicate]
(2 answers)
Closed 7 years ago.
I want to group my data by one column and paste the character strings from a different column into a single row. Suppose, for example, I have a data.frame A:
library(dplyr)
A <- data.frame(student = rep(c("John Smith", "Jane Smith"), 3),
variable1 = rep(c("Var1", "Var1", "Var2"), 2))
A <- arrange(A, student)
student variable1
1 Jane Smith Var1
2 Jane Smith Var1
3 Jane Smith Var2
4 John Smith Var1
5 John Smith Var2
6 John Smith Var1
But, I need to transform data.frame A into data.frame B, grouped by the student variable and pasting any variations from variable1 together:
B <- data.frame(student = c("John Smith", "Jane Smith"),
variable1 = c(paste("Var1", "Var2", sep = ","),
paste("Var1", "Var2", sep = ",")))
student variable1
1 John Smith Var1,Var2
2 Jane Smith Var1,Var2
I've attempted numerous group_by and mutate clauses from the dplyr package but haven't found success.
You can use the data.table package to do this easily, and quickly if you set student to be your key:
library(data.table)
A<-data.table(A)
setkey(A, student)
B<-A[, paste(unique(variable1), collapse=", "),by=student]
I believe you can use the aggregate function to do what you are looking for.
Is this what you are trying to do?
df=unique(A)
agg=aggregate(df$variable1, list(df$student), paste, collapse=",")
> agg
Group.1 x
1 Jane Smith Var1,Var2
2 John Smith Var1,Var2
Related
I'm working on matching last names between two tables. However, there are some variations that I have to take into account. For instance, I found that "Smith" in db1 can potentially have other forms in db2:
Smith, Smith-Whatever, Smith Jr., Smith Sr., Smith III (any Roman numeral)
Lower/uppercase is also an issue.
I'm trying to implement this logic in dplyr. I found the %ilike% operator in data.table, which seems to work kind of like the SQL equivalent. I can use it like this:
match <- db2 %>%
dplyr::filter(last_name %ilike% "^smith$" | last_name %ilike% "^smith-" | last_name %ilike% "^smith .r" | last_name %ilike% "^smith [ivx]")
Of course the strings wouldn't be hardcoded but rather obtained by iterating through db1. Either way, this is unwieldy.
Hence my question:
Is there a way to combine the functionality of %ilike% with something like %in% - by specifying a vector of regexes the 'ilikes' of which I would match against? Is there a smarter way of doing this?
You can combine the pattern with |. You may use grepl (or str_detect if you are using stringr).
library(dplyr)
db2 %>% filter(grepl("smith( .r|-| [ivx]|.*)", last_name, ignore.case = TRUE))
# last_name
#1 Smith
#2 Smith-Whatever
#3 Smith Jr.
#4 Smith Sr.
#5 Smith III
If you want to construct the pattern dynamically you can do -
pat <- c('smith', 'smith-', 'smith .r', 'smith [ivx]')
db2 %>% dplyr::filter(grepl(paste0(pat, collapse = "|"), last_name, ignore.case = TRUE))
Also, would it be enough to filter rows that have only 'smith' in them ?
db2 %>% filter(grepl('smith', last_name, ignore.case = TRUE))
Using RonakShah's pat and my db2 below, ...
Filter
You might try an any operator to iterate through each pattern:
db2 %>%
filter(rowSums(sapply(pat, grepl, name)) > 0)
# name
# 1 smith
# 2 smith-something
And since data.table::%ilike% and data.table::%like% are really using grepl under the hood, this is about the same thing.
Merge/join
If your patterns are in a new frame, you can join them in with:
patdf <- data.frame(ptn = pat, num = seq_along(pat))
patdf
# ptn num
# 1 smith 1
# 2 smith- 2
# 3 smith .r 3
# 4 smith [ivx] 4
fuzzyjoin::regex_left_join(db2, patdf, by = c("name" = "ptn"))
# name ptn num
# 1 smith smith 1
# 2 jones <NA> NA
# 3 hubert <NA> NA
# 4 smith-something smith 1
# 5 smith-something smith- 2
Granted, this is multiplying rows, since it matches multiple times. This can be reduced. Let's assume your original data has a unique id field:
db2$id <- 10L + seq_len(nrow(db2))
fuzzyjoin::regex_left_join(db2, patdf, by = c("name" = "ptn")) %>%
filter(!is.na(ptn)) %>%
group_by(id) %>%
slice_min(num) %>%
ungroup()
# # A tibble: 2 x 4
# name id ptn num
# <chr> <int> <chr> <int>
# 1 smith 11 smith 1
# 2 smith-something 14 smith 1
Data
db2 <- structure(list(name = c("smith", "jones", "hubert", "smith-something")), class = "data.frame", row.names = c(NA, -4L))
I have a very large file where I have merged two customer databases. The key is the ID. Where the customer name did not match it shows an NA. I need to accomplish a simple if/then statement where if there is "NA" in column NAME_1 the DESIRED OUTCOME NAME is what is in NAME_2, else use what is in NAME_1
I attempted the following code but get errors
df <- df %>% if (df$NAME_1 == "NA") rename(df$NAME_1 == df$NAME_2)
Simply done with
df$NAME_1[is.na(df$NAME_1)] <- df$NAME_2[is.na(df$NAME_1)]
This is just subsetting the values in each of the vectors to elements in positions where it is NA in NAME_1
I'd prefer to do this with a data.table rather than data.frame, then you can do
df[is.na(NAME_1), NAME_1 := NAME_2]
Using ifelse from base R,
df$'DESIRED OUTCOME NAME' = ifelse(is.na(df$NAME_1), df$NAME_2, df$NAME_1)
a solution with dummy data acording to the info you supplied:
df <- data.frame(ID = c(1,6,3,5,6,2,8,9),
NAME_1 = c(NA, "STEVE", NA, "JULIE", "BOB", "AMY", NA, "BRUCE"),
NAME_2 = c("MARY", "STEVE", "JAN", "JULIE", "BOB", "AMY", "FRANK", "BRUCE"))
library(dplyr)
df %>%
dplyr::mutate(NEW_COLUMN = ifelse(is.na(NAME_1), NAME_2, NAME_1))
In this particular case, the simplest solution would be to use dplyr::coalesce
library(dplyr)
df %>% mutate(`DESIRED OUTCOME NAME` = coalesce(NAME_1, NAME_2))
#OR
df$`DESIRED OUTCOME NAME` <- coalesce(NAME_1, NAME_2)
ID NAME_1 NAME_2 DESIRED OUTCOME NAME
1 1 <NA> MARY MARY
2 6 STEVE STEVE STEVE
3 3 <NA> JAN JAN
4 5 JULIE JULIE JULIE
5 6 BOB BOB BOB
6 2 AMY AMY AMY
7 8 <NA> FRANK FRANK
8 9 BRUCE BRUCE BRUCE
I'm attempting to create a variable in one long dataset (df1) where the value in each row needs to be based off of matching some conditions in another long dataset (df2). The conditions are:
- match on "name"
- the value for df1 should consider observations for that person that occurred before the observation in df1.
- Then I need the number of rows within that subset that meet a third condition (in the data below called "condition")
I've already tried running a for loop (I know, not preferred in R) to write it for each row in 1:nrow(df1), but I keep running into an issue that in my actual data, df1 and df2 are not the same length or a multiple.
I've also tried writing a function and applying it to df1. I tried applying it using apply, but I can't accept two dataframes in the apply syntax. I tried giving it a list of dataframes and using lapply, but it returns back null values.
Here is some generic data that fits the format of the data I'm working with.
df1 <- data.frame(
name = c("John Smith", "John Smith", "Jane Smith", "Jane Smith"),
date_b = sample(seq(as.Date('2014/01/01'), as.Date('2019/10/01'), by="day"), 4))
df2 <- data.frame(
name = c("John Smith", "John Smith", "Jane Smith", "Jane Smith"),
date_a = sample(seq(as.Date('2014/01/01'), as.Date('2019/10/01'), by="day"), 4),
condition = c("A", "B", "C", "A")
)
I know the way to get the number of rows could look something like this:
num_conditions <- nrow(df2[which(df1$nam== df2$name & df2$date_a < df1$date_b & df2$condition == "A"), ])
What I would like to see in df1 would would be a column called "num_conditions" that would show the number of observations in df2 for that person that occurred before date_b in df1 and met condition "A".
df1 should look like this:
name date_b num_conditions
John Smith 10/1/15 1
John Smith 11/15/16 0
John Smith 9/19/19 0
I'm sure there are better ways to approach including data.table, but here is one using dplyr:
library(dplyr)
set.seed(12)
df2 %>%
filter(condition == "A") %>%
right_join(df1, by = "name") %>%
group_by(name, date_b) %>%
filter(date_a < date_b) %>%
mutate(num_conditions = n()) %>%
right_join(df1, by = c("name", "date_b")) %>%
mutate(num_conditions = coalesce(num_conditions, 0L)) %>%
select(-c(date_a, condition)) %>%
distinct()
# A tibble: 4 x 3
# Groups: name, date_b [4]
name date_b num_conditions
<fct> <date> <int>
1 John Smith 2016-10-13 2
2 John Smith 2015-11-10 2
3 Jane Smith 2016-07-18 1
4 Jane Smith 2018-03-13 1
R> df1
name date_b
1 John Smith 2016-10-13
2 John Smith 2015-11-10
3 Jane Smith 2016-07-18
4 Jane Smith 2018-03-13
R> df2
name date_a condition
1 John Smith 2015-04-16 A
2 John Smith 2014-09-27 A
3 Jane Smith 2017-04-25 C
4 Jane Smith 2015-08-20 A
Maybe the following is what the question is asking for.
library(tidyverse)
df1 %>%
left_join(df2 %>% filter(condition == 'A'), by = 'name') %>%
filter(date_a < date_b) %>%
group_by(name) %>%
mutate(num_conditions = n()) %>%
select(-date_a, -condition) %>%
full_join(df1) %>%
mutate(num_conditions = ifelse(is.na(num_conditions), 0, num_conditions))
#Joining, by = c("name", "date_b")
## A tibble: 4 x 3
## Groups: name [2]
# name date_b num_conditions
# <fct> <date> <dbl>
#1 John Smith 2019-05-07 2
#2 John Smith 2019-02-05 2
#3 Jane Smith 2016-05-03 0
#4 Jane Smith 2018-06-23 0
This question already has answers here:
Combine Multiple Columns Into Tidy Data [duplicate]
(3 answers)
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 5 years ago.
I have a data that looks like the following dataframe, but every combo has about ten fields, starting with name1, adress1, city1, etc
id name1 adress1 name2 adress2 name3 adress3
1 1 John street a Burt street d chris street 1
2 2 Jack street b Ben street e connor street 2
3 3 Joey <NA> Bob street f <NA> <NA>
Now I would like to rearrange this data so it is a bit more useful and it should look like so, but with the information from which entry it came from:
id origin names adresses
1 1 1 John street a
2 2 1 Jack street b
3 3 1 Joey <NA>
4 1 2 Burt street d
5 2 2 Ben street e
6 3 2 Bob street f
7 1 3 chris street 1
8 2 3 connor street 2
Using tidyr I can get a long format, but then I have a key column that contains all the variable names, name1, name2, name3, street1, etc.
I also tried using separate dataframes, one for each combination, e.g. one dataframe for the names, one for the streets, etc. But then joining everything back together results in the wrong records, because you can only join on id and in a long format this ID is replicated. I have also been looking into Reshape2, but that results in the same issue.
All the conversions of wide to long I have seen are when you have one column you want to convert to. I'm looking for the end result in 10 columns, or as in the example 2 columns.
Is there a function that I'm overlooking?
#code to generete the dataframes:
df <- data.frame(id = c(1,2,3),
name1 = c("John", "Jack", "Joey"),
adress1 = c("street a", "street b", NA),
name2 = c("Burt", "Ben", "Bob"),
adress2 = c("street d", "street e", "street f"),
name3 = c("chris", "connor", NA),
adress3 = c("street 1", "street 2", NA),
stringsAsFactors = FALSE)
expecteddf <- data.frame(id = c(1,2,3,1,2,3,1,2),
origin = c(rep(1, 3), rep(2, 3), rep(3, 2)),
names = c("John", "Jack", "Joey", "Burt", "Ben", "Bob", "chris", "connor"),
adresses = c("street a", "street b", NA, "street d", "street e", "street f", "street 1", "street 2"),
stringsAsFactors = FALSE
)
We could use melt from the devel version of data.table which can take multiple patterns for the measure columns. Instructions to install the devel version of 'data.table' is here
We convert the 'data.frame' to 'data.table' (setDT(df)), melt, and specify the regex in the patterns of measure argument. Remove the rows that are NA for the 'names' and 'address' column.
library(data.table)#v1.9.5+
dM <- melt(setDT(df), measure=patterns(c('^name', '^adress')),
value.name=c('names', 'address') )
dM[!(is.na(names) & is.na(address))]
# id variable names address
#1: 1 1 John street a
#2: 2 1 Jack street b
#3: 3 1 Joey NA
#4: 1 2 Burt street d
#5: 2 2 Ben street e
#6: 3 2 Bob street f
#7: 1 3 chris street 1
#8: 2 3 connor street 2
Or we can use reshape from base R.
dM2 <- reshape(df, idvar='id', varying=list(grep('name', names(df)),
grep('adress', names(df))), direction='long')
The NA rows can be removed as in the data.table solution by using standard 'data.frame' indexing after we create the logical index with is.na.
Assume the dataframe is like below:
Name Score
1 John 10
2 John 2
3 James 5
I would like to compute the mean of all the Score values which has John's name.
You can easily perform a mean of every person's score with aggregate:
> aggregate(Score ~ Name, data=d, FUN=mean)
Name Score
1 James 5
2 John 6
Using dplyr:
For each name:
library(dplyr)
df %>%
group_by(Name) %>%
summarise(Value=mean(Score))
Name Value
1 James 5
2 John 6
Filtering:
filter(df, Name=="John") %>%
group_by(Name) %>%
summarise(Value=mean(Score))
Name Value
1 John 6
Using sqldf:
library(sqldf)
sqldf("SELECT Name, avg(Score) AS Score
FROM df GROUP BY Name")
Name Score
1 James 5
2 John 6
Filtering:
sqldf("SELECT Name, avg(Score) AS Score
FROM df
WHERE Name LIKE 'John'")
Name Score
1 John 6