Use merge with one data frame in R [duplicate] - r

This question already has answers here:
Transpose / reshape dataframe without "timevar" from long to wide format
(9 answers)
Closed 4 years ago.
I have one data frame in R with duplicate indexes stored in the first column.
df <- data.frame("Index" = c(1,2,1), "Age" = c("Jane Doe","John Doe","Jane
Doe"), "Address" = c("123 Fake Street","780 York Street","456 Elm
Street"),"Telephone" = c("xxx-xxx-xxxx","zzz-zzz-zzzz","yyy-yyy-yyyy"))
Index Name Address Telephone
1 Jane Doe 123 Fake Street xxx-xxx-xxxx
2 John Doe 780 York Street zzz-zzz-zzzz
1 Jane Doe 456 Elm Street yyy-yyy-yyyy
I would like to combine the above data frame to look like:
Index Name Address Telephone Address 2 Telephone 2
1 Jane, Doe 123 Fake Street xxx-xxx-xxxx 456 Elm Street yyy-yyy-yyyy
2 John Doe 780 York Street zzz-zzz-zzzz NA NA
Can I use "merge" on the same data frame or is their another command in R that would accomplish this task? Thank you.

with tidyverse
df %>%
group_by(Age) %>%
summarize_at(vars(Telephone,Address),paste, collapse="|") %>%
separate(Address,into=c("Address1","Address2"),sep="\\|") %>%
separate(Telephone,into=c("Telephone1","Telephone2"),sep="\\|")
# # A tibble: 2 x 5
# Age Telephone1 Telephone2 Address1 Address2
# <fct> <chr> <chr> <chr> <chr>
# 1 Jane Doe xxx-xxx-xxxx yyy-yyy-yyyy 123 Fake Street 456 Elm Street
# 2 John Doe zzz-zzz-zzzz <NA> 780 York Street <NA>
To be more general, we can nest the values using summarize and list, and reformat the content to unnestit with the right format:
df %>%
group_by(Age) %>%
summarize_at(vars(Telephone,Address),
~lst(setNames(invoke(tibble,.),seq_along(.)))) %>%
unnest(.sep = "")
# # A tibble: 2 x 5
# Age Telephone1 Telephone2 Address1 Address2
# <fct> <fct> <fct> <fct> <fct>
# 1 Jane Doe xxx-xxx-xxxx yyy-yyy-yyyy 123 Fake Street 456 Elm Street
# 2 John Doe zzz-zzz-zzzz <NA> 780 York Street <NA>
The function inside of summarize is a bit scary but you can wrap it into a friendlier name if you want to use it again (I added a names parameter just in case):
nest2row <- function(x,names = seq_along(x))
lst(setNames(invoke(tibble,x),names[seq_along(x)]))
df %>%
group_by(Age) %>%
summarize_at(vars(Telephone,Address), nest2row) %>%
unnest(.sep = "")
And this would be the recommended tidy way I suppose :
df %>%
group_by(Age) %>%
mutate(id=row_number()) %>%
gather(key,value,Address,Telephone) %>%
unite(key,key,id,sep="") %>%
spread(key,value)
# # A tibble: 2 x 6
# # Groups: Age [2]
# Index Age Address1 Address2 Telephone1 Telephone2
# <dbl> <fct> <chr> <chr> <chr> <chr>
# 1 1 Jane Doe 123 Fake Street 456 Elm Street xxx-xxx-xxxx yyy-yyy-yyyy
# 2 2 John Doe 780 York Street <NA> zzz-zzz-zzzz <NA>
With my second solution you keep your factors and there's not this awkward forcing different types of variables in the same column that the idiomatic way has.

Try something like this:
df <- data.frame("Index" = c(1,2,1), "Age" = c("Jane Doe","John Doe","Jane Doe"),
"Address" = c("123 Fake Street","780 York Street","456 Elm Street"),
"Telephone" = c("xxx-xxx-xxxx","zzz-zzz-zzzz","yyy-yyy-yyyy"),
stringsAsFactors = F)
df$unindex=paste(df$Index,df$Age)
sapply(unique(df$unindex),function(li){ # li="1 Jane Doe"
dft=df[li==df$unindex,3:4]
if(nrow(dft)==1)dft else c(t(dft))
})

Related

Count string length using external table

Suppose you have a table of data:
df<-tibble(person = c("Alice", "Bob", "Mary"),
colour = c("Red", "Green", "Blue"),
city = c("London", "Paris", "New York"))
# A tibble: 3 x 3
person colour city
<chr> <chr> <chr>
1 Alice Red London
2 Bob Green Paris
3 Mary Blue New York
And a second table which contains the field names and the maximum string length of each field:
len<-tibble(field_name = c("person", "colour", "city"),
field_length = c(12, 4, 6))
# A tibble: 3 x 2
field_name field_length
<chr> <dbl>
1 person 12
2 colour 4
3 city 6
How can I check, for each field in len, whether a string in df is less than or equal to len$field_length, returning rows which fail the test?
As an example:
Output Row 1 in df would pass because:
'Alice' <= 12 characters long,
'Red' is <= 4 characters long and
'London' is <= 6 characters long.
However,
Row 2 would fail because:
'Green' > 4 characters long and
Row 3 would fail because:
'New York' > 6 characters long.
Thus the returned data frame should only display Rows 2 and Row 3 of the original df.
A dplyr solution with c_across():
library(dplyr)
df %>%
rowwise() %>%
filter(any(nchar(c_across(everything())) > len$field_length)) %>%
ungroup()
# # A tibble: 2 x 3
# person colour city
# <chr> <chr> <chr>
# 1 Bob Green Paris
# 2 Mary Blue New York
Using base R with mapply :
df[rowSums(mapply(function(x, y) nchar(x) > y, df, len$field_length)) > 0, ]
# A tibble: 2 x 3
# person colour city
# <chr> <chr> <chr>
#1 Bob Green Paris
#2 Mary Blue New York
If column names in df are not in the same order as len$field_name use df[len$field_name] in mapply.
In tidyverse we can get data in long format join it with len data by column name, select the rows which fail and get data in wide format again.
library(dplyr)
library(tidyr)
df %>%
mutate(row = row_number()) %>%
pivot_longer(-row) %>%
left_join(len, by = c('name' = 'field_name')) %>%
group_by(row) %>%
filter(any(nchar(value) > field_length)) %>%
dplyr::select(-field_length) %>%
pivot_wider()
It's easier to solve your problem in terms of 2 matrices, first the length of each of your entries:
nchar(as.matrix(df))
person colour city
[1,] 5 3 6
[2,] 3 5 5
[3,] 4 4 8
And a corresponding matrix of allowed length:
allowed = replicate(nrow(df),len$field_length[match(colnames(df),len$field_name)])
allowed
[,1] [,2] [,3]
[1,] 12 12 12
[2,] 4 4 4
[3,] 6 6 6
Then matrix wise comparison, and only keep those where the rowSums() are
df[rowMeans(nchar(as.matrix(df)) > allowed)>0,]
# A tibble: 2 x 3
person colour city
<chr> <chr> <chr>
1 Bob Green Paris
2 Mary Blue New York
If your two data.frames are already in the same order like your example, you can do (thanks to #zx8754) for pointing it out:
df[rowMeans(nchar(as.matrix(df)) > len$field_length)>0,]
# A tibble: 2 x 3
person colour city
<chr> <chr> <chr>
1 Bob Green Paris
2 Mary Blue New York
Pivot df into the same format as len and join the two. After this, it is trivial to compare each string to the field_length.
library(tidyverse)
test_result_df <- df %>%
mutate(id = row_number()) %>%
pivot_longer(-id, names_to = 'field_name') %>%
left_join(len, by = 'field_name') %>%
mutate(test_passed = str_length(value) <= field_length) %>%
group_by(id) %>%
summarise(all_passed = all(test_passed))
df[!test_result_df$all_passed,]
# A tibble: 2 x 3
person colour city
<chr> <chr> <chr>
1 Bob Green Paris
2 Mary Blue New York

Creating a variable based off of matching conditions in two datasets

I'm attempting to create a variable in one long dataset (df1) where the value in each row needs to be based off of matching some conditions in another long dataset (df2). The conditions are:
- match on "name"
- the value for df1 should consider observations for that person that occurred before the observation in df1.
- Then I need the number of rows within that subset that meet a third condition (in the data below called "condition")
I've already tried running a for loop (I know, not preferred in R) to write it for each row in 1:nrow(df1), but I keep running into an issue that in my actual data, df1 and df2 are not the same length or a multiple.
I've also tried writing a function and applying it to df1. I tried applying it using apply, but I can't accept two dataframes in the apply syntax. I tried giving it a list of dataframes and using lapply, but it returns back null values.
Here is some generic data that fits the format of the data I'm working with.
df1 <- data.frame(
name = c("John Smith", "John Smith", "Jane Smith", "Jane Smith"),
date_b = sample(seq(as.Date('2014/01/01'), as.Date('2019/10/01'), by="day"), 4))
df2 <- data.frame(
name = c("John Smith", "John Smith", "Jane Smith", "Jane Smith"),
date_a = sample(seq(as.Date('2014/01/01'), as.Date('2019/10/01'), by="day"), 4),
condition = c("A", "B", "C", "A")
)
I know the way to get the number of rows could look something like this:
num_conditions <- nrow(df2[which(df1$nam== df2$name & df2$date_a < df1$date_b & df2$condition == "A"), ])
What I would like to see in df1 would would be a column called "num_conditions" that would show the number of observations in df2 for that person that occurred before date_b in df1 and met condition "A".
df1 should look like this:
name date_b num_conditions
John Smith 10/1/15 1
John Smith 11/15/16 0
John Smith 9/19/19 0
I'm sure there are better ways to approach including data.table, but here is one using dplyr:
library(dplyr)
set.seed(12)
df2 %>%
filter(condition == "A") %>%
right_join(df1, by = "name") %>%
group_by(name, date_b) %>%
filter(date_a < date_b) %>%
mutate(num_conditions = n()) %>%
right_join(df1, by = c("name", "date_b")) %>%
mutate(num_conditions = coalesce(num_conditions, 0L)) %>%
select(-c(date_a, condition)) %>%
distinct()
# A tibble: 4 x 3
# Groups: name, date_b [4]
name date_b num_conditions
<fct> <date> <int>
1 John Smith 2016-10-13 2
2 John Smith 2015-11-10 2
3 Jane Smith 2016-07-18 1
4 Jane Smith 2018-03-13 1
R> df1
name date_b
1 John Smith 2016-10-13
2 John Smith 2015-11-10
3 Jane Smith 2016-07-18
4 Jane Smith 2018-03-13
R> df2
name date_a condition
1 John Smith 2015-04-16 A
2 John Smith 2014-09-27 A
3 Jane Smith 2017-04-25 C
4 Jane Smith 2015-08-20 A
Maybe the following is what the question is asking for.
library(tidyverse)
df1 %>%
left_join(df2 %>% filter(condition == 'A'), by = 'name') %>%
filter(date_a < date_b) %>%
group_by(name) %>%
mutate(num_conditions = n()) %>%
select(-date_a, -condition) %>%
full_join(df1) %>%
mutate(num_conditions = ifelse(is.na(num_conditions), 0, num_conditions))
#Joining, by = c("name", "date_b")
## A tibble: 4 x 3
## Groups: name [2]
# name date_b num_conditions
# <fct> <date> <dbl>
#1 John Smith 2019-05-07 2
#2 John Smith 2019-02-05 2
#3 Jane Smith 2016-05-03 0
#4 Jane Smith 2018-06-23 0

R JSON to tibble

I have the following data passed back from an API and I cannot change it's structure. I would like to convert the following JSON into a tibble.
data <- '{ "ids":{
"00000012664":{
"state":"Indiana",
"version":"10",
"external_ids":[
{
"db":"POL",
"db_id":"18935"
},
{
"db":"CIT",
"db_id":"1100882"
}
],
"id":"00000012520",
"name":"Joe Smith",
"aliases":[
"John Smith",
"Bill Smith"
]
},
"00000103162":{
"state":"Kentucky",
"external_ids":[
{
"db":"POL",
"db_id":"69131"
},
{
"db":"CIT",
"db_id":"1098802"
}
],
"id":"00000003119",
"name":"Sue Smith",
"WIP":98203059
} ,
"0000019223":{
"state":"Ohio",
"external_ids":[
{
"db":"POL",
"db_id":"69134"
},
{
"db":"JT",
"db_id":"615234"
}
],
"id":"0000019223",
"name":"Larry Smith",
"WIP":76532172,
"aliases":[
"Test 1",
"Test 2",
"Test 3",
"Test 4"
],
"insured":1
} } }'
Please Note: This is a small subset of the data and could have thousands of "ids".
I've tried jsonlite and tidyjson with a combination of purrr.
The following gives me a tibble, but I cannot figure out how to get aliases back.
obj <- jsonlite::fromJSON(data, simplifyDataFrame = T, flatten = F)
obj$ids %>% {
data_frame(id=purrr::map_chr(., 'id'),
state=purrr::map_chr(., 'state', ''),
WIP=purrr::map_chr(., 'WIP', .default=''),
#aliases=purrr::map(purrr::map_chr(., 'aliases', .default=''), my_fun)
)
}
I cannot figure out with tidyjson either:
data %>% enter_object(ids) %>% gather_object %>% spread_all
What I would like back is a tibble with the following fields (regardless if they are in the JSON or not.
id
name
state
version
aliases -> as a string comma separated
WIP
BONUS: ;-)
Can I get external_ids as a string as well?
Instead of extracting each element with multiple calls with map, an option is to convert to tibble with (as_tibble) and select the columns of interest, grouped by 'id' collapse the 'aliases' into a single string and get the distinct rows by 'id'
library(tibble)
library(purrr)
library(stringr)
map_dfr(obj$ids, ~ as_tibble(.x) %>%
select(id, one_of("name", "state", "version", "aliases", "WIP"))) %>%
group_by(id) %>%
mutate(aliases = toString(unique(aliases))) %>%
distinct(id, .keep_all = TRUE)
# A tibble: 2 x 6
# Groups: id [2]
# id name state version aliases WIP
# <chr> <chr> <chr> <chr> <chr> <int>
#1 00000012520 Joe Smith Indiana 10 John Smith, Bill Smith NA
#2 00000003119 Sue Smith Kentucky <NA> NA 98203059
If we also need the 'external_ids' (which is a data.frame)
map_dfr(obj$ids, ~ as_tibble(.x) %>%
mutate(external_ids = reduce(external_ids, str_c, sep = " "))) %>%
group_by(id) %>%
mutate_at(vars(aliases, external_ids), ~ toString(unique(.))) %>%
ungroup %>%
distinct(id, .keep_all= TRUE)
# A tibble: 2 x 7
# state version external_ids id name aliases WIP
# <chr> <chr> <chr> <chr> <chr> <chr> <int>
#1 Indiana 10 POL 18935, CIT 1100882 00000012520 Joe Smith John Smith, Bill Smith NA
#2 Kentucky <NA> POL 69131, CIT 1098802 00000003119 Sue Smith NA 98203059
Update
For the new data, we can use
obj$ids %>%
map_dfr(~ map_df(.x, reduce, str_c, collapse = ", ", sep= " ") )
# A tibble: 3 x 8
# state version external_ids id name aliases WIP insured
# <chr> <chr> <chr> <chr> <chr> <chr> <int> <int>
#1 Indiana 10 POL 18935, CIT 1100882 00000012520 Joe Smith John Smith Bill Smith NA NA
#2 Kentucky <NA> POL 69131, CIT 1098802 00000003119 Sue Smith <NA> 98203059 NA
#3 Ohio <NA> POL 69134, JT 615234 0000019223 Larry Smith Test 1 Test 2 Test 3 Test 4 76532172 1

Combine rows based on multiple columns and keep all unique values

I have a dataset with User Information. For a specific user I have often multiple rows with more or less complete information. I want to summarize all rows that belong to a customer on the basis of First_Name, Last_Name, Street while keeping all information of the other columns and if there are two unique observation for a specific column I want to collapse them with ",".
This is what the df looks like
First_Name Last_Name Street Column1 Colum2 Colum_n
Mike Smith X abc ab a
Mike Smith X abc ad b
John Smith Y xyz xy n
John Smith Y xyz xm NA
My desired output would be
First_Name Last_Name Street Column1 Colum2 Colum_n
Mike Smith X abc ab,ad a,b
John Smith Y xyz xy,xm n
I would like using dplyr and tried something with
df %>%
group_by(First_Name,Last_Name, Street) %>%
summarise_all(funs())
The problem with that function is that I only had the option of using something like the mean or the first occuring value for a column and this would mean the loss of values. What I would like are columns with all unique values without NA's
You can write your own summarization function like
concat_unique <- function(x){paste(unique(x), collapse=',')}
and then apply it using
summarize_all(concat_unique)
A solution using tidyverse.
library(tidyverse)
dat2 <- dat %>%
group_by(First_Name, Last_Name, Street) %>%
# Replace NA with ""
mutate_all(funs(replace(., is.na(.), ""))) %>%
# Combine all strings
summarize_all(funs(toString(unique(.)))) %>%
# Replace the strings ended with ", "
mutate_all(funs(str_replace(., ", $", ""))) %>%
ungroup()
dat2
# # A tibble: 2 x 6
# First_Name Last_Name Street Column1 Colum2 Colum_n
# <chr> <chr> <chr> <chr> <chr> <chr>
# 1 John Smith Y xyz xy, xm n
# 2 Mike Smith X abc ab, ad a, b
After seeing others answer, I realized that we don't have to deal with NA and , as strings. The following is more efficient.
dat2 <- dat %>%
group_by(First_Name, Last_Name, Street) %>%
# Combine all strings
summarize_all(funs(toString(unique(.[!is.na(.)])))) %>%
ungroup()
dat2
# # A tibble: 2 x 6
# First_Name Last_Name Street Column1 Colum2 Colum_n
# <chr> <chr> <chr> <chr> <chr> <chr>
# 1 John Smith Y xyz xy, xm n
# 2 Mike Smith X abc ab, ad a, b
DATA
dat <- read.table(text = 'First_Name Last_Name Street Column1 Colum2 Colum_n
Mike Smith X abc ab a
Mike Smith X abc ad b
John Smith Y xyz xy n
John Smith Y xyz xm NA',
header = TRUE, stringsAsFactors = FALSE)
Using tidyverse:
df %>%
group_by(First_Name, Last_Name, Street) %>%
summarise_all(funs(paste0(unique(.[!is.na(.)]), collapse= ",")))
First_Name Last_Name Street Column1 Colum2 Colum_n
<fct> <fct> <fct> <chr> <chr> <chr>
1 John Smith Y xyz xy,xm n
2 Mike Smith X abc ab,ad a,b
First, it is grouping by "First_Name", "Last_Name" and "Street". Then, it takes all the unique non-NA values and collapses them into one string.
If you want to keep them as a vector, instead of converting them to a single character string, you can do
library(dplyr)
df %>%
group_by(First_Name,Last_Name, Street) %>%
summarise_all(~list(unique(.[!is.na(.)]))) %>%
print.data.frame
# First_Name Last_Name Street Column1 Colum2 Colum_n
# 1 John Smith Y xyz xy, xm n
# 2 Mike Smith X abc ab, ad a, b
or with data.table
library(data.table)
setDT(df)
df[, lapply(.SD, function(x) .(unique(x[!is.na(x)])))
, by = .(First_Name,Last_Name, Street)]
# First_Name Last_Name Street Column1 Colum2 Colum_n
# 1: Mike Smith X abc ab,ad a,b
# 2: John Smith Y xyz xy,xm n

How to return values from group_by in R dplyr?

Good morning,
I've got a two-column dataset which I'd like to spread to more columns based on a group_by in Dplyr but I'm not sure how.
My data looks like:
Person Case
John A
John B
Bill C
David F
I'd like to be able to transform it to the following structure:
Person Case_1 Case_2 ... Case_n
John A B
Bill C NA
David F NA
My original thought was along the lines of:
data %>%
group_by(Person) %>%
spread()
Error: Please supply column name
What's the easiest, or most R-like way to achieve this?
You should first add a case id to the dataset, which can be done with a combination of group_by and mutate:
dat = data.frame(Person = c('John', 'John', 'Bill', 'David'), Case = c('A', 'B', 'C', 'F'))
dat = dat %>% group_by(Person) %>% mutate(id = sprintf('Case_%d', row_number()))
dat %>% head()
# A tibble: 4 × 3
Person Case id
<fctr> <fctr> <chr>
1 John A Case_1
2 John B Case_2
3 Bill C Case_1
4 David F Case_1
Now you can use spread to transform the data:
dat %>% spread(Person, Case)
# A tibble: 2 × 4
id Bill David John
* <chr> <fctr> <fctr> <fctr>
1 Case_1 C F A
2 Case_2 NA NA B
You can get the structure you list above using:
res = dat %>% spread(Person, Case) %>% select(-id) %>% t() %>% as.data.frame()
names(res) = unique(dat$id)
res
Case_1 Case_2
Bill C <NA>
David F <NA>
John A B

Resources