Merge rows containing similar strings using dplyr - r

I have a table containing the following data:
df <- tibble(
dose = seq(10, 50, 10),
date = c("2007-12-15", "2007-10-13","2007-10-13","2007-09-30","2007-09-30"),
response = c(45, 67, 66, 54, 55),
name = c("Peter,Martin", "Gale,Rebecca", "Rebecca,Gale", "Jonathan,Smith", "Smith,Jonathan")
)
The table:
# A tibble: 5 x 4
dose date response name
<dbl> <chr> <dbl> <chr>
1 10 2007-12-15 45 Peter,Martin
2 20 2007-10-13 67 Gale,Rebecca
3 30 2007-10-13 66 Rebecca,Gale
4 40 2007-09-30 54 Jonathan,Smith
5 50 2007-09-30 55 Smith,Jonathan
One of the columns called name either has a string "FirstName,LastName" or "LastName,FirstName". I wish to merge the rows that contain the same names if they are ordered either way. For example, the rows containing Rebecca,Gale and Gale,Rebecca should merge.
While merging, I wish to get the sums of the columns dose and response and want to keep the first of the date and name entries.
Expected outcome:
# A tibble: 3 x 4
dose date response name
<dbl> <chr> <dbl> <chr>
1 10 2007-12-15 45 Peter,Martin
2 50 2007-10-13 133 Gale,Rebecca
3 90 2007-09-30 109 Jonathan,Smith
Please note that I always want to merge using the name column and not the date column because even if the example contains the same dates, my bigger table has different dates for the same name.

Here is one idea.
library(tidyverse)
df2 <- df %>%
mutate(date = as.Date(date)) %>%
mutate(name = map_chr(name, ~toString(sort(str_split(.x, ",")[[1]])))) %>%
group_by(name) %>%
summarize(dose = sum(dose),
response = sum(response),
date = first(date)) %>%
select(names(df)) %>%
ungroup()
df2
# # A tibble: 3 x 4
# dose date response name
# <dbl> <date> <dbl> <chr>
# 1 50 2007-10-13 133 Gale, Rebecca
# 2 90 2007-09-30 109 Jonathan, Smith
# 3 10 2007-12-15 45 Martin, Peter

Related

Group all rows that meet a certain condition

I have the following dataframe df1:
company_location count
<chr> <int>
1 DE 28
2 JP 6
3 GB 47
4 HN 1
5 US 355
6 HU 1
I want to get to df2:
company_location count
<chr> <int>
1 DE 28
2 GB 47
3 US 355
4 OTHER 8
df2 is the same as df1 but sums together all the columns with count<10 and aggregates them in a row called OTHER
Does something like this exist: A group_by() function that only groups all the rows that match a particular condition into one group and leaves all the other rows in groups only containing them alone?
This is what fct_lump_min is for - it's a function from forcats, which is part of the tidyverse.
library(tidyverse)
df %>%
group_by(company_location = fct_lump_min(company_location, 10, count)) %>%
summarise(count = sum(count))
#> # A tibble: 4 x 2
#> company_location count
#> <fct> <int>
#> 1 DE 28
#> 2 GB 47
#> 3 US 355
#> 4 Other 8
Make a temporary variable regrouping company_location based on count, then summarise:
library(dplyr)
df1 %>%
group_by(company_location = replace(company_location, count < 10, 'OTHER')) %>%
summarise(count = sum(count))
# company_location count
# <chr> <int>
#1 DE 28
#2 GB 47
#3 OTHER 8
#4 US 355

Summarizing a difficult dataset

I have a dataset that is basically formatted backwards from how I need it to perform a specific analysis. It represents entities and the articles they are found in, represented by id numbers (see below. Column headings [article 1, 2, 3, etc.] are just the 1st, 2nd, 3rd articles they appear in. The index in the cell is the substantive part). What I'd like to get is a count of how many entities appear in each article, which I think I could do with something like dplyr's group_by and summarise, but I can't find anywhere where you can apply it to a range of columns (there are actually 97 article columns in the dataset).
entity
article 1
article 2
article 3
Bach
51
72
122
Mozart
2
83
95
Two specific transformations that would be useful for me are
The number of entities in each article calculated as the count of the times each unique ID appears in an entity row. eg:
id
count
51
5424
72
1001
122
4000
The entities in each article. eg:
id
entity 1
entity 2
entity 3
51
Bach
Mozart
etc
72
Mozart
Liszt
etc
All this should be possible from this dataset, I just can't figure out how to get it into a workable format. Thanks for your help!
For number 1, you can pivot to long format, then get the counts for each unique id for each entity using tally.
library(tidyverse)
df %>%
pivot_longer(-entity) %>%
group_by(entity, value) %>%
tally()
# A tibble: 6 × 3
# Groups: entity [2]
entity value n
<chr> <dbl> <int>
1 Bach 51 1
2 Bach 72 2
3 Bach 122 1
4 Mozart 2 1
5 Mozart 83 2
6 Mozart 95 1
It is a little unclear exactly what you are looking for, as the output seems different than what you describe. So, if you just want the total counts for each unique id, then you could drop entity in the group_by statement.
df %>%
pivot_longer(-entity) %>%
group_by(value) %>%
tally()
# A tibble: 6 × 2
value n
<dbl> <int>
1 2 1
2 51 1
3 72 2
4 83 2
5 95 1
6 122 1
For number 2, you could do something like this:
df %>%
pivot_longer(-entity) %>%
group_by(value) %>%
mutate(name = paste0("entity " , 1:n())) %>%
pivot_wider(names_from = "name", values_from = "entity")
# A tibble: 6 × 3
# Groups: value [6]
value `entity 1` `entity 2`
<dbl> <chr> <chr>
1 51 Bach NA
2 72 Bach Bach
3 122 Bach NA
4 2 Mozart NA
5 83 Mozart Mozart
6 95 Mozart NA
Data
df <- structure(
list(
entity = c("Bach", "Mozart"),
article.1 = c(51, 2),
article.2 = c(72, 83),
article.3 = c(122, 95),
article.4 = c(72, 83)
),
class = "data.frame",
row.names = c(NA,-2L)
)

fuzzyjoin based on relative difference

I have understood that fuzzyjoin::difference will join two tables based on absolute difference between columns. Is there an R function that will join tables based on relative/percentage differences? I could do so using a full_join() + filter() but I suspect there is a more straightforward way.
Minimal example as follows:
library(tidyverse)
library(fuzzyjoin)
df_1 <- tibble(id = c("wombat", "jerry", "akow"), scores = c(10, 50, 75))
df_2 <- tibble(id= c("wombat", "jerry", "akow"), scores = c(14, 45, 82))
# joining based on absolute difference
difference_full_join(df_1, df_2,
by=c("scores"),
max_dist= 5,
distance_col = "abs_diff" )
# A tibble: 4 x 5
id.x scores.x id.y scores.y abs_diff
<chr> <dbl> <chr> <dbl> <dbl>
1 wombat 10 wombat 14 4
2 jerry 50 jerry 45 5
3 akow 75 NA NA NA
4 NA NA akow 82 NA
## joining based on relative difference (setting 10% as a threshold)
full_join(df_1, df_2, "id") %>%
dplyr::filter( (abs(scores.x - scores.y)/scores.x) <=0.10)
# A tibble: 2 x 3
id scores.x scores.y
<chr> <dbl> <dbl>
1 jerry 50 45
2 akow 75 82

How to summarize `Number of days since first date` and `Number of days seen` by ID and for a large data frame

The dataframe df1 summarizes detections of individuals (ID) through the time (Date). As a short example:
df1<- data.frame(ID= c(1,2,1,2,1,2,1,2,1,2),
Date= ymd(c("2016-08-21","2016-08-24","2016-08-23","2016-08-29","2016-08-27","2016-09-02","2016-09-01","2016-09-09","2016-09-01","2016-09-10")))
df1
ID Date
1 1 2016-08-21
2 2 2016-08-24
3 1 2016-08-23
4 2 2016-08-29
5 1 2016-08-27
6 2 2016-09-02
7 1 2016-09-01
8 2 2016-09-09
9 1 2016-09-01
10 2 2016-09-10
I want to summarize either the Number of days since the first detection of the individual (Ndays) and Number of days that the individual has been detected since the first time it was detected (Ndifdays).
Additionally, I would like to include in this summary table a variable called Prop that simply divides Ndifdays between Ndays.
The summary table that I would expect would be this:
> Result
ID Ndays Ndifdays Prop
1 1 11 4 0.360 # Between 21st Aug and 01st Sept there is 11 days.
2 2 17 5 0.294 # Between 24th Aug and 10st Sept there is 17 days.
Does anyone know how to do it?
You could achieve using various summarising functions in dplyr
library(dplyr)
df1 %>%
group_by(ID) %>%
summarise(Ndays = as.integer(max(Date) - min(Date)),
Ndifdays = n_distinct(Date),
Prop = Ndifdays/Ndays)
# ID Ndays Ndifdays Prop
# <dbl> <int> <int> <dbl>
#1 1 11 4 0.364
#2 2 17 5 0.294
The data.table version of this would be
library(data.table)
df12 <- setDT(df1)[, .(Ndays = as.integer(max(Date) - min(Date)),
Ndifdays = uniqueN(Date)), by = ID]
df12$Prop <- df12$Ndifdays/df12$Ndays
and base R with aggregate
df12 <- aggregate(Date~ID, df1, function(x) c(max(x) - min(x), length(unique(x))))
df12$Prop <- df1$Ndifdays/df1$Ndays
After grouping by 'ID', get the diff or range of 'Date' to create 'Ndays', and then get the unique number of 'Date' with n_distinct, divide by the number of distinct by the Ndays to get the 'Prop'
library(dplyr)
df1 %>%
group_by(ID) %>%
summarise(Ndays = as.integer(diff(range(Date))),
Ndifdays = n_distinct(Date),
Prop = Ndifdays/Ndays)
# A tibble: 2 x 4
# ID Ndays Ndifdays Prop
# <dbl> <int> <int> <dbl>
#1 1 11 4 0.364
#2 2 17 5 0.294

Merging multiple rows into single row

I've some problems with my data frame in R.
My data frame looks something like this:
ID TIME DAY URL_NAME VALUE TIME_SPEND
1 12:15 Monday HOME 4 30
1 13:15 Tuesday CUSTOMERS 5 21
1 15:00 Thursday PLANTS 8 8
1 16:21 Friday MANAGEMENT 1 6
....
So, I want to write the rows, containing the same "ID" into one single row.
Looking something like this:
ID TIME DAY URL_NAME VALUE TIME_SPEND TIME1 DAY1 URL_NAME1 VALUE1 TIME_SPEND1 TIME2 DAY2 URL_NAME2 VALUE2 TIME_SPEND2 TIME3 DAY3 URL_NAME3 VALUE3 TIME_SPEND3
1 12:15 Monday HOME 4 30 13:15 Tuesday CUSTOMERS 5 21 15:00 Thursday PLANTS 8 8 16:21 Friday MANAGEMENT 1 6
My second problem is, that there are about 1.500.00 unique IDs and i would like to do this for the whole data frame.
I did not find any solution fitting to my problem.
I would be happy about any solutions or links to handle my problem.
I'd recommend using dcast from the "data.table" package, which would allow you to reshape multiple measure variables at once.
Example:
library(data.table)
as.data.table(mydf)[, dcast(.SD, ID ~ rowid(ID), value.var = names(mydf)[-1])]
# ID TIME_1 TIME_2 TIME_3 DAY_1 DAY_2 DAY_3 URL_NAME_1 URL_NAME_2 URL_NAME_3 VALUE_1 VALUE_2
# 1: 1 12:15 13:15 15:00 Monday Tuesday Thursday HOME CUSTOMERS PLANTS 4 5
# 2: 2 14:15 10:19 NA Tuesday Monday NA CUSTOMERS CUSTOMERS NA 2 9
# VALUE_3 TIME_SPEND_1 TIME_SPEND_2 TIME_SPEND_3
# 1: 8 30 19 40
# 2: NA 21 8 NA
Here's the sample data used:
mydf <- data.frame(
ID = c(1, 1, 1, 2, 2),
TIME = c("12:15", "13:15", "15:00", "14:15", "10:19"),
DAY = c("Monday", "Tuesday", "Thursday", "Tuesday", "Monday"),
URL_NAME = c("HOME", "CUSTOMERS", "PLANTS", "CUSTOMERS", "CUSTOMERS"),
VALUE = c(4, 5, 8, 2, 9),
TIME_SPEND = c(30, 19, 40, 21, 8)
)
mydf
# ID TIME DAY URL_NAME VALUE TIME_SPEND
# 1 1 12:15 Monday HOME 4 30
# 2 1 13:15 Tuesday CUSTOMERS 5 19
# 3 1 15:00 Thursday PLANTS 8 40
# 4 2 14:15 Tuesday CUSTOMERS 2 21
# 5 2 10:19 Monday CUSTOMERS 9 8
Try this tidyverse solution which will produce an output close to what you want. You can group by TIME then create a sequential id that will identify the future columns. After that reshape to long (pivot_longer()) combine the variable name with the id and then reshape to wide (pivot_wider()). Here's the code where I have used a dataset of my own,
df1 <- data.frame(Components = c(rep("ABC",5),rep("BCD",5)),
Size = c(sample(1:100,5),sample(45:100,5)),
Age = c(sample(1:100,5),sample(45:100,5)))
For the above-generated data set, the following code piece is the solution:
library(tidyverse)
#Code
newdf <- df1 %>% group_by(Components) %>% mutate(id=row_number()) %>%
pivot_longer(-c(Components,id)) %>%
mutate(name=paste0(name,'.',id)) %>% select(-id) %>%
pivot_wider(names_from = name,values_from=value)
OUTPUT would look like:
# A tibble: 2 x 11
# Groups: Components [2]
Components Size.1 Age.1 Size.2 Age.2 Size.3 Age.3 Size.4 Age.4 Size.5 Age.5
<fct> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 ABC 23 94 52 89 15 25 76 38 33 99
2 BCD 59 62 55 81 81 61 80 83 97 68
ALTERNATIVE SOLUTION:
We could use unite to unite the columns and then use pivot_wider
library(dplyr)
library(tidyr)
library(data.table)
df1 %>%
mutate(rn = rowid(Components)) %>%
pivot_longer(cols = Size:Age) %>%
unite(name, name, rn, sep=".") %>%
pivot_wider(names_from = name, values_from = value)
OUTPUT would look like:
# A tibble: 2 x 11
# Components Size.1 Age.1 Size.2 Age.2 Size.3 Age.3 Size.4 Age.4 Size.5 Age.5
# <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#1 ABC 11 16 79 57 70 2 80 6 91 24
#2 BCD 67 81 63 77 48 73 52 100 49 76
Both the solutions were #Duck's Duck's Profile URL and #akrun's Akrun's Profile URL brainchild. Thanks a tonne to them.

Resources