Merge rows and keep values based on another column - r

I've got data from a number of surveys. Each survey can be sent multiple times with updated values. For each survey/row in the dataset there's a date when the survey was submited (created). I'd like to merge the rows for each survey and keep the date from the first survey but other data from the last survey.
A simple example:
#> survey created var1 var2
#> 1 s1 2020-01-01 10 30
#> 2 s2 2020-01-02 10 90
#> 3 s2 2020-01-03 20 20
#> 4 s3 2020-01-01 45 5
#> 5 s3 2020-01-02 50 50
#> 6 s3 2020-01-03 30 10
Desired result:
#> survey created var1 var2
#> 1 s1 2020-01-01 10 30
#> 2 s2 2020-01-02 20 20
#> 3 s3 2020-01-01 30 10
Example data:
df <- data.frame(survey = c("s1", "s2", "s2", "s3", "s3", "s3"),
created = as.POSIXct(c("2020-01-01", "2020-01-02", "2020-01-03", "2020-01-01", "2020-01-02", "2020-01-03"), "%Y-%m-%d", tz = "GMT"),
var1 = c(10, 10, 20, 45, 50, 30),
var2 = c(30, 90, 20, 5, 50, 10),
stringsAsFactors=FALSE)
I've tried group_by with summarize in different ways but can't make it work, any help would be highly appreciated!

After grouping by 'survey', change the 'created' as the first or min value in 'created' and then slice the last row (n())
library(dplyr)
df %>%
group_by(survey) %>%
mutate(created = as.Date(first(created))) %>%
slice(n())
# A tibble: 3 x 4
# Groups: survey [3]
# survey created var1 var2
# <chr> <date> <dbl> <dbl>
#1 s1 2020-01-01 10 30
#2 s2 2020-01-02 20 20
#3 s3 2020-01-01 30 10
Or using base R
transform(df, created = ave(created, survey, FUN = first)
)[!duplicated(df$survey, fromLast = TRUE),]

After selecting the first created date we can select the last values from all the columns.
library(dplyr)
df %>%
group_by(survey) %>%
mutate(created = as.Date(first(created))) %>%
summarise(across(created:var2, last))
#In older version use `summarise_at`
#summarise_at(vars(created:var2), last)
# A tibble: 3 x 4
# survey created var1 var2
# <chr> <date> <dbl> <dbl>
#1 s1 2020-01-01 10 30
#2 s2 2020-01-02 20 20
#3 s3 2020-01-01 30 10

Related

How to calculate difference in values grouped by 2 separate variables in R

Let's say we have a team variable, but we also have a time period 1 and a time period 2 variable, and a numeric grade 1-10. I want to mutate and add a variable that calculates the difference from time period 1 to time period 2.
How do I do this?
Visually the table looks like this:
img
There is a neat function in the data.table package called dcast( ) that allows you to transform your data from long to wide. In this case, you can use the Period variable to create 2 new columns, Period 1 and Period 2, where the values are the Grades.
library(data.table)
> data <- data.table(
+ Team = c("Team 1","Team 1","Team 2","Team 2","Team 3","Team 3"),
+ Period = c("Period 1","Period 2","Period 1","Period 2","Period 1","Period 2"),
+ Grade = c(75,87,42,35,10,95))
> data
Team Period Grade
1: Team 1 Period 1 75
2: Team 1 Period 2 87
3: Team 2 Period 1 42
4: Team 2 Period 2 35
5: Team 3 Period 1 10
6: Team 3 Period 2 95
> data2 <- dcast(
+ data = data,
+ Team ~ Period,
+ value.var = "Grade")
> data2
Team Period 1 Period 2
1: Team 1 75 87
2: Team 2 42 35
3: Team 3 10 95
> data2 <- data2[,Difference := `Period 2` - `Period 1`]
> data2
Team Period 1 Period 2 Difference
1: Team 1 75 87 12
2: Team 2 42 35 -7
3: Team 3 10 95 85
In tidyverse syntax, we would use pivot_wider and mutate:
library(tidyverse)
df %>%
pivot_wider(names_from = `Time Period`, values_from = Grade) %>%
mutate(difference = P2 - P1)
#> # A tibble: 3 x 4
#> Team P1 P2 difference
#> <chr> <dbl> <dbl> <dbl>
#> 1 Team 1 75 87 12
#> 2 Team 2 42 35 -7
#> 3 Team 3 10 95 85
Created on 2022-08-29 with reprex v2.0.2
Data used
df <- data.frame(Team = paste("Team", rep(1:3, each = 2)),
`Time Period` = rep(c("P1", "P2"), 3),
Grade = c(75, 87, 42, 35, 10, 95),
check.names = FALSE)
df
#> Team Time Period Grade
#> 1 Team 1 P1 75
#> 2 Team 1 P2 87
#> 3 Team 2 P1 42
#> 4 Team 2 P2 35
#> 5 Team 3 P1 10
#> 6 Team 3 P2 95

Mark dates within several date ranges

I'm trying to mark all dates, which fall within several ranges in a different table.
The events table among other variables contains start_date and end_date of events:
events <- tibble(
name = c("Event A", "Event B"),
start_date = as.Date(c("2021-10-17", "2021-02-19")),
end_date = as.Date(c("2021-10-19", "2021-02-10"))
)
The date_info table contains date, statistic and value information in the long format for all days of the year:
date_info <- tibble(
date = as.Date(c("2021-10-16", "2021-10-16", "2021-10-17", "2021-10-17")),
statistic = c("var1", "var2", "var1", "var2"),
value = c(10, 54, 23, 34)
)
I need to make a new column in date_info to mark dates which fall within any date range of events.
I've tried the approach below, but it works only if there is one event in events
library(tidyverse)
date_info %>%
mutate(in_range = if_else(date < events$start_date | date > events$end_date, FALSE, TRUE))
I thought about creating a date_range vector in events such that code below can be used to mark the dates:
library(tidyverse)
date_info %>%
mutate(in_range = if_else(date %in% events$date_range, TRUE, FALSE))
However I'm not sure that this is the best approach. Additionally I'm not sure how to get such date range as seq() works on a single start/end date pair rather than a vector.
This can be done as a range-based or non-equi join. Unfortunately, dplyr alone cannot do it, but one of the following should work fine.
The code below assigns the particular events$name to each row, not just an "in range" indicator. It's not hard to simplify that with in_range = !is.na(name) or similar.
fuzzyjoin
# library(fuzzyjoin)
date_info %>%
fuzzyjoin::fuzzy_left_join(events,
by = c(date = "start_date", date = "end_date"),
match_fun = list(`>=`, `<=`))
# # A tibble: 4 x 6
# date statistic value name start_date end_date
# <date> <chr> <dbl> <chr> <date> <date>
# 1 2021-10-16 var1 10 NA NA NA
# 2 2021-10-16 var2 54 NA NA NA
# 3 2021-10-17 var1 23 Event A 2021-10-17 2021-10-19
# 4 2021-10-17 var2 34 Event A 2021-10-17 2021-10-19
sqldf
# library(sqldf)
sqldf::sqldf("
select t1.*, t2.name
from date_info t1
left join events t2 on t1.date between t2.start_date and t2.end_date")
# date statistic value name
# 1 2021-10-16 var1 10 <NA>
# 2 2021-10-16 var2 54 <NA>
# 3 2021-10-17 var1 23 Event A
# 4 2021-10-17 var2 34 Event A
data.table
library(data.table)
date_info_DT <- as.data.table(date_info)
events_DT <- as.data.table(events)
date_info_DT[events_DT, name := i.name,
on = .(date >= start_date, date <= end_date)][]
# date statistic value name
# <Date> <char> <num> <char>
# 1: 2021-10-16 var1 10 <NA>
# 2: 2021-10-16 var2 54 <NA>
# 3: 2021-10-17 var1 23 Event A
# 4: 2021-10-17 var2 34 Event A
(There's also data.table::foverlaps, which requires the second data.table to be keyed.)
Another option, a bit simpler (not requiring class-changes):
date_info %>%
mutate(in_range = data.table::inrange(date, events$start_date, events$end_date))
# # A tibble: 4 x 4
# date statistic value in_range
# <date> <chr> <dbl> <lgl>
# 1 2021-10-16 var1 10 FALSE
# 2 2021-10-16 var2 54 FALSE
# 3 2021-10-17 var1 23 TRUE
# 4 2021-10-17 var2 34 TRUE
Here's a solution using map from the purrr package that should work. It could be more concise but I made it very explicit so it's not overwhelming if you're not familiar with the syntax.
date_info |>
mutate(
in_range_n = map_dbl(date, .f = function(date){
filter(events, start_date <= date, end_date >= date) |>
nrow()
}),
in_range = in_range_n > 0
) |>
select(-in_range_n)
Output:
# A tibble: 4 x 4
date statistic value in_range
<date> <chr> <dbl> <lgl>
1 2021-10-16 var1 10 FALSE
2 2021-10-16 var2 54 FALSE
3 2021-10-17 var1 23 TRUE
4 2021-10-17 var2 34 TRUE
Let me know if I misunderstood the problem!
Using base r
date_info$in_range <- sapply(date_info$date, function(date){
any(date >= events$start_date & date <= events$end_date)
})
gives
date statistic value in_range
<date> <chr> <dbl> <lgl>
1 2021-10-16 var1 10 FALSE
2 2021-10-16 var2 54 FALSE
3 2021-10-17 var1 23 TRUE
4 2021-10-17 var2 34 TRUE

fuzzyjoin based on relative difference

I have understood that fuzzyjoin::difference will join two tables based on absolute difference between columns. Is there an R function that will join tables based on relative/percentage differences? I could do so using a full_join() + filter() but I suspect there is a more straightforward way.
Minimal example as follows:
library(tidyverse)
library(fuzzyjoin)
df_1 <- tibble(id = c("wombat", "jerry", "akow"), scores = c(10, 50, 75))
df_2 <- tibble(id= c("wombat", "jerry", "akow"), scores = c(14, 45, 82))
# joining based on absolute difference
difference_full_join(df_1, df_2,
by=c("scores"),
max_dist= 5,
distance_col = "abs_diff" )
# A tibble: 4 x 5
id.x scores.x id.y scores.y abs_diff
<chr> <dbl> <chr> <dbl> <dbl>
1 wombat 10 wombat 14 4
2 jerry 50 jerry 45 5
3 akow 75 NA NA NA
4 NA NA akow 82 NA
## joining based on relative difference (setting 10% as a threshold)
full_join(df_1, df_2, "id") %>%
dplyr::filter( (abs(scores.x - scores.y)/scores.x) <=0.10)
# A tibble: 2 x 3
id scores.x scores.y
<chr> <dbl> <dbl>
1 jerry 50 45
2 akow 75 82

Merge rows containing similar strings using dplyr

I have a table containing the following data:
df <- tibble(
dose = seq(10, 50, 10),
date = c("2007-12-15", "2007-10-13","2007-10-13","2007-09-30","2007-09-30"),
response = c(45, 67, 66, 54, 55),
name = c("Peter,Martin", "Gale,Rebecca", "Rebecca,Gale", "Jonathan,Smith", "Smith,Jonathan")
)
The table:
# A tibble: 5 x 4
dose date response name
<dbl> <chr> <dbl> <chr>
1 10 2007-12-15 45 Peter,Martin
2 20 2007-10-13 67 Gale,Rebecca
3 30 2007-10-13 66 Rebecca,Gale
4 40 2007-09-30 54 Jonathan,Smith
5 50 2007-09-30 55 Smith,Jonathan
One of the columns called name either has a string "FirstName,LastName" or "LastName,FirstName". I wish to merge the rows that contain the same names if they are ordered either way. For example, the rows containing Rebecca,Gale and Gale,Rebecca should merge.
While merging, I wish to get the sums of the columns dose and response and want to keep the first of the date and name entries.
Expected outcome:
# A tibble: 3 x 4
dose date response name
<dbl> <chr> <dbl> <chr>
1 10 2007-12-15 45 Peter,Martin
2 50 2007-10-13 133 Gale,Rebecca
3 90 2007-09-30 109 Jonathan,Smith
Please note that I always want to merge using the name column and not the date column because even if the example contains the same dates, my bigger table has different dates for the same name.
Here is one idea.
library(tidyverse)
df2 <- df %>%
mutate(date = as.Date(date)) %>%
mutate(name = map_chr(name, ~toString(sort(str_split(.x, ",")[[1]])))) %>%
group_by(name) %>%
summarize(dose = sum(dose),
response = sum(response),
date = first(date)) %>%
select(names(df)) %>%
ungroup()
df2
# # A tibble: 3 x 4
# dose date response name
# <dbl> <date> <dbl> <chr>
# 1 50 2007-10-13 133 Gale, Rebecca
# 2 90 2007-09-30 109 Jonathan, Smith
# 3 10 2007-12-15 45 Martin, Peter

Merging multiple rows into single row

I've some problems with my data frame in R.
My data frame looks something like this:
ID TIME DAY URL_NAME VALUE TIME_SPEND
1 12:15 Monday HOME 4 30
1 13:15 Tuesday CUSTOMERS 5 21
1 15:00 Thursday PLANTS 8 8
1 16:21 Friday MANAGEMENT 1 6
....
So, I want to write the rows, containing the same "ID" into one single row.
Looking something like this:
ID TIME DAY URL_NAME VALUE TIME_SPEND TIME1 DAY1 URL_NAME1 VALUE1 TIME_SPEND1 TIME2 DAY2 URL_NAME2 VALUE2 TIME_SPEND2 TIME3 DAY3 URL_NAME3 VALUE3 TIME_SPEND3
1 12:15 Monday HOME 4 30 13:15 Tuesday CUSTOMERS 5 21 15:00 Thursday PLANTS 8 8 16:21 Friday MANAGEMENT 1 6
My second problem is, that there are about 1.500.00 unique IDs and i would like to do this for the whole data frame.
I did not find any solution fitting to my problem.
I would be happy about any solutions or links to handle my problem.
I'd recommend using dcast from the "data.table" package, which would allow you to reshape multiple measure variables at once.
Example:
library(data.table)
as.data.table(mydf)[, dcast(.SD, ID ~ rowid(ID), value.var = names(mydf)[-1])]
# ID TIME_1 TIME_2 TIME_3 DAY_1 DAY_2 DAY_3 URL_NAME_1 URL_NAME_2 URL_NAME_3 VALUE_1 VALUE_2
# 1: 1 12:15 13:15 15:00 Monday Tuesday Thursday HOME CUSTOMERS PLANTS 4 5
# 2: 2 14:15 10:19 NA Tuesday Monday NA CUSTOMERS CUSTOMERS NA 2 9
# VALUE_3 TIME_SPEND_1 TIME_SPEND_2 TIME_SPEND_3
# 1: 8 30 19 40
# 2: NA 21 8 NA
Here's the sample data used:
mydf <- data.frame(
ID = c(1, 1, 1, 2, 2),
TIME = c("12:15", "13:15", "15:00", "14:15", "10:19"),
DAY = c("Monday", "Tuesday", "Thursday", "Tuesday", "Monday"),
URL_NAME = c("HOME", "CUSTOMERS", "PLANTS", "CUSTOMERS", "CUSTOMERS"),
VALUE = c(4, 5, 8, 2, 9),
TIME_SPEND = c(30, 19, 40, 21, 8)
)
mydf
# ID TIME DAY URL_NAME VALUE TIME_SPEND
# 1 1 12:15 Monday HOME 4 30
# 2 1 13:15 Tuesday CUSTOMERS 5 19
# 3 1 15:00 Thursday PLANTS 8 40
# 4 2 14:15 Tuesday CUSTOMERS 2 21
# 5 2 10:19 Monday CUSTOMERS 9 8
Try this tidyverse solution which will produce an output close to what you want. You can group by TIME then create a sequential id that will identify the future columns. After that reshape to long (pivot_longer()) combine the variable name with the id and then reshape to wide (pivot_wider()). Here's the code where I have used a dataset of my own,
df1 <- data.frame(Components = c(rep("ABC",5),rep("BCD",5)),
Size = c(sample(1:100,5),sample(45:100,5)),
Age = c(sample(1:100,5),sample(45:100,5)))
For the above-generated data set, the following code piece is the solution:
library(tidyverse)
#Code
newdf <- df1 %>% group_by(Components) %>% mutate(id=row_number()) %>%
pivot_longer(-c(Components,id)) %>%
mutate(name=paste0(name,'.',id)) %>% select(-id) %>%
pivot_wider(names_from = name,values_from=value)
OUTPUT would look like:
# A tibble: 2 x 11
# Groups: Components [2]
Components Size.1 Age.1 Size.2 Age.2 Size.3 Age.3 Size.4 Age.4 Size.5 Age.5
<fct> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 ABC 23 94 52 89 15 25 76 38 33 99
2 BCD 59 62 55 81 81 61 80 83 97 68
ALTERNATIVE SOLUTION:
We could use unite to unite the columns and then use pivot_wider
library(dplyr)
library(tidyr)
library(data.table)
df1 %>%
mutate(rn = rowid(Components)) %>%
pivot_longer(cols = Size:Age) %>%
unite(name, name, rn, sep=".") %>%
pivot_wider(names_from = name, values_from = value)
OUTPUT would look like:
# A tibble: 2 x 11
# Components Size.1 Age.1 Size.2 Age.2 Size.3 Age.3 Size.4 Age.4 Size.5 Age.5
# <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#1 ABC 11 16 79 57 70 2 80 6 91 24
#2 BCD 67 81 63 77 48 73 52 100 49 76
Both the solutions were #Duck's Duck's Profile URL and #akrun's Akrun's Profile URL brainchild. Thanks a tonne to them.

Resources