Merging multiple rows into single row - r

I've some problems with my data frame in R.
My data frame looks something like this:
ID TIME DAY URL_NAME VALUE TIME_SPEND
1 12:15 Monday HOME 4 30
1 13:15 Tuesday CUSTOMERS 5 21
1 15:00 Thursday PLANTS 8 8
1 16:21 Friday MANAGEMENT 1 6
....
So, I want to write the rows, containing the same "ID" into one single row.
Looking something like this:
ID TIME DAY URL_NAME VALUE TIME_SPEND TIME1 DAY1 URL_NAME1 VALUE1 TIME_SPEND1 TIME2 DAY2 URL_NAME2 VALUE2 TIME_SPEND2 TIME3 DAY3 URL_NAME3 VALUE3 TIME_SPEND3
1 12:15 Monday HOME 4 30 13:15 Tuesday CUSTOMERS 5 21 15:00 Thursday PLANTS 8 8 16:21 Friday MANAGEMENT 1 6
My second problem is, that there are about 1.500.00 unique IDs and i would like to do this for the whole data frame.
I did not find any solution fitting to my problem.
I would be happy about any solutions or links to handle my problem.

I'd recommend using dcast from the "data.table" package, which would allow you to reshape multiple measure variables at once.
Example:
library(data.table)
as.data.table(mydf)[, dcast(.SD, ID ~ rowid(ID), value.var = names(mydf)[-1])]
# ID TIME_1 TIME_2 TIME_3 DAY_1 DAY_2 DAY_3 URL_NAME_1 URL_NAME_2 URL_NAME_3 VALUE_1 VALUE_2
# 1: 1 12:15 13:15 15:00 Monday Tuesday Thursday HOME CUSTOMERS PLANTS 4 5
# 2: 2 14:15 10:19 NA Tuesday Monday NA CUSTOMERS CUSTOMERS NA 2 9
# VALUE_3 TIME_SPEND_1 TIME_SPEND_2 TIME_SPEND_3
# 1: 8 30 19 40
# 2: NA 21 8 NA
Here's the sample data used:
mydf <- data.frame(
ID = c(1, 1, 1, 2, 2),
TIME = c("12:15", "13:15", "15:00", "14:15", "10:19"),
DAY = c("Monday", "Tuesday", "Thursday", "Tuesday", "Monday"),
URL_NAME = c("HOME", "CUSTOMERS", "PLANTS", "CUSTOMERS", "CUSTOMERS"),
VALUE = c(4, 5, 8, 2, 9),
TIME_SPEND = c(30, 19, 40, 21, 8)
)
mydf
# ID TIME DAY URL_NAME VALUE TIME_SPEND
# 1 1 12:15 Monday HOME 4 30
# 2 1 13:15 Tuesday CUSTOMERS 5 19
# 3 1 15:00 Thursday PLANTS 8 40
# 4 2 14:15 Tuesday CUSTOMERS 2 21
# 5 2 10:19 Monday CUSTOMERS 9 8

Try this tidyverse solution which will produce an output close to what you want. You can group by TIME then create a sequential id that will identify the future columns. After that reshape to long (pivot_longer()) combine the variable name with the id and then reshape to wide (pivot_wider()). Here's the code where I have used a dataset of my own,
df1 <- data.frame(Components = c(rep("ABC",5),rep("BCD",5)),
Size = c(sample(1:100,5),sample(45:100,5)),
Age = c(sample(1:100,5),sample(45:100,5)))
For the above-generated data set, the following code piece is the solution:
library(tidyverse)
#Code
newdf <- df1 %>% group_by(Components) %>% mutate(id=row_number()) %>%
pivot_longer(-c(Components,id)) %>%
mutate(name=paste0(name,'.',id)) %>% select(-id) %>%
pivot_wider(names_from = name,values_from=value)
OUTPUT would look like:
# A tibble: 2 x 11
# Groups: Components [2]
Components Size.1 Age.1 Size.2 Age.2 Size.3 Age.3 Size.4 Age.4 Size.5 Age.5
<fct> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 ABC 23 94 52 89 15 25 76 38 33 99
2 BCD 59 62 55 81 81 61 80 83 97 68
ALTERNATIVE SOLUTION:
We could use unite to unite the columns and then use pivot_wider
library(dplyr)
library(tidyr)
library(data.table)
df1 %>%
mutate(rn = rowid(Components)) %>%
pivot_longer(cols = Size:Age) %>%
unite(name, name, rn, sep=".") %>%
pivot_wider(names_from = name, values_from = value)
OUTPUT would look like:
# A tibble: 2 x 11
# Components Size.1 Age.1 Size.2 Age.2 Size.3 Age.3 Size.4 Age.4 Size.5 Age.5
# <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#1 ABC 11 16 79 57 70 2 80 6 91 24
#2 BCD 67 81 63 77 48 73 52 100 49 76
Both the solutions were #Duck's Duck's Profile URL and #akrun's Akrun's Profile URL brainchild. Thanks a tonne to them.

Related

How to calculate difference in values grouped by 2 separate variables in R

Let's say we have a team variable, but we also have a time period 1 and a time period 2 variable, and a numeric grade 1-10. I want to mutate and add a variable that calculates the difference from time period 1 to time period 2.
How do I do this?
Visually the table looks like this:
img
There is a neat function in the data.table package called dcast( ) that allows you to transform your data from long to wide. In this case, you can use the Period variable to create 2 new columns, Period 1 and Period 2, where the values are the Grades.
library(data.table)
> data <- data.table(
+ Team = c("Team 1","Team 1","Team 2","Team 2","Team 3","Team 3"),
+ Period = c("Period 1","Period 2","Period 1","Period 2","Period 1","Period 2"),
+ Grade = c(75,87,42,35,10,95))
> data
Team Period Grade
1: Team 1 Period 1 75
2: Team 1 Period 2 87
3: Team 2 Period 1 42
4: Team 2 Period 2 35
5: Team 3 Period 1 10
6: Team 3 Period 2 95
> data2 <- dcast(
+ data = data,
+ Team ~ Period,
+ value.var = "Grade")
> data2
Team Period 1 Period 2
1: Team 1 75 87
2: Team 2 42 35
3: Team 3 10 95
> data2 <- data2[,Difference := `Period 2` - `Period 1`]
> data2
Team Period 1 Period 2 Difference
1: Team 1 75 87 12
2: Team 2 42 35 -7
3: Team 3 10 95 85
In tidyverse syntax, we would use pivot_wider and mutate:
library(tidyverse)
df %>%
pivot_wider(names_from = `Time Period`, values_from = Grade) %>%
mutate(difference = P2 - P1)
#> # A tibble: 3 x 4
#> Team P1 P2 difference
#> <chr> <dbl> <dbl> <dbl>
#> 1 Team 1 75 87 12
#> 2 Team 2 42 35 -7
#> 3 Team 3 10 95 85
Created on 2022-08-29 with reprex v2.0.2
Data used
df <- data.frame(Team = paste("Team", rep(1:3, each = 2)),
`Time Period` = rep(c("P1", "P2"), 3),
Grade = c(75, 87, 42, 35, 10, 95),
check.names = FALSE)
df
#> Team Time Period Grade
#> 1 Team 1 P1 75
#> 2 Team 1 P2 87
#> 3 Team 2 P1 42
#> 4 Team 2 P2 35
#> 5 Team 3 P1 10
#> 6 Team 3 P2 95

How to calculate sum on unique values in R

So here's the data:
DF1
ID DOW
1 Monday
1 Monday
1 Tuesday
2 Tuesday
2 Wednesday
3 Friday
3 Monday
3 Tuesday
I would like to join the following dictionary.
DF2
ID DOW Hours
1 Monday 20
1 Tuesday 21
2 Tuesday 30
2 Wednesday 25
3 Friday 24
3 Monday 42
3 Tuesday 54
My goal is I want the total count of entries on each day as well as the hours worked on that day. But if a value on the list exists twice, it is not counted twice. (Thats the hard part)
Here's my attempt following R Code:
df3 <- df1 %>%
left_join(df2, by = c("DOW" ,"ID"))
df3 %>%
group_by(ID) %>%
summarize(count = n())
sum = sum(Employee_Hrs)) %>%
mutate(injRate = count/sum)
This does not work because though it does successfully count total number of entries for each ID, it sums employee_Hrs every time, even when it is entered multiple times...
End product should be:
ID count sum
1 3 41
2 2 55
3 3 120
Again, take count, but sum hours , dont double count.
Here is a base R option using merge + aggregate
u <- merge(df1, df2, by = c("ID", "DOW"))
res <- setNames(
merge(aggregate(DOW ~ ID, u, length),
aggregate(Hours ~ ID, unique(u), sum),
by = "ID"
),
c("ID", "Count", "Sum")
)
which gives
> res
ID Count Sum
1 1 3 41
2 2 2 55
3 3 3 120
An option with data.table
library(data.table)
setDT(df1)[df2, .(Count = .N, Hours), on = .(ID), by = .EACHI][,
.(Sum = sum(Hours)), .(ID, Count)]
# ID Count Sum
#1: 1 3 41
#2: 2 2 55
#3: 3 3 120
Another approach is to summarize the tables prior to joining them.
textFile1 <- "ID DOW
1 Monday
1 Monday
1 Tuesday
2 Tuesday
2 Wednesday
3 Friday
3 Monday
3 Tuesday"
textFile2 <- "ID DOW Hours
1 Monday 20
1 Tuesday 21
2 Tuesday 30
2 Wednesday 25
3 Friday 24
3 Monday 42
3 Tuesday 54"
df1 <- read.table(text =textFile1,header=TRUE )
df2 <- read.table(text =textFile2,header=TRUE )
df1 %>% group_by(ID) %>%
summarise(count = n()) -> counts
df2 %>%
group_by(ID) %>%
summarize(sum = sum(Hours)) %>%
left_join(counts) %>%
mutate(injRate = count/sum)
...and the output:
# A tibble: 3 x 4
ID sum count injRate
<int> <int> <int> <dbl>
1 1 41 3 0.0732
2 2 55 2 0.0364
3 3 120 3 0.025
Try this solution where you compute the number of counts and then you filter to obtain final summary:
library(tidyverse)
#Data
df3 <- df1 %>%
left_join(df2, by = c("DOW" ,"ID"))
#Code
df3 %>%
group_by(ID) %>%
mutate(count=n()) %>%
filter(!duplicated(DOW)) %>%
summarise(count=unique(count),Sum=sum(Hours))
Output:
# A tibble: 3 x 3
ID count Sum
<int> <int> <int>
1 1 3 41
2 2 2 55
3 3 3 120

Merge rows containing similar strings using dplyr

I have a table containing the following data:
df <- tibble(
dose = seq(10, 50, 10),
date = c("2007-12-15", "2007-10-13","2007-10-13","2007-09-30","2007-09-30"),
response = c(45, 67, 66, 54, 55),
name = c("Peter,Martin", "Gale,Rebecca", "Rebecca,Gale", "Jonathan,Smith", "Smith,Jonathan")
)
The table:
# A tibble: 5 x 4
dose date response name
<dbl> <chr> <dbl> <chr>
1 10 2007-12-15 45 Peter,Martin
2 20 2007-10-13 67 Gale,Rebecca
3 30 2007-10-13 66 Rebecca,Gale
4 40 2007-09-30 54 Jonathan,Smith
5 50 2007-09-30 55 Smith,Jonathan
One of the columns called name either has a string "FirstName,LastName" or "LastName,FirstName". I wish to merge the rows that contain the same names if they are ordered either way. For example, the rows containing Rebecca,Gale and Gale,Rebecca should merge.
While merging, I wish to get the sums of the columns dose and response and want to keep the first of the date and name entries.
Expected outcome:
# A tibble: 3 x 4
dose date response name
<dbl> <chr> <dbl> <chr>
1 10 2007-12-15 45 Peter,Martin
2 50 2007-10-13 133 Gale,Rebecca
3 90 2007-09-30 109 Jonathan,Smith
Please note that I always want to merge using the name column and not the date column because even if the example contains the same dates, my bigger table has different dates for the same name.
Here is one idea.
library(tidyverse)
df2 <- df %>%
mutate(date = as.Date(date)) %>%
mutate(name = map_chr(name, ~toString(sort(str_split(.x, ",")[[1]])))) %>%
group_by(name) %>%
summarize(dose = sum(dose),
response = sum(response),
date = first(date)) %>%
select(names(df)) %>%
ungroup()
df2
# # A tibble: 3 x 4
# dose date response name
# <dbl> <date> <dbl> <chr>
# 1 50 2007-10-13 133 Gale, Rebecca
# 2 90 2007-09-30 109 Jonathan, Smith
# 3 10 2007-12-15 45 Martin, Peter

How to summarize `Number of days since first date` and `Number of days seen` by ID and for a large data frame

The dataframe df1 summarizes detections of individuals (ID) through the time (Date). As a short example:
df1<- data.frame(ID= c(1,2,1,2,1,2,1,2,1,2),
Date= ymd(c("2016-08-21","2016-08-24","2016-08-23","2016-08-29","2016-08-27","2016-09-02","2016-09-01","2016-09-09","2016-09-01","2016-09-10")))
df1
ID Date
1 1 2016-08-21
2 2 2016-08-24
3 1 2016-08-23
4 2 2016-08-29
5 1 2016-08-27
6 2 2016-09-02
7 1 2016-09-01
8 2 2016-09-09
9 1 2016-09-01
10 2 2016-09-10
I want to summarize either the Number of days since the first detection of the individual (Ndays) and Number of days that the individual has been detected since the first time it was detected (Ndifdays).
Additionally, I would like to include in this summary table a variable called Prop that simply divides Ndifdays between Ndays.
The summary table that I would expect would be this:
> Result
ID Ndays Ndifdays Prop
1 1 11 4 0.360 # Between 21st Aug and 01st Sept there is 11 days.
2 2 17 5 0.294 # Between 24th Aug and 10st Sept there is 17 days.
Does anyone know how to do it?
You could achieve using various summarising functions in dplyr
library(dplyr)
df1 %>%
group_by(ID) %>%
summarise(Ndays = as.integer(max(Date) - min(Date)),
Ndifdays = n_distinct(Date),
Prop = Ndifdays/Ndays)
# ID Ndays Ndifdays Prop
# <dbl> <int> <int> <dbl>
#1 1 11 4 0.364
#2 2 17 5 0.294
The data.table version of this would be
library(data.table)
df12 <- setDT(df1)[, .(Ndays = as.integer(max(Date) - min(Date)),
Ndifdays = uniqueN(Date)), by = ID]
df12$Prop <- df12$Ndifdays/df12$Ndays
and base R with aggregate
df12 <- aggregate(Date~ID, df1, function(x) c(max(x) - min(x), length(unique(x))))
df12$Prop <- df1$Ndifdays/df1$Ndays
After grouping by 'ID', get the diff or range of 'Date' to create 'Ndays', and then get the unique number of 'Date' with n_distinct, divide by the number of distinct by the Ndays to get the 'Prop'
library(dplyr)
df1 %>%
group_by(ID) %>%
summarise(Ndays = as.integer(diff(range(Date))),
Ndifdays = n_distinct(Date),
Prop = Ndifdays/Ndays)
# A tibble: 2 x 4
# ID Ndays Ndifdays Prop
# <dbl> <int> <int> <dbl>
#1 1 11 4 0.364
#2 2 17 5 0.294

Combine rows with consecutive dates into single row with start and end dates

I have a dataframe of events that looks something like this:
EVENT DATE LONG LAT TYPE
1 1/1/2000 23 45 A
2 2/1/2000 23 45 B
3 3/1/2000 23 45 B
3 5/2/2000 22 56 A
4 6/2/2000 19 21 A
I'd like to collapse this so that any events that occur on consecutive days at the same location (as defined by LONG, LAT) are collapsed into a single event with a START and END date and a concatenated column of the TYPES involved.
Thus the above table would become:
EVENT START-DATE END-DATE LONG LAT TYPE
1 1/1/2000 3/1/2000 23 45 ABB
2 5/2/2000 5/2/2000 22 56 A
3 6/2/2000 6/2/2000 19 21 A
Any advice on how to best approach this would be greatly appreciated.
Here's a modified version of Ronak Shah's solution, taking non-consecutive events at the same location as separate event periods.
# expanded data sample
df <- data.frame(
DATE = as.Date(c("2000-01-01", "2000-01-02", "2000-01-03", "2000-01-05",
"2000-02-05", "2000-02-06", "2000-02-07"), format = "%Y-%m-%d"),
LONG = c(23, 23, 23, 23, 22, 19, 22),
LAT = c(45, 45, 45, 45, 56, 21, 56),
TYPE = c("A", "B", "B", "A", "A", "B", "A")
)
library(dplyr)
df %>%
group_by(LONG, LAT) %>%
arrange(DATE) %>%
mutate(DATE.diff = c(1, diff(DATE))) %>%
mutate(PERIOD = cumsum(DATE.diff != 1)) %>%
ungroup() %>%
group_by(LONG, LAT, PERIOD) %>%
summarise(START_DATE = min(DATE),
END_DATe = max(DATE),
TYPE = paste(TYPE, collapse = "")) %>%
ungroup()
# A tibble: 5 x 6
LONG LAT PERIOD START_DATE END_DATe TYPE
<dbl> <dbl> <int> <date> <date> <chr>
1 19 21 0 2000-02-06 2000-02-06 B
2 22 56 0 2000-02-05 2000-02-05 A
3 22 56 1 2000-02-07 2000-02-07 A
4 23 45 0 2000-01-01 2000-01-03 ABB
5 23 45 1 2000-01-05 2000-01-05 A
Edit to add explanation for what's going on with the "PERIOD" variable.
For simplicity, let's consider some sequential consecutive & non-consecutive events at the same location, so we can skip the group_by(LONG, LAT) & arrange(DATE) steps:
# sample dataset of 10 events at the same location.
# first 3 are on consecutive days, next 2 are on consecutive days,
# next 4 are on consecutive days, & last 1 is on its own.
df2 <- data.frame(
DATE = as.Date(c("2001-01-01", "2001-01-02", "2001-01-03",
"2001-01-05", "2001-01-06",
"2001-02-01", "2001-02-02", "2001-02-03", "2001-02-04",
"2001-04-01"), format = "%Y-%m-%d"),
LONG = rep(23, 10),
LAT = rep(45, 10),
TYPE = LETTERS[1:10]
)
As an intermediate step, we create some helper variables:
"DATE.diff" counts the difference between current row's date & previous row's date. Since the first row has no date before "2001-01-01", we default the difference to 1.
"non.consecutive" indicates whether the calculated date difference is not 1 (i.e. not consecutive from previous day), or 1 (i.e. consecutive from previous day). If you need to account for same-day events at the same location in the dataset, you can change the calculation from DATE.diff != 1 to DATE.diff > 1 here.
"PERIOD" keeps track of the number of TRUE results in the "non.consecutive" variable. Starting from the first row, every time a row's is non-consecutive from the previous row, "PERIOD" increments by 1.
As a result of the helper variables, "PERIOD" takes on a different value for each group of consecutive dates.
df2.intermediate <- df2 %>%
mutate(DATE.diff = c(1, diff(DATE))) %>%
mutate(non.consecutive = DATE.diff != 1) %>%
mutate(PERIOD = cumsum(non.consecutive))
> df2.intermediate
DATE LONG LAT TYPE DATE.diff non.consecutive PERIOD
1 2001-01-01 23 45 A 1 FALSE 0
2 2001-01-02 23 45 B 1 FALSE 0
3 2001-01-03 23 45 C 1 FALSE 0
4 2001-01-05 23 45 D 2 TRUE 1
5 2001-01-06 23 45 E 1 FALSE 1
6 2001-02-01 23 45 F 26 TRUE 2
7 2001-02-02 23 45 G 1 FALSE 2
8 2001-02-03 23 45 H 1 FALSE 2
9 2001-02-04 23 45 I 1 FALSE 2
10 2001-04-01 23 45 J 56 TRUE 3
We can then treat "PERIOD" as a grouping variable in order to find the start / end date & events within each period:
df2.intermediate %>%
group_by(PERIOD) %>%
summarise(START_DATE = min(DATE),
END_DATe = max(DATE),
TYPE = paste(TYPE, collapse = "")) %>%
ungroup()
# A tibble: 4 x 4
PERIOD START_DATE END_DATe TYPE
<int> <date> <date> <chr>
1 0 2001-01-01 2001-01-03 ABC
2 1 2001-01-05 2001-01-06 DE
3 2 2001-02-01 2001-02-04 FGHI
4 3 2001-04-01 2001-04-01 J
With dplyr, we can group by LAT and LONG and select the maximum and minimum DATE for each group and paste the TYPE column together.
library(dplyr)
df %>%
group_by(LONG, LAT) %>%
summarise(start_date = min(as.Date(DATE, "%d/%m/%Y")),
end_date = max(as.Date(DATE, "%d/%m/%Y")),
type = paste0(TYPE, collapse = ""))
# LONG LAT start_date end_date type
# <int> <int> <date> <date> <chr>
#1 19 21 2000-02-06 2000-02-06 A
#2 22 56 2000-02-05 2000-02-05 A
#3 23 45 2000-01-01 2000-01-03 ABB

Resources