Aggregating and ranking of groups in R - r

I have a dataset in this format in R:
+----------+-------+-----------+
| Person | Group | Timestamp |
+----------+-------+-----------+
| Person A | X | 12:00 PM |
| Person A | X | 12:01 PM |
| Person A | X | 12:03 PM |
| Person A | Y | 12:10 PM |
| Person A | Y | 12:11 PM |
| Person A | Y | 12:12 PM |
| Person A | X | 12:20 PM |
| Person A | X | 12:21 PM |
| Person A | X | 12:22 PM |
| … | | |
+----------+-------+-----------+
I need to convert this into this format:
+----------+-------+---------+
| Person | Group | Ranking |
+----------+-------+---------+
| Person A | X | 1 |
| Person A | Y | 2 |
| Person A | X | 3 |
| … | | |
+----------+-------+---------+
(group all the similar entries into 1 - the same group can repeat after another group
like in the example above - the groups are X > Y > X)
I have hundreds of Persons and ~20 million records. I tried running a for loop but that just takes way too much time.
Please let me know if there is an easier way to achieve this.
Any help is appreciated. Thanks in advance.

Here's a data.table solution, should be quite fast.
library(data.table)
dt[, .(Ranking = rleid(Group), Group), by = .(Person)][, .SD[1], by = .(Ranking, Person)]
# Person Ranking Group
# 1: Person A 1 X
# 2: Person A 2 Y
# 3: Person A 3 X
(Original method didn't calculate the rleid for each person separately, edited to fix.)
A different method. Not sure if this will be any faster, but we could conceptualize the problem as keeping rows where either the Person or the Group is different from the previous row, then numbering them by group:
dt[is.na(shift(Person)) | shift(Person) != Person | shift(Group) != Group, .(Person, Group)][, Ranking := 1:.N, by = .(Person)][]
# Person Group Ranking
# 1: Person A X 1
# 2: Person A Y 2
# 3: Person A X 3
Using this data:
dt = fread(" Person | Group | Timestamp
Person A | X | 12:00 PM
Person A | X | 12:01 PM
Person A | X | 12:03 PM
Person A | Y | 12:10 PM
Person A | Y | 12:11 PM
Person A | Y | 12:12 PM
Person A | X | 12:20 PM
Person A | X | 12:21 PM
Person A | X | 12:22 PM", sep = "|")

Here is a tidyverse solution that ensures the timestamps are sorted in ascending order within Person before returning the rankings.
library(tidyverse)
get_ranking <- function(data) {
grps <- rle(data$Group)$values
data.frame(Group = grps, Ranking = seq_along(grps))
}
dat %>%
group_by(Person) %>%
arrange(Timestamp) %>%
group_modify(~ get_ranking(.x))
Using this data:
dat <- data.frame(Person= 'Person A',
Group=rep(c('X','Y','X'),each=3),
Timestamp=as.POSIXct('2010-01-01 12:00 PM')+(1:9)*60,
stringsAsFactors = FALSE)
To produce this output:
# A tibble: 3 x 3
# Groups: Person [1]
Person Group Ranking
<chr> <fct> <int>
1 Person A X 1
2 Person A Y 2
3 Person A X 3

library(dplyr)
library(tidyr)
d %>%
group_by(Person) %>%
mutate(Ranking = sequence(rle(Group)$lengths) == 1) %>%
ungroup() %>%
select(-Timestamp) %>%
filter(Ranking) %>%
mutate(Ranking = cumsum(Ranking))
## A tibble: 3 x 3
# Person Group Ranking
# <chr> <chr> <int>
#1 Person A X 1
#2 Person A Y 2
#3 Person A X 3
In Base R
do.call(rbind, lapply(split(d, d$Person), function(x){
data.frame(Person = x$Person[1],
with(rle(x$Group),
data.frame(Group = values,
Ranking = seq_along(values))))}))
DATA
d = structure(list(Person = c("Person A", "Person A", "Person A",
"Person A", "Person A", "Person A",
"Person A", "Person A", "Person A"),
Group = c("X", "X", "X", "Y", "Y", "Y", "X", "X", "X"),
Timestamp = c("12:00 PM", "12:01 PM", "12:03 PM", "12:10 PM",
"12:11 PM", "12:12 PM", "12:20 PM", "12:21 PM",
"12:22 PM")),
class = "data.frame",
row.names = c(NA, -9L))

Related

How to group data by time in R and count frequency

I want to group the hour to time of the day:
i.e., Morning - 00:00:00 - 09:59:59
Afternoon - 10:00:00 - 17:59:59
Evening - 18:00:00 - 23:59:59
This is the input data:
| Date | Time |
| 21/10/20 | 03:49:19 |
| 21/10/20 | 05:39:23 |
| 21/10/20 | 09:23:10 |
| 21/10/20 | 14:38:50 |
| 21/10/20 | 17:17:48 |
| 21/10/20 | 21:23:45 |
| 21/10/20 | 21:49:32 |
The output data should be:
| Period | Count |
| Morning | 3 |
| Afternoon | 2 |
| Evening | 2 |
You could use hms and case_when:
data <- read.table(text ='
Date Time
"21/10/20" "03:49:19"
"21/10/20" "05:39:23"
"21/10/20" "09:23:10"
"21/10/20" "14:38:50"
"21/10/20" "17:17:48"
"21/10/20" "21:23:45"
"21/10/20" "21:49:32"',header = T)
library(hms)
library(dplyr)
data %>% mutate(period = case_when(as_hms(Time)<as_hms('10:00:00') ~ 'Morning',
as_hms(Time)<as_hms('18:00:00') ~ 'Afternoon',
T ~ 'Evening')) %>%
group_by(Date,period) %>%
summarize(count=n()) %>%
ungroup()
#> # A tibble: 3 x 3
#> # Groups: Date [1]
#> Date period count
#> <chr> <chr> <int>
#> 1 21/10/20 Afternoon 2
#> 2 21/10/20 Evening 2
#> 3 21/10/20 Morning 3
library(lubridate)
data %>% group_by(period = case_when(hms(Time) < hours(10) ~ 'morning',
hms(Time) < hours(18) ~ 'Afternoon',
TRUE ~ 'Evening')) %>%
summarise(Count = n())
# A tibble: 3 x 2
period Count
<chr> <int>
1 Afternoon 2
2 Evening 2
3 morning 3
Base R :
data$Hour <- as.integer(substr(data$Time, 1, 2))
result <- stack(with(data, table(ifelse(Hour < 10, 'Morning',
ifelse(Hour < 18, 'Afternoon', 'Evening')))))
result
# values ind
#1 2 Afternoon
#2 2 Evening
#3 3 Morning

parse values based on groups in R

I have a very large dataset and a sample of that looks something like the one below:
| Id | Name | Start_Date | End_Date |
|----|---------|------------|------------|
| 10 | Mark | 4/2/1999 | 7/5/2018 |
| 10 | | 1/1/2000 | 9/24/2018 |
| 25 | | 5/3/1968 | 6/3/2000 |
| 25 | | 6/6/2009 | 4/23/2010 |
| 25 | Anthony | 2/20/2010 | 7/21/2016 |
| 25 | | 9/12/2014 | 11/26/2019 |
I need to parse the names from Name column based on their Id such that the output table looks like:
| Id | Name | Start_Date | End_Date |
|----|---------|------------|------------|
| 10 | Mark | 4/2/1999 | 7/5/2018 |
| 10 | Mark | 1/1/2000 | 9/24/2018 |
| 25 | Anthony | 5/3/1968 | 6/3/2000 |
| 25 | Antony | 6/6/2009 | 4/23/2010 |
| 25 | Anthony | 2/20/2010 | 7/21/2016 |
| 25 | Anthony | 9/12/2014 | 11/26/2019 |
How can I achieve an output as shown above? I went through the substitute and parse functions, but was unable to understand how they apply to this problem.
My dataset would be:
df=data.frame(Id=c("10","10","25","25","25","25"),Name=c("Mark","","","","Anthony",""),
Start_Date=c("4/2/1999", "1/1/2000","5/3/1968","6/6/2009","2/20/2010","9/12/2014"),
End_Date=c("7/5/2018","9/24/2018","6/3/2000","4/23/2010","7/21/2016","11/26/2019"))
We can change the blanks ("") to NA and use fill to replace the NA elements with the previous non-NA element
library(dplyr)
library(tidyr)
df1 %>%
mutate(Name = na_if(Name, "")) %>%
group_by(Id) %>%
fill(Name, .direction = "down") %>%
fill(Name, .direction = "up)
# A tibble: 6 x 4
# Groups: Id [2]
# Id Name Start_Date End_Date
# <chr> <chr> <chr> <chr>
#1 10 Mark 4/2/1999 7/5/2018
#2 10 Mark 1/1/2000 9/24/2018
#3 25 Anthony 5/3/1968 6/3/2000
#4 25 Anthony 6/6/2009 4/23/2010
#5 25 Anthony 2/20/2010 7/21/2016
#6 25 Anthony 9/12/2014 11/26/2019
In the devel version of tidyr (‘0.8.3.9000’), this can be done in a single fill statement as .direction = "downup" is also an option
df1 %>%
mutate(Name = na_if(Name, "")) %>%
group_by(Id) %>%
fill(Name, .direction = "downup")
Or another option is to group by 'Id', and mutate the 'Name' as the first non-blank element
df1 %>%
group_by(Id) %>%
mutate(Name = first(Name[Name!=""]))
# A tibble: 6 x 4
# Groups: Id [2]
# Id Name Start_Date End_Date
# <chr> <chr> <chr> <chr>
#1 10 Mark 4/2/1999 7/5/2018
#2 10 Mark 1/1/2000 9/24/2018
#3 25 Anthony 5/3/1968 6/3/2000
#4 25 Anthony 6/6/2009 4/23/2010
#5 25 Anthony 2/20/2010 7/21/2016
#6 25 Anthony 9/12/2014 11/26/2019
data
df1 <- structure(list(Id = c("10", "10", "25", "25", "25", "25"), Name = c("Mark",
"", "", "", "Anthony", ""), Start_Date = c("4/2/1999", "1/1/2000",
"5/3/1968", "6/6/2009", "2/20/2010", "9/12/2014"), End_Date = c("7/5/2018",
"9/24/2018", "6/3/2000", "4/23/2010", "7/21/2016", "11/26/2019"
)), class = "data.frame", row.names = c(NA, -6L))
Using DF defined reproducibly in the Note at the end, replace each zero-length element of Name with NA and then use na.omit to get the unique non-NA to use to fill. We have assumed that there is only one non-NA per Id which is the case in the question. If not we could replace na.omit with function(x) unique(na.omit(x)) assuming that the non-NAs are all the same within Id. No packages are used.
transform(DF, Name = ave(replace(Name, !nzchar(Name), NA), Id, FUN = na.omit))
giving:
Id Name Start_Date End_Date
1 10 Mark 4/2/1999 7/5/2018
2 10 Mark 1/1/2000 9/24/2018
3 25 Anthony 5/3/1968 6/3/2000
4 25 Anthony 6/6/2009 4/23/2010
5 25 Anthony 2/20/2010 7/21/2016
6 25 Anthony 9/12/2014 11/26/2019
na.strings
We can simplify this slightly if we make sure that the zero length elements of Name are NA in the first place. We replace the read.table line in the Note with the first line below. Then it is just a matter of using na.locf0.
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE, sep = "|",
strip.white = TRUE, na.strings = "")
transform(DF, Name = ave(Name, Id, FUN = na.omit))
Note
The input in reproducible form:
Lines <- "
Id | Name | Start_Date | End_Date
10 | Mark | 4/2/1999 | 7/5/2018
10 | | 1/1/2000 | 9/24/2018
25 | | 5/3/1968 | 6/3/2000
25 | | 6/6/2009 | 4/23/2010
25 | Anthony | 2/20/2010 | 7/21/2016
25 | | 9/12/2014 | 11/26/2019"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE, sep = "|", strip.white = TRUE)

How can I take two columns in R and flatten them like the below?

Before
+---------+------------------------------------+
| Word | Tags |
+---------+------------------------------------+
| morning | #sunrise #droplets #waterdroplets |
| morning | #sky #ocean #droplets |
+---------+------------------------------------+
After
+---------+---------------+
| Word | Tags |
+---------+---------------+
| morning | sunrise |
| morning | droplets |
| morning | waterdroplets |
| morning | sky |
| morning | ocean |
| morning | droplets |
+---------+---------------+
Notice how I want to keep droplets appearing twice. This table is very big, over 5m rows, if this method can be efficient that would be very helpful. Thanks!
We can use separate_rows from tidyr.
library(dplyr)
library(tidyr)
dat <- tribble(
~Word, ~Tags,
"morning", "#sunrise #droplets #waterdroplets",
"morning", "#sky #ocean #droplets"
)
dat2 <- dat %>%
separate_rows(Tags, sep = " #") %>%
mutate(Tags = gsub("#", "", Tags))
dat2
# # A tibble: 6 x 2
# Word Tags
# <chr> <chr>
# 1 morning sunrise
# 2 morning droplets
# 3 morning waterdroplets
# 4 morning sky
# 5 morning ocean
# 6 morning droplets

Spreading data over a date range from a column (R)

I have a set of survey data, where each survey covers multiple days. Here is an example of what the data looks like in the current form:
| Survey | Dates | Result |
|--------|--------------|--------|
| A | 11/30 - 12/1 | 33% |
| B | 12/2 - 12/4 | 26% |
| C | 12/4 - 12/5 | 39% |
This example can be made with the following:
frame <- data.frame(Survey = c('A','B','C'),
Dates = c('11/30 - 12/1', '12/2 - 12/4', '12/4 - 12/5'),
Result = c('33%', '26%', '39%'))
What I would like to do is make a column for each date, and if the date is within the range of the survey, to put the result in the cell. It would look something like this:
| Survey | 11/30 | 12/1 | 12/2 | 12/3 | 12/4 | 12/5 |
|--------|-------|------|------|------|------|------|
| A | 33% | 33% | | | | |
| B | | | 26% | 26% | 26% | |
| C | | | | | 39% | 39% |
Any help would be appreciated.
Here's an idea:
library(dplyr)
library(tidyr)
frame %>%
separate_rows(Dates, sep = " - ") %>%
mutate(Dates = as.Date(Dates, format = "%m/%d")) %>%
group_by(Survey) %>%
complete(Dates = seq(min(Dates), max(Dates), 1)) %>%
fill(Result) %>%
spread(Dates, Result)
Which gives:
# Survey `2017-11-30` `2017-12-01` `2017-12-02` `2017-12-03` `2017-12-04` `2017-12-05`
#* <fctr> <fctr> <fctr> <fctr> <fctr> <fctr> <fctr>
#1 A 33% 33% NA NA NA NA
#2 B NA NA 26% 26% 26% NA
#3 C NA NA NA NA 39% 39%
A tidyverse solution but it requires that you play with the Dates column a bit:
#install.packages('tidyverse')
library(tidyverse)
dframe <- data.frame(Survey = c('A','B','C'),
Dates = c('11/30 - 12/1', '12/2 - 12/4', '12/4 - 12/5'),
Result = c('33%', '26%', '39%'), stringsAsFactors = F)
dframe$Dates <- lapply(strsplit(dframe$Dates, split = " - "), function(x) {
x <- strptime(x, "%m/%d")
x <- seq(min(x), max(x), '1 day')
paste0(strftime(x, "%m/%d"), collapse = " - ")
})
dframe %>%
separate_rows(Dates, sep = " - ") %>%
spread(Dates, Result)
Should get:
Survey 11/30 12/01 12/02 12/03 12/04 12/05
A 33% 33% <NA> <NA> <NA> <NA>
B <NA> <NA> 26% 26% 26% <NA>
C <NA> <NA> <NA> <NA> 39% 39%
I hope this helps.

Concatenate rows in R depending on specific row value range

I have two data frames:
df
set.seed(10)
df <- data.frame(Name = c("Bob","John","Jane","John","Bob","Jane","Jane"),
Date=as.Date(c("2014-06-04", "2013-12-04", "2013-11-04" , "2013-12-06" ,
"2014-01-09", "2014-03-21", "2014-09-24")), Degrees= rnorm(7, mean=32, sd=32))
Name | Date | Degrees
Bob | 2014-06-04 | 50.599877
John | 2013-12-04 | 44.103919
Jane | 2013-11-04 | 6.117422
John | 2013-12-06 | 30.826633
Bob | 2014-01-09 | 59.425444
Jane | 2014-03-21 | 62.473418
Jane | 2014-09-24 | 11.341562
df2
df2 <- data.frame(Name = c("Bob","John","Jane"),
Date=as.Date(c("2014-03-01", "2014-01-20", "2014-06-07")),
Weather = c("Good weather","Bad weather", "Good weather"))
Name | Date | Weather
Bob | 2014-03-01 | Good weather
John | 2014-01-20 | Bad weather
Jane | 2014-06-07 | Good weather
I would like to extract the following:
Name | Date | Weather | Degrees (until this Date) | Other measures
Bob | 2014-03-01 | Good weather | 59.425444 | 50.599877
John | 2014-01-20 | Bad weather | 44.103919, 30.826633 |
Jane | 2014-06-07 | Good weather | 6.117422, 62.473418 | 11.341562
Which is a merge between both df and df2, with:
"Degrees (until this Date)" concatenates from df$Degrees up until the date of df2$Date;
the value of "Other measures" is whatever measures are on df$Degrees after the date of df2$Date.
Another alternative:
#a grouping variable to use for identical splitting
nms = unique(c(as.character(df$Name), as.character(df2$Name)))
#split data
dates = split(df$Date, factor(df$Name, nms))
degrees = split(df$Degrees, factor(df$Name, nms))
thresholds = split(df2$Date, factor(df2$Name, nms))
#mapply the condition
res = do.call(rbind.data.frame,
Map(function(date, thres, deg)
tapply(deg, factor(date <= thres, c(TRUE, FALSE)),
paste0, collapse = ", "),
dates, thresholds, degrees))
#bind with df2
cbind(df2, setNames(res[match(row.names(res), df2$Name), ], c("Degrees", "Other")))
# Name Date Weather Degrees Other
#Bob Bob 2014-03-01 Good weather 41.4254440501603 32.5998774701384
#John John 2014-01-20 Bad weather 26.10391865379, 12.826633094921 <NA>
#Jane Jane 2014-06-07 Good weather -11.8825775975204, 44.4734176224054 -6.65843761374357
Here's one approach:
library(dplyr)
library(tidyr)
library(magrittr)
res <-
left_join(df, df2 %>% select(Name, Date, Weather), by = "Name") %>%
mutate(paste = factor(Date.x <= Date.y, labels = c("before", "other"))) %>%
group_by(Name, paste) %>%
mutate(Degrees = paste(Degrees, collapse = ", ")) %>%
distinct() %>%
spread(paste, Degrees) %>%
group_by(Name, Date.y, Weather) %>%
summarise(other = other[1], before = before[2]) %>%
set_names(c("Name", "Date" , "Weather", "Degrees (until this Date)" , "Other measures"))
res[is.na(res)] <- ""
res
# Name Date Weather Degrees (until this Date) Other measures
# 1 Bob 2014-03-01 Good weather 41.4254440501603 32.5998774701384
# 2 Jane 2014-06-07 Good weather -11.8825775975204, 44.4734176224054 -6.65843761374357
# 3 John 2014-01-20 Bad weather 26.10391865379, 12.826633094921
There may be room for improvements, but anyway.

Resources