Subsetting Across Data Frames Using dplyr in R - r

and thanks in advance for looking.
I have a data frame of Events(EV):
Event_ID | Person_ID | Start_Period | End_Period | Event_Type
------------------------------------------------------------
A | Person1 | 1 | 9 | Assessment
B | Person1 | 2 | 9 | Activity
C | Person1 | 3 | 6 | Assessment
D | Person2 | 3 | 6 | Activity
E | Person3 | 7 | 13 | Assessment
And I have a data frame of Person-Periods (PP)
Person_ID | Period
----------------------
Person1 | 1
Person1 | 2
Person1 | 3
Person2 | 1
Person2 | 2
Person2 | 3
Person3 | 1
Person3 | 2
Person3 | 3
I want to find out for each period, how many activities or assessments were on-going during the period. For example if an event for Person1 in EV had a start period of 5 and end period of 10, then this event should show up in 5,6,7,8,9,10 in PP. The result would look like this:
Person_ID | Period | ActivitiesFreq | AssessmentsFreq
----------------------------------------------
Person1 | 1 | 0 | 1
Person1 | 2 | 1 | 1
Person1 | 3 | 1 | 2
Person2 | 1 | 0 | 0
Person2 | 2 | 0 | 0
Person2 | 3 | 1 | 0
Person3 | 1 | 0 | 0
Person3 | 2 | 0 | 0
Person3 | 3 | 0 | 0
At the moment I'm using a for loop - which is slow.And I'm resisting a join because the full dataset has hundreds and thousands of data. I've tried using mutate from the dplyr package:
mutate(PP,SUM(EV$Person_ID==Person_ID,EV$Start_Period<=Period,EV$End_Period>=Period)
but I get the following error:
Warning messages:
1: In mutate_impl(.data, dots) :
is.na() applied to non-(list or vector) of type 'NULL'
2: In mutate_impl(.data, dots) :
longer object length is not a multiple of shorter object length
3: In mutate_impl(.data, dots) :
longer object length is not a multiple of shorter object length
I'm open to using other packages - I think I don't quite understand something about the way mutate works

Here's a solution using data.table v1.9.5 (current devel version). I'm using it for the new on= feature that allows joins without having to set keys:
require(data.table) # v1.9.5+
ans = setDT(df2)[df1, .(Period, Event_Type,
isBetween = Period %between% c(Start_Period, End_Period)),
by = .EACHI, on = "Person_ID", nomatch = 0L]
dcast(ans, Person_ID + Period ~ Event_Type, fun.aggregate = sum)
# Using 'isBetween' as value column. Use 'value.var' to override
# Person_ID Period Activity Assessment
# 1: Person1 1 0 1
# 2: Person1 2 1 1
# 3: Person1 3 1 2
# 4: Person2 1 0 0
# 5: Person2 2 0 0
# 6: Person2 3 1 0
# 7: Person3 1 0 0
# 8: Person3 2 0 0
# 9: Person3 3 0 0
How it works:
setDT() converts a data.frame to data.table in-place (by reference).
setDT(df2)[df1, on = "Person_ID"] performs a join operation on column Person_ID. For each row in df1, the corresponding matching rows in df2 are computed, and all columns corresponding to those matching rows are extracted.
setDT(df2)[df1, on = "Person_ID", nomatch = 0L], as you might have guessed only returns matching rows, and leaves out those rows of Person_ID in df1 where there is no match in df2.
The by = .EACHI part is quite useful and powerful argument. It helps to compute the expression we provide in j, the second argument within [], for each row in df1.
For example, consider the 2nd row of df1. Joining on Person_ID, it matches with rows 1,2,3 of df2. And by = .EACHI will execute the expression provided within .(), which will return Period = 1,2,3, Event_Type = "Activity"and isBetween = FALSE,TRUE,TRUE. Event_Type is recycled to fit the length of the longest vector (= 3).
Essentially, we are joining and computing at the same time. This is a feature (only?) in data.table, where joins are considered as extensions of subset operations. Since we can compute while subsetting and grouping, we can do exactly the same while joining as well. This is both fast and *memory efficient as the entire join doesn't have to be materialised.
To understand it better, try computing what j expression will result in for the last row.
Then, have a look at ans, and the result should be obvious.
Then we've one last step to do and that is to count the number of Activity and Assessment for each Person_ID, Period and have them as separate columns. This can be done in one step using dcast function.
The formula implies that for each Person_ID, Period, we'd like to sum() the values of inBetween, as a separate column, for each unique value of Event_Type.

I haven't come up with a way to do this without joining datasets. Here is a dplyr-based solution using left_join to join the datasets first (I took only the three columns from EV needed for the task).
Once the dataset are joined, you can just group the dataset by Person_ID and calculate the cumulative sum of the two types of events. I threw in an arrange in case the real dataset wasn't in order by Period within Person_ID and removed the Event_Type column within mutate.
library(dplyr)
PP %>%
left_join(., select(EV, -Event_ID, -End_Period), by = c("Person_ID", "Period" = "Start_Period")) %>%
group_by(Person_ID) %>%
arrange(Period) %>%
mutate(ActivitiesFreq = cumsum(Event_Type == "Activity" & !is.na(Event_Type)),
AssessmentFreq = cumsum(Event_Type == "Assessment" & !is.na(Event_Type)),
Event_Type = NULL)
Source: local data frame [9 x 4]
Groups: Person_ID
Person_ID Period ActivitiesFreq AssessmentFreq
1 Person1 1 0 1
2 Person1 2 1 1
3 Person1 3 1 2
4 Person2 1 0 0
5 Person2 2 0 0
6 Person2 3 1 0
7 Person3 1 0 0
8 Person3 2 0 0
9 Person3 3 0 0

Here is a potential solution:
Left Join PP and EV (dplyr::left_join) on Person_ID and Period
Group by Person and period dplyr::group_by(Person_ID , Period)
Count the number of values using dplyr::summarise()

Related

Count of how many times a unique value appears across all columns and rows of a dataset?

I have unique IDs in rows where columns are the IDs of their 'sent' friends. To get a count of 'received' friends, I need to get a count for how many times an ID appears across all columns and rows of the dataset. This is easy in R, but I'd like to stay with Stata for this project.
ID
F1_ID
F2_ID
F3_ID
ID_mentions
1
2
3
4
4
2
4
1
4
3
1
2
3
4
2
1
3
3
Toy data above. Here, there are four mentions of ID #1, three mentions of ID #4, etc.
I want to generate a variable containing the count of how many times each ID value in the first column is mentioned in any column of the data set. This is illustrated in the ID_mentions column.
Turns out I wrote something in this territory. You would need to install this with ssc install tab_chi
* Example generated by -dataex-. For more info, type help dataex
clear
input byte(id f1_id f2_id f3_id)
1 2 3 4
2 4 1 .
3 1 2 .
4 2 1 3
end
tabm *id
| values
variable | 1 2 3 4 | Total
-----------+--------------------------------------------+----------
ID | 1 1 1 1 | 4
F1_ID | 1 2 0 1 | 4
F2_ID | 2 1 1 0 | 4
F3_ID | 0 0 1 1 | 2
-----------+--------------------------------------------+----------
Total | 4 4 3 3 | 14
EDIT To count all mentions:
gen mentions = .
quietly forval i = 1/`=_N' {
egen work = anycount(*id), value(`=id[`i']')
su work, meanonly
replace mentions = r(sum) in `i'
drop work
}
list

Reshaping data with R [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 6 years ago.
I have my data in the below table structure:
Person ID | Role | Role Count
-----------------------------
1 | A | 24
1 | B | 3
2 | A | 15
2 | B | 4
2 | C | 7
I would like to reshape this so that there is one row for each Person ID, A column for each distinct role (e.g. A,B,C) and then the Role Count for each person as the values. Using the above data the output would be:
Person ID | Role A | Role B | Role C
-------------------------------------
1 | 24 | 3 | 0
2 | 16 | 4 | 7
Coming from a Java background I would take an iterative approach to this:
Find all distinct values for Role
Create a new table with a column for PersonID and each of the distinct roles
Iterate through the first table, get role counts for each Person ID and Role combination and insert results into new table.
Is there another way of doing this in R without iterating through the first table?
Thanks
Try:
library(tidyr)
df %>% spread(Role, `Role Count`)
To make the column names exactly as per your example:
df2 <- df %>% spread(Role, `Role Count`)
names(df2) <- paste('Role', names(df2))
Try this:
library(reshape2)
df <- dcast(df, PersonID~Role, value.var='RoleCount')
df[is.na(df)] <- 0
names(df)[-1] <- paste('Role', names(df[-1]))
df
PersonID Role A Role B Role C
1 1 24 3 0
2 2 15 4 7
With spread from tidyr
library(tidyr)
spread(data, Role, `Role Count`, sep = " ")

How to vectorize this R function when elements depend on other elements in dataframe

Consider this dataframe :
col1 | col2
1 | 1
1 | 2
1 | 3
2 | 4
2 | 5
2 | 6
I want to a new column, say col3 in the dataframe, which has the following definition : the ith element col3[i] is the mean of all values of col2[j], for all j such that col1[i] == col1[j] && i!=j.
The for loop for it goes like this :
for (i in 1:length(data$col2))
{
sum = 0
count = 0
for (j in 1:length(data$col1))
{
if (data$col1[j] == data$col1[i] && i!=j)
{
sum = sum + data$col2[j]
count = count + 1
}
}
data$col3[i] = sum/count
}
The final table is :
col1 | col2 | col3
1 | 1 | 2.5
1 | 2 | 2
1 | 3 | 1.5
2 | 4 | 5.5
2 | 5 | 5
2 | 6 | 4.5
I could use an apply function, but that would take me pretty much as much time as the for loop, right? Any help with giving a vectorized version of this loop is appreciated.
You can use dplyr:
library(dplyr)
dat %>% group_by(col1) %>%
mutate(col3 = (sum(col2) - col2)/(n()-1))
Source: local data frame [6 x 3]
Groups: col1 [2]
col1 col2 col3
(int) (int) (dbl)
1 1 1 2.5
2 1 2 2.0
3 1 3 1.5
4 2 4 5.5
5 2 5 5.0
6 2 6 4.5
This can be done with ave from base R
df1$col3 <- with(df1, ave(col2, col1,
FUN=function(x) (sum(x)-x)/(length(x)-1)))
Or using data.table
library(data.table)
setDT(df1)[, col3 := (sum(col2)-col2)/(.N-1) , col1]

R - subsetting rows from a data frame for column values within a vector

Have a data.frame, df as below
id | name | value
1 | team1 | 3
1 | team2 | 1
2 | team1 | 1
2 | team2 | 4
3 | team1 | 0
3 | team2 | 6
4 | team1 | 1
4 | team2 | 2
5 | team1 | 3
5 | team2 | 0
How do we subset the data frame to get rows for all values of id from 2:4 ?
We can apply conditionally like df[,df$id >= 2 & df$id <= 4] . But is there a way to directly use a vector of integer ranges like ids <- c(2:4) to subset a dataframe ?
One way to do this is df[,df$id >= min(ids) & df$id <= max(ids)].
Is there a more elegant R way of doing this ?
The most typical way is mentioned already, but also variations using match
with(df, df[match(id, 2:4, F) > 0, ])
or, similar
with(df, df[is.element(id, 2:4), ])

Calculate transaction per ID by date

How to count the total number of transaction by id and by date ?
Sample data :
f<- data.frame(
id=c("A","A","A","A","C","C","D","D","E"),
start_date=c("6/3/2012","7/3/2012","7/3/2012","8/3/2012","5/3/2012","6/3/2012","6/3/2012","6/3/2012","5 /3/2012")
)
Excepted Output:
id | count
A | 3
C | 2
D | 1
E | 1
Logic :
As A is 6 MARCH , 7 MARCH AND 8 MARCH SO COUNT 3
C is 5 MARCH , 6 MARCH SO COUNT 2
so on...
I Tried with the following code , and I think it only count the number of the ID occurred in the data.
library(lubridate)
f$date <- mdy(f$Date)
f1 <- s[order(f$id, f$Date), ]
How can I implement this code to get my desire outcome?
[Note: The actual data is in huge volume, so optimization need to be consider.]
Thanks in advance.
I'm getting a different answer:
with(f, tapply(start_date, id, length))
A C D E
4 2 2 1
You can try. f[!duplicated(f), ] removes duplicates from f and then aggregate does the aggregation using length function i.e. gives count of start_date for each id
aggregate(start_date ~ id, f[!duplicated(f), ], length)
## id start_date
## 1 A 3
## 2 C 2
## 3 D 1
## 4 E 1
Not sure what format you want the results in, but
rowSums(with(f, table(id, start_date)>0))
will return a named vector with the count of distinct days for each ID.

Resources