combine similar consecutive observations into one observation in R - r

I have a data set like this
date ID key value
05 1 3 2
05 1 3 5
05 1 3 1
05 1 5 2
05 1 7 3
05 1 7 3
05 1 3 4
05 2 9 8
I need the output to look like this
date ID key value
05 1 3 8
05 1 5 2
05 1 7 6
05 1 3 4
05 2 9 8
so as you see if consecutive date, ID, and key are the same , I want to know how to combine these observation and add their value. I need this to happen only if the events where consecutive.
is it possible to do it r?
if yes, can anyone please tell me how to do it?
thanks

Use rle to look for consecutive sequences
# your data
df <- read.table(text="date ID key value
05 1 3 2
05 1 3 5
05 1 3 1
05 1 5 2
05 1 7 3
05 1 7 3
05 1 3 4
05 2 9 8", header=T)
# get consecutive values - add a grouping variables
r <- with(df, rle(paste(date, ID, key)))
df$grps <- rep(seq(r$lengths), r$lengths)
# aggregate values
a <- aggregate(value ~ date + ID + key + grps, data = df , sum)
# remove the grouping variable
a$grps <- NULL

Related

SAS/SQL group by and keeping all rows

I have a table like this, observing the behavior of some accounts in time, here two accounts with acc_ids 1 and 22:
acc_id date mob
1 Dec 13 -1
1 Jan 14 0
1 Feb 14 1
1 Mar 14 2
22 Mar 14 10
22 Apr 14 11
22 May 14 12
I would like to create a column orig_date that would be equal to date if mob=0 and to minimum date by acc_id group if there is no mob=0 for that acc_id.
Therefore the expected output is:
acc_id date mob orig_date
1 Dec 13 -1 Jan 14
1 Jan 14 0 Jan 14
1 Feb 14 1 Jan 14
1 Mar 14 2 Jan 14
22 Mar 14 10 Mar 14
22 Apr 14 11 Mar 14
22 May 14 12 Mar 14
The second account does not have mob=0 observation, therefore orig_date is set to min(date) by group.
Is there some way how to achieve this in SAS, preferably by one proc sql step?
Seems pretty simple. Just calculate the min date in two ways and use coalesce() to pick the one you want.
First let's turn your printout into an actual dataset.
data have ;
input acc_id date :anydtdte. mob ;
format date date9.;
cards;
1 Dec13 -1
1 Jan14 0
1 Feb14 1
1 Mar14 2
22 Mar14 10
22 Apr14 11
22 May14 12
;
To find the DATE when MOB=0 use a CAsE clause. PROC SQL will automatically remerge the MIN() aggregate results calculated at the ACC_ID level back onto all of the detail rows.
proc sql ;
create table want as
select *
, coalesce( min(case when mob=0 then date else . end)
, min(date)
) as orig_date format=date9.
from have
group by acc_id
order by acc_id, date
;
quit;
Result:
Obs acc_id date mob orig_date
1 1 01DEC2013 -1 01JAN2014
2 1 01JAN2014 0 01JAN2014
3 1 01FEB2014 1 01JAN2014
4 1 01MAR2014 2 01JAN2014
5 22 01MAR2014 10 01MAR2014
6 22 01APR2014 11 01MAR2014
7 22 01MAY2014 12 01MAR2014
Here is a data step approach
data have;
input acc_id date $ mob;
datalines;
1 Dec13 -1
1 Jan14 0
1 Feb14 1
1 Mar14 2
22 Mar14 10
22 Apr14 11
22 May14 12
;
data want;
do until (last.acc_id);
set have;
by acc_id;
if first.acc_id then orig_date=date;
if mob=0 then orig_date=date;
end;
do until (last.acc_id);
set have;
by acc_id;
output;
end;
run;

Subsetting data based on multiple stratified fields and criteria

My data frame has multiple factors. I would like to subset the data in a way that excludes only data that belongs to a specific factor level within another factor level.
I've used the two following approaches, but only one has worked - not sure why. Would appreciate if someone could explain it.
This is a simplified example, where f1 and f2 are the factors:
df = data.frame(f1 = c(rep(2019,4),rep(2018,4),rep(2017,4)),
f2 = rep(1:4,3), data = c(0:11))
print (df)
Output:
f1 f2 data
1 2019 1 0
2 2019 2 1
3 2019 3 2
4 2019 4 3
5 2018 1 4
6 2018 2 5
7 2018 3 6
8 2018 4 7
9 2017 1 8
10 2017 2 9
11 2017 3 10
12 2017 4 11
In this case I want to keep only data that do not belong to level "1" of "factor 2" that are from "2019" in "factor 1".
Method 1:
subs.df = subset (df, f1 != 2019 & f2 != 1)
print (subs.df)
f1 f2 data
6 2018 2 5
7 2018 3 6
8 2018 4 7
10 2017 2 9
11 2017 3 10
12 2017 4 11
Method 2:
subs.df = subset (df, !(f1 %in% 2019 & f2 %in% 1))
print (subs.df)
f1 f2 data
2 2019 2 1
3 2019 3 2
4 2019 4 3
5 2018 1 4
6 2018 2 5
7 2018 3 6
8 2018 4 7
9 2017 1 8
10 2017 2 9
11 2017 3 10
12 2017 4 11
WORKED!
Why doesn't method 1 work but method 2 does?
What are the differences?
This is a logical issue, the negation of (A and B) is (not A) or (not B)
You just have to replace & by | (or)
subs.df = subset (df, f1 != 2019 | f2 != 1)

Truncating a dataframe according to count of vector elements

I have a dataframe df, containing three vectors:
subject condition value
01 A 12
01 A 6
01 B 10
01 B 2
02 A 5
02 A 11
02 B 3
02 B 5
02 B 9
...
There are four observations (and hence four rows) for subject 01, with two observations corresponding to condition A and two corresponding to condition B. Let's say that due to a technical error, there are three condition B observations for subject 02.
My question is this: how can I truncate df to ensure that each condition only has two observations for each individual subject (hence removing the erroneous third row where condition==B for subject 02)?
Thanks in advance for any assistance!
Here's a dplyr solution -
df %>%
group_by(subject, condition) %>%
filter(row_number() < 3) %>%
ungroup()
# A tibble: 8 x 3
subject condition value
<chr> <chr> <dbl>
1 01 A 12
2 01 A 6
3 01 B 10
4 01 B 2
5 02 A 5
6 02 A 11
7 02 B 3
8 02 B 5
For each subject/condition pair create a sequence number seq for its rows and then only keep those rows whose sequence number is less than 3.
subset(transform(DF, seq = ave(value, subject, condition, FUN = seq_along)), seq < 3)
giving:
subject condition value seq
1 01 A 12 1
2 01 A 6 2
3 01 B 10 1
4 01 B 2 2
5 02 A 5 1
6 02 A 11 2
7 02 B 3 1
8 02 B 5 2
Note
The input in reprodudible form is assumed to be:
Lines <- "subject condition value
01 A 12
01 A 6
01 B 10
01 B 2
02 A 5
02 A 11
02 B 3
02 B 5
02 B 9"
DF <- read.table(text = Lines, header = TRUE, strip.white = TRUE,
colClasses = c("character", "character", "numeric"))

Add a column that sum the number of sessions per user in R [duplicate]

This question already has answers here:
Add count of unique / distinct values by group to the original data
(3 answers)
Closed 6 years ago.
I starting to data-mine a mobile application,
and I have a database that looks like this:
Database
UserId Hour Date
01 18 01.01.2016
01 18 01.01.2016
01 14 02.01.2016
01 14 03.01.2016
02 21 03.01.2016
02 08 05.01.2016
02 08 05.01.2016
03 23 05.01.2016
I would like to add a new column to this database that sums the number of different days the user has been using the application,
In this database for example UserId#01 has been on the platform in three different days,
Expected data outcomes like this:
Database
UserId Hour Date NumDates
01 18 01.01.2016 3
01 18 01.01.2016 3
01 14 02.01.2016 3
01 14 03.01.2016 3
02 21 03.01.2016 2
02 08 05.01.2016 2
02 08 05.01.2016 2
03 23 05.01.2016 1
So far I have used this command:
Database["NumDates"] % group_by(UserId) %>% summarise(NumDates = length(unique(Date)))
But it tells me that that it is creating only 5000 lines (the number of different users in my database) when I need +600,000 (the number of sessions in my database)
If somebody could help me with this, it will be greatly appreciated!
We can use uniqueN from data.table
library(data.table)
setDT(Database)[, NumDates := uniqueN(Date) , by = UserId]
Database
# UserId Hour Date NumDates
#1: 1 18 01.01.2016 3
#2: 1 18 01.01.2016 3
#3: 1 14 02.01.2016 3
#4: 1 14 03.01.2016 3
#5: 2 21 03.01.2016 2
#6: 2 8 05.01.2016 2
#7: 2 8 05.01.2016 2
#8: 3 23 05.01.2016 1
You don't want summarise here but mutate. summarise will give you one row by distinct value of the column you grouped by, while mutate will just add another column and preserving existing ones.
you could use n_distict in dplyr
library("dplyr")
database<- data.frame(UserId = c(1,1,1,1,2,2,2,3), Hour = c(18,18,14,14,21,8,8,23), Date = c("01.01.2016","01.01.2016","02.01.2016","03.01.2016","03.01.2016","05.01.2016","05.01.2016","05.01.2016"))
database %>% group_by(userId) %>% mutate(NumDates = n_distinct(Date))
the result is as follows
UserId Hour Date NumDates
(dbl) (dbl) (fctr) (int)
1 1 18 01.01.2016 3
2 1 18 01.01.2016 3
3 1 14 02.01.2016 3
4 1 14 03.01.2016 3
5 2 21 03.01.2016 2
6 2 8 05.01.2016 2
7 2 8 05.01.2016 2
8 3 23 05.01.2016 1

Working with dataframes from unique function

I was wondering how I could go about changing some data like this from a dataframe i created:
Variable Freq and Variable Freq
01 3 M 10
02 2
03 4
04 5
to
01 3
02 2
03 4
04 5
M 10
The code i am using to get those 2 tables is :
y = as.data.frame(length(unique(index_visit$PatientID)))
x = as.data.frame(table(index_visit$ProcedureID))

Resources