I'm trying to keep/split groups in a data frame which meet a condition for a specific row; the data.frame looks like:
COW PARITY DFC ABCS
1 1 1 0.5
1 1 2 1
1 1 3 0.25
1 2 1 -0.3
1 2 2 0.5
I would like to create groups with the same value of COW and parity for which
ABCS>0 for DFC==1
I try with group_by + filter but I'm unable to correctly split.
You could try with logical subsetting, like this:
df1[df1$COW==df1$PARITY & df1$ABCS>0 & df1$DFC==1,]
# COW PARITY DFC ABCS
#1 1 1 1 0.5
This considers three conditions, connected with a logical AND (&). First, the value of COW and PARITY should be equal, then the value of ABCS should be greater than 0, and finally the value of DFC should be equal to one. In the example posted above, only one observation (row) fulfills these three conditions.
edit
Following the suggestion by #docendodiscimus the command can be shortened and rendered more legible by using with(), for instance like this:
df1[with(df1,COW==PARITY & ABCS>0 & DFC==1),]
data
df1 <- structure(list(COW = c(1L, 1L, 1L, 1L, 1L),
PARITY = c(1L, 1L, 1L, 2L, 2L), DFC = c(1L, 2L, 3L, 1L, 2L),
ABCS = c(0.5, 1, 0.25, -0.3, 0.5)),
.Names = c("COW", "PARITY", "DFC", "ABCS"),
class = "data.frame", row.names = c(NA,-5L))
Related
I have some data that I am trying to group by consecutive values in R. This solution is similar to what I am looking for, however my data is structured like this:
line_num
1
2
3
1
2
1
2
3
4
What I want to do is group each time the number returns to 1 such that I get groups like this:
line_num
group_num)
1
1
2
1
3
1
1
2
2
2
1
3
2
3
3
3
4
3
Any ideas on the best way to accomplish this using dplyr or base R?
Thanks!
We could use cumsum on a logical vector
library(dplyr)
df2 <- df1 %>%
mutate(group_num = cumsum(line_num == 1))
or with base R
df1$group_num <- cumsum(df1$line_num == 1)
data
df1 <- structure(list(line_num = c(1L, 2L, 3L, 1L, 2L, 1L, 2L, 3L, 4L
)), class = "data.frame", row.names = c(NA, -9L))
I'm currently working in data.table in R with the following data set:
id age_start age_end cases
1 2 2 1000
1 3 3 500
1 4 4 300
1 2 4 1800
2 2 2 8000
2 3 3 200
2 4 4 100
In the given data set I only want values of cases where the age_start == 2 and the age_end ==4.
In each ID where the age_start !=2 and the age_end !=4, I need to sum or aggregate the rows to create a group of age_start==2 and age_end ==4. In these cases I'd need to sum up the cases of age_start==2 & age_end==2, age_start==3 & age_end==3, as well as age_start==4 & age_end==4 into one new row of age_start==2 and age_end==4.
After these are summed up into one row, I want to drop the rows that I used to make the new age_start==2 and age_start==4 row (i.e. the age values 2-2, 3-3, and 4-4) as they are no longer needed
Ideally the data set would look like this when I finish these steps:
id age_start age_end cases
1 2 4 1800
2 2 4 8300
Any suggestions on how to accomplish this in data.table are greatly appreciated!
You can use an equi-join for the first bullet; and a non-equi join for the second:
m_equi = x[.(id = unique(id), age_dn = 2, age_up = 4),
on=.(id, age_start = age_dn, age_end = age_up),
nomatch=0
]
m_nonequi = x[!m_equi, on=.(id)][.(id = unique(id), age_dn = 2, age_up = 4),
on=.(id, age_start >= age_dn, age_end <= age_up),
.(cases = sum(cases)), by=.EACHI
]
res = rbind(m_equi, m_nonequi)
id age_start age_end cases
1: 1 2 4 1800
2: 2 2 4 8300
How it works:
x[i] uses values in i to look up rows and columns in x according to rules specified in on=.
nomatch=0 means unmatched rows of i in x[i] are dropped, so m_equi only ends up with id=1.
x[!m_equi, on=.(id)] is an anti-join that skips id=1 since we already matched it in the equi join.
by=.EACHI groups by each row of i in x[i] for the purpose of doing the aggregation.
An alternative would be to anti-join on rows with start 2 and end 4 so that all groups need to be aggregated (similar to #akrun's answer), though I guess that would be less efficient.
We can specify the i with the logical condition, grouped by 'id', get the sum of 'cases' while adding 'age_start', 'age_end' as 2 and 4
library(data.table)
as.data.table(df1)[age_start != 2|age_end != 4,
.(age_start = 2, age_end = 4, cases = sum(cases)), id]
# id age_start age_end cases
#1: 1 2 4 1800
#2: 2 2 4 8300
data
df1 <- structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L), age_start = c(2L,
3L, 4L, 2L, 2L, 3L, 4L), age_end = c(2L, 3L, 4L, 4L, 2L, 3L,
4L), cases = c(1000L, 500L, 300L, 1800L, 8000L, 200L, 100L)),
class = "data.frame", row.names = c(NA,
-7L))
I have two dataframes, one with ID , DATE, and the name of the drug . Another has ID and date of event date.event.
expected column prev_drug :
how can I count the number of the different drug prior the current date ? for example, for ID=1 , prev_drug for row 4 is 2 , because it has two drugs ( A ,B) different from drug C prior the the DATE of row 4.
2.expected column event.30d.prior :
for each ID and each DATE in the first data frame, how many events happened during the 30days prior to the DATE ? eg. for row 2, the event for id=1 happened at 1/20/2001 , falls in to the 30 days prior to 2/1/2001 period.
ID DATE DRUG prev_drug event.30d.prior
1 1/1/2001 A 0 0
1 2/1/2001 A 0 1
1 3/15/2001 B 1 0
1 4/20/2001 C 2 1
1 5/29/2001 A 2 0
1 5/2/2001 B 2 0
2 3/2/2001 A 0 1
2 3/23/2001 C 1 1
2 4/4/2001 D 2 0
2 5/5/2001 B 3 0
ID date.event
1 1/20/2001
1 4/11/2001
2 3/1/2001
Here is a solution with base R with some dplyr methods used. This is not the cleanest and best solution but it should solve your problem.
df<-structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
DATE = structure(c(11323, 11354, 11396, 11432, 11471, 11444,
11383, 11404, 11416, 11447), class = "Date"), DRUG = structure(c(1L,
1L, 2L, 3L, 1L, 2L, 1L, 3L, 4L, 2L), .Label = c("A", "B",
"C", "D"), class = "factor")), row.names = c(NA, -10L), class = "data.frame")
#Note DATE was converted to a Date object with the following line
#df$DATE<-as.Date(df$DATE, "%m/%d/%Y")
date.event<-read.table(header=TRUE, text="ID date.event
1 1/20/2001
1 4/11/2001
2 3/1/2001")
date.event$date.event<-as.Date(date.event$date.event, "%m/%d/%Y")
library(dplyr)
#calculate the prev_drup by counting the number of unique drugs
df<-df %>% group_by(ID) %>% mutate(prev_drug= (cumsum(!duplicated(DRUG)))-1)
#loop through each row after spitting and filtering by ID
event.30d.prior<-sapply(1:nrow(df), function(i){
events<-date.event[date.event$ID==df$ID[i], "date.event"]
sum(between(events, df$DATE[i]-30, df$DATE[i]))
})
finalanswer<-cbind(df, event.30d.prior=unlist(event.30d.prior))
I have 100 simulated data sets, for example a single set is shown below
pid time status
1 2 1
1 6 0
1 4 1
2 3 0
2 1 1
2 7 1
3 8 1
3 11 1
3 2 0
pid denotes patient id. This indicates that each patient has three records on the time and status column.
I want to write R code to delete any row with 0 status if that row is not a record for the first observation of a given patient and keep rows with 0 status if it denotes the first observation while the remaining rows with status 1 following the this 0 are deleted for that patient. The output should look like
pid time status
1 2 1
1 4 1
2 3 0
3 8 1
3 11 1
As there are 100 simulated data sets the positions of 0's and 1's in the status column are not the same for all the data. Could anyone be of help to provide R code that can perform this task?
Thank you in advance.
dplyr package can help. I added a record to your data example to include multiple 0 values for a pid.
Group by pid and with the function first you can hold the first value of status. Due to the group by this will be held for all the records per pid. Then just filter if the first record is 0 and row_number() = 1 just in case there are more records with 0 (see pid 4) or if the first record has status = 1 and keep all the records with status 1.
df %>%
group_by(pid) %>%
filter((first(status) == 0 & row_number() == 1) | (first(status) == 1 & status == 1))
# A tibble: 6 x 3
# Groups: pid [4]
pid time status
<int> <int> <int>
1 1 2 1
2 1 4 1
3 2 3 0
4 3 8 1
5 3 11 1
6 4 3 0
data:
df <-
structure(
list(
pid = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L),
time = c(2L, 6L, 4L, 3L, 1L, 7L, 8L, 11L, 2L, 3L, 6L, 8L),
status = c(1L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L)
),
.Names = c("pid", "time", "status"),
class = "data.frame",
row.names = c(NA,-12L)
)
This question is more appropriate on https://stackoverflow.com.
Here is an attempt using tapply() (it's a little verbose):
dat <- structure(list(pid = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
time = c(2L, 6L, 4L, 3L, 1L, 7L, 8L, 11L, 2L),
status = c(1L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 0L)),
.Names = c("pid", "time", "status"), class = "data.frame",
row.names = c(NA, -9L))
ind <- unlist(tapply(dat$status, dat$pid, function(x) {
# browser()
y <- (rep(FALSE, length(x)))
if (x[1] == 1) {
y[x != 0] <- TRUE
} else {
y[1] <- TRUE
}
y
}))
dat[ind, ]
#> pid time status
#> 1 1 2 1
#> 3 1 4 1
#> 4 2 3 0
#> 7 3 8 1
#> 8 3 11 1
ind is a vector of TRUEs and FALSEs, which will indicate whether a row of dat should be kept or not according to your rules.
I use tapply(X, INDEX, FUN) to apply a function to subsets of a vector (here X = dat$status), which are defined by a grouping factor (here INDEX = dat$pid).
Here, I used an anonymous function (i.e., FUN = function(x){}) to do something with each subset of X.
In particular, I first define y, which I will return later, to be a vector of FALSEs.
If the first status is 1 for a subgroup, I turn all elements that are non-zero (i.e., y[x != 0]) into TRUE.
Otherwise, I turn only the first element (i.e., y[1]) into TRUE.
You may uncomment the browser() statement and see at the console what the function does by typing n (for next) or x or y (to see what they are).
I'm trying to determine the number of unique customers per week per store.
I have a piece of code that accomplishes this task but the tabulation is not what I am looking for.
I have the following table:
store week customer_ID
1 1 1
1 1 1
1 1 2
1 2 1
1 2 2
1 2 3
2 1 1
2 1 1
2 1 2
2 2 2
2 2 3
2 2 3
So every week I need to count how many unique customer there were.
Say for example if customer 1 had visited on week 1, then revisited on week 2 that would not count as a unique visit.
If that same customer visited store 2 on week 1 or any other week. Then that would count as a unique visit for store two.
The outcome would look like the following:
store week unique Customers
1 1 2
1 2 1
2 1 2
2 2 1
I used the following but its not correct
agg <- aggregate(data=df, customer_ID~ week+store, function(x) length(unique(x)))
structure(list(store = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L), week = c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L,
2L, 2L), customer_ID = c(1L, 1L, 2L, 1L, 2L, 3L, 1L, 1L, 2L,
2L, 3L, 3L)), .Names = c("store", "week", "customer_ID"), class = "data.frame", row.names = c(NA,
-12L))
Here is a base R method. The idea is to split the data into a list of data.frames, one for each store. Assuming observations are ordered by week, then drop duplicated observations of customer ID. The subset data.frame is aggregated using your function. Then do.call and rbind put the results into a single data.frame:
do.call(rbind, lapply(split(df, df$store),
function(i) aggregate(data=i[!duplicated(i$customer_ID),],
customer_ID ~ week+store, length)))
week store customer_ID
1.1 1 1 2
1.2 2 1 1
2.1 1 2 2
2.2 2 2 1
to make sure that your data.frame is ordered properly prior to attempting this, you could use order:
df <- df[order(df$store, df$week), ]
In case it is of interest, I put together a data.table solution as well.
library(data.table)
setDT(df)
df[df[, !duplicated(customer_ID), by=store]$V1,
.(newCust=length(customer_ID)), by=.(store, week)]
store week newCust
1: 1 1 2
2: 1 2 1
3: 2 1 2
4: 2 2 1
This method uses a logical vector df[, !duplicated(customer_ID), by=store]$V1 to subset the data to unique IDs by store, and then calculates the unique number of new customers by store-week.