I have patient data that looks like this:
ID DATE DUR
82 29/08/2014 10.32
82 29/08/2014 0.32
82 12/09/2014 13.35
82 12/09/2014 0.16
82 12/09/2014 0.24
82 12/09/2014 0.31
82 22/12/2014 100.39
82 22/12/2014 0.1
219 31/11/2012 -300.32
219 31/11/2012 0.23
219 12/01/2013 80.20
219 12/01/2013 0.20
In the first column is a patient ID, In the second there is a date and time (time is visually missing but is in there) and the third is just the duration difference (which I have been using to determine different admittance of patients). Each different row is a check up on the patient but they may have come here at a later date (not with in the same time frame).
Basically what I want to do is to be able to categorize each patients number so that when they admit a second time there id becomes "82a" and third time "82b" and so on. It wouldn't have to be alphabetic it could be any such indicator. Sometimes there can be patients with as many as 50 different admissions (separate occasion admissions). So after this I want to have it look something like:
ID DATE DUR
82 29/08/2014 10.32
82 29/08/2014 0.32
82a 12/09/2014 13.35
82a 12/09/2014 0.16
82a 12/09/2014 0.24
82a 12/09/2014 0.31
82b 22/12/2014 100.39
82b 22/12/2014 0.1
219 31/11/2012 -300.32
219 31/11/2012 0.23
219a 12/01/2013 80.20
219a 12/01/2013 0.20
I have been working in Excel for the time being and at first had used
=IF(AND(ABS(C3)>1,A3=A2),1,0)
Just to allow to indicate when an ID is repeated on a new admission date, then I did this again to indicate the 3rd admission and began drawing out columns for 4th,5th,6th and planned on merging them. This is simply not an efficient solution, especially with a large data set. I am familiar with R and think that might be a better way for manipulation but I am just stuck with how to do this for the entire data set and to continually add a new "indicator" every time the same patient is admitted again. I am not even sure exactly how to tell the computer what to do with pseudo. Perhaps something like this
Pseudo-Code
-> Run through ID Column
-> IF Dur is > 1 (it will always be > 1 for a new admission)
ANDIF ID already exists above with DUR > 1 = a, or if DUR > 1 TWICE for
same ID = b, or if DUR > THREE TIMES = c, and so on....
Any help would be great
In R, you have a lot of options. Your data has issues, however; since November only has 30 days, converting the DATE column to an actual date format will introduce NAs. (You could, of course, just leave it as character, but date formats are easier to work with.)
With dplyr:
library(dplyr)
df %>% mutate(DATE = as.Date(DATE, '%d/%m/%Y')) %>% # parse date data
group_by(ID) %>% # group data by ID
mutate(visit = as.integer(factor(DATE))) # make an integer factor of DATE
# Source: local data frame [12 x 4]
# Groups: ID [2]
#
# ID DATE DUR visit
# (int) (date) (dbl) (int)
# 1 82 2014-08-29 10.32 1
# 2 82 2014-08-29 0.32 1
# 3 82 2014-09-12 13.35 2
# 4 82 2014-09-12 0.16 2
# 5 82 2014-09-12 0.24 2
# 6 82 2014-09-12 0.31 2
# 7 82 2014-12-22 100.39 3
# 8 82 2014-12-22 0.10 3
# 9 219 <NA> -300.32 NA
# 10 219 <NA> 0.23 NA
# 11 219 2013-01-12 80.20 1
# 12 219 2013-01-12 0.20 1
Base R has a lot of options, including ave and tapply, but to keep it simple so you can see what happens step-by-step in a split-apply-combine model, split by grouping variable, lapply across the list, and use do.call(rbind to reassemble:
df$DATE <- as.Date(df$DATE, '%d/%m/%Y')
df <- do.call(rbind, lapply(split(df, df$ID),
function(x){data.frame(x,
visit = as.integer(factor(x$DATE)))}))
rownames(df) <- NULL # delete useless rownames
df
# ID DATE DUR visit
# 1 82 2014-08-29 10.32 1
# 2 82 2014-08-29 0.32 1
# 3 82 2014-09-12 13.35 2
# 4 82 2014-09-12 0.16 2
# 5 82 2014-09-12 0.24 2
# 6 82 2014-09-12 0.31 2
# 7 82 2014-12-22 100.39 3
# 8 82 2014-12-22 0.10 3
# 9 219 <NA> -300.32 NA
# 10 219 <NA> 0.23 NA
# 11 219 2013-01-12 80.20 1
# 12 219 2013-01-12 0.20 1
Related
I am using Rstudio (version .99.903), have a PC (windows 8). I have a follow up question from yesterday as the problem became more complicated. Here is what the data looks like:
Number Trial ID Open date Enrollment rate
420 NCT00091442 9 1/28/2005 0.2
1476 NCT00301457 26 2/22/2008 1
10559 NCT01307397 34 7/28/2011 0.6
6794 NCT00948675 53 5/12/2010 0
6451 NCT00917384 53 8/17/2010 0.3
8754 NCT01168973 53 1/19/2011 0.2
8578 NCT01140347 53 12/30/2011 2.4
11655 NCT01358877 53 4/2/2012 0.3
428 NCT00091442 55 9/7/2005 0.1
112 NCT00065325 62 10/15/2003 0.2
477 NCT00091442 62 11/11/2005 0.1
16277 NCT01843374 62 12/16/2013 0.2
17386 NCT01905657 62 1/8/2014 0.6
411 NCT00091442 66 1/12/2005 0
What I need to do is compare the enrollment rate of the most current date within a given ID to the average of those values that are up to one year prior to it. For instance, for ID 53, the date of 1/19/2011 has an enrollment rate of 0.2 and I would want to compare this against the average of 8/17/2010 and 5/12/2010 enrollment rates (e.g., 0.15).
If there are no other dates within the ID prior to the current one, then the comparison should not be made. For instance, for ID 26, there would be no comparison. Similarly, for ID 53, there would be no comparison for 5/12/2010.
When I say "compare" I am not doing any analysis or visualization. I simply want a new column that takes the average value of those enrollment rates up to one year prior to the current one (I will be plotting them and percentile ranking them later). There are >20,000 data points. Any help would be much appreciated.
Verbose but possibly high performance way of doing this. No giant for loops looping over all the rows of the data frame. The two sapply loops only operate on a big numeric vector, which should be relatively quick regardless of your data row count. But I'm sure someone will waltz in with a trivial dplyr solution soon enough.
Approach assumes that your data is first sorted by ID then by Opendata. If they are not sorted, you need to sort them first.
# Find indices where the same ID is above and below it
A = which(unlist(sapply(X = rle(df$ID)$lengths,
FUN = function(x) {if(x == 1) return(F)
if(x == 2) return(c(F,F))
if(x >= 3) return(c(F,rep(T, x-2),F))})))
# Store list of date, should speed up code a tiny bit
V_opendate = df$Opendate
# Further filter on A, where the date difference < 365 days
B = A[sapply(A, function(x) (abs(V_opendate[x]-V_opendate[x-1]) < 365) & (abs(V_opendate[x]-V_opendate[x+1]) < 365))]
# Return actual indices of rows - 1, rows +1
C = sapply(B, function(x) c(x-1, x+1), simplify = F)
# Actually take the mean of these cases
D = sapply(C, function(x) mean(df[x,]$Enrollment))
# Create new column rate and fill in with value of C. You can do the comparison from here.
df[B,"Rate"] = D
Number Trial ID Opendate Enrollmentrate Rate
1 420 NCT00091442 9 2005-01-28 0.2 NA
2 1476 NCT00301457 26 2008-02-22 1.0 NA
3 10559 NCT01307397 34 2011-07-28 0.6 NA
4 6794 NCT00948675 53 2010-05-12 0.0 NA
5 6451 NCT00917384 53 2010-08-17 0.3 0.10
6 8754 NCT01168973 53 2011-01-19 0.2 1.35
7 8578 NCT01140347 53 2011-12-30 2.4 0.25
8 11655 NCT01358877 53 2012-04-02 0.3 NA
9 428 NCT00091442 55 2005-09-07 0.1 NA
10 112 NCT00065325 62 2003-10-15 0.2 NA
11 477 NCT00091442 62 2005-11-11 0.1 NA
12 16277 NCT01843374 62 2013-12-16 0.2 NA
13 17386 NCT01905657 62 2014-01-08 0.6 NA
14 411 NCT00091442 66 2005-01-12 0.0 NA
14 411 NCT00091442 66 1/12/2005 0.00 NA
The relevant rows are calculated. You can do your comparison with the newly created Rate column.
You might have to change the code a little since I changed removed the space in the column names
df = read.table(text = " Number Trial ID Opendate Enrollmentrate
420 NCT00091442 9 1/28/2005 0.2
1476 NCT00301457 26 2/22/2008 1
10559 NCT01307397 34 7/28/2011 0.6
6794 NCT00948675 53 5/12/2010 0
6451 NCT00917384 53 8/17/2010 0.3
8754 NCT01168973 53 1/19/2011 0.2
8578 NCT01140347 53 12/30/2011 2.4
11655 NCT01358877 53 4/2/2012 0.3
428 NCT00091442 55 9/7/2005 0.1
112 NCT00065325 62 10/15/2003 0.2
477 NCT00091442 62 11/11/2005 0.1
16277 NCT01843374 62 12/16/2013 0.2
17386 NCT01905657 62 1/8/2014 0.6
411 NCT00091442 66 1/12/2005 0", header = T)
I have a data.frame named sampleframe where I have stored all the table values. Inside sampleframe I have columns id, month, sold.
id month SMarch SJanFeb churn
101 1 0.00 0.00 1
101 2 0.00 0.00 1
101 3 0.00 0.00 1
108 2 0.00 6.00 1
103 2 0.00 10.00 1
160 1 0.00 2.00 1
160 2 0.00 3.00 1
160 3 0.50 0.00 0
164 1 0.00 3.00 1
164 2 0.00 6.00 1
I would like to calculate average sold for last three months based on ID. If it is month 3 then it has to consider average sold for the last two months based on ID, if it is month 2 then it has to consider average sold for 1 month based on ID., respectively for all months.
I have used ifelse and mean function to avail it but some rows are missing when i try to use it for all months
Query that I have used for execution
sampleframe$Churn <- ifelse(sampleframe$Month==4|sampleframe$Month==5|sampleframe$Month==6, ifelse(sampleframe$Sold<0.7*mean(sampleframe$Sold[sampleframe$ID[sampleframe$Month==-1&sampleframe$Month==-2&sampleframe$Month==-3]]),1,0),0)
adding according to the logic of the query it should compare with the previous months sold value of 70% and if the current value is higher than previous average months values then it should return 1 else 0
Not clear about the expected output. Based on the description about calculating average 'sold' for each 3 months, grouped by 'id', we can use roll_mean from library(RcppRoll). We convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'id', if the number of rows is greater than 1, we get the roll_mean with n specified as 3 and concatenate with the averages for less than 3 or else i.e. for 1 observation, get the value itself.
library(RcppRoll)
library(data.table)
k <- 3
setDT(df1)[, soldAvg := if(.N>1) c(cumsum(sold[1:(k-1)])/1:(k-1),
roll_mean(sold,n=k, align='right')) else as.numeric(sold), id]
df1
# id month sold soldAvg
#1: 101 1 124 124.0000
#2: 101 2 211 167.5000
#3: 104 3 332 332.0000
#4: 105 4 124 124.0000
#5: 101 5 211 182.0000
#6: 101 6 332 251.3333
#7: 101 7 124 222.3333
#8: 101 8 211 222.3333
#9: 101 9 332 222.3333
#10: 102 10 124 124.0000
#11: 102 12 211 167.5000
#12: 104 3 332 332.0000
#13: 105 4 124 124.0000
#14: 102 5 211 182.0000
#15: 102 6 332 251.3333
#16: 106 7 124 124.0000
#17: 107 8 211 211.0000
#18: 102 9 332 291.6667
#19: 103 11 124 124.0000
#20: 103 2 211 167.5000
#21: 108 3 332 332.0000
#22: 108 4 124 228.0000
#23: 109 5 211 211.0000
#24: 103 6 332 222.3333
#25: 104 7 124 262.6667
#26: 105 8 211 153.0000
#27: 103 10 332 291.6667
Solution for above Question can be done by using library(dplyr) and use this query to avail the output
resultData <- group_by(data, KId) %>%
arrange(sales_month) %>%
mutate(monthMinus1Qty = lag(quantity_sold,1), monthMinus2Qty = lag(quantity_sold, 2)) %>%
group_by(KId, sales_month) %>%
mutate(previous2MonthsQty = sum(monthMinus1Qty, monthMinus2Qty, na.rm = TRUE)) %>%
mutate(result = ifelse(quantity_sold/previous2MonthsQty >= 0.6,0,1)) %>%
select(KId,sales_month, quantity_sold, result)
link to refer for solution and output Answer
I have a data frame such as this:
> bp
Source: local data frame [6 x 4]
date amount accountId type
1 2015-06-11 101.2 1 a
2 2015-06-18 101.2 1 a
3 2015-06-24 101.2 1 b
4 2015-06-11 294.0 2 a
5 2015-06-18 48.0 2 a
6 2015-06-26 10.0 2 b
It has 3.4 million rows of data:
> nrow(bp)
[1] 3391874
>
I am trying to compute lagged differences of time in days as follows using dplyr:
bp <- bp %>% group_by(accountId) %>%
mutate(diff = as.numeric(date - lag(date)))
On my 8GB memory macbook, R crashes. On a 64GB Linux server the code is taking forever. Any ideas on fixing this problem?
No idea what has gone wrong over your way, but with date as a proper Date object, everything goes very quickly over here:
Recreate some data:
dat <- read.table(text=" date amount accountId type
1 2015-06-11 101.2 1 a
2 2015-06-18 101.2 1 a
3 2015-06-24 101.2 1 b
4 2015-06-11 294.0 2 a
5 2015-06-18 48.0 2 a
6 2015-06-26 10.0 2 b",header=TRUE)
dat$date <- as.Date(dat$date)
Then run some analyses on 3.4M rows, 1000 groups:
set.seed(1)
dat2 <- dat[sample(rownames(dat),3.4e6,replace=TRUE),]
dat2$accountId <- sample(1:1000,3.4e6,replace=TRUE)
nrow(dat2)
#[1] 3400000
length(unique(dat2$accountId))
#[1] 1000
system.time({
dat2 <- dat2 %>% group_by(accountId) %>%
mutate(diff = as.numeric(date - lag(date)))
})
# user system elapsed
# 0.38 0.03 0.40
head(dat2[dat2$accountId==46,])
#Source: local data frame [6 x 6]
#Groups: accountId
#
# date amount accountId type diff
#1 2015-06-24 101.2 46 b NA
#2 2015-06-18 48.0 46 a -6
#3 2015-06-11 294.0 46 a -13
#4 2015-06-18 101.2 46 a 7
#5 2015-06-26 10.0 46 b 2
#6 2015-06-11 294.0 46 a 0
I am using a Kaggle data set for bike sharing. I would like to write script that compares my predicted values to the training data set. I would like comparisons of the mean by month for each year.
The training data set, I call df looks like this:
datetime count
1 2011-01-01 00:00:00 16
2 2011-01-11 01:00:00 40
3 2011-02-01 02:00:00 32
4 2011-02-11 03:00:00 13
5 2011-03-21 04:00:00 1
6 2011-03-11 05:00:00 1
My predicted values, I call sub look like this:
datetime count
1 2011-01-01 00:00:00 42
2 2011-01-11 01:00:00 33
3 2011-02-01 02:00:00 33
4 2011-02-11 05:00:00 36
5 2011-03-21 06:00:00 57
6 2011-03-11 07:00:00 129
I have isolated the month and year using the lubridate package. Then concatenated the month-date as a new column. I used the new column and split, then use lapply to find the mean.
library(lubridate)
df$monyear <- interaction(
month(ymd_hms(df$datetime)),
year(ymd_hms(df$datetime)),
sep="-")
s<-split(df,df$monyear)
x <-lapply(s,function(x) colMeans(x[,c("count", "count")],na.rm=TRUE))
But this gives me the average for each month-year combination nested in a list so it is not easy to compare. What I would like instead is :
year-month train-mean sub-mean diff
1 2011-01 28 37.5 9.5
2 2011-02 22.5 34.5 12
3 2011-03 1 93 92
Is there a better way to do this?
Something like this. For each of your data sets:
library(dplyr)
dftrain %>% group_by(monyear) %>% summarize(mc=mean(count)) -> xtrain
dftest %>% group_by(monyear) %>% summarize(mc=mean(count)) -> xtest
merged <- merge(xtrain, xtest, by="monyear")
I have data a data frame in the following structure
transaction | customer | week | amount
12551 | ieeamo | 32 | €23.54
12553 | ieeamo | 33 | €17.00
I would like to get it in the following structure (for all weeks)
week | customer | activity last week | activity 2 weeks ago
32 | ieeamo | €0.00 | €0.00
33 | ieeamo | €23.54 | €0.00
34 | ieeamo | €17.00 | €23.54
35 | ieeamo | €0.00 | €17.00
Essentially, I am trying to convert transactional data to relative data.
My thoughts are that the best way to do this is to use loops to generate many dataframes then rbind them all at the end. However this approach does not seem efficient, and i'm not sure it will scale to the data I am using.
Is there a more proper solution?
Rbinding is a bad idea for this, since each rbind creates a new copy of the data frame in memory. We can get to the answer more quickly with a mostly vectorized approach, using loops only to make code more concise. Props to the OP for recognizing the inefficiency and searching for a solution.
Note: The following solution will work for any number of customers, but would require minor modification to work with more lag columns.
Setup: First we need to generate some data to work with. I'm going to use two different customers with a few weeks of transactional data each, like so:
data <- read.table(text="
transaction customer week amount
12551 cOne 32 1.32
12552 cOne 34 1.34
12553 cTwo 34 2.34
12554 cTwo 35 2.35
12555 cOne 36 1.36
12556 cTwo 37 1.37
", header=TRUE)
Step 1: Calculate some variables and initialize new data frame. To make the programming really easy, we first want to know two things: how many customers and how many weeks? We calculate those answers like so:
customer_list <- unique(data$customer)
# cOne cTwo
week_span <- min(data$week):max(data$week)
# 32 33 34 35 36 37
Next, we need to initialize the new data frame based on the variables we just calculated. In this new data frame, we need an entry for every week, not just the weeks in the data. This is where our 'week_span' variable comes in useful.
new_data <- data.frame(
week=sort(rep(week_span,length(customer_list))),
customer=customer_list,
activity_last_week=NA,
activity_2_weeks_ago=NA)
# week customer activity_last_week activity_2_weeks_ago
# 1 32 cOne NA NA
# 2 32 cTwo NA NA
# 3 33 cOne NA NA
# 4 33 cTwo NA NA
# 5 34 cOne NA NA
# 6 34 cTwo NA NA
# 7 35 cOne NA NA
# 8 35 cTwo NA NA
# 9 36 cOne NA NA
# 10 36 cTwo NA NA
# 11 37 cOne NA NA
# 12 37 cTwo NA NA
You'll notice we repeat the week list for each customer and sort it, so we get a list resembling 1,1,2,2,3,3,4,4...n,n with a number of repetitions equal to the number of customers in the data. This makes it so we can specify the 'customer' data as just the list of customers, since the list will repeat to fill up the space. The lag columns are left as NA for now.
Step 2: Fill in the lag values. Now, things are pretty simple. We just need to grab the subset of rows for each customer and find out if there were any transactions for each week. We do this by using the 'match' function to pull out values for every week. Where data does not exist, we'll get an NA value and need to replace those with zeros (assuming no activity means a zero transaction). Then, for the lag columns, we just offset the values with NA depending on the number of weeks we are lagging.
# Loop through the customers.
for (i in 1:length(customer_list)){
# Select the next customer's data.
subset <- data[data$customer==customer_list[i],]
# Extract the data values for each week.
subset_amounts <- subset$amount[match(week_span, subset$week)]
# Replace NA with zero.
subset_amounts <- ifelse(is.na(subset_amounts),0,subset_amounts)
# Loop through the lag columns.
for (lag in 1:2){
# Write in the data values with the appropriate
# number of offsets according to the lag.
# Truncate the extra values.
new_data[new_data$customer==customer_list[i], (2+lag)] <- c(rep(NA,lag), subset_amounts[1:(length(subset_amounts)-lag)])
}
}
# week customer activity_last_week activity_2_weeks_ago
# 1 32 cOne NA NA
# 2 32 cTwo NA NA
# 3 33 cOne 1.32 NA
# 4 33 cTwo 0.00 NA
# 5 34 cOne 0.00 1.32
# 6 34 cTwo 0.00 0.00
# 7 35 cOne 1.34 0.00
# 8 35 cTwo 2.34 0.00
# 9 36 cOne 0.00 1.34
# 10 36 cTwo 2.35 2.34
# 11 37 cOne 1.36 0.00
# 12 37 cTwo 0.00 2.35
In other situations... If you have a series of ordered time data where no rows are missing, this sort of task becomes incredibly simple with the 'embed' function. Let's say we have some data that looks like this:
data <- data.frame(week=1:20, value=1:20+(1:20/100))
# week value
# 1 1 1.01
# 2 2 2.02
# 3 3 3.03
# 4 4 4.04
# 5 5 5.05
# 6 6 6.06
# 7 7 7.07
# 8 8 8.08
# 9 9 9.09
# 10 10 10.10
# 11 11 11.11
# 12 12 12.12
# 13 13 13.13
# 14 14 14.14
# 15 15 15.15
# 16 16 16.16
# 17 17 17.17
# 18 18 18.18
# 19 19 19.19
# 20 20 20.20
We could make a lagged data set in no time, like so:
new_data <- data.frame(week=data$week[3:20], embed(data$value,3))
names(new_data)[2:4] <- c("this_week", "last_week", "2_weeks_ago")
# week this_week last_week 2_weeks_ago
# 1 3 3.03 2.02 1.01
# 2 4 4.04 3.03 2.02
# 3 5 5.05 4.04 3.03
# 4 6 6.06 5.05 4.04
# 5 7 7.07 6.06 5.05
# 6 8 8.08 7.07 6.06
# 7 9 9.09 8.08 7.07
# 8 10 10.10 9.09 8.08
# 9 11 11.11 10.10 9.09
# 10 12 12.12 11.11 10.10
# 11 13 13.13 12.12 11.11
# 12 14 14.14 13.13 12.12
# 13 15 15.15 14.14 13.13
# 14 16 16.16 15.15 14.14
# 15 17 17.17 16.16 15.15
# 16 18 18.18 17.17 16.16
# 17 19 19.19 18.18 17.17
# 18 20 20.20 19.19 18.18