family_id<-c(1,2,3)
age_mother<-c(30,27,29)
dob_child1<-c("1998-11-12","1999-12-12","1996-04-12")##child one birth day
dob_child2<-c(NA,"1997-09-09",NA)##if no child,NA
dob_child3<-c(NA,"1999-09-01","1996-09-09")
DT<-data.table(family_id,age_mother,dob_child1,dob_child2,dob_child3)
Now I have DT, how can I use this table to know how many children each family have using syntax like this:
DT[,apply..,keyby=family_id]##this code is wrong
This may also work:
> DT$total_child <- as.vector(rowSums(!is.na(DT[, c("dob_child1",
"dob_child2", "dob_child3")])))
> DT
family_id age_mother dob_child1 dob_child2 dob_child3 total_child
1 1 30 1998-11-12 <NA> <NA> 1
2 2 27 1999-12-12 1997-09-09 1999-09-01 3
3 3 29 1996-04-12 <NA> 1996-09-09 2
You can use sqldf package, to use a SQL query in R.
I duplicated your DT.
family_id<-c(1,2,3)
age_mother<-c(30,27,29)
dob_child1<-c("1998-11-12","1999-12-12","1996-04-12")##child one birth day
dob_child2<-c(NA,"1997-09-09",NA)##if no child,NA
dob_child3<-c(NA,"1999-09-01","1996-09-09")
DT<-data.table(family_id,age_mother,dob_child1,dob_child2,dob_child3)
library(sqldf)
sqldf('select distinct (count(dob_child3)+count(dob_child2)+count(dob_child1)) as total_child,
family_id from DT group by family_id')
The result is the following:
total_child family_id
1 1 1
2 3 2
3 2 3
It is correct for you?
Related
I am attempting to join together two data frames. One contains records of when certain events happened. The other contains daily information on values that occurred for a given organization.
My current challenge is how to join together the information in the "when certain events happened" data frame fully into the records data frame. Most of dplyr's joins appear to simply join one line together. I need to fully spread out the record information based on start and end dates.
In other words, I need to spread out information from one line into many lines, while simultaneously joining to the daily data table. It is important that I do this in R because the alternative is quite a bit of filtering and dragging in Excel (the information covers thousands of rows).
Below is a representation of the daily data table
value year month day org link
12 1 1 1 AA AA-1-1
45 1 1 2 AA AA-1-2
31 1 1 3 AA AA-1-3
10 1 1 4 AA AA-1-4
Below is a representation of the records table
year month day org link end_link event event_info
1 1 2 AA AA-1-1-2 AA-1-1-3 Buy Yes
1 2 7 BB BB-1-2-7 BB-1-2-10 Sell Yes
And finally, here is what I am aiming for in the end:
value month day org link event event_info
12 1 1 AA AA-1-1-1
45 1 2 AA AA-1-1-2 Buy Yes
31 1 3 AA AA-1-1-3 Buy Yes
10 1 4 AA AA-1-1-4
Is there any way to accomplish this in R? I have tried using dplyr joins but usually am only able to join together the initial link.
Edit: The second "end" link refers to an end date. In the records table this is all in one line, while the second data frame has daily information.
Edit: Below I have put together a cleaner look at my real data. The first image is of DAILY DATA while the second is of RECORDS OF EVENTS. The third is what I would like to see (ideally).
Daily data, which will have multiple orgs present
Records data, note org id AA and the audience
Ideal combined data
We have first to build some dates in order to build date sequences that we'll unnest to get a long version of df2, which we right join on df1:
library(tidyverse)
df2 %>%
separate(link,c("org1","year1","month1","day1")) %>%
separate(end_link,c("org2","year2","month2","day2")) %>%
rowwise %>%
transmute(org,event,event_info, date = list(
as.Date(paste0(year1,"-",month1,"-",day1)):as.Date(paste0(year2,"-",month2,"-",day2)))) %>%
unnest %>%
right_join(df1 %>% mutate(date=as.numeric(as.Date(paste0(year,"-",month,"-",day))))) %>%
select(value, month, day, org, link, event,event_info)
# # A tibble: 4 x 7
# value month day org link event event_info
# <int> <int> <int> <chr> <chr> <chr> <chr>
# 1 12 1 1 AA AA-1-1 <NA> <NA>
# 2 45 1 2 AA AA-1-2 Buy Yes
# 3 31 1 3 AA AA-1-3 Buy Yes
# 4 10 1 4 AA AA-1-4 <NA> <NA>
data
df1 <- read.table(text="value year month day org link
12 1 1 1 AA AA-1-1
45 1 1 2 AA AA-1-2
31 1 1 3 AA AA-1-3
10 1 1 4 AA AA-1-4",h=T,strin=F)
df2 <- read.table(text="year month day org link end_link event event_info
1 1 2 AA AA-1-1-2 AA-1-1-3 Buy Yes
1 2 7 BB BB-1-2-7 BB-1-2-10 Sell Yes",h=T,strin=F)
I would use the Data table package, it is for me the best R package to do data analysis. Hope to have properly understood the problem, let me know if it does not work.
The first part creates the data-set (I created the two data.table objects in two different ways just to show both alternatives, you could read your data directly from excel, .txt, .csv or similar, let me know if you want to know how to do this).
library(data.table)
value<-c(12,45,31,10)
year<-c(1,1,1,1)
month<-c(1,1,1,1)
day<-c(1,2,3,4)
org<-c("AA","AA","AA","AA")
link<-c("AA-1-1","AA-1-2","AA-1-3","AA-1-4")
Daily_dt<-data.table(value, year,month,day,org,link)
Records_dt<-data.table(year=c(1,1),month=c(1,1),day=c(2,3),org=c("AA","BB"),link=c("AA-1-1-2","BB-1-2-7"),end_link=c("AA-1-1-3","BB-1-2-10"),
event=c("Buy","Buy"),event_info=c("Yes","Yes"))
Daily_dt[,Date:=as.Date(paste(year,"-",month,"-",day,sep=""))]
To achieve what you want you need these lines
Records_dt=rbind(Records_dt[,c("org","link","event","event_info")],
Records_dt[,list(org,link=end_link,event,event_info)])
Record_Dates<-as.data.table(tstrsplit(Records_dt$link,"-")[-1])
Record_Dates[,Dates:=as.Date(paste(V1,"-",V2,"-",V3,sep=""))]
Records_dt[,Date:=Record_Dates$Dates]
setkey(Records_dt,Date)
setkey(Daily_dt,Date)
Records_dt<-Records_dt[,c("Date","event","event_info")][Daily_dt,]
Records_dt<-Records_dt[,c("value","month","day","org","link","event","event_info")]
and this is the result
> Records_dt
value month day org link event event_info
1: 12 1 1 AA AA-1-1 NA NA
2: 45 1 2 AA AA-1-2 Buy Yes
3: 31 1 3 AA AA-1-3 Buy Yes
4: 10 1 4 AA AA-1-4 NA NA
If your input data had more than one event in the same day (with or without the same org) something like:
> Records_dt
year month day org link end_link event event_info
1: 1 1 2 AA AA-1-1-2 AA-1-1-3 Buy Yes
2: 1 1 3 BB BB-1-2-7 BB-1-2-10 Buy Yes
3: 1 1 2 AA AA-1-1-2 AA-1-1-3 Buy Yes
4: 1 1 3 AA AA-1-2-7 AA-1-2-10 Buy Yes
some tweaks may be required, but am not sure if you required this, so did not add it.
One data table (let's call is A) contains the ID numbers:
ID
3
5
12
8
...
and another table (let's call it B) contains the lower bound and the upper bound and the name for that ID.
ID_lower ID_upper Name
1 4 James
5 7 Arthur
8 11 Jacob
12 13 Sarah
so based on table B, given the ID from table A, we can find the matching name by finding the name on the row in table B such that
ID_lower <= ID <= ID upper
and I wanna create a table of ID and Name, so in the above example, it would be
ID Name
3 James
5 Arthur
12 Sarah
8 Jacob
... ...
I used for loop, so that for each row of A, I look for the row in B such that ID is between the ID_lower and ID_upper for that row and joined the name from there.
However, this method was a bit slow. Is there a fast way of doing it in R?
Using the new non-equi joins feature in the current development version of data.table, this is straightforward:
require(data.table) # v1.9.7+
dt2[dt1, .(ID, Name), on=.(ID_lower <= ID, ID_upper >= ID)]
See the installation instructions for devel version here.
where,
dt1=fread('ID
3
5
12
8')
dt2 = fread('ID_lower ID_upper Name
1 4 James
5 7 Arthur
8 11 Jacob
12 13 Sarah')
You can make a look-up table with your second data.frame (B):
lu <- do.call(rbind,
apply(B,1,function(x)
data.frame(ID=c(x[1]:x[2]),Name=x[3], row.names = NULL)))
then you query it with your first data.frame (A):
A$Name <- lu[A$ID,"Name"]
You can try this data.table solution:
data.table::setDT(B)[, .(Name, ID = Map(`:`, ID_lower, ID_upper))]
[, .(ID = unlist(ID)), .(Name)][ID %in% A$ID]
Name ID
1: James 3
2: Arthur 5
3: Sarah 12
4: Jacob 8
I believe findInterval() on ID_lower might be the ideal approach here:
A[,Name:=B[findInterval(ID,ID_lower),Name]];
A;
## ID Name
## 1: 3 James
## 2: 5 Arthur
## 3: 12 Sarah
## 4: 8 Jacob
This will only be correct if (1) B is sorted by ID_lower and (2) all values in A$ID are covered by the ranges in B.
Okay, so I have two different data frames (df1 and df2) which, to simplify it, have an ID, a date, and the score on a test. In each data frame the person (ID) have taken the test on multiple dates. When looking between the two data frames, some of the people are listed in df1 but not in df2, and vice versa, but some are listed in both and they can overlap differently.
I want to combine all the data into one frame, but the tricky part is if any of the IDs and scores from df1 and df2 are within 7 days (I can do this with a subtracted dates column), I want to combine that row.
In essence, for every ID there will be one row with both scores written separately if taken within 7 days, and if not it will make two separate rows, one with score from df1 and one from df2 along with all the other scores that might not be listed in both.
EX:
df1
ID Date1(yyyymmdd) Score1
1 20140512 50
1 20140501 30
1 20140703 50
1 20140805 20
3 20140522 70
3 20140530 10
df2
ID Date2(yyyymmdd) Score2
1 20140530 40
1 20140622 20
1 20140702 10
1 20140820 60
2 20140522 30
2 20140530 80
Wanted_df
ID Date1(yyyymmdd) Score1 Date2(yyyymmdd) Score2
1 20140512 50
1 20140501 30
1 20140703 50 20140702 10
1 20140805 20
1 20140530 40
1 20140622 20
1 20140820 60
3 20140522 70
3 20140530 10
2 20140522 30
2 20140530 80
Alright. I feel bad about the bogus outer join answer (which may be possible in a library I don't know about, but there are advantages to using RDBMS sometimes...) so here is a hacky workaround. It assumes that all the joins will be at most one to one, which you've said is OK.
# ensure the date columns are date type
df1$Date1 <- as.Date(as.character(df1$Date1), format="%Y%m%d")
df2$Date2 <- as.Date(as.character(df2$Date2), format="%Y%m%d")
# ensure the dfs are sorted
df1 <- df1[order(df1$ID, df1$Date1),]
df2 <- df2[order(df2$ID, df2$Date2),]
# initialize the output df3, which starts as everything from df1 and NA from df2
df3 <- cbind(df1,Date2=NA, Score2=NA)
library(plyr) #for rbind.fill
for (j in 1:nrow(df2)){
# see if there are any rows of test1 you could join test2 to
join_rows <- which(df3[,"ID"]==df2[j,"ID"] & abs(df3[,"Date1"]-df2[j,"Date2"])<7 )
# if so, join it to the first one (see discussion)
if(length(join_rows)>0){
df3[min(join_rows),"Date2"] <- df2[j,"Date2"]
df3[min(join_rows),"Score2"] <- df2[j,"Score2"]
} # if not, add a new row of just the test2
else df3 <- rbind.fill(df3,df2[j,])
}
df3 <- df3[order(df3$ID,df3$Date1,df3$Date2),]
row.names(df3)<-NULL # i hate these
df3
# ID Date1 Score1 Date2 Score2
# 1 1 2014-05-01 30 <NA> NA
# 2 1 2014-05-12 50 <NA> NA
# 3 1 2014-07-03 50 2014-07-02 10
# 4 1 2014-08-05 20 <NA> NA
# 5 1 <NA> NA 2014-05-30 40
# 6 1 <NA> NA 2014-06-22 20
# 7 1 <NA> NA 2014-08-20 60
# 8 2 <NA> NA 2014-05-22 30
# 9 2 <NA> NA 2014-05-30 80
# 10 3 2014-05-22 70 <NA> NA
# 11 3 2014-05-30 10 <NA> NA
I couldn't get the rows in the same sort order as yours, but they look the same.
Short explanation: For each row in df2, see if there's a row in df1 you can "join" it to. If not, stick it at the bottom of the table. In the initialization and rbinding, you'll see some hacky ways of assigning blank rows or columns as placeholders.
Why this is a bad hacky workaround: for large data sets, the rbinding of df3 to itself will consume more and more memory. The loop is definitely not optimal and its search does not exploit the fact that the tables are sorted. If by some chance the test were taken twice within a week, you would see some unexpected behavior (duplicates from df2, etc).
Use an outer join with an absolute value limit on the date difference. (A outer join B keeps all rows of A and B.) For example:
library(sqldf)
sqldf("select a.*, b.* from df1 a outer join df2 b on a.ID = b.ID and abs(a.Date1 - b.Date2) <=7")
Note that your date variables will have to be true dates. If they are currently characters or integers, you need to do something like df1$Date1 <- as.Date(as.character(df$Date1), format="%Y%M%D) etc.
Hello I have a dataset with multiple patients, each with multiple observations.
I want to select the earliest observation for each patient.
Example:
Patient ID Tender Swollen pt_visit
101 1 10 6
101 6 12 12
101 4 3 18
102 9 5 18
102 3 6 24
103 5 2 12
103 2 1 18
103 8 0 24
The pt_visit variable is the number of months the patient was in the study at the time of the observation. What I need is the first observation from each patient based on the lowest number of months in the pt_visit column. However I need the earliest observation for each patient ID.
My desired results:
Patient ID Tender Swollen pt_visit
101 1 10 6
102 9 5 18
103 5 2 12
Assuming your data frame is called df, use the ddply function in the plyr package:
require(plyr)
firstObs <- ddply(df, "PatientID", function(x) x[x$pt_visit == min(x$pt_visit), ])
I would use the data.table package:
Data <- data.table(Data)
setkey(Data, Patient_ID, pt_visit)
Data[,.SD[1], by=Patient_ID]
Assuming that the Patient ID column is actually named Patient_ID, here are a few approaches. DF is assumed to be the name of the input data frame:
sqldf
library(sqldf)
sqldf("select Patient_ID, Tender, Swollen, min(pt_visit) pt_visit
from DF
group by Patient_ID")
or
sqldf("select *, min(pt_visit) pt_visit from DF group by Patient_ID")[-ncol(DF)]
Note: The above two alternatives use an extension to SQL only found in SQLite so be sure you are using the SQLite backend. (SQLite is the default backend for sqldf unless RH2, RProgreSQL or RMYSQL is loaded.)
subset/ave
subset(DF, ave(pt_visit, Patient_ID, FUN = rank) == 1)
Note: This makes use of the fact that there are no duplicate pt_visit values within the same Patient_ID. If there were we would need to specify the ties= argument to rank.
I almost think they should be a subset parameter named "by" that would do the same as it does in data.table. This is a base-solution:
do.call(rbind, lapply( split(dfr, dfr$PatientID),
function(x) x[which.min(x$pt_visit),] ) )
PatientID Tender Swollen pt_visit
101 101 1 10 6
102 102 9 5 18
103 103 5 2 12
I guess you can see why #hadley built 'plyr'.
Having the following table which comprises some key columns which are: customer ID | order ID | product ID | Quantity | Amount | Order Date.
All this data is in LONG Format, in that you will get multi line items for the 1 Customer ID.
I can get the first date last date using R DateDiff but converting the file to WIDE format using Plyr, still end up with the same problem of getting multiple orders by customer, just less rows and more columns.
Is there an R function that extends R DateDiff to work out how to get the time interval between purchases by Customer ID? That is, time between order 1 and 2, order 2 and 3, and so on assuming these orders exists.
CID Order.Date Order.DateMY Order.No_ Amount Quantity Category.Name Locality
1 26/02/13 Feb-13 zzzzz 1 r MOSMAN
1 26/05/13 May-13 qqqqq 1 x CHULLORA
1 28/05/13 May-13 wwwww 1 r MOSMAN
1 28/05/13 May-13 wwwww 1 x MOSMAN
2 19/08/13 Aug-13 wwwwww 1 o OAKLEIGH SOUTH
3 3/01/13 Jan-13 wwwwww 1 x CURRENCY CREEK
4 28/08/13 Aug-13 eeeeeee 1 t BRISBANE
4 10/09/13 Sep-13 rrrrrrrrr 1 y BRISBANE
4 25/09/13 Sep-13 tttttttt 2 e BRISBANE
It is not clear what do you want to do since you don't give the expected result. But I guess you want to the the intervals between 2 orders.
library(data.table)
DT <- as.data.table(DF)
DT[, list(Order.Date,
diff = c(0,diff(sort(as.Date(Order.Date,'%d/%m/%y')))) ),CID]
CID Order.Date diff
1: 1 26/02/13 0
2: 1 26/05/13 89
3: 1 28/05/13 2
4: 1 28/05/13 0
5: 2 19/08/13 0
6: 3 3/01/13 0
7: 4 28/08/13 0
8: 4 10/09/13 13
9: 4 25/09/13 15
Split the data frame and find the intervals for each Customer ID.
df <- data.frame(customerID=as.factor(c(rep("A",3),rep("B",4))),
OrderDate=as.Date(c("2013-07-01","2013-07-02","2013-07-03","2013-06-01","2013-06-02",
"2013-06-03","2013-07-01")))
dfs <- split(df,df$customerID)
lapply(dfs,function(x){
tmp <-diff(x$OrderDate)
tmp
})
Or use plyr
library(plyr)
dfs <- dlply(df,.(customerID),function(x)return(diff(x$OrderDate)))
I know this question is very old, but I just figured out another way to do it and wanted to record it:
> library(dplyr)
> library(lubridate)
> df %>% group_by(customerID) %>%
mutate(SinceLast=(interval(ymd(lag(OrderDate)),ymd(OrderDate)))/86400)
# A tibble: 7 x 3
# Groups: customerID [2]
customerID OrderDate SinceLast
<fct> <date> <dbl>
1 A 2013-07-01 NA
2 A 2013-07-02 1.
3 A 2013-07-03 1.
4 B 2013-06-01 NA
5 B 2013-06-02 1.
6 B 2013-06-03 1.
7 B 2013-07-01 28.