How to create dataframe subset of the one patient observation with the lowest score on a variable - r

Hello I have a dataset with multiple patients, each with multiple observations.
I want to select the earliest observation for each patient.
Example:
Patient ID Tender Swollen pt_visit
101 1 10 6
101 6 12 12
101 4 3 18
102 9 5 18
102 3 6 24
103 5 2 12
103 2 1 18
103 8 0 24
The pt_visit variable is the number of months the patient was in the study at the time of the observation. What I need is the first observation from each patient based on the lowest number of months in the pt_visit column. However I need the earliest observation for each patient ID.
My desired results:
Patient ID Tender Swollen pt_visit
101 1 10 6
102 9 5 18
103 5 2 12

Assuming your data frame is called df, use the ddply function in the plyr package:
require(plyr)
firstObs <- ddply(df, "PatientID", function(x) x[x$pt_visit == min(x$pt_visit), ])

I would use the data.table package:
Data <- data.table(Data)
setkey(Data, Patient_ID, pt_visit)
Data[,.SD[1], by=Patient_ID]

Assuming that the Patient ID column is actually named Patient_ID, here are a few approaches. DF is assumed to be the name of the input data frame:
sqldf
library(sqldf)
sqldf("select Patient_ID, Tender, Swollen, min(pt_visit) pt_visit
from DF
group by Patient_ID")
or
sqldf("select *, min(pt_visit) pt_visit from DF group by Patient_ID")[-ncol(DF)]
Note: The above two alternatives use an extension to SQL only found in SQLite so be sure you are using the SQLite backend. (SQLite is the default backend for sqldf unless RH2, RProgreSQL or RMYSQL is loaded.)
subset/ave
subset(DF, ave(pt_visit, Patient_ID, FUN = rank) == 1)
Note: This makes use of the fact that there are no duplicate pt_visit values within the same Patient_ID. If there were we would need to specify the ties= argument to rank.

I almost think they should be a subset parameter named "by" that would do the same as it does in data.table. This is a base-solution:
do.call(rbind, lapply( split(dfr, dfr$PatientID),
function(x) x[which.min(x$pt_visit),] ) )
PatientID Tender Swollen pt_visit
101 101 1 10 6
102 102 9 5 18
103 103 5 2 12
I guess you can see why #hadley built 'plyr'.

Related

subset data based on condition in r [duplicate]

This question already has answers here:
Remove group from data.frame if at least one group member meets condition
(4 answers)
Closed 1 year ago.
I want to select those household where all the member's age is greater than 20 in r.
household Members_age
100 75
100 74
100 30
101 20
101 50
101 60
102 35
102 40
102 5
Here two household satisfy the condition. Household 100 and 101.
How to do it in r?
what I did is following but it's not working.
sqldf("select household,Members_age from data group by household having Members_age > 20")
household Members_age
100 75
102 35
Please suggest. Here is the sample dataset
library(dplyr)
library(sqldf)
data <- data.frame(household = c(100,100,100,101,101,101,102,102,102),
Members_age = c(75,74,30,20,50,60,35,40,5))
You can use ave.
data[ave(data$Members_age, data$household, FUN=min) > 20,]
# household Members_age
#1 100 75
#2 100 74
#3 100 30
or only the households.
unique(data$household[ave(data$Members_age, data$household, FUN=min) > 20])
#[1] 100
I understand SQL's HAVING clause, but your request "all member's age is greater than 20" does not match your sqldf output. This is because HAVING is really only looking at the first row for each household, which is why we see 102 (and shouldn't) and we don't see 101 (shouldn't as well).
I suggest to implement your logic, you would change your sqldf code to the following:
sqldf("select household,Members_age from data group by household having min(Members_age) > 20")
# household Members_age
# 1 100 30
which is effectively the SQL analog of GKi's ave answer.
An alternative:
library(dplyr)
data %>%
group_by(household) %>%
filter(all(Members_age > 20)) %>%
ungroup()
# # A tibble: 3 x 2
# household Members_age
# <dbl> <dbl>
# 1 100 75
# 2 100 74
# 3 100 30
and if you just need one row per household, then add %>% distinct(household) or perhaps %>% distinct(household, .keep_all = TRUE).
But for base R, I think nothing is likely to be better than GKi's use of ave.

How to add rows to dataframe R with rbind

I know this is a classic question and there are also similar ones in the archive, but I feel like the answers did not really apply to this case. Basically I want to take one dataframe (covid cases in Berlin per district), calculate the sum of the columns and create a new dataframe with a column representing the name of the district and another one representing the total number. So I wrote
covid_bln <- read.csv('https://www.berlin.de/lageso/gesundheit/infektionsepidemiologie-infektionsschutz/corona/tabelle-bezirke-gesamtuebersicht/index.php/index/all.csv?q=', sep=';')
c_tot<-data.frame('district'=c(), 'number'=c())
for (n in colnames(covid_bln[3:14])){
x<-data.frame('district'=c(n), 'number'=c(sum(covid_bln$n)))
c_tot<-rbind(c_tot, x)
next
}
print(c_tot)
Which works properly with the names but returns only the number of cases for the 8th district, but for all the districts. If you have any suggestion, even involving the use of other functions, it would be great. Thank you
Here's a base R solution:
number <- colSums(covid_bln[3:14])
district <- names(covid_bln[3:14])
c_tot <- cbind.data.frame(district, number)
rownames(c_tot) <- NULL
# If you don't want rownames:
rownames(c_tot) <- NULL
This gives us:
district number
1 mitte 16030
2 friedrichshain_kreuzberg 10679
3 pankow 10849
4 charlottenburg_wilmersdorf 10664
5 spandau 9450
6 steglitz_zehlendorf 9218
7 tempelhof_schoeneberg 12624
8 neukoelln 14922
9 treptow_koepenick 6760
10 marzahn_hellersdorf 6960
11 lichtenberg 7601
12 reinickendorf 9752
I want to provide a solution using tidyverse.
The final result is ordered alphabetically by districts
c_tot <- covid_bln %>%
select( mitte:reinickendorf) %>%
gather(district, number, mitte:reinickendorf) %>%
group_by(district) %>%
summarise(number = sum(number))
The rusult is
# A tibble: 12 x 2
district number
* <chr> <int>
1 charlottenburg_wilmersdorf 10736
2 friedrichshain_kreuzberg 10698
3 lichtenberg 7644
4 marzahn_hellersdorf 7000
5 mitte 16064
6 neukoelln 14982
7 pankow 10885
8 reinickendorf 9784
9 spandau 9486
10 steglitz_zehlendorf 9236
11 tempelhof_schoeneberg 12656
12 treptow_koepenick 6788

How to select random rows from R data frame to include all distinct values of two columns

I want to select a random sample of rows from a large R data frame df (around 10 million rows) in such a way that all distinct values of two columns are included in the resulting sample. df looks like:
StoreID WEEK Units Value ProdID
2001 1 1 3.5 20702
2001 2 2 3 20705
2002 32 3 6 23568
2002 35 5 15 24025
2003 1 2 10 21253
I have the following unique values in the respective columns: StoreID: 1433 and WEEK: 52. When I generate a random sample of rows from df, I must have at least one row each for each StoreID and each WEEK value.
I used the function sample_frac in dplyr in various trials but that does not ensure that all distinct values of StoreID and WEEK are included at least once in the resulting sample. How can I achieve what I want?
It sounds like you need to group the desired columns before sampling rows. The last line will return one random row for each unique storeID-week pairing.
df <- data.frame(storeid=sample(c(2000:2010),1000,T),
week=sample(c(1:52),1000,T),
value=runif(1000))
# count number of duplicated storeid-week pairs
df %>% count(storeid,week) %>% filter(n>1)
df %>% group_by(storeid,week) %>% sample_n(1)
# A tibble: 468 x 3
# Groups: storeid, week [468]
storeid week value
<int> <int> <dbl>
1 2000 1 0.824
2 2000 2 0.0987
3 2000 6 0.916
4 2000 8 0.289
5 2000 9 0.610
6 2000 11 0.0807
7 2000 12 0.592
8 2000 13 0.849
9 2000 14 0.0181
10 2000 16 0.182
# ... with 458 more rows
Not sure if I have read the problem correctly. I would have tried the following using sample function.
Assuming your dataframe is called MyDataFrame and is two dimensional, I would have done it like this.
RandomizedDF <- MyDataFrame[sample(dim(MyDataFrame)[1],dim(MyDataFrame)[1],replace=FALSE),]
Let me know if this is what you wanted or something else?

Turning one row into multiple rows in r [duplicate]

This question already has answers here:
Combine Multiple Columns Into Tidy Data [duplicate]
(3 answers)
Closed 5 years ago.
In R, I have data where each person has multiple session dates, and the scores on some tests, but this is all in one row. I would like to change it so I have multiple rows with the persons info, but only one of the session dates and corresponding test scores, and do this for every person. Also, each person may have completed different number of sessions.
Ex:
ID Name Session1Date Score Score Session2Date Score Score
23 sjfd 20150904 2 3 20150908 5 7
28 addf 20150905 3 4 20150910 6 8
To:
ID Name SessionDate Score Score
23 sjfd 20150904 2 3
23 sjfd 20150908 5 7
28 addf 20150905 3 4
28 addf 20150910 6 8
You can use melt from the devel version of data.table ie. v1.9.5. It can take multiple 'measure' columns as a list. Instructions to install are here
library(data.table)#v1.9.5+
melt(setDT(df1), measure = patterns("Date$", "Score(\\.2)*$", "Score\\.[13]"))
# ID Name variable value1 value2 value3
#1: 23 sjfd 1 20150904 2 3
#2: 28 addf 1 20150905 3 4
#3: 23 sjfd 2 20150908 5 7
#4: 28 addf 2 20150910 6 8
Or using reshape from base R, we can specify the direction as 'long' and varying as a list of column index
res <- reshape(df1, idvar=c('ID', 'Name'), varying=list(c(3,6), c(4,7),
c(5,8)), direction='long')
res
# ID Name time Session1Date Score Score.1
#23.sjfd.1 23 sjfd 1 20150904 2 3
#28.addf.1 28 addf 1 20150905 3 4
#23.sjfd.2 23 sjfd 2 20150908 5 7
#28.addf.2 28 addf 2 20150910 6 8
If needed, the rownames can be changed
row.names(res) <- NULL
Update
If the columns follow a specific order i.e. 3rd grouped with 6th, 4th with 7th, 5th with 8th, we can create a matrix of column index and then split to get the list for the varying argument in reshape.
m1 <- matrix(3:8,ncol=2)
lst <- split(m1, row(m1))
reshape(df1, idvar=c('ID', 'Name'), varying=lst, direction='long')
If your data frame name is data
Use this
data1 <- data[1:5]
data2 <- data[c(1,2,6,7,8)]
newdata <- rbind(data1,data2)
This works for the example you've given. You might have to change column names appropriately in data1 and data2 for a proper rbind

Merging overlapping dataframes in R

Okay, so I have two different data frames (df1 and df2) which, to simplify it, have an ID, a date, and the score on a test. In each data frame the person (ID) have taken the test on multiple dates. When looking between the two data frames, some of the people are listed in df1 but not in df2, and vice versa, but some are listed in both and they can overlap differently.
I want to combine all the data into one frame, but the tricky part is if any of the IDs and scores from df1 and df2 are within 7 days (I can do this with a subtracted dates column), I want to combine that row.
In essence, for every ID there will be one row with both scores written separately if taken within 7 days, and if not it will make two separate rows, one with score from df1 and one from df2 along with all the other scores that might not be listed in both.
EX:
df1
ID Date1(yyyymmdd) Score1
1 20140512 50
1 20140501 30
1 20140703 50
1 20140805 20
3 20140522 70
3 20140530 10
df2
ID Date2(yyyymmdd) Score2
1 20140530 40
1 20140622 20
1 20140702 10
1 20140820 60
2 20140522 30
2 20140530 80
Wanted_df
ID Date1(yyyymmdd) Score1 Date2(yyyymmdd) Score2
1 20140512 50
1 20140501 30
1 20140703 50 20140702 10
1 20140805 20
1 20140530 40
1 20140622 20
1 20140820 60
3 20140522 70
3 20140530 10
2 20140522 30
2 20140530 80
Alright. I feel bad about the bogus outer join answer (which may be possible in a library I don't know about, but there are advantages to using RDBMS sometimes...) so here is a hacky workaround. It assumes that all the joins will be at most one to one, which you've said is OK.
# ensure the date columns are date type
df1$Date1 <- as.Date(as.character(df1$Date1), format="%Y%m%d")
df2$Date2 <- as.Date(as.character(df2$Date2), format="%Y%m%d")
# ensure the dfs are sorted
df1 <- df1[order(df1$ID, df1$Date1),]
df2 <- df2[order(df2$ID, df2$Date2),]
# initialize the output df3, which starts as everything from df1 and NA from df2
df3 <- cbind(df1,Date2=NA, Score2=NA)
library(plyr) #for rbind.fill
for (j in 1:nrow(df2)){
# see if there are any rows of test1 you could join test2 to
join_rows <- which(df3[,"ID"]==df2[j,"ID"] & abs(df3[,"Date1"]-df2[j,"Date2"])<7 )
# if so, join it to the first one (see discussion)
if(length(join_rows)>0){
df3[min(join_rows),"Date2"] <- df2[j,"Date2"]
df3[min(join_rows),"Score2"] <- df2[j,"Score2"]
} # if not, add a new row of just the test2
else df3 <- rbind.fill(df3,df2[j,])
}
df3 <- df3[order(df3$ID,df3$Date1,df3$Date2),]
row.names(df3)<-NULL # i hate these
df3
# ID Date1 Score1 Date2 Score2
# 1 1 2014-05-01 30 <NA> NA
# 2 1 2014-05-12 50 <NA> NA
# 3 1 2014-07-03 50 2014-07-02 10
# 4 1 2014-08-05 20 <NA> NA
# 5 1 <NA> NA 2014-05-30 40
# 6 1 <NA> NA 2014-06-22 20
# 7 1 <NA> NA 2014-08-20 60
# 8 2 <NA> NA 2014-05-22 30
# 9 2 <NA> NA 2014-05-30 80
# 10 3 2014-05-22 70 <NA> NA
# 11 3 2014-05-30 10 <NA> NA
I couldn't get the rows in the same sort order as yours, but they look the same.
Short explanation: For each row in df2, see if there's a row in df1 you can "join" it to. If not, stick it at the bottom of the table. In the initialization and rbinding, you'll see some hacky ways of assigning blank rows or columns as placeholders.
Why this is a bad hacky workaround: for large data sets, the rbinding of df3 to itself will consume more and more memory. The loop is definitely not optimal and its search does not exploit the fact that the tables are sorted. If by some chance the test were taken twice within a week, you would see some unexpected behavior (duplicates from df2, etc).
Use an outer join with an absolute value limit on the date difference. (A outer join B keeps all rows of A and B.) For example:
library(sqldf)
sqldf("select a.*, b.* from df1 a outer join df2 b on a.ID = b.ID and abs(a.Date1 - b.Date2) <=7")
Note that your date variables will have to be true dates. If they are currently characters or integers, you need to do something like df1$Date1 <- as.Date(as.character(df$Date1), format="%Y%M%D) etc.

Resources