I am using Rstudio (version .99.903), have a PC (windows 8). I have a follow up question from yesterday as the problem became more complicated. Here is what the data looks like:
Number Trial ID Open date Enrollment rate
420 NCT00091442 9 1/28/2005 0.2
1476 NCT00301457 26 2/22/2008 1
10559 NCT01307397 34 7/28/2011 0.6
6794 NCT00948675 53 5/12/2010 0
6451 NCT00917384 53 8/17/2010 0.3
8754 NCT01168973 53 1/19/2011 0.2
8578 NCT01140347 53 12/30/2011 2.4
11655 NCT01358877 53 4/2/2012 0.3
428 NCT00091442 55 9/7/2005 0.1
112 NCT00065325 62 10/15/2003 0.2
477 NCT00091442 62 11/11/2005 0.1
16277 NCT01843374 62 12/16/2013 0.2
17386 NCT01905657 62 1/8/2014 0.6
411 NCT00091442 66 1/12/2005 0
What I need to do is compare the enrollment rate of the most current date within a given ID to the average of those values that are up to one year prior to it. For instance, for ID 53, the date of 1/19/2011 has an enrollment rate of 0.2 and I would want to compare this against the average of 8/17/2010 and 5/12/2010 enrollment rates (e.g., 0.15).
If there are no other dates within the ID prior to the current one, then the comparison should not be made. For instance, for ID 26, there would be no comparison. Similarly, for ID 53, there would be no comparison for 5/12/2010.
When I say "compare" I am not doing any analysis or visualization. I simply want a new column that takes the average value of those enrollment rates up to one year prior to the current one (I will be plotting them and percentile ranking them later). There are >20,000 data points. Any help would be much appreciated.
Verbose but possibly high performance way of doing this. No giant for loops looping over all the rows of the data frame. The two sapply loops only operate on a big numeric vector, which should be relatively quick regardless of your data row count. But I'm sure someone will waltz in with a trivial dplyr solution soon enough.
Approach assumes that your data is first sorted by ID then by Opendata. If they are not sorted, you need to sort them first.
# Find indices where the same ID is above and below it
A = which(unlist(sapply(X = rle(df$ID)$lengths,
FUN = function(x) {if(x == 1) return(F)
if(x == 2) return(c(F,F))
if(x >= 3) return(c(F,rep(T, x-2),F))})))
# Store list of date, should speed up code a tiny bit
V_opendate = df$Opendate
# Further filter on A, where the date difference < 365 days
B = A[sapply(A, function(x) (abs(V_opendate[x]-V_opendate[x-1]) < 365) & (abs(V_opendate[x]-V_opendate[x+1]) < 365))]
# Return actual indices of rows - 1, rows +1
C = sapply(B, function(x) c(x-1, x+1), simplify = F)
# Actually take the mean of these cases
D = sapply(C, function(x) mean(df[x,]$Enrollment))
# Create new column rate and fill in with value of C. You can do the comparison from here.
df[B,"Rate"] = D
Number Trial ID Opendate Enrollmentrate Rate
1 420 NCT00091442 9 2005-01-28 0.2 NA
2 1476 NCT00301457 26 2008-02-22 1.0 NA
3 10559 NCT01307397 34 2011-07-28 0.6 NA
4 6794 NCT00948675 53 2010-05-12 0.0 NA
5 6451 NCT00917384 53 2010-08-17 0.3 0.10
6 8754 NCT01168973 53 2011-01-19 0.2 1.35
7 8578 NCT01140347 53 2011-12-30 2.4 0.25
8 11655 NCT01358877 53 2012-04-02 0.3 NA
9 428 NCT00091442 55 2005-09-07 0.1 NA
10 112 NCT00065325 62 2003-10-15 0.2 NA
11 477 NCT00091442 62 2005-11-11 0.1 NA
12 16277 NCT01843374 62 2013-12-16 0.2 NA
13 17386 NCT01905657 62 2014-01-08 0.6 NA
14 411 NCT00091442 66 2005-01-12 0.0 NA
14 411 NCT00091442 66 1/12/2005 0.00 NA
The relevant rows are calculated. You can do your comparison with the newly created Rate column.
You might have to change the code a little since I changed removed the space in the column names
df = read.table(text = " Number Trial ID Opendate Enrollmentrate
420 NCT00091442 9 1/28/2005 0.2
1476 NCT00301457 26 2/22/2008 1
10559 NCT01307397 34 7/28/2011 0.6
6794 NCT00948675 53 5/12/2010 0
6451 NCT00917384 53 8/17/2010 0.3
8754 NCT01168973 53 1/19/2011 0.2
8578 NCT01140347 53 12/30/2011 2.4
11655 NCT01358877 53 4/2/2012 0.3
428 NCT00091442 55 9/7/2005 0.1
112 NCT00065325 62 10/15/2003 0.2
477 NCT00091442 62 11/11/2005 0.1
16277 NCT01843374 62 12/16/2013 0.2
17386 NCT01905657 62 1/8/2014 0.6
411 NCT00091442 66 1/12/2005 0", header = T)
Related
I have two tables coming from devices that gather data with different sampling frequencies. One device samples every 30 seconds, the other is roughly 30 and sometimes drops measurements (example sequence might be 31, 61, 95, 151, notice how it missed the sample around ~120). My original data.frame contains a datetime instead of the number of seconds but the toy data should work to illustrate.
q1 <-
read.table(text="
A 0 1.1
A 30 1.2
A 90 1.3
A 120 1.4
B 15 -5
B 45 -3
B 75 -3.5
C 10 0
C 40 -1.4
C 70 -1")
q2 <-
read.table(text="
A 10 10.1
A 40 10.2
A 110 10.4
B 30 -50
B 90 -30
C 5 0
C 35 -10.4
C 76 -10")
names(q1) <- c("key","datetime","x")
names(q2) <- c("key","timepoint","y")
# create a joint_time to keep the originals in place
q1$joint_time <- q1$datetime
q2$joint_time <- q2$timepoint
If I try to join by nearest, I get
# set the keys
data.table::setkey(data.table::setDT(q1), key, joint_time)
data.table::setkey(data.table::setDT(q2), key, joint_time)
q2[q1, roll="nearest"]
Notice the duplicates on row 4 and 6.
key timepoint y joint_time datetime x
1: A 10 10.1 0 0 1.1
2: A 40 10.2 30 30 1.2
3: A 110 10.4 90 90 1.3
4: A 110 10.4 120 120 1.4
5: B 30 -50.0 15 15 -5.0
6: B 30 -50.0 45 45 -3.0
7: B 90 -30.0 75 75 -3.5
8: C 5 0.0 10 10 0.0
9: C 35 -10.4 40 40 -1.4
10: C 76 -10.0 70 70 -1.0
My ideal output would join by nearest but fill with NA instead of duplicate on y values.
key timepoint y joint_time datetime x
1: A 10 10.1 0 0 1.1
2: A 40 10.2 30 30 1.2
3: A 110 10.4 90 90 1.3
4: A NA NA 120 120 1.4
5: B 30 -50.0 15 15 -5.0
6: B NA NA 45 45 -3.0
7: B 90 -30.0 75 75 -3.5
8: C 5 0.0 10 10 0.0
9: C 35 -10.4 40 40 -1.4
10: C 76 -10.0 70 70 -1.0
I'm fine with doing the join first and then finding the duplicates and changing them to NA. I will later try to interpolate the y variable there. Not sure if there's a direct way to do the join and fill with NA or if it has to be done a posteriori.
Here's what I ended up doing, I don't think it's awesome but as far as I can see, it is working as expected.
q1$joint_time <- q1$datetime
q2$joint_time <- q2$timepoint
# create a sample id using the key since the data is grouped
q2$sample_id <- paste0(q2$key, as.character(1:nrow(q2)))
# Join
res <- q2[q1, roll="nearest"]
# fill with NAs
res %>% mutate_at(vars(y,timepoint), ~ifelse(duplicated(sample_id), NA, .))
Which produces
key timepoint y joint_time sample_id datetime x
1: A 10 10.1 0 A1 0 1.1
2: A 40 10.2 30 A2 30 1.2
3: A 110 10.4 90 A3 90 1.3
4: A NA NA 120 A3 120 1.4
5: B 30 -50.0 15 B4 15 -5.0
6: B NA NA 45 B4 45 -3.0
7: B 90 -30.0 75 B5 75 -3.5
8: C 5 0.0 10 C6 10 0.0
9: C 35 -10.4 40 C7 40 -1.4
10: C 76 -10.0 70 C8 70 -1.0
This question already has answers here:
R Loop for Variable Names to run linear regression model
(2 answers)
Closed 2 years ago.
I have the following dataframe I'm calling "test" and I am trying to run a Bartlett's test and a Kruskal-Wallis test for each "metab" vs the "diagnosis"
> test
Index tube.label age gender diagnosis metab1 metab2 metab3 metab4 metab5 metab6
1 200 73 Male Cancer 6 1.5 2 5 8 1.5
2 201 71 Male Healthy 6 1.5 2 11.5 50 1.5
4 202 76 Male Adenoma 2 1.5 2 5 8 1.5
7 203 58 Female Cancer 2 1.5 2 1.5 2.5 1.5
9 204 73 Male Cancer 2 1.5 2 1.5 8 1.5
12 205 72 Male Healthy 6 1.5 17.8272 13.5 184.2 4.5
13 206 46 Female Cancer 30.0530 1.5 2 21.2 16.6 4.5
14 207 38 Female Healthy 6 1.5 2 12.494 31.59 1.5
15 208 60 Male Cancer 6 1.5 2 13.2 53.2 4.5
16 209 72 Female Cancer 6 1.5 2 1.5 8 1.5
17 210 72 Male Adenoma 6 1.5 2 22.829 102.44 9.069
18 211 52 Male Cancer 6 1.5 2 1.5 8 1.5
19 212 64 Male Healthy 6 1.5 2 1.5 8 1.5
20 213 68 Male Cancer 6 1.5 2 26.685 40.9 4.5
21 214 60 Male Healthy 24.902 1.5 42.443 22.942 498.5 4.5
23 215 70 Female Healthy 6 1.5 2 1.5 19.908 4.5
24 216 42 Female Healthy 6 1.5 2 1.5 17.7 1.5
25 217 72 Male Inflammation 6 1.5 2 1.5 8 1.5
26 218 71 Male Healthy 51 1.5 2 41.062 182.2 11.340
27 219 51 Female Inflammation 2 1.5 2 1.5 8 1.5
I can run them individually and it gives me the proper value:
bartlett.test(metab1 ~ diagnosis, data = test)
Bartlett test of homogeneity of variances
data: metab1 by diagnosis
Bartlett's K-squared = 5.1526, df = 3, p-value = 0.161
kruskal.test(metab1 ~ diagnosis, data = test)
Kruskal-Wallis rank sum test
data: metab1 by diagnosis
Kruskal-Wallis chi-squared = 4.3475, df = 3, p-value = 0.2263
However when I try to run a for loop (I have more than 100 of them to run) I keep getting the following error:
Bartlett error:
testcols <- colnames(test[6:ncol(test)])
for (met in testcols){
bartlett.test(met ~ diagnosis, data = test)
}
>Error in model.frame.default(formula = met ~ diagnosis, data = test) :
variable lengths differ (found for 'diagnosis')
Kruskal-Wallis error:
for(met in testcols){
kruskal.test(met ~ diagnosis,data = test)
}
>Error in model.frame.default(formula = met ~ diagnosis, data = test) :
variable lengths differ (found for 'diagnosis')
Should I be using something else? Thank you for the help!
Try to create formula to apply using reformulate :
cols <- names(test)[6:ncol(test)]
all_test <- lapply(cols, function(x)
bartlett.test(reformulate("diagnosis", x), data = test))
You can do the same with kruskal.test.
I have patient data that looks like this:
ID DATE DUR
82 29/08/2014 10.32
82 29/08/2014 0.32
82 12/09/2014 13.35
82 12/09/2014 0.16
82 12/09/2014 0.24
82 12/09/2014 0.31
82 22/12/2014 100.39
82 22/12/2014 0.1
219 31/11/2012 -300.32
219 31/11/2012 0.23
219 12/01/2013 80.20
219 12/01/2013 0.20
In the first column is a patient ID, In the second there is a date and time (time is visually missing but is in there) and the third is just the duration difference (which I have been using to determine different admittance of patients). Each different row is a check up on the patient but they may have come here at a later date (not with in the same time frame).
Basically what I want to do is to be able to categorize each patients number so that when they admit a second time there id becomes "82a" and third time "82b" and so on. It wouldn't have to be alphabetic it could be any such indicator. Sometimes there can be patients with as many as 50 different admissions (separate occasion admissions). So after this I want to have it look something like:
ID DATE DUR
82 29/08/2014 10.32
82 29/08/2014 0.32
82a 12/09/2014 13.35
82a 12/09/2014 0.16
82a 12/09/2014 0.24
82a 12/09/2014 0.31
82b 22/12/2014 100.39
82b 22/12/2014 0.1
219 31/11/2012 -300.32
219 31/11/2012 0.23
219a 12/01/2013 80.20
219a 12/01/2013 0.20
I have been working in Excel for the time being and at first had used
=IF(AND(ABS(C3)>1,A3=A2),1,0)
Just to allow to indicate when an ID is repeated on a new admission date, then I did this again to indicate the 3rd admission and began drawing out columns for 4th,5th,6th and planned on merging them. This is simply not an efficient solution, especially with a large data set. I am familiar with R and think that might be a better way for manipulation but I am just stuck with how to do this for the entire data set and to continually add a new "indicator" every time the same patient is admitted again. I am not even sure exactly how to tell the computer what to do with pseudo. Perhaps something like this
Pseudo-Code
-> Run through ID Column
-> IF Dur is > 1 (it will always be > 1 for a new admission)
ANDIF ID already exists above with DUR > 1 = a, or if DUR > 1 TWICE for
same ID = b, or if DUR > THREE TIMES = c, and so on....
Any help would be great
In R, you have a lot of options. Your data has issues, however; since November only has 30 days, converting the DATE column to an actual date format will introduce NAs. (You could, of course, just leave it as character, but date formats are easier to work with.)
With dplyr:
library(dplyr)
df %>% mutate(DATE = as.Date(DATE, '%d/%m/%Y')) %>% # parse date data
group_by(ID) %>% # group data by ID
mutate(visit = as.integer(factor(DATE))) # make an integer factor of DATE
# Source: local data frame [12 x 4]
# Groups: ID [2]
#
# ID DATE DUR visit
# (int) (date) (dbl) (int)
# 1 82 2014-08-29 10.32 1
# 2 82 2014-08-29 0.32 1
# 3 82 2014-09-12 13.35 2
# 4 82 2014-09-12 0.16 2
# 5 82 2014-09-12 0.24 2
# 6 82 2014-09-12 0.31 2
# 7 82 2014-12-22 100.39 3
# 8 82 2014-12-22 0.10 3
# 9 219 <NA> -300.32 NA
# 10 219 <NA> 0.23 NA
# 11 219 2013-01-12 80.20 1
# 12 219 2013-01-12 0.20 1
Base R has a lot of options, including ave and tapply, but to keep it simple so you can see what happens step-by-step in a split-apply-combine model, split by grouping variable, lapply across the list, and use do.call(rbind to reassemble:
df$DATE <- as.Date(df$DATE, '%d/%m/%Y')
df <- do.call(rbind, lapply(split(df, df$ID),
function(x){data.frame(x,
visit = as.integer(factor(x$DATE)))}))
rownames(df) <- NULL # delete useless rownames
df
# ID DATE DUR visit
# 1 82 2014-08-29 10.32 1
# 2 82 2014-08-29 0.32 1
# 3 82 2014-09-12 13.35 2
# 4 82 2014-09-12 0.16 2
# 5 82 2014-09-12 0.24 2
# 6 82 2014-09-12 0.31 2
# 7 82 2014-12-22 100.39 3
# 8 82 2014-12-22 0.10 3
# 9 219 <NA> -300.32 NA
# 10 219 <NA> 0.23 NA
# 11 219 2013-01-12 80.20 1
# 12 219 2013-01-12 0.20 1
I have a very large data set, structured as the sample below.
I have been trying to use the na.spline function in order to
1) identify the "fips" category with missing Yield.
2) if less than than 3 Yield values are NA per fips (here 1-3) the spline function should kick in and fill in the NA.
3) If 3 or more Yields are NA for a "fips" the code should remove the entire "fips" subset, in this case fips 2 should be removed.
My code so far:
finX <- dataset
finxx <- transform(subset(finX, ave(na.spline(finX$Yield), fips, FUN=sum)<2))
#or
finxx <- transform(subset(finX, ave(is.na(finX$Yield), fips, FUN=sum)<2))
Year fips Max Min Rain Yield
1980 1 24.7 0.0 71 37
1981 1 22.8 0.0 62 40
1982 1 22.6 0.0 47 37
1983 1 24.2 0.0 51 39
1984 1 23.8 0.0 61 47
1985 1 25.1 0.0 67 43
1980 2 24.8 0.0 72 34
1981 2 23.2 0.4 54 **NA**
1982 2 25.3 0.1 83 55
1983 2 23.0 0.0 68 **NA**
1984 2 22.4 0.7 70 **NA**
1985 2 24.6 0.0 47 31
1980 3 25.5 0.0 51 31
1981 3 25.5 0.0 51 31
1982 3 25.5 0.0 51 31
1983 3 25.5 0.0 51 **NA**
1984 3 25.5 0.0 51 31
...
Currently the codes above either do not fill in all the NA's in the final product, or simply have no result at all.
Any guidance would be very useful, thank you.
Yield needs to be converted from character to numeric or NA. Then use by to divide finX into separate data frames by fips value. For each data frame with less than 3 NA's, do the spline interpolation. Those with 3 or greater are returned as NULL. Combine the list of returned data frames into single data frame. Code would look like:
library(zoo)
# convert finX$Yield values from character to either numeric or NA
finX$Yield <- sapply(finX$Yield, function(x) if(x =="**NA**") NA_real_ else as.numeric(x))
# use spline interpolation on fips sets with less than 3 NA's
finxx <- by(finX, finX$fips, function(x) if(sum(is.na(x$Yield)) < 3) transform(x, Yield=na.spline(object=Yield, x=Year)) )
# combine results into a single data frame
finxx <- do.call(rbind, finxx)
Alternatively after the conversion to numeric values, you could use ave on the Yield column where spline interpolation returns values on fips sets with less than 3 NA's and all NA's on any other sets. All rows with any NA's in the final result would then be deleted. Code is as follows:
finxx2 <- transform(finX, Yield=ave(Yield, fips, FUN=function(x) if(sum(is.na(x)) < 3) na.spline(object=x) else NA))
finxx2 <- na.omit(finxx2)
Both versions give the same result for the sample data but the first version using by allows you to work with a full data frame for each fips set rather than with just Yield. In this case, this allowed Year to be specified for the x values in the spline interpolation so any data set with a missing Year would still give the correct interpolation. The ave version would get an incorrect answer. So the by version seems more robust.
There's also the dplyr version which is very much like the by version above and gives the same answer as the base R versions. If you're OK with working with dplyr, this is probably the most straightforward and robust approach.
library(dplyr)
finxx3 <- finX %>% group_by(fips) %>%
filter(sum(is.na(Yield)) < 3) %>%
mutate(Yield=na.spline(object=Yield, x=Year))
The first version returns
Year fips Max Min Rain Yield
1.1 1980 1 24.7 0 71 37
1.2 1981 1 22.8 0 62 40
1.3 1982 1 22.6 0 47 37
1.4 1983 1 24.2 0 51 39
1.5 1984 1 23.8 0 61 47
1.6 1985 1 25.1 0 67 43
3.13 1980 3 25.5 0 51 31
3.14 1981 3 25.5 0 51 31
3.15 1982 3 25.5 0 51 31
3.16 1983 3 25.5 0 51 31
3.17 1984 3 25.5 0 51 31
I am looking at a data frame and trying to subset rows that have the same pressure value for more then 5 rows or delete rows that do not have 5 duplicate pressure values...
File Turbidity Pressure
1 3.2 46
2 3.4 46
3 5.4 46
4 3.2 46
5 3.1 46
6 2.3 46
7 2.3 45
8 4.5 45
9 2.3 45
10 3.2 44
11 4.5 44
12 6.5 43
13 3.2 42
14 3.1 41
15 1.2 41
16 2.3 41
17 2.4 41
18 2.1 41
19 1.4 41
25 1.3 41
So basically trying to keep rows that have a pressure of 46 and 41 and delete rows in-between. This is a small portion of my dataset and just need code that will basically keep rows with 5 or more duplicate pressure values and delete others.
Try
library(dplyr)
df %>% group_by(Pressure) %>% filter(n() >= 5)
Which gives:
#Source: local data frame [13 x 3]
#Groups: Pressure
#
# File Turbidity Pressure
#1 1 3.2 46
#2 2 3.4 46
#3 3 5.4 46
#4 4 3.2 46
#5 5 3.1 46
#6 6 2.3 46
#7 14 3.1 41
#8 15 1.2 41
#9 16 2.3 41
#10 17 2.4 41
#11 18 2.1 41
#12 19 1.4 41
#13 25 1.3 41
Here's a data.table solution (relies crucially on Pressure not repeating itself later on):
library(data.table)
setDT(df)[,if(.N>=5) .SD,by=Pressure]
Addendum:
If you expect Pressure values to repeat later on, e.g.:
df<-data.frame(File=c(1:19,25:28),
Pressure=rep(c(46:41,46),c(6,3,2,1,1,7,3)))
Then you'll need to use rleid in order to keep only groups of at least 5 in a row (no gaps):
setDT(df)[,ct:=rleid(Pressure)][,if (.N>=5) .SD,by=ct]
Here is a solution using base R:
df <- data.frame(id=1:10, Pressure=c(rep(1,5),6:10))
p.counts <- table(df[,"Pressure"])
good.pressures <- as.numeric(names(p.counts))[p.counts>=5]
df.sub <- df[df[,"Pressure"]%in%good.pressures,]
Note that I'm using df as an example data set, so you can delete that first line of code and replace all instances of df with the name of your data.frame.