Insert incremental time between two rows determined by column entry - r

Sorry me again. I will keep on trying but I want help in case I can't figure out within the next hour.
My data looks like this:
B<-data.frame(ID=c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2),EVID=c(1,1,1,0,1,2,2,1,1,1,2,2,1,1,1),VALUE=seq(15))
B$TIME<-c(Sys.time()+6*3600*(seq_len(nrow(B))-1))
Actually the time is more variable, and each ID may have multiple EVID of 2.
I wanted to add one hour increments between the times for EVID=2 for as many hours as they are apart, i.e., for each pair of EVID=2, I add one hour until the time is within one hour to the second EVID=2 in the pair, so I can get something like this:
(value and ID are just duplicate previous rows)
ID EVID VALUE TIME
1 1 1 1 2013-05-31 07:51:09
2 1 1 2 2013-05-31 13:51:09
3 1 1 3 2013-05-31 19:51:09
4 1 0 4 2013-06-01 01:51:09
5 1 1 5 2013-06-01 07:51:09
6 1 2 6 2013-06-01 13:51:09
6 1 2 6 2013-06-01 14:51:09
6 1 2 6 2013-06-01 15:51:09
6 1 2 6 2013-06-01 16:51:09
6 1 2 6 2013-06-01 17:51:09
6 1 2 6 2013-06-01 18:51:09
7 1 2 7 2013-06-01 19:51:09
8 1 1 8 2013-06-02 01:51:09
9 2 1 9 2013-06-02 07:51:09
10 2 1 10 2013-06-02 13:51:09
11 2 2 11 2013-06-02 19:51:09
11 2 2 11 2013-06-02 20:51:09
11 2 2 11 2013-06-02 21:51:09
11 2 2 11 2013-06-02 22:51:09
11 2 2 11 2013-06-02 23:51:09
11 2 2 11 2013-06-02 0:51:09
12 2 2 12 2013-06-03 01:51:09
13 2 1 13 2013-06-03 07:51:09
14 2 1 14 2013-06-03 13:51:09
15 2 1 15 2013-06-03 19:51:09
Below is my brainstorm/attempt:
library(data.table)
BDT <- data.table(row=1:nrow(B), B, key="ID")
BDT[,list(row,EVID,c(EVID)==2)]
attach(B)
newB<-BDT[c(EVID)==2,list(row=row+1,ID=ID,EVID=EVID,VALUE=VALUE,TIME=head(TIME+3600,-1))]
finalB<-rbind(BDT,newB)[order(EVID,decreasing=TRUE)][order(row)][,-1,with=FALSE]
However, this adds one row of Time+1 hour to each EVID=2 which is not what I desired.
The next thing I tried duplicates every row after the first which is not what I wanted, but has the advantage of sparing my from typing out all the names of the columns (I have about 32)
newB<-B[c(1,rep(2:nrow(B),each=2)),]
## My wild guess -- as.numeric(head(TIME))-as.numeric(tail(TIME)))/3600 doesn't work. I know it says that from row 2 to last row, repeat each row twice
newB[c(FALSE,TRUE),"EVID"]<-2
newB[c(FALSE,TRUE),"TIME"]<-newB[c(FALSE,TRUE),"TIME"]+3600
Thank you for any feedback.
=================================================================
eddie's code works well with my example, which I thought was a good representation but my actual data keep getting
error in seq.int(...) wrong sign in 'by' argument
(...) varies depending on what I was trying
I have a relatively large data, the column that I use as the ID as in the example is in the middle of the data table; I see even from my small sample data if I place the ID along with the other names in the list, R will recognize item 2 as having n+1 columns than item 1 in the rbind. But if I don't include it in the list so that I may use the by=ID, R complains that names are in different order. If a do not list one of the unimportant columns in the beginning of the data, R says item 2 has n-1 columns compared to item 1!
I thought that perhaps my error comes from my time being not exactly hours apart, but by test runs I see that small differences are tolerated, and rounding, either to hour or doing integers, doesn't help.
I tried using length.out, ignoring the warning
Warning message: In .rbind.data.table(...) : Argument 2 has names in
a different order. Columns will be bound by name for consistency with
base. Alternatively, you can drop names (by using an unnamed list) and
the columns will then be joined by position. Or, set use.names=FALSE.
But then the code does not add to between the 2's except at the end, where it adds too many!
What am I doing wrong? I've been pulling all-nighter for this :(
OK so when I rearrange the original data I can get rid of the warnings. However, the insertions are still happening at the end of the data only and they were too many.

This should work:
library(data.table)
dt = data.table(B)
dt[, TIME := as.POSIXct(TIME)]
rbind(dt, dt[EVID == 2,
list(EVID=EVID[1],
VALUE=VALUE[1],
TIME=seq.POSIXt(TIME[1], TIME[2], "hour")),
by = ID])[!duplicated(paste(ID,EVID,TIME))][order(ID, TIME)]

Related

Is there an R function to redefine a variable so I can use the spread function?

I'm new with R and I have the following problem. Maybe it's a really easy question, but I don't know the terms to search for an answer.
My problem:
I have several persons, each person is assigned a studynumber (SN). And each SN has one or more tests being performed, the test can have multiple results.
My data is long at the moment, but I need it to be wide (one row for each SN).
For example:
What I have:
SN testnumbers result
1 1 1234 6
2 1 1234 9
3 2 4567 6
4 3 5678 9
5 3 8790 9
What I want:
SN test1result1 test1result2 test2result1
1 1 6 6 NA
2 2 6 NA NA
3 3 9 NA 9
So I need to renumber the testnumbers into test 1 etc for each SN, in order to use the spread function, I think. But I don't know how.
I did manage to renumber testnumber into a list of 1 till the last unique testnumber, but still the wide dataframe looks awful.

Sum variables conditionally with loop in r

I realize this is a topic that's covered somewhat well but I couldn't find anything that approaches this specific concern:
I have a df with 800 columns, 10 iterations of 80 columns (each column represents an item) - Each column is named something like: 1_BL_PRE.1 1_FU_PRE.1 1_BL_PRE.1 1_BL_POST.1
Where the first '1' indicates the item number and the second '1' indicates the iteration number.
What I'm trying to figure out is how to get the sums of specific groups of items from all 10 iterations.
As a short example let's say I want to take the 1st and 3rd item of BL_PRE and get the sum of all 10 iterations for those 2 items - how would I do this?
subject 1_BL_PRE.1 2_BL_PRE.1 3_BL_PRE.1 1_BL_PRE.2 2_BL_PRE.2
1 40002 3 4 3 1 2
2 40004 1 2 3 4 4
3 40006 4 3 3 3 1
4 40008 2 3 1 2 3
5 40009 3 4 1 2 3
Expected output (where A represents the sum of 1_BL_PRE.1, 3_BL_PRE.1, 1_BL_PRE.2 and so on):
subject BL_PRE_A
1 40002 12
2 40004 14
3 40006 15
4 40008 20
5 40009 12
My hunch is the solution is related to a for-loop or lappy (and I'm not familiar at all with either). I'm trying to work with apply(finaldata,1,function(x) {sum(x ...)}) but I haven't been able to figure out the conditional statement for the function of sum.
If there's an implementation with plyr I'd be really curious to see what that looks like. (and if there's a thread that answers this, apologies and just re-direct!)
**Edited to include small example + code I'm trying to get to work
Thanks!

Identifying maximum number and longest set of time intervals

Say I have data that look like this:
level start end
1 1 133.631 825.141
2 2 133.631 155.953
3 3 146.844 155.953
4 2 293.754 302.196
5 3 293.754 302.196
6 4 293.754 301.428
7 2 326.253 343.436
8 3 326.253 343.436
9 4 333.827 343.436
10 2 578.066 611.766
11 3 578.066 611.766
12 4 578.066 587.876
13 4 598.052 611.766
14 2 811.228 825.141
15 3 811.228 825.141
or this:
level start end
1 1 3.60353 1112.62000
2 2 3.60353 20.35330
3 3 3.60353 8.77526
4 2 72.03720 143.60700
5 3 73.50530 101.13200
6 4 73.50530 81.64660
7 4 92.19030 101.13200
8 3 121.28500 143.60700
9 4 121.28500 128.25900
10 2 167.19700 185.04800
11 3 167.19700 183.44600
12 4 167.19700 182.84600
13 2 398.12300 418.64300
14 3 398.12300 418.64300
15 2 445.83600 454.54500
16 2 776.59400 798.34800
17 3 776.59400 796.64700
18 4 776.59400 795.91300
19 2 906.68800 915.89700
20 3 906.68800 915.89700
21 2 1099.44000 1112.62000
22 3 1099.44000 1112.62000
23 4 1100.14000 1112.62000
They produce the following graphs:
As you can see there are several time intervals at different levels. The level-1 interval always spans the entire duration of the time of interest. Levels 2+ have time intervals that are shorter.
What I would like to do is select the maximum number of non-overlapping time intervals covering each period that contain the maximum number of total time within them. I have marked in pink which ones those would be.
For small dataframes it is possible to brute force this, but obviously there should be some more logical way of doing this. I'm interested in hearing some ideas about what I should try.
EDIT:
I think one thing that could help here is the column 'level'. The results come from Kleinberg's burst detection algorithm (package 'bursts'). You will note that the levels are hierarchically organized. Levels of the same number cannot overlap. However levels successively increasing e.g. 2,3,4 in successive rows can overlap.
In essence, I think the problem could be shortened to this. Take the levels produced, but remove level 1. This would be the vector for the 2nd example:
2 3 2 3 4 4 3 4 2 3 4 2 3 2 2 3 4 2 3 2 3 4
Then, look at the 2s... if there are fewer than or only one '3' then that 2 is the longest interval. But if there are two or more 3's between successive 2's, then those 3s should be counted. Do this iteratively for each level. I think that should work...?
e.g.
vec<-df$level %>% as.vector() %>% .[-1]
vec
#[1] 2 3 2 3 4 4 3 4 2 3 4 2 3 2 2 3 4 2 3 2 3 4
max(vec) #4
vec3<-vec #need to find two or more 4's between 3s
vec3[vec3==3]<-NA
names(vec3)<-cumsum(is.na(vec3))
0 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 8 8
2 NA 2 NA 4 4 NA 4 2 NA 4 2 NA 2 2 NA 4 2 NA 2 NA 4
vec3.res<-which(table(vec3,names(vec3))["4",]>1)
which(names(vec3)==names(vec3.res) & vec3==4) #5 6
The above identifies rows 5 and 6 (which equate to rows 6 and 7 in original df) as having two fours that lie between 3's. Perhaps something using this sort of approach might work?
OK here is a stab using your second data set to test. This might not be correct in all cases!!
library(data.table)
dat <- fread("data.csv")
dat[,use:="maybe"]
make.pass <- function(dat,low,high,the.level,use) {
check <- dat[(use!="no" & level > the.level)]
check[,contained.by.above:=(low<=start & end<=high)]
check[,consecutive.contained.by.above:=
(contained.by.above &
!is.na(shift(contained.by.above,1)) &
shift(contained.by.above,1)),by=level]
if(!any(check[,consecutive.contained.by.above])) {
#Cause a side effect where we've learned we don't care:
dat[check[(contained.by.above),rownum],use:="no"]
print(check)
return("yes")
} else {
return("no")
}
}
dat[,rownum:=.I]
dat[level==1,use:=make.pass(dat,start,end,level,use),by=rownum]
dat
dat[use=="maybe" & level==2,use:=make.pass(dat,start,end,level,use),by=rownum]
dat
dat[use=="maybe" & level==3,use:=make.pass(dat,start,end,level,use),by=rownum]
dat
#Finally correct for last level
dat[use=="maybe" & level==4,use:="yes"]
I wrote these last steps out so you can trace in your own interactive session to see what's happening (see the print to get an idea) but you can remove the print and also condense the last steps into something like lapply(1:dat[,max(level)-1], function(the.level) dat[use=="maybe" & level==the.level,use:=make.pass......]) In response to your comment if there are an arbitrary number of levels you will definitely want to use this formalism, and follow it with a final call to dat[use=="maybe" & level==max(level),use:="yes"].
Output:
> dat
level start end use rownum
1: 1 3.60353 1112.62000 no 1
2: 2 3.60353 20.35330 yes 2
3: 3 3.60353 8.77526 no 3
4: 2 72.03720 143.60700 no 4
5: 3 73.50530 101.13200 no 5
6: 4 73.50530 81.64660 yes 6
7: 4 92.19030 101.13200 yes 7
8: 3 121.28500 143.60700 yes 8
9: 4 121.28500 128.25900 no 9
10: 2 167.19700 185.04800 yes 10
11: 3 167.19700 183.44600 no 11
12: 4 167.19700 182.84600 no 12
13: 2 398.12300 418.64300 yes 13
14: 3 398.12300 418.64300 no 14
15: 2 445.83600 454.54500 yes 15
16: 2 776.59400 798.34800 yes 16
17: 3 776.59400 796.64700 no 17
18: 4 776.59400 795.91300 no 18
19: 2 906.68800 915.89700 yes 19
20: 3 906.68800 915.89700 no 20
21: 2 1099.44000 1112.62000 yes 21
22: 3 1099.44000 1112.62000 no 22
23: 4 1100.14000 1112.62000 no 23
level start end use rownum
On the off chance this is correct, the algorithm can roughly be described as follows:
Mark all the intervals as possible.
Start with a given level. Pick a particular interval (by=rownum) say called X. With X in mind, subset a copy of the data to all higher-level intervals.
Mark any of these that are contained in X as "contained in X".
If consecutive intervals at the same level are contained in X, X is no good b/c it wastes intervals. In this case label X's "use" variable as "no" so we'll never think about X again. [Note: if it's possible that non-consecutive intervals are contained in X, or that containing multiple intervals across levels could ruin X's viability, then this logic might need to be changed to count contained intervals instead of finding consecutive ones. I didn't think about this at all, but it's just occurring to me now, so use at your own risk.]
On the other hand, if X passed the test, then we've already established it's good. Mark it as a "yes." But importantly, we also have to mark any single interval contained in X as "no," or else when we iterate the step it will forget that it was contained inside a good interval and mark itself as "yes" as well. This is the side effect step.
Now, iterate, ignoring any results that we've already determined.
Finally any "maybe"s leftover at the highest level are automatically in.
Let me know what you think of this--this is a rough draft and some aspects might not be correct.

Adding all values of a variable in R [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 7 years ago.
I don't know how to word the title exactly, so I will just do my best to explain below... Sorry in advance for the .csv format.
I have the following example dataset:
print(data)
ID Tag Flowers
1 1 6871 1
2 2 6750 1
3 3 6859 1
4 4 6767 1
5 5 6747 1
6 6 6261 1
7 7 6750 1
8 8 6767 1
9 9 6812 1
10 10 6746 1
11 11 6496 4
12 12 6497 1
13 13 6495 4
14 14 6481 1
15 15 6485 1
Notice that in Lines 2 and 7, the tag 6750 appears twice. I observed one flower on plant number 6750 on two separate days, equaling two flowers in its lifetime. Basically, I want to add every flower that occurs for tag 6750, tag 6767, etc throughout ~100 rows. Each tag appears more than once, usually around 4 or 5 times.
I feel like I need to apply the unlist function here, but I'm a little bit lost as to how I should do so.
Without any extra packages, you can use function aggregate():
res<-aggregate(data$Flowers, list(data$Tag), sum)
This calculates a sum of the values in Flowers column for every value in the Tag column.

Short(er) notation of selecting a part of a data.frame or other objects in R

I always get angry at my R code when I have to process dataframes, i.e. filtering out certain rows. The code gets very illegible as I tend to choose meaningful, but long, names for my objects. An example:
all.mutations.extra.large.name <- read.delim(filename)
head(all.mutations.extra.large.name)
id gene pos aa consequence V
ENSG00000105732 ZN574_HUMAN 81 x/N missense_variant 3
ENSG00000125879 OTOR_HUMAN 7 V/3 missense_variant 2
ENSG00000129194 SOX15_HUMAN 20 N/T missense_variant 3
ENSG00000099204 ABLM1_HUMAN 33 H/R missense_variant 2
ENSG00000103335 PIEZ1_HUMAN 11 Q/R missense_variant 3
ENSG00000171533 MAP6_HUMAN 39 A/G missense_variant 3
all.mutations.extra.large.name <- all.mutations.extra.large.name[which(all.mutations.extra.large.name$gene == ZN574_HUMAN)]
So in order to kick out all other lines in which I am not interested I need to reference 3 times the object all.mutations.extra.large.name. And reating this kind of step for different columns makes the code really difficult to understand.
Therefore my question: Is there a way to filter out rows by a criterion without referencing the object 3 times. Something like this would be beautiful: myobj[,gene=="ZN574_HUMAN"]
You can use subset for that:
subset(all.mutations.extra.large.name, gene == "ZN574_HUMAN")
Several options:
all.mutations.extra.large.name <- data.frame(a=1:5, b=2:6)
within(all.mutations.extra.large.name, a[a < 3] <- 0)
a b
1 0 2
2 0 3
3 3 4
4 4 5
5 5 6
transform(all.mutations.extra.large.name, b = b^2)
a b
1 1 4
2 2 9
3 3 16
4 4 25
5 5 36
Also check ?attach if you would like to avoid repetitive typing like all.mutations.extra.large.name$foo.

Resources