I am using R and RStudio.
I have the following data frame:
ID TIME DV GpH SIpH GTT SITT
1 0 0 1.4 7.1 1.8 3.5
1 1 0.5 1.4 7.1 1.8 3.5
1 2 2 1.4 7.1 1.8 3.5
1 3 5 1.4 7.1 1.8 3.5
2 0 0 1.5 7.5 0.7 2.5
2 1 0.5 1.5 7.5 0.7 2.5
2 2 2 1.5 7.5 0.7 2.5
2 3 5 1.5 7.5 0.7 2.5
I want to add a pH column to the data frame such that:
1) If TIME is less than GTT for each subject ID then pH is GpH for that subject.
2) If TIME is bigger than GTT and less than the sum of GTT+SITT then pH = SIpH for that subject.
3) If TIME is bigger than the sum of GTT+SITT for each subject, then pH=6.
How possibly can I achieve this in R in a fast way?
You can try this, assuming that your data frame is stored as df1:
df1$pH <- with(df1, (TIME < GTT) * GpH + (TIME > GTT & TIME < (GTT + SITT)) * SIpH + (TIME > (GTT + SITT)) * 6)
#> df1
# ID TIME DV GpH SIpH GTT SITT pH
#1 1 0 0.0 1.4 7.1 1.8 3.5 1.4
#2 1 1 0.5 1.4 7.1 1.8 3.5 1.4
#3 1 2 2.0 1.4 7.1 1.8 3.5 7.1
#4 1 3 5.0 1.4 7.1 1.8 3.5 7.1
#5 2 0 0.0 1.5 7.5 0.7 2.5 1.5
#6 2 1 0.5 1.5 7.5 0.7 2.5 7.5
#7 2 2 2.0 1.5 7.5 0.7 2.5 7.5
#8 2 3 5.0 1.5 7.5 0.7 2.5 7.5
You can try a nesting of ifelse:
"Creating" the data:
data <- read.csv(head=TRUE, text =
"ID,TIME,DV,GpH,SIpH,GTT,SITT
1,0,0,1.4,7.1,1.8,3.5
1,1,0.5,1.4,7.1,1.8,3.5
1,2,2,1.4,7.1,1.8,3.5
1,3,5,1.4,7.1,1.8,3.5
2,0,0,1.5,7.5,0.7,2.5
2,1,0.5,1.5,7.5,0.7,2.5
2,2,2,1.5,7.5,0.7,2.5
2,3,5,1.5,7.5,0.7,2.5")
Adding ph
data$ph <- ifelse(
data$TIME < data$GTT,
data$GpH,
ifelse (
data$TIME > data$GTT & data$TIME < data$GTT + data$SITT,
data$SIpH,
6
)
)
Printing the result
data
ID TIME DV GpH SIpH GTT SITT ph
1 1 0 0.0 1.4 7.1 1.8 3.5 1.4
2 1 1 0.5 1.4 7.1 1.8 3.5 1.4
3 1 2 2.0 1.4 7.1 1.8 3.5 7.1
4 1 3 5.0 1.4 7.1 1.8 3.5 7.1
5 2 0 0.0 1.5 7.5 0.7 2.5 1.5
6 2 1 0.5 1.5 7.5 0.7 2.5 7.5
7 2 2 2.0 1.5 7.5 0.7 2.5 7.5
8 2 3 5.0 1.5 7.5 0.7 2.5 7.5
Related
I have a data.table with millions of rows in the following format.
There are multi-year results for each ID, however I only know the day of the year going from 1 to 365 or 366. I don't know the month nor the year, but I know the date for the first row (e.g. 1995/1/1).
ID DAY ATRR1 ATRR2
1 1 0.2 0.4
2 1 1.2 0.5
3 1 0.8 1.4
1 2 1.3 1.5
2 2 2.3 0.3
3 2 1.7 1.3
1 3 1.5 1.4
2 3 2.1 1.3
3 3 1.2 0.3
...
1 365 1.5 1.4
2 365 2.1 1.3
3 365 1.2 0.3
1 1 1.5 1.4
2 1 2.1 1.3
3 1 1.2 0.3
1 2 1.3 1.5
2 2 2.3 0.3
3 2 1.7 1.3
...
I would like to add a DATE column adding one day at each change in the DAY column, so the result would be:
ID DAY ATRR1 ATRR2 DATE
1 1 0.2 0.4 1995/1/1
2 1 1.2 0.5 1995/1/1
3 1 0.8 1.4 1995/1/1
1 2 1.3 1.5 1995/1/2
2 2 2.3 0.3 1995/1/2
3 2 1.7 1.3 1995/1/2
1 3 1.5 1.4 1995/1/3
2 3 2.1 1.3 1995/1/3
3 3 1.2 0.3 1995/1/3
...
1 365 1.5 1.4 1995/12/31
2 365 2.1 1.3 1995/12/31
3 365 1.2 0.3 1995/12/31
1 1 1.5 1.4 1996/1/1
2 1 2.1 1.3 1996/1/1
3 1 1.2 0.3 1996/1/1
1 2 1.3 1.5 1996/1/2
2 2 2.3 0.3 1996/1/2
3 2 1.7 1.3 1996/1/2
...
How would it be possible to do that?
You can simply do this:
as.Date(x, origin="1994-12-31")
My assumption here is that you don't have gaps in your dates and arranged as described in the question, otherwise this shall produce undesirable results.
Sample data:
df <- data.frame(Day = rep(c(1:365,1:2),each=3))
Create a seq like this using rle(run length encoding)
df$seq <- data.table::rleid(df$Day)
df$date <- as.Date(df$seq, origin="1994-12-31") #final answer
tail(df,8)
Let me know , if this is your expectation
Sample Output:
> tail(df,8)
Day seq date
1094 365 365 1995-12-31
1095 365 365 1995-12-31
1096 1 366 1996-01-01
1097 1 366 1996-01-01
1098 1 366 1996-01-01
1099 2 367 1996-01-02
1100 2 367 1996-01-02
1101 2 367 1996-01-02
date gaps no problem for this solution:
library(data.table)
library(lubridate)
library(magrittr)
read.table(text = "
ID DAY ATRR1 ATRR2
1 1 0.2 0.4
2 1 1.2 0.5
3 1 0.8 1.4
1 2 1.3 1.5
2 2 2.3 0.3
3 2 1.7 1.3
1 3 1.5 1.4
2 3 2.1 1.3
3 3 1.2 0.3
1 365 1.5 1.4
2 365 2.1 1.3
3 365 1.2 0.3
1 1 1.5 1.4
2 1 2.1 1.3
3 1 1.2 0.3
1 2 1.3 1.5
2 2 2.3 0.3
3 2 1.7 1.3", header = T) %>% setDT -> x
x[, date := as.Date(DAY, origin = "1995-01-01") -1]
x[, date := {
t1 = c(0, diff(DAY))
t2 = ifelse(t1 < 0, 1, 0)
t3 = cumsum(t2)
t4 = date + years(t3)
}]
I have distance matrix like this
1 2 3 4 5
A 0.1 0.2 0.3 0.5 0.6
B 0.7 0.8 0.9 1 1.1
C 1.2 1.3 1.4 1.5 1.6
D 1.7 1.8 1.9 2 2.1
E 2.2 2.3 2.4 2.5 2.6
and now I want to create lower triangle matrix like this
1 2 3 4 5 A B C D E
1 0
2 0.1 0
3 0.2 0.1 0
4 0.4 0.3 0.2 0
5 0.5 0.4 0.3 0.1 0
A 0.1 0.2 0.3 0.5 0.6 0
B 0.7 0.8 0.9 1 1.1 0.6 0
C 1.2 1.3 1.4 1.5 1.6 1.1 0.5 0
D 1.7 1.8 1.9 2 2.1 1.6 1 0.5 0
E 2.2 2.3 2.4 2.5 2.6 2.1 1.5 1 0.5 0
I just deducted distance between 2 from 1 from first table to get genetic distance between 1 and 2 (0.2 - 0.1=0.1) and like this I did for rest of the entries and I do not know doing like this is correct or not?, after doing calculation like that made lower triangle matrix. I tried like this in R
x <- read.csv("AD2.csv", head = FALSE, sep = ",")
b<-lower.tri(b, diag = FALSE)
but I am getting only TRUE and FALSE as output not like distance matrix.
can any one help to solve this problem and here is link to my example data.
You can make use of dist to calculate sub-matrices. Then use cbind and create the top and bottom half. Then rbind the 2 halves. Then set upper triangular to NA to create the desired output.
mat <- rbind(
cbind(as.matrix(dist(tbl[1,])), tbl),
cbind(tbl, as.matrix(dist(tbl[,1])))
)
mat[upper.tri(mat, diag=FALSE)] <- NA
mat
Hope it helps.
data:
tbl <- as.matrix(read.table(text="1 2 3 4 5
A 0.1 0.2 0.3 0.5 0.6
B 0.7 0.8 0.9 1 1.1
C 1.2 1.3 1.4 1.5 1.6
D 1.7 1.8 1.9 2 2.1
E 2.2 2.3 2.4 2.5 2.6", header=TRUE, check.names=FALSE, row.names=1))
Here is my repeated measurements dataframe
subject StartTime_month StopTime_month ...
1 0.0 0.5
1 0.5 1.0
1 1.0 3.0
1 3.0 6.0
1 6.0 9.6
1 9.6 12.1
2 0.0 0.5
2 0.5 1.0
2 1.0 1.9
2 1.9 3.2
2 3.2 6.2
2 6.2 8.2
I would like to select the rows which have the first StopTime_month >6.0 for each subject
We can try with data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'subject', get the row index of the first instance where 'StopTime_month' is greater than 6, and use that to subset the rows
library(data.table)
setDT(df1)[df1[, .I[which(StopTime_month > 6)[1]], by = subject]$V1]
# subject StartTime_month StopTime_month
#1: 1 6.0 9.6
#2: 2 3.2 6.2
Supppose, if we need all the rows until the first instance of 'StopTime_month' greater than 6,
setDT(df1)[, .SD[cumsum(StopTime_month > 6)<2], by = subject]
# subject StartTime_month StopTime_month
# 1: 1 0.0 0.5
# 2: 1 0.5 1.0
# 3: 1 1.0 3.0
# 4: 1 3.0 6.0
# 5: 1 6.0 9.6
# 6: 2 0.0 0.5
# 7: 2 0.5 1.0
# 8: 2 1.0 1.9
# 9: 2 1.9 3.2
#10: 2 3.2 6.2
Or using dplyr
library(dplyr)
df1 %>%
filter(StopTime_month > 6) %>%
group_by(subject) %>%
slice(1L)
# subject StartTime_month StopTime_month
# <int> <dbl> <dbl>
#1 1 6.0 9.6
#2 2 3.2 6.2
With base R aggregate
aggregate(.~subject, df[df$StopTime_month > 6, ], function(x) x[1])
# subject StartTime_month StopTime_month
#1 1 6.0 9.6
#2 2 3.2 6.2
A base R solution:
For subject 1:
df[df$subject==1 & df$StopTime_month > 6,][1,]
For subject 2:
df[df$subject==2 & df$StopTime_month > 6,][1,]
(where df is your dataframe)
I am trying to aggregate (sum) values across months and hours and re-arrange the summed data so that hour and month are on different "axes". I would like the hour to be column headers and the month to be row headers with summed values in each cell. Here's what I mean, through a dummy data example (obviously 12 months are present and 24 hours in the real data):
Month <- c(1,1,2,2,3,3,3,4,4,4,5,5,5,5,6,7,8,9,10,11,12)
Hour <- c(4,1,3,2,5,5,1,4,3,6,0,0,2,3,1,2,3,4,5,6,2)
Value <- c(0.1,0.4,0.02,0.1,0.1,0.2,0.02,0.01,0.01,0.02,0.1,0.3,0.2,0.1,0.2, 0.1,0.3,0.1,0.01,0.01,0.1)
z <- data.frame(Month, Hour, Value)
head(z)
Month Hour Value
1 4 0.10
1 1 0.40
2 3 0.02
2 2 0.10
3 5 0.10
3 5 0.20
My desired output, Hour = column headers (there will be 24 total, this just shows first 6 hours), Month = row headers (there will be 12 total)
z
0 1 2 3 4 5 6
1 0.3 0.2 0.1 0.7 0.1 1.1 0.7
2 0.1 0.1 0.8 1.7 0.2 0.1 0.6
3 0.2 0.7 0.1 0.4 2.1 1.3 0.1
4 0.1 0.2 0.2 0.1 3.1 0.1 0.7
5 0.7 0.8 1.2 0.2 0.4 0.1 0.2
6 0.5 0.2 3.0 0.8 0.2 5.1 1.2
7 0.5 0.2 3.0 0.8 0.2 5.1 1.2
8 0.5 0.2 3.0 0.8 0.2 5.1 1.2
9 0.5 0.2 3.0 0.8 0.2 5.1 1.2
10 0.5 0.2 3.0 0.8 0.2 5.1 1.2
11 0.5 0.2 3.0 0.8 0.2 5.1 1.2
12 0.5 0.2 3.0 0.8 0.2 5.1 1.2
We can use xtabs to create a contingency table
xtabs(Value~Month+Hour)
i have a data frame like this
A B value
1 1 0.123
2 1 0.213
3 1 0.543
1 2 0.313
2 2 0.123
3 2 0.412
what i want to do is to create a function that shift this data frame by a value. for example:
if the value of shifting is 1 the data frame will become:
A B value
3 2 0.412
1 1 0.123
2 1 0.213
3 1 0.543
1 2 0.313
2 2 0.123
etc...
the function should be like this.
shift<-function(dataframe,shiftvalue)
is there any simple way to do this in R without entering in a lot of loops??
You can do it many ways, but one way is to use head and tail:
df <- data.frame(a=1:10, b = 11:20)
shift <- function(d, k) rbind( tail(d,k), head(d,-k), deparse.level = 0 )
> shift(df,3)
a b
4 4 14
5 5 15
6 6 16
7 7 17
8 8 18
9 9 19
10 10 20
1 1 11
2 2 12
3 3 13
I prefer plain old modulo ;-)
shift<-function(df,offset) df[((1:nrow(df))-1-offset)%%nrow(df)+1,]
It is pretty straightforward, the only quirk is R's from-one indexing. Also it works for offsets like 0, -7 or 7*nrow(df)...
here is my implementation:
> shift <- function(df, sv = 1) df[c((sv+1):nrow(df), 1:sv),]
> head(shift(iris, 3))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
> tail(shift(iris, 3))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
148 6.5 3.0 5.2 2.0 virginica
149 6.2 3.4 5.4 2.3 virginica
150 5.9 3.0 5.1 1.8 virginica
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
>
Updated:
> shift <- function(df, sv = 1) df[c((nrow(df)-sv+1):nrow(df), 1:(nrow(df)-sv)),]
> head(shift(iris, 3))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
148 6.5 3.0 5.2 2.0 virginica
149 6.2 3.4 5.4 2.3 virginica
150 5.9 3.0 5.1 1.8 virginica
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
> tail(shift(iris, 3))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
142 6.9 3.1 5.1 2.3 virginica
143 5.8 2.7 5.1 1.9 virginica
144 6.8 3.2 5.9 2.3 virginica
145 6.7 3.3 5.7 2.5 virginica
146 6.7 3.0 5.2 2.3 virginica
147 6.3 2.5 5.0 1.9 virginica
There's a shift function in taRifx that works on vectors. Applying it results in coersion of all columns to character if any are character, so we'll use a trick from plyr. I'll likely write a data.frame method for it soon:
dd <- data.frame(b = seq(4),
x = c("A", "D", "A", "C"), y = c('a','b','c','d'),
z = c(1, 1, 1, 2),stringsAsFactors=FALSE)
> dd
b x y z
1 1 A a 1
2 2 D b 1
3 3 A c 1
4 4 C d 2
library(taRifx)
library(plyr)
shift.data.frame <- colwise(shift)
> shift.data.frame(dd)
b x y z
1 2 D b 1
2 3 A c 1
3 4 C d 2
4 1 A a 1
> shift(dd,n=-1)
b x y z
1 4 C d 2
2 1 A a 1
3 2 D b 1
4 3 A c 1
> shift(dd,n=-1,wrap=FALSE)
b x y z
1 1 A a 1
2 2 D b 1
3 3 A c 1
> shift(dd,n=-1,wrap=FALSE,pad=TRUE)
b x y z
1 NA <NA> <NA> NA
2 1 A a 1
3 2 D b 1
4 3 A c 1
The advantage of shift is that it takes a bunch of options:
n can be positive or negative to wrap from left/right
wrap can be turned on or off
If wrap is turned off, pad can be turned on to pad with NAs so vector remains the same length
https://dplyr.tidyverse.org/reference/lead-lag.html
lag(1:5, n = 1)
#> [1] NA 1 2 3 4
lag(1:5, n = 2)
#> [1] NA NA 1 2 3