I have a data frame such as this:
> bp
Source: local data frame [6 x 4]
date amount accountId type
1 2015-06-11 101.2 1 a
2 2015-06-18 101.2 1 a
3 2015-06-24 101.2 1 b
4 2015-06-11 294.0 2 a
5 2015-06-18 48.0 2 a
6 2015-06-26 10.0 2 b
It has 3.4 million rows of data:
> nrow(bp)
[1] 3391874
>
I am trying to compute lagged differences of time in days as follows using dplyr:
bp <- bp %>% group_by(accountId) %>%
mutate(diff = as.numeric(date - lag(date)))
On my 8GB memory macbook, R crashes. On a 64GB Linux server the code is taking forever. Any ideas on fixing this problem?
No idea what has gone wrong over your way, but with date as a proper Date object, everything goes very quickly over here:
Recreate some data:
dat <- read.table(text=" date amount accountId type
1 2015-06-11 101.2 1 a
2 2015-06-18 101.2 1 a
3 2015-06-24 101.2 1 b
4 2015-06-11 294.0 2 a
5 2015-06-18 48.0 2 a
6 2015-06-26 10.0 2 b",header=TRUE)
dat$date <- as.Date(dat$date)
Then run some analyses on 3.4M rows, 1000 groups:
set.seed(1)
dat2 <- dat[sample(rownames(dat),3.4e6,replace=TRUE),]
dat2$accountId <- sample(1:1000,3.4e6,replace=TRUE)
nrow(dat2)
#[1] 3400000
length(unique(dat2$accountId))
#[1] 1000
system.time({
dat2 <- dat2 %>% group_by(accountId) %>%
mutate(diff = as.numeric(date - lag(date)))
})
# user system elapsed
# 0.38 0.03 0.40
head(dat2[dat2$accountId==46,])
#Source: local data frame [6 x 6]
#Groups: accountId
#
# date amount accountId type diff
#1 2015-06-24 101.2 46 b NA
#2 2015-06-18 48.0 46 a -6
#3 2015-06-11 294.0 46 a -13
#4 2015-06-18 101.2 46 a 7
#5 2015-06-26 10.0 46 b 2
#6 2015-06-11 294.0 46 a 0
Related
I have a data as like this
Name Group Heath BP PM
QW DE23 20 60 10
We Fw34 0.5 42 2.5
Sd Kl78 0.4 0.1 0.5
Op Ss14 43 45 96
I need to remove all the rows if that values are less than 1.8
I used following command
data[colSums(data)>=1.8]
data[,colSums(data)>=1.8, drop=FALSE]
subset(data, select=colSums(data) >=1.8)
But I got error as like this "Error in colSums(data) : 'x' must be numeric"
Expected out put
Name Group Heath BP PM
QW DE23 20 60 10
We Fw34 0.5 42 2.5
Op Ss14 43 45 96
You can use to select rows where their sum is >=1.8:
data[rowSums(data[-1:-2])>=1.8,]
# Name Group Heath BP PM
#1 QW DE23 20.0 60 10.0
#2 We Fw34 0.5 42 2.5
#4 Op Ss14 43.0 45 96.0
or where any element in the row is >=1.8:
data[rowSums(data[-1:-2]>=1.8)>0,]
# Name Group Heath BP PM
#1 QW DE23 20.0 60 10.0
#2 We Fw34 0.5 42 2.5
#4 Op Ss14 43.0 45 96.0
data[-1:-2] select the numeric columns.
Here is a tidyverse solution:
library(tidyverse)
df <- tibble::tribble(
~Name,~Group,~Heath,~BP,~PM,
"QW", "DE23",20,60,10,
"We", "Fw34",0.5,42,2.5,
"Sd", "Kl78",0.4,0.1,0.5,
"Op", "Ss14",43,45,96
)
df %>%
filter_if(is.numeric,any_vars(.>=1.8))
#> # A tibble: 3 x 5
#> Name Group Heath BP PM
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 QW DE23 20 60 10
#> 2 We Fw34 0.5 42 2.5
#> 3 Op Ss14 43 45 96
Created on 2020-12-07 by the reprex package (v0.3.0)
The easiest way is to use the filter() function from dplyr package in combination with select to automatically detect numeric columns:
library(dplyr)
df <- data.frame(Name = c("QW", "We", "Sd", "Op"),
Group = c("DE23", "Fw34", "Kl78", "Ss14"),
Heath = c(20, 0.5, 0.4, 43),
BP = c(60, 42, 0.1, 45),
PM = c(10, 2.5, 0.5, 96))
df %>% filter(rowSums(select_if(., is.numeric)) >= 1.8)
Name Group Heath BP PM
1 QW DE23 20.0 60 10.0
2 We Fw34 0.5 42 2.5
3 Op Ss14 43.0 45 96.0
An option with Reduce from base R
df[Reduce(`|`, lapply(df[-(1:2)], `>=`, 1.8)),]
# Name Group Heath BP PM
#1 QW DE23 20.0 60 10.0
#2 We Fw34 0.5 42 2.5
#4 Op Ss14 43.0 45 96.0
The dataframe df1 summarizes water temperature at different depths (T5m,T15m,T25m,T35m) for every hour (Datetime). As an example of dataframe:
df1<- data.frame(Datetime=c("2016-08-12 12:00:00","2016-08-12 13:00:00","2016-08-12 14:00:00","2016-08-12 15:00:00","2016-08-13 12:00:00","2016-08-13 13:00:00","2016-08-13 14:00:00","2016-08-13 15:00:00"),
T5m= c(10,20,20,10,10,20,20,10),
T15m=c(10,20,10,20,10,20,10,20),
T25m=c(20,20,20,30,20,20,20,30),
T35m=c(20,20,10,10,20,20,10,10))
df1$Datetime<- as.POSIXct(df1$Datetime, format="%Y-%m-%d %H")
df1
Datetime T5m T15m T25m T35m
1 2016-08-12 12:00:00 10 10 20 20
2 2016-08-12 13:00:00 20 20 20 20
3 2016-08-12 14:00:00 20 10 20 10
4 2016-08-12 15:00:00 10 20 30 10
5 2016-08-13 12:00:00 10 10 20 20
6 2016-08-13 13:00:00 20 20 20 20
7 2016-08-13 14:00:00 20 10 20 10
8 2016-08-13 15:00:00 10 20 30 10
I would like to create a new dataframe df2 in which I have the average water temperature per day for either each depth interval and for the whole water column and the standard error estimation. I would expect something like this (I did the calculations by hand so there might be some mistakes):
> df2
Date meanT5m meanT15m meanT25m meanT35m meanTotal seT5m seT15m seT25m seT35m seTotal
1 2016-08-12 15 15 22.5 15 16.875 2.88 2.88 2.5 2.88 1.29
2 2016-08-13 15 15 22.5 15 16.875 2.88 2.88 2.5 2.88 1.29
I am especially interested in knowing how to do it with data.table since I will work with huge data.frames and I think data.table is quite efficient.
For calculating the standard error I know the function std.error() from the package plotrix.
Update based on #chinsoon's comment
First transform your data frame into a data table:
library(data.table)
setDT(df1)
Create a total column:
df1[, total := rowSums(.SD), .SDcols = grep("T[0-9]+m", names(df1))][]
# Datetime T5m T15m T25m T35m total
# 1: 2016-08-12 12:00:00 10 10 20 20 60
# 2: 2016-08-12 13:00:00 20 20 20 20 80
# 3: 2016-08-12 14:00:00 20 10 20 10 60
# 4: 2016-08-12 15:00:00 10 20 30 10 70
# 5: 2016-08-13 12:00:00 10 10 20 20 60
# 6: 2016-08-13 13:00:00 20 20 20 20 80
# 7: 2016-08-13 14:00:00 20 10 20 10 60
# 8: 2016-08-13 15:00:00 10 20 30 10 70
Apply the functions per day:
library(lubridate)
(df3 <- df1[, as.list(unlist(lapply(.SD, function (x)
c(mean = mean(x), sem = sd(x) / sqrt(length(x)))))),
day(Datetime)])
# day T5m.mean T5m.sem T15m.mean T15m.sem T25m.mean T25m.sem T35m.mean
# 1: 12 15 2.886751 15 2.886751 22.5 2.5 15
# 2: 13 15 2.886751 15 2.886751 22.5 2.5 15
# T35m.sem total.mean total.sem
# 1: 2.886751 67.5 4.787136
# 2: 2.886751 67.5 4.787136
Here is one way using dplyr and tidyr calculated in two parts
library(dplyr)
library(tidyr)
df2 <- df1 %>%
mutate(Datetime = as.Date(Datetime)) %>%
gather(key, value, -Datetime) %>%
group_by(Datetime, key) %>%
summarise(se = plotrix::std.error(value),
mean = mean(value)) %>%
gather(total, value, -key, -Datetime)
bind_rows(df2, df2 %>%
group_by(Datetime, total) %>%
summarise(value = sum(value)) %>%
mutate(key = paste("total", c("mean", "se"), sep = "_"))) %>%
unite(key, key, total) %>%
spread(key, value)
# A tibble: 2 x 11
# Groups: Datetime [2]
# Datetime T15m_mean T15m_se T25m_mean T25m_se T35m_mean
# <date> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 2016-08-12 15 2.89 22.5 2.5 15
#2 2016-08-13 15 2.89 22.5 2.5 15
# … with 5 more variables: T35m_se <dbl>, T5m_mean <dbl>,
# T5m_se <dbl>, total_mean_mean <dbl>, total_se_se <dbl>
I’m a newbie in R.
I have two dataset A and B.
A <- data.table::fread(
"
V1 DATE ID
1 7/16/11 a
2 2/18/09 b
3 3/25/08 c
")
B <- data.table::fread(
"
V1 DATE ID Value
1 2013-06-13 a 109
2 2017-08-22 a 86
3 2017-09-15 a 88
4 2008-11-05 a 78
5 2009-02-17 a 74
6 2009-03-09 a 84
7 2009-03-17 a 81
8 2009-04-14 a 57
9 2009-04-21 a 65
10 2009-05-12 a 54
11 2009-06-08 a 54
12 2009-08-27 a 68
13 2011-08-26 b 199
14 2011-12-07 b 174
15 2012-01-31 b 66
16 2012-02-15 b 58
17 2012-04-17 b 59
18 2012-12-21 b 78
19 2013-01-14 b 91
20 2014-03-12 b 74
21 2014-08-28 b 98
22 2014-10-18 b 112
23 2010-12-15 b 36
24 2011-08-26 b 199
25 2011-12-07 b 174
26 2012-01-31 b 66
27 2012-02-15 b 58
28 2012-04-17 b 59
29 2015-05-08 c 105
30 2006-03-27 c 69
31 2007-03-12 c 104
32 2007-11-09 c 63
33 2008-03-25 c 239
34 2008-04-04 c 446
35 2008-04-09 c 354
36 2008-04-10 c 365
37 2008-04-11 c 366
38 2008-04-18 c 273
39 2008-04-28 c 271
40 2008-05-06 c 262
41 2008-05-19 c 72
42 2008-05-24 c 86
43 2008-06-20 c 47
44 2008-07-10 c 46
45 2008-08-06 c 55
46 2008-09-01 c 58
47 2008-09-29 c 56
48 2008-10-30 c 53
49 2008-12-09 c 71
50 2008-12-18 c 63
51 2009-01-14 c 60
52 2009-02-21 c 58
53 2009-03-28 c 54
54 2009-04-29 c 56
55 2009-04-30 c 59
56 2009-06-23 c 64
57 2009-07-24 c 69
58 2009-08-17 c 73
59 2009-10-04 c 127
60 2009-11-26 c 289
61 2009-12-02 c 277
62 2009-12-08 c 230
")
I tried weeks to use R to:
find value from B which ID==A$ID, and B$DATE is closest date before or the same date as A$DATE;
The expected result is : ID=c, DATE=2008-03-25, Value=239
find value from B which ID==A$ID, and B$DATE is 14 days after A$DATE. If there is no exact date after 14 days, find the closest date's value (like 15, 16 or 17 days after A$DATE)
The expected result is : ID=c, DATE=2008-04-09, Value=354
Both questions can answered using a rolling join from data.table.
However, there are two important steps in preparing the data.
The date strings need to be converted to class IDate (or Date) to allow for date arithmetic. (IDate uses an integer representation to save memory).
The dataframes need to be coerced to data.table to enable the enhanced syntax. setDT() coerces a dataframe or tibble to data.table by reference, i.e., without copying.
BTW: The sample datasets provided by the OP were already data.tables as the OP had used the data.table::fread() function.
Data preparation:
library(data.table)
setDT(A)[, DATE := as.IDate(DATE, "%m/%d/%y")]
setDT(B)[, DATE := as.IDate(DATE)]
Now, we can apply the rolling join:
B[A, on = .(ID, DATE), roll = +Inf, .(ID, DATE, Value)]
ID DATE Value
1: a 2011-07-16 68
2: b 2009-02-18 NA
3: c 2008-03-25 239
The result can be verified by printing B in proper order B[order(ID, DATE)]. The earliest date for ID == "b" in B is 2011-08-26. So, there is no date in B on or before 2009-02-18.
Please, note that the value in the DATE column is the reference date A$DATE, not the matching B$DATE.
Edit after clarification of the expected result by the OP:
Also the second question can be solved by a rolling join but the code requires three modifications:
The reference dates A$DATE need to be shifted by 14 days later.
We need a backward rolling join because the OP wants to find the closest date in B on or after the shifted reference date.
According to OP's expected result the result should contain the matching B$DATE.
With the additional requrements we get
B[A[, .(ID, DATE = DATE + 14)], on = .(ID, DATE), roll = -Inf, .(ID, DATE = x.DATE, Value)]
ID DATE Value
1: a 2013-06-13 109
2: b 2010-12-15 36
3: c 2008-04-09 354
A solution using dplyr:
q1 and q2 corresponds to your two questions.
library(dplyr)
A$DATE <- as.Date(A$DATE,format = "%m/%d/%y")
B$DATE <- as.Date(B$DATE)
BA <- left_join(B,A, by= c("ID"="ID"))
q1 <- BA %>%
filter(ID %in% A$ID) %>%
filter(DATE.x < DATE.y) %>%
group_by(ID) %>%
arrange(desc(DATE.x)) %>%
slice(1)
q2 <- BA %>%
filter(ID %in% A$ID) %>%
group_by(ID) %>%
filter(as.numeric(DATE.x) - as.numeric(DATE.y) >= 14)
q1
#> # A tibble: 2 x 6
#> # Groups: ID [2]
#> V1.x DATE.x ID Value V1.y DATE.y
#> <int> <date> <chr> <int> <int> <date>
#> 1 12 2009-08-27 a 68 1 2011-07-16
#> 2 32 2007-11-09 c 63 3 2008-03-25
q2
#> # A tibble: 48 x 6
#> # Groups: ID [3]
#> V1.x DATE.x ID Value V1.y DATE.y
#> <int> <date> <chr> <int> <int> <date>
#> 1 1 2013-06-13 a 109 1 2011-07-16
#> 2 2 2017-08-22 a 86 1 2011-07-16
#> 3 3 2017-09-15 a 88 1 2011-07-16
#> 4 13 2011-08-26 b 199 2 2009-02-18
#> 5 14 2011-12-07 b 174 2 2009-02-18
#> 6 15 2012-01-31 b 66 2 2009-02-18
#> 7 16 2012-02-15 b 58 2 2009-02-18
#> 8 17 2012-04-17 b 59 2 2009-02-18
#> 9 18 2012-12-21 b 78 2 2009-02-18
#> 10 19 2013-01-14 b 91 2 2009-02-18
#> # ... with 38 more rows
So far I have not been able to find a suitable solution to my problem on Stack Overflow.
I would like to use dplyr to subtract a control value from my data. I need to subtract the control from data measured on the same date only. There are several dates contained within my data frame, and the each date contains a different amount of data.
My data looks something like that listed below; where 'F' are the samples which need modifying and 'AC' are the controls which will be subtracted.
Sample Tissue Date Result1 Result2
1 F 10-Jul 210 56.0
2 F 10-Jul 527 427.0
3 F 10-Jul 557 69.0
4 F 10-Jul 684 344.0
5 F 10-Jul 650 10.0
6 AC 10-Jul 200 10.0
7 F 12-Jul 676 65.0
8 F 12-Jul 520 70.0
9 F 12-Jul 595 730.0
10 AC 12-Jul 100 5.0
I imagine I need to use:
myData <- myData2 %>%
group_by(Date) %>%
And from there I'm a bit confused, I've tried:
mutate(Result1 = Result1 - subset(myData$Result1, myData$Tissue=="AC"))
but with no real success. I imagine there's a simple solution out there, for which I would be very grateful!
And thus I would end up with data looking something like this:
Sample Tissue Date Result1 Result2
1 F 10-Jul 10 46.0
2 F 10-Jul 327 417.0
3 F 10-Jul 357 59.0
4 F 10-Jul 484 334.0
5 F 10-Jul 450 0.0
6 AC 10-Jul 200 10.0
7 F 12-Jul 576 60.0
8 F 12-Jul 420 65.0
9 F 12-Jul 495 725.0
10 AC 12-Jul 100 5.0
It would be useful if the function could be used to calculate the difference for both results or more at once. Thanks in advance!
Edit:
I've think I've found a solution with this code
myData2 <- myData %>%
group_by(Date) %>%
mutate_at(vars(3:4),funs(.-.[Tissue=="AC"]))
Does my logic work here? Also why do I need to need take 1 from my column number to use the vars() function?
I seemed to have solved it using this code:
myData2 <- myData %>%
group_by(Date) %>%
mutate_at(vars(3:4),funs(.-.[Tissue=="AC"]))
I liked the simplicity of this solution, but many thanks to the other respondents for taking time to help me out.
df = read.table(text = "
Sample Tissue Date Result1 Result2
1 F 10-Jul 210 56.0
2 F 10-Jul 527 427.0
3 F 10-Jul 557 69.0
4 F 10-Jul 684 344.0
5 F 10-Jul 650 10.0
6 AC 10-Jul 200 10.0
7 F 12-Jul 676 65.0
8 F 12-Jul 520 70.0
9 F 12-Jul 595 730.0
10 AC 12-Jul 100 5.0
", stringsAsFactors=F, header=T)
library(dplyr)
df %>%
group_by(Date) %>% # for each date
mutate(control1 = Result1[Tissue == "AC"], # calculate control values
control2 = Result2[Tissue == "AC"]) %>%
ungroup() %>% # forget about the grouping
mutate(Result1 = ifelse(Tissue == "F", Result1 - control1, Result1), # update result values only for rows with tissue = F
Result2 = ifelse(Tissue == "F", Result2 - control2, Result2)) %>%
select(Sample:Result2) # select columns of interest
# # A tibble: 10 x 5
# Sample Tissue Date Result1 Result2
# <int> <chr> <chr> <int> <dbl>
# 1 1 F 10-Jul 10 46
# 2 2 F 10-Jul 327 417
# 3 3 F 10-Jul 357 59
# 4 4 F 10-Jul 484 334
# 5 5 F 10-Jul 450 0
# 6 6 AC 10-Jul 200 10
# 7 7 F 12-Jul 576 60
# 8 8 F 12-Jul 420 65
# 9 9 F 12-Jul 495 725
# 10 10 AC 12-Jul 100 5
The control columns are used only to help you understand the process. You can use:
df %>%
group_by(Date) %>%
mutate(Result1 = ifelse(Tissue == "F", Result1 - Result1[Tissue == "AC"], Result1),
Result2 = ifelse(Tissue == "F", Result2 - Result2[Tissue == "AC"], Result2)) %>%
ungroup()
I have patient data that looks like this:
ID DATE DUR
82 29/08/2014 10.32
82 29/08/2014 0.32
82 12/09/2014 13.35
82 12/09/2014 0.16
82 12/09/2014 0.24
82 12/09/2014 0.31
82 22/12/2014 100.39
82 22/12/2014 0.1
219 31/11/2012 -300.32
219 31/11/2012 0.23
219 12/01/2013 80.20
219 12/01/2013 0.20
In the first column is a patient ID, In the second there is a date and time (time is visually missing but is in there) and the third is just the duration difference (which I have been using to determine different admittance of patients). Each different row is a check up on the patient but they may have come here at a later date (not with in the same time frame).
Basically what I want to do is to be able to categorize each patients number so that when they admit a second time there id becomes "82a" and third time "82b" and so on. It wouldn't have to be alphabetic it could be any such indicator. Sometimes there can be patients with as many as 50 different admissions (separate occasion admissions). So after this I want to have it look something like:
ID DATE DUR
82 29/08/2014 10.32
82 29/08/2014 0.32
82a 12/09/2014 13.35
82a 12/09/2014 0.16
82a 12/09/2014 0.24
82a 12/09/2014 0.31
82b 22/12/2014 100.39
82b 22/12/2014 0.1
219 31/11/2012 -300.32
219 31/11/2012 0.23
219a 12/01/2013 80.20
219a 12/01/2013 0.20
I have been working in Excel for the time being and at first had used
=IF(AND(ABS(C3)>1,A3=A2),1,0)
Just to allow to indicate when an ID is repeated on a new admission date, then I did this again to indicate the 3rd admission and began drawing out columns for 4th,5th,6th and planned on merging them. This is simply not an efficient solution, especially with a large data set. I am familiar with R and think that might be a better way for manipulation but I am just stuck with how to do this for the entire data set and to continually add a new "indicator" every time the same patient is admitted again. I am not even sure exactly how to tell the computer what to do with pseudo. Perhaps something like this
Pseudo-Code
-> Run through ID Column
-> IF Dur is > 1 (it will always be > 1 for a new admission)
ANDIF ID already exists above with DUR > 1 = a, or if DUR > 1 TWICE for
same ID = b, or if DUR > THREE TIMES = c, and so on....
Any help would be great
In R, you have a lot of options. Your data has issues, however; since November only has 30 days, converting the DATE column to an actual date format will introduce NAs. (You could, of course, just leave it as character, but date formats are easier to work with.)
With dplyr:
library(dplyr)
df %>% mutate(DATE = as.Date(DATE, '%d/%m/%Y')) %>% # parse date data
group_by(ID) %>% # group data by ID
mutate(visit = as.integer(factor(DATE))) # make an integer factor of DATE
# Source: local data frame [12 x 4]
# Groups: ID [2]
#
# ID DATE DUR visit
# (int) (date) (dbl) (int)
# 1 82 2014-08-29 10.32 1
# 2 82 2014-08-29 0.32 1
# 3 82 2014-09-12 13.35 2
# 4 82 2014-09-12 0.16 2
# 5 82 2014-09-12 0.24 2
# 6 82 2014-09-12 0.31 2
# 7 82 2014-12-22 100.39 3
# 8 82 2014-12-22 0.10 3
# 9 219 <NA> -300.32 NA
# 10 219 <NA> 0.23 NA
# 11 219 2013-01-12 80.20 1
# 12 219 2013-01-12 0.20 1
Base R has a lot of options, including ave and tapply, but to keep it simple so you can see what happens step-by-step in a split-apply-combine model, split by grouping variable, lapply across the list, and use do.call(rbind to reassemble:
df$DATE <- as.Date(df$DATE, '%d/%m/%Y')
df <- do.call(rbind, lapply(split(df, df$ID),
function(x){data.frame(x,
visit = as.integer(factor(x$DATE)))}))
rownames(df) <- NULL # delete useless rownames
df
# ID DATE DUR visit
# 1 82 2014-08-29 10.32 1
# 2 82 2014-08-29 0.32 1
# 3 82 2014-09-12 13.35 2
# 4 82 2014-09-12 0.16 2
# 5 82 2014-09-12 0.24 2
# 6 82 2014-09-12 0.31 2
# 7 82 2014-12-22 100.39 3
# 8 82 2014-12-22 0.10 3
# 9 219 <NA> -300.32 NA
# 10 219 <NA> 0.23 NA
# 11 219 2013-01-12 80.20 1
# 12 219 2013-01-12 0.20 1