I want to calculate the difference of two incidents. First five columns indicate a date-time of incident. The rest five columns indicate the date-time of death.
dat <- read.table(header=TRUE, text="
YEAR MONTH DAY HOUR MINUTE D.YEAR D.MONTH D.DAY D.HOUR D.MINUTE
2013 1 6 0 55 2013 1 6 0 56
2013 2 3 21 24 2013 2 4 23 14
2013 1 6 11 45 2013 1 6 12 29
2013 3 6 12 25 2013 3 6 23 55
2013 4 6 18 28 2013 5 3 11 18
2013 4 8 14 31 2013 4 8 14 32")
dat
YEAR MONTH DAY HOUR MINUTE D.YEAR D.MONTH D.DAY D.HOUR D.MINUTE
2013 1 6 1 55 2013 1 6 0 56
2013 2 3 21 24 2013 2 4 23 14
2013 1 6 11 45 2013 1 6 12 29
2013 3 6 12 25 2013 3 6 23 55
2013 4 6 18 28 2013 5 3 11 18
2013 4 8 14 31 2013 4 8 14 32
I want to calculate the difference of time (in minutes). The following code is not going anywhere. The timestamp will look like 2013-04-06 04:08.
library(lubridate)
dat$tstamp1 <- mdy(paste(dat$YEAR, dat$MONTH, dat$DAY, dat$HOUR, dat$MINUTE,sep = "-"))
dat$tstamp2 <- mdy(paste(dat$D.YEAR, dat$D.MONTH, dat$D.DAY, dat$D.HOUR, dat$D.MINUTE, sep = "-"))
dat$diff <- dat$tstamp2 -dat$tstamp2 ### want the difference in minutes
In order to parse a date/time string of the "-"-separated format you're creating, you'll need to give a custom format, and pass it to parse_date_time. For example:
parse_date_time(paste(dat$D.YEAR, dat$D.MONTH, dat$D.DAY, dat$D.HOUR, dat$D.MINUTE, sep = "-"),
"%Y-%m-%d-%H-%M")
Your new code would therefore look like:
library(lubridate)
dat$tstamp1 <- parse_date_time(paste(dat$YEAR, dat$MONTH, dat$DAY, dat$HOUR, dat$MINUTE, sep = "-"),
"%Y-%m-%d-%H-%M")
dat$tstamp2 <- parse_date_time(paste(dat$D.YEAR, dat$D.MONTH, dat$D.DAY, dat$D.HOUR, dat$D.MINUTE, sep = "-"),
"%Y-%m-%d-%H-%M")
Then the following will get you the time difference in minutes:
dat$diff <- as.numeric(dat$tstamp2 - dat$tstamp1)
You can try this:
library(lubridate)
dat$tstamp1 <- strptime(paste(dat$YEAR, dat$MONTH, dat$DAY, dat$HOUR, dat$MINUTE,sep = "-"),"%Y-%m-%d-%H-%M")
dat$tstamp2 <- strptime(paste(dat$D.YEAR, dat$D.MONTH, dat$D.DAY, dat$D.HOUR, dat$D.MINUTE, sep = "-"),"%Y-%m-%d-%H-%M")
dat$diff <- as.POSIXct(dat$tstamp2) - as.POSIXct(dat$tstamp1)
Using strptime is faster and bit safer against unexpected data. You can read more about it here.
Related
I have this data.frame called EXAMPLE, with 4 variables:
date <- c(2010, 2011, 2012, 2013)
new_york <- c(10,20,22,28)
berlin <- c(0,51,45,12)
tokyo <- c(2,15,20,13)
EXAMPLE <- data.frame(date, new_york, berlin, tokyo)
I want to identify, in each column, in which position was summed at least 50 and also store that sum. For example, in the column new_york, was summed 52 at row 3.
I was thinking in something like this below, but it didn't work:
x <- 1
while(sum(EXAMPLE$berlin[1:x]) <= 50) {
a <-x
}
I appreciate if someone can help.
out <- lapply(EXAMPLE[,-1], cumsum)
names(out) <- paste0(names(out), "_cumulative")
options(width=123, length=99999)
cbind(EXAMPLE, out)
# date new_york berlin tokyo new_york_cumulative berlin_cumulative tokyo_cumulative
# 1 2010 10 0 2 10 0 2
# 2 2011 20 51 15 30 51 17
# 3 2012 22 45 20 52 96 37
# 4 2013 28 12 13 80 108 50
Here's the equivalent tidy version of #r2evans answer...
library(dplyr)
EXAMPLE %>%
mutate(across(new_york:tokyo,
cumsum,
.names = "cumsum_{.col}")
)
#> date new_york berlin tokyo cumsum_new_york cumsum_berlin cumsum_tokyo
#> 1 2010 10 0 2 10 0 2
#> 2 2011 20 51 15 30 51 17
#> 3 2012 22 45 20 52 96 37
#> 4 2013 28 12 13 80 108 50
Wasn't sure the best way to word this, but I'd like to multiple / divide two columns by each other, lagged by one row (in my dataset this means varx/vary - 1 row).
The end result should be an additional column, with one NA value (for the first year which isn't present)
I'm having trouble indexing it, but I think it would go something along these lines...
e.g.
df <- data_frame(year = c(2010:2020), var_x = c(20:30), var_y = c(2:12))
#not correct
diff <- df[,2, 2:ncol(df)-1] * df[,3, 1:ncol(df)]
dplyr would look something like...
df %>%
mutate(forecast = (var_x * ncol(var_y)-1))
incorrect result:
# A tibble: 11 x 4
year var_x var_y forecast
<int> <int> <int> <int>
1 2010 20 2 40
2 2011 21 3 63
3 2012 22 4 88
4 2013 23 5 115
5 2014 24 6 144
6 2015 25 7 175
7 2016 26 8 208
8 2017 27 9 243
9 2018 28 10 280
10 2019 29 11 319
11 2020 30 12 360
Error in mutate_impl(.data, dots) :
Column `forecast` must be length 11 (the number of rows) or one, not 0
Thanks, your guidance is appreciated.
From recommended comment above:
df %>%
mutate(forecast = var_y * lag(var_x))
# A tibble: 11 x 4
year var_x var_y forecast
<int> <int> <int> <int>
1 2010 20 2 NA
2 2011 21 3 60
3 2012 22 4 84
4 2013 23 5 110
5 2014 24 6 138
6 2015 25 7 168
7 2016 26 8 200
8 2017 27 9 234
9 2018 28 10 270
10 2019 29 11 308
11 2020 30 12 348
I have the following data frame, from which I would like to remove observations based on three criteria: x=x, y=y and z>=60.
df <- data.frame(x=c(1,1,2,2,3,3,4,4),
y=c(2011,2012,2011,2011,2013,2014,2011,2012),
z=c(15,15,60,60,15,15,30,15))
> df
x y z
1 1 2011 15
2 1 2012 15
3 2 2011 60
4 2 2011 60
5 3 2013 15
6 3 2014 15
7 4 2011 30
8 4 2012 15
The data frame I'm looking for is thus (which one of the x=2 observations is removed doesn't matter):
> df1
x y z
1 1 2011 15
2 1 2012 15
3 2 2011 60
4 3 2013 15
5 3 2014 15
6 4 2011 30
7 4 2012 15
My first thoughts included using unique or duplicate, but I cannot seem to understand how to implement it in practice.
This should do the trick. Look for duplicated x and y entries where z is also greater than or equal to 60:
df[!(duplicated(df[,1:2]) & df$z >= 60), ]
# x y z
#1 1 2011 15
#2 1 2012 15
#3 2 2011 60
#5 3 2013 15
#6 3 2014 15
#7 4 2011 30
#8 4 2012 15
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
What finally worked was:
a <- cast(we, year ~ region, mean, value='response')
Although, I only have 1 observation per region and site, so mean is just a workaround. I couldn't get c to work as a function.
Output for suggested answer (by Justin)
> DT
> response year
> 1: 15 2000
> 2: 6 2000
> 3: 23 2000
> 4: 23 2000
---
> 794: 3 2010
> 795: 5 2010
> 796: 1 2010
Update: desired output should look like:
> Year x1 x2 x3 x4
> 2000 4 5 16 22
> 2001 6 11 2 18
> 2002 1 0 21 10
> ...
I am struggling to find a way to transpose my data based on factor levels. I have data with 2 columns, a factor and a response. I have many rows for each factor, so I want to transpose the table such that each factor is on one row, with the different responses as a column in that row. I cannot seem to subset within a loop based on levels of that factor. I would appreciate any insight.
example of data:
> response year
> 5 2001
> 10 2001
> 8 2001
> 1 2002
> 7 2010
> levels(data$year)
[1] "2000" "2001" "2002" "2003" "2004" "2005" ...
w <- matrix(0,54,15)
for(i in 1:levels(data$year)){
w[i] <- levels(data$year)==i
}
This syntax is obviously not correct, but it is the idea of what I'm trying to accomplish.
Thank you.
Using the data.table package this is trivial:
library(data.table)
DT <- data.table(data)
DT[, as.list(value), by=year]
However, this will fall apart if you have different numbers of observations per year. Instead:
DT[, list(values = list(value)), by=year]
Or using base R:
tapply(data$value, data$year, c)
Here's another way, using aggregate:
> set.seed(1)
> data <- data.frame(year = rep(2000:2010, each=10), value = sample(3:30, 110, TRUE))
> aggregate(value~year, data=data, FUN=c)
year value.1 value.2 value.3 value.4 value.5 value.6 value.7 value.8 value.9 value.10
1 2000 10 13 19 28 8 28 29 21 20 4
2 2001 8 7 22 13 24 16 23 30 13 24
3 2002 29 8 21 6 10 13 3 13 27 12
4 2003 16 19 16 8 26 21 25 6 23 14
5 2004 25 21 24 18 17 25 3 16 23 22
6 2005 16 27 15 9 4 5 11 17 21 14
7 2006 28 11 15 12 21 10 16 24 5 27
8 2007 12 26 12 12 16 27 27 13 24 29
9 2008 15 22 14 12 24 8 22 6 9 7
10 2009 9 4 20 27 24 25 15 14 25 19
11 2010 21 12 10 30 20 8 6 16 28 19
If I had a different number of responses per year, I would probably come at this problem by first making a new variable to represent the response in each year and then casting that dataset out using dcast. By default dcast fills in missing values with NA, although you can change that if needed.
set.seed(1)
data = data.frame(year = c(rep(2000:2010, each=10), 2011), value = sample(3:30, 111, TRUE))
require(reshape2)
require(plyr)
# Create a new variable representing the number of responses per year and add to dataset
dat2 = ddply(data, .(year), transform,
response = interaction("x", 1:length(value), sep = ""))
dcast(dat2, year ~ response, value.var = "value")
I want to reshape a wide format dataset that has multiple tests which are measured at 3 time points:
ID Test Year Fall Spring Winter
1 1 2008 15 16 19
1 1 2009 12 13 27
1 2 2008 22 22 24
1 2 2009 10 14 20
2 1 2008 12 13 25
2 1 2009 16 14 21
2 2 2008 13 11 29
2 2 2009 23 20 26
3 1 2008 11 12 22
3 1 2009 13 11 27
3 2 2008 17 12 23
3 2 2009 14 9 31
into a data set that separates the tests by column but converts the measurement time into long format, for each of the new columns like this:
ID Year Time Test1 Test2
1 2008 Fall 15 22
1 2008 Spring 16 22
1 2008 Winter 19 24
1 2009 Fall 12 10
1 2009 Spring 13 14
1 2009 Winter 27 20
2 2008 Fall 12 13
2 2008 Spring 13 11
2 2008 Winter 25 29
2 2009 Fall 16 23
2 2009 Spring 14 20
2 2009 Winter 21 26
3 2008 Fall 11 17
3 2008 Spring 12 12
3 2008 Winter 22 23
3 2009 Fall 13 14
3 2009 Spring 11 9
3 2009 Winter 27 31
I have unsuccessfully tried to use reshape and melt. Existing posts address transforming to single column outcome.
Using reshape2:
# Thanks to Ista for helping with direct naming using "variable.name"
df.m <- melt(df, id.var = c("ID", "Test", "Year"), variable.name = "Time")
df.m <- transform(df.m, Test = paste0("Test", Test))
dcast(df.m, ID + Year + Time ~ Test, value.var = "value")
Update: Using data.table melt/cast from versions >= 1.9.0:
data.table from versions 1.9.0 imports reshape2 package and implements fast melt and dcast methods in C for data.tables. A comparison of speed on bigger data is shown below.
For more info regarding NEWS, go here.
require(data.table) ## ver. >=1.9.0
require(reshape2)
dt <- as.data.table(df, key=c("ID", "Test", "Year"))
dt.m <- melt(dt, id.var = c("ID", "Test", "Year"), variable.name = "Time")
dt.m[, Test := paste0("Test", Test)]
dcast.data.table(dt.m, ID + Year + Time ~ Test, value.var = "value")
At the moment, you'll have to write dcast.data.table explicitly as it's not a S3 generic in reshape2 yet.
Benchmarking on bigger data:
# generate data:
set.seed(45L)
DT <- data.table(ID = sample(1e2, 1e7, TRUE),
Test = sample(1e3, 1e7, TRUE),
Year = sample(2008:2014, 1e7,TRUE),
Fall = sample(50, 1e7, TRUE),
Spring = sample(50, 1e7,TRUE),
Winter = sample(50, 1e7, TRUE))
DF <- as.data.frame(DT)
reshape2 timings:
reshape2_melt <- function(df) {
df.m <- melt(df, id.var = c("ID", "Test", "Year"), variable.name = "Time")
}
# min. of three consecutive runs
system.time(df.m <- reshape2_melt(DF))
# user system elapsed
# 43.319 4.909 48.932
df.m <- transform(df.m, Test = paste0("Test", Test))
reshape2_cast <- function(df) {
dcast(df.m, ID + Year + Time ~ Test, value.var = "value")
}
# min. of three consecutive runs
system.time(reshape2_cast(df.m))
# user system elapsed
# 57.728 9.712 69.573
data.table timings:
DT_melt <- function(dt) {
dt.m <- melt(dt, id.var = c("ID", "Test", "Year"), variable.name = "Time")
}
# min. of three consecutive runs
system.time(dt.m <- reshape2_melt(DT))
# user system elapsed
# 0.276 0.001 0.279
dt.m[, Test := paste0("Test", Test)]
DT_cast <- function(dt) {
dcast.data.table(dt.m, ID + Year + Time ~ Test, value.var = "value")
}
# min. of three consecutive runs
system.time(DT_cast(dt.m))
# user system elapsed
# 12.732 0.825 14.006
melt.data.table is ~175x faster than reshape2:::melt and dcast.data.table is ~5x than reshape2:::dcast.
Sticking with base R, this is another good candidate for the "stack + reshape" routine. Assuming our dataset is called "mydf":
mydf.temp <- data.frame(mydf[1:3], stack(mydf[4:6]))
mydf2 <- reshape(mydf.temp, direction = "wide",
idvar=c("ID", "Year", "ind"),
timevar="Test")
names(mydf2) <- c("ID", "Year", "Time", "Test1", "Test2")
mydf2
# ID Year Time Test1 Test2
# 1 1 2008 Fall 15 22
# 2 1 2009 Fall 12 10
# 5 2 2008 Fall 12 13
# 6 2 2009 Fall 16 23
# 9 3 2008 Fall 11 17
# 10 3 2009 Fall 13 14
# 13 1 2008 Spring 16 22
# 14 1 2009 Spring 13 14
# 17 2 2008 Spring 13 11
# 18 2 2009 Spring 14 20
# 21 3 2008 Spring 12 12
# 22 3 2009 Spring 11 9
# 25 1 2008 Winter 19 24
# 26 1 2009 Winter 27 20
# 29 2 2008 Winter 25 29
# 30 2 2009 Winter 21 26
# 33 3 2008 Winter 22 23
# 34 3 2009 Winter 27 31
Base reshape function alternative method is below. Though this required using reshape twice, there might be a simpler way.
Assuming your dataset is called df1
tmp <- reshape(df1,idvar=c("ID","Year"),timevar="Test",direction="wide")
result <- reshape(
tmp,
idvar=c("ID","Year"),
varying=list(3:5,6:8),
v.names=c("Test1","Test2"),
times=c("Fall","Spring","Winter"),
direction="long"
)
Which gives:
> result
ID Year time Test1 Test2
1.2008.Fall 1 2008 Fall 15 22
1.2009.Fall 1 2009 Fall 12 10
2.2008.Fall 2 2008 Fall 12 13
2.2009.Fall 2 2009 Fall 16 23
3.2008.Fall 3 2008 Fall 11 17
3.2009.Fall 3 2009 Fall 13 14
1.2008.Spring 1 2008 Spring 16 22
1.2009.Spring 1 2009 Spring 13 14
2.2008.Spring 2 2008 Spring 13 11
2.2009.Spring 2 2009 Spring 14 20
3.2008.Spring 3 2008 Spring 12 12
3.2009.Spring 3 2009 Spring 11 9
1.2008.Winter 1 2008 Winter 19 24
1.2009.Winter 1 2009 Winter 27 20
2.2008.Winter 2 2008 Winter 25 29
2.2009.Winter 2 2009 Winter 21 26
3.2008.Winter 3 2008 Winter 22 23
3.2009.Winter 3 2009 Winter 27 31
tidyverse/tidyr solution:
library(dplyr)
library(tidyr)
df %>%
gather("Time", "Value", Fall, Spring, Winter) %>%
spread(Test, Value, sep = "")