Problem: need to add values from one dataframe to another depending on the time window in which each row occurs.
I have one dataframe with a list of singular events like this:
Ind Date Time Event
1 FAU 15/11/2016 06:40:43 A
2 POR 15/11/2016 12:26:51 V
3 POR 15/11/2016 14:52:53 B
4 MAM 20/11/2016 08:12:19 G
5 SUR 03/12/2016 13:51:18 A
6 SUR 14/12/2016 07:47:06 V
And a second data frame with ongoing, continuous events linked like this:
Date Time Event
1 15/11/2016 06:56:48 1
2 15/11/2016 06:59:40 2
3 15/11/2016 07:27:36 3
4 15/11/2016 07:29:10 4
5 15/11/2016 07:34:51 5
6 15/11/2016 07:35:10 6
7 15/11/2016 07:37:19 7
8 15/11/2016 07:39:55 8
9 15/11/2016 07:51:59 9
10 15/11/2016 08:00:13 10
11 15/11/2016 08:08:01 11
12 15/11/2016 08:13:21 12
13 15/11/2016 08:16:21 13
14 15/11/2016 12:14:48 14
15 15/11/2016 12:16:58 15
16 15/11/2016 12:51:22 16
17 15/11/2016 12:52:09 17
18 15/11/2016 13:26:29 18
19 15/11/2016 13:26:55 19
20 15/11/2016 13:34:14 20
21 15/11/2016 13:50:41 21
22 15/11/2016 13:53:25 22
23 15/11/2016 14:15:17 23
24 15/11/2016 14:54:49 24
Question: how can I combine these so that for the singular events we can see during which continuous events they occurred, for example, something like this:
Ind Date Time Eventx Eventy
1 FAU 15/11/2017 06:40:43 A 1
2 POR 15/11/2017 12:26:51 V 15
3 POR 15/11/2017 14:52:53 B 23
Many thanks
I can provide you with a data.table solution. The only issue is that I had to move the start of the first event in the second dataframe to an earlier date, since it was after the starting time of the first event of the first dataframe.
You'll need the additional packages data.table and lubridate.
library(data.table)
library(lubridate)
dt1 <- data.table(df1)
dt2 <- data.table(df2)
dt1[, Date.Time := as.POSIXct(strptime(paste(Date, Time, sep = " "), "%d/%m/%Y %H:%M:%S"))]
dt2[, Date.Time := as.POSIXct(strptime(paste(Date, Time, sep = " "), "%d/%m/%Y %H:%M:%S"))]
# Create the start and end time columns in the second data.table
dt2[, `:=`(Start.Time = Date.Time
, End.Time = shift(Date.Time, n = 1L, fill = NA, type = "lead"))]
# Change the start date to an earlier one
dt2[Event == 1,`:=`(Start.Time = Start.Time - days(1)) ]
# Merge on multiple conditions and the selection of the relevant columns
dt2[dt1, on=.(Start.Time < Date.Time
, End.Time > Date.Time)
, nomatch = 0L][,.(Ind
, Date
, Time
, Eventx = i.Event
, Eventy = Event)]
# Output of the last merge
Ind Date Time Eventx Eventy
1: FAU 15/11/2016 06:56:48 A 1
2: POR 15/11/2016 12:16:58 V 15
3: POR 15/11/2016 14:15:17 B 23
This should work (at least does on your example):
df1 <- structure(list(Ind = c("FAU", "POR", "POR", "MAM", "SUR", "SUR"
), Date = c("15/11/2016", "15/11/2016", "15/11/2016", "20/11/2016",
"03/12/2016", "14/12/2016"), Time = c("06:40:43", "12:26:51",
"14:52:53", "08:12:19", "13:51:18", "07:47:06"), Event = c("A",
"V", "B", "G", "A", "V")), .Names = c("Ind", "Date", "Time",
"Event"), class = "data.frame", row.names = c("1", "2", "3",
"4", "5", "6"))
df2 <- structure(list(Date = c("15/11/2016", "15/11/2016", "15/11/2016",
"15/11/2016", "15/11/2016", "15/11/2016", "15/11/2016", "15/11/2016",
"15/11/2016", "15/11/2016", "15/11/2016", "15/11/2016", "15/11/2016",
"15/11/2016", "15/11/2016", "15/11/2016", "15/11/2016", "15/11/2016",
"15/11/2016", "15/11/2016", "15/11/2016", "15/11/2016", "15/11/2016",
"15/11/2016"), Time = c("06:56:48", "06:59:40", "07:27:36", "07:29:10",
"07:34:51", "07:35:10", "07:37:19", "07:39:55", "07:51:59", "08:00:13",
"08:08:01", "08:13:21", "08:16:21", "12:14:48", "12:16:58", "12:51:22",
"12:52:09", "13:26:29", "13:26:55", "13:34:14", "13:50:41", "13:53:25",
"14:15:17", "14:54:49"), Event = 1:24), .Names = c("Date", "Time",
"Event"), class = "data.frame", row.names = c("1", "2", "3",
"4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15",
"16", "17", "18", "19", "20", "21", "22", "23", "24"))
Create as.POSIXct variables:
df1$datetime <- as.POSIXct(strptime(paste(df1$Date, df1$Time, sep = " "), "%d/%m/%Y %H:%M:%S"))
df2$datetime <- as.POSIXct(strptime(paste(df2$Date, df2$Time, sep = " "), "%d/%m/%Y %H:%M:%S"))
Initiate new count variable for df1:
df1$count <- NA
Now we loop over the rows of df1 and count the occurences in df2 with the same Date and within the Time intervals:
for(i in 1:nrow(df1)){
df1$count[i] <- sum(df2$datetime[df2$Date == df1$Date[i]] < df1$datetime[i])
}
Result:
> df1
Ind Date Time Event datetime count
1 FAU 15/11/2016 06:40:43 A 2016-11-15 06:40:43 0
2 POR 15/11/2016 12:26:51 V 2016-11-15 12:26:51 15
3 POR 15/11/2016 14:52:53 B 2016-11-15 14:52:53 23
4 MAM 20/11/2016 08:12:19 G 2016-11-20 08:12:19 0
5 SUR 03/12/2016 13:51:18 A 2016-12-03 13:51:18 0
6 SUR 14/12/2016 07:47:06 V 2016-12-14 07:47:06 0
Related
Is there a quick way to replace variable names with the content of the first row of a tibble?
So turning something like this:
Subject Q1 Q2 Q3
Subject age gender cue
429753 24 1 man
b952x8 23 2 mushroom
264062 19 1 night
53082m 35 1 moon
Into this:
Subject age gender cue
429753 24 1 man
b952x8 23 2 mushroom
264062 19 1 night
53082m 35 1 moon
My dataset has over 100 variables so I'm looking for a way that doesn't involve typing out each old and new variable name.
A possible solution:
df <- structure(list(Subject = c("Subject", "429753", "b952x8", "264062",
"53082m"), Q1 = c("age", "24", "23", "19", "35"), Q2 = c("gender",
"1", "2", "1", "1"), Q3 = c("cue", "man", "mushroom", "night",
"moon")), row.names = c(NA, -5L), class = "data.frame")
names(df) <- df[1,]
df <- df[-1,]
df
#> Subject age gender cue
#> 2 429753 24 1 man
#> 3 b952x8 23 2 mushroom
#> 4 264062 19 1 night
#> 5 53082m 35 1 moon
I 'm doing one project for analysing time series data. It's Apple stocks from 2018-1-1 to 2019-12-31. From the dataset, I selected two columns "Date" and "Ajd.close". I attached a small dataset here in below. (Alternatively: You can download the data directly from Yahoo finance. There is a download link under the blue button "Apply". )
I tested the dataset with adf.test(). It's not stationary. Now I would like to try another way, chunk the dataset into 24 periods(months), then compare the mean and variances of these chunked data. I tried with chunker() but it seems did not work. How should I do it? Thank you!
Here is a shorter version of the dataset:
Date Adj.Close
1 2018-01-02 41.38024
2 2018-01-03 41.37303
3 2018-01-04 41.56522
4 2018-01-05 42.03845
5 2018-01-08 41.88231
6 2018-01-09 41.87751
7 2018-01-10 41.86789
8 2018-01-11 42.10571
9 2018-01-12 42.54050
10 2018-01-16 42.32431
11 2018-01-17 43.02335
12 2018-01-18 43.06179
13 2018-01-19 42.86961
14 2018-01-22 42.51889
15 2018-01-23 42.52850
16 2018-01-24 41.85107
17 2018-01-25 41.10399
18 2018-01-26 41.20008
19 2018-01-29 40.34730
20 2018-01-30 40.10948
21 2018-01-31 40.21999
22 2018-02-01 40.30407
23 2018-02-02 38.55526
24 2018-02-05 37.59198
You could split the dataset and use map to make calculations on every chunck :
library(purrr)
library(dplyr)
df <- structure(list(Date = structure(c(17533, 17534, 17535, 17536,
17539, 17540, 17541, 17542, 17543, 17547, 17548, 17549, 17550,
17553, 17554, 17555, 17556, 17557, 17560, 17561, 17562, 17563,
17564, 17567), class = "Date"), Adj.Close = c(41.38024, 41.37303,
41.56522, 42.03845, 41.88231, 41.87751, 41.86789, 42.10571, 42.5405,
42.32431, 43.02335, 43.06179, 42.86961, 42.51889, 42.5285, 41.85107,
41.10399, 41.20008, 40.3473, 40.10948, 40.21999, 40.30407, 38.55526,
37.59198)), row.names = c("1", "2", "3", "4", "5", "6", "7",
"8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18",
"19", "20", "21", "22", "23", "24"), class = "data.frame")
# As an example : split every 10 rows
df %>% split((seq(nrow(df))-1) %/% 10) %>%
map(~{list(startDate = min(.x$Date),
avg = mean(.x$Adj.Close),
sd = sd(.x$Adj.Close))}) %>% bind_rows
#> # A tibble: 3 x 3
#> startDate avg sd
#> <date> <dbl> <dbl>
#> 1 2018-01-02 41.9 0.382
#> 2 2018-01-17 41.9 1.11
#> 3 2018-01-31 39.2 1.32
We can also do this with group_by/summarise
library(dplyr)
df %>%
group_by(grp = as.integer(gl(n(), 10, n()))) %>%
summarise(startDate = min(Date),
avg = mean(Adj.Close), sd = sd(Adj.Close), .groups = 'drop') %>%
select(-grp)
-output
# A tibble: 3 x 3
# startDate avg sd
# <date> <dbl> <dbl>
#1 2018-01-02 41.9 0.382
#2 2018-01-17 41.9 1.11
#3 2018-01-31 39.2 1.32
data
df <- structure(list(Date = structure(c(17533, 17534, 17535, 17536,
17539, 17540, 17541, 17542, 17543, 17547, 17548, 17549, 17550,
17553, 17554, 17555, 17556, 17557, 17560, 17561, 17562, 17563,
17564, 17567), class = "Date"), Adj.Close = c(41.38024, 41.37303,
41.56522, 42.03845, 41.88231, 41.87751, 41.86789, 42.10571, 42.5405,
42.32431, 43.02335, 43.06179, 42.86961, 42.51889, 42.5285, 41.85107,
41.10399, 41.20008, 40.3473, 40.10948, 40.21999, 40.30407, 38.55526,
37.59198)), row.names = c("1", "2", "3", "4", "5", "6", "7",
"8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18",
"19", "20", "21", "22", "23", "24"), class = "data.frame")
You probably want to check the stationarity of returns as opposed to raw prices. Using the data in the Note at the end convert it to zoo class, calculate the returns, aggregate it by year/month computing mean and sd statistics and plot. If you prefer year/quarter replace as.yearmon with as.yearqtr.
library(zoo)
aapl <- read.zoo(aapl.df)
aapl.ret <- diff(aapl, arith = FALSE) - 1
stats <- function(x) c(mean = mean(x), sd = sd(x))
aapl.ret.stats <- aggregate(aapl.ret, as.yearmon, stats)
plot(aapl.ret.stats, main = "AAPL Adj Returns")
To use a chunk of arbitrary length, here 10, we can use rollapplyr:
na.omit(rollapplyr(drop(aapl.ad.ret), 10, by = 10, stats))
Yahoo data
There isn't enough data in the question to really show the above but using quantmod we can download a longer series and perform the same operations giving the plot shown at after the code. We also show some tests that could be run with the data.
library(quantmod)
getSymbols("AAPL")
aapl.ad <- Ad(AAPL)
aapl.ad.ret <- diff(aapl.ad, arith = FALSE) - 1
stats <- function(x) c(mean = mean(x), sd = sd(x))
aapl.ret.stats <- aggregate(aapl.ad.ret, as.yearmon, stats)
# plot shown after code
plot(aapl.ret.stats, main = "AAPL Adj Returns")
# some additional things to try -- output not shown
aapl.ad.ret.na <- na.omit(aapl.ad.ret)
acf(aapl.ad.ret.na)
Box.test(aapl.ad.ret.na)
library(tseries)
adf.test(aapl.ad.ret.na)
kpss.test(aapl.ad.ret.na, null = "Level")
kpss.test(aapl.ad.ret.na, null = "Trend")
Note
The input in reproducible form:
Lines <- " Date Adj.Close
1 2018-01-02 41.38024
2 2018-01-03 41.37303
3 2018-01-04 41.56522
4 2018-01-05 42.03845
5 2018-01-08 41.88231
6 2018-01-09 41.87751
7 2018-01-10 41.86789
8 2018-01-11 42.10571
9 2018-01-12 42.54050
10 2018-01-16 42.32431
11 2018-01-17 43.02335
12 2018-01-18 43.06179
13 2018-01-19 42.86961
14 2018-01-22 42.51889
15 2018-01-23 42.52850
16 2018-01-24 41.85107
17 2018-01-25 41.10399
18 2018-01-26 41.20008
19 2018-01-29 40.34730
20 2018-01-30 40.10948
21 2018-01-31 40.21999
22 2018-02-01 40.30407
23 2018-02-02 38.55526
24 2018-02-05 37.59198"
aapl.df <- read.table(text = Lines)
I have data like this
Time chamber
9 1
10 2
11 3
12 4
13 5
14 6
15 7
16 8
17 9
18 10
19 11
20 12
21 1
22 2
23 3
24 4
I want to create a new column using conditions on another existing column (chamber).
It should look something like this
Time chamber treatment
9 1 c2t2
10 2 c2t2
11 3 c0t0r
12 4 c2t2r
13 5 c2t2r
14 6 c0t0
15 7 c0t0r
16 8 c0t0r
17 9 c2t2
18 10 c2t2r
19 11 c0t0
20 12 c0t0
21 1 c2t2
22 2 c2t2
23 3 c0t0r
24 4 c2t2r
For chambers 1,2,9: Treatment is c2t2
For chambers 3,7,8: Treatment is c0t0r.
For chambers 4,5,10: Treatment is c2t2r
For chambers 6,11,12: Treatment is c0t0.
I have also made a lookup table, but I don't know how to use it:
lookup_table <- data.frame(row.names = c("1", "2", "3","4", "5", "6","7", "8", "9","10", "11", "12"),
new_col = c("C2T2", "C2T2", "C0T0R","C2T2R", "C2T2R", "C0T0","C0T0R", "C0T0R", "C2T2","C2T2R", "C0T0", "C0T0"),
stringsAsFactors = FALSE)
Assuming "dt" is your dataframe name, then you can use dplyr with case_when
library(tidyverse)
dt %>%
mutate(newcol = case_when(dt$chamber %in% c(1, 2, 9) ~ "c2t2",
dt$chamber %in% c(3, 7, 8) ~ "c0t0r",
dt$chamber %in% c(4, 5, 10) ~ "c2t2r",
dt$chamber %in% c(6, 11, 12) ~ "c0t0"))
Output:
Time chamber newcol
1 9 1 c2t2
2 10 2 c2t2
3 11 3 c0t0r
4 12 4 c2t2r
5 13 5 c2t2r
6 14 6 c0t0
7 15 7 c0t0r
8 16 8 c0t0r
9 17 9 c2t2
10 18 10 c2t2r
11 19 11 c0t0
12 20 12 c0t0
13 21 1 c2t2
14 22 2 c2t2
15 23 3 c0t0r
16 24 4 c2t2r
>
You can merge your df with the lookup_table. In my experience, if you want to combine different data.frames, merge() is the command I like to use. Do note that there are many different ways and specialised packages you can use for the same purpose!
You need to specify which column you use as the 'matching column' and also that you want to keep all records in df:
merge(df, lookup_table, all.x = TRUE, by.x = "chamber", by.y = "row.names")
Data:
df <- structure(list(Time = 9:24, chamber = c(1L, 2L, 3L, 4L, 5L, 6L,
7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L)),
.Names = c("Time", "chamber"), class = "data.frame",
row.names = c(NA, -16L))
lookup_table <- structure(list(new_col = c("C2T2", "C2T2", "C0T0R", "C2T2R",
"C2T2R", "C0T0", "C0T0R", "C0T0R",
"C2T2", "C2T2R", "C0T0", "C0T0")),
.Names = "new_col",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"), class = "data.frame")
When using the plyr package to summarise my data, it seems impossible to use the nlevels() function.
The structure of my data set is as follows:
>aer <- read.xlsx("XXXX.xlsx", sheetIndex=1)
>aer$ID <- as.factor(aer$ID)
>aer$description <- as.factor(aer$description)
>head(aer)
ID SOC start end days count severity relation
1 1 410 2015-04-21 2015-04-28 7 1 1 3
2 1 500 2015-01-30 2015-05-04 94 1 1 3
3 1 600 2014-11-25 2014-11-29 4 1 1 3
4 1 600 2015-01-02 2015-01-07 5 1 1 3
5 1 600 2015-01-26 2015-03-02 35 1 1 3
6 1 600 2015-04-14 2015-04-17 3 1 1 3
> dput(head(aer,4))
structure(list(ID = structure(c(1L, 1L, 1L, 1L), .Label = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "12", "13", "14",
"15"), class = "factor"), SOC = c(410, 500, 600, 600),
start = structure(c(16546, 16465, 16399, 16437), class = "Date"),
end = structure(c(16553, 16559, 16403, 16442), class = "Date"),
days = c(7, 94, 4, 5), count = c(1, 1, 1, 1), severity = c(1,
1, 1, 1), relation = c(3, 3, 3, 3)), .Names = c("ID", "SOC",
"description", "start", "end", "days", "count", "severity", "relation"
), row.names = c(NA, 4L), class = "data.frame")
What I would like to know is how many levels exists in the "ID" variable in data sections created, when dividing the data set using the variable "SOC". I want to summarise this information together with some other variables in a new data set. Therefore, I would like to use the plyr package like so:
summaer2 <- ddply(aer, c("SOC"), summarise,
participants = nlevels(ID),
events = sum(count),
min_duration = min(days),
max_duration = max(days),
max_severity = max(severity))
This returns the following error:
Error in Summary.factor(c(4L, 5L, 11L, 11L, 14L, 14L), na.rm = FALSE) :
‘max’ not meaningful for factors
Could someone give me advice on how to reach my goal? Or what I'm doing wrong?
Many thanks in advance!
Update:
Substituting nlevels(ID) with length(unique(ID)) seems to give me the desired output:
> head(summaer2)
SOC participants events min_duration max_duration max_severity
1 100 4 7 1 62 2
2 410 9 16 1 41 2
3 431 2 2 109 132 1
4 500 5 9 23 125 2
5 600 8 19 1 35 1
6 1040 1 1 98 98 2
I am looking to do something in R that seems similar to what I would use the reshape package for, but not quite. I am looking to move some rows of a data frame into columns but not all. For example, my data frame looks something like:
v1, v2, v3
info, time, 12:00
info, day, Monday
info, temperature, 70
data, 1, 2
data, 2, 2
data, 3, 1
data, 4, 1
data, 5, 3
I would like to transform it into something like:
v1, v2, v3, info_time, info_day, info_temperature
data, 1, 2, 12:00, Monday, 70
data, 2, 2, 12:00, Monday, 70
data, 3, 1, 12:00, Monday, 70
data, 4, 1, 12:00, Monday, 70
data, 5, 3, 12:00. Monday, 70
Is there an easy way to do this? Does the reshape package help here?
Thank you in advance for all your help!
Vincent
Try
library(reshape2)
indx <- df$v1=='data'
res <- cbind(df[indx,],dcast(df[!indx,],v1~v2, value.var='v3'))[,-4]
row.names(res) <- NULL
colnames(res)[4:6] <- paste('info', colnames(res)[4:6], sep="_")
res
# v1 v2 v3 info_day info_temperature info_time
#1 data 1 2 Monday 70 12:00
#2 data 2 2 Monday 70 12:00
#3 data 3 1 Monday 70 12:00
#4 data 4 1 Monday 70 12:00
#5 data 5 3 Monday 70 12:00
Or use dplyr/tidyr
library(dplyr)
library(tidyr)
cbind(df[indx,],
unite(df[!indx,], Var, v1, v2) %>%
mutate(id=1) %>%
spread(Var, v3)%>%
select(-id))
Or using base R
cbind(df[indx,],
reshape(transform(df[!indx,], v2= paste(v1, v2, sep="_")),
idvar='v1', timevar='v2', direction='wide')[,-1])
data
df <- structure(list(v1 = c("info", "info", "info", "data", "data",
"data", "data", "data"), v2 = c("time", "day", "temperature",
"1", "2", "3", "4", "5"), v3 = c("12:00", "Monday", "70", "2",
"2", "1", "1", "3")), .Names = c("v1", "v2", "v3"), class = "data.frame",
row.names = c(NA, -8L))
A solution without external packages ( using df structure from Akrun):
df1 <- cbind(df[4:8,1:3],apply(df[1:3,3,drop=FALSE],1,function(x) rep(x,nrow(df)-3)))
colnames(df1)[4:6] <- paste("info",df[1:3,2], sep = "_")
df1
> df1
v1 v2 v3 info_time info_day info_temperature
4 data 1 2 12:00 Monday 70
5 data 2 2 12:00 Monday 70
6 data 3 1 12:00 Monday 70
7 data 4 1 12:00 Monday 70
8 data 5 3 12:00 Monday 70