Spread column values in R - r

Hi i would like to change my data frame profile_table_long which represents 24 hour/data for 50 companies from 2 years.
Data - date from 2015-01-01 to 2016-12-31
name - name of firm 1:50
hour - hour 1:24 (with additional 2a between 2 and 3)
load - variable
x <- NULL
x$Data <- rep(seq(as.Date("2015/1/1"), as.Date("2016/12/31"), "days"), length.out=913750)
x$Name <- rep(rep(1:50, each=731), length.out=913750)
x$hour <- rep(rep(c(1, 2, "2a", 3:24), each=36550),length.out=913750)
x$load <- sample(2000:2500, 913750, replace=T)
x <- data.frame(x)
Data name hour load
1 2015-01-01 1 1 8837.050
2 2015-01-01 1 2 6990.952
3 2015-01-01 1 2a 8394.421
4 2015-01-01 1 3 8267.276
5 2015-01-01 1 4 8324.069
6 2015-01-01 1 5 8644.901
7 2015-01-01 1 6 8720.878
8 2015-01-01 1 7 9213.204
9 2015-01-01 1 8 9601.976
10 2015-01-01 1 9 8549.170
11 2015-01-01 1 10 9379.324
12 2015-01-01 1 11 9370.418
13 2015-01-01 1 12 7159.201
14 2015-01-01 1 13 8497.344
15 2015-01-01 1 14 6419.835
16 2015-01-01 1 15 9354.910
17 2015-01-01 1 16 9320.462
18 2015-01-01 1 17 9263.098
19 2015-01-01 1 18 9167.991
20 2015-01-01 1 19 9004.010
21 2015-01-01 1 20 9134.466
22 2015-01-01 1 21 7631.472
23 2015-01-01 1 22 6492.074
24 2015-01-01 1 23 6888.025
25 2015-01-01 1 24 8821.283
25 2015-01-02 1 1 8902.135
I would like to make it look like that:
data hour name1 name2 .... name49 name50
2015-01-01 1 load load .... load load
2015-01-01 2 load load .... load load
.....
2015-01-01 24 load load .... load load
2015-01-02 1 load load .... load load
.....
2016-12-31 24 load load .... load load
I tried spread() from tidyr package profile_table_tidy <- spread(profile_table_long, name, load) but I am getting an error Error: Duplicate identifiers for rows

This method uses the reshape2 package:
library("reshape2")
profile_table_wide = dcast(data = profile_table_long,
formula = Data + hour ~ name,
value.var = "load")
You might also want to choose a value for fill as well. Good luck!

Related

calculate number of frost change days (number of days) from the weather hourly data in r

I have to calculate the following data Number of frost change days**(NFCD)**** as weekly basis.
That means the number of days in which minimum temperature and maximum temperature cross 0°C.
Let's say I work with years 1957-1980 with hourly temp.
Example data (couple of rows look like):
Date Time (UTC) temperature
1957-07-01 00:00:00 5
1957-07-01 03:00:00 6.2
1957-07-01 05:00:00 9
1957-07-01 06:00:00 10
1957-07-01 07:00:00 10
1957-07-01 08:00:00 14
1957-07-01 09:00:00 13.2
1957-07-01 10:00:00 15
1957-07-01 11:00:00 15
1957-07-01 12:00:00 16.3
1957-07-01 13:00:00 15.8
Expected data:
year month week NFCD
1957 7 1 1
1957 7 2 5
dat <- data.frame(date=c(rep("A",5),rep("B",5)), time=rep(1:5, times=2), temp=c(1:5,-2,1:4))
dat
# date time temp
# 1 A 1 1
# 2 A 2 2
# 3 A 3 3
# 4 A 4 4
# 5 A 5 5
# 6 B 1 -2
# 7 B 2 1
# 8 B 3 2
# 9 B 4 3
# 10 B 5 4
aggregate(temp ~ date, data = dat, FUN = function(z) min(z) <= 0 && max(z) > 0)
# date temp
# 1 A FALSE
# 2 B TRUE
(then rename temp to NFCD)
Using the data from r2evans's answer you can also use tidyverse logic:
library(tidyverse)
dat %>%
group_by(date) %>%
summarize(NFCD = min(temp) < 0 & max(temp) > 0)
which gives:
# A tibble: 2 x 2
date NFCD
<chr> <lgl>
1 A FALSE
2 B TRUE

Summations by conditions on another row dealing with time

I am looking to run a cumulative sum at every row for values that occur in two columns before and after that point. So in this case I have volume of 2 incident types at every given minute over two days. I want to create a column which adds all the incidents that occured before and after for each row by the type. Sumif from excel comes to mind but I'm not sure how to port that over to R:
EDIT: ADDED set.seed and easier numbers
I have the following data set:
set.seed(42)
master_min =
setDT(
data.frame(master_min = seq(
from=as.POSIXct("2016-1-1 0:00", tz="America/New_York"),
to=as.POSIXct("2016-1-2 23:00", tz="America/New_York"),
by="min"
))
)
incident1= round(runif(2821, min=0, max=10))
incident2= round(runif(2821, min=0, max=10))
master_min = head(cbind(master_min, incident1, incident2), 5)
How do I essentially compute the following logic:
for each row, sum all the incident1s that occured before that row's timestamp and all the incident2s that occured after that row's timestamp? It would be great to get a data table solution, if not a dplyr as I am working with a large dataset. Below is a before and after for the data`:
BEFORE:
master_min incident1 incident2
1: 2016-01-01 00:00:00 9 6
2: 2016-01-01 00:01:00 9 5
3: 2016-01-01 00:02:00 3 5
4: 2016-01-01 00:03:00 8 6
5: 2016-01-01 00:04:00 6 9
AFTER THE CALCULATION:
master_min incident1 incident2 new_column
1: 2016-01-01 00:00:00 9 6 25
2: 2016-01-01 00:01:00 9 5 29
3: 2016-01-01 00:02:00 3 5 33
4: 2016-01-01 00:03:00 8 6 30
5: 2016-01-01 00:04:00 6 9 29
If I understand correctly:
# Cumsum of incident1, without current row:
master_min$sum1 <- cumsum(master_min$incident1) - master_min$incident1
# Reverse cumsum of incident2, without current row:
master_min$sum2 <- rev(cumsum(rev(master_min$incident2))) - master_min$incident2
# Your new column:
master_min$new_column <- master_min$sum1 + master_min$sum2
*update
The following two lines can do the job
master_min$sum1 <- cumsum(master_min$incident1)
master_min$sum2 <- sum(master_min$incident2) - cumsum(master_min$incident2)
I rewrote the question a bit to show a bit more comprehensive structure
library(data.table)
master_min <-
setDT(
data.frame(master_min = seq(
from=as.POSIXct("2016-1-1 0:00", tz="America/New_York"),
to=as.POSIXct("2016-1-1 0:09", tz="America/New_York"),
by="min"
))
)
set.seed(2)
incident1= as.integer(runif(10, min=0, max=10))
incident2= as.integer(runif(10, min=0, max=10))
master_min = cbind(master_min, incident1, incident2)
Now master_min looks like this
> master_min
master_min incident1 incident2
1: 2016-01-01 00:00:00 1 5
2: 2016-01-01 00:01:00 7 2
3: 2016-01-01 00:02:00 5 7
4: 2016-01-01 00:03:00 1 1
5: 2016-01-01 00:04:00 9 4
6: 2016-01-01 00:05:00 9 8
7: 2016-01-01 00:06:00 1 9
8: 2016-01-01 00:07:00 8 2
9: 2016-01-01 00:08:00 4 4
10: 2016-01-01 00:09:00 5 0
Apply transformations
master_min$sum1 <- cumsum(master_min$incident1)
master_min$sum2 <- sum(master_min$incident2) - cumsum(master_min$incident2)
Results
> master_min
master_min incident1 incident2 sum1 sum2
1: 2016-01-01 00:00:00 1 5 1 37
2: 2016-01-01 00:01:00 7 2 8 35
3: 2016-01-01 00:02:00 5 7 13 28
4: 2016-01-01 00:03:00 1 1 14 27
5: 2016-01-01 00:04:00 9 4 23 23
6: 2016-01-01 00:05:00 9 8 32 15
7: 2016-01-01 00:06:00 1 9 33 6
8: 2016-01-01 00:07:00 8 2 41 4
9: 2016-01-01 00:08:00 4 4 45 0
10: 2016-01-01 00:09:00 5 0 50 0

How to recreate the table by key?

I thought it could be a very easy question, but I am really a new beginner for R.
I have a data.table with key and lots of rows, two of which could be set as key. I want to recreate the table by Key.
For example, the simple data. In this case, the key is ID and Act, and here we can get a total of 4 groups.
ID ValueDate Act Volume
1 2015-01-01 EUR 21
1 2015-02-01 EUR 22
1 2015-01-01 MAD 12
1 2015-02-01 MAD 11
2 2015-01-01 EUR 5
2 2015-02-01 EUR 7
3 2015-01-01 EUR 4
3 2015-02-01 EUR 2
3 2015-03-01 EUR 6
Here is a code to generate test data:
dd <- data.table(ID = c(1,1,1,1,2,2,3,3,3),
ValueDate = c("2015-01-01", "2015-02-01", "2015-01-01","2015-02-01", "2015-01-01","2015-02-01","2015-01-01","2015-02-01","2015-03-01"),
Act = c("EUR","EUR","MAD","MAD","EUR","EUR","EUR","EUR","EUR"),
Volume=c(21,22,12,11,5,7,4,2,6))
After change, each column should present a specific group which is defined by Key (ID and Act).
Below is the result:
ValueDate ID1_EUR D1_MAD D2_EUR D3_EUR
2015-01-01 21 12 5 4
2015-02-01 22 11 7 2
2015-03-01 NA NA NA 6
Thanks a lot !
What you are trying to do is not recreating the data.table, but reshaping it from a long format to a wide format. You can use dcast for this:
dcast(dd, ValueDate ~ ID + Act, value.var = "Volume")
which gives:
ValueDate 1_EUR 1_MAD 2_EUR 3_EUR
1: 2015-01-01 21 12 5 4
2: 2015-02-01 22 11 7 2
3: 2015-03-01 NA NA NA 6
If you want the numbers in the resulting columns to be preceded with ID, then you can use:
dcast(dd, ValueDate ~ paste0("ID",ID) + Act, value.var = "Volume")
which gives:
ValueDate ID1_EUR ID1_MAD ID2_EUR ID3_EUR
1: 2015-01-01 21 12 5 4
2: 2015-02-01 22 11 7 2
3: 2015-03-01 NA NA NA 6

How to calculate the difference of a list of various by key?

I have a data.table with key and about 1000 rows, two of which are set to key. I would like to create a new variable named difference that contains difference of each numeric rows which were grouped by key.
For example, the simple data is: ID and Act are set to as key
ID ValueDate Act Volume
1 2015-01-01 EUR 21
1 2015-02-01 EUR 22
1 2015-01-01 MAD 12
1 2015-02-01 MAD 11
2 2015-01-01 EUR 5
2 2015-02-01 EUR 7
3 2015-01-01 EUR 4
3 2015-02-01 EUR 2
3 2015-03-01 EUR 6
What I would like to have is: adding a new column to calculate the difference between two rows(order by Time) for each group, note that for the first row of each group , the value of difference is 0.
ID ValueDate Act Volume Difference
1 2015-01-01 EUR 21 0
1 2015-02-01 EUR 22 1
1 2015-01-01 MAD 12 0
1 2015-02-01 MAD 11 -1
2 2015-01-01 EUR 5 0
2 2015-02-01 EUR 7 2
3 2015-01-01 EUR 4 0
3 2015-02-01 EUR 2 -2
3 2015-03-01 EUR 6 4
Here is a code to generate test data:
dd <- data.table(ID = c(1,1,1,1,2,2,3,3,3),
ValueDate = c("2015-01-01", "2015-02-01", "2015-01-01","2015-02-01", "2015-01-01","2015-02-01","2015-01-01","2015-02-01","2015-03-01"),
Act = c("EUR","EUR","MAD","MAD","EUR","EUR","EUR","EUR","EUR"),
Volume=c(21,22,12,11,5,7,4,2,6))
set key for the table:
setkey(dd, ID, Act)
to view the data:
> dd
ID ValueDate Act Volume
1 1 2015-01-01 EUR 21
2 1 2015-02-01 EUR 22
3 1 2015-01-01 MAD 12
4 1 2015-02-01 MAD 11
5 2 2015-01-01 EUR 5
6 2 2015-02-01 EUR 7
7 3 2015-01-01 EUR 4
8 3 2015-02-01 EUR 2
9 3 2015-03-01 EUR 6
so , can we use the function of aggregate to calculate the difference? or the method of .SD for "subset of data, but I don't know how to do the calculation of difference between two rows by group,note that for some groups, the number of rows might be different as well, but i have tried before is using the for(i in 0:x) to re-calculate the difference, but I don't think it could be a good method :(
If you want explicitly use your key, you could pass a keycall to the by argument
dd[, Difference := c(0L, diff(Volume)), by = key(dd)]
dd
# ID ValueDate Act Volume Difference
# 1: 1 2015-01-01 EUR 21 0
# 2: 1 2015-02-01 EUR 22 1
# 3: 1 2015-01-01 MAD 12 0
# 4: 1 2015-02-01 MAD 11 -1
# 5: 2 2015-01-01 EUR 5 0
# 6: 2 2015-02-01 EUR 7 2
# 7: 3 2015-01-01 EUR 4 0
# 8: 3 2015-02-01 EUR 2 -2
# 9: 3 2015-03-01 EUR 6 4
Or using data.table v 1.9.6+ you could also utilize the shift function
dd[, Difference := Volume - shift(Volume, fill = Volume[1L]), by = key(dd)]
We can use dplyr. After grouping by 'ID', 'Act', we create the 'Difference' column as the difference of 'Volume' and lag of that column.
library(dplyr)
dd %>%
group_by(ID, Act) %>%
mutate(Difference = Volume-lag(Volume))
EDIT: As mentioned by #DavidArenburg, replacing lag(Volume) by lag(Volume, default = Volume[1L]) will give 0 instead of NA for the first element in each group.
Or with ave from base R, we can do the diff and concatenate with 0 so that the lengths are the same. The diff returns a vector with length one less than the length of the original vector.
with(dd, ave(Volume, ID, Act, FUN= function(x) c(0, diff(x)))

Finding average of values in the past 2 minutes in a data.table

I am trying to find average of values that are within a certain time frame within the same data.table and save it to a new column.
Below is a sample data set
Updated the dataset to represent the discontinuous timeline in my original dataset.
> x
ts value avg
1: 2015-01-01 00:00:23 9 0
2: 2015-01-01 00:01:56 11 0
3: 2015-01-01 00:02:03 18 0
4: 2015-01-01 00:03:16 1 0
5: 2015-01-01 00:05:19 6 0
6: 2015-01-01 00:05:54 16 0
7: 2015-01-01 00:06:27 13 0
8: 2015-01-01 00:06:50 7 0
9: 2015-01-01 00:08:41 12 0
10: 2015-01-01 00:09:08 17 0
11: 2015-01-01 00:09:28 8 0
12: 2015-01-01 00:10:56 5 0
13: 2015-01-01 00:11:44 10 0
14: 2015-01-01 00:12:23 20 0
15: 2015-01-01 00:12:28 2 0
16: 2015-01-01 00:12:37 15 0
17: 2015-01-01 00:12:42 4 0
18: 2015-01-01 00:12:48 19 0
19: 2015-01-01 00:13:41 3 0
20: 2015-01-01 00:16:04 14 0
My code assigns value 10.5 to all the rows and I donot get the expected results. Here is my code.
require(lubridate)
x[, avg := x[ts>=ts-minutes(2) & ts<=ts , mean(value)], verbose=TRUE ]
Updated
I want the results to be as below
ts value avg
1 01-01-2015 00:00:23 9 0
2 01-01-2015 00:01:56 11 9
3 01-01-2015 00:02:03 18 10
4 01-01-2015 00:03:16 1 14.5
5 01-01-2015 00:05:19 6 0
6 01-01-2015 00:05:54 16 6
7 01-01-2015 00:06:27 13 11
8 01-01-2015 00:06:50 7 11.66666667
9 01-01-2015 00:08:41 12 7
10 01-01-2015 00:09:08 17 12
11 01-01-2015 00:09:28 8 14.5
12 01-01-2015 00:10:56 5 12.5
13 01-01-2015 00:11:44 10 5
14 01-01-2015 00:12:23 20 7.5
15 01-01-2015 00:12:28 2 11.66666667
16 01-01-2015 00:12:37 15 9.25
17 01-01-2015 00:12:42 4 10.4
18 01-01-2015 00:12:48 19 9.333333333
19 01-01-2015 00:13:41 3 11.666667
20 01-01-2015 00:16:04 14 0
I want to do this to a data with a larger data set, also with min and max values in separate columns separately( here I have shown only the average function). Any help would be great.
Updated
Below is the reproducible code.
#reproducible code
ts<- seq(from=ISOdatetime(2015,1,1,0,0,0,tz="GMT"),to=ISOdatetime(2015,1,1,0,0,19,tz="GMT"), by="sec")
set.seed(2)
ts <-ts + seconds(round(runif(20,0,1000),0))
value <- 1:20
avg <- 0
x <- data.table(ts,value,avg)
setkey(x,ts)
x
Solution
Thanks to #Saksham for poining me towards apply functions. Here is the solution that I have come up with.
find <- function(y){
mean(x[ts>=y-minutes(2) & ts<y,value])
}
x$avg <- mapply(find,x[,ts])
> x
ts value avg
1: 2015-01-01 00:00:23 9 NaN
2: 2015-01-01 00:01:56 11 9.000000
3: 2015-01-01 00:02:03 18 10.000000
4: 2015-01-01 00:03:16 1 14.500000
5: 2015-01-01 00:05:19 6 NaN
6: 2015-01-01 00:05:54 16 6.000000
7: 2015-01-01 00:06:27 13 11.000000
8: 2015-01-01 00:06:50 7 11.666667
9: 2015-01-01 00:08:41 12 7.000000
10: 2015-01-01 00:09:08 17 12.000000
11: 2015-01-01 00:09:28 8 14.500000
12: 2015-01-01 00:10:56 5 12.500000
13: 2015-01-01 00:11:44 10 5.000000
14: 2015-01-01 00:12:23 20 7.500000
15: 2015-01-01 00:12:28 2 11.666667
16: 2015-01-01 00:12:37 15 9.250000
17: 2015-01-01 00:12:42 4 10.400000
18: 2015-01-01 00:12:48 19 9.333333
19: 2015-01-01 00:13:41 3 11.666667
20: 2015-01-01 00:16:04 14 NaN
Will this do
ts[,avg] <- ts[,val] - 0.5
As logically and seeing your expected result, it is doing the same thing. You can edit you expected result to make it more flexible if I interpreted it wrong.
EDIT:
This base R approach should do the trick. As I an not familiar with manipulating time, I am assuming that arithmetic works in the same way as it does in most of the languages
interval <- minutes(2) #Assuming this is how we define 5 minutes
x$avg <- apply( x, 1, function(y){
mean(x$value[x$time > ( y["time"]) - interval ) && x$time < y["time"]])
})

Resources