Let I have such a data frame(df):
Date x
20.01.2016 34
21.01.2016 28
22.01.2016 NA
23.01.2016 NA
24.01.2016 56
25.01.2016 NA
26.01.2016 28
I want to add such a column(z) to this data frame
Date x z
20.01.2016 34 -
21.01.2016 28 1
22.01.2016 NA NA
23.01.2016 NA NA
24.01.2016 56 3
25.01.2016 NA NA
26.01.2016 28 2
where z shows the day difference between the related row's date and closest previous date (where x is not NA).
For example for the date 24.01.2016 the closest previous date is 21.01.2016 where x is not NA. So the day difference of these two dates is 3.
How can I do this using R?
I will be very glad for any help. Thanks a lot.
Cinsidering that your date variable is as.Date,(i.e. df$Date <- as.Date(df$Date, format = '%d.%m.%Y')) then,
df$z[!is.na(df$x)] <- c(NA, diff.difftime(df$Date[!is.na(df$x)]))
df
# Date x z
#1 2016-01-20 34 NA
#2 2016-01-21 28 1
#3 2016-01-22 NA NA
#4 2016-01-23 NA NA
#5 2016-01-24 56 3
#6 2016-01-25 NA NA
#7 2016-01-26 28 2
We can use data.table
library(data.table)
setDT(df)[, Date := as.IDate(Date, "%d.%m.%Y")][!is.na(x), z := Date - shift(Date)]
df
# Date x z
#1: 2016-01-20 34 NA
#2: 2016-01-21 28 1
#3: 2016-01-22 NA NA
#4: 2016-01-23 NA NA
#5: 2016-01-24 56 3
#6: 2016-01-25 NA NA
#7: 2016-01-26 28 2
Related
I have a dataset basically looks like that, giving which campaigns are active for each household with given start and end dates of respective campaigns:
campaign_id household_id campaign_type start_date end_date
1 26 1 Type B 2016-12-28 2017-02-19
2 8 1 Type A 2017-05-08 2017-06-25
3 12 1 Type B 2017-07-12 2017-08-13
4 13 1 Type A 2017-08-08 2017-09-24
5 18 1 Type A 2017-10-30 2017-12-24
6 20 1 Type C 2017-11-27 2018-02-05
7 22 1 Type B 2017-12-06 2018-01-07
8 23 1 Type B 2017-12-28 2018-02-04
And I create a new dataframe with given structure, which will show which campaigns are active for given household in a given time (having all the campaign numbers as columns, i have omitted the rest while putting here):
household_id date campaign1 campaign2 campaign3 campaign4
1 1 2016-11-14 NA NA NA NA
2 1 2016-12-06 NA NA NA NA
3 1 2016-12-28 NA NA NA NA
4 1 2017-02-08 NA NA NA NA
5 1 2017-03-03 NA NA NA NA
6 1 2017-03-08 NA NA NA NA
7 1 2017-03-13 NA NA NA NA
8 1 2017-03-29 NA NA NA NA
9 1 2017-04-03 NA NA NA NA
10 1 2017-04-19 NA NA NA NA
What I want to do is assigning the active promotions in the given dates as rows in the second dataframe. For example if household_id 1 is having campaign 2 running in 2016-11-14 but no other campaigns, then it will look like this:
household_id date campaign1 campaign2 campaign3 campaign4
1 1 2016-11-14 0 1 0 0
How can i manage this construction, should I use for loops in the initial dataframe and assign to second one in each loop, or there is a better and faster way? Thanks in advance.
I've got a dataset that looks like this -
dataset = data.frame(Site=c(rep('A',3),rep('B',3),rep('C',3)),MonthYear = c(rep(c('May 19','Apr 19','Mar 19'),3)),Date=c(rep(c('2019-05-31','2019-04-30','2019-03-31'),3)),Measure=c(rep(c('Service','Speed','Efficiency'),3)),Score=runif(9,0,1))
My objective is to transform that dataset by using the spread function.
However after doing so, I'd like the spread columns to be ordered based on the Date column (ascending order).
This would mean that the spread columns are in the following order: Mar 19, Apr 19, May 19
Here's my attempt -
library(dplyr)
library(tidyr)
final = dataset %>% spread(MonthYear,Score)
My attempt results in the spread columns being arranged in alphabetical order. And not in chronological order.
Thanks in advance for your inputs
Order the appropriate factor levels and you're done.
library(tidyr)
dataset = data.frame(Site=c(rep('A',3),rep('B',3),rep('C',3)),MonthYear = c(rep(c('May 19','Apr 19','Mar 19'),3)),Date=c(rep(c('2019-05-31','2019-04-30','2019-03-31'),3)),Measure=c(rep(c('Service','Speed','Efficiency'),3)),Score=runif(9,0,1))
dataset$MonthYear <- factor(dataset$MonthYear, levels = c("Mar 19", "Apr 19", "May 19"))
spread(dataset, key = MonthYear, value = Score)
Site Date Measure Mar 19 Apr 19 May 19
1 A 2019-03-31 Efficiency 0.09789678 NA NA
2 A 2019-04-30 Speed NA 0.4645101 NA
3 A 2019-05-31 Service NA NA 0.89602042
4 B 2019-03-31 Efficiency 0.59516115 NA NA
5 B 2019-04-30 Speed NA 0.5208239 NA
6 B 2019-05-31 Service NA NA 0.45334636
7 C 2019-03-31 Efficiency 0.93941294 NA NA
8 C 2019-04-30 Speed NA 0.5439323 NA
9 C 2019-05-31 Service NA NA 0.07971263
The only issue is that dataset$MonthYear is a factor and is not ordered in the way you like.
#Find Order by Date column
dLvl <- unique(dataset$MonthYear[order(dataset$Date)])
levels(dataset$MonthYear)
#[1] "Apr 19" "Mar 19" "May 19"
dataset$MonthYear <- factor(dataset$MonthYear, levels = dLvl)
levels(dataset$MonthYear)
#[1] "Mar 19" "Apr 19" "May 19"
final = dataset %>% spread(MonthYear,Score)
final
# Site Date Measure Mar 19 Apr 19 May 19
#1 A 2019-03-31 Efficiency 0.9928678 NA NA
#2 A 2019-04-30 Speed NA 0.1457551 NA
#3 A 2019-05-31 Service NA NA 0.6047312
#4 B 2019-03-31 Efficiency 0.4419907 NA NA
#5 B 2019-04-30 Speed NA 0.5799068 NA
If you convert them to dates you can order the columns based on the order of those dates
df <-
dataset %>%
spread(MonthYear,Score)
col_dts <- as.Date(paste0('01', names(df)), format = '%d%b %y')
df <- df[order(!is.na(col_dts), col_dts)]
df
# Site Date Measure Mar 19 Apr 19 May 19
# 1 A 2019-03-31 Efficiency 0.76653679 NA NA
# 2 A 2019-04-30 Speed NA 0.0416291 NA
# 3 A 2019-05-31 Service NA NA 0.3885358
# 4 B 2019-03-31 Efficiency 0.02538343 NA NA
# 5 B 2019-04-30 Speed NA 0.7264234 NA
# 6 B 2019-05-31 Service NA NA 0.5128166
# 7 C 2019-03-31 Efficiency 0.50107038 NA NA
# 8 C 2019-04-30 Speed NA 0.9013112 NA
# 9 C 2019-05-31 Service NA NA 0.3678922
Or you could change the factor levels according to the order of the date values
new_levels <-
with(dataset, {
mons <- unique(MonthYear)
ord <- order(as.Date(paste0('01', mons), format = '%d%b %y'))
mons[ord]})
dataset$MonthYear <- factor(dataset$MonthYear, levels = new_levels)
dataset %>%
spread(MonthYear,Score)
# Site Date Measure Mar 19 Apr 19 May 19
# 1 A 2019-03-31 Efficiency 0.76653679 NA NA
# 2 A 2019-04-30 Speed NA 0.0416291 NA
# 3 A 2019-05-31 Service NA NA 0.3885358
# 4 B 2019-03-31 Efficiency 0.02538343 NA NA
# 5 B 2019-04-30 Speed NA 0.7264234 NA
# 6 B 2019-05-31 Service NA NA 0.5128166
# 7 C 2019-03-31 Efficiency 0.50107038 NA NA
# 8 C 2019-04-30 Speed NA 0.9013112 NA
# 9 C 2019-05-31 Service NA NA 0.3678922
You can also use reorder with dcast (not sure why it doesn't work with spread)
library(data.table)
dataset %>%
dcast(Site + Date + Measure ~ reorder(MonthYear, -order(Date)),
value.var = 'Score')
# Site Date Measure Mar 19 Apr 19 May 19
# 1 A 2019-03-31 Efficiency 0.76653679 NA NA
# 2 A 2019-04-30 Speed NA 0.0416291 NA
# 3 A 2019-05-31 Service NA NA 0.3885358
# 4 B 2019-03-31 Efficiency 0.02538343 NA NA
# 5 B 2019-04-30 Speed NA 0.7264234 NA
# 6 B 2019-05-31 Service NA NA 0.5128166
# 7 C 2019-03-31 Efficiency 0.50107038 NA NA
# 8 C 2019-04-30 Speed NA 0.9013112 NA
# 9 C 2019-05-31 Service NA NA 0.3678922
So here's my problem, I have about 40 datasets, all csv files that contain only two columns, (a) Date and (b) Price (for each dataset the price column is named as its country).. I used the merge function as follows to consolidate all data into a single dataset with one date column and several price columns. This was the function I used:
merged <- Reduce(function(x, y) merge(x, y, by="Date", all=TRUE), list(a,b,c,d,e,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,aa,ab,ac,ad,ae,af,ag,ah,ai,aj,ak,al,am,an))
What has happened is I have for instance in date column, 3 values for same date but the corresponding country values are split. e.g.:
# Date India China South Korea
# 01-Jan-2000 5445 NA 4445 NA
# 01-Jan-2000 NA 1234 NA NA
# 01-Jan-2000 NA NA NA 5678
I actually want
# 01-Jan-2000 5445 1234 4445 5678
I dont know how to get this, as the other questions related to this topic ask for summation of values which I clearly do not need. This is a simple example. Unfortunately I have daily data from Jan 2000 to November 2016 for about 43 countries, all messed up. Any help to solve this would be appreciated.
I would append all dataframes using rbind and reshape the result with spread(). As merging depends on the dataframe you start with.
Reproducable example:
library(dplyr)
a <- data.frame(date = Sys.Date()-1:10, cntry = "China", price=round(rnorm(10,20,5),2))
b <- data.frame(date = Sys.Date()-6:15, cntry = "Netherlands", price=round(rnorm(10,50,10),2))
c <- data.frame(date = Sys.Date()-11:20, cntry = "USA", price=round(rnorm(10,70,25),2))
all <- do.call(rbind, list(a,b,c))
all %>% group_by(date) %>% spread(cntry, price)
results in:
date China Netherlands USA
* <date> <dbl> <dbl> <dbl>
1 2016-11-29 NA NA 78.75
2 2016-11-30 NA NA 66.22
3 2016-12-01 NA NA 86.04
4 2016-12-02 NA NA 17.07
5 2016-12-03 NA NA 75.72
6 2016-12-04 NA 46.90 39.57
7 2016-12-05 NA 51.80 65.11
8 2016-12-06 NA 57.50 96.36
9 2016-12-07 NA 46.42 46.93
10 2016-12-08 NA 45.71 57.63
11 2016-12-09 15.41 60.09 NA
12 2016-12-10 16.66 60.07 NA
13 2016-12-11 23.72 66.21 NA
14 2016-12-12 19.82 45.46 NA
15 2016-12-13 14.22 45.07 NA
16 2016-12-14 27.26 NA NA
17 2016-12-15 20.08 NA NA
18 2016-12-16 15.79 NA NA
19 2016-12-17 17.66 NA NA
20 2016-12-18 26.77 NA NA
I have the data as following. Each column starting from 1.07m to 11.82m represents the depth and the values corresponds to the temperature. I am interested in reducing the datasets into 8 sets (8 distinct water depths). While doing so I would like to use the averaging method. For example in row1 of my data starts from column x1.07m to x2.82m (x2.82m because all the values beyond that point are NA). I would like to create a separate data frame with data and 8 columns (layer1, layer2, layer3, layer4, layer5, layer6, layer7, layer8). Layer1 value should start from 1.07m and the Layer8 should correspond to the maximum non-zero value.
Data: The dput of data can be found on https://dl.dropboxusercontent.com/u/9267938/rcode.R
> head(data.frame(mytest))
datetime Year Month Day Hour Minute Second X1.07m X1.32m X1.57m X1.82m X2.07m X2.32m X2.57m X2.82m X3.07m
1 2014-08-03 12:40:00 2014 8 3 12 40 0 -0.079553637 -0.018856349 -0.022559778 -0.0278269427 -0.019816260 -0.01304108 -0.003394041 -0.010720688 NA
2 2014-08-03 12:50:00 2014 8 3 12 50 0 -0.001409806 0.006434559 0.013885671 0.0033940409 0.009665614 0.01176982 0.011130125 0.019991707 0.02997477
3 2014-08-03 13:00:00 2014 8 3 13 0 0 -0.006942835 -0.011130125 0.010715907 -0.0058745801 -0.005716650 0.01534520 0.030355206 0.024851408 0.04862646
4 2014-08-03 13:10:00 2014 8 3 13 10 0 -0.020586547 0.002935416 -0.016304143 -0.0001326389 -0.003896694 0.00361282 0.004723244 0.013947785 0.03787721
5 2014-08-03 13:20:00 2014 8 3 13 20 0 -0.028394300 -0.023132719 -0.001721911 -0.0139650391 -0.038460075 0.01749898 0.008466864 0.003630492 0.01442467
6 2014-08-03 13:30:00 2014 8 3 13 30 0 -0.034646511 -0.006791177 0.004064423 -0.0038792422 -0.015942808 -0.02029747 -0.014287663 0.007956902 0.01786172
X3.32m X3.57m X3.82m X4.07m X4.32m X4.57m X4.82m X5.07m X5.32m X5.57m X5.82m X6.07m X6.32m X6.57m X6.82m
1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2 0.05094966 0.04699597 0.032100892 0.02650842 0.045689389 0.0169759192 -0.006879327 -0.0187681077 -0.030404344 -0.04405705 -0.04501967 NA NA NA NA
3 0.04500833 0.01713256 0.006450535 0.02870071 0.019079580 0.0009741734 -0.024666588 -0.0409943643 -0.030201313 -0.03873463 -0.02893064 NA NA NA NA
4 0.03971244 0.05723497 0.039496306 0.03799276 0.012742073 0.0024111385 -0.023706420 -0.0188563490 -0.033791404 -0.04162619 -0.02979164 -0.045051204 NA NA NA
5 0.03269076 0.05125416 0.054766084 0.03625076 0.005988487 0.0020217180 -0.007510352 -0.0069913419 -0.006656083 -0.01630414 -0.01403812 -0.001580609 NA NA NA
6 0.01913708 0.03932811 0.048955209 0.04764632 0.037480601 0.0205218532 0.004171715 0.0009371753 -0.002468609 -0.04511612 -0.01263816 0.035861544 NA NA NA
X7.07m X7.32m X7.57m X7.82m X8.07m X8.32m X8.57m X8.82m X9.07m X9.32m X9.57m X9.82m X10.07m X10.32m X10.57m X10.82m X11.07m X11.32m X11.57m X11.82m
1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
6 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Sometimes the data points will be 20, 22, 25 points so the function should be written such that it would try to account those information and divide into 8 data values for each rows.
Rcode.R linked to dropbox has the code that has dput of mytest. It was pretty big to be posted here. So I posted a external link.
Info added
Each row would have different number of data. The motive is to convert them into 8 columns of data using averaging or linear interpolation.
Taking the question as a desire to collapse the values to means of eight equally spaced depths, dplyr and tidyr take us where we need to go:
library(dplyr)
library(tidyr)
mytest %>%
# melt to long form
gather(depth, value, -datetime:-Second, na.rm = TRUE) %>%
# clean depth to number
mutate(depth = extract_numeric(depth)) %>%
# group so cut levels are for each datetime
group_by(datetime) %>%
# group to keep columns; cut depth into 8 levels per group
group_by(datetime, levels = cut(depth, 8, paste0('level', 1:8))) %>%
# collapse groups by taking the mean
summarise(value = mean(value)) %>%
# re-spread new levels to wide form
spread(levels, value) %>%
# re-add other time columns dropped by summarise
inner_join(mytest %>% select(datetime:Second), .)
# Source: local data frame [20 x 15]
#
# datetime Year Month Day Hour Minute Second level1 level2
# (time) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
# 1 2014-08-03 12:40:00 2014 8 3 12 40 0 -0.079553637 -0.0188563490
# 2 2014-08-03 12:50:00 2014 8 3 12 50 0 0.006303474 0.0065298277
# 3 2014-08-03 13:00:00 2014 8 3 13 0 0 -0.002452351 -0.0057956151
# 4 2014-08-03 13:10:00 2014 8 3 13 10 0 -0.011318424 -0.0001388374
# 5 2014-08-03 13:20:00 2014 8 3 13 20 0 -0.017749644 -0.0116420430
# 6 2014-08-03 13:30:00 2014 8 3 13 30 0 -0.012457755 -0.0133731725
# 7 2014-08-03 13:40:00 2014 8 3 13 40 0 -0.020440875 -0.0253538846
# 8 2014-08-03 13:50:00 2014 8 3 13 50 0 -0.058681338 -0.0177194127
# 9 2014-08-03 14:00:00 2014 8 3 14 0 0 -0.037929680 -0.0211918383
# 10 2014-08-03 14:10:00 2014 8 3 14 10 0 -0.027045726 -0.0147261076
# 11 2014-08-03 14:20:00 2014 8 3 14 20 0 -0.048997399 -0.0290804019
# 12 2014-08-03 14:30:00 2014 8 3 14 30 0 -0.059110466 -0.0370898043
# 13 2014-08-03 14:40:00 2014 8 3 14 40 0 -0.067156867 -0.0138750287
# 14 2014-08-03 14:50:00 2014 8 3 14 50 0 -0.049762164 -0.0280648246
# 15 2014-08-03 15:00:00 2014 8 3 15 0 0 -0.028033559 -0.0245379952
# 16 2014-08-03 15:10:00 2014 8 3 15 10 0 -0.044087211 -0.0107995239
# 17 2014-08-03 15:20:00 2014 8 3 15 20 0 -0.028761973 -0.0113161242
# 18 2014-08-03 15:30:00 2014 8 3 15 30 0 -0.013476051 -0.0142316424
# 19 2014-08-03 15:40:00 2014 8 3 15 40 0 -0.012799297 -0.0135366710
# 20 2014-08-03 15:50:00 2014 8 3 15 50 0 -0.012238548 -0.0180806876
# Variables not shown: level3 (dbl), level4 (dbl), level5 (dbl), level6 (dbl), level7 (dbl),
# level8 (dbl)
Note that you should check that these data make sense in context; you've lost your depth data by scaling them.
I have a table like the following:
days Debit loaddate
1 23/01/2014 138470289.4 23/01/2014
2 24/01/2014 NA NA
3 25/01/2014 NA NA
4 26/01/2014 NA NA
5 27/01/2014 NA NA
one row for each day and then in the columns loaddate after a few NA another date appears:
28 19/02/2014 NA NA
29 20/02/2014 NA NA
30 21/02/2014 NA NA
31 22/02/2014 9090967.9 22/02/2014
32 23/02/2014 NA NA
33 24/02/2014 308083.5 24/02/2014
I would like to replace each NA in loaddate column with the previous date in loaddate.
I tried:
for(i in 1:nrow(data3))
{ if (!is.na(data3[i,'Debit']))
{data3[i,'loaddate1']<-as.Date(data3[i,'loaddate'], format='%Y-%m-%d')}
else {data3[i,'loaddate1']<-data3[i-1,'loaddate1']}
}
But I got the wrong format:
> head(data3)
days Debit loaddate loaddate1
1 2014-01-23 138470289 2014-01-23 16093
2 2014-01-24 NA <NA> 16093
3 2014-01-25 NA <NA> 16093
4 2014-01-26 NA <NA> 16093
5 2014-01-27 NA <NA> 16093
6 2014-01-28 NA <NA> 16093
I need to get the date format also. If I do:
for(i in 1:nrow(data3))
{ if (!is.na(data3[i,'Debit']))
{data3[i,'loaddate1']<-as.Date(data3[i,'loaddate'], format='%Y-%m-%d')}
else {data3[i,'loaddate1']<-as.Date(data3[i-1,'loaddate1'], format='%Y-%m-%d')}
}
I got the wrong result (with NA).
days Debit loaddate loaddate1
1 2014-01-23 138470289 2014-01-23 16093
2 2014-01-24 NA <NA> <NA>
3 2014-01-25 NA <NA> <NA>
4 2014-01-26 NA <NA> <NA>
5 2014-01-27 NA <NA> <NA>
6 2014-01-28 NA <NA> <NA>
How can I get the right result and with the right format?
Also, Is there a better way to do this replacement? I mean without a loop.
Thanks.
Try zoo::na.locf and make sure to use the appropriate date format:
library(zoo)
data3$loaddate <- as.Date(na.locf(data3$loaddate), format='%d/%m/%Y'))