R spread data while preserving order - r

I've got a dataset that looks like this -
dataset = data.frame(Site=c(rep('A',3),rep('B',3),rep('C',3)),MonthYear = c(rep(c('May 19','Apr 19','Mar 19'),3)),Date=c(rep(c('2019-05-31','2019-04-30','2019-03-31'),3)),Measure=c(rep(c('Service','Speed','Efficiency'),3)),Score=runif(9,0,1))
My objective is to transform that dataset by using the spread function.
However after doing so, I'd like the spread columns to be ordered based on the Date column (ascending order).
This would mean that the spread columns are in the following order: Mar 19, Apr 19, May 19
Here's my attempt -
library(dplyr)
library(tidyr)
final = dataset %>% spread(MonthYear,Score)
My attempt results in the spread columns being arranged in alphabetical order. And not in chronological order.
Thanks in advance for your inputs

Order the appropriate factor levels and you're done.
library(tidyr)
dataset = data.frame(Site=c(rep('A',3),rep('B',3),rep('C',3)),MonthYear = c(rep(c('May 19','Apr 19','Mar 19'),3)),Date=c(rep(c('2019-05-31','2019-04-30','2019-03-31'),3)),Measure=c(rep(c('Service','Speed','Efficiency'),3)),Score=runif(9,0,1))
dataset$MonthYear <- factor(dataset$MonthYear, levels = c("Mar 19", "Apr 19", "May 19"))
spread(dataset, key = MonthYear, value = Score)
Site Date Measure Mar 19 Apr 19 May 19
1 A 2019-03-31 Efficiency 0.09789678 NA NA
2 A 2019-04-30 Speed NA 0.4645101 NA
3 A 2019-05-31 Service NA NA 0.89602042
4 B 2019-03-31 Efficiency 0.59516115 NA NA
5 B 2019-04-30 Speed NA 0.5208239 NA
6 B 2019-05-31 Service NA NA 0.45334636
7 C 2019-03-31 Efficiency 0.93941294 NA NA
8 C 2019-04-30 Speed NA 0.5439323 NA
9 C 2019-05-31 Service NA NA 0.07971263

The only issue is that dataset$MonthYear is a factor and is not ordered in the way you like.
#Find Order by Date column
dLvl <- unique(dataset$MonthYear[order(dataset$Date)])
levels(dataset$MonthYear)
#[1] "Apr 19" "Mar 19" "May 19"
dataset$MonthYear <- factor(dataset$MonthYear, levels = dLvl)
levels(dataset$MonthYear)
#[1] "Mar 19" "Apr 19" "May 19"
final = dataset %>% spread(MonthYear,Score)
final
# Site Date Measure Mar 19 Apr 19 May 19
#1 A 2019-03-31 Efficiency 0.9928678 NA NA
#2 A 2019-04-30 Speed NA 0.1457551 NA
#3 A 2019-05-31 Service NA NA 0.6047312
#4 B 2019-03-31 Efficiency 0.4419907 NA NA
#5 B 2019-04-30 Speed NA 0.5799068 NA

If you convert them to dates you can order the columns based on the order of those dates
df <-
dataset %>%
spread(MonthYear,Score)
col_dts <- as.Date(paste0('01', names(df)), format = '%d%b %y')
df <- df[order(!is.na(col_dts), col_dts)]
df
# Site Date Measure Mar 19 Apr 19 May 19
# 1 A 2019-03-31 Efficiency 0.76653679 NA NA
# 2 A 2019-04-30 Speed NA 0.0416291 NA
# 3 A 2019-05-31 Service NA NA 0.3885358
# 4 B 2019-03-31 Efficiency 0.02538343 NA NA
# 5 B 2019-04-30 Speed NA 0.7264234 NA
# 6 B 2019-05-31 Service NA NA 0.5128166
# 7 C 2019-03-31 Efficiency 0.50107038 NA NA
# 8 C 2019-04-30 Speed NA 0.9013112 NA
# 9 C 2019-05-31 Service NA NA 0.3678922
Or you could change the factor levels according to the order of the date values
new_levels <-
with(dataset, {
mons <- unique(MonthYear)
ord <- order(as.Date(paste0('01', mons), format = '%d%b %y'))
mons[ord]})
dataset$MonthYear <- factor(dataset$MonthYear, levels = new_levels)
dataset %>%
spread(MonthYear,Score)
# Site Date Measure Mar 19 Apr 19 May 19
# 1 A 2019-03-31 Efficiency 0.76653679 NA NA
# 2 A 2019-04-30 Speed NA 0.0416291 NA
# 3 A 2019-05-31 Service NA NA 0.3885358
# 4 B 2019-03-31 Efficiency 0.02538343 NA NA
# 5 B 2019-04-30 Speed NA 0.7264234 NA
# 6 B 2019-05-31 Service NA NA 0.5128166
# 7 C 2019-03-31 Efficiency 0.50107038 NA NA
# 8 C 2019-04-30 Speed NA 0.9013112 NA
# 9 C 2019-05-31 Service NA NA 0.3678922
You can also use reorder with dcast (not sure why it doesn't work with spread)
library(data.table)
dataset %>%
dcast(Site + Date + Measure ~ reorder(MonthYear, -order(Date)),
value.var = 'Score')
# Site Date Measure Mar 19 Apr 19 May 19
# 1 A 2019-03-31 Efficiency 0.76653679 NA NA
# 2 A 2019-04-30 Speed NA 0.0416291 NA
# 3 A 2019-05-31 Service NA NA 0.3885358
# 4 B 2019-03-31 Efficiency 0.02538343 NA NA
# 5 B 2019-04-30 Speed NA 0.7264234 NA
# 6 B 2019-05-31 Service NA NA 0.5128166
# 7 C 2019-03-31 Efficiency 0.50107038 NA NA
# 8 C 2019-04-30 Speed NA 0.9013112 NA
# 9 C 2019-05-31 Service NA NA 0.3678922

Related

Day difference between rows related to other row that is not NA

Let I have such a data frame(df):
Date x
20.01.2016 34
21.01.2016 28
22.01.2016 NA
23.01.2016 NA
24.01.2016 56
25.01.2016 NA
26.01.2016 28
I want to add such a column(z) to this data frame
Date x z
20.01.2016 34 -
21.01.2016 28 1
22.01.2016 NA NA
23.01.2016 NA NA
24.01.2016 56 3
25.01.2016 NA NA
26.01.2016 28 2
where z shows the day difference between the related row's date and closest previous date (where x is not NA).
For example for the date 24.01.2016 the closest previous date is 21.01.2016 where x is not NA. So the day difference of these two dates is 3.
How can I do this using R?
I will be very glad for any help. Thanks a lot.
Cinsidering that your date variable is as.Date,(i.e. df$Date <- as.Date(df$Date, format = '%d.%m.%Y')) then,
df$z[!is.na(df$x)] <- c(NA, diff.difftime(df$Date[!is.na(df$x)]))
df
# Date x z
#1 2016-01-20 34 NA
#2 2016-01-21 28 1
#3 2016-01-22 NA NA
#4 2016-01-23 NA NA
#5 2016-01-24 56 3
#6 2016-01-25 NA NA
#7 2016-01-26 28 2
We can use data.table
library(data.table)
setDT(df)[, Date := as.IDate(Date, "%d.%m.%Y")][!is.na(x), z := Date - shift(Date)]
df
# Date x z
#1: 2016-01-20 34 NA
#2: 2016-01-21 28 1
#3: 2016-01-22 NA NA
#4: 2016-01-23 NA NA
#5: 2016-01-24 56 3
#6: 2016-01-25 NA NA
#7: 2016-01-26 28 2

Duplicate rows while using Merge function in R - but I dont want the sum

So here's my problem, I have about 40 datasets, all csv files that contain only two columns, (a) Date and (b) Price (for each dataset the price column is named as its country).. I used the merge function as follows to consolidate all data into a single dataset with one date column and several price columns. This was the function I used:
merged <- Reduce(function(x, y) merge(x, y, by="Date", all=TRUE), list(a,b,c,d,e,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,aa,ab,ac,ad,ae,af,ag,ah,ai,aj,ak,al,am,an))
What has happened is I have for instance in date column, 3 values for same date but the corresponding country values are split. e.g.:
# Date India China South Korea
# 01-Jan-2000 5445 NA 4445 NA
# 01-Jan-2000 NA 1234 NA NA
# 01-Jan-2000 NA NA NA 5678
I actually want
# 01-Jan-2000 5445 1234 4445 5678
I dont know how to get this, as the other questions related to this topic ask for summation of values which I clearly do not need. This is a simple example. Unfortunately I have daily data from Jan 2000 to November 2016 for about 43 countries, all messed up. Any help to solve this would be appreciated.
I would append all dataframes using rbind and reshape the result with spread(). As merging depends on the dataframe you start with.
Reproducable example:
library(dplyr)
a <- data.frame(date = Sys.Date()-1:10, cntry = "China", price=round(rnorm(10,20,5),2))
b <- data.frame(date = Sys.Date()-6:15, cntry = "Netherlands", price=round(rnorm(10,50,10),2))
c <- data.frame(date = Sys.Date()-11:20, cntry = "USA", price=round(rnorm(10,70,25),2))
all <- do.call(rbind, list(a,b,c))
all %>% group_by(date) %>% spread(cntry, price)
results in:
date China Netherlands USA
* <date> <dbl> <dbl> <dbl>
1 2016-11-29 NA NA 78.75
2 2016-11-30 NA NA 66.22
3 2016-12-01 NA NA 86.04
4 2016-12-02 NA NA 17.07
5 2016-12-03 NA NA 75.72
6 2016-12-04 NA 46.90 39.57
7 2016-12-05 NA 51.80 65.11
8 2016-12-06 NA 57.50 96.36
9 2016-12-07 NA 46.42 46.93
10 2016-12-08 NA 45.71 57.63
11 2016-12-09 15.41 60.09 NA
12 2016-12-10 16.66 60.07 NA
13 2016-12-11 23.72 66.21 NA
14 2016-12-12 19.82 45.46 NA
15 2016-12-13 14.22 45.07 NA
16 2016-12-14 27.26 NA NA
17 2016-12-15 20.08 NA NA
18 2016-12-16 15.79 NA NA
19 2016-12-17 17.66 NA NA
20 2016-12-18 26.77 NA NA

Dividing the data in multiple columns to 8 values logically in R

I have the data as following. Each column starting from 1.07m to 11.82m represents the depth and the values corresponds to the temperature. I am interested in reducing the datasets into 8 sets (8 distinct water depths). While doing so I would like to use the averaging method. For example in row1 of my data starts from column x1.07m to x2.82m (x2.82m because all the values beyond that point are NA). I would like to create a separate data frame with data and 8 columns (layer1, layer2, layer3, layer4, layer5, layer6, layer7, layer8). Layer1 value should start from 1.07m and the Layer8 should correspond to the maximum non-zero value.
Data: The dput of data can be found on https://dl.dropboxusercontent.com/u/9267938/rcode.R
> head(data.frame(mytest))
datetime Year Month Day Hour Minute Second X1.07m X1.32m X1.57m X1.82m X2.07m X2.32m X2.57m X2.82m X3.07m
1 2014-08-03 12:40:00 2014 8 3 12 40 0 -0.079553637 -0.018856349 -0.022559778 -0.0278269427 -0.019816260 -0.01304108 -0.003394041 -0.010720688 NA
2 2014-08-03 12:50:00 2014 8 3 12 50 0 -0.001409806 0.006434559 0.013885671 0.0033940409 0.009665614 0.01176982 0.011130125 0.019991707 0.02997477
3 2014-08-03 13:00:00 2014 8 3 13 0 0 -0.006942835 -0.011130125 0.010715907 -0.0058745801 -0.005716650 0.01534520 0.030355206 0.024851408 0.04862646
4 2014-08-03 13:10:00 2014 8 3 13 10 0 -0.020586547 0.002935416 -0.016304143 -0.0001326389 -0.003896694 0.00361282 0.004723244 0.013947785 0.03787721
5 2014-08-03 13:20:00 2014 8 3 13 20 0 -0.028394300 -0.023132719 -0.001721911 -0.0139650391 -0.038460075 0.01749898 0.008466864 0.003630492 0.01442467
6 2014-08-03 13:30:00 2014 8 3 13 30 0 -0.034646511 -0.006791177 0.004064423 -0.0038792422 -0.015942808 -0.02029747 -0.014287663 0.007956902 0.01786172
X3.32m X3.57m X3.82m X4.07m X4.32m X4.57m X4.82m X5.07m X5.32m X5.57m X5.82m X6.07m X6.32m X6.57m X6.82m
1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2 0.05094966 0.04699597 0.032100892 0.02650842 0.045689389 0.0169759192 -0.006879327 -0.0187681077 -0.030404344 -0.04405705 -0.04501967 NA NA NA NA
3 0.04500833 0.01713256 0.006450535 0.02870071 0.019079580 0.0009741734 -0.024666588 -0.0409943643 -0.030201313 -0.03873463 -0.02893064 NA NA NA NA
4 0.03971244 0.05723497 0.039496306 0.03799276 0.012742073 0.0024111385 -0.023706420 -0.0188563490 -0.033791404 -0.04162619 -0.02979164 -0.045051204 NA NA NA
5 0.03269076 0.05125416 0.054766084 0.03625076 0.005988487 0.0020217180 -0.007510352 -0.0069913419 -0.006656083 -0.01630414 -0.01403812 -0.001580609 NA NA NA
6 0.01913708 0.03932811 0.048955209 0.04764632 0.037480601 0.0205218532 0.004171715 0.0009371753 -0.002468609 -0.04511612 -0.01263816 0.035861544 NA NA NA
X7.07m X7.32m X7.57m X7.82m X8.07m X8.32m X8.57m X8.82m X9.07m X9.32m X9.57m X9.82m X10.07m X10.32m X10.57m X10.82m X11.07m X11.32m X11.57m X11.82m
1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
6 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Sometimes the data points will be 20, 22, 25 points so the function should be written such that it would try to account those information and divide into 8 data values for each rows.
Rcode.R linked to dropbox has the code that has dput of mytest. It was pretty big to be posted here. So I posted a external link.
Info added
Each row would have different number of data. The motive is to convert them into 8 columns of data using averaging or linear interpolation.
Taking the question as a desire to collapse the values to means of eight equally spaced depths, dplyr and tidyr take us where we need to go:
library(dplyr)
library(tidyr)
mytest %>%
# melt to long form
gather(depth, value, -datetime:-Second, na.rm = TRUE) %>%
# clean depth to number
mutate(depth = extract_numeric(depth)) %>%
# group so cut levels are for each datetime
group_by(datetime) %>%
# group to keep columns; cut depth into 8 levels per group
group_by(datetime, levels = cut(depth, 8, paste0('level', 1:8))) %>%
# collapse groups by taking the mean
summarise(value = mean(value)) %>%
# re-spread new levels to wide form
spread(levels, value) %>%
# re-add other time columns dropped by summarise
inner_join(mytest %>% select(datetime:Second), .)
# Source: local data frame [20 x 15]
#
# datetime Year Month Day Hour Minute Second level1 level2
# (time) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
# 1 2014-08-03 12:40:00 2014 8 3 12 40 0 -0.079553637 -0.0188563490
# 2 2014-08-03 12:50:00 2014 8 3 12 50 0 0.006303474 0.0065298277
# 3 2014-08-03 13:00:00 2014 8 3 13 0 0 -0.002452351 -0.0057956151
# 4 2014-08-03 13:10:00 2014 8 3 13 10 0 -0.011318424 -0.0001388374
# 5 2014-08-03 13:20:00 2014 8 3 13 20 0 -0.017749644 -0.0116420430
# 6 2014-08-03 13:30:00 2014 8 3 13 30 0 -0.012457755 -0.0133731725
# 7 2014-08-03 13:40:00 2014 8 3 13 40 0 -0.020440875 -0.0253538846
# 8 2014-08-03 13:50:00 2014 8 3 13 50 0 -0.058681338 -0.0177194127
# 9 2014-08-03 14:00:00 2014 8 3 14 0 0 -0.037929680 -0.0211918383
# 10 2014-08-03 14:10:00 2014 8 3 14 10 0 -0.027045726 -0.0147261076
# 11 2014-08-03 14:20:00 2014 8 3 14 20 0 -0.048997399 -0.0290804019
# 12 2014-08-03 14:30:00 2014 8 3 14 30 0 -0.059110466 -0.0370898043
# 13 2014-08-03 14:40:00 2014 8 3 14 40 0 -0.067156867 -0.0138750287
# 14 2014-08-03 14:50:00 2014 8 3 14 50 0 -0.049762164 -0.0280648246
# 15 2014-08-03 15:00:00 2014 8 3 15 0 0 -0.028033559 -0.0245379952
# 16 2014-08-03 15:10:00 2014 8 3 15 10 0 -0.044087211 -0.0107995239
# 17 2014-08-03 15:20:00 2014 8 3 15 20 0 -0.028761973 -0.0113161242
# 18 2014-08-03 15:30:00 2014 8 3 15 30 0 -0.013476051 -0.0142316424
# 19 2014-08-03 15:40:00 2014 8 3 15 40 0 -0.012799297 -0.0135366710
# 20 2014-08-03 15:50:00 2014 8 3 15 50 0 -0.012238548 -0.0180806876
# Variables not shown: level3 (dbl), level4 (dbl), level5 (dbl), level6 (dbl), level7 (dbl),
# level8 (dbl)
Note that you should check that these data make sense in context; you've lost your depth data by scaling them.

Daily averages of all data frame variables including NA values with aggregate function

I want to calculate daily means of all variables in my dataframe which includes NA values. All my databases have a value every 30min, so I´m very interested in using the timestamp with aggregate function to obtain daily, weekly, monthly... aggregated data.
My dataframe is 37795 rows x 54 variables. I´ve tried two ways to do that, first option does not give me daily means cause I obtained too high values (not logical). Second option gives me almost all NA values. I do not what to do.
I write my dataframe head and code below.
head(data)
timestamp day month year.x hour minute doy.x rn_1_1_1 ppfd_1_1_1
1 2013-07-06 00:00:00 6 7 2013 0 0 187.000 -84.37381 0.754
2 2013-07-06 00:30:00 6 7 2013 0 30 187.020 -84.07990 0.808
3 2013-07-06 01:00:00 6 7 2013 1 0 187.041 -82.19991 0.808
4 2013-07-06 01:30:00 6 7 2013 1 30 187.062 -81.12341 0.831
5 2013-07-06 02:00:00 6 7 2013 2 0 187.083 -79.57474 0.708
6 2013-07-06 02:30:00 6 7 2013 2 30 187.104 -77.72460 0.639
ppfdr_1_1_1 p_rain_1_1_1 swc_1_1_1 swc_2_1_1 swc_3_1_1 air_pressure air_pressure.1
1 0.624 0 0.07230304 0.09577876 0.134602791 101212.4165 1012.124165
2 0.587 0 0.07233134 0.09569421 0.134479816 101181.8094 1011.818094
3 0.713 0 0.07242914 0.09566160 0.134203719 101166.0948 1011.660948
4 0.72 0 0.07252077 0.09563419 0.134149141 101144.6151 1011.446151
5 0.564 0 0.07261925 0.09560297 0.134095791 101144.8662 1011.448662
6 0.706 0 0.07271843 0.09557789 0.134037119 101144.5084 1011.445084
u_rot v_rot w_rot wind_speed u. h_scr_qc01_man
1 5.546047919 1.42E-14 4.76E-16 5.546047919 0.426515403 -28.07603618
2 5.122724997 6.94E-15 -8.00E-16 5.122724997 0.408213459 -34.39110979
3 5.248639421 4.56E-15 7.28E-17 5.248639421 0.393959075 -33.29033501
4 4.845257286 2.81E-14 -1.33E-17 4.845257286 0.365475898 -32.62427147
5 4.486426895 1.39E-14 -4.43E-16 4.486426895 0.335905384 -33.80219189
6 4.109603841 7.08E-15 -9.76E-16 4.109603841 0.312610588 -35.77289349
fco2_scr_qc01_man le_scr_qc01_man fco2_scr_qc0 fco2_scr_qc0_man date year.y time
1 -0.306504951 NA NA NA 06-jul-13 2013 0:00
2 -0.206266524 NA -0.206266524 -0.206266524 06-jul-13 2013 0:30
3 -0.268508139 NA -0.268508139 -0.268508139 06-jul-13 2013 1:00
4 -0.203804516 0.426531598 -0.203804516 -0.203804516 06-jul-13 2013 1:30
5 -0.217438742 -0.358248118 -0.217438742 -0.217438742 06-jul-13 2013 2:00
6 -0.193778528 2.571063044 -0.193778528 -0.193778528 06-jul-13 2013 2:30
doy_ent doy.y doy_cum doy_cum_ent mes nrecord bat panel_temp vwc_0.1
1 187 187.0000 187.0000 187 7 24 12.57 22.93 0.06284828
2 187 187.0208 187.0208 187 7 25 12.56 22.85 0.06267169
3 187 187.0417 187.0417 187 7 26 12.55 22.58 0.06261738
4 187 187.0625 187.0625 187 7 27 12.54 22.3 0.06247716
5 187 187.0833 187.0833 187 7 28 12.53 22.01 0.06249525
6 187 187.1042 187.1042 187 7 29 12.52 21.82 0.06236862
vwc_0.5 vwc_1.5 temp_0.1 temp_0.5 temp_1.5 tempsd_0.1 tempsd_0.5 tempsd_1.5
1 0.07569027 0.1007845 30.9 28.96 25.14 0.372 0.961 0.767
2 0.07569027 0.1007743 30.8 28.85 24.99 0.181 1.361 1.087
3 0.07568554 0.1008558 30.53 28.8 25.03 0.98 1.476 0.351
4 0.07559577 0.1008507 30.52 29.09 25.11 0.186 0.229 0.556
5 0.07559577 0.1007743 30.11 29.09 24.87 1.331 0.191 0.954
6 0.07556271 0.1007285 30.15 29.33 25.04 1.447 1.078 0.2
pair pair_avg CO2_0.1 CO2_0.5 CO2_1.5 DCO2_0.1 DCO2_0.5
1 101.2124 101.2118 1161.592832 3275.1134 4888.231603 -24.67422109 34.88538221
2 101.1818 101.2131 1168.144925 3338.24016 4941.418642 6.55209301 63.12675931
3 101.1661 101.2090 1201.049131 3435.235974 5012.525851 32.90420541 96.9958144
4 101.1446 101.2007 1268.613941 3556.723878 5092.96558 67.56481067 121.4879035
5 101.1449 101.1906 1364.315214 3680.188043 5164.795759 95.7012722 123.464165
6 101.1445 101.1805 1472.975286 3808.988677 5236.40855 108.6600723 128.8006346
DCO2_1.5
1 31.30293041
2 53.18703947
3 71.10720845
4 80.43972916
5 71.83017884
6 71.61279156
## Daily avg - OPTION 1
data$timestamp <- as.POSIXct(data$timestamp, format = "%d/%m/%Y %H:%M",tz ="GMT")
> dates <- format(data$timestamp,"%Y/%m/%d",tz = "GMT")
> datadates <- cbind(data,dates)
> dailydata_avg <- aggregate(. ~ dates, datadates, FUN=mean, na.rm=TRUE, na.action = "na.pass")
head(dailydata_avg)
dates timestamp day month year.x hour minute doy.x rn_1_1_1 ppfd_1_1_1
1 2013/07/06 1373111100 6 7 2013 11.5 15 187.489 159.7788 3580.562
2 2013/07/07 1373197500 7 7 2013 11.5 15 188.489 154.0925 3506.688
3 2013/07/08 1373283900 8 7 2013 11.5 15 189.489 152.5259 3460.667
4 2013/07/09 1373370300 9 7 2013 11.5 15 190.489 131.1619 2965.250
5 2013/07/10 1373456700 10 7 2013 11.5 15 191.489 136.7853 3171.958
6 2013/07/11 1373543100 11 7 2013 11.5 15 192.489 145.2757 3282.167
ppfdr_1_1_1 p_rain_1_1_1 swc_1_1_1 swc_2_1_1 swc_3_1_1 air_pressure air_pressure.1
1 2552.396 1.0000 0.07095847 0.09606378 18341.81 25940.167 25940.167
2 2532.542 1.0000 0.06994341 0.09502167 18065.98 24891.000 24891.000
3 2523.562 1.0000 0.06860553 0.09379282 17777.02 23107.271 23107.271
4 2336.000 1.0000 0.06717054 0.09268716 17526.50 19309.500 19309.500
5 2607.229 1.0625 0.06620048 0.09166904 17275.56 8385.646 8385.646
6 2484.521 1.0000 0.06562964 0.09083684 17028.94 3535.438 3535.438
u_rot v_rot w_rot wind_speed u. h_scr_qc01_man fco2_scr_qc01_man
1 32167.83 2215.875 2041.354 32167.83 28531.44 18197.75 15365.65
2 30878.27 1911.312 1939.917 30878.27 26929.62 17605.52 14955.56
3 26052.96 2261.417 2116.458 26052.96 23305.83 19167.98 18399.33
4 17284.04 1987.438 2139.083 17284.04 17704.35 20349.92 18137.65
5 12028.06 2053.812 1960.417 12028.06 15670.00 21997.83 21120.19
6 15607.50 1997.417 1907.646 15607.50 15384.56 18000.94 18810.62
le_scr_qc01_man fco2_scr_qc0 fco2_scr_qc0_man date year.y time doy_ent doy.y
1 17409.67 13032.10 13027.90 137 2013 44.5 187 187.4896
2 15524.38 12077.17 12072.92 163 2013 44.5 188 188.4896
3 16407.71 14775.94 14770.56 189 2013 44.5 189 189.4896
4 16788.04 15024.79 15019.02 215 2013 44.5 190 190.4896
5 17955.58 17737.25 17730.75 241 2013 44.5 191 191.4896
6 14610.02 16605.48 16599.33 267 2013 44.5 192 192.4896
doy_cum doy_cum_ent mes nrecord bat panel_temp vwc_0.1 vwc_0.5 vwc_1.5
1 187.4896 187.5 7 28966.375 111.5208 1836.250 4638.833 4594.396 37.35417
2 188.4896 188.5 7 20801.417 111.7292 1900.812 4656.875 4392.979 26.68750
3 189.4896 189.5 7 4394.500 110.6042 1934.792 4675.604 4238.229 65.20833
4 190.4896 190.5 7 9467.708 104.0000 2090.896 4776.521 4178.729 54.12500
5 191.4896 191.5 7 14796.375 109.7500 2145.875 4907.292 4161.312 108.39583
6 192.4896 192.5 7 20127.958 109.3125 1934.375 4876.021 4123.458 143.10417
temp_0.1 temp_0.5 temp_1.5 tempsd_0.1 tempsd_0.5 tempsd_1.5 pair pair_avg CO2_0.1
1 2018.438 1565.812 797.8750 470.8125 474.3958 508.8333 101.1268 101.1323 10400.27
2 1998.438 1574.000 783.1875 478.3333 460.4583 566.0208 101.0764 101.0789 11292.75
3 1994.833 1568.104 780.2083 463.8125 453.1667 488.5625 100.9967 101.0036 13288.25
4 2042.625 1564.875 780.1667 465.0000 599.2708 437.6042 100.8520 100.8665 16156.60
5 2114.708 1576.729 780.5000 471.5833 406.5417 484.6875 100.4828 100.5169 18656.50
6 2124.604 1591.125 781.8125 516.7500 530.3333 510.7500 100.3025 100.2947 14586.60
CO2_0.5 CO2_1.5 DCO2_0.1 DCO2_0.5 DCO2_1.5
1 26360.38 34371.31 19795.81 20637.94 27123.92
2 26939.60 34558.17 18838.38 20464.56 20452.58
3 27603.06 34608.31 17413.15 19998.02 22754.85
4 28572.69 34678.38 19294.62 21894.92 18379.62
5 28983.29 34644.15 20251.17 20409.58 22077.40
6 28236.12 34736.67 17031.02 18852.04 19684.69`
## Daily avg - OPTION 2
data$timestamp <- as.POSIXct(data$timestamp, format = "%d/%m/%Y %H:%M",tz ="GMT")
datatime <- data$timestamp
dailydata_avg <- aggregate( data,
by = list('DATES'= format(datatime,'%Y%m%d' )),
FUN = mean, na.rm=T)
I obtain this console message:
1: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
2: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
3: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
head(dailydata_avg)
DATES timestamp day month year.x hour minute doy.x rn_1_1_1 ppfd_1_1_1
1 20130706 2013-07-06 13:45:00 6 7 2013 11.5 15 187.489 159.7788 NA
2 20130707 2013-07-07 13:45:00 7 7 2013 11.5 15 188.489 154.0925 NA
3 20130708 2013-07-08 13:45:00 8 7 2013 11.5 15 189.489 152.5259 NA
4 20130709 2013-07-09 13:45:00 9 7 2013 11.5 15 190.489 131.1619 NA
5 20130710 2013-07-10 13:45:00 10 7 2013 11.5 15 191.489 136.7853 NA
6 20130711 2013-07-11 13:45:00 11 7 2013 11.5 15 192.489 145.2757 NA
ppfdr_1_1_1 p_rain_1_1_1 swc_1_1_1 swc_2_1_1 swc_3_1_1 air_pressure air_pressure.1
1 NA NA 0.07095847 0.09606378 NA NA NA
2 NA NA 0.06994341 0.09502167 NA NA NA
3 NA NA 0.06860553 0.09379282 NA NA NA
4 NA NA 0.06717054 0.09268716 NA NA NA
5 NA NA 0.06620048 0.09166904 NA NA NA
6 NA NA 0.06562964 0.09083684 NA NA NA
u_rot v_rot w_rot wind_speed u. h_scr_qc01_man fco2_scr_qc01_man le_scr_qc01_man
1 NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA NA
6 NA NA NA NA NA NA NA NA
fco2_scr_qc0 fco2_scr_qc0_man date year.y time doy_ent doy.y doy_cum doy_cum_ent
1 NA NA NA 2013 NA 187 187.4896 187.4896 187.5
2 NA NA NA 2013 NA 188 188.4896 188.4896 188.5
3 NA NA NA 2013 NA 189 189.4896 189.4896 189.5
4 NA NA NA 2013 NA 190 190.4896 190.4896 190.5
5 NA NA NA 2013 NA 191 191.4896 191.4896 191.5
6 NA NA NA 2013 NA 192 192.4896 192.4896 192.5
mes nrecord bat panel_temp vwc_0.1 vwc_0.5 vwc_1.5 temp_0.1 temp_0.5 temp_1.5
1 7 NA NA NA NA NA NA NA NA NA
2 7 NA NA NA NA NA NA NA NA NA
3 7 NA NA NA NA NA NA NA NA NA
4 7 NA NA NA NA NA NA NA NA NA
5 7 NA NA NA NA NA NA NA NA NA
6 7 NA NA NA NA NA NA NA NA NA
tempsd_0.1 tempsd_0.5 tempsd_1.5 pair pair_avg CO2_0.1 CO2_0.5 CO2_1.5 DCO2_0.1
1 NA NA NA 101.1268 101.1323 NA NA NA NA
2 NA NA NA 101.0764 101.0789 NA NA NA NA
3 NA NA NA 100.9967 101.0036 NA NA NA NA
4 NA NA NA 100.8520 100.8665 NA NA NA NA
5 NA NA NA 100.4828 100.5169 NA NA NA NA
6 NA NA NA 100.3025 100.2947 NA NA NA NA
DCO2_0.5 DCO2_1.5
1 NA NA
2 NA NA
3 NA NA
4 NA NA
5 NA NA
6 NA NA
How could I do it?
Thanks!!
I didn't use the aggregate function, I used the tapply one.
This is the code, that deals with NA's, I came up with:
# create a sequence of DateTime with half-hourly data
DateTime <- seq.POSIXt(from = as.POSIXct("2015-05-01 00:00:00", tz = "Etc/GMT+12"),
to = as.POSIXct("2015-05-30 23:59:00", tz = "Etc/GMT+12"), by = 1800)
# create some dummy data of the same length as DateTime vector
aa <- runif(1440, 5.0, 7.5)
bb <- NA
df <- data.frame(DateTime, aa, bb)
# replace a cell with NA in the "a" column
df[19,2] <- NA # dataframe = df, row = 19, column = 2
# create DateHour column to use later
df$DateHour <- paste(format(df$DateTime, "%Y/%m/%d"), format(df$DateTime, "%H"), sep = " ")
View(df)
# Hourly means
# Calculate hourly mean values
aa.HourlyMean <- tapply(df$aa, df$DateHour, mean, na.rm = TRUE)
# convert the vector to dataframe
aa.HourlyMean <- data.frame(aa.HourlyMean)
# Extract the DateHour column from the "aa" dataframe
aa.HourlyMean$DateHour <- row.names(aa.HourlyMean);
# Delete rownames of "aa" dataframe
row.names(aa.HourlyMean) <- NULL
# Create a tidy DateTime column
aa.HourlyMean$DateTime <- as.POSIXct(aa.HourlyMean$DateHour, "%Y/%m/%d %H", tz = "Etc/GMT+12")
# change to a tidy dataframe
aa.HourlyMean <- aa.HourlyMean[,c(3,2,1)]
# You can delete any column (for example, DateHour) by
# aa.HourlyMean$Date <- NULL
# You can rename a column with "plyr" package by
# rename(aa.HourlyMean)[3] <- "NewColumnName"
# View the hourly mean of the "aa" dataframe
View(aa.HourlyMean)
# You can do the same with the "bb" vector
bb.HourlyMean <- tapply(df$bb, df$DateHour, mean, na.rm = TRUE)
bb.HourlyMean <- data.frame(bb.HourlyMean)
# View the hourly mean of the "bb" vector
View(bb.HourlyMean)
# /Hourly means
You then can combine in one dataframe the aa.HourlyMean and bb.HourlyMean vectors.
# Daily means
df$Date <- format(df$DateTime, "%Y/%m/%d")
aa.DailyMean <- tapply(df$aa, df$Days, mean, na.rm = TRUE)
aa.DailyMean <- data.frame(aa.DailyMean)
aa.DailyMean$Date <- row.names(aa.DailyMean); row.names(aa.DailyMean) <- NULL
aa.DailyMean <- aa.DailyMean[,c(2,1)]
View(aa.DailyMean)
# /Daily means
# Weekly means
df$YearWeek <- paste(format(df$DateTime, "%Y"), strftime(DateTime, format = "%W"), sep = " ")
aa.WeeklyMean <- tapply(df$aa, df$YearWeek, mean, na.rm = TRUE)
aa.WeeklyMean <- data.frame(aa.WeeklyMean)
aa.WeeklyMean$YearWeek <- row.names(aa.WeeklyMean); row.names(aa.WeeklyMean) <- NULL
aa.WeeklyMean <- aa.WeeklyMean[,c(2,1)]
View(aa.WeeklyMean)
# /Weekly means
I created the mean values for hourly, daily and weekly observations but you get the idea how to create the monthly, yearly, ... ones.

R update values based on event based on multiple columns

I recently asked a similar question here for having all "Activities" in one column. The solution provided worked very well. Now I would like to change something in the design to be able to record more detailed information. The table shows information recorded from different fields over several years. All activities on the fields are recored by date. Now I would like to add a "Season" column that groups all values belonging to a harvest season. As harvest season I define the time in between two harvest events. (see table at the bottom on how the result should look like). The problem here is that seeding is sometimes done in the previous year (e.g.2012) but fields are harvested in 2013. All events would need to be grouped as 2013.
What would I need to change if I start recording more information and give all "Activities" a separate column? I tried:
library(data.table)
DF <- read.table(text="ID|Field|Date |Tillage|Seeding|Fertilizer|Spraying|Harvest
1|A |2012/08/01|Plough |NA|NA|NA|NA
2|A |2012/08/24|NA |Wheat|NA|NA|NA
3|A |2013/03/05|NA |NA|NA|ProduktA|NA
4|A |2013/03/05|NA|NA|TypeB|NA|NA
5|A |2013/07/25|NA |NA|NA|NA|9t
6|B |2012/09/01|Plough |NA|NA|NA|NA
7|B |2012/09/05|NA |Barley|NA|NA|NA
8|B |2013/04/05|NA |NA|NA|ProductB|NA
9|B |2013/07/28|NA |NA|NA|NA|10t
10|B |2010/08/24|Cultivator |NA|NA|NA|NA
11|B |2010/09/29|NA |NA|NA|NA|NA
12|B |2011/05/01|NA|NA|TypeB|NA|NA
13|B |2011/07/12|NA |NA|NA|NA|6t
14|A |2011/09/01|NA |Barley|NA|NA|NA
15|A |2011/10/10|NA |NA|NA|ProductC|NA
16|A |2012/04/10|NA|NA|TypeA|NA|NA
17|A |2012/08/02|NA |NA|NA|NA|7t|",
sep="|", header=TRUE, stringsAsFactors=FALSE)
DT <- data.table(DF)
DT[, Harvest:=gsub(" ", "", Harvest, fixed=TRUE)]
DT[, Date:=as.POSIXct(Date)]
setkeyv(DT, c("Field", "Date"))
DT[, Season:=cumsum(c("", !is.na(head(Harvest, -1)))), by=Field]
DT[, Season:=max(year(Date)), by=list(Field, Season)]
However, that seems not to work. The result should look like this with a "season" column at the end that indicates the season:
ID|Field|Date |Tillage|Seeding|Fertilizer|Spraying|Harvest|Season
1|A |2012/08/01|Plough |NA|NA|NA|NA|2013
2|A |2012/08/24|NA |Wheat|NA|NA|NA|2013
3|A |2013/03/05|NA |NA|NA|ProduktA|NA|2013
4|A |2013/03/05|NA|NA|TypeB|NA|NA|2013
5|A |2013/07/25|NA |NA|NA|NA|9t|2013
6|B |2012/09/01|Plough |NA|NA|NA|NA|2013
7|B |2012/09/05|NA |Barley|NA|NA|NA|2013
8|B |2013/04/05|NA |NA|NA|ProductB|NA|2013
9|B |2013/07/28|NA |NA|NA|NA|10t|2013
10|B |2010/08/24|Cultivator |NA|NA|NA|NA|2011
11|B |2010/09/29|NA |NA|NA|NA|NA|2011
12|B |2011/05/01|NA|NA|TypeB|NA|NA|2011
13|B |2011/07/12|NA |NA|NA|NA|6t|2011
14|A |2011/09/01|NA |Barley|NA|NA|NA|2012
15|A |2011/10/10|NA |NA|NA|ProductC|NA|2012
16|A |2012/04/10|NA|NA|TypeA|NA|NA|2012
17|A |2012/08/02|NA |NA|NA|NA|7t||2012
The only difference to OP's other question is that there are some additional columns and that the condition for extracting the harvest dates for applying my rolling join answer needs to be amended:
library(data.table)
setDT(DF)[!is.na(Harvest), .(Field, Date, Season = year(Date))][
DF, on = .(Field, Date), roll = -Inf]
Field Date Season ID Tillage Seeding Fertilizer Spraying Harvest
1: A 2012/08/01 2012 1 Plough NA NA NA NA
2: A 2012/08/24 2013 2 NA Wheat NA NA NA
3: A 2013/03/05 2013 3 NA NA NA ProduktA NA
4: A 2013/03/05 2013 4 NA NA TypeB NA NA
5: A 2013/07/25 2013 5 NA NA NA NA 9t
6: B 2012/09/01 2013 6 Plough NA NA NA NA
7: B 2012/09/05 2013 7 NA Barley NA NA NA
8: B 2013/04/05 2013 8 NA NA NA ProductB NA
9: B 2013/07/28 2013 9 NA NA NA NA 10t
10: B 2010/08/24 2011 10 Cultivator NA NA NA NA
11: B 2010/09/29 2011 11 NA NA NA NA NA
12: B 2011/05/01 2011 12 NA NA TypeB NA NA
13: B 2011/07/12 2011 13 NA NA NA NA 6t
14: A 2011/09/01 2012 14 NA Barley NA NA NA
15: A 2011/10/10 2012 15 NA NA NA ProductC NA
16: A 2012/04/10 2012 16 NA NA TypeA NA NA
17: A 2012/08/02 2012 17 NA NA NA NA 7t
Note that the rolling join has exhibited a flaw in the sample dataset. Row 1 shows Season 2012 although the subsequent harvest (according to OP's ID) should be in 2013. Reason is that the dates for tillage and harvest are intermixed for field A. Tillage date of Field A in row 1 is 2012/08/01 while the harvest date of the same field in row 17 is 2012/08/02, one day after tillage.
In case the column order is important, the setcolorder() function can be used to order the columns in place, i.e., without copying:
result <- setDT(DF)[!is.na(Harvest), .(Field, Date, Season = year(Date))][
DF, on = .(Field, Date), roll = -Inf]
setcolorder(result, c(names(DF), "Season"))[]
ID Field Date Tillage Seeding Fertilizer Spraying Harvest Season
1: 1 A 2012/08/01 Plough NA NA NA NA 2012
2: 2 A 2012/08/24 NA Wheat NA NA NA 2013
3: 3 A 2013/03/05 NA NA NA ProduktA NA 2013
4: 4 A 2013/03/05 NA NA TypeB NA NA 2013
5: 5 A 2013/07/25 NA NA NA NA 9t 2013
6: 6 B 2012/09/01 Plough NA NA NA NA 2013
7: 7 B 2012/09/05 NA Barley NA NA NA 2013
8: 8 B 2013/04/05 NA NA NA ProductB NA 2013
9: 9 B 2013/07/28 NA NA NA NA 10t 2013
10: 10 B 2010/08/24 Cultivator NA NA NA NA 2011
11: 11 B 2010/09/29 NA NA NA NA NA 2011
12: 12 B 2011/05/01 NA NA TypeB NA NA 2011
13: 13 B 2011/07/12 NA NA NA NA 6t 2011
14: 14 A 2011/09/01 NA Barley NA NA NA 2012
15: 15 A 2011/10/10 NA NA NA ProductC NA 2012
16: 16 A 2012/04/10 NA NA TypeA NA NA 2012
17: 17 A 2012/08/02 NA NA NA NA 7t 2012
Data
library(data.table)
DF <- fread(
"ID|Field|Date |Tillage|Seeding|Fertilizer|Spraying|Harvest
1|A |2012/08/01|Plough |NA|NA|NA|NA
2|A |2012/08/24|NA |Wheat|NA|NA|NA
3|A |2013/03/05|NA |NA|NA|ProduktA|NA
4|A |2013/03/05|NA|NA|TypeB|NA|NA
5|A |2013/07/25|NA |NA|NA|NA|9t
6|B |2012/09/01|Plough |NA|NA|NA|NA
7|B |2012/09/05|NA |Barley|NA|NA|NA
8|B |2013/04/05|NA |NA|NA|ProductB|NA
9|B |2013/07/28|NA |NA|NA|NA|10t
10|B |2010/08/24|Cultivator |NA|NA|NA|NA
11|B |2010/09/29|NA |NA|NA|NA|NA
12|B |2011/05/01|NA|NA|TypeB|NA|NA
13|B |2011/07/12|NA |NA|NA|NA|6t
14|A |2011/09/01|NA |Barley|NA|NA|NA
15|A |2011/10/10|NA |NA|NA|ProductC|NA
16|A |2012/04/10|NA|NA|TypeA|NA|NA
17|A |2012/08/02|NA |NA|NA|NA|7t")

Resources