Dividing the data in multiple columns to 8 values logically in R - r

I have the data as following. Each column starting from 1.07m to 11.82m represents the depth and the values corresponds to the temperature. I am interested in reducing the datasets into 8 sets (8 distinct water depths). While doing so I would like to use the averaging method. For example in row1 of my data starts from column x1.07m to x2.82m (x2.82m because all the values beyond that point are NA). I would like to create a separate data frame with data and 8 columns (layer1, layer2, layer3, layer4, layer5, layer6, layer7, layer8). Layer1 value should start from 1.07m and the Layer8 should correspond to the maximum non-zero value.
Data: The dput of data can be found on https://dl.dropboxusercontent.com/u/9267938/rcode.R
> head(data.frame(mytest))
datetime Year Month Day Hour Minute Second X1.07m X1.32m X1.57m X1.82m X2.07m X2.32m X2.57m X2.82m X3.07m
1 2014-08-03 12:40:00 2014 8 3 12 40 0 -0.079553637 -0.018856349 -0.022559778 -0.0278269427 -0.019816260 -0.01304108 -0.003394041 -0.010720688 NA
2 2014-08-03 12:50:00 2014 8 3 12 50 0 -0.001409806 0.006434559 0.013885671 0.0033940409 0.009665614 0.01176982 0.011130125 0.019991707 0.02997477
3 2014-08-03 13:00:00 2014 8 3 13 0 0 -0.006942835 -0.011130125 0.010715907 -0.0058745801 -0.005716650 0.01534520 0.030355206 0.024851408 0.04862646
4 2014-08-03 13:10:00 2014 8 3 13 10 0 -0.020586547 0.002935416 -0.016304143 -0.0001326389 -0.003896694 0.00361282 0.004723244 0.013947785 0.03787721
5 2014-08-03 13:20:00 2014 8 3 13 20 0 -0.028394300 -0.023132719 -0.001721911 -0.0139650391 -0.038460075 0.01749898 0.008466864 0.003630492 0.01442467
6 2014-08-03 13:30:00 2014 8 3 13 30 0 -0.034646511 -0.006791177 0.004064423 -0.0038792422 -0.015942808 -0.02029747 -0.014287663 0.007956902 0.01786172
X3.32m X3.57m X3.82m X4.07m X4.32m X4.57m X4.82m X5.07m X5.32m X5.57m X5.82m X6.07m X6.32m X6.57m X6.82m
1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2 0.05094966 0.04699597 0.032100892 0.02650842 0.045689389 0.0169759192 -0.006879327 -0.0187681077 -0.030404344 -0.04405705 -0.04501967 NA NA NA NA
3 0.04500833 0.01713256 0.006450535 0.02870071 0.019079580 0.0009741734 -0.024666588 -0.0409943643 -0.030201313 -0.03873463 -0.02893064 NA NA NA NA
4 0.03971244 0.05723497 0.039496306 0.03799276 0.012742073 0.0024111385 -0.023706420 -0.0188563490 -0.033791404 -0.04162619 -0.02979164 -0.045051204 NA NA NA
5 0.03269076 0.05125416 0.054766084 0.03625076 0.005988487 0.0020217180 -0.007510352 -0.0069913419 -0.006656083 -0.01630414 -0.01403812 -0.001580609 NA NA NA
6 0.01913708 0.03932811 0.048955209 0.04764632 0.037480601 0.0205218532 0.004171715 0.0009371753 -0.002468609 -0.04511612 -0.01263816 0.035861544 NA NA NA
X7.07m X7.32m X7.57m X7.82m X8.07m X8.32m X8.57m X8.82m X9.07m X9.32m X9.57m X9.82m X10.07m X10.32m X10.57m X10.82m X11.07m X11.32m X11.57m X11.82m
1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
6 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Sometimes the data points will be 20, 22, 25 points so the function should be written such that it would try to account those information and divide into 8 data values for each rows.
Rcode.R linked to dropbox has the code that has dput of mytest. It was pretty big to be posted here. So I posted a external link.
Info added
Each row would have different number of data. The motive is to convert them into 8 columns of data using averaging or linear interpolation.

Taking the question as a desire to collapse the values to means of eight equally spaced depths, dplyr and tidyr take us where we need to go:
library(dplyr)
library(tidyr)
mytest %>%
# melt to long form
gather(depth, value, -datetime:-Second, na.rm = TRUE) %>%
# clean depth to number
mutate(depth = extract_numeric(depth)) %>%
# group so cut levels are for each datetime
group_by(datetime) %>%
# group to keep columns; cut depth into 8 levels per group
group_by(datetime, levels = cut(depth, 8, paste0('level', 1:8))) %>%
# collapse groups by taking the mean
summarise(value = mean(value)) %>%
# re-spread new levels to wide form
spread(levels, value) %>%
# re-add other time columns dropped by summarise
inner_join(mytest %>% select(datetime:Second), .)
# Source: local data frame [20 x 15]
#
# datetime Year Month Day Hour Minute Second level1 level2
# (time) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
# 1 2014-08-03 12:40:00 2014 8 3 12 40 0 -0.079553637 -0.0188563490
# 2 2014-08-03 12:50:00 2014 8 3 12 50 0 0.006303474 0.0065298277
# 3 2014-08-03 13:00:00 2014 8 3 13 0 0 -0.002452351 -0.0057956151
# 4 2014-08-03 13:10:00 2014 8 3 13 10 0 -0.011318424 -0.0001388374
# 5 2014-08-03 13:20:00 2014 8 3 13 20 0 -0.017749644 -0.0116420430
# 6 2014-08-03 13:30:00 2014 8 3 13 30 0 -0.012457755 -0.0133731725
# 7 2014-08-03 13:40:00 2014 8 3 13 40 0 -0.020440875 -0.0253538846
# 8 2014-08-03 13:50:00 2014 8 3 13 50 0 -0.058681338 -0.0177194127
# 9 2014-08-03 14:00:00 2014 8 3 14 0 0 -0.037929680 -0.0211918383
# 10 2014-08-03 14:10:00 2014 8 3 14 10 0 -0.027045726 -0.0147261076
# 11 2014-08-03 14:20:00 2014 8 3 14 20 0 -0.048997399 -0.0290804019
# 12 2014-08-03 14:30:00 2014 8 3 14 30 0 -0.059110466 -0.0370898043
# 13 2014-08-03 14:40:00 2014 8 3 14 40 0 -0.067156867 -0.0138750287
# 14 2014-08-03 14:50:00 2014 8 3 14 50 0 -0.049762164 -0.0280648246
# 15 2014-08-03 15:00:00 2014 8 3 15 0 0 -0.028033559 -0.0245379952
# 16 2014-08-03 15:10:00 2014 8 3 15 10 0 -0.044087211 -0.0107995239
# 17 2014-08-03 15:20:00 2014 8 3 15 20 0 -0.028761973 -0.0113161242
# 18 2014-08-03 15:30:00 2014 8 3 15 30 0 -0.013476051 -0.0142316424
# 19 2014-08-03 15:40:00 2014 8 3 15 40 0 -0.012799297 -0.0135366710
# 20 2014-08-03 15:50:00 2014 8 3 15 50 0 -0.012238548 -0.0180806876
# Variables not shown: level3 (dbl), level4 (dbl), level5 (dbl), level6 (dbl), level7 (dbl),
# level8 (dbl)
Note that you should check that these data make sense in context; you've lost your depth data by scaling them.

Related

How to drop rows containing NA in specified columns?

I have a dataframe like this
dep_delay temp humid wind_dir precip pressure date
16983 3 68.00 53.06 NA 0 1020.8 2013-05-07
26477 42 NA 64.93 360 0 NA 2013-03-07
...
29299 -1 NA NA NA NA NA 2013-12-31
29300 33 NA NA NA NA NA 2013-12-31
I want to drop only the rows like 29299 and 29300, which contain 5 NAs from temp to pressure (these are consecutive columns), and keep the rows like 16983 and 26477.
desired result:
dep_delay temp humid wind_dir precip pressure date
16983 3 68.00 53.06 NA 0 1020.8 2013-05-07
26477 42 NA 64.93 360 0 NA 2013-03-07
In other words, the problem is how to remove only the rows where there are at least 5 NAs in a row.
apparently this is not the right way to do it:
df <- df[!is.na(df$temp:df$pressure),]
Updated based on Yacine Jajji comment.
You can use standard filter function in dplyr package. You set the number of columns which should be never NA. In your case there are 2: dep_delay and date. Then calculate amount of NA in each row, if the number equals 5 the row will be filtered out. See the code below:
df <- read.table( text = "dep_delay temp humid wind_dir precip pressure date
16983 3 68.00 53.06 NA 0 1020.8 2013-05-07
26477 42 NA 64.93 360 0 NA 2013-03-07
29299 -1 NA NA NA NA NA 2013-12-31
29300 33 NA NA NA NA NA 2013-12-31")
library(dplyr)
cols_to_remove <- c("temp", "humid", "wind_dir", "precip", "pressure")
df[rowSums(is.na(df[, cols_to_remove])) !=
ncol(df[, cols_to_remove]), ]
Output:
dep_delay temp humid wind_dir precip pressure date
16983 3 68 53.06 NA 0 1020.8 2013-05-07
26477 42 NA 64.93 360 0 NA 2013-03-07

Fill in blanks from the previous cell multiplied by the current cell in a different column in R

I have the below data:
year<-c(2015:2030)
actual<-c(NA,NA,NA,3170.620936,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA)
delta<-c(0.276674282,
0.23515258,
0.133083622,
0.236098022,
0.399974342,
0.385942573,
0.165095681,
0.163945346,
0.155695778,
0.147270755,
0.146505261,
0.133997582,
0.123100693,
0.119131947,
0.115589755,
0.103675414)
df<-cbind.data.frame(year,actual,delta)
df
year actual delta
1 2015 NA 0.2766743
2 2016 NA 0.2351526
3 2017 NA 0.1330836
4 2018 3170.621 0.2360980
5 2019 NA 0.3999743
6 2020 NA 0.3859426
7 2021 NA 0.1650957
8 2022 NA 0.1639453
9 2023 NA 0.1556958
10 2024 NA 0.1472708
11 2025 NA 0.1465053
12 2026 NA 0.1339976
13 2027 NA 0.1231007
14 2028 NA 0.1191319
15 2029 NA 0.1155898
16 2030 NA 0.1036754
What I am trying to do is to replace NA's after the last valid data point multiplied by the current delta. So, in this case, I want to multiply "actual" in 2016 by "delta" in 2017 and fill in the 2017 value for "actual". I have tried the below code with no success:
df$actual_filled<-df$actual
df
library(dplyr)
df<-df%>%
mutate( actual_filled=lag(actual_filled,1)*delta)
df
year actual delta actual_filled
1 2015 NA 0.2766743 NA
2 2016 NA 0.2351526 NA
3 2017 NA 0.1330836 NA
4 2018 3170.621 0.2360980 NA
5 2019 NA 0.3999743 1268.167
6 2020 NA 0.3859426 NA
7 2021 NA 0.1650957 NA
8 2022 NA 0.1639453 NA
9 2023 NA 0.1556958 NA
10 2024 NA 0.1472708 NA
11 2025 NA 0.1465053 NA
12 2026 NA 0.1339976 NA
13 2027 NA 0.1231007 NA
14 2028 NA 0.1191319 NA
15 2029 NA 0.1155898 NA
16 2030 NA 0.1036754 NA
As you can see, the filling process ends in 2019. I thought it would populate the new data till the end of the series. The code I wrote is acting as if I am using the "actual" data, rather than "actual_filled". Could someone tell me what I am doing wrong and how I can fix this?
Here's a solution that works via a loop:
df$actual_filled<-df$actual
for (row in 2:nrow(df)) {
if(!is.na(df$actual_filled[row-1])) {
df$actual_filled[row] <- df$delta[row] * df$actual_filled[row-1]
}
}
I'm new to R so it may not be the best solution!

Reconstruct dataframe with dates as date intervals in R

I have a dataset basically looks like that, giving which campaigns are active for each household with given start and end dates of respective campaigns:
campaign_id household_id campaign_type start_date end_date
1 26 1 Type B 2016-12-28 2017-02-19
2 8 1 Type A 2017-05-08 2017-06-25
3 12 1 Type B 2017-07-12 2017-08-13
4 13 1 Type A 2017-08-08 2017-09-24
5 18 1 Type A 2017-10-30 2017-12-24
6 20 1 Type C 2017-11-27 2018-02-05
7 22 1 Type B 2017-12-06 2018-01-07
8 23 1 Type B 2017-12-28 2018-02-04
And I create a new dataframe with given structure, which will show which campaigns are active for given household in a given time (having all the campaign numbers as columns, i have omitted the rest while putting here):
household_id date campaign1 campaign2 campaign3 campaign4
1 1 2016-11-14 NA NA NA NA
2 1 2016-12-06 NA NA NA NA
3 1 2016-12-28 NA NA NA NA
4 1 2017-02-08 NA NA NA NA
5 1 2017-03-03 NA NA NA NA
6 1 2017-03-08 NA NA NA NA
7 1 2017-03-13 NA NA NA NA
8 1 2017-03-29 NA NA NA NA
9 1 2017-04-03 NA NA NA NA
10 1 2017-04-19 NA NA NA NA
What I want to do is assigning the active promotions in the given dates as rows in the second dataframe. For example if household_id 1 is having campaign 2 running in 2016-11-14 but no other campaigns, then it will look like this:
household_id date campaign1 campaign2 campaign3 campaign4
1 1 2016-11-14 0 1 0 0
How can i manage this construction, should I use for loops in the initial dataframe and assign to second one in each loop, or there is a better and faster way? Thanks in advance.

Daily averages of all data frame variables including NA values with aggregate function

I want to calculate daily means of all variables in my dataframe which includes NA values. All my databases have a value every 30min, so I´m very interested in using the timestamp with aggregate function to obtain daily, weekly, monthly... aggregated data.
My dataframe is 37795 rows x 54 variables. I´ve tried two ways to do that, first option does not give me daily means cause I obtained too high values (not logical). Second option gives me almost all NA values. I do not what to do.
I write my dataframe head and code below.
head(data)
timestamp day month year.x hour minute doy.x rn_1_1_1 ppfd_1_1_1
1 2013-07-06 00:00:00 6 7 2013 0 0 187.000 -84.37381 0.754
2 2013-07-06 00:30:00 6 7 2013 0 30 187.020 -84.07990 0.808
3 2013-07-06 01:00:00 6 7 2013 1 0 187.041 -82.19991 0.808
4 2013-07-06 01:30:00 6 7 2013 1 30 187.062 -81.12341 0.831
5 2013-07-06 02:00:00 6 7 2013 2 0 187.083 -79.57474 0.708
6 2013-07-06 02:30:00 6 7 2013 2 30 187.104 -77.72460 0.639
ppfdr_1_1_1 p_rain_1_1_1 swc_1_1_1 swc_2_1_1 swc_3_1_1 air_pressure air_pressure.1
1 0.624 0 0.07230304 0.09577876 0.134602791 101212.4165 1012.124165
2 0.587 0 0.07233134 0.09569421 0.134479816 101181.8094 1011.818094
3 0.713 0 0.07242914 0.09566160 0.134203719 101166.0948 1011.660948
4 0.72 0 0.07252077 0.09563419 0.134149141 101144.6151 1011.446151
5 0.564 0 0.07261925 0.09560297 0.134095791 101144.8662 1011.448662
6 0.706 0 0.07271843 0.09557789 0.134037119 101144.5084 1011.445084
u_rot v_rot w_rot wind_speed u. h_scr_qc01_man
1 5.546047919 1.42E-14 4.76E-16 5.546047919 0.426515403 -28.07603618
2 5.122724997 6.94E-15 -8.00E-16 5.122724997 0.408213459 -34.39110979
3 5.248639421 4.56E-15 7.28E-17 5.248639421 0.393959075 -33.29033501
4 4.845257286 2.81E-14 -1.33E-17 4.845257286 0.365475898 -32.62427147
5 4.486426895 1.39E-14 -4.43E-16 4.486426895 0.335905384 -33.80219189
6 4.109603841 7.08E-15 -9.76E-16 4.109603841 0.312610588 -35.77289349
fco2_scr_qc01_man le_scr_qc01_man fco2_scr_qc0 fco2_scr_qc0_man date year.y time
1 -0.306504951 NA NA NA 06-jul-13 2013 0:00
2 -0.206266524 NA -0.206266524 -0.206266524 06-jul-13 2013 0:30
3 -0.268508139 NA -0.268508139 -0.268508139 06-jul-13 2013 1:00
4 -0.203804516 0.426531598 -0.203804516 -0.203804516 06-jul-13 2013 1:30
5 -0.217438742 -0.358248118 -0.217438742 -0.217438742 06-jul-13 2013 2:00
6 -0.193778528 2.571063044 -0.193778528 -0.193778528 06-jul-13 2013 2:30
doy_ent doy.y doy_cum doy_cum_ent mes nrecord bat panel_temp vwc_0.1
1 187 187.0000 187.0000 187 7 24 12.57 22.93 0.06284828
2 187 187.0208 187.0208 187 7 25 12.56 22.85 0.06267169
3 187 187.0417 187.0417 187 7 26 12.55 22.58 0.06261738
4 187 187.0625 187.0625 187 7 27 12.54 22.3 0.06247716
5 187 187.0833 187.0833 187 7 28 12.53 22.01 0.06249525
6 187 187.1042 187.1042 187 7 29 12.52 21.82 0.06236862
vwc_0.5 vwc_1.5 temp_0.1 temp_0.5 temp_1.5 tempsd_0.1 tempsd_0.5 tempsd_1.5
1 0.07569027 0.1007845 30.9 28.96 25.14 0.372 0.961 0.767
2 0.07569027 0.1007743 30.8 28.85 24.99 0.181 1.361 1.087
3 0.07568554 0.1008558 30.53 28.8 25.03 0.98 1.476 0.351
4 0.07559577 0.1008507 30.52 29.09 25.11 0.186 0.229 0.556
5 0.07559577 0.1007743 30.11 29.09 24.87 1.331 0.191 0.954
6 0.07556271 0.1007285 30.15 29.33 25.04 1.447 1.078 0.2
pair pair_avg CO2_0.1 CO2_0.5 CO2_1.5 DCO2_0.1 DCO2_0.5
1 101.2124 101.2118 1161.592832 3275.1134 4888.231603 -24.67422109 34.88538221
2 101.1818 101.2131 1168.144925 3338.24016 4941.418642 6.55209301 63.12675931
3 101.1661 101.2090 1201.049131 3435.235974 5012.525851 32.90420541 96.9958144
4 101.1446 101.2007 1268.613941 3556.723878 5092.96558 67.56481067 121.4879035
5 101.1449 101.1906 1364.315214 3680.188043 5164.795759 95.7012722 123.464165
6 101.1445 101.1805 1472.975286 3808.988677 5236.40855 108.6600723 128.8006346
DCO2_1.5
1 31.30293041
2 53.18703947
3 71.10720845
4 80.43972916
5 71.83017884
6 71.61279156
## Daily avg - OPTION 1
data$timestamp <- as.POSIXct(data$timestamp, format = "%d/%m/%Y %H:%M",tz ="GMT")
> dates <- format(data$timestamp,"%Y/%m/%d",tz = "GMT")
> datadates <- cbind(data,dates)
> dailydata_avg <- aggregate(. ~ dates, datadates, FUN=mean, na.rm=TRUE, na.action = "na.pass")
head(dailydata_avg)
dates timestamp day month year.x hour minute doy.x rn_1_1_1 ppfd_1_1_1
1 2013/07/06 1373111100 6 7 2013 11.5 15 187.489 159.7788 3580.562
2 2013/07/07 1373197500 7 7 2013 11.5 15 188.489 154.0925 3506.688
3 2013/07/08 1373283900 8 7 2013 11.5 15 189.489 152.5259 3460.667
4 2013/07/09 1373370300 9 7 2013 11.5 15 190.489 131.1619 2965.250
5 2013/07/10 1373456700 10 7 2013 11.5 15 191.489 136.7853 3171.958
6 2013/07/11 1373543100 11 7 2013 11.5 15 192.489 145.2757 3282.167
ppfdr_1_1_1 p_rain_1_1_1 swc_1_1_1 swc_2_1_1 swc_3_1_1 air_pressure air_pressure.1
1 2552.396 1.0000 0.07095847 0.09606378 18341.81 25940.167 25940.167
2 2532.542 1.0000 0.06994341 0.09502167 18065.98 24891.000 24891.000
3 2523.562 1.0000 0.06860553 0.09379282 17777.02 23107.271 23107.271
4 2336.000 1.0000 0.06717054 0.09268716 17526.50 19309.500 19309.500
5 2607.229 1.0625 0.06620048 0.09166904 17275.56 8385.646 8385.646
6 2484.521 1.0000 0.06562964 0.09083684 17028.94 3535.438 3535.438
u_rot v_rot w_rot wind_speed u. h_scr_qc01_man fco2_scr_qc01_man
1 32167.83 2215.875 2041.354 32167.83 28531.44 18197.75 15365.65
2 30878.27 1911.312 1939.917 30878.27 26929.62 17605.52 14955.56
3 26052.96 2261.417 2116.458 26052.96 23305.83 19167.98 18399.33
4 17284.04 1987.438 2139.083 17284.04 17704.35 20349.92 18137.65
5 12028.06 2053.812 1960.417 12028.06 15670.00 21997.83 21120.19
6 15607.50 1997.417 1907.646 15607.50 15384.56 18000.94 18810.62
le_scr_qc01_man fco2_scr_qc0 fco2_scr_qc0_man date year.y time doy_ent doy.y
1 17409.67 13032.10 13027.90 137 2013 44.5 187 187.4896
2 15524.38 12077.17 12072.92 163 2013 44.5 188 188.4896
3 16407.71 14775.94 14770.56 189 2013 44.5 189 189.4896
4 16788.04 15024.79 15019.02 215 2013 44.5 190 190.4896
5 17955.58 17737.25 17730.75 241 2013 44.5 191 191.4896
6 14610.02 16605.48 16599.33 267 2013 44.5 192 192.4896
doy_cum doy_cum_ent mes nrecord bat panel_temp vwc_0.1 vwc_0.5 vwc_1.5
1 187.4896 187.5 7 28966.375 111.5208 1836.250 4638.833 4594.396 37.35417
2 188.4896 188.5 7 20801.417 111.7292 1900.812 4656.875 4392.979 26.68750
3 189.4896 189.5 7 4394.500 110.6042 1934.792 4675.604 4238.229 65.20833
4 190.4896 190.5 7 9467.708 104.0000 2090.896 4776.521 4178.729 54.12500
5 191.4896 191.5 7 14796.375 109.7500 2145.875 4907.292 4161.312 108.39583
6 192.4896 192.5 7 20127.958 109.3125 1934.375 4876.021 4123.458 143.10417
temp_0.1 temp_0.5 temp_1.5 tempsd_0.1 tempsd_0.5 tempsd_1.5 pair pair_avg CO2_0.1
1 2018.438 1565.812 797.8750 470.8125 474.3958 508.8333 101.1268 101.1323 10400.27
2 1998.438 1574.000 783.1875 478.3333 460.4583 566.0208 101.0764 101.0789 11292.75
3 1994.833 1568.104 780.2083 463.8125 453.1667 488.5625 100.9967 101.0036 13288.25
4 2042.625 1564.875 780.1667 465.0000 599.2708 437.6042 100.8520 100.8665 16156.60
5 2114.708 1576.729 780.5000 471.5833 406.5417 484.6875 100.4828 100.5169 18656.50
6 2124.604 1591.125 781.8125 516.7500 530.3333 510.7500 100.3025 100.2947 14586.60
CO2_0.5 CO2_1.5 DCO2_0.1 DCO2_0.5 DCO2_1.5
1 26360.38 34371.31 19795.81 20637.94 27123.92
2 26939.60 34558.17 18838.38 20464.56 20452.58
3 27603.06 34608.31 17413.15 19998.02 22754.85
4 28572.69 34678.38 19294.62 21894.92 18379.62
5 28983.29 34644.15 20251.17 20409.58 22077.40
6 28236.12 34736.67 17031.02 18852.04 19684.69`
## Daily avg - OPTION 2
data$timestamp <- as.POSIXct(data$timestamp, format = "%d/%m/%Y %H:%M",tz ="GMT")
datatime <- data$timestamp
dailydata_avg <- aggregate( data,
by = list('DATES'= format(datatime,'%Y%m%d' )),
FUN = mean, na.rm=T)
I obtain this console message:
1: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
2: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
3: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
head(dailydata_avg)
DATES timestamp day month year.x hour minute doy.x rn_1_1_1 ppfd_1_1_1
1 20130706 2013-07-06 13:45:00 6 7 2013 11.5 15 187.489 159.7788 NA
2 20130707 2013-07-07 13:45:00 7 7 2013 11.5 15 188.489 154.0925 NA
3 20130708 2013-07-08 13:45:00 8 7 2013 11.5 15 189.489 152.5259 NA
4 20130709 2013-07-09 13:45:00 9 7 2013 11.5 15 190.489 131.1619 NA
5 20130710 2013-07-10 13:45:00 10 7 2013 11.5 15 191.489 136.7853 NA
6 20130711 2013-07-11 13:45:00 11 7 2013 11.5 15 192.489 145.2757 NA
ppfdr_1_1_1 p_rain_1_1_1 swc_1_1_1 swc_2_1_1 swc_3_1_1 air_pressure air_pressure.1
1 NA NA 0.07095847 0.09606378 NA NA NA
2 NA NA 0.06994341 0.09502167 NA NA NA
3 NA NA 0.06860553 0.09379282 NA NA NA
4 NA NA 0.06717054 0.09268716 NA NA NA
5 NA NA 0.06620048 0.09166904 NA NA NA
6 NA NA 0.06562964 0.09083684 NA NA NA
u_rot v_rot w_rot wind_speed u. h_scr_qc01_man fco2_scr_qc01_man le_scr_qc01_man
1 NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA NA
6 NA NA NA NA NA NA NA NA
fco2_scr_qc0 fco2_scr_qc0_man date year.y time doy_ent doy.y doy_cum doy_cum_ent
1 NA NA NA 2013 NA 187 187.4896 187.4896 187.5
2 NA NA NA 2013 NA 188 188.4896 188.4896 188.5
3 NA NA NA 2013 NA 189 189.4896 189.4896 189.5
4 NA NA NA 2013 NA 190 190.4896 190.4896 190.5
5 NA NA NA 2013 NA 191 191.4896 191.4896 191.5
6 NA NA NA 2013 NA 192 192.4896 192.4896 192.5
mes nrecord bat panel_temp vwc_0.1 vwc_0.5 vwc_1.5 temp_0.1 temp_0.5 temp_1.5
1 7 NA NA NA NA NA NA NA NA NA
2 7 NA NA NA NA NA NA NA NA NA
3 7 NA NA NA NA NA NA NA NA NA
4 7 NA NA NA NA NA NA NA NA NA
5 7 NA NA NA NA NA NA NA NA NA
6 7 NA NA NA NA NA NA NA NA NA
tempsd_0.1 tempsd_0.5 tempsd_1.5 pair pair_avg CO2_0.1 CO2_0.5 CO2_1.5 DCO2_0.1
1 NA NA NA 101.1268 101.1323 NA NA NA NA
2 NA NA NA 101.0764 101.0789 NA NA NA NA
3 NA NA NA 100.9967 101.0036 NA NA NA NA
4 NA NA NA 100.8520 100.8665 NA NA NA NA
5 NA NA NA 100.4828 100.5169 NA NA NA NA
6 NA NA NA 100.3025 100.2947 NA NA NA NA
DCO2_0.5 DCO2_1.5
1 NA NA
2 NA NA
3 NA NA
4 NA NA
5 NA NA
6 NA NA
How could I do it?
Thanks!!
I didn't use the aggregate function, I used the tapply one.
This is the code, that deals with NA's, I came up with:
# create a sequence of DateTime with half-hourly data
DateTime <- seq.POSIXt(from = as.POSIXct("2015-05-01 00:00:00", tz = "Etc/GMT+12"),
to = as.POSIXct("2015-05-30 23:59:00", tz = "Etc/GMT+12"), by = 1800)
# create some dummy data of the same length as DateTime vector
aa <- runif(1440, 5.0, 7.5)
bb <- NA
df <- data.frame(DateTime, aa, bb)
# replace a cell with NA in the "a" column
df[19,2] <- NA # dataframe = df, row = 19, column = 2
# create DateHour column to use later
df$DateHour <- paste(format(df$DateTime, "%Y/%m/%d"), format(df$DateTime, "%H"), sep = " ")
View(df)
# Hourly means
# Calculate hourly mean values
aa.HourlyMean <- tapply(df$aa, df$DateHour, mean, na.rm = TRUE)
# convert the vector to dataframe
aa.HourlyMean <- data.frame(aa.HourlyMean)
# Extract the DateHour column from the "aa" dataframe
aa.HourlyMean$DateHour <- row.names(aa.HourlyMean);
# Delete rownames of "aa" dataframe
row.names(aa.HourlyMean) <- NULL
# Create a tidy DateTime column
aa.HourlyMean$DateTime <- as.POSIXct(aa.HourlyMean$DateHour, "%Y/%m/%d %H", tz = "Etc/GMT+12")
# change to a tidy dataframe
aa.HourlyMean <- aa.HourlyMean[,c(3,2,1)]
# You can delete any column (for example, DateHour) by
# aa.HourlyMean$Date <- NULL
# You can rename a column with "plyr" package by
# rename(aa.HourlyMean)[3] <- "NewColumnName"
# View the hourly mean of the "aa" dataframe
View(aa.HourlyMean)
# You can do the same with the "bb" vector
bb.HourlyMean <- tapply(df$bb, df$DateHour, mean, na.rm = TRUE)
bb.HourlyMean <- data.frame(bb.HourlyMean)
# View the hourly mean of the "bb" vector
View(bb.HourlyMean)
# /Hourly means
You then can combine in one dataframe the aa.HourlyMean and bb.HourlyMean vectors.
# Daily means
df$Date <- format(df$DateTime, "%Y/%m/%d")
aa.DailyMean <- tapply(df$aa, df$Days, mean, na.rm = TRUE)
aa.DailyMean <- data.frame(aa.DailyMean)
aa.DailyMean$Date <- row.names(aa.DailyMean); row.names(aa.DailyMean) <- NULL
aa.DailyMean <- aa.DailyMean[,c(2,1)]
View(aa.DailyMean)
# /Daily means
# Weekly means
df$YearWeek <- paste(format(df$DateTime, "%Y"), strftime(DateTime, format = "%W"), sep = " ")
aa.WeeklyMean <- tapply(df$aa, df$YearWeek, mean, na.rm = TRUE)
aa.WeeklyMean <- data.frame(aa.WeeklyMean)
aa.WeeklyMean$YearWeek <- row.names(aa.WeeklyMean); row.names(aa.WeeklyMean) <- NULL
aa.WeeklyMean <- aa.WeeklyMean[,c(2,1)]
View(aa.WeeklyMean)
# /Weekly means
I created the mean values for hourly, daily and weekly observations but you get the idea how to create the monthly, yearly, ... ones.

R update values based on event based on multiple columns

I recently asked a similar question here for having all "Activities" in one column. The solution provided worked very well. Now I would like to change something in the design to be able to record more detailed information. The table shows information recorded from different fields over several years. All activities on the fields are recored by date. Now I would like to add a "Season" column that groups all values belonging to a harvest season. As harvest season I define the time in between two harvest events. (see table at the bottom on how the result should look like). The problem here is that seeding is sometimes done in the previous year (e.g.2012) but fields are harvested in 2013. All events would need to be grouped as 2013.
What would I need to change if I start recording more information and give all "Activities" a separate column? I tried:
library(data.table)
DF <- read.table(text="ID|Field|Date |Tillage|Seeding|Fertilizer|Spraying|Harvest
1|A |2012/08/01|Plough |NA|NA|NA|NA
2|A |2012/08/24|NA |Wheat|NA|NA|NA
3|A |2013/03/05|NA |NA|NA|ProduktA|NA
4|A |2013/03/05|NA|NA|TypeB|NA|NA
5|A |2013/07/25|NA |NA|NA|NA|9t
6|B |2012/09/01|Plough |NA|NA|NA|NA
7|B |2012/09/05|NA |Barley|NA|NA|NA
8|B |2013/04/05|NA |NA|NA|ProductB|NA
9|B |2013/07/28|NA |NA|NA|NA|10t
10|B |2010/08/24|Cultivator |NA|NA|NA|NA
11|B |2010/09/29|NA |NA|NA|NA|NA
12|B |2011/05/01|NA|NA|TypeB|NA|NA
13|B |2011/07/12|NA |NA|NA|NA|6t
14|A |2011/09/01|NA |Barley|NA|NA|NA
15|A |2011/10/10|NA |NA|NA|ProductC|NA
16|A |2012/04/10|NA|NA|TypeA|NA|NA
17|A |2012/08/02|NA |NA|NA|NA|7t|",
sep="|", header=TRUE, stringsAsFactors=FALSE)
DT <- data.table(DF)
DT[, Harvest:=gsub(" ", "", Harvest, fixed=TRUE)]
DT[, Date:=as.POSIXct(Date)]
setkeyv(DT, c("Field", "Date"))
DT[, Season:=cumsum(c("", !is.na(head(Harvest, -1)))), by=Field]
DT[, Season:=max(year(Date)), by=list(Field, Season)]
However, that seems not to work. The result should look like this with a "season" column at the end that indicates the season:
ID|Field|Date |Tillage|Seeding|Fertilizer|Spraying|Harvest|Season
1|A |2012/08/01|Plough |NA|NA|NA|NA|2013
2|A |2012/08/24|NA |Wheat|NA|NA|NA|2013
3|A |2013/03/05|NA |NA|NA|ProduktA|NA|2013
4|A |2013/03/05|NA|NA|TypeB|NA|NA|2013
5|A |2013/07/25|NA |NA|NA|NA|9t|2013
6|B |2012/09/01|Plough |NA|NA|NA|NA|2013
7|B |2012/09/05|NA |Barley|NA|NA|NA|2013
8|B |2013/04/05|NA |NA|NA|ProductB|NA|2013
9|B |2013/07/28|NA |NA|NA|NA|10t|2013
10|B |2010/08/24|Cultivator |NA|NA|NA|NA|2011
11|B |2010/09/29|NA |NA|NA|NA|NA|2011
12|B |2011/05/01|NA|NA|TypeB|NA|NA|2011
13|B |2011/07/12|NA |NA|NA|NA|6t|2011
14|A |2011/09/01|NA |Barley|NA|NA|NA|2012
15|A |2011/10/10|NA |NA|NA|ProductC|NA|2012
16|A |2012/04/10|NA|NA|TypeA|NA|NA|2012
17|A |2012/08/02|NA |NA|NA|NA|7t||2012
The only difference to OP's other question is that there are some additional columns and that the condition for extracting the harvest dates for applying my rolling join answer needs to be amended:
library(data.table)
setDT(DF)[!is.na(Harvest), .(Field, Date, Season = year(Date))][
DF, on = .(Field, Date), roll = -Inf]
Field Date Season ID Tillage Seeding Fertilizer Spraying Harvest
1: A 2012/08/01 2012 1 Plough NA NA NA NA
2: A 2012/08/24 2013 2 NA Wheat NA NA NA
3: A 2013/03/05 2013 3 NA NA NA ProduktA NA
4: A 2013/03/05 2013 4 NA NA TypeB NA NA
5: A 2013/07/25 2013 5 NA NA NA NA 9t
6: B 2012/09/01 2013 6 Plough NA NA NA NA
7: B 2012/09/05 2013 7 NA Barley NA NA NA
8: B 2013/04/05 2013 8 NA NA NA ProductB NA
9: B 2013/07/28 2013 9 NA NA NA NA 10t
10: B 2010/08/24 2011 10 Cultivator NA NA NA NA
11: B 2010/09/29 2011 11 NA NA NA NA NA
12: B 2011/05/01 2011 12 NA NA TypeB NA NA
13: B 2011/07/12 2011 13 NA NA NA NA 6t
14: A 2011/09/01 2012 14 NA Barley NA NA NA
15: A 2011/10/10 2012 15 NA NA NA ProductC NA
16: A 2012/04/10 2012 16 NA NA TypeA NA NA
17: A 2012/08/02 2012 17 NA NA NA NA 7t
Note that the rolling join has exhibited a flaw in the sample dataset. Row 1 shows Season 2012 although the subsequent harvest (according to OP's ID) should be in 2013. Reason is that the dates for tillage and harvest are intermixed for field A. Tillage date of Field A in row 1 is 2012/08/01 while the harvest date of the same field in row 17 is 2012/08/02, one day after tillage.
In case the column order is important, the setcolorder() function can be used to order the columns in place, i.e., without copying:
result <- setDT(DF)[!is.na(Harvest), .(Field, Date, Season = year(Date))][
DF, on = .(Field, Date), roll = -Inf]
setcolorder(result, c(names(DF), "Season"))[]
ID Field Date Tillage Seeding Fertilizer Spraying Harvest Season
1: 1 A 2012/08/01 Plough NA NA NA NA 2012
2: 2 A 2012/08/24 NA Wheat NA NA NA 2013
3: 3 A 2013/03/05 NA NA NA ProduktA NA 2013
4: 4 A 2013/03/05 NA NA TypeB NA NA 2013
5: 5 A 2013/07/25 NA NA NA NA 9t 2013
6: 6 B 2012/09/01 Plough NA NA NA NA 2013
7: 7 B 2012/09/05 NA Barley NA NA NA 2013
8: 8 B 2013/04/05 NA NA NA ProductB NA 2013
9: 9 B 2013/07/28 NA NA NA NA 10t 2013
10: 10 B 2010/08/24 Cultivator NA NA NA NA 2011
11: 11 B 2010/09/29 NA NA NA NA NA 2011
12: 12 B 2011/05/01 NA NA TypeB NA NA 2011
13: 13 B 2011/07/12 NA NA NA NA 6t 2011
14: 14 A 2011/09/01 NA Barley NA NA NA 2012
15: 15 A 2011/10/10 NA NA NA ProductC NA 2012
16: 16 A 2012/04/10 NA NA TypeA NA NA 2012
17: 17 A 2012/08/02 NA NA NA NA 7t 2012
Data
library(data.table)
DF <- fread(
"ID|Field|Date |Tillage|Seeding|Fertilizer|Spraying|Harvest
1|A |2012/08/01|Plough |NA|NA|NA|NA
2|A |2012/08/24|NA |Wheat|NA|NA|NA
3|A |2013/03/05|NA |NA|NA|ProduktA|NA
4|A |2013/03/05|NA|NA|TypeB|NA|NA
5|A |2013/07/25|NA |NA|NA|NA|9t
6|B |2012/09/01|Plough |NA|NA|NA|NA
7|B |2012/09/05|NA |Barley|NA|NA|NA
8|B |2013/04/05|NA |NA|NA|ProductB|NA
9|B |2013/07/28|NA |NA|NA|NA|10t
10|B |2010/08/24|Cultivator |NA|NA|NA|NA
11|B |2010/09/29|NA |NA|NA|NA|NA
12|B |2011/05/01|NA|NA|TypeB|NA|NA
13|B |2011/07/12|NA |NA|NA|NA|6t
14|A |2011/09/01|NA |Barley|NA|NA|NA
15|A |2011/10/10|NA |NA|NA|ProductC|NA
16|A |2012/04/10|NA|NA|TypeA|NA|NA
17|A |2012/08/02|NA |NA|NA|NA|7t")

Resources