How to drop rows containing NA in specified columns? - r

I have a dataframe like this
dep_delay temp humid wind_dir precip pressure date
16983 3 68.00 53.06 NA 0 1020.8 2013-05-07
26477 42 NA 64.93 360 0 NA 2013-03-07
...
29299 -1 NA NA NA NA NA 2013-12-31
29300 33 NA NA NA NA NA 2013-12-31
I want to drop only the rows like 29299 and 29300, which contain 5 NAs from temp to pressure (these are consecutive columns), and keep the rows like 16983 and 26477.
desired result:
dep_delay temp humid wind_dir precip pressure date
16983 3 68.00 53.06 NA 0 1020.8 2013-05-07
26477 42 NA 64.93 360 0 NA 2013-03-07
In other words, the problem is how to remove only the rows where there are at least 5 NAs in a row.
apparently this is not the right way to do it:
df <- df[!is.na(df$temp:df$pressure),]

Updated based on Yacine Jajji comment.
You can use standard filter function in dplyr package. You set the number of columns which should be never NA. In your case there are 2: dep_delay and date. Then calculate amount of NA in each row, if the number equals 5 the row will be filtered out. See the code below:
df <- read.table( text = "dep_delay temp humid wind_dir precip pressure date
16983 3 68.00 53.06 NA 0 1020.8 2013-05-07
26477 42 NA 64.93 360 0 NA 2013-03-07
29299 -1 NA NA NA NA NA 2013-12-31
29300 33 NA NA NA NA NA 2013-12-31")
library(dplyr)
cols_to_remove <- c("temp", "humid", "wind_dir", "precip", "pressure")
df[rowSums(is.na(df[, cols_to_remove])) !=
ncol(df[, cols_to_remove]), ]
Output:
dep_delay temp humid wind_dir precip pressure date
16983 3 68 53.06 NA 0 1020.8 2013-05-07
26477 42 NA 64.93 360 0 NA 2013-03-07

Related

Find the first rows in a data frame which meet a dynamic condition

Here's some sample code:
library(quantmod)
library(dplyr)
stock.prices <- getSymbols(Symbols = 'AAPL', from = '2017-08-08', to = '2017-08-17', env = NULL)[,c(2,4)]
stock.dividends <- getDividends(Symbol = 'AAPL', from = '2017-08-08', to = '2017-08-17')
summary <- merge(stock.prices, stock.dividends)
summary <- data.frame(date=index(summary), coredata(summary))
summary <- mutate(summary, buy.price = ifelse(is.na(AAPL.div), NA, lag(AAPL.Close, 1)))
summary
It produces this data:
date AAPL.High AAPL.Close AAPL.div lag.buy.price
1 2017-08-08 161.83 160.08 NA NA
2 2017-08-09 161.27 161.06 NA NA
3 2017-08-10 160.00 155.32 0.63 161.06
4 2017-08-11 158.57 157.48 NA NA
5 2017-08-14 160.21 159.85 NA NA
6 2017-08-15 162.20 161.60 NA NA
7 2017-08-16 162.51 160.95 NA NA
I would like to append a column like so:
date AAPL.High AAPL.Close AAPL.div lag.buy.price sell.date
1 2017-08-08 161.83 160.08 NA NA NA
2 2017-08-09 161.27 161.06 NA NA NA
3 2017-08-10 160.00 155.32 0.63 161.06 2017-08-15
4 2017-08-11 158.57 157.48 NA NA NA
5 2017-08-14 160.21 159.85 NA NA NA
6 2017-08-15 162.20 161.60 NA NA NA
7 2017-08-16 162.51 160.95 NA NA NA
This finds the first date that I can sell to break even...I buy stock on 2017-08-09 to be eligible for the dividend the following day. I pay 161.06 per share. Having received the dividend, I'd now like to sell at >= 161.06. 2017-08-15 is the first day that I can do this.
I can run a for-loop to achieve this but it seems rather crude and inefficient.
Is there a way to produce the 'sell.date' column using dplyr?
This should get you there:
library(quantmod)
library(tidyverse)
stock.prices <- getSymbols(Symbols = 'AAPL', from = '2017-08-08', to = '2017-08-17', env = NULL)[,c(2,4)]
stock.dividends <- getDividends(Symbol = 'AAPL', from = '2017-08-08', to = '2017-08-17')
summary <- merge(stock.prices, stock.dividends) %>%
as_tibble() %>%
rownames_to_column('date') %>%
coredata() %>%
mutate(buy.price = ifelse(is.na(AAPL.div), NA, lag(AAPL.Close, 1)))
new_summary <- summary %>%
rownames_to_column() %>%
mutate(rowname = as.numeric(rowname),
sell.date = map2_chr(rowname, buy.price, function(row, buy){
if(is.na(row) | is.na(buy)){
NA
}else{
data <- summary %>%
mutate(lt_buy = AAPL.High >= buy) %>%
filter(lt_buy == T, rowname > row)
min(data$date)
}
}))
First, you need to append the row numbers to the data frame. Then, you should use purrr::map to iterate over the data (I changed your library(dplyr) to library(tidyverse) to get purrr). purrr::map2 takes two vector inputs (in this case two columns of your data.frame -- which I took the liberty to switching to a tibble) and runs a function over those inputs. The anonymous function I wrote there filters your summary tibble for dates beyond the input date and prices that are higher than the buy price. It then returns the minimum date meeting that criteria.
I also made some changes to your data setup so that it uses a pipe chain and a more tidy type of structure.
Hope this helps!
df[is.na(df$AAPL.div),'AAPL.div'] <- 0
sell.date <-
with(df, {
bought <- date > as.Date('2017-08-09')
date[which.max(bought & (AAPL.Close + cumsum(AAPL.div*bought)) > 161.06)]})
sell.date
#[1] "2017-08-15"
To add this as a column
df$sell.date <- ifelse(is.na(df$lag.buy.price), NA, sell.date)
df
# date AAPL.High AAPL.Close AAPL.div lag.buy.price sell.date
# 1: 2017-08-08 161.83 160.08 0.00 NA <NA>
# 2: 2017-08-09 161.27 161.06 0.00 NA <NA>
# 3: 2017-08-10 160.00 155.32 0.63 161.06 2017-08-15
# 4: 2017-08-11 158.57 157.48 0.00 NA <NA>
# 5: 2017-08-14 160.21 159.85 0.00 NA <NA>
# 6: 2017-08-15 162.20 161.60 0.00 NA <NA>
# 7: 2017-08-16 162.51 160.95 0.00 NA <NA>
data used
library(data.table)
df <- fread("
a date AAPL.High AAPL.Close AAPL.div lag.buy.price
1 2017-08-08 161.83 160.08 NA NA
2 2017-08-09 161.27 161.06 NA NA
3 2017-08-10 160.00 155.32 0.63 161.06
4 2017-08-11 158.57 157.48 NA NA
5 2017-08-14 160.21 159.85 NA NA
6 2017-08-15 162.20 161.60 NA NA
7 2017-08-16 162.51 160.95 NA NA
")[, -1]
this solution is not entirely without a for loop, but i guess you meant a loop to compare each value (that part is vectorized here). Just in case you have more than one dividend that you observe this loop will be needed:
summary$sell.date<-as.Date(rep(NA,7))
for(i in 1:length(which(!is.na(summary$buy.price))))
summary$sell.date[which(!is.na(summary$buy.price))[i]]<- summary[c(rep(FALSE,which(!is.na(summary$buy.price))[i]-1),(summary[which(!is.na(summary$buy.price))[i]:nrow(summary),"AAPL.High"]>summary[!is.na(summary$buy.price),"buy.price"][i])),"date"][1]
it produces the following result:
date AAPL.High AAPL.Close AAPL.div buy.price sell.date
1 2017-08-08 161.83 160.08 NA NA <NA>
2 2017-08-09 161.27 161.06 NA NA <NA>
3 2017-08-10 160.00 155.32 0.63 161.06 2017-08-15
4 2017-08-11 158.57 157.48 NA NA <NA>
5 2017-08-14 160.21 159.85 NA NA <NA>
6 2017-08-15 162.20 161.60 NA NA <NA>
7 2017-08-16 162.51 160.95 NA NA <NA>

Duplicate rows while using Merge function in R - but I dont want the sum

So here's my problem, I have about 40 datasets, all csv files that contain only two columns, (a) Date and (b) Price (for each dataset the price column is named as its country).. I used the merge function as follows to consolidate all data into a single dataset with one date column and several price columns. This was the function I used:
merged <- Reduce(function(x, y) merge(x, y, by="Date", all=TRUE), list(a,b,c,d,e,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,aa,ab,ac,ad,ae,af,ag,ah,ai,aj,ak,al,am,an))
What has happened is I have for instance in date column, 3 values for same date but the corresponding country values are split. e.g.:
# Date India China South Korea
# 01-Jan-2000 5445 NA 4445 NA
# 01-Jan-2000 NA 1234 NA NA
# 01-Jan-2000 NA NA NA 5678
I actually want
# 01-Jan-2000 5445 1234 4445 5678
I dont know how to get this, as the other questions related to this topic ask for summation of values which I clearly do not need. This is a simple example. Unfortunately I have daily data from Jan 2000 to November 2016 for about 43 countries, all messed up. Any help to solve this would be appreciated.
I would append all dataframes using rbind and reshape the result with spread(). As merging depends on the dataframe you start with.
Reproducable example:
library(dplyr)
a <- data.frame(date = Sys.Date()-1:10, cntry = "China", price=round(rnorm(10,20,5),2))
b <- data.frame(date = Sys.Date()-6:15, cntry = "Netherlands", price=round(rnorm(10,50,10),2))
c <- data.frame(date = Sys.Date()-11:20, cntry = "USA", price=round(rnorm(10,70,25),2))
all <- do.call(rbind, list(a,b,c))
all %>% group_by(date) %>% spread(cntry, price)
results in:
date China Netherlands USA
* <date> <dbl> <dbl> <dbl>
1 2016-11-29 NA NA 78.75
2 2016-11-30 NA NA 66.22
3 2016-12-01 NA NA 86.04
4 2016-12-02 NA NA 17.07
5 2016-12-03 NA NA 75.72
6 2016-12-04 NA 46.90 39.57
7 2016-12-05 NA 51.80 65.11
8 2016-12-06 NA 57.50 96.36
9 2016-12-07 NA 46.42 46.93
10 2016-12-08 NA 45.71 57.63
11 2016-12-09 15.41 60.09 NA
12 2016-12-10 16.66 60.07 NA
13 2016-12-11 23.72 66.21 NA
14 2016-12-12 19.82 45.46 NA
15 2016-12-13 14.22 45.07 NA
16 2016-12-14 27.26 NA NA
17 2016-12-15 20.08 NA NA
18 2016-12-16 15.79 NA NA
19 2016-12-17 17.66 NA NA
20 2016-12-18 26.77 NA NA

Dividing the data in multiple columns to 8 values logically in R

I have the data as following. Each column starting from 1.07m to 11.82m represents the depth and the values corresponds to the temperature. I am interested in reducing the datasets into 8 sets (8 distinct water depths). While doing so I would like to use the averaging method. For example in row1 of my data starts from column x1.07m to x2.82m (x2.82m because all the values beyond that point are NA). I would like to create a separate data frame with data and 8 columns (layer1, layer2, layer3, layer4, layer5, layer6, layer7, layer8). Layer1 value should start from 1.07m and the Layer8 should correspond to the maximum non-zero value.
Data: The dput of data can be found on https://dl.dropboxusercontent.com/u/9267938/rcode.R
> head(data.frame(mytest))
datetime Year Month Day Hour Minute Second X1.07m X1.32m X1.57m X1.82m X2.07m X2.32m X2.57m X2.82m X3.07m
1 2014-08-03 12:40:00 2014 8 3 12 40 0 -0.079553637 -0.018856349 -0.022559778 -0.0278269427 -0.019816260 -0.01304108 -0.003394041 -0.010720688 NA
2 2014-08-03 12:50:00 2014 8 3 12 50 0 -0.001409806 0.006434559 0.013885671 0.0033940409 0.009665614 0.01176982 0.011130125 0.019991707 0.02997477
3 2014-08-03 13:00:00 2014 8 3 13 0 0 -0.006942835 -0.011130125 0.010715907 -0.0058745801 -0.005716650 0.01534520 0.030355206 0.024851408 0.04862646
4 2014-08-03 13:10:00 2014 8 3 13 10 0 -0.020586547 0.002935416 -0.016304143 -0.0001326389 -0.003896694 0.00361282 0.004723244 0.013947785 0.03787721
5 2014-08-03 13:20:00 2014 8 3 13 20 0 -0.028394300 -0.023132719 -0.001721911 -0.0139650391 -0.038460075 0.01749898 0.008466864 0.003630492 0.01442467
6 2014-08-03 13:30:00 2014 8 3 13 30 0 -0.034646511 -0.006791177 0.004064423 -0.0038792422 -0.015942808 -0.02029747 -0.014287663 0.007956902 0.01786172
X3.32m X3.57m X3.82m X4.07m X4.32m X4.57m X4.82m X5.07m X5.32m X5.57m X5.82m X6.07m X6.32m X6.57m X6.82m
1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2 0.05094966 0.04699597 0.032100892 0.02650842 0.045689389 0.0169759192 -0.006879327 -0.0187681077 -0.030404344 -0.04405705 -0.04501967 NA NA NA NA
3 0.04500833 0.01713256 0.006450535 0.02870071 0.019079580 0.0009741734 -0.024666588 -0.0409943643 -0.030201313 -0.03873463 -0.02893064 NA NA NA NA
4 0.03971244 0.05723497 0.039496306 0.03799276 0.012742073 0.0024111385 -0.023706420 -0.0188563490 -0.033791404 -0.04162619 -0.02979164 -0.045051204 NA NA NA
5 0.03269076 0.05125416 0.054766084 0.03625076 0.005988487 0.0020217180 -0.007510352 -0.0069913419 -0.006656083 -0.01630414 -0.01403812 -0.001580609 NA NA NA
6 0.01913708 0.03932811 0.048955209 0.04764632 0.037480601 0.0205218532 0.004171715 0.0009371753 -0.002468609 -0.04511612 -0.01263816 0.035861544 NA NA NA
X7.07m X7.32m X7.57m X7.82m X8.07m X8.32m X8.57m X8.82m X9.07m X9.32m X9.57m X9.82m X10.07m X10.32m X10.57m X10.82m X11.07m X11.32m X11.57m X11.82m
1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
6 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Sometimes the data points will be 20, 22, 25 points so the function should be written such that it would try to account those information and divide into 8 data values for each rows.
Rcode.R linked to dropbox has the code that has dput of mytest. It was pretty big to be posted here. So I posted a external link.
Info added
Each row would have different number of data. The motive is to convert them into 8 columns of data using averaging or linear interpolation.
Taking the question as a desire to collapse the values to means of eight equally spaced depths, dplyr and tidyr take us where we need to go:
library(dplyr)
library(tidyr)
mytest %>%
# melt to long form
gather(depth, value, -datetime:-Second, na.rm = TRUE) %>%
# clean depth to number
mutate(depth = extract_numeric(depth)) %>%
# group so cut levels are for each datetime
group_by(datetime) %>%
# group to keep columns; cut depth into 8 levels per group
group_by(datetime, levels = cut(depth, 8, paste0('level', 1:8))) %>%
# collapse groups by taking the mean
summarise(value = mean(value)) %>%
# re-spread new levels to wide form
spread(levels, value) %>%
# re-add other time columns dropped by summarise
inner_join(mytest %>% select(datetime:Second), .)
# Source: local data frame [20 x 15]
#
# datetime Year Month Day Hour Minute Second level1 level2
# (time) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
# 1 2014-08-03 12:40:00 2014 8 3 12 40 0 -0.079553637 -0.0188563490
# 2 2014-08-03 12:50:00 2014 8 3 12 50 0 0.006303474 0.0065298277
# 3 2014-08-03 13:00:00 2014 8 3 13 0 0 -0.002452351 -0.0057956151
# 4 2014-08-03 13:10:00 2014 8 3 13 10 0 -0.011318424 -0.0001388374
# 5 2014-08-03 13:20:00 2014 8 3 13 20 0 -0.017749644 -0.0116420430
# 6 2014-08-03 13:30:00 2014 8 3 13 30 0 -0.012457755 -0.0133731725
# 7 2014-08-03 13:40:00 2014 8 3 13 40 0 -0.020440875 -0.0253538846
# 8 2014-08-03 13:50:00 2014 8 3 13 50 0 -0.058681338 -0.0177194127
# 9 2014-08-03 14:00:00 2014 8 3 14 0 0 -0.037929680 -0.0211918383
# 10 2014-08-03 14:10:00 2014 8 3 14 10 0 -0.027045726 -0.0147261076
# 11 2014-08-03 14:20:00 2014 8 3 14 20 0 -0.048997399 -0.0290804019
# 12 2014-08-03 14:30:00 2014 8 3 14 30 0 -0.059110466 -0.0370898043
# 13 2014-08-03 14:40:00 2014 8 3 14 40 0 -0.067156867 -0.0138750287
# 14 2014-08-03 14:50:00 2014 8 3 14 50 0 -0.049762164 -0.0280648246
# 15 2014-08-03 15:00:00 2014 8 3 15 0 0 -0.028033559 -0.0245379952
# 16 2014-08-03 15:10:00 2014 8 3 15 10 0 -0.044087211 -0.0107995239
# 17 2014-08-03 15:20:00 2014 8 3 15 20 0 -0.028761973 -0.0113161242
# 18 2014-08-03 15:30:00 2014 8 3 15 30 0 -0.013476051 -0.0142316424
# 19 2014-08-03 15:40:00 2014 8 3 15 40 0 -0.012799297 -0.0135366710
# 20 2014-08-03 15:50:00 2014 8 3 15 50 0 -0.012238548 -0.0180806876
# Variables not shown: level3 (dbl), level4 (dbl), level5 (dbl), level6 (dbl), level7 (dbl),
# level8 (dbl)
Note that you should check that these data make sense in context; you've lost your depth data by scaling them.

Daily averages of all data frame variables including NA values with aggregate function

I want to calculate daily means of all variables in my dataframe which includes NA values. All my databases have a value every 30min, so I´m very interested in using the timestamp with aggregate function to obtain daily, weekly, monthly... aggregated data.
My dataframe is 37795 rows x 54 variables. I´ve tried two ways to do that, first option does not give me daily means cause I obtained too high values (not logical). Second option gives me almost all NA values. I do not what to do.
I write my dataframe head and code below.
head(data)
timestamp day month year.x hour minute doy.x rn_1_1_1 ppfd_1_1_1
1 2013-07-06 00:00:00 6 7 2013 0 0 187.000 -84.37381 0.754
2 2013-07-06 00:30:00 6 7 2013 0 30 187.020 -84.07990 0.808
3 2013-07-06 01:00:00 6 7 2013 1 0 187.041 -82.19991 0.808
4 2013-07-06 01:30:00 6 7 2013 1 30 187.062 -81.12341 0.831
5 2013-07-06 02:00:00 6 7 2013 2 0 187.083 -79.57474 0.708
6 2013-07-06 02:30:00 6 7 2013 2 30 187.104 -77.72460 0.639
ppfdr_1_1_1 p_rain_1_1_1 swc_1_1_1 swc_2_1_1 swc_3_1_1 air_pressure air_pressure.1
1 0.624 0 0.07230304 0.09577876 0.134602791 101212.4165 1012.124165
2 0.587 0 0.07233134 0.09569421 0.134479816 101181.8094 1011.818094
3 0.713 0 0.07242914 0.09566160 0.134203719 101166.0948 1011.660948
4 0.72 0 0.07252077 0.09563419 0.134149141 101144.6151 1011.446151
5 0.564 0 0.07261925 0.09560297 0.134095791 101144.8662 1011.448662
6 0.706 0 0.07271843 0.09557789 0.134037119 101144.5084 1011.445084
u_rot v_rot w_rot wind_speed u. h_scr_qc01_man
1 5.546047919 1.42E-14 4.76E-16 5.546047919 0.426515403 -28.07603618
2 5.122724997 6.94E-15 -8.00E-16 5.122724997 0.408213459 -34.39110979
3 5.248639421 4.56E-15 7.28E-17 5.248639421 0.393959075 -33.29033501
4 4.845257286 2.81E-14 -1.33E-17 4.845257286 0.365475898 -32.62427147
5 4.486426895 1.39E-14 -4.43E-16 4.486426895 0.335905384 -33.80219189
6 4.109603841 7.08E-15 -9.76E-16 4.109603841 0.312610588 -35.77289349
fco2_scr_qc01_man le_scr_qc01_man fco2_scr_qc0 fco2_scr_qc0_man date year.y time
1 -0.306504951 NA NA NA 06-jul-13 2013 0:00
2 -0.206266524 NA -0.206266524 -0.206266524 06-jul-13 2013 0:30
3 -0.268508139 NA -0.268508139 -0.268508139 06-jul-13 2013 1:00
4 -0.203804516 0.426531598 -0.203804516 -0.203804516 06-jul-13 2013 1:30
5 -0.217438742 -0.358248118 -0.217438742 -0.217438742 06-jul-13 2013 2:00
6 -0.193778528 2.571063044 -0.193778528 -0.193778528 06-jul-13 2013 2:30
doy_ent doy.y doy_cum doy_cum_ent mes nrecord bat panel_temp vwc_0.1
1 187 187.0000 187.0000 187 7 24 12.57 22.93 0.06284828
2 187 187.0208 187.0208 187 7 25 12.56 22.85 0.06267169
3 187 187.0417 187.0417 187 7 26 12.55 22.58 0.06261738
4 187 187.0625 187.0625 187 7 27 12.54 22.3 0.06247716
5 187 187.0833 187.0833 187 7 28 12.53 22.01 0.06249525
6 187 187.1042 187.1042 187 7 29 12.52 21.82 0.06236862
vwc_0.5 vwc_1.5 temp_0.1 temp_0.5 temp_1.5 tempsd_0.1 tempsd_0.5 tempsd_1.5
1 0.07569027 0.1007845 30.9 28.96 25.14 0.372 0.961 0.767
2 0.07569027 0.1007743 30.8 28.85 24.99 0.181 1.361 1.087
3 0.07568554 0.1008558 30.53 28.8 25.03 0.98 1.476 0.351
4 0.07559577 0.1008507 30.52 29.09 25.11 0.186 0.229 0.556
5 0.07559577 0.1007743 30.11 29.09 24.87 1.331 0.191 0.954
6 0.07556271 0.1007285 30.15 29.33 25.04 1.447 1.078 0.2
pair pair_avg CO2_0.1 CO2_0.5 CO2_1.5 DCO2_0.1 DCO2_0.5
1 101.2124 101.2118 1161.592832 3275.1134 4888.231603 -24.67422109 34.88538221
2 101.1818 101.2131 1168.144925 3338.24016 4941.418642 6.55209301 63.12675931
3 101.1661 101.2090 1201.049131 3435.235974 5012.525851 32.90420541 96.9958144
4 101.1446 101.2007 1268.613941 3556.723878 5092.96558 67.56481067 121.4879035
5 101.1449 101.1906 1364.315214 3680.188043 5164.795759 95.7012722 123.464165
6 101.1445 101.1805 1472.975286 3808.988677 5236.40855 108.6600723 128.8006346
DCO2_1.5
1 31.30293041
2 53.18703947
3 71.10720845
4 80.43972916
5 71.83017884
6 71.61279156
## Daily avg - OPTION 1
data$timestamp <- as.POSIXct(data$timestamp, format = "%d/%m/%Y %H:%M",tz ="GMT")
> dates <- format(data$timestamp,"%Y/%m/%d",tz = "GMT")
> datadates <- cbind(data,dates)
> dailydata_avg <- aggregate(. ~ dates, datadates, FUN=mean, na.rm=TRUE, na.action = "na.pass")
head(dailydata_avg)
dates timestamp day month year.x hour minute doy.x rn_1_1_1 ppfd_1_1_1
1 2013/07/06 1373111100 6 7 2013 11.5 15 187.489 159.7788 3580.562
2 2013/07/07 1373197500 7 7 2013 11.5 15 188.489 154.0925 3506.688
3 2013/07/08 1373283900 8 7 2013 11.5 15 189.489 152.5259 3460.667
4 2013/07/09 1373370300 9 7 2013 11.5 15 190.489 131.1619 2965.250
5 2013/07/10 1373456700 10 7 2013 11.5 15 191.489 136.7853 3171.958
6 2013/07/11 1373543100 11 7 2013 11.5 15 192.489 145.2757 3282.167
ppfdr_1_1_1 p_rain_1_1_1 swc_1_1_1 swc_2_1_1 swc_3_1_1 air_pressure air_pressure.1
1 2552.396 1.0000 0.07095847 0.09606378 18341.81 25940.167 25940.167
2 2532.542 1.0000 0.06994341 0.09502167 18065.98 24891.000 24891.000
3 2523.562 1.0000 0.06860553 0.09379282 17777.02 23107.271 23107.271
4 2336.000 1.0000 0.06717054 0.09268716 17526.50 19309.500 19309.500
5 2607.229 1.0625 0.06620048 0.09166904 17275.56 8385.646 8385.646
6 2484.521 1.0000 0.06562964 0.09083684 17028.94 3535.438 3535.438
u_rot v_rot w_rot wind_speed u. h_scr_qc01_man fco2_scr_qc01_man
1 32167.83 2215.875 2041.354 32167.83 28531.44 18197.75 15365.65
2 30878.27 1911.312 1939.917 30878.27 26929.62 17605.52 14955.56
3 26052.96 2261.417 2116.458 26052.96 23305.83 19167.98 18399.33
4 17284.04 1987.438 2139.083 17284.04 17704.35 20349.92 18137.65
5 12028.06 2053.812 1960.417 12028.06 15670.00 21997.83 21120.19
6 15607.50 1997.417 1907.646 15607.50 15384.56 18000.94 18810.62
le_scr_qc01_man fco2_scr_qc0 fco2_scr_qc0_man date year.y time doy_ent doy.y
1 17409.67 13032.10 13027.90 137 2013 44.5 187 187.4896
2 15524.38 12077.17 12072.92 163 2013 44.5 188 188.4896
3 16407.71 14775.94 14770.56 189 2013 44.5 189 189.4896
4 16788.04 15024.79 15019.02 215 2013 44.5 190 190.4896
5 17955.58 17737.25 17730.75 241 2013 44.5 191 191.4896
6 14610.02 16605.48 16599.33 267 2013 44.5 192 192.4896
doy_cum doy_cum_ent mes nrecord bat panel_temp vwc_0.1 vwc_0.5 vwc_1.5
1 187.4896 187.5 7 28966.375 111.5208 1836.250 4638.833 4594.396 37.35417
2 188.4896 188.5 7 20801.417 111.7292 1900.812 4656.875 4392.979 26.68750
3 189.4896 189.5 7 4394.500 110.6042 1934.792 4675.604 4238.229 65.20833
4 190.4896 190.5 7 9467.708 104.0000 2090.896 4776.521 4178.729 54.12500
5 191.4896 191.5 7 14796.375 109.7500 2145.875 4907.292 4161.312 108.39583
6 192.4896 192.5 7 20127.958 109.3125 1934.375 4876.021 4123.458 143.10417
temp_0.1 temp_0.5 temp_1.5 tempsd_0.1 tempsd_0.5 tempsd_1.5 pair pair_avg CO2_0.1
1 2018.438 1565.812 797.8750 470.8125 474.3958 508.8333 101.1268 101.1323 10400.27
2 1998.438 1574.000 783.1875 478.3333 460.4583 566.0208 101.0764 101.0789 11292.75
3 1994.833 1568.104 780.2083 463.8125 453.1667 488.5625 100.9967 101.0036 13288.25
4 2042.625 1564.875 780.1667 465.0000 599.2708 437.6042 100.8520 100.8665 16156.60
5 2114.708 1576.729 780.5000 471.5833 406.5417 484.6875 100.4828 100.5169 18656.50
6 2124.604 1591.125 781.8125 516.7500 530.3333 510.7500 100.3025 100.2947 14586.60
CO2_0.5 CO2_1.5 DCO2_0.1 DCO2_0.5 DCO2_1.5
1 26360.38 34371.31 19795.81 20637.94 27123.92
2 26939.60 34558.17 18838.38 20464.56 20452.58
3 27603.06 34608.31 17413.15 19998.02 22754.85
4 28572.69 34678.38 19294.62 21894.92 18379.62
5 28983.29 34644.15 20251.17 20409.58 22077.40
6 28236.12 34736.67 17031.02 18852.04 19684.69`
## Daily avg - OPTION 2
data$timestamp <- as.POSIXct(data$timestamp, format = "%d/%m/%Y %H:%M",tz ="GMT")
datatime <- data$timestamp
dailydata_avg <- aggregate( data,
by = list('DATES'= format(datatime,'%Y%m%d' )),
FUN = mean, na.rm=T)
I obtain this console message:
1: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
2: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
3: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
head(dailydata_avg)
DATES timestamp day month year.x hour minute doy.x rn_1_1_1 ppfd_1_1_1
1 20130706 2013-07-06 13:45:00 6 7 2013 11.5 15 187.489 159.7788 NA
2 20130707 2013-07-07 13:45:00 7 7 2013 11.5 15 188.489 154.0925 NA
3 20130708 2013-07-08 13:45:00 8 7 2013 11.5 15 189.489 152.5259 NA
4 20130709 2013-07-09 13:45:00 9 7 2013 11.5 15 190.489 131.1619 NA
5 20130710 2013-07-10 13:45:00 10 7 2013 11.5 15 191.489 136.7853 NA
6 20130711 2013-07-11 13:45:00 11 7 2013 11.5 15 192.489 145.2757 NA
ppfdr_1_1_1 p_rain_1_1_1 swc_1_1_1 swc_2_1_1 swc_3_1_1 air_pressure air_pressure.1
1 NA NA 0.07095847 0.09606378 NA NA NA
2 NA NA 0.06994341 0.09502167 NA NA NA
3 NA NA 0.06860553 0.09379282 NA NA NA
4 NA NA 0.06717054 0.09268716 NA NA NA
5 NA NA 0.06620048 0.09166904 NA NA NA
6 NA NA 0.06562964 0.09083684 NA NA NA
u_rot v_rot w_rot wind_speed u. h_scr_qc01_man fco2_scr_qc01_man le_scr_qc01_man
1 NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA NA
6 NA NA NA NA NA NA NA NA
fco2_scr_qc0 fco2_scr_qc0_man date year.y time doy_ent doy.y doy_cum doy_cum_ent
1 NA NA NA 2013 NA 187 187.4896 187.4896 187.5
2 NA NA NA 2013 NA 188 188.4896 188.4896 188.5
3 NA NA NA 2013 NA 189 189.4896 189.4896 189.5
4 NA NA NA 2013 NA 190 190.4896 190.4896 190.5
5 NA NA NA 2013 NA 191 191.4896 191.4896 191.5
6 NA NA NA 2013 NA 192 192.4896 192.4896 192.5
mes nrecord bat panel_temp vwc_0.1 vwc_0.5 vwc_1.5 temp_0.1 temp_0.5 temp_1.5
1 7 NA NA NA NA NA NA NA NA NA
2 7 NA NA NA NA NA NA NA NA NA
3 7 NA NA NA NA NA NA NA NA NA
4 7 NA NA NA NA NA NA NA NA NA
5 7 NA NA NA NA NA NA NA NA NA
6 7 NA NA NA NA NA NA NA NA NA
tempsd_0.1 tempsd_0.5 tempsd_1.5 pair pair_avg CO2_0.1 CO2_0.5 CO2_1.5 DCO2_0.1
1 NA NA NA 101.1268 101.1323 NA NA NA NA
2 NA NA NA 101.0764 101.0789 NA NA NA NA
3 NA NA NA 100.9967 101.0036 NA NA NA NA
4 NA NA NA 100.8520 100.8665 NA NA NA NA
5 NA NA NA 100.4828 100.5169 NA NA NA NA
6 NA NA NA 100.3025 100.2947 NA NA NA NA
DCO2_0.5 DCO2_1.5
1 NA NA
2 NA NA
3 NA NA
4 NA NA
5 NA NA
6 NA NA
How could I do it?
Thanks!!
I didn't use the aggregate function, I used the tapply one.
This is the code, that deals with NA's, I came up with:
# create a sequence of DateTime with half-hourly data
DateTime <- seq.POSIXt(from = as.POSIXct("2015-05-01 00:00:00", tz = "Etc/GMT+12"),
to = as.POSIXct("2015-05-30 23:59:00", tz = "Etc/GMT+12"), by = 1800)
# create some dummy data of the same length as DateTime vector
aa <- runif(1440, 5.0, 7.5)
bb <- NA
df <- data.frame(DateTime, aa, bb)
# replace a cell with NA in the "a" column
df[19,2] <- NA # dataframe = df, row = 19, column = 2
# create DateHour column to use later
df$DateHour <- paste(format(df$DateTime, "%Y/%m/%d"), format(df$DateTime, "%H"), sep = " ")
View(df)
# Hourly means
# Calculate hourly mean values
aa.HourlyMean <- tapply(df$aa, df$DateHour, mean, na.rm = TRUE)
# convert the vector to dataframe
aa.HourlyMean <- data.frame(aa.HourlyMean)
# Extract the DateHour column from the "aa" dataframe
aa.HourlyMean$DateHour <- row.names(aa.HourlyMean);
# Delete rownames of "aa" dataframe
row.names(aa.HourlyMean) <- NULL
# Create a tidy DateTime column
aa.HourlyMean$DateTime <- as.POSIXct(aa.HourlyMean$DateHour, "%Y/%m/%d %H", tz = "Etc/GMT+12")
# change to a tidy dataframe
aa.HourlyMean <- aa.HourlyMean[,c(3,2,1)]
# You can delete any column (for example, DateHour) by
# aa.HourlyMean$Date <- NULL
# You can rename a column with "plyr" package by
# rename(aa.HourlyMean)[3] <- "NewColumnName"
# View the hourly mean of the "aa" dataframe
View(aa.HourlyMean)
# You can do the same with the "bb" vector
bb.HourlyMean <- tapply(df$bb, df$DateHour, mean, na.rm = TRUE)
bb.HourlyMean <- data.frame(bb.HourlyMean)
# View the hourly mean of the "bb" vector
View(bb.HourlyMean)
# /Hourly means
You then can combine in one dataframe the aa.HourlyMean and bb.HourlyMean vectors.
# Daily means
df$Date <- format(df$DateTime, "%Y/%m/%d")
aa.DailyMean <- tapply(df$aa, df$Days, mean, na.rm = TRUE)
aa.DailyMean <- data.frame(aa.DailyMean)
aa.DailyMean$Date <- row.names(aa.DailyMean); row.names(aa.DailyMean) <- NULL
aa.DailyMean <- aa.DailyMean[,c(2,1)]
View(aa.DailyMean)
# /Daily means
# Weekly means
df$YearWeek <- paste(format(df$DateTime, "%Y"), strftime(DateTime, format = "%W"), sep = " ")
aa.WeeklyMean <- tapply(df$aa, df$YearWeek, mean, na.rm = TRUE)
aa.WeeklyMean <- data.frame(aa.WeeklyMean)
aa.WeeklyMean$YearWeek <- row.names(aa.WeeklyMean); row.names(aa.WeeklyMean) <- NULL
aa.WeeklyMean <- aa.WeeklyMean[,c(2,1)]
View(aa.WeeklyMean)
# /Weekly means
I created the mean values for hourly, daily and weekly observations but you get the idea how to create the monthly, yearly, ... ones.

R Fill cells with previous data

I have a table like the following:
days Debit loaddate
1 23/01/2014 138470289.4 23/01/2014
2 24/01/2014 NA NA
3 25/01/2014 NA NA
4 26/01/2014 NA NA
5 27/01/2014 NA NA
one row for each day and then in the columns loaddate after a few NA another date appears:
28 19/02/2014 NA NA
29 20/02/2014 NA NA
30 21/02/2014 NA NA
31 22/02/2014 9090967.9 22/02/2014
32 23/02/2014 NA NA
33 24/02/2014 308083.5 24/02/2014
I would like to replace each NA in loaddate column with the previous date in loaddate.
I tried:
for(i in 1:nrow(data3))
{ if (!is.na(data3[i,'Debit']))
{data3[i,'loaddate1']<-as.Date(data3[i,'loaddate'], format='%Y-%m-%d')}
else {data3[i,'loaddate1']<-data3[i-1,'loaddate1']}
}
But I got the wrong format:
> head(data3)
days Debit loaddate loaddate1
1 2014-01-23 138470289 2014-01-23 16093
2 2014-01-24 NA <NA> 16093
3 2014-01-25 NA <NA> 16093
4 2014-01-26 NA <NA> 16093
5 2014-01-27 NA <NA> 16093
6 2014-01-28 NA <NA> 16093
I need to get the date format also. If I do:
for(i in 1:nrow(data3))
{ if (!is.na(data3[i,'Debit']))
{data3[i,'loaddate1']<-as.Date(data3[i,'loaddate'], format='%Y-%m-%d')}
else {data3[i,'loaddate1']<-as.Date(data3[i-1,'loaddate1'], format='%Y-%m-%d')}
}
I got the wrong result (with NA).
days Debit loaddate loaddate1
1 2014-01-23 138470289 2014-01-23 16093
2 2014-01-24 NA <NA> <NA>
3 2014-01-25 NA <NA> <NA>
4 2014-01-26 NA <NA> <NA>
5 2014-01-27 NA <NA> <NA>
6 2014-01-28 NA <NA> <NA>
How can I get the right result and with the right format?
Also, Is there a better way to do this replacement? I mean without a loop.
Thanks.
Try zoo::na.locf and make sure to use the appropriate date format:
library(zoo)
data3$loaddate <- as.Date(na.locf(data3$loaddate), format='%d/%m/%Y'))

Resources