R Loop by date calculations and put into a new dataframe/matrix - r

I have a database with 7,994,625 obs of 42 variables. It's basically water quality parameters taken from multiple stations every 15 minutes for 1 to 12 years depending on stations...
here is the head of dataframe:
STATION DATE Time SONDE Layer TOTAL_DEPTH TOTAL_DEPTH_A BATT BATT_A WTEMP WTEMP_A SPCOND SPCOND_A
1 CCM0069 2001-05-01 09:45:52 AMY BS NA NND 11.6 <NA> 19.32 <NA> 0.387 <NA>
2 CCM0069 2001-05-01 10:00:52 AMY BS NA NND 11.5 <NA> 19.51 <NA> 0.399 <NA>
3 CCM0069 2001-05-01 10:15:52 AMY BS NA NND 11.5 <NA> 19.49 <NA> 0.407 <NA>
4 CCM0069 2001-05-01 10:30:52 AMY BS NA NND 11.5 <NA> 19.34 <NA> 0.428 <NA>
5 CCM0069 2001-05-01 10:45:52 AMY BS NA NND 11.5 <NA> 19.42 <NA> 0.444 <NA>
6 CCM0069 2001-05-01 11:00:52 AMY BS NA NND 11.5 <NA> 19.31 <NA> 0.460 <NA>
SALINITY SALINITY_A DO_SAT DO_SAT_A DO DO_A PH PH_A TURB_NTU TURB_NTU_A FLUOR FLUOR_A TCHL_PRE_CAL
1 0.19 <NA> 97.8 <NA> 9.01 <NA> 7.24 <NA> 19.5 <NA> 9.6 <NA> 63.4
2 0.19 <NA> 99.7 <NA> 9.14 <NA> 7.26 <NA> 21.1 <NA> 9.5 <NA> 63.2
3 0.20 <NA> 99.3 <NA> 9.11 <NA> 7.23 <NA> 19.2 <NA> 9.7 <NA> 64.3
4 0.21 <NA> 98.4 <NA> 9.05 <NA> 7.23 <NA> 20.0 <NA> 10.2 <NA> 67.6
5 0.21 <NA> 99.2 <NA> 9.12 <NA> 7.23 <NA> 21.2 <NA> 10.4 <NA> 68.7
6 0.22 <NA> 98.7 <NA> 9.09 <NA> 7.23 <NA> 18.3 <NA> 11.0 <NA> 72.5
TCHL_PRE_CAL_A CHLA CHLA_A COMMENTS month year day
1 <NA> <NA> <NA> <NA> May 2001 1
2 <NA> <NA> <NA> <NA> May 2001 1
3 <NA> <NA> <NA> <NA> May 2001 1
4 <NA> <NA> <NA> <NA> May 2001 1
5 <NA> <NA> <NA> <NA> May 2001 1
6 <NA> <NA> <NA> <NA> May 2001 1
I have been all though the R help sites and found similar questions but when I tried to addapt them to my dataframe no dice
I'm trying to
loop by date and calculate total number of DO observations, number of times DO falls below 5 mg/l and then calculate % failure rate of 5mg/l. I can do this over entire datasets and subset each station and date individually just fine but need to do this in a loop and put results in a new dataframe with other parameter calculations... I guess I just need a head start..
Here is what little I have figured out or not .
x <- levels(sub$DATE)
for(i in 1:length(x)){
x$c<-(sum(!is.na(x$DO)))/4 # number of DO measurements and put into hours(every 15 mins)
x$dur<-(sum(x$DO<= 5))/4 # number of DO measurement under 5 mg/l and put into hours
x$fail<-(x$dur/x$c)*100 # failure rate at station and day
}
I get error codes about atomic vectors
What I eventually want is this
station date c dur fail
HGD2115 5/1/2001 24 5 20.83333333
HGD2115 5/2/2001 22 20 90.90909091
HGD2115 5/3/2001 24 12 50
JLD5564 5/1/2001 20 6 30
JLD5564 5/2/2001 12 2 16.66666667
JLD5564 5/3/2001 23 5 21.73913043
there are more calculations I need to do and add to the new dataframe such as the monthly min max and mean of salinity, temperature, etc... hopefully I won't have to come back for help with that. I just need some advice and push in right direction.
and eventually I will get really wild by throwing out days with not enough DO measurements!

This seems like what you are asking (??)
# create sample dataset - you have this already
# 100 stations, 10 days, 15-minute intervals = 100*10*24*4
library(stringr) # for str_pad(...) in example only - you don't need this
set.seed(1) # for reproducible example...
data <- data.frame(STATION=paste0("CMM",str_pad(rep(1:100,each=4*24*10),3,pad="0")),
DATE = as.POSIXct("2001-05-01")+seq(0,15*60*24*1000,len=4*24*1000),
DO = rpois(4*24*1000,5))
# you start here
result <- aggregate(DO~as.Date(DATE)+STATION,data,function(x) {
count <- sum(!is.na(x))
fail <- sum(x[!is.na(x)]<5)
pct.fail <- 100*fail/count
c(count,fail,pct.fail)
})
result <- data.frame(result[,1:2],result[,3])
colnames(result) <- c("DATE","STATION","COUNT","FAIL","PCT.FAIL")
head(result)
# DATE STATION COUNT FAIL PCT.FAIL
# 1 2001-05-01 CMM001 320 147 45.93750
# 2 2001-05-02 CMM001 384 163 42.44792
# 3 2001-05-03 CMM001 256 119 46.48438
# 4 2001-05-03 CMM002 128 61 47.65625
# 5 2001-05-04 CMM002 384 191 49.73958
# 6 2001-05-05 CMM002 384 168 43.75000
This uses the so-called formula interface to aggregate(...) to subset data by date (using as.Date(DATE)) and STATION. For every subgroup, the column DO is passed to the function, which calculates count, fail, and pct.fail as you did.
When the function in aggregate(...) returns a vector, as this one does, the result is a data frame with 3 columns, one for date, one for station, and one containing the vector of results. But you want these in separate columns (so, 5 columns total in your case). The line:
result <- data.frame(result[,1:2],result[,3])
does this.

Here is a slight variation using the aggregate solution. Instead of having the relational operator inside the aggregate function, a second data set is made consisting only of the data that satisfies the requirement (DO < 5).
set.seed(5)
samp_times<- seq(as.POSIXct("2014-06-01 00:00:00", tz = "UTC"),
as.POSIXct("2014-12-31 23:45:00", tz = "UTC"),
by = 60*15)
ntimes=length(samp_times)
nSta<-15
sta<-vector(nSta,mode="any")
for (iSta in seq(1,nSta)) {
sta[iSta] <- paste(paste(sample(letters,3), collapse = ''), sample(1000:9999, 1), sep="")
}
df<-data.frame(DATETIME=rep(rep(samp_times,each=nSta)), STATION=sta, DO=runif(ntimes*nSta,.1,10))
df$DATE<-strftime(df$DATETIME, format="%Y-%m-%d")
df$TIME<-strftime(df$DATETIME, format="%H:%M:%S")
head(df,20)
do_small = 5
agr_1 <- aggregate(df$DO,list(station=df$STATION,date=df$DATE),length)
dfSmall <- df[df$DO<=do_small,]
agr_2 <- aggregate(dfSmall$DO,list(station=dfSmall$STATION,date=dfSmall$DATE),length)
names(agr_1)[3]="nDO"
names(agr_2)[3]="nDO_Small"
agr <- merge(agr_1,agr_2)
agr$pcnt_DO_SMALL <- agr$nDO_Small / agr$nDO * 100
head(agr)

Related

New value with one similar column and on different column in R

I need to mutate a new value: "new_value" based on the same ID "ï..record_id". I need all with the same ID to have the same value in "date_eortc".
My data1 looks likes:
data1 %>%
select( ï..record_id, dato1, galbeta_date, date_eortc)
> ï..record_id dato1 galbeta_date date_eortc
1 1 <NA> <NA> <NA>
2 1 <NA> <NA> <NA>
3 1 <NA> 2018-01-16 <NA>
.....
99 10 2018-02-07 <NA> 2017-12-27
100 10 <NA> <NA> <NA>
101 10 <NA> <NA> <NA>
102 10 <NA> 2017-12-19 <NA>
103 10 <NA> 2017-12-26 <NA>
104 10 <NA> 2017-12-29 <NA>
105 10 <NA> 2018-01-02 <NA>
106 10 <NA> <NA> <NA>
107 10 <NA> <NA> <NA>
108 11 <NA> <NA> <NA>
In this case I need all with "ï..record_id"=10, then date date eortc should all be "2017-12-27"
So it would looks like:
ï..record_id dato1 galbeta_date date_eortc
99 10 2018-02-07 <NA> 2017-12-27
100 10 <NA> <NA> 2017-12-27
101 10 <NA> <NA> 2017-12-27
102 10 <NA> 2017-12-19 2017-12-27
103 10 <NA> 2017-12-26 2017-12-27
104 10 <NA> 2017-12-29 2017-12-27
105 10 <NA> 2018-01-02 2017-12-27
106 10 <NA> <NA> 2017-12-27
107 10 <NA> <NA> 2017-12-27
108 11 <NA> <NA> <NA>
I have tried to make an ifelse statement, but it's not the right one...
data2 <- data1 %>%
mutate(new_value= ifelse(ï..record_id == ï..record_id , date_eortc, NA))
I hope it makes sense.
Thank you for your time,
Julie
We could do a group_by the ï..record_id and fill the NA elements in 'date_eortic' with the non-NA adjacent element
library(dplyr)
library(tidyr)
data1 %>%
group_by(ï..record_id) %>%
fill(date_eortic)

Quarterly year-to-year changes

I have a quarterly time series. I am trying to apply a function which is supposed calculate the year-to-year growth and year-to-year difference and multiply a variable by (-1).
I already used a similar function for calculating quarter-to-quarter changes and it worked.
I modified this function for yoy changes and it does not have any effect on my data frame. And any error popped up.
Do you have any suggestion how to modify the function or how to accomplish to apply the yoy change function on a time series?
Here is the code:
Date <- c("2004-01-01","2004-04-01", "2004-07-01","2004-10-01","2005-01-01","2005-04-01","2005-07-01","2005-10-01","2006-01-01","2006-04-01","2006-07-01","2006-10-01","2007-01-01","2007-04-01","2007-07-01","2007-10-01")
B1 <- c(3189.30,3482.05,3792.03,4128.66,4443.62,4876.54,5393.01,5885.01,6360.00,6930.00,7430.00,7901.00,8279.00,8867.00,9439.00,10101.00)
B2 <- c(7939.97,7950.58,7834.06,7746.23,7760.59,8209.00,8583.05,8930.74,9424.00,9992.00,10041.00,10900.00,11149.00,12022.00,12662.00,13470.00)
B3 <- as.numeric(c("","","","",140.20,140.30,147.30,151.20,159.60,165.60,173.20,177.30,185.30,199.30,217.10,234.90))
B4 <- as.numeric(c("","","","",-3.50,-14.60,-11.60,-10.20,-3.10,-16.00,-4.90,-17.60,-5.30,-10.90,-12.80,-8.40))
df <- data.frame(Date,B1,B2,B3,B4)
The code will produce following data frame:
Date B1 B2 B3 B4
1 2004-01-01 3189.30 7939.97 NA NA
2 2004-04-01 3482.05 7950.58 NA NA
3 2004-07-01 3792.03 7834.06 NA NA
4 2004-10-01 4128.66 7746.23 NA NA
5 2005-01-01 4443.62 7760.59 140.2 -3.5
6 2005-04-01 4876.54 8209.00 140.3 -14.6
7 2005-07-01 5393.01 8583.05 147.3 -11.6
8 2005-10-01 5885.01 8930.74 151.2 -10.2
9 2006-01-01 6360.00 9424.00 159.6 -3.1
10 2006-04-01 6930.00 9992.00 165.6 -16.0
11 2006-07-01 7430.00 10041.00 173.2 -4.9
12 2006-10-01 7901.00 10900.00 177.3 -17.6
13 2007-01-01 8279.00 11149.00 185.3 -5.3
14 2007-04-01 8867.00 12022.00 199.3 -10.9
15 2007-07-01 9439.00 12662.00 217.1 -12.8
16 2007-10-01 10101.00 13470.00 234.9 -8.4
And I want to apply following changes on the variables:
# yoy absolute difference change
abs.diff = c("B1","B2")
# yoy percentage change
percent.change = c("B3")
# make the variable negative
negative = c("B4")
This is the fuction that I am trying to use for my data frame.
transformation = function(D,abs.diff,percent.change,negative)
{
TT <- dim(D)[1]
DData <- D[-1,]
nms <- c()
for (i in c(2:dim(D)[2])) {
# yoy absolute difference change
if (names(D)[i] %in% abs.diff)
{ DData[,i] = (D[5:TT,i]-D[1:(TT-4),i])
names(DData)[i] = paste('a',names(D)[i],sep='') }
# yoy percent. change
if (names(D)[i] %in% percent.change)
{ DData[,i] = 100*(D[5:TT,i]-D[1:(TT-4),i])/D[1:(TT-4),i]
names(DData)[i] = paste('p',names(D)[i],sep='') }
#CA.deficit
if (names(D)[i] %in% negative)
{ DData[,i] = (-1)*D[1:TT,i] }
}
return(DData)
}
This is what I would like to get :
Date pB1 pB2 aB3 B4
1 2004-01-01 NA NA NA NA
2 2004-04-01 NA NA NA NA
3 2004-07-01 NA NA NA NA
4 2004-10-01 NA NA NA NA
5 2005-01-01 39.33 -2.26 NA 3.5
6 2005-04-01 40.05 3.25 NA 14.6
7 2005-07-01 42.22 9.56 NA 11.6
8 2005-10-01 42.54 15.29 11.0 10.2
9 2006-01-01 43.13 21.43 19.3 3.1
10 2006-04-01 42.11 21.72 18.3 16.0
11 2006-07-01 37.77 16.99 22.0 4.9
12 2006-10-01 34.26 22.05 17.7 17.6
13 2007-01-01 30.17 18.3 19.7 5.3
14 2007-04-01 27.95 20.32 26.1 10.9
15 2007-07-01 27.04 26.1 39.8 12.8
16 2007-10-01 27.84 23.58 49.6 8.4
Grouping by the months, i.e. 6th and 7th substring using ave and do the necessary calculations. With sapply we may loop over the columns.
f <- function(x) {
g <- substr(Date, 6, 7)
l <- length(unique(g))
o <- ave(x, g, FUN=function(x) 100/x * c(x[-1], NA) - 100)
c(rep(NA, l), head(o, -4))
}
cbind(df[1], sapply(df[-1], f))
# Date B1 B2 B3 B4
# 1 2004-01-01 NA NA NA NA
# 2 2004-04-01 NA NA NA NA
# 3 2004-07-01 NA NA NA NA
# 4 2004-10-01 NA NA NA NA
# 5 2005-01-01 39.32901 -2.259202 NA NA
# 6 2005-04-01 40.04796 3.250329 NA NA
# 7 2005-07-01 42.21960 9.560688 NA NA
# 8 2005-10-01 42.54044 15.291439 NA NA
# 9 2006-01-01 43.12655 21.434066 13.83738 -11.428571
# 10 2006-04-01 42.10895 21.720063 18.03279 9.589041
# 11 2006-07-01 37.77093 16.986386 17.58316 -57.758621
# 12 2006-10-01 34.25636 22.050356 17.26190 72.549020
# 13 2007-01-01 30.17296 18.304329 16.10276 70.967742
# 14 2007-04-01 27.95094 20.316253 20.35024 -31.875000
# 15 2007-07-01 27.03903 26.102978 25.34642 161.224490
# 16 2007-10-01 27.84458 23.577982 32.48731 -52.272727

Create a conditional column based on another table

I have two data frames, Table1 and Table2.
Table1:
code
CM171
CM114
CM129
CM131
CM154
CM197
CM42
CM54
CM55
Table2:
code;y;diff_y
CM60;1060;2.9
CM55;255;0.7
CM54;1182;3.2
CM53;1046;2.9
CM47;589;1.6
CM42;992;2.7
CM39;1596;4.4
CM36;1113;3
CM34;1975;5.4
CM226;155;0.4
CM224;46;0.1
CM212;43;0.1
CM197;726;2
CM154;1122;3.1
CM150;206;0.6
CM144;620;1.7
CM132;8;0
CM131;618;1.7
CM129;479;1.3
CM121;634;1.7
CM114;15;0
CM109;1050;2.9
CM107;1165;3.2
CM103;194;0.5
I want to add a column to Table2 based on the values in Table1. I tried to do this using dplyr:
result <-Table2 %>%
mutate (fbp = case_when(
code == Table1$code ~"y",))
But this only works for a few rows. Does anyone know why it doesn't add all rows? The values are not repeated.
Try this. It looks like the == operator is only checking for one value. Instead you can use %in% to test all values. Here the code:
#Code
result <-Table2 %>%
mutate (fbp = case_when(
code %in% Table1$code ~"y",))
Output:
code y diff_y fbp
1 CM60 1060 2.9 <NA>
2 CM55 255 0.7 y
3 CM54 1182 3.2 y
4 CM53 1046 2.9 <NA>
5 CM47 589 1.6 <NA>
6 CM42 992 2.7 y
7 CM39 1596 4.4 <NA>
8 CM36 1113 3.0 <NA>
9 CM34 1975 5.4 <NA>
10 CM226 155 0.4 <NA>
11 CM224 46 0.1 <NA>
12 CM212 43 0.1 <NA>
13 CM197 726 2.0 y
14 CM154 1122 3.1 y
15 CM150 206 0.6 <NA>
16 CM144 620 1.7 <NA>
17 CM132 8 0.0 <NA>
18 CM131 618 1.7 y
19 CM129 479 1.3 y
20 CM121 634 1.7 <NA>
21 CM114 15 0.0 y
22 CM109 1050 2.9 <NA>
23 CM107 1165 3.2 <NA>
24 CM103 194 0.5 <NA>

reshape untidy data frame, spreading rows to columns names [duplicate]

This question already has answers here:
Transpose a data frame
(6 answers)
Closed 2 years ago.
Have searched the threads but can't understand a solution that will solve the problem with the data frame that I have.
My current data frame (df):
# A tibble: 8 x 29
`Athlete` Monday...2 Tuesday...3 Wednesday...4 Thursday...5 Friday...6 Saturday...7 Sunday...8
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Date 29/06/2020 30/06/2020 43837.0 43868.0 43897.0 43928.0 43958.0
2 HR 47.0 54.0 51.0 56.0 59.0 NA NA
3 HRV 171.0 91.0 127.0 99.0 77.0 NA NA
4 Sleep Duration 9.11 7.12 8.59 7.15 8.32 NA NA
5 Sleep Efficien~ 92.0 94.0 89.0 90.0 90.0 NA NA
6 Recovery Score 98.0 66.0 96.0 72.0 46.0 NA NA
7 Life Stress NO NO NO NO NO NA NA
8 Sick NO NO NO NO NO NA NA
Have tried to use spread and pivot wider but I know there would require additional functions in order to get the desired output which beyond my level on understanding in R.
Do I need to u
Desired output:
Date HR HRV Sleep Duration Sleep Efficiency Recovery Score Life Stress Sick
29/06/2020 47.0 171.0 9.11
30/06/2020 54.0 91.0 7.12
43837.0 51.0 127.0 8.59
43868.0 56.0 99.0 7.15
43897.0 59.0 77.0 8.32
43928.0 NA NA NA
43958.0 NA NA NA
etc.
Thank you
In Base R you will do:
type.convert(setNames(data.frame(t(df[-1]), row.names = NULL), df[,1]))
Date HR HRV Sleep Duration Sleep Efficien~ Recovery Score Life Stress Sick
1 29/06/2020 47 171 9.11 92 98 NO NO
2 30/06/2020 54 91 7.12 94 66 NO NO
3 43837.0 51 127 8.59 89 96 NO NO
4 43868.0 56 99 7.15 90 72 NO NO
5 43897.0 59 77 8.32 90 46 NO NO
6 43928 NA NA NA NA NA <NA> <NA>
7 43958 NA NA NA NA NA <NA> <NA>

sort column by variable name then loop in each variable

I am an R noob :) and this is my first post.
I have a dataset of 4k entries (data) describing mortality rates (data$mortality) by US state (data$state).
I want to loop through the mortality rates by state name
for instance loop through all mortality rates in "AK"
something like this:
tbl <- table (data$State) ## table with frequency for entries at each state
How can I loop through all the occurrences of each state?
I don't want to specify the state name. I want to sort all states then loop through them by name:
"AK", "AL" etc...
for instance, my table would be:
State mortality
AL 14.3
AL 18.5
AL 18.1
AL NA
AL NA
AK NA
AK 17.7
AK 18
AK 15.9
AK NA
AK 19.6
AK 17.3
AZ 15
AZ 17.1
AZ 17.1
AZ NA
AZ 16.4
AZ 15.2
AZ 16.7
I can then loop through all rates in "AL" and rank them then choose a hospital name associated with each ranked mortality rate in "AL"
I can write a piece of code for each state at a time but imagine doing that for all states!
Here's a data.table solution, as suggested in a comment:
require(data.table)
DT <- data.table(hospID=1:nrow(data),data)
DT[,r:=rank(mortality,na.last='keep'),by=State]
Then run DT to see the result:
hospID State mortality r
1: 1 AL 14.3 1.0
2: 2 AL 18.5 3.0
3: 3 AL 18.1 2.0
4: 4 AL NA NA
5: 5 AL NA NA
6: 6 AK NA NA
7: 7 AK 17.7 3.0
8: 8 AK 18.0 4.0
9: 9 AK 15.9 1.0
10: 10 AK NA NA
11: 11 AK 19.6 5.0
12: 12 AK 17.3 2.0
13: 13 AZ 15.0 1.0
14: 14 AZ 17.1 5.5
15: 15 AZ 17.1 5.5
16: 16 AZ NA NA
17: 17 AZ 16.4 3.0
18: 18 AZ 15.2 2.0
Look at ?rank to see different ways of handling ties and NA values.
If you want to sort on the rank, you can do that with DT[order(State,r)]. The data.table package also allows for a key -- a vector of columns on which the data.table is sorted automatically. There are other benefits to setting a key as well that you can read about in a data.table tutorial or the FAQ.
To sort by col 'a':
x = data.frame(a = sample(LETTERS, 10), b = runif(10))
x = x[order(x[, 'a']), ]
print(x)
4 B 0.8030872
9 C 0.3754850
7 D 0.8670409
5 G 0.1278583
3 J 0.9161972
6 N 0.7159080
8 R 0.5340525
2 S 0.2903496
10 T 0.5466612
1 V 0.9187505

Resources