I have a large data frame with meteorological conditions at different locations (column radar_id), time (column date) and heights (column hgt).
I need to interpolate the data of each parameter (temp,u,v...) to a specific height (500 m above the ground for each radar- altitude_500 column) separately for each location (radar_id) and date.
I tried to do the approx command in dplyr pipes or splitting the data frame but it didn't work for me...
example of part of my data frame:
head (example)
radar_id date temp u v hgt W wind_ang temp_diff tw altitude_500
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Dagan 2014-03-02 18.8 -6.00 4.80 77 7.68 129. 5. -3.33 547
2 Dagan 2014-03-02 17.6 -2.40 9.30 742 9.60 166. 6 -9.20 547
3 Dagan 2014-03-02 16.2 3.10 15.4 1463 15.7 -169. 5.80 -10.4 547
4 Dagan 2014-03-03 16.2 0.900 -0.500 96 1.03 -60.9 -2.6 -0.971 547
5 Dagan 2014-03-03 13.0 3.10 -0.500 754 3.14 -80.8 -4.6 -2.39 547
6 Dagan 2014-03-03 10.8 8.10 4.10 1462 9.08 -117. -5.30 -5.01 547
I want to get a column with the y values from the approx command for each parameter (the x values are the height -hgt),at a specific height (by the altitude_500 column), after the data frame is grouped by radar_id and date .
Here's a dplyr solution. First, I define the data.
# Data
df <- read.table(text = "radar_id date temp u v hgt W wind_ang temp_diff tw altitude_500
1 Dagan 2014-03-02 18.8 -6.00 4.80 77 7.68 129. 5. -3.33 547
2 Dagan 2014-03-02 17.6 -2.40 9.30 742 9.60 166. 6 -9.20 547
3 Dagan 2014-03-02 16.2 3.10 15.4 1463 15.7 -169. 5.80 -10.4 547
4 Dagan 2014-03-03 16.2 0.900 -0.500 96 1.03 -60.9 -2.6 -0.971 547
5 Dagan 2014-03-03 13.0 3.10 -0.500 754 3.14 -80.8 -4.6 -2.39 547
6 Dagan 2014-03-03 10.8 8.10 4.10 1462 9.08 -117. -5.30 -5.01 547")
Then, I load the dplyr package.
# Load library
library(dplyr)
Finally, I group by both radar_id and date and perform a linear interpolation using approx to get the value at altitude_500 m for each column (except the grouping variables and hgt).
# Group then summarise
df %>%
group_by(radar_id, date) %>%
summarise_at(vars(-hgt), ~approx(hgt, ., xout = first(altitude_500))$y)
#> # A tibble: 2 x 10
#> # Groups: radar_id [1]
#> radar_id date temp u v W wind_ang temp_diff tw
#> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Dagan 2014~ 18.0 -3.46 7.98 9.04 155. 5.71 -7.48
#> 2 Dagan 2014~ 14.0 2.41 -0.5 2.48 -74.5 -3.97 -1.94
#> # ... with 1 more variable: altitude_500 <dbl>
Created on 2019-08-21 by the reprex package (v0.3.0)
This assumes that there is only one value of altitude_500 for each radar_id -date pair.
Related
I am trying to simulate dataset for a linear regression in a bit of bayesian stats.
Obviously the overall formula is
Y = A + BX
I have simulated a variety of values of A and B using
A <- rnorm(10,0,1)
B <- rnorm(10,0,1)
#10 Random draws from a normal distribution for the values of each of A and B
I setup a list of possible values of X
stuff <- tibble(x = seq(130,170,10)) %>%
#Make table for possible values of X between 130>170 in intervals of 10
mutate(Y = A + B*x)
Make new value which is A plus B*each value of X
This works fine when I have only 1 value in A & B (i.e if I do A <- rnorm(1,0,1))
But obviously it doesnt work when the length of A & B > 1
What I am trying to figure out how to do us something that would be like
mutate(Y[i] = A[i] + B[i]*x
Resulting in 10 new columns Y1>Y10
Any suggestions welcomed
Here's how I would do what I think you want. I'd start long and then convert to wide...
library(tidyverse)
set.seed(123)
df <- tibble() %>%
expand(
nesting(
ID=1:10,
A=rnorm(10,0,1),
B=rnorm(10,0,1)
),
X=seq(130,170,10)
) %>%
mutate(Y=A + B*X)
df
# A tibble: 50 × 5
ID A B X Y
<int> <dbl> <dbl> <dbl> <dbl>
1 1 -1.07 0.426 130 54.4
2 1 -1.07 0.426 140 58.6
3 1 -1.07 0.426 150 62.9
4 1 -1.07 0.426 160 67.2
5 1 -1.07 0.426 170 71.4
6 2 -0.218 -0.295 130 -38.6
7 2 -0.218 -0.295 140 -41.5
8 2 -0.218 -0.295 150 -44.5
9 2 -0.218 -0.295 160 -47.4
10 2 -0.218 -0.295 170 -50.4
# … with 40 more rows
Now, pivot to wide...
df %>%
pivot_wider(
names_from=ID,
values_from=Y,
names_prefix="Y",
id_cols=X
)
# A tibble: 5 × 11
X Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 130 54.4 -38.6 115. 113. 106. 87.8 72.8 -7.90 -40.9 -48.2
2 140 58.6 -41.5 124. 122. 114. 94.7 78.4 -8.51 -44.0 -52.0
3 150 62.9 -44.5 133. 131. 123. 102. 83.9 -9.13 -47.0 -55.8
4 160 67.2 -47.4 142. 140. 131. 108. 89.5 -9.75 -50.1 -59.6
5 170 71.4 -50.4 151. 149. 139. 115. 95.0 -10.4 -53.2 -63.4
At this point you've lost A & B, because you'd need another 10 columns to store the original A's and another 10 to store the original B's.
Personally, I'd probably stick with the long format, because that's most likely going to make your future workflow easier. And I get to keep the A's and B's.
I am running this code in order to get a bound test on stock datas.
Everything is working until I made my ardlBoundOrders and get the following error : Error in match.arg(method) : 'arg' must be of length 1
Where this error comes from ? Is that possible this comes from the merged dataset (since I run the code without any problem when I only use excel imported dataset) ? How to fix it ?
Thanks for your help!
Here is the script :
library(quantmod)
library(ggplot2)
library(plotly)
library(dLagM)
tickers = c("DIS", "GILD", "AMZN", "AAPL")
stocks<-getSymbols(tickers,
from = "1994-01-01",
to = "2022-02-01",
periodicity = "monthly",
src = "yahoo")
DISclose<-DIS[, 4:4]
GILDclose<-GILD[, 4:4]
AMZNclose<-AMZN[, 4:4]
AAPLclose<-AAPL[, 4:4]
newdata <- merge(DATA, DISclose)
formula <- DIS.Close ~ USDEUR+CPI+CONSCONF+FEDFUNDS+HOUST+UNRATE+INDPRO+VIX+SPY+CLI
ARDLfit <- ardlDlm(formula = formula, data = newdata, p = 10, q = 10)
summary(ARDLfit)
orders3 <- ardlBoundOrders(data = newdata, formula =
formula, ic = "BIC", max.p = 2, max.q = 2)
p <- data.frame(orders3$q, orders3$p) + 1
Boundtest<- ardlBound(data = DATA, formula =
formula2, p=p , ECM = TRUE)
par(mfrow=c(1,1))
disney<-Boundtest[["ECM"]][["EC.t"]]
plot(disney, type="l")
Update :
I think I found something :
When I merge my datas, it square them by allocating each of the stocks data on each of my rows datas. An example would be more explicit :
Here is the variable DATA :
> DATA
# A tibble: 337 × 12
Date VIX USDEUR CPI CONSCONF FEDFUNDS HOUST SPY INDPRO UNRATE
<dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1994-01-01 00:00:00 10.6 0.897 146. 101. 3.05 1272 28.8 67.1 6.6
2 1994-02-01 00:00:00 14.9 0.895 147. 101. 3.25 1337 28.0 67.1 6.6
3 1994-03-01 00:00:00 20.5 0.876 147. 101. 3.34 1564 26.7 67.8 6.5
4 1994-04-01 00:00:00 13.8 0.877 147. 101. 3.56 1465 27.1 68.2 6.4
5 1994-05-01 00:00:00 13.0 0.859 148. 101. 4.01 1526 27.6 68.5 6.1
6 1994-06-01 00:00:00 15.0 0.846 148. 101. 4.25 1409 26.7 69.0 6.1
7 1994-07-01 00:00:00 11.1 0.818 148. 101. 4.26 1439 27.8 69.1 6.1
8 1994-08-01 00:00:00 12.0 0.818 149 101. 4.47 1450 28.8 69.5 6
9 1994-09-01 00:00:00 14.3 0.810 149. 101. 4.73 1474 27.9 69.7 5.9
10 1994-10-01 00:00:00 14.6 0.793 149. 101. 4.76 1450 28.9 70.3 5.8
# … with 327 more rows, and 2 more variables: CLI <dbl>, SPYr <dbl>
Here is the variable merged newdata :
CLI SPYr DIS.Close
1 100.52128 0.0000000000 15.53738
2 100.70483 -0.0291642024 15.53738
3 100.83927 -0.0473966064 15.53738
4 100.92260 0.0170457821 15.53738
5 100.95804 0.0159393078 15.53738
6 100.95186 -0.0293319435 15.53738
7 100.91774 0.0391511218 15.53738
8 100.86948 0.0381206253 15.53738
9 100.80795 -0.0311470101 15.53738
10 100.72614 0.0346814791 15.53738
11 100.60322 -0.0398155024 15.53738
12 100.42905 -0.0006857954 15.53738
13 100.19862 0.0418493643 15.53738
In fact, for each row of DATA there is the first row of DIScloseand so on for the 2nd, the 3rd... Then my dataset go from x row to x^2 row.
I did some research to fix this problem, and I should match both datasets through by="matchingIDinbothdataset" but I do not have matching ID. Is there a solution ?
Thank you in advance.
I used the codes below to add a regression line after a boxplot.
boxplot(yield~Year, data=dfreg.raw,
ylab = 'Yield (bushels/acre)',
col = 'orange')
yield.year <- lm(yield~Year, data = dfreg.raw)
abline(reg = yield.year)
However, the regression line did not show up. The plot I got is below
My data looks like this. It's a panel data, which might end up problems with regression line.
> head(dfreg.raw)
# A tibble: 6 x 15
index Year yield State.Code harv frez_j dd_j cupc_j sm7_j fitted_j max_spring_j sp_spring_j
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 16001 1984 105 16 7200 330. 2438. 7.32 53.4 49.1 19.7 0.863
2 16001 1985 96.8 16 8200 413. 2407. 5.71 52.5 48.4 23.9 -0.391
3 16001 1986 94.9 16 7400 476. 2638. 8.34 52.5 48.4 23.4 -0.122
4 16001 1987 106. 16 9700 154. 2838. 5.44 54.4 49.9 25.6 -0.485
5 16001 1988 89.6 16 7600 184. 2944. 3.28 54.5 50.0 23.9 0.115
6 16001 1989 96.4 16 7300 383. 2766. 5.91 52.6 48.4 23.5 -1.02
# … with 3 more variables: pc_spring_j <dbl>, lt <dbl>, qt <dbl>
Anyone has any idea on this?
The x values are 1:max(levels of x variable), so the abline doesn't work. You can try something like this below.
First simulate a dataset:
dfreg.raw= data.frame(
yield=rpois(100,lambda=rep(seq(60,100,by=10),each=20)),
Year=rep(1995:1999,each=20)
)
Then plot:
boxplot(yield~Year, data=dfreg.raw,
ylab = 'Yield (bushels/acre)',
col = 'orange')
yield.year <- lm(yield~Year, data = dfreg.raw)
Get a unique ascending vector of Years, and predict
X = sort(unique(dfreg.raw$Year))
lines(x=1:length(X),
y=predict(yield.year,data.frame(Year=X)),col="blue",lty=8)
I have grouped data frames (in my case three data frames grouped together ). I want to find the intersection between all three data frames based on a value in a column.
I have been playing around with the dplyr intersect function but don't see how I can use this with my grouped data frames. I want to find all rows within all three data frames that have the same Start.Coord value.
Here is one failed attempt with the resulting error message:
SameWithinTreatment <= SorbitolGroup %>% group_by(Sample) %>% intersect(Start.Coord)
Error in intersect_data_frame(x, y) : object 'Start.Coord' not found
Obviously I need another parameter to give to intersect(). I see that intersect() doesn't seem to be the function I need but it seems that there must be a way to do what I need.
I have done a lot of searching but everything I find only works with 2 data frames.
Here is some example data from my grouped data frames. There is one row with a common Start.Coord value between these three: the row with 8805 as the Start.Coord.
Start.Coord Stop.Coord Sample Coverage normalized.coverage Average.Normalized.Covera~ SD.of.Normalized.Covera~ TwoSD
<int> <int> <chr> <int> <dbl> <dbl> <dbl> <dbl>
1 1019 1023 X1.combined 19 18 9.91 3.98 7.95
2 1510 1514 X1.combined 19 18 9.91 3.98 7.95
3 1514 1518 X1.combined 19 18 9.91 3.98 7.95
4 1520 1524 X1.combined 19 18 9.91 3.98 7.95
5 8805 8809 X1.combined 19 18 9.91 3.98 7.95
6 48185 48189 X1.combined 19 18 9.91 3.98 7.95
Start.Coord Stop.Coord Sample Coverage normalized.coverage Average.Normalized.Coverage SD.of.Normalized.Coverage TwoSD
<int> <int> <chr> <int> <dbl> <dbl> <dbl> <dbl>
1 8805 8809 X2 167 166 122. 21.7 43.4
2 11874 11878 X2 169 168 122. 21.7 43.4
3 12042 12046 X2 169 168 122. 21.7 43.4
4 18321 18325 X2 175 174 122. 21.7 43.4
5 25187 25191 X2 167 166 122. 21.7 43.4
6 25308 25312 X2 194 193 122. 21.7 43.4
Start.Coord Stop.Coord Sample Coverage normalized.coverage Average.Normalized.Coverage SD.of.Normalized.Coverage TwoSD
<int> <int> <chr> <int> <dbl> <dbl> <dbl> <dbl>
1 8805 8809 X3 132 131 94.4 16.7 33.5
2 10340 10344 X3 135 134 94.4 16.7 33.5
3 11874 11878 X3 141 140 94.4 16.7 33.5
4 12042 12046 X3 137 136 94.4 16.7 33.5
5 18209 18213 X3 133 132 94.4 16.7 33.5
6 18218 18222 X3 143 142 94.4 16.7 33.5
So I would like to get back a new data frame that looks like this:
Start.Coord Stop.Coord Sample Coverage normalized.coverage Average.Normalized.Coverage SD.of.Normalized.Coverage TwoSD
8805 8809 X1.combined 19 18 9.91 3.98 7.95
8805 8809 X2 167 166 122. 21.7 43.4
8805 8809 X3 132 131 94.4 16.7 33.5
Is there a way to accomplish this?
If your 3 data frames have the same column names use rbind to combine them
SorbitolGroup<- rbind(df1,df2,df3)
then add
Start.Coord to group_by:
SorbitolGroup %>% group_by(Sample,Start.Coord)
If you want to count the number of observations in both groups
SorbitolGroup %>% group_by(Sample,Start.Coord) %>% tally()
it sounds like you need to use filter(), in addition to what #W148SMH suggested.
a <- data.frame(sample='a',value=sample(1:10,10,T))
b <- data.frame(sample='b',value=sample(1:10,10,T))
c <- data.frame(sample='c',value=sample(1:10,10,T))
df <- rbind(a,b,c)
summary(df)
df %>% filter(value==9)
df_new <- df %>% filter(value==9) # new data frame including all cases with value==9
df %>% count(sample,value)
df %>% group_by(sample,value) %>%
summarise(...) # to summarise other variables at each level of sample and value
I have a dataframe with multiple columns and they are ordered by a time column with a time stamp every second. I want to search the data frame for 1-minute periods that have limited variation in another variable.
For example, I want every minute in the data frame where the TWS(true wind speed) has a variation of no more than 5 knots. These 1 minute periods should also not overlap.
Once we have the 1-minute sections, create another data frame with each minute of data averaged into rows.
Here is the head of the data
Date Time Lat Lon AWA AWS TWA TWS
1 19/10/2018 2019-02-11 12:06:16 35.8952 14.5 -99.7 8.42 -99.7 8.42
2 19/10/2018 2019-02-11 12:06:17 35.8952 14.5 -99.1 8.24 -99.1 8.24
3 19/10/2018 2019-02-11 12:06:18 35.8952 14.5 -99.2 7.34 -99.2 7.34
4 19/10/2018 2019-02-11 12:06:19 35.8952 14.5 -99.6 6.87 -99.6 6.87
5 19/10/2018 2019-02-11 12:06:20 35.8952 14.5 -101.1 8.85 -101.1 8.85
6 19/10/2018 2019-02-11 12:06:21 35.8952 14.5 -101.6 9.39 -101.6 9.39
library(dplyr)
library(lubridate)
df %>%
mutate(Date=as.Date(Date), Time=ymd_hms(Time)) %>%
group_by(gr=minute(Time)) %>%
mutate(flag=max(TWS,na.rm=TRUE)-min(TWS,na.rm=TRUE)) %>%
filter(flag<5) %>%
mutate_all(.,mean,na.rm=TRUE) %>% distinct()
# A tibble: 1 x 10
# Groups: gr [1]
Date Time Lat Lon AWA AWS TWA TWS gr flag
<date> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
1 0019-10-20 2019-02-11 12:06:17 35.9 14.5 -99.3 8. -99.3 8. 6 1.08
For variation between elements in each group, we can use dplyr::lag:
... mutate(flag=TWS-lag(TWS, default = first(TWS))) %>%
filter(all(abs(flag)<5)) %>% mutate_all(.,mean,na.rm=TRUE) %>% distinct()
Data
df <- read.table(text = "
Date Time Lat Lon AWA AWS TWA TWS
1 '19/10/2018' '2019-02-11 12:06:16' 35.8952 14.5 -99.7 8.42 -99.7 8.42
2 '19/10/2018' '2019-02-11 12:06:17' 35.8952 14.5 -99.1 8.24 -99.1 8.24
3 '19/10/2018' '2019-02-11 12:06:18' 35.8952 14.5 -99.2 7.34 -99.2 7.34
4 '19/10/2018' '2019-02-11 12:07:19' 35.8952 14.5 -99.6 6.87 -99.6 6.87
5 '19/10/2018' '2019-02-11 12:07:20' 35.8952 14.5 -101.1 8.85 -101.1 8.85
6 '19/10/2018' '2019-02-11 12:07:21' 35.8952 14.5 -101.6 9.39 -101.6 16.39
", header=TRUE)