I have downloaded a table of stream diversion data ("df_download"). The column names of this table are primarily taken from the ID numbers of the gauging stations.
I want to conditionally replace the ID numbers that have been used for column names with text for the station names, which will help make the data more readable when I'm sharing the results. I created a table ("stationIDs") with the ID numbers and station names to use as a reference for changing the column names of "df_download".
I can replace the column names individually, but I want to write a loop of some kind that will address all of the columns of "df_download" and change the names of the columns referenced in the dataframe "stationIDs".
An example of what I'm trying to do is below.
Downloaded Data ("df_download")
A portion of the downloaded data is similar to this:
df_downloaded <- data.frame(Var1 = seq(as.Date("2012-01-01"),as.Date("2012-12-01"), by="month"),
Var2 = sample(50:150,12, replace =TRUE),
Var3 = sample(10:100,12, replace =TRUE),
Var4 = sample(15:45,12, replace =TRUE),
Var5 = sample(50:200,12, replace =TRUE),
Var6 = sample(15:100,12, replace =TRUE),
Var7 = c(rep(0,3),rep(13,6),rep(0,3)),
Var8 = rep(5,12))
colnames(df_downloaded) <- c("Diversion.Date","360410059","360410060",
"360410209","361000655","361000656","Irrigation","Seep")
df_download # not run
#
# Diversion.Date 360410059 360410060 360410209 361000655 361000656 Irrigation Seep
# 1 2012-01-01 93 57 28 101 16 0 5
# 2 2012-02-01 102 68 19 124 98 0 5
# 3 2012-03-01 124 93 36 109 56 0 5
# 4 2012-04-01 94 96 23 54 87 13 5
# 5 2012-05-01 83 70 43 119 15 13 5
# 6 2012-06-01 78 63 45 195 15 13 5
# 7 2012-07-01 86 77 20 130 63 13 5
# 8 2012-08-01 118 29 27 118 57 13 5
# 9 2012-09-01 142 18 45 116 27 13 5
# 10 2012-10-01 74 68 34 182 79 0 5
# 11 2012-11-01 106 48 27 95 74 0 5
# 12 2012-12-01 91 41 20 179 55 0 5
Reference Table ("stationIDs")
stationIDs <- data.frame(ID = c("360410059", "360410060", "360410209", "361000655", "361000656"),
Names = c("RimView", "IPCO", "WMA.Ditch", "RV.Bypass", "LowerFalls"))
stationIDs # not run
#
# ID Names
# 1 360410059 RimView
# 2 360410060 IPCO
# 3 360410209 WMA.Ditch
# 4 361000655 RV.Bypass
# 5 361000656 LowerFalls
I can replace the column names in "df_downloaded" using individual statements. I show the first three iterations below.
After three iterations "RimValley", "IPCO", and "WMA.Ditch" have replaced their respective gauge ID numbers.
names(df_downloaded) <- gsub(stationIDs$ID[1],stationIDs$Name[1],names(df_downloaded))
# head(df_downloaded)
# Diversion.Date RimView 360410060 360410209 361000655 361000656 Irrigation Seep
# 1 2012-01-01 93 57 28 101 16 0 5
# 2 2012-02-01 102 68 19 124 98 0 5
# 3 2012-03-01 124 93 36 109 56 0 5
# 4 2012-04-01 94 96 23 54 87 13 5
# 5 2012-05-01 83 70 43 119 15 13 5
# 6 2012-06-01 78 63 45 195 15 13 5
names(df_downloaded) <- gsub(stationIDs$ID[2],stationIDs$Name[2],names(df_downloaded))
# head(df_downloaded)
# Diversion.Date RimView IPCO 360410209 361000655 361000656 Irrigation Seep
# 1 2012-01-01 93 57 28 101 16 0 5
# 2 2012-02-01 102 68 19 124 98 0 5
# 3 2012-03-01 124 93 36 109 56 0 5
# 4 2012-04-01 94 96 23 54 87 13 5
# 5 2012-05-01 83 70 43 119 15 13 5
# 6 2012-06-01 78 63 45 195 15 13 5
names(df_downloaded) <- gsub(stationIDs$ID[3],stationIDs$Name[3],names(df_downloaded))
# head(df_downloaded)
# Diversion.Date RimView IPCO WMA.Ditch 361000655 361000656 Irrigation Seep
# 1 2012-01-01 93 57 28 101 16 0 5
# 2 2012-02-01 102 68 19 124 98 0 5
# 3 2012-03-01 124 93 36 109 56 0 5
# 4 2012-04-01 94 96 23 54 87 13 5
# 5 2012-05-01 83 70 43 119 15 13 5
# 6 2012-06-01 78 63 45 195 15 13 5
If I try to do the renaming using a for loop, I end up with NAs for column names.
for(i in seq_along(names(df_downloaded))){
names(df_downloaded) <- gsub(stationIDs$ID[i],stationIDs$Name[i],names(df_downloaded))
}
# head(df_downloaded)
# NA NA NA NA NA NA NA NA
# 1 2012-01-01 93 57 28 101 16 0 5
# 2 2012-02-01 102 68 19 124 98 0 5
# 3 2012-03-01 124 93 36 109 56 0 5
# 4 2012-04-01 94 96 23 54 87 13 5
# 5 2012-05-01 83 70 43 119 15 13 5
# 6 2012-06-01 78 63 45 195 15 13 5
I really want to be able to change the names with a for loop or something similar, because because the number of stations that I download data from changes depending on the years that I am analyzing.
Thanks for taking time to look at my question.
We can use match
#Convert factor columns to character
stationIDs[] <- lapply(stationIDs, as.character)
#Match names of df_downloaded with stationIDs$ID
inds <- match(names(df_downloaded), stationIDs$ID)
#Replace the matched name with corresponding Names from stationIDs
names(df_downloaded)[which(!is.na(inds))] <- stationIDs$Names[inds[!is.na(inds)]]
df_downloaded
# Diversion.Date RimView IPCO WMA.Ditch RV.Bypass LowerFalls Irrigation Seep
#1 2012-01-01 142 14 41 200 79 0 5
#2 2012-02-01 97 100 35 176 22 0 5
#3 2012-03-01 85 59 26 88 71 0 5
#4 2012-04-01 68 49 34 63 15 13 5
#5 2012-05-01 62 58 44 87 16 13 5
#6 2012-06-01 70 59 33 145 87 13 5
#7 2012-07-01 112 65 25 52 64 13 5
#8 2012-08-01 75 12 27 103 19 13 5
#9 2012-09-01 73 65 36 172 68 13 5
#10 2012-10-01 87 35 27 146 42 0 5
#11 2012-11-01 122 17 33 183 32 0 5
#12 2012-12-01 108 65 15 120 99 0 5
You can do this dplyr and tidyr. You basically want to make your data long so that the IDs are in a column so that you can do a join on this with your reference of IDs to names. Then you can make your data wide again.
df_downloaded %>%
gather(ID, value, -Diversion.Date, -Irrigation, -Seep) %>%
left_join(., stationIDs) %>%
dplyr::select(-ID) %>%
spread(Names, value)
I am currently stuck trying to add some custom seasonality using the Prophet package from Facebook in a grouped by piped operation with R.
Here's a mockup of my current data:
ds <- as.Date(c('2016-11-01','2016-11-02','2016-11-03','2016-11-04',
'2016-11-05','2016-11-06','2016-11-07','2016-11-08',
'2016-11-09','2016-11-10','2016-11-11','2016-11-12',
'2016-11-13','2016-11-14','2016-11-15','2016-11-16',
'2016-11-17','2016-11-18','2016-11-19','2016-11-20',
'2016-11-21','2016-11-22','2016-11-23','2016-11-24',
'2016-11-25','2016-11-26','2016-11-27','2016-11-28',
'2016-11-29','2016-11-30',
'2016-11-01','2016-11-02','2016-11-03','2016-11-04',
'2016-11-05','2016-11-06','2016-11-07','2016-11-08',
'2016-11-09','2016-11-10','2016-11-11','2016-11-12',
'2016-11-13','2016-11-14','2016-11-15','2016-11-16',
'2016-11-17','2016-11-18','2016-11-19','2016-11-20',
'2016-11-21','2016-11-22','2016-11-23','2016-11-24',
'2016-11-25','2016-11-26','2016-11-27','2016-11-28',
'2016-11-29','2016-11-30'))
y <- c(15,17,18,19,20,54,67,23,12,34,12,78,34,12,3,45,67,89,12,111,123,112,14,566,345,123,567,56,87,90,
45,23,12,10,21,34,12,45,12,44,87,45,32,67,1,57,87,99,33,234,456,123,89,333,411,232,455,55,90,21)
y<-as.numeric(y)
some_group<-c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A",
"A","A","A","A","A","A","A","A","A","A","A","A","A","A","A",
"B","B","B","B","B","B","B","B","B","B","B","B","B","B","B",
"B","B","B","B","B","B","B","B","B","B","B","B","B","B","B")
df <- data.frame(ds,some_group, y)
With df looking something like this:
ds some_group y
1 2016-11-01 A 15
2 2016-11-02 A 17
3 2016-11-03 A 18
4 2016-11-04 A 19
5 2016-11-05 A 20
6 2016-11-06 A 54
7 2016-11-07 A 67
8 2016-11-08 A 23
9 2016-11-09 A 12
10 2016-11-10 A 34
11 2016-11-11 A 12
12 2016-11-12 A 78
13 2016-11-13 A 34
14 2016-11-14 A 12
15 2016-11-15 A 3
16 2016-11-16 A 45
17 2016-11-17 A 67
18 2016-11-18 A 89
19 2016-11-19 A 12
20 2016-11-20 A 111
21 2016-11-21 A 123
22 2016-11-22 A 112
23 2016-11-23 A 14
24 2016-11-24 A 566
25 2016-11-25 A 345
26 2016-11-26 A 123
27 2016-11-27 A 567
28 2016-11-28 A 56
29 2016-11-29 A 87
30 2016-11-30 A 90
31 2016-11-01 B 45
32 2016-11-02 B 23
33 2016-11-03 B 12
34 2016-11-04 B 10
35 2016-11-05 B 21
36 2016-11-06 B 34
37 2016-11-07 B 12
38 2016-11-08 B 45
39 2016-11-09 B 12
40 2016-11-10 B 44
41 2016-11-11 B 87
42 2016-11-12 B 45
43 2016-11-13 B 32
44 2016-11-14 B 67
45 2016-11-15 B 1
46 2016-11-16 B 57
47 2016-11-17 B 87
48 2016-11-18 B 99
49 2016-11-19 B 33
50 2016-11-20 B 234
51 2016-11-21 B 456
52 2016-11-22 B 123
53 2016-11-23 B 89
54 2016-11-24 B 333
55 2016-11-25 B 411
56 2016-11-26 B 232
57 2016-11-27 B 455
58 2016-11-28 B 55
59 2016-11-29 B 90
60 2016-11-30 B 21
My current implementation in R works. However, I am missing the part of the code in which I can add some custom seasonality as specified by the documentation page of the project with regards to the specification of custom seasonality.
Here's my current implementation;
df2 <- df %>%
group_by(some_group) %>%
do(predict(prophet(.,
holidays = custom_events,
yearly.seasonality = FALSE,
weekly.seasonality = FALSE,
daily.seasonality = FALSE), # missing custom seasonality
make_future_dataframe(prophet(.,
holidays = events,
yearly.seasonality = FALSE,
weekly.seasonality = FALSE,
daily.seasonality = FALSE),
periods = 30*7))) %>%
select(ds, some_group, yhat)
How can I add custom seasonality to this pipe operation?
I have two data tables. The first table is matrix with coordinates and precipitation. It consists of four columns with latitude, longitude, precipitation and day of monitoring. The example of table is:
latitude_1 longitude_1 precipitation day_mon
54.17 62.15 5 34
69.61 48.65 3 62
73.48 90.16 7 96
66.92 90.27 7 19
56.19 96.46 9 25
72.23 74.18 5 81
88.00 95.20 7 97
92.44 44.41 6 18
95.83 52.91 9 88
99.68 96.23 8 6
81.91 48.32 8 96
54.66 52.70 0 62
95.31 91.82 2 84
60.32 96.25 9 71
97.39 47.91 7 76
65.21 44.63 9 3
The second table consists of 5 columns : station number, longitude, latitude, day when monitoring began, day when monitoring ends. It looks like:
station latitude_2 longitude_2 day_begin day_end
15 50.00 93.22 34 46
11 86.58 85.29 15 47
14 93.17 63.17 31 97
10 88.56 61.28 15 78
13 45.29 77.10 24 79
6 69.73 99.52 13 73
4 45.60 77.36 28 95
13 92.88 62.38 9 51
1 65.10 64.13 7 69
10 60.57 86.77 34 64
3 53.62 60.76 23 96
16 87.82 59.41 38 47
1 47.83 95.89 21 52
11 75.42 46.20 38 87
3 55.71 55.26 2 73
16 71.65 96.15 36 93
I want to sum precipitations from 1 table. But I have two conditions:
day_begin< day_mon< day_end. Day of monitoring(day_mon from 1 table) should be less than day of end and more than day of begin (from 2 table)
Sum precipitation from the point which is closer than others. distance between point of monitoring (coordinates consists
longitude_1 and latitude_1) and station (coordinates consists
longitude_2 and latitude_2) should be minimum. The distance is calculated by the formula :
R = 6400*arccos(sin(latitude_1)*sin(latitude_2)+cos(latitude_1)*cos(latitude_2))*cos(longitude_1-longitude_2))
At last I want to get results as table :
station latitude_2 longitude_2 day_begin day_end Sum
15 50 93.22 34 46 188
11 86.58 85.29 15 47 100
14 93.17 63.17 31 97 116
10 88.56 61.28 15 78 182
13 45.29 77.1 24 79 136
6 69.73 99.52 13 73 126
4 45.6 77.36 28 95 108
13 92.88 62.38 9 51 192
1 65.1 64.13 7 69 125
10 60.57 86.77 34 64 172
3 53.62 60.76 23 96 193
16 87.82 59.41 38 47 183
1 47.83 95.89 21 52 104
11 75.42 46.2 38 87 151
3 55.71 55.26 2 73 111
16 71.65 96.15 36 93 146
I know how to calculate it in C++. What function should I use in R?
Thank you for your help!
I'm not sure if I solved your problem correctly... but here it comes..
I used a data.table approach.
library( tidyverse )
library( data.table )
#step 1. join days as periods
#create a dummy variables to create a virtual period in dt1
dt1[, point_id := .I]
dt1[, day_begin := day_mon]
dt1[, day_end := day_mon]
setkey(dt2, day_begin, day_end)
#overlap join finding all stations for each point that overlap periods
dt <- foverlaps( dt1, dt2, type = "within" )
#step 2. calculate the distance station for each point based on TS-privided formula
dt[, distance := 6400 * acos( sin( latitude_1 ) * sin( latitude_2 ) + cos( latitude_1 ) * cos( latitude_2 ) ) * cos( longitude_1 - longitude_2 ) ]
#step 3. filter (absolute) minimal distance based on point_id
dt[ , .SD[which.min( abs( distance ) )], by = point_id ]
# point_id station latitude_2 longitude_2 day_begin day_end latitude_1 longitude_1 precipitation day_mon i.day_begin i.day_end distance
# 1: 1 1 47.83 95.89 21 52 54.17 62.15 5 34 34 34 -248.72398
# 2: 2 6 69.73 99.52 13 73 69.61 48.65 3 62 62 62 631.89228
# 3: 3 14 93.17 63.17 31 97 73.48 90.16 7 96 96 96 -1519.84886
# 4: 4 11 86.58 85.29 15 47 66.92 90.27 7 19 19 19 1371.54757
# 5: 5 11 86.58 85.29 15 47 56.19 96.46 9 25 25 25 1139.46849
# 6: 6 14 93.17 63.17 31 97 72.23 74.18 5 81 81 81 192.99264
# 7: 7 14 93.17 63.17 31 97 88.00 95.20 7 97 97 97 5822.81529
# 8: 8 3 55.71 55.26 2 73 92.44 44.41 6 18 18 18 -899.71206
# 9: 9 3 53.62 60.76 23 96 95.83 52.91 9 88 88 88 45.16237
# 10: 10 3 55.71 55.26 2 73 99.68 96.23 8 6 6 6 -78.04484
# 11: 11 14 93.17 63.17 31 97 81.91 48.32 8 96 96 96 -5467.77459
# 12: 12 3 53.62 60.76 23 96 54.66 52.70 0 62 62 62 -1361.57863
# 13: 13 11 75.42 46.20 38 87 95.31 91.82 2 84 84 84 -445.18765
# 14: 14 14 93.17 63.17 31 97 60.32 96.25 9 71 71 71 -854.86321
# 15: 15 3 53.62 60.76 23 96 97.39 47.91 7 76 76 76 1304.41634
# 16: 16 3 55.71 55.26 2 73 65.21 44.63 9 3 3 3 -7015.57516
Sample data
dt1 <- read.table( text = "latitude_1 longitude_1 precipitation day_mon
54.17 62.15 5 34
69.61 48.65 3 62
73.48 90.16 7 96
66.92 90.27 7 19
56.19 96.46 9 25
72.23 74.18 5 81
88.00 95.20 7 97
92.44 44.41 6 18
95.83 52.91 9 88
99.68 96.23 8 6
81.91 48.32 8 96
54.66 52.70 0 62
95.31 91.82 2 84
60.32 96.25 9 71
97.39 47.91 7 76
65.21 44.63 9 3", header = TRUE ) %>%
setDT()
dt2 <- read.table( text = "station latitude_2 longitude_2 day_begin day_end
15 50.00 93.22 34 46
11 86.58 85.29 15 47
14 93.17 63.17 31 97
10 88.56 61.28 15 78
13 45.29 77.10 24 79
6 69.73 99.52 13 73
4 45.60 77.36 28 95
13 92.88 62.38 9 51
1 65.10 64.13 7 69
10 60.57 86.77 34 64
3 53.62 60.76 23 96
16 87.82 59.41 38 47
1 47.83 95.89 21 52
11 75.42 46.20 38 87
3 55.71 55.26 2 73
16 71.65 96.15 36 93", header = TRUE ) %>%
setDT()
I have the sample input file DF1 out of many rows and columns with missing data and wanted to impute the missing data from a different data frame DF2and generate many data frames as shown in the output ans save as a data frame. Can anyone help in solving this issue.
Input:
DF1:
GM A B C D E
1 22 34 56 345 76
2 34 44 777 67 NA
3 45 76 77 NA NA
4 56 88 NA NA NA
5 36 NA NA NA NA
DF2
V1 V2 V3
1 11 21
2 12 22
3 13 23
4 14 24
5 15 25
6 16 26
7 17 27
8 18 28
9 19 29
10 20 30
Output:
OutputV1:
GM A B C D E
1 22 34 56 345 76
2 34 44 777 67 1
3 45 76 77 2 3
4 56 88 4 5 6
5 36 7 8 9 10
OutputV2
GM A B C D E
1 22 34 56 345 76
2 34 44 777 67 11
3 45 76 77 12 13
4 56 88 14 14 16
5 36 17 18 19 20
Output3:
GM A B C D E
1 22 34 56 345 76
2 34 44 777 67 21
3 45 76 77 22 23
4 56 88 24 25 26
5 36 27 28 29 30
I did put the picture to make it clear for adding the values of DF2 to the output dataframe
OuputV1:
OutputV2:
It would be great if someone help me in solving this as there area many variables in the DF2 and many data frames needs to be generated depending on the number of variables.
You can transpose DF1, fill the missing values, and then transpose it back:
t_df <- t(df1)
t_df[is.na(t_df)] <- df2$V1
as.data.frame(t(t_df))
# GM A B C D E
#1 1 22 34 56 345 76
#2 2 34 44 777 67 1
#3 3 45 76 77 2 3
#4 4 56 88 4 5 6
#5 5 36 7 8 9 10
This works best if all columns have the same data type, otherwise the data types may get mixed up due to the transpose.
impute_by_row <- function(df, values) {
t_df <- t(df)
t_df[is.na(t_df)] <- values
as.data.frame(t(t_df))
}
impute_by_row(df1, df2$V1)
# GM A B C D E
#1 1 22 34 56 345 76
#2 2 34 44 777 67 1
#3 3 45 76 77 2 3
#4 4 56 88 4 5 6
#5 5 36 7 8 9 10
impute_by_row(df1, df2$V2)
# GM A B C D E
#1 1 22 34 56 345 76
#2 2 34 44 777 67 11
#3 3 45 76 77 12 13
#4 4 56 88 14 15 16
#5 5 36 17 18 19 20
impute_by_row(df1, df2$V3)
# GM A B C D E
#1 1 22 34 56 345 76
#2 2 34 44 777 67 21
#3 3 45 76 77 22 23
#4 4 56 88 24 25 26
#5 5 36 27 28 29 30
Apply the function to all columns of df2:
lapply(df2, function(v) impute_by_row(df1, v))
$V1
GM A B C D E
1 1 22 34 56 345 76
2 2 34 44 777 67 1
3 3 45 76 77 2 3
4 4 56 88 4 5 6
5 5 36 7 8 9 10
$V2
GM A B C D E
1 1 22 34 56 345 76
2 2 34 44 777 67 11
3 3 45 76 77 12 13
4 4 56 88 14 15 16
5 5 36 17 18 19 20
$V3
GM A B C D E
1 1 22 34 56 345 76
2 2 34 44 777 67 21
3 3 45 76 77 22 23
4 4 56 88 24 25 26
5 5 36 27 28 29 30