R: sum columns with two conditions - r

I have two data tables. The first table is matrix with coordinates and precipitation. It consists of four columns with latitude, longitude, precipitation and day of monitoring. The example of table is:
latitude_1 longitude_1 precipitation day_mon
54.17 62.15 5 34
69.61 48.65 3 62
73.48 90.16 7 96
66.92 90.27 7 19
56.19 96.46 9 25
72.23 74.18 5 81
88.00 95.20 7 97
92.44 44.41 6 18
95.83 52.91 9 88
99.68 96.23 8 6
81.91 48.32 8 96
54.66 52.70 0 62
95.31 91.82 2 84
60.32 96.25 9 71
97.39 47.91 7 76
65.21 44.63 9 3
The second table consists of 5 columns : station number, longitude, latitude, day when monitoring began, day when monitoring ends. It looks like:
station latitude_2 longitude_2 day_begin day_end
15 50.00 93.22 34 46
11 86.58 85.29 15 47
14 93.17 63.17 31 97
10 88.56 61.28 15 78
13 45.29 77.10 24 79
6 69.73 99.52 13 73
4 45.60 77.36 28 95
13 92.88 62.38 9 51
1 65.10 64.13 7 69
10 60.57 86.77 34 64
3 53.62 60.76 23 96
16 87.82 59.41 38 47
1 47.83 95.89 21 52
11 75.42 46.20 38 87
3 55.71 55.26 2 73
16 71.65 96.15 36 93
I want to sum precipitations from 1 table. But I have two conditions:
day_begin< day_mon< day_end. Day of monitoring(day_mon from 1 table) should be less than day of end and more than day of begin (from 2 table)
Sum precipitation from the point which is closer than others. distance between point of monitoring (coordinates consists
longitude_1 and latitude_1) and station (coordinates consists
longitude_2 and latitude_2) should be minimum. The distance is calculated by the formula :
R = 6400*arccos(sin(latitude_1)*sin(latitude_2)+cos(latitude_1)*cos(latitude_2))*cos(longitude_1-longitude_2))
At last I want to get results as table :
station latitude_2 longitude_2 day_begin day_end Sum
15 50 93.22 34 46 188
11 86.58 85.29 15 47 100
14 93.17 63.17 31 97 116
10 88.56 61.28 15 78 182
13 45.29 77.1 24 79 136
6 69.73 99.52 13 73 126
4 45.6 77.36 28 95 108
13 92.88 62.38 9 51 192
1 65.1 64.13 7 69 125
10 60.57 86.77 34 64 172
3 53.62 60.76 23 96 193
16 87.82 59.41 38 47 183
1 47.83 95.89 21 52 104
11 75.42 46.2 38 87 151
3 55.71 55.26 2 73 111
16 71.65 96.15 36 93 146
I know how to calculate it in C++. What function should I use in R?
Thank you for your help!

I'm not sure if I solved your problem correctly... but here it comes..
I used a data.table approach.
library( tidyverse )
library( data.table )
#step 1. join days as periods
#create a dummy variables to create a virtual period in dt1
dt1[, point_id := .I]
dt1[, day_begin := day_mon]
dt1[, day_end := day_mon]
setkey(dt2, day_begin, day_end)
#overlap join finding all stations for each point that overlap periods
dt <- foverlaps( dt1, dt2, type = "within" )
#step 2. calculate the distance station for each point based on TS-privided formula
dt[, distance := 6400 * acos( sin( latitude_1 ) * sin( latitude_2 ) + cos( latitude_1 ) * cos( latitude_2 ) ) * cos( longitude_1 - longitude_2 ) ]
#step 3. filter (absolute) minimal distance based on point_id
dt[ , .SD[which.min( abs( distance ) )], by = point_id ]
# point_id station latitude_2 longitude_2 day_begin day_end latitude_1 longitude_1 precipitation day_mon i.day_begin i.day_end distance
# 1: 1 1 47.83 95.89 21 52 54.17 62.15 5 34 34 34 -248.72398
# 2: 2 6 69.73 99.52 13 73 69.61 48.65 3 62 62 62 631.89228
# 3: 3 14 93.17 63.17 31 97 73.48 90.16 7 96 96 96 -1519.84886
# 4: 4 11 86.58 85.29 15 47 66.92 90.27 7 19 19 19 1371.54757
# 5: 5 11 86.58 85.29 15 47 56.19 96.46 9 25 25 25 1139.46849
# 6: 6 14 93.17 63.17 31 97 72.23 74.18 5 81 81 81 192.99264
# 7: 7 14 93.17 63.17 31 97 88.00 95.20 7 97 97 97 5822.81529
# 8: 8 3 55.71 55.26 2 73 92.44 44.41 6 18 18 18 -899.71206
# 9: 9 3 53.62 60.76 23 96 95.83 52.91 9 88 88 88 45.16237
# 10: 10 3 55.71 55.26 2 73 99.68 96.23 8 6 6 6 -78.04484
# 11: 11 14 93.17 63.17 31 97 81.91 48.32 8 96 96 96 -5467.77459
# 12: 12 3 53.62 60.76 23 96 54.66 52.70 0 62 62 62 -1361.57863
# 13: 13 11 75.42 46.20 38 87 95.31 91.82 2 84 84 84 -445.18765
# 14: 14 14 93.17 63.17 31 97 60.32 96.25 9 71 71 71 -854.86321
# 15: 15 3 53.62 60.76 23 96 97.39 47.91 7 76 76 76 1304.41634
# 16: 16 3 55.71 55.26 2 73 65.21 44.63 9 3 3 3 -7015.57516
Sample data
dt1 <- read.table( text = "latitude_1 longitude_1 precipitation day_mon
54.17 62.15 5 34
69.61 48.65 3 62
73.48 90.16 7 96
66.92 90.27 7 19
56.19 96.46 9 25
72.23 74.18 5 81
88.00 95.20 7 97
92.44 44.41 6 18
95.83 52.91 9 88
99.68 96.23 8 6
81.91 48.32 8 96
54.66 52.70 0 62
95.31 91.82 2 84
60.32 96.25 9 71
97.39 47.91 7 76
65.21 44.63 9 3", header = TRUE ) %>%
setDT()
dt2 <- read.table( text = "station latitude_2 longitude_2 day_begin day_end
15 50.00 93.22 34 46
11 86.58 85.29 15 47
14 93.17 63.17 31 97
10 88.56 61.28 15 78
13 45.29 77.10 24 79
6 69.73 99.52 13 73
4 45.60 77.36 28 95
13 92.88 62.38 9 51
1 65.10 64.13 7 69
10 60.57 86.77 34 64
3 53.62 60.76 23 96
16 87.82 59.41 38 47
1 47.83 95.89 21 52
11 75.42 46.20 38 87
3 55.71 55.26 2 73
16 71.65 96.15 36 93", header = TRUE ) %>%
setDT()

Related

R: order() not working as expected outside of RStudio

I have a character vector (consisting of randomly arranged numbers or letters) that I want to use to order a dataframe:
vals = as.numeric(dict$keys)
## ONE
vals = order(vals)
## TWO
dict = dict[vals,]
At ONE:
> vals
[1] 1 1 1 1 1 2 2 2 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5
[26] 6 7 7 7 7 8 8 8 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 10 10
[51] 10 10 10 11 11 11 11 11 12 12 12 12 12 12 12 12 12 13 13 13 14 14 15 15 15
[76] 15 16 16 16 16 16 16 16 16 16 16 17 17 17 17 17 18 18 18 18 18 18 18 18 18
[101] 18 19 19 19 19 19 19 20 20 20 20 20 20 20 20 20 20 21 21 21 21 21 21 21 21
[126] 22 22 22 22 22 22 22 22 22 22 22 22 23
At TWO:
> vals
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
[37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
[55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
[73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
[91] 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
[109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
[127] 127 128 129 130 131 132 133 134 135 136 137 138
When I execute this snippet in RStudio in Windows, it orders the dataframe dict fine. Numbers are ordered first, then letters are at the end (this is what I want).
However, in a linux remote desktop where I execute with > Rscript , this snippet doesn't work and the dataframe remains how it was before these lines are executed.
I fixed this by defining stringsAsFactors = F for all uses of data.frame in the script as Henrik suggested. The issue lied in the different versions of R I was using on the two systems.

Extract the columns in a data frame when colnames match characters in a vector

I have a vector with characters matching some colnames in a data frame. I want to extract/subset the columns that match the vector v. I could I do that? Imagine a big data frame!
a <- sample(100,10)
b <- sample(100,10)
c <- sample(100,10)
d <- sample(100,10)
df <- data.frame(a,b,c,d)
df
a b c d
1 91 17 93 53
2 9 94 65 55
3 11 58 38 13
4 100 77 98 45
5 69 9 61 2
6 15 50 44 14
7 58 55 88 85
8 78 45 33 51
9 94 3 89 62
10 7 12 90 44
v <- c("a","c")
wanted output:
a c
1 91 93
2 9 65
3 11 38
4 100 98
5 69 61
6 15 44
7 58 88
8 78 33
9 94 89
10 7 90
>
We can use select
library(dplyr)
df %>%
select(all_of(v))
-output
# a c
#1 26 92
#2 34 15
#3 15 80
#4 4 88
#5 55 69
#6 96 78
#7 63 2
#8 69 62
#9 12 100
#10 16 22
> df
a b c d
1 38 28 68 88
2 18 21 99 40
3 30 33 20 91
4 85 64 88 33
5 9 46 82 51
6 59 86 40 80
7 80 81 46 49
8 57 61 83 37
9 64 6 15 27
10 72 13 37 68
> v <- c("a","c")
> df[v]
a c
1 38 68
2 18 99
3 30 20
4 85 88
5 9 82
6 59 40
7 80 46
8 57 83
9 64 15
10 72 37
>

Conditionally replace column names in a dataframe based on values in another dataframe

I have downloaded a table of stream diversion data ("df_download"). The column names of this table are primarily taken from the ID numbers of the gauging stations.
I want to conditionally replace the ID numbers that have been used for column names with text for the station names, which will help make the data more readable when I'm sharing the results. I created a table ("stationIDs") with the ID numbers and station names to use as a reference for changing the column names of "df_download".
I can replace the column names individually, but I want to write a loop of some kind that will address all of the columns of "df_download" and change the names of the columns referenced in the dataframe "stationIDs".
An example of what I'm trying to do is below.
Downloaded Data ("df_download")
A portion of the downloaded data is similar to this:
df_downloaded <- data.frame(Var1 = seq(as.Date("2012-01-01"),as.Date("2012-12-01"), by="month"),
Var2 = sample(50:150,12, replace =TRUE),
Var3 = sample(10:100,12, replace =TRUE),
Var4 = sample(15:45,12, replace =TRUE),
Var5 = sample(50:200,12, replace =TRUE),
Var6 = sample(15:100,12, replace =TRUE),
Var7 = c(rep(0,3),rep(13,6),rep(0,3)),
Var8 = rep(5,12))
colnames(df_downloaded) <- c("Diversion.Date","360410059","360410060",
"360410209","361000655","361000656","Irrigation","Seep")
df_download # not run
#
# Diversion.Date 360410059 360410060 360410209 361000655 361000656 Irrigation Seep
# 1 2012-01-01 93 57 28 101 16 0 5
# 2 2012-02-01 102 68 19 124 98 0 5
# 3 2012-03-01 124 93 36 109 56 0 5
# 4 2012-04-01 94 96 23 54 87 13 5
# 5 2012-05-01 83 70 43 119 15 13 5
# 6 2012-06-01 78 63 45 195 15 13 5
# 7 2012-07-01 86 77 20 130 63 13 5
# 8 2012-08-01 118 29 27 118 57 13 5
# 9 2012-09-01 142 18 45 116 27 13 5
# 10 2012-10-01 74 68 34 182 79 0 5
# 11 2012-11-01 106 48 27 95 74 0 5
# 12 2012-12-01 91 41 20 179 55 0 5
Reference Table ("stationIDs")
stationIDs <- data.frame(ID = c("360410059", "360410060", "360410209", "361000655", "361000656"),
Names = c("RimView", "IPCO", "WMA.Ditch", "RV.Bypass", "LowerFalls"))
stationIDs # not run
#
# ID Names
# 1 360410059 RimView
# 2 360410060 IPCO
# 3 360410209 WMA.Ditch
# 4 361000655 RV.Bypass
# 5 361000656 LowerFalls
I can replace the column names in "df_downloaded" using individual statements. I show the first three iterations below.
After three iterations "RimValley", "IPCO", and "WMA.Ditch" have replaced their respective gauge ID numbers.
names(df_downloaded) <- gsub(stationIDs$ID[1],stationIDs$Name[1],names(df_downloaded))
# head(df_downloaded)
# Diversion.Date RimView 360410060 360410209 361000655 361000656 Irrigation Seep
# 1 2012-01-01 93 57 28 101 16 0 5
# 2 2012-02-01 102 68 19 124 98 0 5
# 3 2012-03-01 124 93 36 109 56 0 5
# 4 2012-04-01 94 96 23 54 87 13 5
# 5 2012-05-01 83 70 43 119 15 13 5
# 6 2012-06-01 78 63 45 195 15 13 5
names(df_downloaded) <- gsub(stationIDs$ID[2],stationIDs$Name[2],names(df_downloaded))
# head(df_downloaded)
# Diversion.Date RimView IPCO 360410209 361000655 361000656 Irrigation Seep
# 1 2012-01-01 93 57 28 101 16 0 5
# 2 2012-02-01 102 68 19 124 98 0 5
# 3 2012-03-01 124 93 36 109 56 0 5
# 4 2012-04-01 94 96 23 54 87 13 5
# 5 2012-05-01 83 70 43 119 15 13 5
# 6 2012-06-01 78 63 45 195 15 13 5
names(df_downloaded) <- gsub(stationIDs$ID[3],stationIDs$Name[3],names(df_downloaded))
# head(df_downloaded)
# Diversion.Date RimView IPCO WMA.Ditch 361000655 361000656 Irrigation Seep
# 1 2012-01-01 93 57 28 101 16 0 5
# 2 2012-02-01 102 68 19 124 98 0 5
# 3 2012-03-01 124 93 36 109 56 0 5
# 4 2012-04-01 94 96 23 54 87 13 5
# 5 2012-05-01 83 70 43 119 15 13 5
# 6 2012-06-01 78 63 45 195 15 13 5
If I try to do the renaming using a for loop, I end up with NAs for column names.
for(i in seq_along(names(df_downloaded))){
names(df_downloaded) <- gsub(stationIDs$ID[i],stationIDs$Name[i],names(df_downloaded))
}
# head(df_downloaded)
# NA NA NA NA NA NA NA NA
# 1 2012-01-01 93 57 28 101 16 0 5
# 2 2012-02-01 102 68 19 124 98 0 5
# 3 2012-03-01 124 93 36 109 56 0 5
# 4 2012-04-01 94 96 23 54 87 13 5
# 5 2012-05-01 83 70 43 119 15 13 5
# 6 2012-06-01 78 63 45 195 15 13 5
I really want to be able to change the names with a for loop or something similar, because because the number of stations that I download data from changes depending on the years that I am analyzing.
Thanks for taking time to look at my question.
We can use match
#Convert factor columns to character
stationIDs[] <- lapply(stationIDs, as.character)
#Match names of df_downloaded with stationIDs$ID
inds <- match(names(df_downloaded), stationIDs$ID)
#Replace the matched name with corresponding Names from stationIDs
names(df_downloaded)[which(!is.na(inds))] <- stationIDs$Names[inds[!is.na(inds)]]
df_downloaded
# Diversion.Date RimView IPCO WMA.Ditch RV.Bypass LowerFalls Irrigation Seep
#1 2012-01-01 142 14 41 200 79 0 5
#2 2012-02-01 97 100 35 176 22 0 5
#3 2012-03-01 85 59 26 88 71 0 5
#4 2012-04-01 68 49 34 63 15 13 5
#5 2012-05-01 62 58 44 87 16 13 5
#6 2012-06-01 70 59 33 145 87 13 5
#7 2012-07-01 112 65 25 52 64 13 5
#8 2012-08-01 75 12 27 103 19 13 5
#9 2012-09-01 73 65 36 172 68 13 5
#10 2012-10-01 87 35 27 146 42 0 5
#11 2012-11-01 122 17 33 183 32 0 5
#12 2012-12-01 108 65 15 120 99 0 5
You can do this dplyr and tidyr. You basically want to make your data long so that the IDs are in a column so that you can do a join on this with your reference of IDs to names. Then you can make your data wide again.
df_downloaded %>%
gather(ID, value, -Diversion.Date, -Irrigation, -Seep) %>%
left_join(., stationIDs) %>%
dplyr::select(-ID) %>%
spread(Names, value)

Filter using paste and name in dplyr

Sample data
df <- data.frame(loc.id = rep(1:5, each = 6), day = sample(1:365,30),
ref.day1 = rep(c(20,30,50,80,90), each = 6),
ref.day2 = rep(c(10,28,33,49,67), each = 6),
ref.day3 = rep(c(31,49,65,55,42), each = 6))
For each loc.id, if I want to keep days that are >= then ref.day1, I do this:
df %>% group_by(loc.id) %>% dplyr::filter(day >= ref.day1)
I want to make 3 data frames, each whose rows are filtered by ref.day1, ref.day2,ref.day3 respectively
I tried this:
col.names <- c("ref.day1","ref.day2","ref.day3")
temp.list <- list()
for(cl in seq_along(col.names)){
col.sub <- col.names[cl]
columns <- c("loc.id","day",col.sub)
df.sub <- df[,columns]
temp.dat <- df.sub %>% group_by(loc.id) %>% dplyr::filter(day >= paste0(col.sub)) # this line does not work
temp.list[[cl]] <- temp.dat
}
final.dat <- rbindlist(temp.list)
I was wondering how to refer to columns by names and paste function in dplyr in order to filter it out.
The reason why your original code doesn't work is that your col.names are strings, but dplyr function uses non-standard evaluation which doesn't accept strings. So you need to convert the string into variables.rlang::sym() can do that.
Also, you can use map function in purrr package, which is much more compact:
library(dplyr)
library(purrr)
col_names <- c("ref.day1","ref.day2","ref.day3")
map(col_names,~ df %>% dplyr::filter(day >= UQ(rlang::sym(.x))))
#it will return you a list of dataframes
By the way I removed group_by() because they don't seem to be useful.
Returned result:
[[1]]
loc.id day ref.day1 ref.day2 ref.day3
1 1 362 20 10 31
2 1 69 20 10 31
3 1 65 20 10 31
4 1 88 20 10 31
5 1 142 20 10 31
6 2 355 30 28 49
7 2 255 30 28 49
8 2 136 30 28 49
9 2 156 30 28 49
10 2 194 30 28 49
11 2 204 30 28 49
12 3 129 50 33 65
13 3 254 50 33 65
14 3 279 50 33 65
15 3 201 50 33 65
16 3 282 50 33 65
17 4 351 80 49 55
18 4 114 80 49 55
19 4 338 80 49 55
20 4 283 80 49 55
21 5 199 90 67 42
22 5 141 90 67 42
23 5 241 90 67 42
24 5 187 90 67 42
[[2]]
loc.id day ref.day1 ref.day2 ref.day3
1 1 16 20 10 31
2 1 362 20 10 31
3 1 69 20 10 31
4 1 65 20 10 31
5 1 88 20 10 31
6 1 142 20 10 31
7 2 355 30 28 49
8 2 255 30 28 49
9 2 136 30 28 49
10 2 156 30 28 49
11 2 194 30 28 49
12 2 204 30 28 49
13 3 129 50 33 65
14 3 254 50 33 65
15 3 279 50 33 65
16 3 201 50 33 65
17 3 282 50 33 65
18 4 351 80 49 55
19 4 114 80 49 55
20 4 338 80 49 55
21 4 283 80 49 55
22 4 79 80 49 55
23 5 199 90 67 42
24 5 67 90 67 42
25 5 141 90 67 42
26 5 241 90 67 42
27 5 187 90 67 42
[[3]]
loc.id day ref.day1 ref.day2 ref.day3
1 1 362 20 10 31
2 1 69 20 10 31
3 1 65 20 10 31
4 1 88 20 10 31
5 1 142 20 10 31
6 2 355 30 28 49
7 2 255 30 28 49
8 2 136 30 28 49
9 2 156 30 28 49
10 2 194 30 28 49
11 2 204 30 28 49
12 3 129 50 33 65
13 3 254 50 33 65
14 3 279 50 33 65
15 3 201 50 33 65
16 3 282 50 33 65
17 4 351 80 49 55
18 4 114 80 49 55
19 4 338 80 49 55
20 4 283 80 49 55
21 4 79 80 49 55
22 5 199 90 67 42
23 5 67 90 67 42
24 5 141 90 67 42
25 5 241 90 67 42
26 5 187 90 67 42
You may also want to check these:
https://dplyr.tidyverse.org/articles/programming.html
Use variable names in functions of dplyr

Using barplot in R doesn't not match the data?

I want to use barplot (or any other better options) to plot the following data:
action_number times
1 1 13408
2 2 5550
3 3 2757
4 4 1782
5 5 1114
6 6 847
7 7 582
8 8 410
9 9 306
10 10 278
11 11 212
12 12 165
13 13 139
14 14 112
15 15 106
16 16 82
17 17 64
18 18 61
19 19 69
20 20 47
21 21 31
22 22 40
23 23 34
24 24 31
25 25 28
26 26 26
27 27 21
28 28 16
29 29 14
30 30 16
31 31 11
32 32 10
33 33 11
34 34 10
35 35 4
36 36 6
37 37 5
38 38 8
39 39 6
40 40 3
41 41 6
42 42 8
43 43 3
44 44 3
45 45 7
46 46 8
47 47 4
48 48 4
49 49 1
50 50 4
51 51 2
52 52 4
53 53 3
54 54 1
55 55 2
56 56 1
57 58 2
58 59 4
59 60 1
60 62 2
61 63 1
62 66 1
63 67 4
64 68 2
65 69 1
66 70 1
67 71 1
68 73 1
69 74 1
70 77 1
71 79 1
72 80 1
73 82 1
74 92 2
75 97 1
76 98 1
77 103 1
78 106 1
79 114 1
80 118 1
81 128 1
82 142 1
83 148 1
84 153 1
85 155 1
86 166 1
87 183 1
88 218 1
89 224 1
90 298 1
91 536 1
I am using the following, but it does not match the data correctly:
mp <- barplot(data$times,axes=FALSE,ylim=c(0,13408))
axis(1,at=data$action_number,labels=data$action_number)
#??? Should I use at=data$action_number to at=data$times
axis(2,seq(0,91,3),c(0:30))
![enter image description here][1]
Problems:
- the x-axis does not have 536, it only goes to 224
- the Y axis only shows one number
Can you please give me advice and if I should use any package?
still, unclear but may be something like this
barplot(data$times, xlab=data$action_number)
mp <- barplot(data$times,axes=FALSE,ylim=c(0,13408))
axis(1,at=seq(1,91,10),labels=data$action_number[seq(1,91,10)])
axis(2,seq(0,13408,500),seq(0,13408,500))

Resources