Taylor Diagrams by Group in R (openair) - r

I'm trying to create a taylor diagram to show agreement between observations and model output. The openair package lets you differentiate by a group, which I would like to do for each site.
This is the code that I'm using:
TaylorDiagram(month_join, obs = "temp", mod = "temp_surf", group = "dataset_id", normalise = TRUE, cex = 1)
The observation variable is temp, model variable is temp_surf, and site that I want to differentiate by different groups, is dataset_id.
When I do this, though there are 17 different datasets, they are binned into four groups. I can't find any help online about this. The function documentation says that for the group argument, "The total number of models compared will be equal to the number of unique values of group". I have 17 unique values in the group but they are automatically binned into 4.
Taylor diagram with 4 groups instead of 17
[Edit: first 20 rows of data from month_join]
# A tibble: 20 × 14
# Groups: dataset_id [2]
dataset_id month temp_surf temp_mid temp_bot ph_surf ph_mid ph_bot do_surf do_mid do_bot temp ph do
<dbl> <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 1 13.4 13.3 13.2 8.01 7.99 7.97 244. 232. 220. 13.3 8.00 NaN
2 3 2 13.3 13.2 13.0 8.01 7.98 7.96 245. 232. 218. 12.5 7.99 NaN
3 3 3 12.9 12.7 12.5 7.97 7.94 7.91 233. 216. 199. 12.7 8.04 NaN
4 3 4 12.6 12.4 12.2 7.93 7.91 7.89 223. 207. 190. NaN NaN NaN
5 3 5 12.9 12.7 12.4 7.93 7.91 7.89 223. 208. 193. NaN NaN NaN
6 3 6 13.5 13.2 12.9 7.94 7.92 7.90 226. 212. 198. 15.1 8.04 NaN
7 3 7 14.3 13.9 13.5 7.97 7.95 7.94 236. 224. 212. 16.0 8.09 NaN
8 3 8 14.4 14.1 13.8 7.98 7.97 7.95 238. 228. 217. 16.6 8.06 NaN
9 3 9 14.8 14.5 14.1 8.00 7.99 7.97 244. 235. 227. 16.7 8.05 NaN
10 3 10 14.8 14.4 14.1 8.00 7.98 7.96 243. 233. 222. 16.2 8.05 NaN
11 3 11 14.3 14.0 13.7 7.99 7.96 7.94 237. 224. 211. 15.5 8.05 NaN
12 3 12 13.6 13.4 13.3 7.99 7.97 7.94 237. 225. 213. 14.4 8.05 NaN
13 6 1 14.3 9.48 4.70 8.07 7.84 7.62 261. 143. 24.7 13.6 NaN NaN
14 6 2 14.2 9.42 4.68 8.07 7.84 7.62 264. 144. 24.4 13.5 NaN NaN
15 6 3 14.5 9.61 4.67 8.07 7.84 7.61 266. 145. 24.2 14.0 NaN NaN
16 6 4 15.0 9.86 4.68 8.06 7.84 7.61 264. 144. 24.0 14.3 NaN NaN
17 6 5 16.0 10.4 4.68 8.05 7.83 7.61 262. 143. 24.0 16.4 NaN NaN
18 6 6 17.3 11.0 4.68 8.04 7.83 7.61 257. 141. 23.9 17.6 NaN NaN
19 6 7 18.8 11.7 4.71 8.03 7.82 7.61 251. 138. 24.2 19.3 NaN NaN
20 6 8 19.2 12.0 4.76 8.03 7.82 7.61 248. 136. 24.7 NA NA NA

Related

dplyr - programming dynamic variable & function name - ascending & descending

I am trying to find way to shorten my code using dynamic naming variables & functions related with ascending & descending order. Though I can do desc but couldn't find anything for ascending. Below is the reproducible example to demonstrate my problem.
Here is the sample dataset
library(dplyr)
set.seed(100)
data <- tibble(a = runif(20, min = 0, max = 100),
b = runif(20, min = 0, max = 100),
c = runif(20, min = 0, max = 100))
Dynamically passing variable with percent rank in ascending order
current_var <- "a" # dynamic variable name
data %>%
mutate("percent_rank_{current_var}" := percent_rank(!!sym(current_var)))
#> # A tibble: 20 × 4
#> a b c percent_rank_a
#> <dbl> <dbl> <dbl> <dbl>
#> 1 30.8 53.6 33.1 0.263
#> 2 25.8 71.1 86.5 0.158
#> 3 55.2 53.8 77.8 0.684
#> 4 5.64 74.9 82.7 0
#> 5 46.9 42.0 60.3 0.526
#> 6 48.4 17.1 49.1 0.579
#> 7 81.2 77.0 78.0 0.947
#> 8 37.0 88.2 88.4 0.421
#> 9 54.7 54.9 20.8 0.632
#> 10 17.0 27.8 30.7 0.0526
#> 11 62.5 48.8 33.1 0.737
#> 12 88.2 92.9 19.9 1
#> 13 28.0 34.9 23.6 0.211
#> 14 39.8 95.4 27.5 0.474
#> 15 76.3 69.5 59.1 0.895
#> 16 66.9 88.9 25.3 0.789
#> 17 20.5 18.0 12.3 0.105
#> 18 35.8 62.9 23.0 0.316
#> 19 35.9 99.0 59.8 0.368
#> 20 69.0 13.0 21.1 0.842
Dynamically passing variable with percent rank in descending order
data %>%
mutate("percent_rank_{current_var}" := percent_rank(desc(!!sym(current_var))))
#> # A tibble: 20 × 4
#> a b c percent_rank_a
#> <dbl> <dbl> <dbl> <dbl>
#> 1 30.8 53.6 33.1 0.737
#> 2 25.8 71.1 86.5 0.842
#> 3 55.2 53.8 77.8 0.316
#> 4 5.64 74.9 82.7 1
#> 5 46.9 42.0 60.3 0.474
#> 6 48.4 17.1 49.1 0.421
#> 7 81.2 77.0 78.0 0.0526
#> 8 37.0 88.2 88.4 0.579
#> 9 54.7 54.9 20.8 0.368
#> 10 17.0 27.8 30.7 0.947
#> 11 62.5 48.8 33.1 0.263
#> 12 88.2 92.9 19.9 0
#> 13 28.0 34.9 23.6 0.789
#> 14 39.8 95.4 27.5 0.526
#> 15 76.3 69.5 59.1 0.105
#> 16 66.9 88.9 25.3 0.211
#> 17 20.5 18.0 12.3 0.895
#> 18 35.8 62.9 23.0 0.684
#> 19 35.9 99.0 59.8 0.632
#> 20 69.0 13.0 21.1 0.158
How to combine both into one statement? - I can do for desc but couldn't find any explicit statement for ascending order
rank_function <- desc # dynamic function for ranking
data %>%
mutate("percent_rank_{current_var}" := percent_rank(rank_function(!!sym(current_var))))
#> # A tibble: 20 × 4
#> a b c percent_rank_a
#> <dbl> <dbl> <dbl> <dbl>
#> 1 30.8 53.6 33.1 0.737
#> 2 25.8 71.1 86.5 0.842
#> 3 55.2 53.8 77.8 0.316
#> 4 5.64 74.9 82.7 1
#> 5 46.9 42.0 60.3 0.474
#> 6 48.4 17.1 49.1 0.421
#> 7 81.2 77.0 78.0 0.0526
#> 8 37.0 88.2 88.4 0.579
#> 9 54.7 54.9 20.8 0.368
#> 10 17.0 27.8 30.7 0.947
#> 11 62.5 48.8 33.1 0.263
#> 12 88.2 92.9 19.9 0
#> 13 28.0 34.9 23.6 0.789
#> 14 39.8 95.4 27.5 0.526
#> 15 76.3 69.5 59.1 0.105
#> 16 66.9 88.9 25.3 0.211
#> 17 20.5 18.0 12.3 0.895
#> 18 35.8 62.9 23.0 0.684
#> 19 35.9 99.0 59.8 0.632
#> 20 69.0 13.0 21.1 0.158
Created on 2022-08-17 by the reprex package (v2.0.1)
You could compose a function to return its input:
rank_function <- function(x) x
Actually this function has been defined in base, i.e. identity.
rank_function <- identity
Also, you can explore the source code of desc:
desc
function (x) -xtfrm(x)
Apparently desc is just the opposite number of xtfrm. So you can use it for ascending ordering.
rank_function <- xtfrm
In the help document of xtfrm(x):
A generic auxiliary function that produces a numeric vector which will sort in the same order as x.

Function for finding Temperature at Dissolved Oxygen of 3 (TDO3) value across a whole year

I am looking to calculate the TDO3 value at every date during the year 2020. I have interpolated data sets of both temperature and dissolved oxygen in 0.25 meter increments from 1m - 22m below the surface between the dates of Jan-1-2020 and Dec-31-2020.
TDO3 is the temperature when dissolved oxygen is 3mg/L. Below are snips of the merged data set.
> print(do_temp, n=85)
# A tibble: 31,110 x 4
date depth mean_temp mean_do
<date> <dbl> <dbl> <dbl>
1 2020-01-01 1 2.12 11.6
2 2020-01-01 1.25 2.19 11.5
3 2020-01-01 1.5 2.27 11.4
4 2020-01-01 1.75 2.34 11.3
5 2020-01-01 2 2.42 11.2
6 2020-01-01 2.25 2.40 11.2
7 2020-01-01 2.5 2.39 11.1
8 2020-01-01 2.75 2.38 11.1
9 2020-01-01 3 2.37 11.0
10 2020-01-01 3.25 2.41 11.0
11 2020-01-01 3.5 2.46 11.0
12 2020-01-01 3.75 2.50 10.9
13 2020-01-01 4 2.55 10.9
14 2020-01-01 4.25 2.54 10.9
15 2020-01-01 4.5 2.53 10.9
16 2020-01-01 4.75 2.52 11.0
17 2020-01-01 5 2.51 11.0
18 2020-01-01 5.25 2.50 11.0
19 2020-01-01 5.5 2.49 11.0
20 2020-01-01 5.75 2.49 11.1
21 2020-01-01 6 2.48 11.1
22 2020-01-01 6.25 2.49 10.9
23 2020-01-01 6.5 2.51 10.8
24 2020-01-01 6.75 2.52 10.7
25 2020-01-01 7 2.54 10.5
26 2020-01-01 7.25 2.55 10.4
27 2020-01-01 7.5 2.57 10.2
28 2020-01-01 7.75 2.58 10.1
29 2020-01-01 8 2.60 9.95
30 2020-01-01 8.25 2.63 10.1
31 2020-01-01 8.5 2.65 10.2
32 2020-01-01 8.75 2.68 10.3
33 2020-01-01 9 2.71 10.5
34 2020-01-01 9.25 2.69 10.6
35 2020-01-01 9.5 2.67 10.7
36 2020-01-01 9.75 2.65 10.9
37 2020-01-01 10 2.63 11.0
38 2020-01-01 10.2 2.65 10.8
39 2020-01-01 10.5 2.67 10.6
40 2020-01-01 10.8 2.69 10.3
41 2020-01-01 11 2.72 10.1
42 2020-01-01 11.2 2.75 9.89
43 2020-01-01 11.5 2.78 9.67
44 2020-01-01 11.8 2.81 9.44
45 2020-01-01 12 2.84 9.22
46 2020-01-01 12.2 2.83 9.39
47 2020-01-01 12.5 2.81 9.56
48 2020-01-01 12.8 2.80 9.74
49 2020-01-01 13 2.79 9.91
50 2020-01-01 13.2 2.80 10.1
51 2020-01-01 13.5 2.81 10.3
52 2020-01-01 13.8 2.82 10.4
53 2020-01-01 14 2.83 10.6
54 2020-01-01 14.2 2.86 10.5
55 2020-01-01 14.5 2.88 10.4
56 2020-01-01 14.8 2.91 10.2
57 2020-01-01 15 2.94 10.1
58 2020-01-01 15.2 2.95 10.0
59 2020-01-01 15.5 2.96 9.88
60 2020-01-01 15.8 2.97 9.76
61 2020-01-01 16 2.98 9.65
62 2020-01-01 16.2 2.99 9.53
63 2020-01-01 16.5 3.00 9.41
64 2020-01-01 16.8 3.01 9.30
65 2020-01-01 17 3.03 9.18
66 2020-01-01 17.2 3.05 9.06
67 2020-01-01 17.5 3.07 8.95
68 2020-01-01 17.8 3.09 8.83
69 2020-01-01 18 3.11 8.71
70 2020-01-01 18.2 3.13 8.47
71 2020-01-01 18.5 3.14 8.23
72 2020-01-01 18.8 3.16 7.98
73 2020-01-01 19 3.18 7.74
74 2020-01-01 19.2 3.18 7.50
75 2020-01-01 19.5 3.18 7.25
76 2020-01-01 19.8 3.18 7.01
77 2020-01-01 20 3.18 6.77
78 2020-01-01 20.2 3.18 5.94
79 2020-01-01 20.5 3.18 5.10
80 2020-01-01 20.8 3.18 4.27
81 2020-01-01 21 3.18 3.43
82 2020-01-01 21.2 3.22 2.60
83 2020-01-01 21.5 3.25 1.77
84 2020-01-01 21.8 3.29 0.934
85 2020-01-01 22 3.32 0.100
# ... with 31,025 more rows
https://github.com/TRobin82/WaterQuality
The above link will get you to the raw data.
What I am looking for is a data frame that looks like this but it will have 366 rows for each date during the year.
> TDO3
dates tdo3
1 2020-1-1 3.183500
2 2020-2-1 3.341188
3 2020-3-1 3.338625
4 2020-4-1 3.437000
5 2020-5-1 4.453310
6 2020-6-1 5.887560
7 2020-7-1 6.673700
8 2020-8-1 7.825672
9 2020-9-1 8.861190
10 2020-10-1 11.007972
11 2020-11-1 7.136880
12 2020-12-1 2.752500
However a DO value of a perfect 3 mg/L is not found in the interpolation data frame of DO so I would need the function to find the closest value to 3 without going below then match the depth of that value up with the other data frame for temperature to assign the proper temperature at that depth.
I am assuming the best route to take is a for-loop but not sold on the proper way to go about this question.
here's one way of doing it with tidyverse-style functions. Note that this code is reproducible because anyone can run it and should get the same answer. It's great that you showed us your data, but it's even better to post the output of dput() because then people can load the data and start helping you immediately.
This code does the following:
Load the data from the link you provided. But since there were several data files I had to guess which one you meant.
Groups the observations by date.
Puts the observations in increasing order of mean_do.
Removes rows with values of mean_do that are strictly less than 3.
Takes the first ordered observation for each date (this will be the one with the lowest value of mean_do that is greater than or equal to 3).
Rename the column mean_temp as tdo3 since it's the temperature for that date when the dissolved oxygen level was closest to 3mg/L.
library(tidyverse)
do_temp <- read_csv("https://raw.githubusercontent.com/TRobin82/WaterQuality/main/DateDepthTempDo.csv") %>%
select(-X1)
do_temp %>%
group_by(date) %>%
arrange(mean_do) %>%
filter(mean_do > 3) %>%
slice_head(n=1) %>%
rename(tdo3 = mean_temp) %>%
select(date, tdo3)
Here are the results. They're a bit different from the ones you posted, so I'm not sure if I've misunderstood you or if those were just illustrative and not real results.
# A tibble: 366 x 2
# Groups: date [366]
date tdo3
<date> <dbl>
1 2020-01-01 3.18
2 2020-01-02 3.18
3 2020-01-03 3.19
4 2020-01-04 3.21
5 2020-01-05 3.21
6 2020-01-06 3.21
7 2020-01-07 3.24
8 2020-01-08 3.28
9 2020-01-09 3.27
10 2020-01-10 3.28
# ... with 356 more rows
Let me know if you were looking for something else.

Locating duplicated entries in a column of a dataframe?

In rows, 11:13, and in 14:16, it can be observed that there are duplicate entries in column 'C2_xsampa' for 'm:' and 'n:'. Each value in 'C2_xsampa' has two levels, Singleton or Geminate but it is not the case among 'm:' and 'n:'. This yields wrong mean values for numeric columns.
My question is: How do I filter which row is being duplicated? I have manually checked the parent dataset through which means values are obtained. All looks fine there.
Earlier, I was using subset () to rectify the 'real' errors in entry.
Data:
C2_xsampa Consonant Speaker C1.dn C2.dn V1.dn V2.dn total.dn
1 "d_d" Singleton 8.5 11.9 7.82 13.0 7.65 40.3
2 "d_d:" Geminate 9 11.6 11.9 11.4 7.46 42.3
3 "dZ" Singleton 8.31 7.79 7.47 14.9 9.81 40.0
4 "dZ:" Geminate 8.08 7.72 13.4 12.8 9.61 43.6
5 "g" Singleton 9 12.1 11.3 11.9 8.56 43.9
6 "g:" Geminate 8.69 11.3 11.1 12.7 10.2 45.3
7 "k" Singleton 9.5 12.3 14.4 9.71 6.97 43.4
8 "k:" Geminate 9 14.7 16.1 10.1 7.37 48.2
9 "l" Singleton 8.69 11.9 6.33 11.5 10.2 40.0
10 "l:" Geminate 8.81 11.3 10.0 10.0 11.5 42.8
11 "m" Singleton 8.36 13.6 9.11 11.1 9.20 43.0
12 "m:" Geminate 8.85 13.7 10.9 9.95 8.42 43.0
13 "m: " Geminate 14 14.6 12.4 5.66 5.01 37.7
14 "n" Singleton 8 15.1 4.44 11.6 8.99 40.2
15 "n:" Geminate 8.21 21.4 10.1 10.2 9.32 51.0
16 "n: " Geminate 11.3 32.0 10.4 8.09 7.94 58.5
17 "p" Singleton 8.4 11.2 11.9 7.98 6.53 37.7
18 "p:" Geminate 8.81 13.2 12.7 8.57 11.3 45.8
19 "t`" Singleton 9 12.9 10.5 8.69 9.20 41.3
20 "t`:" Geminate 9 13.1 13.1 8.39 10.6 45.2
Thanks.
You could check that the values for the two columns are unique throughout the dataset
df = df.drop_duplicates(subset=['C2_xsampa','Consonant'])
You can get the inverse df[~df] to get the rows that are incorrect
edit just saw the r language tag
I believe distinct(select(df, C2_xsampa, Consonant)) will do
It seems there are unnecessary symbols and spaces in some of the values of C2_xsampa. Here is a suggestion using {tidyverse}. First, it removes the symbols/spaces and then identifies duplicated rows by C2_xsampa and Consonant. You can filter the duplicated rows using dup column.
library(tidyverse)
dat1 <- dat %>%
mutate(C2_xsampa = str_trim(C2_xsampa)) %>%
group_by(C2_xsampa, Consonant) %>%
mutate(dup = n()) %>%
ungroup()
dat1
# # A tibble: 20 x 9
# C2_xsampa Consonant Speaker C1.dn C2.dn V1.dn V2.dn total.dn dup
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 d_d Singleton 8.5 11.9 7.82 13 7.65 40.3 1
# 2 d_d: Geminate 9 11.6 11.9 11.4 7.46 42.3 1
# 3 dZ Singleton 8.31 7.79 7.47 14.9 9.81 40 1
# 4 dZ: Geminate 8.08 7.72 13.4 12.8 9.61 43.6 1
# 5 g Singleton 9 12.1 11.3 11.9 8.56 43.9 1
# 6 g: Geminate 8.69 11.3 11.1 12.7 10.2 45.3 1
# 7 k Singleton 9.5 12.3 14.4 9.71 6.97 43.4 1
# 8 k: Geminate 9 14.7 16.1 10.1 7.37 48.2 1
# 9 l Singleton 8.69 11.9 6.33 11.5 10.2 40 1
# 10 l: Geminate 8.81 11.3 10 10 11.5 42.8 1
# 11 m Singleton 8.36 13.6 9.11 11.1 9.2 43 1
# 12 m: Geminate 8.85 13.7 10.9 9.95 8.42 43 2
# 13 m: Geminate 14 14.6 12.4 5.66 5.01 37.7 2
# 14 n Singleton 8 15.1 4.44 11.6 8.99 40.2 1
# 15 n: Geminate 8.21 21.4 10.1 10.2 9.32 51 2
# 16 n: Geminate 11.3 32 10.4 8.09 7.94 58.5 2
# 17 p Singleton 8.4 11.2 11.9 7.98 6.53 37.7 1
# 18 p: Geminate 8.81 13.2 12.7 8.57 11.3 45.8 1
# 19 t` Singleton 9 12.9 10.5 8.69 9.2 41.3 1
# 20 t`: Geminate 9 13.1 13.1 8.39 10.6 45.2 1
Here is the code for the dataset:
dat <- read.table(
text = '
C2_xsampa Consonant Speaker C1.dn C2.dn V1.dn V2.dn total.dn
1 "d_d" Singleton 8.5 11.9 7.82 13.0 7.65 40.3
2 "d_d:" Geminate 9 11.6 11.9 11.4 7.46 42.3
3 "dZ" Singleton 8.31 7.79 7.47 14.9 9.81 40.0
4 "dZ:" Geminate 8.08 7.72 13.4 12.8 9.61 43.6
5 "g" Singleton 9 12.1 11.3 11.9 8.56 43.9
6 "g:" Geminate 8.69 11.3 11.1 12.7 10.2 45.3
7 "k" Singleton 9.5 12.3 14.4 9.71 6.97 43.4
8 "k:" Geminate 9 14.7 16.1 10.1 7.37 48.2
9 "l" Singleton 8.69 11.9 6.33 11.5 10.2 40.0
10 "l:" Geminate 8.81 11.3 10.0 10.0 11.5 42.8
11 "m" Singleton 8.36 13.6 9.11 11.1 9.20 43.0
12 "m:" Geminate 8.85 13.7 10.9 9.95 8.42 43.0
13 "m: " Geminate 14 14.6 12.4 5.66 5.01 37.7
14 "n" Singleton 8 15.1 4.44 11.6 8.99 40.2
15 "n:" Geminate 8.21 21.4 10.1 10.2 9.32 51.0
16 "n: " Geminate 11.3 32.0 10.4 8.09 7.94 58.5
17 "p" Singleton 8.4 11.2 11.9 7.98 6.53 37.7
18 "p:" Geminate 8.81 13.2 12.7 8.57 11.3 45.8
19 "t`" Singleton 9 12.9 10.5 8.69 9.20 41.3
20 "t`:" Geminate 9 13.1 13.1 8.39 10.6 45.2',
header = TRUE
)
My favorite approach for this is:
subset(dat, duplicated(C2_xsampa) | duplicated(rev(C2_xsampa))

R data.table, select columns with no NA

I have a table of stock prices here:
https://drive.google.com/file/d/1S666wiCzf-8MfgugN3IZOqCiM7tNPFh9/view?usp=sharing
Some columns have NA's because the company does not exist (until later dates), or the company folded.
What I want to do is: select columns that has no NA's. I use data.table because it is faster. Here are my working codes:
example <- fread(file = "example.csv", key = "date")
example_select <- example[,
lapply(.SD,
function(x) not(sum(is.na(x) > 0)))
] %>%
as.logical(.)
example[, ..example_select]
Is there better (less lines) code to do the same? Thank you!
Try:
example[,lapply(.SD, function(x) {if(anyNA(x)) {NULL} else {x}} )]
There are lots of ways you could do this. Here's how I usually do it - a data.table approach without lapply:
example[, .SD, .SDcols = colSums(is.na(example)) == 0]
An answer using tidyverse packages
library(readr)
library(dplyr)
library(purrr)
data <- read_csv("~/Downloads/example.csv")
map2_dfc(data, names(data), .f = function(x, y) {
column <- tibble("{y}" := x)
if(any(is.na(column)))
return(NULL)
else
return(column)
})
Output
# A tibble: 5,076 x 11
date ACU ACY AE AEF AIM AIRI AMS APT ARMP ASXC
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2001-01-02 2.75 4.75 14.4 8.44 2376 250 2.5 1.06 490000 179.
2 2001-01-03 2.75 4.5 14.5 9 2409 250 2.5 1.12 472500 193.
3 2001-01-04 2.75 4.5 14.1 8.88 2508 250 2.5 1.06 542500 301.
4 2001-01-05 2.38 4.5 14.1 8.88 2475 250 2.25 1.12 586250 301.
5 2001-01-08 2.56 4.75 14.3 8.75 2376 250 2.38 1.06 638750 276.
6 2001-01-09 2.56 4.75 14.3 8.88 2409 250 2.38 1.06 568750 264.
7 2001-01-10 2.56 5.5 14.5 8.69 2310 300 2.12 1.12 586250 274.
8 2001-01-11 2.69 5.25 14.4 8.69 2310 300 2.25 1.19 564375 333.
9 2001-01-12 2.75 4.81 14.6 8.75 2541 275 2 1.38 564375 370.
10 2001-01-16 2.75 4.88 14.9 8.94 2772 300 2.12 1.62 595000 358.
# … with 5,066 more rows
Using Filter :
library(data.table)
Filter(function(x) all(!is.na(x)), fread('example.csv'))
# date ACU ACY AE AEF AIM AIRI AMS APT
# 1: 2001-01-02 2.75 4.75 14.4 8.44 2376.00 250.00 2.50 1.06
# 2: 2001-01-03 2.75 4.50 14.5 9.00 2409.00 250.00 2.50 1.12
# 3: 2001-01-04 2.75 4.50 14.1 8.88 2508.00 250.00 2.50 1.06
# 4: 2001-01-05 2.38 4.50 14.1 8.88 2475.00 250.00 2.25 1.12
# 5: 2001-01-08 2.56 4.75 14.3 8.75 2376.00 250.00 2.38 1.06
# ---
#5072: 2021-03-02 36.95 10.59 28.1 8.77 2.34 1.61 2.48 14.33
#5073: 2021-03-03 38.40 10.00 30.1 8.78 2.26 1.57 2.47 12.92
#5074: 2021-03-04 37.90 8.03 30.8 8.63 2.09 1.44 2.27 12.44
#5075: 2021-03-05 35.68 8.13 31.5 8.70 2.05 1.48 2.35 12.45
#5076: 2021-03-08 37.87 8.22 31.9 8.59 2.01 1.52 2.47 12.15
# ARMP ASXC
# 1: 4.90e+05 178.75
# 2: 4.72e+05 192.97
# 3: 5.42e+05 300.62
# 4: 5.86e+05 300.62
# 5: 6.39e+05 276.25
# ---
#5072: 5.67e+00 3.92
#5073: 5.58e+00 4.54
#5074: 5.15e+00 4.08
#5075: 4.49e+00 3.81
#5076: 4.73e+00 4.15

How to match across 2 data frames IDs and run operations in R loop?

I have 2 data frames, the sampling ("samp") and the coordinates ("coor").
The "samp" data frame:
Plot X Y H L
1 6.4 0.6 3.654 0.023
1 19.1 9.3 4.998 0.023
1 2.4 4.2 5.568 0.024
1 16.1 16.7 5.32 0.074
1 10.8 15.8 6.58 0.026
1 1 16 4.968 0.023
1 9.4 12.4 6.804 0.078
2 3.6 0.4 4.3 0.038
3 12.2 19.9 7.29 0.028
3 2 18.2 7.752 0.028
3 6.5 19.9 7.2 0.028
3 3.7 13.8 5.88 0.042
3 4.9 10.3 9.234 0.061
3 3.7 13.8 5.88 0.042
3 4.9 10.3 9.234 0.061
4 16.3 2.4 5.18 0.02
4 15.7 9.8 10.92 0.096
4 6 12.6 6.96 0.16
5 19.4 16.4 8.2 0.092
10 4.8 5.16 7.38 1.08
11 14.7 16.2 16.44 0.89
11 19 19 10.2 0.047
12 10.8 2.7 19.227 1.2
14 0.6 6.4 12.792 0.108
14 4.6 1.9 12.3 0.122
15 12.2 18 9.6 0.034
16 13 18.3 4.55 0.021
The "coor" data frame:
Plot X Y
1 356154.007 501363.546
2 356154.797 501345.977
3 356174.697 501336.114
4 356226.469 501336.816
5 356255.24 501352.714
10 356529.313 501292.4
11 356334.895 501320.725
12 356593.271 501255.297
14 356350.029 501314.385
15 356358.81 501285.955
16 356637.29 501227.297
17 356652.157 501263.238
18 356691.68 501262.403
19 356755.386 501242.501
20 356813.735 501210.59
22 356980.118 501178.974
23 357044.996 501168.859
24 357133.365 501158.418
25 357146.781 501158.866
26 357172.485 501161.646
I wish to run "for loop" function to register the "samp" data frame with the GPS coordinates from the "coor" data frame -- e.g. the "new_x" variable is the sum output of "X" from the "samp" and the "coor" , under the same "Plot" IDs.
This is what i tried but not working.
for (i in 1:nrow(samp)){
if (samp$Plot[i]==coor$Plot[i]){
(samp$new_x[i]<-(coor$X[i] + samp$X[i]))
} else (samp$new_x[i]<-samp$X[i])
}
The final output i wish to have is with a proper coordinate variable ("new_x") created onto the "samp" data frame. It should looks like this:
Plot X Y H L new_x
1 6.4 0.6 3.654 0.023 356160.407
1 19.1 9.3 4.998 0.023 356173.107
1 2.4 4.2 5.568 0.024 356156.407
1 16.1 16.7 5.32 0.074 356170.107
1 10.8 15.8 6.58 0.026 356164.807
1 1 16 4.968 0.023 356155.007
1 9.4 12.4 6.804 0.078 356163.407
2 3.6 0.4 4.3 0.038 356158.397
3 12.2 19.9 7.29 0.028 356186.897
3 2 18.2 7.752 0.028 356176.697
3 6.5 19.9 7.2 0.028 356181.197
3 3.7 13.8 5.88 0.042 356178.397
3 4.9 10.3 9.234 0.061 356179.597
3 3.7 13.8 5.88 0.042 356178.397
3 4.9 10.3 9.234 0.061 356179.597
4 16.3 2.4 5.18 0.02 356242.769
4 15.7 9.8 10.92 0.096 356242.169
4 6 12.6 6.96 0.16 356232.469
5 19.4 16.4 8.2 0.092 356274.64
10 4.8 5.16 7.38 1.08 356534.113
11 14.7 16.2 16.44 0.89 356349.595
11 19 19 10.2 0.047 356353.895
Any suggestion will be appreciated. Thanks.
You could merge the two datasets and create a new column by summing the X.x and X.y variables.
res <- transform(merge(samp, coor, by='Plot'), new_x=X.x+X.y)[,-c(6:7)]
colnames(res) <- colnames(out) #`out` is the expected result showed
all.equal(res[1:22,], out, check.attributes=FALSE)
#[1] TRUE

Resources