How can I extract specific data points from a wide-formatted text file in R? - r

I have datasheets with multiple measurements that look like the following:
FILE DATE TIME LOC QUAD LAI SEL DIFN MTA SEM SMP
20 20210805 08:38:32 H 1161 2.80 0.68 0.145 49. 8. 4
ANGLES 7.000 23.00 38.00 53.00 68.00
CNTCT# 1.969 1.517 0.981 1.579 1.386
STDDEV 1.632 1.051 0.596 0.904 0.379
DISTS 1.008 1.087 1.270 1.662 2.670
GAPS 0.137 0.192 0.288 0.073 0.025
A 1 08:38:40 31.66 33.63 34.59 39.13 55.86
1 2 08:38:40 -5.0e-006
B 3 08:38:48 25.74 20.71 15.03 2.584 1.716
B 4 08:38:55 0.344 1.107 2.730 0.285 0.265
B 5 08:39:02 3.211 5.105 13.01 4.828 1.943
B 6 08:39:10 8.423 22.91 48.77 16.34 3.572
B 7 08:39:19 12.58 14.90 18.34 18.26 4.125
I would like to read the entire datasheet and extract the values for 'QUAD' and 'LAI' only. For example, for the data above I would only be extracting a QUAD of 1161 and an LAI of 2.80.
In the past the datasheets were formatted as long data, and I was able to use the following code:
library(stringr)
QUAD <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^QUAD).*$")))
LAI <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^LAI).*$")))
data_extract <- data.frame(
QUAD = QUAD[!is.na(QUAD)],
LAI = LAI[!is.na(LAI)]
)
data_extract
Unfortunately, this does not work because of the wide formatting in the current datasheet. Any help would be hugely appreciated. Thanks in advance for your time.

Related

Calculate weighted average of four nearest grid points

I have a data frame that looks like this:
Teff logg M_div_H U B V R I J H K L Lprime M
1: 2000 4.00 -0.1 -13.443 -11.390 -7.895 -4.464 -1.831 1.666 3.511 2.701 4.345 4.765 5.680
2: 2000 4.50 -0.1 -13.402 -11.416 -7.896 -4.454 -1.794 1.664 3.503 2.728 4.352 4.772 5.687
3: 2000 5.00 -0.1 -13.358 -11.428 -7.888 -4.431 -1.738 1.664 3.488 2.753 4.361 4.779 5.685
4: 2000 5.50 -0.1 -13.220 -11.079 -7.377 -4.136 -1.483 1.656 3.418 2.759 4.355 4.753 5.638
5: 2200 3.50 -0.1 -11.866 -9.557 -6.378 -3.612 -1.185 1.892 3.294 2.608 3.929 4.289 4.842
6: 2200 4.50 -0.1 -11.845 -9.643 -6.348 -3.589 -1.132 1.874 3.310 2.648 3.947 4.305 4.939
7: 2200 5.50 -0.1 -11.655 -9.615 -6.279 -3.508 -0.997 1.886 3.279 2.709 3.964 4.314 4.928
8: 2500 -1.02 -0.1 -7.410 -7.624 -6.204 -3.854 -1.533 1.884 3.320 2.873 3.598 3.964 5.579
9: 2500 -0.70 -0.1 -7.008 -7.222 -5.818 -3.618 -1.338 1.905 3.266 2.868 3.502 3.877 5.417
10: 2500 -0.29 -0.1 -6.526 -6.740 -5.357 -3.421 -1.215 1.927 3.216 2.870 3.396 3.781 5.247
11: 2500 5.50 -0.1 -9.518 -7.575 -5.010 -2.756 -0.511 1.959 3.057 2.642 3.472 3.756 4.265
12: 2800 -1.02 -0.1 -7.479 -7.386 -5.941 -3.716 -1.432 1.824 3.259 2.812 3.567 3.784 5.333
13: 2800 -0.70 -0.1 -7.125 -7.032 -5.596 -3.477 -1.231 1.822 3.218 2.813 3.479 3.717 5.229
14: 2800 -0.29 -0.1 -6.673 -6.580 -5.154 -3.166 -0.974 1.816 3.163 2.812 3.364 3.628 5.093
15: 2800 3.50 -0.1 -8.113 -6.258 -4.103 -2.209 -0.360 1.957 2.872 2.517 3.219 3.427 4.026
16: 2800 4.00 -0.1 -7.992 -6.099 -3.937 -2.076 -0.230 1.907 2.869 2.480 3.227 3.424 4.075
17: 2800 4.50 -0.1 -7.815 -6.051 -4.067 -2.176 -0.228 1.920 2.877 2.503 3.212 3.428 4.000
18: 2800 5.00 -0.1 -7.746 -6.018 -4.031 -2.144 -0.176 1.907 2.883 2.512 3.216 3.430 4.023
19: 3000 -0.70 -0.1 -7.396 -6.995 -5.605 -3.554 -1.293 1.787 3.172 2.759 3.474 3.588 5.052
20: 3000 -0.29 -0.1 -6.966 -6.565 -5.179 -3.249 -1.035 1.772 3.136 2.764 3.388 3.533 4.978
Notice, for example, how every V value has a unique Teff, logg combination.
Now, let's say I have two values:
input_Teff = 2300
input_log_g = 3.86
If we imagine all the (Teff, logg) combinations as grid points, for a given input point, I would like to find the four points closest to that input point.
point 1 (Teff1, logg1)
point 2 (Teff2, logg2)
point 3 (Teff3, logg3)
point 4 (Teff4, logg4)
then, calculate "distances" between my input point and the other points through Pythagoras (in this examples, three distances),
(Teff1, logg_1) -> d1
(Teff2, logg_2) -> d2
(Teff3, logg_3) -> d3
(Teff4, logg_4) -> d4 # NOTE: Teff and logg are different scales
next, get in this example, the V values in the rows of these points,
(Teff1, logg_1) -> V1
(Teff2, logg_2) -> V2
(Teff3, logg_3) -> V3
(Teff4, logg_4) -> V4
And finally do a weighted average calculation
V = (d1V1+d2V2+d3V3+d4V4)/(d1+d2+d3+d4)
What would be a good way to do this in R?
Edit: https://www.dropbox.com/s/prbceabxmd25etx/lcb98cor.dat?dl=0
lets say your dataframe name is teff_df, and lets assume this way is to calculate phytagoras distances, (update -> find nearest distance between input and every point of Teff logg), note: I use min-max normalization to rescale the Teff and logg values:
min_max_norm <- function(x) {
return(((x - min(x)) / (max(x) - min(x))))
}
phytagoras <- function(a,b){
return(a**2 + b**2)
}
First, calculate the Teff and logg rescale values between the input and all available grid points into a ranges of 0-1:
teff_with_input <- c(teff_df$Teff, 2300)
logg_with_input <- c(teff_df$logg, 3.86)
teff_rescale <- min_max_norm(teff_with_input)
logg_rescale <- min_max_norm(logg_with_input)
teff_input_rescale <- teff_rescale[length(teff_rescale)]
logg_input_rescale <- logg_rescale[length(logg_rescale)]
teff_rescale <- teff_rescale[-length(teff_rescale)]
logg_rescale <- logg_rescale[-length(logg_rescale)]
Second, calculate the differences between input and all grid point which already transformed to phytagoras values:
input_distance <- phytagoras(teff_input_rescale, logg_input_rescale)
my_phyta_dist <- phytagoras(teff_rescale, logg_rescale)
my_v_point <- teff_df[,which(colnames(teff_df)=="V")]
diff_input <- my_phyta_dist - input_distance
teff_df[order(abs(diff_input))[1:4],] #this will display which rows are nearest based by phytagoras distance calculation
Third, extract the 4 nearest points, by the minimum differences of input and 4 points of Teff and logg combination grid
nearest_4rows <- as.numeric(rownames(teff_df[order(abs(diff_input))[1:4],]))
nearest_4phyta_dist <- my_phyta_dist[nearest_4rows]
nearest_4v_point <- my_v_point[nearest_4rows]
and then finally, calculate final formula of weighted average calculation
V = (d1V1+d2V2+d3V3+d4V4)/(d1+d2+d3+d4)
product_dist <- nearest_4phyta_dist * nearest_4v_point
weighted_average <- sum(product_dist) / sum(nearest_4phyta_dist)

How could I use R to pull a few select lines out of a large text file?

I am fairly new to stack overflow but did not find this in the search engine. Please let me know if this question should not be asked here.
I have a very large text file. It has 16 entries and each entry looks like this:
AI_File 10
Version
Date 20200708 08:18:41
Prompt1 LOC
Resp1 H****
Prompt2 QUAD
Resp2 1012
TransComp c-p-s
Model Horizontal
### Computed Results
LAI 4.36
SEL 0.47
ACF 0.879
DIFN 0.031
MTA 40.
SEM 1.
SMP 5
### Ring Summary
MASK 1 1 1 1 1
ANGLES 7.000 23.00 38.00 53.00 68.00
AVGTRANS 0.038 0.044 0.055 0.054 0.030
ACFS 0.916 0.959 0.856 0.844 0.872
CNTCT# 3.539 2.992 2.666 2.076 1.499
STDDEV 0.826 0.523 0.816 0.730 0.354
DISTS 1.008 1.087 1.270 1.662 2.670
GAPS 0.028 0.039 0.034 0.032 0.018
### Contributing Sensors
### Observations
A 1 20200708 08:19:12 x 31.42 38.30 40.61 48.69 60.28
L 2 20200708 08:19:12 1 5.0e-006
B 3 20200708 08:19:21 x 2.279 2.103 1.408 5.027 1.084
B 4 20200708 08:19:31 x 1.054 0.528 0.344 0.400 0.379
B 5 20200708 08:19:39 x 0.446 1.255 2.948 3.828 1.202
B 6 20200708 08:19:47 x 1.937 2.613 5.909 3.665 5.964
B 7 20200708 08:19:55 x 0.265 1.957 0.580 0.311 0.551
Almost all of this is junk information, and I am looking to run some code for the whole file that will only give me the lines for "Resp2" and "LAI" for all 16 of the entries. Is a task like this doable in R? If so, how would I do it?
Thanks very much for any help and please let me know if there's any more information I can give to clear anything up.
I've saved your file as a text file and read in the lines. Then you can use regex to extract the desired rows. However, I feel that my approach is rather clumsy, I bet there are more elegant ways (maybe also with (unix) command line tools).
data <- readLines("testfile.txt")
library(stringr)
resp2 <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^Resp2).*$")))
lai <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^LAI).*$")))
data_extract <- data.frame(
resp2 = resp2[!is.na(resp2)],
lai = lai[!is.na(lai)]
)
data_extract
resp2 lai
1 1012 4.36
A solution based in the tidyverse can look as follows.
library(dplyr)
library(vroom)
library(stringr)
library(tibble)
library(tidyr)
vroom_lines('data') %>%
enframe() %>%
filter(str_detect(value, 'Resp2|LAI')) %>%
transmute(value = str_squish(value)) %>%
separate(value, into = c('name', 'value'), sep = ' ')
# name value
# <chr> <chr>
# 1 Resp2 1012
# 2 LAI 4.36

CAPdiscrim error: numeric 'envir' arg not of length one

I'm trying to apply a CAP function to chemical data collected in different years.
I have a data archive:
head(ISPA_data)
SrCa MgCa MnCa RbCa CuCa ZnCa BaCa PbCa NaCa LiCa CoCa NiCa
1 5178 25.101 9.334 0.166 4.869 8.379 34.846 0.194 5464 0.313 2.510 25.181
2 6017 22.922 7.185 0.166 4.685 8.720 24.659 0.154 4600 0.300 2.475 25.060
3 5628 26.232 6.248 0.179 4.628 10.157 23.942 0.166 5378 0.300 2.529 25.252
4 4769 35.598 7.683 0.131 4.370 8.735 50.068 0.180 5938 0.568 2.159 21.645
5 5330 28.284 6.828 0.130 5.370 12.742 34.257 0.220 5614 0.397 2.275 23.852
6 5786 24.603 4.797 0.156 5.317 13.331 66.896 0.117 5001 0.423 2.298 24.361
and a environmental dataset:
head(ISPA.env)
Year OM Code Location
<dbl> <chr> <chr> <chr>
1 1975 0.04349 CSP75_25 CSP
2 1975 0.0433 CSP75_28 CSP
3 1975 0.04553 CSP75_31 CSP
4 1975 0.0439 CSP75_33 CSP
5 1975 0.02998 CSP75_37 CSP
6 1975 0.0246 CSP75_39 CSP
When performing CAPdiscrim,
Ordination.model1 <- CAPdiscrim(ISPA_data~Year,
ISPA.env,
dist="euclidean",
axes=4,
m=0,
add=FALSE,
permutations=999)
this Error occurs:
Error in eval(predvars, data, env) :
numeric 'envir' arg not of length one
Besides: Warning message:
In cmdscale(distmatrix, k = nrow(x) - 1, eig = T, add = add) :
only 13 of the first 19 eigenvalues are > 0
All data has the same length.
Can anyone help me? Thanks!

Issues with merging multiple dataframes (with varying rows) using a common column

Suppose I have many data frames, that have varying row numbers (of data) but Date as common among them. e.g. :
DF1:
Date Index Change
05-04-17 29911.55 0
03-04-17 29910.22 0.0098
31-03-17 29620.5 -0.0009
30-03-17 29647.42 0.0039
29-03-17 29531.43 0.0041
28-03-17 29409.52 0.0059
27-03-17 29237.15 -0.0063
24-03-17 29421.4 0.003
And
DF2:
Date NG NG_Change
05-04-17 213.8 0.0047
04-04-17 212.8 0.0421
03-04-17 204.2 -0.0078
31-03-17 205.8 -0.0068
30-03-17 207.2 -0.0166
29-03-17 210.7 0.0483
28-03-17 201 0.005
27-03-17 200 -0.0015
24-03-17 200.3 0.0137
And another one:
DF3:
Date TI_Price TI_Change
05-04-17 51.39 0.0071
04-04-17 51.03 0.0157
03-04-17 50.24 -0.0071
31-03-17 50.6 0.005
30-03-17 50.35 0.017
29-03-17 49.51 0.0236
28-03-17 48.37 0.0134
I wanted to combine them, using Dates column "as common variable", in a way that there are only those rows in the final for which Dates are common. Such as:
Date TI_Price TI_Change NG NG_Change TI_Price TI_Change
05-04-17 51.39 0.0071 213.8 0.0047 51.39 0.0071
04-04-17 51.03 0.0157 212.8 0.0421 51.03 0.0157
03-04-17 50.24 -0.0071 204.2 -0.0078 50.24 -0.0071
31-03-17 50.6 0.005 205.8 -0.0068 50.6 0.005
30-03-17 50.35 0.017 207.2 -0.0166 50.35 0.017
29-03-17 49.51 0.0236 210.7 0.0483 49.51 0.0236
28-03-17 48.37 0.0134 201 0.005 48.37 0.0134
I am just wondering if there is any method so that I could merge them in one go and not like the merge() function which takes DF2 and DF2 at a time, merge and then the result is merged with DF3.
What I used and tweaked around (but waste):
myfulldata = merge(DF1, DF2, all.x=T)

Add.times in R relsurv

I am new to coding and to R. Currently working with package relsurv. For this I would like to calculate the relative survival at certain timepoints.
I am using the following to assess RS at five years:
rcurve2 <- rs.surv(Surv(time_days17/365.241,event_17)~1+
ratetable(age = age_diagnosis*365.241, sex = sex,
year = year_diagnosis_days), data = survdata, ratetable = swepop,
method="ederer1",conf.int=0.95,type="kaplan-meier",
add.times = 5*365.241)
summary(rcurve2)
However, I get the same result in my summary output regardless of what number I put after add.times ie for all event/cenasoring points (see below)
time n.risk n.event survival std.err lower 95% CI upper 95% CI
0.205 177 1 0.9944 0.00562 0.9834 1.005
0.627 176 1 0.9888 0.00792 0.9734 1.004
0.742 175 1 0.9831 0.00968 0.9644 1.002
0.827 174 1 0.9775 0.01114 0.9559 1.000
0.849 173 1 0.9718 0.01242 0.9478 0.996
0.947 172 1 0.9662 0.01356 0.9400 0.993
...cont.
I am clearly not getting it right! Would be grateful for your help!
A very good question!
When adding "imaginary" times using add.times, they are automatically censored, and wont show up with the summary() function. To see your added times set censored = TRUE:
summary(rcurve2, censored = TRUE)
You should now find your added time in the list that follows.
EXAMPLE
Using built in data with the relsurv package
data(slopop)
data(rdata)
#note the last argument add.times=1000
rcurve2 <- rs.surv(Surv(time,cens)~sex+ratetable(age=age*365.241, sex=sex,
year=year), ratetable=slopop, data=rdata, add.times = 1000)
When using summary(rcurve2) the time 1000 wont show up:
>summary(rcurve2)
[...]
973 200 1 0.792 0.03081 0.734 0.855
994 199 1 0.790 0.03103 0.732 0.854
1002 198 1 0.783 0.03183 0.723 0.848
[...]
BUT using summary(rcurve2, censored=TRUE) it will!
>summary(rcurve2, censored=TRUE)
[...]
973 200 1 0.792 0.03081 0.734 0.855
994 199 1 0.790 0.03103 0.732 0.854
1000 198 0 0.791 0.03106 0.732 0.854
1002 198 1 0.783 0.03183 0.723 0.848
[...]

Resources