I have datasheets with multiple measurements that look like the following:
FILE DATE TIME LOC QUAD LAI SEL DIFN MTA SEM SMP
20 20210805 08:38:32 H 1161 2.80 0.68 0.145 49. 8. 4
ANGLES 7.000 23.00 38.00 53.00 68.00
CNTCT# 1.969 1.517 0.981 1.579 1.386
STDDEV 1.632 1.051 0.596 0.904 0.379
DISTS 1.008 1.087 1.270 1.662 2.670
GAPS 0.137 0.192 0.288 0.073 0.025
A 1 08:38:40 31.66 33.63 34.59 39.13 55.86
1 2 08:38:40 -5.0e-006
B 3 08:38:48 25.74 20.71 15.03 2.584 1.716
B 4 08:38:55 0.344 1.107 2.730 0.285 0.265
B 5 08:39:02 3.211 5.105 13.01 4.828 1.943
B 6 08:39:10 8.423 22.91 48.77 16.34 3.572
B 7 08:39:19 12.58 14.90 18.34 18.26 4.125
I would like to read the entire datasheet and extract the values for 'QUAD' and 'LAI' only. For example, for the data above I would only be extracting a QUAD of 1161 and an LAI of 2.80.
In the past the datasheets were formatted as long data, and I was able to use the following code:
library(stringr)
QUAD <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^QUAD).*$")))
LAI <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^LAI).*$")))
data_extract <- data.frame(
QUAD = QUAD[!is.na(QUAD)],
LAI = LAI[!is.na(LAI)]
)
data_extract
Unfortunately, this does not work because of the wide formatting in the current datasheet. Any help would be hugely appreciated. Thanks in advance for your time.
Related
I have a data frame that looks like this:
Teff logg M_div_H U B V R I J H K L Lprime M
1: 2000 4.00 -0.1 -13.443 -11.390 -7.895 -4.464 -1.831 1.666 3.511 2.701 4.345 4.765 5.680
2: 2000 4.50 -0.1 -13.402 -11.416 -7.896 -4.454 -1.794 1.664 3.503 2.728 4.352 4.772 5.687
3: 2000 5.00 -0.1 -13.358 -11.428 -7.888 -4.431 -1.738 1.664 3.488 2.753 4.361 4.779 5.685
4: 2000 5.50 -0.1 -13.220 -11.079 -7.377 -4.136 -1.483 1.656 3.418 2.759 4.355 4.753 5.638
5: 2200 3.50 -0.1 -11.866 -9.557 -6.378 -3.612 -1.185 1.892 3.294 2.608 3.929 4.289 4.842
6: 2200 4.50 -0.1 -11.845 -9.643 -6.348 -3.589 -1.132 1.874 3.310 2.648 3.947 4.305 4.939
7: 2200 5.50 -0.1 -11.655 -9.615 -6.279 -3.508 -0.997 1.886 3.279 2.709 3.964 4.314 4.928
8: 2500 -1.02 -0.1 -7.410 -7.624 -6.204 -3.854 -1.533 1.884 3.320 2.873 3.598 3.964 5.579
9: 2500 -0.70 -0.1 -7.008 -7.222 -5.818 -3.618 -1.338 1.905 3.266 2.868 3.502 3.877 5.417
10: 2500 -0.29 -0.1 -6.526 -6.740 -5.357 -3.421 -1.215 1.927 3.216 2.870 3.396 3.781 5.247
11: 2500 5.50 -0.1 -9.518 -7.575 -5.010 -2.756 -0.511 1.959 3.057 2.642 3.472 3.756 4.265
12: 2800 -1.02 -0.1 -7.479 -7.386 -5.941 -3.716 -1.432 1.824 3.259 2.812 3.567 3.784 5.333
13: 2800 -0.70 -0.1 -7.125 -7.032 -5.596 -3.477 -1.231 1.822 3.218 2.813 3.479 3.717 5.229
14: 2800 -0.29 -0.1 -6.673 -6.580 -5.154 -3.166 -0.974 1.816 3.163 2.812 3.364 3.628 5.093
15: 2800 3.50 -0.1 -8.113 -6.258 -4.103 -2.209 -0.360 1.957 2.872 2.517 3.219 3.427 4.026
16: 2800 4.00 -0.1 -7.992 -6.099 -3.937 -2.076 -0.230 1.907 2.869 2.480 3.227 3.424 4.075
17: 2800 4.50 -0.1 -7.815 -6.051 -4.067 -2.176 -0.228 1.920 2.877 2.503 3.212 3.428 4.000
18: 2800 5.00 -0.1 -7.746 -6.018 -4.031 -2.144 -0.176 1.907 2.883 2.512 3.216 3.430 4.023
19: 3000 -0.70 -0.1 -7.396 -6.995 -5.605 -3.554 -1.293 1.787 3.172 2.759 3.474 3.588 5.052
20: 3000 -0.29 -0.1 -6.966 -6.565 -5.179 -3.249 -1.035 1.772 3.136 2.764 3.388 3.533 4.978
Notice, for example, how every V value has a unique Teff, logg combination.
Now, let's say I have two values:
input_Teff = 2300
input_log_g = 3.86
If we imagine all the (Teff, logg) combinations as grid points, for a given input point, I would like to find the four points closest to that input point.
point 1 (Teff1, logg1)
point 2 (Teff2, logg2)
point 3 (Teff3, logg3)
point 4 (Teff4, logg4)
then, calculate "distances" between my input point and the other points through Pythagoras (in this examples, three distances),
(Teff1, logg_1) -> d1
(Teff2, logg_2) -> d2
(Teff3, logg_3) -> d3
(Teff4, logg_4) -> d4 # NOTE: Teff and logg are different scales
next, get in this example, the V values in the rows of these points,
(Teff1, logg_1) -> V1
(Teff2, logg_2) -> V2
(Teff3, logg_3) -> V3
(Teff4, logg_4) -> V4
And finally do a weighted average calculation
V = (d1V1+d2V2+d3V3+d4V4)/(d1+d2+d3+d4)
What would be a good way to do this in R?
Edit: https://www.dropbox.com/s/prbceabxmd25etx/lcb98cor.dat?dl=0
lets say your dataframe name is teff_df, and lets assume this way is to calculate phytagoras distances, (update -> find nearest distance between input and every point of Teff logg), note: I use min-max normalization to rescale the Teff and logg values:
min_max_norm <- function(x) {
return(((x - min(x)) / (max(x) - min(x))))
}
phytagoras <- function(a,b){
return(a**2 + b**2)
}
First, calculate the Teff and logg rescale values between the input and all available grid points into a ranges of 0-1:
teff_with_input <- c(teff_df$Teff, 2300)
logg_with_input <- c(teff_df$logg, 3.86)
teff_rescale <- min_max_norm(teff_with_input)
logg_rescale <- min_max_norm(logg_with_input)
teff_input_rescale <- teff_rescale[length(teff_rescale)]
logg_input_rescale <- logg_rescale[length(logg_rescale)]
teff_rescale <- teff_rescale[-length(teff_rescale)]
logg_rescale <- logg_rescale[-length(logg_rescale)]
Second, calculate the differences between input and all grid point which already transformed to phytagoras values:
input_distance <- phytagoras(teff_input_rescale, logg_input_rescale)
my_phyta_dist <- phytagoras(teff_rescale, logg_rescale)
my_v_point <- teff_df[,which(colnames(teff_df)=="V")]
diff_input <- my_phyta_dist - input_distance
teff_df[order(abs(diff_input))[1:4],] #this will display which rows are nearest based by phytagoras distance calculation
Third, extract the 4 nearest points, by the minimum differences of input and 4 points of Teff and logg combination grid
nearest_4rows <- as.numeric(rownames(teff_df[order(abs(diff_input))[1:4],]))
nearest_4phyta_dist <- my_phyta_dist[nearest_4rows]
nearest_4v_point <- my_v_point[nearest_4rows]
and then finally, calculate final formula of weighted average calculation
V = (d1V1+d2V2+d3V3+d4V4)/(d1+d2+d3+d4)
product_dist <- nearest_4phyta_dist * nearest_4v_point
weighted_average <- sum(product_dist) / sum(nearest_4phyta_dist)
I am fairly new to stack overflow but did not find this in the search engine. Please let me know if this question should not be asked here.
I have a very large text file. It has 16 entries and each entry looks like this:
AI_File 10
Version
Date 20200708 08:18:41
Prompt1 LOC
Resp1 H****
Prompt2 QUAD
Resp2 1012
TransComp c-p-s
Model Horizontal
### Computed Results
LAI 4.36
SEL 0.47
ACF 0.879
DIFN 0.031
MTA 40.
SEM 1.
SMP 5
### Ring Summary
MASK 1 1 1 1 1
ANGLES 7.000 23.00 38.00 53.00 68.00
AVGTRANS 0.038 0.044 0.055 0.054 0.030
ACFS 0.916 0.959 0.856 0.844 0.872
CNTCT# 3.539 2.992 2.666 2.076 1.499
STDDEV 0.826 0.523 0.816 0.730 0.354
DISTS 1.008 1.087 1.270 1.662 2.670
GAPS 0.028 0.039 0.034 0.032 0.018
### Contributing Sensors
### Observations
A 1 20200708 08:19:12 x 31.42 38.30 40.61 48.69 60.28
L 2 20200708 08:19:12 1 5.0e-006
B 3 20200708 08:19:21 x 2.279 2.103 1.408 5.027 1.084
B 4 20200708 08:19:31 x 1.054 0.528 0.344 0.400 0.379
B 5 20200708 08:19:39 x 0.446 1.255 2.948 3.828 1.202
B 6 20200708 08:19:47 x 1.937 2.613 5.909 3.665 5.964
B 7 20200708 08:19:55 x 0.265 1.957 0.580 0.311 0.551
Almost all of this is junk information, and I am looking to run some code for the whole file that will only give me the lines for "Resp2" and "LAI" for all 16 of the entries. Is a task like this doable in R? If so, how would I do it?
Thanks very much for any help and please let me know if there's any more information I can give to clear anything up.
I've saved your file as a text file and read in the lines. Then you can use regex to extract the desired rows. However, I feel that my approach is rather clumsy, I bet there are more elegant ways (maybe also with (unix) command line tools).
data <- readLines("testfile.txt")
library(stringr)
resp2 <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^Resp2).*$")))
lai <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^LAI).*$")))
data_extract <- data.frame(
resp2 = resp2[!is.na(resp2)],
lai = lai[!is.na(lai)]
)
data_extract
resp2 lai
1 1012 4.36
A solution based in the tidyverse can look as follows.
library(dplyr)
library(vroom)
library(stringr)
library(tibble)
library(tidyr)
vroom_lines('data') %>%
enframe() %>%
filter(str_detect(value, 'Resp2|LAI')) %>%
transmute(value = str_squish(value)) %>%
separate(value, into = c('name', 'value'), sep = ' ')
# name value
# <chr> <chr>
# 1 Resp2 1012
# 2 LAI 4.36
I'm trying to apply a CAP function to chemical data collected in different years.
I have a data archive:
head(ISPA_data)
SrCa MgCa MnCa RbCa CuCa ZnCa BaCa PbCa NaCa LiCa CoCa NiCa
1 5178 25.101 9.334 0.166 4.869 8.379 34.846 0.194 5464 0.313 2.510 25.181
2 6017 22.922 7.185 0.166 4.685 8.720 24.659 0.154 4600 0.300 2.475 25.060
3 5628 26.232 6.248 0.179 4.628 10.157 23.942 0.166 5378 0.300 2.529 25.252
4 4769 35.598 7.683 0.131 4.370 8.735 50.068 0.180 5938 0.568 2.159 21.645
5 5330 28.284 6.828 0.130 5.370 12.742 34.257 0.220 5614 0.397 2.275 23.852
6 5786 24.603 4.797 0.156 5.317 13.331 66.896 0.117 5001 0.423 2.298 24.361
and a environmental dataset:
head(ISPA.env)
Year OM Code Location
<dbl> <chr> <chr> <chr>
1 1975 0.04349 CSP75_25 CSP
2 1975 0.0433 CSP75_28 CSP
3 1975 0.04553 CSP75_31 CSP
4 1975 0.0439 CSP75_33 CSP
5 1975 0.02998 CSP75_37 CSP
6 1975 0.0246 CSP75_39 CSP
When performing CAPdiscrim,
Ordination.model1 <- CAPdiscrim(ISPA_data~Year,
ISPA.env,
dist="euclidean",
axes=4,
m=0,
add=FALSE,
permutations=999)
this Error occurs:
Error in eval(predvars, data, env) :
numeric 'envir' arg not of length one
Besides: Warning message:
In cmdscale(distmatrix, k = nrow(x) - 1, eig = T, add = add) :
only 13 of the first 19 eigenvalues are > 0
All data has the same length.
Can anyone help me? Thanks!
Suppose I have many data frames, that have varying row numbers (of data) but Date as common among them. e.g. :
DF1:
Date Index Change
05-04-17 29911.55 0
03-04-17 29910.22 0.0098
31-03-17 29620.5 -0.0009
30-03-17 29647.42 0.0039
29-03-17 29531.43 0.0041
28-03-17 29409.52 0.0059
27-03-17 29237.15 -0.0063
24-03-17 29421.4 0.003
And
DF2:
Date NG NG_Change
05-04-17 213.8 0.0047
04-04-17 212.8 0.0421
03-04-17 204.2 -0.0078
31-03-17 205.8 -0.0068
30-03-17 207.2 -0.0166
29-03-17 210.7 0.0483
28-03-17 201 0.005
27-03-17 200 -0.0015
24-03-17 200.3 0.0137
And another one:
DF3:
Date TI_Price TI_Change
05-04-17 51.39 0.0071
04-04-17 51.03 0.0157
03-04-17 50.24 -0.0071
31-03-17 50.6 0.005
30-03-17 50.35 0.017
29-03-17 49.51 0.0236
28-03-17 48.37 0.0134
I wanted to combine them, using Dates column "as common variable", in a way that there are only those rows in the final for which Dates are common. Such as:
Date TI_Price TI_Change NG NG_Change TI_Price TI_Change
05-04-17 51.39 0.0071 213.8 0.0047 51.39 0.0071
04-04-17 51.03 0.0157 212.8 0.0421 51.03 0.0157
03-04-17 50.24 -0.0071 204.2 -0.0078 50.24 -0.0071
31-03-17 50.6 0.005 205.8 -0.0068 50.6 0.005
30-03-17 50.35 0.017 207.2 -0.0166 50.35 0.017
29-03-17 49.51 0.0236 210.7 0.0483 49.51 0.0236
28-03-17 48.37 0.0134 201 0.005 48.37 0.0134
I am just wondering if there is any method so that I could merge them in one go and not like the merge() function which takes DF2 and DF2 at a time, merge and then the result is merged with DF3.
What I used and tweaked around (but waste):
myfulldata = merge(DF1, DF2, all.x=T)
I am new to coding and to R. Currently working with package relsurv. For this I would like to calculate the relative survival at certain timepoints.
I am using the following to assess RS at five years:
rcurve2 <- rs.surv(Surv(time_days17/365.241,event_17)~1+
ratetable(age = age_diagnosis*365.241, sex = sex,
year = year_diagnosis_days), data = survdata, ratetable = swepop,
method="ederer1",conf.int=0.95,type="kaplan-meier",
add.times = 5*365.241)
summary(rcurve2)
However, I get the same result in my summary output regardless of what number I put after add.times ie for all event/cenasoring points (see below)
time n.risk n.event survival std.err lower 95% CI upper 95% CI
0.205 177 1 0.9944 0.00562 0.9834 1.005
0.627 176 1 0.9888 0.00792 0.9734 1.004
0.742 175 1 0.9831 0.00968 0.9644 1.002
0.827 174 1 0.9775 0.01114 0.9559 1.000
0.849 173 1 0.9718 0.01242 0.9478 0.996
0.947 172 1 0.9662 0.01356 0.9400 0.993
...cont.
I am clearly not getting it right! Would be grateful for your help!
A very good question!
When adding "imaginary" times using add.times, they are automatically censored, and wont show up with the summary() function. To see your added times set censored = TRUE:
summary(rcurve2, censored = TRUE)
You should now find your added time in the list that follows.
EXAMPLE
Using built in data with the relsurv package
data(slopop)
data(rdata)
#note the last argument add.times=1000
rcurve2 <- rs.surv(Surv(time,cens)~sex+ratetable(age=age*365.241, sex=sex,
year=year), ratetable=slopop, data=rdata, add.times = 1000)
When using summary(rcurve2) the time 1000 wont show up:
>summary(rcurve2)
[...]
973 200 1 0.792 0.03081 0.734 0.855
994 199 1 0.790 0.03103 0.732 0.854
1002 198 1 0.783 0.03183 0.723 0.848
[...]
BUT using summary(rcurve2, censored=TRUE) it will!
>summary(rcurve2, censored=TRUE)
[...]
973 200 1 0.792 0.03081 0.734 0.855
994 199 1 0.790 0.03103 0.732 0.854
1000 198 0 0.791 0.03106 0.732 0.854
1002 198 1 0.783 0.03183 0.723 0.848
[...]