I have a data frame that looks like this:
Teff logg M_div_H U B V R I J H K L Lprime M
1: 2000 4.00 -0.1 -13.443 -11.390 -7.895 -4.464 -1.831 1.666 3.511 2.701 4.345 4.765 5.680
2: 2000 4.50 -0.1 -13.402 -11.416 -7.896 -4.454 -1.794 1.664 3.503 2.728 4.352 4.772 5.687
3: 2000 5.00 -0.1 -13.358 -11.428 -7.888 -4.431 -1.738 1.664 3.488 2.753 4.361 4.779 5.685
4: 2000 5.50 -0.1 -13.220 -11.079 -7.377 -4.136 -1.483 1.656 3.418 2.759 4.355 4.753 5.638
5: 2200 3.50 -0.1 -11.866 -9.557 -6.378 -3.612 -1.185 1.892 3.294 2.608 3.929 4.289 4.842
6: 2200 4.50 -0.1 -11.845 -9.643 -6.348 -3.589 -1.132 1.874 3.310 2.648 3.947 4.305 4.939
7: 2200 5.50 -0.1 -11.655 -9.615 -6.279 -3.508 -0.997 1.886 3.279 2.709 3.964 4.314 4.928
8: 2500 -1.02 -0.1 -7.410 -7.624 -6.204 -3.854 -1.533 1.884 3.320 2.873 3.598 3.964 5.579
9: 2500 -0.70 -0.1 -7.008 -7.222 -5.818 -3.618 -1.338 1.905 3.266 2.868 3.502 3.877 5.417
10: 2500 -0.29 -0.1 -6.526 -6.740 -5.357 -3.421 -1.215 1.927 3.216 2.870 3.396 3.781 5.247
11: 2500 5.50 -0.1 -9.518 -7.575 -5.010 -2.756 -0.511 1.959 3.057 2.642 3.472 3.756 4.265
12: 2800 -1.02 -0.1 -7.479 -7.386 -5.941 -3.716 -1.432 1.824 3.259 2.812 3.567 3.784 5.333
13: 2800 -0.70 -0.1 -7.125 -7.032 -5.596 -3.477 -1.231 1.822 3.218 2.813 3.479 3.717 5.229
14: 2800 -0.29 -0.1 -6.673 -6.580 -5.154 -3.166 -0.974 1.816 3.163 2.812 3.364 3.628 5.093
15: 2800 3.50 -0.1 -8.113 -6.258 -4.103 -2.209 -0.360 1.957 2.872 2.517 3.219 3.427 4.026
16: 2800 4.00 -0.1 -7.992 -6.099 -3.937 -2.076 -0.230 1.907 2.869 2.480 3.227 3.424 4.075
17: 2800 4.50 -0.1 -7.815 -6.051 -4.067 -2.176 -0.228 1.920 2.877 2.503 3.212 3.428 4.000
18: 2800 5.00 -0.1 -7.746 -6.018 -4.031 -2.144 -0.176 1.907 2.883 2.512 3.216 3.430 4.023
19: 3000 -0.70 -0.1 -7.396 -6.995 -5.605 -3.554 -1.293 1.787 3.172 2.759 3.474 3.588 5.052
20: 3000 -0.29 -0.1 -6.966 -6.565 -5.179 -3.249 -1.035 1.772 3.136 2.764 3.388 3.533 4.978
Notice, for example, how every V value has a unique Teff, logg combination.
Now, let's say I have two values:
input_Teff = 2300
input_log_g = 3.86
If we imagine all the (Teff, logg) combinations as grid points, for a given input point, I would like to find the four points closest to that input point.
point 1 (Teff1, logg1)
point 2 (Teff2, logg2)
point 3 (Teff3, logg3)
point 4 (Teff4, logg4)
then, calculate "distances" between my input point and the other points through Pythagoras (in this examples, three distances),
(Teff1, logg_1) -> d1
(Teff2, logg_2) -> d2
(Teff3, logg_3) -> d3
(Teff4, logg_4) -> d4 # NOTE: Teff and logg are different scales
next, get in this example, the V values in the rows of these points,
(Teff1, logg_1) -> V1
(Teff2, logg_2) -> V2
(Teff3, logg_3) -> V3
(Teff4, logg_4) -> V4
And finally do a weighted average calculation
V = (d1V1+d2V2+d3V3+d4V4)/(d1+d2+d3+d4)
What would be a good way to do this in R?
Edit: https://www.dropbox.com/s/prbceabxmd25etx/lcb98cor.dat?dl=0
lets say your dataframe name is teff_df, and lets assume this way is to calculate phytagoras distances, (update -> find nearest distance between input and every point of Teff logg), note: I use min-max normalization to rescale the Teff and logg values:
min_max_norm <- function(x) {
return(((x - min(x)) / (max(x) - min(x))))
}
phytagoras <- function(a,b){
return(a**2 + b**2)
}
First, calculate the Teff and logg rescale values between the input and all available grid points into a ranges of 0-1:
teff_with_input <- c(teff_df$Teff, 2300)
logg_with_input <- c(teff_df$logg, 3.86)
teff_rescale <- min_max_norm(teff_with_input)
logg_rescale <- min_max_norm(logg_with_input)
teff_input_rescale <- teff_rescale[length(teff_rescale)]
logg_input_rescale <- logg_rescale[length(logg_rescale)]
teff_rescale <- teff_rescale[-length(teff_rescale)]
logg_rescale <- logg_rescale[-length(logg_rescale)]
Second, calculate the differences between input and all grid point which already transformed to phytagoras values:
input_distance <- phytagoras(teff_input_rescale, logg_input_rescale)
my_phyta_dist <- phytagoras(teff_rescale, logg_rescale)
my_v_point <- teff_df[,which(colnames(teff_df)=="V")]
diff_input <- my_phyta_dist - input_distance
teff_df[order(abs(diff_input))[1:4],] #this will display which rows are nearest based by phytagoras distance calculation
Third, extract the 4 nearest points, by the minimum differences of input and 4 points of Teff and logg combination grid
nearest_4rows <- as.numeric(rownames(teff_df[order(abs(diff_input))[1:4],]))
nearest_4phyta_dist <- my_phyta_dist[nearest_4rows]
nearest_4v_point <- my_v_point[nearest_4rows]
and then finally, calculate final formula of weighted average calculation
V = (d1V1+d2V2+d3V3+d4V4)/(d1+d2+d3+d4)
product_dist <- nearest_4phyta_dist * nearest_4v_point
weighted_average <- sum(product_dist) / sum(nearest_4phyta_dist)
Related
I have datasheets with multiple measurements that look like the following:
FILE DATE TIME LOC QUAD LAI SEL DIFN MTA SEM SMP
20 20210805 08:38:32 H 1161 2.80 0.68 0.145 49. 8. 4
ANGLES 7.000 23.00 38.00 53.00 68.00
CNTCT# 1.969 1.517 0.981 1.579 1.386
STDDEV 1.632 1.051 0.596 0.904 0.379
DISTS 1.008 1.087 1.270 1.662 2.670
GAPS 0.137 0.192 0.288 0.073 0.025
A 1 08:38:40 31.66 33.63 34.59 39.13 55.86
1 2 08:38:40 -5.0e-006
B 3 08:38:48 25.74 20.71 15.03 2.584 1.716
B 4 08:38:55 0.344 1.107 2.730 0.285 0.265
B 5 08:39:02 3.211 5.105 13.01 4.828 1.943
B 6 08:39:10 8.423 22.91 48.77 16.34 3.572
B 7 08:39:19 12.58 14.90 18.34 18.26 4.125
I would like to read the entire datasheet and extract the values for 'QUAD' and 'LAI' only. For example, for the data above I would only be extracting a QUAD of 1161 and an LAI of 2.80.
In the past the datasheets were formatted as long data, and I was able to use the following code:
library(stringr)
QUAD <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^QUAD).*$")))
LAI <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^LAI).*$")))
data_extract <- data.frame(
QUAD = QUAD[!is.na(QUAD)],
LAI = LAI[!is.na(LAI)]
)
data_extract
Unfortunately, this does not work because of the wide formatting in the current datasheet. Any help would be hugely appreciated. Thanks in advance for your time.
I'm trying to create multiple columns in data.table in one command since logic is simple. I have column of starting values a0 and need to create time evolution by simply adding constant to next column.
Here is reproducible example
dt <- data.table(a0 = c(0.3, 0.34, 0.45, 0.6, 0.37, 0.444))
dt[, paste0('a', 1:5) := a0 + 1:5 / 4]
I would expect this produces columns a1, a2, a3, a4, a5 by simply adding 1/4 to each next column, instead getting warning and incorrect result
longer object length is not a multiple of shorter object length
dt
a0 a1 a2 a3 a4 a5
1: 0.300 0.550 0.550 0.550 0.550 0.550
2: 0.340 0.840 0.840 0.840 0.840 0.840
3: 0.450 1.200 1.200 1.200 1.200 1.200
4: 0.600 1.600 1.600 1.600 1.600 1.600
5: 0.370 1.620 1.620 1.620 1.620 1.620
6: 0.444 0.694 0.694 0.694 0.694 0.694
It looks R is calculating in a wrong dimension. Tried to add list dt[, paste0('a', 1:5) := list(a0 + 1:5 / 4)], but without luck.
You get the warning because length(dt$a0) is 6 whereas length(1:5) is 5.
dt$a0 + 1:5
#[1] 1.300 2.340 3.450 4.600 5.370 1.444
Warning message:
In dt$a0 + 1:5 :
longer object length is not a multiple of shorter object length
Here the first value of 1:5 is recycled and added to dt$a0[6].
You cannot reference the previous column directly like that. If you want to add new columns based on previous columns value in this case you can do something like
library(data.table)
n <- 5
dt[, paste0('a', seq_len(n)) := lapply(seq_len(n)/4, function(x) x + a0)]
dt
# a0 a1 a2 a3 a4 a5
#1: 0.300 0.550 0.800 1.050 1.300 1.550
#2: 0.340 0.590 0.840 1.090 1.340 1.590
#3: 0.450 0.700 0.950 1.200 1.450 1.700
#4: 0.600 0.850 1.100 1.350 1.600 1.850
#5: 0.370 0.620 0.870 1.120 1.370 1.620
#6: 0.444 0.694 0.944 1.194 1.444 1.694
Here are two dataframes, data1 and data2
data1
id A B C D E F G
1 id1 1.00 0.31 -3.20 2.50 3.1 -0.300 -0.214
2 id2 0.40 -2.30 0.24 -1.47 3.2 0.152 -0.140
3 id3 1.30 -3.20 2.00 -0.62 2.3 0.460 1.320
4 id4 -0.71 0.98 2.10 1.20 -1.5 0.870 -1.550
5 id5 2.10 -1.57 0.24 1.70 -1.2 -1.300 1.980
> data2
factor constant
1 A -0.321
2 B 1.732
3 C 1.230
4 D 3.200
5 E -0.980
6 F -1.400
7 G -0.300
Actually, data1 is a large set of data with id up to 1000 and factor up to z.
data2 also has the factor from a to z and corresponding constant variable.
And, I want to multiply the value of the factor in data1 and the constant of data2 corresponding to the factor, for all factors. And then, I want to create the total of multipliers into a new variable 'total' in data1.
for example> creating 'total' of 'id1'= (A value 1.0 (data1) x A constant -0.32 (data2) + (B value 0.31 x 1.732) + (C -3.20 x 1.230) + ( D 2.5 x 3.2) + (E 3.1 x 0.980) + (F -0.300 x -1.40) + (G -0.214 x -0.300)
If you have ordered your column names in data1 and the rows in data2 in the same order, you can do:
t(t(dat1[-1]) * dat2$constant)
# A B C D E F G
#1 -0.32100 0.53692 -3.9360 8.000 -3.038 0.4200 0.0642
#2 -0.12840 -3.98360 0.2952 -4.704 -3.136 -0.2128 0.0420
#3 -0.41730 -5.54240 2.4600 -1.984 -2.254 -0.6440 -0.3960
#4 0.22791 1.69736 2.5830 3.840 1.470 -1.2180 0.4650
#5 -0.67410 -2.71924 0.2952 5.440 1.176 1.8200 -0.5940
Or if you need the totals:
res = t(t(dat1[-1]) * dat2$constant)
res = cbind(res, total = rowSums(res))
res
# A B C D E F G total
#1 -0.32100 0.53692 -3.9360 8.000 -3.038 0.4200 0.0642 1.72612
#2 -0.12840 -3.98360 0.2952 -4.704 -3.136 -0.2128 0.0420 -11.82760
#3 -0.41730 -5.54240 2.4600 -1.984 -2.254 -0.6440 -0.3960 -8.77770
#4 0.22791 1.69736 2.5830 3.840 1.470 -1.2180 0.4650 9.06527
#5 -0.67410 -2.71924 0.2952 5.440 1.176 1.8200 -0.5940 4.74386
I need to generate bins from a data.frame based on the values of one column. I have tried the function "cut".
For example: I want to create bins of air temperature values in the column "AirTDay" in a data frame:
AirTDay (oC)
8.16
10.88
5.28
19.82
23.62
13.14
28.84
32.21
17.44
31.21
I need the bin intervals to include all values in a range of 2 degrees centigrade from that initial value (i.e. 8-9.99, 10-11.99, 12-13.99...), to be labelled with the average value of the range (i.e. 9.5, 10.5, 12.5...), and to respect blank cells, returning "NA" in the bins column.
The output should look as:
Air_T (oC) TBins
8.16 8.5
10.88 10.5
5.28 NA
NA
19.82 20.5
23.62 24.5
13.14 14.5
NA
NA
28.84 28.5
32.21 32.5
17.44 18.5
31.21 32.5
I've gotten as far as:
setwd('C:/Users/xxx')
temp_data <- read.csv("temperature.csv", sep = ",", header = TRUE)
TAir <- temp_data$AirTDay
Tmin <- round(min(TAir, na.rm = FALSE), digits = 0) # is start at minimum value
Tmax <- round(max(TAir, na.rm = FALSE), digits = 0)
int <- 2 # bin ranges 2 degrees
mean_int <- int/2
int_range <- seq(Tmin, Tmax + int, int) # generate bin sequence
bin_label <- seq(Tmin + mean_int, Tmax + mean_int, int) # generate labels
temp_data$TBins <- cut(TAir, breaks = int_range, ordered_result = FALSE, labels = bin_label)
The output table looks correct, but for some reason it shows a sequential additional column, shifts column names, and collapse all values eliminating blank cells. Something like this:
Air_T (oC) TBins
1 8.16 8.5
2 10.88 10.5
3 5.28 NA
4 19.82 20.5
5 23.62 24.5
6 13.14 14.5
7 28.84 28.5
8 32.21 32.5
9 17.44 18.5
10 31.21 32.5
Any ideas on where am I failing and how to solve it?
v<-ceiling(max(dat$V1,na.rm=T))
breaks<-seq(8,v,2)
labels=seq(8.5,length.out=length(s)-1,by=2)
transform(dat,Tbins=cut(V1,breaks,labels))
V1 Tbins
1 8.16 8.5
2 10.88 10.5
3 5.28 <NA>
4 NA <NA>
5 19.82 18.5
6 23.62 22.5
7 13.14 12.5
8 NA <NA>
9 NA <NA>
10 28.84 28.5
11 32.21 <NA>
12 17.44 16.5
13 31.21 30.5
This result follows the logic given: we have
paste(seq(8,v,2),seq(9.99,v,by=2),sep="-")
[1] "8-9.99" "10-11.99" "12-13.99" "14-15.99" "16-17.99" "18-19.99" "20-21.99"
[8] "22-23.99" "24-25.99" "26-27.99" "28-29.99" "30-31.99"
From this we can tell that 19.82 will lie between 18 and 20 thus given the value 18.5, similar to 10.88 being between 10-11.99 thus assigned the value 10.5
I am working on a MARS model using earth package in R. My dataset (CE.Rda) consists of one dependent variable (D9_RTO_avg) and 10 potential predictors (NDVI_l1, NDVI_f0, NDVI_f1, NDVI_f2, NDVI_f3, LST_l1, LST_f0, LST_f1, NDVI_f2,NDVI_f3). Next, I show you the head of my dataset
D9_RTO_avg NDVI_l1 NDVI_f0 NDVI_f1 NDVI_f2 NDVI_f3 LST_l1 LST_f0 LST_f1 LST_f2 LST_f3
2 1.866667 0.3082 0.3290 0.4785 0.4330 0.5844 38.25 30.87 31 21.23 17.92
3 2.000000 0.2164 0.2119 0.2334 0.2539 0.4686 35.7 29.7 28.35 21.67 17.71
4 1.200000 0.2324 0.2503 0.2640 0.2697 0.4726 40.13 33.3 28.95 22.81 16.29
5 1.600000 0.1865 0.2070 0.2104 0.2164 0.3911 43.26 35.79 30.22 23.07 17.88
6 1.800000 0.2757 0.3123 0.3462 0.3778 0.5482 43.99 36.06 30.26 21.36 17.93
7 2.700000 0.2265 0.2654 0.3174 0.2741 0.3590 41.61 35.4 27.51 23.55 18.88_
After creating my earth model as follows
mymodel.mod <- earth(D9_RTO_avg ~ ., data=CE, nk=10)
I print the summary of the resulting model by typing
print(summary(mymodel.mod, digits=2, style="pmax"))
and I obtain the following output
D9_RTO_avg =
4.1
+ 38 * LST_f128.68
+ 6.3 * LST_f216.41
- 2.9 * pmax(0, 0.66 - NDVI_l1)
- 2.3 * pmax(0, NDVI_f3 - 0.23)
Selected 5 of 7 terms, and 4 of 13169 predictors
Termination condition: Reached nk 10
Importance: LST_f128.68, NDVI_l1, NDVI_f3, LST_f216.41, NDVI_f0-unused, NDVI_f1-unused, NDVI_f2-unused, ...
Number of terms at each degree of interaction: 1 4 (additive model)
GCV 2 RSS 4046 GRSq 0.29 RSq 0.29
My question is why earth is identifying 13169 predictors when they are actually 10!? It seems that MARS is considering single observations of candidate predictors as predictors themselves. How can I avoid MARS from doing so?
Thanks for your help