Assigning a value if between two variables, across data frames in R - r

I have two data frames, one with a series of random values of length >n, call it:
df.my_data
I also have a second data frame, call it:
df.regions
df.regions consists of three columns, the first with a variable set of numbers 1 through n, the second with a distinguished lower bound, and the third with a distinguished upper bound. Call these
regions$location
regions$lower
regions$upper
I would like to assign the number in the first column of df.regions, regions$location, to a new column in df.my_data based on if the number in df.my_data falls between a given lower and upper bounds with respect to df.regions.
Let me know if I can clarify in any way.

If I understand correctly (and assuming that regions lower and upper bounds exhaust the range of values you need to classify and are exclusive), then this should be an analogous example
library(dplyr)
library(purrr)
set.seed(1)
x = tibble(value=abs(rnorm(10, 0, 5)))
bounds = tibble(lower = c(0:6), upper = c(1:6, Inf), class = letters[1:7])
x$class <- bounds[map_int(x$value, function(z) {which(map_lgl(seq_len(nrow(bounds)), ~between(z, bounds$lower[.x], bounds$upper[.x]) ))}),3]
x
#> # A tibble: 10 x 2
#> value class$class
#> <dbl> <chr>
#> 1 3.13 d
#> 2 0.918 a
#> 3 4.18 e
#> 4 7.98 g
#> 5 1.65 b
#> 6 4.10 e
#> 7 2.44 c
#> 8 3.69 d
#> 9 2.88 c
#> 10 1.53 b
Created on 2019-11-24 by the reprex package (v0.3.0)

Related

Acoustic complexity index time series output

I have a wav file and I would like to calculate the Acoustic Complexity Index at each second and receive a time series output.
I understand how to modify other settings within a function like seewave::ACI() but I am unable to find out how to output a time series data frame where each row is one second of time with the corresponding ACI value.
For a reproducible example, this audio file is 20 seconds, so I'd like the output to have 20 rows, with each row printing the ACI for that 1-second of time.
library(soundecology)
data(tropicalsound)
acoustic_complexity(tropicalsound)
In fact, I'd like to achieve this is a few other indices, for example:
soundecology::ndsi(tropicalsound)
soundecology::acoustic_evenness(tropicalsound)
You can subset your wav file according to the samples it contains. Since the sampling frequency can be obtained from the wav object, we can get one-second subsets of the file and perform our calculations on each. Note that you have to set the cluster size to 1 second, since the default is 5 seconds.
library(soundecology)
data(tropicalsound)
f <- tropicalsound#samp.rate
starts <- head(seq(0, length(tropicalsound), f), -1)
aci <- sapply(starts, function(i) {
aci <- acoustic_complexity(tropicalsound[i + seq(f)], j = 1)
aci$AciTotAll_left
})
nds <- sapply(starts, function(i) {
nds <- ndsi(tropicalsound[i + seq(f)])
nds$ndsi_left
})
aei <- sapply(starts, function(i) {
aei <- acoustic_evenness(tropicalsound[i + seq(f)])
aei$aei_left
})
This allows us to create a second-by-second data frame representing a time series of each measure:
data.frame(time = 0:19, aci, nds, aei)
#> time aci nds aei
#> 1 0 152.0586 0.7752307 0.438022
#> 2 1 168.2281 0.4171902 0.459380
#> 3 2 149.2796 0.9366220 0.516602
#> 4 3 176.8324 0.8856127 0.485036
#> 5 4 162.4237 0.8848515 0.483414
#> 6 5 161.1535 0.8327568 0.511922
#> 7 6 163.8071 0.7532586 0.549262
#> 8 7 156.4818 0.7706808 0.436910
#> 9 8 156.1037 0.7520663 0.489253
#> 10 9 160.5316 0.7077717 0.491418
#> 11 10 157.4274 0.8320380 0.457856
#> 12 11 169.8831 0.8396483 0.456514
#> 13 12 165.4426 0.6871337 0.456985
#> 14 13 165.1630 0.7655454 0.497621
#> 15 14 154.9258 0.8083035 0.489896
#> 16 15 162.8614 0.7745876 0.458035
#> 17 16 148.6004 0.1393345 0.443370
#> 18 17 144.6733 0.8189469 0.458309
#> 19 18 156.3466 0.6067827 0.455578
#> 20 19 158.3413 0.7175293 0.477261
Note that this is simply a demonstration of how to achieve the desired output; you would need to check the literature to determine whether it is appropriate to use these measures over such short time periods.

How to randomize ascending list of values into two similar groups in R

I want to randomize ascending list of values in to two similar groups in R.
two statistical similar groups, meaning the mean timen (performance) is the same. (the lower timen the better) i would like a evenly distributed groups with fast and slow skiers
I will have a Pre-test and want to randomize some alpine ski athletes besed on there performance.
The datassett will look like this; (this is test datasett, the real one (with n =40) will i get at thePre-test)
# A tibble: 4 × 5
# Groups: BIB. [4]
BIB. `11` `99` `77` performance
<int> <dbl> <dbl> <dbl> <dbl>
1 1 14.2 NA NA NA
2 2 14.4 15.0 NA -0.600
3 3 14.3 14.6 NA -0.310
4 77 NA 12.9 61.4 NA
can anyone help me ?
My approach might be to sample it randomly some number of times (100?) then evaluate the means in each, and pick the smallest.
Here is how you would do that.
#invent some initial data
(hiris <- head(iris,n=20) |> select(perftime = Sepal.Length) |> mutate(id=row_number()))
# make 100 scrambled datasets, then evaluate them for closest mean
set.seed(42)
(nr <- nrow(hiris))
possible_sets <- map(1:100,
~slice_sample(.data = hiris,n = nr,replace=FALSE) |>
mutate(group=1*(row_number()<nr/2)))
(evaluations <- map_dbl(possible_sets,~{
step1 <- group_by(.x,group) |> summarise(m=mean(perftime))
sqrt((step1$m[[1]]-step1$m[[2]])^2)
}))
(set_to_choose <- which.min(evaluations))
#to see the evaluation
plot(seq_along(evaluations),evaluations)
points(x=set_to_choose,
y=evaluations[set_to_choose],
col="red",pch=15)
#to use the 'best' set
(chosen_set <- possible_sets[set_to_choose])

How to create percentiles in R using dplyr with data frame?

I am looking to create an additional column named "percentile", the percentile will be based off the sold quotes quotes and I do not want to create a window function on it, the percentile is should be based off the entire dataset. See below, the data is currently in descending order by SOLD_QUOOTES, what ideally the first row we see in the image should be the 99.99% percentile and should lower cascading down the table.
Excepted output
Maybe something like,
library(dplyr)
df <- tibble(sold_quotes = sample(1e6, 1e3, replace = TRUE))
pctiles <- seq(0, 1, 0.001)
df %>%
arrange(desc(sold_quotes)) %>%
mutate(percentile = cut(sold_quotes,
quantile(sold_quotes,
probs = pctiles),
labels = pctiles[2:length(pctiles)]*100))
#> # A tibble: 1,000 x 2
#> sold_quotes percentile
#> <int> <fct>
#> 1 999562 100
#> 2 996533 99.9
#> 3 996260 99.8
#> 4 995499 99.7
#> 5 994984 99.6
#> 6 994937 99.5
#> 7 994130 99.4
#> 8 993001 99.3
#> 9 992902 99.2
#> 10 990298 99.1
#> # … with 990 more rows
The percentile calculation doesn't depend on rearranging sold_quotes in descending order; you'll get the correct result without it. I was just mirroring your example.

How to total rows with only certain columns

So I have a data set that is based on HR data training which asks tech and common questions.
The rows represent an employee and the columns represent the score they got on each question. The columns also include demographic data. I only want to see the row total of the tech and common questions though and not include the demographic data.
techs<-grep("^T",rownames(dat))
commons<-grep("^C",rownames(dat))
I used this to try to group the columns together but when I do:
total<-rowsum(commons,techs)
and try to put it in a linear regression:
Mod1Train<-lm(total~.,data=dat[Train,])
it says that there are different variable lengths.
I'm a super newbie to R, so sorry in advance if I'm really off.
in the future it would be ever so helpful if you provided a sample of your data. It's hard for us to help when we're guessing about that. Please see this link https://stackoverflow.com/help/minimal-reproducible-example.
Having said that LOL and realizing you're new I'll take a guess...
Let's make pretend data that I imagine is a smaller imaginary version of yours...
set.seed(2020)
emplid <- 1:10
gender <- sample(c("Male", "Female"), size = 10, replace = TRUE)
Tech1 <- sample(10:20, size = 10, replace = TRUE)
Tech2 <- sample(10:20, size = 10, replace = TRUE)
Tech3 <- sample(10:20, size = 10, replace = TRUE)
Common1 <- sample(10:20, size = 10, replace = TRUE)
Common2 <- sample(10:20, size = 10, replace = TRUE)
Common3 <- sample(10:20, size = 10, replace = TRUE)
Kathryn <- data.frame(emplid, gender, Tech1, Tech2, Tech3, Common1, Common2, Common3)
Kathryn
#> emplid gender Tech1 Tech2 Tech3 Common1 Common2 Common3
#> 1 1 Female 10 17 15 18 17 15
#> 2 2 Female 17 13 11 20 11 13
#> 3 3 Male 17 11 19 18 10 12
#> 4 4 Female 19 16 15 14 15 16
#> 5 5 Female 11 13 20 20 16 13
#> 6 6 Male 15 11 17 19 17 13
#> 7 7 Male 11 13 11 15 14 11
#> 8 8 Female 12 14 10 11 17 19
#> 9 9 Female 11 13 15 18 11 10
#> 10 10 Female 17 20 12 12 14 15
If you're new may want to invest some time learning the tidyverse which could make this simple like here Efficiently sum across multiple columns in R
Per your note in the comments, you have a pattern we can match for summing questions. You were close with your attempt at grep but we want the values back so we need value = TRUE which we'll store and make use of.
techqs <- grep(x = names(Kathryn), pattern = "^Tech", value = TRUE)
commonqs <- grep(x = names(Kathryn), pattern = "^Common", value = TRUE)
Kathryn$TechScores <- rowSums(Kathryn[,techqs])
Kathryn$CommonScores <- rowSums(Kathryn[,commonqs])
### Commented out how to do it manually.
# Kathryn$TechScores <- rowSums(Kathryn[,c("TQ1", "TQ2", "TQ3")])
# Kathryn$CommonScores <- rowSums(Kathryn[,c("CQ1", "CQ2", "CQ3")])
Kathryn$TotalScore <- Kathryn$TechScores + Kathryn$CommonScores
Now to regress which is where the statistical problem comes in. Are you really trying to predict the total score from the components??? That's not hard in r but it leads to silly answers.
Kathryn_model <- lm(formula = TotalScore ~ TechScores + CommonScores, data = Kathryn)
summary(Kathryn_model)
#> Warning in summary.lm(Kathryn_model): essentially perfect fit: summary may be
#> unreliable
#>
#> Call:
#> lm(formula = TotalScore ~ TechScores + CommonScores, data = Kathryn)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -3.165e-14 -1.905e-15 9.290e-16 8.590e-15 1.183e-14
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 8.089e-14 6.345e-14 1.275e+00 0.243
#> TechScores 1.000e+00 9.344e-16 1.070e+15 <2e-16 ***
#> CommonScores 1.000e+00 1.130e-15 8.853e+14 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 1.43e-14 on 7 degrees of freedom
#> Multiple R-squared: 1, Adjusted R-squared: 1
#> F-statistic: 9.875e+29 on 2 and 7 DF, p-value: < 2.2e-16
I don't understand your code and what you search for
rowsums don't make "a row total" but, quite on the contrary, adds rows between themselves. It returns a matrix, not a vector. Is that what you want ?
Otherwise, maybe you're looking for rowSums, which computes every rows totals of a matrix.
(by the way, if you need it, the matrix product is %*% in R)
Are you sure you have understood lm ?
In lm, there should be something like
lm(y~x,data=adataframe)
"adataframe" is the eventual dataframe/matrix where lm seeks both the response and the input variable,named "y" and "x" here. It is optional. If not found, y and x are seeked in the Global Env as if the columns names are not found in data, they are seeked in the Global environment. It is sometimes better however, to have such a matrix-like object, to avoid common errors.
So if you want to use lm, maybe you should first try to obtain 2 vectors, one for x and one for y, have them in a data.frame with 2 columns (x and y), and call the code above, if I have correctly understood
Note : if you want to remove the constant, use then
lm(y~x+0,data=adataframe)

approx() without duplicates?

I am using approx() to interpolate values.
x <- 1:20
y <- c(3,8,2,6,8,2,4,7,9,9,1,3,1,9,6,2,8,7,6,2)
df <- cbind.data.frame(x,y)
> df
x y
1 1 3
2 2 8
3 3 2
4 4 6
5 5 8
6 6 2
7 7 4
8 8 7
9 9 9
10 10 9
11 11 1
12 12 3
13 13 1
14 14 9
15 15 6
16 16 2
17 17 8
18 18 7
19 19 6
20 20 2
interpolated <- approx(x=df$x, y=df$y, method="linear", n=5)
gets me this:
interpolated
$x
[1] 1.00 5.75 10.50 15.25 20.00
$y
[1] 3.0 3.5 5.0 5.0 2.0
Now, the first and last value are duplicates of my real data, is there any way to prevent this or is it something I don't understand properly about approx()?
You may want to specify xout to avoid this. For instance, if you want to always exclude the first and the last points, here's how you can do that:
specify_xout <- function(x, n) {
seq(from=min(x), to=max(x), length.out=n+2)[-c(1, n+2)]
}
plot(df$x, df$y)
points(approx(df$x, df$y, xout=specify_xout(df$x, 5)), pch = "*", col = "red")
It does not prevent from interpolating the existing point somewhere in the middle (exactly what happens on the picture below).
approx will fit through all your original datapoints if you give it a chance (change n=5 to xout=df$x to see this). Interpolation is the process of generating values for y given unobserved values of x, but should agree if the values of x have been previously observed.
The method="linear" setup is going to 'draw' linear segments joining up your original coordinates exactly (and so will give the y values you input to it for integer x). You only observe 'new' y values because your n=5 means that for points other than the beginning and end the x is not an integer (and therefore not one of your input values), and so gets interpolated.
If you want observed values not to be exactly reproduced, then maybe add some noise via rnorm ?

Resources