I have a set of audio clips from different sites and files that I have manually extracted and a set that have been extracted automatically. I want to identify any clips in the automatically extracted set at the same site and file that overlap at all with those in the manually extracted set.
For example:
library(dplyr)
set.seed(123)
group <- as.factor(round(runif(100, 1, 10),0))
file <- sample(LETTERS, 100, TRUE)
start <- round(runif(100, 0, 100), 2)
end <- round(start + (runif(100, 0, 1)), 2)
auto <- bind_cols(group, file, start, end) %>%
set_names(., nm = c("group", "file", "start", "end"))
man <- bind_rows(sample_n(auto, 10) %>%
mutate(start = round(start - runif(10, 0, 1), 2),
end = round(end - runif(10,0, 0.5), 2)),
sample_n(auto, 10) %>%
mutate(start = round(start + runif(10, 0, 1), 2),
end = round(start + runif(10,0, 0.5), 2)))
I have been trying to use dplyr::between like this:
res <- auto %>%
left_join(., man, by = c("group", "file")) %>%
drop_na() %>%
mutate(pos = (between(start.x, start.y, end.y) | between(end.x, start.y, end.y)))
But it is just returning FALSE. The code doesn't throw an error so I am guessing there is in issue in the way I have written it.
If you split out the or in the last mutate to check if either between function is the issue, like this:
res <- auto %>%
left_join(., man, by = c("group", "file")) %>%
drop_na() %>%
mutate(x = between(start.x, start.y, end.y),
y = between(end.x, start.y, end.y),
pos = x | y == "TRUE")
you can see that both return FALSE.
It seems like it should be a relatively simple thing, but I am struggling to figure it out.
Related
I am working with the R programming language.
Suppose there is a classroom of students - each student flips the same coin many times (the students don't flip the coin the same number of times). Here is a simulate dataset to represent this example:
library(tidyverse)
library(dplyr)
set.seed(123)
ids = 1:100
student_id = sample(ids, 1000, replace = TRUE)
coin_result = sample(c("H", "T"), 1000, replace = TRUE)
my_data = data.frame(student_id, coin_result)
my_data = my_data[order(my_data$student_id),]
I want to count the number of "3 Flip Sequences" recorded by each student (e.g. Student 1 got HHHTH : HHH 1 time, HHT 1 time, HTH 1 time)
And the probability of the 3rd Flip based on the previous 2 flips (e.g. in general, over all students, the probability of a H following HH was 0.54)
Here is some R code that performs these tasks:
results = my_data %>%
group_by(student_id) %>%
summarize(Sequence = str_c(coin_result, lead(coin_result), lead(coin_result, 2)), .groups = 'drop') %>%
filter(!is.na(Sequence)) %>%
count(Sequence)
final = results %>%
mutate(two_seq = substr(Sequence, 1, 2)) %>%
group_by(two_seq) %>%
mutate(third = substr(Sequence, 3, 3)) %>%
group_by(two_seq, third) %>%
summarize(sums = sum(n)) %>%
mutate(prob = sums / sum(sums))
My Question: Suppose I want to now extend this problem to "4 Flip Sequences" (e.g. probability of H given HHH) - I can manually extend this code:
results = my_data %>%
group_by(student_id) %>%
summarize(Sequence = str_c(coin_result, lead(coin_result), lead(coin_result, 2), lead(coin_result, 3)), .groups = 'drop') %>%
filter(!is.na(Sequence)) %>%
count(Sequence)
final = results %>%
mutate(three_seq = substr(Sequence, 1, 3)) %>%
group_by(three_seq) %>%
mutate(fourth = substr(Sequence, 4, 4)) %>%
group_by(three_seq, fourth) %>%
summarize(sums = sum(n)) %>%
mutate(prob = sums / sum(sums))
Is it possible to convert the above code into a function such that I can repeat this for arbitrary combinations? For example:
results <- function(i) {return(my_data %>%
group_by(student_id) %>%
summarize(Sequence = str_c(coin_result, lead(coin_result), lead(coin_result, i+1), lead(coin_result, i+2) .....### insert code here ####), .groups = 'drop') %>%
filter(!is.na(Sequence)) %>%
count(Sequence))}
final <- function(i)
return(results %>%
mutate(three_seq = substr(Sequence, 1, i)) %>%
group_by(three_seq) %>%
mutate(fourth = substr(Sequence, i+1, i+1)) %>%
group_by(three_seq, fourth) %>%
summarize(sums = sum(n)) %>%
mutate(prob = sums / sum(sums)))
}
I am not sure how exactly I would do this, seeing as the first function would require to be "dynamically changed" depending on the value of "i".
Can someone please show me how to do this?
Thanks!
Here's a way you can do it in base R:
# Returns a vector of 0's and 1's, bit more efficient than sample
tosses <- floor(runif(1e3, 0, 2))
count_seqs <- function(x, seq_length) {
vec_length <- length(x)
rolling_window_indices <- rep(1:seq_length, vec_length - seq_length + 1) +
rep(0:(vec_length - seq_length), each = seq_length)
mat <- matrix(x[rolling_window_indices], nrow = seq_length)
sequences <- apply(mat, 2, paste0, collapse = "")
table(sequences)
}
count_seqs(tosses, 3)
Notice I didn't include any ids in the code above. The reason is that, if all students have the same probability of tossing heads or tails, we can treat them as independent (or, more precisely, treat the design as ignorable). However, it's easy to expand the code for situations where the tosses are not independent, e.g. where each participant has a different probability of tossing heads:
ids <- floor(runif(1e3, 1, 101))
probs <- runif(1e2, 0, 1)
tosses_by_id <- lapply(ids, function(i) rbinom(10, 1, probs[i]))
lapply(tosses_by_id, function(x) count_seqs(x, 3))
I'm looking to run a function on each group of a dataset, and bind the output to the existing set inside the tidyverse environment. After the example set, I've added how I do it right now, which requires splitting the set and running lapply (I want to move everything towards the tidyverse).
library(TTR)
test = data.frame('high'=rnorm(100,10,0.1),'low'=rnorm(100,0,0.1), 'close'=rnorm(100,5,0.1))
stoch(test,
nFastK = 14, nFastD = 3, nSlowD = 3,
maType=list(list(SMA), list(SMA), list(SMA)),
bounded = TRUE,
smooth = 1)
Here is how it used to be done with lists:
get_stoch = function(dat_) {
stochs = stoch(dat_ %>% select(-ticker), nFastK = 14, nFastD = 3, nSlowD = 3,
maType=list(list(SMA), list(SMA), list(SMA)),
bounded = TRUE, smooth = 1)
dat_ = cbind(dat_,stochs)
}
test = data.frame('ticker'=c(rep('A',50),rep('B',50)),
'high'=rnorm(100,10,0.1),'low'=rnorm(100,0,0.1), 'close'=rnorm(100,5,0.1)) %>%
split(.,.$ticker) %>%
lapply(.,get_stoch) %>%
bind_rows
If you want to translate your code to tidyverse you can use :
library(dplyr)
library(purrr)
df %>% group_split(ticker) %>% map_dfr(get_stoch)
You can use plyr::ddply to run a split-apply-bind method in tidyverse-like language:
df <- data.frame(ticker = c(rep('A', 50), rep('B', 50)),
high = rnorm(100, 10, 0.1),
low = rnorm(100, 0, 0.1),
close = rnorm(100, 5, 0.1))
test1 <- df %>%
split(.,.$ticker) %>%
lapply(.,get_stoch) %>%
bind_rows
test2 <- df %>%
ddply("ticker", get_stoch)
identical(test1, test2)
#> [1] TRUE
I have a dataframe of 96074 obs. of 31 variables.
the first two variables are id and the date, then I have 9 columns with measurement (three different KPIs with three different time properties), then various technical and geographical variables.
df <- data.frame(
id = rep(1:3, 3),
time = rep(as.Date('2009-01-01') + 0:2, each = 3),
sum_d_1day_old = rnorm(9, 2, 1),
sum_i_1day_old = rnorm(9, 2, 1),
per_i_d_1day_old = rnorm(9, 0, 1),
sum_d_5days_old = rnorm(9, 0, 1),
sum_i_5days_old = rnorm(9, 0, 1),
per_i_d_5days_old = rnorm(9, 0, 1),
sum_d_15days_old = rnorm(9, 0, 1),
sum_i_15days_old = rnorm(9, 0, 1),
per_i_d_15days_old = rnorm(9, 0, 1)
)
I want to transform from wide to long, in order to do graphs with ggplot using facets for example.
If I had a df with just one variable with its three-time scans I would have no problem in using gather:
plotdf <- df %>%
gather(sum_d, value,
c(sum_d_1day_old, sum_d_5days_old, sum_d_15days_old),
factor_key = TRUE)
But having three different variables trips me up.
I would like to have this output:
plotdf <- data.frame(
id = rep(1:3, 3),
time = rep(as.Date('2009-01-01') + 0:2, each = 3),
sum_d = rep(c("sum_d_1day_old", "sum_d_5days_old", "sum_d_15days_old"), 3),
values_sum_d = rnorm(9, 2, 1),
sum_i = rep(c("sum_i_1day_old", "sum_i_5days_old", "sum_i_15days_old"), 3),
values_sum_i = rnorm(9, 2, 1),
per_i_d = rep(c("per_i_d_1day_old", "per_i_d_5days_old", "per_i_d_15days_old"), 3),
values_per_i_d = rnorm(9, 2, 1)
)
with id, sum_d, sum_i and per_i_d of class factor time of class Date and the values of class numeric (I have to add that I don't have negative measures in these variables).
what I've tried to do:
plotdf <- gather(df, key, value, sum_d_1day_old:per_i_d_15days_old, factor_key = TRUE)
gathering all of the variables in a single column
plotdf$KPI <- paste(sapply(strsplit(as.character(plotdf$key), "_"), "[[", 1),
sapply(strsplit(as.character(plotdf$key), "_"), "[[", 2), sep = "_")
creating a new column with the name of the KPI, without the time specification
plotdf %>% unite(value2, key, value) %>%
#creating a new variable with the full name of the KPI attaching the value at the end
mutate(i = row_number()) %>% spread(KPI, value2) %>% select(-i)
#spreading
But spread creates rows with NAs.
To replace then at first I used
group_by(id, date) %>%
fill(c(sum_d, sum_i, per_i_d), .direction = "down") %>%
fill(c(sum_d, sum_i, per_i_d), .direction = "up") %>%
But the problem is that there are already some measurements with NAs in the original df in the variable per_i_d (44 in total), so I lose that information.
I thought that I could replace the NAs in the original df with a dummy value and then replace the NAs back, but then I thought that there could be a more efficient solution for all of my problem.
After I replaced the NAs, my idea was to use slice(1) to select only the first row of each couple id/date, then do some manipulation with separate/unite to have the output I desired.
I actually did that, but then I remembered I had those aforementioned NAs in the original df.
df %>%
gather(key,value,-id,-time) %>%
mutate(type = str_extract(key,'[a-z]+_[a-z]'),
age = str_extract(key, '[0-9]+[a-z]+_[a-z]+')) %>%
select(-key) %>%
spread(type,value)
gives
id time age per_i sum_d sum_i
1 1 2009-01-01 15days_old 0.8132301 0.8888928 0.077532040
2 1 2009-01-01 1day_old -2.0993199 2.8817133 3.047894196
3 1 2009-01-01 5days_old -0.4626151 -1.0002926 0.327102000
4 1 2009-01-02 15days_old 0.4089618 -1.6868523 0.866412133
5 1 2009-01-02 1day_old 0.8181313 3.7118065 3.701018419
...
EDIT:
adding non-value columns to the dataframe:
df %>%
gather(key,value,-id,-time) %>%
mutate(type = str_extract(key,'[a-z]+_[a-z]'),
age = str_extract(key, '[0-9]+[a-z]+_[a-z]+'),
info = paste(age,type,sep = "_")) %>%
select(-key) %>%
gather(key,value,-id,-time,-age,-type) %>%
unite(dummy,type,key) %>%
spread(dummy,value)
This is a continuation from the previous question:
Apply function over every entry one table to every entry of another
I have the following tables loss.tib and bandstib and function bandedlossfn:
library(tidyverse)
set.seed(1)
n <- 5
loss.tib <- tibble(lossid = seq(n),
loss = rbeta(n, 1, 10) * 100)
bandstib <- tibble(bandid = seq(4),
start = seq(0, 75, by = 25),
end = seq(25, 100, by = 25))
bandedlossfn <- function(loss, start, end) {
pmin(end - start, pmax(0, loss - start))
}
It is possible to apply this function over loss.tib using bandstib as arguments:
loss.tib %>%
mutate(
result = map(
loss, ~ tibble(result = bandedlossfn(.x, bandstib$start,
bandstib$end))
)
) %>% unnest
However, I would like to add an index within map as follows:
loss.tib %>%
mutate(
result = map(
loss, ~ tibble(result = bandedlossfn(.x, bandstib$start,
bandstib$end)) %>%
mutate(bandid2 = row_number())
)
) %>% unnest
But it does not seem to work as intended.
I also want to add filter(!near(result,0)) within the map function too for efficient memory management.
The result I'm expecting is:
lossid loss bandid result
1 21.6691088 1 21.6691088
2 6.9390647 1 6.9390647
3 0.5822383 1 0.5822383
4 5.5671643 1 5.5671643
5 27.8237244 1 25.0000000
5 27.8237244 2 2.8237244
Thank you.
Here is one possibility:
you first nest bandstib and add it to loss.tib. This way the id sticks to your calculations:
bandstib <- tibble(bandid = seq(4),
start = seq(0, 75, by = 25),
end = seq(25, 100, by = 25)) %>%
nest(.key = "data")
set.seed(1)
n <- 5
result <- tibble(loss = rbeta(n, 1, 10) * 100) %>%
bind_cols(., slice(bandstib, rep(1, n))) %>%
mutate(result = map2(loss, data, ~bandedlossfn(.x, .y$start, .y$end))) %>%
unnest()
I have data structured like this
set.seed(2)
require(tidyverse)
data <- data.frame("TIME" = c(sample(seq(1:20), 20, replace = F), seq(21:30)),
"ID" = c(rep("A", 10), rep("B", 10), rep("C", 10)),
"LOC" = c(sample(c("X", "Y"), 20, replace = T), c("X", rep("Y", 9))))
and I'm trying to use dplyr to create a variable to indicate whether or not a given ID has a change in time that is permanent (PERMANENT =1) or not (PERMANENT=0). I can't use first() or last() because that ignores the in between values. If for instance they go from X to Y and back to X again like in the case of A and B, the indicator should be 0 for every instance of A and B in the data. However C starts at X and stays at Y in all other instances.
I tried to use indexing in a mutate function, but something isn't working.
data %>%
arrange(ID, TIME) %>%
group_by(ID)%>%
mutate(LOC = as.character(LOC),
PERMANENT = ifelse(last(LOC) != "X" & any(LOC[2:length(ID) -1]) != "X"), 1, 0)
Like I said the output should indicate C moved permanently, while A and B bounced around in the data set.
the expected output is what happens if you run the following code:
data$PERMANENT<-ifelse(data$ID%in%c("A","B"),0,1)
Here's a go at it with dpylr. I am assuming an ID is permanent if it hasn't changed since it's second observed time period.
set.seed(2)
data<-data.frame("TIME" = c(sample(seq(1:20),20,replace = F),seq(21:30)),"ID" =c(rep("A",10),rep("B",10),rep("C",10)),"LOC" = c(sample(c("X","Y"),20,replace = T),c("X",rep("Y",9))) )
data %>% arrange(ID, TIME) %>%
group_by(ID) %>%
mutate(timeObs = row_number(), SecondLoc = LOC[timeObs == 2], Change = LOC != SecondLoc) %>%
filter(timeObs > 1) %>%
summarize(Permanent = sum(Change) == 0 ) %>%
right_join(data, by = 'ID')