Search for value within a range of values in two separate vectors - r

This is my first time posting to Stack Exchange, my apologies as I'm certain I will make a few mistakes. I am trying to assess false detections in a dataset.
I have one data frame with "true" detections
truth=
ID Start Stop SNR
1 213466 213468 10.08
2 32238 32240 10.28
3 218934 218936 12.02
4 222774 222776 11.4
5 68137 68139 10.99
And another data frame with a list of times, that represent possible 'real' detections
possible=
ID Times
1 32239.76
2 32241.14
3 68138.72
4 111233.93
5 128395.28
6 146180.31
7 188433.35
8 198714.7
I am trying to see if the values in my 'possible' data frame lies between the start and stop values. If so I'd like to create a third column in possible called "between" and a column in the "truth" data frame called "match. For every value from possible that falls between I'd like a 1, otherwise a 0. For all of the rows in "truth" that find a match I'd like a 1, otherwise a 0.
Neither ID, not SNR are important. I'm not looking to match on ID. Instead I wand to run through the data frame entirely. Output should look something like:
ID Times Between
1 32239.76 0
2 32241.14 1
3 68138.72 0
4 111233.93 0
5 128395.28 0
6 146180.31 1
7 188433.35 0
8 198714.7 0
Alternatively, knowing if any of my 'possible' time values fall within 2 seconds of start or end times would also do the trick (also with 1/0 outputs)
(Thanks for the feedback on the original post)
Thanks in advance for your patience with me as I navigate this system.

I think this can be conceptulised as a rolling join in data.table. Take this simplified example:
truth
# id start stop
#1: 1 1 5
#2: 2 7 10
#3: 3 12 15
#4: 4 17 20
#5: 5 22 26
possible
# id times
#1: 1 3
#2: 2 11
#3: 3 13
#4: 4 28
setDT(truth)
setDT(possible)
melt(truth, measure.vars=c("start","stop"), value.name="times")[
possible, on="times", roll=TRUE
][, .(id=i.id, truthid=id, times, status=factor(variable, labels=c("in","out")))]
# id truthid times status
#1: 1 1 3 in
#2: 2 2 11 out
#3: 3 3 13 in
#4: 4 5 28 out
The source datasets were:
truth <- read.table(text="id start stop
1 1 5
2 7 10
3 12 15
4 17 20
5 22 26", header=TRUE)
possible <- read.table(text="id times
1 3
2 11
3 13
4 28", header=TRUE)

I'll post a solution that I'm pretty sure works like you want it to in order to get you started. Maybe someone else can post a more efficient answer.
Anyway, first I needed to generate some example data - next time please provide this from your own data set in your post using the function dput(head(truth, n = 25)) and dput(head(possible, n = 25)). I used:
#generate random test data
set.seed(7)
truth <- data.frame(c(1:100),
c(sample(5:20, size = 100, replace = T)),
c(sample(21:50, size = 100, replace = T)))
possible <- data.frame(c(sample(1:15, size = 15, replace = F)))
colnames(possible) <- "Times"
After getting sample data to work with; the following solution provides what I believe you are asking for. This should scale directly to your own dataset as it seems to be laid out. Respond below if the comments are unclear.
#need the %between% operator
library(data.table)
#initialize vectors - 0 or false by default
truth.match <- c(rep(0, times = nrow(truth)))
possible.between <- c(rep(0, times = nrow(possible)))
#iterate through 'possible' dataframe
for (i in 1:nrow(possible)){
#get boolean vector to show if any of the 'truth' rows are a 'match'
match.vec <- apply(truth[, 2:3],
MARGIN = 1,
FUN = function(x) {possible$Times[i] %between% x})
#if any are true then update the match and between vectors
if(any(match.vec)){
truth.match[match.vec] <- 1
possible.between[i] <- 1
}
}
#i think this should be called anyMatch for clarity
truth$anyMatch <- truth.match
#similarly; betweenAny
possible$betweenAny <- possible.between

Related

Multiply various subsets of a data frame by different elements of a vector R

I have a data frame:
df<-data.frame(id=rep(1:10,each=10),
Room1=rnorm(100,0.4,0.5),
Room2=rnorm(100,0.3,0.5),
Room3=rnorm(100,0.7,0.5))
And a vector:
vals <- sample(7:100, 10)
I want to multiply cols Room1, Room2 and Room3 by a different element of the vector for every unique ID number and output a new data frame (df2).
I managed to multiply each column per id by EVERY element of the vector using the following:
samp_func <- function(x) {
x*vals[i]
}
for (i in vals) {
df2 <- df %>% mutate_at(c("Room1", "Room2", "Room3"), samp_func)
}
But the resulting df (df2) is each Room column multiplied by the same element of the vector (vals) for each of the different ids. When what I want is each Room column (per id) multiplied by a different element of the vector vals. Sorry in advance if this is not clear I am a beginner and still getting to grips with the terminology.
Thanks!
EDIT: The desired output should look like the below, where the columns for each ID have been multiplied by a different element of the vector vals.
id Room1 Room2 Room3
1 1 24.674826880 60.1942571 46.81276141
2 1 21.970270107 46.0461779 35.09928150
3 1 26.282357614 -3.5098880 38.68400541
4 1 29.614182061 -39.3025587 25.09146592
5 1 33.030886472 46.0354881 42.68209027
6 1 41.362699668 -23.6624632 26.93845129
7 1 5.429031042 26.7657577 37.49086963
8 1 18.733422977 -42.0620572 23.48992138
9 1 -17.144070723 9.9627315 55.43999326
10 1 45.392182468 20.3959968 -16.52166621
11 2 30.687978299 -11.7194020 27.67351631
12 2 -4.559185345 94.9256561 9.26738357
13 2 86.165076849 -1.2821515 29.36949423
14 2 -12.546711562 47.1763755 152.67588456
15 2 18.285856423 60.5679496 113.85971720
16 2 72.074929648 47.6509398 139.69051486
17 2 -12.332519694 67.8890324 20.73189965
18 2 80.889634991 69.5703581 98.84404415
19 2 87.991093995 -20.7918559 106.13610773
20 2 -2.685594148 71.0611693 47.40278949
21 3 4.764445589 -7.6155681 12.56546664
22 3 -1.293867841 -1.1092243 13.30775785
23 3 16.114831628 -5.4750642 8.58762550
24 3 -0.309470950 7.0656088 10.07624289
25 3 11.225609780 4.2121241 16.59168866
26 3 -3.762529113 6.4369973 15.82362705
27 3 -5.103277731 0.9215625 18.20823042
28 3 -10.623165177 -5.2896293 33.13656839
29 3 -0.002517872 5.0861361 -0.01966699
30 3 -2.183752881 24.4644310 13.55572730
This should solve your problem. You can use a new dataset of all id, value combinations to make sure you calculate each combination and merge on the Room values. Then use mutate to make new Room columns.
Also, in the future I'd recommend setting a seed when asking questions with random data as it's easier for someone to replicate your output.
library(dplyr)
set.seed(0)
df<-data.frame(id=rep(1:10,each=10),
Room1=rnorm(100,0.4,0.5),
Room2=rnorm(100,0.3,0.5),
Room3=rnorm(100,0.7,0.5))
vals <- sample(7:100, 10)
other_df <- data.frame(id=rep(1:10),
val = rep(vals, 10))
df2 <- inner_join(other_df, df)
df2 <- df2 %>%
mutate(Room1 = Room1*val,
Room2 = Room2*val,
Room3 = Room3*val)

Creating sublists from one bigger list

I am writing my Thesis in R and I would like, if possible, some help in a problem that I have.
I have a table, which is called tkalp, with 2 columns and 3001 rows and after a 'subset' command that I wrote this table contains now 1084 rows and called kp. Some values of kp are:
As you can see some values from the column V1 are continuously with step = 2 and some are not.
So my difficulty is:
1. I would like to 'break' this big list/table into smaller lists/tables that contain only continuous numbers. For this difficulty, I tried to implement it with these commands but it didn't go as planned:
for (n in 1:nrow(kp)) {
kp1 <- subset(kp, kp[n+1,1] - kp[n,1])==2)
}
2. After completing this task I would like to keep only the sublists that contain more than 10 rows.
Any idea or help is more than welcome! Thank you very much
EDIT
I have uploaded a picture of my table and I have separated the numbers that I want to be contained in different tables. And I would like to do that for all the original table.
blue is one smaller table than the original
black another
yellow another
red another
And after I create all those smaller tables I would like to keep only the tables that contain more than 10 numbers. For example I don't want to keep the yellow table since it contains only 4 numbers.
Thank you again
What about
df <- data.frame(V1=c(1,3,5,10,12,14, 20, 22), V2=runif(8))
df$diff <- c(2,diff(df$V1))
df$numSubset <- cumsum(df$diff != 2) + 1
iter <- seq(max(df$numSubset))
purrr::map(iter, function(i) filter(df, numSubset == i))
listOfSubsets <- purrr::map(iter, function(i) dplyr::filter(df, numSubset == i))
Then you loop through the list and select only those you want. Btw purrr also provides a means to filter the list you get without looping. Check the documentation of purrr.
With base R
kp=data.frame(V1=c(seq(8628,8618,by=-2),seq(8576,8566,by=-2),78,76),V2=runif(14))
kp$diffV1=c(-2,diff(kp$V1))/-2
kp$group=cumsum(ifelse(kp$diffV1/-2==1,0,1))+1
lkp=split(kp,kp$group)
# > kp
# V1 V2 diffV1 group
# 1 8628 0.74304325 -2 1
# 2 8626 0.84658101 -2 1
# 3 8624 0.74540089 -2 1
# 4 8622 0.83551473 -2 1
# 5 8620 0.63605222 -2 1
# 6 8618 0.92702915 -2 1
# 7 8576 0.81978587 -42 2
# 8 8574 0.01661538 -2 2
# 9 8572 0.52313859 -2 2
# 10 8570 0.39997951 -2 2
# 11 8568 0.61444445 -2 2
# 12 8566 0.23570017 -2 2
# 13 78 0.58397923 -8488 3
# 14 76 0.03634809 -2 3

R loop through multiple sub groups with using functions

Hi I am trying to learn how to loop through multiple groups within a data frame and apply certain arithmetic operations. I do not have a programming background and am struggling to loop through the multiple conditions.
My data looks like the following:
Event = c(1,1,1,1,1,2,2,2,2,2)
Indiv1=c(4,5,6,11,45,66,8,9,32,45)
Indiv2=c(7,81,91,67,12,34,56,78,90,12)
Category=c(1,1,2,2,2,1,2,2,1,1)
Play_together=c(1,0,1,1,1,1,1,1,0,1)
Money=c(23,11,78,-9,-12,345,09,43,21,90)
z = data.frame(Event,Indiv1,Indiv2,Category,Play_together,Money)
What I would like to do is to look through each event and each category and take the average value of Money in cases where Play_together == 1. When Play_together==0, then I would like to apply Money/100.
I understand that the loop would look something like the following:
for i in 1:nrow(z){
#loop for event{
#loop for Category{
#Define avg or division function
}
}
}
However, I cannot seem to implement this using a nested loop. I saw another post (link: apply function for each subgroup) which uses dplyr package. I was wondering if someone could help me to implement this without using any packages (I know this might take longer as compared to using R packages). I am trying to learn R and this is the first time I am working with nested loops.
The final output will look like this:
where for event 1, the following holds:
a) For cateory 1:
Play_together ==1 in row 1; we take the avg of Money value and hence final output = 23/1= 23
Play_together==0 in row 2; we take Money/100= 0.11
b) For category 2:
Play_together == 1 for all observations. We take avg Money for all three observations.
This holds similarly for Event 2. In my actual dataset, I have event = 600 and number of category ranging from 1 - 10. Some events may have only 1 category and a maximum of 10 categories. So any function needs to be extremely flexible. The total number of observations in my dataset is around 1.5 million so any changes in the looping process to reduce the time taken to carry out the operation is going to be extremely helpful (Although at this stage my priority is the looping process itself).
Once again it would be a great help if you can show me how to use nested looping and explain the steps in brief. Much appreciated.
will something like this do?
I know it's using dplyr, but that package is made for this kind of jobs ;-)
Event = c(1,1,1,1,1,2,2,2,2,2)
Indiv1=c(4,5,6,11,45,66,8,9,32,45)
Indiv2=c(7,81,91,67,12,34,56,78,90,12)
Category=c(1,1,2,2,2,1,2,2,1,1)
Play_together=c(1,0,1,1,1,1,1,1,0,1)
Money=c(23,11,78,-9,-12,345,09,43,21,90)
z = data.frame(Event,Indiv1,Indiv2,Category,Play_together,Money)
library(dplyr)
df_temp <- z %>%
group_by( Event, Category, Play_together ) %>%
summarise( money_mean = mean( Money ) ) %>%
mutate( final_output = ifelse( Play_together == 0, money_mean / 100, money_mean )) %>%
select( -money_mean )
df <- z %>%
left_join(df_temp, by = c("Event", "Category", "Play_together" )) %>%
arrange(Event, Category)
Consider base R's by, the object-oriented wrapper to tapply designed to subset dataframes by factor(s) but unlike split can pass subsets into a defined function. Then, run conditional logic with ifelse for Final_Output field. Finally, stack all subsetted dataframes for final object.
# LIST OF DATAFRAMES
by_list <- by(z, z[c("Event", "Category")], function(sub) {
tmp <- subset(sub, Play_together==1)
sub$Final_Output <- ifelse(sub$Play_together == 1, mean(tmp$Money), sub$Money/100)
return(sub)
})
# APPEND ALL DATAFRAMES
final_df <- do.call(rbind, by_list)
row.names(final_df) <- NULL
final_df
# Event Indiv1 Indiv2 Category Play_together Money Final_Output
# 1 1 4 7 1 1 23 23.00
# 2 1 5 81 1 0 11 0.11
# 3 2 66 34 1 1 345 217.50
# 4 2 32 90 1 0 21 0.21
# 5 2 45 12 1 1 90 217.50
# 6 1 6 91 2 1 78 19.00
# 7 1 11 67 2 1 -9 19.00
# 8 1 45 12 2 1 -12 19.00
# 9 2 8 56 2 1 9 26.00
# 10 2 9 78 2 1 43 26.00

If (condition), add 1 to previous value, else, subtract 1

I'm tracking Meals and satiety in a dataframe. I would like to have R add 1 to the previous value in the satiety column when a meal is eaten, and subtract 1 when no meal is eaten (meal=NA).
I'm trying to accomplish this with a for loop nested in an ifelse statement but it is not working.
My current attempt:
ifelse(Meals=="NA",for (i in 1:length(Day$Fullness)){
print(Day$Fullness[[i]]-1+i)}, for (i in 1:length(Day$Fullness)){
print(Day$Fullness[[i]]+1+i)}
Error: Error in ans[test & ok] <- rep(yes, length.out = length(ans))
[test & ok] :
replacement has length zero
In addition: Warning message:
In rep(yes, length.out = length(ans)) :
'x' is NULL so the result will be NULL
I'm not sure how to create a table on here but I will do my best to make sense.
Time: 9:30 AM 10:00 AM 10:30 AM ETC
Meals: NA NA Breakfast NA NA Snack NA NA NA ETC
Satiety: Range from 0-10.
My current satiety data is just a vector I created, but I would like it to start at 0 and increase by 1 after every meal, while decreasing by 1 after every 30 minute timeframe where there is no meal(where meal= NA).
I'm sure there is a much better way to do this.
Thank you.
Here's some sample data and a potential solution.
set.seed(123)
meals <- sample(c(1, 1, 1, NA), 20, replace = TRUE)
df <- data.frame(meals = meals)
head(df)
# meals
# 1 1
# 2 NA
# 3 1
# 4 NA
# 5 NA
# 6 1
df$meals[is.na(df$meals)] <- -1
df$satiety <- cumsum(df$meals)
head(df)
# meals satiety
# 1 1 1
# 2 -1 0
# 3 1 1
# 4 -1 0
# 5 -1 -1
# 6 1 0
tail(df)
# meals satiety
# 15 1 5
# 16 -1 4
# 17 1 5
# 18 1 6
# 19 1 7
# 20 -1 6
I would suggest not coding the absence of a meal (or a skipped meal) as NA which means "I don't know". If you're using NA to mean the meal was skipped, than you do actually know and you should give it something that represents a skipped meal. Here, since your model interprets a skipped meal as having a negative impact on satiety (not a neutral impact), -1 actually makes quite a lot of sense. If that's how you use it in your model, then code it that way.
A couple of things here.
Unless the data includes the string "NA", you should use the command is.na(x) to check if a value or values are NA. It's hard to tell however without sample data.
Generally speaking, in R you will want to use vectorised solutions. In many cases, if you're using a for loop, it's incorrect.
You've stated that "Meals" is in a dataframe. As such, you will need to refer to Meals as a subset of that data frame. For example, if the data frame is data, then the expression should be data$Meals.
Summarising all of this, I'd probably do something similar to the following:
Day$Meals.na <- is.na(Day$Meals)
print(Day$Fullness + (-1)^Day$Meals.na)
This uses a nice trick: TRUE and FALSE are both stored as 1 and 0 respectively under the hood.
Hopefully this helps. If not, we'd really need sample data and expected outputs to be able to be of more use.

select records according to the difference between records R

I hope someone could suggest me something for this "problem", because I really don't know how to proceed...
Well, my data are like this
data<-data.frame(site=c(rep("A",3),rep("B",3),rep("C",3)),time=c(100,180,245,5,55,130,70,120,160))
where time is in minute.
I want to select only the records, for each site, for which the difference is more than 60, so the output should be Like this:
out<-data[c(1:4,6,7,9),]
What I have tried so far. Well,to get the difference I use this:
difference<-stack(tapply(data$time,data$site,diff))
but then, no idea how to pick up those records which satisfied my condition...
If there is already a similar question, although I've searched for a while, I apologize for this.
To make things clear, as probably the definition of difference was not so unambiguous, I need to select all the records (for each site) which are separated at least by 60 minutes, so not only those that are strictly subsequent in time.
Specifically,
> out
site time
1 A 100#included because difference between 2 and 1 is>60
2 A 180#included because difference between 3 and 2 is>60
3 A 245#included because separated by 6o minutes before record#2
4 B 5#included because difference between 6 and 4 is>60
6 B 130#included because separated by 6o minutes before record#4
7 C 70#included because difference between 9 and 7 is>60
9 C 160#included because separated by 60 minutes before record#7
May be to solve the "problem", it could be useful to consider the results of the difference, something like this:
> difference
values ind
1 80 A#include record 1 and 2
2 65 A#include record 2 and 3
3 50 B#include only record 4
4 75 B#include record 6 because there are(50+75)>60 m from r#4
5 50 C#include only record 7
6 40 C#include record 9 because there are (50+40)>60 m from r#7
Thanks for the help.
data[ave(data$time, data$site, FUN = function(x){c(61, diff(x)) > 60}) == 1, ]
# site time
# 1 A 100
# 2 A 180
# 3 A 245
# 4 B 5
# 6 B 130
# 7 C 70
Edit following updated question:
keep <- as.logical(ave(data$time, data$site, FUN = function(x){
c(TRUE, cumsum(diff(x)) > 60)
}))
data[keep, ]
# site time
# 1 A 100
# 2 A 180
# 3 A 245
# 4 B 5
# 6 B 130
# 7 C 70
# 9 C 160
#Calculate the differences
data$diff <- unlist(by(data$time, data$site,function(x)c(NA,diff(x))))
#subset data
data[is.na(data$diff) | data$diff > 60,]
Using plyr:
ddply(dat,.(site),function(x)x[c(TRUE , diff(x$time) >60),])

Resources