R - add column based on intervals in separate data frame

R - add column based on intervals in separate data frame - r

I have the following data frames:
DF <- data.frame(Time=c(1:20))
StartEnd <- data.frame(Start=c(2,6,14,19), End=c(4,10,17,20))
I want to add a column "Activity" to DF if the values in the Time column lie inbetween one of the intervals specified in the StartEnd dataframe.
I came up with the following:
mapply(FUN = function(Start,End) ifelse(DF$Time >= Start & DF$Time <= End, 1, 0),
Start=StartEnd$Start, End=StartEnd$End)
This doesn't give me the output I want (it gives me a matrix with four columns), but I would like to get a vector that I can add to DF.
I guess the solution is easy but I'm not seeing it :) Thank you in advance.
EDIT: I'm sure I can use a loop but I'm wondering if there are more elegant solutions.

You can achieve this with
DF$Activity <- sapply(DF$Time, function(x) {
ifelse(sum(ifelse(x >= StartEnd$Start & x <= StartEnd$End, 1, 0)), 1, 0)
})
I hope this helps!

If you're using the tidyverse, I think a good way to go would be with with purrr::map2:
# generate a sequence (n, n + 1, etc.) for each StartEnd row
# (map functions return a list; purrr::flatten_int or unlist can
# squash this down to a vector!)
activity_times = map2(StartEnd$Start, StartEnd$End, seq) %>% flatten_int
# then get a new DF column that is TRUE if Time is in activity_times
DF %>% mutate(active = Time %in% active_times)

Related

Comparing each row of one dataframe with a row in another dataframe using R

I'm relatively new to R and I have looked for an answer for my problem but didn't find one. I want to compare two dataframes.
library(dplyr)
library(gtools)
v1 <- LETTERS[1:10]
combinations_from_4_letters <- (as.data.frame(combinations(n = 10, r = 4, v = v1),
stringsAsFactors = FALSE))
combinations_from_4_letters$group <- rep(1:15, each = 14)
combinations_from_2_letters <- (as.data.frame(combinations(n = 10, r = 2, v = v1),
stringsAsFactors = FALSE))
Dataframe 'combinations_from_4_letters' contains all combinations that can be made from 10 letters without repetitions and permutations. The combinations are binned into groups from 1-15. I want to find out how often pairs of the 10 letters (saved in dataframe 'combinations_from_2_letters') are found in each group (basically a frequency table). I started doing a complicated loop looping through both dataframes but I think there must be a more 'R' solution to it, similar to comparing a dataframe and a vector like:
combinations_from_4_letters %in% combinations_from_2_letters[i,])
Thank you in advance for your help!

I recommend an approach like the following:
# adding dummy column for a complete cross-join
combinations_from_4_letters = combinations_from_4_letters %>%
mutate(ones = 1)
combinations_from_2_letters = combinations_from_2_letters %>%
mutate(ones = 1)
joined = combinations_from_2_letters %>%
inner_join(combinations_from_4_letters, by = "ones") %>%
# comparison goes here
mutate(within = ifelse(comb2 %in% comb4, 1, 0)) %>%
group_by(comb2) %>%
summarise(freq = sum(within))
You'll probably need to modify to ensure it matches the exact column names and your comparison condition.
Key ideas:
adding filler column so we have a complete cross-join
mutate a new indicator column for whether the two letter pair is within the four letter pair
sum indicators on the two letter pair

R Get percentage difference from previous row

I have a dataset with 4 columns, 1st col is date, the other 3 are numeric. I am trying to get the % diff from previous row for those 3 numeric columns. I know there already have some posts about this kind of questions df %>% mutate_each(funs(. - lag(.))) %>% na.omit(), but most of them can not take care about the date, since I want the date to be unchange, and need % different.
here is the dataset
date=c('2018-01-01', '2018-02-01', '2018-03-01')
a=c(1,3,2)
b=c(89,56,47)
c=c(1872,7222,2930)
x=data.frame(date,a,b,c)
I wish to have the final dataset like this
x=data.frame(date,a,b,c)
a=c(NA, 2, -0.333)
b=c(NA, -0.371, -0.161)
c=c(NA,2.857, -0.594)
x=data.frame(date,a,b,c)
which means for col A, 2=3/1-1, -0.333=2/3-1
for col B, -0.371=56/89-1 etc
Thank you so much for your help!

A solution using package data.table:
x = as.data.table(x)
cols = c("a", "b", "c")
x[,(paste0(cols, "_pctChange")) := lapply(.SD, function(col){
(col-shift(col,1,type = "lag"))/shift(col,1,type = "lag")
}), .SDcols=cols]

quantmod package has a very useful function for exactly this called Delt().
All you would need to do is the following:
x[-1] <- sapply(x[-1], Delt)
I'm not sure how familiar you are with sapply, but if you wanted to access Delt()'s parameters to tweak your calculation, you could try something like:
x[-1] <- sapply(x[-1], function(x) { Delt(x, k=2) })

R function or loop that could go through a binary variable (1 and 0) in a dataframe and returns a third variable (y) value from a different column

I do need some help. I am trying to build a function or a loop using R that could go through a binary variable (1 and 0) in a dataframe in such way that everytime 1 is followed by a 0, I could save a vector indicating the value of a third variable (y) in the same line where it occurred. I tried a couple of options based on previous posts, but nothing gives me something even close from that.
My data looks a bit like that:
ID <- rep(1001, 5)
variable <- c(1, 1, 0, 1, 0)
y <- c(10, 20, 30, 40, 50)
df <- cbind(ID, variable, y)
In this case, for example, the answer would give me a vector with the y values 30 and 50. Sorry if someone already has answered that, I could not find something similar. Thanks a lot!

Here's a 'vectorial' solution. Basically, I paste together variable in position i and i+1. Then I check to see if the combination is "10". The position you want is actually the next one (e.g. i+1), so we add 1.
df <- data.frame(ID, variable, y)
idx <- which(paste0(df$variable[-nrow(df)], df$variable[-1]) == "10") + 1
df$y[idx]

Here is an approach with tidyverse:
library(tidyverse)
df %>%
as.tibble %>%
mutate(y1 = ifelse(lag(variable) == 1 & variable == 0, y, NA)) %>%
pull(y1)
#output
[1] NA NA 30 NA 50
and in base R:
ifelse(c(NA, df[-nrow(df),2]) == 1 & df[, 2] == 0, df[, 3], NA)
if the lag of variable is 1 and the variable is 0 then return y, else return NA.
If you would like to remove the NA. wrap it in na.omit

Time difference between dates of adjacent columns in R

I'm trying to adapt the answer to my previous question (Difference between dates in many columns in R). I've realised I only want the time difference between a given column, and the column immediately to it's left. Example for clarification:
df <- data.frame(
Group=c("A","B"),
ID=c(1,2),
Date1=as.POSIXct(c('2016-04-25 09:15:29','2016-04-25 09:15:29')),
Date2=as.POSIXct(c('2016-04-25 14:01:19','2016-04-25 14:01:19')),
Date3=as.POSIXct(c('2016-04-26 13:28:19','2016-04-26 13:28:19')),
stringsAsFactors=F
)
My desired output is Date2-Date1 and Date3-Date2. And this of course would extend for many columns i.e. Date4-Date3 etc. But I do not need Date3-Date1. To clarify, how can I automate this for many columns
df$Date2_Date1 <- difftime(df$Date2,df$Date1, units = c("hours"))
df$Date3_Date2 <- difftime(df$Date3,df$Date2, units = c("hours"))
Thanks to #bgoldst for the original answer. I think I just need to adapt cmb below to have the correct sequence:
cmb <- combn(seq_len(ncol(df)-1L)+1L,2L);
res <- abs(apply(cmb,2L,function(x) difftime(df[[x[1L]]],df[[x[2L]]],units='hours')));
colnames(res) <- apply(cmb,2L,function(x,cns) paste0(cns[x[1L]],'_',cns[x[2L]]),names(df))
Thanks

Given your example, this should to the trick:
df <- data.frame(
Group=c("A","B"),
ID=c(1,2),
Date1=as.POSIXct(c('2016-04-25 09:15:29','2016-04-25 09:15:29')),
Date2=as.POSIXct(c('2016-04-25 14:01:19','2016-04-25 14:01:19')),
Date3=as.POSIXct(c('2016-04-26 13:28:19','2016-04-26 13:28:19')),
stringsAsFactors=F
)
mapply(difftime, df[, 4:5], df[, 3:4], units = "hours")
> Date2 Date3
> [1,] 4.763889 23.45
> [2,] 4.763889 23.45
In my call mapply applies function difftime to the two arrays provided, so it starts with df[, 4] - df[, 3], then df[, 5] - df[, 4]. You of course have to change this with the column numbers for your dates, and make sure they are ordered in the right way.
Good luck!

You could use Non-Standard Evaluation:
First you create a character vector with the name of the columns containing the dates. So let' say all the columns starting with 'Date'
dates = names(df)[grepl("^Date", names(df))]
We create a list of formulas that dynamically calculate the difference between to adjacent columns:
all_operations = lapply(seq_len(length(dates) - 1), function(i){
as.formula(paste("~difftime(", dates[i + 1], ",", dates[i],", units = c('hours'))"))
})
this will create the formulas:
[[1]]: ~difftime(Date2, Date1, units = c("hours"))
[[2]]: ~difftime(Date3, Date2, units = c("hours"))
Then you can use dplyr's NSE mutate_ to apply the dynamic formulas generated above:
df %>%
mutate_(.dots = setNames(all_operations, paste0("Diff", seq_len(length(dates) - 1))))

Drop columns per row based on a separate column value

Given a dummy data frame that looks like this:
Data1<-rnorm(20, mean=20)
Data2<-rnorm(20, mean=21)
Data3<-rnorm(20, mean=22)
Data4<-rnorm(20, mean=19)
Data5<-rnorm(20, mean=20)
Data6<-rnorm(20, mean=23)
Data7<-rnorm(20, mean=21)
Data8<-rnorm(20, mean=25)
Index<-rnorm(20,mean=5)
DF<-data.frame(Data1,Data2,Data3,Data4,Data5,Data6,Data7,Data8,Index)
What I'd like to do is remove (make NA) certain columns per row based on the Index column. I took the long way and did this to give you an idea of what I'm trying to do:
DF[DF$Index>5.0,8]<-NA
DF[DF$Index>=4.5 & DF$Index<=5.0,7:8]<-NA
DF[DF$Index>=4.0 & DF$Index<=4.5,6:8]<-NA
DF[DF$Index>=3.5 & DF$Index<=4.0,5:8]<-NA
DF[DF$Index>=3.0 & DF$Index<=3.5,4:8]<-NA
DF[DF$Index>=2.5 & DF$Index<=3.0,3:8]<-NA
DF[DF$Index>=2.0 & DF$Index<=2.5,2:8]<-NA
DF[DF$Index<=2.0,1:8]<-NA
This works fine as is, but is not very adaptable. If the number of columns change, or I need to tweak the conditional statements, it's a pain to rewrite the entire code (the actual data set is much larger).
What I would like to do is be able to define a few variables, and then run some sort of loop or apply to do exactly what the lines of code above do.
As an example, in order to replicate my long code, something along the lines of this kind of logic:
NumCol<-8
Max<-5
Min<-2.0
if index > Max, then drop NumCol
if index >= (Max-0.5) & <=Max, than drop NumCol:(NumCol -1)
repeat until reach Min
I don't know if that's the most logical line of reasoning in R, and I'm pretty bad with Looping and apply, so I'm open to any line of thought that can replicate the above long lines of code with the ability to adjust the above variables.

If you don't mind changing your data.frame to a matrix, here is a solution that uses indexing by a matrix. The building of the two-column matrix of indices to drop is a nice review of the apply family of functions:
Seq <- seq(Min, Max, by = 0.5)
col.idx <- lapply(findInterval(DF$Index, Seq) + 1, seq, to = NumCol)
row.idx <- mapply(rep, seq_along(col.idx), sapply(col.idx, length))
drop.idx <- as.matrix(data.frame(unlist(row.idx), unlist(col.idx)))
M <- as.matrix(DF)
M[drop.idx] <- NA

Here is a memory efficient (but I can't claim elegant) data.table solution
It uses the very useful function findInterval to change you less than / greater than loop
#
library(data.table)
DT <- data.table(DF)
# create an index column which 1:8 represent your greater than less than
DT[,IND := findInterval(Index, c(-Inf, seq(2,5,by =0.5 ), Inf))]
# the columns you want to change
changing <- names(DT)[1:8]
setkey(DT, IND)
# loop through the indexes and alter by reference
for(.ind in DT[,unique(IND)]){
# the columns you want to change
.which <- tail(changing, .ind)
# create a call to `:=`(a = as(NA, class(a), b= as(NA, class(b))
pairlist <- mapply(sprintf, .which, .which, MoreArgs = list(fmt = '%s = as(NA,class(%s))'))
char_exp <- sprintf('`:=`( %s )',paste(pairlist, collapse = ','))
.e <- parse(text = char_exp)
DT[J(.ind), eval(.e)]
}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R - add column based on intervals in separate data frame - r

You can achieve this with DF$Activity <- sapply(DF$Time, function(x) { ifelse(sum(ifelse(x >= StartEnd$Start & x <= StartEnd$End, 1, 0)), 1, 0) }) I hope this helps!

Related

Comparing each row of one dataframe with a row in another dataframe using R

R Get percentage difference from previous row

R function or loop that could go through a binary variable (1 and 0) in a dataframe and returns a third variable (y) value from a different column

Time difference between dates of adjacent columns in R

Drop columns per row based on a separate column value

Categories

Resources