How to create a column based on two conditions from other data frame? - r

I'm trying to create a column that identifies if the row meets two conditions. For example, I have a table similar to this:
> dat <- data.frame(Date = c(rep(c("2019-01-01", "2019-02-01","2019-03-01", "2019-04-01"), 4)),
+ Rep = c(rep("Mike", 4), rep("Tasha", 4), rep("Dane", 4), rep("Trish", 4)),
+ Manager = c(rep("Amber", 2), rep("Michelle", 2), rep("Debbie", 4), rep("Brian", 4), rep("Tim", 3), "Trevor"),
+ Sales = floor(runif(16, min = 0, max = 10)))
> dat
Date Rep Manager Sales
1 2019-01-01 Mike Amber 6
2 2019-02-01 Mike Amber 3
3 2019-03-01 Mike Michelle 9
4 2019-04-01 Mike Michelle 2
5 2019-01-01 Tasha Debbie 9
6 2019-02-01 Tasha Debbie 6
7 2019-03-01 Tasha Debbie 0
8 2019-04-01 Tasha Debbie 4
9 2019-01-01 Dane Brian 3
10 2019-02-01 Dane Brian 6
11 2019-03-01 Dane Brian 6
12 2019-04-01 Dane Brian 1
13 2019-01-01 Trish Tim 6
14 2019-02-01 Trish Tim 7
15 2019-03-01 Trish Tim 6
16 2019-04-01 Trish Trevor 1
Out of the Reps that have switched manager, I would like to identify weather this manager is the first or the second manager with respect to the date. The ideal output would look something like:
Date Rep Manager Sales New_Column
1 2019-01-01 Mike Amber 6 1
2 2019-02-01 Mike Amber 3 1
3 2019-03-01 Mike Michelle 9 2
4 2019-04-01 Mike Michelle 2 2
5 2019-01-01 Trish Tim 6 1
6 2019-02-01 Trish Tim 7 1
7 2019-03-01 Trish Tim 6 1
8 2019-04-01 Trish Trevor 1 2
I have tried a few things but they're not quite working out. I have created two separate data frames where one consists of the first instance of that Rep and associated manager (df1) and the other one consists of the last instance of that rep and associated manager (df2). The code that I have tried that has gotten the closest is:
dat$New_Column <- ifelse(dat$Rep %in% df1$Rep & dat$Manager %in% df1$Manager, 1,
ifelse(dat$Rep %in% df2$Rep & dat$Manager %in% df2$Manager, 2, NA))
However this reads as two separate conditions, rather than having a condition of a condition (i.e. If Mike exists in the first instance and Amber exists in the first instance assign 1 rather than If Mike exists with the manager Amber in the first instance assign 1). Any help would be really appreciated. Thank you!

An option is to first grouped by 'Rep' filter the rows where the number of unique 'Manager' is 2, and then add a column by matching the 'Manager' with the unique elements of 'Manager' to get the indices
library(dplyr)
dat %>%
group_by(Rep) %>%
filter(n_distinct(Manager) == 2) %>%
mutate(New_Column = match(Manager, unique(Manager)))
# A tibble: 8 x 5
# Groups: Rep [2]
# Date Rep Manager Sales New_Column
# <chr> <chr> <chr> <int> <int>
#1 2019-01-01 Mike Amber 6 1
#2 2019-02-01 Mike Amber 3 1
#3 2019-03-01 Mike Michelle 9 2
#4 2019-04-01 Mike Michelle 2 2
#5 2019-01-01 Trish Tim 6 1
#6 2019-02-01 Trish Tim 7 1
#7 2019-03-01 Trish Tim 6 1
#8 2019-04-01 Trish Trevor 1 2

Related

Referring to the row above when using mutate() in R

I want to create a new variable in a dataframe that refers to the value of the same new variable in the row above. Here's an example of what I want to do:
A horse is in a field divided into four zones. The horse is wearing a beacon that signals every minute, and the signal is picked up by one of four sensors, one for each zone. The field has a fence that runs most of the way down the middle, such that the horse can pass easily between zones 2 and 3, but to get between zones 1 and 4 it has to go via 2 and 3. The horse cannot jump over the fence.
|________________|
| |
sensor 2 | X | | sensor 3
| | |
| | |
| | |
sensor 1 | Y| | sensor 4
| | |
|----------------|
In the schematic above, if the horse is at position X, it will be picked up by sensor 2. If the horse is near the middle fence at position Y, however, it may be picked up by either sensor 1 or sensor 4, the ranges of which overlap slightly.
In the toy example below, I have a dataframe where I have location data each minute for 20 minutes. In most cases, the horse moves one zone at a time, but in several instances, it switches back and forth between zone 1 and 4. This should be impossible: the horse cannot jump the fence, and neither can it run around in the space of a minute.
I therefore want to calculate a new variable in the dataset that provides the "true" location of the animal, accounting for the impossibility of travelling between 1 and 4.
Here's the data:
library(tidyverse)
library(reshape2)
example <- data.frame(time = seq(as.POSIXct("2022-01-01 09:00:00"),
as.POSIXct("2022-01-01 09:20:00"),
by="1 mins"),
location = c(1,1,1,1,2,3,3,3,4,4,4,3,3,2,1,1,4,1,4,1,4))
example
Create two new variables: "prevloc" is where the animal was in the previous minute, and "diffloc" is the number differences between the animal's current and previous location.
example <- example %>% mutate(prevloc = lag(location),
diffloc = abs(location - prevloc))
example
Next, just change the first value of "diffloc" from NA to zero:
example <- example %>% mutate(diffloc = ifelse(is.na(diffloc), 0, diffloc))
example
Now we have a dataframe where diffloc is either 0 (animal didn't move), 1 (animal moved one zone), or 3 (animal apparently moved from zone 1 to zone 4 or vice versa). Where diffloc = 3, I want to create a "true" location taking account of the fact that such a change in location is impossible.
In my example, the animal went from zone 1 -> 4 -> 1 -> 4 -> 1 -> 4. Based on the fact that the animal started in zone 1, my assumption is that the animal just stayed in zone 1 the whole time.
My attempt to solve this below, which doesn't work:
example <- example %>%
mutate(returnloc = ifelse(diffloc < 3, location, lag(returnloc)))
I wonder whether anyone can help me to solve this? I've been trying for a couple of days and haven't even got close...
Best wishes,
Adam
One possible solution is to, when diffloc == 3, look at the previous value that is not 1 nor 4. If it is 2, then the horse is certainly in 1 afterwards, if it is 3, then the horse is certainly in 4.
example %>%
mutate(trueloc = case_when(diffloc == 3 & sapply(seq(row_number()), \(i) tail(location[1:i][!location %in% c(1, 4)], 1) == 2) ~ 1,
diffloc == 3 & sapply(seq(row_number()), \(i) tail(location[1:i][!location %in% c(1, 4)], 1) == 3) ~ 4,
T ~ location))
time location prevloc diffloc trueloc
1 2022-01-01 09:00:00 1 NA 0 1
2 2022-01-01 09:01:00 1 1 0 1
3 2022-01-01 09:02:00 1 1 0 1
4 2022-01-01 09:03:00 1 1 0 1
5 2022-01-01 09:04:00 2 1 1 2
6 2022-01-01 09:05:00 3 2 1 3
7 2022-01-01 09:06:00 3 3 0 3
8 2022-01-01 09:07:00 3 3 0 3
9 2022-01-01 09:08:00 4 3 1 4
10 2022-01-01 09:09:00 4 4 0 4
11 2022-01-01 09:10:00 4 4 0 4
12 2022-01-01 09:11:00 3 4 1 3
13 2022-01-01 09:12:00 3 3 0 3
14 2022-01-01 09:13:00 2 3 1 2
15 2022-01-01 09:14:00 1 2 1 1
16 2022-01-01 09:15:00 1 1 0 1
17 2022-01-01 09:16:00 4 1 3 1
18 2022-01-01 09:17:00 1 4 3 1
19 2022-01-01 09:18:00 4 1 3 1
20 2022-01-01 09:19:00 1 4 3 1
21 2022-01-01 09:20:00 4 1 3 1
Here is an approach using a funciton containing a for-loop.
You cannot rely on diff, because this will not pick up sequences of (wrong) zone 4's.
c(1,1,4,4,4,1,1,1) should be converted to c(1,1,1,1,1,1,1,1) if I understand your question correctly.
So, you need to iterate (I think).
library(data.table)
# custom sample data set
example <- data.frame(time = seq(as.POSIXct("2022-01-01 09:00:00"),
as.POSIXct("2022-01-01 09:20:00"),
by="1 mins"),
location = c(1,1,1,1,2,3,3,3,4,4,4,3,3,2,1,1,4,4,4,1,4))
# Make it a data.table, make sure the time is ordered
setDT(example, key = "time")
# function
fixLocations <- function(x) {
for(i in 2:length(x)) {
if (abs(x[i] - x[i-1]) > 1) x[i] <- x[i-1]
}
return(x)
}
NB that this function only works if the location in the first row is correct. If it start with (wrong) zone 4's, it will go awry.
example[, locationNew := fixLocations(location)][]
# time location locationNew
# 1: 2022-01-01 09:00:00 1 1
# 2: 2022-01-01 09:01:00 1 1
# 3: 2022-01-01 09:02:00 1 1
# 4: 2022-01-01 09:03:00 1 1
# 5: 2022-01-01 09:04:00 2 2
# 6: 2022-01-01 09:05:00 3 3
# 7: 2022-01-01 09:06:00 3 3
# 8: 2022-01-01 09:07:00 3 3
# 9: 2022-01-01 09:08:00 4 4
#10: 2022-01-01 09:09:00 4 4
#11: 2022-01-01 09:10:00 4 4
#12: 2022-01-01 09:11:00 3 3
#13: 2022-01-01 09:12:00 3 3
#14: 2022-01-01 09:13:00 2 2
#15: 2022-01-01 09:14:00 1 1
#16: 2022-01-01 09:15:00 1 1
#17: 2022-01-01 09:16:00 4 1
#18: 2022-01-01 09:17:00 4 1
#19: 2022-01-01 09:18:00 4 1
#20: 2022-01-01 09:19:00 1 1
#21: 2022-01-01 09:20:00 4 1
# time location locationNew

FOR Loop with multiple parameters from a dataframe in R [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I would like to know whether it is possible to build a FOR loop in R which would change multiple parameters at every run.
I have parameter dataframe [df_params] which looks like this:
group person date_from date_to
1 Mike 2020-10-01 12:00:00 2020-10-01 13:00:00
2 Mike 2020-10-04 09:00:00 2020-10-07 17:00:00
3 Dave 2020-10-07 12:00:00 2020-10-07 13:00:00
4 Dave 2020-10-09 09:00:00 2020-10-11 17:00:00
I would like to loop over a larger dataframe [df] and get only the rows matching parameters of individual rows in the "df_params" dataframe.
The large dataframe [df] looks like this:
person datetime books tasks done
Mike 2020-10-01 12:15:00 5 7 2
Mike 2020-10-01 12:17:00 5 7 3
Mike 2020-10-01 18:00:00 5 7 4
Mike 2020-10-02 12:00:00 5 5 0
Mike 2020-10-04 09:08:00 5 3 3
Mike 2020-10-09 12:00:00 5 7 1
Dave 2020-10-07 12:22:00 7 5 1
Dave 2020-10-08 02:34:00 7 5 2
Dave 2020-10-09 07:00:00 7 3 3
Dave 2020-10-09 08:00:00 7 8 5
Dave 2020-10-09 09:48:00 7 7 2
Nick 2020-10-01 13:00:00 3 7 3
Nick 2020-10-02 12:58:00 3 3 2
Nick 2020-10-03 10:02:00 3 7 1
The desired result would look like this:
person datetime books tasks done group
Mike 2020-10-01 12:15:00 5 7 2 1
Mike 2020-10-01 12:17:00 5 7 3 1
Mike 2020-10-04 09:08:00 5 3 3 2
Dave 2020-10-07 12:22:00 7 5 1 3
Dave 2020-10-09 09:48:00 7 7 2 4
Is something like this possible in R.
Thank you very much for any suggestions.
This might be a slightly expensive solution if your datasets are very large, but it outputs the desired result.
I don't know if your date variables are already in date format; below I convert them with the lubridate package just in case they aren't.
Also, I create the variable date_interval that will be used later for a filtering condition.
library(dplyr)
library(lubridate)
# convert to date format
df_params <- df_params %>%
mutate(
date_from = ymd_hms(date_from),
date_to = ymd_hms(date_to),
# create interval
date_interval = interval(date_from, date_to)
)
df <- df %>%
mutate(datetime = ymd_hms(datetime))
After this manipulation step, I use a left_join on the person name in order to have a larger dataframe - for this reason I said before that this operation might be a little expensive - and then filter only the rows where datetime is within the above-mentioned interval.
left_join(df, df_params, by = "person") %>%
filter(datetime %within% date_interval) %>%
select(person:group)
# person datetime books tasks done group
# 1 Mike 2020-10-01 12:15:00 5 7 2 1
# 2 Mike 2020-10-01 12:17:00 5 7 3 1
# 3 Mike 2020-10-04 09:08:00 5 3 3 2
# 4 Dave 2020-10-07 12:22:00 7 5 1 3
# 5 Dave 2020-10-09 09:48:00 7 7 2 4
Starting data
df_params <- read.table(text="
group person date_from date_to
1 Mike 2020-10-01T12:00:00 2020-10-01T13:00:00
2 Mike 2020-10-04T09:00:00 2020-10-07T17:00:00
3 Dave 2020-10-07T12:00:00 2020-10-07T13:00:00
4 Dave 2020-10-09T09:00:00 2020-10-11T17:00:00", header=T)
df <- read.table(text="
person datetime books tasks done
Mike 2020-10-01T12:15:00 5 7 2
Mike 2020-10-01T12:17:00 5 7 3
Mike 2020-10-01T18:00:00 5 7 4
Mike 2020-10-02T12:00:00 5 5 0
Mike 2020-10-04T09:08:00 5 3 3
Mike 2020-10-09T12:00:00 5 7 1
Dave 2020-10-07T12:22:00 7 5 1
Dave 2020-10-08T02:34:00 7 5 2
Dave 2020-10-09T07:00:00 7 3 3
Dave 2020-10-09T08:00:00 7 8 5
Dave 2020-10-09T09:48:00 7 7 2
Nick 2020-10-01T13:00:00 3 7 3
Nick 2020-10-02T12:58:00 3 3 2
Nick 2020-10-03T10:02:00 3 7 1 ", header=T)

Calculate number of days passed since First event R

I would like to calculate the number of days which have passed since the first event. There are different groups, so each group's starting date for an event is different and I want to calculate each groups number of days passed since their own first event.
names = c('Ben',"Ben","Ben","Ben","Ben","Ben" ,'Dan',"Dan","Dan","Dan", 'Peter',"Peter","Peter","Peter","Peter","Peter","Peter",'Betty',"Betty","Betty",'Betty', "Betty")
dates = c('2000-02-01','2000-02-02',"2000-02-03","2000-02-04",'2000-02-05','2000-02-05', '2000-01-11','2000-01-12',"2000-01-13",'2000-01-14',
'2000-09-10','2000-09-11',"2000-09-12",'2000-09-13','2000-09-14','2000-09-15','2000-09-16','2000-11-13','2000-11-14', "2000-11-15",'2000-11-16','2000-11-17')
events = c(0,0,1,4,5,11,0,0,2,6,0,0,1,2,3,4,5,0,0,1,2,3)
newd = data.frame(names,dates,events)
newd
so the data frame looks like this:
> newd
names dates events
1 Ben 2000-02-01 0
2 Ben 2000-02-02 0
3 Ben 2000-02-03 1
4 Ben 2000-02-04 4
5 Ben 2000-02-05 5
6 Ben 2000-02-05 11
7 Dan 2000-01-11 0
8 Dan 2000-01-12 0
9 Dan 2000-01-13 2
10 Dan 2000-01-14 6
11 Peter 2000-09-10 0
12 Peter 2000-09-11 0
13 Peter 2000-09-12 1
14 Peter 2000-09-13 2
15 Peter 2000-09-14 3
16 Peter 2000-09-15 4
17 Peter 2000-09-16 5
18 Betty 2000-11-13 0
19 Betty 2000-11-14 0
20 Betty 2000-11-15 1
21 Betty 2000-11-16 2
22 Betty 2000-11-17 3
This is just an example I am using, the 'events' are not in a specific order and are totally random, there are also many other dates with the event of 0. So I would like to only start counting days where: event > 0.
So if there's a 0 at 'event' than there should also be a 0 days counted.
Convert the dates to actual date and you can then subtract minimum dates for each names.
newd$dates <- as.Date(newd$dates)
library(dplyr)
newd %>% group_by(names) %>% mutate(events = as.integer(dates - min(dates)))
# names dates events
# <chr> <date> <int>
# 1 Ben 2000-02-02 0
# 2 Ben 2000-02-03 1
# 3 Ben 2000-02-04 2
# 4 Ben 2000-02-05 3
# 5 Ben 2000-02-05 3
# 6 Dan 2000-01-12 0
# 7 Dan 2000-01-13 1
# 8 Dan 2000-01-14 2
# 9 Peter 2000-09-11 0
#10 Peter 2000-09-12 1
#11 Peter 2000-09-13 2
#12 Peter 2000-09-14 3
#13 Peter 2000-09-15 4
#14 Peter 2000-09-16 5
#15 Betty 2000-11-14 0
#16 Betty 2000-11-15 1
#17 Betty 2000-11-16 2
#18 Betty 2000-11-17 3
In base R :
newd$events <- with(newd, dates - ave(dates, names, FUN = min))
and data.table :
library(data.table)
setDT(newd)[, events := dates - min(dates), names]

How to iterate rows between start_date and end_date in R

I have a dataframe that looks like this:
And here is the output I'm hoping for.
This should work. The key is to use uncount from dplyr package. Then you need to do some operations regarding the datetime. There are some tricky issues in calculating the difference in months. What I proposed here may not be the best way to do it, but you get the idea.
library(tidyverse)
library(lubridate)
df = tibble(name = c('Alice', 'Bob', 'Caroline'),
start_date = as.Date(c('2019-01-01','2018-03-01','2019-06-01')),
end_date = as.Date(c('2019-07-01','2019-05-01','2019-09-01')))
# # A tibble: 3 x 3
# name start_date end_date
# <chr> <date> <date>
# 1 Alice 2019-01-01 2019-07-01
# 2 Bob 2018-03-01 2019-05-01
# 3 Caroline 2019-06-01 2019-09-01
df %>% mutate(tenure_in_month = as.integer(difftime(end_date, start_date, units = "days")/365*12+2))%>%
uncount(tenure_in_month)%>%
group_by(name)%>%
mutate(iteratedDate = start_date %m+% months(row_number()-1))%>%
select(name,iteratedDate)
# A tibble: 28 x 2
# Groups: name [3]
name iteratedDate
<chr> <date>
1 Alice 2019-01-01
2 Alice 2019-02-01
3 Alice 2019-03-01
4 Alice 2019-04-01
5 Alice 2019-05-01
6 Alice 2019-06-01
7 Alice 2019-07-01
8 Bob 2018-03-01
9 Bob 2018-04-01
10 Bob 2018-05-01
I use seq function to fix this problem.
library(data.table)
library(lubridate)
# data
original_data <- data.table(
CustomerName = c('Ben','Julie','Angelo','Carlo'),
StartDate = c(ymd(20190101),ymd(20180103),ymd(20190106),ymd(20170108)),
EndDate = c(ymd(20190107),ymd(20190105),ymd(20190109),ymd(20180112))
)
# CustomerName StartDate EndDate
#1: Ben 2019-01-01 2019-01-07
#2: Julie 2018-01-03 2019-01-05
#3: Angelo 2019-01-06 2019-01-09
#4: Carlo 2017-01-08 2018-01-12
finish_data <- original_data %>%
.[,.(IteratedDate = seq(from = StartDate,
to = EndDate, by = 'day')), by = .(CustomerName)]
# CustomerName IteratedDate
#1: Ben 2019-01-01
#2: Ben 2019-01-02
#3: Ben 2019-01-03
#4: Ben 2019-01-04
#5: Ben 2019-01-05
#6: Ben 2019-01-06
#7: Ben 2019-01-07
#8: Julie 2018-01-03
#9: Julie 2018-01-04

In R: Duplicate rows except for the first row based on condition

I have a data.table dt:
names <- c("john","mary","mary","mary","mary","mary","mary","tom","tom","tom","mary","john","john","john","tom","tom")
dates <- c(as.Date("2010-06-01"),as.Date("2010-06-01"),as.Date("2010-06-05"),as.Date("2010-06-09"),as.Date("2010-06-13"),as.Date("2010-06-17"),as.Date("2010-06-21"),as.Date("2010-07-09"),as.Date("2010-07-13"),as.Date("2010-07-17"),as.Date("2010-06-01"),as.Date("2010-08-01"),as.Date("2010-08-05"),as.Date("2010-08-09"),as.Date("2010-09-03"),as.Date("2010-09-04"))
shifts_missed <- c(2,11,11,11,11,11,11,6,6,6,1,5,5,5,0,2)
shift <- c("Day","Night","Night","Night","Night","Night","Night","Day","Day","Day","Day","Night","Night","Night","Night","Day")
df <- data.frame(names=names, dates=dates, shifts_missed=shifts_missed, shift=shift)
dt <- as.data.table(df)
names dates shifts_missed shift
john 2010-06-01 2 Day
mary 2010-06-01 11 Night
mary 2010-06-05 11 Night
mary 2010-06-09 11 Night
mary 2010-06-13 11 Night
mary 2010-06-17 11 Night
mary 2010-06-21 11 Night
tom 2010-07-09 6 Day
tom 2010-07-13 6 Day
tom 2010-07-17 6 Day
mary 2010-06-01 1 Day
john 2010-08-01 5 Night
john 2010-08-05 5 Night
john 2010-08-09 5 Night
tom 2010-09-03 0 Night
tom 2010-09-04 2 Day
Ultimately, what I want is to get the following:
names dates shifts_missed shift count
john 2010-06-01 2 Day 1
mary 2010-06-01 11 Night 1
mary 2010-06-05 11 Night 1
mary 2010-06-09 11 Night 1
mary 2010-06-13 11 Night 1
mary 2010-06-17 11 Night 1
mary 2010-06-21 11 Night 1
tom 2010-07-09 6 Day 1
tom 2010-07-13 6 Day 1
tom 2010-07-17 6 Day 1
mary 2010-06-01 1 Day 1
john 2010-08-01 5 Night 1
john 2010-08-05 5 Night 1
john 2010-08-09 5 Night 1
tom 2010-09-03 0 Night 0
tom 2010-09-04 2 Day 1
john 2010-06-01 2 Night 1
mary 2010-06-05 11 Day 1
mary 2010-06-09 11 Day 1
mary 2010-06-13 11 Day 1
mary 2010-06-17 11 Day 1
mary 2010-06-21 11 Day 1
tom 2010-07-09 6 Night 1
tom 2010-07-13 6 Night 1
tom 2010-07-17 6 Night 1
john 2010-08-05 5 Day 1
john 2010-08-09 5 Day 1
tom 2010-09-04 2 Night 1
As you can see, the second half of the data is almost a duplicate of the first half. However, if shifts_missed = 0, it should not be duplicated, and if shifts_missed is odd, the first row should not be duplicated but the remaining rows should. It should then add a 1 in the count column for all except when shifts_missed = 0.
I've seen some answers that speak about !duplicate or unique, but these values in shifts_missed are not unique. I'm sure this isn't overly complicated and is probably a multi-step process, but I can't figure out how to isolate the first rows of the odd shifts_missed column.
dt[, is.in := if(shifts_missed[1] %% 2 == 0) T else c(F, rep(T, .N-1))
, by = .(names, shift)]
rbind(dt, dt[is.in & shifts_missed != 0])
Adding the extra column part should be obvious.

Resources