R: Creating new dataframe based on multiple conditions on existing dataframe - r

I need to create a new dataframe using multiple conditions on an existing dataframe.
I tried using dplyr function, summarise in particular for multiple conditions but failed as the dataset size decreases once the conditions as applied.
For explanation, below is a simple sample of what I am trying to achieve.
df <- data.frame(User = c("Newton","Newton","Newton","Newton","Newton"),
Location = c("A","A","B","A","B"),
Movement = c(10,10,20,20,30),
Unit = c(-2,2,2,-2,-1),
Time = c("4-20-2019","4-20-2019","4-21-2019","4-21-2019"
,"4-23-2019"))
dfNew <- data.frame(User = c("Newton","Newton","Newton"),
FromLocation = c("A","A","B"),
ToLocation = c("A","B","B"),
Movement = c(10,20,30),
Units = c(2,2,-1))
The conditions used to calculate dfNew are as follow:
Looking at the first line of df:
a) if movement is 10 and unit is negative - ignore this line
Looking at the second line of df:
a) if movement is 10 and unit is positive - FromLocation and ToLocation are both A, and Units is taken from df which is 2
Looking at the third line of df:
a) if movement is 20 and unit is positive - ToLocation (B) and Units (2) has to be taken from this line and FromLocation has to be taken from the next line
Looking at the fourth line of df:
a) if movement is 20 and unit is negative - FromLocation(A) for the previous line of dfnew has to be taken from this line
Looking at the fifth line of df:
a) if movement type is 30, then ToLocation and FromLocation will both be B and the units will be the same as df which is -1
Another pattern that could be useful is that each movement would occur on the same day/time. Also please do note that the example is for only 1 user, I have more than 2000 for which similar conditions have to be applied.
Like I said, I tried using dplyr and summarise to put all these conditions in but since the size of dataset is different I could find a way to make it work.
Appreciate any advice, thank you!

It sounds like dplyr::group_by and case_when might suffice, but I'm not sure these are the right interpretations of the "rules" for your table.
library(dplyr)
df %>%
group_by(User) %>%
mutate(FromLocation = case_when(Movement == 10 & Unit < 0 ~ "DROP",
Movement == 10 & Unit > 0 ~ Location,
Movement == 20 & Unit < 0 ~ lag(Location),
Movement == 20 & Unit > 0 ~ lead(Location),
Movement == 30 ~ "B",
TRUE ~ "not specified in rules"),
ToLocation = case_when(Movement == 10 & Unit < 0 ~ "DROP",
Movement == 10 & Unit > 0 ~ Location,
Movement == 20 & Unit < 0 ~ lag(Location), # Not given
Movement == 20 & Unit > 0 ~ Location,
Movement == 30 ~ "B",
TRUE ~ "not specified in rules")) %>%
ungroup() %>%
filter(FromLocation != "DROP") %>%
select(User, FromLocation, ToLocation, Movement, Unit)
Results
# A tibble: 4 x 5
User FromLocation ToLocation Movement Unit
<chr> <chr> <chr> <dbl> <dbl>
1 Newton A A 10 2
2 Newton A B 20 2
3 Newton B B 20 -2
4 Newton B B 30 -1

Related

Dplyr: Group by and then return top n based on different conditons

I have been trying to solve the problem of grouping my data (by Loan Number) and then returning the one row per group based on either the highest or lowest value of a certain column (here it is the Filter column) based on differing conditions. I realize I cannot use ifelse to do what I want but other examples have used if and else (hence my attempt). I have had all manner of errors along the way. Any help would be appreciated along with clarifying the problems
Example data
Loan_Number <- c(100,100,100,100,200,200,200,200,300,300,300,300)
Principal_Remaining <- c(50,50,50,50,5,5,0,0,10,10,10,10)
Principal_In_Arrears <- c(50,50,50,50,0,0,0,0,0,10,10,10)
Write_off_Number <- c(10,10,10,10, 0,0,0,0,0,0,0,0)
Filter <- c (1,2,3,4,5,6,7,8,9,10,11,12)
outcome <- as.data.frame(cbind(Loan_Number,Principal_In_Arrears, Principal_Remaining, Write_off_Number, Filter))
My last attempt at the code was
hope <- outcome %>%
group_by(Loan_Number) %>%
if(Principal_Remaining == 0) top_n( -1, wt = Filter) else
if(Principal_In_Arrears == 0) top_n( -1, wt = Filter) else
if(Write_off_Number >= 0) top_n( 1, wt = Filter) else top_n( -1, wt =
Filter)))
The idea being that if there is no principal left then I want a certain value and if there is principal left I have to check whether the loan is in arrears or has been written off.
NB To confirm the exact requirement, I need to avoid considering the rows that do not meet the condition. For example, for loan 200 the record returned should return row 7 (the lowest month where the principal is 0). The first answer did not do that. Also, loan 300 should return row 10 (condition should be != 0 and the min of Filter), the first month it goes into arrears. Loan 1 should just return row 1.
You can use case_when with slice to select the row for each Loan_Number.
library(dplyr)
outcome %>%
group_by(Loan_Number) %>%
slice(case_when(any(Principal_Remaining == 0) ~ which.max(Filter),
any(Principal_In_Arrears == 0) ~ which.min(Filter),
any(Write_off_Number >= 0) ~ which.max(Filter),
TRUE ~ which.min(Filter))) %>%
ungroup
# Loan_Number Principal_In_Arrears Principal_Remaining Write_off_Number Filter
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 100 50 50 10 4
#2 200 0 0 0 8
#3 300 10 10 0 12
at this point this works but I am not 100% it will continue to work as other combinations develop
hope <- outcome %>%
group_by(Banjo_Loan_No) %>%
dplyr::slice(case_when(any(Principal_Remaining == 0) ~ which.min(abs(filter*Principal_Remaining)),
any(Principal_in_Arrears == 0) ~ which.max(abs(filter*Principal_in_Arrears > 0)),
any(Write_Off_Date != "1016-01-01") ~ which.max(filter),
TRUE ~ which.min(filter)))

row index of "looked at" row case_when in R

I´m currently struggling with a coding task concerning the use of a case_when statement in R.
In general I would like to use the looked at row index of the case_when statement in the assignment part.
A short explanation to the data. I have large data.frame with a date-column, a geo layer-column and some numeric columns with numbers for the calculations.
The data.frame doesn't have any sorting and not for every point in time all geo layers are necessarily in the data.frame. Sadly I can't provide a real data set due to legal issues.
The task at hand is to compute on the one hand simple mathematical operations for the same point in time on the other side to compute mathematical operations for different points in time for the same geo layer and numeric value.
The mathematical operations vary as dose the interval between the time points.
For instance I need to calculate a change rate to the last quarter and last year of the value:
((current_value - last_quarter_value) / current_value)*100
This is how I'd like to code it.
library(tidyverse)
test_dataframe <- data.frame(
times = c(rep(as.Date("2021-03-01"),2),rep(as.Date("2020-12-01"),2)),
geo_layer = rep(c("001001001", "001001002"),2),
numeric_value_a = 1:4,
numeric_value_b = 4:1,
numeric_value_c = c(1,NA,3,1)
)
check_comparison_times <- unique(test_dataframe$times)
test_dataframe <- test_dataframe %>%
mutate(
normale_calculation = case_when(
!is.na(numeric_value_c) ~ (numeric_value_a + numeric_value_b) / numeric_value_c,
TRUE ~ Inf
),
time_comparison = case_when(
is.na(numeric_value_c) ~ Inf,
(times - months(3)) %in% check_comparison_times ~ test_dataframe[
which(
test_dataframe[,"times"] ==
(test_dataframe[row_index_of_current_looked_at_row, "times"] - months(3)) &
test_dataframe[,"geo_layer"] ==
test_dataframe[row_index_of_current_looked_at_row, "geo_layer"]
)
,"numeric_value_c"] - test_dataframe[row_index_of_current_looked_at_row, "numeric_value_c"],
TRUE ~ -Inf
)
)
With this desired outcome:
times geo_layer numeric_value_a numeric_value_b numeric_value_c normal_calculation time_comparison
1 2021-03-01 001001001 1 4 1 5.000000 2
2 2021-03-01 001001002 2 3 NA Inf Inf
3 2020-12-01 001001001 3 2 3 1.666667 -Inf
4 2020-12-01 001001002 4 1 1 5.000000 -Inf
Currently I solve the problem with a triple loop in which I first pair the Values for time then for geo_layer and then execute the mathematical operation.
Since my Data-Set is much much lager than that this. This solution is every in efficient.
Thanks for your help.

Calculating response latency of the eye

I would like to calculate the response latency of the eye. I want to do this by measuring the time difference between onscreen appearance of a target and the onset of the fast eye movement in response.
Below a picutre of a single trial example. The purple line is the period where target appears on screen. The top line shows the position data of the Y coordinate of the eye, the bottom line shows the velocity.
As you can see here, the fast eye movement downwards, with a high velocity, is a saccade.
To give you an idea how my data looks, I made a dummy data.frame. The block represents the blocks you can also see in the figure. Ignore trial.block for now. saccade is a column telling you if the data is a S(saccade) or F(fixation).
Any idea how to calculate the time between Iview at the onset of the target and the onset of the first saccade for each individual trial?
Thanks alot
library(dplyr)
N = 500
G.df <- data.frame(Iview = seq(N*2),
cue.condition = rep(c("spatial", "non-spatial"), each = N),
block = rep(c("fixation.1", "fixation.2", "target.1", "target.2"), each = N/2),
trial.block = rep(1:4, each = N/2),
trial.number = rep(1:50, each = 10),
saccade = sample(c("S","F"), size = 100, replace = T))
I'm not sure if I understood your request correctly. The time between the first occurence of block == 'target.1' and the first occurence of block == 'target.1' & saccade == 'S' for each trial could be calculated like this:
G.df %>%
group_by(trial.number) %>%
summarise(time_between = Iview[block == "target.1" & saccade == "S"][1] - Iview[block == "target.1"][1])
# A tibble: 50 x 2
trial.number time_between
<int> <int>
1 1 2
2 2 1
3 3 0
4 4 0
5 5 1
6 6 1
7 7 1
8 8 1
9 9 0
10 10 0
# ... with 40 more rows

How can I create new column in data frame by aggregating rows?

I have a large (~200k rows) dataframe that is structured like this:
df <-
data.frame(c(1,1,1,1,1), c('blue','blue','blue','blue','blue'), c('m','m','m','m','m'), c(2016,2016,2016,2016,2016),c(3,4,5,6,7), c(10,20,30,40,50))
colnames(df) <- c('id', 'color', 'size', 'year', 'week','revenue')
Let's say it is currently week 7, and I want to compare the trailing 4 week average of revenue to the current week's revenue. What I would like to do is create a new column for that average when all of the identifiers match.
df_new <-
data.frame(1, 'blue', 'm', 2016,7,50, 25 )
colnames(df_new) <- c('id', 'color', 'size', 'year', 'week','revenue', 't4ave')
How can I accomplish this efficiently? Thank you for the help
good question. for loops are pretty inefficient, but since you do have to check the conditions of prior entries, this is the only solution I can think of (mind you, I'm also an intermediate at R):
for (i in 1:nrow(df))
{
# condition for all entries to match up
if ((i > 5) && (df$id[i] == df$id[i-1] == df$id[i-2] == df$id[i-3] == df$id[i-4])
&& (df$color[i] == df$color[i-1] == df$color[i-2] == df$color[i-3] == df$color[i-4])
&& (df$size[i] == df$size[i-1] == df$size[i-2] == df$size[i-3] == df$size[i-4])
&& (df$year[i] == df$year[i-1] == df$year[i-2] == df$year[i-3] == df$year[i-4])
&& (df$week[i] == df$week[i-1] == df$week[i-2] == df$week[i-3] == df$week[i-4]))
# avg of last 4 entries' revenues
avg <- mean(df$revenue[i-1] + df$revenue[i-2] + df$revenue[i-3] + df$revenue[i-4])
# create new variable of difference between this entry and last 4's
df$diff <- df$revenue[i] - avg
}
This code will probably take forever, but it should work. If this is a one time thing for when the code needs to run, then it should be okay. Otherwise, hopefully others will be able to advise.
A solution using dplyr and zoo. The idea is to group the variable that are the same, such as id, color, size, and year. Aftet that, use rollmean to calculate the rolling mean of revenue. Use na.pad = TRUE and align = "right" to make sure the calculation covers the recent weeks. Finally, use lag to "shift" the calculation results to fit your needs.
library(dplyr)
library(zoo)
df2 <- df %>%
group_by(id, color, size, year) %>%
mutate(t4ave = rollmean(revenue, 4, na.pad = TRUE, align = "right")) %>%
mutate(t4ave = lag(t4ave))
df2
# A tibble: 5 x 7
# Groups: id, color, size, year [1]
id color size year week revenue t4ave
<dbl> <fctr> <fctr> <dbl> <dbl> <dbl> <dbl>
1 1 blue m 2016 3 10 NA
2 1 blue m 2016 4 20 NA
3 1 blue m 2016 5 30 NA
4 1 blue m 2016 6 40 NA
5 1 blue m 2016 7 50 25

How to subset data for a specific column with ddply?

I would like to know if there is a simple way to achieve what I describe below using ddply. My data frame describes an experiment with two conditions. Participants had to select between options A and B, and we recorded how long they took to decide, and whether their responses were accurate or not.
I use ddply to create averages by condition. The column nAccurate summarizes the number of accurate responses in each condition. I also want to know how much time they took to decide and express it in the column RT. However, I want to calculate average response times only when participants got the response right (i.e. Accuracy==1). Currently, the code below can only calculate average reaction times for all responses (accurate and inaccurate ones). Is there a simple way to modify it to get average response times computed only in accurate trials?
See sample code below and thanks!
library(plyr)
# Create sample data frame.
Condition = c(rep(1,6), rep(2,6)) #two conditions
Response = c("A","A","A","A","B","A","B","B","B","B","A","A") #whether option "A" or "B" was selected
Accuracy = rep(c(1,1,0),4) #whether the response was accurate or not
RT = c(110,133,121,122,145,166,178,433,300,340,250,674) #response times
df = data.frame(Condition,Response, Accuracy,RT)
head(df)
Condition Response Accuracy RT
1 1 A 1 110
2 1 A 1 133
3 1 A 0 121
4 1 A 1 122
5 1 B 1 145
6 1 A 0 166
# Calculate averages.
avg <- ddply(df, .(Condition), summarise,
N = length(Response),
nAccurate = sum(Accuracy),
RT = mean(RT))
# The problem: response times are calculated over all trials. I would like
# to calculate mean response times *for accurate responses only*.
avg
Condition N nAccurate RT
1 6 4 132.8333
2 6 4 362.5000
With plyr, you can do it as follows:
ddply(df,
.(Condition), summarise,
N = length(Response),
nAccurate = sum(Accuracy),
RT = mean(RT[Accuracy==1]))
this gives:
Condition N nAccurate RT
1: 1 6 4 127.50
2: 2 6 4 300.25
If you use data.table, then this is an alternative way:
library(data.table)
setDT(df)[, .(N = .N,
nAccurate = sum(Accuracy),
RT = mean(RT[Accuracy==1])),
by = Condition]
Using dplyr package:
library(dplyr)
df %>%
group_by(Condition) %>%
summarise(N = n(),
nAccurate = sum(Accuracy),
RT = mean(RT[Accuracy == 1]))

Resources