Vectorising function on subset of dataframe based on other columns

Vectorising function on subset of dataframe based on other columns - r

I have a dataframe from a psychology experiment with the time since the beginning of the experiment for each subject, and what I want is to set from that the time since the beginning of each trial for each subject. To do so I'm basically just substracting the minimum time value for each trial/subject to all the values for that same trial/subject.
I'm currently doing it with two for loops, I was just wondering if there's a way to vectorise it. What I have at the minute:
for (s in 1:max(df$Subject)){
subject <- df[df$Subject==s,]
for (t in 1:max(subject$TrialId)){
trial <- subject[subject$TrialId==t,]
start_offset <- min(trial$timestamp)
df$timestamp[df$Subject==s & df$TrialId==t] <- df$timestamp[df$Subject==s &
df$TrialId==t]
- start_offset
}
}
And what I would like is something like
df$timestamp <- df$timestamp - min_per_trial_per_subject(df$timestamp)

With dplyr
library(dplyr)
df %>% group_by(Subject, TrialId) %>%
mutate(modified_timestamp = timestamp - min(timestamp))
Should work. If it doesn't, please share a reproducible example so we can test.

Related

r - Filter a rows by a date that alters each day

The dataset is 1 column with thousands of rows that contain a date as "2021-09-23T06:38:53.458Z".
With the following code I am able to subset the rows from yesterday:
rows_from_yesterday <- df[df$timestamp %like% "2021-09-24", ]
It works like a charm! I would now like to automate the process because I am not able to update the match criteria each day. How would one approach this? Any tips or suggestions?
Just to be clear. I would like that the "2021-09-24" is automatically updated to "2021-09-25" when it is tomorrow. I have tried the following:
rows_from_yesterday <- df[df$timestamp %like% as.character(Sys.Date()-1), ]
This is sadly without succes.

If I understood you want to filter the observations from yesterday, right? If so, here a solution:
library(dplyr)
library(lubridate)
x <- "2021-09-23T06:38:53.458Z"
df <- tibble(timestamp = x)
df %>%
mutate(timestamp = ymd_hms(timestamp)) %>%
#filter dates equals to yesterday
filter(as_date(timestamp) == (today()-days(1)))

Lagged values multiple columns with function in R

I would like to create lagged values for multiple columns in R.
First, I used a function to create lead/lag like this:
mleadlag <- function(x, n, ts_id) {
pos <- match(as.numeric(ts_id) + n, as.numeric(ts_id))
x[pos]
}
Second, I would like to apply this function for several columns in R. firm.characteristics is list of columns I would like to compute lagged values.
library(dplyr)
firm.characteristics <- colnames(df)[4:6]
for(i in 1:length(firm.characteristics)){
df <- df %>%
group_by(company) %>%
mutate(!!paste0("lag_", i) := mleadlag(df[[i]] ,-1, fye)) %>%
ungroup()
}
However, I didn't get the correct values. The output for all companies in year t is the last row in year t-1. It didn't group by the company any compute the lagged values.
Can anyone help me which is wrong in the loop? Or what should I do to get the correct lagged values?
Thank you so much for your help.
Reproducible sample could be like this:
set.seed(42) ## for sake of reproducibility
n <- 6
dat <- data.frame(company=1:n,
fye=2009,
x=rnorm(n),
y=rnorm(n),
z=rnorm(n),
k=rnorm(n),
m=rnorm(n))
dat2 <- data.frame(company=1:n,
fye=2010,
x=rnorm(n),
y=rnorm(n),
z=rnorm(n),
k=rnorm(n),
m=rnorm(n))
dat3 <- data.frame(company=1:n,
fye=2011,
x=rnorm(n),
y=rnorm(n),
z=rnorm(n),
k=rnorm(n),
m=rnorm(n))
df <- rbind(dat,dat2,dat3)

I would try to stay away from loops in the tidyverse. Many of the tidyverse applications that would traditionally require loops already exist and are very fast, which creates more efficient and intuitive code (the latter being my opinion). This is a great use case for dplyr's across() functionality. I first changed the df to a tibble.
df %>%
as_tibble() %>%
group_by(company) %>%
mutate(
across(firm.characteristics, ~lag(., 1L))
) %>%
ungroup()
This generates the required lagged values. For more information see dplyr's across documentation.

How to use if-statement in apply function?

Since I have to read over 3 go of data, I would like to improve mycode by changing two for-loop and if-statement to the applyfunction.
Here under is a reproducible example of my code. The overall purpose (in this example) is to count the number of positive and negative values in "c" column for each value of x and y. In real case I have over 150 files to read.
# Example of initial data set
df1 <- data.frame(a=rep(c(1:5),times=3),b=rep(c(1:3),each=5),c=rnorm(15))
# Another dataframe to keep track of "c" counts
dfOcc <- data.frame(a=rep(c(1:5),times=3),"positive"=c(0),"negative"=c(0))
So far I did this code, which works but is really slow:
for (i in 1:nrow(df)) {
x = df[i,"a"]
y = df[i,"b"]
if (df[i,"c"]>=0) {
dfOcc[which(dfOcc$a==x && dfOcc$b==y),"positive"] <- dfOcc[which(dfOcc$a==x && dfOcc$b==y),"positive"] +1
}else{
dfOcc[which(dfOcc$a==x && dfOcc$b==y),"negative"] <- dfOcc[which(dfOcc$a==x && dfOcc$b==y),"negative"] +1
}
}
I am unsure whether the code is slow due to the size of the files (260k rows each) or due to the for-loop?
So far I managed to improve it in this way:
dfOcc[which(dfOcc$a==df$a & dfOcc$b==df$b),"positive"] <- apply(df,1,function(x){ifelse(x["c"]>0,1,0)})
This works fine in this example but not in my real case:
It only keeps count of the positive c and running this code twice might be counter productive
My original datasets are 260k rows while my "tracer" is 10k rows (the initial dataset repeats the a and b values with other c values
Any tip on how to improve those two points would be greatly appreciated!

I think you can simply count and spread the data. This will be easier and will work on any group and dataset. You can change group_by(a) to group_by(a, b) if you want to count grouping both a and b column.
library(dplyr)
library(tidyr)
df1 %>%
group_by(a) %>%
mutate(sign = ifelse(c > 0, "Positive", "Negative")) %>%
count(sign) %>%
spread(sign, n)

package data.table might help you do this in one line.
df1 <- data.table(data.frame(a=rep(c(1:5),times=3),b=rep(c(1:3),each=5),c=rnorm(15)))
posneg <- c("positive" , "negative") # list of columns needed
df1[,(posneg) := list(ifelse(c>0, 1,0), ifelse(c<0, 1,0))] # use list to combine the 2 ifelse conditions
for more information , try
?data.table
if you really want the positive negative counts to be in a separate dataframe,
dfOcc <- df1[,c("a", "positive","negative")]

Removing rows of dataframe based on frequency of a variable

I'm working with a dataframe (in R) that contains observations of animals in the wild (recording time/date, location, and species identification). I want to remove rows that contain a certain species if there are less than x observations of them in the whole dataframe. As of now, I managed to get it to work with the following code, but I know there must be a more elegant and efficient way to do it.
namelist <- names(table(ind.data$Species))
for (i in 1:length(namelist)) {
if (table(ind.data$Species)[namelist[i]] <= 2) {
while (namelist[i] %in% ind.data$Species) {
j <- match(namelist[i], ind.data$Species)
ind.data <- ind.data[-j,]
}
}
}
The namelist vector contains all the species names in the data frame ind.data, and the if statement checks to see if the frequency of the ith name on the list is less than x (2 in this example).
I'm fully aware that this is not a very clean way to do it, I just threw it together at the end of the day yesterday to see if it would work. Now I'm looking for a better way to do it, or at least for how I could refine it.

You can do this with the dplyr package:
library(dplyr)
new.ind.data <- ind.data %>%
group_by(Species) %>%
filter(n() > 2) %>%
ungroup()
An alternative using built-in functions is to use ave():
group_sizes <- ave(ind.data$Species, ind.data$Species, FUN = length)
new.ind.data <- ind.data[group_sizes > 2, ]

We can use data.table
library(data.table)
setDT(ind.data)[, .SD[.N >2], Species]

Subsetting based on observations in a month

I'm trying to subset some data and am stuck on the last part of cleaning.
What I need to do is calculate the number of observations for each individual (indivID) in months (June, July, and August) and return a percentage for each without missing data and then keep those observations that are over 75%.
I was able to create a nested for loop, but it took probably 6 hours to process today. I would like to be able to take advantage of parallel computer by using ddply, or another function, but an very lost.
Here's the data (Note this is a very small subset that only includes individuals from 1:5):
https://www.dropbox.com/s/fmk8900622klsgt/data.csv?dl=0
And here's the for loop :
epa.d <- read.csv("/.../data.csv")
#Function for loops
days <- function (month){
if (month == 06) return(as.numeric(30))
if (month == 07) return(as.numeric(31))
if (month == 08) return(as.numeric(31))
}
#Subset data for 75% in June, July, and August
for (i in unique(epa.d$indivID)){
for (j in unique(epa.d$year)){
for (k in unique(epa.d$month)){
monthsum <- sum(epa.d$indivID == i & epa.d$year == j & epa.d$month == k )
monthperc = (monthsum/days(k))* 100
if (monthperc < 75){
epa.d <- epa.d[! (epa.d$indivID == i & epa.d$year == j), ]
}
}
}
}

If I understand you correctly, you want to keep daily observations for each combination of indivID-month-year in which at least 75% of days have ozone measurements. Here's a way to do it that should be pretty fast:
library(dplyr)
# For each indivID, calculate percent of days in each month with
# ozone observations, and keep those with pctCoverage >= 0.75
epa.d_75 = epa.d %>%
group_by(indivID, year, month) %>%
summarise(count=n()) %>%
mutate(pctCoverage = ifelse(month==6, count/30, count/31)) %>%
filter(pctCoverage >= 0.75)
We now have a data frame epa.d_75 that has one row for each indivID-month-year with at least 75% coverage. Next, we'll merge the daily data into this data frame, resulting in one row for each daily observation for each unique indivID-month-year.
# Merge in daily data for each combination of indivID-month-year that meets
# the 75% coverage criterion
epa.d_75 = merge(epa.d_75, epa.d, by=c("indivID","month","year"),
all.x=TRUE)
Update: To answer the questions in the commments:
Can you explain what the %>% is doing, and if possible a break down of how you logically thought about this.
The %>% is a "chaining" operator that allows you to chain functions one after the other without having to store the result of the previous function before running the next one. Take a look at the dplyr Vignette to learn more about how to use it. Here's how the logic works in this case:
group_by splits the data set by the grouping variables, then runs the next functions separately on each group. In this case, summarise counts the number of rows in the data frame for each unique combination of indivID, month, and year, then mutate adds a column with the fractional coverage for that indivID for that month and year. filter then gets rid of any combination of indivID, month, and year with less than 75% coverage. You can stop the chain at any point to see what it's doing. For example, run the following code to see what epa.d_75 looks like before the filtering operation:
epa.d_75 = epa.d %>%
group_by(indivID, year, month) %>%
summarise(count=n()) %>%
mutate(pctCoverage = ifelse(month==6, count/30, count/31))
why the hell this is so much faster than running for loops? I don't know the answer in detail, but dplyr does most of its magic in C code under the hood, which is faster than native R. Hopefully someone else can give a more precise and detailed answer.

Another option would be to use data.table (similar to #eipi10's dplyr method), which would be very fast.
library(data.table)
epa.d_75 <- setDT(epa.d)[, list(pctCoverage=ifelse(month==6, .N/30,
.N/31)),by=list(indivID, year, month)][pctCoverage >=0.75]
epa.d_75New = merge(epa.d_75, epa.d, by=c("indivID","month","year"),
all.x=TRUE)
data
epa.d <- read.csv('data.csv', row.names=1)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Vectorising function on subset of dataframe based on other columns - r

With dplyr library(dplyr) df %>% group_by(Subject, TrialId) %>% mutate(modified_timestamp = timestamp - min(timestamp)) Should work. If it doesn't, please share a reproducible example so we can test.

Related

r - Filter a rows by a date that alters each day

Lagged values multiple columns with function in R

How to use if-statement in apply function?

Removing rows of dataframe based on frequency of a variable

Subsetting based on observations in a month

Categories

Resources