How to subset data for a specific column with ddply?

How to subset data for a specific column with ddply? - r

I would like to know if there is a simple way to achieve what I describe below using ddply. My data frame describes an experiment with two conditions. Participants had to select between options A and B, and we recorded how long they took to decide, and whether their responses were accurate or not.
I use ddply to create averages by condition. The column nAccurate summarizes the number of accurate responses in each condition. I also want to know how much time they took to decide and express it in the column RT. However, I want to calculate average response times only when participants got the response right (i.e. Accuracy==1). Currently, the code below can only calculate average reaction times for all responses (accurate and inaccurate ones). Is there a simple way to modify it to get average response times computed only in accurate trials?
See sample code below and thanks!
library(plyr)
# Create sample data frame.
Condition = c(rep(1,6), rep(2,6)) #two conditions
Response = c("A","A","A","A","B","A","B","B","B","B","A","A") #whether option "A" or "B" was selected
Accuracy = rep(c(1,1,0),4) #whether the response was accurate or not
RT = c(110,133,121,122,145,166,178,433,300,340,250,674) #response times
df = data.frame(Condition,Response, Accuracy,RT)
head(df)
Condition Response Accuracy RT
1 1 A 1 110
2 1 A 1 133
3 1 A 0 121
4 1 A 1 122
5 1 B 1 145
6 1 A 0 166
# Calculate averages.
avg <- ddply(df, .(Condition), summarise,
N = length(Response),
nAccurate = sum(Accuracy),
RT = mean(RT))
# The problem: response times are calculated over all trials. I would like
# to calculate mean response times *for accurate responses only*.
avg
Condition N nAccurate RT
1 6 4 132.8333
2 6 4 362.5000

With plyr, you can do it as follows:
ddply(df,
.(Condition), summarise,
N = length(Response),
nAccurate = sum(Accuracy),
RT = mean(RT[Accuracy==1]))
this gives:
Condition N nAccurate RT
1: 1 6 4 127.50
2: 2 6 4 300.25
If you use data.table, then this is an alternative way:
library(data.table)
setDT(df)[, .(N = .N,
nAccurate = sum(Accuracy),
RT = mean(RT[Accuracy==1])),
by = Condition]

Using dplyr package:
library(dplyr)
df %>%
group_by(Condition) %>%
summarise(N = n(),
nAccurate = sum(Accuracy),
RT = mean(RT[Accuracy == 1]))

Related

Obtaining Percentage for Date Observations

I am very new to R and am struggling with this concept. I have a data frame that looks like this:
enter image description here
I have used summary(FoodFacilityInspections$DateRecent) to get the observations for each "date" listed. I have 3932 observations, though, and wanted to get a summary of:
Dates with the most observations and the percentage for that
Percentage of observations for the Date Recent category
I have tried:
*
> count(FoodFacilityInspections$DateRecent) Error in UseMethod("count")
> : no applicable method for 'count' applied to an object of class
> "factor"

Using built in data as you did not provide example data
library(data.table)
dtcars <- data.table(mtcars, keep.rownames = TRUE)
Solution
dtcars[, .("count"=.N, "percent"=.N/dtcars[, .N]*100),
by=cyl]

You can use the table function to find out which date occurs the most. Then you can loop through each item in the table (date in your case) and divide it by the total number of rows like this (also using the mtcars dataset):
table(mtcars$cyl)
percent <- c()
for (i in 1:length(table(mtcars$cyl))){
percent[i] <- table(mtcars$cyl)[i]/nrow(mtcars) * 100
}
output <- cbind(table(mtcars$cyl), percent)
output
percent
4 11 34.375
6 7 21.875
8 14 43.750

A one-liner using table and proportions in within.
within(as.data.frame.table(with(mtcars, table(cyl))), Pc <- proportions(Freq)*100)
# cyl Freq Pc
# 1 4 11 34.375
# 2 6 7 21.875
# 3 8 14 43.750

An updated solution with total, percent and cumulative percent table based on your data.
library(data.table)
data<-data.frame("ScoreRecent"=c(100,100,100,100,100,100,100,100,100),
"DateRecent"=c("7/23/2021", "7/8/2021","5/25/2021","5/19/2021","5/20/2021","5/13/2021","5/17/2021","5/18/2021","5/18/2021"),
"Facility_Type_Description"=c("Retail Food Stores", "Retail Food Stores","Food Service Establishment","Food Service Establishment","Food Service Establishment","Food Service Establishment","Food Service Establishment","Food Service Establishment","Food Service Establishment"),
"Premise_zip"=c(40207,40207,40207,40206,40207,40206,40207,40206,40206),
"Opening_Date"=c("6/27/1988","6/29/1988","10/20/2009","2/28/1989","10/20/2009","10/20/2009","10/20/2009","10/20/2009", "10/20/2009"))
tab <- function(dataset, var){
dataset %>%
group_by({{var}}) %>%
summarise(n=n()) %>%
mutate(total = cumsum(n),
percent = n / sum(n) * 100,
cumulativepercent = cumsum(n / sum(n) * 100))
}
tab(data, Facility_Type_Description)
Facility_Type_Description n total percent cumulativepercent
<chr> <int> <int> <dbl> <dbl>
1 Food Service Establishment 7 7 77.8 77.8
2 Retail Food Stores 2 9 22.2 100

(R): Calculate quantile by unique row value unification

I have a df like this:
> df<-data.frame(Client.code =
c(100451,100451,100523,100523,100523,100525),dayref = c(24,30,15,13,17,5))
> df
Client.code dayref
1 100451 24
2 100451 30
3 100523 15
4 100523 13
5 100523 17
6 100525 5
It is a one-year distribution of payments period from issue.
Usign this data above and given a df2 like this:
Client.Code Days
1 100451 16
1 100523 16
1 100460 35
As i have enough data for a reasonable quantile prob. calculations.I will like to know how to build a loop for assing to every row in this df2 of days a quantile according with the first df.

We can use data.table
library(data.table)
setDT(df)[, .(Quantile = quantile(dayref)), Client.code]
Or with tidyverse
library(dplyr)
library(tidyr)
df %>%
group_by(Client.code) %>%
summarise(Quantile = list(quantile(dayref))) %>%
unnest

tapply(df$dayref, df$Client.code, quantile)
You can specify specific percentiles by adding a vector of them
tapply(df$dayref, df$Client.code, quantile, 1:19/20)
You may need to formulate like this
tapply(df$dayref, df$Client.code, quantile, probs = 1:19/20)
And you can add na.rm = TRUE as another argument if you might have NAs

R count and substract events from a data frame

I am trying to calculate the families sizes from a data frame, which also contains two types of events : family members who died, and those who left the family. I would like to take into account these two parameters in order to compute the actual family size.
Here is a reproductive example of my problem, with 3 families only :
family <- factor(rep(c("001","002","003"), c(10,8,15)), levels=c("001","002","003"), labels=c("001","002","003"), ordered=TRUE)
dead <- c(0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0)
left <- c(0,0,0,0,0,1,0,0,0,1,1,0,0,0,1,1,0,0,0,1,1,1,0,0,0,0,0,0,1,1,1,0,0)
DF <- data.frame(family, dead, left) ; DF
I could count N = total family members (in each family) in a second dataframe DF2, by simply using table()
DF2 <- with(DF, data.frame(table(family)))
colnames(DF2)[2] <- "N" ; DF2
family N
1 001 10
2 002 8
3 003 15
But i can not find a proper way to get the actual number of people (for example, creating a new variable N2 into DF2) , calculated by substracting to N the number of members who died or left the family. I suppose i have to relate the two dataframes DF and DF2 in a way. i have looked for other related questions in this site but could not find the right answer...
If anyone has a good idea, it would be great !
Thank you in advance..
Deni

Logic : First we want to group_by(family) and then calculate 2 numbers : i) total #obs in each group ii) subtract the sum(dead) + sum(left) from this total .
In dplyr package : n() helps us get the total #observations in each group
In data.table : .N does the same above job
library(dplyr)
DF %>% group_by(family) %>% summarise( total = n(), current = n()-sum(dead,left, na.rm = TRUE))
# family total current
# (fctr) (int) (dbl)
#1 001 10 6
#2 002 8 4
#3 003 15 7
library(data.table)
# setDT() is preferred if incase your data was a data.frame. else just DF.
setDT(DF)[, .(total = .N, current = .N - sum(dead, left, na.rm = TRUE)), by = family]
# family total current
#1: 001 10 6
#2: 002 8 4
#3: 003 15 7

Here is a base R option
do.call(data.frame, aggregate(dl~family, transform(DF, dl = dead + left),
FUN = function(x) c(total=length(x), current=length(x) - sum(x))))
Or a modified version is
transform(aggregate(. ~ family, transform(DF, total = 1,
current = dead + left)[c(1,4:5)], FUN = sum), current = total - current)
# family total current
#1 001 10 6
#2 002 8 4
#3 003 15 7

I finally found another which works fine (from another post), allowing to compute everything from the original DF table. This uses the ddply function :
DF <- ddply(DF,.(family),transform,total=length(family))
DF <- ddply(DF,.(family),transform,actual=length(family)-sum(dead=="1")-sum(left=="1"))
DF
Thanks a lot to everyone who helped ! Deni

Calculate the mean per subject and repeat the value for each subject's row

This is the first time that I ask a question on stack overflow. I have tried searching for the answer but I cannot find exactly what I am looking for. I hope someone can help.
I have a huge data set of 20416 observation. Basically, I have 83 subjects and for each subject I have several observations. However, the number of observations per subject is not the same (e.g. subject 1 has 256 observations, while subject 2 has only 64 observations).
I want to add an extra column containing the mean of the observations for each subject (the observations are reading times (RT)).
I tried with the aggregate function:
aggregate (RT ~ su, data, mean)
This formula returns the correct mean per subject. But then I cannot simply do the following:
data$mean <- aggregate (RT ~ su, data, mean)
as R returns this error:
Error in $<-.data.frame(tmp, "mean", value = list(su = 1:83, RT
= c(378.1328125, : replacement has 83 rows, data has 20416
I understand that the formula lacks a command specifying that the mean for each subject has to be repeated for all the subject's rows (e.g. if subject 1 has 256 rows, the mean for subject 1 has to be repeated for 256 rows, if subject 2 has 64 rows, the mean for subject 2 has to be repeated for 64 rows and so forth).
How can I achieve this in R?

The data.table syntax lends itself well to this kind of problem:
Dt[, Mean := mean(Value), by = "ID"][]
# ID Value Mean
# 1: a 0.05881156 0.004426491
# 2: a -0.04995858 0.004426491
# 3: b 0.64054432 0.038809830
# 4: b -0.56292466 0.038809830
# 5: c 0.44254622 0.099747707
# 6: c -0.10771992 0.099747707
# 7: c -0.03558318 0.099747707
# 8: d 0.56727423 0.532377247
# 9: d -0.60962095 0.532377247
# 10: d 1.13808538 0.532377247
# 11: d 1.03377033 0.532377247
# 12: e 1.38789640 0.568760936
# 13: e -0.57420308 0.568760936
# 14: e 0.89258949 0.568760936
As we are applying a grouped operation (by = "ID"), data.table will automatically replicate each group's mean(Value) the appropriate number of times (avoiding the error you ran into above).
Data:
Dt <- data.table::data.table(
ID = sample(letters[1:5], size = 14, replace = TRUE),
Value = rnorm(14))[order(ID)]

Staying in Base R, ave is intended for this use:
data$mean = with(data, ave(x = RT, su, FUN = mean))

Simply merge your aggregated means data with full dataframe joined by the subject:
aggdf <- aggregate (RT ~ su, data, mean)
names(aggdf)[2] <- "MeanOfRT"
df <- merge(df, aggdf, by="su")

Another compelling way of handling this without generating extra data objects is by using group_by of dplyr package:
# Generating some data
data <- data.table::data.table(
su = sample(letters[1:5], size = 14, replace = TRUE),
RT = rnorm(14))[order(su)]
# Performing
> data %>% group_by(su) %>%
+ mutate(Mean = mean(RT)) %>%
+ ungroup()
Source: local data table [14 x 3]
su RT Mean
1 a -1.62841746 0.2096967
2 a 0.07286149 0.2096967
3 a 0.02429030 0.2096967
4 a 0.98882343 0.2096967
5 a 0.95407214 0.2096967
6 a 1.18823435 0.2096967
7 a -0.13198711 0.2096967
8 b -0.34897914 0.1469982
9 b 0.64297557 0.1469982
10 c -0.58995261 -0.5899526
11 d -0.95995198 0.3067978
12 d 1.57354754 0.3067978
13 e 0.43071258 0.2462978
14 e 0.06188307 0.2462978

How to find sum and average for some columns based on the numbers from another column in R

GIVEN DATA
I have 6 columns of data of vehicle trajectory (observation of vehicles' change in position, velocity, etc over time) a part of which is shown below:
Vehicle ID Frame ID Global X Vehicle class Vehicle velocity Lane
1 177 6451181 2 24.99 5
1 178 6451182 2 24.95 5
1 179 6451184 2 24.91 5
1 180 6451186 2 24.90 5
1 181 6451187 2 24.96 5
1 182 6451189 2 25.08 5
Vehicle ID is the identification of individual vehicles e.g. vehicle 1, vehicle 2, etc. It is repeated in the column for each frame in which it was observed. Please note that each frame is 0.1 seconds long so 10 frames make 1 second. The IDs of frames is in Frame ID column. Vehicle class is the type of vehicle (1=motorcycle, 2=car, 3=truck). Vehicle velocity column represents instantaneous speed of vehicle in that instant of time i.e. in a frame. Lane represents the number or ID of the lane in which vehicle is present in a particular frame.
WHAT I NEED TO FIND
The data I have is for 15 minutes period. The minimum frame ID is 5 and maximum frame ID is 9952. I need to find the total number of vehicles in every 30 seconds time period. This means that starting from the first 30 seconds (frame ID 5 to frame ID 305), I need to know the unique vehicle IDs observed. Also, for these 30 seconds period, I need to find the average velocity of each vehicle class. This means that e.g. for cars I need to find the average of all velocities of those vehicles whose vehicle class is 2.
I need to find this for all 30 seconds time period i.e. 5-305, 305-605, 605-905,..., 9605-9905. The ouput should tables for cars, trucks and motorcycles like this:
Time Slots Total Cars Average Velocity
5-305 xx xx
305-605 xx xx
. . .
. . .
9605-9905 xx xx
WHAT I HAVE TRIED SO FAR
# Finding the minimum and maximum Frame ID for creating 30-seconds time slots
minfid <- min(data$'Frame ID') # this was 5
maxfid <- max(data$'Frame ID') # this was 9952
for (i in 'Frame ID'==5:Frame ID'==305) {
table ('Vehicle ID')
mean('Vehicle Velocity', 'Vehicle class'==2)
} #For cars in first 30 seconds
I can't generate the required output and I don't know how can I do this for all 30 second periods. Please help.

It's a bit tough to make sure code is completely correct with your data since there is only one vehicle in the sample you show. That said, this is a typical split-apply-combine type analysis you can execute easily with the data.table package:
library(data.table)
dt <- data.table(df) # I just did a `read.table` on the text you posted
dt[, frame.group:=cut(Frame_ID, seq(5, 9905, by=300), include.lowest=T)]
Here, I just converted your data into a data.table (df was a direct import of your data posted above), and then created 300 frame buckets using cut. Then, you just let data.table do the work. In the first expression we calculate total unique vehicles per frame.group
dt[, list(tot.vehic=length(unique(Vehicle_ID))), by=frame.group]
# frame.group tot.vehic
# 1: [5,305] 1
Now we group by frame.group and Vehicle_class to get average speed and count for those combinations:
dt[, list(tot.vehic=length(unique(Vehicle_ID)), mean.speed=mean(Vehicle_velocity)), by=list(frame.group, Vehicle_class)]
# frame.group Vehicle_class tot.vehic mean.speed
# 1: [5,305] 2 1 24.965
Again, a bit silly when we only have one vehicle, but this should work for your data set.
EDIT: to show that it works:
library(data.table)
set.seed(101)
dt <- data.table(
Frame_ID=sample(5:9905, 50000, rep=T),
Vehicle_ID=sample(1:400, 50000, rep=T),
Vehicle_velocity=runif(50000, 25, 100)
)
dt[, frame.group:=cut(Frame_ID, seq(5, 9905, by=300), include.lowest=T)]
dt[, Vehicle_class:=Vehicle_ID %% 3]
head(
dt[order(frame.group, Vehicle_class), list(tot.vehic=length(unique(Vehicle_ID)), mean.speed=mean(Vehicle_velocity)), by=list(frame.group, Vehicle_class)]
)
# frame.group Vehicle_class tot.vehic mean.speed
# 1: [5,305] 0 130 63.34589
# 2: [5,305] 1 131 61.84366
# 3: [5,305] 2 129 64.13968
# 4: (305,605] 0 132 61.85548
# 5: (305,605] 1 132 64.76820
# 6: (305,605] 2 133 61.57129
Maybe it's your data?

Here is a plyr version:
data$timeSlot <- cut(data$FrameID,
breaks = seq(5, 9905, by=300),
dig.lab=5,
include.lowest=TRUE)
# split & combine
library(plyr)
data.sum1 <- ddply(.data = data,
.variables = c("timeSlot"),
.fun = summarise,
totalCars = length(unique(VehicleID)),
AverageVelocity = mean(velocity)
)
# include VehicleClass
data.sum2 <- ddply(.data = data,
.variables = c("timeSlot", "VehicleClass"),
.fun = summarise,
totalCars = length(unique(VehicleID)),
AverageVelocity = mean(velocity)
)
The column names like FrameID would have to be edited to match the ones you use:
data <- read.table(sep = "", header = TRUE, text = "
VehicleID FrameID GlobalX VehicleClass velocity Lane
1 177 6451181 2 24.99 5
1 178 6451182 2 24.95 5
1 179 6451184 2 24.91 5
1 180 6451186 2 24.90 5
1 181 6451187 2 24.96 5
1 182 6451189 2 25.08 5")
data.sum1
# timeSlot totalCars AverageVelocity
# 1 [5,305] 1 24.965
data.sum2
# timeSlot VehicleClass totalCars AverageVelocity
# 1 [5,305] 2 1 24.965

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to subset data for a specific column with ddply? - r

Using dplyr package: library(dplyr) df %>% group_by(Condition) %>% summarise(N = n(), nAccurate = sum(Accuracy), RT = mean(RT[Accuracy == 1]))

Related

Obtaining Percentage for Date Observations

(R): Calculate quantile by unique row value unification

R count and substract events from a data frame

Calculate the mean per subject and repeat the value for each subject's row

How to find sum and average for some columns based on the numbers from another column in R

Categories

Resources