Counting Attempts of an event in R - r

I'm relatively new in R and learning. I have the following data frame = data
ID grade Test_Date
1 56 01-25-2012
1 63 02-21-2016
1 73 02-31-2016
2 41 12-23-2015
2 76 01-07-2016
3 66 02-08-2016
I am looking to count the number of people (in this case only two unique individuals) who passed their tests after multiple attempts(passing is defined as 65 or over). So the final product would return me a list of unique ID's who had multiple counts until their test scores hit 65. This would inform me that approx. 66% of the clients in this data frame require multiple test sessions before getting a passing grade.
Below is my idea or concept more or less, I've framed it as an if statement
If ID appears twice
count how often it appears, until TEST GRADE >= 65
ifelse(duplicated(data$ID), count(ID), NA)
I'm struggling with the second piece where I want to say, count the occurrence of ID until grade >=65.
The other option I see is some sort of loop. Below is my attempt
for (i in data$ID) {
duplicated(datad$ID)
count(data$ID)
Here is where something would say until =65
}
Again the struggle comes in how to tell R to stop counting when grade hits 65.
Appreciate the help!

You can use data.table:
library(data.table)
dt <- fread(" ID grade Test_Date
1 56 01-25-2012
1 63 02-21-2016
1 73 02-31-2016
2 41 12-23-2015
2 76 01-07-2016
3 66 02-08-2016")
# count the number of try per ID then get only the one that have been successful
dt <- dt[, N:=.N, by=ID][grade>=65]
# proportion of successful having tried more than once
length(dt[N>1]$ID)/length(dt$ID)
[1] 0.6666667

Another option, though the other two work just fine:
library(dplyr)
dat2 <- dat %>%
group_by(ID) %>%
summarize(
multiattempts = n() > 1 & any(grade < 65),
maxgrade = max(grade)
)
dat2
# Source: local data frame [3 x 3]
# ID multiattempts maxgrade
# <int> <lgl> <int>
# 1 1 TRUE 73
# 2 2 TRUE 76
# 3 3 FALSE 66
sum(dat2$multiattempts) / nrow(dat2)
# [1] 0.6666667

Here is a method using the aggregate function and subsetting that returns the maximum score for testers that took the the test more than once starting from their second test.
multiTestMax <- aggregate(grade~ID, data=df[duplicated(df$ID),], FUN=max)
multiTestMax
ID grade
1 1 73
2 2 76
To get the number of rows, you can use nrow:
nrow(multiTestMax)
2
or the proportion of all test takers
nrow(multiTestMax) / unique(df$ID)
data
df <- read.table(header=T, text="ID grade Test_Date
1 56 01-25-2012
1 63 02-21-2016
1 73 02-31-2016
2 41 12-23-2015
2 76 01-07-2016
3 66 02-08-2016")

Related

read.csv - to separate information stored in .csv based on the presence or absence of a duplicate value

First of all - apologies, I'm new to all of this, so I may write things in a confusing way.
I have multiple .csv files that I need to read, and to save a lot of time I am looking to find an automated way of doing this.
I am looking to read different rows of the .csv and store the information as two separate files, based on the information stored in the last column.
My data is specifically areas, and slices of a 3D image, which I will use to compile volumes. If two rows have the same "slice" then I need to separate them, as the area found in row 1 corresponds to a different structure to the one with an area in row 2, on the same slice.
Eg:
Row,area,slice
1,50,180
2,52,180
3,49,181
4,53,181
5,65,182
6,60,183
So slice structure 1 has an area at slice 180 (area = 50) and 181 (area = 49), whereas structure 2 has an area at each slice from 180 to 183.
I want to be able to store all the bold data in one .csv, and all the other data in another .csv
There may be .csv files with more or less overlapping slice values, adding complexity to this.
Thank you for the help, please let me know if I need to clarify anything.
Use duplicated:
dat <- read.csv(text="
Row,area,slice
1,50,180
2,52,180
3,49,181
4,53,181
5,65,182
6,60,183")
dat[duplicated(dat$slice),]
# Row area slice
# 2 2 52 180
# 4 4 53 181
dat[!duplicated(dat$slice),]
# Row area slice
# 1 1 50 180
# 3 3 49 181
# 5 5 65 182
# 6 6 60 183
(Whether you write each of these last two frames to files or store them for later use is up to you.)
duplicated normally returns TRUE for the second and subsequent incidents of the field(s). Your logic of 2,4,5,6 is more along the lines of "last of the dupes or "no dupes", which is a little different.
library(dplyr)
dat %>%
group_by(slice) %>%
slice(-n()) %>%
ungroup()
# # A tibble: 2 x 3
# Row area slice
# <int> <int> <int>
# 1 1 50 180
# 2 3 49 181
dat %>%
group_by(slice) %>%
slice(n()) %>%
ungroup()
# # A tibble: 4 x 3
# Row area slice
# <int> <int> <int>
# 1 2 52 180
# 2 4 53 181
# 3 5 65 182
# 4 6 60 183
Similarly, with data.table:
library(data.table)
as.data.table(dat)[, .SD[.N,], by = .(slice)]
# slice Row area
# 1: 180 2 52
# 2: 181 4 53
# 3: 182 5 65
# 4: 183 6 60
as.data.table(dat)[, .SD[-.N,], by = .(slice)]
# slice Row area
# 1: 180 1 50
# 2: 181 3 49

Match column and rows then replace

I have to analyze data from an economic experiment.
My database is composed of 14 976 observations with 212 variables. Within this database we have other informations like the profit, total profit, the treatments and other variables.
You can see that I have two types :
Type 1 is for sellers
Type 2 is for buyers
For some variables, results were put in the buyers (type 2) rows and not in the sellers ones (which is a choice completely arbitrary choice). However I would like to analyze gender of sellers who overcharged (for instance). So I need to manipulate my database and I don't know how to do this.
Here, you have part of the database :
ID Gender Period Matching group Group Type Overcharging ...
654 1 1 73 1 1 NA
654 1 2 73 1 1 NA
654 1 3 73 1 1 NA
654 1 4 73 1 1 NA
435 1 1 73 2 1 NA
435 1 2 73 2 1 NA
435 1 3 73 2 1 NA
435 1 4 73 2 1 NA
708 0 1 73 1 2 1
708 0 2 73 1 2 0
708 0 3 73 1 2 0
708 0 4 73 1 2 1
546 1 1 73 2 2 0
546 1 2 73 2 2 0
546 1 3 73 2 2 1
546 1 4 73 2 2 0
To do what I'd like to I have many informations (only one seller was matched with one buyer in at the period x, in the group x, matching group x, and with treatment x...).
To give you and example, in matching group 73 we know that at period 1 subject 708 was overcharged (the one in group 1). As I know that this men belongs to group 1 and matching group 73, I am able to identify the seller who has overcharged him at period 1 : subject 654 with gender =1.
So, I would like to put overcharging (and some others) buyers values on the sellers rows (type ==1) to analyze sellers behavior but at the right period, for the right group and the right matching group.
I have a long way of doing it with data.frames. If you are looking to code in R long term I would suggest checking out either (i) dplyr package, part of the tidyverse suite or (ii) data.table package. The first one has the most popular syntax, and is tied together nicely with a bunch of useful packages. The second is harder to learn but quicker. For your size data, this is negligible though.
In base data.frames, here is something I hope matches your request. Let me know if I've mistaken anything, or been unclear.
# sellers data eg
dt1 <- data.frame(Period = 1:4, MatchGroup = 73, Group = 1, Type = 1,
Overcharging = NA)
# buyers data eg
dt2 <- data.frame(Period = 1:4, MatchGroup = 73, Group = 1, Type = 2,
Overcharging = c(1,0,0,1))
# make my current data view
dt <- rbind(dt1, dt2)
dt[]
# split in to two data frames, on the Type column:
dt_split <- split(dt, dt$Type)
dt_split
# move out of list
dt_suffix <- seq_along(dt_split)
dt_names <- sprintf("dt%s", dt_suffix)
for(name in dt_names){
assign(name, dt_split[match(name, dt_names)][[1]])
}
dt1[]
dt2[]
# define the columns in which to match up the buyer to seller
merge_cols <- c("Period", "MatchGroup", "Group")
# define the columns you want to merge, that you know are NA
na_cols <- c("Overcharging")
# now use merge operation, and filter dt2, to pull in only columns you want
# I suggest dropping the na_cols first in dt1, as otherwise it will create two
# columns post-merge: Overcharging, i.Overcharging
dt1 <- dt1[,setdiff(names(dt1), na_cols)]
dt1_new <- merge(dt1,
dt2[, c(merge_cols, na_cols)], # filter dt2
by = merge_cols, # columns to match on
all.x = TRUE) # dt1 is x, dt2 is y. Want to keep all of dt1
# if you want to bind them back together, ensure the column order matches, and
# bind e.g.
dt1_new <- dt1_new[, names(dt2)]
dt_final <- rbind(dt1_new, dt2)
dt_final[]
What my line of thinking is to make these buyers and sellers data frames in to two separate ones. Then identify how they join, and migrate the data you need from buyers to sellers. Then finally bring them back together if so desired.

How to find group of rows of a data frame where error occures

I have a two-column dataframe contataining thousands of IDs where each ID has hundreds of data rows, in other words a data frame of about 6 million rows. I am grouping (using either dplyr or data.table) this data frame by IDs and performing a "tso" (outlier detection) function on grouped data frame. The problem is after hours of computation it returns me an error related to ARIMA specification of one of the IDs. Question is how can I identify the ID (or the row number) where my function is returning error?? (if I detect it then I can remove that ID from dataframe)
I tried to manually perform my function on subgroups of this dataframe however I cannot reach the erroneous ID because there are thousands of IDs so it takes me weeks to find them this way.
outlier.detection <- function(x,iter) {
y <- as.ts(x)
out2 <- tso(y,maxit.iloop = iter,tsmethod = "auto.arima",remove.method = "bottom-up",cval=3)
y[out2$outliers$ind] <- NA
return(y)}
df <- data.table(outlying1);setkey(df,id)
test <- df[,list(new.weight = outlier.detection(weight,iter=1)),by=id]
the above function finds the annomalies and replace them with NAs. here is an example,
ID weight
1 a 50
2 a 50
3 a 51
4 a 51.5
5 a 52
6 b 80
7 b 81
8 b 81.5
9 b 90
10 b 82
it will look like the following,
ID weight
1 a 50
2 a 50
3 a 51
4 a 51.5
5 a 52
6 b 80
7 b 81
8 b 81.5
9 b NA
10 b 82

replacing for loops in a function with vector calculations to speed up R

Say I have some data in data frame d1, that describes how frequently different sample individuals eat different foods, and a final column describing whether or not those foods are cool to eat. The data are structured like this.
OTU.ID<- c('pizza','taco','pizza.taco','dirt')
s1<-c(5,20,14,70)
s2<-c(99,2,29,5)
s3<-c(44,44,33,22)
cool<-c(1,1,1,0)
d1<-data.frame(OTU.ID,s1,s2,s3,cool)
print(d1)
OTU.ID s1 s2 s3 cool
1 pizza 5 99 44 1
2 taco 20 2 44 1
3 pizza.taco 14 29 33 1
4 dirt 70 5 22 0
I have written a function that, for each sample, s1:s3, the number of cool foods that were consumed, and the total number of foods that were consumed. It runs as a for loop on each line of the data table (which is extremely slow).
cool.food.abundance<- function(food.table){
samps<-colnames(food.table)
#remove column names that are not sample names
samps<-samps[!samps %in% c("OTU.ID","cool")]
#create output vectors for for loop
id<-c()
cool.foods<-c()
all.foods<-c()
#run a loop that stores output ids and results as vectors
for(i in 1:length(samps)){
x<- samps[i]
y1<-sum(food.table[samps[i]]*food.table$cool)
y2<-sum(food.table[samps[i]])
id<-c(id,x)
cool.foods<-c(cool.foods,y1)
all.foods<-c(all.foods,y2)
}
#save results as a data frame and return the data frame object
results<-data.frame(id,cool.foods,all.foods)
return(results)
}
So, if you run this function, you will get a new table of sample IDs, the number of cool foods that sample ate, and the total number of foods that sample ate.
cool.food.abundance(d1)
id cool.foods all.foods
1 s1 39 109
2 s2 130 135
3 s3 121 143
How can I replace this for-loop with vector calculations to speed this up? I would really like to be able for the function to operate on dataframes loaded with the fread function in the data.table package.
You can try
library(data.table)#v1.9.5+
dcast(melt(setDT(d1), id.var=c('OTU.ID', 'cool'))[,
sum(value) ,.(cool, variable)], variable~c('notcool.foods',
'cool.foods')[cool+1L], value.var='V1')[,
all.foods:= cool.foods+notcool.foods][, notcool.foods:=NULL]
# variable cool.foods all.foods
#1: s1 39 109
#2: s2 130 135
#3: s3 121 143
Or instead of using dcast we can summarise the result (as in #jeremycg's post) as there are only two groups
melt(setDT(d1), id.var=c('OTU.ID', 'cool'), variable.name='id')[,
list(all.foods=sum(value), cool.foods=sum(value[cool==1])) , id]
# id all.foods cool.foods
#1: s1 109 39
#2: s2 135 130
#3: s3 143 121
Or you can use base R
nm1 <- paste0('s', 1:3)
res <- t(addmargins(rowsum(as.matrix(d1[nm1]), group=d1$cool),1)[-1,])
colnames(res) <- c('cool.foods', 'all.foods')
res
# cool.foods all.foods
#s1 39 109
#s2 130 135
#s3 121 143
Here's how I would do it, with reshape2 and dplyr:
library(reshape2)
library(dplyr)
d1 <- melt(d1, id = c("OTU.ID", "cool"))
d1 %>% group_by(variable) %>%
summarise(all.foods = sum(value), cool.foods = sum(value[cool == 1]))

Create dataframe containing only matching data from 2 dataframes in R

I've seen several posts on similar topics to this but I can't seem to make it work for my needs. I have 2 data frames, df1 and df2. df1 is quite large, df 2 is small.
df1
Chr start end Count
1 0 50 20
1 51 100 40
2 0 50 100
2 51 100 30
2 101 150 7
df2
Chr coord Name
1 25 X
2 75 Y
What I would like is to return rows which contain only those that match Chr exactly (df1$Chr == df2$Chr) and where df2$coord falls in the range of df1 start and end (df2$coord >= df1$start & df2$coord <= df1$end)
The end result (ideally) should look like this:
Chr start end Count coord Name
1 0 50 20 25 X
2 51 100 30 75 Y
I know this is probably a basic problem but any help would be greatly appreciated.
This linked question by thelatemail gives the solution: Comparing multiple columns in different data sets to find values within range R That question is somewhat muddled and unclear.
This is a duplicate of that question, but this question is clearer and much more readable.
x <- merge(df1, df2)
with(x, x[coord >= start & coord <= end,])
## Chr start end Count coord Name
## 1 1 0 50 20 25 X
## 4 2 51 100 30 75 Y

Resources