R enumerate duplicates in a dataframe with unique value

R enumerate duplicates in a dataframe with unique value - r

I have a dataframe containing a set of parts and test results. The parts are tested on 3 sites (North Centre and South). Sometimes those parts are re-tested. I want to eventually create some charts that compare the results from the first time that a part was tested with the second (or third, etc.) time that it was tested, e.g. to look at tester repeatability.
As an example, I've come up with the below code. I've explicitly removed the "Experiment" column from the morley data set, as this is the column I'm effectively trying to recreate. The code works, however it seems that there must be a more elegant way to approach this problem. Any thoughts?
Edit - I realise that the example given was overly simplistic for my actual needs (I was trying to generate a reproducible example as easily as possible).
New example:
part<-as.factor(c("A","A","A","B","B","B","A","A","A","C","C","C"))
site<-as.factor(c("N","C","S","C","N","S","N","C","S","N","S","C"))
result<-c(17,20,25,51,50,49,43,45,47,52,51,56)
data<-data.frame(part,site,result)
data$index<-1
repeat {
if(!anyDuplicated(data[,c("part","site","index")]))
{ break }
data$index<-ifelse(duplicated(data[,1:2]),data$index+1,data$index)
}
data
part site result index
1 A N 17 1
2 A C 20 1
3 A S 25 1
4 B C 51 1
5 B N 50 1
6 B S 49 1
7 A N 43 2
8 A C 45 2
9 A S 47 2
10 C N 52 1
11 C S 51 1
12 C C 56 1
Old example:
#Generate a trial data frame from the morley dataset
df<-morley[,c(2,3)]
#Set up an iterative variable
#Create the index column and initialise to 1
df$index<-1
# Loop through the dataframe looking for duplicate pairs of
# Runs and Indices and increment the index if it's a duplicate
repeat {
if(!anyDuplicated(df[,c(1,3)]))
{ break }
df$index<-ifelse(duplicated(df[,c(1,3)]),df$index+1,df$index)
}
# Check - The below vector should all be true
df$index==morley$Expt

We may use diff and cumsum on the 'Run' column to get the expected output. In this method, we are not creating a column of 1s i.e 'index' and also assuming that the sequence in 'Run' is ordered as showed in the OP's example.
indx <- cumsum(c(TRUE,diff(df$Run)<0))
identical(indx, morley$Expt)
#[1] TRUE
Or we can use ave
indx2 <- with(df, ave(Run, Run, FUN=seq_along))
identical(indx2, morley$Expt)
#[1] TRUE
Update
Using the new example
with(data, ave(seq_along(part), part, site, FUN=seq_along))
#[1] 1 1 1 1 1 1 2 2 2 1 1 1
Or we can use getanID from library(splitstackshape)
library(splitstackshape)
getanID(data, c('part', 'site'))[]

I think this is a job for make.unique, with some manipulation.
index <- 1L + as.integer(sub("\\d+(\\.)?","",make.unique(as.character(morley$Run))))
index <- ifelse(is.na(index),1L,index)
identical(index,morley$Expt)
[1] TRUE

Details of your actual data.frame may matter. However, a couple of options working with your example:
#this works if each group starts with 1:
df$index<-cumsum(df$Run==1)
#this is maybe more general, with data.table
require(data.table)
dt<-as.data.table(df)
dt[,index:=seq_along(Speed),by=Run]

Related

Search by conditional grouping and logic?

I'm new and learning R. I'm trying to ask a question that I don't know the words for.
Suppose I have a data frame such that:
df<-data.frame(ID=c("A","A","A","B","B","B","C","C","C"),
Week=c(1,2,3,1,2,3,1,2,3),
Variable=c(30,25,27,42,44,45,30,50,19))
ID Week Variable
1 A 1 30
2 A 2 25
3 A 3 27
4 B 1 42
5 B 2 44
6 B 3 45
7 C 1 30
8 C 2 50
9 C 3 19
How can I find what is the average Variable at Week 2 for all ID that had Variable = 30 at Week 1?
For example, I would like the output in this example to = 37.5

This might be easier to read/see.
library(tidyverse)
df %>%
spread(Week, Variable) %>%
filter(`1` == 30) %>%
with(mean(`2`))
[1] 37.5
I think tidyverse code is easier to understand because you can read it left to right like you would any non-code text. And the piping %>% makes seeing the order of operations easier, no more parentheses to parse.

Step 1: Obtain the IDs which had Variable=30 in Week1
res<- subset( df,Variable==30 & Week==1, ID )
The output is:
> res
ID
1 A
7 C
Step 2:
Get all their variables at week 2:
dt<-subset(df,ID %in% as.vector(unlist(res)) & Week==2 ,select=c(ID,Variable))
The output is:
ID Variable
2 A 25
8 C 50
Step 3: Get the mean:
mean(dt$Variable)
The final output is:
37.5
In step 2 we have ID %in% as.vector(unlist(res)). So, what does it mean?
The %in% part simply is an operator which returns true if it finds an ID inside the right handside vector. For example, run the below sample:
a<- 1:10
b<-c(0,4,6,8,16)
b %in% a
and the result is:
FALSE TRUE TRUE TRUE FALSE
So, the %in% operator returns a Boolean value for each element of b. The result will be True if, that element exist in a, otherwise it returns False. As you see 0 and 16 have False.
But, the point is, a should be vector, meanwhile res is a data.frame so, i need to first unlist it, and then consider it as a vector (as.vector).
In conclusion, ID %in% as.vector(unlist(res)) checks if each ID exist in res or not.

First we need ID's which have entry for variable=30 AND week=1 and then from that ID's extract ID's with Week=2 and do avg(Variable)
Base R Solution:
mean(df[df$ID %in% (df[df$Week==1 & df$Variable==30,1]) & df$Week==2,3])
Output:
[1] 37.5
OR (another approach)
Using sqldf:
library(sqldf)
sqldf("select avg(Variable) from df where ID IN (select ID from df where variable=30 AND week=1) AND Week=2")
Output:
avg(Variable)
1 37.5

How to label consecutive periods with identical statuses

I have long vector of patient statuses in R that are chronologically sorted, and a label of associated patient IDs. This vector is an element of a dataframe. I would like to label consecutive rows of data for which the patient status is the same. If the status changes, then reverts to its original value, that would be three separate events. This is different than most situations I have searched where duplicated or match would suffice.
An example would be along the lines of:
s <- c(0,0,0,1,1,1,0,0,2,1,1,0,0)
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2)
and the desired output would be
flag <- c(1,1,1,2,2,2,3,1,2,3,4,4)
or
flag <- c(1,1,1,2,2,2,3,4,5,6,7,7)
One inelegant approach would be to generate the sequence:
unlist(tapply(s, id, function(x) cumsum(c(T, x[-1] != rev(rev(x)[-1])))))
Is there a better way?

I think you could use rleid from data.table for this:
library(data.table)
rleid(s,id)
Output:
1 1 1 2 2 2 3 4 5 6 6 7 7
Or for the first sequence:
data.table(s,id)[,rleid(s),id]$V1
Output:
1 1 1 2 2 2 3 1 2 3 3 4 4

Run Length Encoding - rle()
tapply(s, id, function(x) {
v<-rle(x)$length
rep(1:length(v), v)
})

Reordering rows in a dataframe that follow particular sequence

I want to reorder rows in the following data frame if the Sequence is 2,1,3. However, this should only apply in instances where the Project ID is the same. So this logic should reorder the rows in Project 123, but should not effect Projects 124 or 125.
Here is the data:
Data <- data.frame(Project=c(123,123,123,124,125,125),
Value=c(1,4,7,3,8,9),
Sequence=c(2,1,3,2,1,3))
This is the result I'm looking for:
Result <- data.frame(Project=c(123,123,123,124,125,125),
Value=c(4,1,7,3,8,9),
Sequence=c(1,2,3,2,1,3))

Not a realy smart piece of work, but it should work - try this...
This function sorts the data rows of the specified sequence, if it appears within a project.(you could also change the sequence)
reorder <- function(data, sequence=c(2,1,3)){
seq_len <- length(sequence)
for (i in 1:(dim(data)[1]-seq_len)){
seq_check <- identical(sequence, data$Sequence[i:(i+seq_len-1)])
if (seq_check) {
pro_check <- identical(rep(data$Project[i], seq_len), data$Project[i:(i+seq_len-1)]) #check if Project is the same over the sequence
if (pro_check){
exchange <- data[i:(i+seq_len-1),]
for (j in 1:seq_len){
data[(i+j-1),] <- exchange[exchange$Sequence==j,]
}
}
}
}
data
}
running:
Data_reordered <- reorder(Data)
results in:
> Data_reordered
Project Value Sequence
1 123 4 1
2 123 1 2
3 123 7 3
4 124 3 2
5 125 8 1
6 125 9 3
and:
> identical(Result, Data_reordered)
[1] TRUE
I hope that is your requested solution :-)

Parallel processing for multiple nested for loops

I am trying to run simulation scenarios which in turn should provide me with the best scenario for a given date, back tested a couple of months. The input for a specific scenario has 4 input variables with each of the variables being able to be in 5 states (625 permutations). The flow of the model is as follows:
Simulate 625 scenarios to get each of their profit
Rank each of the scenarios according to their profit
Repeat the process through a 1-day expanding window for the last 2 months starting on the 1st Dec 2015 - creating a time series of ranks for each of the 625 scenarios
The unfortunate result for this is 5 nested for loops which can take extremely long to run. I had a look at the foreach package, but I am concerned around how the combining of the outputs will work in my scenario.
The current code that I am using works as follows, first I create the possible states of each of the inputs along with the window
a<-seq(as.Date("2015-12-01", "%Y-%m-%d"),as.Date(Sys.Date()-1, "%Y-%m-%d"),by="day")
#input variables
b<-seq(1,5,1)
c<-seq(1,5,1)
d<-seq(1,5,1)
e<-seq(1,5,1)
set.seed(3142)
tot_results<-NULL
Next the nested for loops proceed to run through the simulations for me.
for(i in 1:length(a))
{
cat(paste0("\n","Current estimation date: ", a[i]),";itteration:",i," \n")
#subset data for backtesting
dataset_calc<-dataset[which(dataset$Date<=a[i]),]
p=1
results<-data.frame(rep(NA,625))
for(j in 1:length(b))
{
for(k in 1:length(c))
{
for(l in 1:length(d))
{
for(m in 1:length(e))
{
if(i==1)
{
#create a unique ID to merge onto later
unique_ID<-paste0(replicate(1, paste(sample(LETTERS, 5, replace=TRUE), collapse="")),round(runif(n=1,min=1,max=1000000)))
}
#Run profit calculation
post_sim_results<-profit_calc(dataset_calc, param1=e[m],param2=d[l],param3=c[k],param4=b[j])
#Exctract the final profit amount
profit<-round(post_sim_results[nrow(post_sim_results),],2)
results[p,]<-data.frame(unique_ID,profit)
p=p+1
}
}
}
}
#extract the ranks for all scenarios
rank<-rank(results$profit)
#bind the ranks for the expanding window
if(i==1)
{
tot_results<-data.frame(ID=results[,1],rank)
}else{
tot_results<-cbind(tot_results,rank)
}
suppressMessages(gc())
}
My biggest concern is the binding of the results given that the outer loop's actions are dependent on the output of the inner loops.
Any advice on how proceed would greatly be appreciated.

So I think that you can vectorize most of this, which should give a big reduction in run time.
Currently, you use for-loops (5, to be exact) to create every combination of values, and then run the values one by one through profit_calc (a function that is not specified). Ideally, you'd just take all possible combinations in one go and push them through profit_calc in one single operation.
-- Rationale --
a <- 1:10
b <- 1:10
d <- rep(NA,10)
for (i in seq(a)) d[i] <- a[i] * b[i]
d
# [1] 1 4 9 16 25 36 49 64 81 100
Since * also works on vectors, we can rewrite this to:
a <- 1:10
b <- 1:10
d <- a*b
d
# [1] 1 4 9 16 25 36 49 64 81 100
While it may save us only one line of code, it actually reduces the problem from 10 steps to 1 step.
-- Application --
So how does that apply to your code? Well, given that we can vectorize profit_calc, you can basically generate a data frame where each row is every possible combination of your parameters. We can do this with expand.grid:
foo <- expand.grid(b,c,d,e)
head(foo)
# Var1 Var2 Var3 Var4
# 1 1 1 1 1
# 2 2 1 1 1
# 3 3 1 1 1
# 4 4 1 1 1
# 5 5 1 1 1
# 6 1 2 1 1
Lets say we have a formula... (a - b) / (c + d)... Then it would work like:
bar <- (foo[,1] - foo[,2]) * (foo[,3] + foo[,4])
head(bar)
# [1] 0 2 4 6 8 -2
So basically, try to find a way to replace for-loops with vectorized options. If you cannot vectorize something, try looking into apply instead, as that can also save you some time in most cases. If your code is running too slow, you'd ideally first see if you can write a more efficient script. Also, you may be interested in the microbenchmark library, or ?system.time.

Use value in new variable name

I am trying to build a for loop which will step through each site, for that site calculate frequencies of a response, and put those results in a new data frame. Then after the loop I want to be able to combine all of the site data frames so it will look something like:
Site Genus Freq
1 A 50
1 B 30
1 C 20
2 A 70
2 B 10
2 C 20
But to do this I need my names (of vectors, dataframes) to change each time through the loop. I think I can do this using the SiteNum variable, but how do I insert it into new variable names? The way I tried (below) treats it like part of the string, doesn't insert the value for the name.
I feel like what I want to use is a placeholder %, but I don't know how to do that with variable names.
> SiteNum <- 1
> for (Site in CoralSites){
> Csub_SiteNum <- subset(dfrmC, Site==CoralSites[SiteNum])
> CGrfreq_SiteNum <- numeric(length(CoralGenera))
> for (Genus in CoralGenera){
> CGrfreq_SiteNum[GenusNum] <- mean(dfrmC$Genus == CoralGenera[GenusNum])*100
> GenusNum <- GenusNum + 1
> }
> names(CGrfreq_SiteNum) <- c(CoralGenera)
> Site_SiteNum <- c(Site)
> CG_SiteNum <- data.frame(CoralGenera,CGrfreq_SiteNum,Site_SiteNum)
> SiteNum <- SiteNum + 1
> }

Your question as stated asks how you can create a bunch of variables, e.g. CGrfreq_1, CGrfreq_2, ..., where the name of the variable indicates the site number that it corresponds to (1, 2, ...). While you can do such a thing with functions like assign, it is not good practice for a few reasons:
It makes your code to generate the variables more complicated because it will be littered with calls to assign and get and paste0.
It makes your data more difficult to manipulate afterwards -- you'll need to (either manually or programmatically) identify all the variables of a certain type, grab their values with get or mget, and then do something with them.
Instead, you'll find it easier to work with other R functions that will perform the aggregation for you. In this case you're looking to generate for each Site/Genus pairing the percentage of data points at the site with the particular genus value. This can be done in a few lines of code with the aggregate function:
# Sample data:
(dat <- data.frame(Site=c(rep(1, 5), rep(2, 5)), Genus=c(rep("A", 3), rep("B", 6), "A")))
# Site Genus
# 1 1 A
# 2 1 A
# 3 1 A
# 4 1 B
# 5 1 B
# 6 2 B
# 7 2 B
# 8 2 B
# 9 2 B
# 10 2 A
# Calculate frequencies
dat$Freq <- 1
res <- aggregate(Freq~Genus+Site, data=dat, sum)
res$Freq <- 100 * res$Freq / table(dat$Site)[as.character(res$Site)]
res
# Genus Site Freq
# 1 A 1 60
# 2 B 1 40
# 3 A 2 20
# 4 B 2 80

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R enumerate duplicates in a dataframe with unique value - r

I think this is a job for make.unique, with some manipulation. index <- 1L + as.integer(sub("\\d+(\\.)?","",make.unique(as.character(morley$Run)))) index <- ifelse(is.na(index),1L,index) identical(index,morley$Expt) [1] TRUE

Details of your actual data.frame may matter. However, a couple of options working with your example: #this works if each group starts with 1: df$index<-cumsum(df$Run==1) #this is maybe more general, with data.table require(data.table) dt<-as.data.table(df) dt[,index:=seq_along(Speed),by=Run]

Related

Search by conditional grouping and logic?

How to label consecutive periods with identical statuses

Reordering rows in a dataframe that follow particular sequence

Parallel processing for multiple nested for loops

Use value in new variable name

Categories

Resources