Removing rows from a data frame until a condition is met

Removing rows from a data frame until a condition is met - r

I have a function, remove_fun, that removes rows from a data frame based on some conditions (this function is too verbose to include, so here's a simplified example:).
Let's say I have a data frame called block_2, with two columns:
Treatment seq
1 29
1 23
3 60
1 6
2 41
1 5
2 44
For the sake of this example, let's say my function removes 1 row from block_2 at a time based on the highest value of seq in block_2$seq. This function works well when I run it once, i.e. remove_fun(block_2) would return the following output:
Treatment seq
1 29
1 23
1 6
2 41
1 5
2 44
However, what I'm not figuring out is how to repeatedly implement my remove_fun until I reduce block_2 to a certain dimension.
My idea is to do something like this:
while (dim(block_2_df)[1]>1)#The number of rows of block_2_df{
remove_fun(block_2_df)
}
This would theoretically reduce block_2_df until only the observation corresponding to the lowest seq number remains.
However, this doesn't work. I think my problem relates to me not knowing how to use my 'updated' block_2_df iteratively. What I'd like to accomplish is some code that does something like this:
new_df_1<-remove_fun(block_2)
new_df_2<-remove_fun(new_df_1)
new_df_3<-remove_fun(new_df_2)
etc...
I'm not necessarily looking for an exact solution to this problem (as I didn't provide remove_fun), but I'd appreciate some insight re: a general approach to the problem.
Edit: here's my actual code with some example data:
#Start from a block of 10*6 balls, with lambda*(wj) balls of each class
#Allocation ratios
class_1<-"a"
class_2<-"b"
class_3<-"c"
ratio_a<-3
ratio_b<-2
ratio_c<-1
#Min_set
min_set<-c(rep(class_1,ratio_a),rep(class_2,ratio_b),rep(class_3,ratio_c))
min_set_num<-ifelse(min_set=='a',1,ifelse(min_set=='b',2,3))
table_key <- table(min_set_num)
#Number of min_sets
lamb<-10
#Active urn
block_1<-matrix(0,lamb,length(min_set))
for (i in 1:lamb){
block_1[i,]<-min_set
}
#Turn classes into a vector
block_1<-as.vector(block_1)
block_1<-ifelse(block_1=='a',1,ifelse(block_1=='b',2,3))
#Turn into a df w/ identifying numbers:
block_1_df<-data.frame(block_1,seq(1:length(block_1)))
#Enumerate all sampling outcome permutations
library('dplyr')
#Create inactive urn
#Sample from block_1 until min_set is achieved, store in block_2#####
#Random sample :
block_2<-sample(block_1,length(block_1),replace=F)
block_2_df<-block_1_df[sample(nrow(block_1_df), length(block_1)), ]
colnames(block_2_df)<-c('Treatment','seq')
#Generally:####
remove_fun<-function(dat){
#For df
min_set_obs_mat<-matrix(0,length(block_1),2)
min_set_obs_df<-as.data.frame(min_set_obs_mat)
colnames(min_set_obs_df)<-c('Treatment','seq')
for (i in 1:length(block_1)){
if ((sum(min_set_obs_df[,1]==1)<3) || (sum(min_set_obs_df[,1]==2)<2) || (sum(min_set_obs_df[,1]==3)<1)){
min_set_obs_df[i,]<-dat[i,]
}
}
#Get rid of empty rows in df:
min_set_obs_df<-min_set_obs_df%>%filter(Treatment>0)
#Return the sampled 'balls' which satisfy the minimum set into block_2_df (randomized block_!), ####
#keeping the 'extra' balls in a new df: extra_df:####
#Question: does the order of returning matter?####
#Identify min_set
outcome_df<-min_set_obs_df %>% group_by(Treatment) %>% do({
head(., coalesce(table_key[as.character(.$Treatment[1])], 0L))
})
#This removes extra observations 'chronologically'
#Identify extra balls
#Extra_df is the 'inactive' urn####
extra_df<-min_set_obs_df%>%filter(!(min_set_obs_df$seq%in%outcome_df$seq))
#Question: is the number of pts equal to the block size? (lambda*W)?######
#Return min_df back to block_2_df, remove extra_df from block_2_df:
dat<-dat%>%filter(!(seq%in%extra_df$seq))
return(dat)
}

Your while-loop doesn't redefine block2_df. This should work:
while (dim(block_2_df)[1]>1) {
block_2_df <- remove_fun(block_2_df)
}

If all you need is a way to subset the data frame...
df <- data.frame(Treatment = c(1, 1, 3, 1, 2, 1, 2),
seq = c(29, 23, 60, 6, 41, 5, 44))
df
Treatment seq
1 1 29
2 1 23
3 3 60
4 1 6
5 2 41
6 1 5
7 2 44
# Decide how many rows you want in output
n <- 6
# Find the top "n" values in the seq variable
head(sort(df$seq), n)
[1] 5 6 23 29 41 44
# Use them in the subset criteria
df[df$seq %in% head(sort(df$seq), n), ]
Treatment seq
1 1 29
2 1 23
4 1 6
5 2 41
6 1 5
7 2 44

Related

Is there any way to make a data frame using the last values of a repeated loop?

I'd like to make a data frame using only the last computed values from a Repeat loop.
For the repeat and sample functions, I'm using this data. The numbers in Prob column are the probabilities of each number to occur.
enter image description here
b <- 1
repeat {
c <- sample(a$Plus, size=1, prob=(a$Prob))
cat(b, '\t', c, '\n')
b <- b + 1
if (c >= 10) {
{
break
}
}
}
#I'm interested in the result greater than 10 only
If I run the code above, then it will compute something like
1 4
2 8
3 13
If I run this again, it will compute different results like..
1 9
2 3
3 7
4 3
5 11
What I'd like to do is to make a data frame using only the last outputs of each loop.
For example, using the computed data above, I'd like to make a frame that looks like
Trial Result
3 13
5 11
Is there any way to repeat this loop the number of times I want to and make a data frame using only the last outputs of each repeated function?

You can use a user defined function to do this. Since you haven't given your dataframe a, I've defined it as follows:
library(tidyverse)
a <- tibble(
Plus = 1:15,
Prob = seq(from = 15, to = 1, by = -1)
)
The following function does the same thing as your repeat loop, but stores the relevant results in a tibble. I've left your variable b out of this because as far as I can see, it doesn't contribute to your desired output.
samplefun <- function(a) {
c <- sample(a$Plus, size=length(a$Plus), prob=a$Prob)
res <- tibble(
Trial = which(c >= 10)[1],
Result = c[which(c >= 10)[1]]
)
return(res)
}
Then use map_dfr to return as many samples as you like:
nsamples <- 5
map_dfr(1:nsamples, ~ samplefun(a))
Output:
# A tibble: 5 x 2
Trial Result
<int> <int>
1 4 11
2 6 14
3 5 11
4 2 10
5 4 15

'Random' Sorting with a condition in R for Psychology Research

I have Valence Category for word stimuli in my psychology experiment.
1 = Negative, 2 = Neutral, 3 = Positive
I need to sort the thousands of stimuli with a pseudo-randomised condition.
Val_Category cannot have more than 2 of the same valence stimuli in a row i.e. no more than 2x negative stimuli in a row.
for example - 2, 2, 2 = not acceptable
2, 2, 1 = ok
I can't sequence the data i.e. decide the whole experiment will be 1,3,2,3,1,3,2,3,2,2,1 because I'm not allowed to have a pattern.
I tried various packages like dylpr, sample, order, sort and nothing so far solves the problem.

I think there's a thousand ways to do this, none of which are probably very pretty. I wrote a small function that takes care of the ordering. It's a bit hacky, but it appeared to work for what I tried.
To explain what I did, the function works as follows:
Take the vector of valences and samples from it.
If sequences are found that are larger than the desired length, then, (for each such sequence), take the last value of that sequence at places it "somewhere else".
Check if the problem is solved. If so, return the reordered vector. If not, then go back to 2.
# some vector of valences
val <- rep(1:3,each=50)
pseudoRandomize <- function(x, n){
# take an initial sample
out <- sample(val)
# check if the sample is "bad" (containing sequences longer than n)
bad.seq <- any(rle(out)$lengths > n)
# length of the whole sample
l0 <- length(out)
while(bad.seq){
# get lengths of all subsequences
l1 <- rle(out)$lengths
# find the bad ones
ind <- l1 > n
# take the last value of each bad sequence, and...
for(i in cumsum(l1)[ind]){
# take it out of the original sample
tmp <- out[-i]
# pick new position at random
pos <- sample(2:(l0-2),1)
# put the value back into the sample at the new position
out <- c(tmp[1:(pos-1)],out[i],tmp[pos:(l0-1)])
}
# check if bad sequences (still) exist
# if TRUE, then 'while' continues; if FALSE, then it doesn't
bad.seq <- any(rle(out)$lengths > n)
}
# return the reordered sequence
out
}
Example:
The function may be used on a vector with or without names. If the vector was named, then these names will still be present on the pseudo-randomized vector.
# simple unnamed vector
val <- rep(1:3,each=5)
pseudoRandomize(val, 2)
# gives:
# [1] 1 3 2 1 2 3 3 2 1 2 1 3 3 1 2
# when names assigned to the vector
names(val) <- 1:length(val)
pseudoRandomize(val, 2)
# gives (first row shows the names):
# 1 13 9 7 3 11 15 8 10 5 12 14 6 4 2
# 1 3 2 2 1 3 3 2 2 1 3 3 2 1 1
This property can be used for randomizing a whole data frame. To achieve that, the "valence" vector is taken out of the data frame, and names are assigned to it either by row index (1:nrow(dat)) or by row names (rownames(dat)).
# reorder a data.frame using a named vector
dat <- data.frame(val=rep(1:3,each=5), stim=rep(letters[1:5],3))
val <- dat$val
names(val) <- 1:nrow(dat)
new.val <- pseudoRandomize(val, 2)
new.dat <- dat[as.integer(names(new.val)),]
# gives:
# val stim
# 5 1 e
# 2 1 b
# 9 2 d
# 6 2 a
# 3 1 c
# 15 3 e
# ...

I believe this loop will set the Valence Category's appropriately. I've called the valence categories treat.
#Generate example data
s1 = data.frame(id=c(1:10),treat=NA)
#Setting the first two rows
s1[1,"treat"] <- sample(1:3,1)
s1[2,"treat"] <- sample(1:3,1)
#Looping through the remainder of the rows
for (i in 3:length(s1$id))
{
s1[i,"treat"] <- sample(1:3,1)
#Check if the treat value is equal to the previous two values.
if (s1[i,"treat"]==s1[i-1,"treat"] & s1[i-1,"treat"]==s1[i-2,"treat"])
#If so draw one of the values not equal to that value
{
a = 1:3
remove <- s1[i,"treat"]
a=a[!a==remove]
s1[i,"treat"] <- sample(a,1)
}
}
This solution is not particularly elegant. There may be a much faster way to accomplish this by sorting several columns or something.

Create a new data frame based on another dataframe

I am trying to use a huge dataframe (180000 x 400) to calculate another one that would be much smaller.
I have the following dataframe
df1=data.frame(LOCAT=c(1,2,3,4,5,6),START=c(120,345,765,1045,1347,1879),END=c(150,390,802,1120,1436,1935),CODE1=c(1,1,0,1,0,0),CODE2=c(1,0,0,0,-1,-1))
df1
LOCAT START END CODE1 CODE2
1 1 120 150 1 1
2 2 345 390 1 0
3 3 765 802 0 0
4 4 1045 1120 1 0
5 5 1347 1436 0 -1
6 6 1879 1935 0 -1
This is a sample dataframe. The rows continue until 180000 and the columns are over 400.
What I need to do is create a new dataframe based on each column that tells me the size of each continues "1" or "-1" and returns it with the location, size and value.
Something like this for CODE1:
LOCAT SIZE VALUE
1 1 to 2 270 POS
2 4 to 4 75 POS
And like this for CODE2:
LOCAT SIZE VALUE
1 1 to 1 30 POS
2 5 to 6 588 NEG
Unfortunately I still didn't figure out how to do this. I have been trying several lines of code to develop a function to do this automatically but start to get lost or stuck in loops and it seems that nothing works.
Any help would be appreciated.
Thanks in advance

Below is code that gives you the answer in the exact format that you wanted, except I split your "LOCAT" column into two columns entitled "Starts" and "Stops". This code will work for your entire data frame, no need to replicate it manually for each CODE (CODE1, CODE2, etc).
It assumes that the only non-CODE column have the names "LOCAT" "START" and "END".
# need package "plyr"
library("plyr")
# test2 is the example data frame that you gave in the question
test2 <- data.frame(
"LOCAT"=1:6,
"START"=c(120,345,765, 1045, 1347, 1879),
"END"=c(150,390,803,1120,1436, 1935),
"CODE1"=c(1,1,0,1,0,0),
"CODE2"=c(1,0,0,0,-1,-1)
)
codeNames <- names(test2)[!names(test2)%in%c("LOCAT","START","END")] # the names of columns that correspond to different codes
test3 <- reshape(test2, varying=codeNames, direction="long", v.names="CodeValue", timevar="Code") # reshape so the different codes are variables grouped into the same column
test4 <- test3[,!names(test3)%in%"id"] #remove the "id" column
sss <- function(x){ # sss gives the starting points, stopping points, and sizes (sss) in a data frame
rleX <- rle(x[,"CodeValue"]) # rle() to get the size of consecutive values
stops <- cumsum(rleX$lengths) # cumulative sum to get the end-points for the indices (the second value in your LOCAT column)
starts <- c(1, head(stops,-1)+1) # the starts are the first value in your LOCAT column
ssX0 <- data.frame("Value"=rleX$values, "Starts"=starts, "Stops"=stops) #the starts and stops from X (ss from X)
ssX <- ssX0[ssX0[,"Value"]!=0,] # remove the rows the correspond to CODE_ values that are 0 (not POS or NEG)
# The next 3 lines calculate the equivalent of your SIZE column
sizeX1 <- x[ssX[,"Starts"],"START"]
sizeX2 <- x[ssX[,"Stops"],"END"]
sizeX <- sizeX2 - sizeX1
sssX <- data.frame(ssX, "Size"=sizeX) # Combine the Size to the ssX (start stop of X) data frame
return(sssX) #Added in EDIT
}
answer0 <- ddply(.data=test4, .variables="Code", .fun=sss) # use the function ddply() in the package "plyr" (apply the function to each CODE, why we reshaped)
answer <- answer0 # duplicate the original, new version will be reformatted
answer[,"Value"] <- c("NEG",NA,"POS")[answer0[,"Value"]+2] # reformat slightly so that we have POS/NEG instead of 1/-1
Hopefully this helps, good luck!

Use run-length encoding to determine groups where CODE1 takes the same value.
rle_of_CODE1 <- rle(df1$CODE1)
For convenience, find the points where the value is non-zero, and the lenghts of the corresponding blocks.
CODE1_is_nonzero <- rle_of_CODE1$values != 0
n <- rle_of_CODE1$lengths[CODE1_is_nonzero]
Ignore the parts of df1 where CODE1 is zero.
df1_with_nonzero_CODE1 <- subset(df1, CODE1 != 0)
Define a group based on the contiguous blocks we found with rle.
df1_with_nonzero_CODE1$GROUP <- rep(seq_along(n), times = n)
Use ddply to get summary stats for each group.
summarised_by_CODE1 <- ddply(
df1_with_nonzero_CODE1,
.(GROUP),
summarise,
MinOfLOCAT = min(LOCAT),
MaxOfLOCAT = max(LOCAT),
SIZE = max(END) - min(START)
)
summarised_by_CODE1$VALUE <- ifelse(
rle_of_CODE1$values[CODE1_is_nonzero] == 1,
"POS",
"NEG"
)
summarised_by_CODE1
## GROUP MinOfLOCAT MaxOfLOCAT SIZE VALUE
## 1 1 1 2 270 POS
## 2 3 4 4 75 POS
Now repeat with CODE2.

How to count and test for the sum and repeat the action

I need to test the value of'peso'(see replication code below) for each factor. Whether a factor reaches 50% of the overall sum for 'peso', the values of each factor should be paste into a new object 'results', otherwise, R should evaluate which factor has the lowest aggregated value for 'peso', and consider the factor in the next column for aggregate 'peso' again. Basically, this process replace the lowest scored factor for the next factor. The process should repeat till a factor cross the 50% threshold. So my question is, where do I start?
set.seed(51)
Data <- sapply(1:100, function(x) sample(1:10, size=5))
Data <- data.frame(t(Data))
names(Data) <- letters[1:5]
Data$peso <- sample(0:3.5, 100, rep=TRUE)
It should be like
If your first two rows are:
a b c d e peso
8 2 3 7 9 1
8 3 4 5 7 3
9 7 4 10 1 2
10 3 4 5 7 3
What would you like for the total?
Totals_08 = 4
Totals_09 = 2
Totals_10 = 3
etc?
So, factor 8 got the greater share 4/(4+2+3) = 0.4444444, but not reached 50% threshold in the round a. Therefore, I need something more: repeat the aggregation but considering now the factor 7 in the column 'b' instead of factors 9 in the column 'a', since it got the lowest aggregated value in the first round.

It's unclear if you have your list of factors already or not. If you do not have it, and are taking it from the data set, you can grab it in a few different ways:
# Get a list of all the factors
myFactors <- levels(Data[[1]]) # If actual factors.
myFactors <- sort(unique(unlist(Data))) # Otherwise use similar to this line
Then to calculate the Totals per factor, you can do the following
Totals <-
colSums(sapply(myFactors, function(fctr)
# calculate totals per fctr
as.integer(Data$peso) * rowSums(fctr == subset(Data, select= -peso))
))
names(Totals) <- myFactors
Which gives
Totals
# 1 2 3 4 5 6 7 8 9 10
# 132 153 142 122 103 135 118 144 148 128
Next:
I'm not sure if afterwards, you want to compare to the sum of peso or the sum of the totals. Here are both options, broken down into steps:
# Calculate the total of all the Totals:
TotalSum <- sum(Totals)
# See percentage for each:
Totals / TotalSum
Totals / sum(as.integer(Data$peso))
# See which, if any, is greater than 50%
Totals / TotalSum > 0.50
Totals / sum(as.integer(Data$peso)) > 0.50
# Using Which to identify the ones you are looking for
which(Totals / TotalSum > 0.50)
which(Totals / sum(as.integer(Data$peso)) > 0.50)
Note on your sampling for Peso
You took a sample of 0:3.5, however, the x:y sequence only gives integers.
If you want fractions, you can either use seq() or you can take a larger sequence and then divide appropriately:
option1 <- (0:7) / 2
option2 <- seq(from=0, to=3.5, by=0.5)
If you want whole integers from 0:3 and also the value 3.5, then use c()
option3 <- c(0:3, 3.5)

Comparing two columns: logical- is value from column 1 also in column 2?

I'm pretty confused on how to go about this. Say I have two columns in a dataframe. One column a numerical series in order (x), the other specifying some value from the first, or -1 (y). These are results from a matching experiment, where the goal is to see if multiple photos are taken of the same individual. In the example below, there 10 photos, but 6 are unique individuals. In the y column, the corresponding x is reported if there is a match. y is -1 for no match (might as well be NAs). If there is more than 2 photos per individual, the match # will be the most recent record (photo 1, 5 and 7 are the same individual below). The group is the time period the photo was take (no matches within a group!). Hopefully I've got this example right:
x <- c(1,2,3,4,5,6,7,8,9,10)
y <- c(-1,-1,-1,-1,1,-1,1,-1,2,4)
group <- c(1,1,1,2,2,2,3,3,3,3)
DF <- data.frame(x,y,group)
I would like to create a new variable to name the unique individuals, and have a final dataset with a single row per individual (i.e. only have 6 rows instead of 10), that also includes the group information. I.e. if an individual is in all three groups, there could be a value of "111" or if just in the first and last group it would be "101". Any tips?
Thanks for asking about the resulting dataset. I realized my group explanation was bad based on the actual numbers I gave, so I changed the results slightly. Bonus would also be nice to have, but not critical.
name <- c(1,2,3,4,6,8)
group_history <- as.character(c('111','101','100','011','010','001'))
bonus <- as.character(c('1,5,7','2,9','3','4,10','6','8'))
results_I_want <- data.frame(name,group_history,bonus)
My word, more mistakes fixed above...

Using the (updated) example you gave
x <- c(1,2,3,4,5,6,7,8,9,10)
y <- c(-1,-1,-1,-1,1,-1,1,-1,3,4)
group <- c(1,1,1,2,2,2,3,3,3,3)
DF <- data.frame(x,y,group)
Use the x and y to create a mapping from higher numbers to lower numbers that are the same person. Note that names is a string, despite it be a string of digits.
bottom.df <- DF[DF$y==-1,]
mapdown.df <- DF[DF$y!=-1,]
mapdown <- c(mapdown.df$y, bottom.df$x)
names(mapdown) <- c(mapdown.df$x, bottom.df$x)
We don't know how many times it might take to get everything down to the lowest number, so have to use a while loop.
oldx <- DF$x
newx <- mapdown[as.character(oldx)]
while(any(oldx != newx)) {
oldx = newx
newx = mapdown[as.character(oldx)]
}
The result is the group it belongs to, names by the lowest number of that set.
DF$id <- unname(newx)
Getting the group membership is harder. Using reshape2 to convert this into wide format (one column per group) where the column is "1" if there was something in that one and "0" if not.
library("reshape2")
wide <- dcast(DF, id~group, value.var="id",
fun.aggregate=function(x){if(length(x)>0){"1"}else{"0"}})
Finally, paste these "0"/"1" memberships together to get the grouping variable you described.
wide$grouping = apply(wide[,-1], 1, paste, collapse="")
The result:
> wide
id 1 2 3 grouping
1 1 1 1 1 111
2 2 1 0 0 100
3 3 1 0 1 101
4 4 0 1 1 011
5 6 0 1 0 010
6 8 0 0 1 001
No "bonus" yet.
EDIT:
To get the bonus information, it helps to redo the mapping to keep everything. If you have a lot of cases, this could be slow.
Replace the oldx/newx part with:
iterx <- matrix(DF$x, ncol=1)
iterx <- cbind(iterx, mapdown[as.character(iterx[,1])])
while(any(iterx[,ncol(iterx)]!=iterx[,ncol(iterx)-1])) {
iterx <- cbind(iterx, mapdown[as.character(iterx[,ncol(iterx)])])
}
DF$id <- iterx[,ncol(iterx)]
To generate the bonus data, then you can use
bonus <- tapply(iterx[,1], iterx[,ncol(iterx)], paste, collapse=",")
wide$bonus <- bonus[as.character(wide$id)]
Which gives:
> wide
id 1 2 3 grouping bonus
1 1 1 1 1 111 1,5,7
2 2 1 0 0 100 2
3 3 1 0 1 101 3,9
4 4 0 1 1 011 4,10
5 6 0 1 0 010 6
6 8 0 0 1 001 8
Note this isn't same as your example output, but I don't think your example output is right (how can you have a grouping_history of "000"?)
EDIT:
Now it agrees.

Another solution for bonus variable
f_bonus <- function(data=df){
data_a <- subset(data,y== -1,select=x)
data_a$pos <- seq(nrow(data_a))
data_b <- subset(df,y!= -1,select=c(x,y))
data_b$pos <- match(data_b$y, data_a$x)
data_t <- rbind(data_a,data_b[-2])
data_t <- with(data_t,tapply(x,pos,paste,sep="",collapse=","))
return(data_t)
}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Removing rows from a data frame until a condition is met - r

Your while-loop doesn't redefine block2_df. This should work: while (dim(block_2_df)[1]>1) { block_2_df <- remove_fun(block_2_df) }

Related

Is there any way to make a data frame using the last values of a repeated loop?

'Random' Sorting with a condition in R for Psychology Research

Create a new data frame based on another dataframe

How to count and test for the sum and repeat the action

Comparing two columns: logical- is value from column 1 also in column 2?

Categories

Resources