Loop over data frame comparing pairs - r

I have created the following dataframe:
set.seed(42)
df1 = data.frame(pair = rep(c(1:26),2), size = rnorm(52,5.4,1.89))
It represents random pairs of individuals of a certain size, as assigned by the 'pair' column.
The random distribution (5.4, 1.89) is based on observed data from the group that I sampled in my study (N=26 pairs).
I now want to ask a very basic question that I am unable to code my way to:
Imagine a horizontal line at the mean (5.4), severing the population in two:
What proportion of individuals are paired with another individual from the same side of the line? i.e. is there a tendency for small to be with small and big to be with big?
I want to compare the proportion I observed with the proportion generated from 'asking' the above question a lot of times (e.g. 1000 repetitions).
In my study 18/26 individuals were together with a similar sized partner, and so I want to ask 'out of a 1000 repetitions, how many times was the proportion of similar individuals equal to or greater than 18/26?' this will be my 'p-value'.
I have no clue how to code this, but in my head it goes like this:
For each value in column 'size': when pair value are equal, do this:
is the larger individual equal to or bigger than 5.4? is the smaller
individual equal to or bigger than 5.4?
if so, return a "yes"
OR
is the larger individual equal to or smaller than 5.4? is the smaller
individual equal to or smaller than 5.4?
if so, return a "yes"
if none of the above are true, return a 0
provide an output of the proportion of yes and no. store this in a data.frame repeat
this process 1000 times, adding all the outputs to the mentioned data
frame:
run1 24/26
run2 4/26
...
run999 13/26
I really hope someone can show me the start to this, or the relevant code/arguments/structure.

Is this what you want
#Create empty output, for 10 iterations
same_group_list = replicate(10,0)
diff_group_list = replicate(10,0)
for (j in 1:10){ #For 10 iterations
df1 = data.frame(pair = rep(c(1:26),2), size = rnorm(52,5.4,1.89))
#Sort by 'pair'
df1 = df1[with(df1, order(pair)), ]
#Assign a group based on if 'size' is > or < than mean(size)
for (i in 1:nrow(df1)){
if (df1$size[i] <= mean(df1$size)){ #Use 5.4 explicitly instead of mean(df1$size) if you want
df1$Group[i] = -1
} else {
df1$Group[i] = 1
}
}
df1$Group = as.numeric(df1$Group) #Convert to numeric
output2 = tapply(df1$Group, df1$pair, mean) #Carry out groupwise mean
diff_group_list[j] = sum(output2 == 0) #A mean of 0 means pair grouped with another group
same_group_list[j] = length(output2) - diff_group_list[j] #Everything else is the same group
}
output = data.frame("Same groupout of 26" = same_group_list, "Different Group out of 26" = diff_group_list)

I created a data frame with pairs side by side and then compared which of them where higher than 5.4. Then compared pairs. The pairs with both sizes higher than 5.4 were summed, and then everything was divided by 26.
The data frame proportions shows the proportion for each run.
proportions <- data.frame(run = (1:1000), prop = rep(NA,1000))
for (i in 1:1000) {
df = data.frame(pair = c(1:26),
size1 = rnorm(26,5.4,1.89),
size2 = rnorm(26,5.4,1.89)
)
greaterPairs <- sum(df[,2] > 5.4 & df[,3]>5.4)
proportions[i,2] = greaterPairs/26
}
head(proportions)
I did not keep the proportion in the string format "18/26" because later, if you want to sum the total of them which follows some condition, you will have to do it visually, one by one. So, for example, if you want to know how many of them are greater than or equal 18/26:
sum(proportions$prop >= (18/26))

Related

Subsampling from a set with the assumption that each member would be picked at least one time in r

I need a code or idea for the case that we have a dataset of 1000 rows. I want to subsample from rows with the size of 800 for multiple times (I dont know how many times should I repeat).
How should I control that all members would be picked at least in one run? I need the code in r.
To make the question more clear, lets define the row names as:
rownames(dataset) = A,B,C,D,E,F,G,H,J,I
if I subsample 3 times:
A,B,C,D,E,F,G,H
D,E,A,B,H,J,F,C
F,H,E,A,B,C,D,J
The I is not in any of the subsample sets. I would like to do subsampling for 90 or 80 percent of the data for many times but I expect all the rows would be chosen at least in one of the subsample sets. In the above sample the element I should be picked in at least one of the subsamples.
One way to do this is random sampling without replacement to designate a set of "forced" random picks, in other words have a single guaranteed appearance of each row, and decide ahead of time which subsample that guaranteed appearance will be in. Then, randomly sample the rest of the subsample.
num_rows = 1000
num_subsamples = 1000
subsample_size = 900
full_index = 1:num_rows
dat = data.frame(i = full_index)
# Randomly assign guaranteed subsamples
# Make sure that we don't accidentally assign more than the subsample size
# If we're subsampling 90% of the data, it'll take at most a few tries
biggest_guaranteed_subsample = num_rows
while (biggest_guaranteed_subsample > subsample_size) {
# Assign the subsample that the row is guaranteed to appear in
dat$guarantee = sample(1:num_subsamples, replace = TRUE)
# Find the subsample with the most guaranteed slots taken
biggest_guaranteed_subsample = max(table(dat$guarantee))
}
# Assign subsamples
for (ss in 1:num_subsamples) {
# Pick out any rows guaranteed a slot in that subsample
my_sub = dat[dat$guarantee == ss, 'i']
# And randomly select the rest
my_sub = c(my_sub, sample(full_index[!(full_index %in% my_sub)],
subsample_size - length(my_sub),
replace = FALSE))
# Do your subsample calculation here
}

Is there a way to use the which argument in for loop in R?

I'm trying to write a for loop to find the highest 100 (as an example) variables in each index and reassign them all the 100th highest value. The for loop is starting at the max for the index and testing to see if the number of cases matching the max value exceeds the threshold. If less than 100 cases match the maximum, the maximum variable is reduced by 1 and run again. If 100 or more cases match, the max is adjusted back to the previous value and the cases are assigned that value.
I'm trying to actually use this on a data set to adjust the values for top and bottom 0.075% to a new max and min respectively without crossing over the 0.75% threshold. My actual data has over 400k cases and 170 features that I'm trying use this on.
I don't need to fix this if there is a better way to do what I described above.
df should have: a = values from 0-100 and 101 for 100 cases, b = values from 100-200 and 201 for 100 cases, c = values from 200-300 and 301 for 100 cases.
I tried to use the length(which(df[i]) in the if and else statements and thought that assigning it to a variable might help but it didn't.
a=c(0:200)
b=c(100:300)
c=c(200:400)
df <- data.frame(a, b, c)
for (i in 1:length(df)){
max_count <- length(which(df[i]))
maximum <-max(df[i])
if (((max_count > maximum) < 100) == FALSE){
maximum <- maximum -1
}
else if (((max_count > maximum) >= 100) == TRUE){
df[i](which(df[i] > maximum +1)) <- maximum +1
}
}
>>> Error in which(df[i]) : argument to 'which' is not logical

How to add grouping variable to data set that will classify both an observation and its N neighbors based on some condition

I am having some trouble coming up with a solution that properly handles classifying a variable number of neighbors for any given observation in a data frame based on some condition. I would like to be able to add a simple, binary indicator variable to a data frame that will equal 1 if the condition is satisfied, and 0 if it is not.
Where I am getting stuck is I am unsure how to iteratively check the condition against neighboring observations only, in either direction (i.e., to check if out of 4 neighboring observations in a given column in my data frame, that at least 3 out of 4 of them contain the same value). I have tried first creating another indicator variable indicating if the condition is satisfied or not (1 or 0 = yes or no). Then, I tried setting up a series of ifelse() statements within a loop to try to assign the proper categorization of the observation where the initial condition is satisfied, +/- 2 observations in either direction. However, when I inspect the dataframe after running the loop, only the observation itself (not its neighbors) where the condition is satisfied is receiving the value, rather than all neighboring observations also receiving the value. Here is my code:
#sample data
sample_dat <- data.frame(initial_ind = c(0,1,0,1,0,0,1,1,0,1,0,0))
sample_dat$violate <- NULL
for(i in 1:nrow(dat_date_ord)){
sample_dat$violate[i] <- ifelse(sample_dat$initial_ind[i]==1 &
((sample_dat$initial_ind[i-2]==1 |
sample_dat$initial_ind[i-1]==1) &
(sample_dat$initial_ind[i+2]==1 |
sample_dat$initial_ind[i+1]==1)),
"trending",
"non-trending"
)
}
This loop correctly identifies one of the four points that needs to be labelled "trending", but it does not also assign "trending" to the correct neighbors. In other words, I expect the output to be "trending for observations 7-10, since 3/4 observations in that group of 4 all have a value of 1 in the initial indicator column. I feel like there might be an easier way to accomplish this - but what I need to ensure is that my code is robust enough to identify and assign observations to a group regardless of if I want 3/4 to indicate a group, 5/6, 2/5, etc.
Thank you for any and all advice.
You can use the rollapply function from the zoo package to apply a function to set intervals in your data. The question then becomes about creating a function that satisfies your needs. I'm not sure if I've understood correctly, but it seems you want a function that checks if the condition is true for at least 3/5 of the observation plus its four closest neighbors. In this case just adding the 1s up and checking if they're above 2 works.
library(zoo)
sample_dat <- data.frame(initial_ind = c(0,1,0,1,0,0,1,1,0,1,0,0))
trend_test = function(x){
ifelse(sum(x) > 2, "trending", "non-trending")
}
sample_dat$violate_new = rollapply(sample_dat$initial_ind, FUN = trend_test, width = 5, fill = NA)
Edit: If you want a function that checks if the observation and the next 3 observations have at least 3 1s, you can do something very similar, just by changing the align argument on rollapply:
trend_test_2 = function(x){
ifelse(sum(x) > 2, "trending", "non-trending")
}
sample_dat$violate_new = rollapply(sample_dat$initial_ind, FUN = trend_test_2, width = 4,
fill = NA, align = "left")

How to select a specific amount of rows before and after predefined values

I am trying to select relevant rows from a large time-series data set. The tricky bit is, that the needed rows are before and after certain values in a column.
# example data
x <- rnorm(100)
y <- rep(0,100)
y[c(13,44,80)] <- 1
y[c(20,34,92)] <- 2
df <- data.frame(x,y)
In this case the critical values are 1 and 2 in the df$y column. If, e.g., I want to select 2 rows before and 4 after df$y==1 I can do:
ones<-which(df$y==1)
selection <- NULL
for (i in ones) {
jj <- (i-2):(i+4)
selection <- c(selection,jj)
}
df$selection <- 0
df$selection[selection] <- 1
This, arguably, scales poorly for more values. For df$y==2 I would have to repeat with:
twos<-which(df$y==2)
selection <- NULL
for (i in twos) {
jj <- (i-2):(i+4)
selection <- c(selection,jj)
}
df$selection[selection] <- 2
Ideal scenario would be a function doing something similar to this imaginary function selector(data=df$y, values=c(1,2), before=2, after=5, afterafter = FALSE, beforebefore=FALSE), where values is fed with the critical values, before with the amount of rows to select before and correspondingly after.
Whereas, afterafter would allow for the possibility to go from certain rows until certain rows after the value, e.g. after=5,afterafter=10 (same but going into the other direction with afterafter).
Any tips and suggestions are very welcome!
Thanks!
This is easy enough with rep and its each argument.
df$y[rep(which(df$y == 2), each=7L) + -2:4] <- 2
Here, rep repeats the row indices that your criterion 7 times each (two before, the value, and four after, the L indicates that the argument should be an integer). Add values -2 through 4 to get these indices. Now, replace.
Note that for some comparisons, == will not be adequate due to numerical precision. See the SO post why are these numbers not equal for a detailed discussion of this topic. In these cases, you could use something like
which(abs(df$y - 2) < 0.001)
or whatever precision measure will work for your problem.

Finding Index at which the Measure First equals or Exceeds 70% of Max and Falls Below 70% of Max

To start, I have a data table (RRLong) containing subject ID numbers (n = 15), the type of trial that subjects were exposed to (10 trial types, within-subject factor), and the block of test sessions (5 blocks, within-subject factor). On each trial, for each block, there were 80 sequential bins for which responding was measured. The last column contains the response rate (RR) for a given bin.
ID TrialType Block Bin RR
1 E.Cue 1 1 0.369047619
1 E.Cue 1 2 0.447916667
1 E.Cue 1 3 0.435185185
When RR is plotted as a function of bin, the data approximates the shape of a Gaussian distribution.
From these data, I need to calculate the following measures, for each subject on each trial type on each block (thus, 750 values based on the number of subjects, blocks, and trial types):
PR: The maximum response rate
PT: The bin at which the maximum response rate was located
Initial: The bin at which response rate first equals or exceeds 70% of the maximum response rate.
Final: The bin at which response rate first equals or falls below 70% of the maximum response rate; this value must be later than the value of PT.
I have managed to extract the first two measures using the following code using dplyr:
MolarMeasures <- RRLong %>%
group_by(ID,TrialType,Block) %>%
slice(which.max(RR)) %>%
select(PT = Bin, PR = RR)
However, I am at a loss for how to calculate the last two measures. I would appreciate any insight/advice. Please let me know if any additional information is needed.
You haven't provided enough data to test things out, but this should work.
MolarMeasures <- RRLong %>%
group_by(ID,TrialType,Block) %>%
summarize(
PR = max(RR),
PT = Bin[which.max(RR)],
Initial = Bin[which.max(RR >= 0.7 * PR)],
Final = which.max((RR <= 0.7 * PR) & cumsum(RR >= PR)),
Final = ifelse(Final == 1, NA, Bin[Final])
)
Explanation: PR and PT are straightforward. For the last two we take advantage that TRUE/FALSE can be treated as 1/0, and which.max will return the index of the first maximum (first TRUE). For Final we also use cumsum to make sure that the max has been met at least once. The cumsum result will be 0/FALSE before the maximum and (not 0)/TRUE afterwards.

Resources