R Selecting a Row plus the next 5 rows - r

In my dataframe I would like to select a row based on some logic, and then return a dataframe with the selected row PLUS the next 'N' rows.
So, I have this: (a generic example)
workingRows <- myData[which(myData$Column1 >= myData$Column2 & myData$Column3 <= myData$Column4), ]
Which returns me the correct "starting values". How can I get the "next" 5 values based on each of the starting values?

We can use rep to get the next 5 rows, sort it and if there are any duplicates from overlaps, wrap it with unique and subset the 'myData'.
i1 <- which(myData$Column1 >= myData$Column2 & myData$Column3 <= myData$Column4)
myData[unique(sort(i1 + rep(0:5, each = length(i1)))),]

Related

R function to subset dataframe so that non-adjacent values in a column differ by >= X (starting with the first value)

I am looking for a function that iterates through the rows of a given column ("pos" for position, ascending) in a dataframe, and only keeps those rows whose values are at least let's say 10 different, starting with the first row.Thus it would start with the first row (and store it), and then carry on until it finds a row with a value at least 10 higher than the first, store this row, then start from this value again looking for the next >10diff one.
So far I have an R for loop that successfully finds adjacent rows at least X values apart, but it does not have the capability of looking any further than one row down, nor of stopping once it has found the given row and starting again from there.
Here is the function I have:
# example data frame
df <- data.frame(x=c(1:1000), pos=sort(sample(1:10000, 1000)))
# prep function (this only checks row above)
library(dplyr)
pos.apart.subset <- function(df, pos.diff) {
# create new dfs to store output
new.df <- list()
new.df1 <- data.frame()
# iterate through each row of df
for (i in 1:nrow(df)) {
# if the value of next row is higher or equal than value or row i+posdiff, keep
# if not ascending, keep
# if first row, keep
if(isTRUE(df$pos[i+1] >= df$pos[i]+pos.diff | df$pos[i+1] < df$pos[i] | i==1 )) {
# add rows that meet conditions to list
new.df[[i]] <- df[i,] }
}
# bind all rows that met conditions
new.df1 <- bind_rows(new.df)
return(new.df1)}
# test run for pos column adjacent values to be at least 10 apart
df1 <- pos.apart.subset(df, 10); head(df1)
Happy to do this in awk or any other language. Many thanks.
It seems I misunderstood the question earlier since we don't want to calculate the difference between consecutive rows, you can try :
nrows <- 1
previous_match <- 1
for(i in 2:nrow(df)) {
if(df$pos[i] - df$pos[previous_match] > 10) {
nrows <- c(nrows, i)
previous_match <- i
}
}
and then subset the selected rows :
df[nrows, ]
Earlier answer
We can use diff to get the difference between consecutive rows and select the row which has difference of greater than 10.
head(subset(df, c(TRUE, diff(pos) > 10)))
# x pos
#1 1 1
#2 2 31
#6 6 71
#9 9 134
#10 10 151
#13 13 185
The first TRUE is to by default select the first row.
In dplyr, we can use lag to get value from previous row :
library(dplyr)
df %>% filter(pos - lag(pos, default = -Inf) > 10)

How to drop a buffer of rows in a data frame around rows of a certain condition

I am trying to remove rows in a data frame that are within x rows after rows meeting a certain condition.
I have a data frame with a response variable, a measurement type that represents the condition, and time. Here's a mock data set:
data <- data.frame(rlnorm(45,0,1),
c(rep(1,15),rep(2,15),rep(1,15)),
seq(
from=as.POSIXct("2012-1-1 0:00", tz="EST"),
to=as.POSIXct("2012-1-1 0:44", tz="EST"),
by="min"))
names(data) <- c('Variable','Type','Time')
In this mock case, I want to delete the first 5 rows in condition 1 after condition 2 occurs.
The way I thought about solving this problem was to generate a separate vector that determines the distance that each observation that is a 1 is from the last 2. Here's the code I wrote:
dist = vector()
for(i in 1:nrow(data)) {
if(data$Type[i] != 1) dist[i] <- 0
else {
position = i
tempcount = 0
while(position > 0 && data$Type[position] == 1){
position = position - 1
tempcount = tempcount + 1
}
dist[i] = tempcount
}
}
This code will do the trick, but it's extremely inefficient. I was wondering if anyone had some cleverer, faster solutions.
If I understand you correctly, this should do the trick:
criteria1 = which(data$Type[2:nrow(data)] == 2 & data$Type[2:nrow(data)] != data$Type[1:nrow(data)-1]) +1
criteria2 = as.vector(sapply(criteria1,function(x) seq(x,x+5)))
data[-criteria2,]
How it works:
criteria1 contains indices where Type==2, but the previous row is not the same type. The strange lookign subsets like 2:nrow(data) are because we want to compare to the previous row, but for the first row there is no previous row. herefore we add +1 at then end.
criteria2 contains sequences starting with the number in criteria1, to those numbers+5
the third row performs the subset
This might need small modification, I wasn't exactly clear what criteria 1 and criteria 2 were from your code. Let me know if this works or you need any more advice!

R restructuring long to wide using for loop

Right now, this working loop is pasting the supervisor's scores for a given year into new columns (supervisor.score1:supervisor.score4) on the appropriate employee row. It achieves this by looking at an employee row i and finding the employee row(s) for the supervisor and year listed in row i. Then it takes the first row on that list of matching rows, and pastes the scores (score1:score 4) from that row into supervisor.score1:supervisor.score4 for the corresponding employee row i.
employeeID year supervisor score1:score 4 supervisor.score1:supervisor.score4
for (i in (1:nrow(data))){
matchvector <- which(data[,1] == data[i,3] & data[,2] == data[i,2])
if (length(matchvector) > 0) {
case <- matchvector[1]
data[i, namevector] <- data[case, supervisor.score1:supervisor.score4]}
if (length(matchvector[1]) < 1){
data[i, supervisor.score1:supervisor.score4] <- NA}
}
Is there a way to convert this loop into a function that can be called with apply?

Randomly return a row number for a subset in the data frame

I'd like to be able to randomly return a row number from a data set, where the rows are a subset of the data. For example with the dataframe
x.f<-data.frame(
G = c("M","M","M","M","M","M","F","F","F","F","F","F"),
A = c("1","2","3","1","2","3","1","2","3","1","2","3"),
E = c("W","W","W","B","B","B","W","W","W","B","B","B"))
I'd like to, say, randomly give me a row number where G=="M" and A=="3", so the answer will be row 3 or row 6. The number returned must be the position in the original data frame. Whilst this example is nicely structured (each possible combination appears once only), in reality there will not be such a structure, eg the combination (M,2,W) will be randomly distributed throughout the data frame and can occur more or fewer times than other combinations.
Using the answer of Sourabh and sample you can try:
# create a function using the sample function, which selects one value by chance
foo <- function(G, A, data){
sample(which(data$G == G & data$A == A), 1)
}
foo("M", 3, x.f)
3
To test the equality run the function 1000 times using a loop for instance:
res <- NULL
for(i in 1:1000){
res[i] <- foo("M", 3, x.f)
}
hist(res)
Seems to be an equal distribution.
Please try one: which(((x.f$G == "M") & (x.f$A == 3)))
Or maybe this :
row.names(subset(x.f, x.f$G == "M" & x.f$A == 3))
[1] "3" "6"
Either of the other answers will give you a list of rows meeting your condition, but will not select one row randomly. For a full answer:
sample(which(x.f$G == "M" & x.f$A == 3),1)
or
sample(row.names(subset(x.f, x.f$G == "M" & x.f$A == 3)),1)
or
sample(row.names(x.f[x.f$G=="M" & x.f$A==3,]),1)
Will all work. There's probably two or three other ways to generate a list of row indices or names matching a set of criteria.

Selectively replacing columns in R with their delta values

I've got data being read into a data frame R, by column. Some of the columns will increase in value; for those columns only, I want to replace each value (n) with its difference from the previous value in that column. For example, looking at an individual column, I want
c(1,2,5,7,8)
to be replaced by
c(1,3,2,1)
which are the differences between successive elements
However, it's getting really late in the day, and I think my brain has just stopped working. Here's my code at present
col1 <- c(1,2,3,4,NA,2,3,1) # This column rises and falls, so we want to ignore it
col2 <- c(1,2,3,5,NA,5,6,7) # Note: this column always rises in value, so we want to replace it with deltas
col3 <- c(5,4,6,7,NA,9,3,5) # This column rises and falls, so we want to ignore it
d <- cbind(col1, col2, col3)
d
fix_data <- function(data) {
# Iterate through each column...
for (column in data[,1:dim(data)[2]]) {
lastvalue <- 0
# Now walk through each value in the column,
# checking to see if the column consistently rises in value
for (value in column) {
if (is.na(value) == FALSE) { # Need to ignore NAs
if (value >= lastvalue) {
alwaysIncrementing <- TRUE
} else {
alwaysIncrementing <- FALSE
break
}
}
}
if (alwaysIncrementing) {
print(paste("Column", column, "always increments"))
}
# If a column is always incrementing, alwaysIncrementing will now be TRUE
# In this case, I want to replace each element in the column with the delta between successive
# elements. The size of the column shrinks by 1 in doing this, so just prepend a copy of
# the 1st element to the start of the list to ensure the column length remains the same
if (alwaysIncrementing) {
print(paste("This is an incrementing column:", colnames(column)))
column <- c(column[1], diff(column, lag=1))
}
}
data
}
fix_data(d)
d
If you copy/paste this code into RGui, you'll see that it doesn't do anything to the supplied data frame.
Besides losing my mind, what am I doing wrong??
Thanks in advance
Without addressing the code in any detail, you're assigning values to column, which is a local variable within the loop (i.e. there is no relationship between column and data in that context). You need to assign those values to the appropriate value in data.
Also, data will be local to your function, so you need to assign that back to data after running the function.
Incidentally, you can use diff to see if any value is incrementing rather than looping over every value:
idx <- apply(d, 2, function(x) !any(diff(x[!is.na(x)]) < 0))
d[,idx] <- blah
diff calculates the difference between consecutive values in a vector. You can apply it to each column in a dataframe using, e.g.
dfr <- data.frame(x = c(1,2,5,7,8), y = (1:5)^2)
as.data.frame(lapply(dfr, diff))
x y
1 1 3
2 3 5
3 2 7
4 1 9
EDIT: I just noticed a few more things. You are using a matrix, not a data frame (as you stated in the question). For your matrix 'd', you can use
d_diff <- apply(d, 2, diff)
#Find columns that are (strictly) increasing
incr <- apply(d_diff, 2, function(x) all(x > 0, na.rm=TRUE))
#Replace values in the approriate columns
d[2:nrow(d),incr] <- d_diff[,incr]

Resources