I have a dataframe with a fixed no numeric columns and an arbitrary numeric columns like this:
s <- data.frame(A=c("a","b","c"),B=c(1,2,3), C=c(24,15,2))
I also have two vectors with the same length of the number of numeric columns defining the min and max values for every column.
min <- c(2,10)
max <- c(3,30)
I want to subset the dataframe with all the rows than have column B between 2 and 3, and column C between 10 and 30. Like this:
s <- s[s$B >= min[1] & s$B <= max[1] & s$C >= min[2] & s$C <= max[2],]
To subset the dataframe for an arbitrary number of numeric columns right now I use a for statment:
for(i in 1:length(min))
s <- s[s[,i+1] >= min[i] & s[,i+1] <= max[i],]
This do the job but is very slow. I have around 20 columns and 150K rows in the data frame.
There is a better way?
Generically, like this?
s <- data.frame(A=sample(letters,100,T),B=sample(1:4,100,T), C=sample(2:40,100,T))
# larger dataframe
min <- c(2,10)
max <- c(3,30)
filt<-rowSums(
sapply(1:length(min),function(x){ # for each item in min (or max)
s[,x+1]>=min[x] & s[,x+1]<=max[x] # create a T/F vector
})
)==length(min) # this returns T for cases where all criteria are met
s[filt,] # this applies your filter to s
Related
I am looking for a function that iterates through the rows of a given column ("pos" for position, ascending) in a dataframe, and only keeps those rows whose values are at least let's say 10 different, starting with the first row.Thus it would start with the first row (and store it), and then carry on until it finds a row with a value at least 10 higher than the first, store this row, then start from this value again looking for the next >10diff one.
So far I have an R for loop that successfully finds adjacent rows at least X values apart, but it does not have the capability of looking any further than one row down, nor of stopping once it has found the given row and starting again from there.
Here is the function I have:
# example data frame
df <- data.frame(x=c(1:1000), pos=sort(sample(1:10000, 1000)))
# prep function (this only checks row above)
library(dplyr)
pos.apart.subset <- function(df, pos.diff) {
# create new dfs to store output
new.df <- list()
new.df1 <- data.frame()
# iterate through each row of df
for (i in 1:nrow(df)) {
# if the value of next row is higher or equal than value or row i+posdiff, keep
# if not ascending, keep
# if first row, keep
if(isTRUE(df$pos[i+1] >= df$pos[i]+pos.diff | df$pos[i+1] < df$pos[i] | i==1 )) {
# add rows that meet conditions to list
new.df[[i]] <- df[i,] }
}
# bind all rows that met conditions
new.df1 <- bind_rows(new.df)
return(new.df1)}
# test run for pos column adjacent values to be at least 10 apart
df1 <- pos.apart.subset(df, 10); head(df1)
Happy to do this in awk or any other language. Many thanks.
It seems I misunderstood the question earlier since we don't want to calculate the difference between consecutive rows, you can try :
nrows <- 1
previous_match <- 1
for(i in 2:nrow(df)) {
if(df$pos[i] - df$pos[previous_match] > 10) {
nrows <- c(nrows, i)
previous_match <- i
}
}
and then subset the selected rows :
df[nrows, ]
Earlier answer
We can use diff to get the difference between consecutive rows and select the row which has difference of greater than 10.
head(subset(df, c(TRUE, diff(pos) > 10)))
# x pos
#1 1 1
#2 2 31
#6 6 71
#9 9 134
#10 10 151
#13 13 185
The first TRUE is to by default select the first row.
In dplyr, we can use lag to get value from previous row :
library(dplyr)
df %>% filter(pos - lag(pos, default = -Inf) > 10)
Consider the data.table, exampleDT,
set.seed(7)
exampleDT = data.table(colA = rnorm(10,15,5),
colB = runif(10,100,150),
targetA = rnorm(10,12,2),
targetB = rnorm(10,8,4))
If I want to calculate the mean of all elements in column targetA, for example, that are below some threshold -- say, 10 -- I can do the following:
examp_threshold = 10
exampleDT[targetA<examp_threshold,mean(targetA)]
# [1] 9.224007566814299
And if I want to calculate the mean of all elements in columns targetA and targetB, for example, I can do the following:
target_cols = names(exampleDT)[which(names(exampleDT) %like% "target")]
exampleDT[,lapply(.SD,mean),.SDcols=target_cols]
# targetA targetB
# 1: 12.60101574551183 7.585007905896557
But I don't know how to combine the two; that is, to calculate the mean of all elements in all columns containing a specified string ("target", in this case) that are below some specified threshold (10, here). This was my first guess, but it was unsuccessful:
exampleDT[.SD<examp_threshold,lapply(.SD,mean),.SDcols=target_cols]
#Empty data.table (0 rows) of 2 cols: targetA,targetB
You need to subset in the j expression, like so:
exampleDT[, lapply(.SD, function(x) mean(x[x<examp_threshold])),.SDcols=target_cols]
# targetA targetB
#1: 9.224008 6.66624
In my dataframe I would like to select a row based on some logic, and then return a dataframe with the selected row PLUS the next 'N' rows.
So, I have this: (a generic example)
workingRows <- myData[which(myData$Column1 >= myData$Column2 & myData$Column3 <= myData$Column4), ]
Which returns me the correct "starting values". How can I get the "next" 5 values based on each of the starting values?
We can use rep to get the next 5 rows, sort it and if there are any duplicates from overlaps, wrap it with unique and subset the 'myData'.
i1 <- which(myData$Column1 >= myData$Column2 & myData$Column3 <= myData$Column4)
myData[unique(sort(i1 + rep(0:5, each = length(i1)))),]
I have a data frame in R consisting in 5 columns and 30000 rows. One column, called "pos", has this kind of values sorted in ascending order:
pos
785989
888659
918573
949608
990417
I would like to remove all rows where the difference between a "x" value in "pos" (in a "n" row) and the anterior value in the "n-1" row or the difference between the posterior value in "n+1" row and "x" is greater than, lets say, 100000. Eg: in the input example, 888659-785989 = 102670 > 100000, therefore rows containing 888659 and 785989 values should be removed.
Thanks for your help !
One solution is to create a user function that takes the diff of a vector and checks a conditional gap provided by the user:
diff_set <- function(x, gap) {
ind <- c(F, diff(x) > gap)
if(sum(ind) == 0) return(!ind)
subst <- x[-unique(c(which(ind), which(ind)-1))]
x %in% subst
}
df1[diff_set(df1$x, 1e5),]
x y
3 918573 C
4 949608 D
5 990417 E
Data
x <- scan(text="785989
888659
918573
949608
990417")
df1 <- data.frame(x, y=LETTERS[1:5])
I have a dataset with 40 columns with 100.000 rows each. Because the number of columns is to big, I want to delete some of them. I want to delete the rows from 10.000-20.000; from 30.000-40.000 and from 60.000-70.000; so that I have as a result a dataset with 40 columns with 70.000 rows. The first column is an ID starts with 1 (called ItemID) and ends at 100.000 for the last one. Can someone please help me.
Tried this to delete the columns from 10000 to 20000, but it´s not working (let´s the the data set is called "Data"):
Data <- Data[Data$ItemID>10000 && Data$ItemID<20000]
Severeal ways of doing this. Something like this suit your needs?
dat <- data.frame(ItemID=1:100, x=rnorm(100))
# via row numbers
ind <- c(10:20,30:40,60:70)
dat <- dat[-ind,]
# via logical vector
ind <- with(dat, { (ItemID >= 10 & ItemID <= 20) |
(ItemID >= 30 & ItemID <= 40) |
(ItemID >= 60 & ItemID <= 70) })
dat2 <- dat[!ind,]
To take it to the scale of your data set, just ind according to the size of your data set (multiplication might do).
I think you should be able to do
data <- data[-(10000:20000),]
and then remove the other rows in a similar manner.