I have a data.frame as given below. I want to get the index/row number where (b-a)>8 but I want to compare them after row 7 not from row 1. I have written the code to get me the row number where b-a>8 satisfies but it checks from row 1. How to check it from row 7?
a <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)
b <- c(2,12,4,5,2,5,8,5,7,19,6,7,4,23,1,2)
df <- data.frame(a,b)
which((df$b-df$a)>8)[1]
Desired output: Row number 10 not 2.
One can start with offset as in both vectors as:
which((df$b[7:nrow(df)]-df$a[7:nrow(df)])>8)
#[1] 8
This is just a math calculation
(which(with(df[-(1:7),],b-a>8))+7)[1]
[1] 10
(a<-which((df$b-df$a)>8))[a>7][1]
[1] 10
Related
So I have a data frame with baskets of products of purchases of individuals. A row stands for a basket of products of one individual. I want to remove all the rows (baskets) that contain a product (expressed as a integer) that are listed in a vector named products.to.delete . Here is a small image of how the data set looks like.
Next to that I have a vector containing a large number of numbers that must be deleted. I would like to delete all the rows that contain a value from this vector.
here is some code to make it reproducable:
dataframe <- as.data.frame( matrix(data = sample(10000,1000,replace = TRUE),20,50))
products.to.delete <- sample(10000,200,replace = FALSE)
Thank you in advance for helping me out!
If your data is data, and your vector of target values is vals, you could do this:
data[apply(data,1,\(r) !any(r %in% vals)),]
That is, within each row of data (i.e. apply(data,1...)), you can check if any of the values are in vals. Reverse the boolean using !, to create an global logical vector for selecting the remaining rows
For your next questions, please create reproducible examples such as the one below.
What you're after is called filtering and can be done in base R by the following.
First, create an object called for example myfilter which is a boolean vector with the same length as the number of rows in your data.frame.
mydat <- data.frame("col1"=1:5, "col2"=letters[1:5])
col1 col2
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
myfilter <- mydat$col2 %in% c("a", "c")
[1] TRUE FALSE TRUE FALSE FALSE
mydat[myfilter,]
col1 col2
1 1 a
3 3 c
Then simply include this object into brackets []. R will keep rows where values are TRUE
I'm trying to extract a set of genes (row names) from my large data set based on another data matrix that contains a list of my genes of interest. I've read about that I should use the filter and %in% command, but am unsure as to how to write it properly.
example:
my large database:
Gene Week1 Week 2. Week 3
A. 20. 14. 5
B. 5. 10. 15
C. 2. 4. 6
D. 20. 18. 19
my small data base:
Gene
A
C
D
And I want my result to be:
Gene Week1 Week 2. Week 3
A. 20. 14. 5
C. 2. 4. 6
D. 20. 18. 19
Could anybody please help out? I'd really appreciate it and my apologies for the rather simple question :)
Using logical row indexes:
large_database[large_database$Gene %in% unique(small_data_base$Gene), ]
Explanation:
large_database$Gene %in% unique(small_data_base$Gene)
Checks for each entry (i.e. row) in large_database$Gene if it appears in unique(small_database$Gene) i.e. the list of unique values in the column Gene of small_data_base and returns a boolean vector (a vector of TRUE and FALSE).
We then can use this vector as a row 'index' to selecet only rows where the vector is TRUE (i.e. the value of large_database$Gene was in unique(small_database$Gene)
In a dataframe of 4 columns, I'm looking for an elegant way to get 3 lists that contain the names from column 1 if the maximum of that row in which that name is, is respectively in column 2, 3 or 4.
the first column contains parameter names,
column 2 a shapiro test outcome on the raw data of parameter x
column 3, shapiro test outcome of log10 transformed data for parameter x
column 4, shapiro test outcome of a custom transformation given by the user for parameter x
if this is the data:
Parameter xval xlog10val xcustomval
1 FWS.Range 0.62233371 0.9741614 0.9619065
2 FL.Red.Range 0.48195980 0.9855781 0.9643206
3 FL.Orange.Range 0.43338087 0.9727243 0.8239867
4 FL.Yellow.Range 0.53554943 0.9022795 0.9223407
5 FL.Red.Gradient 0.35194524 0.9905047 0.5718224
6 SWS.Range 0.46932823 0.9487955 0.9825318
7 SWS.Length 0.02927791 0.4565962 0.7309313
8 FWS.Fill.factor 0.93764311 0.8039806 0.0000000
9 FL.Red.Total 0.22437754 0.9655873 0.9923307
QUESTION: how to get a list that tells me all parameter names where xlog10val is the highest of the three columns (xval, xlog10val, xcuxtomval)
detailed explanation, ignore perhaps. ....
list 1, the rows where xval is the highest value, should be looking like this: 'FWS.Fill.factor' since that is the only row where xval has the highest score
list 2 is the list of all rows where xlog10val is the maximum value, and thus should contain the names of parameters where xlog10val is the maximum of that row:
'FWS.Range', 'FL.Red.Range', 'FL.Orange.Range',
'FL.Red.Gradient', 'FWS.Fill.factor'
and list 3 the rest of the names
I tried something like
df$Parameter[which(df$xval == max(df[ ,2:4]))]
but this gives integer(0) results.
EDIT
to clarify:
Lets start with looking at column 2 (xval).
PER row I need to test whether xval is the maximum of the 3 columns; xval, xlog10val, xcustomval
if this is the case, add the parameter in THAT row to the list of xval_is_the_max_of_3_columns list
Then we do the same PER row for xlog10val. IF xlog10val in row i is max of columns 2:4, add the name of that ROW to xlog10val_is_the_max_of_3_columns list.
To make the DF:
df <- data.frame(Parameter = c('FWS.Range', 'FL.Red.Range', 'FL.Orange.Range', 'FL.Yellow.Range', 'FL.Red.Gradient','SWS.Range','SWS.Length','FWS.Fill.factor','FL.Red.Total'),
xval = c(0.622333705577588,0.481959800402278,0.433380866119736,0.535549430820635,0.351945244290616,0.469328232931424,0.0292779051823701,0.93764311477813,0.224377540663707),
xlog10val = c( 0.974161367853916,0.985578135386898,0.97272429360688,0.902279501804112,0.990504657326703,0.94879549470406,0.45659620937997,0.803980592920426,0.965587334461157),
xcustomval = c(0.961906534164457,0.964320569400919,0.823986745004031,0.922340716468745,0.571822393107348,0.982531798077881,0.73093132928955,0,0.992330722386105))
We can use max.col to get the index of the maximum value per each row and with that we subset the 'Parameter'
i1 <- max.col(df[-1], 'first')
split(df$Parameter, i1)
EDIT: Based on the discussion with #Mark
I'm not sure exactly how you're selecting the parameters for list two and three, however, you can try something like this as well
df$Parameter <- as.character(df$Parameter)
par.xval.max <- df[which.max(df$xval), "Parameter"]
par.col3.gt.max <- df[df$xlog10val > max(df$xval), "Parameter"]
par.rem <- df$Parameter[! df$Parameter %in% c(par.xval.max, par.col3.gt.max)]
In this case, the values from column three are greater than the max(df$xval), and the remaining parameters are taken by negative selection using %in%
I dont know how to explain it shortly. I try my best:
I have the following example data:
Data<-data.frame(A=c(1,2,3,5,8,9,10),B=c(5.3,9.2,5,8,10,9.5,4),C=c(1:7))
and a index
Ind<-data.frame(I=c(5,6,2,4,1,3,7))
The value in Ind corresponds to the C column in the Data. Now I want to start with the first Ind value, and find the corresponding row in the Data data.frame (column C). From that row I want to go up and down and find values in column A that are in a tolerance range of 1. I want to write these values into a result dataframe add a group id column and delete it in the dataframe Data (where I found them). Then I start with the next entry in the Index dataframe Ind and so an until the data.frame Data is empty. I know how to match my Ind with column C of my Data and how to write and delete and the other stuff in a for loop, but I dont know the main point, which is my question here:
when I have found my row in the Data, how can I look up fitting values of column A in the tolerance range up and below that entry to get my Group id?
what I want to get is this result:
A B C Group
1 5.3 1 2
2 9.2 2 2
3 5 3 2
5 8 4 3
8 10 5 1
9 9.5 6 1
10 4 7 4
Maybe somebody could help me with the critical point in my question or even how to solve this issue in a fast way.
Many thanks!
Generally: avoid deleting or growing a data frame row by row inside a loop. R's memory management means that every time you add or delete a row, another copy of the data frame is made. Garbage collection will eventually discard the "old" copies of the data frame, but garbage can quickly accumulate and reduce performance. Instead, add a logical column to the Data data frame, and set "extracted" rows to TRUE. So like this:
Data$extracted <- rep(FALSE,nrow(Data))
As for your problem: I get a different set of grouping numbers, but the groups are identical.
There might be a more elegant way to do this, but this will get it done.
# store results in a separate list
res <- list()
group.counter <- 1
# loop until they're all done.
for(idx in Ind$I) {
# skip this iteration if idx is NA.
if(is.na(idx)) {
next
}
# dat.rows is a logical vector which shows the rows where
# "A" meets the tolerance requirement.
# specify the tolerance here.
mytol <- 1
# the next only works for integer compare.
# also not covered: what if multiple values of C
# match idx? do we loop over each corresponding value of A,
# i.e. loop over each value of 'target'?
target <- Data$A[Data$C == idx]
# use the magic of vectorized logical compare.
dat.rows <-
( (Data$A - target) >= -mytol) &
( (Data$A - target) <= mytol) &
( ! Data$extracted)
# if dat.rows is all false, then nothing met the criteria.
# skip the rest of the loop
if( ! any(dat.rows)) {
next
}
# copy the rows to the result list.
res[[length(res) + 1]] <- data.frame(
A=Data[dat.rows,"A"],
B=Data[dat.rows,"B"],
C=Data[dat.rows,"C"],
Group=group.counter # this value will be recycled to match length of A, B, C.
)
# flag the extraction.
Data$extracted[dat.rows] <- TRUE
# increment the group counter
group.counter <- group.counter + 1
}
# now make a data.frame from the results.
# this is the last step in how we avoid
#"growing" a data.frame inside a loop.
resData <- do.call(rbind, res)
Suppose I have matrix called M,
"Date" "X" "Y"
1991 T 10
1992 T 5
1993 F 2
1994 F 1
1995 T 7
where date is a character value, X is Boolean, and Y is numeric. Also, assume that the total number of rows are 50, each filled with values mentioned above.
My initial selection criteria is for the second column to be True. Thus,
initial_row<-M[M[,2]==T,]
I am looking for a way to extract 10 rows (or any constant number) following the initial row, regardless of their values on any column. Basically, I'm trying to mine all the rows that follow the initial extraction, then move on until next row that meets the initial selection criteria.
This code should extract the rows following the initial rows and put the batches of rows into a list
const <- 10
lapply(which(M$X), function(n){
indices <- n + 1:const # 0:const to include initial row
indices <- indices[indices <= nrow(M)] # exclude out of bounds values
M[indices,]
})