How to delete row n and n+1 in a dataframe? - r

I have several dataframes in which I want to delete each row that matches a certain string. I used the following code to do it:
df[!(regexpr("abc", df$V4) ==1),]
How can I delete the row that is following, e.g. if I delete row n as specified by the code above, how can I additionally delete row n+1?
My first try was to simply find out the indices of the desired rows, but that won't work, as I need to delete rows in different dataframes which are of different lengths. So the indices vary.
Thanks!

I would suggest taking out and manipulating the logical vector directly. Suppose we have the vector:
x = c(5,0,1, 4, 3)
and we want to do:
x[x > 3]
First, note that:
R> (s_n = x>3)
[1] TRUE FALSE FALSE TRUE FALSE
So
R> (s_n1 = as.logical(s_n + c(F, l[1:(length(s_n)-1)])))
[1] TRUE TRUE FALSE TRUE TRUE
Hence,
x[s_n1]
gives you what you want.
In your particular example, something like:
s_n = !(regexpr("abc", df$V4) == 1)
s_n1 = as.logical(s_n + c(F, l[1:(length(s_n)-1)])))
df[s_n1, ]
should work.

Use which() on your logical expression and then you can just add 1 to the result.
sel <- which(grep("abc", df$V4))
sel <- c(sel, sel+1)
df[-sel,]

df[which(!(regexpr("abc", df$V4) ==1))+c(0,1),]

Related

cleaning up if/else r function for previous row value reference

I have a function that is currently working, but I think there may be a better way for it to work without having to manipulate the data so much beforehand. Basically, I am returning a simple TRUE or FALSE if a value in my column is greater than both the two values before it, and after it.
y1 #a single vector column of values
for (i in 3:length(y1)){ #for every number starting at 3 (because for 2 and 1 you can't go back two)
if(y1[i] > y1[i-1] && y1[i] > y1[i-2] && y1[i] > y1[i+1] && y1[i] > y1[i+2]){ #if the number is greater than 2 before and 2 after...
y2[i] <- 'TRUE' #if it is true, write true. Here y2[i] you're saving the results in the blank vector
} else {
y2[i] <- 'FALSE' } #opposite here
print(y2[i])
This works okay, but as you see I have to start at 3 in my for loop because otherwise I get an error, given that the first and second values, as well as the last two, can't compute the [i-1],[i-2] or [i+1] and [i+2]. If I do for i:length(y1) it will not work and I also have to add two zeros onto the dataset in order to not get an error/be able to "compute" the last TRUE/FALSE value.
Is there any way to clean up the actual function so that I don't have to manipulate the data beforehand? Essentially have the function give me a null just for the first two and last two values in my data?
Another approach is using lag and lead from dplyr:
library(dplyr)
v2 <- (lag(v1, 2) < v1 & lag(v1, 1) < v1 & lead(v1, 1) < v1 & lead(v1, 2) < v1)
Output & data:
v1 <- c(1,2,3,2,1,1,1,1,1,2,1,1)
v2
[1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
my first quick tip would be to look into the lead and lag functions of dplyr.
See for example this tutorial or in the dplyr documentation or in Hadley Wickhams R for Data Science.
Hope this helps!

ifelse function on a vector

I am using the ifelse function in order to obtain either a vector with NA if all the "value" of this vector are NA or a vector with all the values not equal to "NA_NA". In my example, I would like to obtain this results
[1] "14_mter" "78_ONHY"
but I am obtaining this
[1] "14_mter"
my example:
vect=c("NA_NA", "14_mter", "78_ONHY")
out=ifelse(all(is.na(vec)), vec, vec[which(vec!="NA_NA")])
What is wrong in this function ?
ifelse is vectorized and its result is as long as the test argument. all(is.na(vect)) is always just length one, hence the result. a regular if/else clause is fine here.
vect <- c("NA_NA", "14_mter", "78_ONHY")
if (all(is.na(vect))) {
out <- vect
} else {
out <- vect[vect != "NA_NA"]
}
out
#> [1] "14_mter" "78_ONHY"
additional note: no need for the which() here
The ifelse help file, referring to its three arguments test, yes and no, says:
ifelse returns a value with the same shape as test which is filled
with elements selected from either yes or no depending on whether the
element of test is TRUE or FALSE.
so if the test has a length of 1, which is the case for the code in the question, then the result will also have length 1. Instead try one of these.
1) Use if instead of ifelse. if returns the value of the chosen leg so just assign that to out.
out <- if (all(is.na(vect))) vect else vect[which(vect != "NA_NA")]
2) The collapse package has an allNA function so a variation on (1) is:
library(collapse)
out <- if (allNA(vect)) vect else vect[which(vect != "NA_NA")]
3) Although not recommended if you really wanted to use ifelse it could be done by wrapping each leg in list(...) so that the condition and two legs all have the same length, i.e. 1.
out <- ifelse(all(is.na(vect)), list(vect), list(vect[which(vect != "NA_NA")])) |>
unlist()
If the NAvalue is always the string NA_NA, this works:
grep("NA_NA", vect, value = TRUE, invert = TRUE)
[1] "14_mter" "78_ONHY"
While the pattern matches the NA_NA value, the invert = TRUE argument negates the match(es) and produces the unmatched values
Data:
vect=c("NA_NA", "14_mter", "78_ONHY")

Searching strings to ignore multiple matches

I have a dataframe with columns names that look like this:
d=c("Q.40a-some Text", "Q.40b-some Text", "Q.44a-some Text", "Q.44b-some Text" "Q.44c-some Text" "Q.44d-some Text" ,"Q.4a-some Text", "Q.4b-some Text")
I would like to identify the columns which begin with Q.4 and ignore the Q.40, Q.44.
To identify Q.44 or Q.40, for example, is easy. What I do is to use this "^Q.44" or "^Q.40" as input to my function. But, this does not work if I do the same for identifying Q.4 - simply because all names begin with Q.4. So, can someone help me on this ?
UPDATE
The result I want to pass it, to my function that takes inputs as follows:
multichoice<-function(data, question.prefix){
index<-grep(question.prefix, names(data)) # identifies the index for the available options in Q.12
cases<-length(index) # The number of possible options / columns
# Identify the range of possible answers for each question
# Step 1. Search for the min in each col and across each col choose the min
# step 2. Search for the max in each col and across each col choose the max
mn<-min(data[,index[1:cases]], na.rm=T)
mx<-max(data[,index[1:cases]], na.rm=T)
d = colSums(data[, index] != 0, na.rm = TRUE) # The number of elements across column vector, that are different from zero.
vec<-matrix(,nrow=length(mn:mx),ncol=cases)
for(j in 1:cases){
for(i in mn:mx){
vec[i,j]=sum(data[, index[j]] == i, na.rm = TRUE)/d[j] # This stores the relative responses for option j for the answer that is i
}
}
vec1<-as.data.frame(vec)
names(vec1)<-names(data[index])
vec1<-t(vec1)
return(vec1)
}
And the way I use my funtion is this
q4 <-multichoice(df2,"^Q.4")
Where by "^Q.4" I intend to identify the columns for Q.4, and df2 is my dataframe.
Here is a method using grep:
To return the indices
grep("^Q\\.4[^0-9]", d)
Of the column names:
grep("^Q\\.4[^0-9]", d, value=T)
This works because [^0-9] says any character that is not a number, so we match Q.4 literally, then match strings with any non number.
I believe what you want in the mn statement in your function is
mn <- min(sapply(data[,index], min, na.rm=T), na.rm=T)
sapply moves through the columns selected by index selected grep and finds the minimum with min. Then, min is applied to all of the columns.
We can use stringr,
library(stringr)
str_extract(d, 'Q.[0-9]+') == 'Q.4'
#[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
#or
d[str_extract(d, 'Q.[0-9]+') == 'Q.4']
#[1] "Q.4a-some Text" "Q.4b-some Text"
If the format is always the same (i.e. Q.[0-9]...) then we can use gsub
gsub('\\D', '', d) == 4
#[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE

Count number of rows matching a criteria

I am looking for a command in R which is equivalent of this SQL statement. I want this to be a very simple basic solution without using complex functions OR dplyr type of packages.
Select count(*) as number_of_states
from myTable
where sCode = "CA"
so essentially I would be counting number of rows matching my where condition.
I have imported a csv file into mydata as a data frame.So far I have tried these with no avail.
nrow(mydata$sCode == "CA") ## ==>> returns NULL
sum(mydata[mydata$sCode == 'CA',], na.rm=T) ## ==>> gives Error in FUN(X[[1L]], ...) : only defined on a data frame with all numeric variables
sum(subset(mydata, sCode='CA', select=c(sCode)), na.rm=T) ## ==>> FUN(X[[1L]], ...) : only defined on a data frame with all numeric variables
sum(mydata$sCode == "CA", na.rm=T) ## ==>> returns count of all rows in the entire data set, which is not the correct result.
and some variations of the above samples. Any help would be appreciated! Thanks.
mydata$sCode == "CA" will return a boolean array, with a TRUE value everywhere that the condition is met. To illustrate:
> mydata = data.frame(sCode = c("CA", "CA", "AC"))
> mydata$sCode == "CA"
[1] TRUE TRUE FALSE
There are a couple of ways to deal with this:
sum(mydata$sCode == "CA"), as suggested in the comments; because
TRUE is interpreted as 1 and FALSE as 0, this should return the
numer of TRUE values in your vector.
length(which(mydata$sCode == "CA")); the which() function
returns a vector of the indices where the condition is met, the
length of which is the count of "CA".
Edit to expand upon what's happening in #2:
> which(mydata$sCode == "CA")
[1] 1 2
which() returns a vector identify each column where the condition is met (in this case, columns 1 and 2 of the dataframe). The length() of this vector is the number of occurences.
sum is used to add elements; nrow is used to count the number of rows in a rectangular array (typically a matrix or data.frame); length is used to count the number of elements in a vector. You need to apply these functions correctly.
Let's assume your data is a data frame named "dat". Correct solutions:
nrow(dat[dat$sCode == "CA",])
length(dat$sCode[dat$sCode == "CA"])
sum(dat$sCode == "CA")
mydata$sCode is a vector, it's why nrow output is NULL.
mydata[mydata$sCode == 'CA',] returns data.frame where sCode == 'CA'. sCode includes character. That's why sum gives you the error.
subset(mydata, sCode='CA', select=c(sCode)), you should use sCode=='CA' instead sCode='CA'. Then subset returns you vector where sCode equals CA, so you should use
length(subset(na.omit(mydata), sCode='CA', select=c(sCode)))
Or you can try this: sum(na.omit(mydata$sCode) == "CA")
With dplyr package, Use
nrow(filter(mydata, sCode == "CA")),
All the solutions provided here gave me same error as multi-sam but that one worked.
Just give a try using subset
nrow(subset(data,condition))
Example
nrow(subset(myData,sCode == "CA"))
to get the number of observations the number of rows from your Dataset would be more valid:
nrow(dat[dat$sCode == "CA",])
grep command can be used
CA = mydata[grep("CA", mydata$sCode, ]
nrow(CA)
Call nrow passing as argument the name of the dataset:
nrow(dataset)
I'm using this short function to make it easier using dplyr:
countc <- function(.data, ..., preserve = FALSE){
return(nrow(filter(.data, ..., .preserve = preserve)))
}
With this you can just use it like filter. For example:
countc(data, active == TRUE)
[1] 42

Vector-version / Vectorizing a for which equals loop in R

I have a vector of values, call it X, and a data frame, call it dat.fram. I want to run something like "grep" or "which" to find all the indices of dat.fram[,3] which match each of the elements of X.
This is the very inefficient for loop I have below. Notice that there are many observations in X and each member of "match.ind" can have zero or more matches. Also, dat.fram has over 1 million observations. Is there any way to use a vector function in R to make this process more efficient?
Ultimately, I need a list since I will pass the list to another function that will retrieve the appropriate values from dat.fram .
Code:
match.ind=list()
for(i in 1:150000){
match.ind[[i]]=which(dat.fram[,3]==X[i])
}
UPDATE:
Ok, wow, I just found an awesome way of doing this... it's really slick. Wondering if it's useful in other contexts...?!
### define v as a sample column of data - you should define v to be
### the column in the data frame you mentioned (data.fram[,3])
v = sample(1:150000, 1500000, rep=TRUE)
### now here's the trick: concatenate the indices for each possible value of v,
### to form mybiglist - the rownames of mybiglist give you the possible values
### of v, and the values in mybiglist give you the index points
mybiglist = tapply(seq_along(v),v,c)
### now you just want the parts of this that intersect with X... again I'll
### generate a random X but use whatever X you need to
X = sample(1:200000, 150000)
mylist = mybiglist[which(names(mybiglist)%in%X)]
And that's it! As a check, let's look at the first 3 rows of mylist:
> mylist[1:3]
$`1`
[1] 401143 494448 703954 757808 1364904 1485811
$`2`
[1] 230769 332970 389601 582724 804046 997184 1080412 1169588 1310105
$`4`
[1] 149021 282361 289661 456147 774672 944760 969734 1043875 1226377
There's a gap at 3, as 3 doesn't appear in X (even though it occurs in v). And the
numbers listed against 4 are the index points in v where 4 appears:
> which(X==3)
integer(0)
> which(v==3)
[1] 102194 424873 468660 593570 713547 769309 786156 828021 870796
883932 1036943 1246745 1381907 1437148
> which(v==4)
[1] 149021 282361 289661 456147 774672 944760 969734 1043875 1226377
Finally, it's worth noting that values that appear in X but not in v won't have an entry in the list, but this is presumably what you want anyway as they're NULL!
Extra note: You can use the code below to create an NA entry for each member of X not in v...
blanks = sort(setdiff(X,names(mylist)))
mylist_extras = rep(list(NA),length(blanks))
names(mylist_extras) = blanks
mylist_all = c(mylist,mylist_extras)
mylist_all = mylist_all[order(as.numeric(names(mylist_all)))]
Fairly self-explanatory: mylist_extras is a list with all the additional list stuff you need (the names are the values of X not featuring in names(mylist), and the actual entries in the list are simply NA). The final two lines firstly merge mylist and mylist_extras, and then perform a reordering so that the names in mylist_all are in numeric order. These names should then match exactly the (unique) values in the vector X.
Cheers! :)
ORIGINAL POST BELOW... superseded by the above, obviously!
Here's a toy example with tapply that might well run significantly quicker... I made X and d relatively small so you could see what's going on:
X = 3:7
n = 100
d = data.frame(a = sample(1:10,n,rep=TRUE), b = sample(1:10,n,rep=TRUE),
c = sample(1:10,n,rep=TRUE), stringsAsFactors = FALSE)
tapply(X,X,function(x) {which(d[,3]==x)})

Resources