Keep rows if several columns (with similar names) follow a condition - r

I want to check if the values of several columns follow a condition, the columns have similar names, and what I've tried is this
filter(df.w.meth.mean, cov.CD34.1 > 4 & cov.CD34.2 > 4 & cov.CD34.4 >4 & cov.CD34.5 >4 & cov.CD34.6 > 4)
How can I simplify this?
I was thinking in using grep to keep the columns that have 'cov' pattern, but is not working.
Can you help?

Using dplyr::filter_at() you can do:
library(dplyr)
df.w.meth.mean %>%
filter_at(vars(starts_with("cov.CD34")), ~ . > 4)

In base R, using grep we can find out columns which starts with "cov". We subset those column and select rows where all the values are greater than 4.
cols <- grep("^cov", names(df.w.meth.mean))
df.w.meth.mean[rowSums(df.w.meth.mean[cols] > 4) == length(cols),]

Related

How to pull out columns in r based on various criteria

I have a huge data.set in R (1mil+ rows) and 51 columns. One of my columns is "StateFIPS" the other is "CountyFIPS" and another is "event type". The rest I do not care about.
Is there an easy way to take that dataframe and pull out all the columns that have "StateFIPS"=3 AND "CountyFIPS=4" AND "event type"=Tornado, and put all those rows into a new dataframe.
Thanks!
We can use subset
df2 <- subset(df1, StateFIPS == 3 & CountyFIPS == 4 & `event type` == "Tornado")
It is quite easy. This should do it (supposing your data.frame is named "data_set")
new_data <- data_set[(data_set$CountyFIPS == 4) |
(data_set$event_type == 'Tornado') |
(data_set$StateFIPS == 3),]
Sure,
You can sue the which() command, see https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/which
You can then use any logical conditions (and combine them with & (=and) and | (=or)

R update data.frame column based on multiple conditions, problem of using ifelse

It is noted that the following code does not work. But this expresses my main purpose.
if(df$col_1 > 2 & df$col_1 > 3) {df$col_4 = value_1}
Then I tried ifelse
df$col_4 = ifelse(df$col_1 > 2 & df$col_1 > 3, value_1, 0)
However, the problem using ifelse is that the original value of df$col_4 will be zero as long as the (df$col_1 > 2 & df$col_1 > 3) results FALSE.
The original value of df$col_4 should be kept where (df$col_1 > 2 & df$col_1 > 3) == FALSE.
I don't either prefer a nested ifelse, because that looks like a mess, not reading friendly.
Is there a way like sql update, the value will be updated only where the multiple conditions result in TRUE?
As commented, you could use:
df$col_4 = ifelse (df$col_1> 2 & df$col_1 >3, value_1, df$col_4 )
One potential issue with that is that you are updating df$col_4 on the fly, which could make it harder to trace errors / wrong behaviour. I'd suggest you store the results in a new column (could be outside of df if you don't want to have many new columns). I would even add a vector df$condition <- df$col_1> 2 & df$col_1>3. That way, you can control that the results are what you want, at a glance.
df$condition <- df$col_1> 2 & df$col_1>3
df$col_5 = ifelse (df$condition, value_1, df$col_4 )

R Building a function with a condition as variable

I have a data.frame consisting of two columns, time and value. Let's say:
> s
> time value
>1 -1.40325749 -0.5282231
>2 -0.32640410 -1.8719568
>3 -0.26288196 -0.9861694
>4 -0.19906006 -0.8487832
>5 -0.18720951 -0.2248195
>6 -0.14219086 0.3387807
>7 -0.05981503 1.3872106
>8 0.37187516 2.0057095
>9 0.42432858 2.6805815
>10 1.19915563 1.9988563
I want to build a function which will filter this data, according to the specific condition. Here is my code:
> select<-function(object,cond)
{
subset(object,eval(deparse(substitute(cond))))
}
If I use now my new function as follows:
>select(s,value<0)
I would like to see only rows, where value is < 0. E.g.
> s
> time value
>1 -1.40325749 -0.5282231
>2 -0.32640410 -1.8719568
>3 -0.26288196 -0.9861694
>4 -0.19906006 -0.8487832
>5 -0.18720951 -0.2248195
However, after running this code, I have an error, that subset must be logical. I tried everything I know about to make the "value<0" visible as expression for R. Does anyone know how to fix it?
Error in subset.data.frame(object, eval(deparse(substitute(cond)))) :
'subset' must be logical
Regards
Michal
Try any of these:
select <- subset
select <- function(...) subset(...)
select <- function(data, cond) eval.parent(substitute(subset(data, cond)))
select <- function(data, cond) {
mc <- match.call()
mc[[1L]] <- quote(subset)
m <- match(c("data", "cond"), names(mc), 0L)
names(mc)[m] <- c("x", "subset")
eval.parent(mc)
}
and then using the builtin BOD data.frame
select(BOD, Time > 3)
I also like dplyr for this:
library(dplyr)
s <- s %>%
filter(value < 0)
to filter rows by condition

How to filter a data frame depending on how many chars in each row [duplicate]

I have a dataframe m and I want to remove all the rows where the f_name column has an entry greater than 3. I assume I can use something similar to
m <- m[-grep("nchar(m$f_name)>3", m$f_name]
To reword your question slightly, you want to retain rows where entries in f_name have length of 3 or less. So how about:
subset(m, nchar(as.character(f_name)) <= 3)
Try this:
m[!nchar(as.character(m$f_name)) > 3, ]
For those looking for a tidyverse approach, you can use dplyr::filter:
m %>% dplyr::filter(nchar(f_name) > 3)
The obligatory data.table solution:
setDT(m)
m[ nchar(f_name) <= 3 ]

How to select rows from data.frame with 2 conditions

I have an aggregated table:
> aggdata[1:4,]
Group.1 Group.2 x
1 4 0.05 0.9214660
2 6 0.05 0.9315789
3 8 0.05 0.9526316
4 10 0.05 0.9684211
How can I select the x value when I have values for Group.1 and Group.2?
I tried:
aggdata[aggdata[,"Group.1"]==l && aggdata[,"Group.2"]==lamda,"x"]
but that replies all x's.
More info:
I want to use this like this:
table = data.frame();
for(l in unique(aggdata[,"Group.1"])) {
for(lambda in unique(aggdata[,"Group.2"])) {
table[l,lambda] = aggdata[aggdata[,"Group.1"]==l & aggdata[,"Group.2"]==lambda,"x"]
}
}
Any suggestions that are even easier and giving this result I appreciate!
The easiest solution is to change "&&" to "&" in your code.
> aggdata[aggdata[,"Group.1"]==6 & aggdata[,"Group.2"]==0.05,"x"]
[1] 0.9315789
My preferred solution would be to use subset():
> subset(aggdata, Group.1==6 & Group.2==0.05)$x
[1] 0.9315789
Use & not &&. The latter only evaluates the first element of each vector.
Update: to answer the second part, use the reshape package. Something like this will do it:
tablex <- recast(aggdata, Group.1 ~ variable * Group.2, id.var=1:2)
# Now add useful column and row names
colnames(tablex) <- gsub("x_","",colnames(tablex))
rownames(tablex) <- tablex[,1]
# Finally remove the redundant first column
tablex <- tablex[,-1]
Someone with more experience using reshape may have a simpler solution.
Note: Don't use table as a variable name as it conflicts with the table() function.
There is a really helpful document on subsetting R data frames at:
http://www.ats.ucla.edu/stat/r/modules/subsetting.htm
Here is the relevant excerpt:
Subsetting rows using multiple
conditional statements: There is no
limit to how many logical statements
may be combined to achieve the
subsetting that is desired. The data
frame x.sub1 contains only the
observations for which the values of
the variable y is greater than 2 and
for which the variable V1 is greater
than 0.6.
x.sub1 <- subset(x.df, y > 2 & V1 > 0.6)

Resources