R Building a function with a condition as variable - r

I have a data.frame consisting of two columns, time and value. Let's say:
> s
> time value
>1 -1.40325749 -0.5282231
>2 -0.32640410 -1.8719568
>3 -0.26288196 -0.9861694
>4 -0.19906006 -0.8487832
>5 -0.18720951 -0.2248195
>6 -0.14219086 0.3387807
>7 -0.05981503 1.3872106
>8 0.37187516 2.0057095
>9 0.42432858 2.6805815
>10 1.19915563 1.9988563
I want to build a function which will filter this data, according to the specific condition. Here is my code:
> select<-function(object,cond)
{
subset(object,eval(deparse(substitute(cond))))
}
If I use now my new function as follows:
>select(s,value<0)
I would like to see only rows, where value is < 0. E.g.
> s
> time value
>1 -1.40325749 -0.5282231
>2 -0.32640410 -1.8719568
>3 -0.26288196 -0.9861694
>4 -0.19906006 -0.8487832
>5 -0.18720951 -0.2248195
However, after running this code, I have an error, that subset must be logical. I tried everything I know about to make the "value<0" visible as expression for R. Does anyone know how to fix it?
Error in subset.data.frame(object, eval(deparse(substitute(cond)))) :
'subset' must be logical
Regards
Michal

Try any of these:
select <- subset
select <- function(...) subset(...)
select <- function(data, cond) eval.parent(substitute(subset(data, cond)))
select <- function(data, cond) {
mc <- match.call()
mc[[1L]] <- quote(subset)
m <- match(c("data", "cond"), names(mc), 0L)
names(mc)[m] <- c("x", "subset")
eval.parent(mc)
}
and then using the builtin BOD data.frame
select(BOD, Time > 3)

I also like dplyr for this:
library(dplyr)
s <- s %>%
filter(value < 0)
to filter rows by condition

Related

How to exclude values from random selection

This is independent but related to this question Randomly take wrong answers of a quiz question from dataframe column, instead of doing by hand
Using the mtcars dataset I now have managed to randomly select one value from a certain column: In this example from cyl column.
mtcars[sample(1:nrow(mtcars), 1),2]
This code will give randomly
[1] 6 or [1] 8 or ...
Now I want to exclude one certain value to be chosen, in this example say cyl==8.
I would store the value in a vector like:
not_select <- 8
mtcars[sample(1:nrow(mtcars), 1),2]
My question: How can I integrate not_select in mtcars[sample(1:nrow(mtcars), 1),2]
Expected Output: The random sample should not include 8
UPDATE:
e.g. the output should be:
6 or 4
UPDATE II due to unclear situation:
I want to select from column cyl one value randomly. This value should not be for example 8. So the value will be 4 or 6.
Explanation: 8 is the correct answer. And I am constructing randomly false answers with the other values (e.g. 4 and 6) from cyl column.
Perhaps, another way -
tmp <- mtcars[, 2]
sample(tmp[tmp != not_select], 1)
The above gives the probability of selecting each value based on their occurrence in the dataset. If you want the probability to be equal irrespective of how many times they occur you may only consider unique values.
tmp <- unique(mtcars[, 2])
sample(tmp[tmp != not_select], 1)
Couldn‘t you just add a filtering condition based on not_select?
mtcars[sample(1:nrow(mtcars), 1) & mtcars$cyl != not_select, 2]
Update: how about:
not_select <- 8
draw_cyl <- sample(unique(mtcars$cyl[mtcars$cyl != not_select]), 1)
mtcars %>%
filter(cyl == draw_cyl) %>%
slice_sample(n = 1) %>%
pull(cyl)
Or as suggested by TarJae themselve (so I don‘t own any credit for it!):
mtcars[sample(which (mtcars[,2] != not_select), 1), 2]
This recursive function will run on itself again if the output matches not_selected.
exclude_not_selected <- function(not_selected) {
value <- mtcars[sample(1:nrow(mtcars), 1),2]
if (value == not_selected) {
exclude_not_selected(not_selected)
} else {
return(value)
}
}
exclude_not_selected(8)
[1] 4

Subsetting a df by 2, 3 or more conditions in R [duplicate]

I am looking for a command in R which is equivalent of this SQL statement. I want this to be a very simple basic solution without using complex functions OR dplyr type of packages.
Select count(*) as number_of_states
from myTable
where sCode = "CA"
so essentially I would be counting number of rows matching my where condition.
I have imported a csv file into mydata as a data frame.So far I have tried these with no avail.
nrow(mydata$sCode == "CA") ## ==>> returns NULL
sum(mydata[mydata$sCode == 'CA',], na.rm=T) ## ==>> gives Error in FUN(X[[1L]], ...) : only defined on a data frame with all numeric variables
sum(subset(mydata, sCode='CA', select=c(sCode)), na.rm=T) ## ==>> FUN(X[[1L]], ...) : only defined on a data frame with all numeric variables
sum(mydata$sCode == "CA", na.rm=T) ## ==>> returns count of all rows in the entire data set, which is not the correct result.
and some variations of the above samples. Any help would be appreciated! Thanks.
mydata$sCode == "CA" will return a boolean array, with a TRUE value everywhere that the condition is met. To illustrate:
> mydata = data.frame(sCode = c("CA", "CA", "AC"))
> mydata$sCode == "CA"
[1] TRUE TRUE FALSE
There are a couple of ways to deal with this:
sum(mydata$sCode == "CA"), as suggested in the comments; because
TRUE is interpreted as 1 and FALSE as 0, this should return the
numer of TRUE values in your vector.
length(which(mydata$sCode == "CA")); the which() function
returns a vector of the indices where the condition is met, the
length of which is the count of "CA".
Edit to expand upon what's happening in #2:
> which(mydata$sCode == "CA")
[1] 1 2
which() returns a vector identify each column where the condition is met (in this case, columns 1 and 2 of the dataframe). The length() of this vector is the number of occurences.
sum is used to add elements; nrow is used to count the number of rows in a rectangular array (typically a matrix or data.frame); length is used to count the number of elements in a vector. You need to apply these functions correctly.
Let's assume your data is a data frame named "dat". Correct solutions:
nrow(dat[dat$sCode == "CA",])
length(dat$sCode[dat$sCode == "CA"])
sum(dat$sCode == "CA")
mydata$sCode is a vector, it's why nrow output is NULL.
mydata[mydata$sCode == 'CA',] returns data.frame where sCode == 'CA'. sCode includes character. That's why sum gives you the error.
subset(mydata, sCode='CA', select=c(sCode)), you should use sCode=='CA' instead sCode='CA'. Then subset returns you vector where sCode equals CA, so you should use
length(subset(na.omit(mydata), sCode='CA', select=c(sCode)))
Or you can try this: sum(na.omit(mydata$sCode) == "CA")
Just give a try using subset
nrow(subset(data,condition))
Example
nrow(subset(myData,sCode == "CA"))
With dplyr package, Use
nrow(filter(mydata, sCode == "CA")),
All the solutions provided here gave me same error as multi-sam but that one worked.
to get the number of observations the number of rows from your Dataset would be more valid:
nrow(dat[dat$sCode == "CA",])
grep command can be used
CA = mydata[grep("CA", mydata$sCode, ]
nrow(CA)
Call nrow passing as argument the name of the dataset:
nrow(dataset)
I'm using this short function to make it easier using dplyr:
countc <- function(.data, ..., preserve = FALSE){
return(nrow(filter(.data, ..., .preserve = preserve)))
}
With this you can just use it like filter. For example:
countc(data, active == TRUE)
[1] 42

Looping through rows in an R data frame?

I'm working with multiple big data frames in R and I'm trying to write functions that can modify each of them (given a set of common parameters). One function is giving me trouble (shown below).
RawData <- function(x)
{
for(i in 1:nrow(x))
{
if(grep(".DERIVED", x[i,]) >= 1)
{
x <- x[-i,]
}
}
for(i in 1:ncol(x))
{
if(is.numeric(x[,i]) != TRUE)
{
x <- x[,-i]
}
}
return(x)
}
The objective of this function is twofold: first, to remove any rows that contain a ".DERIVED" string in any one of their cells (using grep), and second, to remove any columns that are non-numeric (using is.numeric). I get an error on the following condition:
if(grep(".DERIVED", x[i,]) >= 1)
The error states the "argument is of zero length", which I believe is usually associated with NULL values in a vector. However, I've used is.null on the entire data frame that is giving me errors, and it confirmed that there are no null values in the DF. I'm sure I'm missing something relatively simple here. Any advice would be greatly appreciated.
If you can use non-base-R functions, this should address your issue. df is the data.frame in question here. It will also be faster than looping over rows (generally not advised if avoidable).
library(dplyr)
library(stringr)
df %>%
filter_all(!str_detect(., '\\.DERIVED')) %>%
select_if(is.numeric)
You can make it a function just as you would anything else:
mattsFunction <- function(dat){
dat %>%
filter_all(!str_detect(., '\\.DERIVED')) %>%
select_if(is.numeric)
}
you should probably give it a better name though
The error is from the line
if(grep(".DERIVED", x[i,]) >= 1)
When grep doesn't find the term ".DERIVED", it returns something of zero length, your inequality doesn't return TRUE or FALSE, but rather returns logical(0). The error is telling you that the if statement cannot evaluate whether logical(0) >= 1
A simple example:
if(grep(".DERIVED", "1234.DERIVEDabcdefg") >= 1) {print("it works")} # Works nicely, since the inequality can be evaluated
if(grep(".DERIVED", "1234abcdefg") > 1) {print("no dice")}
You can replace that line with if(length(grep(".DERIVED", x[i,])) != 0)
There's something else you haven't noticed yet, which is that you're removing rows/columns in a loop. Say you remove the 5th column, the next loop iteration (when i = 6) will be handling what was the 7th row! (this will end in an error along the lines of Error in[.data.frame(x, , i) : undefined columns selected)
I prefer using dplyr, but if you need to use base R functions there are ways to to this without if statements.
Notice that you should consider using the regex version of "\\.DERIVED" and not ".DERIVED" which would mean "any character followed by DERIVED".
I don't have example data or output, so here's my best go...
# Made up data
test <- data.frame(a = c("data","data.DERIVED","data","data","data.DERIVED"),
b = (c(1,2,3,4,5)),
c = c("A","B","C","D","E"),
d = c(2,5,6,8,9),
stringsAsFactors = FALSE)
# Note: The following code assumes that the column class is numeric because the
# example code provided assumed that the column class was numeric. This will not
# detects if the column is full of a string of character values of only numbers.
# Using the base subset command
test2 <- subset(test,
subset = !grepl("\\.DERIVED",test$a),
select = sapply(test,is.numeric))
# > test2
# b d
# 1 1 2
# 3 3 6
# 4 4 8
# Trying to use []. Note: If only 1 column is numeric this will return a vector
# instead of a data.frame
test2 <- test[!grepl("\\.DERIVED",test$a),]
test2 <- test2[,sapply(test,is.numeric)]
# > test2
# b d
# 1 1 2
# 3 3 6
# 4 4 8

function to subtract each column from one specific column in r

I want to subtract each column from a column called df$Means in r. I want to do this as a function but Im not sure how to iterate through each of the columns- each iteration relies on one column being subtracted from df$Means and then there is a load of downstream code that uses the output. I have simplified the code for here as this is the bit that's giving me trouble. So far I have:
CopyNumberLoop <- function (i) {df$ZScore <- (df[3:5]-df$Means)/(df$sd)
}
apply(df[3:50], 2, CopyNumberLoop)
but Im not sure how to make sure that the operation is done on one column at a time. I don't think df[3:5] is correct?
I have been asked to produce a reproducible example so all the code I want is here:
df1 <- read.delim(file.choose(),header=TRUE)
#Take the control samples and average each row for three columns excluding the first two columns- add the per row means to the data frame
df$Means <- rowMeans(df[,30:32])
RowVar <- function(x) {rowSums((x - rowMeans(x))^2)/(dim(x)[2] - 1)}
df$sd=sqrt(RowVar(df[,c(30:32)]))
#Get a Z score by dividing the test sample count at each locus by the average for the control samples and divide everything by the st dev for controls at each locus.
{
df$ZScore <- (df[,35]-df$Means)/(df$sd)
######################################### QUARTILE FILTER ###########################################################
alpha=1.5
numberofControls = 3
UL = median(df$ZScore, na.rm = TRUE) + alpha*IQR(df$ZScore, na.rm = TRUE)
LL = median(df$ZScore, na.rm = TRUE) - alpha*IQR(df$ZScore, na.rm = TRUE)
#Copy the Z score if the score is > or < a certain number, i.e. LL or UL.
Zoutliers <- which(df$ZScore > UL | df$ZScore < LL)
df$Zoutliers <- ifelse(df$ZScore > UL |df$ZScore <LL ,1,-1)
tempout = ifelse(df$ZScore[Zoutliers] > UL,1,-1)
######################################### Three neighbour Isolation filter ##############################################################################
finalSeb=c()
for(i in 2:(length(Zoutliers)-1)){
j=Zoutliers[i]
if(sum(ifelse((j-1) == Zoutliers,1,0)) > 0 & tempout[i] == tempout[i-1] & sum(ifelse((j+1) == Zoutliers,1,0)) > 0 & tempout[i] == tempout[i+1]){
finalSeb = c(finalSeb,i)
}
}
finalset_row_number = Zoutliers[finalSeb]
#View(finalset_row_number)
p_seq = rep(0,nrow(df))
for(i in 1:length(finalset_row_number)){
p_seq[(finalset_row_number[i]-1):(finalset_row_number[i]+1)] = median(df$ZScore[(finalset_row_number[i]-1):(finalset_row_number[i]+1)])
}
nrow(as.data.frame(finalset_row_number))
}
For each column between 3 and 50 I'd like to generate a nrow(as.data.frame(finalset_row_number)) and keep it in another dataframe. Admittedly my code is a mess because I dont know how to create the function that will allow me to apply this to each column
Your code isn’t using the parameter i at all. In fact, i is the current column, so that’s what you should use:
result = apply(df[, 3 : 50], 2, function (col) col - df$Means)
Or you can subtract the means directly:
result = df[, 3 : 50] - df$Means
This will return a new matrix consisting of the columns 3–50 from df, subtracting df$Means from each in turn. Or, if you want to calculate Z scores as your code seems to do:
result = (df[, 3 : 50] - df$Means) / df$sd
It appeared that you wanted the Z-scores assigned back into the original dataframe as named columns. If you want to loop over columns, it would be just as economical to use lapply or sapply. The receiving function will accept each column in turn and match it to the first parameter. Any other arguments offered after the receiving function will get matched by name or position to any other symbol/names in the parameter list. You do not do any assignment to 'df' inside the function:
CopyNumberLoop <- function (col) { col-df$Means/(df$sd)
}
df[, paste0('ZScore' , 3:50)] <- # assignment done outside the loop
lapply(df[3:50], CopyNumberLoop) # result is a list
# but the `[.data.frame<-` method will accept a list.
Usign apply coerces to a matrix which may have undesirable effects in the column is not numeric (say factor or date-time). It's better to get into he habit of using lapply when working on ranges of columns in dataframes.
If you want to assign the result of this operation to a new dataframe, then the lapply(.) result would need to be wrapped in as.data.frame and then column names could be assigned. Same effort would need to be done to a result from apply(.).

adding a column based on other values

I have a dataframe with millions of rows and three columns labeled Keywords, Impressions, Clicks. I'd like to add a column with values depending on the evaluation of this function:
isType <- function(Impressions, Clicks)
{
if (Impressions >= 1 & Clicks >= 1){return("HasClicks")} else if (Impressions >=1 & Clicks == 0){return("NoClicks")} else {return("ZeroImp")}
}
so far so good. I then try this to create the column but 1) it takes for ever and 2) it marks all the rows has "HasClicks" even the ones where it shouldn't.
# Creates a dataframe
Type <- data.frame()
# Loops until last row and store it in data.frame
for (i in c(1:dim(Mydf)[1])) {Type <- rbind(Type,isType(Mydf$Impressions[i], Mydf$Clicks[i]))}
# Add the column to Mydf
Mydf <- transform(Mydf, Type = Type)
input data:
Keywords,Impressions,Clicks
"Hello",0,0
"World",1,0
"R",34,23
Wanted output:
Keywords,Impressions,Clicks,Type
"Hello",0,0,"ZeroImp"
"World",1,0,"NoClicks"
"R",34,23,"HasClicks"
Building on Joshua's solution, I find it cleaner to generate Type in a single shot (note however that this presumes Clicks >= 0...)
Mydf$Type = ifelse(Mydf$Impressions >= 1,
ifelse(Mydf$Clicks >= 1, 'HasClicks', 'NoClicks'), 'ZeroImp')
First, the if/else block in your function will return the warning:
Warning message:
In if (1:2 > 2:3) TRUE else FALSE :
the condition has length > 1 and only the first element will be used
which explains why it all the rows are the same.
Second, you should allocate your data.frame and fill in the elements rather than repeatedly combining objects together. I imagine this is causing your long run-times.
EDIT: My shared code. I'd love for someone to provide a more elegant solution.
Mydf <- data.frame(
Keywords = sample(c("Hello","World","R"),20,TRUE),
Impressions = sample(0:3,20,TRUE),
Clicks = sample(0:3,20,TRUE) )
Mydf$Type <- "ZeroImp"
Mydf$Type <- ifelse(Mydf$Impressions >= 1 & Mydf$Clicks >= 1,
"HasClicks", Mydf$Type)
Mydf$Type <- ifelse(Mydf$Impressions >= 1 & Mydf$Clicks == 0,
"NoClicks", Mydf$Type)
This is a case where arithmetic can be cleaner and most likely faster than nested ifelse statements.
Again building on Joshua's solution:
Mydf$Type <- factor(with(Mydf, (Impressions>=1)*2 + (Clicks>=1)*1),
levels=1:3, labels=c("ZeroImp","NoClicks","HasClicks"))

Resources