I am writing an xor function for a class, so although any recommendations on currently existing xor functions would be nice, I have to write my own. I have searched online, but have not been able to find any solution so far. I also realize my coding style may be sub-optimal. All criticisms will be welcomed.
I writing a function that will return an element-wise TRUE iff one condition is true. Conditions are given as strings, else they will throw an error due to unexpected symbols (e.g. >). I would like to output a list of the pairwise elements of a and b in which my xor function is true.
The problem is that, while I can create a logical vector of xor T/F based on the conditions, I cannot access the objects directly to subset them. It is the conditions that are function arguments, not the objects themselves.
'%xor%' <- function(condition_a, condition_b) {
# Perform an element-wise "exclusive or" on the conditions being true.
if (length(eval(parse(text= condition_a))) != length(eval(parse(text= condition_b))))
stop("Objects are not of equal length.") # Objects must be equal length to proceed
logical_a <- eval(parse(text= condition_a)) # Evaluate and store each logical condition
logical_b <- eval(parse(text= condition_b))
xor_vector <- logical_a + logical_b == 1 # Only one condition may be true.
xor_indices <- which(xor_vector == TRUE) # Store a vector which gives the indices of the elements which satisfy the xor condition.
# Somehow access the objects in the condition strings
list(a = a[xor_indices], b = b[xor_indices]) # Desired output
}
# Example:
a <- 1:10
b <- 4:13
"a < 5" %xor% "b > 4"
Desired output:
$a
[1] 1 5 6 7 8 9 10
$b
[1] 4 8 9 10 11 12 13
I have thought about doing a combination of ls() and grep() to find existing object names in the conditions, but this would run into problems if the objects in the conditions were not initialized. For example, if someone tried to run "c(1:10) < 5" %xor% "c(4:13) > 4".
Related
I'm working with multiple big data frames in R and I'm trying to write functions that can modify each of them (given a set of common parameters). One function is giving me trouble (shown below).
RawData <- function(x)
{
for(i in 1:nrow(x))
{
if(grep(".DERIVED", x[i,]) >= 1)
{
x <- x[-i,]
}
}
for(i in 1:ncol(x))
{
if(is.numeric(x[,i]) != TRUE)
{
x <- x[,-i]
}
}
return(x)
}
The objective of this function is twofold: first, to remove any rows that contain a ".DERIVED" string in any one of their cells (using grep), and second, to remove any columns that are non-numeric (using is.numeric). I get an error on the following condition:
if(grep(".DERIVED", x[i,]) >= 1)
The error states the "argument is of zero length", which I believe is usually associated with NULL values in a vector. However, I've used is.null on the entire data frame that is giving me errors, and it confirmed that there are no null values in the DF. I'm sure I'm missing something relatively simple here. Any advice would be greatly appreciated.
If you can use non-base-R functions, this should address your issue. df is the data.frame in question here. It will also be faster than looping over rows (generally not advised if avoidable).
library(dplyr)
library(stringr)
df %>%
filter_all(!str_detect(., '\\.DERIVED')) %>%
select_if(is.numeric)
You can make it a function just as you would anything else:
mattsFunction <- function(dat){
dat %>%
filter_all(!str_detect(., '\\.DERIVED')) %>%
select_if(is.numeric)
}
you should probably give it a better name though
The error is from the line
if(grep(".DERIVED", x[i,]) >= 1)
When grep doesn't find the term ".DERIVED", it returns something of zero length, your inequality doesn't return TRUE or FALSE, but rather returns logical(0). The error is telling you that the if statement cannot evaluate whether logical(0) >= 1
A simple example:
if(grep(".DERIVED", "1234.DERIVEDabcdefg") >= 1) {print("it works")} # Works nicely, since the inequality can be evaluated
if(grep(".DERIVED", "1234abcdefg") > 1) {print("no dice")}
You can replace that line with if(length(grep(".DERIVED", x[i,])) != 0)
There's something else you haven't noticed yet, which is that you're removing rows/columns in a loop. Say you remove the 5th column, the next loop iteration (when i = 6) will be handling what was the 7th row! (this will end in an error along the lines of Error in[.data.frame(x, , i) : undefined columns selected)
I prefer using dplyr, but if you need to use base R functions there are ways to to this without if statements.
Notice that you should consider using the regex version of "\\.DERIVED" and not ".DERIVED" which would mean "any character followed by DERIVED".
I don't have example data or output, so here's my best go...
# Made up data
test <- data.frame(a = c("data","data.DERIVED","data","data","data.DERIVED"),
b = (c(1,2,3,4,5)),
c = c("A","B","C","D","E"),
d = c(2,5,6,8,9),
stringsAsFactors = FALSE)
# Note: The following code assumes that the column class is numeric because the
# example code provided assumed that the column class was numeric. This will not
# detects if the column is full of a string of character values of only numbers.
# Using the base subset command
test2 <- subset(test,
subset = !grepl("\\.DERIVED",test$a),
select = sapply(test,is.numeric))
# > test2
# b d
# 1 1 2
# 3 3 6
# 4 4 8
# Trying to use []. Note: If only 1 column is numeric this will return a vector
# instead of a data.frame
test2 <- test[!grepl("\\.DERIVED",test$a),]
test2 <- test2[,sapply(test,is.numeric)]
# > test2
# b d
# 1 1 2
# 3 3 6
# 4 4 8
Say I have a data frame df and want to subset it based on the value of column a.
df <- data.frame(a = 1:4, b = 5:8)
df
Is it necessary to include a which function in the brackets or can I just include the logical test?
df[df$a == "2",]
# a b
#2 2 6
df[which(df$a == "2"),]
# a b
#2 2 6
It seems to work the same either way... I was getting some strange results in a large data frame (i.e., getting empty rows returned as well as the correct ones) but once I cleaned the environment and reran my script it worked fine.
df$a == "2" returns a logical vector, while which(df$a=="2") returns indices. If there are missing values in the vector, the first approach will include them in the returned value, but which will exclude them.
For example:
x=c(1,NA,2,10)
x[x==2]
[1] NA 2
x[which(x==2)]
[1] 2
x==2
[1] FALSE NA TRUE FALSE
which(x==2)
[1] 3
I have a vector named "vec". I label the elements from "a" to "m"
vec <- c(1,1,1,2,2,2,2,2,2,4,4,4,4)
names(vec) <- c("a","b","c","d","e","f","g","h","i","j","k","l","m")
Then I split the vec according to the sequences.
split_vec <- split(vec, vec)
Now when I type
Spec_vec$"1" I get the first list.
Instead of typing the specific name as "1". I want to get the values
such as
spec_vec$vec[1]
But the above function doesn't work. Is there a way to get that?
You can do
split_vec[[as.character(vec[1])]]
# a b c
# 1 1 1
Notice that you need as.character, since just the number value from vec[i] would give incorrect results for calls like split_vec[[vec[10]]] where you would expect the third element.
split_vec[[vec[10]]]
# Error in split_vec[[vec[10]]] : subscript out of bounds
split_vec[[as.character(vec[10])]]
# j k l m
# 4 4 4 4
But in general, it's best to avoid such names that begin with numerics because, obviously, it's quite awkward and can cause trouble.
I am trying to obtain the sum of values of a column (B) based on the interval between two values on another column (A) in a "reference" dataframe (df):
A <- seq(1:10)
B <- c(4,3,5,7,5,7,4,7,3,7)
df <- data.frame(A,B)
I have found two ways of doing this:
y <- sum(subset(df, A < 3 & A >= 1, select = "B"))
> y
[1] 7
and
z <- with(df,sum(df[A<3 & A>=1,"B"]))
> z
[1] 7
However, I would like to do this based on a two vectors of values stored on another dataframe
C <- c(3,7,7)
D <- c(1,1,5)
df2 <- data.frame(C,D)
to obtain a column of y values for each pair of C and D values.
I have created a function:
myfn <- function(c,d) {
y <-sum(subset(df, A < c & A >= d, select = "B"))
return(y)
}
Which works fine with numbers
myfn(3,1)
[1] 7
but not with vectors.
myfn(c=C,d=D)
[1] 19
Warning messages:
1: In A < a :
longer object length is not a multiple of shorter object length
2: In A >= b :
longer object length is not a multiple of shorter object length
> myfn(df2$C,df2$D)
[1] 19
Warning messages:
1: In A < a :
longer object length is not a multiple of shorter object length
2: In A >= b :
longer object length is not a multiple of shorter object length
>
Does anyone have any suggestion about how I could calculate such interval for sequence of values?
Try:
mapply(myfn, C, D)
# [1] 7 31 12
The problem is that your function is not naturally vectorized. You can see that because your return value is a sum of the inputs, and sum is not a vectorized operation.
Beyond that, if you look at myfn, the expression A < c & A >= d doesn't make sense when c and d have more than one value. There, you are comparing each value in df to the corresponding value in your C and D vectors (so first value to first, second to second, etc.), instead of comparing all the values in df to each value in C and D in turn.
By using mapply, I'm basically looping through your function with as arguments a single value from C and D at a time.
Fortunately in your case it turns out that C,D have different number of elements than df, so you actually got a warning. If they were the same length you would not have gotten a warning and you would have gotten a single value answer, instead of the three you are presumably looking for.
There are better ways to do this, but the mapply approach is pretty trivial here and works with your code pretty much as is.
Another way...
is.between <- function(x,vec){
return(x>=min(vec) & x<max(vec))
}
apply(df2,1,function(x){sum(df[is.between(df$A,x),]$B)})
# [1] 7 31 12
I would like to make the code more efficient.
The example creates a vector (called 'new_vector'). The values of this 'new_vector' are changed based on if/else-conditions that refer to the values of three other vectors of the same length.
If the conditions are fulfilled, the corresponding elements of the 'new_vector' are updated using values from one of the other vectors (in the example elements of M_date are written into new_vector).
Here is the example code:
new_vector<-c(9,9,9)
S_date<-c(1,1,as.Date('2010/08/01'))
V_date<-c(1,as.Date('2010/09/01'),1)
M_date<-c(2,as.Date('2010/07/01'),1)
for (i in 1:3) {
if ( (S_date[i]==1) & (V_date[i]==1 | M_date[i] < V_date[i]) ) {
new_vector[i]<-M_date[i]
}
}
The result of the example is:
> new_vector
[1] 2 14791 9
The example is simplified and in reality the vectors are larger and there are additional if/else-conditions.
How can I avoid the loop and use implicit methods for vector operations instead?
If you write the expression without the [i] bits you get a vector True/False result:
> S_date==1 & (V_date==1 | M_date < V_date)
[1] TRUE TRUE FALSE
assign that to a vector, and replace in new_vector by that result:
> result = S_date==1 & (V_date==1 | M_date < V_date)
> new_vector[result]=M_date[result]
> new_vector
[1] 2 14791 9
Its a fairly general pattern. Compute a boolean vector, then replace those matching values with the corresponding values from another vector.
It works because the FALSE value in the third element of result means that new_vector[3] doesn't get touched.
Use ifelse instead of if:
new_vector<-c(9,9,9)
S_date<-c(1,1,as.Date('2010/08/01'))
V_date<-c(1,as.Date('2010/09/01'),1)
M_date<-c(2,as.Date('2010/07/01'),1)
vec <- ifelse((S_date==1) & (V_date==1 | M_date < V_date), M_date, new_vector)
vec
#[1] 2 14791 9
HTH