Tables and bins from two vectors in R - r

As an exercise I was given two samples from a seed called u and v and asked to show how many values are in v but not in u fell into the bins [1,50] and [51,100]. Then I am asked to add a line of code in to confirm my answer using a relational operator (like >) and sum().
I solved the first part:
table(findInterval(setdiff(v,u),c(50))
But for the second part, i don't really get what I need to do; any help is appreciated!
Example:
set.seed(1201)
u = sample(100,100,replace=TRUE)
v = sample(100,100,replace=TRUE)
table(findInterval(setdiff(v,u),c(50)))
Output:
0 1
12 12

If we want to use comparative operators and sum, create a logical vector and get the sum of logical vector
i1 <- v[!v %in% u] > 50
sum(i1)
sum(!i1)
Note: If the OP intended to use only unique values (as in setdiff), then get the unique
i1 <- unique(v[!v %in% u]) > 50
out1 <- sum(i1)
out2 <- sum(!i1)
-checking with the output of table
tbl1 <- table(findInterval(setdiff(v,u),c(50)))
all.equal(as.numeric(tbl1), c(out1, out2), check.attributes = FALSE)
#[1] TRUE

Since there is only one number that you are cutting the intervals in, you can verify your answer using > directly.
This is your code
set.seed(1201)
u = sample(100,100,replace=TRUE)
v = sample(100,100,replace=TRUE)
table(findInterval(setdiff(v,u),50))
#0 1
#9 9
Without findInterval
table(setdiff(v,u) > 50)
#FALSE TRUE
# 9 9

Related

Looping through rows in an R data frame?

I'm working with multiple big data frames in R and I'm trying to write functions that can modify each of them (given a set of common parameters). One function is giving me trouble (shown below).
RawData <- function(x)
{
for(i in 1:nrow(x))
{
if(grep(".DERIVED", x[i,]) >= 1)
{
x <- x[-i,]
}
}
for(i in 1:ncol(x))
{
if(is.numeric(x[,i]) != TRUE)
{
x <- x[,-i]
}
}
return(x)
}
The objective of this function is twofold: first, to remove any rows that contain a ".DERIVED" string in any one of their cells (using grep), and second, to remove any columns that are non-numeric (using is.numeric). I get an error on the following condition:
if(grep(".DERIVED", x[i,]) >= 1)
The error states the "argument is of zero length", which I believe is usually associated with NULL values in a vector. However, I've used is.null on the entire data frame that is giving me errors, and it confirmed that there are no null values in the DF. I'm sure I'm missing something relatively simple here. Any advice would be greatly appreciated.
If you can use non-base-R functions, this should address your issue. df is the data.frame in question here. It will also be faster than looping over rows (generally not advised if avoidable).
library(dplyr)
library(stringr)
df %>%
filter_all(!str_detect(., '\\.DERIVED')) %>%
select_if(is.numeric)
You can make it a function just as you would anything else:
mattsFunction <- function(dat){
dat %>%
filter_all(!str_detect(., '\\.DERIVED')) %>%
select_if(is.numeric)
}
you should probably give it a better name though
The error is from the line
if(grep(".DERIVED", x[i,]) >= 1)
When grep doesn't find the term ".DERIVED", it returns something of zero length, your inequality doesn't return TRUE or FALSE, but rather returns logical(0). The error is telling you that the if statement cannot evaluate whether logical(0) >= 1
A simple example:
if(grep(".DERIVED", "1234.DERIVEDabcdefg") >= 1) {print("it works")} # Works nicely, since the inequality can be evaluated
if(grep(".DERIVED", "1234abcdefg") > 1) {print("no dice")}
You can replace that line with if(length(grep(".DERIVED", x[i,])) != 0)
There's something else you haven't noticed yet, which is that you're removing rows/columns in a loop. Say you remove the 5th column, the next loop iteration (when i = 6) will be handling what was the 7th row! (this will end in an error along the lines of Error in[.data.frame(x, , i) : undefined columns selected)
I prefer using dplyr, but if you need to use base R functions there are ways to to this without if statements.
Notice that you should consider using the regex version of "\\.DERIVED" and not ".DERIVED" which would mean "any character followed by DERIVED".
I don't have example data or output, so here's my best go...
# Made up data
test <- data.frame(a = c("data","data.DERIVED","data","data","data.DERIVED"),
b = (c(1,2,3,4,5)),
c = c("A","B","C","D","E"),
d = c(2,5,6,8,9),
stringsAsFactors = FALSE)
# Note: The following code assumes that the column class is numeric because the
# example code provided assumed that the column class was numeric. This will not
# detects if the column is full of a string of character values of only numbers.
# Using the base subset command
test2 <- subset(test,
subset = !grepl("\\.DERIVED",test$a),
select = sapply(test,is.numeric))
# > test2
# b d
# 1 1 2
# 3 3 6
# 4 4 8
# Trying to use []. Note: If only 1 column is numeric this will return a vector
# instead of a data.frame
test2 <- test[!grepl("\\.DERIVED",test$a),]
test2 <- test2[,sapply(test,is.numeric)]
# > test2
# b d
# 1 1 2
# 3 3 6
# 4 4 8

Return attributes of pmax function output

I have the following numeric vectors x and y
x <- c(a=1,b=2,c=3)
y <- c(d=2,e=1,f=4)
I want to find the parallel maximum of each elements in the vectors, so I used:
> pmax(x,y)
a b c
2 2 4
The output has the right values, however, it returns the wrong names. The documentation for pmax mentions that it returns the attributes of the first argument, hence the a b c. Is there a way of getting the names of the maximum values? The desired output is as follow:
d b f
2 2 4
One option would be using max.col for finding the index of the maximum value per each row. For that, we need to create a matrix/data.frame by cbinding the vectors ('xy') and its names ('nmxy'). Create a row/column index ('ij') and subset the elements of 'xy' and set the names from 'nmxy'.
xy <- cbind(x,y)
nmxy <- cbind(names(x), names(y))
ij <- cbind(1:nrow(xy), max.col(xy))
setNames(xy[ij], nmxy[ij])
# d b f
# 2 2 4
Let
r <- pmax(x,y)
Simply add after the function a rename command
names(r)[y == r] <- names(y)[y == r]
If you want to be fancy, you can overload the pmax function to have the desired output.
old.pmax = pmax
pmax <- function(x,y){
r <- old.pmax(x,y)
names(r)[y == r] <- names(y)[y == r]
return(r)
}

How to check whether a variable is numeric for a vector in R?

I have two questions.
for (k in 1:iterations) {
corr <- cor(df2_prod[,k], df2_qa[,k])
ifelse(is.numeric(corr), next,
ifelse((all(df2_prod[,k] == df2_qa[,k])) ), (corr <- 1), (corr <- 0))
correlation[k,] <- rbind(names(df2_prod[k]), corr)
}
This is my requirement - I want to calculate correlation for variables in a loop using the code corr <- cor(df2_prod[,k], df2_qa[,k]) If i receive a correlation value in number, I have to keep the value as it is.
Some time it happens that if two columns have the same values, i receive "NA" as output for the vector "corr".
x y
1 1
1 1
1 1
1 1
1 1
corr
[,1]
[1,] NA
I am trying to handle in such a way that if "NA" is received, i will replace the values with "1" or "0".
My questions are:
When I check the class of "corr" vector, I am getting it as "matrix". I want to check whether that is a number or not. Is there any other way other than checking is.numeric(corr)
> class(corr)
[1] "matrix"
I want to check if two columns has same value or not. Something like the code below. If it returns true, I want to proceed. But the way I have put the code in the loop is wrong. Could you please help me how this can be improved:
((all(df2_prod[,k] == df2_qa[,k]))
Is there any effective way to do this?
I sincerely apologize the readers for the poorly framed question / logic. If you can show me pointers which can improve the code, I would be really thankful to you.
1.
You basically want to avoide NAs, right? So you could check the result with is.na().
a <- rep(1, 5)
b <- rep(1, 5)
if(is.na(cor(a, b))) cor.value <- 1
2.You could count how many times the element of a is equal to the element of b with sum(a==b) and check whether this amount is equal to the amount of elements in a (or b) --> length(a)
if(sum(a==b) == length(a)) cor.value <- 1
An example to explain how the cor function works:
set.seed(123)
df1 <- data.frame(v1=1:10, v2=rnorm(10), v3=rnorm(10), v4=rnorm(10))
df2 <- data.frame(w1=rnorm(10), w2=1:10, w3=rnorm(10))
Here, the first variable of df1 is equal to the second variable of df2. Function cor directly applied on the first 3 variables of each data.frame gives:
cor(df1[, 1:3], df2[, 1:3])
# w1 w2 w3
#v1 -0.4603659 1.0000000 0.1078796
#v2 0.6730196 -0.2602059 -0.3486367
#v3 0.2713188 -0.3749826 -0.2520174
As you can notice, the correlation coefficient between w2 and v1 is 1, not NA.
So, in your case, cor(df2_prod[, 1:k], df2_qa[, 1:k]) should provide you the desired output.

Multiple one-to-many matching between vectors in R

I want to update a dataframe with values from a table of new values where there is a one-to-many relationship between the dataframe and table of new values. This code illustrates the intent:
df = data.frame(x=rep(letters[1:4],5,rep=T), y=1:20)
and new values..
eds = data.frame(x=c('c','d'), val=c(101, 102))
For a one-to-one relationship the following should work:
df$x[match(eds$x, df$x)] = eds$x[match(df$x, eds$x)]
But match only works with first match, so this throws the error number of items to replace is not a multiple of replacement length. Grateful for any tips on the most efficient way to approach this. I'm guessing some sapply wrapper but I can't think of the method.
Thanks in advance.
tmp <- eds$val[match(df$x, eds$x)] # Matching indices (with NAs for no match)
df$y <- ifelse(is.na(tmp), df$y, tmp) # Values at matches (leaving alone for NAs)
head(df, 5)
# x y
# 1 a 1
# 2 b 2
# 3 c 101
# 4 d 102
# 5 a 5
Not that this not a very robust solution. It depends on your exact data structure here (repeating 'c', 'd' pattern) but it works for this case:
df[df[["x"]] %in% eds[["x"]], "y"] = eds[[2]]

Sum object in a column between an interval defined by another dataframe

I am trying to obtain the sum of values of a column (B) based on the interval between two values on another column (A) in a "reference" dataframe (df):
A <- seq(1:10)
B <- c(4,3,5,7,5,7,4,7,3,7)
df <- data.frame(A,B)
I have found two ways of doing this:
y <- sum(subset(df, A < 3 & A >= 1, select = "B"))
> y
[1] 7
and
z <- with(df,sum(df[A<3 & A>=1,"B"]))
> z
[1] 7
However, I would like to do this based on a two vectors of values stored on another dataframe
C <- c(3,7,7)
D <- c(1,1,5)
df2 <- data.frame(C,D)
to obtain a column of y values for each pair of C and D values.
I have created a function:
myfn <- function(c,d) {
y <-sum(subset(df, A < c & A >= d, select = "B"))
return(y)
}
Which works fine with numbers
myfn(3,1)
[1] 7
but not with vectors.
myfn(c=C,d=D)
[1] 19
Warning messages:
1: In A < a :
longer object length is not a multiple of shorter object length
2: In A >= b :
longer object length is not a multiple of shorter object length
> myfn(df2$C,df2$D)
[1] 19
Warning messages:
1: In A < a :
longer object length is not a multiple of shorter object length
2: In A >= b :
longer object length is not a multiple of shorter object length
>
Does anyone have any suggestion about how I could calculate such interval for sequence of values?
Try:
mapply(myfn, C, D)
# [1] 7 31 12
The problem is that your function is not naturally vectorized. You can see that because your return value is a sum of the inputs, and sum is not a vectorized operation.
Beyond that, if you look at myfn, the expression A < c & A >= d doesn't make sense when c and d have more than one value. There, you are comparing each value in df to the corresponding value in your C and D vectors (so first value to first, second to second, etc.), instead of comparing all the values in df to each value in C and D in turn.
By using mapply, I'm basically looping through your function with as arguments a single value from C and D at a time.
Fortunately in your case it turns out that C,D have different number of elements than df, so you actually got a warning. If they were the same length you would not have gotten a warning and you would have gotten a single value answer, instead of the three you are presumably looking for.
There are better ways to do this, but the mapply approach is pretty trivial here and works with your code pretty much as is.
Another way...
is.between <- function(x,vec){
return(x>=min(vec) & x<max(vec))
}
apply(df2,1,function(x){sum(df[is.between(df$A,x),]$B)})
# [1] 7 31 12

Resources