I am trying to obtain the sum of values of a column (B) based on the interval between two values on another column (A) in a "reference" dataframe (df):
A <- seq(1:10)
B <- c(4,3,5,7,5,7,4,7,3,7)
df <- data.frame(A,B)
I have found two ways of doing this:
y <- sum(subset(df, A < 3 & A >= 1, select = "B"))
> y
[1] 7
and
z <- with(df,sum(df[A<3 & A>=1,"B"]))
> z
[1] 7
However, I would like to do this based on a two vectors of values stored on another dataframe
C <- c(3,7,7)
D <- c(1,1,5)
df2 <- data.frame(C,D)
to obtain a column of y values for each pair of C and D values.
I have created a function:
myfn <- function(c,d) {
y <-sum(subset(df, A < c & A >= d, select = "B"))
return(y)
}
Which works fine with numbers
myfn(3,1)
[1] 7
but not with vectors.
myfn(c=C,d=D)
[1] 19
Warning messages:
1: In A < a :
longer object length is not a multiple of shorter object length
2: In A >= b :
longer object length is not a multiple of shorter object length
> myfn(df2$C,df2$D)
[1] 19
Warning messages:
1: In A < a :
longer object length is not a multiple of shorter object length
2: In A >= b :
longer object length is not a multiple of shorter object length
>
Does anyone have any suggestion about how I could calculate such interval for sequence of values?
Try:
mapply(myfn, C, D)
# [1] 7 31 12
The problem is that your function is not naturally vectorized. You can see that because your return value is a sum of the inputs, and sum is not a vectorized operation.
Beyond that, if you look at myfn, the expression A < c & A >= d doesn't make sense when c and d have more than one value. There, you are comparing each value in df to the corresponding value in your C and D vectors (so first value to first, second to second, etc.), instead of comparing all the values in df to each value in C and D in turn.
By using mapply, I'm basically looping through your function with as arguments a single value from C and D at a time.
Fortunately in your case it turns out that C,D have different number of elements than df, so you actually got a warning. If they were the same length you would not have gotten a warning and you would have gotten a single value answer, instead of the three you are presumably looking for.
There are better ways to do this, but the mapply approach is pretty trivial here and works with your code pretty much as is.
Another way...
is.between <- function(x,vec){
return(x>=min(vec) & x<max(vec))
}
apply(df2,1,function(x){sum(df[is.between(df$A,x),]$B)})
# [1] 7 31 12
Related
As an exercise I was given two samples from a seed called u and v and asked to show how many values are in v but not in u fell into the bins [1,50] and [51,100]. Then I am asked to add a line of code in to confirm my answer using a relational operator (like >) and sum().
I solved the first part:
table(findInterval(setdiff(v,u),c(50))
But for the second part, i don't really get what I need to do; any help is appreciated!
Example:
set.seed(1201)
u = sample(100,100,replace=TRUE)
v = sample(100,100,replace=TRUE)
table(findInterval(setdiff(v,u),c(50)))
Output:
0 1
12 12
If we want to use comparative operators and sum, create a logical vector and get the sum of logical vector
i1 <- v[!v %in% u] > 50
sum(i1)
sum(!i1)
Note: If the OP intended to use only unique values (as in setdiff), then get the unique
i1 <- unique(v[!v %in% u]) > 50
out1 <- sum(i1)
out2 <- sum(!i1)
-checking with the output of table
tbl1 <- table(findInterval(setdiff(v,u),c(50)))
all.equal(as.numeric(tbl1), c(out1, out2), check.attributes = FALSE)
#[1] TRUE
Since there is only one number that you are cutting the intervals in, you can verify your answer using > directly.
This is your code
set.seed(1201)
u = sample(100,100,replace=TRUE)
v = sample(100,100,replace=TRUE)
table(findInterval(setdiff(v,u),50))
#0 1
#9 9
Without findInterval
table(setdiff(v,u) > 50)
#FALSE TRUE
# 9 9
I'm working with multiple big data frames in R and I'm trying to write functions that can modify each of them (given a set of common parameters). One function is giving me trouble (shown below).
RawData <- function(x)
{
for(i in 1:nrow(x))
{
if(grep(".DERIVED", x[i,]) >= 1)
{
x <- x[-i,]
}
}
for(i in 1:ncol(x))
{
if(is.numeric(x[,i]) != TRUE)
{
x <- x[,-i]
}
}
return(x)
}
The objective of this function is twofold: first, to remove any rows that contain a ".DERIVED" string in any one of their cells (using grep), and second, to remove any columns that are non-numeric (using is.numeric). I get an error on the following condition:
if(grep(".DERIVED", x[i,]) >= 1)
The error states the "argument is of zero length", which I believe is usually associated with NULL values in a vector. However, I've used is.null on the entire data frame that is giving me errors, and it confirmed that there are no null values in the DF. I'm sure I'm missing something relatively simple here. Any advice would be greatly appreciated.
If you can use non-base-R functions, this should address your issue. df is the data.frame in question here. It will also be faster than looping over rows (generally not advised if avoidable).
library(dplyr)
library(stringr)
df %>%
filter_all(!str_detect(., '\\.DERIVED')) %>%
select_if(is.numeric)
You can make it a function just as you would anything else:
mattsFunction <- function(dat){
dat %>%
filter_all(!str_detect(., '\\.DERIVED')) %>%
select_if(is.numeric)
}
you should probably give it a better name though
The error is from the line
if(grep(".DERIVED", x[i,]) >= 1)
When grep doesn't find the term ".DERIVED", it returns something of zero length, your inequality doesn't return TRUE or FALSE, but rather returns logical(0). The error is telling you that the if statement cannot evaluate whether logical(0) >= 1
A simple example:
if(grep(".DERIVED", "1234.DERIVEDabcdefg") >= 1) {print("it works")} # Works nicely, since the inequality can be evaluated
if(grep(".DERIVED", "1234abcdefg") > 1) {print("no dice")}
You can replace that line with if(length(grep(".DERIVED", x[i,])) != 0)
There's something else you haven't noticed yet, which is that you're removing rows/columns in a loop. Say you remove the 5th column, the next loop iteration (when i = 6) will be handling what was the 7th row! (this will end in an error along the lines of Error in[.data.frame(x, , i) : undefined columns selected)
I prefer using dplyr, but if you need to use base R functions there are ways to to this without if statements.
Notice that you should consider using the regex version of "\\.DERIVED" and not ".DERIVED" which would mean "any character followed by DERIVED".
I don't have example data or output, so here's my best go...
# Made up data
test <- data.frame(a = c("data","data.DERIVED","data","data","data.DERIVED"),
b = (c(1,2,3,4,5)),
c = c("A","B","C","D","E"),
d = c(2,5,6,8,9),
stringsAsFactors = FALSE)
# Note: The following code assumes that the column class is numeric because the
# example code provided assumed that the column class was numeric. This will not
# detects if the column is full of a string of character values of only numbers.
# Using the base subset command
test2 <- subset(test,
subset = !grepl("\\.DERIVED",test$a),
select = sapply(test,is.numeric))
# > test2
# b d
# 1 1 2
# 3 3 6
# 4 4 8
# Trying to use []. Note: If only 1 column is numeric this will return a vector
# instead of a data.frame
test2 <- test[!grepl("\\.DERIVED",test$a),]
test2 <- test2[,sapply(test,is.numeric)]
# > test2
# b d
# 1 1 2
# 3 3 6
# 4 4 8
I've figured out if I use as.character(df[x,y]) or as.<whatever>df[x,y] I can get/coerce what I need, every time from my data frames
What I cant seem to find/figure out is why. Details below.
When I access df[1,1] (or anything in column 1) I get
df[1,1]
[1] a
Levels: a b c
but when I access 1,3 it works fine
> df[1,3]
[1] 10
but then when I use as.character() it works.
> as.character(df[1,1])
[1] "a"
The data frame was built using this line
df = data.frame(names = c("a","b","c"), size = c(1,2,3),num = c(10,20,30) )
> df
names size num
1 a 1 10
2 b 2 20
3 c 3 30
But in this data frame
imp2met = read.csv('tomet.csv', header = TRUE, sep=",",dec='.')
> imp2met
unit mult ret
1 (yd) 0.9100 (m)
2 (in) 2.5200 (cm)
3 .....
I get these results for 1,3
> imp2met[1,3]
[1] (m)
Levels: (c) (cm) (cm^2) ....
>
> as.character(imp2met[1,3])
[1] "(m)"
So why the "random" results? Why do I need as.<whatever>() but only some of the time?
data.frame default is to convert character vectors to factors. You can change this with the argument stringsAsFactors=FALSE
Also, when you subset a dataframe using [, you can add the drop=FALSE argument to simplify the results in some cases.
I have the following numeric vectors x and y
x <- c(a=1,b=2,c=3)
y <- c(d=2,e=1,f=4)
I want to find the parallel maximum of each elements in the vectors, so I used:
> pmax(x,y)
a b c
2 2 4
The output has the right values, however, it returns the wrong names. The documentation for pmax mentions that it returns the attributes of the first argument, hence the a b c. Is there a way of getting the names of the maximum values? The desired output is as follow:
d b f
2 2 4
One option would be using max.col for finding the index of the maximum value per each row. For that, we need to create a matrix/data.frame by cbinding the vectors ('xy') and its names ('nmxy'). Create a row/column index ('ij') and subset the elements of 'xy' and set the names from 'nmxy'.
xy <- cbind(x,y)
nmxy <- cbind(names(x), names(y))
ij <- cbind(1:nrow(xy), max.col(xy))
setNames(xy[ij], nmxy[ij])
# d b f
# 2 2 4
Let
r <- pmax(x,y)
Simply add after the function a rename command
names(r)[y == r] <- names(y)[y == r]
If you want to be fancy, you can overload the pmax function to have the desired output.
old.pmax = pmax
pmax <- function(x,y){
r <- old.pmax(x,y)
names(r)[y == r] <- names(y)[y == r]
return(r)
}
I have a data frame in R consisting in 5 columns and 30000 rows. One column, called "pos", has this kind of values sorted in ascending order:
pos
785989
888659
918573
949608
990417
I would like to remove all rows where the difference between a "x" value in "pos" (in a "n" row) and the anterior value in the "n-1" row or the difference between the posterior value in "n+1" row and "x" is greater than, lets say, 100000. Eg: in the input example, 888659-785989 = 102670 > 100000, therefore rows containing 888659 and 785989 values should be removed.
Thanks for your help !
One solution is to create a user function that takes the diff of a vector and checks a conditional gap provided by the user:
diff_set <- function(x, gap) {
ind <- c(F, diff(x) > gap)
if(sum(ind) == 0) return(!ind)
subst <- x[-unique(c(which(ind), which(ind)-1))]
x %in% subst
}
df1[diff_set(df1$x, 1e5),]
x y
3 918573 C
4 949608 D
5 990417 E
Data
x <- scan(text="785989
888659
918573
949608
990417")
df1 <- data.frame(x, y=LETTERS[1:5])