Sum of data frame's rows in range defined by columns - r

I have an integer based dataframe with positional coordinates in one column and a variable in the second. The coordinates range from 1-10 million, the variables from 0-950 - I'm interested in returning the sum of the variables from ranges defined within a separate frame containing the start and end points of the desired range.
To make things a bit easier to compute I've shortened the example:
Data:
a = seq(1,5)
b = c(0,0,1,0,2)
df1 <- data.frame(a, b)
c = c(1,1,2,2,3)
d = c(3,4,3,5,4)
df2 <- data.frame(c,d)
df1:
1, 0
2, 0
3, 1
4, 0
5, 2
df2:
1, 3
1, 4
2, 3
2, 5
3, 4
magic
output:
1,
1,
1,
3,
1,
Where magic is pulling the start and end positions in df2 columns 1 and 2 to pass to rowSums for df1 extraction.

Edit: #Frank's data.table solution: short and fast.
df2[, s := df1[df2, on=.(a >= c, a <= d), sum(b), by=.EACHI]$V1]
# output
c d s
1: 1 3 1
2: 1 4 1
3: 2 3 1
4: 2 5 3
5: 3 4 1
Another way (may be slower but works):
library(data.table)
setDT(df1)
setDT(df2)
## magic function
get_magic <- function(x)
{
spell <- c()
one <- unlist(x[1])
two <- unlist(x[2])
a <- df1[between(a, one, two), sum(b)]
spell <- append(spell, a)
return(spell)
}
# applies to row
d <- apply(df2, 1, get_magic)
print(d)
# output
[1] 1 1 1 3 1

One possible solution is by using mapply. I have used a custom function but one can write an inline function as part of mapply statement.
mapply(row_sum, df2$c, df2$d)
row_sum <- function(x, y){
sum(df1[x:y,2])
}
#Result
#[1] 1 1 1 3 1
Data
a = seq(1,5)
b = c(0,0,1,0,2)
df1 <- data.frame(a, b)
c = c(1,1,2,2,3)
d = c(3,4,3,5,4)
df2 <- data.frame(c,d)

Related

How do I create new variable names using argument in my function in R?

This is a dataset.
require(data.table)
df <- data.table(a = c(1, 2, 3),
b = c(4, 5, 6))
a b
1: 1 4
2: 2 5
3: 3 6
I would like to make several several column names with my function.
Here is a example function.
f_test <- function(x){
variableName1 <- eval(paste0("variableName1_", x))
variableName2 <- eval(paste0("variableName2_", x))
print(variableName1)
#setNames(variableName)
df_1a <- df[, `:=` (variableName1 = a * b * 1,
variableName2 = a * b * 2)]
}
For example, this is the expected outcome from f_test("AAA")
a b variableName1_AAA variableName2_AAA
1: 1 4 4 8
2: 2 5 10 20
3: 3 6 18 36
However, the function outcome is not 'variableName1_AAA', but 'variableName1'.
How do I assign the name based on the string argument in the function? I need to assign the name character to use in the future function work.
We can use paste directly to create the column names as the input is a string. The assignment (:=) can also be done with concatenating the column name objects on the lhs of :=
f_test <- function(x){
variableName1 <- paste0("variableName1_", x)
variableName2 <- paste0("variableName2_", x)
df_1a <- copy(df)
df_1a[, c(variableName1, variableName2) := .( a * b * 1,
a * b * 2)][]
}
-testing
f_test("AAA")
# a b variableName1_AAA variableName2_AAA
#1: 1 4 4 8
#2: 2 5 10 20
#3: 3 6 18 36

Perform a function on a dataframe across variable number of columns after removing zeros

I'm trying to create a function where I can pass a function as a variable to perform on a variable number of columns, after removing zeros. I'm not too comfortable with ellipses yet, and I'm guessing this is where the problem is arising. The function is using all the values in the specified rows, summarizing them based on the selected function, and then mutating that one value. I'd like to maintain the function across the row (e.g. rowMeans)
Example:
# Setup dataframe
a <- 1:5
b <- c(0, 4, 3, 0, 1)
c <- c(5:1)
d <- c(2, 0, 1, 0, 4)
df <- data.frame(a, b, c, d)
FUNexcludeZero <- function(function_name, ...){
# Match function name
FUN <- match.fun(function_name)
# get all the values - I'm sure this is the problem, need to somehow turn it back into a df?
vals <- unlist(list(...))
# Remove 0's and perform function
valsNo0 <- vals[vals != 0]
compiledVals <- FUN(valsNo0)
return(compiledVals)
}
df %>%
mutate(foo = FUNexcludeZero(function_name = 'sd', a, b))
a b c d foo
1 1 0 5 2 1.457738
2 2 4 4 0 1.457738
3 3 3 3 1 1.457738
4 4 0 2 0 1.457738
5 5 1 1 4 1.457738
df %>%
mutate(foo = FUNexcludeZero(function_name = 'min', a, b))
a b c d foo
1 1 0 5 2 1
2 2 4 4 0 1
3 3 3 3 1 1
4 4 0 2 0 1
5 5 1 1 4 1
# Try row-function (same error occurs with rowMeans)
df %>%
mutate(foo = FUNexcludeZero(function_name = 'pmin', a, b))
Error in mutate_impl(.data, dots) :
Column `foo` must be length 5 (the number of rows) or one, not 8
For function_name = 'sd' the column should be c(NA, 1.41, 0, NA, 2.828) and the min and pmin should be c(1, 2, 3, 4, 1). I'm 100% sure the error has something to do with the list/unlist, but any other way I try it I end up with an error.
I am not sure if this is exactly what you what. You needed to perform a row wise operation on the two vectors, thus I used the apply function. This should work for any number of equal length vectors.
# Setup dataframe
a <- 1:5
b <- c(0, 4, 3, 0, 1)
c <- c(5:1)
d <- c(2, 0, 1, 0, 4)
#df <- data.frame(a, b, c, d) #not used
FUNexcludeZero <- function(function_name, ...){
# Match function name
FUN <- match.fun(function_name)
#combine the vectors into a matrix
df<-cbind(...)
#remove 0 from rows and apply function to the rows
compiledVals <- apply(df, 1, function(x) { x<-x[x!=0]
FUN(x)})
return(compiledVals)
}
FUNexcludeZero(function_name = 'sd', a, b)
#[1] NA 1.414214 0.000000 NA 2.828427
FUNexcludeZero(function_name = 'min', a, b)
#[1] 1 2 3 4 1

Arrange list of vectors

I have a list of vectors and an another vector. I would like the arrange the list of vectors according to values of the other vector
a <- c(1, 2)
b <- c(1, 4)
c <- c(1, 1)
x <- list(a, b, c) # list of vector
v <- c(3, 2, 5) # other vector
Here I want arrange x according to v. So the desired output will be:
2 b
3 a
5 c
Here is an option with stack and arrange
library(dplyr)
v %>%
set_names(letters[1:3]) %>%
stack %>%
arrange(values)
# values ind
#1 2 b
#2 3 a
#3 5 c
First order list x based on order of vector v and then bind vector with take names of ordered list to form related column.
It will something like:
cbind(as.data.frame(v), col = names(x))[order(v),]
# v col
#2 2 b
#1 3 a
#3 5 c
Data:
a <- c(1, 2)
b <- c(1, 4)
c <- c(1, 1)
x <- list(a=a, b=b, c=c) # list of vector
v <- c(3, 2, 5) # other vector

Create a vector of counts

I wanted to create a vector of counts if possible.
For example: I have a vector
x <- c(3, 0, 2, 0, 0)
How can I create a frequency vector for all integers between 0 and 3? Ideally I wanted to get a vector like this:
> 3 0 1 1
which gives me the counts of 0, 1, 2, and 3 respectively.
Much appreciated!
You can do
table(factor(x, levels=0:3))
Simply using table(x) is not enough.
Or with tabulate which is faster
tabulate(factor(x, levels = min(x):max(x)))
You can do this using rle (I made this in minutes, so sorry if it's not optimized enough).
x = c(3, 0, 2, 0, 0)
r = rle(x)
f = function(x) sum(r$lengths[r$values == x])
s = sapply(FUN = f, X = as.list(0:3))
data.frame(x = 0:3, freq = s)
#> data.frame(x = 0:3, freq = s)
# x freq
#1 0 3
#2 1 0
#3 2 1
#4 3 1
You can just use table():
a <- table(x)
a
x
#0 2 3
#3 1 1
Then you can subset it:
a[names(a)==0]
#0
#3
Or convert it into a data.frame if you're more comfortable working with that:
u<-as.data.frame(table(x))
u
# x Freq
#1 0 3
#2 2 1
#3 3 1
Edit 1:
For levels:
a<- as.data.frame(table(factor(x, levels=0:3)))

Groupby bins and aggregate in R

I have data like (a,b,c)
a b c
1 2 1
2 3 1
9 2 2
1 6 2
where 'a' range is divided into n (say 3) equal parts and aggregate function calculates b values (say max) and grouped by at 'c' also.
So the output looks like
a_bin b_m(c=1) b_m(c=2)
1-3 3 6
4-6 NaN NaN
7-9 NaN 2
Which is MxN where M=number of a bins, N=unique c samples or all range
How do I approach this? Can any R package help me through?
A combination of aggregate, cut and reshape seems to work
df <- data.frame(a = c(1,2,9,1),
b = c(2,3,2,6),
c = c(1,1,2,2))
breaks <- c(0, 3, 6, 9)
# Aggregate data
ag <- aggregate(df$b, FUN=max,
by=list(a=cut(df$a, breaks, include.lowest=T), c=df$c))
# Reshape data
res <- reshape(ag, idvar="a", timevar="c", direction="wide")
There would be easier ways.
If your dataset is dat
res <- sapply(split(dat[, -3], dat$c), function(x) {
a_bin <- with(x, cut(a, breaks = c(1, 3, 6, 9), include.lowest = T, labels = c("1-3",
"4-6", "7-9")))
c(by(x$b, a_bin, FUN = max))
})
res1 <- setNames(data.frame(row.names(res), res),
c("a_bin", "b_m(c=1)", "b_m(c=2)"))
row.names(res1) <- 1:nrow(res1)
res1
a_bin b_m(c=1) b_m(c=2)
1 1-3 3 6
2 4-6 NA NA
3 7-9 NA 2
I would use a combination of data.table and reshape2 which are both fully optimized for speed (not using for loops from apply family).
The output won't return the unused bins.
v <- c(1, 4, 7, 10) # creating bins
temp$int <- findInterval(temp$a, v)
library(data.table)
temp <- setDT(temp)[, list(b_m = max(b)), by = c("c", "int")]
library(reshape2)
temp <- dcast.data.table(temp, int ~ c, value.var = "b_m")
## colnames(temp) <- c("a_bin", "b_m(c=1)", "b_m(c=2)") # Optional for prettier table
## temp$a_bin<- c("1-3", "7-9") # Optional for prettier table
## a_bin b_m(c=1) b_m(c=2)
## 1 1-3 3 6
## 2 7-9 NA 2

Resources