Assign results of apply to multiple columns of data frame - r

I would like to process all rows in data frame df by applying function f to every row. As function f returns numeric vector with two elements I would like to assign individual elements to new columns in df.
Sample df, trivial function f returning two elements and my trial with using apply
df <- data.frame(a = 1:3, b = 3:5)
f <- function (a, b) {
c(a + b, a * b)
}
df[, c('apb', 'amb')] <- apply(df, 1, function(x) f(a = x[1], b = x[2]))
This does not work results are assigned by columns:
> df
a b apb amb
1 1 3 4 8
2 2 4 3 8
3 3 5 6 15

You could also use Reduce instead of apply as it is generally more efficient. You just need to slightly modify your function to use cbind instead of c
f <- function (a, b) {
cbind(a + b, a * b) # midified to use `cbind` instead of `c`
}
df[c('apb', 'amb')] <- Reduce(f, df)
df
# a b apb amb
# 1 1 3 4 3
# 2 2 4 6 8
# 3 3 5 8 15
Note: This will only work nicely if you have only two columns (as in your example), thus if you have more columns in you data set, run this only on a subset

You need to transpose apply results to get what you want :
df[, c('apb', 'amb')] <- t(apply(df, 1, function(x) f(a = x[1], b = x[2])))
> df
a b apb amb
1 1 3 4 3
2 2 4 6 8
3 3 5 8 15

Related

R multiple regular expressions, dataframe column names

I have a dataframe data with a lot of columns in the form of
...v1...min ...v1...max ...v2...min ...v2...max
1 a a a a
2 b b b b
3 c c c c
where in place ... there could be any expression.
I would like to create a function createData that takes three arguments:
X: a dataframe,
cols: a vector containing first part of the column, so i.e. c("v1", "v2")
fun: a vector containing second part of the column, so i.e. c("min"), or c("max", "min")
and returns filtered dataframe, so - for example:
createData(X, c("v1"), None) would return this kind of dataframe:
...v1...min ...v1...max
1 a a
2 b b
3 c c
while createData(X, c("v1", "v2"), c("min")) would give me
...v1...min ...v2...min
1 a a
2 b b
3 c c
At this point I decided I need to use i.e. select(contains()) from dplyr package.
createData <- function(data, fun, cols)
{
X %>% select(contains())
return(X)
}
What I struggle with is:
how to filter columns that consist two (or maybe more?) strings, i.e. both var1 and min? I tried going with data[grepl(".*(v1*min|min*v1).*", colnames(data), ignore.case=TRUE)] but it doesn't seem to work and also my expressions aren't fixed - they depend on the vector I pass,
how to filter multiple columns with different names, i.e. c("v1", "v2"), passed in a vector? and how to combine it with the first question?
I don't really need to stick with dplyr package, it was just for the sake of the example. Thanks!
EDIT:
An reproducible example:
data = data.frame(AXv1c2min = c(1,2,3),
subv1trwmax = c(4,5,6),
ss25v2xxmin = c(7,8,9),
cwfv2urttmmax = c(10,11,12))
If you pass a vector to contains, it will function like an OR tag, while multiple select statements will have additive effects. So for your esample data:
We can filter for (v1 OR v2) AND min like this:
library(tidyverse)
data %>%
select(contains(c('v1','v2'))) %>%
select(contains('min'))
AXv1c2min ss25v2xxmin
1 1 7
2 2 8
3 3 9
So as a function where either argument is optional:
createData <- function(data, fun=NULL, cols=NULL) {
if (!is.null(fun)) data <- select(data, contains(fun))
if (!is.null(cols)) data <- select(data, contains(cols))
return(data)
}
A series of examples:
createData(data, cols=c('v1', 'v2'), fun='min')
AXv1c2min ss25v2xxmin
1 1 7
2 2 8
3 3 9
createData(data, cols=c('v1'))
AXv1c2min subv1trwmax
1 1 4
2 2 5
3 3 6
createData(data, fun=c('min'))
AXv1c2min ss25v2xxmin
1 1 7
2 2 8
3 3 9
createData(data, cols=c('v1'), fun=c('min', 'max'))
AXv1c2min subv1trwmax
1 1 4
2 2 5
3 3 6
createData(data, cols=c('v1'), fun=c('max'))
subv1trwmax
1 4
2 5
3 6

Applying one function to multiple columns with a vector of values for parameter using data table

I would like to apply one function to multiple columns, using a vector of different values for one parameter.
I have some data:
library(data.table)
df1 <- as.data.table(data.frame(a = c(1,2,3),
b = c(4,5,6)))
cols <- c('a', 'b')
n <- 1:2
And I want to create columns that add n to a and b. Output would look like this:
a b a+1 b+1 a+2 b+2
1: 1 4 2 5 3 6
2: 2 5 3 6 4 7
3: 3 6 4 7 5 8
This post details how to apply one function to multiple columns which I understand how to do.
df1[,paste0(cols,'+1'):= lapply(.SD, function(x) x + 1), .SDcols = cols]
What I don't know is how to apply the same function to multiple columns, substituting n for 1.
We can also use outer
cbind(df1, df1[, lapply(.SD, function(x) outer(x, n, `+`))])
Or another option is
nm1 <- paste0(cols, "+", rep(n, each = length(cols)))
df1[, (nm1) := lapply(n, `+`, .SD)]
Here is a base R solution
df1out <- cbind(df1,do.call(cbind,lapply(c(1,2), function(k) setNames(df1+k,paste0(names(df1),"+",k)))))
such that
> df1out
a b a+1 b+1 a+2 b+2
1: 1 4 2 5 3 6
2: 2 5 3 6 4 7
3: 3 6 4 7 5 8

Speed up data.frame rearrangement

I have a data frame with coordinates ("start","end") and labels ("group"):
a <- data.frame(start=1:4, end=3:6, group=c("A","B","C","D"))
a
start end group
1 1 3 A
2 2 4 B
3 3 5 C
4 4 6 D
I want to create a new data frame in which labels are assigned to every element of the sequence on the range of coordinates:
V1 V2
1 1 A
2 2 A
3 3 A
4 2 B
5 3 B
6 4 B
7 3 C
8 4 C
9 5 C
10 4 D
11 5 D
12 6 D
The following code works but it is extremely slow with wide ranges:
df<-data.frame()
for(i in 1:dim(a)[1]){
s<-seq(a[i,1],a[i,2])
df<-rbind(df,data.frame(s,rep(a[i,3],length(s))))
}
colnames(df)<-c("V1","V2")
How can I speed this up?
You can try data.table
library(data.table)
setDT(a)[, start:end, by = group]
which gives
group V1
1: A 1
2: A 2
3: A 3
4: B 2
5: B 3
6: B 4
7: C 3
8: C 4
9: C 5
10: D 4
11: D 5
12: D 6
Obviously this would only work if you have one row per group, which it seems you have here.
If you want a very fast solution in base R, you can manually create the data.frame in two steps:
Use mapply to create a list of your ranges from "start" to "end".
Use rep + lengths to repeat the "groups" column to the expected number of rows.
The base R approach shared here won't depend on having only one row per group.
Try:
temp <- mapply(":", a[["start"]], a[["end"]], SIMPLIFY = FALSE)
data.frame(group = rep(a[["group"]], lengths(temp)),
values = unlist(temp, use.names = FALSE))
If you're doing this a lot, just put it in a function:
myFun <- function(indf) {
temp <- mapply(":", indf[["start"]], indf[["end"]], SIMPLIFY = FALSE)
data.frame(group = rep(indf[["group"]], lengths(temp)),
values = unlist(temp, use.names = FALSE))
}
Then, if you want some sample data to try it with, you can use the following as sample data:
set.seed(1)
a <- data.frame(start=1:4, end=sample(5:10, 4, TRUE), group=c("A","B","C","D"))
x <- do.call(rbind, replicate(1000, a, FALSE))
y <- do.call(rbind, replicate(100, x, FALSE))
Note that this does seem to slow down as the number of different unique values in "group" increases.
(In other words, the "data.table" approach will make the most sense in general. I'm just sharing a possible base R alternative that should be considerably faster than your existing approach.)

How to create a table in R that includes column totals

I'm somewhat new to R programming and am in need of assistance.
I'm looking to take the sum of 4 columns in a dataframe and list these totals in a simple table.
Essentially, take the sum of 4 columns (A, B, C, D) and list the total in a table (table = column 1: A, B, C, D column 2: sum of column A, B, C, D) - something along the lines of:
A = 3
B = 4
C = 4
D = 3
Does anyone know how to get this output? Also, the less "manual" the response, the better (i.e. trying to avoid having to input several lines of code to get this output if possible).
Thank you.
If your data looks like this:
a <- c(1:4)
b <- c(2:5)
c <- c(3:6)
d <- c(4:7)
df <- data.frame(a,b,c,d)
a b c d
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7
Use
> res <- sapply(df,sum)
to get
a b c d
10 14 18 22
in order to apply the function only on numeric columns, try
> res <- colSums(df[sapply(df,is.numeric)])
There is colSums:
colSums(Filter(is.numeric, df))

Using merge command in r for merging depending upon column values

So, I have several dataframes like this
1 2 a
2 3 b
3 4 c
4 5 d
3 5 e
......
1 2 j
2 3 i
3 4 t
3 5 r
.......
2 3 t
2 4 g
6 7 i
8 9 t
......
What I want is, I want to merge all of these files into one single file showing the values of third column for each pair of values in columns 1 and columns 2 and 0 if that pair is not present.
So, the output for this will be, since, there are three files (there are more)
1 2 aj0
2 3 bit
3 4 ct0
4 5 d00
3 5 er0
6 7 00i
8 9 00t
......
What I did was combine all my text .txt files in a single list.
Then,
L <- lapply(seq_along(L), function(i) {
L[[i]][, paste0('DF', i)] <- 1
L[[i]]
})
Which will indicate the presence of a value when we will be merging them.
I don't know how to proceed further. Any inputs will be great. Thanks!
Here is one way to do it with Reduce
# function to generate dummy data
gen_data<- function(){
data.frame(
x = 1:3,
y = 2:4,
z = sample(LETTERS, 3, replace = TRUE)
)
}
# generate list of data frames to merge
L <- lapply(1:3, function(x) gen_data())
# function to merge by x and y and concatenate z
f <- function(x, y){
d <- merge(x, y, by = c('x', 'y'), all = TRUE)
# set merged column to zero if no match is found
d[['z.x']] = ifelse(is.na(d[['z.x']]), 0, d[['z.x']])
d[['z.y']] = ifelse(is.na(d[['z.y']]), 0, d[['z.y']])
d$z <- paste0(d[['z.x']], d[['z.y']])
d['z.x'] <- d['z.y'] <- NULL
return(d)
}
# merge data frames
Reduce(f, L)

Resources