I have a dataframe and I would like to use a custom function to add multiple new columns to that dataframe. These new columns will be some function of an existing column, but they require the use of a custom function.
I am currently trying to have my custom function return the results in a list, which I then parse into separate columns. This sometimes works by returning a vector of lists, but sometimes this returns a matrix, in which case I get an error like
Error in $<-.data.frame(*tmp*, "z", value = list(1, 2, 2, 3, 3, 4)) : replacement has 2 rows, data has 3
Below is a sample of what I am trying to do.
sample_func <- function(number)
{
list(w = number + 1, u = number + 2)
}
data = data.frame(x = c(1,2,3), y= c(5,6,7))
data$z = sapply(c(1,2,3),sample_func)
data$w = sapply(data$z,"[[","w")
data$u = sapply(data$z,"[[","u")
The function sapply automatically simplifies the result. In this case, you obtain a matrix. You can avoid this behaviour with the argument simplify = FALSE. But it's easier to use lapply because this function doesn't try to simplify the result.
The command
tmp <- lapply(c(1,2,3), sample_func)
returns a list of lists:
[[1]]
[[1]]$w
[1] 2
[[1]]$u
[1] 3
[[2]]
[[2]]$w
[1] 3
[[2]]$u
[1] 4
[[3]]
[[3]]$w
[1] 4
[[3]]$u
[1] 5
You can use the following command to add the new columns to your data frame:
cbind(data, do.call(rbind, tmp))
# x y w u
# 1 1 5 2 3
# 2 2 6 3 4
# 3 3 7 4 5
Update to address comment:
If possible, you can modify the function to return a data frame.
sample_func <- function(number)
{
data.frame(w = number + 1, u = number + 2)
}
tmp <- lapply(c(1,2,3), sample_func)
cbind(data, do.call(rbind, tmp))
The result will be a data frame with numeric columns.
Related
This is related to R: use the newly generated data in the previous row
I realized the actual problem I was faced with is a bit more complicated than the example I gave in the thread above - it seems I have to pass 3 arguments to the recursive calculation to achieve what I want. Thus, accumulate2 or reduce may not work. So I open a new question here to avoid possible confusion.
I have the following dataset grouped by ID:
ID <- c(1, 2, 2, 3, 3, 3)
pw <- c(1:6)
add <- c(1, 2, 3, 5, 7, 8)
x <- c(1, 2, NA, 4, NA, NA)
df <- data.frame(ID, pw, add, x)
df
ID pw add x
1 1 1 1 1
2 2 2 2 2
3 2 3 3 NA
4 3 4 5 4
5 3 5 7 NA
6 3 6 8 NA
Within each group for column x, I want to keep the value of the first row as it is, while fill in the remaining rows with lagged values raised to the power stored in pw, and add to the exponent the value in add. I want to update the lagged values as I proceed. So I would like to have:
ID pw add x
1 1 1 1 1
2 2 2 2 2
3 2 3 3 2^3 + 3
4 3 4 5 4
5 3 5 7 4^5 + 7
6 3 6 8 (4^5 + 7)^6 + 8
I have to apply this calculation to a large dataset, so it would be perfect if there is a fast way to do this!
If we want to use accumulate2, then specify the arguments correctly i.e. it takes two input arguments as 'pw' and 'add' and an initialization argument which would be the first value of 'x'. As it is a grouped by 'ID', do the grouping before we do the accumulate2, extract the lambda default arguments ..1, ..2 and ..3 respectively in that order and create the recursive function based on this
library(dplyr)
library(purrr)
out <- df %>%
group_by(ID) %>%
mutate(x1 = accumulate2(pw[-1], add[-1], ~ ..1^..2 + ..3,
.init = first(x)) %>%
flatten_dbl ) %>%
ungroup
out$x1
#[1] 1 2 11
#[4] 4 1031 1201024845477409792
With more than 3 arguments, a for loop would be better
# // initialize an empty vector
out <- c()
# // loop over the `unique` ID
for(id in unique(df$ID)) {
# // create a temporary subset of data based on that id
tmp_df <- subset(df, ID == id)
# // initialize a temporary storage output
tmp_out <- numeric(nrow(tmp_df))
# // initialize first value with the first element of x
tmp_out[1] <- tmp_df$x[1]
# // if the number of rows is greater than 1
if(nrow(tmp_df) > 1) {
// loop over the rows
for(i in 2:nrow(tmp_df)) {
#// do the recursive calculation and update
tmp_out[i] <- tmp_out[i - 1]^ tmp_df$pw[i] + tmp_df$add[i]
}
}
out <- c(out, tmp_out)
}
out
#[1] 1 2 11
#[4] 4 1031 1201024845477409792
In base R we could use the following solution for more than two arguments.
In this solution I first subset the original data set on ID values
Then I chose row id values through seq_len(nrow(tmp))[-1] omitting the first row id since it was provided by init
In anonymous function I used in Reduce, b argument represents accumulated/ previous value starting from init and c represents new/current values of our vector which is row numbers
So in every iteration our previous value (starting from init) will be raised to the power of new value from pw and will be summed by new value from add
cbind(df[-length(df)], unlist(lapply(unique(df$ID), function(a) {
tmp <- subset(df, df$ID == a)
Reduce(function(b, c) {
b ^ tmp$pw[c] + tmp$add[c]
}, init = tmp$x[1],
seq_len(nrow(tmp))[-1], accumulate = TRUE)
}))) |> setNames(c(names(df)))
ID pw add x
1 1 1 1 1.000000e+00
2 2 2 2 2.000000e+00
3 2 3 3 1.100000e+01
4 3 4 5 4.000000e+00
5 3 5 7 1.031000e+03
6 3 6 8 1.201025e+18
Data
structure(list(ID = c(1, 2, 2, 3, 3, 3), pw = 1:6, add = c(1,
2, 3, 5, 7, 8), x = c(1, 2, NA, 4, NA, NA)), class = "data.frame", row.names = c(NA,
-6L))
Base R, not using Reduce() but rather a while() Loop:
# Split-apply-combine while loop: res => data.frame
res <- do.call(rbind, lapply(with(df, split(df, ID)), function(y){
# While there are any NAs in x:
while(any(is.na(y$x))){
# Store the index of the first NA value: idx => integer scalar
idx <- with(y, head(which(is.na(x)), 1))
# Calculate x at that index using the business rule provided:
# x => numeric vector
y$x[idx] <- with(y, x[(idx-1)] ** pw[idx] + add[idx])
}
# Explicitly define the return object: y => GlobalEnv
y
}
)
)
OR recursive function:
# Recursive function: estimation_func => function()
estimation_func <- function(value_vec, exponent_vec, add_vec){
# Specify the termination condition; when all elements
# of value_vec are no longer NA:
if(all(!(is.na(value_vec)))){
# Return value_vec: numeric vector => GlobalEnv
return(value_vec)
# Otherwise recursively apply the below:
}else{
# Store the index of the first na value: idx => integer vector
idx <- Position(is.na, value_vec)
# Calculate the value of the value_vec at that index;
# using the provided business logic: value_vec => numeric vector
value_vec[idx] <- (value_vec[(idx-1)] ** exponent_vec[idx]) + add_vec[idx]
# Recursively apply function: function => Local Env
return(estimation_func(value_vec, exponent_vec, add_vec))
}
}
# Split data.frame into a list on ID;
# Overwrite x values, applying recursive function;
# Combine list into a data.frame
# res => data.frame
res <- data.frame(
do.call(
rbind,
Map(function(y){y$x <- estimation_func(y$x, y$pw, y$add); y}, split(df, df$ID))
), row.names = NULL
)
I am a beginner in coding in general. I am trying to calculate two parameters from a data frame named a in R. For row i and column j, I am interested in finding:
B = (sum of all values in column j) - a[i,j]
C = (sum of all values in row i) - a[i,j]
For i=1 , j=2, I'm writing:
A = a[1,2]
B = (colSums(a[1:nrow(a),1],na.rm = FALSE, dims = 1) - A)
C = (rowSums(a[1,1:ncol(a)],na.rm = FALSE, dims = 1) - A)
C seems to give correct answer. However, B gives an error:
Error in base::colSums(x, na.rm = na.rm, dims = dims, ...) :
'x' must be an array of at least two dimensions
I have read other threads as well but couldn't find my answer. Do you have any suggestions?
The problem is due to the command a[1:nrow(a),1]. This command selects all rows of the first column of data frame a but returns the result as a vector (not a data frame). The function colSums does not work with one-dimensional objects (like vectors).
As a side note: You don't need 1:nrow(a) to select all rows. The same is easier to achieve with an empty argument before the comma: a[ , 1].
An example data frame:
dat <- data.frame(a = 1:3, b = 4:6)
# a b
# 1 1 4
# 2 2 5
# 3 3 6
If you select one column, the result is converted into a vector automatically.
dat[ , 1]
# [1] 1 2 3
If you specify drop = FALSE, a one-column data frame is returned.
dat[ , 1, drop = FALSE]
# a
# 1 1
# 2 2
# 3 3
This one-column data frame is a two-dimensional object and can therefore be used with colSums.
colSums(dat[ , 1, drop = FALSE])
# a
# 6
I'm getting myself all tied in knots trying to understand what's going on with the code below. I'm trying to create a vector for each row in a data.frame then append to the original. I expected the code below to return a list of arrays. It appears to return a list of lists, the inner list contains the array? How can I get want I want - a new column appended each element being an array?
df <- mtcars
library(foreach)
library(iterators)
df$x = foreach (row = iter(df, by='row')) %do% {
profile <- as.numeric(row[,c('mpg', 'cyl', 'disp')])
return(profile)
}
I'm expecting the result:
df[1,]$x == as.numeric(df[1,c('mpg', 'cyl', 'disp')])
instead I get
df[1,]$x[1] == as.numeric(df[1,c('mpg', 'cyl', 'disp')])
(where I'm using == to represent both collections are the same, I realize R probably doesn't implement a list equality operator this way)
The foreach package by default returns a list of lists of your input (one list for each iteration). This is why you end up with the 'wrong' output. You can change this by using the .combine option in the foreach loop. If I understand you correctly, you wish to append row by row. This can be achieved by specifying .combine = 'rbind', which uses the familiar rbind function to combine the outputs of each loop iteration. If the order is irrelevant, you should also specify .inorder = FALSE to speed up the code. (TRUE is default, so in case the order is relevant, you don't need to bother.)
So try using foreach (row = iter(df, by='row'), .combine='rbind') %do% ... instead and see if it does the job.
This problem is not caused by foreach. As you want to assign a vector to a cell (or element) of a data frame rather than a column of a data frame. The foreach function has to coerce this vector to a list.
For example.
df1 <- data.frame(x1=1:4, x2=letters[1:4], stringsAsFactors = FALSE)
df1$x1[1] <- 5:8
# Warning message:
# In df1$x1[1] <- 5:8 :
# number of items to replace is not a multiple of replacement length
df1
# x1 x2
# 1 5 a
# 2 2 b
# 3 3 c
# 4 4 d
df1$x1[1] <- list(5:8)
df1
# x1 x2
# 1 5, 6, 7, 8 a
# 2 2 b
# 3 3 c
# 4 4 d
df1$x1[1]
# [[1]]
# [1] 5 6 7 8
df1$x1[[1]]
# [1] 5 6 7 8
Actually, you should use [[ instead of [.
df[1, ]$x[[1]] == as.numeric(df[1,c('mpg', 'cyl', 'disp')])
# [1] TRUE TRUE TRUE
As list[1] is still a list while list[[1]] extracts the first element of list. See the example below.
lst1 <- list(x1=1:4, x2=letters[1:5])
lst1[1]
# $x1
# [1] 1 2 3 4
lst1[[1]]
# [1] 1 2 3 4
In addition, you can use:
df$x[[1]]
[1] 21 6 160
instead of:
df[1, ]$x[[1]]
# [1] 21 6 160
I am new to R and I am trouble with a command that I did all the time in Python.
I have two data-frames (database and creditIDs), and what I want to do is compare one column in database and one column in creditIDs. More specifically in a value exists in creditIDs[,1] but doesn't in database[,5], I want to delete that entire row in database.
Here is the code:
for (i in 1:lengthColumns){
if (!(database$credit_id[i] %in% creditosVencidos)){
database[i,]<-database[-i,]
}
}
But I keep on getting this error:
50: In `[<-.data.frame`(`*tmp*`, i, , value = structure(list( ... :
replacement element 50 has 9696 rows to replace 1 rows
Could someone explain why this is happening? Thanks!
the which() command will return the row indices that satisfy a boolean statement, much like numpy.where() in python. Using the $ after a dataframe with a column name gives you a vector of that column... alternatively you could do d[,column_number].
In this example I'm creating an x and y column which share the first five values, and use which() to slice the dataframe on their by-row equality:
L3 <- LETTERS[1:3]
fac <- sample(L3, 10, replace = TRUE)
(d <- data.frame(x = rep(1:5, 2), y = 1:10, fac = fac))
d = d[which(d$x == d$y),]
d
x y fac
1 1 A
2 2 B
3 3 C
4 4 B
5 5 B
You will need to adjust this for your column names/numbers.
# Create two example data.frames
creditID <- data.frame(ID = c("896-19", "895-8", "899-1", "899-5"))
database <- data.frame(ID = c("896-19", "camel", "899-1", "goat", "899-1"))
# Method 1
database[database$ID %in% creditID$ID, ]
# Method 2 (subset() function)
database <- subset(database, ID %in% creditID$ID)
Given this data.frame
x y z
1 1 3 5
2 2 4 6
I'd like to add the value of columns x and z plus a coefficient 10, for every rows in dat.
The intended result is this
x y z result
1 1 3 5 16 #(1+5+10)
2 2 4 6 18 #(2+6+10)
But why this code doesn't produce the desired result?
dat <- data.frame(x=c(1,2), y=c(3,4), z=c(5,6))
Coeff <- 10
# Function
process.xz <- function(v1,v2,cf) {
return(v1+v2+cf)
}
# It breaks here
sm <- apply(dat[,c('x','z')], 1, process.xz(dat$x,dat$y,Coeff ))
# Later I'd do this:
# cbind(dat,sm);
I wouldn't use an apply here. Since the addition + operator is vectorized, you can get the sum using
> process.xz(dat$x, dat$z, Coeff)
[1] 16 18
To write this in your data.frame, don't use cbind, just assign it directly:
dat$result <- process.xz(dat$x, dat$z, Coeff)
The reason it fails is because apply doesn't work like that - you must pass the name of a function and any additional parameters. The rows of the data frame are then passed (as a single vector) as the first argument to the function named.
dat <- data.frame(x=c(1,2), y=c(3,4), z=c(5,6))
Coeff <- 10
# Function
process.xz <- function(x,cf) {
return(x[1]+x[2]+cf)
}
sm <- apply(dat[,c('x','z')], 1, process.xz,cf=Coeff)
I completely agree that there's no point in using apply here though - but it's good to understand anyway.