I have a dataframe:
Start <- data.frame("Number" = 2,"Square" = 4,"Cube" = 8)
A Vector of inputs:
Numbers <- c(3,5)
I want to iterate the elements of Numbers in the function Squarecube and fill the dataframe with the results:
SquareCube <- function(x){ df <- c(x^2,x^3)
df}
Desired Output:
Filled <- data.frame("Number" = c(2,3,5),"Square" = c(4,9,25),"Cube" = c(8,27,125))
Note: Already searched for this topic , but in this case the size of the vector Numbers can be different. My intent is to fill the dataframe with the results of the function.
Thanks
If I am reading your question right, you may just be having issues with structure that do.call may be able to help with. I also redefined the function slightly to accommodate the naming:
Start <- data.frame("Number" = 2,"Square" = 4, "Cube" = 8)
Number <- c(3,5)
Define your function:
SquareCube <- function(x){ list(Number=x,Square=x^2,Cube=x^3) }
Then construct the data frame with desired end results:
> rbind(Start, data.frame( do.call(cbind, SquareCube(Number)) ))
Number Square Cube
1 2 4 8
2 3 9 27
3 5 25 125
You can also make a wrapper function and just hand it the Start data and the original Number list that you want to process, which will yield a data frame:
> makeResults <- function(a, b) { rbind(a, data.frame(do.call(cbind,SquareCube(b)))) }
> makeResults(Start, Number)
Number Square Cube
1 2 4 8
2 3 9 27
3 5 25 125
outer() function produces matrix which has exactly same output of yours. You can just change it to data frame and rename.
(Filled <- outer(
c(2, 3, 5),
1:3,
FUN = "^"
))
#> [,1] [,2] [,3]
#> [1,] 2 4 8
#> [2,] 3 9 27
#> [3,] 5 25 125
For this matrix, you can use any function what you know to
change class
change column names
Here, for instance, dplyr::rename():
library(tidyverse)
Filled %>%
as_tibble() %>% # make data frame
rename(Number = V1, Square = V2, Cube = V3) # rename column names
#> # A tibble: 3 x 3
#> Number Square Cube
#> <dbl> <dbl> <dbl>
#> 1 2 4 8
#> 2 3 9 27
#> 3 5 25 125
Related
I have a dataframe with variable number of columns. I want to pass
for instance let's say I have the dataframe df and I want to pass columns a and b as individual arguments to my custom function; but the issue is that the list of column names of interest changes depending on the outcome of another operation and could be any lemgth etc.
df <- tibble(a = c(1:3), b = c(4:6), c=(7:9), d=c(10:12))
custom_function <- function(...){ do something }
custom_function(df$a, df$b)
I haven't found a clean way to achieve this. Any help would be great.
UPDATE
for better clarity I need to add that the challenge is the fact the list of columns of interest is retrieved from another variable. for instance col_names <- c("a","b")
We can capture the ... as a list can then apply the function within do.call
custom_function <- function(fn, data = NULL, ...) {
args <- list(...)
if(length(args) == 1 && is.character(args[[1]]) && !is.null(data)) {
args <- data[args[[1]]]
}
do.call(fn, args)
}
custom_function(pmax, df$a, df$b)
[1] 4 5 6
and we can pass ridge
> custom_function(ridge, df$a, df$b)
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
attr(,"class")
[1] "coxph.penalty"
...
> custom_function(ridge, data = df, c("a", "b"))
a b
[1,] 1 4
[2,] 2 5
[3,] 3 6
attr(,"class")
[1] "coxph.penalty"
attr(,"pfun")
...
> custom_function(ridge, data = df, col_names)
a b
[1,] 1 4
[2,] 2 5
[3,] 3 6
attr(,"class")
...
If your outcome is a data.frame, a possibility is to use the the curly-curly tidyverse operator, using the command {{}}, the goal of this operator is to allow us to have an argument passed to our function refering to a column inside a dataframe.
Data
df <- tibble(a = c(1:3), b = c(4:6), c=(7:9), d=c(10:12))
Example
library(dplyr)
operations <- function(df,col1,col2){
df %>%
summarise(
n = n(),
addition = {{col1}} + {{col2}},
subtraction = {{col1}} - {{col2}},
multiplication = {{col1}} * {{col2}},
division = {{col1}} / {{col2}}
)
}
Output
operations(df,a,b)
# A tibble: 3 x 5
n addition subtraction multiplication division
<int> <int> <int> <int> <dbl>
1 3 5 -3 4 0.25
2 3 7 -3 10 0.4
3 3 9 -3 18 0.5
I have a tibble in which one column is a list containing 2x2 matrices. I want to be able to select a specific element from the matrices across all rows in the tibble. I am able to select a specific element from one tibble row using indexing:
t1 <- tibble(x = 1:2, y = 1, z = x ^ 2 + y)
rM1 <- matrix(c(2,3,1,4), nrow=2, ncol=2, byrow = TRUE)
rM2 <- matrix(c(10,19,9,15), nrow=2, ncol=2, byrow = TRUE)
t1$my.lists <- list(rM1,rM2)
t1[[4]][[2]][[2,2]]
[1] 15
However when I try to access that specific element across multiple rows I get an error:
t1[[4]][1:2][[2,2]]
Error in t1[[4]][1:2][[2, 2]] : incorrect number of subscripts
I have also tried using piping and functions such as slice but still haven't been able to acheive the desired result. In this example I expect a return of:
[1] 4 15
where 4 is the 2x2 element from rM1 and 15 is the 2x2 element from rM2. Of course I could write a loop to achieve this but I assume there is also a more direct way to do this.
We can use sapply to loop over the list column number 4, and extract the elements based on row/column index
sapply(t1[[4]], function(x) x[2, 2])
#[1] 4 15
Or with map
library(dplyr)
library(purrr)
t1 %>%
mutate(new = map_dbl(my.lists, ~ .x[2, 2]))
# A tibble: 2 x 5
# x y z my.lists new
# <int> <dbl> <dbl> <list> <dbl>
#1 1 1 2 <dbl[,2] [2 × 2]> 4
#2 2 1 5 <dbl[,2] [2 × 2]> 15
The OP's code didn't work out because the below is a list
t1[[4]][1:2]
#[[1]]
# [,1] [,2]
#[1,] 2 3
#[2,] 1 4
#[[2]]
# [,1] [,2]
#[1,] 10 19
#[2,] 9 15
and the row/column indexing can be done by selecting each list element one by one or using a loop
t1[[4]][1:2][[2]][2,2]
#[1] 15
I have a data frame in R that I want to aggregate. The summary function that I want to apply to each subset is a custom function that takes several variables (columns) as input, and returns a vector or list of variable length. As an output, I would like to have a data frame with a column of the grouping variable, and a single other column containing the output vector (of varying length).
To give a mock example, suppose I have the following dataframe:
df <- data.frame( particle = c(rep("X",5),rep("Y",3),rep("Z",4)),
time = c(1:5,1:3,1:4), state = c(c("A","A","B","C","A"),c("A","B","B"),
c("B","C","A","A")), energy = round(runif(12,0,10)))
> df
particle time state energy
1 X 1 A 9
2 X 2 A 8
3 X 3 B 7
4 X 4 C 5
5 X 5 A 0
6 Y 1 A 1
7 Y 2 B 7
8 Y 3 B 7
9 Z 1 B 3
10 Z 2 C 9
11 Z 3 A 5
12 Z 4 A 6
I would like to obtain for each particle a list of the energy they had every time they changed state. The output I'm looking for is something like this:
>
particle energy
1 X c(9,7,5,0)
2 Y c(1,7)
3 Z c(3,9,5)
To do so, I would define a function like the following:
myfun <- function(state, energy){
tempstate <- state[1]
energyvec <- energy[1]
for(i in 2:length(state)){
if(state[i] != tempstate){
energyvec <- c(energyvec, energy[i])
tempstate <- state[i]
}
}
return(energyvec)
}
And try to pass it to aggregate somehow
The two data structures I tried for this are data.frame and data.table.
In data.frame, using a custom function that returns a vector seems to give the correct output format I am looking for, that is where the output column is really a list, and each row contains a list with the output of the function. However, I can't seem to pass several columns to the function when aggregating this way.
With a data.table, the aggregation is easier to do when considering a function of several variables. However, I can't seem to obtain the output I'm looking for. Indeed,
dt <- data.table(df)
dt[,myfun(state, energy), by= Particle]
only returns the first element of energyvec (instead of a vector), and
dt <- data.table(df)
dt[,as.list(myfun(state, energy)), by= Particle]
doesn't work as the outputs don't all have the same length.
Is there an alternative way to go to accomplish this?
Thank you very much in advance for all your help!
Here's a tidyverse approach:
library(tidyverse)
df <- data.frame( particle = c(rep("X",5),rep("Y",3),rep("Z",4)),
time = c(1:5,1:3,1:4), state = c(c("A","A","B","C","A"),c("A","B","B"),
c("B","C","A","A")), energy = round(runif(12,0,10)))
# Hard-code energy to make this reproducible
df$energy <- c(9, 8, 7, 5, 0, 1, 7, 7, 3, 9, 5, 6)
df %>%
group_by(particle) %>%
mutate(
changed_state = coalesce(state != lag(state, 1), TRUE)
) %>%
filter(changed_state) %>%
summarise(
string = toString(energy)
)
#> # A tibble: 3 x 2
#> particle string
#> <fct> <chr>
#> 1 X 9, 7, 5, 0
#> 2 Y 1, 7
#> 3 Z 3, 9, 5
I'd run each line of the pipe individually. Basically, create a changed_state variable by checking if the "this" state matches the last state lag(state, 1). Since we only care when this happens, we filter where this is TRUE (a more verbose line would be filter(changed_state == TRUE). The toString function collapses the rows of energy as desired and we are already "grouped" by particle.
data.table approach
sample data
#stolen from JasonAizkalns's answer
df <- data.frame( particle = c(rep("X",5),rep("Y",3),rep("Z",4)),
time = c(1:5,1:3,1:4), state = c(c("A","A","B","C","A"),c("A","B","B"),
c("B","C","A","A")), energy = round(runif(12,0,10)))
df$energy <- c(9, 8, 7, 5, 0, 1, 7, 7, 3, 9, 5, 6)
code
library( data.table )
#create data.table
dt <- as.data.table(df)
#use `uniqlist` to get rownumbers where the value of `state` changes,
# then get these rows into a subset
result <- dt[ data.table:::uniqlist(dt[, c("particle", "state")]), ]
#split the resulting `energy`-column by the contents of the `particle`-column
l <- split( result$energy, result$particle)
# $X
# [1] 9 7 5 0
#
# $Y
# [1] 1 7
#
# $Z
# [1] 3 9 5
#craete final output
data.table( particle = names(l), energy = l )
# particle energy
# 1: X 9,7,5,0
# 2: Y 1,7
# 3: Z 3,9,5
Another possible data.table approach
library(data.table)
setDT(DF)[, .(energy=.(.SD[, first(energy), by=.(rleid(state))]$V1)), by=.(particle)]
output:
particle energy
1: X 9,4,6,9
2: Y 2,9
3: Z 7,6,1
data:
set.seed(0L)
DF <- data.frame( particle = c(rep("X",5),rep("Y",3),rep("Z",4)),
time = c(1:5,1:3,1:4), state = c(c("A","A","B","C","A"),c("A","B","B"),
c("B","C","A","A")), energy = round(runif(12,0,10)))
DF
# particle time state energy
# 1 X 1 A 9
# 2 X 2 A 3
# 3 X 3 B 4
# 4 X 4 C 6
# 5 X 5 A 9
# 6 Y 1 A 2
# 7 Y 2 B 9
# 8 Y 3 B 9
# 9 Z 1 B 7
# 10 Z 2 C 6
# 11 Z 3 A 1
# 12 Z 4 A 2
I am trying to polishing my R skills and sort of hit my limit.
The issue I am trying to find the solution as follows.
Suppose my dataframe is as below ,
n = c(2, 15, 31 , 33)
n2 = c( 10 , 9, 10 , 40)
n3 = c( 11 , 10 , 11 , 42)
df = data.frame(n , n2 , n3)
> df
n n2 n3
1 2 10 11
2 15 9 10
3 31 10 11
4 33 40 42
if I would like to go through each row , and generate a random pair :eg 2,10 and go through each of the rest of the rows to find a repeated pair and print out the common pairs and number of occurrence , how can I do that?
In the above example , the only pair that repeat is 10 and 11 at rows 1 and 3.
So far I have thought about the pseudo code as follows
for(each row in the dataframe)
{
for (each of the values in the row)
{
for every pair
}
find a repeated pair
if found store in a dataframe
}
and to find the random pair using combn function.
But I am a little lost at the part on iteration through out the dataframe rows.
Pls help.
Thanks a lot!
I think this is what you want. Instead of thinking about selecting every combination of two values for each row, we'll get every combination of two column numbers - which will be the same for every row. Then we use plyr::count as a convenience function to count rows with the same values for an entire data frame at once. This way we can loop over the combinations of column indices rather than over rows. I use apply, but you could write it as a for loop instead.
pairs = combn(ncol(df), m = 2)
result = apply(pairs, MAR = 2, FUN = function(p) {
plyr::count(df[p])
})
names(result) = apply(pairs, MAR = 2, FUN = paste, collapse = "_")
The result is a list where each item is is a data frame with two columns and a freq column giving the number of rows in the original data each value-pair occurred.
result
# $`1_2`
# n n2 freq
# 1 2 10 1
# 2 15 9 1
# 3 31 10 1
# 4 33 40 1
#
# $`1_3`
# n n3 freq
# 1 2 11 1
# 2 15 10 1
# 3 31 11 1
# 4 33 42 1
#
# $`2_3`
# n2 n3 freq
# 1 9 10 1
# 2 10 11 2
# 3 40 42 1
If you want to omit the values that aren't repeated, we can just subset them out:
lapply(result, subset, freq > 1)
# $`1_2`
# [1] n n2 freq
# <0 rows> (or 0-length row.names)
#
# $`1_3`
# [1] n n3 freq
# <0 rows> (or 0-length row.names)
#
# $`2_3`
# n2 n3 freq
# 2 10 11 2
Slightly different method
n = c(2, 15, 31 , 15) # changed the dataset to have some common pairs in n and n2 too
n2 = c( 10 , 9, 10 , 9)
n3 = c( 11 , 10 , 11 , 42)
df = data.frame(n , n2 , n3)
library(dplyr)
library(rlang)
library(utils)
cols<-colnames(df) # define the columns that you want to do the pair checking for
combinations<- as.data.frame(combn(cols,2),stringsAsFactors = FALSE)
# picks up all combinations of columns
#iterates over each pair of columns
all_combs<- lapply(names(combinations[cols1]), function(x){
df %>%
group_by(!! sym( combinations[[x]][1]),!! sym( combinations[[x]][2])) %>%
filter(n()>1) # groups by the two columns, and filters out pairs that occur more than once. You can add a distinct command below if you
#dont want them repeated
})
all_combs_df <- do.call("rbind", all_combs)# all_combs is in a list format, use rbind to convert into a dataframe
all_combs_df
the output is this
n n2 n3
<dbl> <dbl> <dbl>
1 15. 9. 10.
2 15. 9. 42.
3 2. 10. 11.
4 31. 10. 11.
I would like to do calculations across columns in my data, by row. The calculations are "moving" in that I would like to know the difference between two numbers in column 1 and 2, then columns 3 and 4, and so on. I have looked at "loops" and "rollapply" functions, but could not figure this out. Below are three options of what was attempted. Only the third option gives me the result I am after, but it is very lengthy code and also does not allow for automation (the input data will be a much larger matrix, so typing out the calculation for each row won't work).
Please advice how to make this code shorter and/or any other packages/functions to check out which will do the job. THANK YOU!
MY TEST SCRIPT IN R + errors/results
Sample data set
a<- c(1,2,3, 4, 5)
b<- c(1,2,3, 4, 5)
c<- c(1,2,3, 4, 5)
test.data <- data.frame(cbind(a,b*2,c*10))
names(test.data) <- c("a", "b", "c")
Sample of calculations attempted:
OPTION 1
require(zoo)
rollapply(test.data, 2, diff, fill = NA, align = "right", by.column=FALSE)
RESULT 1 (not what we're after. What we need is at the bottom of Option 3)
# a b c
#[1,] NA NA NA
#[2,] 1 2 10
#[3,] 1 2 10
#[4,] 1 2 10
#[5,] 1 2 10
OPTION 2:
results <- for (i in 1:length(nrow(test.data))) {
diff(as.numeric(test.data[i,]), lag=1)
print(results)}
RESULT 2: (again not what we're after)
# NULL
OPTION 3: works, but long way, so would like to simplify code and make generic for any length of observations in my dataframe and any number of columns (i.e. more than 3). I would like to "automate" the steps below, if know number of observations (i.e. rows).
row1=diff(as.numeric(test[1,], lag=1))
row2=diff(as.numeric(test[2,], lag=1))
row3=diff(as.numeric(test[3,], lag=1))
row4=diff(as.numeric(test[4,], lag=1))
row5=diff(as.numeric(test[5,], lag=1))
results.OK=cbind.data.frame(row1, row2, row3, row4, row5)
transpose.results.OK=data.frame(t(as.matrix(results.OK)))
names(transpose.results.OK)=c("diff.ab", "diff.bc")
Final.data = transpose.results.OK
print(Final.data)
RESULT 3: (THIS IS WHAT I WOULD LIKE TO GET, "row1" can be "obs1" etc)
# diff.ab diff.bc
#row1 1 8
#row2 2 16
#row3 3 24
#row4 4 32
#row5 5 40
THE END
Here are the 3 options redone plus a 4th option:
# 1
library(zoo)
d <- t(rollapplyr(t(test.data), 2, diff, by.column = FALSE))
# 2
d <- test.data[-1]
for (i in 1:nrow(test.data)) d[i, ] <- diff(unlist(test.data[i, ]))
# 3
d <- t(diff(t(test.data)))
# 4 - also this works
nc <- ncol(test.data)
d <- test.data[-1] - test.data[-nc]
For any of them to set the names:
colnames(d) <- paste0("diff.", head(names(test.data), -1), colnames(d))
(2) and (4) give this data.frame and (1) and (3) give the corresponding matrix:
> d
diff.ab diff.bc
1 1 8
2 2 16
3 3 24
4 4 32
5 5 40
Use as.matrix or as.data.frame if you want the other.
An apply based solution using diff on row-wise can be achieved as:
# Result
res <- t(apply(test.data, 1, diff)) #One can change it to data.frame
# Name of the columns
colnames(res) <- paste0("diff.", head(names(test.data), -1),
tail(names(test.data), -1))
res
# diff.ab diff.bc
# [1,] 1 8
# [2,] 2 16
# [3,] 3 24
# [4,] 4 32
# [5,] 5 40