Simple join calculation of dataframe in R - r

I have a dataframe a, with A, B, C are separate entries
Source Target N
A B 100
A D 200
I have another dataframe b for entries' attributes
Name Rate1 Rate2
A 0.1 0.2
B 0.2 0.3
I want to calculate a new column Flow in a, as it is calculated row based by Flow = a$N * b[Name == a$Source]$Rate1. I tried to use apply by row, but I felt it's slow. Is there a faster way?

I don't know what you have tried with apply, but here an answer with merge and transform
transform(merge(a,b,by.x = 'Source',by.y ='Name'),flow = N*Rate1)
Source Target N Rate1 Rate2 flow
1 A B 100 0.1 0.2 10
2 A D 200 0.1 0.2 20

Here's a fairly expressive solution, fairly similar to the code you tried:
> a$Flow <- a$N*b$Rate1[ match(a$Source, b$Name) ]
> a
Source Target N Flow
1 A B 100 10
2 A D 200 20
The match function is the basis for merge and %in%. It is particularly useful for constructing index vectors to pick from alternatives.

Related

How to use Loops or Lapply to generate 100 new variables from a matrix?

So, I have a data frame containing 100 different variables. Now, I want to create 100 new variables corresponding to each of the variable in the original data frame. Currently, I am trying loops and lapply to figure out the way out of it, but haven't had much luck so far.
Here is just a snapshot of how the data frame looks like(suppose my data frame has name er):
a b c d
1 2 3 4
5 6 7 8
9 0 1 2
and using each of these 4 variable I have to create a new variable. Hence, total of 4 new variables. My variable should be like lets suppose a1=0.5+a, b1=0.5+b and so on.
I am doing trying the following two approaches:
for (i in 1:ncol(er)) {
[[i]] <- 0.5 + [[i]]
}
and alternatively, I am trying lapply as follows:
dep <- lapply(er, function(x) {
x<-0.5+er
}
But, none of them are working. Can anyone let me know what's the problem with these codes or suggest an efficient code to do this. I have show just 4 variables here for demonstration. I have around 100 of them.
You could directly add 0.5 (or any number) to the dataframe.
er[paste0(names(er), '1')] <- er + 0.5
er
# a b c d a1 b1 c1 d1
#1 1 2 3 4 1.5 2.5 3.5 4.5
#2 5 6 7 8 5.5 6.5 7.5 8.5
#3 9 0 1 2 9.5 0.5 1.5 2.5
Ronak's answer provides the most efficient way of solving your problem. I'll focus on why your attempts didn't work.
er <- data.frame(a = c(1, 5, 9), b = c(2, 6, 0), c = c(3, 7, 1), d = c(4, 8, 2))
A. for loop:
for (i in 1:ncol(er)) {
[[i]] <- 0.5 + [[i]]
}
Thinking of how R is interpreting each element of your loop. It will go from 1 to however many columns of er, and use the i placeholder, so on the first iteration it will do:
[[1]] <- 0.5 + [[1]]
Which doesn't make sense because you're not indicating what you are indexing at all. Instead, what you would want is:
for (i in 1:ncol(er)) {
er[[i]] <- 0.5 + er[[i]]
}
Here, each iteration will mean "assign to the ith column of er, the ith column of er + 0.5". If you want to further add that you want to create new variables, you would do the following (which is somewhat similar to Ronak's answer, just less efficient):
for (i in 1:ncol(er)) {
er[[paste0(names(er)[i], "1")]] <- 0.5 + er[[i]]
}
As a side note, it is preferred to use seq_along(er) instead of 1:ncol(er).
B. lapply:
dep <- lapply(er, function(x) {
x<-0.5+er
}
When creating a function, whatever you need to specify what you want to return by calling it. Here, function(x) { x + 0.5 } is sufficient to indicate that you want to return the variable + 0.5. Since lapply() returns a list (the function's name is short for "list apply"), you'll want to use as.data.frame():
as.data.frame(lapply(er, function(x) { x + 0.5 }))
However, this doesn't change the variable names, and there's no easy efficient way to change that here:
dep <- as.data.frame(lapply(er, function(x) { x + 0.5 }))
names(dep) <- paste0(names(dep), "1")
cbind(er, dep)
a b c d a1 b1 c1 d1
1 1 2 3 4 1.5 2.5 3.5 4.5
2 5 6 7 8 5.5 6.5 7.5 8.5
3 9 0 1 2 9.5 0.5 1.5 2.5
C. Another way would be using dplyr syntax, which is more elegant and readable:
library(dplyr)
mutate(er, across(everything(), ~ . + 0.5, .names = "{.col}1"))

Can I do on-the-fly calculation on a column using data.table in r?

Hi I am wondering if anyone knows/ could show me how to calculate a value in a column C based on the previous value(s) in this column C and another column D and save the calculated value as a new current value in column C?
For example, suppose I first initialize the column C to 1s and the calculation I want to implement is C(1) = 1 + B(1)*0.1*1 and C(2) = C(1) + B(2)*0.1*C(1).
test=data.table(A=1:5,B=c(1,2,1,2,1),C=1)
test
A B C
1: 1 1 1
2: 2 2 1
3: 3 1 1
4: 4 2 1
5: 5 1 1
What I want is:
test
A B C
1: 1 1 1.1
2: 2 2 1.32
3: 3 1 1.452
4: 4 2 1.7424
5: 5 1 1.91664
I could achieve what I want with for loop or apply() but I really want to know if this is doable just using data.table and get some speed up.
Edit:
As pointed out by Frank in the comments below,
test[, C := cumprod(1 + .1*B)]
will do since multiplication is distributive. What if I want to supply a more complex custom function?
Many thanks in advance!
Using the formula as presented we have:
test[, C := Reduce(function(c, b) c + .1 * b * c, B, init = 1, acc = TRUE)[-1] ]
Of course, as pointed out already it simplifies in this particular case since we can write the body of the function as c * ( 1 + .1 * b) which implies a cumulative product of the parenthesized portion.
It seems you need to apply the function cumulatively
library(data.table)
library(zoo)
test=data.table(A=1:5,B=c(1,2,1,2,1),C=1)
z <- function(b){1+b*0.1}
test[,C:=cumprod(rollapply(B, width=1, FUN=z))]
But I agree that there's really no need to bring zoo here. Frank's solution is more elegant and concise.
test[,C:=cumprod(1 + .1*B)]
I don't believe there is a similar data.table function, but it seems like accumulate from purrr is what you want. Simple example below, but the input could be rows of a data.table also.
library(purrr)
accumulate(1:4, function(x, y){2*x + y})
# [1] 1 4 11 26

Solve a linear equation on every row in datatable

I did some linear regression and I want to forecast the moment of exceeding a certain value.
This means I have three columns:
a= slope
b = intercept
c = target value
On every row I want to calculate
solve(a,(c-b))
How do I do this in an efficient way, without using a loop (it is an extensive dataset)?
So you basically want to solve the equation
c = a*x + b
for x for each row? That has the pretty simple solution of
x = (c-b)/a
which is a vectorized operation in R. No loop necessary
dd <- data.frame(
a = 1:5,
b = -2:2,
c = 10:14
)
transform(dd, solution=(c-b)/a)
# a b c solution
# 1 1 -2 10 12.0
# 2 2 -1 11 6.0
# 3 3 0 12 4.0
# 4 4 1 13 3.0
# 5 5 2 14 2.4
in addition to the aforementioned responses, you could also use the mutate function from the tidyverse. like so:
library(magrittr)
library(tidyverse)
dataframe %<>% mutate(prediction=solve(a,(c-b))
in this example we are assuming the columns 'a','b', and 'c' are in a table called 'dataframe.' we then use the %<>% function from the magrittr library to say "apply the function that follows to the dataframe".
Here is a simple way using the Vectorize function:
solve_vec <- Vectorize(solve)
solve_vec(d$a, d$c - d$b)
> solve_vec(d$a, d$c - d$b)
[1] 12.0 6.0 4.0 3.0 2.4

recursive data subset based on column attributes in R

I have a data frame with 10K rows and 6 columns. The first two columns are factors.
A B C D E F
A1 B1 0.1 0.2 0.3 0.4
A2 B2 .........................
A1 B3 .........................
A1 B1 0.3 ...................
Now I want to generate models(using my function F) based on different subsets of data (different rows), that is different combinations of attributes of A and B.
In my above example, I should have call my function F 6 times with Cartesian production of A and B
(A1,A2) x (B1,B2,B3). I wonder how to do this in R efficiently without explicit loop?
To avoid confusion
e.g, apply F to (A1,B1) combination, in this case, rows 1 and 4, columns 3 to 6.
to other combinations is similar
Try:
lapply(seq_len(length(df$A)*length(df$B))-1, function(x)
myFunction(df[df$A == paste0("A",1+floor(x / length(df$B))) &
df$B == paste0("B",1+(x %% length(df$B))), ]))

Multiple plyr functions and operations in one statement?

I have a dataset as follows:
i,o,c
A,4,USA
B,3,CAN
A,5,USA
C,4,MEX
C,1,USA
A,3,CAN
I want to reform this dataset into a form as follows:
i,u,o,c
A,3,4,2
B,1,3,1
C,2,2.5,1
Here, u represents the unique instances of variable i in the dataset, o = (sum of o / u) and c = unique countries.
I can get u with the following statement and by using plyr:
count(df1,vars="i")
I can also get some of the other variables by using the insights learned from my previous question. I can laboriously and by saving to multiple data frames and then finally joining them together achieve my intended results by I wonder if there is a one line optimization or just simply a better way of doing this than my current long winded way.
Thanks !
I don't understand how this is different from your earlier question. The approach is the same:
library(plyr)
ddply(mydf, .(i), summarise,
u = length(i),
o = mean(o),
c = length(unique(c)))
# i u o c
# 1 A 3 4.0 2
# 2 B 1 3.0 1
# 3 C 2 2.5 2
If you prefer a data.table solution:
> library(data.table)
> DT <- data.table(mydf)
> DT[, list(u = .N, o = mean(o), c = length(unique(c))), by = "i"]
i u o c
1: A 3 4.0 2
2: B 1 3.0 1
3: C 2 2.5 2

Resources