Solve a linear equation on every row in datatable - r

I did some linear regression and I want to forecast the moment of exceeding a certain value.
This means I have three columns:
a= slope
b = intercept
c = target value
On every row I want to calculate
solve(a,(c-b))
How do I do this in an efficient way, without using a loop (it is an extensive dataset)?

So you basically want to solve the equation
c = a*x + b
for x for each row? That has the pretty simple solution of
x = (c-b)/a
which is a vectorized operation in R. No loop necessary
dd <- data.frame(
a = 1:5,
b = -2:2,
c = 10:14
)
transform(dd, solution=(c-b)/a)
# a b c solution
# 1 1 -2 10 12.0
# 2 2 -1 11 6.0
# 3 3 0 12 4.0
# 4 4 1 13 3.0
# 5 5 2 14 2.4

in addition to the aforementioned responses, you could also use the mutate function from the tidyverse. like so:
library(magrittr)
library(tidyverse)
dataframe %<>% mutate(prediction=solve(a,(c-b))
in this example we are assuming the columns 'a','b', and 'c' are in a table called 'dataframe.' we then use the %<>% function from the magrittr library to say "apply the function that follows to the dataframe".

Here is a simple way using the Vectorize function:
solve_vec <- Vectorize(solve)
solve_vec(d$a, d$c - d$b)
> solve_vec(d$a, d$c - d$b)
[1] 12.0 6.0 4.0 3.0 2.4

Related

How to use Loops or Lapply to generate 100 new variables from a matrix?

So, I have a data frame containing 100 different variables. Now, I want to create 100 new variables corresponding to each of the variable in the original data frame. Currently, I am trying loops and lapply to figure out the way out of it, but haven't had much luck so far.
Here is just a snapshot of how the data frame looks like(suppose my data frame has name er):
a b c d
1 2 3 4
5 6 7 8
9 0 1 2
and using each of these 4 variable I have to create a new variable. Hence, total of 4 new variables. My variable should be like lets suppose a1=0.5+a, b1=0.5+b and so on.
I am doing trying the following two approaches:
for (i in 1:ncol(er)) {
[[i]] <- 0.5 + [[i]]
}
and alternatively, I am trying lapply as follows:
dep <- lapply(er, function(x) {
x<-0.5+er
}
But, none of them are working. Can anyone let me know what's the problem with these codes or suggest an efficient code to do this. I have show just 4 variables here for demonstration. I have around 100 of them.
You could directly add 0.5 (or any number) to the dataframe.
er[paste0(names(er), '1')] <- er + 0.5
er
# a b c d a1 b1 c1 d1
#1 1 2 3 4 1.5 2.5 3.5 4.5
#2 5 6 7 8 5.5 6.5 7.5 8.5
#3 9 0 1 2 9.5 0.5 1.5 2.5
Ronak's answer provides the most efficient way of solving your problem. I'll focus on why your attempts didn't work.
er <- data.frame(a = c(1, 5, 9), b = c(2, 6, 0), c = c(3, 7, 1), d = c(4, 8, 2))
A. for loop:
for (i in 1:ncol(er)) {
[[i]] <- 0.5 + [[i]]
}
Thinking of how R is interpreting each element of your loop. It will go from 1 to however many columns of er, and use the i placeholder, so on the first iteration it will do:
[[1]] <- 0.5 + [[1]]
Which doesn't make sense because you're not indicating what you are indexing at all. Instead, what you would want is:
for (i in 1:ncol(er)) {
er[[i]] <- 0.5 + er[[i]]
}
Here, each iteration will mean "assign to the ith column of er, the ith column of er + 0.5". If you want to further add that you want to create new variables, you would do the following (which is somewhat similar to Ronak's answer, just less efficient):
for (i in 1:ncol(er)) {
er[[paste0(names(er)[i], "1")]] <- 0.5 + er[[i]]
}
As a side note, it is preferred to use seq_along(er) instead of 1:ncol(er).
B. lapply:
dep <- lapply(er, function(x) {
x<-0.5+er
}
When creating a function, whatever you need to specify what you want to return by calling it. Here, function(x) { x + 0.5 } is sufficient to indicate that you want to return the variable + 0.5. Since lapply() returns a list (the function's name is short for "list apply"), you'll want to use as.data.frame():
as.data.frame(lapply(er, function(x) { x + 0.5 }))
However, this doesn't change the variable names, and there's no easy efficient way to change that here:
dep <- as.data.frame(lapply(er, function(x) { x + 0.5 }))
names(dep) <- paste0(names(dep), "1")
cbind(er, dep)
a b c d a1 b1 c1 d1
1 1 2 3 4 1.5 2.5 3.5 4.5
2 5 6 7 8 5.5 6.5 7.5 8.5
3 9 0 1 2 9.5 0.5 1.5 2.5
C. Another way would be using dplyr syntax, which is more elegant and readable:
library(dplyr)
mutate(er, across(everything(), ~ . + 0.5, .names = "{.col}1"))

Can I do on-the-fly calculation on a column using data.table in r?

Hi I am wondering if anyone knows/ could show me how to calculate a value in a column C based on the previous value(s) in this column C and another column D and save the calculated value as a new current value in column C?
For example, suppose I first initialize the column C to 1s and the calculation I want to implement is C(1) = 1 + B(1)*0.1*1 and C(2) = C(1) + B(2)*0.1*C(1).
test=data.table(A=1:5,B=c(1,2,1,2,1),C=1)
test
A B C
1: 1 1 1
2: 2 2 1
3: 3 1 1
4: 4 2 1
5: 5 1 1
What I want is:
test
A B C
1: 1 1 1.1
2: 2 2 1.32
3: 3 1 1.452
4: 4 2 1.7424
5: 5 1 1.91664
I could achieve what I want with for loop or apply() but I really want to know if this is doable just using data.table and get some speed up.
Edit:
As pointed out by Frank in the comments below,
test[, C := cumprod(1 + .1*B)]
will do since multiplication is distributive. What if I want to supply a more complex custom function?
Many thanks in advance!
Using the formula as presented we have:
test[, C := Reduce(function(c, b) c + .1 * b * c, B, init = 1, acc = TRUE)[-1] ]
Of course, as pointed out already it simplifies in this particular case since we can write the body of the function as c * ( 1 + .1 * b) which implies a cumulative product of the parenthesized portion.
It seems you need to apply the function cumulatively
library(data.table)
library(zoo)
test=data.table(A=1:5,B=c(1,2,1,2,1),C=1)
z <- function(b){1+b*0.1}
test[,C:=cumprod(rollapply(B, width=1, FUN=z))]
But I agree that there's really no need to bring zoo here. Frank's solution is more elegant and concise.
test[,C:=cumprod(1 + .1*B)]
I don't believe there is a similar data.table function, but it seems like accumulate from purrr is what you want. Simple example below, but the input could be rows of a data.table also.
library(purrr)
accumulate(1:4, function(x, y){2*x + y})
# [1] 1 4 11 26

Using sum(x:y) to create a new variable/vector from existing values in R

I am working in R with a data frame d:
ID <- c("A","A","A","B","B")
eventcounter <- c(1,2,3,1,2)
numberofevents <- c(3,3,3,2,2)
d <- data.frame(ID, eventcounter, numberofevents)
> d
ID eventcounter numberofevents
1 A 1 3
2 A 2 3
3 A 3 3
4 B 1 2
5 B 2 2
where numberofevents is the highest value in the eventcounter for each ID.
Currently, I am trying to create an additional vector z <- c(6,6,6,3,3).
If the numberofevents == 3, it is supposed to calculate sum(1:3), equally to 3 + 2 + 1 = 6.
If the numberofevents == 2, it is supposed to calculate sum(1:2) equally to 2 + 1 = 3.
Working with a large set of data, I thought it might be convenient to create this additional vector
by using the sum function in R d$z<-sum(1:d$numberofevents), i.e.
sum(1:3) # for the rows 1-3
and
sum(1:2) # for the rows 4-5.
However, I always get this warning:
Numerical expression has x elements: only the first is used.
You can try ave
d$z <- with(d, ave(eventcounter, ID, FUN=sum))
Or using data.table
library(data.table)
setDT(d)[,z:=sum(eventcounter), ID][]
Try using apply sapply or lapply functions in R.
sapply(numberofevents, function(x) sum(1:x))
It works for me.

In R how to make two columns an ID and get a frequency histogram for each ID

Example Dataset:
A 2 1.5
A 2 1.5
B 3 2.0
B 3 2.5
B 3 2.6
C 4 3.2
C 4 3.5
So here I would like to create 3 frequency histograms based on the first two columns so A2, B3, and C4? I am new to R any help would be greatly appreciated should I flatten out the data so its like this:
A 2 1.5 1.5
B 3 2.0 2.5 2.6 etc...
Thank you
Here's an alternative solution, that is based on by-function, which is just a wrapper for the tapply that Jilber suggested. You might find the 'ex'-variable useful:
set.seed(1)
dat <- data.frame(First = LETTERS[1:3], Second = 1:2, Num = rnorm(60))
# Extract third column per each unique combination of columns 'First' and 'Second'
ex <- by(dat, INDICES =
# Create names like A.1, A.2, ...
apply(dat[,c("First","Second")], MARGIN=1, FUN=function(z) paste(z, collapse=".")),
# Extract third column per each unique combination
FUN=function(x) x[,3])
# Draw histograms
par(mfrow=c(3,2))
for(i in 1:length(ex)){
hist(ex[[i]], main=names(ex)[i], xlim=extendrange(unlist(ex)))
}
Assuming your dataset is called x and the columns are a,b,c respectively I think this command should do the trick
library(lattice)
histogram(~c|a+b,x)
Notice that this requires you to have the package lattice installed

Multiple plyr functions and operations in one statement?

I have a dataset as follows:
i,o,c
A,4,USA
B,3,CAN
A,5,USA
C,4,MEX
C,1,USA
A,3,CAN
I want to reform this dataset into a form as follows:
i,u,o,c
A,3,4,2
B,1,3,1
C,2,2.5,1
Here, u represents the unique instances of variable i in the dataset, o = (sum of o / u) and c = unique countries.
I can get u with the following statement and by using plyr:
count(df1,vars="i")
I can also get some of the other variables by using the insights learned from my previous question. I can laboriously and by saving to multiple data frames and then finally joining them together achieve my intended results by I wonder if there is a one line optimization or just simply a better way of doing this than my current long winded way.
Thanks !
I don't understand how this is different from your earlier question. The approach is the same:
library(plyr)
ddply(mydf, .(i), summarise,
u = length(i),
o = mean(o),
c = length(unique(c)))
# i u o c
# 1 A 3 4.0 2
# 2 B 1 3.0 1
# 3 C 2 2.5 2
If you prefer a data.table solution:
> library(data.table)
> DT <- data.table(mydf)
> DT[, list(u = .N, o = mean(o), c = length(unique(c))), by = "i"]
i u o c
1: A 3 4.0 2
2: B 1 3.0 1
3: C 2 2.5 2

Resources