How to use a vector in lag operation in R - r

In R, How do we use a vector instead of element in the lag function. i.e for Lag(x,k=2); instead of 2 I want to use a vector because I want to lag each row by a different value. So one row could have a lag of 3, while 1 could be 0 etc.
Example:
a #lags d
1 0 1
2 1 1
4 2 1
3 0 3
1 1 3

i think you may need to write your own function for this task. i wrote one that i think will be what you need, or perhaps point you in the right direction:
x1 <- c(75,98,65,45,78,94,123,54) #a fake data set for us to lag
y1 <- c(2,3,1,4,1,2,3,5) #vector of values to lag by
#the function below takes the data, x1, and lags it by y1
dynlag <- function(x,y) {
a1 <- x[length(x)-y]
return(a1)
}
#test out the function
dynlag(x1,y1)
hope this helps. :)

Here is a solution with index calculus:
D <- read.table(header=TRUE, text=
'a lags d
1 0 1
2 1 1
4 2 1
3 0 3
1 1 3')
i <- seq(length(D$a))
erg <- D$a[i - D$lags]
all.equal(erg, D$d)

Related

How to combine several binary variables into a new categorical variable

I am trying to combine several binary variables into one categorical variable. I have ten categorial variables, each describing tasks of a job.
Data looks something like this:
Personal_Help <- c(1,1,2,1,2,1)
PR <- c(2,1,1,2,1,2)
Fundraising <- c(1,2,1,2,2,1)
# etc.
My goal is to combine them into one variable, where the value 1 (=Yes) of each binary variable will be a seperate level of the categorical variable.
To illustrate what I imagine (wrong code obviously):
If Personal_Help = 1 -> Jobcontent = 1
If PR = 1 -> Jobcontent = 2
If Fundraising = 1 -> Jobcontent = 3
etc.
Thank you very much in advance!
Edit:
Thanks for your Answers and apologies for my late answer. I think more context from my side is needed. The goal of combining the binary variables into a categorical variable is to print them into one graphic (using ggplot). The graphic should display how many respondants report the above mentioned tasks as part of their work.
if you're interested only in the first occurrence of 1 among your variables:
df <- data.frame(t(data.frame(Personal_Help, PR,Fundraising)))
result <- sapply(df, function(x) which(x==1)[1])
X1 X2 X3 X4 X5 X6
1 1 2 1 2 1
Of course, this will depend on what you want to do when multiple values are 1 as asked in comments.
Since there are three different variables, and each variable can take either of 2 values, there are 2^3 = 8 possible unique combinations of the three variables, each of which should have a unique number associated.
One way to do this is to imagine each column as being a digit in a three digit binary number. If we subtract 1 from each column, we get a 1 for "no" and a 0 for "yes". This means that our eight possible unique values, and the binary numbers associated with each would be:
binary decimal
0 0 0 = 0
0 0 1 = 1
0 1 0 = 2
0 1 1 = 3
1 0 0 = 4
1 0 1 = 5
1 1 0 = 6
1 1 1 = 7
This system will work for any number of columns, and can be achieved as follows:
Personal_Help <- c(1,1,2,1,2,1)
PR <- c(2,1,1,2,1,2)
Fundraising <- c(1,2,1,2,2,1)
df <- data.frame(Personal_Help, PR, Fundraising)
New_var <- 0
for(i in seq_along(df)) New_var <- New_var + (2^(i - 1)) * (df[[i]] - 1)
df$New_var <- New_var
The end result would then be:
df
#> Personal_Help PR Fundraising New_var
#> 1 1 2 1 2
#> 2 1 1 2 4
#> 3 2 1 1 1
#> 4 1 2 2 6
#> 5 2 1 2 5
#> 6 1 2 1 2
In your actual data, there will be 1024 possible combinations of tasks, so this will generate numbers for New_var between 0 and 1023. Because of how it is generated, you can actually use this single number to reverse engineer the entire row as long as you know the original column order.
As #ulfelder commented, you need to clarify how you want to handle cases where more than one column is 1.
Assuming you want to use the first column equal to 1, you can use which.min(), applied by row:
data <- data.frame(Personal_Help, PR, Fundraising)
data$Jobcontent <- apply(data, MARGIN = 1, which.min)
Result:
Personal_Help PR Fundraising Jobcontent
1 1 2 1 1
2 1 1 2 1
3 2 1 1 2
4 1 2 2 1
5 2 1 2 2
6 1 2 1 1
If you’d like Jobcontent to include the name of each job, you can index into names(data):
data$Jobcontent <- names(data)[apply(data, MARGIN = 1, which.min)]
Result:
Personal_Help PR Fundraising Jobcontent
1 1 2 1 Personal_Help
2 1 1 2 Personal_Help
3 2 1 1 PR
4 1 2 2 Personal_Help
5 2 1 2 PR
6 1 2 1 Personal_Help
max.col may help here:
Jobcontent <- max.col(-data.frame(Personal_Help, PR, Fundraising), "first")
Jobcontent
#> [1] 1 1 2 1 2 1

How to iterate over a column and assign value based on another column without a for loop?

I am trying to efficiently assign a value to a column, based on another column, but without a for loop as this takes too long.
I'm doing something like this: If the reference column value is greater than a certain random number, I assign 1 to the new column. Otherwise, assign 0. Can't figure out the best way to do this without a loop. I tried dplyr and case_when, but that wasn't iterating over each row.
Thanks!
for (i in 1:nrow(data)) {
if (data$value[i] > runif(1, 0, 1.7)) {
temp$newValue[i] <- 1
} else{
temp$newValue[i] <- 0
}
}
c0=data.frame(c(1,4,6,3,7,3),c(2,8,2,4,9,4))
names(c0)=c("A","B")
c0$C=ifelse(c0[,"A"]>runif(1,0,1.7),1,0)
c0
I'm not so sure if I understand you well. Please comment if I have any misunderstanding.
A
<dbl>
B
<dbl>
C
<dbl>
1 2 0
4 8 1
6 2 1
3 4 1
7 9 1
3 4 1
6 rows
Here is how I use A to generate C
Does this solve your problem?
DATA:
set.seed(1)
df <- data.frame(
refcol = rnorm(10)
)
randvalue <- 0
SOLUTION:
df$newcol <- ifelse(df$refcol > randvalue, 1, 0)
RESULT:
df
refcol newcol
1 0.2352207 1
2 -0.3307359 0
3 -0.3116238 0
4 -2.3023457 0
5 -0.1708760 0
6 0.1402782 1
7 -1.4974267 0
8 -1.0101884 0
9 -0.9484756 0
10 -0.4939622 0

Binning two vectors of different ranges using R

I'm trying to assess the performance of a simple prediction model using R, by discretizing the prediction results by binning them into defined intervals and then compare them with the corresponding actual values(binned).
I have two vectors actual and predicted as shown:
> actual <- c(0,2,0,0,41,1,3,5,2,0,0,0,0,0,6,1,0,0,15,1)
> predicted <- c(3.38,98.01,3.08,4.89,31.46,3.88,4.75,4.64,3.11,3.15,3.42,10.42,3.18,5.73,4.20,3.34,3.95,5.94,3.99)
I need to perform binning here. First off, the values of 'actual' are factorized/discretized into different levels, say:
0-5: Level 1 ** 6-10: Level 2 ** ... ** 41-45: Level 9
Now, I've to bin the values of 'predicted' also into the above mentioned buckets.
I tried to achieve this using the cut() function in R:
binCount <- 5
binActual <- cut(actual,labels=1:binCount,breaks=binCount)
binPred <- cut(predicted,labels=1:binCount,breaks=binCount)
However, if you see the second element in predicted (98.01) is labelled as 5, but it doesn't actually fall in the desired interval.
I feel that using a different binCount for predicted will not help.Can anyone please suggest a solution for this ?
I'm not 100% sure of what you want to do.
However from what I understand you want to return for each element of each vector the class it would be in. Given a set of class that takes into account any value from any of the two vectors actual and predicted.
If it is what you want to do, then your script (as you say) creates classes for values between 0 and 45. With this cut you class your first vector.
Then you create a new set of classes for your vector predicted.
The classification is not the same anymore.
Assuming that I understood what you want to do, I'd rather write :
actual <- c(0,2,0,0,41,1,3,5,2,0,0,0,0,0,6,1,0,0,15,1)
predicted <- c(3.38,98.01,3.08,4.89,31.46,3.88,4.75,4.64,3.11,3.15,3.42,10.42,3.18,5.73,4.20,3.34,3.95,5.94,3.99)
temporary = c(actual, predicted)
maxi <- max(temporary)
mini <- min(temporary)
binCount <- 5
s <- seq(maxi, mini, length.out = binCount)
s = sort(s)
binActual <- cut(actual,breaks=s, include.lowest = T, labels = 1:(length(s)-1))
binPred <- cut(predicted,breaks=s, include.lowest = T, labels = 1:(length(s)-1))
It gives :
> binActual
[1] 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Levels: 1 2 3 4
> binPred
[1] 1 4 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Levels: 1 2 3 4
I'm not sure it is what you're looking for, so let me know, I might be able to help you.
Best wishes.
Is this what you want?
intervals <- cbind(seq(0, 40, length = 9), seq(5, 45, length = 9))
cutFixed <- function(x, intervals) {
sapply(x, function(x) ifelse(x < min(intervals) | x >= max(intervals), NA, which(x >= intervals[,1] & x < intervals[,2])))
}
This gives the following result
> cutFixed(actual, intervals)
[1] 1 1 1 1 9 1 1 2 1 1 1 1 1 1 2 1 1 1 4 1
> cutFixed(predicted, intervals)
[1] 1 NA 1 1 7 1 1 1 1 1 1 3 1 2 1 1 1 2 1

R ffdfdply reset cumsum using data.table

Sorry for asking basic question. I am using ff package and used read.csv.ffdf to import the data. I have more then 50 million rows in excel and I want to do cumulative sum on one of the column and reset it when it finds 0. I have the below code to generate cumulative series and but don't know how to access the current row.
idx <- ffdforder(i[c("a","c","b")])
ordered_i <- i[idx, ]
ordered_i$key_a_c_d <- ikey(ordered_i[c("a", "c","d")])
cumsum_i <- ffdfdply(ordered_i, split=as.character(ordered_i$key_a_c_d), FUN= function(x) {
x <- as.data.table(x)
if(x[**current row**, d]==0)
{
result <- x[,cumsum_a_c_d :=0]
}
else
{
result <- x[, cumsum_a_c_d := cumsum(d), by = list(key_a_c_d)]
}
as.data.frame(result)
}, trace=T)
I am using the data.tablepackage to get the cumulative sum done. How can I access the current row in data table so that I can compare it with 0 and reset the cumsum. I need the expected output as shown below. It's the cumulative sum of column d.
a b c d Result
1 1 1 1 1
1 4 1 0 0
1 6 1 1 1
1 2 1 1 2
1 5 1 0 0
1 3 1 1 1
Thanks

Probably "apply"-function-related

I am creating a d dimensional hypercube representing [0,1]^d through the use of the following code, which was kindly suggested by another user on this forum.
## generation of the d-dimensional hypercube
cube <- do.call(expand.grid,replicate(d, seq_len(mesh)/mesh, simplify=FALSE))
Let's say I have a function, say
foo <- function(u) prod(u)
that I would want to apply to every point of the hybercube created above. Is there a nice way to avoid using a loop through the d rows to do so? I tried using various apply functions, but that was unsuccessful.
Thanks.
First of all, a function for giving you the coordinates of the vertices:
hypercube <- function(d, coord = c(0, 1))
do.call(expand.grid, replicate(d, coord, simplify = FALSE))
For example, using d = 3:
cube <- hypercube(d = 3)
cube
# Var1 Var2 Var3
# 1 0 0 0
# 2 1 0 0
# 3 0 1 0
# 4 1 1 0
# 5 0 0 1
# 6 1 0 1
# 7 0 1 1
# 8 1 1 1
Then, to run your foo function on every vertex of the hypercube, use apply:
apply(cube, 1, foo)

Resources