Fill in N lags based on a variable in R data frame

Fill in N lags based on a variable in R data frame - r

I have two columns in my data frame, value and num_leads. I'd like to create a third column that stores the value's value from n rows below - where n is whatever number is stored in num_leads. Here's an example:
df1 <- data.frame(value = c(1:5),
num_leads = c(2, 3, 1, 1, 0))
Desired output:
value num_leads result
1 1 2 3
2 2 3 5
3 3 1 4
4 4 1 5
5 5 0 5
I have tried using the lead function in dplyr but unfortunately it seems all the leads must have the same number.

using indexing
with(df1, value[seq_along(value) + num_leads])
where seq_along(value) gives the row number, and by adding to num_leads you can pull out the right value

This is what I came up with:
df1$result <- df1$value[df1$value + df1$num_leads]

Related

how to create a row that is calculated from another row automatically like how we do it in excel?

does anyone know how to have a row in R that is calculated from another row automatically? i.e.
lets say in excel, i want to make a row C, which is made up of (B2/B1)
e.g. C1 = B2/B1
C2 = B3/B2
...
Cn = Cn+1/Cn
but in excel, we only need to do one calculation then drag it down. how do we do it in R?

In R you work with columns as vectors so the operations are vectorized. The calculations as described could be implemented by the following commands, given a data.frame df (i.e. a table) and the respective column names as mentioned:
df["C1"] <- df["B2"]/df["B1"]
df["C2"] <- df["B3"]/df["B2"]
In R you usually would name the columns according to the content they hold. With that, you refer to the columns by their name, although you can also address the first column as df[, 1], the first row as df[1, ] and so on.
EDIT 1:
There are multiple ways - and certainly some more elegant ways to get it done - but for understanding I kept it in simple base R:
Example dataset for demonstration:
df <- data.frame("B1" = c(1, 2, 3),
"B2" = c(2, 4, 6),
"B3" = c(4, 8, 12))
Column calculation:
for (i in 1:ncol(df)-1) {
col_name <- paste0("C", i)
df[col_name] <- df[, i+1]/df[, i]
}
Output:
B1 B2 B3 C1 C2
1 1 2 4 2 2
2 2 4 8 2 2
3 3 6 12 2 2
So you iterate through the available columns B1/B2/B3. Dynamically create a column name in every iteration, based on the number of the current iteration, and then calculate the respective column contents.
EDIT 2:
Rowwise, as you actually meant it apparently, works similarly:
a <- c(10,15,20, 1)
df <- data.frame(a)
for (i in 1:nrow(df)) {
df$b[i] <- df$a[i+1]/df$a[i]
}
Output:
a b
1 10 1.500000
2 15 1.333333
3 20 0.050000
4 1 NA

You can do this just using vectors, without a for loop.
a <- c(10,15,20, 1)
df <- data.frame(a)
df$b <- c(df$a[-1], 0) / df$a
print(df)
a b
1 10 1.500000
2 15 1.333333
3 20 0.050000
4 1 0.000000
Explanation:
In the example data, df$a is the vector 10 15 20 1.
df$a[-1] is the same vector with its first element removed, 15 20 1.
And using c() to add a new element to the end so that the vector has the same lenght as before:
c(df$a[-1],0) which is 15 20 1 0
What we want for column b is this vector divided by the original df$a.
So:
df$b <- c(df$a[-1], 0) / df$a

Function to recode multiple variables conditional on other variables

I have a dataset with multiple variables. Each question has the actual survey answer and three other characteristics. So there are four variables for each question. I want to specify if Q135_L ==1 , leave Q135_RT as it is, otherwise code it as NA. I can do that with an ifelse statement.
df$Q135_RT <- ifelse(df$Q135_L == 1, df$Q22_RT, NA)
However, I have hundreds of variables and the names are not related. For example, in the picture we can see Q135, SG1_1 and so on. How can I specify for the whole dataset if a variable ends at _L, then for the same variable ending at _RT should remain as it is, otherwise the variable ending at _RT should be coded as NA.
I tried this but it only returns NAs
ifelse(grepl("//b_L" ==1, df), "//b_RT" , NA)

If I understand your problem correctly, you have a data frame of which the columns represent survey question variables. Each column contains two identifiers, namely: a survey question number (134, 135, etc) and a variable letter (L, R, etc). Because you provide no reproducible example, I tried to make a simplified example of your data frame:
set.seed(5)
DF <- data.frame(array(sample(1:4, 24, replace = TRUE), c(4,6)))
colnames(DF) <- c("Q134_L","Q135_L", "Q134_R", "Q135_R", "Q_L1", "Q134_S")
DF
# Q134_L Q135_L Q134_R Q135_R Q_L1 Q134_S
# 1 2 3 2 3 1 1
# 2 3 1 3 2 4 4
# 3 1 1 3 2 4 3
# 4 3 1 3 3 2 1
What you want is that if Q135_L == 1, leave Q135_RT as it is, otherwise code it as NA. Here is a function that implements this recoding logic:
recode <- function(yourdf, questnums) {
for (k in 1:length(questnums)) {
charnum <- as.character(questnums)
col_end_L_k <- yourdf[grepl("_L\\b", colnames(yourdf)) &
grepl(charnum[k], colnames(yourdf))]
col_end_R_k <- yourdf[grepl("_RT\\b", colnames(yourdf)) &
grepl(charnum[k], colnames(yourdf))]
row_is_1 <- which(col_end_L_k == 1)
col_end_R_k[-row_is_1, ] <- NA
yourdf[, colnames(col_end_R_k)] <- col_end_R_k
}
return(yourdf)
}
This function takes a data frame and a vector of question numbers, and then returns the data frame that has been recoded.
What this function does:
Selecting each question number using for.
Using grepl to identify any column that contains the selected number and contains _L at the end of the column name.
Similar with above but for _RT at the end of the column name.
Using which to identify the location of rows in the _L column that contain 1.
Keeping the values of the _RT column, which has the same question number with the corresponding _L column, in those rows, and change values on other rows to NA.
The result:
recode(DF, 134:135)
# Q134_L Q135_L Q134_RT Q135_RT Q_L1 Q134_S
# 1 2 3 NA NA 1 1
# 2 3 1 NA 2 4 4
# 3 1 1 3 2 4 3
# 4 3 1 NA 3 2 1
Note that the Q_L1 column is not affected because _L in this column is not located on the end of the column name.
As for how to define questnums, the question numbers, you just need to create a numeric vector. Examples:
Your questnums are 1 to 200. Then use 1:200 or seq(200), so recode(DF, 1:200).
Your questnums are 1, 3, 134, 135. Then, use recode(DF, c(1, 3, 134, 135)).
You can also assign the question numbers to an object first, such as n = c(25, 135, 145) and the use it : recode(DF, n)

Looping through a column to make a new table in R

I want to make a table called Count_Table and in it, Id like to count the number of 0s, 1s, and 5s when column "num" == 1,2,3,4, etc.
For example, the code below will count the 0s,1s,and 5s in column "num" when "num == "1". This is great but i need to do this 34 more times since "num" goes from 1-35.
Count_Table <- table(SASS_data[num == "1"]$Visited5)
I am new to R and I don't know how to add 1 to the "num" and loop it until 35 so that the Count_Table includes the counts of 0,1,5 for all nums that exist (1-35). I am sorry if this is confusing and thank you for your help.

lapply will generate a list of tables that span the columns of a dataframe. E.g.,
tablist <- lapply(mtcars, table)
If your dataframe contains columns you want to exclude, can do that by restricting the dataframe. E.g.,
tablist2 <- lapply(mtcars[, c(2, 4, 7)], table)

Answer
Table works on multiple dimensions. Just put both num and Visited5 as arguments. This also works if not all unique values of Visited5 are present in every level of num, those cells will simply be set to 0.
Example
SASS_data <- data.frame(
num = rep(1:5, each = 5),
Visited5 = sample(1:3, 25, r = T)
)
table(SASS_data$num, SASS_data$Visited5)
# 1 2 3
# 1 2 1 2
# 2 1 3 1
# 3 1 1 3
# 4 2 0 3
# 5 2 2 1

How to create a subset by using another subset as condition?

I want to create a subset using another subset as a condition. I can't show my actual data, but I can show an example that deals with the core of my problem.
For example, I have 10 subjects with 10 observations each. So an example of my data would be to create a simple data frame using this:
ID <- rep(1:10, each = 10)
x <- rnorm(100)
y <- rnorm(100)
df <- data.frame(ID,x,y)
Which creates:
ID x y
1 1 0.08146318 0.26682668
2 1 -0.18236757 -1.01868755
3 1 -0.96322876 0.09565239
4 1 -0.64841436 0.09202456
5 1 -1.15244873 -0.38668929
6 1 0.28748521 -0.80816416
7 1 -0.64243912 0.69403155
8 1 0.84882350 -1.48618271
9 1 -1.56619331 -1.30379070
10 1 -0.29069417 1.47436411
11 2 -0.77974847 1.25704185
12 2 -1.54139896 1.25146126
13 2 -0.76082748 0.22607239
14 2 -0.07839719 1.94448322
15 2 -1.53020374 -2.08779769
etc.
Some of these subjects were positive for an event (for example subject 3, 5 and 7), so I have created a subset for that using:
event_pos <- subset(df, ID %in% c("3","5","7"))
Now, I also want to create a subset for the subjects who were negative for an event. I could use something like this:
event_neg <- subset(df, ID %in% c("1","2","4","6","8","9","10"))
The problem is, my data set is too large to specify all the individuals of the negative group. Is there a way to use my subset event_pos to get all the subjects with negative events in one subset?
TL;DR
Can I get a subset_2 by removing the subset_1 from the data frame?

You can use :
ind_list <- c("3","5","7")
event_neg <- subset(df, (ID %in% ind_list) == FALSE)
or
event_neg <- subset(df, !(ID %in% ind_list))
Hope that will helps
Gottaviannoni

Using sum(x:y) to create a new variable/vector from existing values in R

I am working in R with a data frame d:
ID <- c("A","A","A","B","B")
eventcounter <- c(1,2,3,1,2)
numberofevents <- c(3,3,3,2,2)
d <- data.frame(ID, eventcounter, numberofevents)
> d
ID eventcounter numberofevents
1 A 1 3
2 A 2 3
3 A 3 3
4 B 1 2
5 B 2 2
where numberofevents is the highest value in the eventcounter for each ID.
Currently, I am trying to create an additional vector z <- c(6,6,6,3,3).
If the numberofevents == 3, it is supposed to calculate sum(1:3), equally to 3 + 2 + 1 = 6.
If the numberofevents == 2, it is supposed to calculate sum(1:2) equally to 2 + 1 = 3.
Working with a large set of data, I thought it might be convenient to create this additional vector
by using the sum function in R d$z<-sum(1:d$numberofevents), i.e.
sum(1:3) # for the rows 1-3
and
sum(1:2) # for the rows 4-5.
However, I always get this warning:
Numerical expression has x elements: only the first is used.

You can try ave
d$z <- with(d, ave(eventcounter, ID, FUN=sum))
Or using data.table
library(data.table)
setDT(d)[,z:=sum(eventcounter), ID][]

Try using apply sapply or lapply functions in R.
sapply(numberofevents, function(x) sum(1:x))
It works for me.

Categories

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Fill in N lags based on a variable in R data frame - r

using indexing with(df1, value[seq_along(value) + num_leads]) where seq_along(value) gives the row number, and by adding to num_leads you can pull out the right value

This is what I came up with: df1$result <- df1$value[df1$value + df1$num_leads]

Related

how to create a row that is calculated from another row automatically like how we do it in excel?

Function to recode multiple variables conditional on other variables

Looping through a column to make a new table in R

How to create a subset by using another subset as condition?

Using sum(x:y) to create a new variable/vector from existing values in R

Categories

Resources