I have 5 repeat measures called pub1:pub5 each taking a value of 1 to 4. Each was measured at a different age age1:age5. That is, pub1 was measured at age1....pub5 at age5 etc.
I would like to create a new variable age_pb2 that shows the age at which a value of 2 first occurred in pub. For example, for individual x, age_pb2 will equal age3 if the first time a value of 2 is scored is in pub3
I have tried modifying previous code but not had much luck.
library(tidyverse)
#Example data
N <- 2000
data <- data.frame(id = 1:2000,age1 = rnorm(N,6:8),age2 = rnorm(N,7:9),age3 = rnorm(N,8:10),
age4 = rnorm(N,9:11),age5 = rnorm(N,10:12),pub1 = rnorm(N,1:2),pub2 = rnorm(N,1:2),
pub3 = rnorm(N,1:2),pub4 = rnorm(N,1:2),pub5 = rnorm(N,1:2))
data <- data %>% mutate_at(vars(starts_with("pub")), funs(round(replace(., .< 0, NA), 0)))
#New variable showing first age at getting a score of 2 (doesn't work)
i1 <- grepl('^pub', names(data)) # index for pub columns
i2 <- grepl('^age', names(data)) # index for age columns
data[paste0("age_pb2")] <- lapply(2, function(i) {
j1 <- max.col(data[i1] == i, 'first')
j2 <- rowSums(data[i1] == i) == 0
data[i2][cbind(seq_len(nrow(data)), j1 *(NA^j2))]
})
set.seed(1)
N <- 2000
data <- data.frame(id = 1:2000,age1 = rnorm(N,6:8),age2 = rnorm(N,7:9),age3 = rnorm(N,8:10),
age4 = rnorm(N,9:11),age5 = rnorm(N,10:12),pub1 = rnorm(N,1:2),pub2 = rnorm(N,1:2),
pub3 = rnorm(N,1:2),pub4 = rnorm(N,1:2),pub5 = rnorm(N,1:2)) %>%
mutate_at(vars(starts_with("pub")), funs(round(replace(., .< 0, NA), 0))) %>%
mutate(age_pb2 = eval(parse(text = paste0("age", which.min(apply(select(., starts_with("pub")), 2, function(x) which(x == 2)[1]))))))
The way it works, you apply over the pubs columns and take with which(x == 2)[1] the first matched row per column, then take the which.min to get the column index number (of pub respectively age) which you then paste with "age" to assign (using eval(parse(text = variable name))) the respective column.
E.g. here after apply you get
[pub1 = 2, pub2 = 1, pub3 = 2, pub4 = 4, pub5 = 2]
which is the first occurrence of 2 per column. The earliest (which.min) occurrence is for the second pub column, thus index is 2. This pasted with "age" and eval parsed to mutate.
EDIT
It is probably more convenient to do it in a for loop for all age_pbi, or there is an easy solution in dplyr that I am not aware of.
for (i in 1:5) {
index <- which.min(apply(select(data, starts_with("pub")), 2, function(x) which(x == i)[1]))
data[ ,paste0("age_pb", i)] <- data[ ,paste0("age", index)]
}
Note however, that which.min takes the first minimum. E.g. pub1 and pub2 both have a 1 in the first row, so the above approach assigns age1 to age_pb1 whereas it could be age2 as well. I don't know what you want to do with this, so can't say what is a better option.
Related
I'm working on a dataset in R and want to create a new variable based on the values of variable dx1. Here's my code.
Data1$AMI <- Data1$dx1 %in% c("I21.0", "I21.1", "I21.2", "I21.3",
"I21.4", "I21.9", "I21.A")
My question is how to assign the value of AMI based on a number of variables, say dx1 to dx25? In this dataset, dx1 refers to primary diagnosis, dx2 refers to second diagnosis, and so on. Any of them contain the specific diagnosis code (("I21.0", "I21.1", "I21.2", "I21.3", "I21.4", "I21.9", "I21.A") ) will be assigned a value “1”.
If dx1 %in% c("I21.0", "I21.1", "I21.2") or dx2 %in% c("I21.0", "I21.1", "I21.2") or dx3 %in% c("I21.0", "I21.1", "I21.2”), we want the AMI column show “1”.
I may have misunderstood your question; is this what you want to do?
# Load libraries
library(tidyverse)
# Create fake data
dx <- list()
for (i in 1:25){
dx[[i]] <- c(paste("I",
round(rnorm(n = 50, mean = 21, sd = 5), 1),
sep = ""))
}
name_list <- paste("dx", 1:25, sep = "")
Data1 <- as.data.frame(dx, col.names = name_list)
# Create a variable called "AMI" to count the occurrences of values:
# "I21.0","I21.1","I21.2","I21.3","I21.4","I21.9"
Data2 <- Data1 %>%
mutate(AMI = rowSums(
sapply(select(., starts_with("dx")),
function(x) grepl(pattern = paste(c("I21.0","I21.1","I21.2","I21.3","I21.4","I21.9"),
collapse = "|"), x)
))
)
Data2
Edit
Here is how to get the new variable "AMI" to show the value "1" for any row when one or more variables from the list 'dx1:dx25' has a value from the list '"I21.0","I21.1","I21.2","I21.3","I21.4","I21.9"':
Data2 <- Data1 %>%
mutate(AMI = ifelse(rowSums(
sapply(select(., starts_with("dx")),
function(x) grepl(pattern = paste(c("I21.0","I21.1","I21.2","I21.3","I21.4","I21.9"),
collapse = "|"), x)
)) > 0, 1, 0)
)
Data2
The problem is the following:
I have a data frame that I need to update inside a loop. The simple data frame has 4 columns: an identifier and four numeric columns. Here is the simple data frame at the initial step,
res_df <- data.frame(id = c("X", "Y", "Z"),
count = NA,
total = NA,
value = NA)
At every iteration a new data frame is generated with the same identifier and the same numeric columns.
For instance,
loop_df <- data.frame(id = c("X", "Z"),
count = c(1, 0),
total = c(20, 0),
value = c(0.05, 0))
I actually need to fill the res_df with information from the loop_df in the following way:
the row in loop_df with id "X" have to be inserted into the corresponding row of res_df, etc;
the column count has to be filled performing a simple sum between the values of the res_df and the newest values in the loop_df (essentially sum(res_df$count, loop_df$count) based on id);
the column total has to be filled in the same way of the column count (i.e. with a simple sum of the values based on id);
the column value has to be filled performing a simple average between the values of the res_df and the newest values in the loop_df (essentially mean(res_df$count, loop_df$count) based on id).
Here is how the result should be after the first run:
res_df
id count total value
X 1 20 0.05
Y NA NA NA
Z 0 0 0
Now, suppose to be in the second iteration of the loop that results in the loop_df as follow
loop_df <- data.frame(id = c("X", "Y"),
count = c(1, 0),
total = c(50, 0),
value = c(2.35, 0))
Then, the res_df has to be updated as follows
res_df
id count total value
X 2 70 1.2
Y 0 0 0
Z 0 0 0
Update: Solution
library(dplyr)
res_df <- arrange(res_df, id)
df_new_info <- arrange(loop_df, id)
ids <- loop_df$id
res_df[res_df$id %in% ids,] <- res_df[res_df$id %in% ids,] %>%
mutate(count = case_when(is.na(count) ~ loop_df$count,
TRUE ~ count + loop_df$count),
total = case_when(is.na(total) ~ loop_df$total,
TRUE ~ total + loop_df$total),
value = case_when(is.na(value) ~ loop_df$value,
TRUE ~ ewise_mean(value, loop_df$value, zero.rm = TRUE))
)
However, I am still looking for a solution which is highly efficient.
I'd really appreciate your help and thoughts about that.
I didnt quite catch why you need the for loop to generate those loop_df, you might want to consider using lapply to get a list of loop_df and then use the following to get your desired result:
rbindlist(dfl)[, .(count=sum(count), total=sum(total), value=mean(value)), id]
output:
id count total value
1: X 2 70 1.2
2: Z 0 0 0.0
3: Y 0 0 0.0
data:
library(data.table)
dfl <- list(
setDT(data.frame(id = c("X", "Z"),
count = c(1, 0),
total = c(20, 0),
value = c(0.05, 0))),
setDT(data.frame(id = c("X", "Y"),
count = c(1, 0),
total = c(50, 0),
value = c(2.35, 0)))
)
Here is the dataframe
sampledf = data.frame(timeinterval = c(1:120), hour = c(rep(NA, times = 85), 1, rep(NA, times = 5), 1, rep(NA, times = 4),1, rep(NA, times = 4), 1, rep(NA, times = 18)))
I want to replace the NAs in column hour such that values between 86th row and 92 (inclusive) and then between 97 and 102 (inclusive) should all be 1.
Here is what I've tried so far:
1. Getting the list of rownames with value 1 in hour column
2. Looping through (This is what is not working!)
ones = which(sampledf$hour == 1)
n = (length(ones)+1)/2
chunk <- function(ones,n) split(ones, cut(seq_along(ones), n, labels = FALSE))
y = chunk(ones,n)
for (i in y) {
sampledf$Hour[c(y$i[1]:y$i[2])] == 1
}
Help me out, I'm new to R.
In python we have ffill method for this, what an equivalent here?
Thanks!
sampledf$hour[between(sampledf$timeinterval,86,92) | between(sampledf$timeinterval,97,102)]<-1
Basically you subset sampledf's hour column by those cases where timeinterval is between 86-92 or (|) 92-102, and assign 1 to all those cases.
If you want to assign 1 to all timeintervals in the given ranges:
sampledf$hour[sampledf$timeinterval %in% c(86:92,97:102)] <- 1
If you want to assign 1 to cases based on the rownumbers of your data:
sampledf$hour[c(86:92,97:102)] <- 1
If you want to add a cumulated sum to your values as in your comment, you can just use the cumsum() function and do:
sampledf$hour[which(sampledf$hour == 1)] <- cumsum(sampledf$hour[which(sampledf$hour == 1)])
I'm trying to make a counter count each row of a data frame which column 1 needs to equal "vsrv11" and column 3 must is a date that needs to have year 2017.
So I did this code and the counter increments inside the if statement but for every iteration of the loop the counter becomes 0 again.
count <- 0
funcao.teste <- function (x) {
if (x[1] == "vsrv11" && substring(x[3],0,4) == "2017") {
count <<- count + 1
}
}
apply(vpnsessions, 1, funcao.teste, count)
Generally, I'd advise against using global variables and also, you could check this with simple filtering.
df <- data.frame(x = sample(c("vsrv11", rnorm(10)), 100, replace = TRUE),
y = rnorm(100),
z = as.character(sample(c(2017, 2018), 100, replace = TRUE)))
nrow(df[df[, 1] == "vsrv11" & grepl("2017", df[, 3]), ])
or just
sum(df[, 1] == "vsrv11" & grepl("2017", df[, 3]))
In the tidyverse you can perform such an operation using dplyr::count:
# Sample data
vpnsessions <- data.frame(
srv = "vsrv11",
id = c(rep("2017_abc", 10), rep("2018_def", 8)),
stringsAsFactors = F)
library(dplyr);
count(vpnsessions, year = substr(id, 1, 4))
## A tibble: 2 x 2
# year n
# <chr> <int>
#1 2017 10
#2 2018 8
Note how count counts the number of occurrences of ids. It's easy to extract relevant rows from the resulting data.frame/tibble.
To nitpick, in R indexing starts with 1 not with 0, so substring(..., 0, 4) from your code should be substring(..., 1, 4).
The data I have contain three variables with three unique IDs and each has multiple records. See below
ID <- c(rep(1,7), rep(2,6), rep(3,5))
t <- c(seq(1,7), seq(1,6), seq(1,5))
y <- c(rep(6,7), rep(1,6), rep(6,5))
z <- c(5,0,0,0,1,0,0,0,0,1,0,0,0,4,2,1,0,1)
dat1 <- data.frame(ID, t, y, z)
I need to create a new column (let's call it updated_y0) with the following rules:
for each ID i = 1,2,3 and each record j, the updated_y0(i,1) (i.e., the first record for each ID ordered by t) = y(i,1).
updated_y0(i,j) with j>1 (i.e., started from the second record) = updated_y0(i,j-1) - z(i,j-1) (the difference of previous rows)
For example, for ID=1,
updated_y0(1,1) = y(1,1) = 6,
updated_y0(1,2) = updated_y0(1,1) - z(1,1) = 6-5 = 1,
updated_y0(1,3) = updated_y0(1,2) - z(1,2) = 1-0 = 1...
The new data (dat2) is
ID <- c(rep(1,7), rep(2,6), rep(3,5))
t <- c(seq(1,7), seq(1,6), seq(1,5))
y <- c(rep(6,7), rep(1,6), rep(6,5))
z <- c(5,0,0,0,1,0,0,0,0,1,0,0,0,4,2,1,0,1)
updated_y0 <- c(6,1,1,1,1,0,0,1,1,1,0,0,0,6,2,0,-1,-2)
dat2 <- data.frame(ID, t, y, z, updated_y0)
This should work, although I do hate using for loops. First we identify all first records for each ID (all others will be marked NA):
library(dplyr)
dat2 <- dat1 %>%
group_by(ID) %>%
mutate(updated_y0 = ifelse(t == 1,
y,
NA))
Now we use a for loop to to replace just the NAs
for(i in 1:nrow(dat2)){
dat2$updated_y0[i] <- ifelse(is.na(dat2$updated_y0[i]),
dat2$updated_y0[i-1] - dat2$z[i-1],
dat2$updated_y0[i])
}
dat2
For the example of the lagging y-z option, you can use the dplyr option fairly straightforward:
dat1 %>%
group_by(ID) %>%
mutate(updated_y0 = ifelse(t == 1,
y,
lag(y - z)))
The ifelse gives the current y value as long as it is the first record (t). If it is not the second record for the ID, then it calculates the y-z based on the row above it (dplyr::lag).