The data I have contain three variables with three unique IDs and each has multiple records. See below
ID <- c(rep(1,7), rep(2,6), rep(3,5))
t <- c(seq(1,7), seq(1,6), seq(1,5))
y <- c(rep(6,7), rep(1,6), rep(6,5))
z <- c(5,0,0,0,1,0,0,0,0,1,0,0,0,4,2,1,0,1)
dat1 <- data.frame(ID, t, y, z)
I need to create a new column (let's call it updated_y0) with the following rules:
for each ID i = 1,2,3 and each record j, the updated_y0(i,1) (i.e., the first record for each ID ordered by t) = y(i,1).
updated_y0(i,j) with j>1 (i.e., started from the second record) = updated_y0(i,j-1) - z(i,j-1) (the difference of previous rows)
For example, for ID=1,
updated_y0(1,1) = y(1,1) = 6,
updated_y0(1,2) = updated_y0(1,1) - z(1,1) = 6-5 = 1,
updated_y0(1,3) = updated_y0(1,2) - z(1,2) = 1-0 = 1...
The new data (dat2) is
ID <- c(rep(1,7), rep(2,6), rep(3,5))
t <- c(seq(1,7), seq(1,6), seq(1,5))
y <- c(rep(6,7), rep(1,6), rep(6,5))
z <- c(5,0,0,0,1,0,0,0,0,1,0,0,0,4,2,1,0,1)
updated_y0 <- c(6,1,1,1,1,0,0,1,1,1,0,0,0,6,2,0,-1,-2)
dat2 <- data.frame(ID, t, y, z, updated_y0)
This should work, although I do hate using for loops. First we identify all first records for each ID (all others will be marked NA):
library(dplyr)
dat2 <- dat1 %>%
group_by(ID) %>%
mutate(updated_y0 = ifelse(t == 1,
y,
NA))
Now we use a for loop to to replace just the NAs
for(i in 1:nrow(dat2)){
dat2$updated_y0[i] <- ifelse(is.na(dat2$updated_y0[i]),
dat2$updated_y0[i-1] - dat2$z[i-1],
dat2$updated_y0[i])
}
dat2
For the example of the lagging y-z option, you can use the dplyr option fairly straightforward:
dat1 %>%
group_by(ID) %>%
mutate(updated_y0 = ifelse(t == 1,
y,
lag(y - z)))
The ifelse gives the current y value as long as it is the first record (t). If it is not the second record for the ID, then it calculates the y-z based on the row above it (dplyr::lag).
Related
I'm working on a dataset in R and want to create a new variable based on the values of variable dx1. Here's my code.
Data1$AMI <- Data1$dx1 %in% c("I21.0", "I21.1", "I21.2", "I21.3",
"I21.4", "I21.9", "I21.A")
My question is how to assign the value of AMI based on a number of variables, say dx1 to dx25? In this dataset, dx1 refers to primary diagnosis, dx2 refers to second diagnosis, and so on. Any of them contain the specific diagnosis code (("I21.0", "I21.1", "I21.2", "I21.3", "I21.4", "I21.9", "I21.A") ) will be assigned a value “1”.
If dx1 %in% c("I21.0", "I21.1", "I21.2") or dx2 %in% c("I21.0", "I21.1", "I21.2") or dx3 %in% c("I21.0", "I21.1", "I21.2”), we want the AMI column show “1”.
I may have misunderstood your question; is this what you want to do?
# Load libraries
library(tidyverse)
# Create fake data
dx <- list()
for (i in 1:25){
dx[[i]] <- c(paste("I",
round(rnorm(n = 50, mean = 21, sd = 5), 1),
sep = ""))
}
name_list <- paste("dx", 1:25, sep = "")
Data1 <- as.data.frame(dx, col.names = name_list)
# Create a variable called "AMI" to count the occurrences of values:
# "I21.0","I21.1","I21.2","I21.3","I21.4","I21.9"
Data2 <- Data1 %>%
mutate(AMI = rowSums(
sapply(select(., starts_with("dx")),
function(x) grepl(pattern = paste(c("I21.0","I21.1","I21.2","I21.3","I21.4","I21.9"),
collapse = "|"), x)
))
)
Data2
Edit
Here is how to get the new variable "AMI" to show the value "1" for any row when one or more variables from the list 'dx1:dx25' has a value from the list '"I21.0","I21.1","I21.2","I21.3","I21.4","I21.9"':
Data2 <- Data1 %>%
mutate(AMI = ifelse(rowSums(
sapply(select(., starts_with("dx")),
function(x) grepl(pattern = paste(c("I21.0","I21.1","I21.2","I21.3","I21.4","I21.9"),
collapse = "|"), x)
)) > 0, 1, 0)
)
Data2
First Data frame 'total_coming_in' column names: 'LocationID','PartNumber',"Quantity"
Second Data frame 'total_going_out' column names: 'LocationID','PartNumber',"Quantity"
I want output as 'total_data' column names: 'LocationID','PartNumber',"Quantity_subtract" where
Quantity_subtract = total_coming_in$Quantity - total_going_out$Quantity grouped for each 'LocationID','PartNumber'
I tried this :-
matchingCols <- c('LocationID','PartNumber')
mergingCols <- names(coming_in)[3]
total_coming_in[total_going_out,on=matchingCols,
lapply(
setNames(mergingCols),
function(x) get(x) - get(paste0("i.", x))
),
nomatch=0L,
by=.EACHI
]
Using data.table as you seem to want to, I would first cleanly merge the two tables and then do the substract operation on just the rows that make sense (i.e. for rows in total_coming_in which have matching values values in total_going_out and vice-versa):
library(data.table)
M <- merge(total_coming_in, total_going_out, by = c('LocationID','PartNumber'))
# i.e. all.x = FALSE, all.y = FALSE,
# thereby eliminating rows in x without matching row in y and vice-versa
M[ , Quantity_subtract := Quantity.x - Quantity.y,
by = c('LocationID','PartNumber')]
Now for completenes, as your question might be interpreted as allowing 0 values for Quantity.y in total_going_out for rows of total_coming_in that have no matching values in total_going_out and vice-versa, you could do in this case:
M <- merge(total_coming_in, total_going_out, all = TRUE, by = c('LocationID','PartNumber'))
# i.e. all.x = TRUE, all.y = TRUE,
# thereby allowing rows in x without matching row in y and vice-versa
M[is.na(Quantity.x), Quantity.x := 0]
M[is.na(Quantity.y), Quantity.y := 0]
M[ , Quantity_subtract := Quantity.x - Quantity.y,
by = c('LocationID','PartNumber')]
So you want to have a column that gives you the difference of total_coming_in and total_going_out for each combination of PartNumber and LocationID, correct?
If so, the following will do:
library(dplyr)
matchingCols <- c("LocationID", "PartNumber")
total_data <- full_join(total_coming_in, total_going_out, by=matchingCols)
total_data <- mutate(total_data, Quantity_subtract = Quantity.x - Quantity.y)
total_data <- select(total_data, -Quantity.x, -Quantity.y) #if you want to get rid of these columns
I used this example data:
total_coming_in <- list(LocationID = round(runif(26, 1000, 9000)),
PartNumber = paste(runif(26, 10000, 20000), LETTERS, sep="-"),
Quantity = round(runif(26, 2, 4))
) %>% as_tibble()
random_integers <- sample(1:26,26,FALSE)
total_going_out <- list(LocationID = total_coming_in$LocationID[random_integers],
PartNumber = total_coming_in$PartNumber[random_integers],
Quantity = round(runif(26, 1, 3))
) %>% as_tibble()
I have multiple data frames which are individual sequences, consisting out the same columns. I need to delete all the rows after a negative value is encountered in the column "OnsetTime". So not the row of the negative value itself, but the row after that. All sequences have 16 rows in total.
I think it must be able by a loop, but I have no experience with loops in r and I have 499 data frames of which I am currently deleting the rows of a sequence one by one, like this:
sequence_6 <- sequence_6[-c(11:16), ]
sequence_7 <- sequence_7[-c(11:16), ]
sequence_9 <- sequence_9[-c(6:16), ]
Is there a faster way of doing this? An example of a sequence can be seen here example sequence
Ragarding this example, I want to delete row 7 to row 16
Data
Since the odd web configuration at work prevents me from accessing your data, I created three dataframes based on random numbers
set.seed(123); data_1 <- data.frame( value = runif(25, min = -0.1) )
set.seed(234); data_2 <- data.frame( value = runif(20, min = -0.1) )
set.seed(345); data_3 <- data.frame( value = runif(30, min = -0.1) )
First, you could create a list containing all your dataframes:
list_df <- list(data_1, data_2, data_3)
Now you can go through this list with a for loop. Since there are several steps, I find it convenient to use the package dplyr because it allows for a more readable notation:
library(dplyr)
for( i in 1:length(list_df) ){
min_row <-
list_df[[i]] %>%
mutate( id = row_number() ) %>% # add a column with row number
filter(value < 0) %>% # get the rows with negative values
summarise( min(id) ) %>% # get the first row number
as.numeric() # transform this value to a scalar (not a dataframe)
list_df[[i]] <- list_df[[i]] %>% slice(1:min_row) # get rows 1 to min_row
}
Hope it helps!
We can get the datasets into a list assuming that the object names start with 'sequence' followed by a - and one or more digits. Then use lapply to loop over the list and subset the rows based on the condition
lst1 <- lapply(mget(ls(pattern="^sequence_\\d+$")), function(x) {
i1 <- Reduce(`|`, lapply(x, `<`, 0))
#or use rowSums
#i1 <- rowSums(x < 0) > 0
i2 <- which(i1)[1]
x[seq(i2),]
}
)
data
set.seed(42)
sequence_6 <- as.data.frame(matrix(sample(-1:10, 16 *5, replace = TRUE), nrow = 16))
sequence_7 <- as.data.frame(matrix(sample(-2:10, 16 *5, replace = TRUE), nrow = 16))
sequence_9 <- as.data.frame(matrix(sample(-2:10, 16 *5, replace = TRUE), nrow = 16))
I have 5 repeat measures called pub1:pub5 each taking a value of 1 to 4. Each was measured at a different age age1:age5. That is, pub1 was measured at age1....pub5 at age5 etc.
I would like to create a new variable age_pb2 that shows the age at which a value of 2 first occurred in pub. For example, for individual x, age_pb2 will equal age3 if the first time a value of 2 is scored is in pub3
I have tried modifying previous code but not had much luck.
library(tidyverse)
#Example data
N <- 2000
data <- data.frame(id = 1:2000,age1 = rnorm(N,6:8),age2 = rnorm(N,7:9),age3 = rnorm(N,8:10),
age4 = rnorm(N,9:11),age5 = rnorm(N,10:12),pub1 = rnorm(N,1:2),pub2 = rnorm(N,1:2),
pub3 = rnorm(N,1:2),pub4 = rnorm(N,1:2),pub5 = rnorm(N,1:2))
data <- data %>% mutate_at(vars(starts_with("pub")), funs(round(replace(., .< 0, NA), 0)))
#New variable showing first age at getting a score of 2 (doesn't work)
i1 <- grepl('^pub', names(data)) # index for pub columns
i2 <- grepl('^age', names(data)) # index for age columns
data[paste0("age_pb2")] <- lapply(2, function(i) {
j1 <- max.col(data[i1] == i, 'first')
j2 <- rowSums(data[i1] == i) == 0
data[i2][cbind(seq_len(nrow(data)), j1 *(NA^j2))]
})
set.seed(1)
N <- 2000
data <- data.frame(id = 1:2000,age1 = rnorm(N,6:8),age2 = rnorm(N,7:9),age3 = rnorm(N,8:10),
age4 = rnorm(N,9:11),age5 = rnorm(N,10:12),pub1 = rnorm(N,1:2),pub2 = rnorm(N,1:2),
pub3 = rnorm(N,1:2),pub4 = rnorm(N,1:2),pub5 = rnorm(N,1:2)) %>%
mutate_at(vars(starts_with("pub")), funs(round(replace(., .< 0, NA), 0))) %>%
mutate(age_pb2 = eval(parse(text = paste0("age", which.min(apply(select(., starts_with("pub")), 2, function(x) which(x == 2)[1]))))))
The way it works, you apply over the pubs columns and take with which(x == 2)[1] the first matched row per column, then take the which.min to get the column index number (of pub respectively age) which you then paste with "age" to assign (using eval(parse(text = variable name))) the respective column.
E.g. here after apply you get
[pub1 = 2, pub2 = 1, pub3 = 2, pub4 = 4, pub5 = 2]
which is the first occurrence of 2 per column. The earliest (which.min) occurrence is for the second pub column, thus index is 2. This pasted with "age" and eval parsed to mutate.
EDIT
It is probably more convenient to do it in a for loop for all age_pbi, or there is an easy solution in dplyr that I am not aware of.
for (i in 1:5) {
index <- which.min(apply(select(data, starts_with("pub")), 2, function(x) which(x == i)[1]))
data[ ,paste0("age_pb", i)] <- data[ ,paste0("age", index)]
}
Note however, that which.min takes the first minimum. E.g. pub1 and pub2 both have a 1 in the first row, so the above approach assigns age1 to age_pb1 whereas it could be age2 as well. I don't know what you want to do with this, so can't say what is a better option.
I have the following two data frames:
letters <- LETTERS[seq(from = 1, to = 5)]
values <- rnorm(5, mean = 50)
df1 <- data.frame(letters, values)
category <- sample(LETTERS[1:5], 20, replace = TRUE)
numbers <- rnorm(20, mean = 100)
df2 <- data.frame(category, numbers)
I want to create a new column in df2 that takes the value in df2$numbers and subtracts the value in df1$values based on the matching letter.
In other words, if the value for "C" in df1 is 49.2, I want to subtract 49.2 from every row in df2$numbers where df$category equals "C". Hope that makes sense. Thanks for the help!
With dplyr:
df <- full_join(df1, df2, by = c('letters' = 'category')) %>%
mutate(diff = numbers - values)