I have a data frame that includes many variables. Here is a shortened version of what I have so far:
n_20010_0_0 <- c(1,2,3,4)
n_20010_0_1 <- c(0, -2, NA, 4)
n_20010_0_2 <- c(3, 0, -7, 2)
x <- data.frame (n_20010_0_0, n_20010_0_1, n_20010_0_2)
I created a new variable that returns whether or not there is a 1 within the list of variables:
MotherIllness0 <- paste("n_20010_0_", 0:2, sep = "")
x$MotherCAD_0_0 <- apply(x, 1, function(x) as.integer(any(x[MotherIllness0] == 1, na.rm = TRUE)))
I would like to keep the NAs as 0's, but I would also like to recode it so that if there is a -7 the new value is NA.
This is what I've tried and it doesn't work:
x$MotherCAD_0_0[MotherIllness0 == -7] <- NA
you don't need to define MotherIllness0, the argument 1 in your apply function takes care of that.
Here's a line of code that does both things you want.
MotherIllness0 <- paste("n_20010_0_", 0:2, sep = "")
x$MotherCAD_0_0<- apply(x[,MotherIllness0], 1, function(x) ifelse(any(x==-7), NA,
as.integer(any(x==1, na.rm=T))))
I assumed that a row with both 1s and -7s should have NA for the new variable. If not, then this should work:
x$MotherCAD_0_0<- apply(x[,MotherIllness0], 1, function(x) ifelse(any(x==1, na.rm=T), 1,
ifelse(any(x==-7), NA, 0)))
Note that with the example you have above, these two lines should produce the same outcome.
Here's another way to do it, without using any if-else logic:
# Here's your dataset, with a row including both 1 and -7 added:
x <- data.frame (n_20010_0_0 = c(1, 2, 3, 4, 1),
n_20010_0_1 = c(0, -2, NA, 4, 0) ,
n_20010_0_2 = c(3, 0, -7, 2, -7)
)
# Your original function:
MotherIllness0 <- paste("n_20010_0_", 0:2, sep = "")
x$MotherCAD_0_0 <- apply(x, MARGIN = 1, FUN = function(x) {
as.integer(
any(x[MotherIllness0] == 1, na.rm = TRUE)
)
})
# A simplified version
x$test <- apply(x, MARGIN = 1, FUN = function(row) {
as.integer(
any(row[MotherIllness0] == 1, na.rm = TRUE) &
!any(row[MotherIllness0] == -7, na.rm = TRUE)
)
})
A couple of notes: the name of x in an anonymous function like function(x) can be anything, and you'll save yourself a lot of confusion by calling it what it is (I named it row above).
It's also unlikely that you actually need to convert your result column to integer - logical columns are easier to interpret, and they work the same as 0-1 columns for just about everything (e.g., TRUE + FALSE equals 1).
Related
In the below reproducible code, the custom balTransit() function correctly populates a values transition table using a for-loop, while the custom balTransit_1() function is supposed to do the same using lapply() but it doesn't work. What am I doing wrong in my implementation of lapply()? Run the code and you'll see results of:
balTransit (correct results):
> test
X1 X0 X2
X1 0 0 3
X0 0 50 0
X2 5 0 0
balTransit_1 (incorrect, all 0's):
> test_1
X1 X0 X2
X1 0 0 0
X0 0 0 0
X2 0 0 0
Enhanced explanation:
My main objective here is to learn how to use the apply() family of functions, for their perceived benefits. I’ve been going through simple tutorials. A secondary objective is the generation of a transition matrix from a base data frame. Once I figure this out with lapply() (or another apply() function that is most suitable), I’m going to run the various options (for-loop(), data.table(), lapply(), etc.) against the actual data set of 2.5m rows for speed testing.
What I’m doing is creating a transition matrix (technically here a data frame) showing the flow of values (balances) from one “Flags” category to another “Flags” category, over the periods specified by the user. So, in my “for-loop” reproducible example which works correctly, the user has specified a “From” period of 1 and a “To” period of 3. The transition matrix is then generated as shown in the image now posted at the bottom.
A related post yesterday, How to convert a for-loop to lapply function for parallel testing purposes?, addresses this issue for transition counts. This post addresses transition values.
Reproducible code:
# Set up data frame:
data <-
data.frame(
ID = c(1,1,1,2,2,2,3,3,3),
Period = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
Values = c(5, 10, 15, 50, 2, 4, 3, 6, 9),
Flags = c("X1","X0","X2","X0","X2","X0", "X2","X1","X1")
)
# Function to set-up base transition table:
transMat <- function(data){
DF <- data.frame(matrix(0, ncol=length(unique(data$Flags)), nrow=length(unique(data$Flags))))
row.names(DF) <- unique(data$Flags)
names(DF) <- unique(data$Flags)
return(DF)
}
# Function to populate cells of transition table, using for-loop:
balTransit <- function(data, from=1, to=3){
DF <- transMat(data)
for (i in unique(data$ID)){
id_from <- as.character(data$Flags[(data$ID == i & data$Period == from)])
id_to <- as.character(data$Flags[data$ID == i & data$Period == to])
column <- which(names(DF) == id_from)
row <- which(row.names(DF) == id_to)
val <- (data$Values[(data$ID == i & data$Period == from)])
DF[row, column] <- val + DF[row,column]
}
return(DF)
}
# Function to populate cells of transition table, using lapply:
balTransit_1 <- function(data, from=1, to=3){
DF_1 <- transMat(data)
lapply(seq_along(unique(data$ID)), function(i){
id_from <- as.character(data$Flags[(data$ID == i & data$Period == from)])
id_to <- as.character(data$Flags[data$ID == i & data$Period == to])
column <- which(names(DF_1) == id_from)
row <- which(row.names(DF_1) == id_to)
val <- (data$Values[(data$ID == i & data$Period == from)])
DF_1[row, column] <- DF_1[row, column] + val
})
return(DF_1)
}
# Run the 2 functions:
test <- balTransit(data,1,3)
test
test_1 <- balTransit_1(data,1,3)
test_1
To make your lapply code work just replace <- with <<-:
DF_1[row, column] <<- DF_1[row, column] + val
Please see ?assignOps for more info.
However, again I wouldn't recommend lapply in this case (<<- should be avoided in general)
Here is a data.table approach:
library(data.table)
DT <- setDT(data.frame(
ID = c(1,1,1,2,2,2,3,3,3),
Period = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
Values = c(5, 10, 15, 50, 2, 4, 3, 6, 9),
Flags = c("X1","X0","X2","X0","X2","X0", "X2","X1","X1")
))
unique_flags <- unique(DT$Flags)
all_flags <- setDT(expand.grid(list(first_flag = unique_flags, last_flag = unique_flags)))
resultDT <- dcast(
data = DT[, .(first_flag = first(Flags), last_flag = last(Flags), first_value = first(Values)), by = ID][
all_flags, on = c("first_flag", "last_flag")],
last_flag ~ first_flag,
fun.aggregate = sum,
value.var = "first_value"
)
for (col_i in seq_len(ncol(resultDT))){
set(resultDT, which(is.na(resultDT[[col_i]])), col_i, 0)
}
print(resultDT)
Result:
last_flag X0 X1 X2
1: X0 50 0 0
2: X1 0 0 3
3: X2 0 5 0
# step by step ------------------------------------------------------------
library(data.table)
DT <- setDT(data.frame(
ID = c(1,1,1,2,2,2,3,3,3,4,4,4),
Period = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3),
Values = c(5, 10, 15, 50, 2, 4, 3, 6, 9, 3, 6, 9),
Flags = c("X1","X0","X2","X0","X2","X0", "X2","X1","X1", "X2","X1","X1")
))
unique_flags <- unique(DT$Flags)
all_flags <- setDT(expand.grid(list(first_flag = unique_flags, last_flag = unique_flags)))
resultDT <- DT[, .(first_flag = first(Flags), last_flag = last(Flags), first_value = first(Values)), by = ID] # find relevant flags
resultDT <- resultDT[all_flags, on = c("first_flag", "last_flag")] # merge all combinations
resultDT <- dcast(resultDT, last_flag ~ first_flag, fun.aggregate = sum, value.var = "first_value") # dcast
for (col_i in seq_len(ncol(resultDT))){
set(resultDT, which(is.na(resultDT[[col_i]])), col_i, 0)
}
print(resultDT)
I need to calculate the sum of some variables with imputed values. I did this with complete --> as.mids --> with --> do.call
I needed to do the same thing but in a survey context. Therefore, I did: update --> with --> MIcombine
The means of the variables calculated both ways do not match. Which one is correct?
You may check this different behavior in this toy database:
library(tidyverse)
library(mice)
library(mitools)
library(survey)
mydata <- structure(list(dis1 = c(NA, NA, 1, 0, 0, 1, 1, 1, 1, 0),
dis2 = c(0, 1, 0, 1, NA, 1, 1, 1, 1, 0),
dis3 = c(1, 1, 0, 0, NA, 1, 1, 1, 1, 0),
sex = c(0,0,0,1,0,1,1,1,1,0),
clus = c(1,1,1,1,1,2,2,2,2,2)),
row.names = c(NA, 10L),
class = c("tbl_df", "tbl", "data.frame") )
imp <- mice::mice(mydata, m = 5, seed = 237856)
# calculating numenf with mice::complete
long <- mice::complete(imp, action = "long", include = TRUE)
long$numenf <- long$dis1 + long$dis2 + long$dis3
imp2 <- mice::as.mids(long)
res <- with(imp2, mean(numenf))
do.call(mean, res$analyses) # mean = 2.1
#calculating numenf with update (from survey)
imp1 <- mice::complete(imp)
imp2 <- mice::complete(imp, 2)
imp3 <- mice::complete(imp, 3)
imp4 <- mice::complete(imp, 4)
imp5 <- mice::complete(imp, 5)
listimp <- mitools::imputationList(list(imp1, imp2, imp3, imp4, imp5))
clus <- survey::svydesign(id = ~clus, data = listimp)
clus <- stats::update(clus, numenf = dis1 + dis2 + dis3)
res <- with(clus, survey::svymean(~numenf))
summary(mitools::MIcombine(res)) # mean = 1.98
Answer
Replace do.call(mean, res$analyses) with mean(unlist(res$analyses)).
Rationale
In the first code snippet, res$analyses is a list. When entering it into do.call, you are essentially calling:
mean(res$analyses[1], res$analyses[2], res$analyses[3], res$analyses[4], res$analyses[5])
mean takes the average of a vector in its first argument. The other arguments are not used properly (see ?mean). Hence, you're just getting 2.1 back, since that is the (mean of the) value of first analysis.
We can make a vector out of the list by using unlist(res$analyses). Then, we can just feed it to mean as an argument:
mean(unlist(res$analyses))
How can I assign a value into a matrix based in a vector condition index. A working example is:
# Input:
r <- c(2, 1, 3)
m <- matrix(rep(0, 9), nrow = 3)
# Desired output
result <- matrix(c(0, 1, 0,
1, 0, 0,
0, 1, 0), nrow = 3)
result.
# I try with this notation but it does not work:
sapply(1:3, function(x)m[x, r[x]] <- 1)
We use row/column indexing to assign
m[cbind(seq_len(nrow(m)), r)] <- 1
Or using replace
replace(m, cbind(seq_len(nrow(m)), r), 1)
Two matrices
df_A = matrix(, nrow = 5, ncol = 3)
df_A[,1] = c(0, 0, 1, -1, 1)
df_A[,2] = c(0, 1, -1, 0, -1)
df_A[,3] = c(1, 0, -1, 1, 1)
df_B = matrix(, nrow = 5, ncol = 3)
df_B[,1] = c(1, -1, 0, 0, 1)
I want to simulate columns 2 and 3 for df_B based on a few conditions. If the value of df_A is zero, the value for df_B does not change. For example, the first two values for df_B should not change for the first iteration because the first two values for df_A are zero. If the value of df_A is one or negative one, then the respective value for df_B will take on that value given a certain probability (20% in this example). For example, if df_A is negative one and df_B is zero (or one), the respective value for df_B will become negative one 20% of the time.
I know the following is incorrect but here is what I have so far:
belief_change = function(x){
if (df_A[x] = -1 & df_B[x] != -1 & sample(1:2, 1, prob = c(0.2, 0.8) = 1))
df_B[x+1] = df_A[x]
else
df_B[x+1] = df_B[x]
if (df_A[x] = 1 & df_B[x] != 1 & sample(1:2, 1, prob = c(0.2, 0.8) = 1))
df_B[x+1] = df_A[x]
else
df_B[x+1] = df_B[x]
if (df_A[x] = 0)
df_B[x+1] = df_B[x]
}
I'm using the sample function here to help generate a probability. I also need to put this into a for-loop eventually.
There are several errors in your code. First, the correct logical equality operator is == not =. Second, you put = 1 inside the call to sample. Third, if you really want to affect the environment outside where the function is called (i.e. to change df_B by calling your function, you need to use the deep assignment operator <<-, not = or <-. You can read more about this here.
Here is a version of your code that works; see if it does what you want it to do.
belief_change <- function(x) {
if (df_A[x] == -1 & df_B[x] != -1 & sample(1:2, 1, prob = c(0.2, 0.8)) == 1)
df_B[x+1] <<- df_A[x]
else
df_B[x+1] <<- df_B[x]
if (df_A[x] == 1 & df_B[x] != 1 & sample(1:2, 1, prob = c(0.2, 0.8)) == 1)
df_B[x+1] <<- df_A[x]
else
df_B[x+1] <<- df_B[x]
if (df_A[x] == 0)
df_B[x+1] <<- df_B[x]
}
I have two vectors, which I would like to combine in one dataframe. One of the vectors values needs to be divided into two columns. The second vector nc informs about the number of values for each observation. If nc is 1, only one value is given in values (which goes into val1) and 999 is to be written in the second column (val2).
What is an r-ish way to divide vector value and populate the two columns of df? I suspect I miss something very obvious, but can't proceed at the moment...Many thanks!
set.seed(123)
nc <- sample(1:2, 10, replace = TRUE)
value <- sample(1:6, sum(nc), replace = TRUE)
# result by hand
df <- data.frame(nc = nc,
val1 = c(6, 3, 4, 1, 2, 2, 6, 5, 6, 5),
val2 = c(999, 5, 999, 6, 1, 999, 6, 4, 4, 999))
Here's an approach based on this answer:
set.seed(123)
nc <- sample(1:2, 10, replace = TRUE)
value <- sample(1:6, sum(nc), replace = TRUE)
splitUsing <- function(x, pos) {
unname(split(x, cumsum(seq_along(x) %in% cumsum(replace(pos, 1, pos[1] + 1)))))
}
combineValues <- function(vals, nums) {
mydf <- data.frame(cbind(nums, do.call(rbind, splitUsing(vals, nums))))
mydf$V3[mydf$nums == 1] <- 999
return(mydf)
}
df <- combineValues(value, nc)
I think this is what you are looking for. I'm not sure it is the fastest way, but it should do the trick.
count <- 0
for (i in 1:length(nc)) {
count <- count + nc[i]
if(nc[i]==1) {
df$val1[i] <- value[count]
df$val2[i] <- 999
} else {
df$val1[i] <- value[count-1]
df$val2[i] <- value[count]
}
}