I'm doing a simulation study and I have some problem generating data that meet certain conditions.
My first simulated data looks like below.
A1 A2
1 0.8 6
2 0.5 3
3 0.9 2
...
1000
This is how I generated A1 & A2
set.seed(47)
df <- data.frame(A1 = rnorm(1000, mean=0.7, sd=0.1), A2 = rnorm(1000, mean=4, sd=1))
df
In tabular format, this is how the conditional statement looks where 0=fail and 1=pass and the output in the table is the probability of getting a 1 for A3.
A1 0 1
A2
0 0.1 0.3
1 0.9 0.7
Here is the explanation in words:
I want to generate a third row (A3) based on conditional probabilities of the first two rows. This is the condition I want to apply.
If A1>=0.7 (pass) & A2>=0.8 (pass) --> A3=1 with a 70% probability (implying %30 of zero)
If A1>=0.7 (pass) & A2<0.8 (fail) --> A3=1 with a 30% probability
If A1<0.7 (fail) & A2>=0.8 (pass) --> A3=1 with a 90% probability
If A1<0.7 (fail) & A2<0.8 (fail)--> A3=1 with a 10% probability
I hope my logic makes sense. Please let me know if I need more data or words to better explain. Thank you.
You could use a little trick here of converting logical vectors to integers then counting in binary.
If you do the logical test df$A1 >= 0.7 you get a vector of TRUE and FALSE values. If instead you do as.numeric(df$A1 >= 0.7) you get the equivalent vector of 1s and 0s. The trick is to do this for both variables, but multiply the second vector by 2. Now if you add both vectors together, you will get a number between 0 and 3 that corresponds to your truth table:
A1 pass, A2 pass = 3
A1 fail, A2 pass = 2
A1 pass, A2 fail = 1
A1 fail, A2 fail = 0.
Note that if we add one to these numbers, we get a value between one and four. We can therefore use them as indexes of our probability vector:
probs <- c(0.1, 0.3, 0.9, 0.7)[(df$A1 >= 0.7) + 2*(df$A2 >= 0.8)]
That means we can generate the random binary numbers using rbinom like so:
df$A3 <- rbinom(1000, 1, probs)
Resulting in:
head(df)
#> A1 A2 A3
#> 1 0.8994696 5.345481 1
#> 2 0.7711143 3.662635 1
#> 3 0.7185405 3.125840 1
#> 4 0.6718235 3.914527 0
#> 5 0.7108776 3.366858 1
#> 6 0.5914263 2.082173 0
Created on 2022-09-30 with reprex v2.0.2
Related
This is a follow-up to a question I previously asked (Replace only certain values in column based on multiple conditions). For context I'm including some of the same information.
I have a large dataframe that contains many columns, but the relevant ones are: ID (this is number assigned to subject), Time (time at which this subject's measurement was taken) and Concentration. A very simplified example would be:
df <- data.frame( ID=c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3),
Concentration=c("XXX",0.3,0.7,0.6,"XXX","XXX",0.8,0.3,"XXX","XXX",
"XXX",0.6,0.1,0.1,"XXX"),
Time=c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5))
I would like to replace only the "XXX" values in column Concentration based on the following conditions:
when the value in column Time is less than or equal to timeX ; "XXX"==0
when the value in column Time is greater than timeX; "XXX" should be replaced with the word "Missing" unless two consecutive "XXX" values appear for a single subject (ID) for Time>timeX then the first consecutive "XXX" should be replaced with 0.05 and the second consecutive "XXX" (or all the following "XXX" values if there are more) should be replaced with the word "Missing".
It's very important that the ID's are somehow seperated here because there could be "XXX" as the final Concentration of one ID and as the first Concentration of the next ID and I do not want that to be read as two consecutive "XXX" values for a single ID.
The solution I have, for when we assume timeX=3 is:
require(tidyverse)
df <- tibble(df) %>%
mutate(Concentration = as.character(Concentration),
Concentration_Original = Concentration) %>%
mutate(Concentration = ifelse(Concentration == 'XXX' & Time <= 3, "0", Concentration)) %>%
group_by(ID) %>%
mutate(Concentration = ifelse(Concentration == 'XXX' & Concentration == lead(Concentration),
"0.05", ifelse(Concentration == 'XXX',
"Missing", Concentration))) %>%
replace_na(list(Concentration = "Missing")) %>% ungroup()
To make the code more flexible and more importantly so that it doesn't require the user to manually check what the time cut off point should be and then manually insert it, I've been trying to make the code more automatic.
I would like to replace Time <= 3 with the following condition for timeX:
timeX is the value in column Time for that specific subject ID at which the value in column concentration is the highest. So basically the condition should be that timeX is that at which the concentration achieves it's maximum value.
For example: For ID 1 in my df, the highest concentration would be 0.7 and that concentration is achieved at Time = 3 so the value 3 should be inserted as timeX value.
Here are some thoughts/suggestions that might be helpful.
First, if you wish to look at maximum value for Concentration, I would not have this column be of character type. Instead, would make it numeric, and use NA for missing values. The first mutate sets that up.
After grouping, you can use mutate and case_when for your various situations. You can access the Time of maximum concentration through:
Time[which(Concentration == max(Concentration, na.rm = TRUE))]
(removing the missing values).
If it the Concentration is missing, and Time is less than the Time of maximum concentration, then change to 0.
In second case, if lead (or subsequent row) also is missing, then change to .05.
Otherwise, do not change Concentration.
Depending on further analyses and presentation, you can use "Missing" as a text label for missing data.
Edit: Based on OP comment, it appears that only the first "XXX" after max time should be replace with .05 for concentration, but all the following "XXX" after that as missing. To achieve this, add:
!is.na(lag(Concentration, default = 0))
as a condition for determining if value should be .05. The logic is: if the previous row's value is not NA, but the following value is NA, after the max time, then change to .05.
Here is the modified code:
library(tidyverse)
df %>%
mutate(Concentration = ifelse(Concentration == "XXX", NA_character_, Concentration),
Concentration = as.numeric(Concentration)) %>%
group_by(ID) %>%
mutate(Concentration_New = case_when(
is.na(Concentration) & Time < first(Time[which(Concentration == max(Concentration, na.rm = TRUE))]) ~ 0,
is.na(Concentration) & Time > last(Time[which(Concentration == max(Concentration, na.rm = TRUE))]) &
is.na(lead(Concentration, default = 0)) & !is.na(lag(Concentration, default = 0)) ~ .05,
TRUE ~ Concentration
))
Output
ID Concentration Time Concentration_New
<dbl> <dbl> <dbl> <dbl>
1 1 NA 1 0
2 1 0.3 2 0.3
3 1 0.7 3 0.7
4 1 0.6 4 0.6
5 1 NA 5 NA
6 2 NA 1 0
7 2 0.8 2 0.8
8 2 0.3 3 0.3
9 2 NA 4 0.05
10 2 NA 5 NA
11 3 NA 1 0
12 3 0.6 2 0.6
13 3 0.1 3 0.1
14 3 0.1 4 0.1
15 3 NA 5 NA
I'd like to use uniform distribution to randomly assign value 1 or 2 for five groups(generate 5 random uniform distribution), with each group containing 10 samples.
I try to write:
for(i in 1:5){
rf <- runif(10)
result[rf<=0.5]=1
result[rf>0.5]=2
}
However this will replace the previously assigned values when the loop goes on.
The code produces only 10 results:
1 2 1 2 2 1 1 1 2 1
But I want a total of 50 randomized values:
1 2 1 2 ...... 2 1 1
How to do this? Thank you
Since, you are working on random number generated from same distribution every time, you can better generate 50 numbers in once, and assign value using ifelse function.
Try this:
a <- ifelse(runif(50) <= 0.5, 1, 2)
dim(a) <- c(10,5) #if result in matrix
To add to Gregor Thomas' advice, sample... You can also covert the stream into a matrix of 5 columns (groups) of 10.
nums <- sample(1:2, 50, replace = TRUE)
groups <- matrix(nums, ncol = 5)
How to compute different parameters as one in R. For example. I have 3 arrays of a variable A called A1.1,A1.2,A1.3. I want to compute them in one as "A". How to do that?
A1.1>c(1,1,1,0,0,0)
A1.2>c(1,0,0,1,1,1)
A1.3>c(0,1,1,1,1,1)
Out put should be like this. in SPSS we do this by compute variables.
A>c(1,1,1,1,1,1)
In R you can use simple math on arrays, for example:
A1.1 <- c(1,0,1,0,0,0)
A1.2 <- c(1,0,0,1,1,1)
A1.3 <- c(0,0,1,1,1,1)
A1 <- 1*((A1.1 + A1.2 + A1.3)>0)
> A1
[1] 1 0 1 1 1 1
In R you can use the any() function inside of apply() to make this check. For example:
a1 <- c(1,0,0,0,1,1)
a2 <- c(0,1,0,0,0,1)
a3 <- c(0,1,1,0,1,1)
a <- apply(data.frame(a1,a2,a3), 1, function(x) ifelse(any(x),1,0))
And then as output:
> a
[1] 1 1 1 0 1 1
In SPSS you can take a similar approach:
COMPUTE a = ANY(1, a1 TO a3) .
EXE .
I have a longitudinal dataset with many missing values that I would like automatically imputed in R based on the 'last observed value' carried forward, and the 'next observed value' carried backwards. Similar questions have been asked previously, but I would like to add specific conditions for imputation based on the length of the gaps.
The following data frame (wide format) demonstrates the issue:
miss.df <- data.frame(id = c('A','B','C','D','E'),
w1 = c(1,1,2,2,1),
w2 = c(1,NA,NA,2,NA),
w3 = c(NA,NA,NA,NA,2),
w4 = c(1,NA,NA,NA,NA),
w5 = c(1,2,NA,1,3),
w6 = c(1,2,1,NA,NA))
As so:
id w1 w2 w3 w4 w5 w6
1 A 1 1 NA 1 1 1
2 B 1 NA NA NA 2 2
3 C 2 NA NA NA NA 1
4 D 2 2 NA NA 1 NA
5 E 1 NA 2 NA 3 NA
Please note that the data is in wide format, so w1 is the first wave, etc. The first wave is complete with no missings. The values are the numeric values for a categorical variable (political party preference). There is no order to the categories. This data frame therefore consists of information on only one variable, on five individuals across six waves.
The conditions I would like are as follows:
If the gap consists of only one missing, carry last observed value forward, including cases where the gap is in the final wave.
If the gap is an even number of missings (id = C, for instance), then carry forward and carry back so that the values 'meet in the middle'. As such, it is assumed that the individual transitioned (i.e. changed category) half-way through.
If the gap is an odd number of missings (id = B, for instance), then carry forward and carry back to meet in the middle, as point 2, but the exact middle value is imputed as the carry forward value.
If one was to run a loop with the above conditions, the data frame would look like this:
id w1 w2 w3 w4 w5 w6
1 A 1 1 1 1 1 1
2 B 1 1 1 2 2 2
3 C 2 2 2 1 1 1
4 D 2 2 2 1 1 1
5 E 1 1 2 2 3 3
Thanks in advance.
Hmm. Tricky. And I don't know of any useful R generic for filling in NAs. In the end I thought the easiest way was a good old for loop. The logic is to fill in one from the left, then one from the right, and to repeat this until everything is filled in. Not very R at all - it could practically be C code - but should be fine unless you have a zillion rows.
fill_in_old_skool <- function (r) {
while (anyNA(r)) {
for (idx in seq_along(r)) {
val <- r[idx]
if (is.na(r[idx]) && idx > 1) r[idx] <- lastval
lastval <- val
}
for (idx in rev(seq_along(r))) {
val <- r[idx]
if (is.na(r[idx]) && idx < length(r)) r[idx] <- lastval
lastval <- val
}
}
r
}
miss.df[,-1] <- t(apply(miss.df[,-1], 1, fill_in_old_skool))
The imputeTS package has a function that is very similar, to what you want to do.
The function is called na_ma(x, k = 2, weighting = "simple").
Missing Value Imputation by Weighted Moving Average
Basically what it does for you is:
If you input a time series x, it looks for the k next values and takes their average as values for imputation.
Not exactly what you described, but I think it might resemble the idea behind your proposed procedure.
I have a data set with an arbitrary number of row and two columns:a and b. I would like to find the number of a values for a specific value of b. If given the data set below I would want a1 = 2, a2 = 1 for a set value of b1.
a b
1 1
1 1
2 1
2 2
3 2
3 2
Code that I've tried and works:
data[a == 1 & b == 1, list(b = length(b))]
Code that doesn't work:
data[a == c(1,2) & b == 1, list(b = length(b))]
How can I get all values of a for a set b value?
Expected data output:
b1
a1 2
a2 1
a3 0
etc.
Code that works thanks to akrun:
library(data.table)
table(as.integer(data$a),data$b=='b1')[,2]
Make sure your [,2] matches your 'b' column.
Also, as.integer() ranks the values in order.
We can use table
table(within(data, b<-b==1))[,'TRUE', drop=FALSE]
EDIT: Included #Frank's suggestion.