Create a boolean column based on data in another column in R - r

I have a data set and would like to do two things:
Set certain row values in Col A to 0 based on values in Col B
Create a new column with values of either 0 or 1 based on the edited values in Col A
My current approach is shown below - the issue is I occasionally get an error:
Error in `[<-.data.frame`(`*tmp*`, "OCS_dose", value = 0) :
replacement has 1 row, data has 0
As the numbers that I am generating are randomly selected and on certain trials there are no rows to update in Col A based on the numbers in Col B.
Here is an example of my code that causes the error:
pbo_IFNlow_data[pbo_IFNlow_data$OCS_status == 0,]['OCS_dose'] <- 0
OCS_status is either a 0 or 1 that is generated using:
pbo_OCS_status_low <- sample(c(0,1), replace = TRUE,
size = pbo_n_IFNlow, prob=c(1-.863, 0.863))
Therefore on occasion, I have no 0's... In my mind R should then just not try to update anything.
Is there a better way to do what I am trying to do?
Here is a more complete segment of my code:
pbo_OCS_status_low <- sample(c(0,1), replace = TRUE, size = pbo_n_IFNlow, prob=c(1-.863, 0.863)) #on OCS = 1
#OCS dose
pbo_OCS_dose_low <- rtruncnorm(pbo_n_IFNlow, a=0, b=Inf, mean=12.8, sd=8.1)
#IFN boolean flag
pbo_IFN_low <- rep(0, pbo_n_IFNlow)
#SLEDAI score
pbo_SLEDAI_low <- rtruncnorm(pbo_n_IFNlow, a=0, b=Inf, mean=11.1, sd=4.4)
#Response criteria met for SRI score reduction
pbo_SRI_low <- sample(c(0,1), replace = TRUE, size = pbo_n_IFNlow, prob=c(1-0.423, 0.423))
pbo_IFNlow_data <- cbind(IFN_status=pbo_IFN_low,
OCS_status=pbo_OCS_status_low,
OCS_dose=pbo_OCS_dose_low,
SLEDAI=pbo_SLEDAI_low,
SRI_response=pbo_SRI_low)
pbo_IFNlow_data <- data.frame(pbo_IFNlow_data)
#set those off OCS to 0
pbo_IFNlow_data[pbo_IFNlow_data$OCS_status == 0,]['OCS_dose'] <- 0
#stratifcation factor for OCS dosage
pbo_IFNlow_data$OCS_lessthan10 <- "temp"
pbo_IFNlow_data[pbo_IFNlow_data$OCS_dose < 10, ]['OCS_lessthan10'] <- 1
pbo_IFNlow_data[pbo_IFNlow_data$OCS_dose >= 10, ]['OCS_lessthan10'] <- 0
#stratification factor for SLE score
pbo_IFNlow_data$SLE_lessthan10 <- "temp"
pbo_IFNlow_data[pbo_IFNlow_data$SLEDAI < 10, ]['SLE_lessthan10'] <- 1
pbo_IFNlow_data[pbo_IFNlow_data$SLEDAI >= 10, ]['SLE_lessthan10'] <- 0

It would be easier if we can have a minimal reproducible example. If I understand your question correctly, you may want to try ifelse statement in R?
df <- data.frame(colA = seq(1, 10), colB = seq(11, 20))
# Set certain row values in Col A to 0 based on values in Col B
df$colA <- ifelse(df$colB > 15, 0, df$colB)
# Create a new column with values of either 0
# or 1 based on the edited values in Col A
df$colC <- ifelse(df$colA == 0, 1, 0)
print(df)
## colA colB colC
## 1 11 11 0
## 2 12 12 0
## 3 13 13 0
## 4 14 14 0
## 5 15 15 0
## 6 0 16 1
## 7 0 17 1
## 8 0 18 1
## 9 0 19 1
## 10 0 20 1

Related

R - Replace sequence values by its maximum in a column

I am with a small problem and hope someone can help me.
I have a dataframe like this:
df <- data.frame(foo = 1:20, bar = c(0,0,1,0,0,0,1,2,0,0,1,2,3,0,0,0,1,2,3,4))
and what to have a result like this:
df_result <- data.frame(foo = 1:20, bar = c(0,0,1,0,0,0,2,2,0,0,3,3,3,0,0,0,4,4,4,4))
How do I do this without using a while loop?
Using ave in base R :
with(df, as.integer(bar > 0) * (ave(bar, cumsum(bar == 0), FUN = max)))
#[1] 0 0 1 0 0 0 2 2 0 0 3 3 3 0 0 0 4 4 4 4
where cumsum(bar == 0) is used to create groups, ave is used to calculate max in each group and as.integer(bar > 0) is to keep value which are 0 as 0.

Find what values were changed to after normalization

How can I see the original values post normalization? Or change them in the final output?
I want to change my final output back to there original values. Or at least close to it considering I aggregate and take the mean.
I have a dataset that has 10 columns and 5,000 rows. After cleaning up the data and selecting which columns and rows I want, I run a normalization code.
Then I run a kmeans and get my output. How can I see what the values were changed to after normalization? Like, if I have Region 1, 2, 3, 4, and 5. And post normalization it changes to 0.00, 0.25, 0.5, 0.75, and 1. is there a way to change them back to the original in the kmeans output?
I want to change my final output back to there original values. Or at least close to it considering I aggregate and take the mean.
normalize = function(X) {
return(abs((X-min(X)))/(max(X)-min(X)))
}
df_age_norm = as.data.frame(lapply(df_age,normalize))
clusters = kmeans(df_age_norm, 9)[['cluster']]
df_age_norm$clusters = clusters
df_age_norm =
aggregate(df_age_norm[,1:4],list(df_age_norm$clusters),FUN
= mean)
I want to change my final output back to there original values. Or at least close to it considering I aggregate and take the mean.
Head of dataset before normalization
Age HHIncome Region MaritalStatus group
18 11000 5 0 1
18 11000 5 1 1
18 12000 2 0 1
18 12000 4 0 1
18 13000 1 0 1
Head of dataset after normalization
Age HHIncome Region MaritalStatus group
0 0.001879699 1.00 0 0
0 0.001879699 1.00 1 0
0 0.002819549 0.25 0 0
0 0.002819549 0.75 0 0
0 0.003759398 0.00 0 0
This solution is inspired in base R function scale, that centers and scales the vector by subtracting the mean value and dividing by the standard deviation of the vector x. These two values, mean(x) and sd(x) are returned as attributes.
x <- -4:5
y <- scale(x)
attributes(y)
#$dim
#[1] 10 1
#
#$`scaled:center`
#[1] 0.5
#
#$`scaled:scale`
#[1] 3.02765
I have, therefore, rewritten function normalize to also set and return min(x) and max(x) as attributes. They will be used to later denormalize.
normalize <- function(X, na.rm = FALSE) {
if(na.rm) X <- X[!is.na(X)]
Min <- min(X)
Max <- max(X)
Y <- X - Min
if(Min != Max) Y <- Y/(Max - Min)
attr(Y, "scaled:min") <- Min
attr(Y, "scaled:max") <- Max
Y
}
denormalize <- function(X){
Min <- attr(X, "scaled:min")
Max <- attr(X, "scaled:max")
attr(X, "scaled:min") <- NULL
attr(X, "scaled:max") <- NULL
Y <- if(Min != Max) X*(Max - Min) else X
Y <- Y + Min
Y
}
df_age_norm <- as.data.frame(lapply(df_age, normalize))
df_age_2 <- as.data.frame(lapply(df_age_norm, denormalize))
df_age_2
# Age HHIncome Region MaritalStatus group
#1 18 11000 5 0 1
#2 18 11000 5 1 1
#3 18 12000 2 0 1
#4 18 12000 4 0 1
#5 18 13000 1 0 1
Data.
df_age <- read.table(text = "
Age HHIncome Region MaritalStatus group
18 11000 5 0 1
18 11000 5 1 1
18 12000 2 0 1
18 12000 4 0 1
18 13000 1 0 1
", header = TRUE)

R: Get x neighbours from a dataframe based on column value

I wrote a function which I provide with a number which then gives me x rows having a numerical value close to the input number.
For example this is the dataframe:
test.data <- data.frame(
number = c(0,1,3,4,6,2,7,1,3,3,4,0,1,6),
letter = letters[1:14]
)
Then I wrote this function to give me the neighbors:
# library(dplyr)
get.closest <- function( input.number, n.closest, data ) {
data %>%
mutate(abs.score.dif = abs(input.number - number)) %>%
arrange(abs.score.dif) %>%
head(n.closest)
}
So for example get.closest(6, 3, test.data) will give me:
number letter abs.score.dif rel.score.dif
1 6 e 0 0
2 6 n 0 0
3 7 g 1 -1
However I have to do this for > 20.000 numbers and my data frame is around 20.000 rows as well, making this really slow. How can this be done faster?
N = 6
n = 3
df_out = transform(test.data[head(order(abs(N - test.data$number)), n),],
abs.diff = abs(N - number),
rel.diff = N - number)
df_out
# number letter abs.diff rel.diff
#5 6 e 0 0
#14 6 n 0 0
#7 7 g 1 -1
Seems to be fast with following data
#DATA
set.seed(42)
test.data = data.frame(number = sample(0:10, 200000, TRUE),
letter = sample(letters, 200000, TRUE))

Make a vector with counts of rows that meet criteria

I want to make a vector that contains number of rows that meet my criteria^=:
leftE0 <- c(900,2000,4000,9000,15000,30000,53000,100000,160000)
rightE0 <- c(2000,4000,9000,15000,30000,53000,100000,160000,300000)
sum(datap$CF > 0 & (datap$E0.keV > leftE0[1]) & (datap$E0.keV < rightE0[1]), na.rm=TRUE)
I don't understand how to vectorise this action.
Use cut and table:
#some example data
set.seed(42)
datap <- data.frame(CF = rnorm(100), E0.keV = exp(runif(100, 0, log(4e6))))
breaks <- c(-Inf, 900,2000,4000,9000,15000,30000,53000,100000,160000, 300000, Inf)
table(cut(datap$E0.keV, breaks), datap$CF > 0)
# FALSE TRUE
# (-Inf,900] 21 32
# (900,2e+03] 6 3
# (2e+03,4e+03] 3 3
# (4e+03,9e+03] 6 0
# (9e+03,1.5e+04] 1 1
# (1.5e+04,3e+04] 0 1
# (3e+04,5.3e+04] 1 0
# (5.3e+04,1e+05] 2 0
# (1e+05,1.6e+05] 1 0
# (1.6e+05,3e+05] 2 1
# (3e+05, Inf] 3 13

R: Find the Variance of all Non-Zero Elements in Each Row

I have a dataframe d like this:
ID Value1 Value2 Value3
1 20 25 0
2 2 0 0
3 15 32 16
4 0 0 0
What I would like to do is calculate the variance for each person (ID), based only on non-zero values, and to return NA where this is not possible.
So for instance, in this example the variance for ID 1 would be var(20, 25),
for ID 2 it would return NA because you can't calculate a variance on just one entry, for ID 3 the var would be var(15, 32, 16) and for ID 4 it would again return NULL because it has no numbers at all to calculate variance on.
How would I go about this? I currently have the following (incomplete) code, but this might not be the best way to go about it:
len=nrow(d)
variances = numeric(len)
for (i in 1:len){
#get all nonzero values in ith row of data into a vector nonzerodat here
currentvar = var(nonzerodat)
Variances[i]=currentvar
}
Note this is a toy example, but the dataset I'm actually working with has over 40 different columns of values to calculate variance on, so something that easily scales would be great.
Data <- data.frame(ID = 1:4, Value1=c(20,2,15,0), Value2=c(25,0,32,0), Value3=c(0,0,16,0))
var_nonzero <- function(x) var(x[!x == 0])
apply(Data[, -1], 1, var_nonzero)
[1] 12.5 NA 91.0 NA
This seems overwrought, but it works, and it gives you back an object with the ids attached to the statistics:
library(reshape2)
library(dplyr)
variances <- df %>%
melt(., id.var = "id") %>%
group_by(id) %>%
summarise(variance = var(value[value!=0]))
Here's the toy data I used to test it:
df <- data.frame(id = seq(4), X1 = c(3, 0, 1, 7), X2 = c(10, 5, 0, 0), X3 = c(4, 6, 0, 0))
> df
id X1 X2 X3
1 1 3 10 4
2 2 0 5 6
3 3 1 0 0
4 4 7 0 0
And here's the result:
id variance
1 1 14.33333
2 2 0.50000
3 3 NA
4 4 NA

Resources