Conditional if/else statement in R - r

I am learning to improve my coding in R. I have this code:
data$score[testA == 1] <- testA_score
data$score[testB==1] <- testB_score
So basically I have four columns that I want to combine into one: testA=1 indicates if the student took version A of the test and testA_score is their score; testB=1 indicates if the student took version B of the test and testB_score is their score. I want to combine this information into new column score.
As well Suppose I had testA, testB through testH. All values are 0 or 1. How can I make new column test_complete which is = 1 if any of the tests are = 1?
Basically as a former Stata user I am looking for the R equivalent commands to egen rowtotal and egenrowfirst. Thanks so much.

you can take max out of all test : since it 1 or 0 values only if at least one test is completed max will be equal to 1
testA <- c(1,0, 0, 1,0,0,0)
testB <- c(0, 1,0, 0, 1,0,1)
testC <- c(0, 0, 0,1, 0, 1, 0)
df <- as.data.frame(cbind(testA, testB, testC))
df$completed <- apply(df[, 1:3], 1, max)

So if I understand correctly, taking the maximum value by row should give what you need:
binary <- c(0,1)
df <- data.frame(
score1 = sample(binary, 20, replace = TRUE),
score2 = sample(binary, 20, replace = TRUE),
score3 = sample(binary, 20, replace = TRUE)
)
df$passed <- apply(df, 1, max)
head(df)

Related

Iterate until every row of column satisfies condition

I need to adjust one variable until it satisfies the condition that none of its rows are higher than one specific value. Here is some context:
I have 2 vectors: 'a' and 'b'
I normalize a and b to calculate their ratio 'c' (a_norm/b_norm)
Every row of 'c' must not be higher than a constant 'd'. Any 'c' row that is higher than d should be transformed into d.
After all 'c' rows that need to are adjusted (let's call this new column c_adjusted), I must recalculate a_norm (c_adjusted*b) (note that this will not make a_norm to be normalise, so let's call it a_adjusted)
I normalize a_adjusted to estimate the new a_norm (a_adjusted_norm = a_adjusted/sum(a_adjusted)*100
I calculate again c to check if all rows satisfy the condition after the adjustment. If any is still higher than d, I have to repeat the process until the condition is satisfied. At the end I would like the final a_adjusted_norm as the final result.
Does anybody knows how to achieve this? Here is a reproducible example:
set.seed(8)
#create dataframe
a<- runif(100, min = 0, max = 10)
b<- runif(100, min = 0, max = 10)
a_norm <- a/sum(a)*100
b_norm <- b/sum(b)*100
c <- a_norm / b_norm
c_cap <- 1 #C must not be higher than c_Cap
df <- data.frame(a_norm, b_norm, c)
df <- df %>%
mutate(c_adjusted = ifelse(c >= c_cap, c_cap, c), #We adjust c rows that are higher than c_cap
a_adjusted = c_adjusted*b_norm, #We calculate the adjusted a with adjusted c
a_adjusted_norm = a_adjusted/sum(a_adjusted)*100) #Normalize adjusted a
#We calculate again c to see if it matches condition
df <- df %>%
mutate(c = a_adjusted_norm/b_norm) #see if c satisfy condition after adjusting variables
#If any row of C is still higher than cap, I must adjust it again and repeat the process until all rows match the condition
Thanks in advance!
Generally you can do:
a <- runif(10, min = 0, max = 10)
b <- runif(10, min = 0, max = 10)
a_norm <- a/sum(a)*100
b_norm <- b/sum(b)*100
cap <- 1
c <- a_norm / b_norm
while (max(c) > cap) {
c[c>cap] <- cap
a_adjusted <- c * b_norm
a_adjusted_norm <- a_adjusted/sum(a_adjusted)*100
c <- a_adjusted_norm/b_norm
}
However, this seems to never work, because while your approach shrinks the higher values towards 1, it at the same time pushes smaller values than 1 to become larger than 1. Which means that the loop will never end (at least I stopped it manually after some time)
So you probably need to adjust the formula to recalculate your c values!

T-Test For Genes using Apply Function in Dataframe

I’m trying to run a t.test on two data frames.
The dataframes (which I carved out from a data.frame) has the data I need to rows 1:143. I’ve already created sub-variables as I needed to calculate rowMeans.
> c.mRNA<-rowMeans(c007[1:143,(4:9)])
> h.mRNA<-rowMeans(c007[1:143,(10:15)])
I’m simply trying to run a t.test for each row, and then plot the p-values as histograms. This is what I thought would work…
Pvals<-apply(mRNA143.data,1,function(x) {t.test(x[c.mRNA],x[h.mRNA])$p.value})
But I keep getting an error?
Error in t.test.default(x[c.mRNA], x[h.mRNA]) :
not enough 'x' observations
I’ve got something off in my syntax and cannot figure it out for the life of me!
EDIT: I've created a data.frame so it's now just two columns, I need a p-value for each row. Below is a sample of my data...
c.mRNA h.mRNA
1 8.224342 8.520142
2 9.096665 11.762597
3 10.698863 10.815275
4 10.666233 10.972130
5 12.043525 12.140297
I tried this...
pvals=apply(mRNA143.data,1,function(x) {t.test(mRNA143.data[,1],mRNA143.data[, 2])$p.value})
But I can tell from my plot that I'm off (the plots are in a straight line).
A reproducible example would go a long way. In preparing it, you might have realized that you are trying to subset columns based on mean, which doesn't make sense, really.
What you want to do is go through rows of your data, subset columns belonging to a certain group, repeat for the second group and pass that to t.test function.
This is how I would do it.
group1 <- matrix(rnorm(50, mean = 0, sd = 2), ncol = 5)
group2 <- matrix(rnorm(50, mean = 5, sd = 2), ncol = 5)
xy <- cbind(group1, group2)
# this is just a visualization of the test you're performing
plot(0, 0, xlim = c(-5, 11), ylim = c(0, 0.25), type = "n")
curve(dnorm(x, mean = 5, sd = 2), add = TRUE)
curve(dnorm(x, mean = 0, sd = 2), add = TRUE)
out <- apply(xy, MARGIN = 1, FUN = function(x) {
# x is a vector, e.g. xy[i, ] or xy[1, ]
t.test(x = x[1:5], y = x[6:10])$p.value
})
out

Using loop to add columns in R

My data set currently looks like this:
Contract number FA NAAR q
CM300 9746 47000 0.5010
UL350 80000 0 0.01234
RAD3421 50000 10000 0.9431
I would like to add a column with a randomly generated number (called trial) between 0-1 for each row, and compare this number to the value in column q with another column saying 'l' if q < trial, and 'd' if q > trial.
This is my code which accomplishes this task one time.
trial <- runif(3, min = 0, max = 1)
data2 <- mutate(data, trial)
data2 <- mutate(data, qresult = ifelse(data2$q <= data2$trial, 'l', 'd'))
My struggle is to get this to repeat for several trials , adding new columns onto the table with each repetition. I have tried several types of loops, and looked through several questions and cannot seem to figure it out. I am fairly new to R, so any help would be appreciated!
You may want to approach this using:
df <- data.frame(contract = c("CM300", "UL350", "RAD3421"),
FA = c(9746, 80000, 50000),
NAAR = c(47000, 0, 10000),
q = c(0.5010, 0.01234, 0.9431))
trialmax <- 10
for(i in 1:trialmax){
trial <- runif(3, min = 0, max = 1)
df[ , paste0("trial", i)] <- trial
df[ , paste0("qresult", i)] <- ifelse(trial >= df$q, "l", "d")
}
Here I assumed you want 10 trials, but you can change trialmax to whatever you want.
I'd keep things in a separate matrix for efficiency, only binding them on at the end. In fact, using vector recycling, this can be done very efficiently:
n_trials = 20
trials = matrix(runif(n_trials * nrow(data))], ncol = n_trials)
q_result = matrix(c("l", "d")[(trials > data$q) + 1], ncol = n_trials)
colNames(trials) = paste0("trial", seq_len(n_trials))
colNames(q_result) = paste0("qresult", seq_len(n_trials))
data = cbind(data, trials, q_result)

Calculate a Lagged column on itself

I'm certain there is an easier way to accomplish this. I have the following dataframe:
B <- c(1, 1, 1, 0, 1, 2, 2, 0, 0, 0)
A <- c(1:10)
df <- as.data.frame(cbind(A,B))
What I would like to do is add a third column (C) that applies column B, unless column B is 0, in which case apply the percent change in column A to the previous result of column C.
Here is what I did:
library(Hmisc)
df$New <- ifelse(df$B!=0, df$B, df$A/Lag(df$A, shift=1)*Lag(df$B, shift=1))
df$New2 <- ifelse(df$New !=0, df$New, df$A/Lag(df$A, shift=1)*Lag(df$New, shift=1))
df$New3 <- ifelse(df$New2 !=0, df$New2, df$A/Lag(df$A, shift=1)*Lag(df$New2, shift=1))
df$C <- pmax(df$New, df$New2, df$New3)
df<- df[c(1,2,6)]
Essentially, I need to calculate on the column based on the previous calculated result, so maybe sapply, but not sure.

variable limit to define values

I have a simple question, so lets take some basic data
a <- rnorm(100, mean=1, sd = 0.1)
b <- rnorm(100, mean=5, sd = 2)
c <- data.frame(a,b)
Now I want to redefine C$B such that if it is below a limit, the user manually defines the new variable it will take, and if it is above this limit, the values take the same as previous
c$b <- with(c, ifelse(b < 2, 1, # leave as exsiting value #))
so when b < 2, we want to assign a value of 1, otherwise use the exisitng value
If we are using ifelse, try
c$b <- with(c, ifelse(b < 2, 1, b))
This doesn't even require ifelse. We can get the logical index of values less than 2 in the 'b' column (c$b <2) and assign those values to 1.
c$b[c$b<2] <- 1

Resources