Double filtering with for loops - r

I’m trying to essentially double filter this “championships” dataset for each element of the “questions” column and then for the elements of the “correct” column (either 1 or 0). I have tried to do this with the code below:
unique_questions <- unique(championships$question)
question_ <- numeric(length(unique_questions))
for(i in unique_questions) {{
question_[i] <- championships %>% filter(question == i)
}
correct_ <- length(championships$correct)
for(j in unique_correct) {
correct_[j] <- championships %>% filter(correct == j)
}
print(question_[i], correct_[j])
}
This doesn’t seem to be working and I have a feeling the problem has something to do with the placement of brackets of the choice of functions (in particular, the numeric(length()) function) within the for loops. If this function worked properly, I would hope to have elements of a form similar to qc_ij with i drawn from the “unique_questions” category and j drawn from the “unique_correct” category. There are twelve questions and two choices of “correct” (1 and 0) so I would hope there would be 24 objects of type qc_ij. If someone sees where this functions doesn’t work, can you help me fix it?

You don't really need to store the unique questions and answers for each step and you can filter a data.frame simply by selecting rows that satisfy a certain condition:
for(q in unique(championships$question)){
d = championships[championships$question==q,] ## Filter for q
for (ci in unique(d$correct)) { ## Only loop over the possible answers for THIS question
d2 = d[d$correct=ci,] ## Filter for ci
## Just print it to see whether it worked
print(paste(q,ci))
print(d2)
}
}
So, you subsequently get all the subsets of the championships table for each value of questions and each value of (specific) correct answers.

Related

Speed up R's grep during an if conditional %in% operation

I'm in need of some R for-loop and grep optimisation assistance.
I have a data.frame made up of columns of different data types. 42 of these columns have the name "treatmentmedication_code_#", where # is a number 1 to 42.
There is a lot of code so a reproducible example is quite tricky. As a compromise, the following code is the precise operation I need to optimise.
for(i in 1:nTreatments) {
...lots of code...
controlsDrugStatusDF <- cbind(controlsTreatmentDF, Drug=0)
for(n in 1:nControls) {
if(treatment %in% controlsDrugStatusDF[n,grep(pattern="^treatmentmedication_code*",x=colnames(controlsDrugStatusDF))]) {
controlsDrugStatusDF$Drug[n] <- 1
} else {
controlsDrugStatusDF$Drug[n] <- 0
}
}
}
treatment is some coded medication e.g., 145374524. The condition inside the if statement is very slow. It checks to see whether the treatment value is present in any one of those columns defined by the grep for the row n. To make matters worse, this is done for every treatment, thus the i for-loop.
Short of launching multiple processes or massacring my data.frames into lots of separate matrices then pasting them together and converting them back into a data.frame, are there any notable improvements one could make on the if statement?
As part of optimization, the grep for selecting the columns can be done outside the loop. Regarding the treatments part it is not clear. Consider that it is a vector of values. We can use
nm1 <- grep("^treatmentmedication_code*",
colnames(controlsDrugStatusDF), values = TRUE)
nm2 <- paste0("Drug", seq_along(nm1))
controlsDrugStatusDF[nm2] <- lapply(controlsDrugStatusDF[nm1],
function(x)
+(x %in% treatments))

How to pass a column name in a for loop concatenating i with a string?

I need to subset a data frame in several others based in the values of several columns of the original data frame.
Here's my for loop:
for (i in 1:qtde_erros_esti){
temp_esti <- erro_esti[(paste0("erro_esti$" , "erro", i) == "1"),]
assign(paste0("erro", i,"_esti"), temp_esti)
rm(temp_esti)
}
The last piece of the puzzle for me is to pass the column name which value I must check (1st line in the for loop).
I'm trying to pass it with the function paste0, but the result of the function is a string that will never be equal to "1", hence never getting any data.
How can I pass the column names (erro_esti$erro1, erro_esti$erro2, and so on...) in this case?
Observation: I'm aware that this may not be the best approach using R, but I'm a noobie, coming from SAS, so I have limited knowledge.
Secondary question: is the way that I formulated the question (topic title) good? Accepting criticism on that too, please, aiming to improve future questions.
Thanks in advance for anyone who take some time to read this.
We can use [[ instead of $ to subset the column dynamically
erro_esti[[paste0("erro", i)]]
-full code
for(i in seq_len(qtde_erros_esti)) {
temp_esti <- erro_esti[erro_esti[[paste0("erro", i)]] == 1,]
assign(paste0("erro", i,"_esti"), temp_esti)
rm(temp_esti)
}
You are probably going about things a bit too complicated most likely, considert his approach:
for (i in 1:qtde_erros_esti){
column.name <- paste0("erro", i)
column.data <- erro_esti[, column.name ]
## do things with the column.data vector here
}
Now you can do what needs to be done with the data from column i, using the column.data variable.
If you just want to work with every column of your data.frame, also consider this further simplified pattern:
for( column.data in erro_esti ) {
## work with column.data here
}
You can just iterate over the columns of erro_esti directly, no need to use a counter, unless you need that counter for something else.

Assignment in an if statment over data frame?

I hope someone could take a look at the if statement below and tell my how I should change it to get the results I want.
Essentially, I want the code to (1) run through (iterate over) every row in the data frame beh_data, and (2) if the character in the "Cue" column is identical to that in the "face1" column, I want to (3) take the value from the "Enc_trials.thisRepN" column, and (4) assign it to the "scr_of_trial" column. If they are not the same, I want to assign an NA to the "scr_of_trial" column.
Currently, the code runs, but assings NA to every row in the "scr_of_trial" column.
Can anyone tell me why?
Here is the code:
j <- 1
i = as.character(beh_data$Cue[1:1])
for (x in 1:NROW(beh_data$Cue)) {
if (beh_data$Cue[j] == beh_data$face1[j]) {
beh_data$scr_of_trial[j] <- beh_data$Enc_trials.thisRepN[j]
j <- j + 1
i = as.character(beh_data$Cue[1:1+j])
}
else {
beh_data$scr_of_trial[j] <- NA
j <- j + 1
i = as.character(beh_data$Cue[1:1+j])
next
}
}
Shift your thinking to whole-vectors-at-a-time.
A few techniques:
ifelse; while it works fine here, realize that ifelse has issues with class.
beh_data$scr_of_trial <- ifelse(beh_data$Cue == beh_data$face1,
beh_data$Enc_trials.thisRepN, NA_character_)
replace; similar functionality, no class problem:
replace(beh_data$Enc_trials.thisRepN, beh_data$Cue != beh_data$face1, NA_character_)
Use what I call an "indicator variable":
ind <- beh_data$Cue == beh_data$face1
beh_data$scr_of_trial <- NA_character_
beh_data$scr_of_trial[ind] <- beh_data$Enc_trials.thisRepN
No for loops, just whole vectors at a time.
When reasonable, I tend to use class-specific NA types like NA_character_; while base R's functions will happily up-convert for you to whatever class you have, many other dialects within R (e.g., dplyr, data.table) are less permissive. It's a little declarative programming, a little style, perhaps a little snobbery, I don't know ...
(This is all untested on actual data.)

Can't loop through the values of a subset of vectors in R

I apologize if the question is really basic, I'm still a complete newbie with R.
In my data set, observations come from people who were asked how satisfied they were on a scale from 1 to 10, each month during a period of 6 months.
There was no obligation to participate, so sometimes they answer, sometimes they don't.
I am trying to build a variable that counts how many times they answered the question. I consider that they answer it if the answer is >0.
So first I selected the relevant variables from my dataset and stored them into a separate dataframe (don't pay attention to the "average" in the name, for the purpose of the question just consider it's their single answer for the month):
monthly_sats <- select (donnees, average_satisfaction_march, average_satisfaction_april,
average_satisfaction_may, average_satisfaction_june,
average_satisfaction_july, average_satisfaction_august)
Then, I created a variable where I would store how many times (so, how many months) each person answered, and I initialized it to 0.
donnees$monthly_sat_count <- 0
So far so good. Then, I wrote the following:
for (i in monthly_sats) {
for(j in i) {
if (j > 0) {
donnees$monthly_sat_count <- donnees$monthly_sat_count + 1
}
}
}
Here is what I meant:
for each variable in the monthly_sats data frame
for each value in these variables
if that value is greater than 0, increase the monthly_sat_count variable from the "donnees" data set by 1.
I expected that, for each line in my dataset, monthly_sat_counts would tell how many of these variables were greater than 0.
And the result is that every single line of monthly_sat_counts is equal to 365, and I have no idea why.
Note that I also tried subsetting instead of selecting, and the result is exactly the same. Here is the code:
for (i in donnees[c("average_satisfaction_march", "average_satisfaction_april",
"average_satisfaction_may", "average_satisfaction_june",
"average_satisfaction_july", "average_satisfaction_august")]) {
for(j in i) {
if (j > 0) {
donnees$monthly_sat_count <- donnees$monthly_sat_count + 1
}
}
}
And if I remove the second for loop, simply looping through the list of vectors with the code below, then monthly_sat_count is always equal to 0:
for (i in donnees[c("average_satisfaction_march", "average_satisfaction_april",
"average_satisfaction_may", "average_satisfaction_june",
"average_satisfaction_july", "average_satisfaction_august")]) {
if (i > 0) {
donnees$monthly_sat_count <- donnees$monthly_sat_count + 1
}
I have no idea why it does that, and I don't even know where to begin in debugging because I still have trouble understanding R. My only programming background was a little C# some time ago.
Anyway, if sometimes could explain me why it doesn't work and show me a better way of doing it, it would really make my day !
set.seed(123)
df <- as.data.frame(matrix(sample(c(0:10), 60, TRUE), ncol = 6))
colnames(df) <- wrapr::qc(average_satisfaction_march, average_satisfaction_april,
average_satisfaction_may, average_satisfaction_june,
average_satisfaction_july, average_satisfaction_august)
df$donnees <- c(1:10)
df <- df[,c(7,1:6)]
df$timesanswered <- apply(df[,2:7], 1 , function(x) {length(x[x>0])})
At first I created some sample data. The last line is the code to count the times per donnee satisfaction is not zero in a month. I assumed the way you described your data you have no missing values, but zeroes are filled in when a donnee did not answer the question?
You could replace the 2 and 7 with the column numbers of average_satisfaction_march and average_satisfaction_august, respectively. There is no need to create a separate data frame to do this.

Using mapply() in R over rows, vs. columns

I deal with a great deal of survey data and the like in my work, and I often have to make various scoring programs that process data on a row-by-row level. For instance, I am dealing with a table right now that contains 12 columns with subscale scores from a psychometric instrument. These will be converted to normalized scores using tables provided by the instrument's creator. Seems straightforward so far.
However, there are four tables - the instrument is scored differently depending on gender and age range. So, for instance, a 14-year old female and an 10 year-old male get different normalization tables. All of the normalization data is stored in a R data frame.
What I would like to do is write a function which can be applied over rows, which returns a vector looked up from the normalization data. So, something vaguely like this:
converter <- function(rawscores,gender,age) {
if(gender=="Male") {
if(8 <= age & age <= 11) {convertvec <- c(1:12)}
if(12 <= age & age <= 14) {convertvec <- c(13:24)}
}
else if(gender=="Female") {
if(8 <= age & age <= 11) {convertvec <- c(25:36)}
if(12 <= age & age <= 14) {convertvec <- c(37:48)}
}
converted_scores <- rep(0,12)
for(z in 1:12) {
converted_scores[z] <- conversion_table[(unlist(rawscores)+1)[z],
convertvec[z]]
}
rm(z)
return(converted_scores)
}
EDITED: I updated this with the code I actually got to work yesterday. This version returns a simple vector with the scores. Here's how I then implemented it.
mydata[,21:32] <- 0
for(x in 1:dim(mydata)[1]) {
tscc_scores[x,21:32] <- converter(mydata[x,7:18],
mydata[x,"gender"],
mydata[x,"age"])
}
This works, but like I said, I'm given to understand that it is bad practice?
Side note: the reason rawscores+1 is there is that the data frame has a score of zero in the first index.
Fundamentally, the function doesn't seem very complicated, and I know I could just implement it using a loop where I would do for(x in 1:number_of_records), but my understanding is that doing so is poor practice. I had hoped to simply use apply() to do this, like as follows:
apply(X=mydata[,1:12],MARGIN=1,
FUN=converter,gender=mydata[,"gender"],age=mydata[,"age"])
Unfortunately, R doesn't seem to approve of this approach, as it does not iterate through the vectors passed to subsequent arguments, but rather tries to take them as the argument as a whole. The solution would appear to be mapply(), but I can't figure out if there's a way to use mapply() over rows, instead of columns.
So, I guess my questions are threefold. One, is there a way to use mapply() over rows? Two, is there a way to make apply() iterate over arguments? And three, is there a better option out there? I've seen and heard a lot about the plyr package, but I didn't want to jump to that before I fully investigated the options present in Base R.
You could rewrite 'converter' so that it takes vectors of gender, age, and a row index which you then use to do lookups and assignments to converted_scores using a conversion array and a data array that is jsut the numeric score columns. There is an additional problem with using apply since it will convert all its x arguments to "character" class because of the gender class being "character". It wasn't clear whether your code normdf[ rawscores+1, convertvec] was supposed to be an array extraction or a function call.
Untested in absence of working example (with normdf, mydata):
converted_scores <- matrix(NA, nrow=NROW(rawscores), ncol=12)
converter <- function(idx,gender,age) {
gidx <- match(gender, c("Male", "Female") )
aidx <- findInterval(age, c(8,12,15) )
ag.idx <- gidx + 2*aidx -1
# the aidx factor needs to be the same number of valid age categories
cvt <- cvt.arr[ ag.idx, ]
converted_scores[idx] <- normdf[rawscores+1,convertvec]
return(converted_scores)
}
cvt.arr <- matrix(1:48, nrow=4, byrow=TRUE)[1,3,2,4] # the genders alternate
cvt.scores <- mapply(converter, 1:NROW(mydata), mydata$gender, mydata$age)
I'd advise against applying this stuff by row, but would rather apply this by column. The reason is that there are only 12 columns, but there might be many rows.
The following piece of code works for me. There might be better ways, but it might be interesting for you nevertheless.
offset <- with(mydata, 24*(gender == "Female") + 12*(age >= 12))
idxs <- expand.grid(row = 1:nrow(mydata), col = 1:12)
idxs$off <- idxs$col + offset
idxs$val <- as.numeric(mydata[as.matrix(idxs[c("row", "col")])]) + 1
idxs$norm <- normdf[as.matrix(idxs[c("val", "off")])]
converted <- mydata
converted[,1:12] <- as.matrix(idxs$norm, ncol=12)
The tricky part here is this idxs data frame which combines all the rest. It has the folowing columns:
row and column: Position in the original data
off: column in normdf, based on gender and age
val: row in normdf, based on original value + 1
norm: corresponding normalized value
I'll post this here with this first thought, and see whether I can come up with a better answer, either based on jorans comment, or using a three- or four-dimensional array for normdf. Not sure yet.

Resources