Bug in my for-loop to iterate over data frame - r

I am working on a data frame and have extracted on the of the columns with hour data from 0 t0 23. I am adding one more column as type of the day based on hour. I had executed below for loop but getting error. Can somebody help me what is wrong with below syntax and how to correct the same.
for(i in data$Requesthours) {
if(data$Requesthours>=0 & data$Requesthours<3) {
data$Partoftheday <- "Midnight"
} else if(data$Requesthours>=3 & data$Requesthours<6) {
data$Partoftheday <- "Early Morning"
} else if(data$Requesthours>=6 & data$Requesthours<12) {
data$Partoftheday <- "Morning"
} else if(data$Requesthours>=12 & data$Requesthours<16) {
data$Partoftheday <- "Afternoon"
} else if(data$Requesthours>=16 & data$Requesthours<20) {
data$Partoftheday <- "Evening"
} else if(data$Requesthours>=20 & data$Requesthours<=23) {
data$Partoftheday <- "Night"
}
}

Still waiting for you to post your bug, but here's an R coding tip which will reduce this to a one-liner (and bypass your bug). Also it'll be way faster (it's vectorized, unlike your for-loop and if-else-ladder).
data$Partoftheday <- as.character(
cut(data$Requesthours,
breaks=c(-1,3,6,12,16,20,24),
labels=c('Midnight', 'Early Morning', 'Morning', 'Afternoon', 'Evening', 'Night')
)
)
# see Notes on cut() at bottom to explain this
Now back to your bug: You're confused about how to iterate over a column in R. for(i in data$Requesthours) is trying to iterate over your df, but you're confusing indices with data values. Also you try to make i an iterator, but then you don't refer to the value i anywhere inside the loop, you refer back to data$Requesthours, which is an entire column not a single value (how do the loop contents known which value you're referring to? They don't. You could use an ugly explicit index-loop like for (i in 1:nrow(data) ... or for (i in seq_along(data) ... then access data[i,]$Requesthours, but please don't. Because...
One of the huge idiomatic things about learning R is generally when you write a for-loop to iterate over a dataframe or a df column, you should stop to think (or research) if there isn't a vectorized function in R that does what you want. cut, if, sum, mean, max, diff, stdev, ... fns are all vectorized, as are all the arithmetic and logical operators. 'vectorized' means you can feed them an entire (column) vector as an input, and they produce an entire (column) vector as output which you can directly assign to your new column. Very simple, very fast, very powerful. Generally beats the pants off for-loops. Please read R-intro.html, esp. Section 2 about vector assignment
And if you can't find or write a vectorized fn, there's also the *apply family of functions apply, sapply, lapply, ... to apply any arbitrary function you want to a list/vector/dataframe/df column.
Notes on cut()
cut(data, breaks, labels, ...) is a function where data is your input vector (e.g. your selected column data$Requesthours), breaks is a vector of integer or numeric, and labels is a vector to name the output. The length of labels is one more than breaks, since 5 breaks divides your data into 6 ranges.
We want the output vector to be string, not categorical, hence we apply as.character() to the output from cut()
Since your first if-else comparison is (hr>=0 & hr<3), we have to fiddle the lowest cutoff_hour 0 to -1, otherwise hr==0 would wrongly give NA. (There is a parameter include.lowest=TRUE/FALSE but it's not what you want, because it would also cause hr==3 to be 'Midnight', hr==6 to be 'Early Morning', etc.)

if(data$Requesthours>=0 & data$Requesthours<3) (and other similar ifs) make no sense since data$Requesthours is a vector. You should try either of the following:
Solution 1:
for(i in seq(length(data$Requesthours))) {
if(data$Requesthours[i]>=0 & data$Requesthours[i]<3)
data$Partoftheday[i] <- "Midnight"
....
}
This solution is slow like hell and really ugly, but it would work.
Solution 2:
data$Partoftheday[data$Requesthours>=0 & data$Requesthours<3] <- "Midnight"
...
Solution 3 = what was proposed by smci

Related

Speed up R's grep during an if conditional %in% operation

I'm in need of some R for-loop and grep optimisation assistance.
I have a data.frame made up of columns of different data types. 42 of these columns have the name "treatmentmedication_code_#", where # is a number 1 to 42.
There is a lot of code so a reproducible example is quite tricky. As a compromise, the following code is the precise operation I need to optimise.
for(i in 1:nTreatments) {
...lots of code...
controlsDrugStatusDF <- cbind(controlsTreatmentDF, Drug=0)
for(n in 1:nControls) {
if(treatment %in% controlsDrugStatusDF[n,grep(pattern="^treatmentmedication_code*",x=colnames(controlsDrugStatusDF))]) {
controlsDrugStatusDF$Drug[n] <- 1
} else {
controlsDrugStatusDF$Drug[n] <- 0
}
}
}
treatment is some coded medication e.g., 145374524. The condition inside the if statement is very slow. It checks to see whether the treatment value is present in any one of those columns defined by the grep for the row n. To make matters worse, this is done for every treatment, thus the i for-loop.
Short of launching multiple processes or massacring my data.frames into lots of separate matrices then pasting them together and converting them back into a data.frame, are there any notable improvements one could make on the if statement?
As part of optimization, the grep for selecting the columns can be done outside the loop. Regarding the treatments part it is not clear. Consider that it is a vector of values. We can use
nm1 <- grep("^treatmentmedication_code*",
colnames(controlsDrugStatusDF), values = TRUE)
nm2 <- paste0("Drug", seq_along(nm1))
controlsDrugStatusDF[nm2] <- lapply(controlsDrugStatusDF[nm1],
function(x)
+(x %in% treatments))

How to pass a column name in a for loop concatenating i with a string?

I need to subset a data frame in several others based in the values of several columns of the original data frame.
Here's my for loop:
for (i in 1:qtde_erros_esti){
temp_esti <- erro_esti[(paste0("erro_esti$" , "erro", i) == "1"),]
assign(paste0("erro", i,"_esti"), temp_esti)
rm(temp_esti)
}
The last piece of the puzzle for me is to pass the column name which value I must check (1st line in the for loop).
I'm trying to pass it with the function paste0, but the result of the function is a string that will never be equal to "1", hence never getting any data.
How can I pass the column names (erro_esti$erro1, erro_esti$erro2, and so on...) in this case?
Observation: I'm aware that this may not be the best approach using R, but I'm a noobie, coming from SAS, so I have limited knowledge.
Secondary question: is the way that I formulated the question (topic title) good? Accepting criticism on that too, please, aiming to improve future questions.
Thanks in advance for anyone who take some time to read this.
We can use [[ instead of $ to subset the column dynamically
erro_esti[[paste0("erro", i)]]
-full code
for(i in seq_len(qtde_erros_esti)) {
temp_esti <- erro_esti[erro_esti[[paste0("erro", i)]] == 1,]
assign(paste0("erro", i,"_esti"), temp_esti)
rm(temp_esti)
}
You are probably going about things a bit too complicated most likely, considert his approach:
for (i in 1:qtde_erros_esti){
column.name <- paste0("erro", i)
column.data <- erro_esti[, column.name ]
## do things with the column.data vector here
}
Now you can do what needs to be done with the data from column i, using the column.data variable.
If you just want to work with every column of your data.frame, also consider this further simplified pattern:
for( column.data in erro_esti ) {
## work with column.data here
}
You can just iterate over the columns of erro_esti directly, no need to use a counter, unless you need that counter for something else.

Assignment in an if statment over data frame?

I hope someone could take a look at the if statement below and tell my how I should change it to get the results I want.
Essentially, I want the code to (1) run through (iterate over) every row in the data frame beh_data, and (2) if the character in the "Cue" column is identical to that in the "face1" column, I want to (3) take the value from the "Enc_trials.thisRepN" column, and (4) assign it to the "scr_of_trial" column. If they are not the same, I want to assign an NA to the "scr_of_trial" column.
Currently, the code runs, but assings NA to every row in the "scr_of_trial" column.
Can anyone tell me why?
Here is the code:
j <- 1
i = as.character(beh_data$Cue[1:1])
for (x in 1:NROW(beh_data$Cue)) {
if (beh_data$Cue[j] == beh_data$face1[j]) {
beh_data$scr_of_trial[j] <- beh_data$Enc_trials.thisRepN[j]
j <- j + 1
i = as.character(beh_data$Cue[1:1+j])
}
else {
beh_data$scr_of_trial[j] <- NA
j <- j + 1
i = as.character(beh_data$Cue[1:1+j])
next
}
}
Shift your thinking to whole-vectors-at-a-time.
A few techniques:
ifelse; while it works fine here, realize that ifelse has issues with class.
beh_data$scr_of_trial <- ifelse(beh_data$Cue == beh_data$face1,
beh_data$Enc_trials.thisRepN, NA_character_)
replace; similar functionality, no class problem:
replace(beh_data$Enc_trials.thisRepN, beh_data$Cue != beh_data$face1, NA_character_)
Use what I call an "indicator variable":
ind <- beh_data$Cue == beh_data$face1
beh_data$scr_of_trial <- NA_character_
beh_data$scr_of_trial[ind] <- beh_data$Enc_trials.thisRepN
No for loops, just whole vectors at a time.
When reasonable, I tend to use class-specific NA types like NA_character_; while base R's functions will happily up-convert for you to whatever class you have, many other dialects within R (e.g., dplyr, data.table) are less permissive. It's a little declarative programming, a little style, perhaps a little snobbery, I don't know ...
(This is all untested on actual data.)

Double "for loops" in a dataframe in R

I need to do a quality control in a dataset with more than 3000 variables (columns). However, I only want to apply some conditions in a couple of them. A first step would be to replace outliers by NA. I want to replace the observations that are greater or smaller than 3 standard deviations from the mean by NA. I got it, doing column by column:
height = ifelse(abs(height-mean(height,na.rm=TRUE)) <
3*sd(height,na.rm=TRUE),height,NA)
And I also want to create other variables based on different columns. For example:
data$CGmark = ifelse(!is.na(data$mark) & !is.na(data$height) ,
paste(data$age, data$mark,sep=""),NA)
An example of my dataset would be:
name = factor(c("A","B","C","D","E","F","G","H","H"))
height = c(120,NA,150,170,NA,146,132,210,NA)
age = c(10,20,0,30,40,50,60,NA,130)
mark = c(100,0.5,100,50,90,100,NA,50,210)
data = data.frame(name=name,mark=mark,age=age,height=height)
data
I have tried this (for one condition):
d1=names(data)
list = c("age","height","mark")
ntraits=length(list)
nrows=dim(data)[1]
for(i in 1:ntraits){
a=list[i]
b=which(d1==a)
d2=data[,b]
for (j in 1:nrows){
d2[j] = ifelse(abs(d2[j]-mean(d2,na.rm=TRUE)) < 3*sd(d2,na.rm=TRUE),d2[j],NA)
}
}
Someone told me that I am not storing d2. How can I create for loops to apply the conditions I want? I know that there are similar questions but i didnt get it yet. Thanks in advance.
You pretty much wrote the answer in your first line. You're overthinking this one.
First, it's good practice to encapsulate this kind of operation in a function. Yes, function dispatch is a tiny bit slower than otherwise, but the code is often easier to read and debug. Same goes for assigning "helper" variables like mean_x: the cost of assigning the variable is very, very small and absolutely not worth worrying about.
NA_outside_3s <- function(x) {
mean_x <- mean(x)
sd_x <- sd(x,na.rm=TRUE)
x_outside_3s <- abs(x - mean(x)) < 3 * sd_x
x[x_outside_3s] <- NA # no need for ifelse here
x
}
of course, you can choose any function name you want. More descriptive is better.
Then if you want to apply the function to very column, just loop over the columns. That function NA_outside_3s is already vectorized, i.e. it takes a logical vector as an argument and returns a vector of the same length.
cols_to_loop_over <- 1:ncol(my_data) # or, some subset of columns.
for (j in cols_to_loop_over) {
my_data[, j] <- NA_if_3_sd(my_data[, j])
}
I'm not sure why you wrote your code the way you did (and it took me a minute to even understand what you were trying to do), but looping over columns is usually straightforward.
In my comment I said not to worry about efficiency, but once you understand how the loop works, you should rewrite it using lapply:
my_data[cols_to_loop_over] <- lapply(my_data[cols_to_loop_over], NA_outside_3s)
Once you know how the apply family of functions works, they are very easy to read if written properly. And yes, they are somewhat faster than looping, but not as much as they used to be. It's more a matter of style and readability.
Also: do NOT name a variable list! This masks the function list, which is an R built-in function and a fairly important one at that. You also shouldn't generally name variables data because there is also a data function for loading built-in data sets.

R rstats How to replace a single character in one place in a string

I'm trying to write a function that will take a string and replace one character with another, but I want it to return every permutation of replacing that character. I'd like to replace every i with an l but I don't want to do it globally like in gsub and I don't want to do just the first one like in sub. I think an example illustrates it best. If I pass in the name keviin (with two i's):
thisFunction("keviin")
[1] kevlin keviln kevlln
So I get back replacing the first i, the second i and then both i's. This sounds like a job for recursion, but first I need to figure out how to replace just the first i. Then I could pass the resulting string to the function to get the next permutation.
Anybody got an idea to give me a push? I've tried doing this but it didn't work for me:
> substr("keviin",4,4) <- "l"
Error in substr("keviin", 4, 4) <- "l" :
target of assignment expands to non-language object
From #CarlWitthoft idea, how about this:
thisFunction<-function(x) {
xsplit<-strsplit(x,"")[[1]]
ipos<-as.vector(gregexpr("i",x)[[1]])
if (length(ipos)==1) {
if (ipos<0) return(x) else {
substring(x,ipos,ipos)<-"l"
return(x)
}
}
combos<-unlist(lapply(seq_along(ipos),combn,x=ipos,simplify=FALSE),recursive=FALSE)
ret<-t(vapply(combos,function(x) {xsplit[x]<-"l";xsplit},character(length(xsplit))))
do.call(function(...) paste(...,sep=""),as.data.frame(ret))
}
thisFunction("keviin")
#[1] "kevlin" "keviln" "kevlln"
How about a combination of regex and sampling from a vector?
kevsplit<-unlist(strsplit('keviin',''))
the_eyes <-which( grepl('i',kevsplit))
kevsplit[sample(the_eyes,1)] <-"L"
newkev<-paste(kevsplit,collapse='')
That will randomly swap out one of the "i"s. To swap out all possible permutations,
do something like
for(j in 1:length(the_eyes) ) {
calculate all permutations of the_eyes taken j at a time
swap those selected values to kevsplit and save in some list
}
I'm too lazy to write out that last bit :-)
EDIT: to clarify, aside from pasting things back together again, your problem is basically:
For a vector of type c(0,0,0,0,0,....) (replacing your "i" with 0 or logical FALSE), how many ways can you replace 1 or more values with a "TRUE" (or 1) ? That's a standard problem in introductory combinatorics -- and happily enough for us computer weenies, turns out to be counting in binary!
This works with objects but not pure strings in quotes for some reason
thisFunction <- function(x){
+
+ substr(x,4,4) <- 'l'
+ return(x)
+
+ }
> thisFunction('keviin')
[1] "kevlin"
works.

Resources