Related
I know that this is not the most efficient way in order to achieve my goal; however, I am using this as a teaching moment (i.e., to show that you can use a if/else statement nested within a for loop). Specifically, I have a nominal variable that uses integers as of right now. I want to use the if/else combined with the for loop in order to reassign these numbers to their respective category (class character). I have tried to do this in multiple ways, my current code is as follows:
# Take the original data and separate out the variable of interest
oasis_CDR <- oasis_final %>% select('CDR')
# transpose this data
oasis_CDR <- t(oasis_CDR)
# create the for loop
for(i in seq_along(oasis_CDR)){
if(i == 0.0){
oasis_CDR[1, i] <- "Normal"
} else if(i == 0.5) {
oasis_CDR[1 ,i] <- "Very Mild Dementia"
} else if(i == 1.0){
oasis_CDR[1 ,i] <- "Mild Dementia"
} else if(i == 2.0){
oasis_CDR[1 ,i] <- "Moderate Dementia"
} else if(i == 3.0){
oasis_CDR[1 ,i] <- "Severe Dementia"
} else{
oasis_CDR[1 ,i] <- "NA"
}
}
When I look at oasis_CDR it returns 'NA' for all observations.
If i replace 'i' with 'CDR' in each 'for' statement it only returns with 'Normal'.
Is there any way that this can be done in order for the reassignments in order to match what the data is?
If you have a different value to assign to every number you can use dplyr::recode
library(dplyr)
oasis_CDR <- oasis_CDR %>%
mutate(new_col = recode(CDR, `0` = 'Normal',
`0.5` = 'Very Mild Dementia',
`1` = 'Mild Dementia',
`1.5` = 'Moderate Dementia',
`3` = 'Severe Dementia',
.default = NA_character_))
Run a check on your seq_along(oasis_CDR) expression! These will be your i values.
My guess is that you do not really want to compare 0.0, 0.5, 1 and 2 against 1 up to > 220, do you?
And if you really wanna work through this via a for loop and not with indexing the vector then
isn't it more likely that you want to achive something like this:
oasis_CDR$result <- NA_character_
j <- 1
for (i in oasis_CDR) {
if (i == ...) oasis_CDR$result[j] <- 'Normal'
...
j <- j + 1
}
But imho that can get the job done but is not (very) nice R (or any other similar language) code.
I am trying to create a function for gender distribution. Is there a way to define a letter as something other than as.factor? I would like to operate func(F) instead of func("F"). Or should I go numeric: func(0), func(1), func(2)?
I also finished off the statement with an else that is designed to operate when left blank, but does not. If I whittle down the function to not include an IF statement a blank variable works fine:
genderDist <- function(){
cat("Female:", sum(voterData$GENDER == "F"))
}
Thanks in advance! Cheers!
Full Statement:
genderDist <- function(x){
if (x == "F"){
cat("Female:", sum(voterData$GENDER == "F"))
}
else if (x == "M"){
cat("Male:", sum(voterData$GENDER == "M"))
}
else if(x == "U"){
cat("Unknown:", sum(voterData$GENDER == ""))
}
else{
cat("Female:", sum(voterData$GENDER == "F"))
cat("Male:", sum(voterData$GENDER == "M"))
cat("Unknown:", sum(voterData$GENDER == ""))
}
Desired results:
genderDist(F) gives count of Females
genderDist(M) gives count of Males
genderDist(U) gives count of Unknown
genderDist() gives count for all the above
There are several possibilities for coding gender, besides factor:
1. as character, not as factor. You will still have to call your function like func("F").
2. You already thought of using numeric yourself. Disadvantage is that it may be unclear if 1 is male or female.
3. The best option IMHO would be to go binary. Name your column "male" and use TRUE, FALSE and NA for unknown. The binary also works great in your if statement. Start with if(is.na(male)) ... ; else if(male).
EDIT
But to achieve your desired outcome, the coding of gender is not the issue, I would take this approach:
#First, define variables Fe, Ma and Un
#WARNING: Do NOT USE 'F', as 'F' is an abbr. for 'FALSE'!!
Fe <- "F"
Ma <- "M"
Un <- "U"
#now define a lookup dataframe for convienience
LT <- data.frame(code = c(Fe,Ma,Un), name = c("Female","Male","Unknown"), stringsAsFactors = FALSE)
# then define your function without an ifelse needed
genderDist <- function(x){
cat(LT[LT$code == x,"name"], sum(voterData$GENDER == x))
}
Introduce some fake data:
voterData <- data.frame(GENDER= c("F","F","F","M","M","U"))
Then run function:
> genderDist(Fe)
Female 3
> genderDist(Ma)
Male 2
> genderDist(Un)
Unknown 1
I am very new to R but I am interested in learning more and improving.
I have a dataset with around 40,000+ rows containing the length of neuron segments. I want to compare the length trends of neurons of different groups. The first step in this analysis involves sorting the measurements into 1 of 6 different categories such as '<10' '10-15', '15-20', '20-25', '25-30', and '>30'.
I created these categories as appended columns using 'mutate' from the 'dplyr' package and now I am trying to write a boolean function to determine where the measurement fits by applying a value of '1' to the corresponding column if it fits, and a '0' if it doesn't.
Here is what I wrote:
for (i in 1:40019) {
{if (FinalData$Length[i] <=10)
{FinalData$`<10`[i]<-1
} else {FinalData$`<10`[i]<-0}} #Fills '<10'
if (FinalData$Length[i] >=10 & FinalData$Length[i]<15){
FinalData$`10-15`[i]<-1
} else{FinalData$`10-15`[i]<-0} #Fills'10-15'
if (FinalData$Length[i] >=15 & FinalData$Length[i]<20){
FinalData$`15-20`[i]<-1
} else{FinalData$`15-20`[i]<-0} #Fills '15-20'
if (FinalData$Length[i] >=20 & FinalData$Length[i]<25) {
FinalData$`20-25`[i]<-1
} else{FinalData$`20-25`[i]<-0} #Fills '20-25'
if(FinalData$Length[i] >=25 & FinalData$Length[i]<30){
FinalData$`25-30`[i]<-1
} else{FinalData$`25-30`[i]<-0} #Fills '25-30'
if(FinalData$Length[i] >=30){
FinalData$`>30`[i]<-1
} else{FinalData$`>30`[i]<-0} #Fills '>30'
}
This seems to work, but it takes a long time:
system.time(source('~/Desktop/Home/Programming/R/Boolean Loop R.R'))
user system elapsed
94.408 19.147 118.203
The way I coded this seems very clunky and inefficient. Is there a faster and more efficient way to code something like this or am I doing this appropriately for what I am asking for?
Here is an example of some of the values I am testing:
'Length': 14.362, 12.482337, 8.236, 16.752, 12.045
If I am not being clear about how the dataframe is structured, here is a screenshot:
How my data frame is organized
You can use the cut function in R. It is used to convert numeric values to factors:
x<-c(1,2,4,2,3,5,6,5,6,5,8,0,5,5,4,4,3,3,3,5,7,9,0,5,6,7,4,4)
cut(x = x,breaks = c(0,3,6,9,12),labels = c("grp1","grp2","grp3","grp4"),right=F)
set right = "T" or "F" as per your need.
You can vectorise that as follows (I made a sample of some data called DF)
DF <- data.frame(1:40000,sample(letters,1:40000,replace=T),"Length"=sample(1:40,40000,replace=T))
MyFunc <- function(x) {
x[x >= 10 & x < 15] <- "10-15"
x[x >= 15 & x < 20] <- "15-20"
x[x >= 20 & x < 25] <- "20-25"
x[x >= 25 & x < 30] <- "25-30"
x[x > 30] <- ">30"
x[x < 10] <- "<10"
return(x)
}
DF$Group <- MyFunc(DF[,3])
If it has to be 6 columns like that, you can modify the above to return a one or zero for the appropriate size and everything else, respectively, for each of the 6 columns.
Edit: I guess a series of ifelse might be best if it really has to be 6 columns like that.
e.g.
DF$'<10' <- sapply(DF$Length, function(x) ifelse(x < 10,1,0))
I am trying to build a decision table. At time 3 for example I have to take the previous results in time t=1 and time t=2 in order to make my decision in time 3. The decision table is going to be pretty big so I am considering an efficient way to do it by building a function. For instance at time 3:
rm(list=ls()) # clear memory
names <- c("a","b","c","d","e","f","g","h","i","j","k50","l50","m50","n50","o50")
proba <- c(1,1,1,1,1,1,1,1,1,1,0.5,0.5,0.5,0.5,0.5)
need <- 4
re <- 0.5
w <- 1000000000
# t1
t1 <- as.integer(names %in% (sample(names,need,prob=proba,replace=F)))
# t2
t2 <- rep(t1)
# t3
proba3 <- ifelse(t2==1,proba*re,proba)
t3 <- as.integer(names %in% (sample(names,need,prob=proba3,replace=F)))
Now the table is going to be big until t=7 with proba7 which takes condition from t=1 to t=6. After t=7 it always takes the 6 previous outcomes plus the random part proba in order to make decision. In other words the ifelse must be dynamic in order that I can call it later. I have been trying something like
probF <- function(a){
test <- ifelse(paste0("t",a,sep="")==1,proba*re,proba)
return(test)
}
test <- probF(2)
but there is an error as I got just one value and not a vector. I know that it looks complicated
For the conditions requested by one person (i know it's not very good written) :
proba7 <- ifelse(t2==1 & t3==1 & t4==0 & t5==0 & t6==0,proba,
ifelse(t2==1 & t3==0 & t4==0 & t5==1 & t6==1,proba*re,
ifelse(t2==1 & t3==0 & t4==0 & t5==0 & t6==1, w,
ifelse(t2==0 & t3==1 & t4==1 & t5==0 & t6==0,proba,
ifelse(t2==0 & t3==1 & t4==1 & t5==1 & t6==0,0,
ifelse(t2==0 & t3==0 & t4==1 & t5==1 & t6==1,0,
ifelse(t2==0 & t3==0 & t4==1 & t5==1 &t6==0,0,
ifelse(t2==0 & t3==0 & t4==0 & t5==1 & t6==1, proba*re,
ifelse(t2==0 & t3==0 & t4==0 & t5==0 & t6==1,w,proba)))))))))
t7 <- as.integer(names %in% (sample(names,need,prob=proba7,replace=F)))
If you take a bit of a different approach, you'll gain quite a lot of speed.
First of all, it is really a terribly bad idea to store every step as a separate t1, proba1, etc. If you need to keep all that information, predefine a matrix or list of the right size and store everything in there. That way you can use simple indices instead of having to resort to the bug-prone use of get(). If you find yourself typing get(), almost always it's time to stop and rethink your solution.
Secondly, you can use a simple principle to select the indices of the test t:
seq(max(0, i-7), i-1)
will allow you to use a loop index i and refer to the 6 previous positions if they exist.
Thirdly, depending on what you want, you can reformulate your decision as well. If you store every t as a row in the matrix, you can simply use colSums() and check whether that one is larger than 0. Based on that index, you can update the probabilities in such a way that a 1 in any of the previous 6 rows halfs the probability.
wrapping everything in a function would then look like :
myfun <- function(names, proba, need, re,
w=100){
# For convenience, so I don't have to type this twice
resample <- function(p){
as.integer(
names %in% sample(names,need,prob=p, replace = FALSE)
)
}
# get the number of needed columns
nnames <- length(names)
# create two matrices to store all the t-steps and the probabilities used
theT <- matrix(nrow = w, ncol = nnames)
theproba <- matrix(nrow = w, ncol = nnames)
# Create a first step, using the original probabilities
theT[1,] <- resample(proba)
theproba[1,] <- proba
# loop over the other simulations, each time checking the condition
# recalculating the probability and storing the result in the next
# row of the matrices
for(i in 2:w){
# the id vector to select the (maximal) 6 previous rows. If
# i-6 is smaller than 1 (i.e. there are no 6 steps yet), the
# max(1, i-6) guarantees that you start minimal at 1.
tid <- seq(max(1, i-6), i-1)
# Create the probability vector from the original one
p <- proba
# look for which columns in the 6 previous steps contain a 1
pid <- colSums(theT[tid,,drop = FALSE]) > 0
# update the probability vector
p[pid] <- p[pid]*0.5
# store the next step and the used probabilities in the matrices
theT[i,] <- resample(p)
theproba[i,] <- p
}
# Return both matrices in a single list for convenience
return(list(decisions = theT,
proba = theproba)
)
}
which can be used as:
myres <- myfun(names, proba, need, re, w)
head(myres$decisions)
head(myres$proba)
This returns you a matrix where every row is one t-point in the decision table.
I have a table in R with three columns. I want to get the correlation of the first two columns with a subset of the third column following a specific set of conditions (values are all numeric, I want them to be > a certain number). The cor() function doesn't seem to have an argument to define such a subset.
I know that I could use the summary(lm()) function and square-root the r^2, but the issue is that I'm doing this inside a for loop and am just appending the correlation to a separate list that I have. I can't really append part of the summary of the regression easily to a list.
Here is what I am trying to do:
for (i in x) {list[i] = cor(data$column_a, data$column_b, subset = data$column_c > i)}
Obviously, though, I can't do that because the cor() function doesn't work with subsets.
(Note: x = seq(1,100) and list = NULL)
You can do this without a loop using lapply. Here's some code that will output a data frame with the month-range in one column and the correlation in another column. The do.call(rbind... business is just to take the list output from lapply and turn it into a data frame.
corrs = do.call(rbind, lapply(min(airquality$Month):max(airquality$Month),
function(x) {
data.frame(month_range=paste0(x," - ", max(airquality$Month)),
correlation = cor(airquality$Temp[airquality$Month >= x & airquality$Temp < 80],
airquality$Wind[airquality$Month >= x & airquality$Temp < 80]))
}))
corrs
month_range correlation
1 5 - 9 -0.3519351
2 6 - 9 -0.2778532
3 7 - 9 -0.3291274
4 8 - 9 -0.3395647
5 9 - 9 -0.3823090
You can subset the data first, and then find the correlation.
a <- subset(airquality, Temp < 80 & Month > 7)
cor(a$Temp, a$Wind)
Edit: I don't really know what your list variable is, but here is an example of dynamically changing the subset based on i (see how the month requirement changes with each iteration)
list <- seq(1, 5)
for (i in 1:5){
a <- subset(airquality, Temp < 80 & Month > i)
list[i] <- cor(a$Temp, a$Wind)
}
Based on the pseudo-code you provided alone, here's something that should work:
for (i in x) {
df <- subset(data, column_c > i)
list[i] = cor(df$column_a, df$column_b)
}
However, I don't know why you would want your index in list[i] to be the same value that you use to subset column_c. That could be another source of problems.