I created a data frame from another data set called CompleteData:
conditions=data.frame(CompleteData$CAD,CompleteData$CVD,CompleteData$AAA,CompleteData$PAD)
This data frame has 2000 rows of data, with potential values of either 0 or 1. (The values represent the presence or absence of a condition such as CAD or CVD). I am trying to determine how many rows have two or more conditions in this data frame.
My general plan was to use an if statement combined with a for loop to determine which rows have multiple conditions, and then add the number of rows together. Here is the function I created:
for (conditions in conditions)
{
multiple.conditions=function(conditions)
{
if(sum(conditions)>1){return("multiple conditions")}else{
return("0 or 1 condition")
}
}
}
I'm just stuck in trying to figure out how to apply the for loop so that it performs the if statement row by row. I tested the if statement portion, which comes out correct, but how to structure the for loop is confusing me. As this is an introductory-level class, we are limited in the types of functions we are able to use. In this case, we have learned the syntax of functions, creating nested functions, if statements, for loops, and while loops. Any ideas?
What about just using rowSums()?
sum(rowSums(conditions)>1)
[1] 106
If you need to use a loop, you can do something like this:
mult_cond_rows = 0
for(i in 1:nrow(conditions)) {
row = conditions[i,]
if(sum(row)>1) mult_cond_rows = mult_cond_rows + 1
}
print(mult_cond_rows)
[1] 106
Input:
set.seed(123)
conditions = as.data.frame(setNames(
lapply(1:4, \(i) sample(0:1,2000, replace=T, prob=c(.9,.1))),
nm = c("CAD", "CVD", "AAA", "PA")
))
Related
I'm in need of some R for-loop and grep optimisation assistance.
I have a data.frame made up of columns of different data types. 42 of these columns have the name "treatmentmedication_code_#", where # is a number 1 to 42.
There is a lot of code so a reproducible example is quite tricky. As a compromise, the following code is the precise operation I need to optimise.
for(i in 1:nTreatments) {
...lots of code...
controlsDrugStatusDF <- cbind(controlsTreatmentDF, Drug=0)
for(n in 1:nControls) {
if(treatment %in% controlsDrugStatusDF[n,grep(pattern="^treatmentmedication_code*",x=colnames(controlsDrugStatusDF))]) {
controlsDrugStatusDF$Drug[n] <- 1
} else {
controlsDrugStatusDF$Drug[n] <- 0
}
}
}
treatment is some coded medication e.g., 145374524. The condition inside the if statement is very slow. It checks to see whether the treatment value is present in any one of those columns defined by the grep for the row n. To make matters worse, this is done for every treatment, thus the i for-loop.
Short of launching multiple processes or massacring my data.frames into lots of separate matrices then pasting them together and converting them back into a data.frame, are there any notable improvements one could make on the if statement?
As part of optimization, the grep for selecting the columns can be done outside the loop. Regarding the treatments part it is not clear. Consider that it is a vector of values. We can use
nm1 <- grep("^treatmentmedication_code*",
colnames(controlsDrugStatusDF), values = TRUE)
nm2 <- paste0("Drug", seq_along(nm1))
controlsDrugStatusDF[nm2] <- lapply(controlsDrugStatusDF[nm1],
function(x)
+(x %in% treatments))
I have two variables containing missing data loon and profstat. For a better overview of the data that are missing and are needed to impute, I wanted to create an additional variable problem in the data frame, that would return for each case 1 if loon is missing and profstat is observed, and 0 if otherwise. I have generated the following code, which only gives me as output x[] = 1. Any solution to this problem?
{
problem <- dim(length(t))
for (i in 1:nrow(dflapopofficial))
{
if (is.na(dflapopofficial$loon[i])==TRUE & is.na(dflapopofficial$profstat[i])==FALSE) {
dflapopofficial$problem[i]=1
} else {
dflapopofficial$problem[i]=0
}
return(problem)
}
There are a few things that could be improved here:
Remember, many operations in R are vectorized. You don't need to loop through each element in a vector when doing logical checks etc.
is.na(some_condition) == TRUE is just the same as is.na(some_condition) and is.na(some_condition) == FALSE is the same as !is.na(some_condition)
If you want to write a new column inside a dataframe, and you are referring to several variables in that dataframe, using within can save you a lot of typing - particularly if your dataframe has a long name
You are returning problem, yet in your loop, you are writing to dflapipofficial$problem which is a different variable.
If you want to write 1s and 0s, you can implicitly convert logical to numeric using +(logical_vector)
Putting all this together, you can replace your whole loop with a single line:
within(dflapopofficial, problem <- +(is.na(loon) & !is.na(profstat)))
Remember to store the result, either back to the dataframe or to a copy of it, like
df <- within(dflapopofficial, problem <- +(is.na(loon) & !is.na(profstat)))
So that df is just a vopy of dflapopofficial with your extra column.
I am trying to obtain the number of cases for each variable in a df. There are 275 cases in the df but most columns have some missing data. I am trying to run a for loop to obtain the information as follows:
idef_id<-readxl::read_xlsx("IDEF.xlsx")
casenums <- for (i in names(idef_id)) {
nas<- sum(is.na(i))
275-nas
}
however the output for casenums is
> summary(casenums)
Length Class Mode
0 NULL NULL
Any help would be much appreciated!
A for loop isn't a function - it doesn't return anything, so x <- for(... doesn't ever make sense. You can do that with, e.g., sapply, like this
casenums <- sapply(idef_id, function(x) sum(!is.na(x)))
Or you can do it in a for loop, but you need to assign to a particular value inside the loop:
casenums = rep(NA, ncol(idef_id))
names(casenums) = names(idef_id)
for(i in names(idef_id)) {
casenums[i] = sum(!is.na(idef_id[[i]]))`
}
You also had a problem that i is taking on column names, so sum(is.na(i)) is asking if the value of the column name is missing. You need to use idef_id[[i]] to access the actual column, not just the column name, as I show above.
You seem to want the answer to be the number of non-NA values, so I switched to sum(!is.na(...)) to count that directly, rather than hard-coding the number of rows of the data frame and doing subtraction.
The immediate fix for your for loop is that your i is a column name, not the data within. On your first pass through the for loop, your i is class character, always length 1, so sum(is.na(i)) is going to be 0. Due to how frames are structured, there is very little likelihood that a name is NA (though it is possible ... with manual subterfuge).
I suggest a literal fix for your code could be:
casenums <- for (i in names(idef_id)) {
nas<- sum(is.na(idef_id[[i]]))
275-nas
}
But this has the added problem that for loops don't return anything (as Gregor's answer also discusses). For the sake of walking through things, I'll keep that (for the first bullet), and then fix it (in the second):
Two things:
hard-coding 275 (assuming that's the number of rows in the frame) will be problematic if/when your data ever changes. Even if you're "confident" it never will ... I still recommend not hard-coding it. If it's based on the number of rows, then perhaps
OUT_OF <- 275 # should this be nrow(idef_id)?
casenums <- for (i in names(idef_id)) {
nas<- sum(is.na(idef_id[[i]]))
OUT_OF - nas
}
at least in a declarative sense, where the variable name (please choose something better) is clear as to how you determined 275 and how (if necessary) it should be fixed in the future.
(Or better, use Gregor's logic of sum(!is.na(...)) if you just need to count not-NA.)
doing something for each column of a frame is easily done using sapply or lapply, perhaps
OUT_OF <- 275 # should this be nrow(idef_id)?
OUT_OF - sapply(idef_id, function(one_column) sum(is.na(one_column)))
## or
sapply(idef_id, function(one_column) OUT_OF - sum(is.na(one_column)))
I have some data that contains 400+ columns and ~80 observations. I would like to use a for loop to go through each column and, if it contains the desired prefix exp_, I would like to create a new column which is that value divided by a reference column, stored as the same name but with a suffix _pp. I'd also like to do an else if with the other prefix rev_ but I think as long as I can get the first problem figured out I can solve the rest myself. Some example data is below:
exp_alpha exp_bravo rev_charlie rev_delta pupils
10 28 38 95 2
24 56 39 24 5
94 50 95 45 3
15 93 72 83 9
72 66 10 12 3
The first time I tried it, the loop ran through properly but only stored the final column in which the if statement was true, rather than storing each column in which the if statement was true. I made some tweaks and lost that code but now have this which runs without error but doesn't modify the data frame at all.
for (i in colnames(test)) {
if(grepl("exp_", colnames(test)[i])) {
test[paste(i,"pp", sep="_")] <- test[i] / test$pupils)
}
}
My understanding of what this is doing:
loop through the vector of column names
if the substring "exp_" is in the ith element of the colnames vector == TRUE
create a new column in the data set which is the ith element of the colnames vector divided by the reference category (pupils), and with "_pp" appended at the end
else do nothing
I imagine since my the code is executing without error but not doing anything that my problem is in the if() statement, but I can't figure out what I'm doing wrong. I also tried adding "==TRUE" in the if() statement but that achieved the same result.
Almost correct, you did not define the length of the loop so nothing happened. Try this:
for (i in 1:length(colnames(test))) {
if(grepl("exp_", colnames(test)[i])) {
test[paste(i,"pp", sep="_")] <- test[i] / test$pupils
}
}
As an alternative to #timfaber's answer, you can keep your first line the same but not treat i as an index:
for (i in colnames(test)) {
if(grepl("exp_", i)) {
print(i)
test[paste(i,"pp", sep="_")] <- test[i] / test$pupils
}
}
Linear solution:
Don't use loop for that! You can linearize your code and run it much faster than looping over columns. Here's how to do it:
# Extract column names
cNames <- colnames(test)
# Find exp in column names
foo <- grep("exp", cNames)
# Divide by reference: ALL columns at the SAME time
bar <- test[, foo] / test$pupils
# Rename exp to pp : ALL columns at the SAME time
colnames(bar) <- gsub("exp", "pp", cNames[foo])
# Add to original dataset instead of iteratively appending
cbind(test, bar)
I have a dataframe of samples with categorical and numerical attributes. I would like to compare each pair of samples in a way that you take one and compare it against all the other samples. This comparison is performed by means of a function that has two parameters (the two samples in comparison).
Let us suppose that data2 is that dataframe and ComputeSimilarityMeasure is the function that I would like to apply. It is worth saying that this function separates categorical and numerical attributes in order to perform different calculations with them.
I have tried this:
nsamples=nrow(data2)
for (i in 1:nsamples) {
KX(i) <- apply( data2, 1, function(x) ComputeSimilarityMeasure(x,data2[i,]) )
#...rest of the code...
}
The problem is that, inside the ComputeSimilarityMeasure the sample x has all its attributes as strings, even numerical ones. Therefore, the function doesn't work properly.
Input sample to the function (before the call):
KEY_PROMO PROMO_TYPE KEY_STORE KEY_MKT MKT_HQ_CITY MKT_HQ_STATE
1 0 1 6 Chicago IL
Input sample to the function (inside the function):
KEY_PROMO PROMO_TYPE KEY_STORE KEY_MKT MKT_HQ_CITY MKT_HQ_STATE
" 1" " 0" " 1" " 6" "Chicago " "IL
At this moment, I have implemented two for loops for solving the problem (working fine), however, this solution is unacceptable in terms of computation time (data2 has thousands of samples).
Any idea about fixing my apply function? Any other alternative that you estimate better?
You can use sapply like for loop
nsamples=nrow(data2)
for (i in 1:nsamples) {
KX(i) <- sapply(1:nrow(data2), function(x) ComputeSimilarityMeasure(data2[x,],data2[i,]) )
#...rest of the code...
}
If your data set is big enough to parallelize this procedure. I recommend mclapply instead of for