Basic for() loop in R - r

I'm new to loops and R in general. Using the "iris" datasets I need to use a for() loop and create an object called 'X.IQR' that contains the interquartile range of each of the first four columns of "iris". Could someone please provide a little insight for me here? Thank you!
Edit: Sorry forgot to include my attempts
for(row in 1:150){
for(column in 1:4){
print(paste("row =",row,"; col =",column))
print(iris[1:150,1:4])
}
}
I've tried this code here which is partially my knowledge and partially example code that I have learned in my class. I understand that this is a loop and I THINK that I have specified the first 4 columns as I desire I'm just not sure how to incorporate IQR here, anyone have any advice?

When selecting a subset of the data if you intend to have all the rows, as you have, you can just omit the row selection:
iris[1:150,1:4]
becomes
iris[ ,1:4]
as Richard mentioned in a comment, you can use sapply:
X.IQR = sapply(X = iris[,1:4], FUN = IQR)
sapply will apply the FUN (function) IQR to each element of the iris dataset, which corresponds to its columns.
or using apply:
X.IQR = apply(X = iris[ ,1:4], 2, FUN = IQR)
apply can do the same thing, but its a bit more code and won't always be as clean.
Read more with the excellent response here: R Grouping functions: sapply vs. lapply vs. apply. vs. tapply vs. by vs. aggregate

Related

apply fisher test in a large dataset that join all contingency tables

I have a dataset like this:
contingency_table<-tibble::tibble(
x1_not_happy = c(1,4),
x1_happy = c(19,31),
x2_not_happy = c(1,4),
x2_happy= c(19,28),
x3_not_happy=c(14,21),
X3_happy=c(0,9),
x4_not_happy=c(3,13),
X4_happy=c(17,22)
)
in fact, there are many other variables that come from a poll aplied in two different years.
Then, I apply a Fisher test in each 2X2 contingency matrix, using this code:
matrix1_prueba <- contingency_table[1:2,1:2]
matrix2_prueba<- contingency_table[1:2,3:4]
fisher1<-fisher.test(matrix1_prueba,alternative="two.sided",conf.level=0.9)
fisher2<-fisher.test(matrix2_prueba,alternative="two.sided",conf.level=0.9)
I would like to run this task using a short code by mean of a function or a loop. The output must be a vector with the p_values of each questions.
Thanks,
Frederick
So this was a bit of fun to do. The main thing that you need to recognize is that you want combinations of your data. There are a number of functions in R that can do that for you. The main workhorse is combn() Link
So in the language of the problem, we want all combinations of your tibble taken 2 at a time link2
From there, you just need to do some looping structure to get your tests to work, and extract the p-values from the object.
list_tables <- lapply(combn(contingency_table,2,simplify=F), fisher.test)
unlist(lapply(list_tables, `[`, 'p.value'))
This should produce your answer.
EDIT
Given the updated requirements for just adjacement data.frame columns, the following modifications should work.
full_list <- combn(contingency_table,2,simplify=F)
full_list <- full_list[sapply(
full_list, function(x) all(startsWith(names(x), substr(names(x)[1], 1,2))))]
full_list <- lapply(full_list, fisher.test)
unlist(lapply(full_list, `[`, 'p.value'))
This is approximately the same code as before, but now we have to find the subsets of the data that have the same question prefix name. This only works if the prefixes are exactly the same (X3 != x3). I think this is a better solution than trying to work with column indexes, and without the guarantee of always being next to one another. The sapply code does just that. The final output should be what you need for the problem.

Recall different data names inside loop

here is how I created number of data sets with names data_1,data_2,data_3 .....and so on
for initial
dim(data)<- 500(rows) 17(column) matrix
for ( i in 1:length(unique( data$cluster ))) {
assign(paste("data", i, sep = "_"),subset(data[data$cluster == i,]))
}
upto this point everything is fine
now I am trying to use these inside the other loop one by one like
for (i in 1:5) {
data<- paste(data, i, sep = "_")
}
however this is not giving me the data with required format
any help will be really appreciated.
Thank you in advance
Let me give you a tip here: Don't just assign everything in the global environment but use lists for this. That way you avoid all the things that can go wrong when meddling with the global environment. The code you have in your question, will overwrite the original dataset data, so you'll be in trouble if you want to rerun that code when something went wrong. You'll have to reconstruct the original dataframe.
Second: If you need to split a data frame based on a factor and carry out some code on each part, you should take a look at split, by and tapply, or at the plyr and dplyr packages.
Using Base R
With base R, it depends on what you want to do. In the most general case you can use a combination of split() and lapply or even a for loop:
mylist <- split( data, f = data$cluster)
for(mydata in mylist){
head(mydata)
...
}
Or
mylist <- split( data, f = data$cluster)
result <- lapply(mylist, function(mydata){
doSomething(mydata)
})
Which one you use, depends largely on what the result should be. If you need some kind of a summary for every subset, using lapply will give you a list with the results per subset. If you need this for a simulation or plotting or so, you better use the for loop.
If you want to add some variables based on other variables, then the plyr or dplyr packages come in handy
Using plyr and dplyr
These packages come especially handy if the result of your code is going to be an array or data frame of some kind. This would be similar to using split and lapply but then in a way Hadley approves of :-)
For example:
library(plyr)
result <- ddply(data, .(cluster),
function(mydata){
doSomething(mydata)
})
Use dlply if the result should be a list.

R: How to do this without a for-loop?

The following code in R uses a for-loop. What is a way I could solve the same problem without a for-loop (maybe by vectorizing it)?
I am looking at an unfamiliar dataset with many columns (243), and am trying to figure out which columns hold unstructured text. As a first check, I was going to flag columns that are 1) of class 'character' and 2) have at least ten unique values.
openEnded <- rep(x = NA, times = ncol(scaryData))
for(i in 1:ncol(scaryData)) {
openEnded[i] <- is.character(scaryData[[i]]) & length(unique(scaryData[[i]])) >= 10
}
This would probably do the job:
openEnded <- apply(scaryData, 2, function(x) is.character(x) & length(unique(x))>=10)
From the loop, you simply iterate over columns (that's the apply(scaryData, 2) part) an anonymous function that combines your two conditions (function(x) cond1 & cond2).
I guess your data is a data.frame so sapply(scaryData, 2, function(x) ...) would also work.
A nice post about the *apply family can be found there.

Use approxfun with each unique element of vector

I am fairly new to R and I like to understand the concept of using the "apply"-family functions to avoid loop and custom functions. Unfortunately I am failing at the very first exercise.
Here is my minimum reproducible example:
x <- data.frame(Hours=cbind(c(rep(5,5),rep(6,5),rep(7,5),rep(8,5),rep(9,5))),Price=c(cbind(seq(48,50.4, by=0.1),seq(48,52.8, by=0.2),seq(48,55.2, by=0.3),seq(48,57.8, by=0.4),seq(48,60.0, by=0.5))),Volume=seq(10000:10024))
f1 <- approxfun(x$Volume,x$Price, rule=2)
plot(x$Volume, x$Price)
curve(f1, add=TRUE)
However, I would like to perform approxfun() with every unique Hour in x$Hour.
How would I have to approach this?
Thank you for your help.
This solution was provided by bunk.
The idiom is split/apply/combine: split the data, apply the function, combine the results. R/*plyr/data.table etc has many functions to do this:
fns <- lapply(split(x, x$Hours), function(dat) approxfun(dat$Volume, dat$Price, rule=2)); plot(x$Volume, x$Price); cols <- 1; for(fn in fns) curve(fn, add=TRUE, col=(cols<<-cols+1))

Apply function to every value in an R dataframe

I have a 58 column dataframe, I need to apply the transformation $log(x_{i,j}+1)$ to all values in the first 56 columns. What method could I use to go about this most efficiently? I'm assuming there is something that would allow me to do this rather than just using some for loops to run through the entire dataframe.
alexwhan's answer is right for log (and should probably be selected as the correct answer). However, it works so cleanly because log is vectorized. I have experienced the special pain of non-vectorized functions too frequently. When I started with R, and didn't understand the apply family well, I resorted to ugly loops very often. So, for the purposes of those who might stumble onto this question who do not have vectorized functions I provide the following proof of concept.
#Creating sample data
df <- as.data.frame(matrix(runif(56 * 56), 56, 56))
#Writing an ugly non-vectorized function
logplusone <- function(x) {log(x[1] + 1)}
#example code that achieves the desired result, despite the lack of a vectorized function
df[, 1:56] <- as.data.frame(lapply(df[, 1:56], FUN = function(x) {sapply(x, FUN = logplusone)}))
#Proof that the results are the same using both methods...
#Note: I used all.equal rather than all so that the values are tested using machine tolerance for mathematical equivalence. This is probably a non-issue for the current example, but might be relevant with some other testing functions.
#should evaluate to true
all.equal(log(df[, 1:56] + 1),as.data.frame(lapply(df[, 1:56], FUN = function(x) {sapply(x, FUN = logplusone)})))
You should be able to just refer to the columns you want, and do the operation, ie:
df[,1:56] <- log(df[,1:56]+1)

Resources