cor() function in R with a subset - r

I have a table in R with three columns. I want to get the correlation of the first two columns with a subset of the third column following a specific set of conditions (values are all numeric, I want them to be > a certain number). The cor() function doesn't seem to have an argument to define such a subset.
I know that I could use the summary(lm()) function and square-root the r^2, but the issue is that I'm doing this inside a for loop and am just appending the correlation to a separate list that I have. I can't really append part of the summary of the regression easily to a list.
Here is what I am trying to do:
for (i in x) {list[i] = cor(data$column_a, data$column_b, subset = data$column_c > i)}
Obviously, though, I can't do that because the cor() function doesn't work with subsets.
(Note: x = seq(1,100) and list = NULL)

You can do this without a loop using lapply. Here's some code that will output a data frame with the month-range in one column and the correlation in another column. The do.call(rbind... business is just to take the list output from lapply and turn it into a data frame.
corrs = do.call(rbind, lapply(min(airquality$Month):max(airquality$Month),
function(x) {
data.frame(month_range=paste0(x," - ", max(airquality$Month)),
correlation = cor(airquality$Temp[airquality$Month >= x & airquality$Temp < 80],
airquality$Wind[airquality$Month >= x & airquality$Temp < 80]))
}))
corrs
month_range correlation
1 5 - 9 -0.3519351
2 6 - 9 -0.2778532
3 7 - 9 -0.3291274
4 8 - 9 -0.3395647
5 9 - 9 -0.3823090

You can subset the data first, and then find the correlation.
a <- subset(airquality, Temp < 80 & Month > 7)
cor(a$Temp, a$Wind)
Edit: I don't really know what your list variable is, but here is an example of dynamically changing the subset based on i (see how the month requirement changes with each iteration)
list <- seq(1, 5)
for (i in 1:5){
a <- subset(airquality, Temp < 80 & Month > i)
list[i] <- cor(a$Temp, a$Wind)
}

Based on the pseudo-code you provided alone, here's something that should work:
for (i in x) {
df <- subset(data, column_c > i)
list[i] = cor(df$column_a, df$column_b)
}
However, I don't know why you would want your index in list[i] to be the same value that you use to subset column_c. That could be another source of problems.

Related

How do I save a single column of data produced from a while loop in R to a dataframe?

I have written the following very simple while loop in R.
i=1
while (i <= 5) {
print(10*i)
i = i+1
}
I would like to save the results to a dataframe that will be a single column of data. How can this be done?
You may try(if you want while)
df1 <- c()
i=1
while (i <= 5) {
print(10*i)
df1 <- c(df1, 10*i)
i = i+1
}
as.data.frame(df1)
df1
1 10
2 20
3 30
4 40
5 50
Or
df1 <- data.frame()
i=1
while (i <= 5) {
df1[i,1] <- 10 * i
i = i+1
}
df1
If you already have a data frame (let's call it dat), you can create a new, empty column in the data frame, and then assign each value to that column by its row number:
# Make a data frame with column `x`
n <- 5
dat <- data.frame(x = 1:n)
# Fill the column `y` with the "missing value" `NA`
dat$y <- NA
# Run your loop, assigning values back to `y`
i <- 1
while (i <= 5) {
result <- 10*i
print(result)
dat$y[i] <- result
i <- i+1
}
Of course, in R we rarely need to write loops like his. Normally, we use vectorized operations to carry out tasks like this faster and more succinctly:
n <- 5
dat <- data.frame(x = 1:n)
# Same result as your loop
dat$y <- 10 * (1:n)
Also note that, if you really did need a loop instead of a vectorized operation, that particular while loop could also be expressed as a for loop.
I recommend consulting an introductory book or other guide to data manipulation in R. Data frames are very powerful and their use is a necessary and essential part of programming in R.

Summing multiple values from a data frame using a loop

I have a single data frame data and a vector cryptos <- c("btc","eth","bnb","xrp") (where "btc" and etc. are the names of crypto currencies). I need to create a FOR loop that would sum the values of each coin.
So far, I've managed to 'return' every value with a print function:
cryptos <- c("btc","eth","bnb","xrp")
for(i in 1:4) {
print(data[data$crypto_name == cryptos[i], 3]) #where 3 is the number of a column with crypto values
}
So it prints the given currencies' values:
[1] 45065
[1] 2190.07
[1] 459.61
[1] 1.12
Yet, I do not want to print these values, just sum them with the use of a loop. Please tell me, how could I possibly do this.
Is this what you need?
sum( data[data$crypto_name %in% cryptos, 3] )
A basic sum loop is trival:
s = 0
for(i in 1:4) {
s = s + data[data$crypto_name == cryptos[i], 3]
}
s

speeding up boolean logic loop in R

I am very new to R but I am interested in learning more and improving.
I have a dataset with around 40,000+ rows containing the length of neuron segments. I want to compare the length trends of neurons of different groups. The first step in this analysis involves sorting the measurements into 1 of 6 different categories such as '<10' '10-15', '15-20', '20-25', '25-30', and '>30'.
I created these categories as appended columns using 'mutate' from the 'dplyr' package and now I am trying to write a boolean function to determine where the measurement fits by applying a value of '1' to the corresponding column if it fits, and a '0' if it doesn't.
Here is what I wrote:
for (i in 1:40019) {
{if (FinalData$Length[i] <=10)
{FinalData$`<10`[i]<-1
} else {FinalData$`<10`[i]<-0}} #Fills '<10'
if (FinalData$Length[i] >=10 & FinalData$Length[i]<15){
FinalData$`10-15`[i]<-1
} else{FinalData$`10-15`[i]<-0} #Fills'10-15'
if (FinalData$Length[i] >=15 & FinalData$Length[i]<20){
FinalData$`15-20`[i]<-1
} else{FinalData$`15-20`[i]<-0} #Fills '15-20'
if (FinalData$Length[i] >=20 & FinalData$Length[i]<25) {
FinalData$`20-25`[i]<-1
} else{FinalData$`20-25`[i]<-0} #Fills '20-25'
if(FinalData$Length[i] >=25 & FinalData$Length[i]<30){
FinalData$`25-30`[i]<-1
} else{FinalData$`25-30`[i]<-0} #Fills '25-30'
if(FinalData$Length[i] >=30){
FinalData$`>30`[i]<-1
} else{FinalData$`>30`[i]<-0} #Fills '>30'
}
This seems to work, but it takes a long time:
system.time(source('~/Desktop/Home/Programming/R/Boolean Loop R.R'))
user system elapsed
94.408 19.147 118.203
The way I coded this seems very clunky and inefficient. Is there a faster and more efficient way to code something like this or am I doing this appropriately for what I am asking for?
Here is an example of some of the values I am testing:
'Length': 14.362, 12.482337, 8.236, 16.752, 12.045
If I am not being clear about how the dataframe is structured, here is a screenshot:
How my data frame is organized
You can use the cut function in R. It is used to convert numeric values to factors:
x<-c(1,2,4,2,3,5,6,5,6,5,8,0,5,5,4,4,3,3,3,5,7,9,0,5,6,7,4,4)
cut(x = x,breaks = c(0,3,6,9,12),labels = c("grp1","grp2","grp3","grp4"),right=F)
set right = "T" or "F" as per your need.
You can vectorise that as follows (I made a sample of some data called DF)
DF <- data.frame(1:40000,sample(letters,1:40000,replace=T),"Length"=sample(1:40,40000,replace=T))
MyFunc <- function(x) {
x[x >= 10 & x < 15] <- "10-15"
x[x >= 15 & x < 20] <- "15-20"
x[x >= 20 & x < 25] <- "20-25"
x[x >= 25 & x < 30] <- "25-30"
x[x > 30] <- ">30"
x[x < 10] <- "<10"
return(x)
}
DF$Group <- MyFunc(DF[,3])
If it has to be 6 columns like that, you can modify the above to return a one or zero for the appropriate size and everything else, respectively, for each of the 6 columns.
Edit: I guess a series of ifelse might be best if it really has to be 6 columns like that.
e.g.
DF$'<10' <- sapply(DF$Length, function(x) ifelse(x < 10,1,0))

Loop several data frames using R

I want to filter a group of data frames, first by a p.value and then divide it in two by the value of the t.
This is how I do it for just one data frame.
p.value.cut <- which(top_Na1$P.Value < 0.05)
top_Na1 <- top_Na1[p.value.cut,]
up <- which(top_Na1$t > 0)
down <- which(top_Na1$t < 0)
up.p.value <- top_Na1[up,]
down.p.value <- top_Na1[down,]
Whenever I try to replicate this using a loop for or apply, sapply, lapply I end applying changes to all columns or not being able to work with a proper column (it looks like the for loop doesn't uses the whole data frame but it goes column by column) or I just lose the row names of the data frame (like with some apply).
This is how one of the data frames looks like (all are the same).
logFC t P.Value adj.P.Val B
YMR290C -0.1952028 -4.593506 0.003484478 0.03596870 -1.602151
YBR090C -0.3406244 -4.373073 0.004429437 0.03930581 -1.857238
YPL037C -0.8737048 -4.088782 0.006100105 0.04526780 -2.197584
YGL035C -0.3058778 -3.839335 0.008159371 0.05142077 -2.506721
Thanks. It's solved. The problem was on the way I was creating the list.
This way worked.
l <- list(top_Na1, top_Na2)
function_filtering <- function(x){
p.value.cut <- which(x$P.Value < 0.05)
x <- x[p.value.cut,]
up <- which(x$t > 0)
down <- which(x$t < 0)
up.p.value <- x[up,]
down.p.value <- x[down,]
}
lapply(l, function_filtering)
Assume that "/tmp/dat.txt" contains your sample data.frame. The code below should do what you want.
d <- read.table("/tmp/dat.txt")
d[1,"P.Value"] <- 0.05 # to make the example more general
d[2,"t"] <- -d[2,"t"] # to make the example more general
input <- list(d,d) # generating a list of data frames for testing purposes
cutIt <- function(x) {
y <- x[x$P.Value < 0.05,]
list(up=y[y$t > 0,],down=y[y$t <= 0,])
}
lapply(input,cutIt)

improve a for loop with apply inside

I have a data.frame, ordered by mean column that looks like this:
10SE191_2 10SE207 10SE208 mean
7995783 12.64874 13.06391 12.69378 12.73937
8115327 12.69979 12.52285 12.41582 12.50363
8108370 12.58685 12.87818 12.66021 12.45720
7945680 12.46392 12.26087 11.77040 12.36518
7923547 11.98463 11.96649 12.50666 12.33138
8016718 12.81610 12.71548 12.48164 12.32703
I would like to apply a t.test to each row, using as input the intensity values: df[i,1:3] and the mean values from the rows with lower intensities. For example, for the first row I want to compute a t.test for df[1,1:3] vs _mean values_ from row 2 to row 6. My code uses a for loop but my current data.frame has more than 20,000 rows and 24 columns and it takes a long time. Any ideas for improving the code?
Thanks
Code:
temp <- matrix(-9, nrow=dim(matrix.order)[1], ncol=2) #create a result matrix
l <- dim(matrix.order)[1]
for (i in 1:l){
j <- 1+i
if (i < l | j +2 == l) { #avoid not enough y observations
mean.val <- matrix.order[j:l,4]
p <- t.test(matrix.order[i, 1:3], mean.val)
temp[i,1] <- p$p.value
}
else {temp[i,1] <- 1}
}
dput for my df
structure(list(`10SE191_2` = c(12.6487418898415, 12.6997932097351,12.5868508174491, 12.4639169398277, 11.9846348627906, 12.8160978540904), `10SE207` = c(13.0639063105224, 12.522848114011, 12.8781769160682, 12.260865493177, 11.9664905651469, 12.7154788700468), `10SE208` = c(12.6937808736673, 12.4158248856386, 12.6602128982717, 11.7704045448312, 12.5066604109231, 12.4816357798965), mean = c(12.7393707471856, 12.5036313008127, 12.4572035036992, 12.3651842840775, 12.3313821056582, 12.3270331271091)), .Names = c("10SE191_2", "10SE207", "10SE208", "mean"), row.names = c("7995783", "8115327", "8108370", "7945680", "7923547", "8016718"), class = "data.frame")
You can obtain all p-values (if possible) with this command:
apply(df, 1, function(x) {
y <- df$mean[df$mean < x[4]]
if(length(y) > 1)
t.test(x[1:3], y)$p.value
else NA
})
The function will return NA if there are not enough values for y.
7995783 8115327 8108370 7945680 7923547 8016718
0.08199794 0.15627947 0.04993244 0.50885253 NA NA
Running 2E4 t.tests probably takes a lot of time no matter what. Try using Rprof to find the hot spots. You might also want to use mcapply or similar parallel processing tools, since your analysis of each row is independent of all other data (which means this is a task well-suited to multicore parallel processing).

Resources