range problems needed to solve - r

I want to combine range, mean and sd using cbind (each has 10 numbers), and I used range function to calculate range for each variable in my dataset. However, the range is atomic, and the output is like this:
my output
This is my data
BAA BAF Data Science
1 1 1
0 0 0
1 0 0
In my output, R has separated the range and produced 20 numbers. Line 1 and Line 2 should be the range for my first variable. Does anyone know how to solve this? This is my code below that produced this output:
range_com <- c()
mean_com <- c()
sd_com <- c()
dyad.realized_subset <- subset(dyad.realized, select=c(BAA,BAF,`Data
Science`,`Life Science`,Engineer,`Previous Raised`,`Max Raise`,Age,
Patent,`Committed Amount ($K)`))
range1 <- range(dyad.realized_subset$BAA,na.rm=T)
range2 <- range(dyad.realized_subset$BAF,na.rm=T)
range3 <- range(dyad.realized_subset$`Data Science`,na.rm=T)
range4 <- range(dyad.realized_subset$`Life Science`,na.rm=T)
range5 <- range(dyad.realized_subset$Engineer,na.rm=T)
range6 <- range(dyad.realized_subset$`Previous Raised`,na.rm=T)
range7 <- range(dyad.realized_subset$`Max Raise`,na.rm=T)
range8 <- range(dyad.realized_subset$Age,na.rm=T)
range9 <- range(dyad.realized_subset$Patent,na.rm=T)
range10 <- range(dyad.realized_subset$`Committed Amount ($K)`,na.rm=T)
range_com <-c(range1,range2,range3,range4,range5,range6,range7,range8,range9,range10)
for(i in seq(dyad.realized_subset)){
mean_com[i] <- mean(dyad.realized_subset[[i]], na.rm=T)
sd_com[i] <- sd(dyad.realized_subset[[i]],na.rm=T) }
# Bind and output table ####
desc_dyads <- cbind(range_dyads,mean_dyads,sd_dyads)

You can try using sapply to calculate mean, sd and range for each column in dyad.realized_subset together.
t(sapply(dyad.realized_subset, function(x) {
setNames(c(mean(x), sd(x), toString(range(x))), c('mean', 'sd', 'range'))
})) -> desc_dyads
desc_dyads

Related

How to calculate correlation against 2 data frames and output by condition using R?

I am going to calculate correlation between 2 gene expression data frames, which are protein and RNA data frames. Data shows here.https://drive.google.com/file/d/1S5bYm8baqLf40KqmEbOO59YrDLp4sWnL/view
https://drive.google.com/file/d/1wmZF4v8Ehq2giKldFWHv6UUBzu8Ncsy2/view
Rownames are gene, and colnames are samples. let's say, gene1 represents gene in 1st df, gene2 represents gene in 2nd df.
The output I want is a table, only contain the correlation meet the condition, not all correlation value. It looks like this:enter image description here
Here is what and how I do:
A. I need to calculate correlation of gene1 and gene2.
B. And I want to filter that |correlation(gene1, gene1)|<0.3 & |correlation(gene1, gene2)|>0.6, which means correlation of same gene in 2 df smaller than 0.3, and correlation of different gene in 2 df higher than 0.6
C. Return a table with columns are 'gene, gene, correlation', the output I want is a table like this:enter image description here
This is my code, which cannot make what I need, and there are more than 10,000 rows and 77 cols of each data frames, the job was been killed in GPU, which run more than 2 hours, please make code easy and use less memory as less as possible.
func.cor3 <- function(x,y){#func.cor
na1 <- which(is.na(x)==TRUE)
na2 <- which(is.na(y)==TRUE)
nas <- union(na1,na2)
if(length(nas)!=0){
x <- x[-nas]
y <- y[-nas]
}
nn <- cor(x,y)
nn <- round(nn,5)
nn <- format(nn, nsmall = 5)
if(nn>0.6 & nn< -0.6){
w <-as.character(str_c(nn,"_r1p2>0.6"))
return(w)
}
else if(nn<0.3 & nn> -0.3){
w <-as.character(str_c(nn,"_r1p1<0.3"))
return(w)
}else{
}
P2 <- P[,-1]
rownames(P2) <- P[,1]
nn <- setdiff(colnames(P2),colnames(R2))
ns <- vector();n=1
for (i in 1:length(nn)) {
nn.i <- nn[i]
w.i <- which(colnames(P2)==nn.i)
ns[n] <- w.i
n=n+1
}
P2 <- P2[,-ns]
nk <- colnames(P2)
ns <- order(nk)
P2 <- P2[,ns]
P2_1000 <- P2[1:1000,1:77]# try first 1000 rows of data
R <- read.csv(file = "RNA_Breast_2.csv",header = TRUE)
R2 <- R[,-1]
rownames(R2) <- R[,1]
nk <- colnames(R2)
ns <- order(nk)
R2 <- R2[,ns]
R2_1000 <- R2[1:1000,]
D_M <- matrix(rep(NA,3*nrow(R2_1000)*nrow(P2_1000)),ncol =3 )
colnames(D_M) <- c("gene1_RNA","gene_Protein","correlation")
n=1
for (i in 1:nrow(R2_1000)) {
for (j in 1:nrow(P2_1000)) {
D_M[n,1] <- rownames(R2_1000)[i]
D_M[n,2] <- rownames(P2_1000)[j]
D_M[n,3] <- func.cor3(as.numeric(R2_1000[i,]),as.numeric(P2_1000[j,]))
n=n+1
}
}
D_M

Writing a for loop with the output as a data frame in R

I am currently working my way through the book 'R for Data Science'.
I am trying to solve this exercise question (21.2.1 Q1.4) but have not been able to determine the correct output before starting the for loop.
Write a for loop to:
Generate 10 random normals for each of μ= −10, 0, 10 and 100.
Like the previous questions in the book I have been trying to insert into a vector output but for this example, it appears I need the output to be a data frame?
This is my code so far:
values <- c(-10,0,10,100)
output <- vector("double", 10)
for (i in seq_along(values)) {
output[[i]] <- rnorm(10, mean = values[[i]])
}
I know the output is wrong but am unsure how to create the format I need here. Any help much appreciated. Thanks!
There are many ways of doing this. Here is one. See inline comments.
set.seed(357) # to make things reproducible, set random seed
N <- 10 # number of loops
xy <- vector("list", N) # create an empty list into which values are to be filled
# run the loop N times and on each loop...
for (i in 1:N) {
# generate a data.frame with 4 columns, and add a random number into each one
# random number depends on the mean specified
xy[[i]] <- data.frame(um10 = rnorm(1, mean = -10),
u0 = rnorm(1, mean = 0),
u10 = rnorm(1, mean = 10),
u100 = rnorm(1, mean = 100))
}
# result is a list of data.frames with 1 row and 4 columns
# you can bind them together into one data.frame using do.call
# rbind means they will be merged row-wise
xy <- do.call(rbind, xy)
um10 u0 u10 u100
1 -11.241117 -0.5832050 10.394747 101.50421
2 -9.233200 0.3174604 9.900024 100.22703
3 -10.469015 0.4765213 9.088352 99.65822
4 -9.453259 -0.3272080 10.041090 99.72397
5 -10.593497 0.1764618 10.505760 101.00852
6 -10.935463 0.3845648 9.981747 100.05564
7 -11.447720 0.8477938 9.726617 99.12918
8 -11.373889 -0.3550321 9.806823 99.52711
9 -7.950092 0.5711058 10.162878 101.38218
10 -9.408727 0.5885065 9.471274 100.69328
Another way would be to pre-allocate a matrix, add in values and coerce it to a data.frame.
xy <- matrix(NA, nrow = N, ncol = 4)
for (i in 1:N) {
xy[i, ] <- rnorm(4, mean = c(-10, 0, 10, 100))
}
# notice that i name the column names post festum
colnames(xy) <- c("um10", "u0", "u10", "u100")
xy <- as.data.frame(xy)
As this is a learning question I will not provide the solution directly.
> values <- c(-10,0,10,100)
> for (i in seq_along(values)) {print(i)} # Checking we iterate by position
[1] 1
[1] 2
[1] 3
[1] 4
> output <- vector("double", 10)
> output # Checking the place where the output will be
[1] 0 0 0 0 0 0 0 0 0 0
> for (i in seq_along(values)) { # Testing the full code
+ output[[i]] <- rnorm(10, mean = values[[i]])
+ }
Error in output[[i]] <- rnorm(10, mean = values[[i]]) :
more elements supplied than there are to replace
As you can see the error say there are more elements to put than space (each iteration generates 10 random numbers, (in total 40) and you only have 10 spaces. Consider using a data format that allows to store several values for each iteration.
So that:
> output <- ??
> for (i in seq_along(values)) { # Testing the full code
+ output[[i]] <- rnorm(10, mean = values[[i]])
+ }
> output # Should have length 4 and each element all the 10 values you created in the loop
# set the number of rows
rows <- 10
# vector with the values
means <- c(-10,0,10,100)
# generating output matrix
output <- matrix(nrow = rows,
ncol = 4)
# setting seed and looping through the number of rows
set.seed(222)
for (i in 1:rows){
output[i,] <- rnorm(length(means),
mean=means)
}
#printing the output
output

creating a function for processing my dataframe calculations

I am doing systematic calculations for my created dataframe. I have the code for the calculations but I would like to:
1) Wite it as a function and calling it for the dataframe I created.
2) reset the calculations for next ID in the dataframe.
I would appreciate your help and advice on this.
The dataframe is created in R using the following code:
#Create a dataframe
dosetimes <- c(0,6,12,18)
df <- data.frame("ID"=1,"TIME"=sort(unique(c(seq(0,30,1),dosetimes))),"AMT"=0,"A1"=NA,"WT"=NA)
doserows <- subset(df, TIME%in%dosetimes)
doserows$AMT[doserows$TIME==dosetimes[1]] <- 100
doserows$AMT[doserows$TIME==dosetimes[2]] <- 100
doserows$AMT[doserows$TIME==dosetimes[3]] <- 100
doserows$AMT[doserows$TIME==dosetimes[4]] <- 100
#Add back dose information
df <- rbind(df,doserows)
df <- df[order(df$TIME,-df$AMT),]
df <- subset(df, (TIME==0 & AMT==0)==F)
df$A1[(df$TIME==0)] <- df$AMT[(df$TIME ==0)]
#Time-dependent covariate
df$WT <- 70
df$WT[df$TIME >= 12] <- 120
#The calculations are done in a for-loop. Here is the code for it:
#values needed for the calculation
C <- 2
V <- 10
k <- C/V
#I would like this part to be written as a function
for(i in 2:nrow(df))
{
t <- df$TIME[i]-df$TIME[i-1]
A1last <- df$A1[i-1]
df$A1[i] = df$AMT[i]+ A1last*exp(-t*k)
}
head(df)
plot(A1~TIME, data=df, type="b", col="blue", ylim=c(0,150))
The other thing is that the previous code assumes the subject ID=1 for all time points. If subject ID=2 when the WT (weight) changes to 120. How can I reset the calculations and make it automated for all subject IDs in the dataframe? In this case the original dataframe would be like this:
#code:
rm(list=ls(all=TRUE))
dosetimes <- c(0,6,12,18)
df <- data.frame("ID"=1,"TIME"=sort(unique(c(seq(0,30,1),dosetimes))),"AMT"=0,"A1"=NA,"WT"=NA)
doserows <- subset(df, TIME%in%dosetimes)
doserows$AMT[doserows$TIME==dosetimes[1]] <- 100
doserows$AMT[doserows$TIME==dosetimes[2]] <- 100
doserows$AMT[doserows$TIME==dosetimes[3]] <- 100
doserows$AMT[doserows$TIME==dosetimes[4]] <- 100
df <- rbind(df,doserows)
df <- df[order(df$TIME,-df$AMT),]
df <- subset(df, (TIME==0 & AMT==0)==F)
df$A1[(df$TIME==0)] <- df$AMT[(df$TIME ==0)]
df$WT <- 70
df$WT[df$TIME >= 12] <- 120
df$ID[(df$WT>=120)==T] <- 2
df$TIME[df$ID==2] <- c(seq(0,20,1))
Thank you in advance!
In general, when doing calculations on different subject's data, I like to split the dataframe by ID, pass the vector of individual subject data into a for loop, do all the calculations, build a vector containing all the newly calculated data and then collapse the resultant and return the dataframe with all the numbers you want. This allows for a lot of control over what you do for each subject
subjects = split(df, df$ID)
forResults = vector("list", length=length(subjects))
# initialize these constants
C <- 2
V <- 10
k <- C/V
myFunc = function(data, resultsArray){
for(k in seq_along(subjects)){
df = subjects[[k]]
df$A1 = 100 # I assume this should be 100 for t=0 for each subject?
# you could vectorize this nested for loop..
for(i in 2:nrow(df)) {
t <- df$TIME[i]-df$TIME[i-1]
A1last <- df$A1[i-1]
df$A1[i] = df$AMT[i]+ A1last*exp(-t*k)
}
head(df)
# you can add all sorts of other calculations you want to do on each subject's data
# when you're done doing calculations, put the resultant into
# the resultsArray and we'll rebuild the dataframe with all the new variables
resultsArray[[k]] = df
# if you're not using RStudio, then you want to use dev.new() to instantiate a new plot canvas
# dev.new() # dont need this if you're using RStudio (which doesnt allow multiple plots open)
plot(A1~TIME, data=df, type="b", col="blue", ylim=c(0,150))
}
# collapse the results vector into a dataframe
resultsDF = do.call(rbind, resultsArray)
return(resultsDF)
}
results = myFunc(subjects, forResults)
Do you want this:
ddf <- data.frame("ID"=1,"TIME"=sort(unique(c(seq(0,30,1),dosetimes))),"AMT"=0,"A1"=NA,"WT"=NA)
myfn = function(df){
dosetimes <- c(0,6,12,18)
doserows <- subset(df, TIME%in%dosetimes)
doserows$AMT[doserows$TIME==dosetimes[1]] <- 100
doserows$AMT[doserows$TIME==dosetimes[2]] <- 100
doserows$AMT[doserows$TIME==dosetimes[3]] <- 100
doserows$AMT[doserows$TIME==dosetimes[4]] <- 100
#Add back dose information
df <- rbind(df,doserows)
df <- df[order(df$TIME,-df$AMT),]
df <- subset(df, (TIME==0 & AMT==0)==F)
df$A1[(df$TIME==0)] <- df$AMT[(df$TIME ==0)]
#Time-dependent covariate
df$WT <- 70
df$WT[df$TIME >= 12] <- 120
#The calculations are done in a for-loop. Here is the code for it:
#values needed for the calculation
C <- 2
V <- 10
k <- C/V
#I would like this part to be written as a function
for(i in 2:nrow(df))
{
t <- df$TIME[i]-df$TIME[i-1]
A1last <- df$A1[i-1]
df$A1[i] = df$AMT[i]+ A1last*exp(-t*k)
}
head(df)
plot(A1~TIME, data=df, type="b", col="blue", ylim=c(0,150))
}
myfn(ddf)
For multiple calls:
for(i in 1:N) {
myfn(ddf[ddf$ID==i,])
readline(prompt="Press <Enter> to continue...")
}

R subscript out of bounds with for loops

I am trying to count entries that fall within a 1000 window, the problem is that I'm using for loops which makes the number of operations that need to be performed quite large (I'm fairly new to R) and I get an out of bounds error. I know there must be a better way to do this.
File (warning the file is a little over 100mb): bamDF.txt
Use:
dget(file="bamDF.txt")
Script:
attach(bamDF)
out <- matrix(0,1,ceiling((max(pos, na.rm=TRUE)-min(pos, na.rm=TRUE))/interval))
interval <- 1000
for(q in 1:nrow(bamDF)){
for(z in 1:ceiling((max(pos, na.rm=TRUE)-min(pos, na.rm=TRUE))/interval)){
if(min(pos, na.rm=TRUE)+interval*(z-1)<pos[q]&&pos[q]<(min(pos, na.rm=TRUE)+interval*(z))){
out[z,] <- out[z,]+1;
}
}
}
detach(bamDF)
You can use the cut function
# set the seed to get a reproducible example
set.seed(12345)
min.val <- 0
max.val <- 5000
num.val <- 10000
# Generate some random values
values <- sample(min.val:max.val, num.val, replace=T)
interval <- 1000
num.split <- ceiling((max.val - min.val)/interval)+1
# Use cut to split the data.
# You can set labels=FALSE if you want the group number
# rather than the interval
groups <- cut(values, seq(min.val, max.val, length.out=num.split))
# Count the elements in each group
res <- table(groups)
res will contain:
groups
(0,1e+03] (1e+03,2e+03] (2e+03,3e+03] (3e+03,4e+03] (4e+03,5e+03]
1987 1974 2054 2000 1984
Similarly, you can just use the hist function:
h <- hist(values, 10) # 10 bins
or
h <- hist(values, seq(min.val, max.val, length.out=num.split))
h$counts contains the counts. Use plot=NULL if you don't want to plot the results.
grps <- seq(min(pos), max(pos), by= 1000)
counts <- table( findInterval( pos, c(grps, Inf) ) )
names(counts) <- grps

Adding vector next to data.frame under new column name

I am experimenting with R and would like to implement a loop which runs 1000000 times and creates a vector of length 10 and adds each vector to a data frame under the name cycle and the number it has iterated.
This is my current code:
loser <- 100
winner <- 500
percentageWinner <- 70
runns <- 1000000
numbs <- 10
for(i in runns ) {
randNumb <- runif(numbs, min=0, max=100)
outcome <- ifelse(randNumb < percentageWinner, winner, loser) # true are winners and false are losers
df <- data.frame(outcome)
colnames(df)[which(names(df) == "outcome")] <- paste("cycle",i)
}
df
I am struggeling to add the vector next to the other data.frame column.
Any suggestions, how to do that?
I appreciate your replies!
In your code, at each iteration of your for loop, you overwrite i by 1 (i <- 1). And if you remove it, it will be always equal to runns, i.e only 1 loop.
You need to change your code for something like:
loser <- 100
winner <- 500
percentageWinner <- 70
runns <- 1000000
numbs <- 10
outcome <- matrix(NA, numbs, runns)
for(i in seq_len(runns)) {
randNumb <- runif(numbs, min=0, max=100)
outcome[,i] <- ifelse(randNumb < percentageWinner, winner, loser)
}
df <- data.frame(outcome)
colnames(df) <- paste0("cycle",seq_len(runns))
Or you can avoid the loop:
randNumb <- runif(numbs*runns, min=0, max=100)
outcome <- ifelse(randNumb < percentageWinner, winner, loser)
outcome <- matrix(outcome, numbs, runns)
df <- data.frame(outcome)
colnames(df) <- paste0("cycle",seq_len(runns))

Resources