calculate percentage in R - r

I am a beginner in R and R is for me actually only the means to analyse my statistical data, so I am far from being a programmer. I need some help with Building percentages of my variables from an Excel sheet. I Need R.total with R.Max as 100% base. this is what I did:
DB <- read_excel("WechslerData.xlsx", sheet=1, col_names=TRUE,
col_types=NULL, na="", skip=0)
I wanted to to use prop.table
but this dose not work with me. than I tried to make data frame
R.total <- DB$R.total
R.max <- DB$R.max
DB.rus <- data.frame(R.total, R.max)
but prop.table still dose not work. Can somebody give me a hint?

Not really sure what you want, but for this mock data.
r.total <- runif(100,min=0, max=.6) # generate random variable
r.max <- runif(100,min=0.7, max=1) # generate random variable
df <- data.frame(r.total, r.max) # create mock data frame
You could try
# create a new column which is the r.total percentage of r.max
df$percentage <- df$r.total / df$r.max
Hope it helps.

Related

Creating Drug-target interaction network r

I am looking to achieve something along the lines of what is done here, with the intention of creating a drug-target interaction network.
I have downloaded data from here and I would like to reproduce that network.
My data has the below form:
#Drug Gene
DB00357 P05108
DB02721 P00325
DB00773 P23219
DB07138 Q16539
DB08136 P24941
DB01242 P23975
DB01238 P08173
DB00186 P48169
DB00338 P10635
DB01151 P08913
DB01244 P05023
DB01745 P07477
DB01996 P08254
I consulted this previous post as a first step in order to create the similary matrix. The resulting matrix on the entire data set is large, so I tried recreating the procedure on a smaller data frame as per below.
# packages used
library("qgraph")
library("dplyr")
drugs <- c("DB00357","DB02721","DB00773",
"DB07138","DB08136",
"DB01242","DB01238",
"DB00186","DB00338",
"DB01151","DB01244",
"DB01745","DB01996")
genes <- c("P05108", "P00325","P23219",
"Q16539","P24941",
"P23975","P08173",
"P48169","P10635",
"P08913","P05023",
"P07477","P08254")
# Dataframe with a small subset of observations
df <- data.frame(drugs, genes)
# Consulting the other post
b <- df %>% full_join(df, by = "genes")
tb <- table(b$drugs.x, b$genes)
My next step I believe is to create the correlation matrix and the network as per the guide I'm trying to replicate. Here I face issues, below are my attempts documented:
# Follow guide trying to replicate correlation matrix
cormatrix <- cor_auto(tb)
### Error ###
"Removing factor variables: Var1; Var2
Error in data[, sapply(data, function(x) mean(is.na(x))) != 1] :
incorrect number of dimensions"
So I instead tried using cor(), and this works. However when I try to apply it on the entire dataframe it just keeps running/never produce output.
# Second way, using cor() instead to replicate correlation matrix
cormatrix <- cor(tb)
graph1 <- qgraph(tb, verbose = FALSE)
Therefor I wonder if anyone has any ideas for it to run properly and produce the network as intended?

bucket data in R Data frame

I have a code in python that creates bucket dataframe from a simple dataframe. I want to replicate in R. Till now I understand that I can use transform function but I am unable to do it. can anyone help me in this?
This is dataframe
Here is the bucketing code in python
I achieved this with below lines of code
bins = seq(0,max(df_s$wordCount)+input$bins,by = 5)
df_s <-transform(df_s,group = cut(df_s$wordCount,bins))
df <- aggregate(df_s$Freq, by=list(Category=df_s$group), FUN=sum)
#Ronak, thanks for your advice.

Including additional data at each stage of a loop

I am trying to create minimum convex polygons for a set of GPS coordinates, each day has 32 coordinates and I want to create a MCP with 1 day,2 days,3 days... and so on worth of data. For instance in the first step I want to include rows 1-32 which I have managed:
mydata <- read.csv("file.csv", stringsAsFactors = FALSE)
mydata <- mydata[1:32, ]
Currently to select data for me to do 2 days at a time I have written:
mydata <- read.csv("file.csv", stringsAsFactors = FALSE)
mydata <- mydata[1:64, ]
Is there a way to automate adding 32 rows at each step (in a loop) rather than me running the code manually each time and changing the amount of data used manually each time?
I am very new to R so I do not know whether it is possible to do this, the way I thought would work was:
n <- 32
for (i in 1:100) {
mydata <- mydata[1:n, ]
## CREATE MCP AND STORE HOME RANGE OUTPUT
n <- n+32
}
However it is not possible to have n representing a row number but is there a way to do this?
Apologies if this is unclear but as I said I am quite new to using R and really would appreciate any help that can be given.

Is there a way to parallelize summary functions running over loop?

For an input data frame
input<-data.frame(col1=seq(1,10000),col2=seq(1,10000),col3=seq(1,10000),col4=seq(1,10000))
I have to run the following summaries stored in another Data frame
summary<-data.frame(Summary_name=c('Col1_col2','Col3_Col4','Col2_Col3'),
ColIndex=c("1,2","3,4","2,3"))
#summary
Summary_name ColIndex
Col1_col2 1,2
Col3_Col4 3,4
Col2_Col3 2,3
I have the following function to run the aggregates
loopSum<-function(input,summary){
for(i in seq(1,nrow(summary))){
summary$aggregate[i]<-sum(input[,as.numeric(unlist(str_split(summary$ColIndex[i],',')))])}
return(summary)
}
My requirement is to run the sum as used in loopSum only in parallel, ie I would like to run all the summaries in one shot and thus reduce the total time taken for the function to create the summaries. Is there a way to do this?
My actual scenarios requires me to create summary statistics over hundreds of columns for each Summary_name in summary data.frame, I am looking for the most optimized way to do this. Any help is much appreciated.
Does it improve the running time?
library(tidyr)
input1 <- colSums(input)
summary1 <- separate(summary, "ColIndex", into=c("X1", "X2"), sep=",", convert = TRUE)
summary$aggregate <- input1[summary1$X1] + input1[summary1$X2]

R for loop to summarize matrix of data

New user to R (like 2 days of use new) and coming from MATLAB, syntax nuances are driving me a little crazy. If anyone can point me in a direction on this topic I would really appreciate it. I have this dataset (fl1.back), that has 32 variables (columns) and 513 measurements (rows), and I want to create a table with basic stat tables of 9 of the 32 columns of data. There's a separate datset(fl2.back) that I would also like to pull 1 column of data from for the final table.
Here's the code I used to do the above tasks for 1 of the columns of data (sodium measurements) from fl1.back and fl2.back:
fl1.back <- read.delim("web.flat",comment.char="#",colClasses="character")
fl1.back <- fl1.back[-1,]
fl2.back <- read.delim("web.flat2",comment.char="#",colClasses="character")
fl2.back <- fl2.back[-1,]
head(fl1.back)
head(fl2.back)
#for rep criteria for sodium
back.sod.rep <- fl2.back[fl2.back$P00930!="",]
back.sod.rep$P00930 <- as.numeric(back.sod.rep$P00930)
back.sod.rep$P00930
#for samples...sodium
back.sod <- fl1.back[fl1.back$P00930!="",]
back.sod$P00930 <- as.numeric(back.sod$P00930)
back.sod$P00930
head(back.sod)
back.sod.summ <- data.frame("Sodium")
back.sod.summ
colnames(back.sod.summ) <- "Compound"
back.sod.summ$WQ_crit <- "20 mg/L"
back.sod.summ$n <- nrow(back.sod)
back.sod.summ$n_det <- nrow(back.sod[back.sod$R00930!="<",])
back.sod.summ$min <- min(back.sod[back.sod$R00930!="<","P00930"])
back.sod.summ$max <- max(back.sod[back.sod$R00930!="<","P00930"])
back.sod.summ$mean <- mean(back.sod[back.sod$R00930!="<","P00930"])
back.sod.summ$median <- median(back.sod[back.sod$R00930!="<","P00930"])
back.sod.summ$percent_samp_det <- 100*(back.sod.summ$n_det/back.sod.summ$n)
back.sod.summ$percent_samp_above_crit <- 100*(length(back.sod[back.sod$P00930>20,"P00930"])/back.sod.summ$n)
back.sod.summ$percent_rep_above_crit <- (sum(back.sod.rep$P00930>=20)/(nrow(back.sod.rep)))
back.sod$P00930
length(back.sod[back.sod$P00930>back.sod.summ$WQ_crit,"P00930"])
back.sod.summ
final <- data.frame(back.sod.summ)
Instead of rewriting/copying and pasting the above code to create the data frame final, I would like to loop over the two datasets since I'm looking to repeat the same task, just on different columns of data. I really don't know where to start, and there doesn't seem to be much literature on for loops in R.
Any insight is appreciated!
Here is an example of what I think you want with the iris dataset:
library(plyr)
dlply(iris, .(Species), summary)
This can be extended if you need additional stats. Anyway, you probably should use (as I show above) the "split-apply-combine" approach as implemented in various functions and packages.

Resources