Expectation using R - r

I just started using R and I have 5 files(each file has only one column) of data with 227 observations in total. I have to find E(X) and E(X^2). I found E(X) by summing up all the values and dividing it by 227. I also need to find E(X^2) but I don't know how to loop through the 5 files and get each individual value and square it.
I have code for loading the files and that is my code for finding the mean:
mydataset1 = read_csv("file1.txt", col_names = FALSE)
mydataset2 = read_csv("file2.txt", col_names = FALSE)
mydataset3 = read_csv("file3.txt", col_names = FALSE)
mydataset4 = read_csv("file4.txt", col_names = FALSE)
mydataset5 = read_csv("file5.txt", col_names = FALSE)
sum1 <- sum(mydataset1)
sum2 <- sum(mydataset2)
sum3 <- sum(mydataset3)
sum4 <- sum(mydataset4)
sum5 <- sum(mydataset5)
sumAll <- sum1 + sum2 + sum3 + sum4 + sum5
mean <- sumAll / 227

We can get all the datasets in a list with mget based on the pattern of object names get the sum from the list elements into a vector and then get the sum of that vector divided by 227
sum(sapply(mget(ls(pattern = '^mydataset\\d+$')), sum))/227

You can simply square the variable
mydataset1 = read_csv("file1.txt", col_names = FALSE)
mydataset2 = read_csv("file2.txt", col_names = FALSE)
mydataset3 = read_csv("file3.txt", col_names = FALSE)
mydataset4 = read_csv("file4.txt", col_names = FALSE)
mydataset5 = read_csv("file5.txt", col_names = FALSE)
sum1 <- sum(mydataset1 ^ 2)
sum2 <- sum(mydataset2 ^ 2)
sum3 <- sum(mydataset3 ^ 2)
sum4 <- sum(mydataset4 ^ 2)
sum5 <- sum(mydataset5 ^ 2)
The rest of your code will be the same

Maybe you can try a base R code like below
sum(unlist(mget(ls(pattern = "mydataset\\d+"))))/227
Mathematically, sumAll is the sum of all data from mydataset1 to mydataset5, In this sense, you can gather them via unlist and then sum them up before being divided by 227.

#Hugo actually answers the simple question of how do you square a variable in R and then do an operation on it. I think we all assume you don't really want to create a new variable that is X1 squared (but you could do that if you wanted).
I'm going to suggest maybe a more beginner solution than some of the above if what you are doing is trying to learn the basics of R.
mydataset1 = read_csv("file1.txt", col_names = FALSE)
mydataset2 = read_csv("file2.txt", col_names = FALSE)
mydataset3 = read_csv("file3.txt", col_names = FALSE)
mydataset4 = read_csv("file4.txt", col_names = FALSE)
mydataset5 = read_csv("file5.txt", col_names = FALSE)
combined <- rbind(mydataset1, mydataset2, mydataset3, mydataset4, mydataset5)
sum(combined$X1)/nrow(combined)
sum(combined$X1^2)/nrow(combined)
In this solution you are still reading the individual files and type out their names, and as shown in other answers there are lots of neat ways to do that automatically. This will always work.
Here what I'm doing is combining the data frames/ tibbles using the base rbind() function. It does what it sounds like, binds the data frames together.
Then I am doing the calculation, but instead of assuming that I know the number of rows, I'm getting it from the data. (If you have missing data that becomes a little more tricky but you will learn that soon enough.)
Also note that I am specifying the actual variable that you want. That is so you have a model for a situation in the future when you have multiple variables in your data frame.

Related

Gene names from vmatchPattern (Biostrings)

I try to get the gene names out of a binding analysis of the 5'UTR. Therefore I have this little code. Until the vmatchPattern everything works fine. At least I hope so.
library(biomaRt)
library(GenomicFeatures)
library(XVector)
library(Biostrings)
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
library(BSgenome.Mmusculus.UCSC.mm10)
fUTR <- fiveUTRsByTranscript(TxDb.Mmusculus.UCSC.mm10.knownGene)
Mmusculus <- BSgenome.Mmusculus.UCSC.mm10
seqlevelsStyle(Mmusculus) <- 'ensembl'
seqlevelsStyle(fUTR) <- 'ensembl'
Seq <- getSeq(Mmusculus, fUTR)
Pbind <- RNAString('UGUGUGAAHAA')
Match <- vmatchPattern(Pbind, unlist2(Seq), max.mismatch = 0, min.mismatch = 0, with.indels = F, fixed = T, algorithm = 'auto')
Afterwards however I want to get the gene names to create a list in the end and use this in Python for further analysis of a RNAseq experiment. There comes a problem, I think I found so far three different ways on how to potentially do this. However none of them are working for me.
##How to get gene names from the match Pattern
#1
matches <- unlist(Match, recursive = T, use.names = T)
m <- as.matrix(matches)
subseq(genes[rownames(m),], start = m[rownames(m),1], width = 20)
#2
transcripts(TxDb.Mmusculus.UCSC.mm10.knownGene, columns = c('tx_id', 'tx_name', 'gene_id'))
#3
count_index <- countIndex(Match)
wh <- which(count_index > 0)
result_list = list()
for(i in 1: length(wh))
{
result_list[[i]] = Views(subject[[wh[i]]], mindex[[wh[i]]])
}
names(result_listF) = nm[wh]
I am happy to hear some suggestions and get some help or solution for this problem. I am no Bioinformation by training, so this took me already quite a while to figure this out.
So I found an answer, I hope this helps someone, and there is no mistake somewhere.
library(BSgenome.Mmusculus.UCSC.mm10)
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
library(org.Mm.eg.db)
##get all 5’ UTR sequences
fUTR <- fiveUTRsByTranscript(TxDb.Mmusculus.UCSC.mm10.knownGene)
utr_ul <- unlist(fUTR, use.names = F)
mcols(utr_ul)$tx_id <- rep(as.integer(names(fUTR)), lengths(fUTR))
utr_ul
tx2gene <- mcols(transcripts(TxDb.Mmusculus.UCSC.mm10.knownGene, columns = c('tx_id', 'tx_name', 'gene_id')))
tx2gene$gene_id <- as.character(tx2gene$gene_id)
m <- match(mcols(utr_ul)$tx_id, tx2gene$tx_id)
mcols(utr_ul) <- cbind(mcols(utr_ul), tx2gene[m, -1L, drop = F])
utr5_by_gene <- split(utr_ul, mcols(utr_ul)$gene_id)
seqs <- getSeq(Mmusculus, utr5_by_gene)
##search with motif UGUGUGAAHAA
motif <- DNAString('TGTGTGAAHAA')
x <- vmatchPattern(motif, unlist(seqs), fixed = F)
matches <- unlist(x, recursive = T, use.names = T)
##list all genes with matches
hits <- mapIds(org.Mm.eg.db, keys = unique(names(matches)), keytype = 'ENTREZID',
column = 'SYMBOL', multiVals = 'first')

Is there an easy way to simplify this code using a loop in r?

I am working in Rstudio and have a series of codes just like these. There are 34 total and I am wondering if there is an easy ways to just write it once and have it loop through the defined variable of rsqRow.a{#}, combineddfs.a{#} and internally used variables of s_{## 'State'}
# s_WA.train.lr.Summary
rsqRow.a32 = summary(s_WA.train.lr)$r.squared
# rsqRow.a32
Coef = summary(s_WA.train.lr)$coef[,1] # row, column
CoefRows = data.frame(Coef)
Pval = summary(s_WA.train.lr)$coef[,4]
PvalRows = data.frame(Pval)
combineddfs.a32 <- merge(CoefRows, PvalRows, by=0, all=TRUE)
# combineddfs.a32
# s_WI.train.lr.Summary
rsqRow.a33 = summary(s_WI.train.lr)$r.squared
# rsqRow.a33
Coef = summary(s_WI.train.lr)$coef[,1] # row, column
CoefRows = data.frame(Coef)
Pval = summary(s_WI.train.lr)$coef[,4]
PvalRows = data.frame(Pval)
combineddfs.a33 <- merge(CoefRows, PvalRows, by=0, all=TRUE)
# combineddfs.a33
# s_WY.train.lr.Summary
rsqRow.a34 = summary(s_WY.train.lr)$r.squared
# rsqRow.a34
Coef = summary(s_WY.train.lr)$coef[,1] # row, column
CoefRows = data.frame(Coef)
Pval = summary(s_WY.train.lr)$coef[,4]
PvalRows = data.frame(Pval)
combineddfs.a34 <- merge(CoefRows, PvalRows, by=0, all=TRUE)
# combineddfs.a34
As already mentioned in comments you should get the data in a list to avoid such repetitive processes.
Find out a common pattern that represents all your dataframe names in the global environment and use that as a pattern in ls to get character vector of their names. You can then use mget to get dataframes in a list.
list_data <- mget(ls(pattern = 's_W.*train\\.lr'))
Once you have the list of dataframes, you can use lapply to iterate over it and in the function return the values that you want. Note that there might be a simpler way to write what you have in your attempt however as I don't have the data I am not going to take the risk to shorten your code. Here I am returning rsqRow and combineddfs for each dataframe. You can add/remove objects according to your preference.
all_values <- lapply(list_data, function(x) {
rsqRow = summary(x)$r.squared
Coef = summary(x)$coef[,1]
CoefRows = data.frame(Coef)
Pval = summary(x)$coef[,4]
PvalRows = data.frame(Pval)
combineddfs <- merge(CoefRows, PvalRows, by=0, all=TRUE)
list(rsqRow, combineddfs)
})

R - lapply() versus assign() in while loop

I would like to read a large .csv into R. It'd handy to split it into various objects and treat them separately. I managed to do this with a while loop, assigning each tenth to an object:
# The dataset is larger, numbers are fictitious
n <- 0
while(n < 10000){
a <- paste('a_', n, sep = '')
assign(a, read.csv('df.csv',
header = F, stringsAsFactors = F, nrows = 1000, skip = 0 + n)))
# There will be some additional processing here (omitted)
n <- n + 1000
}
Is there a more R-like way of doing this? I immediately thought of lapply. According to my understanding each object would be the element of a list that I would then have to unlist.
I gave a shot to the following but it didn't work and my list only has one element:
A <- lapply('df.csv', read.csv,
header = F, stringsAsFactors = F, nrows = 1000, skip = seq(0, 10000, 1000))
What am I missing? How do I proceed from here? How do I then unlist A and specify each element of the list as a separate data.frame?
If you apply lapply to a single element you'll have only one element as an output.
You probably want to do this:
a <- paste0('a_', 1:1000) # all your 'a's
A <- lapply(a,function(x){
read.csv('df.csv', header = F, stringsAsFactors = F, nrows = 1000, skip = 0 + n)
})
for each element of a, called x because it's the name I chose as my function parameter, I execute your command. A will be a list of the results.
Edit: As #Val mentions in comments, assign seems not needed here, so I removed it, you'll end up with a list of data.frames coming from your csvs if all works fine.

Factors and Dummy Variables in R

I am new to data analytic and learning R. I have few very basic questions which I am not very clear about. I hope to find some help here. Please bear with me..still learning -
I wrote a small function to perform basic exploratory analysis on a data set with 9 variables out of which 8 are of Int/Numeric type and 1 is Factor. The function is like this :
out <- function(x)
{
c <- class(x)
na.len <- length(which(is.na(x)))
m <- mean(x, na.rm = TRUE)
s <- sd(x, na.rm = TRUE)
uc <- m+3*s
lc <- m-3*s
return(c(classofvar = c, noofNA = na.len, mean=m, stdev=s, UpperCap = uc, LowerCap = lc))
}
And I apply it to the data set using :
stats <- apply(train, 2, FUN = out)
But the output file has all the class of variables as Character and all the Means as NA. After some head hurting, I figured that the problem is due to the Factor variable. I converted it to Numeric using this :
train$MonthlyIncome=as.numeric(as.character(train$MonthlyIncome))
It worked fine. But I am confused that if without looking at the dataset I use the above function - it wont work. How can I handle this situation.
When should I consider creating dummy variables?
Thank you in advance, and I hope the questions are not too silly!
Note that c() results in a vector and all element within the vector must be of the same class. If the elements have different classes, then c() uses the least complex class which is able to hold all information. E.g. numeric and integer will result in numeric. character and integer will result in character.
Use a list or a data.frame if you need different classes.
out <- function(x)
{
c <- class(x)
na.len <- length(which(is.na(x)))
m <- mean(x, na.rm = TRUE)
s <- sd(x, na.rm = TRUE)
uc <- m+3*s
lc <- m-3*s
return(data.frame(classofvar = c, noofNA = na.len, mean=m, stdev=s, UpperCap = uc, LowerCap = lc))
}
sum(is.na(x)) is faster than length(which(is.na(x)))
Use lapply to run the function on each variable. Use do.call to append the resulting dataframes.
stats <- do.call(
rbind,
lapply(train, out)
)

Reading series of values in R

I have read a series of 332 files like below by storing the data in each file as a data frame in List.
files <- list.files()
data <- list()
for (i in 1:332){
data[[i]] = read.csv(files[[i]])
}
The data has 3 columns with names id, city, town. Now I need to calculate the mean of all values under city corresponding to the id values 1:10 for which I wrote the below code
for(j in 1:10){
req.data <- data[[j]]$city
}
mean(na.omit(req.data))
But it is giving me a wrong value and when I call it in a function its transferring null values. Any help is highly appreciated.
Each time you iterate through j = 1:10 you assign data[[j]]$city to the object req.data. In doing so, for steps j = 2:10 you are overwriting the previous version of req.data with the contents of the jth data set. Hence req.data only ever contains at any one time a single city's worth of data and hence you are getting the wrong answer sa you are computing the mean for the last city only, not all 10.
Also note that you could do mean(req.data, na.rm = TRUE) to remove the NAs.
You can do this without an explicit loop at the user R level using lapply(), for example, with dummy data,
set.seed(42)
data <- list(data.frame(city = rnorm(100)),
data.frame(city = rnorm(100)),
data.frame(city = rnorm(100)))
mean(unlist(lapply(data, `[`, "city")), na.rm = TRUE)
which gives
> mean(unlist(lapply(data, `[`, "city")), na.rm = TRUE)
[1] -0.02177902
So in your case, you need:
mean(unlist(lapply(data[1:10], `[`, "city")), na.rm = TRUE)
If you want to write a loop, then perhaps
req.data <- vector("list", length = 3) ## allocate, adjust to length = 10
for (j in 1:3) { ## adjust to 1:10 for your data / Q
req.data[[j]] <- data[[j]]$city ## fill in
}
mean(unlist(req.data), na.rm = TRUE)
> mean(unlist(req.data), na.rm = TRUE)
[1] -0.02177902
is one way. Or alternatively, compute the mean of the individual cities and then average those means
vec <- numeric(length = 3) ## allocate, adjust to length = 10
for (j in 1:3) { ## adjust to 1:10 for your question
vec[j] <- mean(data[[j]]$city, na.rm = TRUE)
}
mean(vec)

Resources