Juxtaposing Replicate Data - r

I have provided a sample dataset that I have arranged in column format (called "full.table").
These data were extracted from a 96-well PCR plate, & while collecting my data, I always ran a duplicate experiment, meaning each variable (aka test) has 1 replicate. I would like to take all replicates and juxtapose them (have them be side by side), which would allow me to easily visualize replicates next to each other, and finally calculate an average value for the variable "Cq" between the two.
The complications stems from having done multiple tests over several days (complication one), and NOT having my samples always run in the same fashion on the PCR plate (complication two). Typically, as you see on my data set below, Well A1 has a duplicate in Well B1, however this is not always the case. Occasionally, Well A7 matches Well A8 (and NOT B7).
Replicates were always run on the same day, so an important variable here is “date” which I added via R before uploading to Stack Exchange. I am confused on how to re-arrange the data to get my desired result (not even sure where to start)
I have provided an example of what I would like in the end, called “sample.finished.table”
Logically, having 768 observations in this example, this should divide it in two, resulting in 384 total lines of data (385 with header)
I appreciate any feedback. Thank you
full.table<- read.table("https://pastebin.com/raw/kTQhuttv", header=T, sep="")
sample.finished.table <- read.table("https://pastebin.com/raw/Phg7C9xD", header=T, sep="")

You can use dplyr here to group by sample and extract the requested values:
library(dplyr)
full.table %>% group_by(sample,date) %>% summarise(
Well1 = first(Well), Cq1 = first(Cq),
Well2 = last(Well), sample1 = last(sample), Cq2 = last(Cq), Cq_mean = mean(Cq[Cq > 0]))

Related

Finding summary statistics. Struggling with having anything work after importing data into R from Excel

Very new to R here, also very new to the idea of coding and computer stuff.
Second week of class and I need to find some summary statistics from a set of data my professor provided. I downloaded the chart of data and tried to follow along with his verbal instructions during class, but I am one of the only non-computer science backgrounds in my degree program (I am an RN going for degree in Health Informatics), so he went way too fast for me.
I was hoping for some input on just where to start with his list of tasks for me to complete. I downloaded his data into an excel file, and then uploaded it into R and it is now a matrix. However, everything I try for getting the mean and standard deviation of the columns he wants comes up with an error. I am understanding that I need to convert these column names into some sort of vector, but online every website tells me to do these tasks differently. I don't even know where to start with this assignment.
Any help on how to get myself started would be greatly appreciated. Ive included a screenshot of his instructions and of my matrix. and please, excuse my ignorance/lack of familiarity compared to most of you here... this is my second week into my masters I am hoping I begin to pick this up soon I am just not there yet.
the instructions include:
# * Import the dataset
# * summarize the dataset,Compute the mean and standard deviation for the three variables (columns): age, height, weight
# * Tabulate smokers and age.level data with the variable and its frequency. How many smokers in each age category ?
# * Subset dataset by the mothers that smoke and weigh less than 100kg,how many mothers meet this requirements?
# * Compute the mean and standard deviation for the three variables (columns): age, height, weight
# * Plot a histogram
Stack Overflow is not a place for homeworks, but I feel your pain. Let's get piece by piece.
First let's use a package that helps us do those tasks:
library(data.table) # if not installed, install it with install.packages("data.table")
Then, let's load the data:
library(readxl) #again, install it if not installed
dt = setDT(read_excel("path/to/your/file/here.xlsx"))
Now to the calculations:
1 summarize the dataset. Here you'll see the ranges, means, medians and other interesting data of your table.
summary(dt)
1A mean and standard deviation of age, height and weight (replace age with the column name of height and weight to get those)
dt[, .(meanValue = mean(age, na.rm = TRUE), stdDev = sd(age, na.rm = TRUE))]
2 tabulate smokers and age.level. get the counts for each combination:
dt[, .N, by = .(smoke, age.level)]
3 subset smoker mothers with wt < 100 (I'm asuming non-pregnant mothers have NA in the gestation field. Adjust as necessary):
dt[smoke == 1 & weight < 100 & !is.na(gestation), .N]
4 Is the same as 1A.
5 Plot a histogram (but you don't specify of what variable, so let's say it's age):
hist(dt$age)
Keep on studying R, it's not that difficult. The book recommended in the comments is a very good start.

Create a new row to assign M/F to a column based on heading, referencing second table?

I am new to R (and coding in general) and am really stuck on how to approach this problem.
I have a very large data set; columns are sample ID# (~7000 samples) and rows are gene expression (~20,000 genes). Column headings are BIOPSY1-A, BIOPSY1-B, BIOPSY1-C, ..., BIOPSY200-Z. Each number (1-200) is a different patient, and each sample for that patient is a different letter (-A, -Z).
I would like to do some comparisons between samples that came from men and women. Gender is not included in this gene expression table. I have a separate file with patient numbers (BIOPSY1-200) and their gender M/F.
I would like to code something that will look at the column ID (ex: BIOPSY7-A), recognize that it includes "BIOPSY7" (but not == BIOPSY7 because there is BIOPSY7-A through BIOPSY7-Z), find "BIOPSY7" in the reference file, extrapolate M/F, and create a new row with M/F designation.
Honestly, I am so overwhelmed with coding this that I tried to open the file in Excel to manually input M/F, for the 7000 columns as it would probably be faster. However, the file is so large that Excel crashes when it opens.
Any input or resources that would put me on the right path would be extremely appreciated!!
I don't quite know how your data looks like, so I made mine based on your definitions. I'm sure you can modify this answer based on your needs and your dataset structure:
library(data.table)
genderfile <-data.frame("ID"=c("BIOPSY1", "BIOPSY2", "BIOPSY3", "BIOPSY4", "BIOPSY5"),"Gender"=c("F","M","M","F","M"))
#you can just read in your gender file to r with the line below
#genderfile <- read.csv("~/gender file.csv")
View(genderfile)
df<-matrix(rnorm(45, mean=10, sd=5),nrow=3)
colnames(df)<-c("BIOPSY1-A", "BIOPSY1-B", "BIOPSY1-C", "BIOPSY2-A", "BIOPSY2-B", "BIOPSY2-C","BIOPSY3-A", "BIOPSY3-B", "BIOPSY3-C","BIOPSY4-A", "BIOPSY4-B", "BIOPSY4-C","BIOPSY5-A", "BIOPSY5-B", "BIOPSY5-C")
df<-cbind(Gene=seq(1:3),df)
df<-as.data.frame(df)
#you can just read in your main df to r with the line below, fread prevents dashes to turn to period in r, you need data.table package installed and checked in
#df<-fread("~/first file.csv")
View(df)
Note that the following line of code removes the dash and letter from the column names of df (I removed the first column by df[,-c(1)] because it is the Gene id):
substr(x=names(df[,-c(1)]),start=1,stop=nchar(names(df[,-c(1)]))-2)
#[1] "BIOPSY1" "BIOPSY1" "BIOPSY1" "BIOPSY2" "BIOPSY2" "BIOPSY2" "BIOPSY3" "BIOPSY3" "BIOPSY3" "BIOPSY4" "BIOPSY4"
#[12] "BIOPSY4" "BIOPSY5" "BIOPSY5" "BIOPSY5"
Now, we are ready to match the columns of df with the ID in genderfile to get the Gender column:
Gender<-genderfile[, "Gender"][match(substr(x=names(df[,-c(1)]),start=1,stop=nchar(names(df[,-c(1)]))-2), genderfile[,"ID"])]
Gender
#[1] F F F M M M M M M F F F M M M
Last step is to add the Gender defined above as a row to the df:
df_withGender<-rbind(c("Gender", as.character(Gender)), df)
View(df_withGender)

how to use the functcomp code in R

I am having trouble using the functcomp package in R.
I have 2 datasets: one with species frequency, and the other listing the functional traits of my species. The frequency dataset has 264 species listed in the first row and 27 sites listed in the first column, all values in dataset are between 0-1. The functional trait dataset has the same 264 species (copied & pasted from the frequency dataset to make sure identical) listed in the first column, and 5 different functional traits listed in the 1st row (height, life history, life form, origin, palatability).
I am using the following code:
traits.df <- read.table("species_functional_traits_6_ August.txt", header = TRUE)
frequency.df <- read.table("Spring 2014 - combined table - 6 August.txt", header = TRUE)
x <- (as.matrix(traits.df))
a <- (as.matrix(frequency.df))
functcomp(x, a, CWM.type = c("dom", "all"), bin.num = height)
But keep getting the following error message:
Error in functcomp(x, a, CWM.type = c("dom", "all"), bin.num = height) :
Different number of species in 'x' and 'a'.
I have tried fiddling with a couple of things in the code and datasets, but cannot work out what I am doing wrong here. Any help would be greatly appreciated!
Here are links the frequency & trait data (a subset of it, but still get same error message with this data) as a tab-delimited text file
frequency: https://www.dropbox.com/s/girs3nrq1ciyg1a/frequency%20-%20small.txt?dl=0
traits: https://www.dropbox.com/s/l888sallx7mu3f6/traits%20-%20small.txt?dl=0
try stating row.names=1 when read in your table, this solved my problem -
Anna

Reliability tests for classic content analysis (multiple categorial codes per item)

In classic content analysis (or qualitative content analysis), as typically done with Atlas.TI or Nvivo type tools (sometimes called QACDAS tools), you typically face the situation of having multiple raters rate many objects with many codes, so there are multiple codes that each rater might apply to each object. I think this is what the excellent John Ubersax page on agreement statistics calls "Two Raters, Polytomous Ratings".
For example you might have two raters read articles and code them with some group of topic codes from a coding scheme (e.g., diy, shelving, circular saw), and you are asking how well the coders agree on applying the codes.
What I'd like is to use the irr package functions, agree and kappa2, in these situations. Yet their documentation didn't help me figure out how to proceed, since they expect input in the form of "n*m matrix or dataframe, n subjects m raters." which implies that there is a single rating per rater, per object.
Given two raters using (up to) three codes to code two articles input data that looks like this (two diy articles, the second with some topic tags):
article,rater,code
article1,rater1,diy
article1,rater2,diy
article2,rater1,diy
article2,rater2,diy
article2,rater1,circular-saw
article2,rater1,shelving
article2,rater2,shelving
I'd like to get:
Overall percentage agreement.
Percentage agreement for each code.
Contingency table for each code.
Ideally, I'd also like to get Positive agreement (how often do the raters agree that a code should be present?) and Negative Agreement (how often do the raters agree that a code should not be present). See discussion of these at http://www.john-uebersax.com/stat/raw.htm#binspe
I'm pretty sure that this involves breaking the input data.frame up and processing it code by code, using something like dplyr, but I wondered if others have tackled this problem.
(The kappa functions take the same input, so let's just keep this simple by using the agree function from the irr package, plus the positive and negative agreement only really make sense with percentage agreement).
Looking at the meta.stackexchange threads on answering one's own question, it seems that is an acceptable thing to do. Makes sense, good place to store stuff for others to find :)
I solved most of this with the following code:
library(plyr); library(dplyr); library(reshape2); library(irr)
# The irr package expects input in the form of n x m (objects in rows, raters in columns)
# for multiple coders per coded items that is really confusing. Here we have 10 articles (to be coded) and
# many codes. So each rater rates each combinations of articles and codes as present (or not).
# Basically you send only the ratings columns to agree and kappa2. You can send them all at
# once for overall agreement, or send only those for each code for code-by-code agreement.
# letter,code,rater
# letter1,code1,rater1
# letter1,code2,rater1
# letter2,code3,rater2
coding <- read.csv("CombinedCoding.csv")
# Now want:
# letter, code, rater1, rater2
# where 0 = no (this code wasn't used), 1 = yes (this code was used)
# dcast can do this, collapsing across a group. In this case we're not really
# grouping, so if the code was not present length gives a 0, if it was length
# gives a 1.
# This excludes all the times where we agreed that both codes weren't present.
ccoding <- dcast(coding, letter + code ~ coder, length)
# create data.frame from combination of letters and codes
# this handles the negative agreement parts.
codelist <- unique(coding$code)
letterlist <- unique(coding$letter)
coding_with_negatives <- merge(codelist, letterlist) # Gets cartesion product of these.
names(coding_with_negatives) <- c("code", "letter") # align the names
# merge this with the coding, produces NA for rows that don't exist in ccoding
coding_with_negatives <- merge(coding_with_negatives,ccoding,by=c("letter","code"), all.x=T)
# replace NAs with zeros.
coding_with_negatives[is.na(coding_with_negatives)] <- 0
# Now want agreement per code.
# need a function that returns a df
# this function gets given the split data frame (ie this happens once per code)
getagree <- function(df) {
# for positive agreement remove the cases where we both coded it negative
positive_df <- filter(df, (rater1 == 1 & rater2 == 1) | (rater1 == 0 & rater2 == 1) | (rater1 == 1 & rater2 == 0))
# for negative agreement remove the cases where we both coded it positive
negative_df <- filter(df, (rater1 == 0 & rater2 == 0) | (rater1 == 0 & rater2 == 1) | (rater1 == 1 & rater2 == 0))
data.frame( positive_agree = round(agree(positive_df[,3:4])$value,2) # Run agree on the raters columns, get the $value, and round it.
, negative_agree = round(agree(negative_df[,3:4])$value,2)
, agree = round(agree(df[,3:4])$value,2)
, used_in_articles = nrow(positive_df) # gives some idea of the prevalance.
)
}
# split the df up by code, run getagree on the sections
# recombine into a data frame.
results <- ddply(coding_with_negatives, .(code), getagree)
The confusion matrices can be gotten with:
print(table(coding_with_negatives[,3],coding_with_negatives[,4],dnn=c("rater1","rater2")))
I haven't done it but I think I could do that per code inside the function using print to push them into a text file.

R storing different columns in different vectors to compute conditional probabilities

I am completely new to R. I tried reading the reference and a couple of good introductions, but I am still quite confused.
I am hoping to do the following:
I have produced a .txt file that looks like the following:
area,energy
1.41155882174e-05,1.0914586287e-11
1.46893363946e-05,5.25011714434e-11
1.39244046855e-05,1.57904991488e-10
1.64155121046e-05,9.0815757601e-12
1.85202830392e-05,8.3207522281e-11
1.5256036289e-05,4.24756620609e-10
1.82107587343e-05,0.0
I have the following command to read the file in R:
tbl <- read.csv("foo.txt",header=TRUE).
producing:
> tbl
area energy
1 1.411559e-05 1.091459e-11
2 1.468934e-05 5.250117e-11
3 1.392440e-05 1.579050e-10
4 1.641551e-05 9.081576e-12
5 1.852028e-05 8.320752e-11
6 1.525604e-05 4.247566e-10
7 1.821076e-05 0.000000e+00
Now I want to store each column in two different vectors, respectively area and energy.
I tried:
area <- c(tbl$first)
energy <- c(tbl$second)
but it does not seem to work.
I need to different vectors (which must include only the numerical data of each column) in order to do so:
> prob(energy, given = area), i.e. the conditional probability P(energy|area).
And then plot it. Can you help me please?
As #Ananda Mahto alluded to, the problem is in the way you are referring to columns.
To 'get' a column of a data frame in R, you have several options:
DataFrameName$ColumnName
DataFrameName[,ColumnNumber]
DataFrameName[["ColumnName"]]
So to get area, you would do:
tbl$area #or
tbl[,1] #or
tbl[["area"]]
With the first option generally being preferred (from what I've seen).
Incidentally, for your 'end goal', you don't need to do any of this:
with(tbl, prob(energy, given = area))
does the trick.

Resources