Create a function in R that accomplishes the following - r

Columns A - F are identity columns - (1,0). Column G has the values "WLB0", "WLB2": "WLB10" and "WLB46", "WLB89".
I am trying to do the following for every permutation of A-F with Column G
I am looking for a function to call instead of doing it using this very awkward code that I wrote.
the test data is available for download at the bottom.
X1 <- {dd <- subset(TEST, TEST$A == 1 & TEST$G =="WLB10"); de <-transform(dd, RP = sum(dd$I)/sum(dd$H));mean(de$RP)}
X2 <- {dd <- subset(TEST, TEST$A == 1 & TEST$G =="WLB8"); de <-transform(dd, RP = sum(dd$I)/sum(dd$H));mean(de$RP)}
X3 <- {dd <- subset(TEST, TEST$B == 1 & TEST$G =="WLB10"); de <-transform(dd, RP = sum(dd$I)/sum(dd$H));mean(de$RP)}
TEST1$finalnumber <-ifelse(TEST1$A == 1 & TEST1$G == "WLB10", X1,
ifelse(TEST1$A == 1 & TEST1$G == "WLB8", X2,
ifelse(TEST1$B == 1 & TEST1$G == "WLB10", X3, 0)))
Test data
"https://s3.amazonaws.com/RProgramming/TEST.csv"
"https://s3.amazonaws.com/RProgramming/TEST1.csv"

I'm a bit confused about the purpose of setting RP to be constant across the rows of de, but the below bit of code will get you some way along, I hope. ddply and melt are two great functions for this sort of thing
library(plyr)
library(reshape)
long <- melt(TEST, measure.vars=LETTERS[1:6])
#long <- subset(variable==1)
shorter <- ddply(long, .(G, variable, value), summarize, RP=sum(I)/sum(H))
You can uncomment the line to just get subtotals corresponding to 1, but I thought it was illustrative to show you how it works.
You can then do a similar melt on TEST1, and carry out a lookup for the relevant value:
long <- melt(TEST1, measure.vars=LETTERS[1:6])
ind <- match(paste0(long$G, long$variable), paste0(shorter$G, shorter$variable))
long$final <- shorter$RP[ind]

Related

Import recorded genotype table in genlight object

I am trying to clean up my snp data by filter out any alleles that have depth < 3 and if the total depth of the locus is < 5 then exclude the genotype altogether i.e. turn into NA. Then I tried to record the "clean" genotypes back into genlight object. This is where I stuck.
Recall genotypes based on the new depth filter
het <- Map(function(x, y) x>0 & y>0, read.depth.ref.filter, read.depth.snp.filter)
hom_0 <- Map(function(x, y) x>0 & y==0, read.depth.ref.filter, read.depth.snp.filter)
hom_2 <- Map(function(x, y) x==0 & y>0, read.depth.ref.filter, read.depth.snp.filter)
re.geno <- read.depth.snp.filter
Record the genotype table
re.geno[] <- Map(ifelse, het, 1, re.geno)
re.geno[] <- Map(ifelse, hom_0, 0, re.geno)
re.geno[] <- Map(ifelse, hom_2, 2, re.geno)
I tried:
gl[,] <- t(re.geno[,])
Any suggestions would be greatly appreciated.
R.
I got my own answer which is
gl <- as.genlight(t(re.geno))

Na/NaN Error In R

I have just started using R and have a somewhat complex question. So I have a data frame called "data" for which each individual is assigned a PID number. I want to make a loop to find the closest of two dates (SampleDate and LTROT.Date) since there are multiple sample dates for each LTROT.Date. When Running this code I keep getting "Error in start.of.PID:end.of.PID : NA/NaN argument". The data is confidential so I cant provide that. I am new to stackoverflow so I apologize if my questions doesn't meet some of the guidelines.
unique <- unique(data$PID)
z <- 1
end.of.PID <- 0
max <- 100000000
sample.ideal <- vector(length = 58)
for(i in unique){
start.of.PID <- (end.of.PID + 1)
multi <- sum(unique[i] == data$PID)
end.of.PID <- (start.of.PID + multi)-1
for(j in start.of.PID:end.of.PID){
Sample.Date <- as.Date(data$SampleDate)
LTROT.Date <- as.Date(data$LTROT.Date)
time <- Sample.Date[j]-LTROT.Date[j]
if(time < max){
max <- time
sample.ID <- data$SampleID[j]
}else{
max <- max
}
sample.ideal[z] <- sample.ID
z <- z + 1
}
}
Mistake in OP's code:
unique <- unique(data$PID)
......
for(i in unique){ # i represent as an item in "unique" vector
start.of.PID <- (end.of.PID + 1)
multi <- sum(unique[i] == data$PID) #Here i has been used as 'index'
#The above line should be written as:
multi <- sum(i == data$PID)
Though the sample data is not provided by OP with question but based on logic in for-loop it seems a dplyr based solution can be an easier option. The result can be received by self-join and then filter for the record having minimum date difference. The query can be written as:
library(dplyr)
data %>% mutate(SampleDate = as.Date(SampleDate), LTROT.Date = as.Date(LTROT.Date)) %>%
inner_join(., .,by="PID") %>%
group_by(PID) %>%
mutate(MinDateDiff = (SampleDate.x - LTROT.Date.y)) %>%
filter(MinDateDiff == min(MinDateDiff)) %>%
select(PID, SampleDate = SampleDate.x, LTROT.Date = LTROT.Date.y, MinDateDiff )

How can I convert these nested loops into an R loop function like sapply or tapply or

I have this code which I run over a data frame t.
for (i in years){
for (j in type){
x <- rbind(x, cbind(i, j,
sum(t[(t$year == i) & (t$type == j),]$Emissions,
na.rm = TRUE)))
}
}
Basically, I have two vectors years and type. I'm finding the sum of each category and merging that into a data frame. The above code works, but I cannot figure out how to use one of the loop functions.
Yes, there are ways to do this using the apply functions. I'm going to suggest a high performance approach using dplyr, though.
library(dplyr)
x <- t %>%
group_by(year,type) %>%
summarize(SumEmmissions=sum(Emissions,na.rm=TRUE))
I think you will find that it is much faster than either a loop or apply approach.
=================== Proof, as requested ===============
library(dplyr)
N <- 1000000
Nyear <- 50
Ntype <- 40
myt <- data.frame(year=sample.int(50,N,replace=TRUE),
type=sample.int(4,N,replace=TRUE),
Emissions=rnorm(N)
)
years <- 1:Nyear
type <- 1:Ntype
v1 <- function(){
x <- myt %>%
group_by(year,type) %>%
summarize(SumEmmissions=sum(Emissions,na.rm=TRUE))
}
v2 <- function(){
x <- data.frame()
for (i in years){
for (j in type){
x <- rbind(x, cbind(i, j,
sum(myt[(myt$year == i) & (myt$type == j),]$Emissions, na.rm = TRUE)))
}
}
}
v3 <- function(){
t0 <- myt[myt$year %in% years & myt$type %in% type, ]
x <- aggregate(Emissions ~ year + type, t0, sum, na.rm = TRUE)
}
system.time(v1())
user system elapsed
0.051 0.000 0.051
system.time(v2())
user system elapsed
176.482 0.402 177.231
system.time(v3())
user system elapsed
7.758 0.011 7.783
As the sizes and number of groups increases, so does the performance spread.
Pick out all rows for which year is in years and type is in type giving t0. Then aggregate Emissions based on years and type.
t0 <- t[t$year %in% years & t$type %in% type, ]
aggregate(Emissions ~ year + type, t0, sum, na.rm = TRUE)
If the years and type vectors contain all years and types then the first line could be omitted and t0 in the second line replaced with t.
Next time please make your example reproducible.
Update Some corrections.

How to extract some sample in R

How do I extract only random numbers(CD) for 'Trt' at time point 1.
ns <- 20
ans <- matrix(rep(0,200),nrow=100)
for(k in 1:100)
{
x1=rnorm(ns,0,1)
x2=rnorm(ns,5,5)
x3=rnorm(ns,10,5)
U=c(x1,x2,x3)
simdata=data.frame(CD=U,
Time=factor(rep(c(1,2,3),each=ns)),
treatment=sample(rep(c('Trt','placebo'),ns/2)))
ans[k,]=table(simdata$treatment)
}
simdata
You can do that in multiple ways:
simdata$CD[sim_data$Time == 1]
or use subset:
subset(simdata, Time == 1, select = "CD")
The former is recommended for use in scripts, the latter works well in interactive mode (R prompt).
You can subset for both conditions (treatment = "Trt" and Time = "1") like this:
smpl <- simdata[simdata$Time=="1" & simdata$treatment=="Trt",]
If you only want the CD column:
smpl <- simdata$CD[simdata$Time=="1" & simdata$treatment=="Trt",]
I think you want CD for Timepoint "1" and Treatment ="Trt"
subset(simdata, Time == 1 & treatment == "Trt", select = "CD")
alternatively for the whole data frame
subset(simdata, Time == 1 & treatment == "Trt")

Function not returning appropriate value when applied to a data.table column while successful on one variable [duplicate]

I have the following dataframe:
sp <- combn(c("sp1","sp2","sp3","sp4"),2)
d <- data.frame(t(sp),"freq"=sample(0:100,6))
and two factors
x1 <- as.factor(c("sp1","sp2"))
x2 <- as.factor(c("sp3","sp4"))
I need a dataframe returned containing all possible combinations of x1 and x2 and the freq from dataframe d associated with this combination.
The returned dataframe would look like this:
data.frame("X1" = c("sp1","sp1","sp2","sp2"),
"X2" = c("sp3","sp4","sp3","sp4"),
"freq" = c(4,94,46,74))
I have tried:
sub <- d[d$X1 == x1 & d$X2 == x2,]
but get the error
Error in Ops.factor(d$X1, x1) : level sets of factors are different
Any ideas on how to solve this problem?
Do not make x1 and x2 factors. Just use vectors. Use %in% for the logical test.
sp <- combn(c("sp1","sp2","sp3","sp4"),2)
d <- data.frame(t(sp),"freq"=sample(0:100,6))
x1 <- c("sp1","sp2")
x2 <- c("sp3","sp4")
sub <- d[d$X1 %in% x1 & d$X2 %in% x2,]
You are almost there:
d[d$X1 %in% x1 & d$X2 %in% x2,]

Resources