How to extract some sample in R - r

How do I extract only random numbers(CD) for 'Trt' at time point 1.
ns <- 20
ans <- matrix(rep(0,200),nrow=100)
for(k in 1:100)
{
x1=rnorm(ns,0,1)
x2=rnorm(ns,5,5)
x3=rnorm(ns,10,5)
U=c(x1,x2,x3)
simdata=data.frame(CD=U,
Time=factor(rep(c(1,2,3),each=ns)),
treatment=sample(rep(c('Trt','placebo'),ns/2)))
ans[k,]=table(simdata$treatment)
}
simdata

You can do that in multiple ways:
simdata$CD[sim_data$Time == 1]
or use subset:
subset(simdata, Time == 1, select = "CD")
The former is recommended for use in scripts, the latter works well in interactive mode (R prompt).

You can subset for both conditions (treatment = "Trt" and Time = "1") like this:
smpl <- simdata[simdata$Time=="1" & simdata$treatment=="Trt",]
If you only want the CD column:
smpl <- simdata$CD[simdata$Time=="1" & simdata$treatment=="Trt",]

I think you want CD for Timepoint "1" and Treatment ="Trt"
subset(simdata, Time == 1 & treatment == "Trt", select = "CD")
alternatively for the whole data frame
subset(simdata, Time == 1 & treatment == "Trt")

Related

Create list of decomposition of variable combination fro loop in r and extract components later

On my way to calculate each component of the time series for each X (50 levels) and Y (80 levels) from my dataset (df) I realised I probably need to create a list of lists of all the decomposition lists for those combinations.
Unfortunately, my code below brings me no closer to the solution. I know that on its own, the first lines work, until the decomposition bit. It creates a list of lists but then how do I extract components for each X*Y combination?
P <- df$X
for(y in 1:length(P)) {
OneP <- P[y]
AllS <- unique(df$Y[df$X== OneP])
for(i in 1:length(AllS)) {
OneS<- AllS[i]
df$TS[df$Y == OneS & df$X== OneP] <- ts(df$Mean[df$Y == OneS & df$X
== OneP], start = c(1999, 1), end = c(2015, 12), frequency = 12)
ListOfDec <- list()
for(d in 1:length(OneS)) {
# Run for-loop over lists
ListOfDec[df$Y == OneS & df$X== OneP] <- list(decompose(ts(df$TS[df$Y == OneS & df$X== OneP], frequency = 12), type = c("additive")))
}
df$Decomposition_seasonal[df$Y == OneS & df$X== OneP] <- what goes here?
}
Any advice is deeply appreciated.
Similar data:
X Y Mean Date(mY)
Tru A 35.6 02.2015
Fle A 15 05.2010
Srl C 67.1 05.1999
Tru A 13.2 08.2006
Srl B 89 08.2006
Tru B 14.8 12.2001
Fle A 21.5 11.2001
Lub D 34.8 03.2000

Numeric vs Factors & IF Statements

I am trying to create a function for gender distribution. Is there a way to define a letter as something other than as.factor? I would like to operate func(F) instead of func("F"). Or should I go numeric: func(0), func(1), func(2)?
I also finished off the statement with an else that is designed to operate when left blank, but does not. If I whittle down the function to not include an IF statement a blank variable works fine:
genderDist <- function(){
cat("Female:", sum(voterData$GENDER == "F"))
}
Thanks in advance! Cheers!
Full Statement:
genderDist <- function(x){
if (x == "F"){
cat("Female:", sum(voterData$GENDER == "F"))
}
else if (x == "M"){
cat("Male:", sum(voterData$GENDER == "M"))
}
else if(x == "U"){
cat("Unknown:", sum(voterData$GENDER == ""))
}
else{
cat("Female:", sum(voterData$GENDER == "F"))
cat("Male:", sum(voterData$GENDER == "M"))
cat("Unknown:", sum(voterData$GENDER == ""))
}
Desired results:
genderDist(F) gives count of Females
genderDist(M) gives count of Males
genderDist(U) gives count of Unknown
genderDist() gives count for all the above
There are several possibilities for coding gender, besides factor:
1. as character, not as factor. You will still have to call your function like func("F").
2. You already thought of using numeric yourself. Disadvantage is that it may be unclear if 1 is male or female.
3. The best option IMHO would be to go binary. Name your column "male" and use TRUE, FALSE and NA for unknown. The binary also works great in your if statement. Start with if(is.na(male)) ... ; else if(male).
EDIT
But to achieve your desired outcome, the coding of gender is not the issue, I would take this approach:
#First, define variables Fe, Ma and Un
#WARNING: Do NOT USE 'F', as 'F' is an abbr. for 'FALSE'!!
Fe <- "F"
Ma <- "M"
Un <- "U"
#now define a lookup dataframe for convienience
LT <- data.frame(code = c(Fe,Ma,Un), name = c("Female","Male","Unknown"), stringsAsFactors = FALSE)
# then define your function without an ifelse needed
genderDist <- function(x){
cat(LT[LT$code == x,"name"], sum(voterData$GENDER == x))
}
Introduce some fake data:
voterData <- data.frame(GENDER= c("F","F","F","M","M","U"))
Then run function:
> genderDist(Fe)
Female 3
> genderDist(Ma)
Male 2
> genderDist(Un)
Unknown 1

speeding up boolean logic loop in R

I am very new to R but I am interested in learning more and improving.
I have a dataset with around 40,000+ rows containing the length of neuron segments. I want to compare the length trends of neurons of different groups. The first step in this analysis involves sorting the measurements into 1 of 6 different categories such as '<10' '10-15', '15-20', '20-25', '25-30', and '>30'.
I created these categories as appended columns using 'mutate' from the 'dplyr' package and now I am trying to write a boolean function to determine where the measurement fits by applying a value of '1' to the corresponding column if it fits, and a '0' if it doesn't.
Here is what I wrote:
for (i in 1:40019) {
{if (FinalData$Length[i] <=10)
{FinalData$`<10`[i]<-1
} else {FinalData$`<10`[i]<-0}} #Fills '<10'
if (FinalData$Length[i] >=10 & FinalData$Length[i]<15){
FinalData$`10-15`[i]<-1
} else{FinalData$`10-15`[i]<-0} #Fills'10-15'
if (FinalData$Length[i] >=15 & FinalData$Length[i]<20){
FinalData$`15-20`[i]<-1
} else{FinalData$`15-20`[i]<-0} #Fills '15-20'
if (FinalData$Length[i] >=20 & FinalData$Length[i]<25) {
FinalData$`20-25`[i]<-1
} else{FinalData$`20-25`[i]<-0} #Fills '20-25'
if(FinalData$Length[i] >=25 & FinalData$Length[i]<30){
FinalData$`25-30`[i]<-1
} else{FinalData$`25-30`[i]<-0} #Fills '25-30'
if(FinalData$Length[i] >=30){
FinalData$`>30`[i]<-1
} else{FinalData$`>30`[i]<-0} #Fills '>30'
}
This seems to work, but it takes a long time:
system.time(source('~/Desktop/Home/Programming/R/Boolean Loop R.R'))
user system elapsed
94.408 19.147 118.203
The way I coded this seems very clunky and inefficient. Is there a faster and more efficient way to code something like this or am I doing this appropriately for what I am asking for?
Here is an example of some of the values I am testing:
'Length': 14.362, 12.482337, 8.236, 16.752, 12.045
If I am not being clear about how the dataframe is structured, here is a screenshot:
How my data frame is organized
You can use the cut function in R. It is used to convert numeric values to factors:
x<-c(1,2,4,2,3,5,6,5,6,5,8,0,5,5,4,4,3,3,3,5,7,9,0,5,6,7,4,4)
cut(x = x,breaks = c(0,3,6,9,12),labels = c("grp1","grp2","grp3","grp4"),right=F)
set right = "T" or "F" as per your need.
You can vectorise that as follows (I made a sample of some data called DF)
DF <- data.frame(1:40000,sample(letters,1:40000,replace=T),"Length"=sample(1:40,40000,replace=T))
MyFunc <- function(x) {
x[x >= 10 & x < 15] <- "10-15"
x[x >= 15 & x < 20] <- "15-20"
x[x >= 20 & x < 25] <- "20-25"
x[x >= 25 & x < 30] <- "25-30"
x[x > 30] <- ">30"
x[x < 10] <- "<10"
return(x)
}
DF$Group <- MyFunc(DF[,3])
If it has to be 6 columns like that, you can modify the above to return a one or zero for the appropriate size and everything else, respectively, for each of the 6 columns.
Edit: I guess a series of ifelse might be best if it really has to be 6 columns like that.
e.g.
DF$'<10' <- sapply(DF$Length, function(x) ifelse(x < 10,1,0))

Extend conditions in a dynamic way

I am trying to build a decision table. At time 3 for example I have to take the previous results in time t=1 and time t=2 in order to make my decision in time 3. The decision table is going to be pretty big so I am considering an efficient way to do it by building a function. For instance at time 3:
rm(list=ls()) # clear memory
names <- c("a","b","c","d","e","f","g","h","i","j","k50","l50","m50","n50","o50")
proba <- c(1,1,1,1,1,1,1,1,1,1,0.5,0.5,0.5,0.5,0.5)
need <- 4
re <- 0.5
w <- 1000000000
# t1
t1 <- as.integer(names %in% (sample(names,need,prob=proba,replace=F)))
# t2
t2 <- rep(t1)
# t3
proba3 <- ifelse(t2==1,proba*re,proba)
t3 <- as.integer(names %in% (sample(names,need,prob=proba3,replace=F)))
Now the table is going to be big until t=7 with proba7 which takes condition from t=1 to t=6. After t=7 it always takes the 6 previous outcomes plus the random part proba in order to make decision. In other words the ifelse must be dynamic in order that I can call it later. I have been trying something like
probF <- function(a){
test <- ifelse(paste0("t",a,sep="")==1,proba*re,proba)
return(test)
}
test <- probF(2)
but there is an error as I got just one value and not a vector. I know that it looks complicated
For the conditions requested by one person (i know it's not very good written) :
proba7 <- ifelse(t2==1 & t3==1 & t4==0 & t5==0 & t6==0,proba,
ifelse(t2==1 & t3==0 & t4==0 & t5==1 & t6==1,proba*re,
ifelse(t2==1 & t3==0 & t4==0 & t5==0 & t6==1, w,
ifelse(t2==0 & t3==1 & t4==1 & t5==0 & t6==0,proba,
ifelse(t2==0 & t3==1 & t4==1 & t5==1 & t6==0,0,
ifelse(t2==0 & t3==0 & t4==1 & t5==1 & t6==1,0,
ifelse(t2==0 & t3==0 & t4==1 & t5==1 &t6==0,0,
ifelse(t2==0 & t3==0 & t4==0 & t5==1 & t6==1, proba*re,
ifelse(t2==0 & t3==0 & t4==0 & t5==0 & t6==1,w,proba)))))))))
t7 <- as.integer(names %in% (sample(names,need,prob=proba7,replace=F)))
If you take a bit of a different approach, you'll gain quite a lot of speed.
First of all, it is really a terribly bad idea to store every step as a separate t1, proba1, etc. If you need to keep all that information, predefine a matrix or list of the right size and store everything in there. That way you can use simple indices instead of having to resort to the bug-prone use of get(). If you find yourself typing get(), almost always it's time to stop and rethink your solution.
Secondly, you can use a simple principle to select the indices of the test t:
seq(max(0, i-7), i-1)
will allow you to use a loop index i and refer to the 6 previous positions if they exist.
Thirdly, depending on what you want, you can reformulate your decision as well. If you store every t as a row in the matrix, you can simply use colSums() and check whether that one is larger than 0. Based on that index, you can update the probabilities in such a way that a 1 in any of the previous 6 rows halfs the probability.
wrapping everything in a function would then look like :
myfun <- function(names, proba, need, re,
w=100){
# For convenience, so I don't have to type this twice
resample <- function(p){
as.integer(
names %in% sample(names,need,prob=p, replace = FALSE)
)
}
# get the number of needed columns
nnames <- length(names)
# create two matrices to store all the t-steps and the probabilities used
theT <- matrix(nrow = w, ncol = nnames)
theproba <- matrix(nrow = w, ncol = nnames)
# Create a first step, using the original probabilities
theT[1,] <- resample(proba)
theproba[1,] <- proba
# loop over the other simulations, each time checking the condition
# recalculating the probability and storing the result in the next
# row of the matrices
for(i in 2:w){
# the id vector to select the (maximal) 6 previous rows. If
# i-6 is smaller than 1 (i.e. there are no 6 steps yet), the
# max(1, i-6) guarantees that you start minimal at 1.
tid <- seq(max(1, i-6), i-1)
# Create the probability vector from the original one
p <- proba
# look for which columns in the 6 previous steps contain a 1
pid <- colSums(theT[tid,,drop = FALSE]) > 0
# update the probability vector
p[pid] <- p[pid]*0.5
# store the next step and the used probabilities in the matrices
theT[i,] <- resample(p)
theproba[i,] <- p
}
# Return both matrices in a single list for convenience
return(list(decisions = theT,
proba = theproba)
)
}
which can be used as:
myres <- myfun(names, proba, need, re, w)
head(myres$decisions)
head(myres$proba)
This returns you a matrix where every row is one t-point in the decision table.

Create a function in R that accomplishes the following

Columns A - F are identity columns - (1,0). Column G has the values "WLB0", "WLB2": "WLB10" and "WLB46", "WLB89".
I am trying to do the following for every permutation of A-F with Column G
I am looking for a function to call instead of doing it using this very awkward code that I wrote.
the test data is available for download at the bottom.
X1 <- {dd <- subset(TEST, TEST$A == 1 & TEST$G =="WLB10"); de <-transform(dd, RP = sum(dd$I)/sum(dd$H));mean(de$RP)}
X2 <- {dd <- subset(TEST, TEST$A == 1 & TEST$G =="WLB8"); de <-transform(dd, RP = sum(dd$I)/sum(dd$H));mean(de$RP)}
X3 <- {dd <- subset(TEST, TEST$B == 1 & TEST$G =="WLB10"); de <-transform(dd, RP = sum(dd$I)/sum(dd$H));mean(de$RP)}
TEST1$finalnumber <-ifelse(TEST1$A == 1 & TEST1$G == "WLB10", X1,
ifelse(TEST1$A == 1 & TEST1$G == "WLB8", X2,
ifelse(TEST1$B == 1 & TEST1$G == "WLB10", X3, 0)))
Test data
"https://s3.amazonaws.com/RProgramming/TEST.csv"
"https://s3.amazonaws.com/RProgramming/TEST1.csv"
I'm a bit confused about the purpose of setting RP to be constant across the rows of de, but the below bit of code will get you some way along, I hope. ddply and melt are two great functions for this sort of thing
library(plyr)
library(reshape)
long <- melt(TEST, measure.vars=LETTERS[1:6])
#long <- subset(variable==1)
shorter <- ddply(long, .(G, variable, value), summarize, RP=sum(I)/sum(H))
You can uncomment the line to just get subtotals corresponding to 1, but I thought it was illustrative to show you how it works.
You can then do a similar melt on TEST1, and carry out a lookup for the relevant value:
long <- melt(TEST1, measure.vars=LETTERS[1:6])
ind <- match(paste0(long$G, long$variable), paste0(shorter$G, shorter$variable))
long$final <- shorter$RP[ind]

Resources