Sequence manipulation - r

I have a matrix that is equivalent to a 96-well plate commonly used in microbiology. In that matrix I randomized 12 treatments each 8 times. I printed a kind of guide in order to follow the patten in the lab easily, and then after measurements I merged the randomized plate to the data.
cipPlate <- c(rep(c(seq(0,50,5),"E"),8)); cipPlate
rcipPlate <- array(sample(cipPlate),dim=c(8,12),dimnames=dimna); rcipPlate
platecCIP <- melt(rcipPlate); platecCIPbbCIP
WellCIP <- paste(platecCIP$Var1,platecSTR$Var2,sep=''); WellCIP
bbCIP <- data.frame(Well=WellCIP,ID=platecCIP$value); bbCIP
That works fine, except that the numbers in the sequence created for this are characters instead of integers. Then when i try to use ggplo2 to plot this with the measurements instead of plotting in the x -axis (o,5,10,15,...,50) it goes (0,10,15,...,45,5,50)
Is there a way to avoid this, or to make the numbers inside these sequence to represent the actual number as an integer instead of a character
BTW: sorry for the clumpsy code, I'm not an expert and it works good enough so that i can use it further.

Related

Translating a for-loop to perhaps an apply through a list

I have a r code question that has kept me from completing several tasks for the last year, but I am relatively new to r. I am trying to loop over a list to create two variables with a specified correlation structure. I have been able to "cobble" this together with a "for" loop. To further complicate matters, I need to be able to put the correlation number into a data frame two times.
For my ultimate usage, I am concerned about speed, efficiency, and long-term effectiveness of my code.
library(mvtnorm)
n=100
d = NULL
col = c(0, .3, .5)
for (j in 1:length(col)){
X.corr = matrix(c(1, col[j], col[j], 1), nrow=2, ncol=2)
x=rmvnorm(n, mean=c(0,0), sigma=X.corr)
x1=x[,1]
x2=x[,2]
}
d = rbind(d, c(j))
Let me describe my code, so my logic is clear. This is part of a larger simulation. I am trying to draw 2 correlated variables from the mvtnorm function with 3 different correlation levels per pass using 100 observations [toy data to get the coding correct]. d is a empty data frame. The 3 correlation levels will occur in the following way pass 1 uses correlation 0 then create the variables, and yes other code will occur; pass 2 uses correlation .3 to create 2 new variables, and then other code will occur; pass 3 uses correlation .5 to create 2 new variables, and then other code will occur. Within my larger code, the for-loop gets the job done. The last line puts the number of the correlation into the data frame. I realize as presented here it will only put 1 number into this data frame, but when it is incorporated into my larger code it works as desired by putting 3 different numbers in a single column (1=0, 2=.3, and 3=.5). To reiterate, the for-loop gets the job done, but I believe there is a better way--perhaps something in the apply family. I do not know how to construct this and still access which correlation is being used. Would someone help me develop this little piece of code? Thank you.

Making a histogram

this sounds pretty basic but every time I try to make a histogram, my code is saying x needs to be numeric. I've been looking everywhere but can't find one relating to my problem. I have data with 240 obs with 5 variables.
Nipper length
Number of Whiskers
Crab Carapace
Sex
Estuary location
There is 3 locations and i'm trying to make a histogram with nipper length
I've tried making new factors and levels, with the 80 obs in each location but its not working
Crabs.data <-read.table(pipe("pbpaste"),header = FALSE)##Mac
names(Crabs.data)<-c("Crab Identification","Estuary Location","Sex","Crab Carapace","Length of Nipper","Number of Whiskers")
Crabs.data<-Crabs.data[,-1]
attach(Crabs.data)
hist(`Length of Nipper`~`Estuary Location`)
Error in hist.default(Length of Nipper ~ Estuary Location) :
'x' must be numeric
Instead of correct result
hist() doesn't seem to like taking more than one variable.
I think you'd have the best luck subsetting the data, that is, making a vector of nipper lengths for all crabs in a given estuary.
crabs.data<-read.table("whatever you're calling it")
names<-(as you have it)
Estuary1<-as.vector(unlist(subset(crabs.data, `Estuary Loc`=="Location", select = `Length of Nipper`)))
hist(Estuary1)
Repeat the last two lines for your other two estuaries. You may not need the unlist() command, depending on your table. I've tended to need it for Excel files, but I don't know what format your table is in (that would've been helpful).

TraMineR, Extract all present combination of events as dummy variables

Lets say I have this data. My objective is to extraxt combinations of sequences.
I have one constraint, the time between two events may not be more than 5, lets call this maxGap.
User <- c(rep(1,3)) # One users
Event <- c("C","B","C") # Say this is random events could be anything from LETTERS[1:4]
Time <- c(c(1,12,13)) # This is a timeline
df <- data.frame(User=User,
Event=Event,
Time=Time)
If want to use these sequences as binary explanatory variables for analysis.
Given this dataframe the result should be like this.
res.df <- data.frame(User=1,
C=1,
B=1,
CB=0,
BC=1,
CBC=0)
(CB) and (CBC) will be 0 since the maxGap > 5.
I was trying to write a function for this using many for-loops, but it becomes very complex if the sequence becomes larger and the different number of evets also becomes larger. And also if the number of different User grows to 100 000.
Is it possible of doing this in TraMineR with the help of seqeconstraint?
Here is how you would do that with TraMineR
df.seqe <- seqecreate(id=df$User, timestamp=df$Time, event=df$Event)
constr <- seqeconstraint(maxGap=5)
subseq <- seqefsub(df.seqe, minSupport=0, constraint=constr)
(presence <- seqeapplysub(subseq, method="presence"))
which gives
(B) (B)-(C) (C)
1-(C)-11-(B)-1-(C) 1 1 1
presence is a table with a column for each subsequence that occurs at least once in the data set. So, if you have several individuals (event sequences), the table will have one row per individual and the columns will be the binary variable you are looking for. (See also TraMineR: Can I get the complete sequence if I give an event sub sequence? )
However, be aware that TraMineR works fine only with subsequences of length up to about 4 or 5. We suggest to set maxK=3 or 4 in seqefsub. The number of individuals should not be a problem, nor should the number of different possible events (the alphabet) as long as you restrict the maximal subsequence length you are looking for.
Hope this helps

R: how to divide a vector of values into fixed number of groups, based on smallest distance?

I think I have a rather simple problem but I can't figure out the best approach. I have a vector with 30 different values. Now I need to divide the vector into 10 groups in such a way that the mean within group variance is as small as possible. the size of the groups is not important, it can anything between one and 21.
Example. Let's say I have vector of six values, that I have to split into three groups:
Myvector <- c(0.88,0.79,0.78,0.62,0.60,0.58)
Obviously the solution would be:
Group1 <-c(0.88)
Group2 <-c(0.79,0.78)
Group3 <-c(0.62,0.60,0.58)
Is there a function that gives the same outcome as the example and that I can use for my vector withe 30 values?
Many thanks in advance.
It sounds like you want to do k-means clustering. Something like this would work
kmeans(Myvector,3, algo="Lloyd")
Note that I changed the default algorithm to match your desired output. If you read the ?kmeans help page you will see that there are different algorithms to calculate the different clusters because it's not a trivial computational problem. They might necessarily guarantee optimality.

R spline function given a fixed space

So, I need to generate a spline function to feed it into another program which only accepts a fixed space between consecutive points. So, I used spline function in R with a given number of points to genrate spline, however, the floating-point cutoff makes the space among the points variable, for example:
spline(d$V1, d$V2, n=(max(d$V1)-min(d$V1))/0.0200)
> head(t.spl, 7)
x y
1 2.3000 -3.0204
2 2.3202 -3.0204
3 2.3404 -3.0204
4 2.3606 -3.0204
5 2.3807 -3.0204
6 2.4009 -3.0204
7 2.4211 -3.0204
so, the space between 1st 1nd 2nd row is 0.0202, while between 4th and 5th is 0.0201. So because of this problem, the other program that I am feeding this spline into, doesn't accept this. So, is there any way to make this work?
As an aside: please provide a reproducible example next time (I can't copy/paste your code in because I don't have d or t.spl)
I think you'll find that the different intervals (0.0202 vs 0.0201) is an artifact of the number of characters you are printing on the screen, not of the spline function.
It seems R is printing 4 digits after the decimal point for you for neatness, so it's doing the rounding only for the purposes of displaying the results to you.
You can see how many digits are displayed with options('digits')$digits, and adjust it with options(digits=new_number_of_digits) (see ?options for details).
For example:
options(digits=4)
pi
# 3.142
options(digits=10)
pi
# 3.141592654
In summary, when you feed the values in to your other program, make sure you print the values with enough decimal points that the other program accepts the intervals as being "equal".
If you are writing to a file, for example, just make sure you write enough digits out. If you are copy-pasting from the R console, make sure you adjust R to print out enough digits.
MathematicalCoffee is probably right. I'm just adding an alternative for the sake of wordiness.
myspline <- splinefun(dV$1,dV$2)
mydata.y <- myspline(desired_x_values,deriv=0)
Will guarantee the uniform x-spacings you desire.

Resources