I'm using LSD.test in agricolae packages
Below is a reproducible example
library('agricolae')
group <- c(1,1,1,2,2,2,3,3,3)
variable <- c(1,2,1.5,10,11,12,22,23,21)
df <- data.frame(cbind(group,variable))
model <- aov(variable~group,data=df)
LSD.test(model,"group",p.adj="bonferroni")
I'm getting the below output which is great
$statistics
MSerror Df Mean CV t.value MSD
0.8035714 7 11.5 7.794969 3.127552 2.289134
$parameters
test p.ajusted name.t ntr alpha
Fisher-LSD bonferroni group 3 0.05
$means
variable std r LCL UCL Min Max Q25 Q50 Q75
1 1.5 0.5 3 0.2761907 2.723809 1 2 1.25 1.5 1.75
2 11.0 1.0 3 9.7761907 12.223809 10 12 10.50 11.0 11.50
3 22.0 1.0 3 20.7761907 23.223809 21 23 21.50 22.0 22.50
$comparison
NULL
$groups
variable groups
3 22.0 a
2 11.0 b
1 1.5 c
attr(,"class")
[1] "group"
I wanted to extract the median and letter from this output.
To extract the median of group 3 for example, I used this function
output [[5]][[1]][[1]]
that gives this output
[1] 22
Till now, everything is fine. I'll explain the problem and ask the question below.
Now, I need to extract the letter as well.
I tried the following code
output [[5]][[2]][[1]]
[1] a
Levels: a b c
My question is:
Is there any way to get rid of the Levels: a b c statement in the code and get only the letter?
Many thanks in advance.
as.character(output [[5]][[2]][[1]])
Solved it, thanks to #Tim Biegeleisen's comment
Related
This is my DF:
Con1 Con2 Con3 Con4
1 45 576
2 23 1234
3 67 345
4 22 44
5 5 567
I want for each column to find the Mean and the SD.
Then for each cell in the specific column I want to apply Normal distribution calculation to find the probability for each cell's number in specific column.
For example, Con1's mean are 32.4 and SD 4, I want to take each number in this column and apply normal distribution to find the probability for each number - and then replace the number with its probability.
the output (For example):
Con1 Con2 Con3 Con4
1 0.6 0.455
2 0.34 0.09
3 0.23 0.12
4 0.1 0.55
5 0.7 0.88
Any help?
In base R you can do this with...
sapply(df, function(x) pnorm(x, mean = mean(x), sd = sd(x)))
Con1 Con2
[1,] 0.7002401 0.5207649
[2,] 0.3476271 0.9400139
[3,] 0.9253371 0.3172112
[4,] 0.3323590 0.1224208
[5,] 0.1267551 0.5125718
This uses pnorm, which is the cumulative normal distribution function. If you want the density instead, use dnorm. You might also like to have a look at the scale function to normalise values.
Ok so i have a pretty large data set of around 500 observations and 3 variables. The first column refers to time.
For a test data set I am using:
dat=as.data.frame(matrix(c(1,2,3,4,5,6,7,8,9,10,
1,1.8,3.5,3.8,5.6,6.2,7.8,8.2,9.8,10.1,
2,4.8,6.5,8.8,10.6,12.2,14.8,16.2,18.8,20.1),10,3))
colnames(dat)=c("Time","Var1","Var2")
Time Var1 Var2
1 1 1.0 2.0
2 2 1.8 4.8
3 3 3.5 6.5
4 4 3.8 8.8
5 5 5.6 10.6
6 6 6.2 12.2
7 7 7.8 14.8
8 8 8.2 16.2
9 9 9.8 18.8
10 10 10.1 20.1
So what I need to do is create a new column that each observations is the slope respect to time of some past points. For example taking 3 past points it would be something like:
slopeVar1[i]=slope(Var1[i-2:i],Time[i-2:i]) #Not real code
slopeVar[i]=slope(Var2[i-2:i],Time[i-2:i]) #Not real code
Time Var1 Var2 slopeVar1 slopeVar2
1 1 1 2 NA NA
2 2 1.8 4.8 NA NA
3 3 3.5 6.5 1.25 2.25
4 4 3.8 8.8 1.00 2.00
5 5 5.6 10.6 1.05 2.05
6 6 6.2 12.2 1.20 1.70
7 7 7.8 14.8 1.10 2.10
8 8 8.2 16.2 1.00 2.00
9 9 9.8 18.8 1.00 2.00
10 10 10.1 20.1 0.95 1.95
I actually got as far as using a for() function, but for really large data sets (>100,000) it starts taking too long.
The for() argument that I used is shown bellow:
#CREATE DATA FRAME
rm(dat)
dat=as.data.frame(matrix(c(1,2,3,4,5,6,7,8,9,10,
1,1.8,3.333,3.8,5.6,6.2,7.8,8.2,9.8,10.1,
2,4.8,6.5,8.8,10.6,12.2,14.8,16.2,18.8,20.1),10,3))
colnames(dat)=c("Time","Var1","Var2")
dat
plot(dat)
#CALCULATE SLOPE OF n POINTS FROM i TO i-n.
#In this case I am taking just 3 points, but it should
#be possible to change the number of points taken.
attach(dat)
n=3 #number for points to take slope
l=dim(dat[1])[1] #number of iterations
y=0
x=0
slopeVar1=NA
slopeVar2=NA
for (i in 1:l) {
if (i<n) {slopeVar1[i]=NA} #For the rows where there are not enough previous observations, it outputs NA
if (i>=n) {
y1=Var1[(i-n+1):i] #y data sets for calculating slope of Var1
y2=Var2[(i-n+1):i]#y data sets for calculating slope of Var2
x=Time[(i-n+1):i] #x data sets for calculating slope of Var1&Var2
z1=lm(y1~x) #Temporal value of slope of Var1
z2=lm(y2~x) #Temporal value of slope of Var2
slope1=as.data.frame(z1[1]) #Temporal value of slope of Var1
slopeVar1[i]=slope1[2,1] #Populating string of slopeVar1
slope2=as.data.frame(z2[1])#Temporal value of slope of Var2
slopeVar2[i]=slope2[2,1] #Populating string of slopeVar2
}
}
slopeVar1 #Checking results.
slopeVar2
(result=cbind(dat,slopeVar1,slopeVar2)) #Binds original data with new calculated slopes.
This code actually outputs what I want; but again, for really large data sets is quite inefficient.
This quick rollapply implemenation seems to be speeding it up somewhat -
library("zoo")
slope_func = function(period) {
y1=period[,2] #y data sets for calculating slope of Var1
y2=period[,3] #y data sets for calculating slope of Var2
x=period[,1] #x data sets for calculating slope of Var1&Var2
z1=lm(y1~x) #Temporal value of slope of Var1
z2=lm(y2~x) #Temporal value of slope of Var2
slope1=as.data.frame(z1[1]) #Temporal value of slope of Var1
slopeVar1[i]=slope1[2,1] #Populating string of slopeVar1
slope2=as.data.frame(z1[1])#Temporal value of slope of Var2
slopeVar2[i]=slope2[2,1] #Populating string of slopeVar2
}
}
start = Sys.time()
rollapply(dat[1:3], FUN=slope_func, width=3, by.column=FALSE)
end=Sys.time()
print(end-start)
Time difference of 0.04980111 secs
OP's previous implementation was taking Time difference of 0.2666121 secs for the same
I have data like so:
aye <- c(0,0,3,4,5,6)
bee <- c(3,4,0,0,7,8)
see <- c(9,8,3,5,0,0)
df <- data.frame(aye, bee, see)
I am looking for a concise way to create columns based on the mean for each of the columns in the data frame, where zero is kept at zero.
To obtain the mean excluding zero:
df2 <- as.data.frame(t(apply(df, 2, function(x) mean(x[x>0]))))
I can't figure out how to simply replace the values in the column with the mean excluding zero. My approach so far is:
df$aye <- ifelse(df$aye == 0, 0, df2$aye)
df$bee <- ifelse(df$bee == 0, 0, df2$bee)
df$see <- ifelse(df$see == 0, 0, df2$see)
But this gets messy with many variables - would be nice to wrap it up in one function.
Thanks for your help!
Why can't we just use
data.frame(lapply(dat, function (u) ave(u, u > 0, FUN = mean)))
# aye bee see
#1 0.0 5.5 6.25
#2 0.0 5.5 6.25
#3 4.5 0.0 6.25
#4 4.5 0.0 6.25
#5 4.5 5.5 0.00
#6 4.5 5.5 0.00
Note, I used dat rather than df as the name of your data frame. df is a function in R and don't mask it.
We can keep the result of apply function as numeric vector in x.
x <- apply(df, 2, function(x){ mean(x[x>0])})
df[which(df!=0, arr.ind = T)] <- x[ceiling(which(df!=0)/nrow(df))]
df
# aye bee see
#1 0.0 5.5 6.25
#2 0.0 5.5 6.25
#3 4.5 0.0 6.25
#4 4.5 0.0 6.25
#5 4.5 5.5 0.00
#6 4.5 5.5 0.00
Breaking the code down further to explain the working
Gives the indices where the value is not zero
which(df! = 0)
#[1] 3 4 5 6 7 8 11 12 13 14 15 16
This line decides which index we are going to select from x
ceiling(which(df!=0)/nrow(df))
#[1] 1 1 1 1 2 2 2 2 3 3 3 3
x[ceiling(which(df!=0)/nrow(df))]
#aye aye aye aye bee bee bee bee see see see see
#4.50 4.50 4.50 4.50 5.50 5.50 5.50 5.50 6.25 6.25 6.25 6.25
Now substituting the above values where value isn't equal to 0 in the dataframe
df[which(df!=0, arr.ind = T)] <- x[ceiling(which(df!=0)/nrow(df))]
Try rearranging what you already have into a zeroless_mean function, and then use apply on each column of your data.frame:
# Data
aye <- c(0,0,3,4,5,6)
bee <- c(3,4,0,0,7,8)
see <- c(9,8,3,5,0,0)
dff <- data.frame(aye, bee, see)
# Function
zeroless_mean <- function(x) ifelse(x==0,0,mean(x[x!=0]))
# apply
data.frame(apply(dff, 2, zeroless_mean))
# Output
aye bee see
1 0.0 5.5 6.25
2 0.0 5.5 6.25
3 4.5 0.0 6.25
4 4.5 0.0 6.25
5 4.5 5.5 0.00
6 4.5 5.5 0.00
I hope this helps.
I am looking for an explicit function to subscript elements in R, say subscript(x,i) to mean x[i].
The reason that I need this traces back to a piece of code using dplyr and magrittr pipe operator, which is not a pipe, and where I need to divide by the first element of each column.
pipedDF <- rawdata %>% filter, merge, summarize, dcast %>%
mutate_each( funs(./subscript(., 1) ), -index)
I think this would do the trick and keep that pipe syntax which people like.
Without dplyr it would look like this...
Example,
> df
index a b c
1 1 6.00 5.0 4
2 2 7.50 6.0 5
3 3 5.00 4.5 6
4 4 9.00 7.0 7
> data.frame(sapply(df, function(x)x/x[1]))
index a b c
1 1 1.00 1.0 1.00
2 2 1.25 1.2 1.25
3 3 0.83 0.9 1.50
4 4 1.50 1.4 1.75
You should be able to use '[', as in
x<-5:1
'['(x,2)
# [1] 4
Say I have an "integer" factor vector of length 5:
vecFactor = c(1,3,2,2,3)
and another "integer" data vector of length 5:
vecData = c(1.3,4.5,6.7,3,2)
How can I find the average of the data in each factor, so that I would get a result of:
Factor 1: Average = 1.3
Factor 2: Average = 4.85
Factor 3: Average = 3.25
tapply(vecData, vecFactor, FUN=mean)
1 2 3
1.30 4.85 3.25
I sometimes use a linear model to do this instead of tapply, which is quite flexible (for instance if you need to add weights...). Don't forget the "-1" in the formula
lm(vecData~factor(vecFactor)-1)$coef
factor(vecFactor)1 factor(vecFactor)2 factor(vecFactor)3
1.30 4.85 3.25
To get a good table, try aggregate function with data.frame:
ddf = data.frame(vecData, vecFactor)
aggregate(vecData~vecFactor, data=ddf, mean)
vecFactor vecData
1 1 1.30
2 2 4.85
3 3 3.25
data.table can also be used for this:
library(data.table)
ddt = data.table(ddf)
ddt[,list(meanval=mean(vecData)),by=vecFactor]
vecFactor meanval
1: 1 1.30
2: 3 3.25
3: 2 4.85