Using variations of `apply` in R - r

Often times in research we have to do a summary table. I would like to create a table using tapply in R. The only problem is I have 40 variables and I would like to basically perform the same operation for all 40 variables. Here is an example of the data
Age Wt Ht Type
79 134 66 C
67 199 64 C
39 135 78 T
92 149 61 C
33 138 75 T
68 139 71 C
95 198 62 T
65 132 65 T
56 138 81 C
71 193 78 T
Essentially I would like to get it to produce the means of all the variables given the Type. It should look as
C T
Age 72.4 60.6
Wt 151.8 159.2
Ht 68.6 71.6
I tried using
sapply(df, tapply(df, df$Type, mean))
but got an error.
Any guidance would be appreciated.

Try:
> sapply(df[1:3], tapply, df$Type, mean)
Age Wt Ht
C 72.4 151.8 68.6
T 60.6 159.2 71.6
alternatively you can use colMeans:
> sapply(split(df[1:3], df$Type), colMeans)
C T
Age 72.4 60.6
Wt 151.8 159.2
Ht 68.6 71.6

You could use aggregate :
res <- aggregate(DF[,names(DF) != 'Type'],list(DF$Type),mean)
> res
Group.1 Age Wt Ht
1 C 72.4 151.8 68.6
2 T 60.6 159.2 71.6
then transposing it :
m <- t(res[-1]) # convert the data.frame (excluding first col) in a matrix and traspose it
colnames(m) <- res[[1]] # set colnames of the matrix taking them from the data.frame 1st col
> m
C T
Age 72.4 60.6
Wt 151.8 159.2
Ht 68.6 71.6

Related

Adding a new column to data.frame with a factor depending conditions from another data.frame

I have 2 different data.frames. I want to add the grouping$.group column to the phenology data.frame under the conditions given by the group data.frame (LEVEL and SPECIES). I have tried the merge() function using by= but it keeps giving me "Error in fix.by(by.y, y) : 'by' must specify a uniquely valid column". Sorry this might seem like a very easy thing. I'm a beginner..
> head(phenology1)
YEAR GRADIENT SPECIES ELEVATION SITE TREE_ID CN b_E b_W b_M d_E d_W d_X c_E c_W t_max r_max r_delta_t LEVEL
1 2019 1 Pseudotsuga menziesii 395 B1_D B1_D1 59 119 135.5 143.0 139.0 148.5 165 258.0 284 154 0.7908536 0.4244604 lower
2 2019 1 Pseudotsuga menziesii 395 B1_D B1_D2 69 106 127.0 142.0 177.0 173.0 194 283.0 300 156 0.9807529 0.3898305 lower
3 2019 1 Pseudotsuga menziesii 395 B1_D B1_D3 65 97 125.0 154.5 169.0 174.0 202 266.0 299 167 NA 0.3846154 lower
4 2019 1 Picea abies 405 B1_F B1_F1 68 162 171.5 182.0 106.5 127.5 137 268.5 299 190 NA 0.6384977 lower
5 2019 1 Picea abies 405 B1_F B1_F2 78 139 165.5 176.5 152.0 140.5 167 291.0 306 181 0.9410427 0.5131579 lower
6 2019 1 Picea abies 405 B1_F B1_F3 34 147 177.5 188.0 100.0 97.5 128 247.0 275 187 0.5039245 0.3400000 lower
> grouping
LEVEL SPECIES emmean SE df lower.CL upper.CL .group
lower Pseudotsuga menziesii 107 8.19 12 89.5 125 1
upper Pseudotsuga menziesii 122 8.19 12 103.8 140 12
lower Abies alba 128 8.19 12 110.2 146 12
upper Abies alba 144 8.19 12 126.7 162 12
upper Picea abies 147 8.19 12 129.2 165 2
lower Picea abies 149 8.19 12 131.5 167 2
You can use left_join() from dplyr package (join phenology1 with only the columns LEVEL, SPECIES and .group from grouping):
library(dplyr)
phenology1 %>%
left_join(grouping %>% select(LEVEL, SPECIES, .group))
This automatically selects identical column names in both data frames to join on. If you want to set these explicitely, you can add by = c("LEVEL" = "LEVEL", "SPECIES" = "SPECIES").
Base R using match function:
phenology1$.group <- grouping$.group[match(grouping$SPECIES, phenology1$SPECIES) & match(grouping$LEVEL, phenology1$LEVEL)]

P spline smoother

Hi I am trying to find a non-parametric regression smoother to the difference between the control and treatment groups so as to determine the effectiveness of the appetite suppressant over time. then I need to use my model to estimate the difference between the treatment and control group at t=0 and t=50.
I want to use P-spline smoother ,but I do not have enough background about it
This is my data :
t
0 1 3 7 8 10 14 15 17 21 22 24 28 29 31 35 36 38 42 43 45 49 50 52 56 57 59 63 64 70 73 77 80 84 87 91 94 98 105
con
20.5 19.399 22.25 17.949 19.899 21.449 16.899 21.5 22.8 24.699 26.2 28.5 24.35 24.399 26.6 26.2 26.649 29.25 27.55 29.6 24.899 27.6 28.1 27.85 26.899 27.8 30.25 27.6 27.449 27.199 27.8 28.199 28 27.3 27.899 28.699 27.6 28.6 27.5
trt
21.3 16.35 19.25 16.6 14.75 18.149 14.649 16.7 15.05 15.5 13.949 16.949 15.6 14.699 14.15 14.899 12.449 14.85 16.75 14.3 16 16.85 15.65 17.149 18.05 15.699 18.25 18.149 16.149 16.899 18.95 22 23.6 23.75 27.149 28.449 25.85 29.7 29.449
where:
t - the time in days since the experiment started.
con - the median food intake of the control group.
trt - the median food intake of the treatment group.
Can anybody help please?
Only to give you a start. mgcv package implements various regression spline basis, including P-splines (penalized B-splines with difference penalty).
First, you need to set up your data:
dat <- data.frame(time = rep(t, 2), y = c(con, trt),
grp = gl(2, 39, labels = c("con", "trt")))
Then call gam for non-parametric regression:
library(mgcv) # no need to install; it comes with R
fit <- gam(y ~ s(time, bs = 'ps', by = grp) + grp, data = dat)
Read mgcv: how to specify interaction between smooth and factor? for specification of interaction. bs = 'ps' sets P-spline basis. By default, 10 (evenly spaced interior) knots are chosen. You can change k if you want.
More about P-splines in mgcv, read mgcv: how to extract knots, basis, coefficients and predictions for P-splines in adaptive smooth?.

How best to index for max values in data frame?

Here dataset in use is genotype from the cran package,MASS.
> names(genotype)
[1] "Litter" "Mother" "Wt"
> str(genotype)
'data.frame': 61 obs. of 3 variables:
$ Litter: Factor w/ 4 levels "A","B","I","J": 1 1 1 1 1 1 1 1 1 1 ...
$ Mother: Factor w/ 4 levels "A","B","I","J": 1 1 1 1 1 2 2 2 3 3 ...
$ Wt : num 61.5 68.2 64 65 59.7 55 42 60.2 52.5 61.8 ...
This was the given question from a tutorial:
Exercise 6.7. Find the heaviest rats born to each mother in the genotype() data.
tapply, whence split by factor genotype$Mother gives:
> tapply(genotype$Wt, genotype$Mother, max)
A B I J
68.2 69.8 61.8 61.0
Also:
> out <- tapply(genotype$Wt, genotype[,1:2],max)
> out
Mother
Litter A B I J
A 68.2 60.2 61.8 61.0
B 60.3 64.7 59.0 51.3
I 68.0 69.8 61.3 54.5
J 59.0 59.5 61.4 54.0
First tapply gives the heaviest rats from each mother , and second (out) gives a table that allows me identify which type of litter of each mother was heaviest. Is there another way to match which Litter is has the most weight for each mother, for instance if the 2 dim table is real large.
We could use data.table. We convert the 'data.frame' to 'data.table' (setDT(genotype)). Create the index using which.max and subset the rows of the dataset grouped by the 'Mother'.
library(data.table)#v1.9.5+
setDT(genotype)[, .SD[which.max(Wt)], by = Mother]
# Mother Litter Wt
#1: A A 68.2
#2: B I 69.8
#3: I A 61.8
#4: J A 61.0
If we are only interested in the max of 'Wt' by 'Mother'
setDT(genotype)[, list(Wt=max(Wt)), by = Mother]
# Mother Wt
#1: A 68.2
#2: B 69.8
#3: I 61.8
#4: J 61.0
Based on the last tapply code showed by the OP, if we need similar output, we can use dcast from the devel version of 'data.table'
dcast(setDT(genotype), Litter ~ Mother, value.var='Wt', max)
# Litter A B I J
#1: A 68.2 60.2 61.8 61.0
#2: B 60.3 64.7 59.0 51.3
#3: I 68.0 69.8 61.3 54.5
#4: J 59.0 59.5 61.4 54.0
data
library(MASS)
data(genotype)
From stats:
aggregate(. ~ Mother, data = genotype, max)
or
aggregate(Wt ~ Mother, data = genotype, max)

Confusion matrix output error

I tried to build a predictive model in R using decision tree through this code:
library(rpart)
library(caret)
DataYesNo<-read.csv('DataYesNo.csv',header=T)
worktrain<- sample(1:50,40)
worktest <- setdiff(1:50,worktrain)
M <- ncol(DataYesNo)
input <- names(DataYesNo)[1:(M-1)]
target <- "ICUtransfer"
tree<- rpart(ICUtransfer~Temperature+RespiratoryRate+HeartRate+SystolicBP+OxygenSaturations,
data=DataYesNo[worktrain, c(input,target)],
method="class",
parms=list(split="information"),
control=rpart.control(usesurrogate=0, maxsurrogate=0))
fitted <- predict(tree, DataYesNo[worktest, c(input,target)])
cmatrix <- confusionMatrix(fitted, worktest$ICUtransfer)
print(cmatrix)
tree
plot(tree)
text(tree)
I got error at : cmatrix <- confusionMatrix(fitted, worktest$ICUtransfer)
"$ operator is invalid for atomic vectors "
please help me to solve this?
Regards,
DataYesNo[worktest,]
Temperature RespiratoryRate HeartRate SystolicBP OxygenSaturations ICUtransfer
11 36.3 26 65 140 97 no
15 37.3 20 80 129 99 no
21 36.9 20 72 154 95 no
26 36.0 28 56 199 97 no
30 36.9 20 72 150 96 no
34 36.6 16 97 118 95 yes
36 36.0 20 77 145 97 yes
38 36.0 20 77 145 97 yes
43 36.3 28 98 116 95 yes
47 36.0 20 77 145 97 yes
I tried this line:
cmatrix <- confusionMatrix(fitted, DataYesNo[worktest,]$ICUtransfer)
but I got this error: Error in confusionMatrix.default(fitted, DataYesNo[worktest, ]$ICUtransfer) :
the data and reference factors must have the same number of levels
please anyone can help me?
You're getting that error because worktest doesn't have any factor called ICUtransfer. worktest is just a numeric vector of indices, and thus has no factors. You want the subset of your data corresponding to the worktest indices.
It's impossible to know what exactly needs to be done, because I can't see into the data structures you're using.
Instead of worktest$ICUtransfer try using DataYesNo[worktest, c(input,target)].

using rbind to create dataframe is not working

I am trying to write a script to get some specific values for the equation 25a+20b=1600 with a in the range between 24:60 and b in 20:50
I need to get the pairs of a and b satisfying the equation.
My first problem was how to define a and b with a single digit decimal place (a=24.0,24.1,24.2...etc.) but I overcame that defining a<-c(240:600)/10, so my first question is: Is there any direct method to do that?
Now, I did a couple of nested loops and I am able to get each time the equation is satisfied in a vector, I want to use rbind() to attach this vector to a matrix or a dataframe but it is not working without any error or warning. it just takes the value of the first vector and that's it !
Here is my code, can someone help me define where the problem is?
solve_ms <- function() {
index<-1
sol<-data.frame()
temp<-vector("numeric")
a<-c(240:600)/10
b<-c(200:500)/10
for (i in 1:length(a)){
for (j in 1:length(b)) {
c <- 25*a[i]+20*b[j]
if(c == 1600) {
temp<-c(a[i], b[j])
if(index == 1) {
sol<-temp
index<-0
}
else rbind(sol,temp)
}
}
}
return(sol)
}
I found our where my code problem is, it is using rbind without assigning its return to a dataframe. I had to do {sol<-rbind(sol,temp)} and it will work.
I will check other suggestions as well.. thanks.
Try this instead:
#define a function
fun <- function(a,b) (25*a+20*b) == 1600
Since floating point precision could be an issue:
#alternative function
fun <- function(a,b,tol=.Machine$double.eps ^ 0.5) abs(25*a+20*b-1600) < tol
#create all possible combinations
paras <- expand.grid(a=c(240:600)/10, b=20:50)
paras[fun(paras$a,paras$b),]
a b
241 48.0 20
594 47.2 21
947 46.4 22
1300 45.6 23
1653 44.8 24
2006 44.0 25
2359 43.2 26
2712 42.4 27
3065 41.6 28
3418 40.8 29
3771 40.0 30
4124 39.2 31
4477 38.4 32
4830 37.6 33
5183 36.8 34
5536 36.0 35
5889 35.2 36
6242 34.4 37
6595 33.6 38
6948 32.8 39
7301 32.0 40
7654 31.2 41
8007 30.4 42
8360 29.6 43
8713 28.8 44
9066 28.0 45
9419 27.2 46
9772 26.4 47
10125 25.6 48
10478 24.8 49
10831 24.0 50
If the problem is really this simple i.e. solving for roots of 2 variable linear equation, you can always rearrange the equation to write b in terms of a i.e. b = (1600-25*a)/20 and get all the values of b for corresponding values of a and filter the combinations by b
e.g.
a = c(240:600)/10
b = 20:50
RESULTS <- data.frame(a, b = (1600 - 25 * a)/20)[((1600 - 25 * a)/20) %in% b, ]
RESULTS
## a b
## 1 24.0 50
## 9 24.8 49
## 17 25.6 48
## 25 26.4 47
## 33 27.2 46
## 41 28.0 45
## 49 28.8 44
## 57 29.6 43
## 65 30.4 42
## 73 31.2 41
## 81 32.0 40
## 97 33.6 38
## 105 34.4 37
## 121 36.0 35
## 137 37.6 33
## 145 38.4 32
## 161 40.0 30
## 177 41.6 28
## 185 42.4 27
## 193 43.2 26
## 201 44.0 25
## 209 44.8 24
## 217 45.6 23
## 225 46.4 22
## 233 47.2 21
## 241 48.0 20

Resources