counting function on data frame in R - r

I have the following data frame:
> Mice
Blood States Minute
1 0.875 X0 0.8352569
2 0.875 A2 0.7551901
3 0.625 X0 1.4508139
4 0.625 A1 0.7876343
5 0.375 X0 1.1345252
6 0.125 X0 0.8699363
7 0.375 X0 0.9378742
8 1.125 H1 0.9769522
9 0.625 X0 0.4716321
10 0.875 H1 0.9935999
11 0.625 X0 1.0025917
12 0.375 A1 1.0703999
13 0.375 X0 1.3044854
14 0.875 H1 0.6720436
15 0.875 A1 1.0431863
So every mouse has some value of drugs in their "Blood", and their "State" is checked. This is just a piece of my data frame, but the mice can be in 4 different states. "Minute" is whenever something occurs to the mice, does not matter what.
For every value of "Blood", the mice can be in either of the 4 different states, and I want to count how many observations I have in each category.
The count() function with both columns Blood and States did not work because "States" is a factor column

To operate on factor levels, you can use tapply or by. If you have discrete scale for Mice$Blood, convert it to a factor as well:
> by(mice$States, as.factor(mice$Blood), function(x) summary(factor(x)))
as.factor(mice$Blood): 0.125
X0
1
------------------------------------------------------------------------------------------------
as.factor(mice$Blood): 0.375
A1 X0
1 3
------------------------------------------------------------------------------------------------
as.factor(mice$Blood): 0.625
A1 X0
1 3
------------------------------------------------------------------------------------------------
as.factor(mice$Blood): 0.875
A1 A2 H1 X0
1 1 2 1
------------------------------------------------------------------------------------------------
as.factor(mice$Blood): 1.125
H1
1
The returned object is a list, so you may capture it and use for your purposes.

Related

Merge double integers with different decimal places

I want to merge two datasets based on their double integers, however one of the falls short 2 decimal places but I would like to merge them regardless if all units from the short match its relative length to the longer. The final integer can be of the longer or shorter one.
Example with expectation (xy):
x = c(0.456, 0.797, 0.978, 0.897, 0.567)
x1 = c(0.45698, 0.79786, 0.97895, 0.75869, 0.56049)
x x1
0.456 0.45698
0.797 0.79786
0.978 0.97895
# or if they match then replace with longer integers
I have tried merge but it duplicates:
> xx1<-merge(x, x1)
> unique(xx1)
x x1
1 0.456 0.45698
2 0.797 0.45698
3 0.978 0.45698
4 0.456 0.79786
5 0.797 0.79786
6 0.978 0.79786
7 0.456 0.97895
8 0.797 0.97895
9 0.978 0.97895
You can use pmatch:
i <- pmatch(x, x1)
j <- !is.na(i)
cbind(x=x[j], x1=x1[i[j]])
# x x1
#[1,] 0.456 0.45698
#[2,] 0.797 0.79786
#[3,] 0.978 0.97895

Why am I getting different predicted probabilities on random forest rf$votes vs. predict()?

I ran randomForest on a dataset with binary outcome and want the predicted probabilities (on the same dataset - I don't need separate train/test for this). I was expecting the values for p1 and p2 below to be the same, but clearly they are not. I haven't been able to find a clear description of how they are different. Any help would be appreciated.
mydata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
rf = randomForest(factor(admit)~., data = mydata)
p1 = predict(rf, mydata[,c(2:4)], type = "prob")
p2 <- rf$votes
> head(p1)
0 1
1 0.926 0.074
2 0.584 0.416
3 0.166 0.834
4 0.722 0.278
5 0.968 0.032
6 0.258 0.742
> head(p2)
0 1
1 0.8324324 0.16756757
2 0.7663043 0.23369565
3 0.2447917 0.75520833
4 0.9695431 0.03045685
5 0.9264706 0.07352941
6 0.3351351 0.66486486

Long-format Regression model refering to older measures?

Imagine I have an experiment in which I take repeated measures to a group of individuals.
The following diagram represents the measures taken to each individual at each time.
The data could be stored on a wide format table, with one row for each individual.
ID x1 x2 x3 x4 y1 y2 y3 y4
Or it could also be stored in a more compact long format, like this:
ID T X Y
Where T would be 1, 2, 3 or 4. For the moment just leave it simple, T increments are always the same, 1 unit.
I've been seeing and using this long format to fit regression models with dummy variables.
I use to do it with R (with the syntax "Y ~ X*T" where T is a factor) but it can be done on many other programs in different ways.
In this situation you can find the relationship between each y and its corresponding x (for the same time).
It would be similar to say:
y1 = a1 + b1·X1
y2 = a2 + b2·X2
y3 = a3 + b3·X3
But you get more power because you analyze all the data together.
Usually I do it with lme4 in order to take into account the repeated measures. But forget it for the moment.
My question is, Is it possible to use the "long format" to find relationships such as this?
y1 = a10 + a11·X1
y2 = a20 + a21·X1 + a22·X2
y3 = a30 + a31·X1 + a32·X2 + a33·X3
I mean, every "y" not only depends on the "x" in the same time but also in previous "x". (Kinda cumulative effect).
Or, Am I forced to use a wide format and create new variables instead?
I think the problem with the wide format is that we will lose the explicit dependency with T, the time. And I woulkd like to see how the outcome depends on it.
And I find it easier to work with long format.
If you want a very simple reproducible example:
set.seed(1)
ID <- rep(1:4,each=4)
XX <- round(runif(16),3)
TT <- rep(1:4, 4)
YY <- ave(XX*TT,ID, FUN = cumsum)
data.frame(ID,TT,XX, YY)
ID TT XX YY
1 1 0.266 0.266
1 2 0.372 1.010
1 3 0.573 2.729
1 4 0.908 6.361
2 1 0.202 0.202
2 2 0.898 1.998
2 3 0.945 4.833
2 4 0.661 7.477
3 1 0.629 0.629
3 2 0.062 0.753
3 3 0.206 1.371
3 4 0.177 2.079
4 1 0.687 0.687
4 2 0.384 1.455
4 3 0.770 3.765
4 4 0.498 5.757
Any solution not relying on R is also welcome.

Why does R return a low p-value for ANOVA on a set of 1s?

I'm trying to use repeated rounds of ANOVA to sort a large dataset into different categories. For each element in the dataset I have twelve data points which represent three replicates each of four conditions, which arise as two combinitions of two variable1. The data is some relative expression compared to a control, which means that for the control itself all twelve values are 1:
>at
v1 v2 values
1. a X 1
2. b X 1
3. a X 1
4. b X 1
5. a X 1
6. b X 1
7. a Y 1
8. b Y 1
9. a Y 1
10. b Y 1
11. a Y 1
12. b Y 1
which I analyze this way (the Tukey wrapper gives me Information about whether it is up or down in addition to whether it is different, which is why I'm using it):
stats <- TukeyHSD(aov(values~v1+v2, data=at))
> stats
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = values ~ v1 + v2, data = at)
$v1
diff lwr upr p adj
a-b 4.440892e-16 -1.359166e-16 1.024095e-15 0.1173068
$v2
diff lwr upr p adj
X-Y -4.440892e-16 -1.024095e-15 1.359166e-16 0.1173068
I expected the p value to be very close or equal to 1 since clearly the null hypothesis that the two groups of both of these tests are the same is correct. Instead the p-value is very low with 0.117! Clearly the difference and the bounds are tiny (e-16) so I'm guessing the problem is to do with the internal storage of the numbers as slightly off 1, but I'm not sure how to solve the problem. Any suggestions?
Thanks a lot!
I'm adding some sample data:
aX1 bX1 aX2 bX2 aX3 bX3 aY1 bY1 aY2 bY2 aY3 bY3
element1 0.112 0 0.172 0.072 0.058 0.055 0 0 0.046 0 0.042 0
element2 0.859 0.294 0.565 0 0.669 0 0.11 0 1.707 0 1.324 0
element3 1.255 0.721 3.645 1.636 5.36 6.701 0 0.097 0.533 0.209 0.358 2.219

How to convert only SOME positive numbers to negative numbers (conditional recoding)?

I am looking for a convenient way to convert positive values (proportions) into negative values of the same variable, depending on the value of another variable.
This is how the data structure looks like:
id Item Var1 Freq
1 P1 0 0.043
2 P2 1 0.078
3 P3 2 0.454
4 P4 3 0.543
5 T1 0 0.001
6 T2 1 0
7 T3 2 0.045
8 T4 3 0.321
9 A1 0 0.671
...
More precisely, I would like to put the numbers for Freq into the negative if Var1 <= 1 (e.g. -0.043).
This is what I tried:
for(i in 1: 180) {
if (mydata$Var1 <= "1") (mydata$Freq*(-1))}
OR
mydata$Freq[mydata$Var1 <= "1"] = -abs(mydata$Freq)}
In both cases, the negative sign is rightly set but the numbers are altered as well.
Any help is highly appreciated. THANKS!
new.Freq <- with(mydata, ifelse(Var1 <= 1, -Freq, Freq))
Try:
index <- mydata$Var1 <= 1
mydata$Freq[index] = -abs(mydata$Freq[index])
There are two errors in your attempted code:
You did a character comparison by writing x <= "1" - this should be a numeric comparison, i.e. x <= 1
Although you are replacing a subset of your vector, you don't refer to the same subset as the replacement
It can also be used to deal with two variables when one has negative values and want to combine that by retaining negative values,
similarly can use it to convert to negative value by put - at start of variable (as mentioned above) e.g. -Freq.
mydata$new_Freq <- with(mydata, ifelse(Var1 < 0, Low_Freq, Freq))
id Item Var1 Freq Low_Freq
1 P1 0 1.043 -0.063
2 P2 1 1.078 -0.077
3 P3 2 2.401 -0.068
4 P4 3 3.543 -0.323
5 T1 0 1.001 1.333
6 T2 1 1.778 1.887
7 T3 2 2.045 1.011
8 T4 3 3.321 1.000
9 A1 0 4.671 2.303
# Output would be:
id Item Var1 Freq Low_Freq new_Freq
1 P1 0 1.043 -0.063 -0.063
2 P2 1 1.078 -0.077 -0.077
3 P3 2 2.401 -0.068 -0.068
4 P4 3 3.543 -0.323 -0.323
5 T1 0 1.001 0.999 1.001
6 T2 1 1.778 0.887 1.778
7 T3 2 2.045 1.011 2.045
8 T4 3 3.321 1.000 3.321
9 A1 0 4.671 2.303 4.671

Resources