Calculating entropy in R - r

A=c("f","t","t","f","t","f","f","f","t","f")
B=c("t","t","t","t","t","f","f","f","t","t")
class=c("+","+","+","-","+","-","-","-","-","-")
df=data.frame(A,B,class)
df
A B class
1 f t +
2 t t +
3 t t +
4 f t -
5 t t +
6 f f -
7 f f -
8 f f -
9 t t -
10 f t -
I partitioned attribute A or B due to the class as follows :
{A}
[T , F]
/ \
------- -------
[3+,1-] [1+,5-]
{B}
[T , F]
/ \
------- -------
[4+,3-] [0+,3-]
depending on the above formula I calculated entropy by this code in R .
1- for attribute A
t=table(A,class)
t
class
A - +
f 5 1
t 1 3
prop1=t[1,]/sum(t[1,])
prop1
- +
0.8333333 0.1666667
prop2=t[2,]/sum(t[2,])
prop2
- +
0.25 0.75
H1=-(prop1[1]*log2(prop1[1]))-(prop1[2]*log2(prop1[2]))
H1
0.6500224
H2=-(prop2[1]*log2(prop2[1]))-(prop2[2]*log2(prop2[2]))
H2
0.8112781
entropy=(table(A)[1]/length(A))*H1 +(table(A)[2]/length(A))*H2
entropy
0.7145247
2- for attribute B
t=table(B,class)
t
class
B - +
f 3 0
t 3 4
prop1=t[1,]/sum(t[1,])
prop1
- +
1 0
prop2=t[2,]/sum(t[2,])
prop2
- +
0.4285714 0.5714286
H1=-(prop1[1]*log2(prop1[1]))-(prop1[2]*log2(prop1[2]))
H1
NaN
H2=-(prop2[1]*log2(prop2[1]))-(prop2[2]*log2(prop2[2]))
H2
0.9852281
entropy=(table(B)[1]/length(B))*H1 +(table(B)[2]/length(B))*H2
entropy
NaN
when I calculate entropy for attribute B the result give me NaN that is due to zero(0) (log2(0) is error ) . in such situation how can I fix this error or how can make H1 give me zero instead of NaN

Related

Calculate percentage change in dataframe from first row

I want to calculate the per cent change in my dataframe using the first row as the reference. For example my dataframe
Set rate field
A 3 10
B 2 17
C 5 4
Using row A as the reference, I want to calculate the percentage change from row A to every other row for all columns in the dataframe.
which will result in
Set rate field
A 3 10
B -33 70
C 66.66 -60
or
Set rate field pct_rate pct-field
A 3 10 0 0
B 2 17 -33 70
C 5 4 66.66 -60
My code:
z %>%
mutate(pct_rate = (rate - lag(rate)/ rate ) * 100)
which doesn't give me the desired result
df <- fread("Set rate field
A 3 10
B 2 17
C 5 4")
Soltuion using dplyr: We can use dplyr's first function to refer to the first element of a vector (your attempt with lag is very close to this solution). Also I used first(rate) in the denominator to calculate the percentage difference to get the numbers in your example...
library(dplyr)
df %>%
mutate(pct_rate = (rate - first(rate)) / first(rate) * 100,
pct_field = (field - first(field)) / first(field) * 100)
Returns:
Set rate field pct_rate pct_field
1: A 3 10 0.00000 0
2: B 2 17 -33.33333 70
3: C 5 4 66.66667 -60
You can use z$rate[1] or z$field[1] to get the first element and make than the calculations with all values.
z$pct_rate <- 100 * (z$rate - z$rate[1]) / z$rate[1]
z$pct_field <- 100 * (z$field - z$field[1]) / z$field[1]
z
# Set rate field pct_rate pct_field
#1 A 3 10 0.00000 0
#2 B 2 17 -33.33333 70
#3 C 5 4 66.66667 -60
or for many columns:
rbind(z[1,], do.call(cbind.data.frame, c(z[1],
lapply(z[-1], function(x) 100 * (x - x[1]) / x[1])))[-1,])
# Set rate field
#1 A 3.00000 10
#2 B -33.33333 70
#3 C 66.66667 -60

Friedman's test manual calculation

I have obtained a negative value from the Friedman's test. The data is:
Full MIC ReliefF LCorrel InfoGain
equinox 69.939 80.178 78.794 75.205 62.268
lucene 78.175 84.103 79.017 82.044 75.564
mylyn 75.531 78.006 77.161 47.711 81.575
pde 70.282 82.686 81.884 75.07 79.476
jdt 71.675 93.202 95.387 85.878 82.818
Ranking is below
Full MIC ReliefF LCorrel InfoGain
equinox 2 5 4 3 1
lucene 2 5 3 4 1
mylyn 2 4 3 1 5
pde 1 5 4 2 3
jdt 1 4 5 3 2
Sum 8 23 19 13 12
The Friedman's F Calculation formula:
F = (5/[5*5*(5+1)] * [8*8 + 23*23 + 19*19 + 13*13 + 12*12] - [5*5*(5+1)]
The value I get is -107.7666667.
How do I interpret that? The examples I have seen all have positive result.
I know about the R code but want the manual calculation.
This is how I generated the results and it worked
pacc_part
f1 <- friedman.test(pacc_part)
print (f1)
# Post-hoc tests are conducted only if omnimus Kruskal-Wallis test p-value
is 0.05 or less.
if ( f1$p.value < 0.05 )
{
n1 <- posthoc.friedman.nemenyi.test(pacc_part)
}
n1;
# alternate representation of post-hoc test results
summary(n1);

Survdiff() output fields in R

my question is about the output structure of survdiff() function form the 'survival' library in R. Namely, I have a data frame containing survival data
> dat
ID Time Treatment Gender Censored
1 E002 2.7597536 IND F 0
2 E003 4.2710472 Control M 0
3 E005 1.4784394 IND F 0
4 E006 6.8993840 Control F 1
5 E008 9.5934292 IND M 0
6 E009 2.9897331 Control F 0
7 E014 1.3470226 IND F 1
8 E016 2.1683778 Control F 1
9 E018 2.7597536 IND F 1
10 E022 1.3798768 IND F 0
11 E023 0.7227926 IND M 1
12 E024 5.5195072 IND F 0
13 E025 2.4640657 Control F 0
14 E028 7.4579055 Control M 1
15 E029 5.5195072 Control F 1
16 E030 2.7926078 IND M 0
17 E031 4.9938398 Control F 0
18 E032 2.7268994 IND M 0
19 E033 0.1642710 IND M 1
20 E034 4.1396304 Control F 0
and a model
> diff = survdiff(Surv(Time, Censored) ~ Treatment+Gender, data = dat)
> diff
Call:
survdiff(formula = Surv(Time, Censored) ~ Treatment + Gender,
data = dat)
N Observed Expected (O-E)^2/E (O-E)^2/V
Treatment=Control, Gender=M 2 1 1.65 0.255876 0.360905
Treatment=Control, Gender=F 7 3 2.72 0.027970 0.046119
Treatment=IND, Gender=M 5 2 2.03 0.000365 0.000519
Treatment=IND, Gender=F 6 2 1.60 0.100494 0.139041
Chisq= 0.5 on 3 degrees of freedom, p= 0.924
I'm wondering what's the field of the output object that contains the values from the very right column (O-E)^2/V? I'd like to use them further but can't obtain them neither from diff\$obs, diff\$exp, diff\$var nor from their combinations.
Your help's gonna be much appreciated.
For (O-E)^2/V try something like
rowSums(diff$obs - diff$exp)^2 / diag(diff$var)
while for (O-E)^2/E try something like
rowSums(diff$obs - diff$exp)^2 / rowSums(diff$exp)

How to create new column based on another column's contents in R

This is a sample of my data set:
>data
C1 C2 C3 C4 C5 C6
ATOM 1 -4.794 -7.29 6.756 C
ATOM 1 -4.357 -6.181 6.473 O
ATOM 2 -5.878 -8.511 5.233 C
ATOM 2 -7.02 -9.179 5.732 C
ATOM 3 -7.479 -9.499 6.108 C
ATOM 5 -4.873 -7.021 6.767 C
ATOM 8 -3.891 -6.723 6.31 O
ATOM 1 -7.515 -10.402 -0.621 C
ATOM 1 -7.26 -11.716 -0.22 O
ATOM 2 -7.53 -10.348 0.581 C
ATOM 3 -6.689 -11.008 2.344 C
Ultimately, What I want to achieve is to have a new column that detects when the numbers in C2 reset back to 1, as shown below:
>data
C1 C2 C3 C4 C5 C6 C7
ATOM 1 -4.794 -7.29 6.756 C 1
ATOM 1 -4.357 -6.181 6.473 O 1
ATOM 2 -5.878 -8.511 5.233 C 1
ATOM 2 -7.02 -9.179 5.732 C 1
ATOM 3 -7.479 -9.499 6.108 C 1
ATOM 5 -4.873 -7.021 6.767 C 1
ATOM 8 -3.891 -6.723 6.31 O 1
ATOM 1 -7.515 -10.402 -0.621 C 2
ATOM 1 -7.26 -11.716 -0.22 O 2
ATOM 2 -7.53 -10.348 0.581 C 2
ATOM 3 -6.689 -11.008 2.344 C 2
I used a for loop with a nested if statement. My method was to compare an existing row with the following row. if the value is less than/ equal to the existing value, I assign 1 to the new column - if not, I increment the old count by 1.
C7 = NULL
i <- 1
index <- 1
for ( i in 1:nrow(data)){
if(data$C2[i] <= data$C2[i+1]){
data$C7 = index
} else {
data$C7 <- index + 1
}
}
**This code doesnt work and this error occurs
Error in if (data$C2[i] <= data$C2[i + 1]) { :
missing value where TRUE/FALSE needed **
I'm not quite sure what I'm doing wrong and have a feeling that there's an even better way to do this. I appreciate any help -- thank you in advance!
You don't need a loop:
data$C7 <- c(1, cumsum(diff(data$C2) < 0) + 1)
C1 C2 C3 C4 C5 C6 C7
1 ATOM 1 -4.794 -7.290 6.756 C 1
2 ATOM 1 -4.357 -6.181 6.473 O 1
3 ATOM 2 -5.878 -8.511 5.233 C 1
4 ATOM 2 -7.020 -9.179 5.732 C 1
5 ATOM 3 -7.479 -9.499 6.108 C 1
6 ATOM 5 -4.873 -7.021 6.767 C 1
7 ATOM 8 -3.891 -6.723 6.310 O 1
8 ATOM 1 -7.515 -10.402 -0.621 C 2
9 ATOM 1 -7.260 -11.716 -0.220 O 2
10 ATOM 2 -7.530 -10.348 0.581 C 2
11 ATOM 3 -6.689 -11.008 2.344 C 2

Converting between numbering systems

I'm trying to understand the reason for a rule when converting.
I'm sure there must be a simple explanation, but I can't seem to wrap my head around it.
Appreciate any help!
Converting from base10 to any other base is done like this:
number / desiredBase = number + remainder
You do this until number = 0.
But after all of the calculations, you have to take all the remainders upside down. I don't understand why.
For example: base10 number to base2
11 / 2 = 5 + 1
5 / 2 = 2 + 1
2 / 2 = 1 + 0
1 / 2 = 0 + 1
Why is the correct answer: 1011 and not 1101 ?
I know it's a little petty, but it would really help me remember better if I could understand this.
Think of the same in decimal system, even if it doesn't make that much sense to actually do the math in this case :)
1234 / 10 = 123 | 4
123 / 10 = 12 | 3
12 / 10 = 1 | 2
1 / 10 = 0 | 1
Every time you divide, you strip the least significant digit, so the first result, is the least significant result -- digit on the right.
Because 11 =
1 * 2 ^ 3 + 0 * 2 ^ 2 + 1 * 2 ^ 1 + 1 * 2 ^ 0 (1011)
and not
1 * 2 ^ 3 + 1 * 2 ^ 2 + 0 * 2 ^ 1 + 1 * 2 ^ 0 (1101)

Resources