I am trying to take the means over the columns sam and dup of the following dataset:
fat co lab sam dup
1 0.62 1 1 1 1
2 0.55 1 1 1 2
3 0.34 1 1 2 1
4 0.24 1 1 2 2
5 0.80 1 1 3 1
6 0.68 1 1 3 2
7 0.76 1 1 4 1
8 0.65 1 1 4 2
9 0.30 1 2 1 1
10 0.40 1 2 1 2
11 0.33 1 2 2 1
12 0.43 1 2 2 2
13 0.39 1 2 3 1
14 0.40 1 2 3 2
15 0.29 1 2 4 1
16 0.18 1 2 4 2
17 0.46 1 3 1 1
18 0.38 1 3 1 2
19 0.27 1 3 2 1
20 0.37 1 3 2 2
21 0.37 1 3 3 1
22 0.42 1 3 3 2
23 0.45 1 3 4 1
24 0.54 1 3 4 2
25 0.18 2 1 1 1
26 0.47 2 1 1 2
27 0.53 2 1 2 1
28 0.32 2 1 2 2
29 0.40 2 1 3 1
30 0.37 2 1 3 2
31 0.31 2 1 4 1
32 0.43 2 1 4 2
33 0.35 2 2 1 1
34 0.39 2 2 1 2
35 0.37 2 2 2 1
36 0.33 2 2 2 2
37 0.42 2 2 3 1
38 0.36 2 2 3 2
39 0.20 2 2 4 1
40 0.41 2 2 4 2
41 0.37 2 3 1 1
42 0.43 2 3 1 2
43 0.28 2 3 2 1
44 0.36 2 3 2 2
45 0.18 2 3 3 1
46 0.20 2 3 3 2
47 0.26 2 3 4 1
48 0.06 2 3 4 2
The output should be this:
lab co fat
1 1 1 0.58000
2 2 1 0.34000
3 3 1 0.40750
4 1 2 0.37625
5 2 2 0.35375
6 3 2 0.26750
These are both in the form of .RData files.
How can this be done?
An example with part of the data you posted:
dt = read.table(text = "
fat co lab sam dup
0.62 1 1 1 1
0.55 1 1 1 2
0.34 1 1 2 1
0.24 1 1 2 2
0.80 1 1 3 1
0.68 1 1 3 2
0.76 1 1 4 1
0.65 1 1 4 2
0.30 1 2 1 1
0.40 1 2 1 2
0.33 1 2 2 1
0.43 1 2 2 2
0.39 1 2 3 1
0.40 1 2 3 2
0.29 1 2 4 1
0.18 1 2 4 2
", header= T)
library(dplyr)
dt %>%
group_by(lab, co) %>% # for each lab and co combination
summarise(fat = mean(fat)) %>% # get the mean of fat
ungroup() # forget the grouping
# # A tibble: 2 x 3
# lab co fat
# <int> <int> <dbl>
# 1 1 1 0.58
# 2 2 1 0.34
Related
Currently trying to visualize k-means clusters and running in to a bit of trouble. I'm getting this error message when I run the code below:
Error in fviz_cluster(res.km, data = nci[, 5], palette = c("#2E9FDF", :
The dimension of the data < 2! No plot.
Here's my code:
library(dplyr)
library(tidyr)
library(ggplot2)
library(tidyverse)
library(hrbrthemes)
library(factoextra)
library(ggpubr)
nci <- read.csv('/Users/KyleHammerberg/Desktop/ML Extra Credit/nci.datanames.csv')
names(nci)[1] <- "gene"
# Compute k-means with k = 3
set.seed(123)
res.km <- kmeans(scale(nci[,2]), 3, nstart = 25)
# K-means clusters showing the group of each individuals
res.km$cluster
fviz_cluster(res.km, data = nci[,5 ],
palette = c("#2E9FDF", "#00AFBB", "#E7B800"),
geom = "point",
ellipse.type = "convex",
ggtheme = theme_bw()
)
res.km$cluster
[1] 1 2 1 2 3 1 1 3 3 3 3 3 1 1 1 3 3 3 1 3 3 3 3 1 1 1 3 3 3 3 1 3 3 1 3 3 1 1 1 1 1 3
[43] 1 3 3 3 1 1 1 1 3 3 3 3 3 3 3 1 1 3 3 1 1 1 1 1 1 1 3 1 3 1 1 1 3 3 1 2 1 1 3 2 1 3
[85] 1 1 1 1 1 1 1 2 3 1 1 1 3 3 1 1 1 1 1 1 1 3 2 1 2 1 3 3 1 1 1 1 3 3 1 3 3 3 3 1 1 1
[127] 3 3 1 3 1 1 1 3 1 1 1 2 2 2 1 2 2 2 3 1 1 3 3 1 3 1 2 1 3 3 3 3 3 3 1 1 3 1 1 3 3 3
[169] 1 3 3 3 3 1 1 3 1 1 1 1 1 3 1 1 1 1 1 3 1 1 1 1 2 3 3 3 1 3 3 1 1 3 3 1 3 1 1 3 3 1
[211] 3 1 3 1 3 3 1 3 3 1 1 1 1 3 3 1 3 1 3 3 3 3 1 1 1 1 1 3 3 1 3 1 3 1 3 1 3 1 3 3 3 3
[253] 3 3 1 3 3 3 3 3 1 2 1 3 1 3 3 1 1 3 1 1 1 1 1 3 1 3 3 3 3 1 1 3 3 1 3 3 1 1 1 3 1 1
[295] 2 3 1 3 1 3 1 3 1 3 3 3 1 3 3 3 3 3 3 3 1 1 1 1 3 1 1 1 3 1 3 1 1 1 1 3 3 1 3 1 1 1
[337] 3 1 1 2 1 1 1 1 1 1 3 1 3 3 1 3 1 3 3 1 1 3 3 1 1 1 3 1 1 3 3 1 1 1 1 1 1 1 3 1 3 1
[379] 1 1 1 1 1 1 1 1 3 3 1 3 1 1 1 2 1 1 1 3 1 1 1 1 1 3 3 1 3 3 3 1 1 1 1 1 1 1 1 1 3 1
[421] 1 1 1 3 1 3 1 2 1 3 3 3 1 1 1 1 1 1 3 1 1 3 1 1 1 1 1 1 1 3 1 3 3 3 1 1 3 3 1 1 1 3
[463] 3 3 1 3 3 1 3 3 3 3 1 3 1 1 1 3 1 3 3 3 3 3 3 3 3 3 1 3 1 1 3 3 1 1 3 3 3 3 3 3 3 3
[505] 3 3 3 1 3 1 3 3 2 1 1 3 3 1 3 3 3 1 1 3 3 3 1 1 1 1 1 3 3 1 3 3 1 1 1 3 3 1 3 3 1 3
[547] 1 1 1 1 3 3 3 1 3 3 3 3 3 3 1 2 1 1 3 3 3 3 1 1 3 3 3 3 3 1 3 1 1 3 1 3 3 3 3 3 3 3
[589] 1 1 1 1 1 1 3 1 3 1 3 3 3 3 3 1 3 3 3 3 3 1 1 3 3 3 3 3 3 1 3 1 3 3 3 3 3 3 1 3 3 3
[631] 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 1 3 1 3 3 1 3 3 3 1 3
[673] 1 3 3 1 1 1 3 1 3 3 3 3 1 3 3 1 3 1 1 1 1 3 1 3 1 3 3 3 1 1 1 3 1 1 1 1 3 3 3 3 3 3
[715] 1 1 1 1 1 1 1 3 1 1 1 3 1 1 3 3 1 1 3 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 3 1 3 1 1 3 3
[757] 1 1 1 1 1 1 1 3 3 3 3 1 3 1 1 3 1 3 3 1 1 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 1 1 1 1
[799] 1 1 1 1 1 1 1 1 3 1 1 1 1 3 1 1 3 3 1 3 3 1 3 1 3 1 3 1 3 1 3 1 3 1 1 1 1 3 3 1 3 3
[841] 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 1 1 3 3 1 2 1 1 1 3 3 1 3 1 1 1 1 1 1 3 1 3 1 1 1
[883] 1 1 1 1 1 1 3 1 1 1 1 3 3 1 1 3 3 3 3 3 3 1 1 2 1 3 1 1 1 1 1 1 1 3 1 3 1 3 1 1 1 1
[925] 1 1 1 3 3 1 1 3 1 1 1 1 1 1 1 1 1 1 3 3 3 3 1 3 3 3 3 3 3 3 1 1 1 3 1 3 1 1 1 1 1 1
[967] 1 1 1 3 1 1 3 1 3 1 3 1 1 3 1 3 3 3 3 3 3 3 1 3 1 3 3 3 3 1 3 1 1 1
[ reached getOption("max.print") -- omitted 5830 entries ]
Here's a look at the data if that helps:
head(nci)
gene CNS CNS.1 CNS.2 RENAL BREAST CNS.3 CNS.4 BREAST.1 NSCLC NSCLC.1
1 g1 0.300 0.679961 0.940 2.80e-01 0.485 0.310 -0.830 -0.190 0.460 0.760
2 g2 1.180 1.289961 -0.040 -3.10e-01 -0.465 -0.030 0.000 -0.870 0.000 1.490
3 g3 0.550 0.169961 -0.170 6.80e-01 0.395 -0.100 0.130 -0.450 1.150 0.280
4 g4 1.140 0.379961 -0.040 -8.10e-01 0.905 -0.460 -1.630 0.080 -1.400 0.100
5 g5 -0.265 0.464961 -0.605 6.25e-01 0.200 -0.205 0.075 0.005 -0.005 -0.525
6 g6 -0.070 0.579961 0.000 -1.39e-17 -0.005 -0.540 -0.360 0.350 -0.700 0.360
RENAL.1 RENAL.2 RENAL.3 RENAL.4 RENAL.5 RENAL.6 RENAL.7 BREAST.2 NSCLC.2 RENAL.8 UNKNOWN
1 0.270 -0.450 -0.030 0.710 -0.360 -0.210 -0.500 -1.060 0.150 -0.290 -0.200
2 0.630 -0.060 -1.120 0.000 -1.420 -1.950 -0.520 -2.190 -0.450 0.000 0.740
3 -0.360 0.150 -0.050 0.160 -0.030 -0.700 -0.660 -0.130 -0.320 0.050 0.080
4 -1.040 -0.610 0.000 -0.770 -2.280 -1.650 -2.610 0.000 -1.610 0.730 0.760
5 0.015 -0.395 -0.285 0.045 0.135 -0.075 0.225 -0.485 -0.095 0.385 -0.105
6 -0.040 0.150 -0.250 -0.160 -0.320 0.060 -0.050 -0.430 -0.080 0.390 -0.080
OVARIAN MELANOMA PROSTATE OVARIAN.1 OVARIAN.2 OVARIAN.3 OVARIAN.4 OVARIAN.5 PROSTATE.1
1 0.430 -0.490 -0.530 -0.010 0.640 -0.480 0.140 0.640 0.070
2 0.500 0.330 -0.050 -0.370 0.550 0.970 0.720 0.150 0.290
3 -0.730 0.010 -0.230 -0.160 -0.540 0.300 -0.240 -0.170 0.070
4 0.600 -1.660 0.170 0.930 -1.780 0.470 0.000 0.550 1.310
5 -0.635 -0.185 0.825 0.395 0.315 0.425 1.715 -0.205 0.085
6 -0.430 -0.140 0.010 -0.100 0.810 0.020 0.260 0.290 -0.620
NSCLC.3 NSCLC.4 NSCLC.5 LEUKEMIA K562B.repro X6K562B.repro LEUKEMIA.1 LEUKEMIA.2
1 0.130 0.320 0.515 0.080 0.410 -0.200 -0.36998050 -0.370
2 2.240 0.280 1.045 0.120 0.000 0.000 -1.38998000 0.180
3 0.640 0.360 0.000 0.060 0.210 0.060 -0.05998047 0.000
4 0.680 -1.880 0.000 0.400 0.180 -0.070 0.07001953 -1.320
5 0.135 0.475 0.330 0.105 -0.255 -0.415 -0.07498047 -0.825
6 0.300 0.110 -0.155 -0.190 -0.110 0.020 0.04001953 -0.130
LEUKEMIA.3 LEUKEMIA.4 LEUKEMIA.5 COLON COLON.1 COLON.2 COLON.3 COLON.4
1 -0.430 -0.380 -0.550 -0.32003900 -0.620 -4.90e-01 0.07001953 -0.120
2 -0.590 -0.550 0.000 0.08996101 0.080 4.20e-01 -0.82998050 0.000
3 -0.500 -1.710 0.100 -0.29003900 0.140 -3.40e-01 -0.59998050 -0.010
4 -1.520 -1.870 -2.390 -1.03003900 0.740 7.00e-02 -0.90998050 0.130
5 -0.785 -0.585 -0.215 0.09496101 0.205 -2.05e-01 0.24501950 0.555
6 0.520 0.120 -0.620 0.05996101 0.000 -1.39e-17 -0.43998050 -0.550
COLON.5 COLON.6 MCF7A.repro BREAST.3 MCF7D.repro BREAST.4 NSCLC.6 NSCLC.7
1 -0.290 -0.8100195 0.200 0.37998050 0.3100195 0.030 -0.42998050 0.160
2 0.030 0.0000000 -0.230 0.44998050 0.4800195 0.220 -0.38998050 -0.340
3 -0.310 0.2199805 0.360 0.65998050 0.9600195 0.150 -0.17998050 -0.020
4 1.500 0.7399805 0.180 0.76998050 0.9600195 -1.240 0.86001950 -1.730
5 0.005 0.1149805 -0.315 0.05498047 -0.2149805 -0.305 0.78501950 -0.625
6 -0.540 0.1199805 0.410 0.54998050 0.3700195 0.050 0.04001953 -0.140
NSCLC.8 MELANOMA.1 BREAST.5 BREAST.6 MELANOMA.2 MELANOMA.3 MELANOMA.4 MELANOMA.5
1 0.010 -0.620 -0.380 0.04998047 0.650 -0.030 -0.270 0.210
2 -1.280 -0.130 0.000 -0.72001950 0.640 -0.480 0.630 -0.620
3 -0.770 0.200 -0.060 0.41998050 0.150 0.070 -0.100 -0.150
4 0.940 -1.410 0.800 0.92998050 -1.970 -0.700 1.100 -1.330
5 -0.015 1.585 -0.115 -0.09501953 -0.065 -0.195 1.045 0.045
6 0.270 1.160 0.180 0.19998050 0.130 0.410 0.080 -0.400
MELANOMA.6 MELANOMA.7
1 -5.00e-02 0.350
2 1.40e-01 -0.270
3 -9.00e-02 0.020
4 -1.26e+00 -1.230
5 4.50e-02 -0.715
6 -2.71e-20 -0.340
nci[,5 ] is the data with only one column. fviz_cluster requires data with atleast 2 columns. This check is performed in these lines https://github.com/kassambara/factoextra/blob/master/R/fviz_cluster.R#L184-L203 .
Using mtcars as example -
Passing a single column in data :
res.km <- kmeans(scale(mtcars[,2]), 3, nstart = 25)
factoextra::fviz_cluster(res.km, data = mtcars[,5],
palette = c("#2E9FDF", "#00AFBB", "#E7B800"),
geom = "point",
ellipse.type = "convex",
ggtheme = theme_bw())
Error in factoextra::fviz_cluster(res.km, data = mtcars[, 5], palette = c("#2E9FDF", :
The dimension of the data < 2! No plot.
Passing two columns in data :
factoextra::fviz_cluster(res.km, data = mtcars[,5:6],
palette = c("#2E9FDF", "#00AFBB", "#E7B800"),
geom = "point",
ellipse.type = "convex",
ggtheme = theme_bw())
This is my data set
You can get the data form this link( If can't ,please inform me)
https://www.dropbox.com/s/1n9hpyhcniaghh5/table.csv?dl=0
LABEL DATE TAU TYPE x y z
1 A 1 2 1 0.75 7 16
2 A 1 2 0 0.41 5 18
3 A 1 2 1 0.39 6 14
4 A 2 3 0 0.65 5 14
5 A 2 3 1 0.55 7 19
6 A 2 3 1 0.69 5 19
7 A 2 3 0 0.66 7 19
8 A 3 1 0 0.38 8 15
9 A 3 1 0 0.02 5 16
10 A 3 1 0 0.71 8 13
11 B 1 2 1 0.25 9 18
12 B 1 2 0 0.06 8 20
13 B 1 2 1 0.60 8 20
14 B 1 2 0 0.56 6 13
15 B 1 3 1 0.50 8 19
16 B 1 3 0 0.04 8 16
17 B 2 1 1 0.04 5 15
18 B 2 1 1 0.75 5 13
19 B 2 1 0 0.44 8 18
20 B 2 1 1 0.52 9 13
I want to filter data by group with multiple conditions. And the conditions is
the number of rows for each type(0,1) of TYPE variable by group must
bigger than 1
the number of rows for each type must be equal
(For example: the number of rows for type 1 is equal to the number of rows for type 0 for each group)
I had tried many times... And finally I get this code and this output
table %>% group_by(label,date,tau,type) %>% filter(n()>1) %>% filter(length(type==1)==length(type==0))
# A tibble: 16 x 7
# Groups: label, date, tau, type [7]
LABEL DATE TAU TYPE x y z
<fctr> <int> <int> <int> <dbl> <int> <int>
1 A 1 2 1 0.75 7 16
2 A 1 2 1 0.39 6 14
3 A 2 3 0 0.65 5 14
4 A 2 3 1 0.55 7 19
5 A 2 3 1 0.69 5 19
6 A 2 3 0 0.66 7 19
7 A 3 1 0 0.38 8 15
8 A 3 1 0 0.02 5 16
9 A 3 1 0 0.71 8 13
10 B 1 2 1 0.25 9 18
11 B 1 2 0 0.06 8 20
12 B 1 2 1 0.60 8 20
13 B 1 2 0 0.56 6 13
14 B 2 1 1 0.04 5 15
15 B 2 1 1 0.75 5 13
16 B 2 1 1 0.52 9 13
I was confused about this output I get with this code. I already get rid of the data which didn't meet the condition 1 BUT the data which didn't meet the condition 2 still inside
The result I want is just like the below
LABEL DATE TAU TYPE x y z
<fctr> <int> <int> <int> <dbl> <int> <int>
3 A 2 3 0 0.65 5 14
4 A 2 3 1 0.55 7 19
5 A 2 3 1 0.69 5 19
6 A 2 3 0 0.66 7 19
10 B 1 2 1 0.25 9 18
11 B 1 2 0 0.06 8 20
12 B 1 2 1 0.60 8 20
13 B 1 2 0 0.56 6 13
And if I want to compute value with the function below for each row, how can i code?? Just use the function of mutate()??
f(x,y,z) = 2 * x + y - z / 3 if TYPE == 1
f(x,y,z) = 4 * x - y / 2 + z / 3 if TYPE == 0
I hope there is anyone can help me and I am appreciate for your help! If you need to provide any other information just let me know ~
# example dataset
df = read.table(text = "
LABEL DATE TAU TYPE x y z
1 A 1 2 1 0.75 7 16
2 A 1 2 0 0.41 5 18
3 A 1 2 1 0.39 6 14
4 A 2 3 0 0.65 5 14
5 A 2 3 1 0.55 7 19
6 A 2 3 1 0.69 5 19
7 A 2 3 0 0.66 7 19
8 A 3 1 0 0.38 8 15
9 A 3 1 0 0.02 5 16
10 A 3 1 0 0.71 8 13
11 B 1 2 1 0.25 9 18
12 B 1 2 0 0.06 8 20
13 B 1 2 1 0.60 8 20
14 B 1 2 0 0.56 6 13
15 B 1 3 1 0.50 8 19
16 B 1 3 0 0.04 8 16
17 B 2 1 1 0.04 5 15
18 B 2 1 1 0.75 5 13
19 B 2 1 0 0.44 8 18
20 B 2 1 1 0.52 9 13
", header=T, stringsAsFactors=F)
library(dplyr)
library(tidyr)
# function to use for each row
# (assumes that type can be only 1 or 0)
f = function(t,x,y,z) { ifelse(t == 1,
2 * x + y - z / 3,
4 * x - y / 2 + z / 3) }
df %>%
count(LABEL, DATE, TAU, TYPE) %>% # count rows for each group (based on those combinations)
filter(n > 1) %>% # keep groups with multiple rows
mutate(TYPE = paste0("TYPE_",TYPE)) %>% # update variable
spread(TYPE, n, fill = 0) %>% # reshape data
filter(TYPE_0 == TYPE_1) %>% # keep groups with equal number of rows for type 0 and 1
select(LABEL, DATE, TAU) %>% # keep variables/groups of interest
inner_join(df, by=c("LABEL", "DATE", "TAU")) %>% # join back info
mutate(f_value = f(TYPE,x,y,z)) # apply function
# # A tibble: 8 x 8
# LABEL DATE TAU TYPE x y z f_value
# <chr> <int> <int> <int> <dbl> <int> <int> <dbl>
# 1 A 2 3 0 0.65 5 14 4.76666667
# 2 A 2 3 1 0.55 7 19 1.76666667
# 3 A 2 3 1 0.69 5 19 0.04666667
# 4 A 2 3 0 0.66 7 19 5.47333333
# 5 B 1 2 1 0.25 9 18 3.50000000
# 6 B 1 2 0 0.06 8 20 2.90666667
# 7 B 1 2 1 0.60 8 20 2.53333333
# 8 B 1 2 0 0.56 6 13 3.57333333
I have 100 rows and 5 columns, each row is a sequence of 5 integers between 1 and 3 like
X1 X2 X3 X4 X5
1 2 2 1 3 1
2 2 1 2 3 2
3 2 3 1 2 2
4 2 3 1 1 3
5 2 1 2 3 1
6 2 1 2 2 2
7 2 2 2 1 2
8 2 1 3 2 2
9 2 2 2 2 2
10 2 2 1 2 3
11 2 3 2 1 1
12 2 1 3 2 1
13 2 1 3 2 3
14 2 2 2 3 3
15 2 3 2 2 1
16 2 1 2 2 3
17 2 3 3 1 2
18 2 1 3 1 3
.....
How can I calculate the probability of the whole sequence of the row just having the first integer and update looking at the second and the third etc...
More info
There is no dependency between each row and each column but
I have the probability of each integer (1,2,3) to be in each column : 1 (0.2) 2 (0.2) 3 (0.6) and the probability of each unique series in a row is:
11223 0.01
12311 0.01
13121 0.01
13233 0.01
13313 0.01
13323 0.01
.........
33233 0.02
33313 0.01
33323 0.01
33331 0.01
33332 0.03
33333 0.16
Each draw is independent from previous draw.
Instead of integers 1 2 3 I should talk about A B and C to avoid confusion ...
this is a categorical Bayesian probability (I think :-) )
Consider this data:
m = data.frame(pop=c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4),
id=c(0,1,1,1,1,1,0,2,1,1,1,2,1,2,2,2))
> m
pop id
1 1 0
2 1 1
3 1 1
4 1 1
5 2 1
6 2 1
7 2 0
8 2 2
9 2 1
10 3 1
11 3 1
12 3 2
13 3 1
14 3 2
15 4 2
16 4 2
I would like to get the frequency of each unique id in each unique pop? For example, the id 1 is present 3 times out of 4 when pop == 1, therefore the frequency of id 1 in pop 1 is 0.75.
I came up with this ugly solution:
out = matrix(0,ncol=3)
for (p in unique(m$pop))
{
for (i in unique(m$id))
{
m1 = m[m$pop == p,]
f = nrow(m1[m1$id == i,])/nrow(m1)
out = rbind(out, c(p, f, i))
}
}
out = out[-1,]
colnames(out) = c("pop", "freq", "id")
# SOLUTION
> out
pop freq id
[1,] 1 0.25 0
[2,] 1 0.75 1
[3,] 1 0.00 2
[4,] 2 0.20 0
[5,] 2 0.60 1
[6,] 2 0.20 2
[7,] 3 0.00 0
[8,] 3 0.60 1
[9,] 3 0.40 2
[10,] 4 0.00 0
[11,] 4 0.00 1
[12,] 4 1.00 2
I am sure there exists a more efficient solution using data.table or table but couldn't find it.
Here's what I might do:
as.data.frame(prop.table(table(m),1))
# pop id Freq
# 1 1 0 0.25
# 2 2 0 0.20
# 3 3 0 0.00
# 4 4 0 0.00
# 5 1 1 0.75
# 6 2 1 0.60
# 7 3 1 0.60
# 8 4 1 0.00
# 9 1 2 0.00
# 10 2 2 0.20
# 11 3 2 0.40
# 12 4 2 1.00
If you want it sorted by pop, you can do that afterwards. Alternately, you could transpose the table with t before converting to data.frame; or use rev(m) and prop.table on dimension 2.
Try:
library(dplyr)
m %>%
group_by(pop, id) %>%
summarise(s = n()) %>%
mutate(freq = s / sum(s)) %>%
select(-s)
Which gives:
#Source: local data frame [8 x 3]
#Groups: pop
#
# pop id freq
#1 1 0 0.25
#2 1 1 0.75
#3 2 0 0.20
#4 2 1 0.60
#5 2 2 0.20
#6 3 1 0.60
#7 3 2 0.40
#8 4 2 1.00
A data.table solution:
setDT(m)[, {div = .N; .SD[, .N/div, keyby = id]}, by = pop]
# pop id V1
#1: 1 0 0.25
#2: 1 1 0.75
#3: 2 0 0.20
#4: 2 1 0.60
#5: 2 2 0.20
#6: 3 1 0.60
#7: 3 2 0.40
#8: 4 2 1.00
I have a file that looks like
X90045GridMs.TotPFPrc X90045Inv.TmpLimStt X90042InvCtl.Stt X90042Mode
1 NA NA NA NA
2 0.00 1 3 7
3 0.44 1 2 1
4 0.80 1 2 1
5 0.88 1 2 1
6 0.93 1 2 1
7 0.95 1 2 1
8 0.98 1 2 1
9 0.99 1 2 1
where the headers are made up of a serial no. and a parameter name. I would like to change the headers from X90045 and X90042 to Inv 1 and Inv 2 using gsub. Is there such a method to use gsub on the header? The end result should look something like this:
Inv1GridMs.TotPFPrc Inv1Inv.TmpLimStt Inv2InvCtl.Stt Inv2Mode
1 NA NA NA NA
2 0.00 1 3 7
3 0.44 1 2 1
4 0.80 1 2 1
5 0.88 1 2 1
6 0.93 1 2 1
7 0.95 1 2 1
8 0.98 1 2 1
9 0.99 1 2 1
Is your data in a data.frame object? If so you can access and modify the header using names().
names(yourdata) <- gsub("X90045", "Inv1", names(yourdata))
and likewise for your other field.