I need to make a DESeq2 analysis with my dataset for an homework, but I'm really new with this package (I never used it before).
When I want to make a
counts <- read.table("ProstateCancerCountData.txt",sep="", header=TRUE, row.names=1)
metadat<- read.table("mart_export.txt",sep=",", header=TRUE, row.names=1)
counts <- as.matrix(counts)
dds <- DESeqDataSetFromMatrix(countData = counts, colData = metadat, design = ~ GC.content+ Gene.type)
I have this error :
Erreur dans DESeqDataSetFromMatrix(countData = counts, colData = metadat, :
ncol(countData) == nrow(colData) n'est pas TRUE
I don't know how to fix it.
This is the two dataset I have to used for the analysis :
head(counts)
N_10 T_10 N_11 T_12 N_13 T_13 N_14 T_14 N_1 T_1 N_2 T_2 N_3
ENSG00000000003 401 442 1155 1095 788 754 852 938 774 520 808 648 891
ENSG00000000005 0 7 23 9 5 2 45 5 11 10 56 8 7
ENSG00000000419 112 96 424 468 385 452 751 491 247 222 509 363 706
ENSG00000000457 13 121 327 165 40 204 290 199 70 121 104 151 352
ENSG00000000460 24 66 162 137 71 159 174 156 86 94 120 91 166
ENSG00000000938 96 128 218 372 126 129 538 320 117 129 157 238 177
T_3 N_4 N_5 T_6 N_7 T_7 N_8 T_8 N_9 T_9
ENSG00000000003 1071 2059 737 1006 1146 653 1299 1306 1522 490
ENSG00000000005 0 18 0 7 1 4 1 2 0 3
ENSG00000000419 622 988 307 402 294 323 535 518 573 322
ENSG00000000457 333 328 58 153 138 115 179 200 86 85
ENSG00000000460 152 162 100 100 101 148 128 78 83 109
ENSG00000000938 86 113 410 230 64 76 93 61 121 68
head(metadat)
Chromosome.scaffold.name Gene.start..bp. Gene.end..bp.
ENSG00000271782 1 50902700 50902978
ENSG00000232753 1 103817769 103828355
ENSG00000225767 1 50927141 50936822
ENSG00000202140 1 50965430 50965529
ENSG00000207194 1 51048076 51048183
ENSG00000252825 1 51215968 51216025
GC.content Gene.type
ENSG00000271782 35.48 lincRNA
ENSG00000232753 33.99 lincRNA
ENSG00000225767 38.99 antisense
ENSG00000202140 43.00 misc_RNA
ENSG00000207194 37.96 snRNA
ENSG00000252825 36.21 snRNA
Thank you for your help, and for your lighting
EDIT :
Thank you for your previous answer.
I take an another dataset to make this homework. But I have another bug :
This is my new dataset :
head(mycounts)
R1L1Kidney R1L2Liver R1L3Kidney R1L4Liver R1L6Liver
ENSG00000177757 2 1 0 0 1
ENSG00000187634 49 27 43 34 23
ENSG00000188976 73 34 77 56 45
ENSG00000187961 15 8 15 13 11
ENSG00000187583 1 0 1 1 0
ENSG00000187642 4 0 5 0 2
R1L7Kidney R1L8Liver R2L2Kidney R2L3Liver R2L6Kidney
ENSG00000177757 2 0 1 1 3
ENSG00000187634 41 35 42 25 47
ENSG00000188976 68 55 70 42 82
ENSG00000187961 13 12 12 20 15
ENSG00000187583 3 0 0 2 3
ENSG00000187642 12 1 9 4 9
head(myfactors)
Tissue TissueRun
R1L1Kidney Kidney Kidney_1
R1L2Liver Liver Liver_1
R1L3Kidney Kidney Kidney_1
R1L4Liver Liver Liver_1
R1L6Liver Liver Liver_1
R1L7Kidney Kidney Kidney_1
When I code my DESeq object, I would take the Tissue and TissueRun for take care of the batch. But I have an error :
dds2 <- DESeqDataSetFromMatrix(countData = mycounts, colData = myfactors, design = ~ Tissue + TissueRun)
Error in checkFullRank(modelMatrix) :
the model matrix is not full rank, so the model cannot be fit as specified.
One or more variables or interaction terms in the design formula are linear
combinations of the others and must be removed.
Please read the vignette section 'Model matrix not full rank':
vignette('DESeq2')
Thank you for your help
Related
I would like to build the Cumulative Distribution Function (CDF) from an input file that contains the data to generate a histogram. The input file has one column per bin and one column with the amount of ocurrences inside each bin, so it looks like this:
bin column6
0 1189
5 11957
10 24203
15 21518
20 14515
25 10323
30 7799
35 6015
40 4869
45 3858
50 3215
55 2615
60 2350
65 1890
70 1673
75 1433
80 1218
85 942
90 869
95 736
100 605
105 528
110 449
115 429
120 327
125 252
130 208
135 170
140 154
145 138
150 124
155 86
160 113
165 108
170 71
175 72
180 51
185 58
190 37
195 29
200 35
205 24
210 11
215 24
220 16
225 20
230 15
235 5
240 11
245 4
250 4
255 6
260 6
265 6
270 4
275 3
280 4
285 2
290 3
295 1
300 5
305 3
310 2
315 1
320 1
325 2
330 0
335 1
340 2
345 0
350 0
355 2
360 4
365 2
370 0
375 1
380 1
385 2
390 0
395 1
400 1
405 1
I use R to visualize the histogram using the following code:
library(ggplot2)
input <- read.table('/home/agalvez/data/domains/histo_leu.txt', sep="\t", header=TRUE)
histo <- ggplot(data=input, aes(x=input$bin, y=input$column6)) +
geom_bar(stat="identity")
histo
Could someone give me some advice on how to build the CDF for this histogram? Thanks in advance!
Bit unclear question, I assume you are looking for the eCDF since any parametric CDF generally has an analytical formula.
In R, you can use ecdf to generate an eCDF.
library(purrr)
library(tidyr)
library(dplyr)
library(ggplot2)
input <- input %>%
filter(column6 != 0) %>%
mutate(
column6 = map(column6, ~1:.x)
) %>%
unnest(column6)
# Make the ecdf
input %$%
ecdf(bin)
# To plot use stat_ecdf
input %>%
ggplot(aes(bin))+
stat_ecdf(geom = "step")
I tried fitting gams to some dataframes I have. All minus one work. It fails with the error:
Error in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) : A term has fewer unique covariate combinations than specified maximum degrees of freedom
I looked a bit on the internet but couldn't really figure out what's really going wrong. All my 7 over dataframes run without a problem.
I then ran epiR::epi.cp(srtm[-c(1,7,8)]) and it gave me this output:
$cov.pattern
id n curv_plan curv_prof dem slope ca
1 1 1 1.113192e-02 3.991046e-03 3909 43.601479 5.225853
2 2 1 -2.686749e-03 3.474989e-03 3312 35.022511 4.418310
3 3 1 -1.033450e-02 -4.626922e-03 3326 36.678623 4.421465
4 4 1 -5.439283e-03 2.066148e-03 4069 31.501045 3.887526
5 5 1 -2.602015e-03 -1.249511e-04 3021 37.199219 5.010560
6 6 1 1.068216e-03 1.216902e-03 2844 44.694374 4.852220
7 7 1 -1.855443e-02 -5.965539e-03 2841 42.753750 5.088554
8 8 1 2.363193e-03 2.353357e-03 2833 33.160995 4.652209
9 9 1 2.169674e-02 1.049735e-02 2964 32.311535 4.671970
10 10 1 2.850910e-02 9.416230e-03 2956 50.791847 3.496096
11 11 1 -1.932028e-02 4.949751e-04 2794 38.714302 4.217102
12 12 1 -1.372750e-03 -4.437230e-03 3799 48.356312 4.597039
13 13 1 1.154181e-04 -4.114155e-03 3808 54.669777 3.518823
14 14 1 2.743768e-02 7.829833e-03 3580 23.674162 3.268744
15 15 1 7.216539e-03 9.818082e-04 3969 29.421440 4.354250
16 16 1 2.385139e-03 6.333927e-04 3635 10.555381 4.905733
17 17 1 -1.129411e-02 2.719948e-03 2805 29.195084 4.807369
18 18 1 4.584329e-04 -1.497223e-03 3676 32.754879 3.729304
19 19 1 1.883965e-03 4.189690e-03 3165 30.973505 4.833158
20 20 1 -5.350136e-03 -2.615470e-03 2745 32.534698 4.420852
21 21 1 1.484253e-02 -1.245213e-03 3872 26.113234 4.045357
22 22 1 -2.449377e-02 -5.045668e-04 2931 31.060991 5.170872
23 23 1 -2.962795e-02 -9.271557e-03 2917 21.680889 4.547461
24 24 1 -2.487545e-02 -7.834328e-03 2736 41.775677 4.543325
25 25 1 2.890568e-03 -2.040353e-03 2577 47.003765 3.739546
26 26 1 -5.119631e-03 8.869720e-03 3401 38.519680 5.428564
27 27 1 6.171266e-03 -6.515175e-04 2687 36.678623 4.152842
28 28 1 -8.297552e-03 -7.053435e-03 3678 39.532673 4.081311
29 29 1 8.652663e-03 2.394378e-03 3515 33.895370 4.220177
30 30 1 -2.528805e-03 -1.293259e-03 3404 42.548138 4.266330
31 31 1 1.899994e-02 6.367806e-03 3191 41.696201 3.300749
32 32 1 -2.243623e-02 -1.866033e-04 2433 34.162479 5.364681
33 33 1 -6.934012e-03 9.280805e-03 2309 32.667160 5.650699
34 34 1 -1.121149e-02 6.376335e-05 2188 31.119059 4.706416
35 35 1 -1.429000e-02 5.299596e-04 2511 34.543365 4.538456
36 36 1 -7.168889e-03 1.301791e-03 2625 30.826660 4.059711
37 37 1 -4.226461e-03 7.440552e-03 2830 33.398251 4.941027
38 38 1 -2.635832e-03 8.748529e-03 3378 45.972672 4.861779
39 39 1 -2.007920e-02 -8.081778e-03 3281 31.735376 5.173269
40 40 1 -3.453595e-02 -6.867430e-03 2690 47.515182 4.935358
41 41 1 1.698363e-03 -8.296107e-03 2529 42.224693 4.386349
42 42 1 5.257193e-03 1.021242e-02 2571 43.070564 4.194372
43 43 1 6.968817e-03 5.538784e-03 2581 36.055031 4.209373
44 44 1 -7.632907e-04 2.803704e-04 2582 28.257311 4.230427
45 45 1 -3.468894e-03 -9.099842e-04 2409 29.421440 4.190946
46 46 1 1.879089e-02 6.532978e-03 3733 41.535984 4.032614
47 47 1 -1.076225e-03 -1.138945e-03 2712 39.260731 4.580621
48 48 1 -5.306205e-03 2.667941e-03 3446 34.250553 4.925404
49 49 1 -5.380515e-03 -2.595619e-03 3785 50.561493 4.642792
50 50 1 -2.571232e-03 -2.063937e-03 3768 46.160892 4.728879
51 51 1 -7.638110e-03 -2.432463e-03 3413 32.401161 5.058373
52 52 1 -2.950254e-03 -2.034031e-04 3852 32.543564 4.443869
53 53 1 -2.702386e-03 -1.776183e-03 2483 31.002720 3.879390
54 54 1 -3.892425e-02 -2.266178e-03 2225 26.126318 5.750985
55 55 1 -2.644659e-03 3.034660e-03 2192 32.103516 4.949506
56 56 1 -2.862503e-02 3.673996e-04 2361 23.930893 5.181818
57 57 1 6.263880e-03 -7.725377e-04 3780 17.752790 4.890797
58 58 1 1.054093e-03 -1.563014e-03 3089 36.422310 4.520845
59 59 1 9.474340e-04 -3.901043e-03 3155 42.552841 4.265886
60 60 1 5.569567e-03 -1.770366e-04 3516 13.166321 4.772187
61 61 1 -8.342760e-03 -9.908290e-03 3097 36.815479 5.346615
62 62 1 -1.422498e-03 -1.645628e-03 2865 29.802414 4.131463
63 63 1 4.523963e-02 1.067406e-02 2163 36.154739 3.369432
64 64 1 -1.164162e-02 6.808200e-04 2316 19.610609 4.634536
65 65 1 -8.043590e-03 9.395104e-03 2614 44.298817 3.983136
66 66 1 -1.925332e-02 -4.521391e-03 2035 31.205780 4.134195
67 67 1 -1.429050e-02 5.435983e-03 2799 38.876656 4.180761
68 68 1 6.935605e-04 3.015038e-03 2679 37.863647 4.213497
69 69 1 -5.062089e-03 5.961242e-04 2831 32.401161 3.729215
70 70 1 -3.617065e-04 -2.874465e-03 3152 45.871994 4.703659
71 71 1 -4.216370e-02 -4.917050e-03 3726 25.376934 4.614913
72 72 1 -2.184333e-02 -2.840071e-03 3610 43.138550 4.237120
73 73 1 -1.735273e-02 -2.199261e-03 3339 33.984894 4.811754
74 74 1 1.929157e-02 5.358084e-03 3447 32.356407 3.355368
75 75 1 -4.118797e-02 -2.408211e-03 3251 22.373844 5.160147
76 76 1 -1.393304e-02 7.900328e-05 3297 22.090260 4.724728
77 77 1 -3.078095e-02 -5.535597e-03 3143 37.298687 4.625203
78 78 1 1.717030e-02 -1.120720e-03 3617 37.965389 4.627342
79 79 1 -5.965119e-04 -5.377157e-04 3689 28.360373 4.767213
80 80 1 7.843294e-03 -9.579902e-04 3676 48.356312 3.907819
81 81 1 5.994634e-03 2.034169e-03 2759 25.142431 3.980591
82 82 1 -1.323012e-02 2.393529e-03 3972 26.880308 5.107575
83 83 1 6.312347e-03 2.877600e-04 3323 32.167103 3.496723
84 84 1 -1.180464e-02 4.438243e-03 3790 40.369972 4.081389
85 85 1 -8.333334e-03 4.009274e-03 3248 14.931417 4.881107
86 86 1 2.016023e-03 -5.707344e-04 3994 18.305449 4.278613
87 87 1 -5.515654e-03 -8.373593e-04 3368 40.703190 4.229169
88 88 1 8.931696e-03 1.677515e-03 4651 30.133842 4.327270
89 89 1 1.962347e-04 -7.458636e-04 5075 57.352509 3.263017
90 90 1 -2.880805e-02 -5.200595e-04 2645 11.976726 5.634262
91 91 1 -2.101875e-02 -5.110677e-03 3109 34.218582 4.925558
92 92 1 -8.390786e-03 -1.188547e-02 3667 39.895481 4.249029
93 93 1 -1.366958e-02 9.873455e-04 2827 22.636129 5.269634
94 94 1 1.004551e-02 5.205147e-04 3667 44.028976 3.993555
95 95 1 5.892557e-03 -5.482296e-04 2416 5.385977 4.614692
96 96 1 -1.662132e-02 -9.946494e-04 3806 42.599808 3.951163
97 97 1 -7.977792e-03 5.937776e-03 3470 28.888371 3.120762
98 98 1 -2.408042e-02 -2.647421e-03 2975 16.228737 4.227977
99 99 1 -1.191509e-02 -2.014583e-03 2461 30.051607 4.361413
100 100 1 1.110316e-02 2.506189e-04 3362 29.517509 4.591039
101 101 1 2.010373e-03 4.185408e-04 5104 17.387333 3.642855
102 102 1 -3.218945e-03 1.004196e-02 4113 44.448421 3.282414
103 103 1 2.438254e-03 2.551999e-03 3234 31.205780 3.844411
104 104 1 -1.178511e-02 2.775465e-04 1864 1.350224 3.875072
105 105 1 -9.511201e-04 -1.446065e-03 2351 22.406872 4.392300
106 106 1 -4.563018e-03 -5.890041e-03 3141 24.862123 3.998985
107 107 1 -1.471223e-02 5.965497e-03 3765 25.363234 3.661456
108 108 1 -5.857890e-03 -9.363544e-03 2272 22.878105 5.105480
109 109 1 1.369277e-02 1.019289e-02 4016 44.848000 4.092690
110 110 1 -8.784844e-03 3.358194e-03 3293 32.543564 4.115062
111 111 1 -5.148044e-03 5.372697e-03 3038 31.772562 3.626687
112 112 1 -1.556184e+35 5.799786e+34 4961 29.421440 3.020591
113 113 1 3.831991e-03 1.570888e-03 2069 28.821898 3.790284
114 114 1 8.289138e-04 6.439757e-04 2154 21.045721 3.959267
115 115 1 -4.800863e-03 3.194520e-03 5294 45.660866 3.701611
116 116 1 2.974254e-02 1.197812e-02 4380 31.670097 3.877057
117 117 1 1.137725e-02 -1.082659e-02 5172 18.774675 3.572600
118 118 1 -4.678526e-03 7.448288e-03 2257 39.260731 4.227000
119 119 1 -4.655881e-03 -1.119303e-03 3233 30.205467 5.613868
120 120 1 -4.827522e-03 -4.766134e-03 3414 42.974857 3.831894
121 121 1 -8.568994e-04 1.053632e-03 1750 29.421440 4.132886
122 122 1 1.212121e-02 0.000000e+00 5018 20.136303 3.669850
123 123 1 -4.711660e-03 -2.261143e-03 3013 45.007954 3.622240
124 124 1 -1.226328e-02 4.688181e-04 3842 26.880308 3.098333
125 125 1 3.438910e-03 1.441129e-03 3470 11.386165 4.552782
126 126 1 1.192164e-02 -1.295839e-03 3473 22.684824 4.748498
127 127 1 -1.960781e-40 0.000000e+00 4155 90.000000 2.960569
128 128 1 2.124726e-04 1.945100e-03 2496 32.103516 5.242211
129 129 1 5.669804e-03 -4.589476e-03 2577 35.398876 4.271112
130 130 1 -8.838220e-03 -9.496282e-04 4921 14.506372 4.088247
131 131 1 1.009090e-02 -2.243944e-03 3385 38.372120 4.067030
132 132 1 5.630660e-03 -8.632211e-04 4003 33.322365 3.776054
133 133 1 -9.103803e-03 -6.322661e-03 2758 47.934212 3.739807
134 134 1 6.225513e-03 -1.824928e-03 3925 37.085732 3.389725
135 135 1 -1.303080e-03 3.580316e-03 2978 27.432941 4.345174
136 136 1 1.355920e-02 3.468190e-03 5058 57.797195 3.739124
137 137 1 2.092464e-02 -3.244962e-04 2400 3.931096 3.032193
138 138 1 5.691811e-02 -7.933985e-04 3885 15.069956 3.414036
139 139 1 8.052407e-05 -3.197287e-03 3493 33.993008 3.881695
140 140 1 -1.892967e-02 -5.049255e-03 2985 24.904482 4.417928
141 141 1 2.278842e-02 1.188287e-02 3666 31.670097 3.313449
142 142 1 1.496110e-02 2.181270e-03 3702 30.498932 3.171413
[ reached 'max' / getOption("max.print") -- omitted 18 rows ]
$id
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
[34] 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
[67] 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
[100] 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132
[133] 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
I tried to lower the number of knots in the gam-call but didn't suceed as well...
Anyone might have an idea?
I fit the gam using the following line:
mgcv::gam(slide ~ s(curv_plan) + s(curv_prof) + s(dem) + s(slope) + s(ca), data = dataframes_new[[7]], family = binomial)
I have experienced the same issue. The root cause was that some of my categorical variables had fewer levels than k in my formula specification. To give an example:
Suppose one of the terms in my formula specification was:
s(I(pmin(example_variable, 120)), k = 5)
and the data in my example_variable had 3 levels (say, "yes", "no", "maybe"). This would throw the above-mentioned error.
In my case, I solved it by creating additional levels in my data (I was creating test data for a unit test). In other cases it could be solved by ensuring k does not exceed the number of levels in your categorical variables.
If you're using categorical variables, check if the root cause might be the same for you.
I found the solution to my problem by reading these:
https://stat.ethz.ch/pipermail/r-sig-ecology/2011-May/002148.html
https://stat.ethz.ch/pipermail/r-help/2007-October/143569.html
The error means that you tried to create a thin plate spline basis expansion with more basis functions than the variable from which the expansion is to be made has unique values.
As you don't show the model fitting code, we can't say more than that one of the smooths in the model you tried to fit didn't have enough unique values for the value of k you specific or used (if you didn't set k a default value was used).
I'm dealing with the following dataset
animal protein herd sire dam
6 416 189.29 2 15 236
7 417 183.27 2 6 295
9 419 193.24 3 11 268
10 420 198.84 2 12 295
11 421 205.25 3 3 251
12 422 204.15 2 2 281
13 423 200.20 2 3 248
14 424 197.22 2 11 222
15 425 201.14 1 10 262
17 427 196.20 1 11 290
18 428 208.13 3 9 294
19 429 213.01 3 14 254
21 431 203.38 2 4 273
22 432 190.56 2 8 248
25 435 196.59 3 9 226
26 436 193.31 3 10 249
27 437 207.89 3 7 272
29 439 202.98 2 10 260
30 440 177.28 2 4 291
31 441 182.04 1 6 282
32 442 217.50 2 3 265
33 443 190.43 2 11 248
35 445 197.24 2 4 256
37 447 197.16 3 5 240
42 452 183.07 3 5 293
43 453 197.99 2 6 293
44 454 208.27 2 6 254
45 455 187.61 3 12 271
46 456 173.18 2 6 280
47 457 187.89 2 6 235
48 458 191.96 1 7 286
49 459 196.39 1 4 275
50 460 178.51 2 13 262
52 462 204.17 1 6 253
53 463 203.77 2 11 273
54 464 206.25 1 13 249
55 465 211.63 2 13 222
56 466 211.34 1 6 228
57 467 194.34 2 1 217
58 468 201.53 2 12 247
59 469 198.01 2 3 251
60 470 188.94 2 7 290
61 471 190.49 3 2 220
62 472 197.34 2 3 224
63 473 194.04 1 15 229
64 474 202.74 2 1 287
67 477 189.98 1 6 300
69 479 206.37 3 2 293
70 480 183.81 2 10 274
72 482 190.70 2 12 265
74 484 194.25 3 2 262
75 485 191.15 3 10 297
76 486 193.23 3 15 255
77 487 193.29 2 4 266
78 488 182.20 1 15 260
81 491 195.89 2 12 294
82 492 200.77 1 8 278
83 493 179.12 2 7 281
85 495 172.14 3 13 252
86 496 183.82 1 4 264
88 498 195.32 1 6 249
89 499 197.19 1 13 274
90 500 178.07 1 8 293
92 502 209.65 2 7 241
95 505 199.66 3 5 220
96 506 190.96 2 11 259
98 508 206.58 3 3 230
100 510 196.60 2 5 231
103 513 193.25 2 15 280
104 514 181.34 2 3 227
I'm interested with the animals indexes and corresponding to them the dams' indexes. Using table function I was able to check that some dams are matched to different animals. In fact I got the following output
217 220 222 224 226 227 228 229 230 231 235 236 240 241 247 248 249 251 252 253 254 255 256 259 260 262
1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 1 1 2 1 1 1 2 3
264 265 266 268 271 272 273 274 275 278 280 281 282 286 287 290 291 293 294 295 297 300
1 2 1 1 1 1 2 2 1 1 2 2 1 1 1 2 1 4 2 2 1 1
Using length function I checked that there are only 48 dams in this dataset.
I would like to 'reindex' them with the integers 1, ..., 48 instead of these given in my set. Is there any method of doing such things?
You can use match and unique.
df$index <- match(df$dam, unique(df$dam))
Or convert to factor and then integer
df$index <- as.integer(factor(df$dam))
Another option is group_indices from dplyr.
df$index <- dplyr::group_indices(df, dam)
We can use .GRP in data.table
library(data.table)
setDT(df)[, index := .GRP, dam]
I have this data from an r package, where X is the dataset with all the data
library(ISLR)
data("Hitters")
X=Hitters
head(X)
here is one part of the data:
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks League Division PutOuts Assists Errors Salary NewLeague
-Andy Allanson 293 66 1 30 29 14 1 293 66 1 30 29 14 A E 446 33 20 NA A
-Alan Ashby 315 81 7 24 38 39 14 3449 835 69 321 414 375 N W 632 43 10 475.0 N
-Alvin Davis 479 130 18 66 72 76 3 1624 457 63 224 266 263 A W 880 82 14 480.0 A
-Andre Dawson 496 141 20 65 78 37 11 5628 1575 225 828 838 354 N E 200 11 3 500.0 N
-Andres Galarraga 321 87 10 39 42 30 2 396 101 12 48 46 33 N E 805 40 4 91.5 N
-Alfredo Griffin 594 169 4 74 51 35 11 4408 1133 19 501 336 194 A W 282 421 25 750.0 A
I want to convert all the columns and the rows with non numeric values to zero, is there any simple way to do this.
I found here an example how to remove the rows for one column just but for more I have to do it for every column manually.
Is in r any function that does this for all columns and rows?
To remove non-numeric columns, perhaps something like this?
df %>%
select(which(sapply(., is.numeric)))
# AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
#-Andy Allanson 293 66 1 30 29 14 1 293 66 1
#-Alan Ashby 315 81 7 24 38 39 14 3449 835 69
#-Alvin Davis 479 130 18 66 72 76 3 1624 457 63
#-Andre Dawson 496 141 20 65 78 37 11 5628 1575 225
#-Andres Galarraga 321 87 10 39 42 30 2 396 101 12
#-Alfredo Griffin 594 169 4 74 51 35 11 4408 1133 19
# CRuns CRBI CWalks PutOuts Assists Errors Salary
#-Andy Allanson 30 29 14 446 33 20 NA
#-Alan Ashby 321 414 375 632 43 10 475.0
#-Alvin Davis 224 266 263 880 82 14 480.0
#-Andre Dawson 828 838 354 200 11 3 500.0
#-Andres Galarraga 48 46 33 805 40 4 91.5
#-Alfredo Griffin 501 336 194 282 421 25 750.0
or
df %>%
select(-which(sapply(., function(x) is.character(x) | is.factor(x))))
Or much neater (thanks to #AntoniosK):
df %>% select_if(is.numeric)
Update
To additionally replace NAs with 0, you can do
df %>% select_if(is.numeric) %>% replace(is.na(.), 0)
# AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
#-Andy Allanson 293 66 1 30 29 14 1 293 66 1
#-Alan Ashby 315 81 7 24 38 39 14 3449 835 69
#-Alvin Davis 479 130 18 66 72 76 3 1624 457 63
#-Andre Dawson 496 141 20 65 78 37 11 5628 1575 225
#-Andres Galarraga 321 87 10 39 42 30 2 396 101 12
#-Alfredo Griffin 594 169 4 74 51 35 11 4408 1133 19
# CRuns CRBI CWalks PutOuts Assists Errors Salary
#-Andy Allanson 30 29 14 446 33 20 0.0
#-Alan Ashby 321 414 375 632 43 10 475.0
#-Alvin Davis 224 266 263 880 82 14 480.0
#-Andre Dawson 828 838 354 200 11 3 500.0
#-Andres Galarraga 48 46 33 805 40 4 91.5
#-Alfredo Griffin 501 336 194 282 421 25 750.0
library(ISLR)
data("Hitters")
d = head(Hitters)
library(dplyr)
d %>%
mutate_if(function(x) !is.numeric(x), function(x) 0) %>% # if column is non numeric add zeros
mutate_all(function(x) ifelse(is.na(x), 0, x)) # if there is an NA element replace it with 0
# AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks League Division PutOuts Assists Errors Salary NewLeague
# 1 293 66 1 30 29 14 1 293 66 1 30 29 14 0 0 446 33 20 0.0 0
# 2 315 81 7 24 38 39 14 3449 835 69 321 414 375 0 0 632 43 10 475.0 0
# 3 479 130 18 66 72 76 3 1624 457 63 224 266 263 0 0 880 82 14 480.0 0
# 4 496 141 20 65 78 37 11 5628 1575 225 828 838 354 0 0 200 11 3 500.0 0
# 5 321 87 10 39 42 30 2 396 101 12 48 46 33 0 0 805 40 4 91.5 0
# 6 594 169 4 74 51 35 11 4408 1133 19 501 336 194 0 0 282 421 25 750.0 0
If you want to avoid function(x) you can use this
d %>%
mutate_if(Negate(is.numeric), ~0) %>%
mutate_all(~ifelse(is.na(.), 0, .))
You can get the numeric columns with sapply/inherits.
X <- Hitters
inx <- sapply(X, inherits, c("integer", "numeric"))
Y <- X[inx]
Then, it wouldn't make much sense to remove the rows with non-numeric entries, they were already removed, but you could do
inx <- apply(Y, 1, function(y) all(inherits(y, c("integer", "numeric"))))
Y[inx, ]
> head(m)
X id1 q_following topic_followed topic_answered nfollowers nfollowing
1 1 1 80 80 100 180 180
2 2 1 76 76 95 171 171
3 3 1 72 72 90 162 162
4 4 1 68 68 85 153 153
5 5 1 64 64 80 144 144
6 6 1 60 60 75 135 135
> head(d)
X id1 q_following topic_followed topic_answered nfollowers nfollowing
1 1 1 63 735 665 949 146
2 2 1 89 737 666 587 185
3 3 1 121 742 670 428 264
4 4 1 277 750 706 622 265
5 5 1 339 765 734 108 294
6 6 1 363 767 766 291 427
matcher <- function(x,y){ return(na.omit(m[which(d[,y]==x),y])) }
max_matcher <- function(x) { return(sum(matcher(x,3:13))) }
result <- foreach(1:1000, function(x) {
if(max(max_matcher(1:1000)) == max_matcher(x)) return(x)
})
I want to compute result across each group, grouped by id1 of dataframe m.
m %>% group_by(id1) %>% summarise(result) #doesn't work
by(m, m[,"id1"], result) #doesn't work
How should I proceed?