Difficulties applying pca - r

I am experimenting pca with R. I have the following data:
V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
2454 0 168 290 45 1715 61 551 245 30 91
222 188 94 105 60 3374 615 7 294 0 169
552 0 0 465 0 3040 0 0 771 0 0
2872 0 0 0 0 3380 0 289 0 0 0
2938 0 56 56 0 2039 538 311 113 0 254
2849 0 0 332 0 2548 0 332 0 0 221
3102 0 0 0 0 2690 0 0 0 807 807
3134 0 0 0 0 2897 289 144 144 144 0
558 0 0 0 0 3453 0 0 0 0 0
2893 0 262 175 0 2452 350 1138 262 87 175
552 0 0 351 0 3114 0 0 678 0 0
2874 0 109 54 0 2565 272 1037 109 0 0
1396 0 0 407 0 1730 0 0 305 0 0
2866 0 71 179 0 2403 358 753 35 107 143
449 0 0 0 0 2825 0 0 0 0 0
2888 0 0 523 0 2615 104 627 209 0 0
2537 0 57 0 0 1854 0 0 463 0 0
2873 0 0 342 0 3196 0 114 0 0 114
720 0 0 365 4 2704 0 4 643 4 0
218 125 31 94 219 2479 722 0 219 0 94
to which I apply the following code:
fit <- prcomp(data)
ev <- fit$rotation # pc loadings
In order to make some tests, I tried to see the data matrix I retrieve when I do keep all the components I can keep:
numberComponentsKept = 10
featureVector = ev[,1:numberComponentsKept]
newData <- as.matrix(data)%*%as.matrix(featureVector)
The newData matrix should be the same as the original one, but instead, I get a very different result:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
2454 1424.447 867.5986 514.0592 -155.4783720 -574.7425 85.38724 -86.71887 90.872507 4.305168 92.08284
222 3139.681 1020.4150 376.3165 471.8718398 -796.9549 142.14301 -119.86945 32.919950 -31.269467 32.55846
552 2851.544 539.6075 883.3969 -93.3579153 -908.6689 68.34030 -40.97052 -13.856931 23.133566 89.00851
2872 3111.317 1210.0187 433.0382 -144.4065362 -381.2305 -20.08927 -49.03447 9.569258 44.201571 70.13113
2938 1788.334 945.8162 189.6526 308.7703509 -593.5577 124.88484 -109.67276 -115.127348 14.170615 99.19492
2849 2291.839 978.1819 374.7567 -243.6739292 -496.8707 287.01065 -126.22501 -18.747873 54.080763 62.80605
3102 2530.989 814.7548 -510.5978 -410.6295894 -1015.3228 46.85727 -21.20662 14.696831 23.687923 72.37691
3134 2679.430 970.1323 311.8627 124.2884480 -536.4490 -26.23858 83.86768 -17.808390 -28.802387 92.09583
558 3268.599 988.2515 353.6538 -82.9155988 -342.5729 12.96219 -60.94886 18.537087 7.291126 96.14917
2893 1921.761 1664.0084 631.0800 -55.6321469 -864.9628 -28.11045 -104.78931 37.797727 -12.078535 104.88374
552 2927.108 607.6489 799.9602 -79.5494412 -827.6994 14.14625 -50.12209 -14.020936 29.996639 86.72887
2874 2084.285 1636.7999 621.6383 -49.2934502 -577.4815 -67.27198 -11.06071 -7.167577 47.395309 51.02962
1396 1618.171 337.4320 488.2717 -100.1663625 -469.8857 212.37199 -1.19409 13.531485 -23.332701 64.58806
2866 2007.261 1387.6890 395.1586 0.8640971 -636.1243 133.41074 12.34794 -26.969634 5.506828 74.13767
449 2674.136 808.5174 289.3345 -67.8356695 -280.2689 10.60475 -49.86404 15.165731 5.965083 78.66244
2888 2254.171 1162.4988 749.7230 -206.0215007 -652.2364 302.36320 40.76341 -1.079259 17.635956 57.86999
2537 1747.098 371.8884 429.1309 9.3761544 -480.7130 -196.25019 -81.31580 2.819608 24.089379 56.91885
2873 2973.872 974.3854 433.7282 -197.0601947 -478.3647 301.96576 -81.81105 14.516646 -1.191972 100.79057
720 2537.535 504.4124 744.5909 -78.1162036 -771.1396 38.17725 -36.61446 -9.079443 25.488688 78.21597
218 2292.718 800.5257 260.6641 603.3295960 -641.9296 187.38913 11.71382 70.011487 78.047216 96.10967
What did I do wrong?

I think the problem is rather a PCA problem than an R problem. You multiply the original data with the rotation matrix and you wonder then why newData!=data. This would be only the case if the rotation matrix would be the identity matrix.
What you probably were planning to do is the following:
# Run PCA:
fit <- prcomp(USArrests)
ev <- fit$rotation # pc loadings
# Reversed PCA:
head(fit$x%*% t(as.matrix(ev)))
# Centered Original data:
head(t(apply(USArrests,1,'-',colMeans(USArrests))))
In the last step you have to center the data, because the function prcomp centers them by default.

Related

Subsetting nested lists based on condition (values) in R

I have a large nested list (list of named lists) - the example of such a list is given below. I would like to create a new list, in which only sub-lists with "co" vectors containing both 0 and 1 values would be preserved, while 0-only sublists would be discarded (eg. the output should contain only first-, third- and fourth- subgroups.
I played with lapply and filter according to this thread:
Subset elements in a list based on a logical condition
However, it throwed errors. I would appreciate tips how to handle lists within the lists.
# reprex
set.seed(123)
## empty lists
first_group <- list()
second_group <- list()
third_group <- list()
fourth_group <- list()
# dummy_vecs
values1 <- c(sample(120:730, 30, replace=TRUE))
coeff1 <- c(sample(0:1, 30, replace=TRUE))
values2 <- c(sample(50:810, 43, replace=TRUE))
coeff2 <- c(rep(0, 43))
values3 <- c(sample(510:730, 57, replace=TRUE))
coeff3 <- c(rep(0, 8), rep(1, 4), rep(0, 45))
values4 <- c(sample(123:770, 28, replace=TRUE))
coeff4 <- c(sample(0:1, 28, replace=TRUE))
## fill lists with values:
first_group[["val"]] <- values1
first_group[["co"]] <- coeff1
second_group[["val"]] <- values2
second_group[["co"]] <- coeff2
third_group[["val"]] <- values3
third_group[["co"]] <- coeff3
fourth_group[["val"]] <- values4
fourth_group[["co"]] <- coeff4
#concatenate lists:
dummy_list <- list()
dummy_list[["first-group"]] <- first_group
dummy_list[["second-group"]] <- second_group
dummy_list[["third-group"]] <- third_group
dummy_list[["fourth-group"]] <- fourth_group
rm(values1, values2, values3, values4, coeff1, coeff2, coeff3, coeff4, first_group, second_group, third_group, fourth_group)
gc()
#show list
print(dummy_list)
# create boolean for where condition is TRUE
cond <- sapply(dummy_list, function(x) any(0 %in% x$co) & any(1 %in% x$co))
# subset
dummy_list[cond]
You could use Filter from base R:
Filter(function(x) sum(x$co) !=0, dummy_list)
Or you can use purrr:
library(tidyverse)
dummy_list %>%
keep( ~ sum(.$co) != 0)
Output
$`first-group`
$`first-group`$val
[1] 534 582 298 645 314 237 418 348 363 133 493 721 722 210 467 474 145 638 545 330 709 712 674 492 262 663 609 142 428 254
$`first-group`$co
[1] 0 0 1 1 0 1 0 0 1 1 0 0 1 0 0 0 0 1 0 0 0 0 1 1 0 1 1 1 1 0
$`third-group`
$`third-group`$val
[1] 713 721 683 526 699 555 563 672 619 603 588 533 622 724 616 644 730 716 660 663 611 669 644 664 679 514 579 525 533 541 530 564 584 673 592 726 548 563 727
[40] 646 708 557 586 592 693 620 548 705 510 677 539 603 726 525 597 563 712
$`third-group`$co
[1] 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
$`fourth-group`
$`fourth-group`$val
[1] 142 317 286 174 656 299 676 206 645 755 514 424 719 741 711 552 550 372 551 520 650 503 667 162 644 595 322 247
$`fourth-group`$co
[1] 0 0 0 0 1 0 1 1 1 0 1 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 1 1
However, if you also want to exclude any co that have all 1s, then we can add an extra condition.
Filter(function(x) sum(x$co) !=0 & sum(x$co == 0) > 0, dummy_list)
purrr
dummy_list %>%
keep( ~ sum(.$co) != 0 & sum(.$co == 0) > 0)

How to create a dummy variable for date interval

I have been trying to generate a dummy variable from the data column for interval.
Sample data
Date <- seq(as.Date("1988-01-01"), as.Date("2018-12-31"), by="1 day")
DATASET <- data.frame(rnorm(11323), Date)
I would like to create an interval: 20-04 : 20-08 for each year codes as 1. I would be grateful for the hint with code for doing this.
You could compare the day of the year. In base R that would be
DATASET$day_of_year <- as.integer(format(DATASET$Date, "%j"))
DATASET$flag <- +(with(DATASET, ifelse(as.integer(format(Date, "%Y")) %% 4 == 0 ,
day_of_year %in% 111:233, day_of_year %in% 110:232)))
For leap years 20-04 is 111th day of the year and 20-08 is 233rd day and for rest of the years they are 110 and 232 respectively. We assign 1 when the date is between those 2 values.
Maybe you can try the following code to have codes for the interval between 20-4 and 20-8 for each year
DATASET <- within(DATASET,
code <- ave(as.numeric(format(DATASET$Date,"%m%d")),
as.numeric(format(DATASET$Date,"%Y")),
FUN = function(x) ifelse(x>=420 & x <=820,1,0)))
and a small piece of result is shown as below
> DATASET
rnorm.11323. Date code
1 -0.326546058 1988-01-01 0
2 -0.561589735 1988-01-02 0
3 -0.417091199 1988-01-03 0
4 -0.482488496 1988-01-04 0
5 0.039820482 1988-01-05 0
6 -0.285270230 1988-01-06 0
7 -1.301004464 1988-01-07 0
8 1.835118221 1988-01-08 0
9 -0.207213889 1988-01-09 0
10 1.695089989 1988-01-10 0
11 -0.618905489 1988-01-11 0
12 1.689917961 1988-01-12 0
13 -0.272349252 1988-01-13 0
14 0.585059685 1988-01-14 0
15 -0.793666725 1988-01-15 0
16 -0.276084733 1988-01-16 0
17 -0.474363507 1988-01-17 0
18 1.703568414 1988-01-18 0
19 0.011776841 1988-01-19 0
20 0.029492096 1988-01-20 0
21 -1.313446231 1988-01-21 0
22 -0.127952381 1988-01-22 0
23 -0.203861769 1988-01-23 0
24 -0.365669967 1988-01-24 0
25 -0.239937083 1988-01-25 0
26 0.620562975 1988-01-26 0
27 0.652111601 1988-01-27 0
28 -0.869191381 1988-01-28 0
29 0.130085565 1988-01-29 0
30 0.059768397 1988-01-30 0
31 0.349921562 1988-01-31 0
32 -1.087277224 1988-02-01 0
33 -1.250976040 1988-02-02 0
34 -0.970337410 1988-02-03 0
35 2.063232550 1988-02-04 0
36 -0.294777997 1988-02-05 0
37 0.535559649 1988-02-06 0
38 -0.229363577 1988-02-07 0
39 -1.819158790 1988-02-08 0
40 1.020335484 1988-02-09 0
41 0.102285275 1988-02-10 0
42 1.254992570 1988-02-11 0
43 1.584044869 1988-02-12 0
44 -0.629629933 1988-02-13 0
45 -1.073561540 1988-02-14 0
46 1.273920124 1988-02-15 0
47 -0.376367657 1988-02-16 0
48 1.331066300 1988-02-17 0
49 0.694872356 1988-02-18 0
50 0.863826292 1988-02-19 0
51 -1.411795778 1988-02-20 0
52 0.388793450 1988-02-21 0
53 -0.216112938 1988-02-22 0
54 -0.196632011 1988-02-23 0
55 0.558895841 1988-02-24 0
56 0.818765192 1988-02-25 0
57 -1.250469812 1988-02-26 0
58 0.803231988 1988-02-27 0
59 0.002634810 1988-02-28 0
60 0.252328475 1988-02-29 0
61 -0.958851197 1988-03-01 0
62 -1.448732431 1988-03-02 0
63 0.647314543 1988-03-03 0
64 0.644802476 1988-03-04 0
65 -0.087973096 1988-03-05 0
66 1.088076864 1988-03-06 0
67 -0.293465532 1988-03-07 0
68 0.141825697 1988-03-08 0
69 0.413649305 1988-03-09 0
70 -1.877052966 1988-03-10 0
71 -2.200275448 1988-03-11 0
72 -0.025524427 1988-03-12 0
73 1.236501510 1988-03-13 0
74 -0.872516837 1988-03-14 0
75 -1.063727523 1988-03-15 0
76 0.264564444 1988-03-16 0
77 0.971958801 1988-03-17 0
78 0.102470655 1988-03-18 0
79 1.369131551 1988-03-19 0
80 -0.041148284 1988-03-20 0
81 -2.476135538 1988-03-21 0
82 0.836740451 1988-03-22 0
83 0.078102241 1988-03-23 0
84 -0.949778901 1988-03-24 0
85 -0.975874102 1988-03-25 0
86 2.011305586 1988-03-26 0
87 1.441333862 1988-03-27 0
88 1.404182762 1988-03-28 0
89 -0.425158054 1988-03-29 0
90 1.250722900 1988-03-30 0
91 0.060629220 1988-03-31 0
92 -1.593162931 1988-04-01 0
93 0.475640908 1988-04-02 0
94 0.102547315 1988-04-03 0
95 -2.350611181 1988-04-04 0
96 0.185065822 1988-04-05 0
97 0.463470128 1988-04-06 0
98 1.722202344 1988-04-07 0
99 -1.344383635 1988-04-08 0
100 0.858491817 1988-04-09 0
101 -0.008338174 1988-04-10 0
102 0.572599035 1988-04-11 0
103 0.138858045 1988-04-12 0
104 -1.808541857 1988-04-13 0
105 1.308927384 1988-04-14 0
106 -2.374371017 1988-04-15 0
107 1.134519340 1988-04-16 0
108 1.604437740 1988-04-17 0
109 -0.109549779 1988-04-18 0
110 -0.011355562 1988-04-19 0
111 -1.462229758 1988-04-20 1
112 1.006583367 1988-04-21 1
113 -0.124824926 1988-04-22 1
114 1.611795681 1988-04-23 1
115 0.818715370 1988-04-24 1
116 -0.440445043 1988-04-25 1
117 0.024114452 1988-04-26 1
118 -1.418044894 1988-04-27 1
119 -0.632317886 1988-04-28 1
120 0.599948691 1988-04-29 1
121 1.055118998 1988-04-30 1
122 0.301676490 1988-05-01 1
123 -0.662547532 1988-05-02 1
124 0.425191055 1988-05-03 1
125 1.715003304 1988-05-04 1
126 -0.298346044 1988-05-05 1
127 -1.043983256 1988-05-06 1
128 -1.194283503 1988-05-07 1
129 -1.517810914 1988-05-08 1
130 0.386735460 1988-05-09 1
131 0.742102056 1988-05-10 1
132 0.953762078 1988-05-11 1
133 -0.602941007 1988-05-12 1
134 1.469329252 1988-05-13 1
135 -0.233230972 1988-05-14 1
136 0.663378860 1988-05-15 1
137 -0.749108544 1988-05-16 1
138 0.591009181 1988-05-17 1
139 0.013732152 1988-05-18 1
140 -0.774612526 1988-05-19 1
141 -1.707183964 1988-05-20 1
142 -0.808360648 1988-05-21 1
143 1.420371293 1988-05-22 1
144 0.603838459 1988-05-23 1
145 0.743964804 1988-05-24 1
146 0.059498235 1988-05-25 1
147 -0.597795793 1988-05-26 1
148 0.867167938 1988-05-27 1
149 0.441291857 1988-05-28 1
150 1.348769636 1988-05-29 1
151 -1.768938126 1988-05-30 1
152 1.070400122 1988-05-31 1
153 0.321542409 1988-06-01 1
154 -0.495030342 1988-06-02 1
155 -0.740337974 1988-06-03 1
156 -1.887552572 1988-06-04 1
157 0.805602475 1988-06-05 1
158 -0.824104379 1988-06-06 1
159 0.801460489 1988-06-07 1
160 -0.912871263 1988-06-08 1
161 -0.422677222 1988-06-09 1
162 0.126785279 1988-06-10 1
163 -0.598578319 1988-06-11 1
164 -1.535492985 1988-06-12 1
165 0.018486996 1988-06-13 1
166 -1.156209268 1988-06-14 1
167 0.656276068 1988-06-15 1
168 0.045640396 1988-06-16 1
169 0.627538985 1988-06-17 1
170 2.640792582 1988-06-18 1
171 -0.383475408 1988-06-19 1
172 -2.631633446 1988-06-20 1
173 0.772980776 1988-06-21 1
174 1.930884904 1988-06-22 1
175 2.026248604 1988-06-23 1
176 -0.134588724 1988-06-24 1
177 -0.593768442 1988-06-25 1
178 -0.427553478 1988-06-26 1
179 0.303955588 1988-06-27 1
180 -0.195481230 1988-06-28 1
181 1.231190798 1988-06-29 1
182 -0.871672993 1988-06-30 1
183 -1.002028081 1988-07-01 1
184 -0.912352588 1988-07-02 1
185 -0.714319398 1988-07-03 1
186 0.053181016 1988-07-04 1
187 0.865163557 1988-07-05 1
188 0.474865269 1988-07-06 1
189 -1.105410939 1988-07-07 1
190 -0.110529764 1988-07-08 1
191 -0.805821554 1988-07-09 1
192 -1.550774659 1988-07-10 1
193 -0.508057551 1988-07-11 1
194 -0.755394814 1988-07-12 1
195 0.993023957 1988-07-13 1
196 -0.342427853 1988-07-14 1
197 -1.481690158 1988-07-15 1
198 -0.095168751 1988-07-16 1
199 1.320208464 1988-07-17 1
200 -0.340080090 1988-07-18 1
201 -1.545902324 1988-07-19 1
202 0.389589474 1988-07-20 1
203 -0.734778233 1988-07-21 1
204 0.296933278 1988-07-22 1
205 -0.024469569 1988-07-23 1
206 1.261660247 1988-07-24 1
207 -0.136786252 1988-07-25 1
208 0.908519533 1988-07-26 1
209 1.576193030 1988-07-27 1
210 0.413044482 1988-07-28 1
211 -0.601938271 1988-07-29 1
212 0.495905040 1988-07-30 1
213 0.440665366 1988-07-31 1
214 -0.804152825 1988-08-01 1
215 -1.065705237 1988-08-02 1
216 0.149246056 1988-08-03 1
217 -0.530891226 1988-08-04 1
218 -0.879233155 1988-08-05 1
219 -0.262727374 1988-08-06 1
220 -2.244552614 1988-08-07 1
221 -1.531707789 1988-08-08 1
222 1.498847169 1988-08-09 1
223 0.810096179 1988-08-10 1
224 -1.690822775 1988-08-11 1
225 0.303456055 1988-08-12 1
226 -0.874022497 1988-08-13 1
227 0.244933676 1988-08-14 1
228 1.220193574 1988-08-15 1
229 -0.456840188 1988-08-16 1
230 1.083075786 1988-08-17 1
231 -1.769152445 1988-08-18 1
232 -1.038850200 1988-08-19 1
233 0.963345582 1988-08-20 1
234 0.036574589 1988-08-21 0
235 -2.613751531 1988-08-22 0
236 1.441930677 1988-08-23 0
237 -1.927433949 1988-08-24 0
238 -0.045661284 1988-08-25 0
239 0.974935858 1988-08-26 0
240 -1.457985965 1988-08-27 0
241 0.914085417 1988-08-28 0
242 -0.004152904 1988-08-29 0
243 1.653886738 1988-08-30 0
244 0.972947047 1988-08-31 0

How to sort or order by month?

I have the data frame and i have tabulated the output as per my requirement with xtabs :
df1<-data.frame(
Year=sample(2016:2018,100,replace = T),
Month=sample(month.abb,100,replace = T),
category1=sample(letters[1:6],100,replace = T),
catergory2=sample(LETTERS[8:16],100,replace = T),
lic=sample(c("P","F","T"),100,replace = T),
count=sample(1:1000,100,replace = T)
)
Code :
xtabs(count~Month+category1+lic,data=df1)
Output :
, , lic = F
category1
Month a b c d e f
Apr 0 0 0 0 0 0
Aug 418 0 0 0 0 208
Dec 628 0 0 0 0 0
Feb 0 0 0 968 0 701
Jan 388 0 0 0 0 0
Jul 771 0 0 0 0 2514
Jun 987 913 0 216 0 395
Mar 454 0 0 0 0 314
May 0 1298 0 0 0 0
Nov 906 0 526 262 0 1417
Oct 783 0 853 336 310 286
Sep 0 0 0 0 928 0
, , lic = P
category1
Month a b c d e f
Apr 13 0 0 0 0 0
Aug 0 774 0 0 416 652
Dec 0 0 0 241 462 123
Feb 150 857 0 169 6 1
Jan 954 0 567 0 0 0
Jul 481 0 0 0 0 846
Jun 0 0 0 484 0 535
Mar 751 0 0 0 241 0
May 0 549 37 0 0 2
Nov 649 0 0 0 154 692
Oct 0 0 182 0 0 0
Sep 0 0 585 0 493 0
, , lic = T
category1
Month a b c d e f
Apr 0 0 410 0 0 0
Aug 0 0 0 0 0 0
Dec 0 0 833 289 811 0
Feb 0 1223 0 716 366 552
Jan 555 0 802 0 1598 0
Jul 0 0 69 0 0 696
Jun 0 0 0 0 190 0
Mar 0 1165 0 0 0 0
May 979 951 676 0 0 0
Nov 267 0 79 1951 290 530
Oct 230 78 0 679 321 0
Sep 0 871 0 0 0 0
Output matches my requirement but order of month is misplaced.
can i achieve same thing with any package? or any easiest methods to get the same data?
I suggest making Month an ordered factor:
df1$Month <- ordered(df1$Month, levels = month.abb)
xtabs(count~Month+category1+lic,data=df1)
#, , lic = F
#
# category1
#Month a b c d e f
# Jan 0 0 0 0 563 0
# Feb 0 0 0 826 0 0
# Mar 0 0 3 685 443 814
# Apr 0 848 0 474 0 0
# May 192 412 1942 0 803 545
# Jun 593 0 0 0 520 807
# Jul 829 745 0 0 926 0
# Aug 1474 0 603 376 0 706
# Sep 0 0 0 173 0 0
# Oct 0 0 661 915 814 0
# Nov 0 881 0 0 0 0
# Dec 0 0 0 0 0 0
#</snip>
Hopefully this is what OP is aiming to do:
library(tidyverse)
df1<-as.tibble(df1)
df1 %>%
arrange(Month)
Year Month category1 catergory2 lic count
<int> <fct> <fct> <fct> <fct> <int>
1 2016 Apr a N F 745
2 2016 Apr b K F 346
3 2016 Apr b O T 61
4 2016 Apr a J T 680
5 2018 Apr d O P 308
6 2017 Apr e M F 408
7 2016 Apr b P P 474
8 2017 Apr b O P 332
9 2016 Apr b P F 321
10 2017 Apr e N T 384
# ... with 90 more rows

Convert data frame from wide to long with 2 variables

I have the following wide data frame (mydf.wide):
DAY JAN F1 FEB F2 MAR F3 APR F4 MAY F5 JUN F6 JUL F7 AUG F8 SEP F9 OCT F10 NOV F11 DEC F12
1 169 0 296 0 1095 0 599 0 1361 0 1746 0 2411 0 2516 0 1614 0 908 0 488 0 209 0
2 193 0 554 0 1085 0 1820 0 1723 0 2787 0 2548 0 1402 0 1633 0 897 0 411 0 250 0
3 246 0 533 0 1111 0 1817 0 2238 0 2747 0 1575 0 1912 0 705 0 813 0 156 0 164 0
4 222 0 547 0 1125 0 1789 0 2181 0 2309 0 1569 0 1798 0 1463 0 878 0 241 0 230 0
I want to produce the following "semi-long":
DAY variable_month value_month value_F
1 JAN 169 0
I tried:
library(reshape2)
mydf.long <- melt(mydf.wide, id.vars=c("YEAR","DAY"), measure.vars=c("JAN","FEB","MAR","APR","MAY","JUN","JUL","AUG","SEP","OCT","NOV","DEC"))
but this skip the F variable and I don't know how to deal with two variables...
This is one of those cases where reshape(...) in base R is a better option.
months <- c(2,4,6,8,10,12,14,16,18,20,22,24) # column numbers of months
F <- c(3,5,7,9,11,13,15,17,19,21,23,25) # column numbers of Fn
mydf.long <- reshape(mydf.wide,idvar=1,
times=colnames(mydf.wide)[months],
varying=list(months,F),
v.names=c("value_month","value_F"),
direction="long")
colnames(mydf.long)[2] <- "variable_month"
head(mydf.long)
# DAY variable_month value_month value_F
# 1.JAN 1 JAN 169 0
# 2.JAN 2 JAN 193 0
# 3.JAN 3 JAN 246 0
# 4.JAN 4 JAN 222 0
# 1.FEB 1 FEB 296 0
# 2.FEB 2 FEB 554 0
You can also do this with 2 calls to melt(...)
library(reshape2)
months <- c(2,4,6,8,10,12,14,16,18,20,22,24) # column numbers of months
F <- c(3,5,7,9,11,13,15,17,19,21,23,25) # column numbers of Fn
z.1 <- melt(mydf.wide,id=1,measure=months,
variable.name="variable_month",value.name="value_month")
z.2 <- melt(mydf.wide,id=1,measure=F,value.name="value_F")
mydf.long <- cbind(z.1,value_F=z.2$value_F)
head(mydf.long)
# DAY variable_month value_month z.2$value_F
# 1 1 JAN 169 0
# 2 2 JAN 193 0
# 3 3 JAN 246 0
# 4 4 JAN 222 0
# 5 1 FEB 296 0
# 6 2 FEB 554 0
melt() and dcast() are available from the reshape2 and data.table packages. The recent versions of data.table allow to melt multiple columns simultaneously. The patterns() parameter can be used to specify the two sets of columns by regular expressions:
library(data.table) # CRAN version 1.10.4 used
regex_month <- toupper(paste(month.abb, collapse = "|"))
mydf.long <- melt(setDT(mydf.wide), measure.vars = patterns(regex_month, "F\\d"),
value.name = c("MONTH", "F"))
# rename factor levels
mydf.long[, variable := forcats::lvls_revalue(variable, toupper(month.abb))][]
DAY variable MONTH F
1: 1 JAN 169 0
2: 2 JAN 193 0
3: 3 JAN 246 0
4: 4 JAN 222 0
5: 1 FEB 296 0
...
44: 4 NOV 241 0
45: 1 DEC 209 0
46: 2 DEC 250 0
47: 3 DEC 164 0
48: 4 DEC 230 0
DAY variable MONTH F
Note that "F\\d" is used as regular expression in patterns(). A simple "F" would have catched FEB as well as F1, F2, etc. producing unexpected results.
Also note that mydf.wide needs to be coerced to a data.table object. Otherwise, reshape2::melt() will be dispatched on a data.frame object which doesn't recognize patterns().
Data
library(data.table)
mydf.wide <- fread(
"DAY JAN F1 FEB F2 MAR F3 APR F4 MAY F5 JUN F6 JUL F7 AUG F8 SEP F9 OCT F10 NOV F11 DEC F12
1 169 0 296 0 1095 0 599 0 1361 0 1746 0 2411 0 2516 0 1614 0 908 0 488 0 209 0
2 193 0 554 0 1085 0 1820 0 1723 0 2787 0 2548 0 1402 0 1633 0 897 0 411 0 250 0
3 246 0 533 0 1111 0 1817 0 2238 0 2747 0 1575 0 1912 0 705 0 813 0 156 0 164 0
4 222 0 547 0 1125 0 1789 0 2181 0 2309 0 1569 0 1798 0 1463 0 878 0 241 0 230 0",
data.table = FALSE)

How to remove rows with 0 values using R

Hi am using a matrix of gene expression, frag counts to calculate differentially expressed genes. I would like to know how to remove the rows which have values as 0. Then my data set will be compact and less spurious results will be given for the downstream analysis I do using this matrix.
Input
gene ZPT.1 ZPT.0 ZPT.2 ZPT.3 PDGT.1 PDGT.0
XLOC_000001 3516 626 1277 770 4309 9030
XLOC_000002 342 82 185 72 835 1095
XLOC_000003 2000 361 867 438 454 687
XLOC_000004 143 30 67 37 90 236
XLOC_000005 0 0 0 0 0 0
XLOC_000006 0 0 0 0 0 0
XLOC_000007 0 0 0 0 1 3
XLOC_000008 0 0 0 0 0 0
XLOC_000009 0 0 0 0 0 0
XLOC_000010 7 1 5 3 0 1
XLOC_000011 63 10 19 15 92 228
Desired output
gene ZPT.1 ZPT.0 ZPT.2 ZPT.3 PDGT.1 PDGT.0
XLOC_000001 3516 626 1277 770 4309 9030
XLOC_000002 342 82 185 72 835 1095
XLOC_000003 2000 361 867 438 454 687
XLOC_000004 143 30 67 37 90 236
XLOC_000007 0 0 0 0 1 3
XLOC_000010 7 1 5 3 0 1
XLOC_000011 63 10 19 15 92 228
As of now I only want to remove those rows where all the frag count columns are 0 if in any row some values are 0 and others are non zero I would like to keep that row intact as you can see my example above.
Please let me know how to do this.
df[apply(df[,-1], 1, function(x) !all(x==0)),]
A lot of options to do this within the tidyverse have been posted here: How to remove rows where all columns are zero using dplyr pipe
my preferred option is using rowwise()
library(tidyverse)
df <- df %>%
rowwise() %>%
filter(sum(c(col1,col2,col3)) != 0)

Resources