Convert data frame from wide to long with 2 variables - r

I have the following wide data frame (mydf.wide):
DAY JAN F1 FEB F2 MAR F3 APR F4 MAY F5 JUN F6 JUL F7 AUG F8 SEP F9 OCT F10 NOV F11 DEC F12
1 169 0 296 0 1095 0 599 0 1361 0 1746 0 2411 0 2516 0 1614 0 908 0 488 0 209 0
2 193 0 554 0 1085 0 1820 0 1723 0 2787 0 2548 0 1402 0 1633 0 897 0 411 0 250 0
3 246 0 533 0 1111 0 1817 0 2238 0 2747 0 1575 0 1912 0 705 0 813 0 156 0 164 0
4 222 0 547 0 1125 0 1789 0 2181 0 2309 0 1569 0 1798 0 1463 0 878 0 241 0 230 0
I want to produce the following "semi-long":
DAY variable_month value_month value_F
1 JAN 169 0
I tried:
library(reshape2)
mydf.long <- melt(mydf.wide, id.vars=c("YEAR","DAY"), measure.vars=c("JAN","FEB","MAR","APR","MAY","JUN","JUL","AUG","SEP","OCT","NOV","DEC"))
but this skip the F variable and I don't know how to deal with two variables...

This is one of those cases where reshape(...) in base R is a better option.
months <- c(2,4,6,8,10,12,14,16,18,20,22,24) # column numbers of months
F <- c(3,5,7,9,11,13,15,17,19,21,23,25) # column numbers of Fn
mydf.long <- reshape(mydf.wide,idvar=1,
times=colnames(mydf.wide)[months],
varying=list(months,F),
v.names=c("value_month","value_F"),
direction="long")
colnames(mydf.long)[2] <- "variable_month"
head(mydf.long)
# DAY variable_month value_month value_F
# 1.JAN 1 JAN 169 0
# 2.JAN 2 JAN 193 0
# 3.JAN 3 JAN 246 0
# 4.JAN 4 JAN 222 0
# 1.FEB 1 FEB 296 0
# 2.FEB 2 FEB 554 0
You can also do this with 2 calls to melt(...)
library(reshape2)
months <- c(2,4,6,8,10,12,14,16,18,20,22,24) # column numbers of months
F <- c(3,5,7,9,11,13,15,17,19,21,23,25) # column numbers of Fn
z.1 <- melt(mydf.wide,id=1,measure=months,
variable.name="variable_month",value.name="value_month")
z.2 <- melt(mydf.wide,id=1,measure=F,value.name="value_F")
mydf.long <- cbind(z.1,value_F=z.2$value_F)
head(mydf.long)
# DAY variable_month value_month z.2$value_F
# 1 1 JAN 169 0
# 2 2 JAN 193 0
# 3 3 JAN 246 0
# 4 4 JAN 222 0
# 5 1 FEB 296 0
# 6 2 FEB 554 0

melt() and dcast() are available from the reshape2 and data.table packages. The recent versions of data.table allow to melt multiple columns simultaneously. The patterns() parameter can be used to specify the two sets of columns by regular expressions:
library(data.table) # CRAN version 1.10.4 used
regex_month <- toupper(paste(month.abb, collapse = "|"))
mydf.long <- melt(setDT(mydf.wide), measure.vars = patterns(regex_month, "F\\d"),
value.name = c("MONTH", "F"))
# rename factor levels
mydf.long[, variable := forcats::lvls_revalue(variable, toupper(month.abb))][]
DAY variable MONTH F
1: 1 JAN 169 0
2: 2 JAN 193 0
3: 3 JAN 246 0
4: 4 JAN 222 0
5: 1 FEB 296 0
...
44: 4 NOV 241 0
45: 1 DEC 209 0
46: 2 DEC 250 0
47: 3 DEC 164 0
48: 4 DEC 230 0
DAY variable MONTH F
Note that "F\\d" is used as regular expression in patterns(). A simple "F" would have catched FEB as well as F1, F2, etc. producing unexpected results.
Also note that mydf.wide needs to be coerced to a data.table object. Otherwise, reshape2::melt() will be dispatched on a data.frame object which doesn't recognize patterns().
Data
library(data.table)
mydf.wide <- fread(
"DAY JAN F1 FEB F2 MAR F3 APR F4 MAY F5 JUN F6 JUL F7 AUG F8 SEP F9 OCT F10 NOV F11 DEC F12
1 169 0 296 0 1095 0 599 0 1361 0 1746 0 2411 0 2516 0 1614 0 908 0 488 0 209 0
2 193 0 554 0 1085 0 1820 0 1723 0 2787 0 2548 0 1402 0 1633 0 897 0 411 0 250 0
3 246 0 533 0 1111 0 1817 0 2238 0 2747 0 1575 0 1912 0 705 0 813 0 156 0 164 0
4 222 0 547 0 1125 0 1789 0 2181 0 2309 0 1569 0 1798 0 1463 0 878 0 241 0 230 0",
data.table = FALSE)

Related

How to sort or order by month?

I have the data frame and i have tabulated the output as per my requirement with xtabs :
df1<-data.frame(
Year=sample(2016:2018,100,replace = T),
Month=sample(month.abb,100,replace = T),
category1=sample(letters[1:6],100,replace = T),
catergory2=sample(LETTERS[8:16],100,replace = T),
lic=sample(c("P","F","T"),100,replace = T),
count=sample(1:1000,100,replace = T)
)
Code :
xtabs(count~Month+category1+lic,data=df1)
Output :
, , lic = F
category1
Month a b c d e f
Apr 0 0 0 0 0 0
Aug 418 0 0 0 0 208
Dec 628 0 0 0 0 0
Feb 0 0 0 968 0 701
Jan 388 0 0 0 0 0
Jul 771 0 0 0 0 2514
Jun 987 913 0 216 0 395
Mar 454 0 0 0 0 314
May 0 1298 0 0 0 0
Nov 906 0 526 262 0 1417
Oct 783 0 853 336 310 286
Sep 0 0 0 0 928 0
, , lic = P
category1
Month a b c d e f
Apr 13 0 0 0 0 0
Aug 0 774 0 0 416 652
Dec 0 0 0 241 462 123
Feb 150 857 0 169 6 1
Jan 954 0 567 0 0 0
Jul 481 0 0 0 0 846
Jun 0 0 0 484 0 535
Mar 751 0 0 0 241 0
May 0 549 37 0 0 2
Nov 649 0 0 0 154 692
Oct 0 0 182 0 0 0
Sep 0 0 585 0 493 0
, , lic = T
category1
Month a b c d e f
Apr 0 0 410 0 0 0
Aug 0 0 0 0 0 0
Dec 0 0 833 289 811 0
Feb 0 1223 0 716 366 552
Jan 555 0 802 0 1598 0
Jul 0 0 69 0 0 696
Jun 0 0 0 0 190 0
Mar 0 1165 0 0 0 0
May 979 951 676 0 0 0
Nov 267 0 79 1951 290 530
Oct 230 78 0 679 321 0
Sep 0 871 0 0 0 0
Output matches my requirement but order of month is misplaced.
can i achieve same thing with any package? or any easiest methods to get the same data?
I suggest making Month an ordered factor:
df1$Month <- ordered(df1$Month, levels = month.abb)
xtabs(count~Month+category1+lic,data=df1)
#, , lic = F
#
# category1
#Month a b c d e f
# Jan 0 0 0 0 563 0
# Feb 0 0 0 826 0 0
# Mar 0 0 3 685 443 814
# Apr 0 848 0 474 0 0
# May 192 412 1942 0 803 545
# Jun 593 0 0 0 520 807
# Jul 829 745 0 0 926 0
# Aug 1474 0 603 376 0 706
# Sep 0 0 0 173 0 0
# Oct 0 0 661 915 814 0
# Nov 0 881 0 0 0 0
# Dec 0 0 0 0 0 0
#</snip>
Hopefully this is what OP is aiming to do:
library(tidyverse)
df1<-as.tibble(df1)
df1 %>%
arrange(Month)
Year Month category1 catergory2 lic count
<int> <fct> <fct> <fct> <fct> <int>
1 2016 Apr a N F 745
2 2016 Apr b K F 346
3 2016 Apr b O T 61
4 2016 Apr a J T 680
5 2018 Apr d O P 308
6 2017 Apr e M F 408
7 2016 Apr b P P 474
8 2017 Apr b O P 332
9 2016 Apr b P F 321
10 2017 Apr e N T 384
# ... with 90 more rows

How can I call for something in a data.frame when the destinction has to be done in two columns?

Sorry for the very specific question, but I have a file as such:
Adj Year man mt wm wmt by bytl gr grtl
3 careless 1802 0 126 0 54 0 13 0 51
4 careless 1803 0 166 0 72 0 1 0 18
5 careless 1804 0 167 0 58 0 2 0 25
6 careless 1805 0 117 0 5 0 5 0 7
7 careless 1806 0 408 0 88 0 15 0 27
8 careless 1807 0 214 0 71 0 9 0 32
...
560 mean 1939 21 5988 8 1961 0 1152 0 1512
561 mean 1940 20 5810 6 1965 1 914 0 1444
562 mean 1941 10 6062 4 2097 5 964 0 1550
563 mean 1942 8 5352 2 1660 2 947 2 1506
564 mean 1943 14 5145 5 1614 1 878 4 1196
565 mean 1944 42 5630 6 1939 1 902 0 1583
566 mean 1945 17 6140 7 2192 4 1004 0 1906
Now I have to call for specific values (e.g. [careless,1804,man] or [mean, 1944, wmt].
Now I have no clue how to do that, one possibility would be to split the data.frame and create an array if I'm correct. But I'd love to have a simpler solution.
Thank you in advance!
Subsetting for specific values in Adj and Year column and selecting the man column will give you the required output.
df[df$Adj == "careless" & df$Year == 1804, "man"]

Organizing three dimensional data from table into matrix/array form using R

I have a table that looks similar to this
MUNI YEAR ENTE SALE
D101 1995 F001 1000
D101 1995 F002 1200
D101 1995 F003 1300
D101 1996 F001 1000
D101 1996 F003 1250
D101 1996 F004 1300
D101 1997 F001 1000
D101 1998 F002 1400
D101 1998 F003 1500
D102 1995 F001 1000
D102 1995 F003 1200
D102 1995 F006 1300
D102 1996 F001 1050
D102 1996 F002 1320
D102 1996 F003 1250
D102 1996 F006 1350
D102 1996 F002 1320
...
It is a sales table where MUNI stands for markets and ENTE stands for firms. The data consists of 7 years, 1200 markets and 200 firms. I would like to reorganize this table into a matrix form such that the dimensions are (rows = MUNI X YEAR, Cols = ENTE) and in each cell there is the value of sale, something like this
MUNIxYEAR\ENTE F001 F002 F003 F004 ...
D101x1995 1000 1200 1300 NA ...
D101x1996 1000 NA 1250 1300 ...
...
I am not sure how to this or the best way to proceed so I get the above-mentioned data organization. I have checked other posts and I believe the way of doing this is to use the command sparseMatrix. However, I don't know how to use it when (1) you have multiple criteria (i.e., two conditions for the rows) and (2) the dimensions of the matrix are string IDs (change them into factors and the get the levels?).
Thanks in advance for any help and guidance.
Many ways and packages to do that. I'm using a "tidyr" package method:
library(tidyr)
df = data.frame(MUNI = rep(paste0("D10", c(1,1,2,2,3,4)), each = 2),
YEAR = rep(1999:2000,3),
ENTE = paste0("F00", c(1,2,3,3,4,5)),
SALE = sample(1000:2000, 6, replace = T))
df
# MUNI YEAR ENTE SALE
# 1 D101 1999 F001 1670
# 2 D101 2000 F002 1420
# 3 D101 1999 F003 1985
# 4 D101 2000 F003 1914
# 5 D102 1999 F004 1727
# 6 D102 2000 F005 1195
# 7 D102 1999 F001 1670
# 8 D102 2000 F002 1420
# 9 D103 1999 F003 1985
# 10 D103 2000 F003 1914
# 11 D104 1999 F004 1727
# 12 D104 2000 F005 1195
spread(df,ENTE,SALE, fill=0) # in case you decide to have each column separately for querying or further grouping in the future
# MUNI YEAR F001 F002 F003 F004 F005
# 1 D101 1999 1716 0 1516 0 0
# 2 D101 2000 0 1917 1155 0 0
# 3 D102 1999 1716 0 0 1259 0
# 4 D102 2000 0 1917 0 0 1291
# 5 D103 1999 0 0 1516 0 0
# 6 D103 2000 0 0 1155 0 0
# 7 D104 1999 0 0 0 1259 0
# 8 D104 2000 0 0 0 0 1291
df2 = spread(df,ENTE,SALE, fill=0)
unite(df2, "MUNIxYEAR", MUNI,YEAR, sep = " x ") # if you want to combine columns
# MUNIxYEAR F001 F002 F003 F004 F005
# 1 D101 x 1999 1716 0 1516 0 0
# 2 D101 x 2000 0 1917 1155 0 0
# 3 D102 x 1999 1716 0 0 1259 0
# 4 D102 x 2000 0 1917 0 0 1291
# 5 D103 x 1999 0 0 1516 0 0
# 6 D103 x 2000 0 0 1155 0 0
# 7 D104 x 1999 0 0 0 1259 0
# 8 D104 x 2000 0 0 0 0 1291
You can use xtabs
For instance:
# Set random seed for reproducibility
set.seed(12345)
# Generate 500 rows of random data
my.data = data.frame(MUNI = rep(paste0("D", 101:110), each = 50),
YEAR = sample(1990:2000, 500, replace = TRUE),
ENTE = sample(paste0("F00", 1:9), 500, replace = T),
SALE = sample(1000:2000, 500, replace = T)
)
# Create a new column with the string "MUNIxYEAR"
my.data$MUNIxYEAR = paste(my.data$MUNI, my.data$YEAR, sep = "x")
# Call xtabs to get the table!
res <- xtabs(SALE ~ MUNIxYEAR + ENTE, my.data)
First lines of the output:
ENTE
MUNIxYEAR F001 F002 F003 F004 F005 F006 F007 F008 F009
D101x1990 1339 0 0 1693 0 2831 2779 0 0
D101x1991 0 1407 0 3619 0 0 0 1254 0
D101x1992 0 0 0 0 1807 0 1766 0 1657
D101x1993 1174 1154 0 0 1794 0 0 1218 0
D101x1994 0 1015 6636 0 0 0 2126 0 0
D101x1995 0 0 0 0 0 3478 3228 1517 0
D101x1996 0 0 1304 0 0 0 1505 0 0
D101x1997 0 1077 1481 1802 0 2494 0 0 0
D101x1998 0 0 1660 5366 1844 0 0 1006 0
D101x1999 0 1437 0 0 0 0 1844 0 2394
D101x2000 0 0 1714 0 0 0 1950 1758 1108
D102x1990 3761 0 3307 1182 0 0 0 0 0
D102x1991 0 0 0 1539 2716 0 1716 0 0
D102x1992 1980 0 1056 1458 0 0 0 0 1641
D102x1993 0 0 1429 0 1784 0 1114 0 0
D102x1994 0 0 0 0 1377 0 1038 1000 0
D102x1995 0 0 1088 0 0 1031 4205 1764 0
D102x1996 0 0 0 0 1658 0 3559 0 0
D102x1997 0 1048 2453 0 0 1741 0 0 0
D102x1998 1427 5139 0 1336 0 0 1372 0 1395
D102x1999 0 0 0 3957 0 1972 0 0 0
D102x2000 0 3258 0 0 0 3780 0 3299 1360
D103x1990 0 0 0 1247 1526 0 0 0 1234
D103x1991 0 1919 0 0 0 0 0 1704 0
D103x1992 0 1489 0 0 4428 0 1371 0 0
D103x1993 0 1477 0 0 0 0 1319 0 1211
D103x1994 0 2649 0 0 1488 0 0 0 0
The xtabs function can help reformat your data into a 3 dimensional array and then the ftable function can flatten it to the 2 dimensional table.
Other options would be the reshape2 or plyr packages (and probably others as well).

Difficulties applying pca

I am experimenting pca with R. I have the following data:
V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
2454 0 168 290 45 1715 61 551 245 30 91
222 188 94 105 60 3374 615 7 294 0 169
552 0 0 465 0 3040 0 0 771 0 0
2872 0 0 0 0 3380 0 289 0 0 0
2938 0 56 56 0 2039 538 311 113 0 254
2849 0 0 332 0 2548 0 332 0 0 221
3102 0 0 0 0 2690 0 0 0 807 807
3134 0 0 0 0 2897 289 144 144 144 0
558 0 0 0 0 3453 0 0 0 0 0
2893 0 262 175 0 2452 350 1138 262 87 175
552 0 0 351 0 3114 0 0 678 0 0
2874 0 109 54 0 2565 272 1037 109 0 0
1396 0 0 407 0 1730 0 0 305 0 0
2866 0 71 179 0 2403 358 753 35 107 143
449 0 0 0 0 2825 0 0 0 0 0
2888 0 0 523 0 2615 104 627 209 0 0
2537 0 57 0 0 1854 0 0 463 0 0
2873 0 0 342 0 3196 0 114 0 0 114
720 0 0 365 4 2704 0 4 643 4 0
218 125 31 94 219 2479 722 0 219 0 94
to which I apply the following code:
fit <- prcomp(data)
ev <- fit$rotation # pc loadings
In order to make some tests, I tried to see the data matrix I retrieve when I do keep all the components I can keep:
numberComponentsKept = 10
featureVector = ev[,1:numberComponentsKept]
newData <- as.matrix(data)%*%as.matrix(featureVector)
The newData matrix should be the same as the original one, but instead, I get a very different result:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
2454 1424.447 867.5986 514.0592 -155.4783720 -574.7425 85.38724 -86.71887 90.872507 4.305168 92.08284
222 3139.681 1020.4150 376.3165 471.8718398 -796.9549 142.14301 -119.86945 32.919950 -31.269467 32.55846
552 2851.544 539.6075 883.3969 -93.3579153 -908.6689 68.34030 -40.97052 -13.856931 23.133566 89.00851
2872 3111.317 1210.0187 433.0382 -144.4065362 -381.2305 -20.08927 -49.03447 9.569258 44.201571 70.13113
2938 1788.334 945.8162 189.6526 308.7703509 -593.5577 124.88484 -109.67276 -115.127348 14.170615 99.19492
2849 2291.839 978.1819 374.7567 -243.6739292 -496.8707 287.01065 -126.22501 -18.747873 54.080763 62.80605
3102 2530.989 814.7548 -510.5978 -410.6295894 -1015.3228 46.85727 -21.20662 14.696831 23.687923 72.37691
3134 2679.430 970.1323 311.8627 124.2884480 -536.4490 -26.23858 83.86768 -17.808390 -28.802387 92.09583
558 3268.599 988.2515 353.6538 -82.9155988 -342.5729 12.96219 -60.94886 18.537087 7.291126 96.14917
2893 1921.761 1664.0084 631.0800 -55.6321469 -864.9628 -28.11045 -104.78931 37.797727 -12.078535 104.88374
552 2927.108 607.6489 799.9602 -79.5494412 -827.6994 14.14625 -50.12209 -14.020936 29.996639 86.72887
2874 2084.285 1636.7999 621.6383 -49.2934502 -577.4815 -67.27198 -11.06071 -7.167577 47.395309 51.02962
1396 1618.171 337.4320 488.2717 -100.1663625 -469.8857 212.37199 -1.19409 13.531485 -23.332701 64.58806
2866 2007.261 1387.6890 395.1586 0.8640971 -636.1243 133.41074 12.34794 -26.969634 5.506828 74.13767
449 2674.136 808.5174 289.3345 -67.8356695 -280.2689 10.60475 -49.86404 15.165731 5.965083 78.66244
2888 2254.171 1162.4988 749.7230 -206.0215007 -652.2364 302.36320 40.76341 -1.079259 17.635956 57.86999
2537 1747.098 371.8884 429.1309 9.3761544 -480.7130 -196.25019 -81.31580 2.819608 24.089379 56.91885
2873 2973.872 974.3854 433.7282 -197.0601947 -478.3647 301.96576 -81.81105 14.516646 -1.191972 100.79057
720 2537.535 504.4124 744.5909 -78.1162036 -771.1396 38.17725 -36.61446 -9.079443 25.488688 78.21597
218 2292.718 800.5257 260.6641 603.3295960 -641.9296 187.38913 11.71382 70.011487 78.047216 96.10967
What did I do wrong?
I think the problem is rather a PCA problem than an R problem. You multiply the original data with the rotation matrix and you wonder then why newData!=data. This would be only the case if the rotation matrix would be the identity matrix.
What you probably were planning to do is the following:
# Run PCA:
fit <- prcomp(USArrests)
ev <- fit$rotation # pc loadings
# Reversed PCA:
head(fit$x%*% t(as.matrix(ev)))
# Centered Original data:
head(t(apply(USArrests,1,'-',colMeans(USArrests))))
In the last step you have to center the data, because the function prcomp centers them by default.

How to remove rows with 0 values using R

Hi am using a matrix of gene expression, frag counts to calculate differentially expressed genes. I would like to know how to remove the rows which have values as 0. Then my data set will be compact and less spurious results will be given for the downstream analysis I do using this matrix.
Input
gene ZPT.1 ZPT.0 ZPT.2 ZPT.3 PDGT.1 PDGT.0
XLOC_000001 3516 626 1277 770 4309 9030
XLOC_000002 342 82 185 72 835 1095
XLOC_000003 2000 361 867 438 454 687
XLOC_000004 143 30 67 37 90 236
XLOC_000005 0 0 0 0 0 0
XLOC_000006 0 0 0 0 0 0
XLOC_000007 0 0 0 0 1 3
XLOC_000008 0 0 0 0 0 0
XLOC_000009 0 0 0 0 0 0
XLOC_000010 7 1 5 3 0 1
XLOC_000011 63 10 19 15 92 228
Desired output
gene ZPT.1 ZPT.0 ZPT.2 ZPT.3 PDGT.1 PDGT.0
XLOC_000001 3516 626 1277 770 4309 9030
XLOC_000002 342 82 185 72 835 1095
XLOC_000003 2000 361 867 438 454 687
XLOC_000004 143 30 67 37 90 236
XLOC_000007 0 0 0 0 1 3
XLOC_000010 7 1 5 3 0 1
XLOC_000011 63 10 19 15 92 228
As of now I only want to remove those rows where all the frag count columns are 0 if in any row some values are 0 and others are non zero I would like to keep that row intact as you can see my example above.
Please let me know how to do this.
df[apply(df[,-1], 1, function(x) !all(x==0)),]
A lot of options to do this within the tidyverse have been posted here: How to remove rows where all columns are zero using dplyr pipe
my preferred option is using rowwise()
library(tidyverse)
df <- df %>%
rowwise() %>%
filter(sum(c(col1,col2,col3)) != 0)

Resources