filling in a matrix based on rownames - r

I have a matrix which looks like the following:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
the 0.07200378 0.173467875 -0.32174805 -0.17641919 -0.1895841 0.41491635 0.52559372 0.46668538 -0.622698039 0.07609943
dog -0.03110763 -0.307907604 -0.51872045 -0.61390705 0.2901446 -0.30045110 0.37480375 0.43265162 -0.095877141 0.13267635
went -0.10276563 0.006781152 -0.22007612 0.29408635 -0.5130759 -0.54109880 0.27203657 -0.10996491 -0.442054480 0.14811820
to -0.25024018 -0.690325871 0.04050764 0.19626275 0.1937401 0.22256489 -0.28244329 0.01593702 0.357230552 0.56581933
play -0.30871394 -0.093627274 -0.28149478 0.09634858 -0.0895794 0.40877385 -0.60633919 0.15760252 0.001222108 0.82736039
with -0.15535758 0.103512824 -0.22533448 0.18746118 0.4194084 0.64124607 -0.03984496 0.16687895 -0.373183180 -0.58537456
his 0.56851056 -0.376888059 0.48226617 0.06921187 0.5648746 -0.20768129 -0.28356166 0.70855895 0.031217873 0.71860737
owner -0.29910484 -0.727676094 -0.29929429 -0.23175114 0.4336813 0.39667153 -0.29670753 -0.04054499 -0.041433528 0.34875186
fun 0.08032176 -0.431446284 0.15740608 0.16003107 -0.1894946 0.37010769 0.26229681 -0.22716813 -0.310652746 0.06291729
john 0.08629179 0.470551208 0.31550134 0.61767611 0.6179546 -0.01474994 0.58974983 -0.39419778 -0.689627200 -0.18293759
With words as the rownames and its populated with word vectors from a word2vec model. I also have a second matrix which looks like the following:
dog play owner went fun NA_TEXT NA_2TEXT
1750_10-K_2005 0 1 0 0 1 0 0
1800_10-K_2005 1 0 1 0 0 0 1
1923_10-K_2005 1 0 0 0 0 0 0
2135_10-K_2005 0 0 0 0 0 1 0
2488_10-K_2005 0 0 0 0 0 0 0
2491_10-K_2005 0 0 1 0 0 0 1
2969_10-K_2005 1 1 0 1 0 0 1
3133_10-K_2005 0 0 0 0 0 0 0
3197_10-K/A_2005 0 0 0 1 0 1 0
3197_10-K_2005 0 0 0 0 0 0 0
Which is a bag of words matrix. The rownames here are documents. I want to compute the cosine similarity between the documents based on the word embeddings matrix in the first matrix. I can average the word embedding matrix words by doing rowMeans(wrds) which gives:
> rowMeans(wrds)
the dog went to play with
0.041831714 -0.063769466 -0.120801359 0.036905292 0.011155287 0.013941266
his owner fun john
0.227511640 -0.075740768 -0.006568109 0.141621237
Now I would like to "join" these words up with the docs matrix when a word exists in the colnames.
Expected output (for the first few columns of the docs matrix):
dog play owner went
1750_10-K_2005 0 0.011 0 0
1800_10-K_2005 -0.063 0 -0.12 0
1923_10-K_2005 -0.063 0 0 0
2135_10-K_2005 0 0 0 0
2488_10-K_2005 0 0 0 0
2491_10-K_2005 0 0 -0.075 0
2969_10-K_2005 -0.063 -0.075 0 -0.121
3133_10-K_2005 0 0 0 0
3197_10-K/A_2005 0 0 0 -0.12
3197_10-K_2005 0 0 0 0
Data:
wrds <- structure(c(0.0720037762075663, -0.031107634305954, -0.102765634655952,
-0.250240176916122, -0.30871394276619, -0.155357576906681, 0.568510562181473,
-0.299104837700725, 0.0803217552602291, 0.0862917900085449, 0.173467874526978,
-0.307907603681087, 0.00678115151822567, -0.690325871109962,
-0.0936272740364075, 0.103512823581696, -0.376888059079647, -0.727676093578339,
-0.43144628405571, 0.470551207661629, -0.321748048067093, -0.51872044801712,
-0.220076121389866, 0.0405076444149017, -0.281494781374931, -0.225334476679564,
0.482266165316105, -0.299294285476208, 0.157406084239483, 0.315501344390213,
-0.17641919106245, -0.613907054066658, 0.294086349196732, 0.196262747049332,
0.0963485836982727, 0.18746118247509, 0.0692118704319, -0.231751143932343,
0.16003106534481, 0.617676109075546, -0.189584106206894, 0.290144592523575,
-0.513075917959213, 0.193740077316761, -0.0895793968811631, 0.419408403337002,
0.564874619245529, 0.433681339025497, -0.189494623802602, 0.617954611778259,
0.414916351437569, -0.300451099872589, -0.541098803281784, 0.222564890980721,
0.408773854374886, 0.641246065497398, -0.207681285217404, 0.396671526134014,
0.370107688009739, -0.0147499442100525, 0.525593716651201, 0.374803751707077,
0.272036574780941, -0.282443292438984, -0.606339186429977, -0.0398449599742889,
-0.283561661839485, -0.296707525849342, 0.262296808883548, 0.589749827980995,
0.466685384511948, 0.432651624083519, -0.109964912757277, 0.015937015414238,
0.157602518796921, 0.166878946125507, 0.708558946847916, -0.0405449904501438,
-0.227168127894402, -0.394197784364223, -0.622698038816452, -0.0958771407604218,
-0.442054480314255, 0.357230551540852, 0.00122210755944252, -0.37318317964673,
0.0312178730964661, -0.0414335280656815, -0.310652745887637,
-0.689627200365067, 0.0760994255542755, 0.132676348090172, 0.148118201643229,
0.565819330513477, 0.827360391616821, -0.585374563932419, 0.718607366085052,
0.348751857876778, 0.0629172921180725, -0.18293759226799), .Dim = c(10L,
10L), .Dimnames = list(c("the", "dog", "went", "to", "play",
"with", "his", "owner", "fun", "john"), NULL))
docs <- structure(c(0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1,
0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0), .Dim = c(10L, 7L), .Dimnames = list(
c("1750_10-K_2005", "1800_10-K_2005", "1923_10-K_2005", "2135_10-K_2005",
"2488_10-K_2005", "2491_10-K_2005", "2969_10-K_2005", "3133_10-K_2005",
"3197_10-K/A_2005", "3197_10-K_2005"), c("dog", "play", "owner",
"went", "fun", "NA_TEXT", "NA_2TEXT")))

Related

combine redundant row items in r

I have a dataset with the the names of many different plant species (column MTmatch), some of which appear repeatedly. Each of these has a column (ReadSum) with a sum associated with it (as well as many other pieces of information). How do I combine/aggregate all of the redundant plant species and sum the associated ReadSum with each, while leaving the non-redundant rows alone?
I would like to take a dataset like this, and either have it transformed so that each sample has the aggregate of the combined rows, or at least an additional column showing the sum of the ReadSum column for the combined redundant species. Sorry if this is confusing, I'm not sure how to ask this question.
I have been messing about with dplyr, using group_by() and summarise(), but that seems to be summarizing across the whole column rather than just the new group.
structure(list(ESVID = c("ESV_000090", "ESV_000682", "ESV_000028",
"ESV_000030", "ESV_000010", "ESV_000182", "ESV_000040", "ESV_000135",
"ESV_000383"), S026401.R1 = c(0.222447727, 0, 0, 0, 0, 0, 0.029074432,
0, 0), S026404.R1 = c(0.022583349, 0, 0, 0, 0, 0, 0.016390389,
0.001257217, 0), S026406.R1 = c(0.360895503, 0, 0, 0.00814677,
0, 0, 0.01513888, 0, 0.00115466)), row.names = c(NA, -9L), class = "data.frame")
> dput(samp5[1:9])
structure(list(ESVID = c("ESV_000090", "ESV_000682", "ESV_000028",
"ESV_000030", "ESV_000010", "ESV_000182", "ESV_000040", "ESV_000135",
"ESV_000383"), S026401.R1 = c(0.222447727, 0, 0, 0, 0, 0, 0.029074432,
0, 0), S026404.R1 = c(0.022583349, 0, 0, 0, 0, 0, 0.016390389,
0.001257217, 0), S026406.R1 = c(0.360895503, 0, 0, 0.00814677,
0, 0, 0.01513888, 0, 0.00115466), S026409.R1 = c(0.221175955,
0, 0, 0, 0, 0, 0.005146173, 0, 0), S026412.R1 = c(0.026058888,
0, 0, 0, 0, 0, 0, 0, 0), MAX = c(0.400577608, 0.009933177, 0.124412855,
0.00814677, 0.009824944, 0.086475106, 0.154850408, 0.015593835,
0.008340888), ReadSum = c(3.54892343, 0.012059346, 0.203303936,
0.021075546, 0.009824944, 0.128007863, 0.859687787, 0.068159534,
0.050266853), SPECIES = c("Abies ", "Abies ", "Acer", "Alnus",
"Berberis", "Betula ", "Boykinia", "Boykinia", "Boykinia")), row.names = c(NA,
-9L), class = "data.frame")
Do either of these approached produce your intended outcome?
Data:
df <- structure(list(ESVID = c("ESV_000090", "ESV_000682", "ESV_000028",
"ESV_000030", "ESV_000010", "ESV_000182", "ESV_000040", "ESV_000135",
"ESV_000383"), S026401.R1 = c(0.222447727, 0, 0, 0, 0, 0, 0.029074432,
0, 0), S026404.R1 = c(0.022583349, 0, 0, 0, 0, 0, 0.016390389,
0.001257217, 0), S026406.R1 = c(0.360895503, 0, 0, 0.00814677,
0, 0, 0.01513888, 0, 0.00115466), S026409.R1 = c(0.221175955,
0, 0, 0, 0, 0, 0.005146173, 0, 0), S026412.R1 = c(0.026058888,
0, 0, 0, 0, 0, 0, 0, 0), MAX = c(0.400577608, 0.009933177, 0.124412855,
0.00814677, 0.009824944, 0.086475106, 0.154850408, 0.015593835,
0.008340888), ReadSum = c(3.54892343, 0.012059346, 0.203303936,
0.021075546, 0.009824944, 0.128007863, 0.859687787, 0.068159534,
0.050266853), SPECIES = c("Abies ", "Abies ", "Acer", "Alnus",
"Berberis", "Betula ", "Boykinia", "Boykinia", "Boykinia")), row.names = c(NA,
-9L), class = "data.frame")
Create a new column "combined_ReadSum" (2nd col) which is the sum of "ReadSum" for each "SPECIES":
library(dplyr)
df %>%
group_by(SPECIES) %>%
summarise(combined_ReadSum = sum(ReadSum)) %>%
left_join(df, by = "SPECIES")
#> # A tibble: 9 × 10
#> SPECIES combi…¹ ESVID S0264…² S0264…³ S0264…⁴ S0264…⁵ S0264…⁶ MAX ReadSum
#> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 "Abies " 3.56 ESV_… 0.222 0.0226 0.361 0.221 0.0261 0.401 3.55
#> 2 "Abies " 3.56 ESV_… 0 0 0 0 0 0.00993 0.0121
#> 3 "Acer" 0.203 ESV_… 0 0 0 0 0 0.124 0.203
#> 4 "Alnus" 0.0211 ESV_… 0 0 0.00815 0 0 0.00815 0.0211
#> 5 "Berber… 0.00982 ESV_… 0 0 0 0 0 0.00982 0.00982
#> 6 "Betula… 0.128 ESV_… 0 0 0 0 0 0.0865 0.128
#> 7 "Boykin… 0.978 ESV_… 0.0291 0.0164 0.0151 0.00515 0 0.155 0.860
#> 8 "Boykin… 0.978 ESV_… 0 0.00126 0 0 0 0.0156 0.0682
#> 9 "Boykin… 0.978 ESV_… 0 0 0.00115 0 0 0.00834 0.0503
#> # … with abbreviated variable names ¹​combined_ReadSum, ²​S026401.R1,
#> # ³​S026404.R1, ⁴​S026406.R1, ⁵​S026409.R1, ⁶​S026412.R1
Or, summarise columns by summing the values for each unique species:
library(dplyr)
df %>%
group_by(SPECIES) %>%
summarise(across(where(is.numeric), sum))
#> # A tibble: 6 × 8
#> SPECIES S026401.R1 S026404.R1 S026406.R1 S026409.R1 S0264…¹ MAX ReadSum
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 "Abies " 0.222 0.0226 0.361 0.221 0.0261 0.411 3.56
#> 2 "Acer" 0 0 0 0 0 0.124 0.203
#> 3 "Alnus" 0 0 0.00815 0 0 0.00815 0.0211
#> 4 "Berberis" 0 0 0 0 0 0.00982 0.00982
#> 5 "Betula " 0 0 0 0 0 0.0865 0.128
#> 6 "Boykinia" 0.0291 0.0176 0.0163 0.00515 0 0.179 0.978
#> # … with abbreviated variable name ¹​S026412.R1
Created on 2022-10-28 by the reprex package (v2.0.1)

XTS searching in dates prior to current index

I have an xts with OHLC and a logical column: isSwingBottom. I want a new column that gives the index of first occurrence of isSwingBottom by searching only in dates prior to the current row. I need the closest found row.
For example, in this data for isSwingBottom, 1995-10-04/1995-10-10 should be NA, 1995-10-11/1995-10-31 should be 1995-10-10 and so on. I want to avoid using a loop.
myXTS <- structure(c(408.75, 417.2937, 423.3188, 423.775, 419.3375, 415.55,
419.325, 417.25, 415.8375, 411.7812, 408.975, 406.2625, 396.6062,
385.6125, 374.375, 368.0375, 369.8625, 356.2688, 360.45, 357.45,
354.35, 363.6313, 356.25, 361.0813, 364.375, 365.9, 351.15, 342.0375,
333.5938, 315.4812, 418.75, 431.25, 431.25, 429.375, 421.875,
422.5, 419.375, 418.125, 417.5, 415.625, 411.875, 406.25, 396.875,
378.75, 393.8188, 373.5, 371.5, 366.25, 366.25, 358.4937, 364.625,
372.25, 363.75, 368.75, 369.25, 362.5, 350, 343.75, 337.5, 322.5,
408.75, 417.625, 420.625, 415.625, 412.5, 417.625, 417, 415.125,
406.25, 406.25, 405, 387.5, 382.5, 368.75, 361.875, 368.125,
353.75, 350, 356.25, 350.875, 355, 360.625, 355.125, 358.875,
362.5, 344, 339.5, 332.5, 313.375, 309.5, 417.2937, 423.3188,
423.775, 419.3375, 415.55, 419.325, 417.25, 415.8375, 411.7812,
408.975, 406.2625, 396.6062, 385.6125, 372.0562, 368.0375, 369.8625,
356.2688, 360.45, 357.45, 354.35, 363.6313, 361.5, 361.0813,
364.375, 365.9, 351.15, 342.0375, 333.5938, 315.4812, 315.8188,
4560, 11120, 30840, 8640, 7400, 3040, 6360, 1720, 6080, 6840,
9200, 38440, 53000, 32000, 46240, 17680, 57240, 57840, 28680,
28880, 66760, 56640, 23240, 33480, 39120, 62480, 64320, 79920,
66320, 60360, NA, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
0, 0, 0), .Dim = c(30L, 7L), .Dimnames = list(NULL, c("Open",
"High", "Low", "Close", "Volume", "isSwingBottom", "isSwingTop"
)), index = structure(c(812764800, 812851200, 812937600, 813196800,
813283200, 813369600, 813456000, 813542400, 813801600, 813888000,
813974400, 814060800, 814147200, 814406400, 814492800, 814579200,
814665600, 814752000, 815011200, 815097600, 815184000, 815270400,
815356800, 815616000, 815702400, 815788800, 815875200, 815961600,
816220800, 816307200), tzone = "UTC", tclass = "Date"), class = c("xts",
"zoo"))
If you want a new column with dates you might need to convert myXTS to dataframe. Try this approach :
library(dplyr)
library(zoo)
library(tidyr)
myXTS %>%
fortify.zoo() %>%
mutate(new_col = if_else(isSwingBottom == 1, Index, as.Date(NA))) %>%
fill(new_col)
# Index Open High Low Close Volume isSwingBottom isSwingTop new_col
#1 1995-10-04 408.7500 418.7500 408.750 417.2937 4560 NA 0 <NA>
#2 1995-10-05 417.2937 431.2500 417.625 423.3188 11120 0 0 <NA>
#3 1995-10-06 423.3188 431.2500 420.625 423.7750 30840 0 0 <NA>
#4 1995-10-09 423.7750 429.3750 415.625 419.3375 8640 0 0 <NA>
#5 1995-10-10 419.3375 421.8750 412.500 415.5500 7400 1 0 1995-10-10
#6 1995-10-11 415.5500 422.5000 417.625 419.3250 3040 0 1 1995-10-10
#7 1995-10-12 419.3250 419.3750 417.000 417.2500 6360 0 0 1995-10-10
#8 1995-10-13 417.2500 418.1250 415.125 415.8375 1720 0 0 1995-10-10
#9 1995-10-16 415.8375 417.5000 406.250 411.7812 6080 0 0 1995-10-10
#10 1995-10-17 411.7812 415.6250 406.250 408.9750 6840 0 0 1995-10-10
#11 1995-10-18 408.9750 411.8750 405.000 406.2625 9200 0 0 1995-10-10
#12 1995-10-19 406.2625 406.2500 387.500 396.6062 38440 0 0 1995-10-10
#13 1995-10-20 396.6062 396.8750 382.500 385.6125 53000 0 0 1995-10-10
#14 1995-10-23 385.6125 378.7500 368.750 372.0562 32000 0 0 1995-10-10
#15 1995-10-24 374.3750 393.8188 361.875 368.0375 46240 0 0 1995-10-10
#16 1995-10-25 368.0375 373.5000 368.125 369.8625 17680 0 0 1995-10-10
#17 1995-10-26 369.8625 371.5000 353.750 356.2688 57240 0 0 1995-10-10
#18 1995-10-27 356.2688 366.2500 350.000 360.4500 57840 0 0 1995-10-10
#19 1995-10-30 360.4500 366.2500 356.250 357.4500 28680 0 0 1995-10-10
#20 1995-10-31 357.4500 358.4937 350.875 354.3500 28880 1 0 1995-10-31
#21 1995-11-01 354.3500 364.6250 355.000 363.6313 66760 0 0 1995-10-31
#22 1995-11-02 363.6313 372.2500 360.625 361.5000 56640 0 1 1995-10-31
#23 1995-11-03 356.2500 363.7500 355.125 361.0813 23240 1 0 1995-11-03
#24 1995-11-06 361.0813 368.7500 358.875 364.3750 33480 0 0 1995-11-03
#25 1995-11-07 364.3750 369.2500 362.500 365.9000 39120 0 1 1995-11-03
#26 1995-11-08 365.9000 362.5000 344.000 351.1500 62480 0 0 1995-11-03
#27 1995-11-09 351.1500 350.0000 339.500 342.0375 64320 0 0 1995-11-03
#28 1995-11-10 342.0375 343.7500 332.500 333.5938 79920 0 0 1995-11-03
#29 1995-11-13 333.5938 337.5000 313.375 315.4812 66320 0 0 1995-11-03
#30 1995-11-14 315.4812 322.5000 309.500 315.8188 60360 NA 0 1995-11-03

problem while changing col names with str_to_title

I have a data set that looks like this:
It can be build using codes:
df<- structure(list(`Med` = c("DOCETAXEL",
"BEVACIZUMAB", "CARBOPLATIN", "CETUXIMAB", "DOXORUBICIN", "IRINOTECAN"
), `2.4 mg` = c(0, 0, 0, 0, 1, 0), `PRIOR CANCER THERAPY` = c(4L,
3L, 3L, 3L, 3L, 3L), `PRIOR CANCER SURGERY` = c(0, 0, 0, 0, 0,
0), `PRIOR RADIATION THERAPY` = c(0, 0, 0, 0, 0, 0)), row.names = c(NA,
6L), class = "data.frame")
Now I would like to change col name that are not start with number to proper case. How should I do it? I thought I could use str_to_title. I have tried many ways can not get it to work. Here is the codes that I tried:
# try1:
df[,3:5] %>% setNames(str_to_title(colnames(df[,3:5])))
#try2:
df[,3:5] <- df[,3:5]%>% rename_with (str_to_title)
# try3:
colnames(df[,3:5])<- str_to_title(colnames(df[,3:5]))
What did I do wrong? there is no error message, just the col names did not get updated. Could anyone help me identify the issue, or maybe show me a better way if you have?
Here I have small data then I can find the col number. If I want it to auto correct the col names to proper case, how can I do that?
Thanks.
We can use
library(dplyr)
library(stringr)
df %>%
rename_at(3:5, ~ str_to_title(.))
-output
# Med 2.4 mg Prior Cancer Therapy Prior Cancer Surgery Prior Radiation Therapy
#1 DOCETAXEL 0 4 0 0
#2 BEVACIZUMAB 0 3 0 0
#3 CARBOPLATIN 0 3 0 0
#4 CETUXIMAB 0 3 0 0
#5 DOXORUBICIN 1 3 0 0
#6 IRINOTECAN 0 3 0 0
Or using rename_with
df %>%
rename_with(~ str_to_title(.), 3:5)

Weighted random sampling for Monte Carlo simulation in R

I would like to run a Monte Carlo simulation. I have a data.frame where rows are unique IDs which have a probability of association with one of the columns. The data entered into the columns can be treated as the weights for that probability. I want to randomly sample each row in the data.frame based on the weights listed for each row. Each row should only return one value per run. The data.frame structure looks like this:
ID, X2000, X2001, X2002, X2003, X2004
X11, 0, 0, 0.5, 0.5, 0
X33, 0.25, 0.25, 0.25, 0.25, 0
X55, 0, 0, 0, 0, 1
X77, 0.5, 0, 0, 0, 0.5
For weighting, "X11" should either return X2002 or X2003, "X33" should have an equal probability of returning X2000, X2001, X2002, or X2003, should be equal with no chance of returning X2004. The only possible return for "X55" should be X2004.
The output data I am interested in are the IDs and the column that was sampled for that run, although it would probably be simpler to return something like this:
ID, X2000, X2001, X2002, X2003, X2004
X11, 0, 0, 1, 0, 0
X33, 1, 0, 0, 0, 0
X55, 0, 0, 0, 0, 1
X77, 1, 0, 0, 0, 0
Your data.frame is transposed - the sample() function takes a probability vector. However, your probability vector is rowwise which means it's harder to extract from a data.frame.
To get around this - you can import your ID column as a row.name. This allows you to be able to access it during an apply() statement. Note the apply() will coerce the data.frame to a matrix which means only one data type is allowed. That's why the IDs needed to be rownames - otherwise we'd have a probability vector of characters instead of numerics.
mc_df <- read.table(
text =
'ID X2000 X2001 X2002 X2003 X2004
X11 0 0 0.5 0.5 0
X33 0.25 0.25 0.25 0.25 0
X55 0 0 0 0 1
X77 0.5 0 0 0 0.5'
, header = T
,row.names = 1)
From there, can use the apply function:
apply(mc_df, 1, function(x) sample(names(x), size = 200, replace = T, prob = x))
Or you could make it fancy
apply(mc_df, 1, function(x) table(sample(names(x), size = 200, replace = T, prob = x)))
$X11
X2002 X2003
102 98
$X33
X2000 X2001 X2002 X2003
54 47 64 35
$X55
X2004
200
$X77
X2000 X2004
103 97
Fancier:
apply(mc_df, 1, function(x) table(sample(as.factor(names(x)), size = 200, replace = T, prob = x)))
X11 X33 X55 X77
X2000 0 51 0 99
X2001 0 50 0 0
X2002 91 57 0 0
X2003 109 42 0 0
X2004 0 0 200 101
Or fanciest:
prop.table(apply(mc_df
, 1
, function(x) table(sample(as.factor(names(x)), size = 200, replace = T, prob = x)))
,2)
X11 X33 X55 X77
X2000 0.00 0.270 0 0.515
X2001 0.00 0.235 0 0.000
X2002 0.51 0.320 0 0.000
X2003 0.49 0.175 0 0.000
X2004 0.00 0.000 1 0.485

R - matching values in a dataframe column to another dataframes row names

I think this is a fairly challenging data manipulation problem in R, and have struggled constructing a function that can achieve this. The context is organizing basketball players who play different positions into a lineup together, subject to what position each player plays. For some clarity, here is an example of the dataframe I am working with, in two different forms:
dput(my_df)
structure(list(Name = c("C.J. McCollum", "DeMar DeRozan", "Jimmy Butler",
"Jonas Valanciunas", "Kevin Durant", "Markieff Morris", "Pascal Siakam",
"Pau Gasol"), Pos1 = c("PG", "SG", "SG", "C", "SF", "SF", "PF",
"C"), Pos2 = c("SG", "", "SF", "", "PF", "PF", "", "")), .Names = c("Name",
"Pos1", "Pos2"), class = "data.frame", row.names = c(18L, 33L,
62L, 68L, 78L, 92L, 106L, 111L))
my_df
Name Pos1 Pos2
18 C.J. McCollum PG SG
33 DeMar DeRozan SG
62 Jimmy Butler SG SF
68 Jonas Valanciunas C
78 Kevin Durant SF PF
92 Markieff Morris SF PF
106 Pascal Siakam PF
111 Pau Gasol C
dput(my_df2)
structure(list(Name = c("C.J. McCollum", "DeMar DeRozan", "Jimmy Butler",
"Jonas Valanciunas", "Kevin Durant", "Markieff Morris", "Pascal Siakam",
"Pau Gasol"), Pos1 = c("PG", "SG", "SG", "C", "SF", "SF", "PF",
"C"), Pos2 = c("SG", "", "SF", "", "PF", "PF", "", ""), PG = c(1,
0, 0, 0, 0, 0, 0, 0), SG = c(1, 1, 1, 0, 0, 0, 0, 0), SF = c(0,
0, 1, 0, 1, 1, 0, 0), PF = c(0, 0, 0, 0, 1, 1, 1, 0), C = c(0,
0, 0, 1, 0, 0, 0, 1), BackupG = c(1, 1, 1, 0, 0, 0, 0, 0), BackupF = c(0,
0, 1, 0, 1, 1, 1, 0), Man8 = c(1, 1, 1, 1, 1, 1, 1, 1)), .Names = c("Name",
"Pos1", "Pos2", "PG", "SG", "SF", "PF", "C", "BackupG", "BackupF",
"Man8"), row.names = c(18L, 33L, 62L, 68L, 78L, 92L, 106L, 111L
), class = "data.frame")
my_df2
Name Pos1 Pos2 PG SG SF PF C BackupG BackupF Man8
18 C.J. McCollum PG SG 1 1 0 0 0 1 0 1
33 DeMar DeRozan SG 0 1 0 0 0 1 0 1
62 Jimmy Butler SG SF 0 1 1 0 0 1 1 1
68 Jonas Valanciunas C 0 0 0 0 1 0 0 1
78 Kevin Durant SF PF 0 0 1 1 0 0 1 1
92 Markieff Morris SF PF 0 0 1 1 0 0 1 1
106 Pascal Siakam PF 0 0 0 1 0 0 1 1
111 Pau Gasol C 0 0 0 0 1 0 0 1
In a basketball lineup, we want 1 player set for each of the 5 positions in basketball (PG, SG, PF, SF, C), we also want 1 backup guard (a PG or SG is a guard), 1 backup forward (a PF or FS is a forward), and an 8th player who can play any position. With this group of 8 players, we could construct the lineup in this way:
Name
PG C.J. McCollum
SG DeMar DeRozan
PF Kevin Durant
SF Markieff Morris
C Pau Gasol
Backup G Jimmy Butler
Backup F Pascal Siakam
8th Man Jonas Valanciunas
Ofcourse there is some flexibility with this (Kevin Durant and Markieff Morris could have been switched, in fact theres several players who could have switched spots in the 2nd dataframe). I would like to be able to organize my_df into this 2nd dataframes format in a fairly quick matter, something that takes the Pos1 and Pos2 columns from my_df, is able to check the rownames of the 2nd dataframe, and then fill in the player names.
There is a puzzle aspect to this however. Of note is that, not all players have a second position, but those players who do have a second position can be listed at either of the two positions. (for example, Jimmy Butler can be set as a SG, a SF, a Backup G, a backup F, or the 8th man, whereas Pau Gasol can only be set as a C, or as the 8th man). Additionally, while C.J. McCollum is listed as a PG and SG, he is the only player in my_df who is listed as a PG, and therefore must go in the PG row of the second dataframe.
Any thoughts are appreciated with this! I can provide more context if needed.
(edit: potentially editing my_df, adding Pos3, Pos4, Pos5 columns for whether a player can be a backup G, backup F, or 8th Man, may help as well, and is something I am currently working with).
Edit - see Simplify this grid such that each row and column has 1 value for a revised version of my question, which is a simpler problem to solve but will give me a solution to this question!
This approach is guaranteed to return a result if there is one, in fact it will return all viable combinations.
st<-as.matrix(my_df2[4:dim(my_df2)[2]]) # Make a numeric matrix
## allCombinationsAux may not be necessary if you are using a combinatorics library
allCombinationsAux<-function(z,nreg,x){
if(sum(nreg)>1){
innerLoop<-do.call(rbind,lapply(x[nreg&(z!=x)], test1,nreg&(z!=x),x))
ret<-cbind(z,innerLoop )
}
else{
ret<-x[nreg]
}
ret
}
## Find all the possible row combinations for the matrix
combs<-do.call(rbind,lapply(x,function(y) allCombinationsAux(y,y!=x,x)))
## Identify which combinations are valid
inds<-which(apply(combs,1,function(x) sum(diag(st[x,]))==8))
## Select valid matricies
validChoices<-lapply(inds,function(x) st[combs[x,],])
Make a matrix out of my_df2
Find all possible matrix substitutions
Iterate through all possible matrices testing if the diags are all 1
Select those matrices that are valid
To get the output to look like your example you can run
validChoices<-lapply(inds,function(x) {
matr<-st[combs[x,],]
retVal<-data.frame(Name=my_df2[combs[x,],"Name"])
rownames(retVal)<-colnames(matr)
retVal
})

Resources