I have an xts with OHLC and a logical column: isSwingBottom. I want a new column that gives the index of first occurrence of isSwingBottom by searching only in dates prior to the current row. I need the closest found row.
For example, in this data for isSwingBottom, 1995-10-04/1995-10-10 should be NA, 1995-10-11/1995-10-31 should be 1995-10-10 and so on. I want to avoid using a loop.
myXTS <- structure(c(408.75, 417.2937, 423.3188, 423.775, 419.3375, 415.55,
419.325, 417.25, 415.8375, 411.7812, 408.975, 406.2625, 396.6062,
385.6125, 374.375, 368.0375, 369.8625, 356.2688, 360.45, 357.45,
354.35, 363.6313, 356.25, 361.0813, 364.375, 365.9, 351.15, 342.0375,
333.5938, 315.4812, 418.75, 431.25, 431.25, 429.375, 421.875,
422.5, 419.375, 418.125, 417.5, 415.625, 411.875, 406.25, 396.875,
378.75, 393.8188, 373.5, 371.5, 366.25, 366.25, 358.4937, 364.625,
372.25, 363.75, 368.75, 369.25, 362.5, 350, 343.75, 337.5, 322.5,
408.75, 417.625, 420.625, 415.625, 412.5, 417.625, 417, 415.125,
406.25, 406.25, 405, 387.5, 382.5, 368.75, 361.875, 368.125,
353.75, 350, 356.25, 350.875, 355, 360.625, 355.125, 358.875,
362.5, 344, 339.5, 332.5, 313.375, 309.5, 417.2937, 423.3188,
423.775, 419.3375, 415.55, 419.325, 417.25, 415.8375, 411.7812,
408.975, 406.2625, 396.6062, 385.6125, 372.0562, 368.0375, 369.8625,
356.2688, 360.45, 357.45, 354.35, 363.6313, 361.5, 361.0813,
364.375, 365.9, 351.15, 342.0375, 333.5938, 315.4812, 315.8188,
4560, 11120, 30840, 8640, 7400, 3040, 6360, 1720, 6080, 6840,
9200, 38440, 53000, 32000, 46240, 17680, 57240, 57840, 28680,
28880, 66760, 56640, 23240, 33480, 39120, 62480, 64320, 79920,
66320, 60360, NA, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
0, 0, 0), .Dim = c(30L, 7L), .Dimnames = list(NULL, c("Open",
"High", "Low", "Close", "Volume", "isSwingBottom", "isSwingTop"
)), index = structure(c(812764800, 812851200, 812937600, 813196800,
813283200, 813369600, 813456000, 813542400, 813801600, 813888000,
813974400, 814060800, 814147200, 814406400, 814492800, 814579200,
814665600, 814752000, 815011200, 815097600, 815184000, 815270400,
815356800, 815616000, 815702400, 815788800, 815875200, 815961600,
816220800, 816307200), tzone = "UTC", tclass = "Date"), class = c("xts",
"zoo"))
If you want a new column with dates you might need to convert myXTS to dataframe. Try this approach :
library(dplyr)
library(zoo)
library(tidyr)
myXTS %>%
fortify.zoo() %>%
mutate(new_col = if_else(isSwingBottom == 1, Index, as.Date(NA))) %>%
fill(new_col)
# Index Open High Low Close Volume isSwingBottom isSwingTop new_col
#1 1995-10-04 408.7500 418.7500 408.750 417.2937 4560 NA 0 <NA>
#2 1995-10-05 417.2937 431.2500 417.625 423.3188 11120 0 0 <NA>
#3 1995-10-06 423.3188 431.2500 420.625 423.7750 30840 0 0 <NA>
#4 1995-10-09 423.7750 429.3750 415.625 419.3375 8640 0 0 <NA>
#5 1995-10-10 419.3375 421.8750 412.500 415.5500 7400 1 0 1995-10-10
#6 1995-10-11 415.5500 422.5000 417.625 419.3250 3040 0 1 1995-10-10
#7 1995-10-12 419.3250 419.3750 417.000 417.2500 6360 0 0 1995-10-10
#8 1995-10-13 417.2500 418.1250 415.125 415.8375 1720 0 0 1995-10-10
#9 1995-10-16 415.8375 417.5000 406.250 411.7812 6080 0 0 1995-10-10
#10 1995-10-17 411.7812 415.6250 406.250 408.9750 6840 0 0 1995-10-10
#11 1995-10-18 408.9750 411.8750 405.000 406.2625 9200 0 0 1995-10-10
#12 1995-10-19 406.2625 406.2500 387.500 396.6062 38440 0 0 1995-10-10
#13 1995-10-20 396.6062 396.8750 382.500 385.6125 53000 0 0 1995-10-10
#14 1995-10-23 385.6125 378.7500 368.750 372.0562 32000 0 0 1995-10-10
#15 1995-10-24 374.3750 393.8188 361.875 368.0375 46240 0 0 1995-10-10
#16 1995-10-25 368.0375 373.5000 368.125 369.8625 17680 0 0 1995-10-10
#17 1995-10-26 369.8625 371.5000 353.750 356.2688 57240 0 0 1995-10-10
#18 1995-10-27 356.2688 366.2500 350.000 360.4500 57840 0 0 1995-10-10
#19 1995-10-30 360.4500 366.2500 356.250 357.4500 28680 0 0 1995-10-10
#20 1995-10-31 357.4500 358.4937 350.875 354.3500 28880 1 0 1995-10-31
#21 1995-11-01 354.3500 364.6250 355.000 363.6313 66760 0 0 1995-10-31
#22 1995-11-02 363.6313 372.2500 360.625 361.5000 56640 0 1 1995-10-31
#23 1995-11-03 356.2500 363.7500 355.125 361.0813 23240 1 0 1995-11-03
#24 1995-11-06 361.0813 368.7500 358.875 364.3750 33480 0 0 1995-11-03
#25 1995-11-07 364.3750 369.2500 362.500 365.9000 39120 0 1 1995-11-03
#26 1995-11-08 365.9000 362.5000 344.000 351.1500 62480 0 0 1995-11-03
#27 1995-11-09 351.1500 350.0000 339.500 342.0375 64320 0 0 1995-11-03
#28 1995-11-10 342.0375 343.7500 332.500 333.5938 79920 0 0 1995-11-03
#29 1995-11-13 333.5938 337.5000 313.375 315.4812 66320 0 0 1995-11-03
#30 1995-11-14 315.4812 322.5000 309.500 315.8188 60360 NA 0 1995-11-03
Related
I have a dataset with the the names of many different plant species (column MTmatch), some of which appear repeatedly. Each of these has a column (ReadSum) with a sum associated with it (as well as many other pieces of information). How do I combine/aggregate all of the redundant plant species and sum the associated ReadSum with each, while leaving the non-redundant rows alone?
I would like to take a dataset like this, and either have it transformed so that each sample has the aggregate of the combined rows, or at least an additional column showing the sum of the ReadSum column for the combined redundant species. Sorry if this is confusing, I'm not sure how to ask this question.
I have been messing about with dplyr, using group_by() and summarise(), but that seems to be summarizing across the whole column rather than just the new group.
structure(list(ESVID = c("ESV_000090", "ESV_000682", "ESV_000028",
"ESV_000030", "ESV_000010", "ESV_000182", "ESV_000040", "ESV_000135",
"ESV_000383"), S026401.R1 = c(0.222447727, 0, 0, 0, 0, 0, 0.029074432,
0, 0), S026404.R1 = c(0.022583349, 0, 0, 0, 0, 0, 0.016390389,
0.001257217, 0), S026406.R1 = c(0.360895503, 0, 0, 0.00814677,
0, 0, 0.01513888, 0, 0.00115466)), row.names = c(NA, -9L), class = "data.frame")
> dput(samp5[1:9])
structure(list(ESVID = c("ESV_000090", "ESV_000682", "ESV_000028",
"ESV_000030", "ESV_000010", "ESV_000182", "ESV_000040", "ESV_000135",
"ESV_000383"), S026401.R1 = c(0.222447727, 0, 0, 0, 0, 0, 0.029074432,
0, 0), S026404.R1 = c(0.022583349, 0, 0, 0, 0, 0, 0.016390389,
0.001257217, 0), S026406.R1 = c(0.360895503, 0, 0, 0.00814677,
0, 0, 0.01513888, 0, 0.00115466), S026409.R1 = c(0.221175955,
0, 0, 0, 0, 0, 0.005146173, 0, 0), S026412.R1 = c(0.026058888,
0, 0, 0, 0, 0, 0, 0, 0), MAX = c(0.400577608, 0.009933177, 0.124412855,
0.00814677, 0.009824944, 0.086475106, 0.154850408, 0.015593835,
0.008340888), ReadSum = c(3.54892343, 0.012059346, 0.203303936,
0.021075546, 0.009824944, 0.128007863, 0.859687787, 0.068159534,
0.050266853), SPECIES = c("Abies ", "Abies ", "Acer", "Alnus",
"Berberis", "Betula ", "Boykinia", "Boykinia", "Boykinia")), row.names = c(NA,
-9L), class = "data.frame")
Do either of these approached produce your intended outcome?
Data:
df <- structure(list(ESVID = c("ESV_000090", "ESV_000682", "ESV_000028",
"ESV_000030", "ESV_000010", "ESV_000182", "ESV_000040", "ESV_000135",
"ESV_000383"), S026401.R1 = c(0.222447727, 0, 0, 0, 0, 0, 0.029074432,
0, 0), S026404.R1 = c(0.022583349, 0, 0, 0, 0, 0, 0.016390389,
0.001257217, 0), S026406.R1 = c(0.360895503, 0, 0, 0.00814677,
0, 0, 0.01513888, 0, 0.00115466), S026409.R1 = c(0.221175955,
0, 0, 0, 0, 0, 0.005146173, 0, 0), S026412.R1 = c(0.026058888,
0, 0, 0, 0, 0, 0, 0, 0), MAX = c(0.400577608, 0.009933177, 0.124412855,
0.00814677, 0.009824944, 0.086475106, 0.154850408, 0.015593835,
0.008340888), ReadSum = c(3.54892343, 0.012059346, 0.203303936,
0.021075546, 0.009824944, 0.128007863, 0.859687787, 0.068159534,
0.050266853), SPECIES = c("Abies ", "Abies ", "Acer", "Alnus",
"Berberis", "Betula ", "Boykinia", "Boykinia", "Boykinia")), row.names = c(NA,
-9L), class = "data.frame")
Create a new column "combined_ReadSum" (2nd col) which is the sum of "ReadSum" for each "SPECIES":
library(dplyr)
df %>%
group_by(SPECIES) %>%
summarise(combined_ReadSum = sum(ReadSum)) %>%
left_join(df, by = "SPECIES")
#> # A tibble: 9 × 10
#> SPECIES combi…¹ ESVID S0264…² S0264…³ S0264…⁴ S0264…⁵ S0264…⁶ MAX ReadSum
#> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 "Abies " 3.56 ESV_… 0.222 0.0226 0.361 0.221 0.0261 0.401 3.55
#> 2 "Abies " 3.56 ESV_… 0 0 0 0 0 0.00993 0.0121
#> 3 "Acer" 0.203 ESV_… 0 0 0 0 0 0.124 0.203
#> 4 "Alnus" 0.0211 ESV_… 0 0 0.00815 0 0 0.00815 0.0211
#> 5 "Berber… 0.00982 ESV_… 0 0 0 0 0 0.00982 0.00982
#> 6 "Betula… 0.128 ESV_… 0 0 0 0 0 0.0865 0.128
#> 7 "Boykin… 0.978 ESV_… 0.0291 0.0164 0.0151 0.00515 0 0.155 0.860
#> 8 "Boykin… 0.978 ESV_… 0 0.00126 0 0 0 0.0156 0.0682
#> 9 "Boykin… 0.978 ESV_… 0 0 0.00115 0 0 0.00834 0.0503
#> # … with abbreviated variable names ¹combined_ReadSum, ²S026401.R1,
#> # ³S026404.R1, ⁴S026406.R1, ⁵S026409.R1, ⁶S026412.R1
Or, summarise columns by summing the values for each unique species:
library(dplyr)
df %>%
group_by(SPECIES) %>%
summarise(across(where(is.numeric), sum))
#> # A tibble: 6 × 8
#> SPECIES S026401.R1 S026404.R1 S026406.R1 S026409.R1 S0264…¹ MAX ReadSum
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 "Abies " 0.222 0.0226 0.361 0.221 0.0261 0.411 3.56
#> 2 "Acer" 0 0 0 0 0 0.124 0.203
#> 3 "Alnus" 0 0 0.00815 0 0 0.00815 0.0211
#> 4 "Berberis" 0 0 0 0 0 0.00982 0.00982
#> 5 "Betula " 0 0 0 0 0 0.0865 0.128
#> 6 "Boykinia" 0.0291 0.0176 0.0163 0.00515 0 0.179 0.978
#> # … with abbreviated variable name ¹S026412.R1
Created on 2022-10-28 by the reprex package (v2.0.1)
I have a data set that looks like this:
It can be build using codes:
df<- structure(list(`Med` = c("DOCETAXEL",
"BEVACIZUMAB", "CARBOPLATIN", "CETUXIMAB", "DOXORUBICIN", "IRINOTECAN"
), `2.4 mg` = c(0, 0, 0, 0, 1, 0), `PRIOR CANCER THERAPY` = c(4L,
3L, 3L, 3L, 3L, 3L), `PRIOR CANCER SURGERY` = c(0, 0, 0, 0, 0,
0), `PRIOR RADIATION THERAPY` = c(0, 0, 0, 0, 0, 0)), row.names = c(NA,
6L), class = "data.frame")
Now I would like to change col name that are not start with number to proper case. How should I do it? I thought I could use str_to_title. I have tried many ways can not get it to work. Here is the codes that I tried:
# try1:
df[,3:5] %>% setNames(str_to_title(colnames(df[,3:5])))
#try2:
df[,3:5] <- df[,3:5]%>% rename_with (str_to_title)
# try3:
colnames(df[,3:5])<- str_to_title(colnames(df[,3:5]))
What did I do wrong? there is no error message, just the col names did not get updated. Could anyone help me identify the issue, or maybe show me a better way if you have?
Here I have small data then I can find the col number. If I want it to auto correct the col names to proper case, how can I do that?
Thanks.
We can use
library(dplyr)
library(stringr)
df %>%
rename_at(3:5, ~ str_to_title(.))
-output
# Med 2.4 mg Prior Cancer Therapy Prior Cancer Surgery Prior Radiation Therapy
#1 DOCETAXEL 0 4 0 0
#2 BEVACIZUMAB 0 3 0 0
#3 CARBOPLATIN 0 3 0 0
#4 CETUXIMAB 0 3 0 0
#5 DOXORUBICIN 1 3 0 0
#6 IRINOTECAN 0 3 0 0
Or using rename_with
df %>%
rename_with(~ str_to_title(.), 3:5)
In R, i have a table where the column name is a date, how do I invert the columns by rows to be able to record in the database?
Example Table:
estab codigo descricao 2021-02-01 2021-02-02
1 103 4390160 ANM 2003 0 0
2 103 4390161 ANM 2004 MF 0 0
3 103 4390162 ANM 2008 MF 0 0
4 103 4390193 ANM 3004 ST 0 0
5 103 4390189 ANM 3008 ST 0 0
6 103 4543512 ANM 24 NET 0 0
7 103 4390163 AMT 2008 RF 0 0
8 103 4543520 ANM 2003 COM BATERIA 0 0
9 103 4543521 ANM 2004 MF COM BATERIA 0 0
10 103 4543522 ANM 2008 MF COM BATERIA 0 0
11 103 4543523 ANM 3004 ST COM BATERIA 0 0
12 103 4543524 ANM 3008 ST COM BATERIA 0 0
13 103 4543516 AMT 8000 0 0
14 103 4390165 AMT 2018 0 0
15 103 4390164 AMT 2010 0 0
I tried to use melt, but it didn't work very well:
xxx <- reshape2::melt(xxx[[1]], id.vars = 'codigo')
If I understood your question, here is a code that should work for you:
# Tried recreating your dataframe
dt <- data.frame(estab = 103,
codigo = 4390160:4390174,
descricao = c("ANM 2003", "ANM 2004", "ANM BATERIA"),
"2021-02-01" = 0,
"2021-02-02" = 0)
dt <- reshape2::melt(dt, id.vars = c("estab", "codigo", "descricao"), variable.name = "Date", value.name = "Value")
# Make column into date
dt$Date <- gsub("X", "", dt$Date)
dt$Date <- as.Date(dt$Date, format = "%Y.%m.%d")
head(dt)
In base R, we can use reshape
out <- reshape(df1, direction = "long", varying = 4:5, sep = "")
row.names(out) <- NULL
data
df1 <- structure(list(estab = c(103, 103, 103, 103, 103, 103, 103, 103,
103, 103, 103, 103, 103, 103, 103), codigo = 4390160:4390174,
descricao = c("ANM 2003", "ANM 2004", "ANM BATERIA", "ANM 2003",
"ANM 2004", "ANM BATERIA", "ANM 2003", "ANM 2004", "ANM BATERIA",
"ANM 2003", "ANM 2004", "ANM BATERIA", "ANM 2003", "ANM 2004",
"ANM BATERIA"), X2021.02.01 = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0), X2021.02.02 = c(0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0)), class = "data.frame", row.names = c(NA,
-15L))
I think you might be after something like the following.
library(tidyr)
library(lubridate)
pivot_longer(df, matches('\\d{4}-\\d{2}-\\d{2}'),
names_to = 'date',
names_transform = list(date = ymd))
# # A tibble: 6 x 5
# estab codigo descricao date value
# <dbl> <dbl> <chr> <date> <dbl>
# 1 103 4390160 ANM 2003 2021-02-01 0
# 2 103 4390160 ANM 2003 2021-02-02 0
# 3 103 4390161 ANM 2004 MF 2021-02-01 0
# 4 103 4390161 ANM 2004 MF 2021-02-02 0
# 5 103 4390162 ANM 2008 MF 2021-02-01 0
# 6 103 4390162 ANM 2008 MF 2021-02-02 0
Data
df <- structure(list(estab = c(103, 103, 103), codigo = c(4390160,
4390161, 4390162), descricao = c("ANM 2003", "ANM 2004 MF", "ANM 2008 MF"
), `2021-02-01` = c(0, 0, 0), `2021-02-02` = c(0, 0, 0)), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))
I have a matrix which looks like the following:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
the 0.07200378 0.173467875 -0.32174805 -0.17641919 -0.1895841 0.41491635 0.52559372 0.46668538 -0.622698039 0.07609943
dog -0.03110763 -0.307907604 -0.51872045 -0.61390705 0.2901446 -0.30045110 0.37480375 0.43265162 -0.095877141 0.13267635
went -0.10276563 0.006781152 -0.22007612 0.29408635 -0.5130759 -0.54109880 0.27203657 -0.10996491 -0.442054480 0.14811820
to -0.25024018 -0.690325871 0.04050764 0.19626275 0.1937401 0.22256489 -0.28244329 0.01593702 0.357230552 0.56581933
play -0.30871394 -0.093627274 -0.28149478 0.09634858 -0.0895794 0.40877385 -0.60633919 0.15760252 0.001222108 0.82736039
with -0.15535758 0.103512824 -0.22533448 0.18746118 0.4194084 0.64124607 -0.03984496 0.16687895 -0.373183180 -0.58537456
his 0.56851056 -0.376888059 0.48226617 0.06921187 0.5648746 -0.20768129 -0.28356166 0.70855895 0.031217873 0.71860737
owner -0.29910484 -0.727676094 -0.29929429 -0.23175114 0.4336813 0.39667153 -0.29670753 -0.04054499 -0.041433528 0.34875186
fun 0.08032176 -0.431446284 0.15740608 0.16003107 -0.1894946 0.37010769 0.26229681 -0.22716813 -0.310652746 0.06291729
john 0.08629179 0.470551208 0.31550134 0.61767611 0.6179546 -0.01474994 0.58974983 -0.39419778 -0.689627200 -0.18293759
With words as the rownames and its populated with word vectors from a word2vec model. I also have a second matrix which looks like the following:
dog play owner went fun NA_TEXT NA_2TEXT
1750_10-K_2005 0 1 0 0 1 0 0
1800_10-K_2005 1 0 1 0 0 0 1
1923_10-K_2005 1 0 0 0 0 0 0
2135_10-K_2005 0 0 0 0 0 1 0
2488_10-K_2005 0 0 0 0 0 0 0
2491_10-K_2005 0 0 1 0 0 0 1
2969_10-K_2005 1 1 0 1 0 0 1
3133_10-K_2005 0 0 0 0 0 0 0
3197_10-K/A_2005 0 0 0 1 0 1 0
3197_10-K_2005 0 0 0 0 0 0 0
Which is a bag of words matrix. The rownames here are documents. I want to compute the cosine similarity between the documents based on the word embeddings matrix in the first matrix. I can average the word embedding matrix words by doing rowMeans(wrds) which gives:
> rowMeans(wrds)
the dog went to play with
0.041831714 -0.063769466 -0.120801359 0.036905292 0.011155287 0.013941266
his owner fun john
0.227511640 -0.075740768 -0.006568109 0.141621237
Now I would like to "join" these words up with the docs matrix when a word exists in the colnames.
Expected output (for the first few columns of the docs matrix):
dog play owner went
1750_10-K_2005 0 0.011 0 0
1800_10-K_2005 -0.063 0 -0.12 0
1923_10-K_2005 -0.063 0 0 0
2135_10-K_2005 0 0 0 0
2488_10-K_2005 0 0 0 0
2491_10-K_2005 0 0 -0.075 0
2969_10-K_2005 -0.063 -0.075 0 -0.121
3133_10-K_2005 0 0 0 0
3197_10-K/A_2005 0 0 0 -0.12
3197_10-K_2005 0 0 0 0
Data:
wrds <- structure(c(0.0720037762075663, -0.031107634305954, -0.102765634655952,
-0.250240176916122, -0.30871394276619, -0.155357576906681, 0.568510562181473,
-0.299104837700725, 0.0803217552602291, 0.0862917900085449, 0.173467874526978,
-0.307907603681087, 0.00678115151822567, -0.690325871109962,
-0.0936272740364075, 0.103512823581696, -0.376888059079647, -0.727676093578339,
-0.43144628405571, 0.470551207661629, -0.321748048067093, -0.51872044801712,
-0.220076121389866, 0.0405076444149017, -0.281494781374931, -0.225334476679564,
0.482266165316105, -0.299294285476208, 0.157406084239483, 0.315501344390213,
-0.17641919106245, -0.613907054066658, 0.294086349196732, 0.196262747049332,
0.0963485836982727, 0.18746118247509, 0.0692118704319, -0.231751143932343,
0.16003106534481, 0.617676109075546, -0.189584106206894, 0.290144592523575,
-0.513075917959213, 0.193740077316761, -0.0895793968811631, 0.419408403337002,
0.564874619245529, 0.433681339025497, -0.189494623802602, 0.617954611778259,
0.414916351437569, -0.300451099872589, -0.541098803281784, 0.222564890980721,
0.408773854374886, 0.641246065497398, -0.207681285217404, 0.396671526134014,
0.370107688009739, -0.0147499442100525, 0.525593716651201, 0.374803751707077,
0.272036574780941, -0.282443292438984, -0.606339186429977, -0.0398449599742889,
-0.283561661839485, -0.296707525849342, 0.262296808883548, 0.589749827980995,
0.466685384511948, 0.432651624083519, -0.109964912757277, 0.015937015414238,
0.157602518796921, 0.166878946125507, 0.708558946847916, -0.0405449904501438,
-0.227168127894402, -0.394197784364223, -0.622698038816452, -0.0958771407604218,
-0.442054480314255, 0.357230551540852, 0.00122210755944252, -0.37318317964673,
0.0312178730964661, -0.0414335280656815, -0.310652745887637,
-0.689627200365067, 0.0760994255542755, 0.132676348090172, 0.148118201643229,
0.565819330513477, 0.827360391616821, -0.585374563932419, 0.718607366085052,
0.348751857876778, 0.0629172921180725, -0.18293759226799), .Dim = c(10L,
10L), .Dimnames = list(c("the", "dog", "went", "to", "play",
"with", "his", "owner", "fun", "john"), NULL))
docs <- structure(c(0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1,
0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0), .Dim = c(10L, 7L), .Dimnames = list(
c("1750_10-K_2005", "1800_10-K_2005", "1923_10-K_2005", "2135_10-K_2005",
"2488_10-K_2005", "2491_10-K_2005", "2969_10-K_2005", "3133_10-K_2005",
"3197_10-K/A_2005", "3197_10-K_2005"), c("dog", "play", "owner",
"went", "fun", "NA_TEXT", "NA_2TEXT")))
I would like to run a Monte Carlo simulation. I have a data.frame where rows are unique IDs which have a probability of association with one of the columns. The data entered into the columns can be treated as the weights for that probability. I want to randomly sample each row in the data.frame based on the weights listed for each row. Each row should only return one value per run. The data.frame structure looks like this:
ID, X2000, X2001, X2002, X2003, X2004
X11, 0, 0, 0.5, 0.5, 0
X33, 0.25, 0.25, 0.25, 0.25, 0
X55, 0, 0, 0, 0, 1
X77, 0.5, 0, 0, 0, 0.5
For weighting, "X11" should either return X2002 or X2003, "X33" should have an equal probability of returning X2000, X2001, X2002, or X2003, should be equal with no chance of returning X2004. The only possible return for "X55" should be X2004.
The output data I am interested in are the IDs and the column that was sampled for that run, although it would probably be simpler to return something like this:
ID, X2000, X2001, X2002, X2003, X2004
X11, 0, 0, 1, 0, 0
X33, 1, 0, 0, 0, 0
X55, 0, 0, 0, 0, 1
X77, 1, 0, 0, 0, 0
Your data.frame is transposed - the sample() function takes a probability vector. However, your probability vector is rowwise which means it's harder to extract from a data.frame.
To get around this - you can import your ID column as a row.name. This allows you to be able to access it during an apply() statement. Note the apply() will coerce the data.frame to a matrix which means only one data type is allowed. That's why the IDs needed to be rownames - otherwise we'd have a probability vector of characters instead of numerics.
mc_df <- read.table(
text =
'ID X2000 X2001 X2002 X2003 X2004
X11 0 0 0.5 0.5 0
X33 0.25 0.25 0.25 0.25 0
X55 0 0 0 0 1
X77 0.5 0 0 0 0.5'
, header = T
,row.names = 1)
From there, can use the apply function:
apply(mc_df, 1, function(x) sample(names(x), size = 200, replace = T, prob = x))
Or you could make it fancy
apply(mc_df, 1, function(x) table(sample(names(x), size = 200, replace = T, prob = x)))
$X11
X2002 X2003
102 98
$X33
X2000 X2001 X2002 X2003
54 47 64 35
$X55
X2004
200
$X77
X2000 X2004
103 97
Fancier:
apply(mc_df, 1, function(x) table(sample(as.factor(names(x)), size = 200, replace = T, prob = x)))
X11 X33 X55 X77
X2000 0 51 0 99
X2001 0 50 0 0
X2002 91 57 0 0
X2003 109 42 0 0
X2004 0 0 200 101
Or fanciest:
prop.table(apply(mc_df
, 1
, function(x) table(sample(as.factor(names(x)), size = 200, replace = T, prob = x)))
,2)
X11 X33 X55 X77
X2000 0.00 0.270 0 0.515
X2001 0.00 0.235 0 0.000
X2002 0.51 0.320 0 0.000
X2003 0.49 0.175 0 0.000
X2004 0.00 0.000 1 0.485