combine redundant row items in r - r

I have a dataset with the the names of many different plant species (column MTmatch), some of which appear repeatedly. Each of these has a column (ReadSum) with a sum associated with it (as well as many other pieces of information). How do I combine/aggregate all of the redundant plant species and sum the associated ReadSum with each, while leaving the non-redundant rows alone?
I would like to take a dataset like this, and either have it transformed so that each sample has the aggregate of the combined rows, or at least an additional column showing the sum of the ReadSum column for the combined redundant species. Sorry if this is confusing, I'm not sure how to ask this question.
I have been messing about with dplyr, using group_by() and summarise(), but that seems to be summarizing across the whole column rather than just the new group.
structure(list(ESVID = c("ESV_000090", "ESV_000682", "ESV_000028",
"ESV_000030", "ESV_000010", "ESV_000182", "ESV_000040", "ESV_000135",
"ESV_000383"), S026401.R1 = c(0.222447727, 0, 0, 0, 0, 0, 0.029074432,
0, 0), S026404.R1 = c(0.022583349, 0, 0, 0, 0, 0, 0.016390389,
0.001257217, 0), S026406.R1 = c(0.360895503, 0, 0, 0.00814677,
0, 0, 0.01513888, 0, 0.00115466)), row.names = c(NA, -9L), class = "data.frame")
> dput(samp5[1:9])
structure(list(ESVID = c("ESV_000090", "ESV_000682", "ESV_000028",
"ESV_000030", "ESV_000010", "ESV_000182", "ESV_000040", "ESV_000135",
"ESV_000383"), S026401.R1 = c(0.222447727, 0, 0, 0, 0, 0, 0.029074432,
0, 0), S026404.R1 = c(0.022583349, 0, 0, 0, 0, 0, 0.016390389,
0.001257217, 0), S026406.R1 = c(0.360895503, 0, 0, 0.00814677,
0, 0, 0.01513888, 0, 0.00115466), S026409.R1 = c(0.221175955,
0, 0, 0, 0, 0, 0.005146173, 0, 0), S026412.R1 = c(0.026058888,
0, 0, 0, 0, 0, 0, 0, 0), MAX = c(0.400577608, 0.009933177, 0.124412855,
0.00814677, 0.009824944, 0.086475106, 0.154850408, 0.015593835,
0.008340888), ReadSum = c(3.54892343, 0.012059346, 0.203303936,
0.021075546, 0.009824944, 0.128007863, 0.859687787, 0.068159534,
0.050266853), SPECIES = c("Abies ", "Abies ", "Acer", "Alnus",
"Berberis", "Betula ", "Boykinia", "Boykinia", "Boykinia")), row.names = c(NA,
-9L), class = "data.frame")

Do either of these approached produce your intended outcome?
Data:
df <- structure(list(ESVID = c("ESV_000090", "ESV_000682", "ESV_000028",
"ESV_000030", "ESV_000010", "ESV_000182", "ESV_000040", "ESV_000135",
"ESV_000383"), S026401.R1 = c(0.222447727, 0, 0, 0, 0, 0, 0.029074432,
0, 0), S026404.R1 = c(0.022583349, 0, 0, 0, 0, 0, 0.016390389,
0.001257217, 0), S026406.R1 = c(0.360895503, 0, 0, 0.00814677,
0, 0, 0.01513888, 0, 0.00115466), S026409.R1 = c(0.221175955,
0, 0, 0, 0, 0, 0.005146173, 0, 0), S026412.R1 = c(0.026058888,
0, 0, 0, 0, 0, 0, 0, 0), MAX = c(0.400577608, 0.009933177, 0.124412855,
0.00814677, 0.009824944, 0.086475106, 0.154850408, 0.015593835,
0.008340888), ReadSum = c(3.54892343, 0.012059346, 0.203303936,
0.021075546, 0.009824944, 0.128007863, 0.859687787, 0.068159534,
0.050266853), SPECIES = c("Abies ", "Abies ", "Acer", "Alnus",
"Berberis", "Betula ", "Boykinia", "Boykinia", "Boykinia")), row.names = c(NA,
-9L), class = "data.frame")
Create a new column "combined_ReadSum" (2nd col) which is the sum of "ReadSum" for each "SPECIES":
library(dplyr)
df %>%
group_by(SPECIES) %>%
summarise(combined_ReadSum = sum(ReadSum)) %>%
left_join(df, by = "SPECIES")
#> # A tibble: 9 × 10
#> SPECIES combi…¹ ESVID S0264…² S0264…³ S0264…⁴ S0264…⁵ S0264…⁶ MAX ReadSum
#> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 "Abies " 3.56 ESV_… 0.222 0.0226 0.361 0.221 0.0261 0.401 3.55
#> 2 "Abies " 3.56 ESV_… 0 0 0 0 0 0.00993 0.0121
#> 3 "Acer" 0.203 ESV_… 0 0 0 0 0 0.124 0.203
#> 4 "Alnus" 0.0211 ESV_… 0 0 0.00815 0 0 0.00815 0.0211
#> 5 "Berber… 0.00982 ESV_… 0 0 0 0 0 0.00982 0.00982
#> 6 "Betula… 0.128 ESV_… 0 0 0 0 0 0.0865 0.128
#> 7 "Boykin… 0.978 ESV_… 0.0291 0.0164 0.0151 0.00515 0 0.155 0.860
#> 8 "Boykin… 0.978 ESV_… 0 0.00126 0 0 0 0.0156 0.0682
#> 9 "Boykin… 0.978 ESV_… 0 0 0.00115 0 0 0.00834 0.0503
#> # … with abbreviated variable names ¹​combined_ReadSum, ²​S026401.R1,
#> # ³​S026404.R1, ⁴​S026406.R1, ⁵​S026409.R1, ⁶​S026412.R1
Or, summarise columns by summing the values for each unique species:
library(dplyr)
df %>%
group_by(SPECIES) %>%
summarise(across(where(is.numeric), sum))
#> # A tibble: 6 × 8
#> SPECIES S026401.R1 S026404.R1 S026406.R1 S026409.R1 S0264…¹ MAX ReadSum
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 "Abies " 0.222 0.0226 0.361 0.221 0.0261 0.411 3.56
#> 2 "Acer" 0 0 0 0 0 0.124 0.203
#> 3 "Alnus" 0 0 0.00815 0 0 0.00815 0.0211
#> 4 "Berberis" 0 0 0 0 0 0.00982 0.00982
#> 5 "Betula " 0 0 0 0 0 0.0865 0.128
#> 6 "Boykinia" 0.0291 0.0176 0.0163 0.00515 0 0.179 0.978
#> # … with abbreviated variable name ¹​S026412.R1
Created on 2022-10-28 by the reprex package (v2.0.1)

Related

How can I make a moving sum from a cell in R?

I have a dataframe looking like this:
date
P
>60?
03-31-2020
6.8
0
03-30-2020
5.0
0
03-29-2020
0.0
0
03-28-2020
0.0
0
03-27-2020
2.0
0
03-26-2020
0.0
0
03-25-2020
71.0
1
03-24-2020
2.0
0
03-23-2020
0.0
0
03-22-2020
23.8
0
03-21-2020
0.0
0
03-20-2020
23.8
0
Code to reproduce the dataframe:
df1 <- data.frame(date = c("03-31-2020", "03-30-2020", "03-29-2020", "03-28-2020", "03-27-2020", "03-26-2020",
"03-25-2020", "03-24-2020", "03-23-2020", "03-22-2020", "03-21-2020", "03-20-2020"),
P = c(6.8, 5.0, 0.0, 0.0, 2.0, 0.0, 71.0, 2.0, 0.0, 23.8, 0.0, 23.8),
Sup60 = c(0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0))
I want to sum the P values N days befores the P > 60.
For example, the first barrier (number bigger than 60) is the P = 71 on the day 25-03-2020, from that i want to sum the 5 P values before that day, like:
2.0 + 0.0 + 23.8 + 0.0 + 23.8 = 49,6
It is a kind of moving sum because the concept is similar to a moving average.
Instead of the average of the last 5 values, for example, I want the sum of the last 5 values from a value greater than 60.
How can I do this?
Hi firstly we can solve how to calculate a running sum then we do an if_else on this column, as a general rule you always split complex problems into minor solvable problems
library(tidyverse)
df_example <- tibble::tribble(
~date, ~P, ~`>60?`,
"03-31-2020", 6.8, 0L,
"03-30-2020", 5, 0L,
"03-29-2020", 0, 0L,
"03-28-2020", 0, 0L,
"03-27-2020", 2, 0L,
"03-26-2020", 0, 0L,
"03-25-2020", 71, 1L,
"03-24-2020", 2, 0L,
"03-23-2020", 0, 0L,
"03-22-2020", 23.8, 0L,
"03-21-2020", 0, 0L,
"03-20-2020", 23.8, 0L
)
# lets start by doing a simple running sum
jjj <- df_example |>
arrange(date)
jjj |>
mutate(running_sum = slider::slide_dbl(.x = P,.f = ~ sum(.x),.before = 5,.after = -1)) |>
mutate(chosen_sum = if_else(P > 60,running_sum,NA_real_))
#> # A tibble: 12 x 5
#> date P `>60?` running_sum chosen_sum
#> <chr> <dbl> <int> <dbl> <dbl>
#> 1 03-20-2020 23.8 0 0 NA
#> 2 03-21-2020 0 0 23.8 NA
#> 3 03-22-2020 23.8 0 23.8 NA
#> 4 03-23-2020 0 0 47.6 NA
#> 5 03-24-2020 2 0 47.6 NA
#> 6 03-25-2020 71 1 49.6 49.6
#> 7 03-26-2020 0 0 96.8 NA
#> 8 03-27-2020 2 0 96.8 NA
#> 9 03-28-2020 0 0 75 NA
#> 10 03-29-2020 0 0 75 NA
#> 11 03-30-2020 5 0 73 NA
#> 12 03-31-2020 6.8 0 7 NA
Created on 2021-10-20 by the reprex package (v2.0.1)

XTS searching in dates prior to current index

I have an xts with OHLC and a logical column: isSwingBottom. I want a new column that gives the index of first occurrence of isSwingBottom by searching only in dates prior to the current row. I need the closest found row.
For example, in this data for isSwingBottom, 1995-10-04/1995-10-10 should be NA, 1995-10-11/1995-10-31 should be 1995-10-10 and so on. I want to avoid using a loop.
myXTS <- structure(c(408.75, 417.2937, 423.3188, 423.775, 419.3375, 415.55,
419.325, 417.25, 415.8375, 411.7812, 408.975, 406.2625, 396.6062,
385.6125, 374.375, 368.0375, 369.8625, 356.2688, 360.45, 357.45,
354.35, 363.6313, 356.25, 361.0813, 364.375, 365.9, 351.15, 342.0375,
333.5938, 315.4812, 418.75, 431.25, 431.25, 429.375, 421.875,
422.5, 419.375, 418.125, 417.5, 415.625, 411.875, 406.25, 396.875,
378.75, 393.8188, 373.5, 371.5, 366.25, 366.25, 358.4937, 364.625,
372.25, 363.75, 368.75, 369.25, 362.5, 350, 343.75, 337.5, 322.5,
408.75, 417.625, 420.625, 415.625, 412.5, 417.625, 417, 415.125,
406.25, 406.25, 405, 387.5, 382.5, 368.75, 361.875, 368.125,
353.75, 350, 356.25, 350.875, 355, 360.625, 355.125, 358.875,
362.5, 344, 339.5, 332.5, 313.375, 309.5, 417.2937, 423.3188,
423.775, 419.3375, 415.55, 419.325, 417.25, 415.8375, 411.7812,
408.975, 406.2625, 396.6062, 385.6125, 372.0562, 368.0375, 369.8625,
356.2688, 360.45, 357.45, 354.35, 363.6313, 361.5, 361.0813,
364.375, 365.9, 351.15, 342.0375, 333.5938, 315.4812, 315.8188,
4560, 11120, 30840, 8640, 7400, 3040, 6360, 1720, 6080, 6840,
9200, 38440, 53000, 32000, 46240, 17680, 57240, 57840, 28680,
28880, 66760, 56640, 23240, 33480, 39120, 62480, 64320, 79920,
66320, 60360, NA, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
0, 0, 0), .Dim = c(30L, 7L), .Dimnames = list(NULL, c("Open",
"High", "Low", "Close", "Volume", "isSwingBottom", "isSwingTop"
)), index = structure(c(812764800, 812851200, 812937600, 813196800,
813283200, 813369600, 813456000, 813542400, 813801600, 813888000,
813974400, 814060800, 814147200, 814406400, 814492800, 814579200,
814665600, 814752000, 815011200, 815097600, 815184000, 815270400,
815356800, 815616000, 815702400, 815788800, 815875200, 815961600,
816220800, 816307200), tzone = "UTC", tclass = "Date"), class = c("xts",
"zoo"))
If you want a new column with dates you might need to convert myXTS to dataframe. Try this approach :
library(dplyr)
library(zoo)
library(tidyr)
myXTS %>%
fortify.zoo() %>%
mutate(new_col = if_else(isSwingBottom == 1, Index, as.Date(NA))) %>%
fill(new_col)
# Index Open High Low Close Volume isSwingBottom isSwingTop new_col
#1 1995-10-04 408.7500 418.7500 408.750 417.2937 4560 NA 0 <NA>
#2 1995-10-05 417.2937 431.2500 417.625 423.3188 11120 0 0 <NA>
#3 1995-10-06 423.3188 431.2500 420.625 423.7750 30840 0 0 <NA>
#4 1995-10-09 423.7750 429.3750 415.625 419.3375 8640 0 0 <NA>
#5 1995-10-10 419.3375 421.8750 412.500 415.5500 7400 1 0 1995-10-10
#6 1995-10-11 415.5500 422.5000 417.625 419.3250 3040 0 1 1995-10-10
#7 1995-10-12 419.3250 419.3750 417.000 417.2500 6360 0 0 1995-10-10
#8 1995-10-13 417.2500 418.1250 415.125 415.8375 1720 0 0 1995-10-10
#9 1995-10-16 415.8375 417.5000 406.250 411.7812 6080 0 0 1995-10-10
#10 1995-10-17 411.7812 415.6250 406.250 408.9750 6840 0 0 1995-10-10
#11 1995-10-18 408.9750 411.8750 405.000 406.2625 9200 0 0 1995-10-10
#12 1995-10-19 406.2625 406.2500 387.500 396.6062 38440 0 0 1995-10-10
#13 1995-10-20 396.6062 396.8750 382.500 385.6125 53000 0 0 1995-10-10
#14 1995-10-23 385.6125 378.7500 368.750 372.0562 32000 0 0 1995-10-10
#15 1995-10-24 374.3750 393.8188 361.875 368.0375 46240 0 0 1995-10-10
#16 1995-10-25 368.0375 373.5000 368.125 369.8625 17680 0 0 1995-10-10
#17 1995-10-26 369.8625 371.5000 353.750 356.2688 57240 0 0 1995-10-10
#18 1995-10-27 356.2688 366.2500 350.000 360.4500 57840 0 0 1995-10-10
#19 1995-10-30 360.4500 366.2500 356.250 357.4500 28680 0 0 1995-10-10
#20 1995-10-31 357.4500 358.4937 350.875 354.3500 28880 1 0 1995-10-31
#21 1995-11-01 354.3500 364.6250 355.000 363.6313 66760 0 0 1995-10-31
#22 1995-11-02 363.6313 372.2500 360.625 361.5000 56640 0 1 1995-10-31
#23 1995-11-03 356.2500 363.7500 355.125 361.0813 23240 1 0 1995-11-03
#24 1995-11-06 361.0813 368.7500 358.875 364.3750 33480 0 0 1995-11-03
#25 1995-11-07 364.3750 369.2500 362.500 365.9000 39120 0 1 1995-11-03
#26 1995-11-08 365.9000 362.5000 344.000 351.1500 62480 0 0 1995-11-03
#27 1995-11-09 351.1500 350.0000 339.500 342.0375 64320 0 0 1995-11-03
#28 1995-11-10 342.0375 343.7500 332.500 333.5938 79920 0 0 1995-11-03
#29 1995-11-13 333.5938 337.5000 313.375 315.4812 66320 0 0 1995-11-03
#30 1995-11-14 315.4812 322.5000 309.500 315.8188 60360 NA 0 1995-11-03

How to convert long form to wide form based on category in R

I have the following data.
name x1 x2 x3 x4
1 V1_3 1 0 999 999
2 V2_3 1.12 0.044 25.4 0
3 V3_3 0.917 0.045 20.4 0
4 V1_15 1 0 999 999
5 V2_15 1.07 0.036 29.8 0
6 V3_15 0.867 0.039 22.5 0
7 V1_25 1 0 999 999
8 V2_25 1.07 0.034 31.1 0
9 V3_25 0.917 0.037 24.6 0
10 V1_35 1 0 999 999
11 V2_35 1.05 0.034 31.2 0
12 V3_35 0.994 0.037 26.6 0
13 V1_47 1 0 999 999
14 V2_47 1.03 0.031 33.6 0
15 V3_47 0.937 0.034 27.4 0
16 V1_57 1 0 999 999
17 V2_57 1.13 0.036 31.9 0
18 V3_57 1.03 0.037 28.1 0
I want to convert this data to the following data. Can someone give me some suggestion, please?
name est_3 est_15 est_25 est_35 est_47 est_57
1 V2 1.12 1.07 1.07 1.05 1.03 1.13
2 V3 0.917 0.867 0.917 0.994 0.937 1.03
Here is one approach for you. Your data is called mydf here. First, you want to choose necessary columns (i.e., name and x1) using select(). Then, you want to subset rows using filter(). You want to grab rows that begin with V2 or V3 in strings. grepl() checks if each string has the pattern. Then, you want to split the column, name and create two columns (i.e., name and est). Finally, you want to convert the data to a long-format data using pivot_wider().
library(dplyr)
library(tidyr)
select(mydf, name:x1) %>%
filter(grepl(x = name, pattern = "^V[2|3]")) %>%
separate(col = name, into = c("name", "est"), sep = "_") %>%
pivot_wider(names_from = "est",values_from = "x1", names_prefix = "est_")
# name est_3 est_15 est_25 est_35 est_47 est_57
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 V2 1.12 1.07 1.07 1.05 1.03 1.13
#2 V3 0.917 0.867 0.917 0.994 0.937 1.03
For your reference, when you ask questions, you want to provide a minimal sample data and code. If you can do that, SO users can easily help you out. Please read this question.
DATA
mydf <- structure(list(name = c("V1_3", "V2_3", "V3_3", "V1_15", "V2_15",
"V3_15", "V1_25", "V2_25", "V3_25", "V1_35", "V2_35", "V3_35",
"V1_47", "V2_47", "V3_47", "V1_57", "V2_57", "V3_57"), x1 = c(1,
1.122, 0.917, 1, 1.069, 0.867, 1, 1.066, 0.917, 1, 1.048, 0.994,
1, 1.03, 0.937, 1, 1.133, 1.032), x2 = c(0, 0.044, 0.045, 0,
0.036, 0.039, 0, 0.034, 0.037, 0, 0.034, 0.037, 0, 0.031, 0.034,
0, 0.036, 0.037), x3 = c(999, 25.446, 20.385, 999, 29.751, 22.478,
999, 31.134, 24.565, 999, 31.18, 26.587, 999, 33.637, 27.405,
999, 31.883, 28.081), x4 = c(999, 0, 0, 999, 0, 0, 999, 0, 0,
999, 0, 0, 999, 0, 0, 999, 0, 0)), row.names = c(NA, -18L), class = c("tbl_df",
"tbl", "data.frame"))

filling in a matrix based on rownames

I have a matrix which looks like the following:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
the 0.07200378 0.173467875 -0.32174805 -0.17641919 -0.1895841 0.41491635 0.52559372 0.46668538 -0.622698039 0.07609943
dog -0.03110763 -0.307907604 -0.51872045 -0.61390705 0.2901446 -0.30045110 0.37480375 0.43265162 -0.095877141 0.13267635
went -0.10276563 0.006781152 -0.22007612 0.29408635 -0.5130759 -0.54109880 0.27203657 -0.10996491 -0.442054480 0.14811820
to -0.25024018 -0.690325871 0.04050764 0.19626275 0.1937401 0.22256489 -0.28244329 0.01593702 0.357230552 0.56581933
play -0.30871394 -0.093627274 -0.28149478 0.09634858 -0.0895794 0.40877385 -0.60633919 0.15760252 0.001222108 0.82736039
with -0.15535758 0.103512824 -0.22533448 0.18746118 0.4194084 0.64124607 -0.03984496 0.16687895 -0.373183180 -0.58537456
his 0.56851056 -0.376888059 0.48226617 0.06921187 0.5648746 -0.20768129 -0.28356166 0.70855895 0.031217873 0.71860737
owner -0.29910484 -0.727676094 -0.29929429 -0.23175114 0.4336813 0.39667153 -0.29670753 -0.04054499 -0.041433528 0.34875186
fun 0.08032176 -0.431446284 0.15740608 0.16003107 -0.1894946 0.37010769 0.26229681 -0.22716813 -0.310652746 0.06291729
john 0.08629179 0.470551208 0.31550134 0.61767611 0.6179546 -0.01474994 0.58974983 -0.39419778 -0.689627200 -0.18293759
With words as the rownames and its populated with word vectors from a word2vec model. I also have a second matrix which looks like the following:
dog play owner went fun NA_TEXT NA_2TEXT
1750_10-K_2005 0 1 0 0 1 0 0
1800_10-K_2005 1 0 1 0 0 0 1
1923_10-K_2005 1 0 0 0 0 0 0
2135_10-K_2005 0 0 0 0 0 1 0
2488_10-K_2005 0 0 0 0 0 0 0
2491_10-K_2005 0 0 1 0 0 0 1
2969_10-K_2005 1 1 0 1 0 0 1
3133_10-K_2005 0 0 0 0 0 0 0
3197_10-K/A_2005 0 0 0 1 0 1 0
3197_10-K_2005 0 0 0 0 0 0 0
Which is a bag of words matrix. The rownames here are documents. I want to compute the cosine similarity between the documents based on the word embeddings matrix in the first matrix. I can average the word embedding matrix words by doing rowMeans(wrds) which gives:
> rowMeans(wrds)
the dog went to play with
0.041831714 -0.063769466 -0.120801359 0.036905292 0.011155287 0.013941266
his owner fun john
0.227511640 -0.075740768 -0.006568109 0.141621237
Now I would like to "join" these words up with the docs matrix when a word exists in the colnames.
Expected output (for the first few columns of the docs matrix):
dog play owner went
1750_10-K_2005 0 0.011 0 0
1800_10-K_2005 -0.063 0 -0.12 0
1923_10-K_2005 -0.063 0 0 0
2135_10-K_2005 0 0 0 0
2488_10-K_2005 0 0 0 0
2491_10-K_2005 0 0 -0.075 0
2969_10-K_2005 -0.063 -0.075 0 -0.121
3133_10-K_2005 0 0 0 0
3197_10-K/A_2005 0 0 0 -0.12
3197_10-K_2005 0 0 0 0
Data:
wrds <- structure(c(0.0720037762075663, -0.031107634305954, -0.102765634655952,
-0.250240176916122, -0.30871394276619, -0.155357576906681, 0.568510562181473,
-0.299104837700725, 0.0803217552602291, 0.0862917900085449, 0.173467874526978,
-0.307907603681087, 0.00678115151822567, -0.690325871109962,
-0.0936272740364075, 0.103512823581696, -0.376888059079647, -0.727676093578339,
-0.43144628405571, 0.470551207661629, -0.321748048067093, -0.51872044801712,
-0.220076121389866, 0.0405076444149017, -0.281494781374931, -0.225334476679564,
0.482266165316105, -0.299294285476208, 0.157406084239483, 0.315501344390213,
-0.17641919106245, -0.613907054066658, 0.294086349196732, 0.196262747049332,
0.0963485836982727, 0.18746118247509, 0.0692118704319, -0.231751143932343,
0.16003106534481, 0.617676109075546, -0.189584106206894, 0.290144592523575,
-0.513075917959213, 0.193740077316761, -0.0895793968811631, 0.419408403337002,
0.564874619245529, 0.433681339025497, -0.189494623802602, 0.617954611778259,
0.414916351437569, -0.300451099872589, -0.541098803281784, 0.222564890980721,
0.408773854374886, 0.641246065497398, -0.207681285217404, 0.396671526134014,
0.370107688009739, -0.0147499442100525, 0.525593716651201, 0.374803751707077,
0.272036574780941, -0.282443292438984, -0.606339186429977, -0.0398449599742889,
-0.283561661839485, -0.296707525849342, 0.262296808883548, 0.589749827980995,
0.466685384511948, 0.432651624083519, -0.109964912757277, 0.015937015414238,
0.157602518796921, 0.166878946125507, 0.708558946847916, -0.0405449904501438,
-0.227168127894402, -0.394197784364223, -0.622698038816452, -0.0958771407604218,
-0.442054480314255, 0.357230551540852, 0.00122210755944252, -0.37318317964673,
0.0312178730964661, -0.0414335280656815, -0.310652745887637,
-0.689627200365067, 0.0760994255542755, 0.132676348090172, 0.148118201643229,
0.565819330513477, 0.827360391616821, -0.585374563932419, 0.718607366085052,
0.348751857876778, 0.0629172921180725, -0.18293759226799), .Dim = c(10L,
10L), .Dimnames = list(c("the", "dog", "went", "to", "play",
"with", "his", "owner", "fun", "john"), NULL))
docs <- structure(c(0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1,
0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0), .Dim = c(10L, 7L), .Dimnames = list(
c("1750_10-K_2005", "1800_10-K_2005", "1923_10-K_2005", "2135_10-K_2005",
"2488_10-K_2005", "2491_10-K_2005", "2969_10-K_2005", "3133_10-K_2005",
"3197_10-K/A_2005", "3197_10-K_2005"), c("dog", "play", "owner",
"went", "fun", "NA_TEXT", "NA_2TEXT")))

Weighted random sampling for Monte Carlo simulation in R

I would like to run a Monte Carlo simulation. I have a data.frame where rows are unique IDs which have a probability of association with one of the columns. The data entered into the columns can be treated as the weights for that probability. I want to randomly sample each row in the data.frame based on the weights listed for each row. Each row should only return one value per run. The data.frame structure looks like this:
ID, X2000, X2001, X2002, X2003, X2004
X11, 0, 0, 0.5, 0.5, 0
X33, 0.25, 0.25, 0.25, 0.25, 0
X55, 0, 0, 0, 0, 1
X77, 0.5, 0, 0, 0, 0.5
For weighting, "X11" should either return X2002 or X2003, "X33" should have an equal probability of returning X2000, X2001, X2002, or X2003, should be equal with no chance of returning X2004. The only possible return for "X55" should be X2004.
The output data I am interested in are the IDs and the column that was sampled for that run, although it would probably be simpler to return something like this:
ID, X2000, X2001, X2002, X2003, X2004
X11, 0, 0, 1, 0, 0
X33, 1, 0, 0, 0, 0
X55, 0, 0, 0, 0, 1
X77, 1, 0, 0, 0, 0
Your data.frame is transposed - the sample() function takes a probability vector. However, your probability vector is rowwise which means it's harder to extract from a data.frame.
To get around this - you can import your ID column as a row.name. This allows you to be able to access it during an apply() statement. Note the apply() will coerce the data.frame to a matrix which means only one data type is allowed. That's why the IDs needed to be rownames - otherwise we'd have a probability vector of characters instead of numerics.
mc_df <- read.table(
text =
'ID X2000 X2001 X2002 X2003 X2004
X11 0 0 0.5 0.5 0
X33 0.25 0.25 0.25 0.25 0
X55 0 0 0 0 1
X77 0.5 0 0 0 0.5'
, header = T
,row.names = 1)
From there, can use the apply function:
apply(mc_df, 1, function(x) sample(names(x), size = 200, replace = T, prob = x))
Or you could make it fancy
apply(mc_df, 1, function(x) table(sample(names(x), size = 200, replace = T, prob = x)))
$X11
X2002 X2003
102 98
$X33
X2000 X2001 X2002 X2003
54 47 64 35
$X55
X2004
200
$X77
X2000 X2004
103 97
Fancier:
apply(mc_df, 1, function(x) table(sample(as.factor(names(x)), size = 200, replace = T, prob = x)))
X11 X33 X55 X77
X2000 0 51 0 99
X2001 0 50 0 0
X2002 91 57 0 0
X2003 109 42 0 0
X2004 0 0 200 101
Or fanciest:
prop.table(apply(mc_df
, 1
, function(x) table(sample(as.factor(names(x)), size = 200, replace = T, prob = x)))
,2)
X11 X33 X55 X77
X2000 0.00 0.270 0 0.515
X2001 0.00 0.235 0 0.000
X2002 0.51 0.320 0 0.000
X2003 0.49 0.175 0 0.000
X2004 0.00 0.000 1 0.485

Resources