I have got this coefficient table under the form of a dataframe:
coefficient_table <- data.frame("less_than_1" = c( 1, 0.5, 0.1, 0.025, 0.010, 0.005, 0.001),
"1-5" = c(0.500, 1.000, 0.200, 0.050, 0.020, 0.010, 0.002),
"5-20" = c(0.10, 0.20, 1.00, 0.25, 0.10, 0.05, 0.01),
"20-50" = c(0.025, 0.050, 0.250, 1.000, 0.400, 0.200, 0.040),
"50-100" = c(0.010, 0.020, 0.10, 0.400, 1.00, 0.500, 0.100),
"100-500" = c(0.005, 0.010, 0.050, 0.200, 0.500, 1.000, 0.200),
"more_than_500" = c(0.001, 0.002, 0.010, 0.040, 0.100, 0.200, 1.000))
I would like to apply it to a matrix that has 7 dimensions that has the same variables as my coefficient dataframe. For now this is how it looks:
A <- data.frame("less_than_1" = c(0,0,1,0), "1-5" = c(1,0,0,0), "5-20" = c(0,0,0,0),
"20-50" = c(0,1,0,1), "50-100" = c(0,0,0,0), "100-500" = c(0,0,0,0),
"more_than_500" = c(0,0,0,0))
A <- as.matrix(A)
less_than_1 1-5 5-20 20-50 50-100 100-500 more_than_500
1 0 1 0 0 0 0 0
2 0 0 0 1 0 0 0
3 1 0 0 0 0 0 0
4 0 0 0 1 0 0 0
I would like however to use my coefficient matrix to weight the elements of the matrix based on the formula I've used to create the coefficients, namely: min(BudgetRange1,BudgetRange2) / max(BudgetRange1,BudgetRange2) .
The first row has for example a budget of "1-5", the respective column should therefore take value 1. The other columns should take their respective value based on the same column "1-5" of the coefficient matrix (A).
table_z
less_than_1 1-5 5-20 20-50 50-100 100-500 more_than_500
1 0.5 1 0.2 0.05 0.02 0.01 0.002
2 0.025 0.05 0.25 1 0.4 0.2 0.04
3 1 0.5 0.1 0.025 0.01 0.005 0.001
4 0.025 0.05 0.25 1 0.4 0.2 0.04
Anyone knows how? Thanks for reading so far
Assuming the columns (i.e. 'less_than_1', '1-5', '5-20' ...) are exaclty in the same order for both coefficient_table and A , you can use matrix multiplication :
Z <- as.matrix(A)%*%as.matrix(coefficient_table)
> Z
less_than_1 1-5 5-20 20-50 50-100 100-500 more_than_500
[1,] 0.500 1.00 0.20 0.050 0.02 0.010 0.002
[2,] 0.025 0.05 0.25 1.000 0.40 0.200 0.040
[3,] 1.000 0.50 0.10 0.025 0.01 0.005 0.001
[4,] 0.025 0.05 0.25 1.000 0.40 0.200 0.040
# where Z is a matrix, you can convert to data.frame if you need it :
table_z <- as.data.frame(Z)
Related
I made reproducible minimal example, but my real data is really huge
ac_1 <-c(0.1, 0.3, 0.03, 0.03)
ac_2 <-c(0.2, 0.4, 0.1, 0.008)
ac_3 <-c(0.8, 0.043, 0.7, 0.01)
ac_4 <-c(0.2, 0.73, 0.1, 0.1)
c_2<-c(1,2,5,23)
check_1<-c(0.01, 0.902,0.02,0.07)
check_2<-c(0.03, 0.042,0.002,0.00001)
check_3<-c(0.01, 0.02,0.5,0.001)
check_4<-c(0.001, 0.042,0.02,0.2)
id<-1:4
df<-data.frame(id,ac_1, ac_2,ac_3,ac_4,c_2,check_1,check_2,check_3,check_4)
so, the dataframe is like this:
> df
id ac_1 ac_2 ac_3 ac_4 c_2 check_1 check_2 check_3 check_4
1 1 0.10 0.200 0.800 0.20 1 0.010 0.03000 0.010 0.001
2 2 0.30 0.400 0.043 0.73 2 0.902 0.04200 0.020 0.042
3 3 0.03 0.100 0.700 0.10 5 0.020 0.00200 0.500 0.020
4 4 0.03 0.008 0.010 0.10 23 0.070 0.00001 0.001 0.200
and what I want to do is,
if check_1 is 0.02, I will make the corresponding ac_1 to be missing data.
if check_2 is 0.02, I will make the corresponding ac_2 to be missing data.
I will keep doing this every "check" and "ac"columns
For example, in the check_1 column, the 3th id person have 0.02.
so, this person's ac_1 score should be missing data-- 0.03 should be missing data (NA)
In the check_3 column, the 2nd id person have 0.02.
so, this person's ac_3 score should be missing data.
In the check_4 column, the 3th id person have 0.02
so, this person's ac_4 score should be missing data.
so. what i did is as follows:
for(i in 1:4){
if(paste0("df$check_",i)==0.02){
paste0("df$ac_",i)==NA
}
}
But, it did not work...
You're really close, but you're off on a few fundamentals.
You can't (easily) use strings to refer to objects, so "df$check_1" won't work. You can use strings to refer to column names, but not with $, you need to use [ or [[, so df[["check_1"]] will work.
if isn't vectorized, so it won't work on each value in a column. Use ifelse instead, or even better in this case we can skip the if entirely.
Using == on non-integer numbers is risky due to precision issues. We'll use a tolerance instead.
Minor issue, paste0("df$ac_",i)==NA isn't good, == is for checking equality. You need = or <- for assignment on that line.
Addressing all of these issues:
for(i in 1:4){
df[
## rows to replace
abs(df[[paste0("check_", i)]] - 0.02) < 1e-10,
## column to replace
paste0("ac_", i)
] <- NA
}
df
# id ac_1 ac_2 ac_3 ac_4 c_2 check_1 check_2 check_3 check_4
# 1 1 0.10 0.200 0.80 0.20 1 0.010 0.03000 0.010 0.001
# 2 2 0.30 0.400 NA 0.73 2 0.902 0.04200 0.020 0.042
# 3 3 NA 0.100 0.70 NA 5 0.020 0.00200 0.500 0.020
# 4 4 0.03 0.008 0.01 0.10 23 0.070 0.00001 0.001 0.200
Its often better to work with long format data, even if just temporarily. Here is an example of doing so, using dplyr and tidyr:
pivot_longer(df, -c(id,c_2)) %>%
separate(name,into=c("type", "pos")) %>%
pivot_wider(names_from=type, values_from = value) %>%
mutate(ac=if_else(near(check,0.02), as.double(NA), ac)) %>%
pivot_wider(names_from = pos, values_from = ac:check)
(Updated with near() thanks to Gregor)
Output:
id c_2 ac_1 ac_2 ac_3 ac_4 check_1 check_2 check_3 check_4
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 0.1 0.2 0.8 0.2 0.01 0.03 0.01 0.001
2 2 2 0.3 0.4 NA 0.73 0.902 0.042 0.02 0.042
3 3 5 NA 0.1 0.7 NA 0.02 0.002 0.5 0.02
4 4 23 0.03 0.008 0.01 0.1 0.07 0.00001 0.001 0.2
I have a times series data set where I would like to replace the values of specific observations in the first row (the row where date == "2019-12-31") with the value zero. The reason for this is that the percentage returns displayed in these observations are from last year, and for my purpose I need them to equal zero.
The actual dataset that I am working with is much larger then the exapmle provided below and the order that the columns are placed in the real-life dataset change from month to month when I receive the file (except for the date column which is always the first one on the left hand side). There are around 50 observations in the first row that I need to change every month. The common trait of the columns where the changes need to be made for the first observation is that they contain the string "local_return" or "currency_return" or "twr_".
Please see a simplified example of the problem I have in the picture below, basically I need to be able to change the observations coloured red in the first row to zero. As mentioned above, the order of the columns to the right side of the date-column change every month (as well as the selection of columns). However, the columns that I need to change always contain one of a handful of strings.
I would appreciate any help! I have tried using the replace() function but I am only able to change all variables (except the variable 'date') in the first row to zero. I can also change the value of specific observations using x[i,j] but since the columns change order every month this is not a good long-term solution.
A screenshot of the problem in Excel
Example data to copy and paste into R for easy reproducing of problem:
df <- data.frame (date = c("2020-12-31", "2021-01-04", "2021-01-05", "2021-01-06", "2021-01-07", "2021-01-08", "2021-01-11"),
nominal_amount = c( 5000000000, 5000000000,5000000000,5000000000,5000000000,5000000000,5000000000),
market_value = c(5132748596, 5463759675, 5231957361, 5212386748, 5194812564, 5192248647, 5198903366),
exposure = c(5132748596, 5463759675, 5231957361, 5212386748, 5194812564, 5192248647, 5198903366),
net_cash_flow = c(0, 0, 0, 0, 0, 0, 0),
local_return_ytd =c(0.12, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02),
currency_return_ytd = c(0.11, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01),
duration = c(4.02, 4.01, 3.99, 3.98, 3.93, 3.79, 3.94),
twr_ytd = c(0.13, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03),
local_return_mtd = c(0.102, 0.002, 0.002, 0.002, 0.002, 0.002, 0.002),
currency_return_mtd = c(0.101, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001),
twr_mtd = c(0.103, 0.003, 0.003, 0.003, 0.003, 0.003, 0.003))
Here's a direct approach with base R:
# "local_return" or "currency_return" or "twr_".
# get those column names
cols = grep("local_return|currency_return|twr_", names(df), value = TRUE)
# check cols - looks right
cols
# [1] "local_return_ytd" "currency_return_ytd" "twr_ytd" "local_return_mtd"
# [5] "currency_return_mtd" "twr_mtd"
# get row numbers to change (you can just use 1 if you know it's the first row)
rows = which(df$date == "2020-12-31")
# check rows - looks right
rows
# [1] 1
# set values in those columns and rows to 0
df[rows, cols] = 0
df
# date nominal_amount market_value exposure net_cash_flow local_return_ytd currency_return_ytd
# 1 2020-12-31 5e+09 5132748596 5132748596 0 0.00 0.00
# 2 2021-01-04 5e+09 5463759675 5463759675 0 0.02 0.01
# 3 2021-01-05 5e+09 5231957361 5231957361 0 0.02 0.01
# 4 2021-01-06 5e+09 5212386748 5212386748 0 0.02 0.01
# 5 2021-01-07 5e+09 5194812564 5194812564 0 0.02 0.01
# 6 2021-01-08 5e+09 5192248647 5192248647 0 0.02 0.01
# 7 2021-01-11 5e+09 5198903366 5198903366 0 0.02 0.01
# duration twr_ytd local_return_mtd currency_return_mtd twr_mtd
# 1 4.02 0.00 0.000 0.000 0.000
# 2 4.01 0.03 0.002 0.001 0.003
# 3 3.99 0.03 0.002 0.001 0.003
# 4 3.98 0.03 0.002 0.001 0.003
# 5 3.93 0.03 0.002 0.001 0.003
# 6 3.79 0.03 0.002 0.001 0.003
# 7 3.94 0.03 0.002 0.001 0.003
I have a stupid question but I can't solve it easily with lag/lead or other things
Let's say I have this table, I have an initial balance of 100, Position is if I bid or not, and percentage is what I get if I bid, how can i calculate the balance to get something like this?
Position Percentage_change Balance
0 0.01 100
0 - 0.01 100
1 0.02 102
1 0.05 107.1
0 - 0.02 107.1
1 0.03 110.3
cumprod is the function you are looking for eg
df <- data.frame(Position = c(0,0,1,1,0,1),
Percentage_change = c(0.01, -0.01, 0.02, 0.05, -0.02, 0.03))
# convert in to multiplier form eg 100 * 1.01
df$Multiplier <- df$Percentage_change + 1
# when position is 0, reset this to 1 so there is no change to the balance
df[df$Position == 0, ]$Multiplier <- 1
# take starting balance of 100 and times by cumulative product of the multipliers
df$Balance <- 100 * cumprod(df$Multiplier)
df
Position Percentage_change Multiplier Balance
1 0 0.01 1.00 100.000
2 0 -0.01 1.00 100.000
3 1 0.02 1.02 102.000
4 1 0.05 1.05 107.100
5 0 -0.02 1.00 107.100
6 1 0.03 1.03 110.313
Edit: For anyone coming later: THIS IS NOT A DUPLICATE, since it explicitely concerns work on data frames, not single variables/vectors.
I have found several sites describing how to drop leading zeros in numbers or strings, including vectors. But none of the descriptions I found seem applicable to data frames.
Or the f_num function in the numform package. It treats "[a] vector of numbers (or string equivalents)", but does not seem to solve unwanted leading zeros in a data frame.
I am relatively new to R but understand that I could develop some (in my mind) complex code to drop leading zeros by subsetting vectors from a data frame and then combining those vectors into a full data frame. I would like to avoid that.
Here is a simple data frame:
df <- structure(list(est = c(0.05, -0.16, -0.02, 0, -0.11, 0.15, -0.26,
-0.23), low2.5 = c(0.01, -0.2, -0.05, -0.03, -0.2, 0.1, -0.3,
-0.28), up2.5 = c(0.09, -0.12, 0, 0.04, -0.01, 0.2, -0.22, -0.17
)), row.names = c(NA, 8L), class = "data.frame")
Which gives
df
est low2.5 up2.5
1 0.05 0.01 0.09
2 -0.16 -0.20 -0.12
3 -0.02 -0.05 0.00
4 0.00 -0.03 0.04
5 -0.11 -0.20 -0.01
6 0.15 0.10 0.20
7 -0.26 -0.30 -0.22
8 -0.23 -0.28 -0.17
I would want
est low2.5 up2.5
1 .05 .01 .09
2 -.16 -.20 -.12
3 -.02 -.05 .00
4 .00 -.03 .04
5 -.11 -.20 -.01
6 .15 .10 .20
7 -.26 -.30 -.22
8 -.23 -.28 -.17
Is that possible with relatively simple code for a whole data frame?
Edit: An incorrect link has been removed.
I am interpreting the intention of your question is to convert each numeric cell in the data.frame into a "pretty-printed" string which is possible using string substitution and a simple regular expression (a good question BTW since I do not know any method to configure the output of numeric data to suppress leading zeros without converting the numeric data into a string!):
df2 <- data.frame(lapply(df,
function(x) gsub("^0\\.", "\\.", gsub("^-0\\.", "-\\.", as.character(x)))),
stringsAsFactors = FALSE)
df2
# est low2.5 up2.5
# 1 .05 .01 .09
# 2 -.16 -.2 -.12
# 3 -.02 -.05 0
# 4 0 -.03 .04
# 5 -.11 -.2 -.01
# 6 .15 .1 .2
# 7 -.26 -.3 -.22
# 8 -.23 -.28 -.17
str(df2)
# 'data.frame': 8 obs. of 3 variables:
# $ est : chr ".05" "-.16" "-.02" "0" ...
# $ low2.5: chr ".01" "-.2" "-.05" "-.03" ...
# $ up2.5 : chr ".09" "-.12" "0" ".04" ...
If you want to get a fixed number of digits after the decimal point (as shown in the expected output but not asked for explicitly) you could use sprintf or format:
df3 <- data.frame(lapply(df, function(x) gsub("^0\\.", "\\.", gsub("^-0\\.", "-\\.", sprintf("%.2f", x)))), stringsAsFactors = FALSE)
df3
# est low2.5 up2.5
# 1 .05 .01 .09
# 2 -.16 -.20 -.12
# 3 -.02 -.05 .00
# 4 .00 -.03 .04
# 5 -.11 -.20 -.01
# 6 .15 .10 .20
# 7 -.26 -.30 -.22
# 8 -.23 -.28 -.17
Note: This solution is not robust against different decimal point character (different locales) - it always expects a decimal point...
I am interested in combining survival estimates of heart disease at age 70. Since these are survival estimates, they range from 0 to 1 (similar to a proportion). I have 5 studies, and a summary of the estimates, 95% CI, n, and SE are shown below. Each row represents a study.
> dat
Estimate Lower Upper n SE
1 0.55 0.40 0.71 100 1.479592
2 0.23 0.15 0.35 300 2.562728
3 0.54 0.44 0.66 200 2.092459
4 0.59 0.30 0.75 400 2.959184
5 0.88 0.67 0.98 40 0.935776
dat <- structure(list(Estimate = c(0.55, 0.23, 0.54, 0.59, 0.88), Lower = c(0.4,
0.15, 0.44, 0.3, 0.67), Upper = c(0.71, 0.35, 0.66, 0.75, 0.98
), n = c(100, 300, 200, 400, 40), SE = c(1.47959183673469, 2.56272823568864,
2.09245884228672, 2.95918367346939, 0.935776042294725)), .Names = c("Estimate",
"Lower", "Upper", "n", "SE"), row.names = c(NA, -5L), class = "data.frame")
rma in package metafor allows me to input sei (standard error), which would encapsulate the information I have from the 95% CIs based on the studies, but the resulting confidence interval is not bounded between 0 and 1 (i.e. the CI is -0.51, 1.77). How can I go about restricting this? i.e. make sure that rma is treating my dat$Estimate as values between 0 and 1.
> rma(dat$Estimate, dat$SE, method = "DL")
Random-Effects Model (k = 5; tau^2 estimator: DL)
tau^2 (estimated amount of total heterogeneity): 0 (SE = 1.2621)
tau (square root of estimated tau^2 value): 0
I^2 (total heterogeneity / total variability): 0.00%
H^2 (total variability / sampling variability): 1.00
Test for Heterogeneity:
Q(df = 4) = 0.1380, p-val = 0.9977
Model Results:
estimate se zval pval ci.lb ci.ub
0.6302 0.5822 1.0824 0.2791 -0.5109 1.7712
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1