Fill data frame by column with for loop - r

I created an empty data frame with 11 columns and 15 rows and subsequently named the columns.
L_df <- data.frame(matrix(ncol = 11, nrow = 15))
names(L_df) <- paste0("L_por", 0:10)
w <- c(0.2, 0.4, 0.6, 0.8, 1, 1.2, 1.4, 1.6, 1.8, 2, 2.2, 2.4, 2.6, 2.8, 3)
wu <- 0
L <- 333.7
pm <- c(2600, 2574, 2548, 2522, 2496, 2470, 2444, 2418, 2392, 2366, 2340)
The data frame looks like this:
head(L_df)
L_por0 L_por1 L_por2 L_por3 L_por4 L_por5 L_por6 L_por7 L_por8 L_por9 L_por10
1 NA NA NA NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA NA NA NA NA
6 NA NA NA NA NA NA NA NA NA NA NA
Now, I would like to fill the data frame by column, based on a formula. I tried to express this with a nested for loop:
for (i in 1:ncol(L_df)) {
pm_tmp <- pm[i]
col_tmp <- colnames(L_df)[i]
for (j in 1:nrow(L_df)) {
w_tmp <- w[j]
L_por_tmp <- pm_tmp*L*((w_tmp-wu)/100)
col_tmp[j] <- L_por_tmp
}
}
For each column, I iterate over a predefined vector pm of length 11. For each row, I iterate over a predefined vector w of length 15 (repeats each column).
Example: First, select pm[1] for the first column. Second, select w[i] for each row in the first column. Store the formula in L_por_tmp and use it to fill the first column from row1 to row15. The whole procedure should start all over again for the second column (with pm[2]) with w[i] for each row and so on. wu and L are fixed in the formula.
R executes the code without an error. When I check the tmp values, they are correct. However, the data frame remains empty. L_df does not get filled. I would like solve this with a loop but if you have other solutions, I am happy to hear them! I get the impression there might be a smoother way of doing this. Cheers!

Solution
L_df <- data.frame(sapply(pm, function(x) x * L * ((w - wu) / 100)))
names(L_df) <- c("L_por0", "L_por1", "L_por2", "L_por3", "L_por4", "L_por5",
"L_por6", "L_por7", "L_por8", "L_por9", "L_por10")
L_df
L_por0 L_por1 L_por2 L_por3 L_por4 L_por5 L_por6 L_por7
1 1735.24 1717.888 1700.535 1683.183 1665.830 1648.478 1631.126 1613.773
2 3470.48 3435.775 3401.070 3366.366 3331.661 3296.956 3262.251 3227.546
3 5205.72 5153.663 5101.606 5049.548 4997.491 4945.434 4893.377 4841.320
4 6940.96 6871.550 6802.141 6732.731 6663.322 6593.912 6524.502 6455.093
5 8676.20 8589.438 8502.676 8415.914 8329.152 8242.390 8155.628 8068.866
6 10411.44 10307.326 10203.211 10099.097 9994.982 9890.868 9786.754 9682.639
7 12146.68 12025.213 11903.746 11782.280 11660.813 11539.346 11417.879 11296.412
8 13881.92 13743.101 13604.282 13465.462 13326.643 13187.824 13049.005 12910.186
9 15617.16 15460.988 15304.817 15148.645 14992.474 14836.302 14680.130 14523.959
10 17352.40 17178.876 17005.352 16831.828 16658.304 16484.780 16311.256 16137.732
11 19087.64 18896.764 18705.887 18515.011 18324.134 18133.258 17942.382 17751.505
12 20822.88 20614.651 20406.422 20198.194 19989.965 19781.736 19573.507 19365.278
13 22558.12 22332.539 22106.958 21881.376 21655.795 21430.214 21204.633 20979.052
14 24293.36 24050.426 23807.493 23564.559 23321.626 23078.692 22835.758 22592.825
15 26028.60 25768.314 25508.028 25247.742 24987.456 24727.170 24466.884 24206.598
L_por8 L_por9 L_por10
1 1596.421 1579.068 1561.716
2 3192.842 3158.137 3123.432
3 4789.262 4737.205 4685.148
4 6385.683 6316.274 6246.864
5 7982.104 7895.342 7808.580
6 9578.525 9474.410 9370.296
7 11174.946 11053.479 10932.012
8 12771.366 12632.547 12493.728
9 14367.787 14211.616 14055.444
10 15964.208 15790.684 15617.160
11 17560.629 17369.752 17178.876
12 19157.050 18948.821 18740.592
13 20753.470 20527.889 20302.308
14 22349.891 22106.958 21864.024
15 23946.312 23686.026 23425.740
Explanation
The sapply() function can be used to iterate over vectors in a more idiomatic way for R programming. We iterate over pm and use your formula once since R is vectorised; each time it creates a vector of length 15 (so 11 vectors of length 15), and when we wrap it in data.frame() returns the data frame you want and we add in the column names.
NOTE: Applying functions to every element of a vector using an apply() family function has some different implications than iterating using for loops. In your case, I think sapply() is easier and more understandable. For more information on when you need a loop or when something like apply is better, see for example this discussion from Hadley Wickham's Advanced R book.

You are just doing a small mistake and you were almost there, Edited your function:
for (i in 1:ncol(L_df)) {
pm_tmp <- pm[i]
col_tmp <- colnames(L_df)[i]
for (j in 1:nrow(L_df)) {
w_tmp <- w[j]
L_por_tmp <- pm_tmp*L*((w_tmp-wu)/100)
L_df[ j ,col_tmp] <- L_por_tmp ##You must have used df[i, j] referencing here
}
}
Output:
Just printing the head of few rows:
L_df
L_por0 L_por1 L_por2 L_por3 L_por4 L_por5 L_por6 L_por7 L_por8 L_por9 L_por10
1 1735.24 1717.888 1700.535 1683.183 1665.830 1648.478 1631.126 1613.773 1596.421 1579.068 1561.716
2 3470.48 3435.775 3401.070 3366.366 3331.661 3296.956 3262.251 3227.546 3192.842 3158.137 3123.432
3 5205.72 5153.663 5101.606 5049.548 4997.491 4945.434 4893.377 4841.320 4789.262 4737.205 4685.148

Related

data length cannot be over width of moving average

I use quantmod, to calculate the moving average over 2000 dataframes with loop
price = xts object
price <- cbind(price, SMA(price, 5), SMA(price, 10),
SMA(price, 20), SMA(price, 60), SMA(price, 120),
SMA(price, 180), SMA(price, 240))
But some data don't exceed the number of width, stop running in the middle. In that case, I just want to fill NA only.
I need some support to solve this problem.
Or if I need to use any other package for solving this problem, let me know
Thanks
Moving average functions give an error when the chosen period is longer than the available data. As #RuiBarradas mentions in the comment, for a SMA zoo::rollmean could work. As you need to loop over quite a few data.frames a function is easier. The function below could be used in an lapply function or just in a loop.
I created a sub function inside the bigger function to check if the chosen period is bigger than the rows supplied. If so, return a vector of NA's else return a SMA. After that, loop over the periods to return a data.frame with the supplied price column and all the SMA columns with a name so you can see which SMA is in which column.
Note that there is no error handling in case of incorrect inputs. Sample data below.
# periods for the SMA
periods <- c(5, 10, 20, 60, 120, 180, 240)
get_smas <- function(price, n) {
my_sma <- function(x, n = 10) {
if (n < 1 || n > NROW(x)) {
out <- rep(NA_real_, NROW(x))
} else {
# change SMA for EMA if you want the EMA's
out <- TTR::SMA(x, n = n)
}
out
}
# combine the price column with the ma's. Reduce works backwards, so price column last
price_combined <- Reduce(cbind, lapply(n, function(x) my_sma(price, n = x)), price)
# turn matrix into data.frame
price_combined <- data.frame(price_combined)
# rename columns, assuming price column has a column name.
# change paste0 value from SMA to EMA if EMA is used.
names(price_combined) <- c(names(price_combined)[1], paste0("SMA_", n))
price_combined
}
# supply a price and a vector of periods
my_prices <- get_smas(price, periods)
head(my_prices, 2)
Close SMA_5 SMA_10 SMA_20 SMA_60 SMA_120 SMA_180 SMA_240
1 182.01 NA NA NA NA NA NA NA
2 179.70 NA NA NA NA NA NA NA
tail(my_prices, 2)
Close SMA_5 SMA_10 SMA_20 SMA_60 SMA_120 SMA_180 SMA_240
142 156.79 154.156 152.053 147.475 145.4393 156.1770 NA NA
143 157.35 154.556 152.941 148.381 145.4292 156.0474 NA NA
data:
# close prices of aapl from 2022-01-03 to 2022-07-28
price <- structure(list(Close = c(182.009995, 179.699997, 174.919998,
172, 172.169998, 172.190002, 175.080002, 175.529999, 172.190002,
173.070007, 169.800003, 166.229996, 164.509995, 162.410004, 161.619995,
159.779999, 159.690002, 159.220001, 170.330002, 174.779999, 174.610001,
175.839996, 172.899994, 172.389999, 171.660004, 174.830002, 176.279999,
172.119995, 168.639999, 168.880005, 172.789993, 172.550003, 168.880005,
167.300003, 164.320007, 160.070007, 162.740005, 164.850006, 165.119995,
163.199997, 166.559998, 166.229996, 163.169998, 159.300003, 157.440002,
162.949997, 158.520004, 154.729996, 150.619995, 155.089996, 159.589996,
160.619995, 163.979996, 165.380005, 168.820007, 170.210007, 174.070007,
174.720001, 175.600006, 178.960007, 177.770004, 174.610001, 174.309998,
178.440002, 175.059998, 171.830002, 172.139999, 170.089996, 165.75,
167.660004, 170.399994, 165.289993, 165.070007, 167.399994, 167.229996,
166.419998, 161.789993, 162.880005, 156.800003, 156.570007, 163.639999,
157.649994, 157.960007, 159.479996, 166.020004, 156.770004, 157.279999,
152.059998, 154.509995, 146.5, 142.559998, 147.110001, 145.539993,
149.240005, 140.820007, 137.350006, 137.589996, 143.110001, 140.360001,
140.520004, 143.779999, 149.639999, 148.839996, 148.710007, 151.210007,
145.380005, 146.139999, 148.710007, 147.960007, 142.639999, 137.130005,
131.880005, 132.759995, 135.429993, 130.059998, 131.559998, 135.869995,
135.350006, 138.270004, 141.660004, 141.660004, 137.440002, 139.229996,
136.720001, 138.929993, 141.559998, 142.919998, 146.350006, 147.039993,
144.869995, 145.860001, 145.490005, 148.470001, 150.169998, 147.070007,
151, 153.039993, 155.350006, 154.089996, 152.949997, 151.600006,
156.789993, 157.350006)), class = "data.frame", row.names = c(NA,
-143L))
rollmeanr and rollapplyr can handle the situation with fewer data items than width.
library(zoo)
price <- 1:6
rollmeanr(price, 10, fill = NA)
## [1] NA NA NA NA NA NA
w <- c(5, 10, 20, 60, 120, 180, 240)
sapply(setNames(w, w), rollmeanr, x = price, fill = NA)
## 5 10 20 60 120 180 240
## [1,] NA NA NA NA NA NA NA
## [2,] NA NA NA NA NA NA NA
## [3,] NA NA NA NA NA NA NA
## [4,] NA NA NA NA NA NA NA
## [5,] 3 NA NA NA NA NA NA
## [6,] 4 NA NA NA NA NA NA

Create column from data on dynamic number of columns depending on availabity in R

Given a uncertain number of columns containing source values for the same variable I would like to create a column that defines the final value to be selected depending on source importance and availability.
Reproducible data:
set.seed(123)
actuals = runif(10, 500, 1000)
get_rand_vector <- function(){return (runif(10, 0.95, 1.05))}
get_na_rand_ixs <- function(){return (round(runif(5,0,10),0))}
df = data.frame("source_1" = actuals*get_rand_vector(),
"source_2" = actuals*get_rand_vector(),
"source_n" = actuals*get_rand_vector())
df[["source_1"]][get_na_rand_ixs()] <- NA
df[["source_2"]][get_na_rand_ixs()] <- NA
df[["source_n"]][get_na_rand_ixs()] <- NA
My manual solution is as follows:
df$available <- ifelse(
!is.na(df$source_1),
df$source_1,
ifelse(
!is.na(df$source_2),
df$source_2,
df$source_n
)
)
Given the desired result of:
source_1 source_2 source_n available
1 NA NA NA NA
2 NA NA 930.1242 930.1242
3 716.9981 NA 717.9234 716.9981
4 NA 988.0446 NA 988.0446
5 931.7081 NA 924.1101 931.7081
6 543.6802 533.6798 NA 543.6802
7 744.6525 767.4196 783.8004 744.6525
8 902.8788 955.1173 NA 902.8788
9 762.3690 NA 761.6135 762.3690
10 761.4092 702.6064 708.7615 761.4092
How could I automatically iterate over the available sources to set the data to be considered? Given in some cases n_sources could be 1,2,3..,7 and priority follows the natural order (1 > 2 >..)
Once you have all of the candidate vectors in order and in an appropriate data structure (e.g., data.frame or matrix), you can use apply to apply a function over the rows. In this case, we just look for the first non-NA value. Thus, after the first block of code above, you only need the following line:
df$available <- apply(df, 1, FUN = function(x) x[which(!is.na(x))[1]])
coalesce() from dplyr is designed for this:
library(dplyr)
df %>%
mutate(available = coalesce(!!!.))
source_1 source_2 source_n available
1 NA NA NA NA
2 NA NA 930.1242 930.1242
3 716.9981 NA 717.9234 716.9981
4 NA 988.0446 NA 988.0446
5 931.7081 NA 924.1101 931.7081
6 543.6802 533.6798 NA 543.6802
7 744.6525 767.4196 783.8004 744.6525
8 902.8788 955.1173 NA 902.8788
9 762.3690 NA 761.6135 762.3690
10 761.4092 702.6064 708.7615 761.4092

For loop incorrect number of dimensions in R

I am attempting to create a matrix of response probabilities by looping through the rows of a vector (theta) and columns of separate matrix (tmp). I keep receiving the error message incorrect number of subscripts on matrixand am not sure what I am doing wrong. Any help would be appreciated!
theta = seq(from=-4, to=4, by=.01)
ID = c(1:10)
a = c(1.11,1.03,1.03,1.62,1.23,1.16,1.46,0.91,0.78,0.85)
b = c(-0.33,0.05,-1.25,-0.18,0.47,-1.11,-0.17,-0.57,-0.18,0.45)
c = c(0.16,0.18,0.17,0.24,0.12,NA,NA,NA,0.29,NA)
tmp = data.frame(ID,a,b,c)
for (j in 1:nrow(tmp)) {
for (k in 1:length(theta)){
RP[k,j] = tmp$c[j] + ((1-tmp$c[j])/
(1+exp(-1.7 * tmp$a[j]*theta - tmp$b[j])))
}
}
The desired results is a matrix with the same number of rows as the length of theta and the same number of columns as the tmp data frame. It should look like this:
head(tmp2)
p1 p2 p3 p4 p5 p6 p7 p8 p9 p10
1 0.1603182 0.1807822 0.1702159 0.2400104 0.1203281 NA NA NA 0.2929362 NA
2 0.1603243 0.1807960 0.1702197 0.2400107 0.1203350 NA NA NA 0.2929752 NA
3 0.1603305 0.1808100 0.1702236 0.2400110 0.1203421 NA NA NA 0.2930148 NA
4 0.1603368 0.1808243 0.1702276 0.2400113 0.1203493 NA NA NA 0.2930549 NA
5 0.1603432 0.1808389 0.1702316 0.2400116 0.1203567 NA NA NA 0.2930955 NA
6 0.1603497 0.1808537 0.1702357 0.2400120 0.1203642 NA NA NA 0.2931366 NA
In the last line of the for loop, you are using the whole of vector theta as a multiplier:
(1+exp(-1.7 * tmp$a[j]*theta - tmp$b[j])))
Presumably you meant to use the kth element:
(1+exp(-1.7 * tmp$a[j]*theta[k] - tmp$b[j])))
I can't test this as you've left out the definition of the matrix RP, but I'm sure you didn't mean to return 801 elements every pass through the loop.

How to remove extra rows and columns with NA values when importing from csv file using R

I have just started learning R. I'm trying to input data from a .csv file and but R keeps adding extra rows and columns with NA values. Does anyone know why this might be happening? Any advice on removing these NA would be greatly appreciated. I have used the following code:
>no_col <- max(count.fields("6%AA_comp.csv", sep=","))
>mydata <- read.csv(file="6%AA_comp.csv", fill=TRUE, header=TRUE, col.names = 1:no_col-1)
>mydata
X0 X1 X2 X3 X4
1 206428 152160 122080 111940 NA
2 183620 148300 118820 107260 NA
3 169100 164480 151420 146200 NA
4 179000 135920 107340 93540 NA
5 213820 146640 113040 109140 NA
6 150920 141400 133600 132000 NA
7 185645 154000 124510 128900 NA
8 176102 139100 141000 110300 NA
9 159045 154350 121050 153500 NA
10 198610 161000 119000 105600 NA
11 183100 138900 141500 129550 NA
12 211050 142550 136700 113500 NA
13 167000 150100 120000 102540 NA
14 NA NA NA NA NA
15 NA NA NA NA NA
16 NA NA NA NA NA
Well, data cleansing is always half the job or more. What you can do is to read the file as it is and then clean it by indexing only the rows and columns you are interested in, in your case this would be:
mydata <- read.csv(file="6%AA_comp.csv", fill=TRUE, header=TRUE)
mydata <- mydata[1:13, 1:5]
This typically happens when you delete some rows from your csv file and then try and import the same.
If its a one off, the easiest solution will be to open the csv in excel and delete all the rows below the last data row.
Addressing the comment below, we can do something like this
NA.Count = function(x)
{
return(sum(is.na(x)))
}
Row.NA.Count = apply(MAT,1,NA.Count)
Idx = Row.NA.Count == ncol(MAT)
MAT = MAT[!Idx,]
where MAT is the imported matrix.
The above code will take care of all the empty rows. You can do a similar thing for the columns.
Hope this helps.

subsetting error in R

I have a large dataframe called dualbeta which contains 2 rows and 6080 columns. Here is a sample:
row.names A.Close AA.Close AADR.Close AAIT.Close AAL.Close
1 upside 1.253929 0.9869027 0.6169613 0.6353903 0.1782124
2 downside 1.027412 1.1936236 0.5915299 0.5697878 0.1702382
I am trying to extract only those with the upside >= 1.00 and those with a downside <=1.00. I used combinations <- subset(dualbeta, upside>=1.00 & downside<=1.00) but i get the following:
row.names A.Close AA.Close AADR.Close AAIT.Close
1 NA NA NA NA NA
2 NA.1 NA NA NA NA
3 NA.2 NA NA NA NA
4 NA.3 NA NA NA NA
5 NA.4 NA NA NA NA
...
It should just return a 2 by x table where x is the number of combinations found. I do not know why I am getting a bunch of rows? Additionally, i thought i had NA values in the dualbeta so i used na.omit(dualbeta)->dualbeta but it deleted everything & turned dualbeta into a 0 by 6080. I also used which(is.na(dualbeta)) which returned 3307 and 3308 but when i checked those columns, they did not contain NAs.
You might work on the transpose of the data in order to select rows with the proper characteristics (which are columns in the transpose):
# Fix up the data, use proper row names
rownames(x) <- x$row.names
# Remove old row name column
x <- x[-1]
# transpose and subset
subset(data.frame(t(x)), upside > 1 & downside < 1)
This expression returns a zero-length result with your example data. Changing the parameters shows what is returned:
subset(data.frame(t(x)), upside > .6 & downside < .6)
## upside downside
## AADR.Close 0.6169613 0.5915299
## AAIT.Close 0.6353903 0.5697878
You can the data with simple indexing.
Let's say this is your data
dualbeta<-data.frame(matrix(runif(24,0,2),
nrow=2,
dimnames=list(c("upside","downside"), letters[1:12])))
then you can extract with
dualbeta[, dualbeta[1,]>=1.00 & dualbeta[2,]<=1.00]

Resources