Continuous multiplication same column previous value - r

I have a problem.
I have the following data frame.
1
2
NA
100
1.00499
NA
1.00813
NA
0.99203
NA
Two columns. In the second column, apart from the starting value, there are only NAs. I want to fill the first NA of the 2nd column by multiplying the 1st value from column 2 with the 2nd value from column 1 (100* 1.00499). The 3rd value of column 2 should be the product of the 2nd new created value in column 2 and the 3rd value in column 1 and so on. So that at the end the NAs are replaced by values.
These two sources have helped me understand how to refer to different rows. But in both cases a new column is created.I don't want that. I want to fill the already existing column 2.
Use a value from the previous row in an R data.table calculation
https://statisticsglobe.com/use-previous-row-of-data-table-in-r
Can anyone help me?
Thanks so much in advance.
Sample code
library(quantmod)
data.N225<-getSymbols("^N225",from="1965-01-01", to="2022-03-30", auto.assign=FALSE, src='yahoo')
data.N225[c(1:3, nrow(data.N225)),]
data.N225<- na.omit(data.N225)
N225 <- data.N225[,6]
N225$DiskreteRendite= Delt(N225$N225.Adjusted)
N225[c(1:3,nrow(N225)),]
options(digits=5)
N225.diskret <- N225[,3]
N225.diskret[c(1:3,nrow(N225.diskret)),]
N225$diskretplus1 <- N225$DiskreteRendite+1
N225[c(1:3,nrow(N225)),]
library(dplyr)
N225$normiert <-"Value"
N225$normiert[1,] <-100
N225[c(1:3,nrow(N225)),]
N225.new <- N225[,4:5]
N225.new[c(1:3,nrow(N225.new)),]
Here is the code to create the data frame in R studio.
a <- c(NA, 1.0050,1.0081, 1.0095, 1.0016,0.9947)
b <- c(100, NA, NA, NA, NA, NA)
c<- data.frame(ONE = a, TWO=b)

You could use cumprod for cummulative product
transform(
df,
TWO = cumprod(c(na.omit(TWO),na.omit(ONE)))
)
which yields
ONE TWO
1 NA 100.0000
2 1.0050 100.5000
3 1.0081 101.3140
4 1.0095 102.2765
5 1.0016 102.4402
6 0.9947 101.8972
data
> dput(df)
structure(list(ONE = c(NA, 1.005, 1.0081, 1.0095, 1.0016, 0.9947
), TWO = c(100, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-6L))

What about (gasp) a for loop?
(I'll use dat instead of c for your dataframe to avoid confusion with function c()).
for (row in 2:nrow(dat)) {
if (!is.na(dat$TWO[row-1])) {
dat$TWO[row] <- dat$ONE[row] * dat$TWO[row-1]
}
}
This means:
For each row from the second to the end, if the TWO in the previous row is not a missing value, calculate the TWO in this row by multiplying ONE in the current row and TWO from the previous row.
Output:
#> ONE TWO
#> 1 NA 100.0000
#> 2 1.0050 100.5000
#> 3 1.0081 101.3140
#> 4 1.0095 102.2765
#> 5 1.0016 102.4402
#> 6 0.9947 101.8972
Created on 2022-04-28 by the reprex package (v2.0.1)
I'd love to read a dplyr solution!

Related

Sort dataframe by row index value without changing values

I have a dataframe that I have to sort in decreasing order of absolute row value without changing the actual values (some of which are negative).
To give you an example, e.g. for the 1st row, I would like to go from
-0.01189179 0.03687456 -0.12202753 to
-0.12202753 0.03687456 -0.01189179.
For the 2nd row from
-0.04220260 0.04129326 -0.07178175 to
-0.07178175 -0.04220260 0.04129326 etc.
How can I do this in R?
Many thanks!
Try this
lst <- lapply(df , \(x) order(-abs(x)))
ans <- data.frame(Map(\(x,y) x[y] , df ,lst))
output
a b
1 -0.01189179 -0.07178175
2 0.03687456 -0.04220260
3 -0.12202753 0.04129326
data
df <- structure(list(a = c(-0.12202753, 0.03687456, -0.01189179), b = c(-0.0422026,
0.04129326, -0.07178175)), row.names = c(NA, -3L), class = "data.frame")
Here is a simple approach (using #Mohamed Desouky's Data)
df <- df[nrow(df):1,]
> df
a b
3 -0.01189179 -0.07178175
2 0.03687456 0.04129326
1 -0.12202753 -0.04220260

(R) Filter rows based on string names if the only resulting match in another column is NA

The title may sound kinda weird but I have found no way of better defining my issue.
Here an example data set:
test = data.frame(genus = c("Acicarpha", "Acicarpha", "Acicarpha", "Acicarpha", "Acisanthera", "Acisanthera", "Acisanthera", "Acisanthera", "Acmella", "Acmella"), sp1 = c("NA", "bonariensis", "bonariensis", "spathulata", NA, "variabilis", "variabilis", "variabilis", NA, NA))
As you can see, I have a few species names (genus+sp1) possible: Acicarpha NA, Acicarpha bonariensis, Acicarpha spathulata, Acisanthera variabilis, Acisanthera NA, and Acmella NA.
Here's the deal: I'm trying to select only the row related to Acmella NA since the only returning value on the sp1 column is NA. Other species also have NA, but they do not have only NA. How can I do this? I'm bashing my head.
Here's some code that does what I think you're asking for. It has four steps:
Group the rows by genus.
Make a new column called all_sp1_na that is TRUE if all of each genus's sp1 observations are NA, FALSE otherwise (i.e. FALSE if at least one sp1 observation is not NA for that genus).
Filter for rows where all_sp1_na is true.
Remove the temporary column all_sp1_na.
library(tidyverse)
test %>%
group_by(genus) %>%
mutate(all_sp1_na = all(is.na(sp1))) %>%
filter(all_sp1_na) %>%
select(-all_sp1_na)
And it gives this result:
# A tibble: 2 x 2
# Groups: genus [1]
genus sp1
<chr> <chr>
1 Acmella NA
2 Acmella NA
Let me know if you're looking for something else.
We may use subset from base R
subset(test, !genus %in% genus[!is.na(sp1)])
genus sp1
9 Acmella <NA>
10 Acmella <NA>
Or with filter from dplyr
library(dplyr)
test %>%
filter(!genus %in% genus[!is.na(sp1)])

Replacing the entries in a data frame in R

I have a data frame of dimension 100 by 54, where the rows are stock values at the end of the week, and each column represents a stock. I want to replace each entry in my data frame with the return value of the stock, so divide the current value of the cell by the previous one, and replace the current value by the new value. Example: Say I have this data frame with these values,
table 1
I want to manipulate my data frame to be:
table 2
So that it can eventually look like this:
table 3
I have written this as my code, but it does not do that job. I was wondering if someone can help me.
Returns99 <- NULL
for(i in 2:100){
Returns99 <- rbind(Returns99, rep(NA, 54))
Returns <- rbind(Returns99, (df100[i, ]/df100[i-1,]))
}
Where df100 is the data frame with price entries.
You don't need a loop. With Base R,
rbind(NA, df100[-1,] / df100[-nrow(df100),])
gives,
AGG DBC DFE
1 NA NA NA
2 1.0000000 1.0000000 1.0000000
3 1.0021019 0.9739496 0.9990862
4 0.9993008 1.0008628 0.9911585
Data:
structure(list(AGG = c(99.91, 99.91, 100.12, 100.05), DBC = c(23.8,
23.8, 23.18, 23.2), DFE = c(65.66, 65.66, 65.6, 65.02)), class = "data.frame", row.names = c(NA,
-4L))

to find count of distinct values across two columns in r

I have two columns . both are of character data type.
One column has strings and other has got strings with quote.
I want to compare both columns and find the no. of distinct names across the data frame.
string f.string.name
john NA
bravo NA
NA "john"
NA "hulk"
Here the count should be 2, as john is common.
Somehow i am not able to remove quotes from second column. Not sure why.
Thanks
The main problem I'm seeing are the NA values.
First, let's get rid of the quotes you mention.
dat$f.string.name <- gsub('["]', '', dat$f.string.name)
Now, count the number of distinct values.
i1 <- complete.cases(dat$string)
i2 <- complete.cases(dat$f.string.name)
sum(dat$string[i1] %in% dat$f.string.name[i2]) + sum(dat$f.string.name[i2] %in% dat$string[i1])
DATA
dat <-
structure(list(string = c("john", "bravo", NA, NA), f.string.name = c(NA,
NA, "\"john\"", "\"hulk\"")), .Names = c("string", "f.string.name"
), class = "data.frame", row.names = c(NA, -4L))
library(stringr)
table(str_replace_all(unlist(df), '["]', ''))
# bravo hulk john
# 1 1 2

Data munging in R: Subsetting and arranging vectors of uneven length

I am sorry I could not make a more specific title. I am trying to wean myself off of spreadsheets for the more difficult tasks and this one is giving me particular trouble - I can do it in Excel but I don't really know how to begin in R. It is somewhat hard to describe. I imagine a mix of techniques could be involved here so I hope this is of use to others.
I have data that comes in the following form from a spreadsheet:
Data:
1 GOEK, WOWP, PEOL, WJRN, KENC, QPOE, JFPG, PWKR, PWEOR, JFOKE, POQK, LSPF, PEKF,PFOW, VCNS, ALAO, LFOD
2 KFDL, LFOD, WOWP, PWEO, PWEOR, PRCP, ALPQ, JFOKE, ALLF, VCNS CNIR,
3 KJTJ, FKOF, VCNS, FLEP
4 FKKF, EPTR
5 QPOE, PEOL, WJRN, VCNS, PEKF, PFPW
And this data is associated with the following key:
Key:
Items A B C
ALAO NA 0.12246503 0.137902549
ALLF 0.016262491 0.557522799 0.622560763
ALPQ 0.409770566 0.770904525 NA
CNIR NA 0.38075281 0.698236443
EPTR 0.718354484 0.290028597 0.525661861
FKKF 0.801489091 0.878405308 0.645004844
FKOF 0.643251028 0.131643544 NA
FLEP 0.018262707 0.211220859 0.457302727
GOEK 0.902121539 NA NA
JFOKE 0.808410498 0.301443669 0.575188395
JFPG NA NA 0.343824191
KENC 0.882285296 0.372821865 0.593742731
KFDL 0.077569421 0.076497291 NA
KJTJ 0.249613609 0.227241864 NA
LFOD NA 0.000343115 0.329546051
LSPF 0.088451014 0.65148309 0.267490643
PEKF 0.645309773 NA 0.116601451
PEOL 0.626916187 0.093812247 0.152577881
PFOW 0.86690534 0.596673645 NA
PFPW NA 0.018869604 NA
POQK 0.683221579 NA 0.472456955
PRCP 0.486488748 0.860947689 0.097916066
PWEO 0.665854791 0.814111848 0.026085774
PWEOR 0.611034332 0.17254104 0.212386401
PWKR NA NA 0.357298987
QPOE 0.815885005 0.083834541 NA
VCNS 0.394817612 0.250760686 0.419539549
WJRN 0.403002388 0.705142265 0.768961818
WOWP 0.794250738 NA 0.967405211
Here is the general approach:
Each row shown in data comes from one cell of a spreadsheet so it would be interpreted by R as one string if imported directly. Split the string for each row into a form that can be stored as a vector in R.
Filter the data into three categories (A, B, or C) depending on the value in the row it is associated with. For example, for the 5th row of data, we have the values: QPOE, PEOL, WJRN, VCNS, PEKF, PFPW. Looking at the key, we can turn this into three subcategories based on what is contained in A, B, or C. This is based on whether or not there is an NA in that row or not:
A QPOE PEOL WJRN VCNS PEKF
B QPOE PEOL WJRN VCNS PFPW
C PEOL WJRN VCNS PEKF
Now that we have divided up row 5 of our data into its respective categories, we can make a separate table for this row that includes the associated value:
A 0.815885005 0.626916187 0.403002388 0.394817612 0.645309773
B 0.083834541 0.093812247 0.705142265 0.250760686 0.018869604
C 0.152577881 0.768961818 0.419539549 0.116601451
So we have a kind of hash table... sort of. Now I want to store these values in one table. It would essentially look something like this in the final form (shown for row 5 of data only):
Cat A Item A Value B Item B Value C Item C Value
5 QPOE 0.815885005 QPOE 0.083834541 PEOL 0.152577881
5 PEOL 0.626916187 PEOL 0.093812247 WJRN 0.768961818
5 WJRN 0.403002388 WJRN 0.705142265 VCNS 0.419539549
5 VCNS 0.394817612 VCNS 0.250760686 PEKF 0.116601451
5 PEKF 0.645309773 PFPW 0.018869604 NA NA
In reality, I have 400 rows of "Cat" in data not just 5.
Is this the best way to store the data for easy reference? Would a nested list be preferred like so?
Cat Row 1
A Items
Values
B Items
Values
C Items
Values
Cat Row 2...
I am just hesitant to make data frames for this data because there is so much variability in the length of the rows in my original data when divided into A, B, and C. The shortest ones would have to have NA's to fill up to the length of the longest ones to fit in the data frame. Something about this just makes me uncomfortable.
I can always look up the functions used in answer and figure it out so an in-depth explanation is not necessary unless your are feeling particularly generous! Thank you for your time.
I think that this is what I'd do, although it returns the answer in a slightly different form than you've asked for - my approach is to avoid ragged arrays (ones with different column lengths).
Start with your data:
d <- c("GOEK, WOWP, PEOL, WJRN, KENC, QPOE, JFPG, PWKR, PWEOR, JFOKE, POQK, LSPF, PEKF,PFOW, VCNS, ALAO, LFOD",
"KFDL, LFOD, WOWP, PWEO, PWEOR, PRCP, ALPQ, JFOKE, ALLF, VCNS CNIR",
"KJTJ, FKOF, VCNS, FLEP", "FKKF, EPTR", "QPOE, PEOL, WJRN, VCNS, PEKF, PFPW" )
key <- structure(list(Items = c("ALAO", "ALLF", "ALPQ", "CNIR", "EPTR",
"FKKF", "FKOF", "FLEP", "GOEK", "JFOKE", "JFPG", "KENC", "KFDL",
"KJTJ", "LFOD", "LSPF", "PEKF", "PEOL", "PFOW", "PFPW", "POQK",
"PRCP", "PWEO", "PWEOR", "PWKR", "QPOE", "VCNS", "WJRN", "WOWP"
), A = c(NA, 0.016262491, 0.409770566, NA, 0.718354484, 0.801489091,
0.643251028, 0.018262707, 0.902121539, 0.808410498, NA, 0.882285296,
0.077569421, 0.249613609, NA, 0.088451014, 0.645309773, 0.626916187,
0.86690534, NA, 0.683221579, 0.486488748, 0.665854791, 0.611034332,
NA, 0.815885005, 0.394817612, 0.403002388, 0.794250738), B = c(0.12246503,
0.557522799, 0.770904525, 0.38075281, 0.290028597, 0.878405308,
0.131643544, 0.211220859, NA, 0.301443669, NA, 0.372821865, 0.076497291,
0.227241864, 0.000343115, 0.65148309, NA, 0.093812247, 0.596673645,
0.018869604, NA, 0.860947689, 0.814111848, 0.17254104, NA, 0.083834541,
0.250760686, 0.705142265, NA), C = c(0.137902549, 0.622560763,
NA, 0.698236443, 0.525661861, 0.645004844, NA, 0.457302727, NA,
0.575188395, 0.343824191, 0.593742731, NA, NA, 0.329546051, 0.267490643,
0.116601451, 0.152577881, NA, NA, 0.472456955, 0.097916066, 0.026085774,
0.212386401, 0.357298987, NA, 0.419539549, 0.768961818, 0.967405211
)), .Names = c("Items", "A", "B", "C"), class = "data.frame", row.names = c(NA, -29L))
#split it up as you suggest
d <- strsplit(d,",")
d <- lapply(d, gsub, pattern=" ", replacement="") #Get rid of trailing spaces
#Convert key to a long data.frame with no NAs
library(reshape2)
key <- melt(key)
names(key)[2] <- "letter" #You might have better name for this
key <- key[complete.cases(key),]
#Extract subsets for each row of data
lapply(d, function(x)key[key$Items %in% x,])

Resources