Extract value from previous row based on a condition - r

I have a dataset that looks as follows:
data <- tribble(
~Date, ~Ticker, ~Close, ~Open,
"1989-09-11","COND",77.3292,77.3292,
"1989-09-12","COND",77.4435,77.4435,
"1989-09-13","COND",76.3118,76.3118,
"1989-09-14","COND",75.5309,75.6344,
"1989-09-15","COND",75.6598,75.4675)
# A tibble: 5 x 4
Date Ticker Close Open
<chr> <chr> <dbl> <dbl>
1 1989-09-11 COND 77.3 77.3
2 1989-09-12 COND 77.4 77.4
3 1989-09-13 COND 76.3 76.3
4 1989-09-14 COND 75.5 75.6
5 1989-09-15 COND 75.7 75.5
The issue with it is that until a certain date, the closing price is identical with the opening price. What I'm trying to do is writing a function that checks if the opening and closing price are the same, and if that's the case, it replaces the opening price with the closing price from the previous row. If applied to the above data, it would transform the data as follows:
# A tibble: 5 x 4
Date Ticker Close Open
<chr> <chr> <dbl> <dbl>
1 1989-09-11 COND 77.3 NA
2 1989-09-12 COND 77.4 77.3
3 1989-09-13 COND 76.3 77.4
4 1989-09-14 COND 75.5 75.6
5 1989-09-15 COND 75.7 75.5
I tried to do it with an if statement, but I'm running into problems as soon as I try to get the value from the previous row in the "Close" column to the current "Open" value.

In dplyr, it's a simple mutate with lag.
library(dplyr)
data %>%
mutate(Open = if_else(Open == Close, lag(Close), Open))
## A tibble: 5 x 4
# Date Ticker Close Open
# <chr> <chr> <dbl> <dbl>
#1 1989-09-11 COND 77.3 NA
#2 1989-09-12 COND 77.4 77.3
#3 1989-09-13 COND 76.3 77.4
#4 1989-09-14 COND 75.5 75.6
#5 1989-09-15 COND 75.7 75.5

Related

Create a temporary group in dplyr group_by

I would like to group all members of the same genera together for some summary statistics, but would like to maintain their full names in the original dataframe. I know that I could change their names or create a new column in the original dataframe but I am lookng for a more elegant solution. I would like to implement this in R and the dplyr package.
Example data here https://knb.ecoinformatics.org/knb/d1/mn/v2/object/urn%3Auuid%3Aeb176981-1909-4d6d-ac07-3406e4efc43f
I would like to group all clams of the genus Macoma as one group, "Macoma sp." but ideally creating this grouping within the following, perhapse before the group_by(site_code, species_scientific)
summary <- data %>%
group_by(site_code, species_scientific) %>%
summarize(mean_size = mean(width_mm))
Note that there are multiple Macoma xxx species and multiple other species that I want to group as is.
We may replace the species_scientific by replaceing the elements that have the substring 'Macoma' (str_detect) with 'Macoma', use that as grouping column and get the mean
library(dplyr)
library(stringr)
data %>%
mutate(species_scientific = replace(species_scientific,
str_detect(species_scientific, "Macoma"), "Macoma")) %>%
group_by(site_code, species_scientific) %>%
summarise(mean_size = mean(width_mm, na.rm = TRUE), .groups = 'drop')
-output
# A tibble: 97 × 3
site_code species_scientific mean_size
<chr> <chr> <dbl>
1 H_01_a Clinocardium nuttallii 33.9
2 H_01_a Macoma 41.0
3 H_01_a Protothaca staminea 37.3
4 H_01_a Saxidomus gigantea 56.0
5 H_01_a Tresus nuttallii 100.
6 H_02_a Clinocardium nuttallii 35.1
7 H_02_a Macoma 41.3
8 H_02_a Protothaca staminea 38.0
9 H_02_a Saxidomus gigantea 54.7
10 H_02_a Tresus nuttallii 50.5
# … with 87 more rows
If the intention is to keep only the first word in 'species_scientific'
data %>%
group_by(genus = str_remove(species_scientific, "\\s+.*"), site_code) %>%
summarise(mean_size = mean(width_mm, na.rm = TRUE), .groups = 'drop')
-output
# A tibble: 97 × 3
genus site_code mean_size
<chr> <chr> <dbl>
1 Clinocardium H_01_a 33.9
2 Clinocardium H_02_a 35.1
3 Clinocardium H_03_a 37.5
4 Clinocardium H_04_a 48.2
5 Clinocardium H_05_a 37.6
6 Clinocardium H_06_a 38.7
7 Clinocardium H_07_a 40.2
8 Clinocardium L_01_a 44.4
9 Clinocardium L_02_a 54.8
10 Clinocardium L_03_a 61.1
# … with 87 more rows

Is there any function that give the changes between columns?

I have a df that looks like this.
head(dfhigh)
rownames 2015Y 2016Y 2017Y 2018Y 2019Y 2020Y 2021Y
1 Australia 29583.7403 48397.383 45220.323 68461.941 39218.044 20140.351 29773.188
2 Austria* 1294.5092 -8400.973 14926.164 5511.625 2912.795 -14962.963 5855.014
3 Belgium* -24013.3111 68177.596 -3057.153 27119.084 -9208.553 13881.481 22955.298
4 Canada 43852.7732 36061.859 22764.156 37653.521 50141.784 23174.006 59693.992
5 Chile* 20507.8407 12249.294 6128.716 7735.778 12499.238 8385.907 15251.538
6 Czech Republic 465.2137 9814.496 9517.948 11010.423 10108.914 9410.576 5805.084
I want to calculate the changes between years, so instead of the values, the table has the percentage of change (obviously deleting 2015Y).
Try this using (current - previous)/ previous *100
lst <- list()
nm <- names(dfhigh)[-1]
for(i in 1:(length(nm) - 1)){
lst[[i]] <- (dfhigh[[nm[i+1]]] - dfhigh[[nm[i]]]) / dfhigh[[nm[i]]] * 100
}
ans <- do.call(cbind , lst)
colnames(ans) <- paste("ch_of" , nm[-1])
ans
you can change the formula to calculate percentage as you want
You could also use a tidyverse solution.
library(tidyverse)
df %>%
pivot_longer(!rownames) %>%
group_by(rownames) %>%
mutate(value = 100*value/lag(value)-100) %>%
ungroup() %>%
pivot_wider(names_from = name, values_from = value)
# # A tibble: 6 × 8
# rownames `2015Y` `2016Y` `2017Y` `2018Y` `2019Y` `2020Y` `2021Y`
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 Australia NA 63.6 -6.56 51.4 -42.7 -48.6 47.8
# 2 Austria* NA -749. -278. -63.1 -47.2 -614. -139.
# 3 Belgium* NA -384. -104. -987. -134. -251. 65.4
# 4 Canada NA -17.8 -36.9 65.4 33.2 -53.8 158.
# 5 Chile* NA -40.3 -50.0 26.2 61.6 -32.9 81.9
# 6 CzechRepublic NA 2010. -3.02 15.7 -8.19 -6.91 -38.3

Conditional statements of rows and columns

I am trying a rather complex conditional statement in R that I need help with. I have a massive dataset (~1,450,000) different values.
I am trying to shave down the dataset with R by asking it a conditional statement: if multiple rows have the same values in column "date" AND if multiples rows have the same values in column "PageName", THEN give averages in columns "sst", "lat", and "long".
The code I have Frankensteined together so far is:
Combines_averages <- if(Combine_12$date == Combine_12$PageName{aggregate(Combine_12$sst, Combine_12$`location-lat`, Combine_12$`location-long`)}
Data:
Example_Data
# A tibble: 3 x 7
animal_id `location-lat` `location-long` date sst month PageName
<chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
1 Alpha-626 30.5 -79.5 3/14/2020 22.2 3 ABCD
2 Bravo-522 30.5 -79.5 3/14/2020 22.6 3 ABCD
3 Charlie-389 30.5 -79.5 3/13/2020 22.4 3 BCAD
4 Delta-720 30.5 -79.5 3/16/2020 22.8 3 CADB
5 Echo-550 30.5 -79.5 3/14/2020 22.2 3 ABCD

R filter function is returning a dataset full of NA values when run

I have downloaded a dataset from https://www.kaggle.com/aungpyaeap/supermarket-sales
I am trying to filter the data on branch A to make a line graph for just branch A. When I run the code bellow the output is a lot of NA values. I have checked for NA and Null values in the dataset. Any help with how to filter on branch A correctly would be greatly appreciated.
head of the dataset -
`Invoice ID` Branch City `Customer type` Gender `Product line` `Unit price` Quantity `Tax 5%` Total Date Time Payment cogs `gross margin percentage` `gross income` Rating
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <date> <time> <chr> <dbl> <dbl> <dbl> <dbl>
1 750-67-8428 A Yangon Member Female Health and beauty 74.7 7 26.1 549. 2019-01-05 13:08 Ewallet 523. 4.76 26.1 9.1
2 226-31-3081 C Naypyitaw Normal Female Electronic accessories 15.3 5 3.82 80.2 2019-03-08 10:29 Cash 76.4 4.76 3.82 9.6
3 631-41-3108 A Yangon Normal Male Home and lifestyle 46.3 7 16.2 341. 2019-03-03 13:23 Credit card 324. 4.76 16.2 7.4
4 123-19-1176 A Yangon Member Male Health and beauty 58.2 8 23.3 489. 2019-01-27 20:33 Ewallet 466. 4.76 23.3 8.4
5 373-73-7910 A Yangon Normal Male Sports and travel 86.3 7 30.2 634. 2019-02-08 10:37 Ewallet 604. 4.76 30.2 5.3
6 699-14-3026 C Naypyitaw Normal Male Electronic accessories 85.4 7 29.9 628. 2019-03-25 18:30 Ewallet 598. 4.76 29.9 4.1
>
Code-
library(readr)
dataset <- read_csv("dataset.csv",
col_types = cols(Date = col_date(format = "%m/%d/%Y"),
Time = col_time(format = "%H:%M")))
View(dataset)
df<- dataset
sum(is.na(df))
sum(is.null(df))
df_filter_A <- filter(df, Branch == "A")
head(df_filter_A)
I uninstalled R and now the the code works as it should. I believe it was something to do with the libraries. I followed this post for uninstalling-
How to uninstall R and RStudio with all packages, settings and everything else?.
Thank you for all the help and looking at my code.

A quick way to rename multiple columns with unique names using dplyr

I am beginner R user, currently learning the tidyverse way. I imported a dataset which is a time series of monthly indexed consumer prices over a period of four years. The imported headings on the monthly CPI columns displayed in R as five digit numbers (as characters). Here is a short mockup recreation of what it looks like...
df <- tibble(`Product` = c("Eggs", "Chicken"),
`44213` = c(35.77, 36.77),
`44244` = c(39.19, 39.80),
`44272` = c(40.12, 43.42),
`44303` = c(41.09, 41.33)
)
# A tibble: 2 x 5
# Product `44213` `44244` `44272` `44303`
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 Eggs 35.8 39.2 40.1 41.1
#2 Chicken 36.8 39.8 43.4 41.3
I want to change the column headings (44213 etc) to dates that make more sense to me (still as characters). I understand, using dplyr, to do it the following way:
df <- df %>% rename("Jan17" = `44213`, "Feb17" = `44244`,
"Mar17" = `44272`, "Apr17" = `44303`)
# A tibble: 2 x 5
# Product Jan17 Feb17 Mar17 Apr17
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 Eggs 35.8 39.2 40.1 41.1
#2 Chicken 36.8 39.8 43.4 41.3
The problem is that my actual dataset contains 48 such columns (months) to rename and so it is a lot of work to type out. I looked at other replace and set_names functions but these seem to add in the repeated changes to the column names, don't provide new unique names like I am looking for?
(I realise dates as columns is not good practice and would need to shift these to rows before proceeding with any analysis... or maybe this must be a prior step to renaming?)
Trust I expressed my question sufficiently. Would love to learn a quicker solution using dplyr or be directed to where one can be found. Thank you for your time.
We can use !!! with rename by passing a named vector
library(dplyr)
library(stringr)
df1 <- df %>%
rename(!!! setNames(names(df)[-1], str_c(month.abb[1:4], 17)))
-output
df1
# A tibble: 2 x 5
# Product Jan17 Feb17 Mar17 Apr17
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 Eggs 35.8 39.2 40.1 41.1
#2 Chicken 36.8 39.8 43.4 41.3
Or use rename_with
df %>%
rename_with(~str_c(month.abb[1:4], 17), -1)
If the column names should be converted to Date formatted
nm1 <- format(as.Date(as.numeric(names(df)[-1]), origin = '1896-01-01'), '%b%y')
df %>%
rename_with(~ nm1, -1)
# A tibble: 2 x 5
# Product Jan17 Feb17 Mar17 Apr17
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 Eggs 35.8 39.2 40.1 41.1
#2 Chicken 36.8 39.8 43.4 41.3
using some random names, but sequentially
names(df)[2:ncol(df)] <- paste0('col_', 1:(ncol(df)-1), sep = '')
## A tibble: 2 x 5
# Product col_1 col_2 col_3 col_4
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 Eggs 35.8 39.2 40.1 41.1
#2 Chicken 36.8 39.8 43.4 41.3

Resources