Add character to specific value in rows by condition - r

lets say we have following data:
df1 = data.frame(cm= c('10129', '21120', '123456','345239'),
num=c(6,6,6,6))
> df1
cm num
10129 6
21120 6
123456 6
345 4
as you see the length of some boxes in the cm column is 6 digits and some of 5 digits. I want to code the following: if the number of digits in the cm column is less than number in num column, add 0 value in the front to get the given output:
cm num
010129 6
021120 6
123456 6
0345 4

You can use str_pad
library(tidyverse)
df1 %>% mutate(cm = str_pad(cm, num, "left", "0"))
#> cm num
#> 1 010129 6
#> 2 021120 6
#> 3 123456 6
#> 4 0345 4
Created on 2022-04-13 by the reprex package (v2.0.1)
Input Data
df1 <- data.frame(cm = c('10129', '21120', '123456','345'), num = c(6,6,6,4))
df1
#> cm num
#> 1 10129 6
#> 2 21120 6
#> 3 123456 6
#> 4 345 4

Perhaps simplest using dplyr and nchar:
library(dplyr)
df1 %>% mutate(cm = if_else(nchar(cm) < num, paste0(0, cm), cm))
cm num
1 10129 6
2 21120 6
3 123456 6
4 345239 6

The other tidyverse/dplyr answers are nicer, but if you want to stick to base R for some reason:
df1$cm <- ifelse(nchar(df1$cm) < df1$num, paste0('0', df1$cm), df1$cm)
df1
#> cm num
#> 1 010129 6
#> 2 021120 6
#> 3 123456 6
#> 4 345239 6

Related

how to group rows by first digit of id column in r

I have an R dataframe with the numeric id as first column, I would like to group the data frame rows according to the first digit of id
id=c(33211,5966,4478,5589,10003,633,4411,99874,3641) ...
Maybe you want this where you can use substr to extract the first digit of your id like this:
df <- data.frame(id=c(33211,5966,4478,5589,10003,633,4411,99874,3641))
df$new_id <- with(df, as.numeric(substr(id, 1, 1)))
df
#> id new_id
#> 1 33211 3
#> 2 5966 5
#> 3 4478 4
#> 4 5589 5
#> 5 10003 1
#> 6 633 6
#> 7 4411 4
#> 8 99874 9
#> 9 3641 3
Created on 2022-07-27 by the reprex package (v2.0.1)
Similar to #Quinten's answer but just arranging the numbers in order
df<-data.frame(id =c(33211,5966,4478,5589,10003,633,4411,99874,3641))
df %>%
mutate(new_id = substr(df$id, 1,1)) %>%
arrange(by = new_id)
id new_id
1 10003 1
2 33211 3
3 3641 3
4 4478 4
5 4411 4
6 5966 5
7 5589 5
8 633 6
9 99874 9

Using accumulate function with second to last value as .init argument

I have recently come across an interesting question of calculating a vector values using its penultimate value as .init argument plus an additional vector's current value. Here is the sample data set:
set.seed(13)
dt <- data.frame(id = rep(letters[1:2], each = 5), time = rep(1:5, 2), ret = rnorm(10)/100)
dt$ind <- if_else(dt$time == 1, 120, if_else(dt$time == 2, 125, as.numeric(NA)))
id time ret ind
1 a 1 0.005543269 120
2 a 2 -0.002802719 125
3 a 3 0.017751634 NA
4 a 4 0.001873201 NA
5 a 5 0.011425261 NA
6 b 1 0.004155261 120
7 b 2 0.012295066 125
8 b 3 0.002366797 NA
9 b 4 -0.003653828 NA
10 b 5 0.011051443 NA
What I would like to calculate is:
ind_{t} = ind_{t-2}*(1+ret_{t})
I tried the following code. Since .init is of no use here I tried the nullify the original .init and created a virtual .init but unfortunately it won't drag the newly created values (from third row downward) into calculation:
dt %>%
group_by(id) %>%
mutate(ind = c(120, accumulate(3:n(), .init = 125,
~ .x * 1/.x * ind[.y - 2] * (1 + ret[.y]))))
# A tibble: 10 x 4
# Groups: id [2]
id time ret ind
<chr> <int> <dbl> <dbl>
1 a 1 0.00554 120
2 a 2 -0.00280 125
3 a 3 0.0178 122.
4 a 4 0.00187 125.
5 a 5 0.0114 NA
6 b 1 0.00416 120
7 b 2 0.0123 125
8 b 3 0.00237 120.
9 b 4 -0.00365 125.
10 b 5 0.0111 NA
I was wondering if there was a tweak I could make to this code and make it work completely.
I would appreciate your help greatly in advance
Use a state vector consisting of the current value of ind and the prior value of ind. That way the prior state contains the second prior value of ind. We encode that into complex values with the real part equal to ind and the imaginary part equal to the prior value of ind. At the end we take the real part.
library(dplyr)
library(purrr)
dt %>%
group_by(id) %>%
mutate(result = c(ind[1],
Re(accumulate(.x = tail(ret, -2),
.f = ~ Im(.x) * (1 + .y) + Re(.x) * 1i,
.init = ind[2] + ind[1] * 1i)))) %>%
ungroup
giving:
# A tibble: 10 x 5
id time ret ind result
<chr> <int> <dbl> <dbl> <dbl>
1 a 1 0.00554 120 120
2 a 2 -0.00280 125 125
3 a 3 0.0178 NA 122.
4 a 4 0.00187 NA 125.
5 a 5 0.0114 NA 124.
6 b 1 0.00416 120 120
7 b 2 0.0123 125 125
8 b 3 0.00237 NA 120.
9 b 4 -0.00365 NA 125.
10 b 5 0.0111 NA 122.
Variation
This variation eliminates the complex numbers and uses a vector of 2 elements in place of each complex number with the first number corresponding to the real part in the prior solution and the second number of each pair corresponding to the imaginary part. This could be extended to cases where we need more than 2 numbers per state and where the dependence involves all of the last N values but for the question here there is the downside of the extra line of code to extract the result from the list of pairs of numbers which is more involved than using Re in the prior solution.
dt %>%
group_by(id) %>%
mutate(result = c(ind[1],
accumulate(.x = tail(ret, -2),
.f = ~ c(.x[2] * (1 + .y), .x[1]),
.init = ind[2:1])),
result = map_dbl(result, first)) %>%
ungroup
Check
We check that the results above are correct. Alternately this could be used as a straight forward solution.
calc <- function(ind, ret) {
for(i in seq(3, length(ret))) ind[i] <- ind[i-2] * (1 + ret[i])
ind
}
dt %>%
group_by(id) %>%
mutate(result = calc(ind, ret)) %>%
ungroup
giving:
# A tibble: 10 x 5
id time ret ind result
<chr> <int> <dbl> <dbl> <dbl>
1 a 1 0.00554 120 120
2 a 2 -0.00280 125 125
3 a 3 0.0178 NA 122.
4 a 4 0.00187 NA 125.
5 a 5 0.0114 NA 124.
6 b 1 0.00416 120 120
7 b 2 0.0123 125 125
8 b 3 0.00237 NA 120.
9 b 4 -0.00365 NA 125.
10 b 5 0.0111 NA 122.
I would have done it by creating dummy groups for each sequence, so that it can be done for any number of 'N'. Demonstrating it on a new elaborated data
df <- data.frame(
stringsAsFactors = FALSE,
grp = c("a","a","a","a",
"a","a","a","a","a","b","b","b","b","b",
"b","b","b","b"),
rate = c(0.082322056,
0.098491104,0.07294593,0.08741672,0.030179747,
0.061389031,0.011232314,0.08553277,0.091272669,
0.031577847,0.024039791,0.091719552,0.032540636,
0.020411727,0.094521716,0.081729178,0.066429708,
0.04985793),
ind = c(11000L,12000L,
13000L,NA,NA,NA,NA,NA,NA,10000L,13000L,12000L,
NA,NA,NA,NA,NA,NA)
)
df
#> grp rate ind
#> 1 a 0.08232206 11000
#> 2 a 0.09849110 12000
#> 3 a 0.07294593 13000
#> 4 a 0.08741672 NA
#> 5 a 0.03017975 NA
#> 6 a 0.06138903 NA
#> 7 a 0.01123231 NA
#> 8 a 0.08553277 NA
#> 9 a 0.09127267 NA
#> 10 b 0.03157785 10000
#> 11 b 0.02403979 13000
#> 12 b 0.09171955 12000
#> 13 b 0.03254064 NA
#> 14 b 0.02041173 NA
#> 15 b 0.09452172 NA
#> 16 b 0.08172918 NA
#> 17 b 0.06642971 NA
#> 18 b 0.04985793 NA
library(tidyverse)
N = 3
df %>% group_by(grp) %>%
group_by(d = row_number() %% N, .add = TRUE) %>%
mutate(ind = accumulate(rate[-1] + 1, .init = ind[1], ~ .x * .y))
#> # A tibble: 18 x 4
#> # Groups: grp, d [6]
#> grp rate ind d
#> <chr> <dbl> <dbl> <dbl>
#> 1 a 0.0823 11000 1
#> 2 a 0.0985 12000 2
#> 3 a 0.0729 13000 0
#> 4 a 0.0874 11962. 1
#> 5 a 0.0302 12362. 2
#> 6 a 0.0614 13798. 0
#> 7 a 0.0112 12096. 1
#> 8 a 0.0855 13420. 2
#> 9 a 0.0913 15057. 0
#> 10 b 0.0316 10000 1
#> 11 b 0.0240 13000 2
#> 12 b 0.0917 12000 0
#> 13 b 0.0325 10325. 1
#> 14 b 0.0204 13265. 2
#> 15 b 0.0945 13134. 0
#> 16 b 0.0817 11169. 1
#> 17 b 0.0664 14147. 2
#> 18 b 0.0499 13789. 0
Alternate answer in dplyr (using your own data modified a bit only)
set.seed(13)
dt <- data.frame(id = rep(letters[1:2], each = 5), time = rep(1:5, 2), ret = rnorm(10)/100)
dt$ind <- ifelse(dt$time == 1, 12000, ifelse(dt$time == 2, 12500, as.numeric(NA)))
library(dplyr, warn.conflicts = F)
dt %>% group_by(id) %>%
group_by(d= row_number() %% 2, .add = TRUE) %>%
mutate(ind = cumprod(1 + duplicated(id) * ret)* ind[1])
#> # A tibble: 10 x 5
#> # Groups: id, d [4]
#> id time ret ind d
#> <chr> <int> <dbl> <dbl> <dbl>
#> 1 a 1 0.00554 12000 1
#> 2 a 2 -0.00280 12500 0
#> 3 a 3 0.0178 12213. 1
#> 4 a 4 0.00187 12523. 0
#> 5 a 5 0.0114 12353. 1
#> 6 b 1 0.00416 12000 0
#> 7 b 2 0.0123 12500 1
#> 8 b 3 0.00237 12028. 0
#> 9 b 4 -0.00365 12454. 1
#> 10 b 5 0.0111 12161. 0

row-wise iteration in a dataframe where each row depends on previous row calculation in R

In a dataframe I'd like to use previous row calculated result in order to get current row calculated result involving other current row values. Also I need to apply some conditions and it has to be done by a dimension product_id. The key point is that the target column is at the same time used to calculate itself. I reproduced a sample in Excel and it looks like this:
product_id <- c(rep(1,each=9), rep(2,each=8))
dates <- c("24/09/20","25/09/20","26/09/20","27/09/20","28/09/20","29/09/20", "30/09/20","01/10/20","02/10/20","08/10/20","09/10/20","10/10/20","11/10/20","12/10/20","13/10/20","14/10/20","15/10/20")
date <- as.Date(dates, "%d/%m/%y")
num_day <- c(1:9, 1:8)
production <- c(rep(4,each=9), rep(3.5,each=8))
demand <- c(0,0,3,1,3,20,0,1,3,0,1,2,5,0,15,1,3)
df <- data.frame (product_id,date,num_day,production,demand)
Target column to be created is stock. The df is sorted by product_id and then by date, so, the order of the rows is meaningful.
Conditions: both can be applied with one statement but I split them to make it clear
Condition 1: if (stock previous day + production current day - demand
current day <= 0, 0, stock previous day + production current day -
demand current day)
Condition 2: if num_day = 1, stock = production current day - demand
current day and it cannot be negative neither if production current
day - demand current day < 0
In Excel it's a pretty straight forward formula but when dealing with large amount of data, more than 1 million rows, it's not possible to do it there. I'm trying to built a function in R but maybe it's not the best approach. Is any way to do it in R?
I tried to use an auxiliary column with accumulative sum, shift columns but it does not work. I think that it's more complex than that
This can easily be done using purrr's accumulate:
df %>%
group_by(product_id) %>%
mutate(stock = accumulate(production - demand, ~max(.x + .y, 0))) %>%
ungroup()
Result:
# A tibble: 17 x 6
# Groups: product_id [2]
product_id date num_day production demand stock
<dbl> <date> <int> <dbl> <dbl> <dbl>
1 1 2020-09-24 1 4 0 4
2 1 2020-09-25 2 4 0 8
3 1 2020-09-26 3 4 3 9
4 1 2020-09-27 4 4 1 12
5 1 2020-09-28 5 4 3 13
6 1 2020-09-29 6 4 20 0
7 1 2020-09-30 7 4 0 4
8 1 2020-10-01 8 4 1 7
9 1 2020-10-02 9 4 3 8
10 2 2020-10-08 1 3.5 0 3.5
11 2 2020-10-09 2 3.5 1 6
12 2 2020-10-10 3 3.5 2 7.5
13 2 2020-10-11 4 3.5 5 6
14 2 2020-10-12 5 3.5 0 9.5
15 2 2020-10-13 6 3.5 15 0
16 2 2020-10-14 7 3.5 1 2.5
17 2 2020-10-15 8 3.5 3 3
The result matches yours and #rjen's, so I am relatively sure this is correct.
Explanation: with accumulate, a simple cumulative sum could be implemented as accumulate(production - demand, ~.x + .y) (or even shorter as accumulate(production - demand, `+`)). Using the max function here ensures the result never gets lower than 0, which is what you intended.
Until you find a more elegant solution, you can do the following.
library(dplyr)
df %>%
group_by(product_id) %>%
mutate(group = if_else(cumsum(production-demand) < 0 |
num_day == min(num_day), 1, 0)) %>%
ungroup() %>%
mutate(group = if_else(group != 0, row_number(), as.integer(0)),
group = cumsum(group)) %>%
group_by(group) %>%
mutate(modDiff = if_else(num_day == min(num_day), 0, production-demand)) %>%
ungroup() %>%
group_by(product_id) %>%
mutate(modDiff = if_else(num_day == min(num_day), production-demand, modDiff),
modDiff = if_else(num_day == min(num_day) & modDiff < 0, 0, modDiff)) %>%
group_by(group) %>%
mutate(stock = cumsum(modDiff)) %>%
ungroup() %>%
select(-modDiff, -group)
# # A tibble: 17 x 6
# product_id date num_day production demand stock
# <dbl> <date> <int> <dbl> <dbl> <dbl>
# 1 1 2020-09-24 1 4 0 4
# 2 1 2020-09-25 2 4 0 8
# 3 1 2020-09-26 3 4 3 9
# 4 1 2020-09-27 4 4 1 12
# 5 1 2020-09-28 5 4 3 13
# 6 1 2020-09-29 6 4 20 0
# 7 1 2020-09-30 7 4 0 4
# 8 1 2020-10-01 8 4 1 7
# 9 1 2020-10-02 9 4 3 8
# 10 2 2020-10-08 1 3.5 0 3.5
# 11 2 2020-10-09 2 3.5 1 6
# 12 2 2020-10-10 3 3.5 2 7.5
# 13 2 2020-10-11 4 3.5 5 6
# 14 2 2020-10-12 5 3.5 0 9.5
# 15 2 2020-10-13 6 3.5 15 0
# 16 2 2020-10-14 7 3.5 1 2.5
# 17 2 2020-10-15 8 3.5 3 3

Mutate a part of my variables to a unique column [duplicate]

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 5 years ago.
I am coding in R.
I have a table like :
region;2012;2013;2014;2015
1;2465;245;2158;645
2;44;57;687;564
3;545;784;897;512
...
And I want to transform it into :
region;value;annee
1;2465;2012
1;245;2013
1;2158;2014
1;645;2015
2;44;2012
...
Do you know how I can do it ?
First, read the data:
dat <- read.csv2(text = "region;2012;2013;2014;2015
1;2465;245;2158;645
2;44;57;687;564
3;545;784;897;512",
check.names = FALSE)
The data frame con be converted into the long format with gather from package tidyr.
library(tidyr)
dat_long <- gather(dat, key = "annee", , -region)
The result:
region annee value
1 1 2012 2465
2 2 2012 44
3 3 2012 545
4 1 2013 245
5 2 2013 57
6 3 2013 784
7 1 2014 2158
8 2 2014 687
9 3 2014 897
10 1 2015 645
11 2 2015 564
12 3 2015 512
You can also produce the ;-separated result of your question:
write.csv2(dat_long, "", row.names = FALSE, quote = FALSE)
This results in:
region;annee;value
1;2012;2465
2;2012;44
3;2012;545
1;2013;245
2;2013;57
3;2013;784
1;2014;2158
2;2014;687
3;2014;897
1;2015;645
2;2015;564
3;2015;512
An example to answer the question :
olddata_wide
#> subject sex control cond1 cond2
#> 1 1 M 7.9 12.3 10.7
#> 2 2 F 6.3 10.6 11.1
#> 3 3 F 9.5 13.1 13.8
#> 4 4 M 11.5 13.4 12.9
library(tidyr)
# The arguments to gather():
# - data: Data object
# - key: Name of new key column (made from names of data columns)
# - value: Name of new value column
# - ...: Names of source columns that contain values
# - factor_key: Treat the new key column as a factor (instead of character vector)
data_long <- gather(olddata_wide, condition, measurement, control:cond2, factor_key=TRUE)
data_long
#> subject sex condition measurement
#> 1 1 M control 7.9
#> 2 2 F control 6.3
#> 3 3 F control 9.5
#> 4 4 M control 11.5
#> 5 1 M cond1 12.3
#> 6 2 F cond1 10.6
#> 7 3 F cond1 13.1
#> 8 4 M cond1 13.4
#> 9 1 M cond2 10.7
#> 10 2 F cond2 11.1
#> 11 3 F cond2 13.8
#> 12 4 M cond2 12.9

How to mimick ROW_NUMBER() OVER(...) in R

To manipulate/summarize data over time, I usually use SQL ROW_NUMBER() OVER(PARTITION by ...). I'm new to R, so I'm trying to recreate tables I otherwise would create in SQL. The package sqldf does not allow OVER clauses. Example table:
ID Day Person Cost
1 1 A 50
2 1 B 25
3 2 A 30
4 3 B 75
5 4 A 35
6 4 B 100
7 6 B 65
8 7 A 20
I want my final table to include the average of the previous 2 instances for each day after their 2nd instance (day 4 for both):
ID Day Person Cost Prev2
5 4 A 35 40
6 4 B 100 50
7 6 B 65 90
8 7 A 20 35
I've been trying to play around with aggregate, but I'm not really sure how to partition or qualify the function. Ideally, I'd prefer not to use the fact that id is sequential with the date to form my answer (i.e. original table could be rearranged with random date order and code would still work). Let me know if you need more details, thanks for your help!
You could lag zoo::rollapplyr with a width of 2. In dplyr,
library(dplyr)
df %>% arrange(Day) %>% # sort
group_by(Person) %>% # set grouping
mutate(Prev2 = lag(zoo::rollapplyr(Cost, width = 2, FUN = mean, fill = NA)))
#> Source: local data frame [8 x 5]
#> Groups: Person [2]
#>
#> ID Day Person Cost Prev2
#> <int> <int> <fctr> <int> <dbl>
#> 1 1 1 A 50 NA
#> 2 2 1 B 25 NA
#> 3 3 2 A 30 NA
#> 4 4 3 B 75 NA
#> 5 5 4 A 35 40.0
#> 6 6 4 B 100 50.0
#> 7 7 6 B 65 87.5
#> 8 8 7 A 20 32.5
or all in dplyr,
df %>% arrange(Day) %>% group_by(Person) %>% mutate(Prev2 = (lag(Cost) + lag(Cost, 2)) / 2)
which returns the same thing. In base,
df <- df[order(df$Day), ]
df$Prev2 <- ave(df$Cost, df$Person, FUN = function(x){
c(NA, zoo::rollapplyr(x, width = 2, FUN = mean, fill = NA)[-length(x)])
})
df
#> ID Day Person Cost Prev2
#> 1 1 1 A 50 NA
#> 2 2 1 B 25 NA
#> 3 3 2 A 30 NA
#> 4 4 3 B 75 NA
#> 5 5 4 A 35 40.0
#> 6 6 4 B 100 50.0
#> 7 7 6 B 65 87.5
#> 8 8 7 A 20 32.5
or without zoo,
df$Prev2 <- ave(df$Cost, df$Person, FUN = function(x){
(c(NA, x[-length(x)]) + c(NA, NA, x[-(length(x) - 1):-length(x)])) / 2
})
which does the same thing. If you want to remove the NA rows, tack on tidyr::drop_na(Prev2) or na.omit.

Resources