How to add letter label to numeric values in a tibble? - r

i have a column in my tibble with values 10 10 10 20 20 20 20 30 30, I want to create a new column as follows: 10a 10b 10c 20a 20b 20c 20d 30a 30b.

A simpler and concise solution using only dplyr.
# import library
library(dplyr)
# data
df <- data.frame(id = c(10, 10, 10, 20, 20, 20, 20, 30, 30))
# solution
df %>%
group_by(id) %>% # grouping by id
mutate(
alpha_id = letters[1:length(id)], # create alpha id
new_id = paste0(id, alpha_id) # new_id = id + alpha_id
) %>%
ungroup() %>% # ungroup
select(-alpha_id) # dropping staging variable.
Output
# A tibble: 9 x 2
id new_id
<dbl> <chr>
1 10 10a
2 10 10b
3 10 10c
4 20 20a
5 20 20b
6 20 20c
7 20 20d
8 30 30a
9 30 30b

BaseR
df$x2 <- ave(df$x, df$x, FUN = function(.x) paste0(.x, letters[seq_len(length(.x))]))
x x2
1 10 10a
2 10 10b
3 10 10c
4 20 20a
5 20 20b
6 20 20c
7 20 20d
8 30 30a
9 30 30b
dplyr
df <- data.frame(x = c(10, 10, 10, 20, 20, 20, 20, 30, 30))
library(dplyr)
df %>% group_by(x) %>%
mutate(x2 = paste0(x, letters[row_number()]))
# A tibble: 9 x 2
# Groups: x [3]
x x2
<dbl> <chr>
1 10 10a
2 10 10b
3 10 10c
4 20 20a
5 20 20b
6 20 20c
7 20 20d
8 30 30a
9 30 30b

Here's an economical solution with dplyr and data.table:
First convert to dataframe:
df <- data.frame(x = c(10, 10, 10, 20, 20, 20, 20, 30, 30))
Now transform in dplyr using data.table's function rowid and the built-in constant letters (for lower-case letters) subsetted on rowid (credit to #Henrik):
library(dplyr)
library(data.table)
df %>%
mutate(x_new = paste0(x, letters[rowid(x)]))
# A tibble: 9 x 2
x x_new
<dbl> <chr>
1 10 10a
2 10 10b
3 10 10c
4 20 20a
5 20 20b
6 20 20c
7 20 20d
8 30 30a
9 30 30b

Related

Reformat Dataframe to start a new row at a certain column

I have a dataframe that looks like this:
ID x.2019 x.2020
1 10 20
2 20 30
3 30 40
4 40 50
5 50 60
and I would like to reformat it to look like this:
ID time x
1 2019 10
1 2020 20
2 2019 20
2 2020 30
3 2019 40
3 2020 50
4 2019 60
4 2020 70
5 2019 70
5 2020 80
Any idea how to achieve this?
This is a rather simple task which you can probably find in other answers. Though, you can achieve what you want with data.table as follows:
library(data.table)
df = data.table( ID = 1:5,
x.2019 = seq(10, 50, by = 10),
x.2020 = seq(20, 60, by = 10)
)
# change column names conveniently
setnames(df, c("x.2019", "x.2020"), c("2019", "2020"))
# transform the dataset from wide to long format
out = melt(df, id.vars = "ID", variable.name = "time", value.name = "x", variable.factor = FALSE)
# cast time to integer
out[ , time := as.integer(time)]
# reorder by ID
setorder(out, ID)
out
#> ID time x
#> 1: 1 2019 10
#> 2: 1 2020 20
#> 3: 2 2019 20
#> 4: 2 2020 30
#> 5: 3 2019 30
#> 6: 3 2020 40
#> 7: 4 2019 40
#> 8: 4 2020 50
#> 9: 5 2019 50
#> 10: 5 2020 60
Created on 2022-01-20 by the reprex package (v2.0.1)
You can use pivot_longer:
library(dplyr)
library(tidyr)
df = data.frame(ID=1:5,
x.2019=c(10, 20, 30, 40, 50),
x.2020=c(20, 30, 40, 50, 60))
df %>%
pivot_longer(cols = c(2, 3), names_to = 'time', values_to = 'x') %>%
mutate(time = as.integer(stringr::str_replace(time, 'x.', '')))
Result:
# A tibble: 10 x 3
ID time x
<int> <int> <dbl>
1 1 2019 10
2 1 2020 20
3 2 2019 20
4 2 2020 30
5 3 2019 30
6 3 2020 40
7 4 2019 40
8 4 2020 50
9 5 2019 50
10 5 2020 60

Sum up with the next line into a new colum

I'm having some trouble on figuring out how to create a new column with the sum of 2 subsequent cells.
I have :
df1<- tibble(Years=c(1990, 2000, 2010, 2020, 2030, 2050, 2060, 2070, 2080),
Values=c(1,2,3,4,5,6,7,8,9 ))
Now, I want a new column where the first line is the sum of 1+2, the second line is the sum of 1+2+3 , the third line is the sum 1+2+3+4 and so on.
As 1, 2, 3, 4... are hipoteticall values, I need to measure the absolute growth from a decade to another in order to create later on a new variable to measure the percentage change from a decade to another.
library(tibble)
df1<- tibble(Years=c(1990, 2000, 2010, 2020, 2030, 2050, 2060, 2070, 2080),
Values=c(1,2,3,4,5,6,7,8,9 ))
library(slider)
library(dplyr, warn.conflicts = F)
df1 %>%
mutate(xx = slide_sum(Values, after = 1, before = Inf))
#> # A tibble: 9 x 3
#> Years Values xx
#> <dbl> <dbl> <dbl>
#> 1 1990 1 3
#> 2 2000 2 6
#> 3 2010 3 10
#> 4 2020 4 15
#> 5 2030 5 21
#> 6 2050 6 28
#> 7 2060 7 36
#> 8 2070 8 45
#> 9 2080 9 45
Created on 2021-08-12 by the reprex package (v2.0.0)
Assuming the last row is to be repeated. Otherwise the fill part can be skipped.
library(dplyr)
library(tidyr)
df1 %>%
mutate(x = lead(cumsum(Values))) %>%
fill(x)
# Years Values x
# <dbl> <dbl> <dbl>
# 1 1990 1 3
# 2 2000 2 6
# 3 2010 3 10
# 4 2020 4 15
# 5 2030 5 21
# 6 2050 6 28
# 7 2060 7 36
# 8 2070 8 45
# 9 2080 9 45
Using base R
v1 <- cumsum(df1$Values)[-1]
df1$new <- c(v1, v1[length(v1)])
You want the cumsum() function. Here are two ways to do it.
### Base R
df1$cumsum <- cumsum(df1$Values)
### Using dplyr
library(dplyr)
df1 <- df1 %>%
mutate(cumsum = cumsum(Values))
Here is the output in either case.
df1
# A tibble: 9 x 3
Years Values cumsum
<dbl> <dbl> <dbl>
1 1990 1 1
2 2000 2 3
3 2010 3 6
4 2020 4 10
5 2030 5 15
6 2050 6 21
7 2060 7 28
8 2070 8 36
9 2080 9 45
A data.table option
> setDT(df)[, newCol := shift(cumsum(Values), -1, fill = sum(Values))][]
Years Values newCol
1: 1990 1 3
2: 2000 2 6
3: 2010 3 10
4: 2020 4 15
5: 2030 5 21
6: 2050 6 28
7: 2060 7 36
8: 2070 8 45
9: 2080 9 45
or a base R option following a similar idea
transform(
df,
newCol = c(cumsum(Values)[-1],sum(Values))
)

Interpolating Mid-Year Averages

I have yearly observations of income for a series of geographies, like this:
library(dplyr)
library(lubridate)
date <- c("2004-01-01", "2005-01-01", "2006-01-01",
"2004-01-01", "2005-01-01", "2006-01-01")
geo <- c(1, 1, 1, 2, 2, 2)
inc <- c(10, 12, 14, 32, 34, 50)
data <- tibble(date = ymd(date), geo, inc)
date geo inc
<date> <dbl> <dbl>
1 2004-01-01 1 10
2 2005-01-01 1 12
3 2006-01-01 1 14
4 2004-01-01 2 32
5 2005-01-01 2 34
6 2006-01-01 2 50
I need to insert mid-year values, as averages of the start-of-year and end-of-year observations, so that the data is every 6 months. The outcome would like this:
2004-01-01 1 10
2004-06-01 1 11
2005-01-01 1 12
2004-06-01 1 13
2006-01-01 1 14
2004-01-01 2 32
2004-06-01 2 33
2005-01-01 2 34
2004-06-01 2 42
2006-01-01 2 50
Would appreciate any ideas.
Grouped by 'geoo', add (+) the 'inc' with the next value (lead) and get the average (/2), as well as add 5 months to the 'date', then filter out the NA elements in 'inc', bind the rows with the original data
library(dplyr)
library(lubridate)
data %>%
group_by(geo) %>%
summarise(date = date %m+% months(5),
inc = (inc + lead(inc))/2, .groups = 'drop') %>%
filter(!is.na(inc)) %>%
bind_rows(data, .) %>%
arrange(geo, date)
-output
# A tibble: 10 x 3
# date geo inc
# <date> <dbl> <dbl>
# 1 2004-01-01 1 10
# 2 2004-06-01 1 11
# 3 2005-01-01 1 12
# 4 2005-06-01 1 13
# 5 2006-01-01 1 14
# 6 2004-01-01 2 32
# 7 2004-06-01 2 33
# 8 2005-01-01 2 34
# 9 2005-06-01 2 42
#10 2006-01-01 2 50
You can use complete to create a sequence of dates for 6 months and then use na.approx to fill the NA values with interpolated values.
library(dplyr)
library(lubridate)
data %>%
group_by(geo) %>%
tidyr::complete(date = seq(min(date), max(date), by = '6 months')) %>%
mutate(date = if_else(is.na(inc), date %m-% months(1), date),
inc = zoo::na.approx(inc))
# geo date inc
# <dbl> <date> <dbl>
# 1 1 2004-01-01 10
# 2 1 2004-06-01 11
# 3 1 2005-01-01 12
# 4 1 2005-06-01 13
# 5 1 2006-01-01 14
# 6 2 2004-01-01 32
# 7 2 2004-06-01 33
# 8 2 2005-01-01 34
# 9 2 2005-06-01 42
#10 2 2006-01-01 50

Running delta - Lag difference over row of data frame

I have a data frame such as
df1 <- data.frame(Company = c('A','B','C','D','E'),
`X1980` = c(1, 5, 3, 8, 13),
`X1981` = c(20, 13, 23, 11, 29),
`X1982` = c(33, 32, 31, 41, 42),
`X1983` = c(45, 47, 53, 58, 55))
I would like to create a new data frame (df2) keeping the company column as is. The values for the years 1980 and 1983 should be calculated by using the current value minus the previous value. So basically I would like a data frame resulting in the rolling deltas.
Company 1980 1981 1982 1983
A NA 19 13 12
B NA 8 19 15
C NA 20 8 22
D NA 3 30 17
E NA 16 13 13
Thanks for the help! If there's any way for me to improve the question, then just let me know.
You can find difference in each row adding NA to first value.
df1[-1] <- t(apply(df1[-1], 1, function(x) c(NA, diff(x))))
df1
# Company X1980 X1981 X1982 X1983
#1 A NA 19 13 12
#2 B NA 8 19 15
#3 C NA 20 8 22
#4 D NA 3 30 17
#5 E NA 16 13 13
You can also use tidyverse functions.
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = -Company) %>%
group_by(Company) %>%
mutate(value = value - lag(value)) %>%
pivot_wider()
We can use rowDiffs from matrixStats
library(matrixStats)
df1[-1] <- cbind(NA, rowDiffs(as.matrix(df1[-1])))
-output
df1
# Company X1980 X1981 X1982 X1983
#1 A NA 19 13 12
#2 B NA 8 19 15
#3 C NA 20 8 22
#4 D NA 3 30 17
#5 E NA 16 13 13

data frame selecting top by grouping

I have a data frame such as:
set.seed(1)
df <- data.frame(
sample = 1:50,
value = runif(50),
group = c(rep(NA, 20), gl(3, 10)))
I want to select the top 10 samples based on value. However, if there is a group corresponding to the sample, I only want to include one sample from that group. If group == NA, I want to include all of them. Arranging df by value looks like:
df_top <- df %>%
arrange(-value) %>%
top_n(10, value)
sample value group
1 46 0.7973088 3
2 49 0.8108702 3
3 22 0.8394404 1
4 2 0.8612095 NA
5 27 0.8643395 1
6 20 0.8753213 NA
7 44 0.8762692 3
8 26 0.8921983 1
9 11 0.9128759 NA
10 30 0.9606180 1
I would want to include samples 36, 22, 2, 20, 11, and the next five highest values in my data frame that continue to fit the pattern. How do I accomplish this?
I think I figured this out. Would this be the best way:
df_top <- df %>%
arrange(-value) %>%
group_by(group) %>%
filter(ifelse(!is.na(group), value == max(value), value == value)) %>%
ungroup() %>%
top_n(10, value)
# A tibble: 10 x 3
sample value group
<int> <dbl> <int>
1 18 0.992 NA
2 7 0.945 NA
3 21 0.935 1
4 4 0.908 NA
5 6 0.898 NA
6 35 0.827 2
7 41 0.821 3
8 20 0.777 NA
9 15 0.770 NA
10 17 0.718 NA
Similar method that uses slice instead of filter:
library(dplyr)
df_top <- df %>%
arrange(-value) %>%
group_by(group) %>%
slice(if(any(!is.na(group))) 1 else 1:n()) %>%
ungroup() %>%
top_n(10, value)
Result:
# A tibble: 10 x 3
sample value group
<int> <dbl> <int>
1 21 0.9347052 1
2 35 0.8273733 2
3 41 0.8209463 3
4 18 0.9919061 NA
5 7 0.9446753 NA
6 4 0.9082078 NA
7 6 0.8983897 NA
8 20 0.7774452 NA
9 15 0.7698414 NA
10 17 0.7176185 NA

Resources