Recode character IDs into numeric IDs - r

I need to modify an id variable values. Here is how a sample data looks like:
df <- data.frame(id = c(11,21,22,"33_AS_A","33_AS_B","33_AS_X", "35_Part1","35_Part2","35_Part4","35_Part7"),
Grade= c(3,3,3, 4,4,4,5,5,5,5))
> df
id Grade
1 11 3
2 21 3
3 22 3
4 33_AS_A 4
5 33_AS_B 4
6 33_AS_X 4
7 35_Part1 5
8 35_Part2 5
9 35_Part4 5
10 35_Part7 5
I need to recode the id as a numeric variable by giving ordered numeric values instead of the text values in order.
Here is my desired output looks like:
> df2
id Grade
1 11 3
2 21 3
3 22 3
4 331 4
5 332 4
6 333 4
7 351 5
8 352 5
9 353 5
10 354 5
Any ideas?

library(dplyr)
library(stringr)
df %>%
mutate(
group = str_extract(id, "[0-9]+")
) %>%
group_by(group) %>%
mutate(id = as.numeric(paste0(group, if(n() > 1) row_number() else ""))) %>%
ungroup() %>%
select(-group)
# # A tibble: 10 × 2
# id Grade
# <dbl> <dbl>
# 1 11 3
# 2 21 3
# 3 22 3
# 4 331 4
# 5 332 4
# 6 333 4
# 7 351 5
# 8 352 5
# 9 353 5
#10 354 5

Using base, split into groups based on numbers, if the group length is not 1, then add row number:
x <- sapply(strsplit(df$id, "_"), `[`, 1)
df$ID <- unlist(sapply(split(x, x), function(i)
if(length(i) == 1) i else paste0(i, seq(i))))
df
# id Grade ID
# 1 11 3 11
# 2 21 3 21
# 3 22 3 22
# 4 33_AS_A 4 331
# 5 33_AS_B 4 332
# 6 33_AS_X 4 333
# 7 35_Part1 5 351
# 8 35_Part2 5 352
# 9 35_Part4 5 353
# 10 35_Part7 5 354

Related

Converting time-dependent variable to long format using one variable indicating day of update

I am trying to convert my data to a long format using one variable that indicates a day of the update.
I have the following variables:
baseline temperature variable "temp_b";
time-varying temperature variable "temp_v" and
the number of days "n_days" when the varying variable is updated.
I want to create a long format using the carried forward approach and a max follow-up time of 5 days.
Example of data
df <- structure(list(id=1:3, temp_b=c(20L, 7L, 7L), temp_v=c(30L, 10L, NA), n_days=c(2L, 4L, NA)), class="data.frame", row.names=c(NA, -3L))
# id temp_b temp_v n_days
# 1 1 20 30 2
# 2 2 7 10 4
# 3 3 7 NA NA
df_long <- structure(list(id=c(1,1,1,1,1, 2,2,2,2,2, 3,3,3,3,3),
days_cont=c(1,2,3,4,5, 1,2,3,4,5, 1,2,3,4,5),
long_format=c(20,30,30,30,30,7,7,7,10,10,7,7,7,7,7)),
class="data.frame", row.names=c(NA, -15L))
# id days_cont long_format
# 1 1 1 20
# 2 1 2 30
# 3 1 3 30
# 4 1 4 30
# 5 1 5 30
# 6 2 1 7
# 7 2 2 7
# 8 2 3 7
# 9 2 4 10
# 10 2 5 10
# 11 3 1 7
# 12 3 2 7
# 13 3 3 7
# 14 3 4 7
# 15 3 5 7
You could repeat each row 5 times with tidyr::uncount():
library(dplyr)
df %>%
tidyr::uncount(5) %>%
group_by(id) %>%
transmute(days_cont = 1:n(),
temp = ifelse(row_number() < n_days | is.na(n_days), temp_b, temp_v)) %>%
ungroup()
# # A tibble: 15 × 3
# id days_cont temp
# <int> <int> <int>
# 1 1 1 20
# 2 1 2 30
# 3 1 3 30
# 4 1 4 30
# 5 1 5 30
# 6 2 1 7
# 7 2 2 7
# 8 2 3 7
# 9 2 4 10
# 10 2 5 10
# 11 3 1 7
# 12 3 2 7
# 13 3 3 7
# 14 3 4 7
# 15 3 5 7
Here's a possibility using tidyverse functions. First, pivot_longer and get rid of unwanted values (that will not appear in the final df, i.e. values with temp_v == NA), then group_by id, and mutate the n_days variable to match the number of rows it will have in the final df. Finally, uncount the dataframe.
library(tidyverse)
df %>%
replace_na(list(n_days = 6)) %>%
pivot_longer(-c(id, n_days)) %>%
filter(!is.na(value)) %>%
group_by(id) %>%
mutate(n_days = case_when(name == "temp_b" ~ n_days - 1,
name == "temp_v" ~ 5 - (n_days - 1))) %>%
uncount(n_days) %>%
mutate(days_cont = row_number()) %>%
select(id, days_cont, long_format = value)
id days_cont long_format
<int> <int> <int>
1 1 1 20
2 1 2 30
3 1 3 30
4 1 4 30
5 1 5 30
6 2 1 7
7 2 2 7
8 2 3 7
9 2 4 10
10 2 5 10
11 3 1 7
12 3 2 7
13 3 3 7
14 3 4 7
15 3 5 7

Filling the missing values within each id in r

I have a dataframe having some rows missing value. Here is a sample dataframe:
df <- data.frame(id = c(1,1,1, 2,2,2, 3,3,3),
item = c(11,12,13, 24,25,26, 56,45,56),
score = c(5,5, NA, 6,6,6, 7,NA, 7))
> df
id item score
1 1 11 5
2 1 12 5
3 1 13 NA
4 2 24 6
5 2 25 6
6 2 26 6
7 3 56 7
8 3 45 NA
9 3 56 7
Grouping the dataset by id column, I would like to fill those NA values with the same score.
the desired output should be:
> df
id item score
1 1 11 5
2 1 12 5
3 1 13 5
4 2 24 6
5 2 25 6
6 2 26 6
7 3 56 7
8 3 45 7
9 3 56 7
Any ideas?
Thanks!
We can group by 'id' and fill
library(dplyr)
library(tidyr)
df %>%
group_by(id) %>%
fill(score, .direction = "downup") %>%
ungroup
Here is another option with base R
> transform(df, score = ave(score, id, FUN = function(x) mean(x, na.rm = TRUE)))
id item score
1 1 11 5
2 1 12 5
3 1 13 5
4 2 24 6
5 2 25 6
6 2 26 6
7 3 56 7
8 3 45 7
9 3 56 7
Another option is to create your own function,eg:
fill.in<-function(dataf){
dataf2<-data.frame()
for (i in 1:length(unique(dataf$id))){
dataf1<-subset(dataf, id %in% unique(dataf$id)[i])
dataf1$score<-max(dataf1$score,na.rm=TRUE)
dataf2<-rbind(dataf2,dataf1)
}
return(dataf2)
}
fill.in(df)

Sum column over specific rownumbers in grouped dataframe in R

I have a dataframe like this:
df = data.frame(
x = 1:100,
y = rep(1:10, times = 10, each = 10)
) %>%
group_by(y)
And I would like to compute the sum of x from the 3rd to the 6th row of each group of y.
I think this should be easy, but I just can not figure it out at the moment.
In pseudocode I imagine something like this:
df %>%
mutate(
sum(x, ifelse(between(row_number(), 3,6)))
)
But this of course does not work. I would like to solve it with some dplyr-function, but also in base R I cannot think of a fast solution.
For the first group the sum would be 3+4+5+6....
One option could be:
df %>%
group_by(y) %>%
mutate(z = sum(x[row_number() %in% 3:6]))
x y z
<int> <int> <int>
1 1 1 18
2 2 1 18
3 3 1 18
4 4 1 18
5 5 1 18
6 6 1 18
7 7 1 18
8 8 1 18
9 9 1 18
10 10 1 18
You could also do this with filter() and summarise() and obtain a group-wise summary:
df %>%
group_by(y) %>%
mutate(rn = 1:n()) %>%
filter(rn %in% 3:6) %>%
summarise(x_sum = sum(x))
# A tibble: 10 x 2
y x_sum
<int> <int>
1 1 18
2 2 58
3 3 98
4 4 138
5 5 178
6 6 218
7 7 258
8 8 298
9 9 338
10 10 378
Update: If you want to sum multiple sequences from x then you can sum by index:
df %>%
group_by(y) %>%
mutate(sum_row3to6 = sum(x[3:6]),
sum_row1to4 = sum(x[1:4])
)
Output:
x y sum_row3to6 sum_row1to4
<int> <int> <int> <int>
1 1 1 18 10
2 2 1 18 10
3 3 1 18 10
4 4 1 18 10
5 5 1 18 10
6 6 1 18 10
7 7 1 18 10
8 8 1 18 10
9 9 1 18 10
10 10 1 18 10
First answer:
We could use slice summarise
library(dplyr)
df %>%
group_by(y) %>%
slice(3:6) %>%
summarise(sum = sum(x))
Output:
y sum
<int> <int>
1 1 18
2 2 58
3 3 98
4 4 138
5 5 178
6 6 218
7 7 258
8 8 298
9 9 338
10 10 378
data.table
library(data.table)
df = data.frame(
x = 1:100,
y = rep(1:10, times = 10, each = 10)
)
setDT(df)[rowid(y) %in% 3:6, list(sum_x = sum(x)), by = y][]
#> y sum_x
#> 1: 1 18
#> 2: 2 58
#> 3: 3 98
#> 4: 4 138
#> 5: 5 178
#> 6: 6 218
#> 7: 7 258
#> 8: 8 298
#> 9: 9 338
#> 10: 10 378
Created on 2021-05-21 by the reprex package (v2.0.0)

Hhow do I combine dataframes of unequal length based on a condition

I would like to know how to best combine the two following dataframes:
df1 <- data.frame(Date = c(1,2,3,4,5,6,7,8,9,10),
Altitude=c(100,101,101,102,103,99,98,99,89,70))
> df1
Date Altitude
1 1 100
2 2 101
3 3 101
4 4 102
5 5 103
6 6 99
7 7 98
8 8 99
9 9 89
10 10 70
df2 <- data.frame(Start = c(1,4,8),Stop = c(3,7,10),Longitude=c(10,12,13))
> df2
Start Stop Longitude
1 1 3 10
2 4 7 12
3 8 10 13
I would basically need a third column in df2, with the Longitude based on whether the Date is between Start and Stop, resulting in something like this:
Date Altitude Longitude
1 1 100 10
2 2 101 10
3 3 101 10
4 4 102 12
5 5 103 12
6 6 99 12
7 7 98 12
8 8 99 13
9 9 89 13
10 10 70 13
I've been trying all kinds of subsetting, filtering, ... but I just can't figure it out. Any help would be appreciated!
Kind regards
An idea via dplyr is to complete the start:stop sequence, unnest and merge, i.e.
library(dplyr)
df2 %>%
mutate(Date = mapply(seq, Start, Stop)) %>%
tidyr::unnest() %>%
select(-c(1, 2)) %>%
right_join(df1, by = 'Date')
which gives,
Longitude Date Altitude
1 10 1 100
2 10 2 101
3 10 3 101
4 12 4 102
5 12 5 103
6 12 6 99
7 12 7 98
8 13 8 99
9 13 9 89
10 13 10 70
Here is a tidyverse answer using the group_by and group_modify functions in the dplyr package (introduced in version 0.8.1 in May 2019).
library(dplyr)
df1 %>%
group_by(Date, Altitude) %>%
group_modify(~ data.frame(df2 %>%
filter(.x$Date >= Start, .x$Date <= Stop)) %>%
select(Longitude),
keep = TRUE)
For each unique combination in df1 of date and altitude (i.e. for each row), this finds the longitude corresponding to the date range in df2.
The output is a tibble:
# A tibble: 10 x 3
# Groups: Date, Altitude [10]
Date Altitude Longitude
<dbl> <dbl> <dbl>
1 1 100 10
2 2 101 10
3 3 101 10
4 4 102 12
5 5 103 12
6 6 99 12
7 7 98 12
8 8 99 13
9 9 89 13
10 10 70 13
Base R solution:
ind <- apply(df2, 1, function(x) which(df1$Date >= x[1] & df1$Date <= x[2]))
df1$Longitude <- unlist(Map(function(x,y) rep(y, length(x)), ind, df2$Longitude))
Output
Date Altitude Longitude
1 1 100 10
2 2 101 10
3 3 101 10
4 4 102 12
5 5 103 12
6 6 99 12
7 7 98 12
8 8 99 13
9 9 89 13
10 10 70 13

Sum of group but keep the same value for each row in r

I have data frame, I want to create a new variable by sum of each ID and group, if I sum normal,dimension of data reduce, my case I need to keep and repeat each row.
ID <- c(rep(1,3), rep(3, 5), rep(4,4))
Group <-c(1,1,2,1,1,1,2,2,1,1,1,2)
x <- c(1:12)
y<- c(12:23)
df <- data.frame(ID,Group,x,y)
ID Group x y
1 1 1 1 12
2 1 1 2 13
3 1 2 3 14
4 3 1 4 15
5 3 1 5 16
6 3 1 6 17
7 3 2 7 18
8 3 2 8 19
9 4 1 9 20
10 4 1 10 21
11 4 1 11 22
12 4 2 12 23
The output with 2 more variables "sumx" and "sumy". Group by (ID, Group)
ID Group x y sumx sumy
1 1 1 1 12 3 25
2 1 1 2 13 3 25
3 1 2 3 14 3 14
4 3 1 4 15 15 48
5 3 1 5 16 15 48
6 3 1 6 17 15 48
7 3 2 7 18 15 37
8 3 2 8 19 15 37
9 4 1 9 20 30 63
10 4 1 10 21 30 63
11 4 1 11 22 30 63
12 4 2 12 23 12 23
Any Idea?
As short as:
df$sumx <- with(df,ave(x,ID,Group,FUN = sum))
df$sumy <- with(df,ave(y,ID,Group,FUN = sum))
We can use dplyr
library(dplyr)
df %>%
group_by(ID, Group) %>%
mutate_each(funs(sum)) %>%
rename(sumx=x, sumy=y) %>%
bind_cols(., df[c("x", "y")])
If there are only two columns to sum, then
df %>%
group_by(ID, Group) %>%
mutate(sumx = sum(x), sumy = sum(y))
You can use below code to get what you want if it is a single column and in case you have more than 1 column then add accordingly:
library(dplyr)
data13 <- data12 %>%
group_by(Category) %>%
mutate(cum_Cat_GMR = cumsum(GrossMarginRs))

Resources