Create nominal variable from multiple columns R - r

My intention involves creating a variable based on the values of two numeric ones. I have not written any user-defined functions in R and need help getting started.
Dataset:
My dataset has over 3k stores, but created a reproducible example of the first 10 rows. Deliveries per day of week show total volume for that day through the year. Store_num represents store number and Total shows the total deliveries for a store throughout year.
I want predominant delivery days created in a variable called Del_Sch with the following inequalities. If first condition TRUE (50-100%), then create the variable with the column name. If FALSE, test second condition and create variable with all column names between 32-50%, ect. If there are no days over 20%, no predominant delivery days are counted.
-Volume in a day between 50-100% of the total.
-Volume in a day between 32-50% of total
-Volume in a day between 25-32% of total.
-Volume in a day between 20-25% of total.
-Volume in a day less than 20% of total.
Reproducible Example:
Store_Num <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
#Total deliveries made per week
Sun_Del <- c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
Mon_Del <- c(10, 50, 51, 7, 80, 97, 21, 49, 30, 3)
Tue_Del <- c(7, NA, 2, 50, 5, 56, 1, 4, 35, 52)
Wed_Del <- c(49, 51, 1, 4, 51, 16, 2, 2, 1, 1)
Thu_Del <- c(3, 2, 47, 7, 40, 2, 6, 5, 1, 7)
Fri_Del <- c(50, 49, 3, 51, 53, 86, 9, 52, 25, 52)
Sat_Del <- c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
Total <- c(119, 152, 104, 119, 229, 257, 39, 112, 92, 115)
#Single dataset
Schedule <- data.frame(Store_Num, Sun_Del, Mon_Del, Tue_Del,
Wed_Del, Thu_Del, Fri_Del, Sat_Del, Total)
Schedule
Store_Num Sun_Del Mon_Del Tue_Del Wed_Del Thu_Del Fri_Del Sat_Del Total
1 1 NA 10 7 49 3 50 NA 119
2 2 NA 50 NA 51 2 49 NA 152
3 3 NA 51 2 1 47 3 NA 104
4 4 NA 7 50 4 7 51 NA 119
5 5 NA 80 5 51 40 53 NA 229
6 6 NA 97 56 16 2 86 NA 257
7 7 NA 21 1 2 6 9 NA 39
8 8 NA 49 4 2 5 52 NA 112
9 9 NA 30 35 1 1 25 NA 92
10 10 NA 3 52 1 7 52 NA 115
Desired Output:
Store_Num Sun_Del Mon_Del Tue_Del Wed_Del Thu_Del Fri_Del Sat_Del Total Del_Sch
1 1 NA 10 7 49 3 50 NA 119 WF
2 2 NA 50 NA 51 2 49 NA 152 MWF
3 3 NA 51 2 1 47 3 NA 104 MTh
4 4 NA 7 50 4 7 51 NA 119 TF
5 5 NA 80 5 51 40 53 NA 229 MWF
6 6 NA 97 56 16 2 86 NA 257 MTF
7 7 NA 21 1 2 6 9 NA 39 M
8 8 NA 49 4 2 5 52 NA 112 MF
9 9 NA 30 35 1 1 25 NA 92 MTF
10 10 NA 3 52 1 7 52 NA 115 TF

Using tidyr and dplyr. I made the names be the first two letter pasted to fix the Tuesday/Thursday confusion:
library(dplyr)
library(tidyr)
Schedule %>% gather(Day, del, -Store_Num, -Total) %>%
mutate(proportion = ifelse(del/Total >= 0.5, 1,
ifelse(del/Total >= 0.32, 2,
ifelse(del/Total >= 0.25, 3,
ifelse(del/Total >= 0.20, 4,
NA))))) %>%
group_by(Store_Num) %>%
summarise(days = paste0(substr(Day[which(
proportion == min(proportion, na.rm = TRUE))],
1, 2), collapse = "")) %>%
merge(Schedule, ., by = "Store_Num")
Store_Num Sun_Del Mon_Del Tue_Del Wed_Del Thu_Del Fri_Del Sat_Del Total days
1 1 NA 10 7 49 3 50 NA 119 WeFr
2 2 NA 50 NA 51 2 49 NA 152 MoWeFr
3 3 NA 51 2 1 47 3 NA 104 MoTh
4 4 NA 7 50 4 7 51 NA 119 TuFr
5 5 NA 80 5 51 40 53 NA 229 Mo
6 6 NA 97 56 16 2 86 NA 257 MoFr
7 7 NA 21 1 2 6 9 NA 39 Mo
8 8 NA 49 4 2 5 52 NA 112 MoFr
9 9 NA 30 35 1 1 25 NA 92 MoTu
10 10 NA 3 52 1 7 52 NA 115 TuFr
Edit: there are a couple of mismatches between my results and your data (line 5,6 and 9), according to your rules, you have mistakes there.

Related

Merge overlapping datasets by column identifier?

I am trying to merge/join two datasets which have different data about the same samples with no rows in common. I would like to be able to merge them by the column names and have that add the rows from the smaller dataset to the larger, filling in NA for all columns that do not have information from the smaller dataset. I feel like this is something super easy that I'm just somehow not able to figure out.
2 tiny sample datasets:
df1 <- data.frame(team=c('A', 'B', 'C', 'D'),
points=c(88, 98, 104, 100),
league=c('Alpha', 'Beta', 'Gamma', 'Delta'))
team points league
1 A 88 Alpha
2 B 98 Beta
3 C 104 Gamma
4 D 100 Delta
df2 <- data.frame(team=c('L', 'M','N', 'O', 'P', 'Q'),
points=c(43, 66, 77, 83, 12, 12),
league=c('Epsilon', 'Zeta', 'Eta', 'Theta', 'Iota', 'Kappa'),
rebounds=c(22, 31, 29, 20, 33, 44),
fouls=c(1, 3, 2, 4, 5, 1))
team points league rebounds fouls
1 L 43 Epsilon 22 1
2 M 66 Zeta 31 3
3 N 77 Eta 29 2
4 O 83 Theta 20 4
5 P 12 Iota 33 5
6 Q 12 Kappa 44 1
the output I would like to get would be:
df3<- data.frame(team=c('A', 'B', 'C', 'D', 'L', 'M','N', 'O', 'P', 'Q' ),
points=c(88, 98, 104, 100, 43, 66, 77, 83, 12, 12),
league=c('Alpha', 'Beta', 'Gamma', 'Delta', 'Epsilon', 'Zeta', 'Eta', 'Theta', 'Iota', 'Kappa'),
rebounds=c('NA', 'NA', 'NA', 'NA', 22, 31, 29, 20, 33, 44),
fouls= c('NA', 'NA', 'NA', 'NA',1, 3, 2, 4, 5, 1))
team points league rebounds fouls
1 A 88 Alpha NA NA
2 B 98 Beta NA NA
3 C 104 Gamma NA NA
4 D 100 Delta NA NA
5 L 43 Epsilon 22 1
6 M 66 Zeta 31 3
7 N 77 Eta 29 2
8 O 83 Theta 20 4
9 P 12 Iota 33 5
10 Q 12 Kappa 44 1
I tried transposing the dfs, but because they have no rows in common that does not work either. I thought about making an index, but I'm just learning about those and I'm not sure how I would do it or if that's the correct move.
Use full_join and arrange
library(dplyr)
full_join(df2, df1) %>%
arrange(team)
-output
team points league rebounds fouls
1 A 88 Alpha NA NA
2 B 98 Beta NA NA
3 C 104 Gamma NA NA
4 D 100 Delta NA NA
5 L 43 Epsilon 22 1
6 M 66 Zeta 31 3
7 N 77 Eta 29 2
8 O 83 Theta 20 4
9 P 12 Iota 33 5
10 Q 12 Kappa 44 1
Or with rows_upsert
rows_upsert(df2, df1, by = c("team", "points", "league"))
We could use bind_rows()
When row-binding, columns are matched by name, and any missing columns will be filled with NA:
library(dplyr)
bind_rows(df1, df2)
team points league rebounds fouls
1 A 88 Alpha NA NA
2 B 98 Beta NA NA
3 C 104 Gamma NA NA
4 D 100 Delta NA NA
5 L 43 Epsilon 22 1
6 M 66 Zeta 31 3
7 N 77 Eta 29 2
8 O 83 Theta 20 4
9 P 12 Iota 33 5
10 Q 12 Kappa 44 1
Using base R, you could add the missing columns in df1 using setdiff() and then rbind them together:
df1[, setdiff(names(df2), names(df1))] <- NA
rbind(df1, df2)
Output:
# team points league rebounds fouls
# 1 A 88 Alpha NA NA
# 2 B 98 Beta NA NA
# 3 C 104 Gamma NA NA
# 4 D 100 Delta NA NA
# 5 L 43 Epsilon 22 1
# 6 M 66 Zeta 31 3
# 7 N 77 Eta 29 2
# 8 O 83 Theta 20 4
# 9 P 12 Iota 33 5
# 10 Q 12 Kappa 44 1

Substract the result for level time0 from the results from all other levels, for each id

I want to substract the result for level time0 from the results from all other levels, for each id.
id <- rep(1:4,each=4)
time <- rep(c(0,5,10,15),4)
a <- c(34,56,67,35)
b <-c(56,78,23,90)
c <- c(23,89,67,78)
df <- data.frame(id,time,a,b,c)
df
id time a b c
1 1 0 34 56 23
2 1 5 56 78 89
3 1 10 67 23 67
4 1 15 35 90 78
5 2 0 34 56 23
6 2 5 56 78 89
7 2 10 67 23 67
8 2 15 35 90 78
9 3 0 34 56 23
10 3 5 56 78 89
11 3 10 67 23 67
12 3 15 35 90 78
13 4 0 34 56 23
14 4 5 56 78 89
15 4 10 67 23 67
16 4 15 35 90 78
I started like this but it feels there must be a more efficient way. Any suggestions? Thanks!
for( i in 1:length(unique(df$id))){
df_id <- df[df$id==i,]
for(j in 2:length(time)){
test <- t(df_id[,-1])
test[,c(2:4)]-test[,1]
}
Here's an option with dplyr -
library(dplyr)
df %>%
group_by(id) %>%
mutate(across(a:c, ~. - .[time == 0])) %>%
ungroup
# id time a b c
# <int> <dbl> <dbl> <dbl> <dbl>
# 1 1 0 0 0 0
# 2 1 5 22 22 66
# 3 1 10 33 -33 44
# 4 1 15 1 34 55
# 5 2 0 0 0 0
# 6 2 5 22 22 66
# 7 2 10 33 -33 44
# 8 2 15 1 34 55
# 9 3 0 0 0 0
#10 3 5 22 22 66
#11 3 10 33 -33 44
#12 3 15 1 34 55
#13 4 0 0 0 0
#14 4 5 22 22 66
#15 4 10 33 -33 44
#16 4 15 1 34 55
Using time == 0 would work if it is guaranteed that every id has exactly 1 value of time = 0. If for some id's there is no row for time = 0 or have more than one row with time = 0 then probably using match is better option.
df %>% group_by(id) %>% mutate(across(a:c, ~. - .[match(0, time)]))
Use mapply in by.
vc <- c('a', 'b', 'c')
by(df, df$id, \(x) {x[-1, vc] <- mapply(`-`, x[-1, vc], x[1, vc]);x}) |>
do.call(what=rbind)
# id time a b c
# 1.1 1 0 34 56 23
# 1.2 1 5 22 22 66
# 1.3 1 10 33 -33 44
# 1.4 1 15 1 34 55
# 2.5 2 0 34 56 23
# 2.6 2 5 22 22 66
# 2.7 2 10 33 -33 44
# 2.8 2 15 1 34 55
# 3.9 3 0 34 56 23
# 3.10 3 5 22 22 66
# 3.11 3 10 33 -33 44
# 3.12 3 15 1 34 55
# 4.13 4 0 34 56 23
# 4.14 4 5 22 22 66
# 4.15 4 10 33 -33 44
# 4.16 4 15 1 34 55
If id==0 position is not consistent, you need to formulate more verbose:
{x[x$time != 0, vc] <- mapply(`-`, x[x$time != 0, vc], x[x$time == 0, vc]);x}
Data:
df <- structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 4L, 4L, 4L, 4L), time = c(0, 5, 10, 15, 0, 5, 10, 15,
0, 5, 10, 15, 0, 5, 10, 15), a = c(34, 56, 67, 35, 34, 56, 67,
35, 34, 56, 67, 35, 34, 56, 67, 35), b = c(56, 78, 23, 90, 56,
78, 23, 90, 56, 78, 23, 90, 56, 78, 23, 90), c = c(23, 89, 67,
78, 23, 89, 67, 78, 23, 89, 67, 78, 23, 89, 67, 78)), class = "data.frame", row.names = c(NA,
-16L))

Merge Dataframes with different number of rows

I am trying to merge together 8 dataframes into one, matching against the row names.
Examples of the dataframes:
DF1
Arable and Horticulture
Acer
100
Achillea
90
Aesculus
23
Alliaria
3
Allium
56
Anchusa
299
DF2
Improved Grassland
Acer
12
Alliaria
3
Allium
50
Brassica
23
Calystegia
299
Campanula
29
And so on for a few hundred rows for different plants and 8 columns of different habitats.
What I want the merged frame to look like:
Arable and Horticulture
Improved Grassland
Acer
100
12
Achillea
90
0
Aesculus
23
0
Alliaria
3
3
Allium
56
50
Anchusa
299
0
Brassica
0
23
Calystegia
0
299
Campanula
0
29
I tried merging
PolPerGen <- merge(DF1, DF2, all=TRUE)
But that does not match up the row name and dropped them entirely in the output
Arable and Horticulture
Improved Grassland
1
100
NA
2
90
NA
3
23
NA
4
2
NA
5
56
NA
6
299
NA
7
NA
12
8
NA
3
9
NA
50
10
NA
23
11
NA
299
12
NA
29
I am completely out of ideas, any thoughts?
Your dataset is,
dat1 = data.frame("Arable and Horticulture" = c(100, 90,23, 3, 56, 299),
row.names = c("Acer", "Achillea", "Aesculus", "Alliaria", "Allium", "Anchusa"))
dat2 = data.frame("Improved Grassland" = c(12, 3, 50, 23, 299, 29),
row.names = c("Acer", "Achillea", "Allium", "Brassica", "Calystegia", "Campanula"))
As #Vinícius Félix suggested first convert rownames to column.
library(tibble)
dat1 = rownames_to_column(dat1, "Plants")
dat2 = rownames_to_column(dat2, "Plants")
Then lets join both the datasets,
library(dplyr)
dat = full_join(dat1, dat2, )
And replace the NA with 0
dat = dat %>% replace(is.na(.), 0)
Plants Arable.and.Horticulture Improved.Grassland
1 Acer 100 12
2 Achillea 90 3
3 Aesculus 23 0
4 Alliaria 3 0
5 Allium 56 50
6 Anchusa 299 0
7 Brassica 0 23
8 Calystegia 0 299
9 Campanula 0 29

Loop over certain columns to replace NAs with 0 in a dataframe

I have spent a lot of time trying to write a loop to replace NAs with zeros for certain columns in a data frame and have not yet succeeded. I have searched and can't find similar question.
df <- data.frame(A = c(2, 4, 6, NA, 8, 10),
B = c(NA, 10, 12, 14, NA, 16),
C = c(20, NA, 22, 24, 26, NA),
D = c(30, NA, NA, 32, 34, 36))
df
Gives me:
A B C D
1 2 NA 20 30
2 4 10 NA NA
3 6 12 22 NA
4 NA 14 24 32
5 8 NA 26 34
6 10 16 NA 36
I want to set NAs to 0 for only columns B and D. Using separate code lines, I could:
df$B[is.na(df$B)] <- 0
df$D[is.na(df$D)] <- 0
However, I want to use a loop because I have many variables in my real data set.
I cannot find a way to loop over only columns B and D so I get:
df
A B C D
1 2 0 20 30
2 4 10 NA 0
3 6 12 22 0
4 NA 14 24 32
5 8 0 26 34
6 10 16 NA 36
Essentially, I want to apply a loop using a variable list to a data frame:
varlist <- c("B", "D")
How can I loop over only certain columns in the data frame using a variable list to replace NAs with zeros?
here is a tidyverse aproach:
library(tidyverse)
df %>%
mutate_at(.vars = vars(B, D), .funs = funs(ifelse(is.na(.), 0, .)))
#output:
A B C D
1 2 0 20 30
2 4 10 NA 0
3 6 12 22 0
4 NA 14 24 32
5 8 0 26 34
6 10 16 NA 36
basically you say vars B and D should change by a defined function. Where . corresponds to the appropriate column.
Here's a base R one-liner
df[, varlist][is.na(df[, varlist])] <- 0
using the zoo package we can fill the selected columns.
library(zoo)
df[varlist]=na.fill(df[varlist],0)
df
A B C D
1 2 0 20 30
2 4 10 NA 0
3 6 12 22 0
4 NA 14 24 32
5 8 0 26 34
6 10 16 NA 36
In base R we can have
df[varlist]=lapply(df[varlist],function(x){x[is.na(x)]=0;x})
df
A B C D
1 2 0 20 30
2 4 10 NA 0
3 6 12 22 0
4 NA 14 24 32
5 8 0 26 34
6 10 16 NA 36

transforming & adding new column in r

I have currently have a data frame that is taken from a data feed of events that happened in chronological order. I would like to add a new column onto to each row of my data the corresponds to the previous event's endx if the prior event type is 1 & the previous event's x if the prior event type is not 1
e.g
player_id <- c(12, 17, 26, 3)
event_type <- c(1, 3, 1, 10)
x <- c(65, 34, 43, 72)
endx <- c(68, NA, 47, NA)
df <- data.frame(player_id, event_type, x, endx)
df
player_id event_type x endx
1 12 1 65 68
2 17 3 34 NA
3 26 1 43 47
4 3 10 72 NA
so end result
player_id event_type x endx previous
1 12 1 65 68 NA
2 17 3 34 NA 68
3 26 1 43 47 34
4 3 10 72 NA 47
We can use if_else
library(dplyr)
df %>%
mutate(previous = if_else(lag(event_type)==1, lag(endx), lag(x)))
# player_id event_type x endx previous
#1 12 1 65 68 NA
#2 17 3 34 NA 68
#3 26 1 43 47 34
#4 3 10 72 NA 47
I am sure this isn't the most succient way but you can use a loop and indexing.
df$previous <- NA
for( i in 2: nrow(df)){
df[ i , "previous"] <- df[ i-1 , "endx"]
}

Resources