I'm having a data.frame (df), see example, that contains information about people. Based on a key column (sleutel), I know if people live together (e.g. form a family) or not. Now, I need to create new columns with information about the 'head' of the family.
name sex gzverh sleutel gzhfd lft
1 Loekens Man 6 1847LS 9 3 49
2 Kemel Vrouw 5 1847LK 10 2 18
3 Kemel Man 5 1847LK 10 2 22
4 Boersma Vrouw 4 1847LK 10 2 52
5 Kemel Man 2 1847LK 10 1 54
So for example: row 5, Kemel, Male and gzhfd 1 (= head of the family Kemel). He is married to mrs. Boersma (same key). I want to mutate a new column (lfthb) with the age of the head of the family for all family members. So should become something like:
name sex gzverh sleutel gzhfd lft lfthb
1 Loekens Man 6 1847LS 9 3 49 NA
2 Kemel Vrouw 5 1847LK 10 2 18 54
3 Kemel Man 5 1847LK 10 2 22 54
4 Boersma Vrouw 4 1847LK 10 2 52 54
5 Kemel Man 2 1847LK 10 1 54 54
I tried multiple ways with dplyr using multiple combinations of group_by, case_when and if_else statements. And I manage to mutate the column for the head of the family itself. But not for the other members.
For example, evidently only changes the value for the head itself:
df <- df %>% mutate(lfthb = case_when(sleutel == lag(sleutel) & gzhfd == 1 ~ lft))
But how to include the gzhfd == 1 after the ~?
dput of example data:
structure(list(naam = c("Loekens", "Kemel", "Kemel", "Boersma",
"Kemel"), gesl = c("Man", "Vrouw", "Man", "Vrouw", "Man"), gzverh = c(6L,
5L, 5L, 4L, 2L), sleutel = c("1847LS 9", "1847LK 10", "1847LK 10",
"1847LK 10", "1847LK 10"), gzhfd = c(3, 2, 2, 2, 1), lft = c(49,
18, 22, 52, 54)), row.names = c(NA, 5L), class = "data.frame")
A combination of replace and ifelse will do the job, i.e.,
library(tidyverse)
df %>%
group_by(sleutel) %>%
mutate(lfthb = ifelse(any(gzhfd == 1), replace(lft, gzhfd != 1, lft[gzhfd == 1]), NA))
which gives,
# A tibble: 5 x 7
# Groups: sleutel [2]
naam gesl gzverh sleutel gzhfd lft lfthb
<chr> <chr> <int> <chr> <dbl> <dbl> <dbl>
1 Loekens Man 6 1847LS 9 3 49 NA
2 Kemel Vrouw 5 1847LK 10 2 18 54
3 Kemel Man 5 1847LK 10 2 22 54
4 Boersma Vrouw 4 1847LK 10 2 52 54
5 Kemel Man 2 1847LK 10 1 54 54
As #Ronak mentions, we can omit the replace part
df %>%
group_by(sleutel) %>%
mutate(lfthb = if (any(gzhfd == 1)) lft[gzhfd == 1] else NA)
A data.table approach (returns -INF instead of NA for the group with missing data):
dt<-df %>% as.data.table() %>%
.[gzhfd==1, lfthb := lft, by="sleutel"] %>%
.[,lfthb:= max(lfthb,na.rm = T), by="sleutel"]
Related
I have a dataframe (sample) as given below
(8K rows and 1.6K sellers)
# creat dataframe
df <- data.frame(name =c('Tom', 'Tom', 'Tom',Tom','Tom','jack','jack','jack','jack','jack','Malik'),
week = c(1, 2, 3, 4, 5, 1, 2, 3, 4,5,1),
sell = c(20, 21, 19, 18, 23,24, 36, 35, 46, 50,44),
demand=c(28, 16, 43,NaN,NaN,30, 35, 35, 72,NaN, 60)
)
df$`demand-sell` <- df$demand - df$sell
df
Expected Output_function:
**For week = 4**<br/>
In which I would like to fill NaN values of demand for week = 4 with sum of remaining demand (demand - sell) of week = 1,2,3 of the same seller (name)<br/><br/>
**Note:**<br/>
If week=4 demand is not NaN then add week=4 demand in (demand - sell) of week = 1,2,3<br/>
{<b>ex</b> in case of name = jack }<br/><br/>
**For week = 5:**
<br/>In which I would like to fill NaN values of demand for week = 5 with sum of remaining demand (demand - sell) of week = 1,2,3,4 of the same seller (name)
<br/>**Note:**
<br/>If week=5 demand is not NaN then add week=5 demand in (demand - sell) of week = 1,2,3,4<br/>
Expected Output (sample data)
Update with the correct answer:
The issue was the lack of is.nan(demand)statement:
Here is the correct answer:
df %>%
mutate(`demand-sell` = demand - sell) %>%
group_by(name) %>%
mutate(demand=case_when(week == 4 & is.nan(demand) ~ sum(`demand-sell`[1:3]),
week == 4 & !is.nan(demand) ~ demand + sum(`demand-sell`[1:3]),
TRUE ~ demand)) %>%
mutate(`demand-sell`= case_when(week == 4 ~ demand-sell,
TRUE ~ `demand-sell`)) %>%
mutate(demand = case_when(week == 5 ~ `demand-sell`[4],
TRUE ~ demand)) %>%
mutate(`demand-sell`= case_when(week == 5 ~ demand-sell,
TRUE ~ `demand-sell`))
Correct output:
name week sell demand `demand-sell`
<chr> <dbl> <dbl> <dbl> <dbl>
1 Tom 1 20 28 8
2 Tom 2 21 16 -5
3 Tom 3 19 43 24
4 Tom 4 18 27 9
5 Tom 5 23 9 -14
6 jack 1 24 30 6
7 jack 2 36 35 -1
8 jack 3 35 35 0
9 jack 4 46 77 31
10 jack 5 50 31 -19
11 Malik 1 44 60 16
First answer:
Here is a solution: At least for Tom it is correct. I don't know if your excepted output for jack is correct.
If the logic for all name is the same it should look like this:
df %>%
mutate(`demand-sell` = demand - sell) %>%
group_by(name) %>%
mutate(demand=case_when(week == 4 ~ sum(`demand-sell`[1:3]),
TRUE ~ demand)) %>%
mutate(`demand-sell`= case_when(week == 4 ~ demand-sell,
TRUE ~ `demand-sell`)) %>%
mutate(demand = case_when(week == 5 ~ `demand-sell`[4],
TRUE ~ demand)) %>%
mutate(`demand-sell`= case_when(week == 5 ~ demand-sell,
TRUE ~ `demand-sell`))
Output:
name week sell demand `demand-sell`
<chr> <dbl> <dbl> <dbl> <dbl>
1 Tom 1 20 28 8
2 Tom 2 21 16 -5
3 Tom 3 19 43 24
4 Tom 4 18 27 9
5 Tom 5 23 9 -14
6 jack 1 24 30 6
7 jack 2 36 35 -1
8 jack 3 35 35 0
9 jack 4 46 5 -41
10 jack 5 50 -41 -91
11 Malik 1 44 60 16
I'm trying to use R to recreate Baseball Splits as found on MLB.com. The splits are created from Game Logs and provide different cuts of the data. For example, home games vs. away games, day games vs. night games, August vs. September and many more all in one convenient table. I believe the ratios (AVG, OBP SLG) can all be added via mutate once the basic splits have been totaled.
My Question is, what's the best and most efficient way to create these splits and how should the data be shaped. The game log obviously has additional (hidden) column(s) that contain the Split topics. The nature of the problem leads me to believe purrr might be a tool to employ but I can't quite wrap my mind around how to approach this one.
Here is how I believe the data should be shaped and a link to a sample game log. I would appreciate any thoughts, ideas or solutions to this problem.
Links and images of Game Logs and Splits for National outfielder Juan Soto are set forth below.
Game Logs: Juan Soto Game Log
Splits: Juan Soto Game Splits
Splits
I've gone through the dataset, although I'm not sure if the sum values match, and neither the averages relative to the images above.
You're right about mutating for creating the values you suggest.
However, hopefully my approach can help you get what you're after.
library(tidyverse)
library(data.table)
game.splits <- "https://raw.githubusercontent.com/MundyMSDS/GAMELOG/main/SAMPLE_GAME_LOG.csv"
game.splits <- fread(game.splits, fill = TRUE)
game.splits.pivot <- game.splits
game.splits.pivot$Var1 <- ifelse(game.splits.pivot$Var1 %in% "HOME", 1, 0)
game.splits.pivot$Var2 <- ifelse(game.splits.pivot$Var2 %in% "NIGHT", 3, 2)
game.splits.pivot$Var3 <- ifelse(game.splits.pivot$Var3 %in% "SEPTEMBER", 5, 4)
game.splits.pivot <- game.splits.pivot %>% pivot_longer(-c(1:16, 20))
colnames(game.splits.pivot)[19] <- "name_c"
game.splits.pivot <- game.splits.pivot[, -c(17, 18)]
game.splits.pivot <- game.splits.pivot %>% pivot_longer(-c(1:3, 17))
#test
game.splits.pivot_test <- game.splits.pivot[, -c(1, 2, 3)]
game.splits.pivot_test <- aggregate(value ~ name_c + name, game.splits.pivot_test, sum)
game.splits.pivot_test <- game.splits.pivot_test %>% pivot_wider(names_from = name, values_from = value)
lc_name <- tibble(name_c = 0:5, split = c("HOME", "AWAY", "DAY", "NIGHT", "AUGUST", "SEPTEMBER"))
game.splits.pivot_test <- game.splits.pivot_test %>%
inner_join(lc_name, by = "name_c") %>%
arrange(name_c) %>%
select(-name_c)
game.splits.pivot_test <- game.splits.pivot_test[, c(14, 3, 9, 6, 1, 2, 7, 10, 4, 8, 12, 11, 5, 13)]
A look into the dataset:
# A tibble: 6 x 14
split AB R H `2B` `3B` HR RBI BB IBB SO SB CS TB
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 HOME 88 24 32 5 0 9 23 15 5 12 1 2 64
2 AWAY 66 15 22 9 0 4 14 26 7 16 5 0 43
3 DAY 29 21 18 4 0 5 17 12 4 3 4 0 37
4 NIGHT 125 18 36 10 0 8 20 29 8 25 2 2 70
5 AUGUST 90 21 33 6 0 11 25 13 1 13 1 1 72
6 SEPTEMBER 64 18 21 8 0 2 12 28 11 15 5 1 35
This turned out to be more straight-forward than I had thought. The following solution relies upon pivot_longer to shape the data and summarise_if to tally the splits - no rbinds or purrr needed.
library(tidyverse)
game.splits <- "https://raw.githubusercontent.com/MundyMSDS/GAMELOG/main/SAMPLE_GAME_LOG.csv"
game.splits <- read_csv(game.splits)
game.splits %>%
pivot_longer(Var1:Var3, names_to = "split") %>%
group_by(split) %>%
arrange(split) %>%
select(split, value, everything()) %>%
ungroup() %>%
select(split, value, everything()) %>%
select(-Date, -OPP) %>%
mutate(value = str_c(split, "_", value)) %>%
group_by(value) %>%
summarise_if(is.numeric, sum) %>%
mutate(value= str_replace(value, "(Var\\d_)",""))
#> # A tibble: 6 x 14
#> value AB R H TB `2B` `3B` HR RBI BB IBB SO SB
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 AWAY 88 24 32 64 5 0 9 23 15 5 12 1
#> 2 HOME 66 15 22 43 9 0 4 14 26 7 16 5
#> 3 DAY 29 21 18 37 4 0 5 17 12 4 3 4
#> 4 NIGHT 125 18 36 70 10 0 8 20 29 8 25 2
#> 5 AUGUST 90 21 33 72 6 0 11 25 13 1 13 1
#> 6 SEPTE~ 64 18 21 35 8 0 2 12 28 11 15 5
Created on 2021-03-03 by the reprex package (v0.3.0)
I have a data frame
> dput(df)
structure(list(id = c(1, 2, 3, 4, 1, 2, 3, 4), level = structure(c(1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("g01", "g02"), class = "factor"),
m_col = c(1, 2, 3, 4, 11, 22, 33, 44), u_col = c(11, 12,
13, 14, 21, 22, 23, 24), group = c(0, 0, 1, 1, 0, 0, 1, 1
)), row.names = c(NA, -8L), class = "data.frame")
Which looks like this
id level m_col u_col group
1 1 g01 1 11 0
2 2 g01 2 12 0
3 3 g01 3 13 1
4 4 g01 4 14 1
5 1 g02 11 21 0
6 2 g02 22 22 0
7 3 g02 33 23 1
8 4 g02 44 24 1
I want to perform a binomial weighted test on each 'level' (I need to compare u_col and m_col for each id, essentially) ... so using tidyverse and broom I can do the following:
res <- df %>%
group_by(level) %>%
do(tidy(glm(cbind(.$m_col,.$u_col) ~ .$group, family="binomial"))) %>%
filter(term == ".$group")
Which gives me some p-values for each level:
> res
# A tibble: 2 x 6
# Groups: level [2]
level term estimate std.error statistic p.value
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 g01 .$group 0.687 0.746 0.921 0.357
2 g02 .$group 0.758 0.296 2.56 0.0105
I can then ask how many p<0.05
length(which(res$p.value < 0.05)
I would now like to permute the data, repeat the binomial test, ask how many p's < 0.05 and then store that value, and then repeat 999 more times.
HOWEVER, the permutation needs to shuffle the 'group' column within each 'level'. I'm struggling to find a way to do this, so for example one permutation would look like this
id level m_col u_col group
1 1 g01 1 11 1
2 2 g01 2 12 0
3 3 g01 3 13 1
4 4 g01 4 14 0
5 1 g02 11 21 1
6 2 g02 22 22 0
7 3 g02 33 23 1
8 4 g02 44 24 0
A second would look like
id level m_col u_col group
1 1 g01 1 11 0
2 2 g01 2 12 1
3 3 g01 3 13 1
4 4 g01 4 14 0
5 1 g02 11 21 0
6 2 g02 22 22 1
7 3 g02 33 23 1
8 4 g02 44 24 0
etc
Having the test rely on 2 columns limits the shuffle options and I'm stumped. I would appreciate any advice.
If you want a dataframe you may try this:
library(tidyverse)
map_dfr(1:1000, ~ df %>%
group_by(level) %>%
mutate(group = group[sample(row_number())]) %>% # permutation shuffle the 'group' column within each 'level'.
do(tidy(glm(cbind(.$m_col,.$u_col) ~ .$group, family="binomial"))) %>%
filter(term == ".$group") %>%
ungroup() %>%
summarise(sum(p.value < 0.05))) # ask how many p<0.05
and if you want a vector:
map_dbl(1:1000, ~ df %>%
group_by(level) %>%
mutate(group = group[sample(row_number())]) %>% # permutation shuffle the 'group' column within each 'level'.
do(tidy(glm(cbind(.$m_col,.$u_col) ~ .$group, family="binomial"))) %>%
filter(term == ".$group") %>%
ungroup() %>%
summarise(sum(p.value < 0.05)) %>% # ask how many p<0.05
pull())
You could write a function :
library(dplyr)
library(broom)
apply_fun <- function(data) {
sum(subset(tidy(glm(cbind(m_col, u_col)~group, data,
family="binomial")), term == 'group')$p.value < 0.05)
}
and then use replicate to repeat it.
result <- replicate(100, df %>%
group_by(level) %>%
mutate(group = sample(group)) %>%
summarise(value = apply_fun(cur_data())), simplify = FALSE)
result
I would like to calculate the entropy of a continuous variable of each group in R.
Here is an example of the data:
id group X
1 1 1 28
2 2 1 45
3 3 2 21
4 4 2 46
5 5 3 82
6 6 3 98
There are actually 273 groups and more variables other than X.
I would like to be able to calculate the entropy of X for each "group".
I have tried using the group_by and summarise commands in tidyr but I don't believe there is an appropriate command for entropy.
Hopefully there is a simple solution to this.
Thanks in advance.
There is an entropy package in R.
#install.packages('entropy')
library(entropy)
df %>%
group_by(group) %>%
mutate(entropy = entropy(X))
Gives us:
id group X entropy
<dbl> <dbl> <dbl> <dbl>
1 1 1 28 0.666
2 2 1 45 0.666
3 3 2 21 0.622
4 4 2 46 0.622
5 5 3 82 0.689
6 6 3 98 0.689
Or for aggregate results:
df %>%
group_by(group) %>%
summarize(entropy = entropy(X))
Gives us:
group entropy
<dbl> <dbl>
1 1 0.666
2 2 0.622
3 3 0.689
data:
df <- structure(list(id = c(1, 2, 3, 4, 5, 6), group = c(1, 1, 2, 2,
3, 3), X = c(28, 45, 21, 46, 82, 98)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
in base R, you could do could just write the entropy function outside instead of redefining it each time. ie
entropy_base <- function(x) -sum((x<-prop.table(x))*log(x))
if you want to summarize:
aggregate(X~group,df, entropy_base)
group X
1 1 0.6657808
2 2 0.6218199
3 3 0.6891913
if you want to mutate:
transform(df,entopy=ave(X,group,FUN = entropy_base))
id group X entopy
1 1 1 28 0.6657808
2 2 1 45 0.6657808
3 3 2 21 0.6218199
4 4 2 46 0.6218199
5 5 3 82 0.6891913
6 6 3 98 0.6891913
I would recommend the ddply function in the plyr package: https://www.rdocumentation.org/packages/plyr/versions/1.8.6/topics/ddply
This function allows you to pass data, columns to group by, and apply a function.
Example:
ddply(data, .(Group), summarise, Entropy(x,Y))
Where I have Entropy(x,Y) you could include whichever entropy function you would like.
My data came to me like this (but with 4000+ records). The following is data for 4 patients. Every time you see surgery OR age reappear, it is referring to a new patient.
col1 = c("surgery", "age", "weight","albumin","abiotics","surgery","age", "weight","BAPPS", "abiotics","surgery", "age","weight","age","weight","BAPPS","albumin")
col2 = c("yes","54","153","normal","2","no","65","134","yes","1","yes","61","210", "46","178","no","low")
testdat = data.frame(col1,col2)
So to say again, every time surgery or age appear (surgery isn't always there, but age is), those records and the ones after pertain to the same patient until you see surgery or age appear again.
Thus I somehow need to add an ID column with this data:
ID = c(1,1,1,1,1,2,2,2,2,2,3,3,3,4,4,4,4)
testdat$ID = ID
I know how to transpose and melt and all that to put the data into regular format, but how can I create that ID column?
Advice on relevant tags to use is helpful!
Assuming that surgery and age will be the first two pieces of information for each patient and that each patient will have a information that is not age or surgery afterward, this is a solution.
col1 = c("surgery", "age", "weight","albumin","abiotics","surgery","age", "weight","BAPPS", "abiotics","surgery", "age","weight","age","weight","BAPPS","albumin")
col2 = c("yes","54","153","normal","2","no","65","134","yes","1","yes","61","210", "46","178","no","low")
testdat = data.frame(col1,col2)
# Use a tibble and get rid of factors.
dfTest = as_tibble(testdat) %>%
mutate_all(as.character)
# A little dplyr magic to see find if the start of a new patient, then give them an id.
dfTest = dfTest %>%
mutate(couldBeStart = if_else(col1 == "surgery" | col1 == "age", T, F)) %>%
mutate(isStart = couldBeStart & !lag(couldBeStart, default = FALSE)) %>%
mutate(patientID = cumsum(isStart)) %>%
select(-couldBeStart, -isStart)
# # A tibble: 17 x 3
# col1 col2 patientID
# <chr> <chr> <int>
# 1 surgery yes 1
# 2 age 54 1
# 3 weight 153 1
# 4 albumin normal 1
# 5 abiotics 2 1
# 6 surgery no 2
# 7 age 65 2
# 8 weight 134 2
# 9 BAPPS yes 2
# 10 abiotics 1 2
# 11 surgery yes 3
# 12 age 61 3
# 13 weight 210 3
# 14 age 46 4
# 15 weight 178 4
# 16 BAPPS no 4
# 17 albumin low 4
# Get the data to a wide workable format.
dfTest %>% spread(col1, col2)
# # A tibble: 4 x 7
# patientID abiotics age albumin BAPPS surgery weight
# <int> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 1 2 54 normal NA yes 153
# 2 2 1 65 NA yes no 134
# 3 3 NA 61 NA NA yes 210
# 4 4 NA 46 low no NA 178
Using dplyr:
library(dplyr)
testdat = testdat %>%
mutate(patient_counter = cumsum(col1 == 'surgery' | (col1 == 'age' & lag(col1 != 'surgery'))))
This works by checking whether the col1 value is either 'surgery' or 'age', provided 'age' is not preceded by 'surgery'. It then uses cumsum() to get the cumulative sum of the resulting logical vector.
You can try the following
keywords <- c('surgery', 'age')
lgl <- testdat$col1 %in% keywords
testdat$ID <- cumsum(c(0, diff(lgl)) == 1) + 1
col1 col2 ID
1 surgery yes 1
2 age 54 1
3 weight 153 1
4 albumin normal 1
5 abiotics 2 1
6 surgery no 2
7 age 65 2
8 weight 134 2
9 BAPPS yes 2
10 abiotics 1 2
11 surgery yes 3
12 age 61 3
13 weight 210 3
14 age 46 4
15 weight 178 4
16 BAPPS no 4
17 albumin low 4