Creating a variable by group for sample data - r

I have a sample data base (which I did not make myself) as follows:
panelID= c(1:50)
year= c(2005, 2010)
country = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
n <- 2
library(data.table)
set.seed(123)
DT <- data.table( country = rep(sample(country, length(panelID), replace = T), each = n),
year = c(replicate(length(panelID), sample(year, n))),
DT [, uniqueID := .I] # Creates a unique ID
DT[DT == 0] <- NA
DT$sales[DT$sales< 0] <- NA
DT <- as.data.frame(DT)
I am always struggling when I want to create a new variable which has to meet certain conditions.
I would like to create a tax rate for my sample database. The tax rate has to be the same per country-year, between 10% and 40% and not more than 5% apart per country.
I cannot seem to figure out how to do it. It would be great if someone could point me in the right direction.

Not 100 % sure what you are looking for. You could use dplyr:
DT %>%
group_by(country) %>%
mutate(base_rate = as.integer(runif(1, 12.5, 37.5))) %>%
group_by(country, year) %>%
mutate(tax_rate = base_rate + as.integer(runif(1,-2.5,+2.5)))
which returns
# A tibble: 100 x 6
# Groups: country, year [20]
country year uniqueID sales base_rate tax_rate
<chr> <dbl> <int> <lgl> <int> <int>
1 C 2005 1 NA 26 26
2 C 2010 2 NA 26 26
3 C 2010 3 NA 26 26
4 C 2005 4 NA 26 26
5 J 2005 5 NA 21 21
6 J 2010 6 NA 21 20
7 B 2010 7 NA 20 20
8 B 2005 8 NA 20 22
9 F 2010 9 NA 26 26
10 F 2005 10 NA 26 26
I first created a random base_rate per country and then a random tax_rate per country and year.
I used integer but you could easily replace them with real percentage values.

Related

Creating a summary row for each group in a dataframe based on other variables in the group

Fairly new to R, ended up in the following situation: I want to create a summary row for each group in the dataframe based on Year and Model, where a value of each row would be based on the subtraction of value of one Variable from others in the group.
df <- data.frame(Model = c(1,1,1,2,2,2,2,2,2,2,2,2,2),
Year = c(2020, 2020, 2020, 2020, 2020, 2020, 2020, 2030, 2030, 2030, 2040, 2040, 2040),
Variable = c("A", "B", "C", "A", "B", "C", "D", "A", "C", "E", "A", "C", "D"),
value = c(15, 2, 5, 25, 6, 4, 4, 41, 24,1, 15, 3, 2))
I have managed to create a new row for each group, so it already has a Year and a Variable name that I manually specified using:
df <- df %>% group_by(Model, Year) %>% group_modify(~ add_row(., Variable = "New", .before=0))
However, I am struggling to create an equation from which I want to calculate the value.
What I want to have instead of NAs: value of A-B-D in each group
Would appreciate any help. My first thread here, pardon for any inconvenience.
You could pivot wide and then back; this would add rows with zeros where missing:
library(dplyr); library(tidyr)
df %>%
pivot_wider(names_from = Variable, values_from = value, values_fill = 0) %>%
mutate(new = A - B - D) %>%
pivot_longer(-c(Model, Year), names_to = "Variable")
# A tibble: 24 × 4
Model Year Variable value
<dbl> <dbl> <chr> <dbl>
1 1 2020 A 15
2 1 2020 B 2
3 1 2020 C 5
4 1 2020 D 0
5 1 2020 E 0
6 1 2020 new 13 # 15 - 2 - 0 = 13
7 2 2020 A 25
8 2 2020 B 6
9 2 2020 C 4
10 2 2020 D 4
# … with 14 more rows
EDIT - variation where we leave the missing values and use coalesce(x, 0) to allow subtraction to treat NA's as zeroes. The pivot_wider creates NA's in the missing spots, but we can exclude these in the pivot_longer using values_drop_na = TRUE.
df %>%
pivot_wider(names_from = Variable, values_from = value) %>%
mutate(new = A - coalesce(B,0) - coalesce(D,0)) %>%
pivot_longer(-c(Model, Year), names_to = "Variable", values_drop_na = TRUE)
# A tibble: 17 × 4
Model Year Variable value
<dbl> <dbl> <chr> <dbl>
1 1 2020 A 15
2 1 2020 B 2
3 1 2020 C 5
4 1 2020 new 13
5 2 2020 A 25
6 2 2020 B 6
7 2 2020 C 4
8 2 2020 D 4
9 2 2020 new 15
10 2 2030 A 41
11 2 2030 C 24
12 2 2030 E 1
13 2 2030 new 41
14 2 2040 A 15
15 2 2040 C 3
16 2 2040 D 2
17 2 2040 new 13

How to merge rows based on common columns in R

I am working in a dataframe that looks something like this:
vims <- data.frame(
patient_ID = c("a", "a", "a", "b", "b"),
Date = c(2020, 2020, 2018, 2020, 2028),
Eye = c("Right", "Left", "Right", "Right", "Right"),
V1 = c(21, 18, 30, 30, 18)
V2 = c(28, 30, 15, 45, 60)
)
As you can see, the data has an ID and may have several evaluation on different dates for that same ID and further it may have different eye evaluations within the dates. I am trying to merge rows in order to be arrange by ID and date to obtain rows that contains the ID, the date and all the info for every eye in the same row (V1 for the right and left eye if available)
Are you looking for this:
library(dplyr)
library(tidyr)
vims %>% pivot_wider(id_cols = c(patient_ID, Date), names_from = Eye, values_from = c(V1,V2))
# A tibble: 4 x 6
patient_ID Date V1_Right V1_Left V2_Right V2_Left
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a 2020 21 18 28 30
2 a 2018 30 NA 15 NA
3 b 2020 30 NA 45 NA
4 b 2028 18 NA 60 NA
A data.table alternative:
library(data.table)
vims <- as.data.table(vims)
dcast(vims, patient_ID+Date~Eye, value.var = c("V1","V2"))
patient_ID Date V1_Left V1_Right V2_Left V2_Right
1: a 2018 NA 30 NA 15
2: a 2020 18 21 30 28
3: b 2020 NA 30 NA 45
4: b 2028 NA 18 NA 60

Count number of times two values co-occur within a group in R

I've searched for answers to this question and found similar ones (Count number of rows within each group, count unique combinations of variable values in an R dataframe column, R count occurrences of an element by groups) but none of them address my particular issue.
I have a dataframe with variables year, ID, and code. Every person has an ID and can have multiple code values over the course of (potentially) multiple years.
df = data.frame(ID = c(1,1,1,1, 2,2,2, 3, 4,4,4,4,4,4,4,4, 5,5,5),
year = c(2018, 2018, 2020, 2020,
2020, 2020, 2020,
2011,
2019, 2019, 2019, 2019, 2020, 2020, 2020, 2020,
2018, 2019, 2020),
code = c("A", "B", "C", "D",
"A", "B", "Q",
"G",
"A", "B", "Q", "G", "C", "D", "T", "S",
"S", "Z", "F")
)
df
ID year code
1 1 2018 A
2 1 2018 B
3 1 2020 C
4 1 2020 D
5 2 2020 A
6 2 2020 B
7 2 2020 Q
8 3 2011 G
9 4 2019 A
10 4 2019 B
11 4 2019 Q
12 4 2019 G
13 4 2020 C
14 4 2020 D
15 4 2020 T
16 4 2020 S
17 5 2018 S
18 5 2019 Z
19 5 2020 F
What I would like is another dataframe giving the number of times two different values of code co-occurred within groupings of of ID and year (in this example, A and B co-occurred 3 times, and A and C co-occurred 0 times), which I will then use for a network analysis.
So far I have this syntax:
1: Make a wide version of the data
library(tidyverse)
wide = df %>%
group_by(year, ID) %>%
mutate(row = row_number()) %>%
ungroup() %>%
pivot_wider(
id_cols = c(ID, year),
names_from = row,
names_prefix = "code_",
values_from = code
)
2: Make a node list
nodes = distinct(df, code) %>% rowid_to_column("id")
3: Make an edge list
#edge list needs to be three vars: source, dest, and weight
# source and dest are simply code names that (potentially) co-occur in the same year for an ID
# weight is the number of times the codes co-occurred in the same year for each ID.
#all combinations of two codes
edges = combn(x = nodes$code, m = 2 ) %>%
t() %>%
as.data.frame()
colnames(edges) = c("source", "dest")
edges$weight = NA_integer_
#oh, no! a for() loop! a coder's last ditch effort to make something work
for(i in 1:nrow(edges)){
source = edges$source[i]
dest = edges$dest[i]
#get the cases with the first code of interest
temp = df %>%
filter( code == source ) %>%
select(ID, year)
#get the other codes that occurred for that ID in that year
temp = left_join(temp,
wide,
by = c("ID", "year"))
#convert to a logical showing if the other codes are the one I want
temp = temp %>% mutate_at(vars(starts_with("code_")),
function(x){ x == dest }
)
#sum the number of times `source` and `dest` co-occurred
temp$dest = temp %>% select(starts_with("code_")) %>% rowSums(., na.rm=TRUE)
edges$weight[i] = sum(temp$dest, na.rm = TRUE)
}
Edit to add the result:
Result:
edges
source dest weight
1 A B 3
2 A C 0
3 A D 0
4 A Q 2
5 A G 1
6 A T 0
7 A S 0
8 A Z 0
9 A F 0
10 B C 0
11 B D 0
12 B Q 2
13 B G 1
14 B T 0
15 B S 0
16 B Z 0
17 B F 0
18 C D 2
19 C Q 0
20 C G 0
21 C T 1
22 C S 1
23 C Z 0
24 C F 0
25 D Q 0
26 D G 0
27 D T 1
28 D S 1
29 D Z 0
30 D F 0
31 Q G 1
32 Q T 0
33 Q S 0
34 Q Z 0
35 Q F 0
36 G T 0
37 G S 0
38 G Z 0
39 G F 0
40 T S 1
41 T Z 0
42 T F 0
43 S Z 0
44 S F 0
45 Z F 0
That gives me what I want (a dataframe showing A and B co-occurred 3 times, A and C co-occurred 0 times, A and D co-occurred 0 times, A and G co-occurred 1 time, A and Q co-occurred 2 times, etc...). So this works, but it takes a second or two even for this small example. My real data set is ~3,000,000 observations. I let it run for a while but stopped it only to find it was ~1% complete.
Is there a better/faster way to do this?
Here's an alternative which just does a join instead and so is probably very fast for large data.
library(data.table)
setDT(df)
df[df, on = c('ID','year'), allow.cartesian = TRUE][
code<i.code, .N, .(pair = paste0(code, i.code))]
#> pair N
#> 1: AB 3
#> 2: CD 2
#> 3: AQ 2
#> 4: BQ 2
#> 5: GQ 1
#> 6: AG 1
#> 7: BG 1
#> 8: CT 1
#> 9: DT 1
#> 10: ST 1
#> 11: CS 1
#> 12: DS 1
This should work. You'll only get one entry for each pair because of the sort.
library(data.table)
setDT(df)
all_pairs <- function(x) {
if (length(x) > 1) {
sapply(combn(sort(x), 2, simplify = FALSE), paste, collapse = '')
} else {
c()
}
}
df[,.(pairs = all_pairs(code)), .(ID, year)][,.N, .(pairs)]
#> pairs N
#> 1: AB 3
#> 2: CD 2
#> 3: AQ 2
#> 4: BQ 2
#> 5: AG 1
#> 6: BG 1
#> 7: GQ 1
#> 8: CS 1
#> 9: CT 1
#> 10: DS 1
#> 11: DT 1
#> 12: ST 1

R insert week number from vector and perform na.locf afterwards

For a dataframe similar to below (but much larger obviously)) I want to add missing week numbers from a vector ( vector is named weeks below). In the end, each value for var1 should have 4 rows consisting of week 40 - 42 so the value inserted for week can be different for different values of var1. Initially the inserted rows can have value NA but as a second step I would like to perform na.locf for each value of var1. does anyone know how to do this?
Data frame example:
dat <- data.frame(var1 = rep(c('a','b','c','d'),3),
week = c(rep(40,4),rep(41,4),rep(42,4)),
value = c(2,3,3,2,4,5,5,6,8,9,10,10))
dat <- dat[-c(6,11), ]
weeks <- c(40:42)
Like this?
dat %>%
tidyr::complete(var1,week) %>%
group_by(var1) %>%
arrange(week) %>%
tidyr::fill(value)
# A tibble: 12 x 3
# Groups: var1 [4]
var1 week value
<fct> <dbl> <dbl>
1 a 40 2
2 a 41 4
3 a 42 8
4 b 40 3
5 b 41 3
6 b 42 9
7 c 40 3
8 c 41 5
9 c 42 5
10 d 40 2
11 d 41 6
12 d 42 10
Hi have you considered tidyr::complete and dplyr::fill().
library(dplyr)
library(tidyr)
complete(dat, week = 40:42, var1 = c("a", "b", "c", "d")) %>% fill(value, .direction =
"down")

Manually calculate variance from count data for categorical ratings

I am trying to manually calculate the variance (and mean) from categorical rating count data.
Item <- c("A", "B", "C", "D")
cat1 <- c(4,12,17,NA)
cat2 <- c(NA,10,20,15)
cat3 <- c(17,5,12,6)
cat4 <- c(10,12,17,NA)
cat5 <- c(3,21,NA,16)
cat6 <- c(2,14,12,20)
cat7 <- c(7,NA,18,23)
Data <- data.frame(Item=Item, Never=cat1,Rarely=cat2,Occasionally=cat3, Sometimes=cat4,Frequently=cat5,Usually=cat6,Always=cat7,stringsAsFactors=FALSE)
Data
Item Never Rarely Occasionally Sometimes Frequently Usually Always
1 A 4 NA 17 10 3 2 7
2 B 12 10 5 12 21 14 NA
3 C 17 20 12 17 NA 12 18
4 D NA 15 6 NA 16 20 23
Each categorical rating has an equivalent numeric value (1:7). I have calculated the average numerical rating for each Item as follows:
Rating_wt <- 1:7 # Vector of weights for each frequency rating
Rating.wt.mat <- rep(Rating_wt,each=dim(Data[,2:8])[1])
Data$Avg_rating <- rowSums(Data[,2:8]*Rating.wt.mat,na.rm=TRUE)/rowSums(Data[,2:8],na.rm=TRUE)
Data
Item Never Rarely Occasionally Sometimes Frequently Usually Always Avg_rating
1 A 4 NA 17 10 3 2 7 3.976744
2 B 12 10 5 12 21 14 NA 3.837838
3 C 17 20 12 17 NA 12 18 3.739583
4 D NA 15 6 NA 16 20 23 5.112500
I would like to also calculate the variance for each Average and store that as a new variable in Data.
I believe I need to subtract the Average for each item from each numeric rating and multiply that value by the count in each respective cell, then sum those results across rows, then divide by the total counts in each row.
But, I can't figure out how to set up the element-wise calculations to accomplish that.
Conceptually, I think it should be something like this:
Data$Rating_var <- rowSums((Numeric_Rating - Avg_rating)*Value,na.rm=TRUE)/rowSums(Data[,2:8],na.rm=TRUE))
Where Numeric_Rating corresponds to Rating_wt:
Never = 1
Rarely = 2
Occasionally = 3
Sometimes = 4
Frequently = 5
Usually = 6
Always = 7
and Value is the corresponding cell for each Numeric_Rating by Item intersection.
I'd suggest you try to reshape your dataset before you apply your calculations, as it will be easier.
library(dplyr)
library(tidyr)
Item <- c("A", "B", "C", "D")
cat1 <- c(4,12,17,NA)
cat2 <- c(NA,10,20,15)
cat3 <- c(17,5,12,6)
cat4 <- c(10,12,17,NA)
cat5 <- c(3,21,NA,16)
cat6 <- c(2,14,12,20)
cat7 <- c(7,NA,18,23)
Data <- data.frame(Item=Item, Never=cat1,Rarely=cat2,Occasionally=cat3, Sometimes=cat4,Frequently=cat5,Usually=cat6,Always=cat7,stringsAsFactors=FALSE)
Data %>%
gather(category, value, -Item) %>% # reshape dataset
mutate(Rating = recode(category, "Never"=1,"Rarely" = 2,"Occasionally" = 3,
"Sometimes" = 4,"Frequently" = 5,
"Usually" = 6,"Always" = 7)) %>% # assign rating
group_by(Item) %>% # for each item
mutate(Avg = sum(Rating*value, na.rm=T) / sum(value, na.rm=T), # calculate Avg
variance = sum(abs(Rating - Avg)*value, na.rm=T) / sum(value, na.rm=T)) %>% # calculate Variance using the Avg
ungroup() %>% # forget the grouping
select(-Rating) %>% # no need the rating any more
spread(category, value) %>% # reshape back to original form
select_(.dots = c(names(Data), "Avg", "variance")) # get columns in the desired order
# # A tibble: 4 x 10
# Item Never Rarely Occasionally Sometimes Frequently Usually Always Avg variance
# * <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 A 4 NA 17 10 3 2 7 3.976744 1.326122
# 2 B 12 10 5 12 21 14 NA 3.837838 1.530314
# 3 C 17 20 12 17 NA 12 18 3.739583 1.879991
# 4 D NA 15 6 NA 16 20 23 5.112500 1.529062
Try to run the piped process step by step to see how it works, especially if you're not familiar with the dplyr and tidyr syntax.

Resources