How to create differences between several pairs of columns? - r

I have a panel (cross-sectional time series) dataset. For each group (defined by (NAICS2, occ_type) in time ym) I have many variables. For each variable I would like to subtract each group's first (dplyr::first) value from every value of that group.
Ultimately I am trying to take the Euclidean difference between the vector of each row 's group's first entry, (i.e. sqrt(c_1^2 + ... + c_k^2).
I was able to create the a column equal to the first entries for each group:
df2 <- df %>%
group_by(ym, NAICS2, occ_type) %>%
distinct(ym, NAICS2, occ_type, .keep_all = T) %>%
arrange(occ_type, NAICS2, ym) %>%
select(group_cols(), ends_with("_scf")) %>%
mutate_at(vars(-group_cols(), ends_with("_scf")),
list(first = dplyr::first))
I then tried to include variations of f.diff = . - dplyr::first(.) in the list, but none of those worked. I googled the dot notation for a while as well as first and lag in dplyr timeseries but have not been able to resolve this yet.
Ideally, I unite all variables into a vector for each row first and then take the difference.
df2 <- df %>%
group_by(ym, NAICS2, occ_type) %>%
distinct(ym, NAICS2, occ_type, .keep_all = T) %>%
arrange(occ_type, NAICS2, ym) %>%
select(group_cols(), ends_with("_scf")) %>%
unite(vector, c(-group_cols(), ends_with("_scf")), sep = ',') %>%
# TODO: DISTANCE_BETWEEN_ENTRY_AND_FIRST
mutate(vector.diff = ???)
I expect the output to be a numeric column that contains a distance measure of how different each group's row vector is from its initial row vector.
Here is a sample of the data:
structure(list(ym = c("2007-01-01", "2007-02-01"), NAICS2 = c(0L,
0L), occ_type = c("is_middle_manager", "is_middle_manager"),
Administration_scf = c(344, 250), Agriculture..Horticulture..and.the.Outdoors_scf = c(11,
17), Analysis_scf = c(50, 36), Architecture.and.Construction_scf = c(57,
51), Business_scf = c(872, 585), Customer.and.Client.Support_scf = c(302,
163), Design_scf = c(22, 17), Economics..Policy..and.Social.Studies_scf = c(7,
7), Education.and.Training_scf = c(77, 49), Energy.and.Utilities_scf = c(25,
28), Engineering_scf = c(90, 64), Environment_scf = c(19,
19), Finance_scf = c(455, 313), Health.Care_scf = c(105,
71), Human.Resources_scf = c(163, 124), Industry.Knowledge_scf = c(265,
174), Information.Technology_scf = c(467, 402), Legal_scf = c(21,
17), Maintenance..Repair..and.Installation_scf = c(194, 222
), Manufacturing.and.Production_scf = c(176, 174), Marketing.and.Public.Relations_scf = c(139,
109), Media.and.Writing_scf = c(18, 20), Personal.Care.and.Services_scf = c(31,
16), Public.Safety.and.National.Security_scf = c(14, 7),
Religion_scf = c(0, 0), Sales_scf = c(785, 463), Science.and.Research_scf = c(52,
24), Supply.Chain.and.Logistics_scf = c(838, 455), total_scf = c(5599,
3877)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -2L), groups = structure(list(ym = c("2007-01-01",
"2007-02-01"), NAICS2 = c(0L, 0L), occ_type = c("is_middle_manager",
"is_middle_manager"), .rows = list(1L, 2L)), row.names = c(NA,
-2L), class = c("tbl_df", "tbl", "data.frame"), .drop = TRUE))

Related

Create mean value plot without missing values count to total

Using a dataframe with missing values:
structure(list(id = c("id1", "test", "rew", "ewt"), total_frq_1 = c(54, 87, 10, 36), total_frq_2 = c(45, 24, 202, 43), total_frq_3 = c(24, NA, 25, 8), total_frq_4 = c(36, NA, 104, NA)), row.names = c(NA, 4L), class = "data.frame")
How is is possible to create a bar plot with the mean for every column, excluding the id column, but without filling the missing values with 0 but leaving out the row with missing values example for total_frq_3 24+25+8 = 57/3 = 19
You can use colMeans function and pass it the appropriate argument to ignore NA.
library(ggplot2)
xy <- structure(list(id = c("id1", "test", "rew", "ewt"),
total_frq_1 = c(54, 87, 10, 36), total_frq_2 = c(45, 24, 202, 43), total_frq_3 = c(24, NA, 25, 8),
total_frq_4 = c(36, NA, 104, NA)),
row.names = c(NA, 4L),
class = "data.frame")
xy.means <- colMeans(x = xy[, 2:ncol(xy)], na.rm = TRUE)
xy.means <- as.data.frame(xy.means)
xy.means$total <- rownames(xy.means)
ggplot(xy.means, aes(x = total, y = xy.means)) +
theme_bw() +
geom_col()
Or just use base image graphic
barplot(height = colMeans(x = xy[, 2:ncol(xy)], na.rm = TRUE))

How to apply functions depending on the column, and mutate into new data frame?

I came up with the idea to represent stats on a chart like this. Example of the plot. And made it like this.
df_n <- df_normalized %>%
transmute(
Height_x = round(Height*cos_my(45), 2),
Height_y = round(Height*sin_my(45), 2),
Weight_x = round(Weight*cos_my(45*2), 2),
Weight_y = round(Weight*sin_my(45*2), 2),
Reach_x = round(Reach*cos_my(45*3), 2),
Reach_y = round(Reach*sin_my(45*3), 2),
SLpM_x = round(SLpM*cos_my(45*4), 2),
SLpM_y = round(SLpM*sin_my(45*4), 2),
Str_Def_x = round(`Str_Def %`*cos_my(45*5), 2),
Str_Def_y = round(`Str_Def %`*sin_my(45*5), 2),
TD_Avg_x = round(TD_Avg*cos_my(45*6), 2),
TD_Avg_y = round(TD_Avg*sin_my(45*6), 2),
TD_Acc_x = round(`TD_Acc %`*cos_my(45*7), 2),
TD_Acc_y = round(`TD_Acc %`*sin_my(45*7), 2),
Sub_Avg_x = round(Sub_Avg*cos_my(45*8), 2),
Sub_Avg_y = round(Sub_Avg*sin_my(45*8), 2))
Now I want to do this smart way, so I created a data frame with same number of rows empty_df, and later in for loop I try to mutate and array, with every iteration. So for example I want to multiply 1st column by cos(30), 2nd by cos(30*2), and so on
But...
It mutate only last column because all columns during iteration have the same name 'column'.
I want to name each column by the variable column, made with paste0().
reprex_df <- structure(list(Height = c(190, 180, 183, 196, 185),
Weight = c(120, 77, 93, 120, 84),
Reach = c(193, 180, 188, 203, 193),
SLpM = c(2.45, 3.8, 2.05, 7.09, 3.17),
`Str_Def %` = c(58, 56, 55, 34, 44),
TD_Avg = c(1.23, 0.33, 0.64, 0.91, 0),
`TD_Acc %` = c(24, 50, 20, 66, 0),
Sub_Avg = c(0.2, 0, 0, 0, 0)), row.names = c(NA, -5L),
class = c("tbl_df", "tbl", "data.frame"))
temp <- apply(reprex_df[,1], function(x) x*cos(60), MARGIN = 2)
temp
empty_df <- data.frame(first_column = replicate(length(temp),1))
for (x in 1:8) {
temp <- apply(df[,x], function(x) round(x*cos((360/8)*x),2), MARGIN = 2)
column <- paste0("Column_",x)
empty_df <- mutate(empty_df, column = temp)
}
Later I want to make it a function where I can pass data frame and receive data frame with X, and Y coordinates.
So, how should I make it?
Perhaps this helps
library(purrr)
library(stringr)
nm1 <- names(reprex_df)
nm_cos <- str_c(names(reprex_df), "_x")
nm_sin <- str_c(names(reprex_df), "_y")
reprex_df[nm_cos] <- map2(reprex_df, seq_along(nm1),
~ round(.x * cos(45 *.y ), 2))
reprex_df[nm_sin] <- map2(reprex_df[nm1], seq_along(nm1),
~ round(.x * sin(45 *.y ), 2))

Error: Problem with `mutate()` input x `labels` must be unique

I am trying to recode some labelled variables to a 0 to 1 scale in the following fashion. When I try to calculate the mean of the two variables using c_across() I get this odd error Error: Problem with mutate() input market_liberalism. x labels must be unique.
If I delete the value labels then it works. I don't understand what problem the value labels cause.
Thank you.
#Install car package if necessary
#install.packages('car')
library(tidyverse)
library(car)
structure(list(PESE15 = structure(c(3, 5, 5, 8, NA), label = "The Government Should Leave it Entirely to the Private Sector to Create Jobs", na_values = c(8, 9), format.spss = "F1.0", display_width = 0L, labels = c(`Strongly agree` = 1, `Somewhat agree` = 3, Somewhatdisagree = 5, Stronglydisagree = 7,D.K. = 8, Refused = 9), class = c("haven_labelled_spss", "haven_labelled", "vctrs_vctr", "double")), MBSA2 = structure(c(3, 8, 1, 1, NA), label = "People Who Do Not Get Ahead Should Blame Themselves Not the System", na_values = 8, format.spss = "F1.0", display_width = 0L, labels = c(`Strongly agree` = 1, Agree = 2, Disagree = 3, Stronglydisagree = 4, `No opinion` = 8), class = c("haven_labelled_spss", "haven_labelled", "vctrs_vctr", "double"))), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"), label = "NSDstat generated file")->out
#use the car::Recode command to convert values to 0 to 1
out$market1<-Recode(out$PESE15, "1=1; 3=0.75; 5=0.25; 7=0; 8=0.5; else=NA")
out$market2<-Recode(out$MBSA2, "1=1; 2=0.75; 3=0.25; 4=0; 8=0.5; else=NA")
#Use dplyr to try to calculate the average
out %>%
rowwise() %>%
mutate(market_liberalism=mean(
c_across(market1:market2))) -> out2
#setting value labels to NULL makes it work.
val_labels(out$market1)<-NULL
val_labels(out$market2)<-NULL
out %>%
rowwise() %>%
mutate(market_liberalism=mean(
c_across(market1:market2)))
For me car::Recode gives an error and does not work with haven labelled class but dplyr::recode does if you have labelled library loaded.
library(labelled)
library(dplyr)
out %>%
mutate(PESE15 = recode(PESE15, `1` = 1,`3` = 0.75, `5`=0.25, `7`=0, `8` = 0.5),
MBSA2 = recode(MBSA2, `1`=1, `2`=0.75, `3`=0.25, `4`=0, `8`=0.5),
market_liberalism = rowMeans(., na.rm = TRUE))

Reshape from long to wide according to the number of occurrence of one variable

I have a dataframe looks like this
df1<-structure(list(person = c("a", "a", "a", "a", "b", "b", "b",
"c"), visitID = c(123, 123, 256, 816, 237, 828, 828, 911), v1 = c(10,
5, 15, 8, 95, 41, 31, 16), v2 = c(8, 72, 29, 12, 70, 23, 28,
66), v3 = c(0, 1, 0, 0, 1, 1, 0, 1)), row.names = c(NA, -8L), class = c("tbl_df",
"tbl", "data.frame"))
Where person is the name/id of the person and visitID is a number generated for each visit. Now each visit may have one or multiple variables (v1, v2, v3). My problem is that I'm trying to transform the structure where cases are aggregated into unique row with wide visits and variables, to look like:
df2<-structure(list(person = c("a", "b", "c"), visit1 = c(123, 237,
911), visit2 = c(256, 828, NA), visit3 = c(816, NA, NA), v1.visit1 = c("10,5",
"95", "16"), v1.visit2 = c("15", "41,31", NA), v1.visit3 = c("8",
NA, NA), v2.visit1 = c("8,72", "70", "66"), v2.visit2 = c("29",
"23,28", NA), v1.visit3 = c("12", NA, NA), v3.visit1 = c("0,1",
"1", "1"), v3.visit2 = c("0", "1,0", NA), v3.visit3 = c("0",
NA, NA)), row.names = c(NA, -3L), class = c("tbl_df", "tbl",
"data.frame"))
Methods I have tried so far:
Method1:
1-aggregate according to "person" with all other variables separated by comma
2-split the variables into multiple columns
The problem with this method is that I would not know then which variable corresponds to which visit, especially that some may have multiple entries and some may not.
Method2:
1-Count number of each visitID. Take the maximum number of visits per unique person (in the case above is 3)
2-Create 3 columns for each variable.
didn't know how to proceed from here
I found a great answer in the thread Reshape three column data frame to matrix ("long" to "wide" format)
so tried working around with reshape and pivot_wider but couldn't get it to work.
Any ideas are appreciated even if did not lead to the same output.
Thank you
You can try something like this:
df1 %>%
group_by(person, visitID) %>%
summarise(across(matches("v[0-9]+"), list)) %>%
group_by(person) %>%
mutate(visit = seq_len(n()) %>% str_c("visit.", .)) %>%
ungroup() %>%
pivot_wider(
id_cols = person,
names_from = visit,
values_from = c("visitID", matches("v[0-9]+"))
)
replace list with ~str_c(.x, collapse = ",") if you want to have it in character style.

Create new column with percentages in data frame

I have the following dataframe:
dput(df1)
structure(list(month = c(1, 1, 2, 2, 3, 4), transaction_type = c("AAA",
"BBB", "BBB", "CCC",
"DDD", "AAA"), max_wt_per_month = c(54.9,
51.6833333333333, 52.3333333333333, 49.4666666666667, 49.85,
48.5833333333333), min_wt_per_month = c(0, 0, 0, 0, 0, 0), avg_wt_per_month = c(8.41701333107861,
7.65211141060198, 6.44184012508551, 7.74798927613941, 7.4360566888844,
7.50611319574734), prop = c(Inf, Inf, Inf, Inf, Inf, Inf)), .Names = c("month",
"transaction_type", "max_wt_per_month", "min_wt_per_month", "avg_wt_per_month",
"prop"), row.names = c(NA, -6L), class = c("grouped_df", "tbl_df",
"tbl", "data.frame"), vars = list(month), drop = TRUE, indices = list(
0:5), group_sizes = 6L, biggest_group_size = 6L, labels = structure(list(
month = 1), row.names = c(NA, -1L), class = "data.frame", vars = list(
month), drop = TRUE, .Names = "month"))
I want to create column prop that would contain the percentage of maximum waiting time with respect to each month. If I run this code, then I get Inf values in most of the rows... (especially it is evident in the real dataset):
my_fun=function(vec){
100*as.numeric(vec[3]) /
sum(with(data_merged_transactions, ifelse(month == vec[1], max_wt_per_month, 0))) }
data_merged_transactions$prop=apply(data_merged_transactions , 1 , my_fun)
I then finally need to create the filled area chart so that each area would be a percentage out of 100%:
ggplot(data_merged_transactions, aes(x=month, y=prop, fill=transaction_type)) +
geom_area(alpha=0.6 , size=1, colour="black")
Why do I get Inf if the sum is not equal to 0?
Moreover, is it possible to create filled area chart with months being factors (Jan, Feb,etc.), not numbers? I tried to substitute month id's by month names, but then I got very thin bars instead of a filled area.
Is this what you were looking for?
library(tidyverse)
df1_tidy <- df1 %>%
group_by(month) %>%
summarise(SUM = sum(max_wt_per_month)) %>%
full_join(df1) %>%
mutate(prop = max_wt_per_month / SUM)
ggplot(data = df1_tidy,
aes(x = month,
y = prop,
fill = transaction_type)) +
geom_area(alpha = 0.6,
size = 1,
colour = "black") +
scale_x_continuous(labels = c("Jan", "Feb", "Mar", "Apr"))

Resources