I have some tidy data and need to transform it into a format that works for building small graphs (sparklines) using the dataui package. You can see the required dataframe format in the code example below, df_sparkline.
The tidy data I have has about 30 companies and a year of data which is < 10,000 rows. What is the best (clearest to understand is valued more than raw speed) way to transform df_tidy to df_sparklines?
library("dataui")
library("reactable")
library("tidyverse")
df_tidy <- tibble(
company = c("A", "B", "A", "B", "A", "B"),
line_data = c(1, 2, 2, 2, 1, 1),
date = c(as.Date("2021-01-01"), as.Date("2021-01-01"), as.Date("2021-01-02"), as.Date("2021-01-02"), as.Date("2021-01-03"), as.Date("2021-01-03"))
)
df_sparkline <- structure(list(company = c("A", "B"), line_data = list(list(c(1, 2, 1)), list(c(2, 2, 1)))), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"))
rt1 <- reactable(
df_sparkline,
columns = list(
line_data = colDef(
cell = function(value, index) {
dui_sparkline(
data = value[[1]],
height = 80,
components = dui_sparklineseries(curve = "linear") # https://github.com/williaster/data-ui/tree/master/packages/sparkline#series
)
}
)
)
)
rt1
All you need is group_by() and summarise():
df_sparkline2 = df_tidy %>%
group_by(company) %>%
summarise(line_data=list(list(line_data)))
waldo::compare(df_sparkline, df_sparkline2)
# √ No differences
The key here is to call list() inside summarise().
Related
I have a .csv file like this (except that the real .csv file has many more columns):
library(tidyverse)
tibble(id1 = c("a", "b"),
id2 = c("c", "d"),
data1 = c(1, 2),
data2 = c(3, 4),
data1s = c(5, 6),
data2s = c(7, 8)) %>%
write_csv("df.csv")
I only want id1, id2, data1, and data2.
I can do this:
df <- read_csv("df.csv",
col_names = TRUE,
cols_only(id1 = col_character(),
id2 = col_character(),
data1 = col_integer(),
data2 = col_integer()))
But, as mentioned above, my real dataset has many more columns, so I'd like to use tidyselect helpers to only read in specified columns and ensure specified formats.
I tried this:
df2 <- read_csv("df.csv",
col_names = TRUE,
cols_only(starts_with("id") = col_character(),
starts_with("data") & !ends_with("s") = col_integer()))
But the error message indicates that there's a problem with the syntax. Is it possible to use tidyselect helpers in this way?
My proposal is around the houses somewhat but it pretty much does let you customise the read spec on a 'rules' rather than explicit basis
library(tidyverse)
tibble(id1 = c("a", "b"),
id2 = c("c", "d"),
data1 = c(1, 2),
data2 = c(3, 4),
data1s = c(5, 6),
data2s = c(7, 8)) %>%
write_csv("df.csv")
# read only 1 row to make a spec from with minimal read; really just to get the colnames
df_spec <- spec(read_csv("df.csv",
col_names = TRUE,
n_max = 1))
#alter the spec with base R functions startsWith / endsWith etc.
df_spec$cols <- imap(df_spec$cols,~{if(startsWith(.y,"id")){
col_character()
} else if(startsWith(.y,"data") &
!endsWith(.y,"s")){
col_integer()
} else {
col_skip()
}})
df <- read_csv("df.csv",
col_types = df_spec$cols)
Here is a tidy dataframe
df_tidy <- tibble(
company = c("A", "B", "A", "B", "A", "B"),
line_data = c(1, 2, 2, 2, 1, 1)
)
The format required is:
df_ll <- structure(list(company = c("A", "B"), line_data = list(list(c(1, 2, 1)), list(c(2, 2, 1)))), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"))
How do I transform df_tidy into df_ll?
Grouped by 'company' summarise the 'line_data' in a list
df_ll2 <- df_tidy %>%
group_by(company) %>%
summarise(line_data = list(list(line_data)))
-checking with expected
all.equal(df_ll, df_ll2)
[1] TRUE
Or another option is nest or nest_by and then convert the tibble to a list
df_tidy %>%
nest_by(company, .key = "line_data") %>%
mutate(line_data = list(list(unlist(line_data)))) %>%
ungroup
You can also use plyr package:
df_ll <- dlply(df_tidy,.(company),c)
library(tidyverse)
# Create a tibble comprised of: df_ll2 => tibble
df_ll2 <- tibble(
# Uniquify the company vector: company => character vector
company = unique(df_tidy$company),
# Split the data into a list by the company vector, coerce each
# element to an unnamed list:
line_data = unname(
lapply(
with(df_tidy, split(line_data, company)),
list
)
)
)
Why does this...
B.aov2<- eval(parse(text=StringforEvaluation))
summary(B.aov2)
produce a less detailed report than this...?
res.aov2 <- aov(DV ~ IV1*IV2+Error(Participant/(IV1*IV2), data = AnovaAnalysisData))
summary(res.aov2)
Reproducible data, per request of a user. I know it's weird, but I'm making a user-friendly stats program for my students. Now that I've posted the "reproducible code" it wants me to provide more data. Now, I didn't originally post this code because I thought it was a bit much and my general coding sucks.. My question still stand.. Why does one give me a short crappy, unhelpful output, and the other one provide exactly what I want...I ran the same command:
library(tidyverse)
DV <- c(1,1, 5,6, 1, 2, 7, 7, 1, 4, 9, 9)
IV1 <- c("A","B", "A", "B","A","B", "A", "B","A","B", "A", "B" )
IV2 <- c("C","C","D","D", "C","C", "D","D", "C","C", "D","D")
Participant <- c("A", "A", "A", "A", "B","B","B","B","C","C","C","C")
IV3 <- "no_data" #remove the word "no_data" and add c("A","B", "C", etc.. )
IV4 <- "no_data" #remove the word "no_data" and add c("A","B", "C", etc.. )
##### You have to tell the computer if the variable is within!
IV1_iswithin <-"Y"
IV2_iswithin <-"Y"
IV3_iswithin <-"N"
IV4_iswithin <-"N"
####### Your JOB is DONE
data <- data.frame(DV,Participant,IV1,IV2)
#Grouping the dataframe
data %>%group_by(IV2, IV1)%>% #subsetting the data set to calculate 4 different stats
mutate(MAD = median(abs(DV-median(DV))*2.5*1.4826))%>% #calculates 4 differnet mad numbers
mutate(MADLL = median(DV)-MAD)%>% #cacluates UL of MAD, pipe output to next command
mutate(MADUL = median(DV)+MAD)%>% #calculates the LL of MAD
mutate(OutlierPresent = ifelse(DV<MADLL | DV>MADUL, NA, DV))%>% #Creates NA values if it is an outlier
ungroup()%>% #converts back to big data set
mutate(OutlierPresent = ifelse(DV<MADLL | DV>MADUL, NA, DV))%>% #Creates NA
mutate(whichgroup <- paste(IV1,IV2))%>%
mutate(observation = 1:n()) %>%
{. ->> b }
b %>%
select(DV,OutlierPresent) %>%
{. ->> outlierfeedback }
formattable(outlierfeedback)
b %>%
select(OutlierPresent,IV1,IV2,Participant) %>%
pivot_wider(names_from = c(IV1,IV2), values_from = OutlierPresent, names_sep = "_", id_cols = Participant)%>%
drop_na()%>%
pivot_longer(-Participant, names_to = c("IV1","IV2"), names_sep = "_", values_to = "DV")%>%
{. ->> AnovaAnalysisData }
### Calculating the ANOVA for our outlier free data AnovaAnalysisData
#SettingUpTheListofFixedFactors
FactorModel <- list()
ifelse(length(IV1)> 1, FactorModel<- c(FactorModel, "IV1"), FactorModel<-FactorModel)
ifelse(length(IV2)> 1, FactorModel<- c(FactorModel, "IV2"), FactorModel<-FactorModel)
ifelse(length(IV3)> 1, FactorModel<- c(FactorModel, "IV3"), FactorModel<-FactorModel)
ifelse(length(IV4)> 1, FactorModel<- c(FactorModel, "IV4"), FactorModel<-FactorModel)
#SettingUPTheListofErrorFactors
ErrorModel <-list()
ErrorModel <- list()
ifelse(IV1_iswithin== "Y", ErrorModel<- c(ErrorModel, "IV1"), ErrorModel<-ErrorModel)
ifelse(IV2_iswithin== "Y", ErrorModel<- c(ErrorModel, "IV2"), ErrorModel<-ErrorModel)
ifelse(IV3_iswithin== "Y",ErrorModel<- c(ErrorModel, "IV3"), ErrorModel<-ErrorModel)
ifelse(IV4_iswithin== "Y", ErrorModel<- c(ErrorModel, "IV4"), ErrorModel<-ErrorModel)
StrStart = "aov(DV ~"
StrFactor<-paste(ErrorModel, collapse='*' )
StrErrorStart<-("+Error(Participant/(")
StrError<-paste(ErrorModel, collapse='*' )
StrErrorEnd<- ("),data=AnovaAnalysisData))")
StringforEvaluation<- paste(StrStart, StrFactor,StrErrorStart,StrError,StrErrorEnd)
B.aov2<- eval(parse(text=StringforEvaluation))
summary(B.aov2)
res.aov2 <- aov(DV ~ IV1*IV2+Error(Participant/(IV1*IV2), data = AnovaAnalysisData))
summary(res.aov2)
I would like to visualize some deeply nested data using networkD3. I can't figure out how to get the data into the right format before sending to radialNetwork.
Here is some sample data:
level <- c(1, 2, 3, 4, 4, 3, 4, 4, 1, 2, 3)
value <- letters[1:11]
where level indicates the level of the nest, and value is the name of the node. By using these two vectors, I need to get the data into the following format:
my_list <- list(
name = "root",
children = list(
list(
name = value[1], ## a
children = list(list(
name = value[2], ## b
children = list(list(
name = value[3], ## c
children = list(
list(name = value[4]), ## d
list(name = value[5]) ## e
)
),
list(
name = value[6], ## f
children = list(
list(name = value[7]), ## g
list(name = value[8]) ## h
)
))
))
),
list(
name = value[9], ## i
children = list(list(
name = value[10], ## j
children = list(list(
name = value[11] ## k
))
))
)
)
)
Here is the deparsed object:
> dput(my_list)
# structure(list(name = "root",
# children = list(
# structure(list(
# name = "a",
# children = list(structure(
# list(name = "b",
# children = list(
# structure(list(
# name = "c", children = list(
# structure(list(name = "d"), .Names = "name"),
# structure(list(name = "e"), .Names = "name")
# )
# ), .Names = c("name",
# "children")), structure(list(
# name = "f", children = list(
# structure(list(name = "g"), .Names = "name"),
# structure(list(name = "h"), .Names = "name")
# )
# ), .Names = c("name",
# "children"))
# )), .Names = c("name", "children")
# ))
# ), .Names = c("name",
# "children")), structure(list(
# name = "i", children = list(structure(
# list(name = "j", children = list(structure(
# list(name = "k"), .Names = "name"
# ))), .Names = c("name",
# "children")
# ))
# ), .Names = c("name", "children"))
# )),
# .Names = c("name",
# "children"))
Then I can pass it to the final plotting function:
library(networkD3)
radialNetwork(List = my_list)
The output will look similar to this:
Question: How can I create the nested list?
Note: As pointed out by #zx8754, there is already a solution in this SO post, but that requires data.frame as input. Due to the inconsistency in my level, I don't see a simple way to transform it into a data.frame.
Using a data.table-style merge:
library(data.table)
dt = data.table(idx=1:length(value), level, parent=value)
dt = dt[dt[, .(i=idx, level=level-1, child=parent)], on=.(level, idx < i), mult='last']
dt[is.na(parent), parent:= 'root'][, c('idx','level'):= NULL]
> dt
# parent child
# 1: root a
# 2: a b
# 3: b c
# 4: c d
# 5: c e
# 6: b f
# 7: f g
# 8: f h
# 9: root i
# 10: i j
# 11: j k
Now we can use the solution from the other post:
x = maketreelist(as.data.frame(dt))
> identical(x, my_list)
# [1] TRUE
As a preface, your data is difficult to work with because critical information is encoded in the order of the values in level. I don't know how you get those values in that order, but consider that there may be a better way to structure that information in the first place, which would make the next task easier.
Here's a base-y way of converting your data into a data frame with 2 columns, parent and child, then passing that into data.tree functions that can easily convert to the JSON format you need... and then pass it on to radialNetwork...
level <- c(1, 2, 3, 4, 4, 3, 4, 4, 1, 2, 3)
value <- letters[1:11]
library(data.tree)
library(networkD3)
parent_idx <- sapply(1:length(level), function(n) rev(which(level[1:n] < level[n]))[1])
df <- data.frame(parent = value[parent_idx], child = value, stringsAsFactors = F)
df$parent[is.na(df$parent)] <- ""
list <- ToListExplicit(FromDataFrameNetwork(df), unname = T)
radialNetwork(list)
Here's a tidyverse way of achieving the same...
level <- c(1, 2, 3, 4, 4, 3, 4, 4, 1, 2, 3)
value <- letters[1:11]
library(tidyverse)
library(data.tree)
library(networkD3)
data.frame(level, value, stringsAsFactors = F) %>%
mutate(row = row_number()) %>%
mutate(level2 = level, value2 = value) %>%
spread(level2, value2) %>%
mutate(`0` = "") %>%
arrange(row) %>%
fill(-level, -value, -row) %>%
gather(parent_level, parent, -level, -value, -row) %>%
filter(parent_level == level - 1) %>%
arrange(row) %>%
select(parent, child = value) %>%
data.tree::FromDataFrameNetwork() %>%
data.tree::ToListExplicit(unname = TRUE) %>%
radialNetwork()
and for a bonus, the current dev version of networkD3 (v0.4.9000) has a new treeNetwork function that takes a data frame with nodeId and parentId columns/variables, which eliminates the need for the data.tree fucntions to convert to JSON, so something like this works...
level <- c(1, 2, 3, 4, 4, 3, 4, 4, 1, 2, 3)
value <- letters[1:11]
library(tidyverse)
library(networkD3)
data.frame(level, value, stringsAsFactors = F) %>%
mutate(row = row_number()) %>%
mutate(level2 = level, value2 = value) %>%
spread(level2, value2) %>%
mutate(`0` = "root") %>%
arrange(row) %>%
fill(-level, -value, -row) %>%
gather(parent_level, parent, -level, -value, -row) %>%
filter(parent_level == level - 1) %>%
arrange(row) %>%
select(nodeId = value, parentId = parent) %>%
rbind(data.frame(nodeId = "root", parentId = NA)) %>%
mutate(name = nodeId) %>%
treeNetwork(direction = "radial")
I am trying to replicate the Fama French 1993 paper using R. I need to do the following sorting :
for each month,
calculate ME decile breakpoints on NYSE stocks only
sort all stocks into the deciles created in 2.
Data generation:
set.seed(1234)
n = 120
stocks <- c("A", "B", "C", "D", "E")
exchange <- c("NYSE", "NASDAQ", "AMEX")
df <- as.data.frame(cbind(Month = 1:12,
exchangeCode = exchange[round(runif(n, 1, 3))],
Stock = stocks[round(runif(n, 1, 5))],
ME=floor(100*abs(rnorm(n)))))
Desired Output:
ME_NYSE_vals <- as.numeric(paste(df[df$Month==1 & df$exchangeCode=="NYSE","ME"]))
ME_ALL_vals <- as.numeric(paste(df[df$Month==1,"ME"]))
cut(x = ME_ALL_vals,
breaks = c(-Inf,quantile(ME_NYSE_vals,probs=seq(.1,.9,.1)),+Inf),
labels = 1:10
)
The breaks should be calculated based on ME_NSYE_vals. The cut should be applied to all ME_ALL_vals for each month.
If the intention is to keep the whole data frame but generate deciles only for the NYSE values the code below could do. The point was to generate deciles only for the entries pertaining to the NYSE values but to keep the full data set achieving some form of a partial sorting.
# Libs
Vectorize(require)(package = c("dplyr", "magrittr"),
character.only = TRUE)
# Transformations
df %<>%
mutate(nTileNYSE = ifelse(exchangeCode == "NYSE", ntile(ME, 10), NA))
arrange(nTileNYSE)
The code was applied to the data:
set.seed(1)
df <- as.data.frame(cbind(exchangeCode = c("NYSE", "NASDAQ"),
Stock = c("A", "B", "C", "A"),
Month = 1:12,
ME=rnorm(1200)))
2nd approach
Following the discussion in the comments I would suggest the following approach:
# Libs --------------------------------------------------------------------
Vectorize(require)(package = c( "tidyr", "dplyr", "magrittr", "xts", "Hmisc"),
char = TRUE)
# Data generation ---------------------------------------------------------
set.seed(1234)
n = 120
stocks <- c("A", "B", "C", "D", "E")
exchange <- c("NYSE", "NASDAQ", "AMEX")
df <- as.data.frame(cbind(Month = 1:12,
exchangeCode = exchange[round(runif(n, 1, 3))],
Stock = stocks[round(runif(n, 1, 5))],
ME = floor(100*abs(rnorm(n)))))
# Transformations ---------------------------------------------------------
# For some reason this was needed
df$ME <- as.numeric(as.character(df$ME))
# Generate cuts
dfNtiles <- df %>%
arrange(exchangeCode, Month, ME) %>%
group_by(exchangeCode, Month) %>%
mutate(cutsBsdOnNYSE = cut(x = ME,
breaks = cut2(x = df$ME[df$exchangeCode == "NYSE"],
g = 10, onlycuts = TRUE))) %>%
ungroup() %>%
group_by(cutsBsdOnNYSE) %>%
mutate(grpBsdOnNYSE = n())
It's fairly straightforward
Generating cut brackets reflecting subset of the data.
Applying those brackets to the whole vector (ME)
Numbering the obtained groups so a group identifier is created
and boils down to: