I have data that look like this (thanks once again dput!):
dat <- structure(list(vars = c("var_1", "var_2"), data = list(structure(list(
time = 1:10, value = c(1:10
)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
)), structure(list(time = 1:10, value = c(11:20
)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
))), mu = c(1, 2), stdev = c(1,2)), class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA,-2L))
I am trying to mutate an extra column which maps a function over each row. e.g calculate dnorm for each element of the nested variable in dat$data[[1]]$value using dat$mu[1] and dat$stdev[1] and the go on to do the same for row two.
The column I would like to mutate is a tibble [10 x 1] for each row containing this as the output:
dnorm(dat$data[[1]]$value, mean = dat$mu[1], sd = dat$stdev[1])
dnorm(dat$data[[2]]$value, mean = dat$mu[2], sd = dat$stdev[2])
Things I have tried that don't work but might be close?:
# This alternates between mean and stdev for each element of each nested variable
dat_1 <- dat %>%
mutate(z = map(data, ~ dnorm(.x$value, mean = dat$mu, sd = dat$stdev)))
# apply by row has structure issues
dat_2 <- dat %>%
apply(MARGIN = 1, function(x){
mutate(x, z = map(data, ~ dnorm(.x$value, mean = dat$mu, sd = dat$stdev)))
})
a basic map function like this dat_3 <- dat %>% mutate(sigma = map(data, ~ sum(.x$value))) works fine without referencing other values in the df. This is early days for me using nested data and map in this way - been looking at the documentation for all the map functions to try solve this but no luck yet! If that's clear as mud I can try clarify - thanks in advance!
We can use a parallel map:
library(purrr)
library(dplyr)
expected_out1 <- dnorm(dat$data[[1]]$value, mean = dat$mu[1], sd = dat$stdev[1])
expected_out2 <- dnorm(dat$data[[2]]$value, mean = dat$mu[2], sd = dat$stdev[2])
out <-
dat %>%
mutate(z = pmap(list(map(data, "value"), mu, stdev), dnorm))
all.equal(out$z, list(expected_out1, expected_out2))
# [1] TRUE
Related
I have some tidy data and need to transform it into a format that works for building small graphs (sparklines) using the dataui package. You can see the required dataframe format in the code example below, df_sparkline.
The tidy data I have has about 30 companies and a year of data which is < 10,000 rows. What is the best (clearest to understand is valued more than raw speed) way to transform df_tidy to df_sparklines?
library("dataui")
library("reactable")
library("tidyverse")
df_tidy <- tibble(
company = c("A", "B", "A", "B", "A", "B"),
line_data = c(1, 2, 2, 2, 1, 1),
date = c(as.Date("2021-01-01"), as.Date("2021-01-01"), as.Date("2021-01-02"), as.Date("2021-01-02"), as.Date("2021-01-03"), as.Date("2021-01-03"))
)
df_sparkline <- structure(list(company = c("A", "B"), line_data = list(list(c(1, 2, 1)), list(c(2, 2, 1)))), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"))
rt1 <- reactable(
df_sparkline,
columns = list(
line_data = colDef(
cell = function(value, index) {
dui_sparkline(
data = value[[1]],
height = 80,
components = dui_sparklineseries(curve = "linear") # https://github.com/williaster/data-ui/tree/master/packages/sparkline#series
)
}
)
)
)
rt1
All you need is group_by() and summarise():
df_sparkline2 = df_tidy %>%
group_by(company) %>%
summarise(line_data=list(list(line_data)))
waldo::compare(df_sparkline, df_sparkline2)
# √ No differences
The key here is to call list() inside summarise().
I want to define the first two columns of a data frame as rownames. Actually I want to do some calculations and the data frame has to be numeric for that.
data.frame <- data_frame(id=c("A1","B2"),name=c("julia","daniel"),BMI=c("20","49"))
The values for BMI are numerical (proved with is.numeric), but the over all data.frame not. How to define the first two columns (id and name) as rownames?
Thank you in advance for any suggestions
You can combine id and name column and then assign rownames
data.frame %>%
tidyr::unite(rowname, id, name) %>%
tibble::column_to_rownames()
# BMI
#A1_julia 20
#B2_daniel 49
In base R, you can do the same in steps as
data.frame <- as.data.frame(data.frame)
rownames(data.frame) <- paste(data.frame$id, data.frame$name, sep = "_")
data.frame[c('id', 'name')] <- NULL
Not sure if the code and result below is the thing you are after:
dfout <- `rownames<-`(data.frame(BMI = as.numeric(df$BMI)),paste(df$id,df$name))
such that
> dfout
BMI
A1 julia 20
B2 daniel 49
DATA
df <- structure(list(id = structure(1:2, .Label = c("A1", "B2"), class = "factor"),
name = structure(2:1, .Label = c("daniel", "julia"), class = "factor"),
BMI = structure(1:2, .Label = c("20", "49"), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
I have this list:
list(structure(list(a = 1:10, b = 2:11, c = 3:12), .Names = c("a",
"b", "c"), row.names = c(NA, -10L), class = "data.frame"), structure(list(
a = 1:10, b = 2:11, c = 3:12), .Names = c("a", "b", "c"), row.names = c(NA,
-10L), class = "data.frame"), structure(list(a = 1:10, b = 2:11,
c = 3:12), .Names = c("a", "b", "c"), row.names = c(NA, -10L
), class = "data.frame"))
And this function:
fun1<-function(x){
funs<-c(s=sum,m=mean)
lapply(funs,function(f)f(x,na.rm=TRUE))
}
With lapply the result is ok. See:
list%>%
lapply(function(x){
lapply(x,fun1)
})
But, purrr::map doesn't work:
list%>%
map(.)%>%
map(.,fun1)
What's wrong?
Your syntax for the map part is wrong. You need the same code structure as you are using with lapply. First let's get rid of the pipes so the code looks more alike:
Also don't give objects the same name as R functions.
library(purrr)
lapply_outcome <- lapply(my_list, function(x) {lapply(x, fun1)})
map_outcome <- map(my_list, function(x) {map(x, fun1)})
identical(lapply_outcome, map_outcome)
[1] TRUE
With pipes:
my_list %>%
lapply(function(x) lapply(x,fun1))
my_list %>%
map(., function(x) map(x, fun1))
or with a formula call inside map, but personally I find this less readable:
my_list %>%
map(~ map(., fun1))
I have a data.frame
res0 = structure(list(year = "2017", il = 11200000), .Names = c("year",
"il"), row.names = c(NA, -1L), class = "data.frame")
however, when I try to make this an xts object I lose the column names.
as.xts(x = res0[,2:ncol(res0)], order.by = as.POSIXct(paste0(res0$year,"-01-01")), name = NULL)
This returns:
[,1]
2017-01-01 11200000
instead of
il
2017-01-01 11200000
Subscripting in R drops dimensions by default. Use drop = FALSE to prevent this.
res0[, 2:ncol(res0), drop = FALSE]
Also note that this works to create an n x 1 zoo series with year as the index.
library(zoo)
z <- read.zoo(res0, FUN = c, drop = FALSE)
I have many dataframes stored in a list, and I want to create weighted averages from these and store the results in a new dataframe. For example, with the list:
dfs <- structure(list(df1 = structure(list(A = 4:5, B = c(8L, 4L), Weight = c(TRUE, TRUE), Site = c("X", "X")),
.Names = c("A", "B", "Weight", "Site"), row.names = c(NA, -2L), class = "data.frame"),
df2 = structure(list(A = c(6L, 8L), B = c(9L, 4L), Weight = c(FALSE, TRUE), Site = c("Y", "Y")),
.Names = c("A", "B", "Weight", "Site"), row.names = c(NA, -2L), class = "data.frame")),
.Names = c("df1", "df2"))
In this example, I want to use columns A, B, and Weight for the weighted averages. I also want to move over related data such as Site, and want to sum the number of TRUE and FALSE. My desired result would look something like:
result <- structure(list(Site = structure(1:2, .Label = c("X", "Y"), class = "factor"),
A.Weight = c(4.5, 8), B.Weight = c(6L, 4L), Sum.Weight = c(2L,
1L)), .Names = c("Site", "A.Weight", "B.Weight", "Sum.Weight"
), class = "data.frame", row.names = c(NA, -2L))
Site A.Weight B.Weight Sum.Weight
1 X 4.5 6 2
2 Y 8.0 4 1
The above is just a very simple example, but my real data have many dataframes in the list, and many more columns than just A and B for which I want to calculate weighted averages. I also have several columns similar to Site that are constant in each dataframe and that I want to move to the result.
I'm able to manually calculate weighted averages using something like
weighted.mean(dfs$df1$A, dfs$df1$Weight)
weighted.mean(dfs$df1$B, dfs$df1$Weight)
weighted.mean(dfs$df2$A, dfs$df2$Weight)
weighted.mean(dfs$df2$B, dfs$df2$Weight)
but I'm not sure how I can do this in a shorter, less "manual" way. Does anyone have any recommendations? I've recently learned how to lapply across dataframes in a list, but my attempts have not been so great so far.
The trick is to create a function that works for a single data.frame, then use lapply to iterate across your list. Since lapply returns a list, we'll then use do.call to rbind the resulting objects together:
foo <- function(data, meanCols = LETTERS[1:2], weightCol = "Weight", otherCols = "Site") {
means <- t(sapply(data[, meanCols], weighted.mean, w = data[, weightCol]))
sumWeight <- sum(data[, weightCol])
others <- data[1, otherCols, drop = FALSE] #You said all the other data was constant, so we can just grab first row
out <- data.frame(others, means, sumWeight)
return(out)
}
In action:
do.call(rbind, lapply(dfs, foo))
---
Site A B sumWeight
df1 X 4.5 6 2
df2 Y 8.0 4 1
Since you said this was a minimal example, here's one approach to expanding this to other columns. We'll use grepl() and use regular expressions to identify the right columns. Alternatively, you could write them all out in a vector. Something like this:
do.call(rbind, lapply(dfs, foo,
meanCols = grepl("A|B", names(dfs[[1]])),
otherCols = grepl("Site", names(dfs[[1]]))
))
using dplyr
library(dplyr)
library('devtools')
install_github('hadley/tidyr')
library(tidyr)
unnest(dfs) %>%
group_by(Site) %>%
filter(Weight) %>%
mutate(Sum=n()) %>%
select(-Weight) %>%
summarise_each(funs(mean=mean(., na.rm=TRUE)))
gives the result
# Site A B Sum
#1 X 4.5 6 2
#2 Y 8.0 4 1
Or using data.table
library(data.table)
DT <- rbindlist(dfs)
DT[(Weight)][, c(lapply(.SD, mean, na.rm = TRUE),
Sum=.N), by = Site, .SDcols = c("A", "B")]
# Site A B Sum
#1: X 4.5 6 2
#2: Y 8.0 4 1
Update
In response to #jazzuro's comment, Using dplyr 0.3, I am getting
unnest(dfs) %>%
group_by(Site) %>%
summarise_each(funs(weighted.mean=stats::weighted.mean(., Weight),
Sum.Weight=sum(Weight)), -starts_with("Weight")) %>%
select(Site:B_weighted.mean, Sum.Weight=A_Sum.Weight)
# Site A_weighted.mean B_weighted.mean Sum.Weight
#1 X 4.5 6 2
#2 Y 8.0 4 1