Subtotals in columns using reshape2 - r

I have spent some time now learning reshape2 and plyr but I still do not get it. This time I have a problem with (a) subtotals and (b) passing different aggregate functions. Here an example using data from a tutorial on the blog of mrdwab
# libraries
library(plyr)
library(reshape2)
# get data and add few more variables
book.sales = read.csv("http://news.mrdwab.com/data-booksales")
book.sales$Stock = book.sales$Quantity + 10
book.sales$SubjCat[(book.sales$Subject == 'Economics') |
(book.sales$Subject == 'Management')] <- '1_EconSciences'
book.sales$SubjCat[book.sales$Subject %in%
c('Anthropology', 'Politics', 'Sociology')] <- '2_SocSciences'
book.sales$SubjCat[book.sales$Subject %in% c('Communication', 'Fiction',
'History', 'Research', 'Statistics')] <- '3_other'
# to get to my starting dataframe (close to the project I am working on)
book.sales1 <- ddply(book.sales, c('Region', 'Representative', 'SubjCat',
'Subject', 'Publisher'), summarize,
Stock = sum(Stock), Sold = sum(Quantity),
Ratio = round((100 * sum(Quantity)/sum(Stock)), digits = 1))
#melt it
m.book.sales = melt(data = book.sales1, id.vars = c('Region', 'Representative',
'SubjCat', 'Subject', 'Publisher'),
measured.vars = c('Stock', 'Sold', 'Ratio'))
# cast it --- # Please ignore this cast this was a mistake
# Tab1 <- dcast(data = m.book.sales,
# formula = Region + Representative ~ Publisher + variable,
# fun.aggregate = sum, margins = c('Region', 'Representative'))
Tab1 <- dcast(data = m.book.sales, formula = Region + Representative ~
SubjCat + Subject + variable, fun.aggregate = sum,
margins = c('Region', 'Representative', 'SubjCat', 'Subject'))
Now my questions :
I have been able to add the subtotals in rows. But is it possible also to add margins in the columns. Say for example, Totals of Stock for one Publisher? Sorry I meant to say example total sold for all publishers.
There is a problem with the columns with “ratio”. How can I get “mean” instead of “sum” for this variable ?
Please note: Question number one (about subtotals in margins) could be solved.
P.S.: I have seen some examples using reshape. Will you recommend to use it instead of reshape2 (which seems not to include the functionalities of two functions).

Not sure exactly what you want for question 1, but if you want total of stock for Publisher would you not just do this?
totalofstock <- ddply(book.sales, ('Publisher'), function(x)
data.frame=c(subtotals = sum(x$Stock)))
and if you want to add it to Tab1 you just do this:
Tab1$bloomsburytotalofstock<-totalofstock[1,][[2]]
head(Tab1)
As for question 2 getting a mean instead of a sum surely you would be changing the function from sum to mean
e.g.
ratiomeans <- ddply(book.sales1, ('Publisher'), function(x)
data.frame=c(ratioMEAN = mean(x$Ratio)))
Also I would suggest sticking with reshape2. reshape2 is basically the new version of reshape. As far as I know reshape is no longer being worked on but still exists so that people with old code using reshape do not have to rewrite everything.
EDIT
justratio<-(m.book.sales[m.book.sales$variable=="Ratio",])
Tab2 <- dcast(data = justratio,
formula = Region + Representative ~ SubjCat + Subject + variable,
fun.aggregate = mean,
margins = c('Region', 'Representative', 'SubjCat', 'Subject'))
final<-merge(Tab1,Tab2,by=c("Region","Representative"))

Related

r.squared matrix of predictions vs actual values in R

I want to create a matrix that displays the r.squared coefficient of determination of some predictions made over the years and the actual values.
My goal is to display a matrix that looks something like this.
The only way I found is to make multiple lists, calculate each row/ column individually using map2_dbl(l.predicted_line1, l.actual, ~ summary(lm(.x ~ .y))$r.squared), and then add the resulting vectors in a matrix with some code. This would create 9 lists, which I want to avoid.
Is there any way of doing this in a more efficiently?
#sample data
l.actual <- list(
overall_15 = c(59,65,73,73,64,69,64,69,63,NA,82,60,NA,73,NA,73,73,NA,69,
69,71,66,65,70,72,72,NA,64,69,67,64,71,NA,62,62,71,67,63,64,76,72),
overall_16 = c(60,68,75,74,68,71,NA,72,64,69,82,66,64,77,NA,71,72,NA,69,
69,75,67,71,73,73,73,NA,66,NA,69,65,70,76,NA,67,71,72,64,65,76,73),
overall_17 = c(63,68,NA,74,72,72,NA,73,66,69,83,67,64,76,NA,71,73,NA,70,
70,79,NA,73,72,NA,NA,NA,NA,NA,70,NA,70,77,NA,68,74,74,66,64,75,69),
overall_18 = c(NA,68,NA,78,73,72,NA,72,68,67,86,NA,62,75,65,71,71,67,71,
71,76,NA,71,71,NA,NA,74,NA,71,NA,NA,68,74,NA,67,75,74,65,NA,72,NA),
overall_19 = c(NA,NA,NA,77,73,72,NA,71,69,66,87,63,62,73,65,NA,NA,NA,NA,
NA,75,NA,NA,67,NA,NA,73,NA,NA,NA,NA,NA,74,NA,NA,74,74,65,NA,68,NA),
overall_20 = c(NA,NA,NA,77,NA,NA,NA,72,71,66,87,NA,NA,NA,65,NA,NA,NA,70,
70,75,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,74,NA,66,71,73,NA,NA,69,NA),
overall_21 = c(NA,67,NA,76,NA,69,NA,73,69,65,85,NA,NA,NA,NA,NA,NA,NA,NA,
NA,75,NA,NA,NA,NA,NA,69,NA,NA,NA,NA,NA,73,NA,67,68,72,NA,NA,68,NA),
overall_22 = c(NA,NA,NA,75,NA,NA,NA,75,67,65,84,NA,NA,NA,NA,NA,NA,NA,68,
68,73,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,67,69,71,NA,NA,68,NA)
)
l.predicted <- list(
potential_15 = c(59,68,74,76,65,75,64,72,66,NA,85,60,NA,76,NA,73,75,NA,71,
71,71,67,65,70,72,72,NA,68,74,67,64,71,NA,62,62,71,71,63,67,78,72),
potential_16 = c(60,71,75,75,68,73,NA,74,66,69,83,66,64,77,NA,71,74,NA,70,
70,76,67,71,73,73,73,NA,66,NA,69,65,70,76,NA,67,71,72,64,66,76,73),
potential_17 = c(63,69,NA,75,72,72,NA,73,69,69,83,67,64,76,NA,71,73,NA,70,
70,79,NA,73,72,NA,NA,NA,NA,NA,70,NA,70,77,NA,68,74,74,66,64,75,69),
potential_18 = c(NA,68,NA,78,73,72,NA,72,69,67,86,NA,62,75,65,71,71,67,71,
71,76,NA,71,71,NA,NA,74,NA,71,NA,NA,68,74,NA,67,75,74,65,NA,72,NA),
potential_19 = c(NA,NA,NA,77,73,72,NA,71,70,66,87,63,62,73,65,NA,NA,NA,NA,
NA,75,NA,NA,67,NA,NA,73,NA,NA,NA,NA,NA,74,NA,NA,74,74,65,NA,68,NA),
potential_20 = c(NA,NA,NA,77,NA,NA,NA,72,71,66,87,NA,NA,NA,65,NA,NA,NA,70,
70,75,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,74,NA,66,71,73,NA,NA,69,NA),
potential_21 = c(NA,67,NA,76,NA,69,NA,73,69,65,85,NA,NA,NA,NA,NA,NA,NA,NA,
NA,75,NA,NA,NA,NA,NA,69,NA,NA,NA,NA,NA,73,NA,67,68,72,NA,NA,68,NA),
potential_22 = c(NA,NA,NA,75,NA,NA,NA,75,67,65,84,NA,NA,NA,NA,NA,NA,NA,68,
68,73,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,67,69,71,NA,NA,68,NA)
)
Here is a solution using some tidyverse packages. The key thing is to use the function expand_grid() to get all combinations of the elements of each list. This results in a tibble with two named list columns. Next we can use mutate() to pull out the names of the list and assign them to new columns, and extract the numeric IDs. Use filter() to retain only the rows where potential is less than or equal to overall. Finally get the R-squared for each row using your suggested code, and plot. (Note I did not try too hard to get the plot to look just like yours.)
library(purrr)
library(dplyr)
library(ggplot2)
library(tidyr)
r_squared_combinations <- expand_grid(l.actual, l.predicted) %>%
mutate(overall = names(l.actual),
potential = names(l.predicted),
overall_n = as.numeric(gsub('overall_', '', overall)),
potential_n = as.numeric(gsub('potential_', '', potential))) %>%
filter(potential_n <= overall_n) %>%
mutate(r_squared = map2_dbl(l.predicted, l.actual, ~ summary(lm(.x ~ .y))$r.squared))
ggplot(r_squared_combinations, aes(x = overall, y = potential, fill = r_squared, label = round(r_squared, 3))) +
geom_tile() +
geom_text(color = 'white')
Side note: incidentally the base function expand.grid() would work about as well as tidyr::expand_grid() but expand_grid() returns a tibble by default which may be more convenient if you are using tidyverse functions otherwise.

How to calculate the moving average base on date and time in R

I uploaded my data.
https://filebin.net/a29fn87b8wpfnos0/Plume_2.csv?t=iouc5vg7
It looks like this in a csv file format
I tried to look for a proper answer that suits my data.
I couldn't find it, it took me about a month trying by myself to solve it.
First I need to do a moving average for:
30 min
1 hour
1 day
1 week
for each PM2.5, PM10, NO2
However, I can't do that manually using this type of code:
Plume_2$PM2.5_30min_ <- TTR ::SMA(Plume_2$pm2.5, n=31)
Plume_2$PM2.5_1hour_ <- TTR ::SMA(Plume_2$pm2.5, n=61)
Plume_2$PM2.5_1day_ <- TTR ::SMA(Plume_2$pm2.5, n=1441)
Plume_2$PM2.5_1week_ <- TTR ::SMA(Plume_2$pm2.5, n=10080)
with these codes, the n values don't fit with the date I have.
Also used this code and seems the average not working well.
library(runner)
dates = Plume_2$timestamp
value = Plume_2$PM2.5_Plume2
Plume_2$MA <- mean_run(x = value, k = 7, lag = 1, idx = as.Date(dates))
The final output will be a plot graph containing those different moving averages.
Can anyone help me, please?
I hope the follwing is a satisfying solution.
library(data.table)
dt <- fread("https://filebin.net/a29fn87b8wpfnos0/Plume_2.csv?t=phgmlykh")
dt[,.(timestamp,
PM2.5_30min_mean = frollmean(PM2.5_Plume2,31),
PM2.5_1hour_mean = frollmean(PM2.5_Plume2,61),
PM2.5_1day_mean = frollmean(PM2.5_Plume2, 1441),
PM2.5_1week_mean = frollmean(PM2.5_Plume2,10080))]
The result is shown as
Then I want to plot the result using ggplot. Here I choose PM2.5_30min as an example.
library(lubriate) # turn timestamp into POSIXct format with dmy_hm function
ggplot(dt2, aes(dmy_hm(timestamp), PM2.5_1hour_mean,na.rm = TRUE)) +
geom_line() +
scale_x_datetime()
Using zoo's rollmeanr function along with across from dplyr can help you with this.
library(dplyr)
library(zoo)
df <- read.csv('https://filebin.net/a29fn87b8wpfnos0/Plume_2.csv?t=up70ngy3')
df %>%
mutate(across(PM2.5_Plume2:NO2_Plume2,
list(avg_30min = ~rollmeanr(.x, 30, fill = NA),
avg_hour = ~rollmeanr(.x, 60, fill = NA),
avg_day = ~rollmeanr(.x, 1440, fill = NA),
avg_week = ~rollmeanr(.x, 10080, fill = NA)))) -> result
result

How to create a table in R for mean, SD, and range from particular data items?

I am trying to make a table in R for a particular set of data by creating five columns (name, total number of a particular name, mean, SD, and range).
I already have a dataset(sizes) with columns: name, height(H), and weight(W) and I would like to create a table using the sizes data with columns labeled as: name, total number of a particular name, mean of H, SD of H, and range of H, mean of W, SD of W, and range of W. However, I am having trouble extracting the data by name. Any suggestions?
example of dataframe (sizes)
desired table
This may be a rookie question, but that is exactly what I am in the R world so any help would be great!
I think that example below will be helpful:
library(dplyr)
data<-iris
data %>% group_by(Species) %>% summarise(Count= n(),Mean=mean(data$Sepal.Width),SD=sd(data$Sepal.Width))
Where you will be grouping by column with Names
The following does what you want. I have used the built in dataset iris, selecting one name column and two numeric columns.
The main function is aggregate. You should go through its help page. At an R command prompt run ?aggregate.
sizes <- iris[5:3]
names(sizes) <- c("name", "height", "weight")
head(sizes)
stats <- function(x){
c(Sum = sum(x), Mean = mean(x), SD = sd(x), Range = range(x))
}
agg <- aggregate(. ~ name, data = sizes, stats)
colnames(agg$height) <- paste("height", colnames(agg$height), sep = ".")
colnames(agg$height) <- sub("Range1", "Min", colnames(agg$height))
colnames(agg$height) <- sub("Range2", "Max", colnames(agg$height))
colnames(agg$weight) <- paste("weight", colnames(agg$weight), sep = ".")
colnames(agg$weight) <- sub("Range1", "Min", colnames(agg$weight))
colnames(agg$weight) <- sub("Range2", "Max", colnames(agg$weight))
agg <- cbind(agg[1], agg$height, agg$weight)
agg

How to work around error while reshape data frame with spread()

I am trying to transform long data frame into wide and flagged cases. I pivot it and use a temporary vector that serves as a flag. It works perfectly on small data sets: see the example (copy and paste into your Rstudio), but when I try to do it on real data it reports an error:
churnTrain3 <- spread(churnTrain, key = "state", value = "temporary", fill = 0)
Error: Duplicate identifiers for rows (169, 249), (57, 109), (11, 226)
The structure wide data set is relevant for further processing
Is there any work around for this problem. I bet a lot of people try to clean data and get to the same problem.
Please help me
Here is the code:
First chunk "example "makes small data set for good visualisation how it supiosed to look
Second chunk "real data" is sliced portion of data set from churn library
library(caret)
library(tidyr)
#example
#============
df <- data.frame(var1 = (1:6),
var2 = (7:12),
factors = c("facto1", "facto2", "facto3", "facto3","facto5", "facto1") ,
flags = c(1, 1, 1, 1, 1, 1))
df
df2 <- spread(data = df, key = "factors" , value = flags, fill = " ")
df2
#=============
# real data
#============
data(churn)
str(churnTrain)
churnTrain <- churnTrain[1:250,1:4]
churnTrain$temporary <-1
churnTrain3 <- spread(churnTrain, key = "state", value = "temporary", fill = 0)
str(churnTrain)
head(churnTrain3)
str(churnTrain3)
#============
Spread can only put one unique value in the 'cell' that intersects the spread 'key' and the rest of the data (in the churn example, account_length, area_code and international_plan). So the real question is how to manage these duplicate entries. The answer to that depends on what you are trying to do. I provide one possible solution below. Instead of making a dummy 'temporary' variable, I instead count the number of episodes and use that as the dummy variable. This can be done very easily with dplyr:
library(tidyr)
library(dplyr)
library(C50) # this is one source for the churn data
data(churn)
churnTrain <- churnTrain[1:250,1:4]
churnTrain2 <- churnTrain %>%
group_by(state, account_length, area_code, international_plan) %>%
tally %>%
dplyr::rename(temporary = n)
churnTrain3 <- spread(churnTrain2, key = "state", value = "temporary", fill = 0)
Spread now works.
As others point out, you need to input a unique vector into spread. My solution is use base R:
library(C50)
f<- function(df, key){
if (sum(names(df)==key)==0) stop("No such key");
u <- unique(df[[key]])
id <- matrix(0,dim(df)[1],length(u))
uu <- lapply(df[[key]],function(x)which(u==x)) ## check 43697442 for details
for(i in 1:dim(df)[1]) id[i,uu[[i]]] <- 1
colnames(id) = as.character(u)
return(cbind(df,id));
}
df <- data.frame(var1 = (1:6),
var2 = (7:12),
factors = c("facto1", "facto2", "facto3", "facto3","facto5", "facto1"))
f(df, key='fact')
f(df, key='factors')
data(churn)
churnTrain <- churnTrain[1:250,1:4]
f(churnTrain, key='state')
Although you may see a for-loop and other temporary variables inside the f function, the speed is not slow indeed.

Cut a variable differently based on another grouping variable

Example: I have a dataset of heights by gender.
I'd like to split the heights into low and high where the cut points are defined as the mean - 2sd within each gender.
example dataset:
set.seed(8)
df = data.frame(sex = c(rep("M",100), rep("F",100)),
ht = c(rnorm(100, mean=1.7, sd=.17), rnorm(100, mean=1.6, sd=.16)))
I'd like to do something in a single line of vectorized code because I'm fairly sure that is possible, however, I do not know how to write it. I imagine that there may be a way to use cut(), apply(), and/or dplyr to achieve this.
How about this using cut from base R:
sapply(c("F", "M"), function(s){
dfF <- df[df$sex==s,] # filter out per gender
cut(dfF$ht, breaks = c(0, mean(dfF$ht)-2*sd(dfF$ht), Inf), labels = c("low", "high"))
})
# dfF$ht heights per gender
# mean(dfF$ht)-2*sd(dfF$ht) cut point
In the code below, I created 2 new variables. Both were created by grouping the sex variable and filtering the different ranges of ht.
library(dplyr)
df_low <- df %>% group_by(sex) %>% filter(ht<(mean(ht)-2*sd(ht)))
df_high<- df %>% group_by(sex) %>% filter(ht>(mean(ht)+2*sd(ht)))
Just discovered the following solution using base r:
df$ht_grp <- ave(x = df$ht, df$sex,
FUN = function(x)
cut(x, breaks = c(0, (mean(x, na.rm=T) - 2*sd(x, na.rm=T)), Inf)))
This works because I know that 0 and Inf are reasonable bounds, but I could also use min(x), and max(x) as my upper and lower bounds. This results in a factor variable that is split into low, high, and NA.
My prior solution:
I came up with the following two-step process which is not so bad:
df = merge(df,
setNames( aggregate(ht ~ sex, df, FUN = function(x) mean(x)-2*sd(x)),
c("sex", "ht_cutoff")),
by = "sex")
df$ht_is_low = ifelse(df$ht <= df$ht_cutoff, 1, 0)

Resources