I have to make a table in which the mean and median from the logsales per month are set out from the "txhousing" dataset. The exercise i got is the following: “The table below show the means and medians of the log of sales per month, sorted by mean”
Insert a new r chunk and type the code in it to display that table
Use na.omit to get rid of cases with missing values
Use the dplyr command mutate to make the variable logsales
Use the dplyr command group_by to group by month
Use the dplyr command summarise to display the table
Use the dplyr command arrange to sorth by mean
Connect the commands with the pipe operator %>%
I've tried to mix up the code multiple times but I can't find out why it keeps on giving me NA's in my table.
library(tidyverse)
summary(txhousing)
na.omit(txhousing)
txhousing<- as.data.frame(txhousing)
logsales <- log(txhousing$sales)
group_by(txhousing, txhousing$month)
txhousing<- txhousing %>% mutate(logsales= log(txhousing$sales))
txhousing %>% group_by(txhousing$month) %>% summarise(mean(logsales), median(logsales)) %>% arrange(mean)
I expect to get a table with the mean and median of the logsales per month, but what i get is only NA in the column from the mean en the median and the arrange gives the following error:
Error: cannot arrange column of class 'function' at position 1`
There are NA values in the columns, so you need to tell mean and median to ignore them. And also name the columns in summarise to use arrange on column named as mean.
txhousing %>%
group_by(txhousing$month) %>%
summarise(mean = mean(logsales, na.rm = T),
med= median(logsales, na.rm = T)) %>%
arrange(mean) %>%
rename(month = `txhousing$month`)
This creates following tibble
# A tibble: 12 x 3
month mean med
<int> <dbl> <dbl>
1 1 4.95 4.74
2 2 5.13 4.93
3 11 5.19 4.96
4 12 5.24 5.02
5 10 5.29 5.08
6 9 5.32 5.09
7 3 5.38 5.15
8 4 5.42 5.21
9 5 5.52 5.29
10 7 5.53 5.30
11 8 5.53 5.33
12 6 5.56 5.34
Related
I am trying to combing the dividend history of 2 different stocks and I have the below code:
library(quantmod)
library(tidyr)
AAPL<-round(getDividends("AAPL"),4)
MSFT<-round(getDividends("MSFT"),4)
dividend<-(cbind(AAPL,MSFT))
As the 2 stocks pay out dividend on different dates so there will be NAs after combining and so I try to use drop_na function from tidyr like below:
drop_na(dividend)
#Error in UseMethod("drop_na") :
no applicable method for 'drop_na' applied to an object of class "c('xts', 'zoo')"
May I know what did I do wrong here? Many thanks for your help.
Update 1:
Tried na.omit, which returns with the following:
> dividend<-(cbind(AAPL,MSFT))%>%
> na.omit(dividend)%>%
> print(dividend)
[,1] [,2]
drop_na only works on data frames but even if you converted it to data frame it would not give what you want. Every row contains an NA so drop_na and na.omit would both remove every row leaving a zero row result.
Instead try aggregating over year/month which gives zoo object both. If you need an xts object use as.xts(both) .
both <- aggregate(dividend, as.yearmon, sum, na.rm = TRUE)
Optionally we could remove the rows with zero dividends in both columns:
both[rowSums(both) > 0]
We could use na.omit
na.omit(dividend)
drop_na expects a data.frame as input whereas the the object dividend is not a data.frame
We could fortify into a data.frame and then use drop_na after reordering the NAs
library(dplyr)
library(lubridate)
fortify.zoo(dividend) %>%
group_by(Index = format(ymd(Index), '%Y-%b')) %>%
mutate(across(everything(), ~ .x[order(is.na(.x))])) %>%
ungroup %>%
drop_na()
-output
# A tibble: 41 × 3
Index AAPL.div MSFT.div
<chr> <dbl> <dbl>
1 2012-Aug 0.0034 0.2
2 2012-Nov 0.0034 0.23
3 2013-Feb 0.0034 0.23
4 2013-May 0.0039 0.23
5 2013-Aug 0.0039 0.23
6 2013-Nov 0.0039 0.28
7 2014-Feb 0.0039 0.28
8 2014-May 0.0042 0.28
9 2014-Aug 0.0294 0.28
10 2014-Nov 0.0294 0.31
# … with 31 more rows
I am trying to compute the variance for each group in a data set with multiple factors. For example, the data set below is the first 6 lines of a data frame with 5 columns: 4 factors of two levels each (No and Yes) and 1 continuous variable:
Factor A
Factor B
Factor C
Factor D
VarX
Yes
Yes
Yes
No
66.8
No
Yes
Yes
No
66.0
Yes
No
No
No
58.4
No
Yes
Yes
Yes
68.3
Yes
Yes
Yes
No
61.8
Yes
No
No
No
67.3
What I want to do is produce a summary table such as the one below:
Factor
SD (NO)
SD (YES)
SD Ratio
Factor A
3.79
3.51
1.08
Factor B
3.44
3.83
1.11
Factor C
3.77
3.53
1.07
Factor D
3.92
3.32
1.18
For each factor, I have calculated the standard deviation at each level ("No" and "Yes") as well as the ratio of the two standard deviations.
Here is the code I am using to do this:
#
# Define modify function for SD ratio column
#
sd_ratio<-function(x,y){
return(max(x,y)/min(x,y))
}
#
# Set up storage
#
nc<-4 # number of factors in data
testDataSum<-tibble(SD_No=rep(NA,nc),
SD_Yes=rep(NA,nc),
SD_Ratio=rep(NA,nc))
#
Factor<-vector("list",4)
SDList<-vector("list",4)
#
# For Loop. Group data by factors 1,2,3,4
#
for (i in 1:4){
Factor[[i]]<-names(testData[,i])
SDList[[i]]<-testData %>%
group_by(testData[,i])%>%
summarize(SD=sd(VarX))
}
# Load summary DF with data by unlisting SDList
#
testDataSum$SD_No<-as.vector(matrix(unlist(SDList),ncol=4,byrow=T)[,3])
testDataSum$SD_Yes<-as.vector(matrix(unlist(SDList),ncol=4,byrow=T)[,4])
testDataSum$SD_Ratio=modify2(testDataSum$SD_No,testDataSum$SD_Yes,sd_ratio)
#
# Load formatted factor names and put it at the front
#
testDataSum<-testDataSum %>%
mutate(Factor=unlist(Factor)) %>%
relocate(Factor)
# Show results
testDataSum
My request is for help in simplifying this code. This works but it seems horribly ugly and complex, not to mention difficult to come back to at a later date and modify. I believe there is a much simpler way to do it without a for-loop, and without the ungainly process of unlisting SDList using the "as.vector (matrix (..." lines. I have reviewed the documentation for DPLYR and PLYR, especially the grouping section, but I am baffled. Any suggestions are much appreciated.
Here is a link to a github repository with the code and a csv file with 192 rows that you can use to produce the result table.
Git Hub Link for code and Data
You may try using reshape2, dplyr, and tidyr
When I read your data, column names get broken, so I rename them beforehand.
library(dplyr)
library(tidyr)
library(reshape2)
names(df) <- c("A","B","C","D","VarX")
df %>%
melt(id.vars = "VarX", variable.name = "Factor") %>%
group_by(Factor, value) %>%
summarize(sd = sd(VarX)) %>%
pivot_wider(id_cols = Factor, values_from = sd, names_from = value, names_glue = "sd_{value}") %>%
mutate(SD_ratio = pmax(sd_No,sd_Yes)/pmin(sd_No,sd_Yes))
Factor sd_No sd_Yes SD_ratio
<fct> <dbl> <dbl> <dbl>
1 A 3.51 3.79 1.08
2 B 3.83 3.44 1.11
3 C 3.53 3.77 1.07
4 D 3.92 3.32 1.18
I'd like to store functions, or at least their names, in a column of a data.frame for use in a call to mutate. A simplified broken example:
library(dplyr)
expand.grid(mu = 1:5, sd = c(1, 10), stat = c('mean', 'sd')) %>%
group_by(mu, sd, stat) %>%
mutate(sample = get(stat)(rnorm(100, mu, sd))) %>%
ungroup()
If this worked how I thought it would, the value of sample would be generated by the function in .GlobalEnv corresponding to either 'mean' or 'sd', depending on the row.
The error I get is:
Error in mutate_impl(.data, dots) :
Evaluation error: invalid first argument.
Surely this has to do with non-standard evaluation ... grrr.
A few issues here. First expand.grid will convert character values to factors. And get() doesn't like working with factors (ie get(factor("mean")) will give an error). The tidyverse-friendly version is tidyr::crossing(). (You could also pass stringsAsFactors=FALSE to expand.grid.)
Secondly, mutate() assumes that all functions you call are vectorized, but functions like get() are not vectorized, they need to be called one-at-a-time. A safer way rather than doing the group_by here to guarantee one-at-a-time evaluation is to use rowwise().
And finally, your real problem is that you are trying to call get("sd") but when you do, sd also happens to be a column in your data.frame that is part of the mutate. Thus get() will find this sd first, and this sd is just a number, not a function. You'll need to tell get() to pull from the global environment explicitly. Try
tidyr::crossing(mu = 1:5, sd = c(1, 10), stat = c('mean', 'sd')) %>%
rowwise() %>%
mutate(sample = get(stat, envir = globalenv())(rnorm(100, mu, sd)))
Three problems (that I see): (1) expand.grid is giving you factors; (2) get finds variables, so using "sd" as a stat is being confused with the column names "sd" (that was hard to find!); and (3) this really is a row-wise operation, grouping isn't helping enough. The first is easily fixed with an option, the second can be fixed by using match.fun instead of get, and the third can be mitigated with dplyr::rowwise, purrr::pmap, or base R's mapply.
This helper function was useful during debugging and can be used to "clean up" the code within mutate, but it isn't required (for other than this demonstration). Inline "anonymous" functions will work as well.
func <- function(f,m,s) get(f)(rnorm(100,mean=m,sd=s))
Several implementation methods:
set.seed(0)
expand.grid(mu = 1:5, sd = c(1, 10), stat = c('mean', 'sd'),
stringsAsFactors=FALSE) %>%
group_by(mu, sd, stat) %>% # can also be `rowwise() %>%`
mutate(
sample0 = match.fun(stat)(rnorm(100, mu, sd)),
sample1 = purrr::pmap_dbl(list(stat, mu, sd), ~ match.fun(..1)(rnorm(100, ..2, ..3))),
sample2 = purrr::pmap_dbl(list(stat, mu, sd), func),
sample3 = mapply(function(f,m,s) match.fun(f)(rnorm(100,m,s)), stat, mu, sd),
sample4 = mapply(func, stat, mu, sd)
) %>%
ungroup()
# # A tibble: 20 x 8
# mu sd stat sample0 sample1 sample2 sample3 sample4
# <int> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 1 mean 1.02 1.03 0.896 1.08 0.855
# 2 2 1 mean 1.95 2.07 2.05 1.90 1.92
# 3 3 1 mean 2.93 3.07 3.03 2.89 3.01
# 4 4 1 mean 4.01 3.94 4.23 4.05 3.96
# 5 5 1 mean 5.04 5.11 5.05 5.17 5.19
# 6 1 10 mean 1.67 1.21 1.30 2.08 -0.641
# 7 2 10 mean 1.82 2.82 2.35 3.65 1.78
# 8 3 10 mean 1.45 3.10 3.15 4.28 2.58
# 9 4 10 mean 3.49 6.33 5.11 2.84 3.41
# 10 5 10 mean 5.33 4.85 4.07 5.58 6.66
# 11 1 1 sd 0.965 1.04 0.993 0.942 1.08
# 12 2 1 sd 0.974 0.967 0.981 0.984 1.15
# 13 3 1 sd 1.12 0.902 1.06 0.977 1.02
# 14 4 1 sd 0.946 0.928 0.960 1.01 0.992
# 15 5 1 sd 1.06 1.01 0.911 1.11 1.00
# 16 1 10 sd 9.46 8.95 10.0 8.91 9.60
# 17 2 10 sd 9.51 9.11 11.5 9.85 10.6
# 18 3 10 sd 9.77 9.96 11.0 9.09 10.7
# 19 4 10 sd 10.5 9.84 10.1 10.6 8.89
# 20 5 10 sd 11.2 8.82 10.4 9.06 9.64
sample0 happens to work because you have grouped it to be row-wise. If at some point any one grouping has two or more values, this will fail.
For sample1 through sample4, you can remove the group_by and it works equally well (though sample0 demonstrates its failing, so remove it too). You won't get identical results as above with grouping removed, because the entropy is being consumed differently.
To give a small working example, suppose I have the following data frame:
library(dplyr)
country <- rep(c("A", "B", "C"), each = 6)
year <- rep(c(1,2,3), each = 2, times = 3)
categ <- rep(c(0,1), times = 9)
pop <- rep(c(NA, runif(n=8)), each=2)
money <- runif(18)+100
df <- data.frame(Country = country,
Year = year,
Category = categ,
Population = pop,
Money = money)
Now the data I'm actually working with has many more repetitions, namely for every country, year, and category, there are many repeated rows corresponding to various sources of money, and I want to sum these all together. However, for now it's enough just to have one row for each country, year, and category, and just trivially apply the sum() function on each row. This will still exhibit the behavior I'm trying to get rid of.
Notice that for country A in year 1, the population listed is NA. Therefore when I run
aggregate(Money ~ Country+Year+Category+Population, df, sum)
the resulting data frame has dropped the rows corresponding to country A and year 1. I'm only using the ...+Population... bit of code because I want the output data frame to retain this column.
I'm wondering how to make the aggregate() function not drop things that have NAs in the columns by which the grouping occurs--it'd be nice if, for instance, the NAs themselves could be treated as values to group by.
My attempts: I tried turning the Population column into factors but that didn't change the behavior. I read something on the na.action argument but neither na.action=NULL nor na.action=na.skip changed the behavior. I thought about trying to turn all the NAs to 0s, and I can't think of what that would hurt but it feels like a hack that might bite me later on--not sure. But if I try to do it, I'm not sure how I would. When I wrote a function with the is.na() function in it, it didn't apply the if (is.na(x)) test in a vectorized way and gave the error that it would just use the first element of the vector. I thought about perhaps using lapply() on the column and coercing it back to a vector and sticking that in the column, but that also sounds kind of hacky and needlessly round-about.
The solution here seemed to be about keeping the NA values out of the data frame in the first place, which I can't do: Aggregate raster in R with NA values
As you have already mentioned dplyr before your data, you can use dplyr::summarise function. The summarise function supports grouping on NA values.
library(dplyr)
df %>% group_by(Country,Year,Category,Population) %>%
summarise(Money = sum(Money))
# # A tibble: 18 x 5
# # Groups: Country, Year, Category [?]
# Country Year Category Population Money
# <fctr> <dbl> <dbl> <dbl> <dbl>
# 1 A 1.00 0 NA 101
# 2 A 1.00 1.00 NA 100
# 3 A 2.00 0 0.482 101
# 4 A 2.00 1.00 0.482 101
# 5 A 3.00 0 0.600 101
# 6 A 3.00 1.00 0.600 101
# 7 B 1.00 0 0.494 101
# 8 B 1.00 1.00 0.494 101
# 9 B 2.00 0 0.186 100
# 10 B 2.00 1.00 0.186 100
# 11 B 3.00 0 0.827 101
# 12 B 3.00 1.00 0.827 101
# 13 C 1.00 0 0.668 100
# 14 C 1.00 1.00 0.668 101
# 15 C 2.00 0 0.794 100
# 16 C 2.00 1.00 0.794 100
# 17 C 3.00 0 0.108 100
# 18 C 3.00 1.00 0.108 100
Note: The OP's sample data doesn't have multiple rows for same groups. Hence, number of summarized rows will be same as actual rows.
EDIT: My data (for reproducible research) looks as follows. The dplyr will summarise the values for each win_name category:
inv_name inv_province inv_town nip win_name value start duration year
CustomerA łódzkie TownX 1111111111 CompX 233.50 2015-10-23 24 2017
CustomerA łódzkie TownX 1111111111 CompX 300.5 2015-10-23 24 2017
CustomerA łódzkie TownX 1111111111 CompX 200.5 2015-10-23 24 2017
CustomerB łódzkie TownY 2222222222 CompY 200.5 2015-10-25 12 2017
CustomerB łódzkie TownY 2222222222 CompY 1200.0 2015-10-25 12 2017
CustomerB łódzkie TownY 2222222222 CompY 320.00 2015-10-25 12 2017
The dplyr will summarise the values, then the spread will make the summary spread into several columns for each win_name category with numeric values.
I would like to create new columns with formatted text corresponding to existing columns with numbers. Create as many columns as there are numeric columns with numeric data. The number of these columns can change from analysis to analysis. My code so far looks like:
county_marketshare<-df_monthly_val %>%
select(win_name,value,inv_province) %>%
group_by(win_name,inv_province)%>%
summarise(value=round(sum(value),0))%>%
spread(key="win_name", value=value, fill=0) %>% # teraz muszę stworzyc kolumny sformatowane "finansowo"
mutate(!!as.symbol(paste0(bestSup[1],"_lbl")):= formatC(!!as.symbol(bestSup[1]),digits = 0, big.mark = " ", format = "f",zero.print = ""),
!!as.symbol(paste0(bestSup[2],"_lbl")):= formatC(!!as.symbol(bestSup[2]),digits = 0, big.mark = " ", format = "f",zero.print = ""),
!!as.symbol(paste0(bestSup[3],"_lbl")):= formatC(!!as.symbol(bestSup[3]),digits = 0, big.mark = " ", format = "f",zero.print = "")
)
is there a way to loop the mutate function so that as many columns are created as there are existing numeric columns? The relavant lines with the repetitive code are the last three. Each new formatted text column has the name of existing numeric column with a suffix. !!as.symbol makes it possible to put together a parameter, the name of the source column, with _lbl suffix.
you could for example use mutate_at with a function and a conditional such as
dat %>%
mutate_at(.vars = c('num_col1','num_col2'),
.funs = function(x) if(is.numeric(x)) as.character(x))
This will replace the specified numeric columns with character columns. You can tweak the function to your needs, i.e. specifying how the columns should look like. We could help you a bit more with a better data example.
You can also filter only the numeric columns and then use mutate_all:
dat %>%Filter(is.numeric,.) %>% mutate_all(funs(as.character))
# Filter() is not dplyr, but base R, caveat capital 'F' !
# You can also use dat %>%.[sapply(.,is.numeric)], with the same result
# or dplyr::select_if
...:)
P.S. Always worth to cite the reference. Have a look at this gorgeous question:
Selecting only numeric columns from a data frame
Please consult tidyverse documentation.
# mutate_if() is particularly useful for transforming variables from
# one type to another
iris %>% as_tibble() %>% mutate_if(is.factor, as.character)
#> # A tibble: 150 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 5.10 3.50 1.40 0.200 setosa
#> 2 4.90 3.00 1.40 0.200 setosa
#> 3 4.70 3.20 1.30 0.200 setosa
#> 4 4.60 3.10 1.50 0.200 setosa
#> 5 5.00 3.60 1.40 0.200 setosa
#> 6 5.40 3.90 1.70 0.400 setosa
#> 7 4.60 3.40 1.40 0.300 setosa
#> 8 5.00 3.40 1.50 0.200 setosa
#> 9 4.40 2.90 1.40 0.200 setosa
#> 10 4.90 3.10 1.50 0.100 setosa
#> # ... with 140 more rows
Unexpectedly I found a hint at http://stackoverflow.com/a/47971650/3480717
I did not realise that in the syntax
mtcars %>% mutate_at(columnstolog, funs(log = log(.)))
adding a name part "log="in funs will append it to the names of new colums.... in the effect the following in my case is enough:
mutate_if(is.numeric, funs(lbl = formatC(.,digits = 0, big.mark = " ", format = "f",zero.print = "")))
This will generate new columns, as many as there are original numeric columns, and these new columns will have the name sufficed with "_lbl". No need for loops or advanced syntax. Big thanks to Thebo and Nettle