combining 2 series and dropping NA in R - r

I am trying to combing the dividend history of 2 different stocks and I have the below code:
library(quantmod)
library(tidyr)
AAPL<-round(getDividends("AAPL"),4)
MSFT<-round(getDividends("MSFT"),4)
dividend<-(cbind(AAPL,MSFT))
As the 2 stocks pay out dividend on different dates so there will be NAs after combining and so I try to use drop_na function from tidyr like below:
drop_na(dividend)
#Error in UseMethod("drop_na") :
no applicable method for 'drop_na' applied to an object of class "c('xts', 'zoo')"
May I know what did I do wrong here? Many thanks for your help.
Update 1:
Tried na.omit, which returns with the following:
> dividend<-(cbind(AAPL,MSFT))%>%
> na.omit(dividend)%>%
> print(dividend)
[,1] [,2]

drop_na only works on data frames but even if you converted it to data frame it would not give what you want. Every row contains an NA so drop_na and na.omit would both remove every row leaving a zero row result.
Instead try aggregating over year/month which gives zoo object both. If you need an xts object use as.xts(both) .
both <- aggregate(dividend, as.yearmon, sum, na.rm = TRUE)
Optionally we could remove the rows with zero dividends in both columns:
both[rowSums(both) > 0]

We could use na.omit
na.omit(dividend)
drop_na expects a data.frame as input whereas the the object dividend is not a data.frame
We could fortify into a data.frame and then use drop_na after reordering the NAs
library(dplyr)
library(lubridate)
fortify.zoo(dividend) %>%
group_by(Index = format(ymd(Index), '%Y-%b')) %>%
mutate(across(everything(), ~ .x[order(is.na(.x))])) %>%
ungroup %>%
drop_na()
-output
# A tibble: 41 × 3
Index AAPL.div MSFT.div
<chr> <dbl> <dbl>
1 2012-Aug 0.0034 0.2
2 2012-Nov 0.0034 0.23
3 2013-Feb 0.0034 0.23
4 2013-May 0.0039 0.23
5 2013-Aug 0.0039 0.23
6 2013-Nov 0.0039 0.28
7 2014-Feb 0.0039 0.28
8 2014-May 0.0042 0.28
9 2014-Aug 0.0294 0.28
10 2014-Nov 0.0294 0.31
# … with 31 more rows

Related

Simplify R code for computing group statistics across multiple factors

I am trying to compute the variance for each group in a data set with multiple factors. For example, the data set below is the first 6 lines of a data frame with 5 columns: 4 factors of two levels each (No and Yes) and 1 continuous variable:
Factor A
Factor B
Factor C
Factor D
VarX
Yes
Yes
Yes
No
66.8
No
Yes
Yes
No
66.0
Yes
No
No
No
58.4
No
Yes
Yes
Yes
68.3
Yes
Yes
Yes
No
61.8
Yes
No
No
No
67.3
What I want to do is produce a summary table such as the one below:
Factor
SD (NO)
SD (YES)
SD Ratio
Factor A
3.79
3.51
1.08
Factor B
3.44
3.83
1.11
Factor C
3.77
3.53
1.07
Factor D
3.92
3.32
1.18
For each factor, I have calculated the standard deviation at each level ("No" and "Yes") as well as the ratio of the two standard deviations.
Here is the code I am using to do this:
#
# Define modify function for SD ratio column
#
sd_ratio<-function(x,y){
return(max(x,y)/min(x,y))
}
#
# Set up storage
#
nc<-4 # number of factors in data
testDataSum<-tibble(SD_No=rep(NA,nc),
SD_Yes=rep(NA,nc),
SD_Ratio=rep(NA,nc))
#
Factor<-vector("list",4)
SDList<-vector("list",4)
#
# For Loop. Group data by factors 1,2,3,4
#
for (i in 1:4){
Factor[[i]]<-names(testData[,i])
SDList[[i]]<-testData %>%
group_by(testData[,i])%>%
summarize(SD=sd(VarX))
}
# Load summary DF with data by unlisting SDList
#
testDataSum$SD_No<-as.vector(matrix(unlist(SDList),ncol=4,byrow=T)[,3])
testDataSum$SD_Yes<-as.vector(matrix(unlist(SDList),ncol=4,byrow=T)[,4])
testDataSum$SD_Ratio=modify2(testDataSum$SD_No,testDataSum$SD_Yes,sd_ratio)
#
# Load formatted factor names and put it at the front
#
testDataSum<-testDataSum %>%
mutate(Factor=unlist(Factor)) %>%
relocate(Factor)
# Show results
testDataSum
My request is for help in simplifying this code. This works but it seems horribly ugly and complex, not to mention difficult to come back to at a later date and modify. I believe there is a much simpler way to do it without a for-loop, and without the ungainly process of unlisting SDList using the "as.vector (matrix (..." lines. I have reviewed the documentation for DPLYR and PLYR, especially the grouping section, but I am baffled. Any suggestions are much appreciated.
Here is a link to a github repository with the code and a csv file with 192 rows that you can use to produce the result table.
Git Hub Link for code and Data
You may try using reshape2, dplyr, and tidyr
When I read your data, column names get broken, so I rename them beforehand.
library(dplyr)
library(tidyr)
library(reshape2)
names(df) <- c("A","B","C","D","VarX")
df %>%
melt(id.vars = "VarX", variable.name = "Factor") %>%
group_by(Factor, value) %>%
summarize(sd = sd(VarX)) %>%
pivot_wider(id_cols = Factor, values_from = sd, names_from = value, names_glue = "sd_{value}") %>%
mutate(SD_ratio = pmax(sd_No,sd_Yes)/pmin(sd_No,sd_Yes))
Factor sd_No sd_Yes SD_ratio
<fct> <dbl> <dbl> <dbl>
1 A 3.51 3.79 1.08
2 B 3.83 3.44 1.11
3 C 3.53 3.77 1.07
4 D 3.92 3.32 1.18

Why might one use the unlist() function? (example inside)

Im learning the basics of R and im going through an example where the user loads a .csv file containing the weights of mice fed a Normal Control or High Fat diet.
He proceeds to make two vectors (is this true? once extracted and unlisted?)
Im confused as to what purpose the unlist function serves here. Iv seen the unlist function used before graphing as well and am confused as to what difference it makes?
dplyr functions, such as filter() and select(), return tibbles (a variant on data.frames). Data frames and tibbles are a special type of list, where each element is a vector of the same length, but not necessarily the same type.
In the example given, each statement is selecting a single column, returned as a 1-column tibble. A 1-column tibble is a list with one element, in this case the vector of Bodyweights. However, many functions do not expect a 1-column tibble (or data.frame), but want a vector. By using unlist(), we are squashing the structure down to a single vector. This would be true whether you selected a single column or multiple columns.
The idiomatic way in dplyr would be to pipe pull(Bodyweight), as opposed to using unlist().
Consider this simple example for the difference
tib <- tibble(a = 1:5, b = letters[1:5])
select(tib, a)
class(select(tib, a))
# Notice the different printing and class when we unlist
unlist(select(tib, a))
class(unlist(select(tib, a))
Well that just depends on what you want to achieve. Before the unlist() you'll end up with data.frame (or more specific a tibble in this example because of the dplyr functionality applied to the data). When unlisting the single column tibble you'll end up with an atomic numeric (named) vector, which behaves totally different in some situations (the final rbind below is an example).
library(tidyverse)
mice <- structure(list(Diet=c("chow","chow","chow","chow","chow",
"chow","chow","chow","chow","chow","chow","chow","hf",
"hf","hf","hf","hf","hf","hf","hf","hf","hf","hf","hf"
),Bodyweight=c(21.51,28.14,24.04,23.45,23.68,19.79,28.4,
20.98,22.51,20.1,26.91,26.25,25.71,26.37,22.8,25.34,
24.97,28.14,29.58,30.92,34.02,21.9,31.53,20.73)),class=c("spec_tbl_df",
"tbl_df","tbl","data.frame"),row.names=c(NA,-24L),spec=structure(list(
cols=list(Diet=structure(list(),class=c("collector_character",
"collector")),Bodyweight=structure(list(),class=c("collector_double",
"collector"))),default=structure(list(),class=c("collector_guess",
"collector")),skip=1),class="col_spec"))
bodyweight <- mice %>% filter(Diet == "chow") %>% select(Bodyweight)
class(bodyweight)
#> [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
bodyweight
#> # A tibble: 12 x 1
#> Bodyweight
#> <dbl>
#> 1 21.5
#> 2 28.1
#> 3 24.0
#> 4 23.4
#> 5 23.7
#> 6 19.8
#> 7 28.4
#> 8 21.0
#> 9 22.5
#> 10 20.1
#> 11 26.9
#> 12 26.2
bodyweight_unl <- mice %>% filter(Diet == "chow") %>% select(Bodyweight) %>% unlist
class(bodyweight_unl)
#> [1] "numeric"
bodyweight_unl
#> Bodyweight1 Bodyweight2 Bodyweight3 Bodyweight4 Bodyweight5 Bodyweight6
#> 21.51 28.14 24.04 23.45 23.68 19.79
#> Bodyweight7 Bodyweight8 Bodyweight9 Bodyweight10 Bodyweight11 Bodyweight12
#> 28.40 20.98 22.51 20.10 26.91 26.25
rbind(bodyweight, 1:12)
rbind(bodyweight_unl, 1:12)
Created on 2020-07-12 by the reprex package (v0.3.0)
The purpose of unlist is to to flatten a list of vectors into a single vector. This is from R for Data Science. It certainly is worth of reading.
See further explanations in the comments below.
library(tidyverse)
head(data)
#> Diet Bodyweight
#> 1 chow 21.51
#> 2 chow 28.14
#> 3 chow 24.04
#> 4 chow 23.45
#> 5 chow 23.68
#> 6 chow 19.79
# without unlist you get a data.frame
dplyr::filter(data, Diet == 'chow') %>% select(Bodyweight) %>% class()
#> [1] "data.frame"
# by unlisting you get a named vector with the names taken from the selected data
dplyr::filter(data, Diet == 'chow') %>% select(Bodyweight) %>% unlist()
#> Bodyweight1 Bodyweight2 Bodyweight3 Bodyweight4 Bodyweight5 Bodyweight6
#> 21.51 28.14 24.04 23.45 23.68 19.79
#> Bodyweight7 Bodyweight8 Bodyweight9 Bodyweight10 Bodyweight11 Bodyweight12
#> 28.40 20.98 22.51 20.10 26.91 26.25
# If you set use.names=F you get a vector with the data you selected
dplyr::filter(data, Diet == 'chow') %>% select(Bodyweight) %>% unlist(use.names = F)
#> [1] 21.51 28.14 24.04 23.45 23.68 19.79 28.40 20.98 22.51 20.10 26.91 26.25

Reshaping database using reshape package-part 2 [duplicate]

This question already has answers here:
Reshape multiple values at once
(2 answers)
Closed 3 years ago.
Following the interaction in the previous post Reshaping database using reshape package I create this one to ask other question.
Briefly: I have a database with some rows that it replicate for Id column, I would like to transpose it. The following cose show a little example of my database.
test<-data.frame(Id=c(1,1,2,3),
St=c(20,80,80,20),
gap=seq(0.02,0.08,by=0.02),
gip=c(0.23,0.60,0.86,2.09),
gat=c(0.0107,0.989,0.337,0.663))
I would like a final database like this figure that I attached:
With one row for each Id value and the different columns attached.
Can you give me any suggestions?
You can use dcast from data.table. This function allows to spread multiple value variables.
library(data.table)
setDT(test) # convert test to a data.table
test1 <- dcast(test, Id ~ rowid(Id),
value.var = c('St', 'gap', 'gip', 'gat'), fill = 0)
test1
# Id St_1 St_2 gap_1 gap_2 gip_1 gip_2 gat_1 gat_2
#1: 1 20 80 0.02 0.04 0.23 0.6 0.0107 0.989
#2: 2 80 0 0.06 0.00 0.86 0.0 0.3370 0.000
#3: 3 20 0 0.08 0.00 2.09 0.0 0.6630 0.000
If you want to continue with a data.frame call setDF(test1) at the end.
A dplyr/tidyr alternative is to first gather to long format, group_by Id and key and create a sequential row identifier for each group (new_key) and finally spread it back to wide form.
library(dplyr)
library(tidyr)
test %>%
gather(key, value, -Id) %>%
group_by(Id, key) %>%
mutate(new_key = paste0(key, row_number())) %>%
ungroup() %>%
select(-key) %>%
spread(new_key, value, fill = 0)
# Id gap1 gap2 gat1 gat2 gip1 gip2 St1 St2
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 0.02 0.04 0.0107 0.989 0.23 0.6 20 80
#2 2 0.06 0 0.337 0 0.86 0 80 0
#3 3 0.08 0 0.663 0 2.09 0 20 0

Can someone find out how this dplyr table doesn't work?

I have to make a table in which the mean and median from the logsales per month are set out from the "txhousing" dataset. The exercise i got is the following: “The table below show the means and medians of the log of sales per month, sorted by mean”
Insert a new r chunk and type the code in it to display that table
Use na.omit to get rid of cases with missing values
Use the dplyr command mutate to make the variable logsales
Use the dplyr command group_by to group by month
Use the dplyr command summarise to display the table
Use the dplyr command arrange to sorth by mean
Connect the commands with the pipe operator %>%
I've tried to mix up the code multiple times but I can't find out why it keeps on giving me NA's in my table.
library(tidyverse)
summary(txhousing)
na.omit(txhousing)
txhousing<- as.data.frame(txhousing)
logsales <- log(txhousing$sales)
group_by(txhousing, txhousing$month)
txhousing<- txhousing %>% mutate(logsales= log(txhousing$sales))
txhousing %>% group_by(txhousing$month) %>% summarise(mean(logsales), median(logsales)) %>% arrange(mean)
I expect to get a table with the mean and median of the logsales per month, but what i get is only NA in the column from the mean en the median and the arrange gives the following error:
Error: cannot arrange column of class 'function' at position 1`
There are NA values in the columns, so you need to tell mean and median to ignore them. And also name the columns in summarise to use arrange on column named as mean.
txhousing %>%
group_by(txhousing$month) %>%
summarise(mean = mean(logsales, na.rm = T),
med= median(logsales, na.rm = T)) %>%
arrange(mean) %>%
rename(month = `txhousing$month`)
This creates following tibble
# A tibble: 12 x 3
month mean med
<int> <dbl> <dbl>
1 1 4.95 4.74
2 2 5.13 4.93
3 11 5.19 4.96
4 12 5.24 5.02
5 10 5.29 5.08
6 9 5.32 5.09
7 3 5.38 5.15
8 4 5.42 5.21
9 5 5.52 5.29
10 7 5.53 5.30
11 8 5.53 5.33
12 6 5.56 5.34

R: How to aggregate with NA values

To give a small working example, suppose I have the following data frame:
library(dplyr)
country <- rep(c("A", "B", "C"), each = 6)
year <- rep(c(1,2,3), each = 2, times = 3)
categ <- rep(c(0,1), times = 9)
pop <- rep(c(NA, runif(n=8)), each=2)
money <- runif(18)+100
df <- data.frame(Country = country,
Year = year,
Category = categ,
Population = pop,
Money = money)
Now the data I'm actually working with has many more repetitions, namely for every country, year, and category, there are many repeated rows corresponding to various sources of money, and I want to sum these all together. However, for now it's enough just to have one row for each country, year, and category, and just trivially apply the sum() function on each row. This will still exhibit the behavior I'm trying to get rid of.
Notice that for country A in year 1, the population listed is NA. Therefore when I run
aggregate(Money ~ Country+Year+Category+Population, df, sum)
the resulting data frame has dropped the rows corresponding to country A and year 1. I'm only using the ...+Population... bit of code because I want the output data frame to retain this column.
I'm wondering how to make the aggregate() function not drop things that have NAs in the columns by which the grouping occurs--it'd be nice if, for instance, the NAs themselves could be treated as values to group by.
My attempts: I tried turning the Population column into factors but that didn't change the behavior. I read something on the na.action argument but neither na.action=NULL nor na.action=na.skip changed the behavior. I thought about trying to turn all the NAs to 0s, and I can't think of what that would hurt but it feels like a hack that might bite me later on--not sure. But if I try to do it, I'm not sure how I would. When I wrote a function with the is.na() function in it, it didn't apply the if (is.na(x)) test in a vectorized way and gave the error that it would just use the first element of the vector. I thought about perhaps using lapply() on the column and coercing it back to a vector and sticking that in the column, but that also sounds kind of hacky and needlessly round-about.
The solution here seemed to be about keeping the NA values out of the data frame in the first place, which I can't do: Aggregate raster in R with NA values
As you have already mentioned dplyr before your data, you can use dplyr::summarise function. The summarise function supports grouping on NA values.
library(dplyr)
df %>% group_by(Country,Year,Category,Population) %>%
summarise(Money = sum(Money))
# # A tibble: 18 x 5
# # Groups: Country, Year, Category [?]
# Country Year Category Population Money
# <fctr> <dbl> <dbl> <dbl> <dbl>
# 1 A 1.00 0 NA 101
# 2 A 1.00 1.00 NA 100
# 3 A 2.00 0 0.482 101
# 4 A 2.00 1.00 0.482 101
# 5 A 3.00 0 0.600 101
# 6 A 3.00 1.00 0.600 101
# 7 B 1.00 0 0.494 101
# 8 B 1.00 1.00 0.494 101
# 9 B 2.00 0 0.186 100
# 10 B 2.00 1.00 0.186 100
# 11 B 3.00 0 0.827 101
# 12 B 3.00 1.00 0.827 101
# 13 C 1.00 0 0.668 100
# 14 C 1.00 1.00 0.668 101
# 15 C 2.00 0 0.794 100
# 16 C 2.00 1.00 0.794 100
# 17 C 3.00 0 0.108 100
# 18 C 3.00 1.00 0.108 100
Note: The OP's sample data doesn't have multiple rows for same groups. Hence, number of summarized rows will be same as actual rows.

Resources