Im learning the basics of R and im going through an example where the user loads a .csv file containing the weights of mice fed a Normal Control or High Fat diet.
He proceeds to make two vectors (is this true? once extracted and unlisted?)
Im confused as to what purpose the unlist function serves here. Iv seen the unlist function used before graphing as well and am confused as to what difference it makes?
dplyr functions, such as filter() and select(), return tibbles (a variant on data.frames). Data frames and tibbles are a special type of list, where each element is a vector of the same length, but not necessarily the same type.
In the example given, each statement is selecting a single column, returned as a 1-column tibble. A 1-column tibble is a list with one element, in this case the vector of Bodyweights. However, many functions do not expect a 1-column tibble (or data.frame), but want a vector. By using unlist(), we are squashing the structure down to a single vector. This would be true whether you selected a single column or multiple columns.
The idiomatic way in dplyr would be to pipe pull(Bodyweight), as opposed to using unlist().
Consider this simple example for the difference
tib <- tibble(a = 1:5, b = letters[1:5])
select(tib, a)
class(select(tib, a))
# Notice the different printing and class when we unlist
unlist(select(tib, a))
class(unlist(select(tib, a))
Well that just depends on what you want to achieve. Before the unlist() you'll end up with data.frame (or more specific a tibble in this example because of the dplyr functionality applied to the data). When unlisting the single column tibble you'll end up with an atomic numeric (named) vector, which behaves totally different in some situations (the final rbind below is an example).
library(tidyverse)
mice <- structure(list(Diet=c("chow","chow","chow","chow","chow",
"chow","chow","chow","chow","chow","chow","chow","hf",
"hf","hf","hf","hf","hf","hf","hf","hf","hf","hf","hf"
),Bodyweight=c(21.51,28.14,24.04,23.45,23.68,19.79,28.4,
20.98,22.51,20.1,26.91,26.25,25.71,26.37,22.8,25.34,
24.97,28.14,29.58,30.92,34.02,21.9,31.53,20.73)),class=c("spec_tbl_df",
"tbl_df","tbl","data.frame"),row.names=c(NA,-24L),spec=structure(list(
cols=list(Diet=structure(list(),class=c("collector_character",
"collector")),Bodyweight=structure(list(),class=c("collector_double",
"collector"))),default=structure(list(),class=c("collector_guess",
"collector")),skip=1),class="col_spec"))
bodyweight <- mice %>% filter(Diet == "chow") %>% select(Bodyweight)
class(bodyweight)
#> [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
bodyweight
#> # A tibble: 12 x 1
#> Bodyweight
#> <dbl>
#> 1 21.5
#> 2 28.1
#> 3 24.0
#> 4 23.4
#> 5 23.7
#> 6 19.8
#> 7 28.4
#> 8 21.0
#> 9 22.5
#> 10 20.1
#> 11 26.9
#> 12 26.2
bodyweight_unl <- mice %>% filter(Diet == "chow") %>% select(Bodyweight) %>% unlist
class(bodyweight_unl)
#> [1] "numeric"
bodyweight_unl
#> Bodyweight1 Bodyweight2 Bodyweight3 Bodyweight4 Bodyweight5 Bodyweight6
#> 21.51 28.14 24.04 23.45 23.68 19.79
#> Bodyweight7 Bodyweight8 Bodyweight9 Bodyweight10 Bodyweight11 Bodyweight12
#> 28.40 20.98 22.51 20.10 26.91 26.25
rbind(bodyweight, 1:12)
rbind(bodyweight_unl, 1:12)
Created on 2020-07-12 by the reprex package (v0.3.0)
The purpose of unlist is to to flatten a list of vectors into a single vector. This is from R for Data Science. It certainly is worth of reading.
See further explanations in the comments below.
library(tidyverse)
head(data)
#> Diet Bodyweight
#> 1 chow 21.51
#> 2 chow 28.14
#> 3 chow 24.04
#> 4 chow 23.45
#> 5 chow 23.68
#> 6 chow 19.79
# without unlist you get a data.frame
dplyr::filter(data, Diet == 'chow') %>% select(Bodyweight) %>% class()
#> [1] "data.frame"
# by unlisting you get a named vector with the names taken from the selected data
dplyr::filter(data, Diet == 'chow') %>% select(Bodyweight) %>% unlist()
#> Bodyweight1 Bodyweight2 Bodyweight3 Bodyweight4 Bodyweight5 Bodyweight6
#> 21.51 28.14 24.04 23.45 23.68 19.79
#> Bodyweight7 Bodyweight8 Bodyweight9 Bodyweight10 Bodyweight11 Bodyweight12
#> 28.40 20.98 22.51 20.10 26.91 26.25
# If you set use.names=F you get a vector with the data you selected
dplyr::filter(data, Diet == 'chow') %>% select(Bodyweight) %>% unlist(use.names = F)
#> [1] 21.51 28.14 24.04 23.45 23.68 19.79 28.40 20.98 22.51 20.10 26.91 26.25
Related
After I update my Rstudio today, when I tried to get z-scores of a data frame by using mutate() and scale(), it returns a matrix with a 'new name' warning:
df <- df %>% group_by(participants) %>% mutate(zscore=scale(answer))
New names:
* NA -> ...8
class(df$zscore)
[1] "matrix" "array"
The column of the z-scores should have been named 'zscore', but why it is now named '...8'? I never had any problems with the codes before. Is it because of the update?
I think you just added another column without a header or read in data with a column without a header. There is no issue with your classes.
library(tidyverse)
test <- mtcars|>
group_by(cyl) |>
mutate(zscore=scale(mpg))
#class of test
class(test)
#> [1] "grouped_df" "tbl_df" "tbl" "data.frame"
#class of column
class(test$zscore)
#> [1] "matrix" "array"
#recreate warning
test <- test |>
bind_cols("")
#> New names:
#> * `` -> `...13`
The warning at the bottom means that I added a column without a name in the 13th position.
Part of the issue is that scale() returns a matrix. You can fix this by wrapping in as.double():
library(dplyr)
starwars2 <- starwars %>%
select(height, gender) %>%
group_by(gender) %>%
mutate(zscore = as.double(scale(height)))
Output:
# A tibble: 87 × 3
# Groups: gender [3]
height gender zscore
<int> <chr> <dbl>
1 172 masculine -0.120
2 167 masculine -0.253
3 96 masculine -2.14
4 202 masculine 0.677
5 150 feminine -0.624
6 178 masculine 0.0394
7 165 feminine 0.0133
8 97 masculine -2.11
9 183 masculine 0.172
10 182 masculine 0.146
# … with 77 more rows
But I’m not sure this explains your NA -> ...8 issue. If not, please update your question to include your data (using dput(df)) or a subset (using dput(head(df))).
I am trying to combing the dividend history of 2 different stocks and I have the below code:
library(quantmod)
library(tidyr)
AAPL<-round(getDividends("AAPL"),4)
MSFT<-round(getDividends("MSFT"),4)
dividend<-(cbind(AAPL,MSFT))
As the 2 stocks pay out dividend on different dates so there will be NAs after combining and so I try to use drop_na function from tidyr like below:
drop_na(dividend)
#Error in UseMethod("drop_na") :
no applicable method for 'drop_na' applied to an object of class "c('xts', 'zoo')"
May I know what did I do wrong here? Many thanks for your help.
Update 1:
Tried na.omit, which returns with the following:
> dividend<-(cbind(AAPL,MSFT))%>%
> na.omit(dividend)%>%
> print(dividend)
[,1] [,2]
drop_na only works on data frames but even if you converted it to data frame it would not give what you want. Every row contains an NA so drop_na and na.omit would both remove every row leaving a zero row result.
Instead try aggregating over year/month which gives zoo object both. If you need an xts object use as.xts(both) .
both <- aggregate(dividend, as.yearmon, sum, na.rm = TRUE)
Optionally we could remove the rows with zero dividends in both columns:
both[rowSums(both) > 0]
We could use na.omit
na.omit(dividend)
drop_na expects a data.frame as input whereas the the object dividend is not a data.frame
We could fortify into a data.frame and then use drop_na after reordering the NAs
library(dplyr)
library(lubridate)
fortify.zoo(dividend) %>%
group_by(Index = format(ymd(Index), '%Y-%b')) %>%
mutate(across(everything(), ~ .x[order(is.na(.x))])) %>%
ungroup %>%
drop_na()
-output
# A tibble: 41 × 3
Index AAPL.div MSFT.div
<chr> <dbl> <dbl>
1 2012-Aug 0.0034 0.2
2 2012-Nov 0.0034 0.23
3 2013-Feb 0.0034 0.23
4 2013-May 0.0039 0.23
5 2013-Aug 0.0039 0.23
6 2013-Nov 0.0039 0.28
7 2014-Feb 0.0039 0.28
8 2014-May 0.0042 0.28
9 2014-Aug 0.0294 0.28
10 2014-Nov 0.0294 0.31
# … with 31 more rows
In the below code, I've simulated dice rolls at increasing sample sizes and computed the average roll at each sample size. My lapply function works, but I'm uncomfortable with it since I know sample_n is not a dplyr function and has been superceded by slice_sample. I would like make my code better with a dplyr solution rather than sample_n() within the lapply. I think I may have other syntactical errors within the lapply. Here is the code:
#Dice
dice <- c(1,2,3,4,5,6) #the set of possible outcomes of a dice role
dice_probs <- c(1/6,1/6,1/6,1/6,1/6,1/6) #the probability of each option per roll
dice_df <- data.frame(dice,dice_probs)
#Simulate dice rolls for each of these sample sizes and record the average of the rolls
sample_sizes <- c(10,25,50,100,1000,10000,100000,1000000,100000000) #compute at each sample size
output <- lapply(X=sample_sizes, FUN = function(var){
obs = sample_n(dice_df,var,replace=TRUE)
sample_mean = mean(obs$dice)
new.df <- data.frame(sample_mean, var)
return(new.df)
})
The final step is computing the difference compared to the expected value, 3.5. I want a column where that shows the difference between 3.5 and the sample mean. We should see the difference decreasing as the sample size increases.
output <- output %>%
mutate(difference = across(sample_mean, ~3.5 - .x))
When I run this, it's throwing this error:
Error in UseMethod("mutate") :
no applicable method for 'mutate' applied to an object of class "list"
I've tried using sapply but I get a similar error: no applicable method for 'mutate' applied to an object of class "c('matrix', 'array', 'list')"
If it helps, here was my failed attempt at using slice_sample:
output <- lapply(X=sample_sizes, FUN = function(...){
obs = slice_sample(dice_df, ..., .preserve=TRUE)
sample_mean = mean(obs$dice)
new.df <- data.frame(sample_mean, ...)
return(new.df)
})
I got this error: Error: '...' used in an incorrect context
The output is just a single row data.frame element in a list. We can bind them with bind_rows and simply subtract once instead of doing this multiple times
library(dplyr)
bind_rows(output) %>%
mutate(difference = 3.5 - sample_mean )
sample_mean var difference
1 3.500000 10 0.00000000
2 2.800000 25 0.70000000
3 3.440000 50 0.06000000
4 3.510000 100 -0.01000000
5 3.495000 1000 0.00500000
6 3.502200 10000 -0.00220000
7 3.502410 100000 -0.00241000
8 3.498094 1000000 0.00190600
9 3.500183 100000000 -0.00018332
The n argument of slice_sample correspondes to sample_n's size argument.
And to calculate the difference of your output list we can use purrr::map instead of dplyr::across.
library(dplyr)
library(purrr)
set.seed(123)
#Dice
dice <- c(1,2,3,4,5,6) #the set of possible outcomes of a dice role
dice_probs <- c(1/6,1/6,1/6,1/6,1/6,1/6) #the probability of each option per roll
dice_df <- data.frame(dice,dice_probs)
#Simulate dice rolls for each of these sample sizes and record the average of the rolls
sample_sizes <- c(10,25,50,100,1000,10000,100000,1000000,100000000) #compute at each sample size
output <- lapply(X=sample_sizes, FUN = function(var){
obs = slice_sample(dice_df,n = var,replace=TRUE)
sample_mean = mean(obs$dice)
new.df <- data.frame(sample_mean, var)
return(new.df)
})
output %>%
map(~ 3.5 - .x$sample_mean)
#> [[1]]
#> [1] -0.5
#>
#> [[2]]
#> [1] 0.42
#>
#> [[3]]
#> [1] -0.04
#>
#> [[4]]
#> [1] -0.34
#>
#> [[5]]
#> [1] 0.025
#>
#> [[6]]
#> [1] 0.0317
#>
#> [[7]]
#> [1] 0.00416
#>
#> [[8]]
#> [1] -2.6e-05
#>
#> [[9]]
#> [1] -4.405e-05
Created on 2021-08-02 by the reprex package (v0.3.0)
Alternatively, we can use purrr::map_df and add a row diff inside each tibble as proposed by Martin Gal in the comments:
output %>%
map_df(~ tibble(.x, diff = 3.5 - .x$sample_mean))
#> # A tibble: 9 x 3
#> sample_mean var diff
#> <dbl> <dbl> <dbl>
#> 1 2.6 10 0.9
#> 2 3.28 25 0.220
#> 3 3.66 50 -0.160
#> 4 3.5 100 0
#> 5 3.53 1000 -0.0270
#> 6 3.50 10000 -0.00180
#> 7 3.50 100000 -0.00444
#> 8 3.50 1000000 -0.000226
#> 9 3.50 100000000 -0.0000669
Here is a base R way -
transform(do.call(rbind, output), difference = 3.5 - sample_mean)
# sample_mean var difference
#1 3.80 10 -0.300000
#2 3.44 25 0.060000
#3 3.78 50 -0.280000
#4 3.30 100 0.200000
#5 3.52 1000 -0.015000
#6 3.50 10000 -0.004200
#7 3.50 100000 -0.004370
#8 3.50 1000000 0.002696
#9 3.50 100000000 0.000356
If you just need the difference value you can do -
3.5 - sapply(output, `[[`, 'sample_mean')
EDIT: My data (for reproducible research) looks as follows. The dplyr will summarise the values for each win_name category:
inv_name inv_province inv_town nip win_name value start duration year
CustomerA łódzkie TownX 1111111111 CompX 233.50 2015-10-23 24 2017
CustomerA łódzkie TownX 1111111111 CompX 300.5 2015-10-23 24 2017
CustomerA łódzkie TownX 1111111111 CompX 200.5 2015-10-23 24 2017
CustomerB łódzkie TownY 2222222222 CompY 200.5 2015-10-25 12 2017
CustomerB łódzkie TownY 2222222222 CompY 1200.0 2015-10-25 12 2017
CustomerB łódzkie TownY 2222222222 CompY 320.00 2015-10-25 12 2017
The dplyr will summarise the values, then the spread will make the summary spread into several columns for each win_name category with numeric values.
I would like to create new columns with formatted text corresponding to existing columns with numbers. Create as many columns as there are numeric columns with numeric data. The number of these columns can change from analysis to analysis. My code so far looks like:
county_marketshare<-df_monthly_val %>%
select(win_name,value,inv_province) %>%
group_by(win_name,inv_province)%>%
summarise(value=round(sum(value),0))%>%
spread(key="win_name", value=value, fill=0) %>% # teraz muszę stworzyc kolumny sformatowane "finansowo"
mutate(!!as.symbol(paste0(bestSup[1],"_lbl")):= formatC(!!as.symbol(bestSup[1]),digits = 0, big.mark = " ", format = "f",zero.print = ""),
!!as.symbol(paste0(bestSup[2],"_lbl")):= formatC(!!as.symbol(bestSup[2]),digits = 0, big.mark = " ", format = "f",zero.print = ""),
!!as.symbol(paste0(bestSup[3],"_lbl")):= formatC(!!as.symbol(bestSup[3]),digits = 0, big.mark = " ", format = "f",zero.print = "")
)
is there a way to loop the mutate function so that as many columns are created as there are existing numeric columns? The relavant lines with the repetitive code are the last three. Each new formatted text column has the name of existing numeric column with a suffix. !!as.symbol makes it possible to put together a parameter, the name of the source column, with _lbl suffix.
you could for example use mutate_at with a function and a conditional such as
dat %>%
mutate_at(.vars = c('num_col1','num_col2'),
.funs = function(x) if(is.numeric(x)) as.character(x))
This will replace the specified numeric columns with character columns. You can tweak the function to your needs, i.e. specifying how the columns should look like. We could help you a bit more with a better data example.
You can also filter only the numeric columns and then use mutate_all:
dat %>%Filter(is.numeric,.) %>% mutate_all(funs(as.character))
# Filter() is not dplyr, but base R, caveat capital 'F' !
# You can also use dat %>%.[sapply(.,is.numeric)], with the same result
# or dplyr::select_if
...:)
P.S. Always worth to cite the reference. Have a look at this gorgeous question:
Selecting only numeric columns from a data frame
Please consult tidyverse documentation.
# mutate_if() is particularly useful for transforming variables from
# one type to another
iris %>% as_tibble() %>% mutate_if(is.factor, as.character)
#> # A tibble: 150 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 5.10 3.50 1.40 0.200 setosa
#> 2 4.90 3.00 1.40 0.200 setosa
#> 3 4.70 3.20 1.30 0.200 setosa
#> 4 4.60 3.10 1.50 0.200 setosa
#> 5 5.00 3.60 1.40 0.200 setosa
#> 6 5.40 3.90 1.70 0.400 setosa
#> 7 4.60 3.40 1.40 0.300 setosa
#> 8 5.00 3.40 1.50 0.200 setosa
#> 9 4.40 2.90 1.40 0.200 setosa
#> 10 4.90 3.10 1.50 0.100 setosa
#> # ... with 140 more rows
Unexpectedly I found a hint at http://stackoverflow.com/a/47971650/3480717
I did not realise that in the syntax
mtcars %>% mutate_at(columnstolog, funs(log = log(.)))
adding a name part "log="in funs will append it to the names of new colums.... in the effect the following in my case is enough:
mutate_if(is.numeric, funs(lbl = formatC(.,digits = 0, big.mark = " ", format = "f",zero.print = "")))
This will generate new columns, as many as there are original numeric columns, and these new columns will have the name sufficed with "_lbl". No need for loops or advanced syntax. Big thanks to Thebo and Nettle
I am trying to calculate the absolute difference between lagged values over several columns. The first row of the resulting data set is NA, which is correct because there is no previous value to calculate the lag. What I don't understand is why the lag isn't calculated for the last value. Note that the last value in the example below (temp) is the lag between the 2nd to last and the 3rd to last values, the lag value between the last and 2nd to last value is missing.
library(tidyverse)
library(purrr)
dim(mtcars) # 32 rows
temp <- map_df(mtcars, ~ abs(diff(lag(.x))))
names(temp) <- paste(names(temp), '.abs.diff.lag', sep= '')
dim(temp) # 31 rows
It would be an awesome bonus if someone could show me how to pipe the renaming step, I played around with paste and enquo. The real dataset is too long to do a gather/newcolumnname/spread approach.
Thanks in advance!
EDIT: libraries need to run the script added
I think the lag call in your existing code is unnecessary as diff calculates the lagged difference automatically (although perhaps I don't understand properly what you are trying to do). You can also use rename_all to add a suffix to all the variable names.
library(purrr)
library(dplyr)
mtcars %>%
map_df(~ abs(diff(.x))) %>%
rename_all(funs(paste0(., ".abs.diff.lag")))
#> # A tibble: 31 x 11
#> mpg.abs.diff.lag cyl.abs.diff.lag disp.abs.diff.lag hp.abs.diff.lag
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.0 0 0.0 0
#> 2 1.8 2 52.0 17
#> 3 1.4 2 150.0 17
#> 4 2.7 2 102.0 65
#> 5 0.6 2 135.0 70
#> 6 3.8 2 135.0 140
#> 7 10.1 4 213.3 183
#> 8 1.6 0 5.9 33
#> 9 3.6 2 26.8 28
#> 10 1.4 0 0.0 0
#> # ... with 21 more rows, and 7 more variables: drat.abs.diff.lag <dbl>,
#> # wt.abs.diff.lag <dbl>, qsec.abs.diff.lag <dbl>, vs.abs.diff.lag <dbl>,
#> # am.abs.diff.lag <dbl>, gear.abs.diff.lag <dbl>,
#> # carb.abs.diff.lag <dbl>
Maybe something like this:
dataCars <- mtcars%>%mutate(diffMPG = abs(mpg - lag(mpg)),
diffHP = abs(hp - lag(hp)))
And then do this for all the columns you are interested in
I was not able to reproduce your issues regarding the lag function. When I am executing your sample code, I retrieve a data frame consisting of 31 row, exactly as you mentioned, but the first row is not NA, it is already the subtraction of the 1st and 2nd row.
Regarding your bonus question, the answer is provided here:
temp <- map_df(mtcars, ~ abs(diff(lag(.x)))) %>% setNames(paste0(names(.), '.abs.diff.lag'))
This should result in the desired column naming.