R - dplyr lag function - r

I am trying to calculate the absolute difference between lagged values over several columns. The first row of the resulting data set is NA, which is correct because there is no previous value to calculate the lag. What I don't understand is why the lag isn't calculated for the last value. Note that the last value in the example below (temp) is the lag between the 2nd to last and the 3rd to last values, the lag value between the last and 2nd to last value is missing.
library(tidyverse)
library(purrr)
dim(mtcars) # 32 rows
temp <- map_df(mtcars, ~ abs(diff(lag(.x))))
names(temp) <- paste(names(temp), '.abs.diff.lag', sep= '')
dim(temp) # 31 rows
It would be an awesome bonus if someone could show me how to pipe the renaming step, I played around with paste and enquo. The real dataset is too long to do a gather/newcolumnname/spread approach.
Thanks in advance!
EDIT: libraries need to run the script added

I think the lag call in your existing code is unnecessary as diff calculates the lagged difference automatically (although perhaps I don't understand properly what you are trying to do). You can also use rename_all to add a suffix to all the variable names.
library(purrr)
library(dplyr)
mtcars %>%
map_df(~ abs(diff(.x))) %>%
rename_all(funs(paste0(., ".abs.diff.lag")))
#> # A tibble: 31 x 11
#> mpg.abs.diff.lag cyl.abs.diff.lag disp.abs.diff.lag hp.abs.diff.lag
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.0 0 0.0 0
#> 2 1.8 2 52.0 17
#> 3 1.4 2 150.0 17
#> 4 2.7 2 102.0 65
#> 5 0.6 2 135.0 70
#> 6 3.8 2 135.0 140
#> 7 10.1 4 213.3 183
#> 8 1.6 0 5.9 33
#> 9 3.6 2 26.8 28
#> 10 1.4 0 0.0 0
#> # ... with 21 more rows, and 7 more variables: drat.abs.diff.lag <dbl>,
#> # wt.abs.diff.lag <dbl>, qsec.abs.diff.lag <dbl>, vs.abs.diff.lag <dbl>,
#> # am.abs.diff.lag <dbl>, gear.abs.diff.lag <dbl>,
#> # carb.abs.diff.lag <dbl>

Maybe something like this:
dataCars <- mtcars%>%mutate(diffMPG = abs(mpg - lag(mpg)),
diffHP = abs(hp - lag(hp)))
And then do this for all the columns you are interested in

I was not able to reproduce your issues regarding the lag function. When I am executing your sample code, I retrieve a data frame consisting of 31 row, exactly as you mentioned, but the first row is not NA, it is already the subtraction of the 1st and 2nd row.
Regarding your bonus question, the answer is provided here:
temp <- map_df(mtcars, ~ abs(diff(lag(.x)))) %>% setNames(paste0(names(.), '.abs.diff.lag'))
This should result in the desired column naming.

Related

Error in using group_by and summarise in running correlation and test of significance with SPSS dataset

I borrow a dataset from SPSS prepared by Julie Pallant's SPSS Survival Manual and run it on R.
I select three columns to run correlation and significance test: toptim, tnegaff, sex. I select the columns using select: df <- survey %>% select(toptim, tnegaff, sex).
Then, problems emerge.
I'd like to know the correlation between toptim and tnegaff by sex. But I can't use cor and resort to correlate. Why is there error and any difference between the two methods?
df %>% group_by(sex) %>% summarise(cor = correlate(toptim, tnegaff)) <- OK (male = 0.22 female = 0.394)
df %>% group_by(sex) %>% summarise(cor = cor(toptim, tnegaff)) <- failed, returns with NA
I failed to obtain the test of significance with cor.test (The answer should be p = 0.0488)
Error in `summarise()`:
! Problem while computing `cor = cor.test(toptim, tnegaff)`.
✖ `cor` must be a vector, not a `htest` object.
ℹ The error occurred in group 1: sex = 1.
Then I try to follow past examples and use broom::tidy, but no output for p-values....
> df %>% group_by(sex) %>% broom::tidy(cor.test(toptim, tnegaff))
# A tibble: 3 × 13
column n mean sd median trimmed mad min max range skew kurtosis se
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 toptim 435 22.1 4.43 22 22.3 3 7 30 23 NA NA 0.212
2 tnegaff 435 19.4 7.07 18 18.6 4 10 39 29 NA NA 0.339
3 sex 439 1.58 0.494 2 1.58 0 1 2 1 -0.318 1.10 0.0236
How can I get the result? May I know the reason for such failure?
Thank you for your answers in advance.
It's trying to use all values and coming across NAs I presume. If you set to use "complete.obs" then it should work. For the cor.test part wrap the output in a list function to use the tibble's capabilities to have a column of a vector of objects.
For the final tidying and getting p-values, use map(cor.test, broom::tidy) then tidyr::unnest() to get a full and tidy dataframe.
That's a few steps to go through but hope it helps!
df <- haven::read_sav("survey.sav")
library(tidyverse)
df %>%
group_by(sex) %>%
summarise(cor = cor(toptim, tnegaff, use = "complete.obs"),
cor.test = list(cor.test(toptim, tnegaff))) %>%
mutate(tidy_out = map(cor.test, broom::tidy)) %>%
unnest(tidy_out)
#> # A tibble: 2 × 11
#> sex cor cor.t…¹ estim…² stati…³ p.value param…⁴ conf.…⁵ conf.…⁶ method
#> <dbl+l> <dbl> <list> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <chr>
#> 1 1 [MAL… -0.220 <htest> -0.220 -3.04 2.73e- 3 182 -0.353 -0.0775 Pears…
#> 2 2 [FEM… -0.394 <htest> -0.394 -6.75 1.06e-10 248 -0.494 -0.284 Pears…
#> # … with 1 more variable: alternative <chr>, and abbreviated variable names
#> # ¹​cor.test, ²​estimate, ³​statistic, ⁴​parameter, ⁵​conf.low, ⁶​conf.high
Edit - examining difference in correlation
Borrowing the function from here you can examine the difference in correlation coefficients between sexes like this:
cor.diff.test(df$toptim[df$sex == 1], df$tnegaff[df$sex == 1], df$toptim[df$sex == 2], df$tnegaff[df$sex == 2])

How to estimate means from same column in large number of dataframes, based upon a grouping variable in R

I have a huge amount of DFs in R (>50), which correspond to different filtering I've performed, here's an example of 7 of them:
Steps_Day1 <- filter(PD2, Gait_Day == 1)
Steps_Day2 <- filter(PD2, Gait_Day == 2)
Steps_Day3 <- filter(PD2, Gait_Day == 3)
Steps_Day4 <- filter(PD2, Gait_Day == 4)
Steps_Day5 <- filter(PD2, Gait_Day == 5)
Steps_Day6 <- filter(PD2, Gait_Day == 6)
Steps_Day7 <- filter(PD2, Gait_Day == 7)
Each of the dataframes contains 19 variables, however I'm only interested in their speed (to calculate mean) and their subjectID, as each subject has multiple observations of speed in the same DF.
An example of the data we're interested in, in dataframe - Steps_Day1:
Speed SubjectID
0.6 1
0.7 1
0.7 2
0.8 2
0.1 2
1.1 3
1.2 3
1.5 4
1.7 4
0.8 4
The data goes up to 61 pts. and each particpants number of observations is much larger than this.
Now what I want to do, is create a code that automatically cycles through each of 50 dataframes (taking the 7 above as an example) and calculates the mean speed for each participant and stores this and saves it in a new dataframe, alongside the variables containing to mean for each participant in the other DFs.
An example of Steps day 1 (Values not accurate)
Speed SubjectID
0.6 1
0.7 2
1.2 3
1.7 4
and so on... Before I end up with a final DF containing in column vectors the means for each participant from each of the other data frames, which may look something like:
Steps_Day1 StepsDay2 StepsDay3 StepsDay4 SubjectID
0.6 0.8 0.5 0.4 1
0.7 0.9 0.6 0.6 2
1.2 1.1 0.4 0.7 3
1.7 1.3 0.3 0.8 4
I could do this through some horrible, messy long code - but looking to see if anyone has more intuitive ideas please!
:)
To add to the previous answer, I agree that it is much easier to do this without creating a new data frame for each day. Using some generated data, you can achieve your desired results as follows:
# Generate some data
df <- data.frame(
day = rep(1:5, 1, 100),
subject = rep(5:10, 1, 100),
speed = runif(500)
)
df %>%
group_by(day, subject) %>%
summarise(avg_speed = mean(speed)) %>%
pivot_wider(names_from = day,
names_prefix = "Steps_Day",
values_from = avg_speed)
# A tibble: 6 × 6
subject Steps_Day1 Steps_Day2 Steps_Day3 Steps_Day4 Steps_Day5
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 5 0.605 0.416 0.502 0.516 0.517
2 6 0.592 0.458 0.625 0.531 0.460
3 7 0.475 0.396 0.586 0.517 0.449
4 8 0.430 0.435 0.489 0.512 0.548
5 9 0.512 0.645 0.509 0.484 0.566
6 10 0.530 0.453 0.545 0.497 0.460
You don't include a MCVE of your dataset so I can't test out a solution, but it seems like a pretty simple problem using tidyverse solutions.
First, why do you split PD2 into separate dataframes? If you skip that, you can just use group and summarize to get the average for groups:
PD2 %>%
group_by(Gait_Day, SubjectID) %>%
summarize(Steps = mean(Speed))
This will give you a "long-form" data.frame with 3 variables: Gait_Day, SubjectID, and Steps, which has the mean speed for that subject and day. If you want it in the format you show at the end, just pivot into "wide-form" using pivot_wider. You can see this question for further explaination on that: How to reshape data from long to wide format

Weird things with Automatically generate new variable names using dplyr mutate

OK this is going to be a long post.
So i am fairly new with R (i am currently using the MR free 3.5, with no checkpoint) but i am trying to work with the tidyverse, which i find very elegant in writing code and a lot of times a lot more simple.
I decided to replicate an exercise from guru99 here. It is a simple k-means exercise. However because i always want to write "generalizeble" code i was trying to automatically rename the variables in mutate with new names. So i searched SO and found this solution here which is very nice.
First what works fine.
#library(tidyverse)
link <- "https://raw.githubusercontent.com/guru99-edu/R-Programming/master/computers.csv"
df <- read.csv(link)
rescaled <- df %>% discard(is.factor) %>%
select(-X) %>%
mutate_all(
funs("scaled" = scale)
)
When you download the data with read.csv you get the df in dataframe class and everything works.
And now the weird thinks start. If you download the data with read_csv or make it a tibble at any point after (the first X variable will be named X1 and you need to change the is.factor to is.character because stings are converted to character not factors unless explicitly asked for, for future me and others.)
and then run the code
df1 <- read_csv(link)
df1 %>% discard(is.character) %>%
select(-X1) %>%
mutate_all(
funs("scaled" = scale)
)
the new named variables are named price_scaled[,1] speed_scaled[,1] hd_scaled[,1] ram_scaled[,1] etc. when you view the output in the console or you even if you print().
BUT if you view() on it you see the output with the names you expect which are price_scaled speed_scaled hd_scaled etc. ALSO I am using an Rmarkdown document for the code and when i change the chunk output to inline it diplays the names correctly with hd_scaled etc.
Any one has any idea how to get the names printed in the console like price_scaled etc.
Why this is happening?
Though that this would be interesting to ask.
scale() returns a matrix, and dplyr/tibble isn't automatically coercing it to a vector. By changing your mutate_all() call to the below, we can have it return a vector. I identified this is what was happening by calling class(df1$speed_scaled) and seeing the result of "matrix".
library(tidyverse)
link <- "https://raw.githubusercontent.com/guru99-edu/R-Programming/master/computers.csv"
df <- read_csv(link)
#> Warning: Missing column names filled in: 'X1' [1]
#> Parsed with column specification:
#> cols(
#> X1 = col_double(),
#> price = col_double(),
#> speed = col_double(),
#> hd = col_double(),
#> ram = col_double(),
#> screen = col_double(),
#> cd = col_character(),
#> multi = col_character(),
#> premium = col_character(),
#> ads = col_double(),
#> trend = col_double()
#> )
df %>% discard(is.character) %>%
select(-X1) %>%
mutate_all(
list("scaled" = function(x) scale(x)[[1]])
)
#> # A tibble: 6,259 x 14
#> price speed hd ram screen ads trend price_scaled speed_scaled
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1499 25 80 4 14 94 1 -1.24 -1.28
#> 2 1795 33 85 2 14 94 1 -1.24 -1.28
#> 3 1595 25 170 4 15 94 1 -1.24 -1.28
#> 4 1849 25 170 8 14 94 1 -1.24 -1.28
#> 5 3295 33 340 16 14 94 1 -1.24 -1.28
#> 6 3695 66 340 16 14 94 1 -1.24 -1.28
#> 7 1720 25 170 4 14 94 1 -1.24 -1.28
#> 8 1995 50 85 2 14 94 1 -1.24 -1.28
#> 9 2225 50 210 8 14 94 1 -1.24 -1.28
#> 10 2575 50 210 4 15 94 1 -1.24 -1.28
#> # ... with 6,249 more rows, and 5 more variables: hd_scaled <dbl>,
#> # ram_scaled <dbl>, screen_scaled <dbl>, ads_scaled <dbl>,
#> # trend_scaled <dbl>

How to reference "cells" within a column in R?

I'm trying to calculate numeric ranges based on the moving average of a column of data. I have found a way to use caTools::runmean to produce a column of moving averages, and I know how to work with this in Excel to produce the columns I want, but I would love to know a way to do all of this in one R script.
Here is my simplified reproducible example for R.
library(tidyverse)
library(caTools)
data <- as_tibble(data.frame(
Index = as.integer(c(18,19,21,22,23,25,26,29)),
mydbl = c(8.905,13.31,15.739,17.544,19.054,20.393,21.623,22.764)))
data <- data %>%
mutate(avg = runmean(mydbl,
k = 2,
alg = "exact",
endrule = "NA"))
This tibble will look like this:
> data
# A tibble: 8 x 3
Index mydbl avg
<int> <dbl> <dbl>
1 18 8.90 NA
2 19 13.3 11.1
3 21 15.7 14.5
4 22 17.5 16.6
5 23 19.1 18.3
6 25 20.4 19.7
7 26 21.6 21.0
8 29 22.8 22.2
To produce the remaining data I want, I exported this to Excel with write_csv(data,...) and the final table is shown below. The first value in dbl_i is the formula =B2-ABS(C3-B2) (the difference between mydbl and the next avg subtracted from mydbl to create an equidistant lower limit). The last value in dbl_f is the formula =B9+ABS(C9-B9) (the difference between mydbl and the avg added to mydbl to create an equidistant upper limit). The other values in the two columns are just direct references to the avg column.
Index mydbl avg dbl_i dbl_f
18 8.905 NA 6.7025 11.1075
19 13.31 11.1075 11.1075 14.5245
21 15.739 14.5245 14.5245 16.6415
22 17.544 16.6415 16.6415 18.299
23 19.054 18.299 18.299 19.7235
25 20.393 19.7235 19.7235 21.008
26 21.623 21.008 21.008 22.1935
29 22.764 22.1935 22.1935 23.3345
Yes, the dbl_i is just the avg column but with the first value being =B2-abs(C3-B2). And the dbl_f column is the same as the avg column except it's moved up one, and the final value is =B9+abs(C9=B9). Ultimately it seems the real problem lies in finding a way to reproduce the Excel calculations D2=B2-ABS(C3-B2) and E9=B9+ABS(C9-B9).
Does anyone know how they would reproduce these calculations in R? I was looking for a way to create a formula in R that could be the equivalent of B2-ABS(C3-B2), but could not find one, unless I create a matrix instead. Do I have to create a matrix?
Thanks for your time.
data %>%
mutate(
avg = zoo::rollmean(mydbl, 2, align="right", fill=NA),
dbl_i = if_else(row_number() == 1L, mydbl - abs(lead(avg) - mydbl), avg),
dbl_f = if_else(row_number() == n(), mydbl + abs(avg - mydbl), lead(avg))
)
# # A tibble: 8 x 5
# Index mydbl avg dbl_i dbl_f
# <int> <dbl> <dbl> <dbl> <dbl>
# 1 18 8.90 NA 6.70 11.1
# 2 19 13.3 11.1 11.1 14.5
# 3 21 15.7 14.5 14.5 16.6
# 4 22 17.5 16.6 16.6 18.3
# 5 23 19.1 18.3 18.3 19.7
# 6 25 20.4 19.7 19.7 21.0
# 7 26 21.6 21.0 21.0 22.2
# 8 29 22.8 22.2 22.2 23.3
Honestly it's not the most elegant, but it gets the job done.
(BTW: I'm using zoo::rollmean because I don't have caTools installed, but it's the same effect I believe.)

Dplyr: looping the creation of new columns

EDIT: My data (for reproducible research) looks as follows. The dplyr will summarise the values for each win_name category:
inv_name inv_province inv_town nip win_name value start duration year
CustomerA łódzkie TownX 1111111111 CompX 233.50 2015-10-23 24 2017
CustomerA łódzkie TownX 1111111111 CompX 300.5 2015-10-23 24 2017
CustomerA łódzkie TownX 1111111111 CompX 200.5 2015-10-23 24 2017
CustomerB łódzkie TownY 2222222222 CompY 200.5 2015-10-25 12 2017
CustomerB łódzkie TownY 2222222222 CompY 1200.0 2015-10-25 12 2017
CustomerB łódzkie TownY 2222222222 CompY 320.00 2015-10-25 12 2017
The dplyr will summarise the values, then the spread will make the summary spread into several columns for each win_name category with numeric values.
I would like to create new columns with formatted text corresponding to existing columns with numbers. Create as many columns as there are numeric columns with numeric data. The number of these columns can change from analysis to analysis. My code so far looks like:
county_marketshare<-df_monthly_val %>%
select(win_name,value,inv_province) %>%
group_by(win_name,inv_province)%>%
summarise(value=round(sum(value),0))%>%
spread(key="win_name", value=value, fill=0) %>% # teraz muszę stworzyc kolumny sformatowane "finansowo"
mutate(!!as.symbol(paste0(bestSup[1],"_lbl")):= formatC(!!as.symbol(bestSup[1]),digits = 0, big.mark = " ", format = "f",zero.print = ""),
!!as.symbol(paste0(bestSup[2],"_lbl")):= formatC(!!as.symbol(bestSup[2]),digits = 0, big.mark = " ", format = "f",zero.print = ""),
!!as.symbol(paste0(bestSup[3],"_lbl")):= formatC(!!as.symbol(bestSup[3]),digits = 0, big.mark = " ", format = "f",zero.print = "")
)
is there a way to loop the mutate function so that as many columns are created as there are existing numeric columns? The relavant lines with the repetitive code are the last three. Each new formatted text column has the name of existing numeric column with a suffix. !!as.symbol makes it possible to put together a parameter, the name of the source column, with _lbl suffix.
you could for example use mutate_at with a function and a conditional such as
dat %>%
mutate_at(.vars = c('num_col1','num_col2'),
.funs = function(x) if(is.numeric(x)) as.character(x))
This will replace the specified numeric columns with character columns. You can tweak the function to your needs, i.e. specifying how the columns should look like. We could help you a bit more with a better data example.
You can also filter only the numeric columns and then use mutate_all:
dat %>%Filter(is.numeric,.) %>% mutate_all(funs(as.character))
# Filter() is not dplyr, but base R, caveat capital 'F' !
# You can also use dat %>%.[sapply(.,is.numeric)], with the same result
# or dplyr::select_if
...:)
P.S. Always worth to cite the reference. Have a look at this gorgeous question:
Selecting only numeric columns from a data frame
Please consult tidyverse documentation.
# mutate_if() is particularly useful for transforming variables from
# one type to another
iris %>% as_tibble() %>% mutate_if(is.factor, as.character)
#> # A tibble: 150 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 5.10 3.50 1.40 0.200 setosa
#> 2 4.90 3.00 1.40 0.200 setosa
#> 3 4.70 3.20 1.30 0.200 setosa
#> 4 4.60 3.10 1.50 0.200 setosa
#> 5 5.00 3.60 1.40 0.200 setosa
#> 6 5.40 3.90 1.70 0.400 setosa
#> 7 4.60 3.40 1.40 0.300 setosa
#> 8 5.00 3.40 1.50 0.200 setosa
#> 9 4.40 2.90 1.40 0.200 setosa
#> 10 4.90 3.10 1.50 0.100 setosa
#> # ... with 140 more rows
Unexpectedly I found a hint at http://stackoverflow.com/a/47971650/3480717
I did not realise that in the syntax
mtcars %>% mutate_at(columnstolog, funs(log = log(.)))
adding a name part "log="in funs will append it to the names of new colums.... in the effect the following in my case is enough:
mutate_if(is.numeric, funs(lbl = formatC(.,digits = 0, big.mark = " ", format = "f",zero.print = "")))
This will generate new columns, as many as there are original numeric columns, and these new columns will have the name sufficed with "_lbl". No need for loops or advanced syntax. Big thanks to Thebo and Nettle

Resources