How to subset a data frame with R pipeline - r

I am trying to subset/filter a data frame according to the corresponding column elements from another data frame.
Here is what I used to do this
df <- df1[df1$col1 %in% df2$col2,]
And then I am going to set the column as row names
df <- df %>% remove_rownames %>% column_to_rownames('col1')
However I have no idea how to combine these two codes into one using %>%

df1 %>% filter(col1 %in% df2$col2) %>% remove_rownames %>% column_to_rownames('col1')

Related

R filter or subset for finding a specific repeat count for data.frame

I want to use filter or subset from dplyr that will give a new dataframe only with rows in which for the selected column the value is counted exactly 2 times in the original data.frame
I try this:
df2 <-
df %>%
group_by(x) %>% mutate(duplicate = n()) %>%
filter(duplicate == 2)
and this
df2 <- subset(df,duplicated(x))
but neither option works
In the group_by, just use the unquoted column name. Also, we don't need to create a column in mutate before filtering. It can be directly done on the fly in filter
library(dplyr)
df %>%
group_by(x) %>%
filter(n() ==2) %>%
ungroup

dataframe in wideformat to dataframe of timeseries

I am currently struggling with reshaping my dataset to a preferred result. Lets say I have the following dataset to start with:
library(tsbox)
library(dplyr)
library(tidyr)
# create df that matches my format
df1 <- ts_wide(ts_df(ts_c(mdeaths)))
df1$id <- 1
df2 <- ts_wide(ts_df(ts_c(mdeaths)))
df2$id <- 2
df <- rbind(df1, df2)
Now this dataset has a date column, a value column and an "id" column, which should specifiy which date/value points belong to the same observation object. I would now like to reshape my dataset to a 2x2 dataframe, where the first column is the id, while the second column is a timeseries object (of the date/value corresponding to that id). To do so, I tried the following:
# create a new df, with two cols (id and ts)
df_ts <- df %>%
group_by(id) %>%
nest()
The nest command creates a "a list-column of data frames", which is not exactly what I wanted. I know that a ts can be defined via ts(data$value, data$date), but I do not know how to integrate it after the group_by(id) function. Can anyone help me how to turn this column into a ts object instead of a data frame? I am new to R and grateful for any form of help.
Thanks in advance
If you have a non-atomic data type it will have to be a list column of something.
If you want a list-column of ts object you can:
df %>%
group_by(id) %>%
summarize(ts = list(ts(value, time)))
Continuing your pipe you could:
df %>%
group_by(id) %>%
nest() %>%
mutate(data = purrr::map(data, with, ts(value, time)))

How to remove outliers in only one column after grouping by another column in R

I want to remove outliers from a variable MEASURE after grouping by TYPE. I tried the following code but it didn't work. I've searched and I've only came across how to remove outliers for the whole dataframe or one column. But not by after grouping.
df2 <- df %>%
group_by(TYPE) %>%
mutate(MEASURE_WITHOUT_OUTLIERS = remove_outliers(MEASURE))
You can use boxplot.stats to get outlier values in each group and use filter to remove them.
library(dplyr)
df2 <- df %>%
group_by(TYPE) %>%
filter(!MEASURE %in% boxplot.stats(MEASURE)$out) %>%
ungroup

How to Transpose (t) in the Tidyverse Using Tidyr

Using the sample data (bottom), I want to use the code below to group and summarise the data. After this, I want to transpose, but I'm stuck on how to use tidyr to achieve this?
For context, I'm attempting to recreate an existing table that was created in Excel using knitr::kable, so the final product of my code below is expected to break tidy principles.
For example:
library(tidyverse)
Df <- Df %>% group_by(Code1, Code2, Level) %>%
summarise_all(funs(count = sum(!is.na(.))))
I can add t(.) using the pipe...
Df <- Df %>% group_by(Code1, Code2, Level) %>%
summarise_all(funs(count = sum(!is.na(.)))) %>%
t(.)
or I can add...
Df <- as.data.frame(t(Df)
Both of these options allow me to transpose, but I'm wondering if there's a tidyverse method of achieving this using tidyr's gather and spread functions? I want to have more control over the process and also want to remove the "V1","V2", etc, that appear as column names when using transpose (t).
How can I achieve this using tidyverse?
Sample Code:
Code1 <- c("H200","H350","H250","T400","T240","T600")
Code2 <- c("4A","4A","4A","2B","2B","2B")
Level <- c(1,2,3,1,2,3)
Q1 <- c(30,40,40,50,60,80)
Q2 <- c(50,30,50,40,80,30)
Q3 <- c(30,45,70,42,81,34)
Df <- data.frame(Code1, Code2, Level, Q1, Q2, Q3)
The general idiom in the tidyverse is to gather() your data to the maximal extent, forming a "long" data frame with one measurement per row. Then, spread() can revert this long data frame into whichever "wide" format that you like best. This procedure can effectively transpose the data: just gather() all the identifier columns except the row names, and then spread() the row names.
For example, here is how to effectively transpose mtcars:
require(tidyverse)
mtcars %>%
rownames_to_column %>%
gather(variable, value, -rowname) %>%
spread(rowname, value)
Your data does not have "row names" as understood in R, but Code1 effectively serves as a row name because it uniquely identifies each (original) row of your data.
Df1 <- Df %>%
group_by(Code1, Code2, Level) %>%
summarise_all(funs(count = sum(!is.na(.)))) %>%
gather(column, value, -Code1) %>%
spread(Code1, value)
UPDATE for tidyr 1.0 or higher (late 2019 onwards)
The new pivot_wider() and pivot_longer() functions are now preferred over the older (but still supported) gather() and spread(). Thus the preferred way to transpose mtcars is probably
require(tidyverse)
mtcars %>%
rownames_to_column() %>%
pivot_longer(-rowname, 'variable', 'value') %>%
pivot_wider(variable, rowname)
library(tidyr)
library(dplyr)
Df <- Df %>% group_by(Code1, Code2, Level) %>%
summarise_all(funs(count = sum(!is.na(.)))) %>%
gather(var, val, 2:ncol(Df)) %>%
spread(Code1, val)

Standardize data columns in R in subgrups

I'm struggling with standardization of data columns in R in subgroups.
I created the data frame:
df<-data.frame(
salesPerson=sample(c('Alan','Bob','Cindy'),20 ,replace=TRUE)
, quater=sample(c('Q1','Q2','Q3'),20 ,replace=TRUE)
,salesValue=runif(20, 5.0, 7.5)
)
I would like to add additional column to the data frame with scaled values of Sales.
To scale all column I can use code:
df$salesValueScaled<-scale(df$salesValue)
The problem is that I would like to scale sales separably for each combination of columns salesPerson and quater. Sth like:
df$salesValueScaled<-scale(df$salesValue, by =c(df$salesPerson,df$quater))
I have been searching for this solution on this forum but I couldn't find a solution to this problem.
Thank you in advance for help.
You can use dplyr for this:
library(dplyr)
new_df <- df %>% group_by(salesPerson, quater) %>%
mutate(scaled_Col = scale(salesValue)) %>%
ungroup
To work around rows that return NAs, you can either keep the original values as they are or filter them out before scaling:
Keeping the original values (by keeping scaling only instances where NROW is greater than 1):
new_df <- df %>% group_by(salesPerson, quater) %>%
mutate(scaled_Col = ifelse(NROW(salesValue) > 1, scale(salesValue), salesValue)) %>%
ungroup
Filtering them out (as suggested by #steveb):
new_df <- df %>% group_by(salesPerson, quater) %>%
filter(n() > 1) %>%
mutate(scaled_Col = scale(salesValue)) %>%
ungroup
I hope this helps.

Resources