does anyone know how I can loop pipe operator code through multiple dataframe?
I've quite a few dataframe named over the years (df_1990, df_1991 ... df_2020). However, not all years are included, (i.e. df_1993, df_2012 and 3 more years are not available). To account for this, I manually created a list to store all the data frame for the looping (do enlighten me if there's a faster way for this).
df_list = list(df_1990, df_1991, ..., df_2020)
for (i in df_list) {
...
}
The dataframes are pretty simple with just 2 columns (Item (character field) & Cost (numeric field).
Item
Cost
Book_A
3.00
Book_B
5.00
...
...
a sample code for the dataframe
df = structure(list(Item = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 2L,
3L, 1L, 2L, 1L, 2L, 1L, 3L, 1L, 2L, 2L, 1L, 3L, 1L), .Label = c("Book A",
"Book B", "Book C"), class = "factor"), Cost = c(5, 3.5, 12,
6, 8, 3, 6, 3.5, 3.8, 13, 5.1, 7, 11.5, 3.8, 5.5, 6.5, 13.5,
5.5, 3.5, 1.2)), class = "data.frame", row.names = c(NA, -20L
))
Does anyone know how I can add in the following code into the ... portion of the for loop code above? Thank you!
df %>%
group_by(Item) %>%
summarise(outlier = mean(Cost),
offset = outlier * 0.6,
higher_value = outlier + offset,
lower_value = outlier - offset) %>%
left_join(df, by = 'Item') %>%
transmute(Item, Cost, Outlier = ifelse(Cost < lower_value | Cost > higher_value, 'Y', 'N'))
The code basically detect the outlier (for e.g. if the cost is 60% higher or lower than majority average of the particular item) and output a column of "Y" and "N" for each row respectively. (Credits for the code goes to Ronak Shah)
Ideally the new column created should appear in the list created to allow exporting to excel format
Thank you!
Personally I would move the data wrangling code in a function and would then use lapply to loop over your list of data frames.
library(dplyr)
df_list <- list(df, df, df)
prep_data <- function(x) {
x %>%
group_by(Item) %>%
summarise(
outlier = mean(Cost),
offset = outlier * 0.6,
higher_value = outlier + offset,
lower_value = outlier - offset
) %>%
left_join(x, by = "Item") %>%
transmute(Item, Cost, Outlier = ifelse(Cost < lower_value | Cost > higher_value, "Y", "N"))
}
df_prep <- lapply(df_list, prep_data)
lapply(df_prep, head, 2)
#> [[1]]
#> # A tibble: 2 × 3
#> Item Cost Outlier
#> <fct> <dbl> <chr>
#> 1 Book A 5 N
#> 2 Book A 3.5 N
#>
#> [[2]]
#> # A tibble: 2 × 3
#> Item Cost Outlier
#> <fct> <dbl> <chr>
#> 1 Book A 5 N
#> 2 Book A 3.5 N
#>
#> [[3]]
#> # A tibble: 2 × 3
#> Item Cost Outlier
#> <fct> <dbl> <chr>
#> 1 Book A 5 N
#> 2 Book A 3.5 N
If you want to do it via a for loop then you could achieve the same result like so:
df_prep <- list()
for (i in seq_along(df_list)) {
df_prep[[i]] <- prep_data(df_list[[i]])
}
Why don't you put all your data into one dataframe:
df_list = list(df_1990 = df_1990, df_1991 = df_1991, ..., df_2020 = df_2020)
df2 = dplyr::bind_rows(df_list, .id = 'Year')
then you only have to add the variable Year into the group_by statement:
group_by(Year, Item)
If you need to, you can always convert it back to a list of dataframes:
df2 %>%
tidyr::nest(data = Item:Cost) %>%
pull(data, name = Year)
Btw, you can also improve the code for the outlier detection, by omitting the join:
df2 %>%
group_by(Year, Item) %>%
mutate(outlier = mean(Cost),
offset = outlier * 0.6,
higher_value = outlier + offset,
lower_value = outlier - offset) %>%
transmute(Item, Cost, Outlier = if_else(Cost < lower_value | Cost > higher_value, 'Y', 'N'))
using mutate instead of summarise copies the result of mean(Cost) to every row of the group.
Related
I'm currently using the "weightloss" dataset from the datarium package to start running an RMANOVA. Here is the dput:
dput(head(weightloss))
structure(list(id = structure(1:6, .Label = c("1", "2", "3",
"4", "5", "6", "7", "8", "9", "10", "11", "12"), class = "factor"),
diet = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("no",
"yes"), class = "factor"), exercises = structure(c(1L, 1L,
1L, 1L, 1L, 1L), .Label = c("no", "yes"), class = "factor"),
t1 = c(10.43, 11.59, 11.35, 11.12, 9.5, 9.5), t2 = c(13.21,
10.66, 11.12, 9.5, 9.73, 12.74), t3 = c(11.59, 13.21, 11.35,
11.12, 12.28, 10.43)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
So this is the script I have come up with so far:
# Create Data Frame for Dataset:
weight <- weightloss
weight
# Pivot Longer Data to Create Factors and Scores:
weight <- weight %>%
pivot_longer(names_to = 'trial', # creates factor (x)
values_to = 'value', # creates value (y)
cols = t1:t3) # finds which cols to factor
# Plot Means in Boxplot:
ggplot(weight,
aes(x=trial,y=value))+
geom_boxplot()+
labs(title = "Trial Means") # As can be predicted, inc w/time
I get this pretty normal looking boxplot:
Now its time to find outliers and test for normality.
# Identify Outliers (Should be None Given Boxplot):
outlier <- weight %>%
group_by(trial) %>%
identify_outliers(value)
outlier_frame <- data.frame(outlier)
outlier_frame # none found :)
# Normality (Shapiro-Wilk and QQPlot):
model <- lm(value~trial,
data = weight) # creates model
shapiro_test(residuals(model)) # measures Shapiro
ggqqplot(residuals(model))+
labs(title = "QQ Plot of Residuals") # creates QQ
This again gives me a pretty normal QQplot:
I then wrapped the data by trial:
ggqqplot(weight, "value", ggtheme = theme_bw())+
facet_wrap(~trial)+
labs(title = "QQPlot of Each Trial") #looks normal
And it comes out right from what I can tell:
However, when I try to do a Shapiro Wilk test by group, I keep having issues with this code:
shapiro_group <- weight %>%
group_by(trial) %>%
shapiro_test(value)
It gives me this error:
Error: Problem with mutate() column data. i data = map(.data$data, .f, ...). x Must group by variables found in .data.
Column variable is not found.
I also tried this:
shapiro_test(weight, trial$value)
And get this error instead:
Error: Can't subset columns that don't exist. x Column trial$value
doesn't exist.
If anybody has some insight as to why, I would greatly appreciate it!
The reason you were getting an error for shapiro_test was because the implementation of it has this one line in it.
shapiro_test
function (data, ..., vars = NULL)
{
....
....
data <- data %>% gather(key = "variable", value = "value") %>%
filter(!is.na(value))
....
....
}
where it gets the data in long format using gather. Since you already have a column named value this doesn't work.
If you change the name of value column to anything else it works.
library(dplyr)
library(rstatix)
weight %>%
rename(value1 = value) %>%
group_by(trial) %>%
shapiro_test(value1)
# trial variable statistic p
# <chr> <chr> <dbl> <dbl>
#1 t1 value1 0.869 0.222
#2 t2 value1 0.910 0.440
#3 t3 value1 0.971 0.897
I have df1 below. I would like to systematically calculate wilcox.test p values to test whether the variable is significantly higher/lower per color, as defined in the color1 and color2 columns.
I would like to test this for all samples, and per group in the column group.
I am hoping to create a new data.frame with the results, including the sample numbers per group (n). Anticipated result is shown in df2 below. Note, though that the p values in df2 are made up as examples only.
df1 <- data.frame(
stringsAsFactors = FALSE,
sample = c(1L,2L,3L,4L,
5L,6L,7L,8L,9L,10L,11L,12L,13L,14L,15L,
16L,17L,18L,19L,20L,21L,22L,23L,24L,25L,
26L,27L,28L,29L,30L),
group = c("a","a","a",
"a","a","a","a","a","a","a","a","a","a",
"a","a","a","b","b","b","b","b","b","c",
"c","c","c","c","c","c","c"),
variable = c(5L,2L,4L,4L,
1L,3L,3L,5L,1L,7L,13L,9L,4L,4L,3L,12L,
0L,11L,1L,3L,0L,4L,5L,2L,6L,4L,6L,7L,5L,
3L),
color1 = c("black","white",
"white","black","black","white","white",
"black","black","black","black","black","white",
"white","black","white","black","white",
"black","white","black","white","white","white",
"black","white","black","black","white",
"black"),
color2 = c("red","blue",
"blue","blue","red","blue","blue","red","blue",
"red","red","blue","red","red","red",
"blue","blue","red","blue","red","red","blue",
"red","red","red","blue","red","blue","blue",
"blue")
)
df2 <- data.frame(
stringsAsFactors = FALSE,
group = c("all", "a", "b", "c"),
n = c(30L, 16L, 6L, 8L),
color1_pval = c(0.0485, 0.9641, 0.0832, 0.3882),
color2_pval = c(0.6727, 0.4121, 0.1282, 0.4344)
)
You can try :
library(dplyr)
df1 %>%
group_by(group) %>%
summarise(n = n(),
color1_pval = wilcox.test(variable[color1 == 'white'],
variable[color1 == 'black'])$p.value,
color2_pval = wilcox.test(variable[color2 == 'blue'],
variable[color2 == 'red'])$p.value)
# group n color1_pval color2_pval
# <chr> <int> <dbl> <dbl>
#1 a 16 0.556 0.457
#2 b 6 0.0765 0.825
#3 c 8 0.189 1
I would like to run a function to test if a value exists in a dataset or not. I've looked for answers and since found a workaround but it's not as neat and I'm curious why my initial attempt failed.
Here's a simplified dataset
df <- structure(list(country = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("IRE","USA"),
class = "factor"), year = structure(c(1L, 2L, 3L, 1L, 2L, 3L), .Label = c("1990", "1995", "2000"),
class = "factor")), class = "data.frame", row.names = c(NA, -6L))
> df
country year
1 IRE 1990
2 IRE 1995
3 IRE 2000
4 USA 1990
5 USA 1995
6 USA 2000
I would like my function to return a 1 if a particular country code and year are present or a 0 otherwise. This is the working code:
myFunc <- function(x,y){
p.ans <- df %>% filter(year == y)
ifelse(x %in% p.ans$country, 1, 0)
}
> myFunc("USA", 1995)
[1] 1
> myFunc("USA", 1997)
[1] 0
But why doesn't this alternative code work? Is there a variation of it that would?
myFunc <- function(x,y){
df %>% filter(year == y) %>% ifelse(x %in% country, 1, 0)
}
> myFunc("USA", 1997)
Error in ifelse(., x %in% country, 1, 0) : unused argument (0)
Thanks!
But why doesn't this alternative code work?
Because x %>% f(y) is the same as f(x, y). Thus the code you wrote is equal to
ifelse(filter(df, year == y), x %in% country, 1, 0)
… which is not how the ifelse function works.
Instead, you could write
df %>% filter(year == y) %>% pull(country) %>% {ifelse(x %in% ., 1, 0)}
Here you need to surround the ifelse function call with {…} to prevent the pipe from inserting the right-hand side as the first argument into the function call (we want to use x %in% . as the first argument rather than just .).
… or, if you’ve loaded the ‘magrittr’ package, you can use %$% instead of %>% pull(…) %>% {…}:
df %>% filter(year == y) %$% ifelse(x %in% country, 1, 0)
Assume I have a dataset with two columns, Location and Product, that shows how many of each product is sold at each location. I create a contingency table for the number of each product sold at each location:
data%>%
group_by(Location,Product)%>%
summarize(n=n()) %>%
pivot_wider(names_from = product, values_from = n)
Now, imagine that instead of a single Product column, I have US_Product, Japan_Product,..., Germany_Product. How can I create my contingency tables in a for loop?
NOTE: when I create a vector of products like p<-c("Product1", "Product2",..., "Product3") and loop through these products, I get an error message because these are strings and not variable names.
Here is a minimal example:
Location <- c("AB","ON","MN","AB","ON")
Product1<-c("Type1","Type2","Type1","Type3","Type1")
Product2<-c("Type3","Type2","Type3","Type3","Type2")
Product3<-c("Type1","Type2","Type1","Type1","Type1")
data <- tibble(Location,Product1,Product2,Product3)
data%>%
group_by(Location,Product1)%>%
summarize(n=n()) %>%
pivot_wider(names_from = Product1, values_from = n) #this works as expected
#now I want to do the same thing in a loop
prodV <- c("Product1","Product2","Product3")
for (i in c(1:3)){
var <- prodV[i]
data%>%
group_by(Location,var)%>%
summarize(n=n()) %>%
pivot_wider(names_from = var, values_from = n)
}
If we need to use it in a loop, then one option is map
library(dplyr)
library(purrr)
library(tidyr)
map(p, ~
data%>%
group_by_at(vars("Location", .x)) %>%
summarize(n=n()) %>%
pivot_wider(names_from = .x, values_from = n))
Using a reproducible example
data(mtcars)
p <- c("cyl", "vs", "am")
map(p, ~
mtcars %>%
group_by_at(vars('gear', .x)) %>%
summarise(n = n()) %>%
pivot_wider(names_from = .x, values_from = n) )
Or if we use a for loop, then create an empty list to store the output from each iteration ('out'), loop over the 'p' values, and change only the .x part from map while assigning the output to each element of 'out' list
out <- vector('list', length(p))
names(out) <- p
for(p1 in p) {
out[[p1]] <- data %>%
group_by_at(vars("Location", p1)) %>%
summarize(n = n()) %>%
pivot_wider(names_from = p1, values_from = n)
}
Not sure if the following is the thing you are after. Below is a base R solution to make contingency tables:
p <- c("US_Product","Japan_product","Germany_Product")
res <- Map(function(x) table(df[c("Location",x)]),p)
such that
> res
$US_Product
US_Product
Location a b c
XX 2 0 1
YY 1 1 2
$Japan_product
Japan_product
Location d e f
XX 0 2 1
YY 3 0 1
$Germany_Product
Germany_Product
Location g i j
XX 0 3 0
YY 1 1 2
Dummy DATA
df <- > dput(df)
structure(list(Location = structure(c(1L, 1L, 1L, 2L, 2L, 2L,
2L), .Label = c("XX", "YY"), class = "factor"), US_Product = structure(c(1L,
3L, 1L, 2L, 1L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"),
Japan_product = structure(c(2L, 2L, 3L, 3L, 1L, 1L, 1L), .Label = c("d",
"e", "f"), class = "factor"), Germany_Product = structure(c(2L,
2L, 2L, 2L, 3L, 1L, 3L), .Label = c("g", "i", "j"), class = "factor")), class = "data.frame", row.names = c(NA,
-7L))
I was able to handle the problem using group_by_at as opposed to group_by. According to dplyr: whats the difference between group_by and group_by_ functions?
if one needs to have inputs with quotation marks, SE versions of functions should be used, instead of NSE versions---please see the link for a detailed explanation.
prodV <- c("Product1","Product2","Product3")
for (i in c(1:3)){
var <- prodV[i]
a<-data%>%
group_by_at(vars("Location",var))%>%
summarize(n=n()) %>%
pivot_wider(names_from = var, values_from = n)
print(a)
}
I have a data frame "A" with two columns,
the first has names of cities(unique values), the second has NA, which I want to fill with unemployment.
data frame "B" has a Column with city names, but the unemployment isnt in the same row, to be precise, it is always 1 row below.
How would you merge this two data, so that R looks at the first column on data frame "A", finds its match on data frame "B", and replaces the NA from the second column of data frame "A" with the value 1 row below the row in which the match is made.
Here are some summarized version of how data frame A and B would look like.
names= c("Bogotá", "Medellín")
data_frame_A= as.data.frame(names, ncol=1)
colnames(data_frame_A)= "city"
data_frame_A$Unemployment = NA
data_frame_A
data frame B looks something like this
names= c("Bogotá", "life_exp","Unemployment","Medellín","life_exp","Unemployment")
data_frame_B= as.data.frame(names, ncol=1)
colnames(data_frame_B)= "city"
data_frame_B$column_20 = runif(6, 0.5, 0.8)
data_frame_B
How would you merge this two data then?
Here's a method that checks if each city in data_frame_B is in data_frame_A to assign rows to each city. We make a new column that has the actual city name, and then we can spread the variables out into their own columns. You can join back on to data_frame_A after this if there are columns there that you need.
library(tidyverse)
data_frame_A <- structure(list(city = structure(1:2, .Label = c("Bogotá", "Medellín"), class = "factor"), Unemployment = c(NA, NA)), row.names = c(NA, -2L), class = "data.frame")
data_frame_B <- structure(list(city = structure(c(1L, 2L, 4L, 3L, 2L, 4L), .Label = c("Bogotá", "life_exp", "Medellín", "Unemployment"), class = "factor"), column_20 = c(0.653383622108959, 0.685130500583909, 0.616564040770754, 0.731770524056628, 0.53738643436227, 0.571727990615182)), row.names = c(NA, -6L), class = "data.frame")
data_frame_B %>%
group_by(city_id = cumsum(city %in% data_frame_A$city)) %>%
mutate(city_name = first(city)) %>%
filter(city_name != city) %>%
spread(city, column_20)
#> # A tibble: 2 x 4
#> # Groups: city_id [2]
#> city_id city_name life_exp Unemployment
#> <int> <fct> <dbl> <dbl>
#> 1 1 Bogotá 0.685 0.617
#> 2 2 Medellín 0.537 0.572
Created on 2019-04-22 by the reprex package (v0.2.1)
Setting the random seed in the Note at the end to make the data reproducible we can use the following double left join:
library(sqldf)
sqldf("select a.city, b2.[column_20]
from [data_frame_A] as a
left join [data_frame_B] as b using(city)
left join [data_frame_B] as b2 on b2.rowid = b.rowid + 1")
giving:
city column_20
1 Bogotá 0.7364915
2 Medellín 0.7821402
Note
set.seed(123)
names= c("Bogotá", "Medellín")
data_frame_A= as.data.frame(names, ncol=1)
colnames(data_frame_A)= "city"
data_frame_A$Unemployment = NA
names= c("Bogotá", "life_exp","Unemployment","Medellín","life_exp","Unemployment")
data_frame_B= as.data.frame(names, ncol=1)
colnames(data_frame_B)= "city"
data_frame_B$column_20 = runif(6, 0.5, 0.8)