Create a function with a column name as an argument (with grouping) - r

I want to write a function that takes a data frame, grouping a variable(a column) and a variable (also a column).
From reading multiple stakcflow attempts at cracking this, mostly recommended using arguments as strings.
My initial code to check for normality using the Shapiro-Wilk test for multiple data frames and variables was unsuccessful.
check_normality <- function(d, x_grouping_variable, y_cont_var){
d %>%
group_by([[x_grouping_variable]]) %>%
summarise(`W Statistic` = shapiro.test([[y_cont_var]])$statistic,
`p-value` = shapiro.test([[y_cont_var]])$p.value)
return(shapiro.test([[y_cont_var]])$p.value)
}
ERROR:
Error: unexpected '[[' in " return(shapiro.test([["
My attempt to fix it using this code was also unsuccessful.
check_normality <- function(d, x_grouping_variable, y_cont_var){
d %>%
group_by(((!! sym(x_grouping_variable)))) %>%
summarise(`W Statistic` = shapiro.test((!! sym(y_cont_var)))$statistic,
`p-value` = shapiro.test((!! sym(y_cont_var)))$p.value)
return(shapiro.test((!! sym(y_cont_var)))$p.value)
}
check_normality(df, "RHF", "duratoin_days")
the error :
Error in !sym(y_cont_var) : invalid argument type
3.stopifnot(is.numeric(x))
2.shapiro.test((!!sym(y_cont_var)))
1.check_normality(df, "RHF", "duratoin_days")

Related

How *not* to remove entire case from analysis, using aov_car

I'm running an ANOVA with:
within: Session (Pre vs. Post)
within: Condition (A, B, C)
between: Group (Female, Male)
Three participants are missing all of 'C' (pre and post). I don't want to completely exclude them from my analyses because I think their 'A' and 'B' data is still interesting. I have tried including na.rm=TRUE to my script, and to no avail. Is there any way that I can run my aov_car (mixed-design ANOVA) without completely remove all the data from these three participants?
I keep getting the following error: Contrasts set to contr.sum for the following variables: Group. Warning message: Missing values for following ID(s): P20, R21, R22. Removing those cases from the analysis.
Sample data (note, it's fudged/randomized data here):
my_data <- readr::read_csv("PID,Session,Condition,Group,data
P1,Pre,A,Female,0.935147485
P2,Pre,A,Female,0.290449952
P3,Pre,A,Female,0.652213856
P4,Pre,A,Female,0.349222763
P5,Pre,A,Female,0.235789135
P6,Pre,A,Female,0.268469251
P7,Pre,A,Female,0.419284033
P8,Pre,A,Female,0.797236877
P9,Pre,A,Female,0.784526027
P10,Pre,A,Female,0.44837527
P11,Pre,A,Female,0.359525572
P12,Pre,A,Male,0.923775343
P13,Pre,A,Male,0.431557872
P14,Pre,A,Male,0.425703913
P15,Pre,A,Male,0.39916012
P16,Pre,A,Male,0.168378348
P17,Pre,A,Male,0.260462544
P18,Pre,A,Male,0.945835896
P19,Pre,A,Male,0.495932288
P20,Pre,A,Male,0.045565042
P21,Pre,A,Male,0.748259161
P22,Pre,A,Male,0.426588091
P1,Pre,B,Female,0.761677517
P2,Pre,B,Female,0.985953719
P3,Pre,B,Female,0.657063156
P4,Pre,B,Female,0.166859072
P5,Pre,B,Female,0.850201269
P6,Pre,B,Female,0.227918183
P7,Pre,B,Female,0.701946655
P8,Pre,B,Female,0.079116861
P9,Pre,B,Female,0.094935181
P10,Pre,B,Female,0.376525478
P11,Pre,B,Female,0.725431114
P12,Pre,B,Male,0.922099723
P13,Pre,B,Male,0.664993697
P14,Pre,B,Male,0.450501356
P15,Pre,B,Male,0.201276143
P16,Pre,B,Male,0.735428897
P17,Pre,B,Male,0.304752274
P18,Pre,B,Male,0.393020637
P19,Pre,B,Male,0.452345203
P20,Pre,B,Male,0.697709526
P21,Pre,B,Male,0.130459291
P22,Pre,B,Male,0.210211859
P1,Pre,C,Female,0.280820754
P2,Pre,C,Female,0.206499238
P3,Pre,C,Female,0.127540559
P4,Pre,C,Female,0.001998028
P5,Pre,C,Female,0.554408227
P6,Pre,C,Female,0.235435708
P7,Pre,C,Female,0.341077362
P8,Pre,C,Female,0.101103042
P9,Pre,C,Female,0.834297025
P10,Pre,C,Female,0.256605011
P11,Pre,C,Female,0.65647746
P12,Pre,C,Male,0.110716441
P13,Pre,C,Male,0.075856866
P14,Pre,C,Male,0.518357132
P15,Pre,C,Male,0.222078883
P16,Pre,C,Male,0.414747048
P17,Pre,C,Male,0.525522894
P18,Pre,C,Male,0.758019496
P19,Pre,C,Male,0.213927508
P20,Pre,C,Male,
P21,Pre,C,Male,
P22,Pre,C,Male,
P1,Post,A,Female,0.435204978
P2,Post,A,Female,0.681378597
P3,Post,A,Female,0.928158111
P4,Post,A,Female,0.525061816
P5,Post,A,Female,0.46271948
P6,Post,A,Female,0.649810342
P7,Post,A,Female,0.748819476
P8,Post,A,Female,0.207494638
P9,Post,A,Female,0.060148769
P10,Post,A,Female,0.074998663
P11,Post,A,Female,0.177396477
P12,Post,A,Male,0.61446322
P13,Post,A,Male,0.367348586
P14,Post,A,Male,0.853124208
P15,Post,A,Male,0.268734518
P16,Post,A,Male,0.784226481
P17,Post,A,Male,0.892830959
P18,Post,A,Male,0.950081146
P19,Post,A,Male,0.731274982
P20,Post,A,Male,0.901554267
P21,Post,A,Male,0.170960222
P22,Post,A,Male,0.2337913
P1,Post,B,Female,0.940130538
P2,Post,B,Female,0.575209304
P3,Post,B,Female,0.84527559
P4,Post,B,Female,0.160605498
P5,Post,B,Female,0.547844182
P6,Post,B,Female,0.287795345
P7,Post,B,Female,0.010274473
P8,Post,B,Female,0.408166731
P9,Post,B,Female,0.562733542
P10,Post,B,Female,0.44217795
P11,Post,B,Female,0.390071799
P12,Post,B,Male,0.767768344
P13,Post,B,Male,0.548800315
P14,Post,B,Male,0.489825627
P15,Post,B,Male,0.783939035
P16,Post,B,Male,0.772595033
P17,Post,B,Male,0.252895712
P18,Post,B,Male,0.383513642
P19,Post,B,Male,0.709882712
P20,Post,B,Male,0.517304459
P21,Post,B,Male,0.77186642
P22,Post,B,Male,0.395415627
P1,Post,C,Female,0.649783292
P2,Post,C,Female,0.490853459
P3,Post,C,Female,0.467705056
P4,Post,C,Female,0.988740552
P5,Post,C,Female,0.413980642
P6,Post,C,Female,0.83941706
P7,Post,C,Female,0.111722237
P8,Post,C,Female,0.501984852
P9,Post,C,Female,0.15634255
P10,Post,C,Female,0.547770503
P11,Post,C,Female,0.576203944
P12,Post,C,Male,0.857518274
P13,Post,C,Male,0.176794297
P14,Post,C,Male,0.127501287
P15,Post,C,Male,0.831191664
P16,Post,C,Male,0.257022941
P17,Post,C,Male,0.295366754
P18,Post,C,Male,0.113785049
P19,Post,C,Male,0.621389037
P20,Post,C,Male,
P21,Post,C,Male,
P22,Post,C,Male,")
Current Code :
library(tidyverse)
library(car)
library(afex)
library(emmeans)
my_anova <-aov_car(data ~ Group*Session*Condition
+ Error(PID/Session*Condition), na.rm = TRUE,
data=my_data)
I've also tried:
my_anova2 <- aov_ez("PID", "data",
my_data,
within = c("Session", "Condition"),
between = "Group", na.rm=TRUE)

Error while using lapply() function to perform logistic regression (undefined columns selected)

So the data is retrieved here https://archive.ics.uci.edu/ml/machine-learning-databases/00497/divorce.rar
When I run the code to perform logistic regression, it shows an error. But it runs perfectly on others R program. Is there anything that I have missed out?
set.seed(123)
divorce = read.csv("C://Users//User//Documents//Y2S3//Predictive Modelling//divorce//divorce.csv")
dim(divorce)
Outcome: [1] 170 1
summary(divorce)
Outcome:
Atr1.Atr2.Atr3.Atr4.Atr5.Atr6.Atr7.Atr8.Atr9.Atr10.Atr11.Atr12.Atr13.Atr14.Atr15.Atr16.Atr17.Atr18.Atr19.Atr20.Atr21.Atr22.Atr23.Atr24.Atr25.Atr26.Atr27.Atr28.Atr29.Atr30.Atr31.Atr32.Atr33.Atr34.Atr35.Atr36.Atr37.Atr38.Atr39.Atr40.Atr41.Atr42.Atr43.Atr44.Atr45.Atr46.Atr47.Atr48.Atr49.Atr50.Atr51.Atr52.Atr53.Atr54.Class
Length:170
Class :character
Mode :character
colnames(divorce)
Outcome:
[1] "Atr1.Atr2.Atr3.Atr4.Atr5.Atr6.Atr7.Atr8.Atr9.Atr10.Atr11.Atr12.Atr13.Atr14.Atr15.Atr16.Atr17.Atr18.Atr19.Atr20.Atr21.Atr22.Atr23.Atr24.Atr25.Atr26.Atr27.Atr28.Atr29.Atr30.Atr31.Atr32.Atr33.Atr34.Atr35.Atr36.Atr37.Atr38.Atr39.Atr40.Atr41.Atr42.Atr43.Atr44.Atr45.Atr46.Atr47.Atr48.Atr49.Atr50.Atr51.Atr52.Atr53.Atr54.Class"
sapply(divorce,class)
Outcome:
Atr1.Atr2.Atr3.Atr4.Atr5.Atr6.Atr7.Atr8.Atr9.Atr10.Atr11.Atr12.Atr13.Atr14.Atr15.Atr16.Atr17.Atr18.Atr19.Atr20.Atr21.Atr22.Atr23.Atr24.Atr25.Atr26.Atr27.Atr28.Atr29.Atr30.Atr31.Atr32.Atr33.Atr34.Atr35.Atr36.Atr37.Atr38.Atr39.Atr40.Atr41.Atr42.Atr43.Atr44.Atr45.Atr46.Atr47.Atr48.Atr49.Atr50.Atr51.Atr52.Atr53.Atr54.Class
"character"
col_fac = c("Atr1","Atr2","Atr3","Atr4","Atr5","Atr6","Atr7","Atr8","Atr9","Atr10",
+"Atr11","Atr12","Atr13","Atr14","Atr15","Atr16","Atr17","Atr18","Atr19","Atr20",
+"Atr21","Atr22","Atr23","Atr24","Atr25","Atr26","Atr27","Atr28","Atr29","Atr30",
+"Atr31","Atr32","Atr33","Atr34","Atr35","Atr36","Atr37","Atr38","Atr39","Atr40",
+"Atr41","Atr42","Atr43","Atr44","Atr45","Atr46","Atr47","Atr48","Atr49","Atr50",
+"Atr51","Atr52","Atr53","Atr54","Class")
divorce[col_fac] = lapply(divorce[col_fac],factor)
Outcome: Error in [.data.frame(divorce, col_fac) : undefined columns selected)
The only issue is that you read a file that is separated by ";" and not ",". The sep = ";" will solve the issue.
# downloaded and extracted from https://archive.ics.uci.edu/ml/machine-learning-databases/00497/divorce.rar
divorce <- read.csv("./divorce.csv", sep = ";")
dim(divorce)
summary(divorce)
colnames(divorce)
sapply(divorce,class)
col_fac = c("Atr1","Atr2","Atr3","Atr4","Atr5","Atr6","Atr7","Atr8","Atr9","Atr10",
"Atr11","Atr12","Atr13","Atr14","Atr15","Atr16","Atr17","Atr18","Atr19","Atr20",
"Atr21","Atr22","Atr23","Atr24","Atr25","Atr26","Atr27","Atr28","Atr29","Atr30",
"Atr31","Atr32","Atr33","Atr34","Atr35","Atr36","Atr37","Atr38","Atr39","Atr40",
"Atr41","Atr42","Atr43","Atr44","Atr45","Atr46","Atr47","Atr48","Atr49","Atr50",
"Atr51","Atr52","Atr53","Atr54","Class")
divorce[col_fac] = lapply(divorce[col_fac],factor)
less error prone version using dplyr
The following will mutate your dataset by applying the function as.factor across those variables where the function is.numeric returns TRUE.
Note that the functions passed within across and where do not get the usual parenthesis.
library(dplyr)
divorce <- read.csv("./divorce.csv", sep = ";") %>%
mutate(across(where(is.numeric), as.factor))
glimpse(divorce)
For detailed info on mutate across, type ?across in R Console.

Error in UseMethod("select_") : no applicable method for 'select_' applied to an object of class "character"

I am trying to extract some columns from the data which is result of analysis. The data is composed of 592 rows and 20 variables.
When I run the code as below, I got the error message
"Error in UseMethod("select_") : no applicable method for 'select_' applied to an object of class "character" "
unused_cols <- -c(2:9)
pvals_long <- pvals %>%
map(function(x){
x <- x %>%
dplyr::select(unused_cols) %>%
gather(key = "celltype_pair", value = "pvalue", -interacting_pair)
x
})
Thanks in advance,
map is not needed. Mapping on a dataframe means that you are trying to apply your function on each column. However, select expects a dataframe, while in your code it gets a vector. That's what the error is telling you.
unused_cols <- -c(2:9) will not work. Put the -in the call to select.
Try this:
unused_cols <- c(2:9)
pvals_long <- pvals %>%
select(-unused_cols) %>%
gather(key = "celltype_pair", value = "pvalue", -interacting_pair)

How to extract columns from a row and save the output as a variable dplyr

I am trying to extract a specific column from a specific row on my excel sheet (df). However, when I try to do so I get the message:
Error: ... must evaluate to column positions or names, not a list
Call `rlang::last_error()` to see a backtrace.
When I call rlang::last_error() I get:
Backtrace:
1. dplyr::select(., FGA, FTA, TOV, MP, TmFga, TmFta, TmTov, TmMin)
9. tidyselect::vars_select(tbl_vars(.data), !!!enquos(...))
10. tidyselect:::bad_calls(bad, "must evaluate to { singular(.vars) } positions or names, \\\n not { first_type }")
11. tidyselect:::glubort(fmt_calls(calls), ..., .envir = .envir)
12. dplyr::select(., FGA, FTA, TOV, MP, TmFga, TmFta, TmTov, TmMin)
At this point, I am lost. What can I do to my code to work?
library(readxl)
Lakers_Overall_Stats <- read_excel("Desktop/Lakers Overall Stats.xlsx")
library(readxl)
Lakers_Record <- read_excel("Desktop/Lakers Record.xlsx")
require(dplyr)
require(ggplot2)
##WinPercentage of the Team after season
mydata <- Lakers_Record %>% select(Pts,Opp,W,L)%>%
+ mutate(wpct=Pts^13.91/(Pts^13.91+Opp^13.91),expwin=round(wpct*(W+L)),diff=W-expwin)
head(mydata)
##Specifiying
Lakers_Overall_Stats[23,6] <- TmMin
Lakers_Overall_Stats[23,8] <- TmFga
Lakers_Overall_Stats[23,18] <- TmFta
Lakers_Overall_Stats[23,26] <- TmTov
rlang::last_error()
##Usage Percentage
Usgpct <- Lakers_Overall_Stats %>% select(FGA,FTA,TOV,MP,TmFga,TmFta,TmTov,TmMin)%>%
+ mutate(100*(Fga+0.44*Fta+Tov))*TmMin/(TmFga+0.44*TmFta+TmTov)*5(MP)
##head(Usgpct)
##filter(rank(desc(Usgpct))==1)
Also, am I filtering correctly? or should it be written as
Usgpct <- Lakers_Overall_Stats %>% select(FGA,FTA,TOV,MP,TmFga,TmFta,TmTov,TmMin)%>%
filter(rank(desc(Usgpct))==1)%>%
mutate(100*(Fga+0.44*Fta+Tov))*TmMin/(TmFga+0.44*TmFta+TmTov)*5(MP)
head(Usgpct)
You have
Lakers_Overall_Stats[23,6] <- TmMin
This will modify the Lakers_Overall_Stats data frame by setting the element at 23,6 etc. to be TmMin. TmMin is an object outside of your data frame.
Maybe you want:
TmMin <- Lakers_Overall_Stats[23,6]
?
Also, you cannot select TmFga,TmFta,TmTov,TmMin since these variables are not part of your data frame. You can refer to those variables in your mutate equation, but because of the way you've set it up, they're stand-alone variables.

Nestled Loop not Working to gather data from NOAA

I'm using the R package rnoaa(along with it required other packages) to gather historical weather data. I wrote this nestled loop to gather all the data sets but I keep getting errors when I run it. It seems to run for a second fine
The loop:
require('triebeard')
require('bindr')
require('colorspace')
require('mime')
require('curl')
require('openssl')
require('R6')
require('urltools')
require('httpcode')
require('stringr')
require('assertthat')
require('bindrcpp')
require('glue')
require('magrittr')
require('pkgconfig')
require('rlang')
require('Rcpp')
require('BH')
require('plogr')
require('purrr')
require('stringi')
require('tidyselect')
require('digest')
require('gtable')
require('plyr')
require('reshape2')
require('lazyeval')
require('RColorBrewer')
require('dichromat')
require('munsell')
require('labeling')
require('viridisLite')
require('data.table')
require('rjson')
require('httr')
require('crul')
require('lubridate')
require('dplyr')
require('tidyr')
require('ggplot2')
require('scales')
require('XML')
require('xml2')
require('jsonlite')
require('rappdirs')
require('gridExtra')
require('tibble')
require('isdparser')
require('geonames')
require('hoardr')
require('rnoaa')
install.package('ncdf4')
install.packages("devtools")
library(devtools)
install_github("rnoaa", "ropensci")
library(rnoaa)
list <- buoys(dataset='wlevel')
lid <- data.frame(list$id)
foo <- for(range in 1990:2017){
for(bid in lid){
bid_range <- buoy(dataset = 'wlevel', buoyid = bid, year = range)
bid.year.data <- data.frame(bid.year$data)
write.csv(bid.year.data, file='cwind/bid_range.csv')
}
}
The response:
Using c1990.nc
Using
Error: length(url) == 1 is not TRUE
It saves the first data-set but it does not apply the for in the file name it just names it bid_range.csv.
This error message shows that there are no any data of a given station id in 1990. Because you were using for loop, once it gots an error, it stops.
Here I introduce the use of tidyverse to download the NOAA buoy data. A lot of the following functions are from the purrr package, which is part of the tidyverse.
# Load packages
library(tidyverse)
library(rnoaa)
Step 1: Create a "Grid" containing all combination of id and year
The expand function from tidyr can create the combination of different values.
data_list <- buoys(dataset = 'wlevel')
data_list2 <- data_list %>%
select(id) %>%
expand(id, year = 1990:2017)
Step 2: Create a "safe" version that does not break when there is no data.
Also make this function suitable for the map2 function
Because we will use map2 to loop through all the combination of id and year using the map2 function by its .x and .y argument. We modified the sequence of argument to create buoy_modify. We also use the safely function to create a safe version of buoy_modify. Now when it meets error, it will store the error message and moves to the next one rather than breaks.
# Modify the buoy function
buoy_modify <- function(buoyid, year, dataset, ...){
buoy(dataset, buoyid = buoyid, year = year, ...)
}
# Creare a safe version of buoy_modify
buoy_safe <- safely(buoy_modify)
Step 3: Apply the buoy_safe function
wlevel_data <- map2(data_list2$id, data_list2$year, buoy_safe, dataset = "wlevel")
# Assign name for the element in the list based on id and year
names(wlevel_data) <- paste(data_list2$id, data_list2$year, sep = "_")
After this step, all the data were downloaded in wlevel_data. Each element in wlevel_data has two parts. $result shows the data if the download is successful, otherwise, it shows NULL. $error shows NULL if the download is successful, otherwise, it shows the error message.
Step 4: Access the data
transpose can turn a list "inside out". So now wlevel_data2 has two elements: result and error. We can store these two and access the data.
# Turn the list "inside out"
wlevel_data2 <- transpose(wlevel_data)
# Get the error message
wlevel_error <- wlevel_data2$error
# Get he result
wlevel_result <- wlevel_data2$result
# Remove NULL element in wlevel_result
wlevel_result2 <- wlevel_result[!map_lgl(wlevel_result, is.null)]

Resources