Create a function with a column name as an argument (with grouping) - r
I want to write a function that takes a data frame, grouping a variable(a column) and a variable (also a column).
From reading multiple stakcflow attempts at cracking this, mostly recommended using arguments as strings.
My initial code to check for normality using the Shapiro-Wilk test for multiple data frames and variables was unsuccessful.
check_normality <- function(d, x_grouping_variable, y_cont_var){
d %>%
group_by([[x_grouping_variable]]) %>%
summarise(`W Statistic` = shapiro.test([[y_cont_var]])$statistic,
`p-value` = shapiro.test([[y_cont_var]])$p.value)
return(shapiro.test([[y_cont_var]])$p.value)
}
ERROR:
Error: unexpected '[[' in " return(shapiro.test([["
My attempt to fix it using this code was also unsuccessful.
check_normality <- function(d, x_grouping_variable, y_cont_var){
d %>%
group_by(((!! sym(x_grouping_variable)))) %>%
summarise(`W Statistic` = shapiro.test((!! sym(y_cont_var)))$statistic,
`p-value` = shapiro.test((!! sym(y_cont_var)))$p.value)
return(shapiro.test((!! sym(y_cont_var)))$p.value)
}
check_normality(df, "RHF", "duratoin_days")
the error :
Error in !sym(y_cont_var) : invalid argument type
3.stopifnot(is.numeric(x))
2.shapiro.test((!!sym(y_cont_var)))
1.check_normality(df, "RHF", "duratoin_days")
Related
How *not* to remove entire case from analysis, using aov_car
I'm running an ANOVA with: within: Session (Pre vs. Post) within: Condition (A, B, C) between: Group (Female, Male) Three participants are missing all of 'C' (pre and post). I don't want to completely exclude them from my analyses because I think their 'A' and 'B' data is still interesting. I have tried including na.rm=TRUE to my script, and to no avail. Is there any way that I can run my aov_car (mixed-design ANOVA) without completely remove all the data from these three participants? I keep getting the following error: Contrasts set to contr.sum for the following variables: Group. Warning message: Missing values for following ID(s): P20, R21, R22. Removing those cases from the analysis. Sample data (note, it's fudged/randomized data here): my_data <- readr::read_csv("PID,Session,Condition,Group,data P1,Pre,A,Female,0.935147485 P2,Pre,A,Female,0.290449952 P3,Pre,A,Female,0.652213856 P4,Pre,A,Female,0.349222763 P5,Pre,A,Female,0.235789135 P6,Pre,A,Female,0.268469251 P7,Pre,A,Female,0.419284033 P8,Pre,A,Female,0.797236877 P9,Pre,A,Female,0.784526027 P10,Pre,A,Female,0.44837527 P11,Pre,A,Female,0.359525572 P12,Pre,A,Male,0.923775343 P13,Pre,A,Male,0.431557872 P14,Pre,A,Male,0.425703913 P15,Pre,A,Male,0.39916012 P16,Pre,A,Male,0.168378348 P17,Pre,A,Male,0.260462544 P18,Pre,A,Male,0.945835896 P19,Pre,A,Male,0.495932288 P20,Pre,A,Male,0.045565042 P21,Pre,A,Male,0.748259161 P22,Pre,A,Male,0.426588091 P1,Pre,B,Female,0.761677517 P2,Pre,B,Female,0.985953719 P3,Pre,B,Female,0.657063156 P4,Pre,B,Female,0.166859072 P5,Pre,B,Female,0.850201269 P6,Pre,B,Female,0.227918183 P7,Pre,B,Female,0.701946655 P8,Pre,B,Female,0.079116861 P9,Pre,B,Female,0.094935181 P10,Pre,B,Female,0.376525478 P11,Pre,B,Female,0.725431114 P12,Pre,B,Male,0.922099723 P13,Pre,B,Male,0.664993697 P14,Pre,B,Male,0.450501356 P15,Pre,B,Male,0.201276143 P16,Pre,B,Male,0.735428897 P17,Pre,B,Male,0.304752274 P18,Pre,B,Male,0.393020637 P19,Pre,B,Male,0.452345203 P20,Pre,B,Male,0.697709526 P21,Pre,B,Male,0.130459291 P22,Pre,B,Male,0.210211859 P1,Pre,C,Female,0.280820754 P2,Pre,C,Female,0.206499238 P3,Pre,C,Female,0.127540559 P4,Pre,C,Female,0.001998028 P5,Pre,C,Female,0.554408227 P6,Pre,C,Female,0.235435708 P7,Pre,C,Female,0.341077362 P8,Pre,C,Female,0.101103042 P9,Pre,C,Female,0.834297025 P10,Pre,C,Female,0.256605011 P11,Pre,C,Female,0.65647746 P12,Pre,C,Male,0.110716441 P13,Pre,C,Male,0.075856866 P14,Pre,C,Male,0.518357132 P15,Pre,C,Male,0.222078883 P16,Pre,C,Male,0.414747048 P17,Pre,C,Male,0.525522894 P18,Pre,C,Male,0.758019496 P19,Pre,C,Male,0.213927508 P20,Pre,C,Male, P21,Pre,C,Male, P22,Pre,C,Male, P1,Post,A,Female,0.435204978 P2,Post,A,Female,0.681378597 P3,Post,A,Female,0.928158111 P4,Post,A,Female,0.525061816 P5,Post,A,Female,0.46271948 P6,Post,A,Female,0.649810342 P7,Post,A,Female,0.748819476 P8,Post,A,Female,0.207494638 P9,Post,A,Female,0.060148769 P10,Post,A,Female,0.074998663 P11,Post,A,Female,0.177396477 P12,Post,A,Male,0.61446322 P13,Post,A,Male,0.367348586 P14,Post,A,Male,0.853124208 P15,Post,A,Male,0.268734518 P16,Post,A,Male,0.784226481 P17,Post,A,Male,0.892830959 P18,Post,A,Male,0.950081146 P19,Post,A,Male,0.731274982 P20,Post,A,Male,0.901554267 P21,Post,A,Male,0.170960222 P22,Post,A,Male,0.2337913 P1,Post,B,Female,0.940130538 P2,Post,B,Female,0.575209304 P3,Post,B,Female,0.84527559 P4,Post,B,Female,0.160605498 P5,Post,B,Female,0.547844182 P6,Post,B,Female,0.287795345 P7,Post,B,Female,0.010274473 P8,Post,B,Female,0.408166731 P9,Post,B,Female,0.562733542 P10,Post,B,Female,0.44217795 P11,Post,B,Female,0.390071799 P12,Post,B,Male,0.767768344 P13,Post,B,Male,0.548800315 P14,Post,B,Male,0.489825627 P15,Post,B,Male,0.783939035 P16,Post,B,Male,0.772595033 P17,Post,B,Male,0.252895712 P18,Post,B,Male,0.383513642 P19,Post,B,Male,0.709882712 P20,Post,B,Male,0.517304459 P21,Post,B,Male,0.77186642 P22,Post,B,Male,0.395415627 P1,Post,C,Female,0.649783292 P2,Post,C,Female,0.490853459 P3,Post,C,Female,0.467705056 P4,Post,C,Female,0.988740552 P5,Post,C,Female,0.413980642 P6,Post,C,Female,0.83941706 P7,Post,C,Female,0.111722237 P8,Post,C,Female,0.501984852 P9,Post,C,Female,0.15634255 P10,Post,C,Female,0.547770503 P11,Post,C,Female,0.576203944 P12,Post,C,Male,0.857518274 P13,Post,C,Male,0.176794297 P14,Post,C,Male,0.127501287 P15,Post,C,Male,0.831191664 P16,Post,C,Male,0.257022941 P17,Post,C,Male,0.295366754 P18,Post,C,Male,0.113785049 P19,Post,C,Male,0.621389037 P20,Post,C,Male, P21,Post,C,Male, P22,Post,C,Male,") Current Code : library(tidyverse) library(car) library(afex) library(emmeans) my_anova <-aov_car(data ~ Group*Session*Condition + Error(PID/Session*Condition), na.rm = TRUE, data=my_data) I've also tried: my_anova2 <- aov_ez("PID", "data", my_data, within = c("Session", "Condition"), between = "Group", na.rm=TRUE)
Error while using lapply() function to perform logistic regression (undefined columns selected)
So the data is retrieved here https://archive.ics.uci.edu/ml/machine-learning-databases/00497/divorce.rar When I run the code to perform logistic regression, it shows an error. But it runs perfectly on others R program. Is there anything that I have missed out? set.seed(123) divorce = read.csv("C://Users//User//Documents//Y2S3//Predictive Modelling//divorce//divorce.csv") dim(divorce) Outcome: [1] 170 1 summary(divorce) Outcome: Atr1.Atr2.Atr3.Atr4.Atr5.Atr6.Atr7.Atr8.Atr9.Atr10.Atr11.Atr12.Atr13.Atr14.Atr15.Atr16.Atr17.Atr18.Atr19.Atr20.Atr21.Atr22.Atr23.Atr24.Atr25.Atr26.Atr27.Atr28.Atr29.Atr30.Atr31.Atr32.Atr33.Atr34.Atr35.Atr36.Atr37.Atr38.Atr39.Atr40.Atr41.Atr42.Atr43.Atr44.Atr45.Atr46.Atr47.Atr48.Atr49.Atr50.Atr51.Atr52.Atr53.Atr54.Class Length:170 Class :character Mode :character colnames(divorce) Outcome: [1] "Atr1.Atr2.Atr3.Atr4.Atr5.Atr6.Atr7.Atr8.Atr9.Atr10.Atr11.Atr12.Atr13.Atr14.Atr15.Atr16.Atr17.Atr18.Atr19.Atr20.Atr21.Atr22.Atr23.Atr24.Atr25.Atr26.Atr27.Atr28.Atr29.Atr30.Atr31.Atr32.Atr33.Atr34.Atr35.Atr36.Atr37.Atr38.Atr39.Atr40.Atr41.Atr42.Atr43.Atr44.Atr45.Atr46.Atr47.Atr48.Atr49.Atr50.Atr51.Atr52.Atr53.Atr54.Class" sapply(divorce,class) Outcome: Atr1.Atr2.Atr3.Atr4.Atr5.Atr6.Atr7.Atr8.Atr9.Atr10.Atr11.Atr12.Atr13.Atr14.Atr15.Atr16.Atr17.Atr18.Atr19.Atr20.Atr21.Atr22.Atr23.Atr24.Atr25.Atr26.Atr27.Atr28.Atr29.Atr30.Atr31.Atr32.Atr33.Atr34.Atr35.Atr36.Atr37.Atr38.Atr39.Atr40.Atr41.Atr42.Atr43.Atr44.Atr45.Atr46.Atr47.Atr48.Atr49.Atr50.Atr51.Atr52.Atr53.Atr54.Class "character" col_fac = c("Atr1","Atr2","Atr3","Atr4","Atr5","Atr6","Atr7","Atr8","Atr9","Atr10", +"Atr11","Atr12","Atr13","Atr14","Atr15","Atr16","Atr17","Atr18","Atr19","Atr20", +"Atr21","Atr22","Atr23","Atr24","Atr25","Atr26","Atr27","Atr28","Atr29","Atr30", +"Atr31","Atr32","Atr33","Atr34","Atr35","Atr36","Atr37","Atr38","Atr39","Atr40", +"Atr41","Atr42","Atr43","Atr44","Atr45","Atr46","Atr47","Atr48","Atr49","Atr50", +"Atr51","Atr52","Atr53","Atr54","Class") divorce[col_fac] = lapply(divorce[col_fac],factor) Outcome: Error in [.data.frame(divorce, col_fac) : undefined columns selected)
The only issue is that you read a file that is separated by ";" and not ",". The sep = ";" will solve the issue. # downloaded and extracted from https://archive.ics.uci.edu/ml/machine-learning-databases/00497/divorce.rar divorce <- read.csv("./divorce.csv", sep = ";") dim(divorce) summary(divorce) colnames(divorce) sapply(divorce,class) col_fac = c("Atr1","Atr2","Atr3","Atr4","Atr5","Atr6","Atr7","Atr8","Atr9","Atr10", "Atr11","Atr12","Atr13","Atr14","Atr15","Atr16","Atr17","Atr18","Atr19","Atr20", "Atr21","Atr22","Atr23","Atr24","Atr25","Atr26","Atr27","Atr28","Atr29","Atr30", "Atr31","Atr32","Atr33","Atr34","Atr35","Atr36","Atr37","Atr38","Atr39","Atr40", "Atr41","Atr42","Atr43","Atr44","Atr45","Atr46","Atr47","Atr48","Atr49","Atr50", "Atr51","Atr52","Atr53","Atr54","Class") divorce[col_fac] = lapply(divorce[col_fac],factor) less error prone version using dplyr The following will mutate your dataset by applying the function as.factor across those variables where the function is.numeric returns TRUE. Note that the functions passed within across and where do not get the usual parenthesis. library(dplyr) divorce <- read.csv("./divorce.csv", sep = ";") %>% mutate(across(where(is.numeric), as.factor)) glimpse(divorce) For detailed info on mutate across, type ?across in R Console.
Error in UseMethod("select_") : no applicable method for 'select_' applied to an object of class "character"
I am trying to extract some columns from the data which is result of analysis. The data is composed of 592 rows and 20 variables. When I run the code as below, I got the error message "Error in UseMethod("select_") : no applicable method for 'select_' applied to an object of class "character" " unused_cols <- -c(2:9) pvals_long <- pvals %>% map(function(x){ x <- x %>% dplyr::select(unused_cols) %>% gather(key = "celltype_pair", value = "pvalue", -interacting_pair) x }) Thanks in advance,
map is not needed. Mapping on a dataframe means that you are trying to apply your function on each column. However, select expects a dataframe, while in your code it gets a vector. That's what the error is telling you. unused_cols <- -c(2:9) will not work. Put the -in the call to select. Try this: unused_cols <- c(2:9) pvals_long <- pvals %>% select(-unused_cols) %>% gather(key = "celltype_pair", value = "pvalue", -interacting_pair)
How to extract columns from a row and save the output as a variable dplyr
I am trying to extract a specific column from a specific row on my excel sheet (df). However, when I try to do so I get the message: Error: ... must evaluate to column positions or names, not a list Call `rlang::last_error()` to see a backtrace. When I call rlang::last_error() I get: Backtrace: 1. dplyr::select(., FGA, FTA, TOV, MP, TmFga, TmFta, TmTov, TmMin) 9. tidyselect::vars_select(tbl_vars(.data), !!!enquos(...)) 10. tidyselect:::bad_calls(bad, "must evaluate to { singular(.vars) } positions or names, \\\n not { first_type }") 11. tidyselect:::glubort(fmt_calls(calls), ..., .envir = .envir) 12. dplyr::select(., FGA, FTA, TOV, MP, TmFga, TmFta, TmTov, TmMin) At this point, I am lost. What can I do to my code to work? library(readxl) Lakers_Overall_Stats <- read_excel("Desktop/Lakers Overall Stats.xlsx") library(readxl) Lakers_Record <- read_excel("Desktop/Lakers Record.xlsx") require(dplyr) require(ggplot2) ##WinPercentage of the Team after season mydata <- Lakers_Record %>% select(Pts,Opp,W,L)%>% + mutate(wpct=Pts^13.91/(Pts^13.91+Opp^13.91),expwin=round(wpct*(W+L)),diff=W-expwin) head(mydata) ##Specifiying Lakers_Overall_Stats[23,6] <- TmMin Lakers_Overall_Stats[23,8] <- TmFga Lakers_Overall_Stats[23,18] <- TmFta Lakers_Overall_Stats[23,26] <- TmTov rlang::last_error() ##Usage Percentage Usgpct <- Lakers_Overall_Stats %>% select(FGA,FTA,TOV,MP,TmFga,TmFta,TmTov,TmMin)%>% + mutate(100*(Fga+0.44*Fta+Tov))*TmMin/(TmFga+0.44*TmFta+TmTov)*5(MP) ##head(Usgpct) ##filter(rank(desc(Usgpct))==1) Also, am I filtering correctly? or should it be written as Usgpct <- Lakers_Overall_Stats %>% select(FGA,FTA,TOV,MP,TmFga,TmFta,TmTov,TmMin)%>% filter(rank(desc(Usgpct))==1)%>% mutate(100*(Fga+0.44*Fta+Tov))*TmMin/(TmFga+0.44*TmFta+TmTov)*5(MP) head(Usgpct)
You have Lakers_Overall_Stats[23,6] <- TmMin This will modify the Lakers_Overall_Stats data frame by setting the element at 23,6 etc. to be TmMin. TmMin is an object outside of your data frame. Maybe you want: TmMin <- Lakers_Overall_Stats[23,6] ? Also, you cannot select TmFga,TmFta,TmTov,TmMin since these variables are not part of your data frame. You can refer to those variables in your mutate equation, but because of the way you've set it up, they're stand-alone variables.
Nestled Loop not Working to gather data from NOAA
I'm using the R package rnoaa(along with it required other packages) to gather historical weather data. I wrote this nestled loop to gather all the data sets but I keep getting errors when I run it. It seems to run for a second fine The loop: require('triebeard') require('bindr') require('colorspace') require('mime') require('curl') require('openssl') require('R6') require('urltools') require('httpcode') require('stringr') require('assertthat') require('bindrcpp') require('glue') require('magrittr') require('pkgconfig') require('rlang') require('Rcpp') require('BH') require('plogr') require('purrr') require('stringi') require('tidyselect') require('digest') require('gtable') require('plyr') require('reshape2') require('lazyeval') require('RColorBrewer') require('dichromat') require('munsell') require('labeling') require('viridisLite') require('data.table') require('rjson') require('httr') require('crul') require('lubridate') require('dplyr') require('tidyr') require('ggplot2') require('scales') require('XML') require('xml2') require('jsonlite') require('rappdirs') require('gridExtra') require('tibble') require('isdparser') require('geonames') require('hoardr') require('rnoaa') install.package('ncdf4') install.packages("devtools") library(devtools) install_github("rnoaa", "ropensci") library(rnoaa) list <- buoys(dataset='wlevel') lid <- data.frame(list$id) foo <- for(range in 1990:2017){ for(bid in lid){ bid_range <- buoy(dataset = 'wlevel', buoyid = bid, year = range) bid.year.data <- data.frame(bid.year$data) write.csv(bid.year.data, file='cwind/bid_range.csv') } } The response: Using c1990.nc Using Error: length(url) == 1 is not TRUE It saves the first data-set but it does not apply the for in the file name it just names it bid_range.csv.
This error message shows that there are no any data of a given station id in 1990. Because you were using for loop, once it gots an error, it stops. Here I introduce the use of tidyverse to download the NOAA buoy data. A lot of the following functions are from the purrr package, which is part of the tidyverse. # Load packages library(tidyverse) library(rnoaa) Step 1: Create a "Grid" containing all combination of id and year The expand function from tidyr can create the combination of different values. data_list <- buoys(dataset = 'wlevel') data_list2 <- data_list %>% select(id) %>% expand(id, year = 1990:2017) Step 2: Create a "safe" version that does not break when there is no data. Also make this function suitable for the map2 function Because we will use map2 to loop through all the combination of id and year using the map2 function by its .x and .y argument. We modified the sequence of argument to create buoy_modify. We also use the safely function to create a safe version of buoy_modify. Now when it meets error, it will store the error message and moves to the next one rather than breaks. # Modify the buoy function buoy_modify <- function(buoyid, year, dataset, ...){ buoy(dataset, buoyid = buoyid, year = year, ...) } # Creare a safe version of buoy_modify buoy_safe <- safely(buoy_modify) Step 3: Apply the buoy_safe function wlevel_data <- map2(data_list2$id, data_list2$year, buoy_safe, dataset = "wlevel") # Assign name for the element in the list based on id and year names(wlevel_data) <- paste(data_list2$id, data_list2$year, sep = "_") After this step, all the data were downloaded in wlevel_data. Each element in wlevel_data has two parts. $result shows the data if the download is successful, otherwise, it shows NULL. $error shows NULL if the download is successful, otherwise, it shows the error message. Step 4: Access the data transpose can turn a list "inside out". So now wlevel_data2 has two elements: result and error. We can store these two and access the data. # Turn the list "inside out" wlevel_data2 <- transpose(wlevel_data) # Get the error message wlevel_error <- wlevel_data2$error # Get he result wlevel_result <- wlevel_data2$result # Remove NULL element in wlevel_result wlevel_result2 <- wlevel_result[!map_lgl(wlevel_result, is.null)]