I currently have a dataset that has two variables, winner_entry and winner_seed. There are a few instances on where the data was incorrectly inputed. The winner_entry was incorrectly put into the winner_seed variable on a few instances.
Atp_singles_2022 %>%
filter(winner_seed == "WC") %>%
select(winner_seed, winner_entry, winner_name, tourney_name) %>%
print(n=10)
This produces the output below
Atp_singles_2022 %>%
mutate(winner_seed == str_replace_all(tourney_name, fixed("WC"),"NA"))
I was thinking to do this, but that wouldn't fix the winner_entry which needs to be changed to WC
This may be solved using an ifelse statement within mutate.
Atp_singles_2022 <- Atp_singles_2022 %>%
mutate(winner_entry = ifelse(is.na(winner_entry),"WC",winner_entry))
This code says that if is.na(winner_entry) (AKA if winner_entry is NA), change it to WC, else leave it as winner_entry. With this code, you can change the contents of a column based on values in that column, or you could change it based on another column.
Related
I am new to R and am sure the solution is simple, but I am having a hard time figuring out where I'm going wrong.
Apologies if this question has been asked. I did look, but once again, I'm new and it might have gone right over my head. :)
I'm working with a data set "minuteSleep_merged.csv" from the Fitabase Data 4.12.16-5.12.16 folder. I'm trying to determine if the data set is accurate and has insights for my capstone project with the Google Data Analyst Certification.
For backstory, I am using tidyverse and have loaded the package into my session in addition to the .csv file I am using.
This is what I have so far:
minSleep <- read_csv("minuteSleep_merged.csv")
## getting a summary of the data
head(minSleep)
colnames(minSleep)
n_distinct(minSleep)
## separating the date column into date and time to make it easier to read and aggregate
summ_minSleep <- separate(minSleep, date, into = c("date", "time"), sep = " ")
## confirming the column was separated
head(summ_minSleep)
## creating a table that shows the amount of time each participant spent asleep per date
asleep_summ_minSleep <- summ_minSleep %>%
group_by(Id, date) %>%
count(value = 1) %>%
rename(min_asleep = n) %>%
mutate(hours_asleep = min_asleep/60)
head(asleep_summ_minSleep)
## doing the same for value 2 (restless)
restless_summ_minSleep <- summ_minSleep %>%
group_by(Id, date) %>%
count(value = 2) %>%
rename(min_restless = n) %>%
mutate(hours_restless = min_restless/60)
head(restless_summ_minSleep)
## and one more time for value 3 (awake)
awake_summ_minSleep <- summ_minSleep %>%
group_by(Id, date) %>%
count(value = 3) %>%
rename(min_awake = n) %>%
mutate(hours_awake = min_awake/60)
head(awake_summ_minSleep)
I thought I was onto something when I got the first table asleep_summ_minSleep to run properly.
But my next thought was, to know if the data set should be kept for analysis or removed in the cleaning process, I would also need to know how many hours each participant spent awake and restless per day.
So, I created a separate table for each value (1 = asleep, 2 = restless, 3 = awake). As you can see each table has a different name with column names that create a clear distinction.
Also, as a side note, I created a separate table for each because I couldn't figure out how to make a pipe that would contain all this information in one table. That will be my next learning adventure.
Anyway, back to the task at hand. You all can probably already see what my issue is and the cause, but, to spell it out, while each table is correctly labeled and shows the correct corresponding value column, the data within the columns hours_asleep, hours_restless, and hours_awake are exactly the same.
It took an unfortunate amount of time and web searching to create the first table, but with trial and error I thought I was on to something, but this clearly shows I'm off somewhere so I'm seeking help.
Here is a link to an R Markdown file: https://5b69b06b6e8d44fe9dd87e7ae606c95a.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FminSleep_help.html
Any suggestions, hints, theories, really anything, would be appreciated.
Thank you!!
I need to create a dataframe summarising information relating to file checking.
I have a list of 126 unique combinations of climate scenarios and years (e.g. 'ssp126_2030', 'ssp126_2050', 'ssp145_2030', 'ssp245_2050'). These unique elements represent sections of a larger full file path pointing to a specific file (scenario_list, below). For each unique element, I need to create multiple new columns specifying whether the file exists, its size and the date it was created.
I would like to loop through the list of 126 elements and stitch together a table of file checks (file_check_table, below). I start with a table of sub-directories, I then split these strings into sections so I can paste0() together a string that points to the file within the sub-directory that I want to check. I am aiming to use mutate()/transmutate() and purrr::map() to loop through each element in the climate scenario list and add multiple file checking columns (see below image of table).
I am new to functional programming, and this is what I have tried so far I was thinking of creating a function to add new columns, and then apply the function to list of climate scenarios. My end goal is to have one new column for each climate scenario and type of file check:
file_checks <- function(x) {
dir_list %>%
mutate(file_check_table,!!paste0(new_col_name) := ifelse(file.exists(paste0(file))==TRUE,1,0))}
file_check_table <- map(scenario_list, file_checks(x))
However, this function does not work as I don't think I have written the function correctly or perhaps used purrr correctly. Any thought on how to fix this would be much appreciated, thank you. This is what I would like file_check_table
If I understand your question correctly, you have a scenario_list that describes the path to the files, and would like the characteristics of the files. The natural way to do that would be to run a pipe with one entry per row, no reason to put it in a function.
For example:
library(tidyverse)
scenario_list <- read_lines("scenario_list.txt")
root_dir <- "C:/USers/Documents/my_project/data_subdir"
file_table <- tibble(scenario = scenario_list) %>%
mutate(path = file.path(root_dir, paste0(scenario, ".csv")),
exists = file.exists(path),
full_info = file.info(path),
file_size = full_info$size,
file_date = full_info$mtime)
And then if you want the output on a single row as in your screenshot:
file_table %>%
select(-path, -full_info) %>%
pivot_wider(names_from = scenario,
names_glue = "{scenario}_{.value}",
values_from = !scenario) %>%
write_csv("output.csv")
I have added a variable that is the sum of all policies for each customer:
mhomes %>% mutate(total_policies = rowSums(select(., starts_with("num"))))
However, when I now want to use this total_policies variable in plots or when using summary() it says: Error in summary(total_policies) : object 'total_policies' not found.
I don't understand what I did wrong or what I should do differently here.
May be slightly round about, but feel solves the purpose. Considering df is the dataset and it has customer_id, policy_id and policy_amount as variables then the below command should work
req_output = df %>% group_by(customer_id) %>% summarise (total_policies = sum (policy_amount)
if you still face the issue, kindly convert to data frame and try plotting
req_output = as.data.frame(req_output)
I'm working on a script for a swirl lesson on using the tidyr package and I'm having some trouble with the %>% operator. I've got a data frame called passed that contains the name, class number, and final grade of 4 students. I want to add a new column called status and populate it with a character vector that says "passed". Before that, I used select to grab some columns from a data frame called students4 and stored it in a data frame called grade book
gradebook <- students4 %>%
select(id, class, midterm, final) %>%
passed<-passed %>% mutate(status="passed")
Swirl problems build on each other, and the last one just had me running the first to lines of code, so I think those two are correct. The third line was what was suggested after a couple of wrong attempts, so I think there's something about %>% that I'm not understanding. When I run the code I get an error that says;
Error in students4 %>% select(id, class, midterm, final) %>% passed <- passed %>% :
could not find function "%>%<-
I found another user who asked about the "could not find function "%>%" who was able to resolve the issue by installing the magrittr package, but that didn't do the trick for me. Any input on the issues in my code would be super appreciated!
It’s not a problem with the package or the operator. You’re trying to pipe into a new line with a new variable.
The %>%passes the previous dataframe into the next function as that functions df argument.
Instead of doing all of this:
Gradebook <- select(students4, id, class, midterm, final)
Gradebook2 <- mutate(Gradebook, test4 = 100)
Gradebook3 <- arrange(Gradebook2, desc(final))
You can pipe operator into the next argument if you’re working on the same dataframe.
Gradebook <- students4 %>%
select(students4, id, class, midterm, final) %>%
mutate(test4 = 100) %>%
arrange(desc(final))
Much cleaner and easier to read.
In your second line you’re trying to pass it to a new function but instead of there being a function you’re all of a sudden defining a variable. I don’t know the exercise you’re doing but you should remove the second operator.
gradebook <- students4 %>%
select(id, class, midterm, final)
passed <- passed %>% mutate(status="passed")
This is the continuation of the following thread:
Creating Binary Identifiers Based On Condition Of Word Combinations For Filter
Expected output is the same as per the said thread.
I am now writing a function that can take dynamic names as variables.
This is the code that I am aiming at, if I am to run it manually:
df <- df %>% group_by(id, date) %>% mutate(flag1 = if(eval(parse(text=conditions))) grepl(pattern, item_name2) else FALSE)
To make it take into consideration dynamic variable names, I have been doing the code this way:
groupcolumns <- c(id, date)
# where id and date will be entered into the function as character strings by the user
variable <- list(~if(eval(parse(text=conditions))) grepl(pattern, item) else FALSE)
# converting to formula to use with dynamically generated column names
# "conditons" being the following character vector, which I can automatically generate:
conditons <- "any(grepl("Alpha", Item)) & any(grepl("Bravo", Item))"
This becomes:
df <- df %>% group_by_(.dots = groupcolumns) %>% mutate_(.dots = setNames(variable, flags[1]))
# where flags[1] is a predefined vector of columns names that I have created
flags <- paste("flag", seq(1:100), sep = "")
The problem is, I am unable to do anything to the grepl function; to specify the "item" dynamically. If I do it this way, as "df$item", and do a eval(parse(text="df$item")), the intention of piping fails as I am doing a group_by_ and it results in an error (naturally). This also applies to the conditions that I set.
Does a way exists for me to tell grepl to use a dynamic variable name?
Thanks a lot (especially to akrun)!
edit 1:
tried the following, and now there is no problem of passing the name of the item into grepl.
variable <- list(~if(eval(parse(text=conditions))) grepl(pattern, as.name(item)) else FALSE)
However, the problem lies in that piping seems not to work, as the output of as.name(item) is seen as an object, which does not exist in the environment.
edit 2:
trying do() in dplyr:
variable <- list(~if(eval(parse(text=conditions))) grepl(pattern, .$deparse(as.name(item))) else FALSE)
df <- df %>% group_by_(.dots = groupcolumns) %>% do_(.dots = setNames(variable, combiflags[1]))
which throws me the error:
Error: object 'Item' not found
If I understand your question correctly, you want to be able to dynamically input both patterns and the object to be searched by these patterns in grepl? The best solution for you will depend entirely on how you choose to store the patterns and how you choose to store the objects to be searched. I have a few ideas that should help you though.
For dynamic patterns, try inputting a list of patterns using the paste function. This will allow you to search many different patterns at once.
grepl(paste(your.pattern.list, collapse="|"), item)
Lets say you want to set up a scenario where you are storing many patterns of interest in a directory. Perhaps collected automatically from a server, or from some other output. You can create lists of patterns if they are in separate files using this:
#set working directory
setwd("/path/to/files/i/want")
#make a list of all files in this directory
inFilePaths = list.files(path=".", pattern=glob2rx("*"), full.names=TRUE)
#perform a function for each file in the list
for (inFilePath in inFilePaths)
{
#grepl function goes here
#if each file in the folder is a table/matrix/dataframe of patterns try this
inFileData = read_csv(inFilePath)
vectorData=as.vector(inFileData$ColumnOfPatterns)
grepl(paste(vectorData, collapse="|"), item)
}
For dynamically specifying the item, you can use an almost identical framework
#set working directory
setwd("/path/to/files/i/want")
#make a list of all files in this directory
inFilePaths = list.files(path=".", pattern=glob2rx("*"), full.names=TRUE)
#perform a function for each file in the list
for (inFilePath in inFilePaths)
{
#grepl function goes here
#if each file in the folder is a table/matrix/dataframe of data to be searched try this
inFileData = read_csv(inFilePath)
grepl(pattern, inFileData$ColumnToBeSearched)
}
If this is too far off from what you envisioned, please update your question with details about how the data you are using is stored.