I am looking to use a function to speed up a data cleaning process. In the example shown I am looking to remove values reported in the am and pm columns if the ".no" column for that day has a value of 1.
df1 = data.frame (identifier = c(1:4),
mon.no = c(1,NA,NA,NA),mon.am = c(2,1,NA,3),mon.pm = c(3,4,NA,5),
tues.no = c(NA,NA,1,NA),tues.am = c(2,3,1,4),tues.pm = c(3,3,2,3))
I envisage using a function uses the day to clean the data:
clean1 = function (day) {
df1$day.am[df1$day.no==1] = NA
df1$day.pm[df1$day.no==1] = NA
return (df1)}
df2 = clean1(mon)
However this returns the following error.
Error in `$<-.data.frame`(`*tmp*`, "day.am", value = logical(0)) :
replacement has 0 rows, data has 4
I assume that this is because the function expects a full column name and cannot fill in the gaps around a text input? Is it possible to use a function in that way?
Having read these notes I think that it would be better practice to have my data in a tidy format and am working on a solution which involves reorganising my data. However it would also be handy to be able to do this while the data is in it's original format.
Thanks.
You're really close. #Tyler Rinker in comments has explained why it doesn't work. Here's a fix:
clean1 = function (day) {
day.am = paste(day, "am", sep=".") # make a string from the variable day and the suffixes
day.pm = paste(day, "pm", sep=".")
day.no = paste(day, "no", sep=".")
df1[day.am][df1[day.no]==1] = NA
df1[day.pm][df1[day.no]==1] = NA
return (df1)}
df2 = clean1("mon") # "mon" should be a string
Somebody else might offer more efficient ways of doing this. Note that you're only ever working from your original df1 here. If you now run
df3 = clean1("tues")
you won't get a dataframe with both days cleaned. You could fix this by supplying the dataframe to be acted on to the function too:
clean2 = function(df, day){...
Related
I am trying to create a large number of data frames in a for loop using the "assign" function in R. I want to use the colnames function to set the column names in the data frame. The code I am trying to emulate is the following:
county_tmax_min_df <- data.frame(array(NA,c(length(days),67)))
colnames(county_tmax_min_df) <- c('Date',sd_counties$NAME)
county_tmax_min_df$Date <- days
The code I have so far in the loop looks like this:
file_vars = c('file1','file2')
days <- seq(as.Date("1979-01-01"), as.Date("1979-01-02"), "days")
f = 1
for (f in 1:2){
assign(paste0('county_',file_vars[f]),data.frame(array(NA,c(length(days),67))))
}
I need to be able to set the column names similar to how I did in the above statement. How do I do this? I think it needs to be something like this, but I am unsure what goes in the text portion. The end result I need is just a bunch of data frames. Any help would be wonderful. Thank you.
expression(parse(text = ))
You can set the names within assign, like that:
file_vars = c('file1', 'file2')
days <- seq.Date(from = as.Date("1979-01-01"), to = as.Date("1979-01-02"), by = "days")
for (f in seq_along(file_vars)) {
assign(x = paste0('county_', file_vars[f]),
value = {
df <- data.frame(array(NA, c(length(days), 67)))
colnames(df) <- paste0("fancy_column_",
sample(LETTERS, size = ncol(df), replace = TRUE))
df
})
}
When in {} you can use colnames(df) or setNames to assign column names in any manner desired. In your first piece of code you are referring to sd_counties object that is not available but the generic idea should work for you.
I am attempting to split out a flags column into multiple new columns in r using mutate_at and then separate functions. I have simplified and cleaned my solution as seen below, however I am getting an error that indicates that the entire column of data is being passed into my function rather than each row individually. Is this normal behaviour which just requires me to loop over each element of x inside my function? or am I calling the mutate_at function incorrectly?
example data:
dataVariable <- data.frame(c_flags = c(".q.q.q","y..i.o","0x5a",".lll.."))
functions:
dataVariable <- read_csv("...",
col_types = cols(
c_date = col_datetime(format = ""),
c_dbl = col_double(),
c_flags = col_character(),
c_class = col_factor(c("a", "b", "c")),
c_skip = col_skip()
))
funTranslateXForNewColumn <- function(x){
binary = ""
if(startsWith(x, "0x")){
binary=hex2bin(x)
} else {
binary = c(0,0,0,0,0,0)
splitFlag = strsplit(x, "")[[1]]
for(i in splitFlag){
flagVal = 1
if(i=="."){
flagVal = 0
}
binary=append(binary, flagVal)
}
}
return(paste(binary[4:12], collapse='' ))
}
mutate_at(dataVariable, vars(c_flags), funs(funTranslateXForNewColumn(.)))
separate(dataVariable, c_flags, c(NA, "flag_1","flag_2","flag_3","flag_4","flag_5","flag_6","flag_7","flag_8","flag_9"), sep="")
The error I am receiving is:
Warning messages:
1: Problem with `mutate()` input `c_flags`.
i the condition has length > 1 and only the first element will be used
After translating the string into an appropriate binary representation of the flags, I will then use the seperate function to split it into new columns.
Similar to OP's logic but maybe shorter :
dataVariable$binFlags <- sapply(strsplit(dataVariable$c_flags, ''), function(x)
paste(as.integer(x != '.'), collapse = ''))
If you want to do this using dplyr we can implement the same logic as :
library(dplyr)
dataVariable %>%
mutate(binFlags = purrr::map_chr(strsplit(c_flags, ''),
~paste(as.integer(. != '.'), collapse = '')))
# c_flags binFlags
#1 .q.q.q 010101
#2 y..i.o 100101
#3 .lll.. 011100
mutate_at/across is used when you want to apply a function to multiple columns. Moreover, I don't see here that you are creating only one new binary column and not multiple new columns as mentioned in your post.
I was able to get the outcome I desired by replacing the mutate_at function with:
dataVariable$binFlags <- mapply(funTranslateXForNewColumn, dataVariable$c_flags)
However I want to know how to use the mutate_at function correctly.
credit to: https://datascience.stackexchange.com/questions/41964/mutate-with-custom-function-in-r-does-not-work
The above link also includes the solution to get this function to work which is to vectorize the function:
v_funTranslateXForNewColumn <- Vectorize(funTranslateXForNewColumn)
mutate_at(dataVariable, vars(c_flags), funs(v_funTranslateXForNewColumn(.)))
I'm a newbie in R, so please have some patience and... tips are most welcome.
My goal is to create tibble that holds a "Full Name" (of a person, that may have 2 to 4 names) and his/her gender. I must start from a tibble that contains typical Male and Female names.
Below I present a minimum working example.
My problem: I can call get_name() multiple time (in 10.000 for loop!!) and get the right answer. But, I was looking for a more 'elegant' way of doing it. replicate() unfortunately returns a vector... which make it unusable.
My doubts: I know I have some (very few... right!!) issues, like the if statement, that is evaluated every time (which is redundant), but I don't find another way to do it. Any suggestion?
Any other suggestions about code struct are also welcome.
Thank you very much in advance for your help.
# Dummy name list
unit_names <- tribble(
~Women, ~Man,
"fem1", "male1",
"fem2", "male2",
"fem3", "male3",
"fem4", "male4",
"fem5", "male5",
"fem6", NA,
"fem7", NA
)
set.seed(12345) # seed for test
# Create a tibble with the full names
full_name <- tibble("Full Name" = character(), "Gender" = character() )
get_name <- function() {
# Get the Number of 'Unit-names' to compose a 'Full-name'
nbr_names <- sample(2:4, 1, replace = TRUE)
# Randomize the Gender
gender <- sample(c("Women", "Man"), 1, replace = TRUE)
if (gender == "Women") {
lim_names <- sum( !is.na(unit_names$"Women"))
} else {
lim_names <- sum( !is.na(unit_names$"Man"))
}
# Sample the Fem/Man List names (may have duplicate)
sample(unlist(unit_names[1:lim_names, gender]), nbr_names, replace = TRUE) %>%
# Form a Full-name
paste ( . , collapse = " ") %>%
# Add it to the tibble (INCLUDE the Gender)
add_row(full_name, "Full Name" = . , "Gender" = gender)
}
# How can I make 10k of this?
full_name <- get_name()
If you pass a larger number than 1 to sample this problem becomes easier to vectorise.
One thing that currently makes your problem much harder is the layout of your unit_names table: you are effectively treating male and female names as individually paired, but they clearly aren’t: hence they shouldn’t be in columns of the same table. Use a list of two vectors, for instance:
unit_names = list(
Women = c("fem1", "fem2", "fem3", "fem4", "fem5", "fem6", "fem7"),
Men = c("male1", "male2", "male3", "male4", "male5")
)
Then you can generate random names to your heart’s delight:
generate_names = function (n, unit_names) {
name_length = sample(2 : 4, n, replace = TRUE)
genders = sample(c('Women', 'Men'), n, replace = TRUE)
names = Map(sample, unit_names[genders], name_length, replace = TRUE) %>%
lapply(paste, collapse = ' ') %>%
unlist()
tibble(`Full name` = names, Gender = genders)
}
A note on style, unlike your function the above doesn’t use any global variables. Furthermore, don’t "quote" variable names (you do this in unit_names$"Women" and for the arguments of add_row). R allows this, but this is arguably a mistake in the language specification: these are not strings, they’re variable names, making them look like strings is misleading. You don’t quote your other variable names, after all. You do need to backtick-quote the `Full name` column name, since it contains a space. However, the use of backticks, rather than quotes, signifies that this is a variable name.
I am not 100% of what you are trying to get, but if I got it right...did you try with mutate at dplyr? For example:
result= mutate(data.frame,
concated_column = paste(column1, column2, column3, column4, sep = '_'))
With a LITTLE help from Konrad Rudolph, the following elegant (and vectorized ... and fast) solution that I was looking. map2 does the necessary trick.
Here is the full working example if someone needs it:
(Just a side note: I kept the initial conversion from tibble to list because the data arrives to me as a tibble...)
Once again thanks to Konrad.
# Dummy name list
unit_names <- tribble(
~Women, ~Men,
"fem1", "male1",
"fem2", "male2",
"fem3", "male3",
"fem4", "male4",
"fem5", "male5",
"fem6", NA,
"fem7", NA
)
name_list <- list(
Women = unit_names$Women[!is.na(unit_names$Women)],
Men = unit_names$Men[!is.na(unit_names$Men)]
)
generate_names = function (n, name_list) {
name_length = sample(2 : 4, n, replace = TRUE)
genders = sample(c('Women', 'Men'), n, replace = TRUE)
#names = lapply(name_list[genders], sample, name_length) %>%
names = map2(name_list[genders], name_length, sample) %>%
lapply(paste, collapse = ' ') %>%
unlist()
tibble(`Full name` = names, Gender = genders)
}
full_name <- generate_names(10000, name_list)
Often, when using dplyr programmatically, I'll want to select a column by its name, where the column name is stored as a string in some variable.
I've noticed that attempts to do this with dplyr often lead to unexpected results. This seems to be as a result of how tbl_df is handled.
Below are some examples:
## regular data frame:
df = data.frame(subject = 1:3, resp = c(2,3,3)) # example dataframe
response_column = "resp" # I want to select the contents of a column with a string
# for loop over unique values:
unique_responses = unique(df[,response_column])
for (resp in unique_responses) {
cat("\nA response:", resp)
}
# convert column type:
df[,response_column] = as.character(df[,response_column])
str(df) # modified the column
So these are the sorts of things I'm used to doing. Accessing column's contents, converting them and reassigning them, taking their unique values, etc.
But when the data.frame has class tbl_df and tbl, things don't work as well.
## with tbl_df and tbl
require(dplyr)
df = data.frame(subject = 1:3, resp = c(2,3,3))
class(df) = c("tbl_df","tbl", class(df))
class(df)
df[,response_column]
# for loop doesn't seem to know what to do with this:
unique_responses = unique(df[,response_column])
for (resp in unique_responses) {
cat("\nA response:", resp)
}
# as.character seems to concatenate the entire column into one string!
df[,response_column] = as.character(df[,response_column])
df
I wasn't sure what to make of this behavior (i.e., intentional vs. bug), or generally, what the best practice is to be able to use the same (programmatic) code across normal data frames as well as dplyr's data frames.
I am trying to add a column to a zoo object. I found merge which works well
test = zoo(data.frame('x' = c(1,2,3)))
test = merge(test, 'x1' = 0)
However when I try to name the column dynamically, it no longer works
test = merge(test, paste0('x',1) = 0)
Error: unexpected '=' in "merge(test,paste0('x',1) ="
I have been working with data frames and the same syntax works
test = data.frame('x' = c(1,2,3))
test[paste0('x',1)] = 0
Can someone help explain what the problem is and how to get around this?
Try setNames :
setNames( merge(test, 0), c(names(test), paste0("x", 1)) )
or names<-.zoo like this:
test2 <- merge(test, 0)
names(test2) <- c(names(test), paste0("x", 1))
I found this solution very easy and elegant. It uses the eval() function to interpret a string as an R command. Thus, you are completely free to assemble the string exactly the way you want:
test = merge(test, paste0("x",1) = 0)
# does not work (see question)
test[,"x1"] <- 0
# does not work for uninitialized columns
test$x1 <- 0
# works to initialize a new column
# so lets trick R by assembling this command out of strings:
newcolumn <- "x1"
eval(parse(text=paste0("test$",newcolumn," <- 0")))
# welcome test$x1 :-)
Merge expects a string as variable name, it doesn't understand variable names that are return values of functions. Why not
test = zoo(data.frame('x' = c(1,2,3)))
var <- paste0('x',1)
test = merge(test, var = 0)