Trouble interpolating column name for subset in R function - r

I have (fbodata) 'data.frame': 6181090 obs. of 41 variables:
I want to subset it and save the portion that pertains to a specific subset (like a zip). My approach seems to work when it is not in a function, but I ultimately want to use sapply.
nmakedir <- function(item, ccol) {
snipped a bunch of code that works
trim<- fbodata[ which(paste(ccol)==item),]
trim%>% drop_na(paste(ccol))
trim<- droplevels(trim
save(trim, file = paste(item, "rda", sep="."))
}
The line that doesn't work is one where I creating the subset with which. If i hardcode the line using fbodata$zip instead of paste(ccol) it works fine. Eventually, I plan to call it with something like:
sapply(unique(fbodata$zip),zip, FUN = nmakedir)
I appreciate any clues, I have been on this for a good long while.

A few things going on:
ccol is a string. paste(ccol) is the same string. You never need to call paste with only one argument. (You can use paste to coerce non-strings to strings, but in that case you should use as.character() to be clear.)
Keeping in mind that ccol is a string, what is fbodata$zip? It's a column! What is the equivalent using ccol and brackets? fbodata[[ccol]] or fbodata[, ccol]. You can use either of those interchangeably with fbodata$zip. So, this bad line
fbodata[ which(paste(ccol)==item),]
# should be this:
fbodata[which(fbodata[[ccol]] == item), ]
drop_na, like most dplyr functions, expects (quoting from the help) "bare variable names", not strings. Also from the help, "See Also: drop_na_ for a version that uses regular evaluation and is suitable for programming with". In this case, I don't think you need to do anything more than replace drop_na with drop_na_.
You are missing a right parenthesis on your droplevels command.
There might be more, but this is much as I can see without any sample data. Your sapply call looks funny to me because I thought zip is supposed to be a column name, but when you call sapply(unique(fbodata$zip),zip, FUN = nmakedir) it needs to be an object in your global environment. I would think sapply(unique(fbodata$zip), 'zip', FUN = nmakedir) makes more sense, but without a reproducile example there's no way to know.
It also seems like you're coding your own version of split. I would probably start this off with fbo_split = split(fbodata, fbodata$zip) and then use lapply to drop_na_, droplevels, and save, but maybe your snipped code makes that a less good idea.

Related

Rename a column with R

I'm trying to rename a specific column in my R script using the colnames function but with no sucess so far.
I'm kinda new around programming so it may be something simple to solve.
Basically, I'm trying to rename a column called Reviewer Overall Notes and name it Nota Final in a data frame called notas with the codes:
colnames(notas$`Reviewer Overall Notes`) <- `Nota Final`
and it returns to me:
> colnames(notas$`Reviewer Overall Notes`) <- `Nota Final`
Error: object 'Nota Final' not found
I also found in [this post][1] a code that goes:
colnames(notas) [13] <- `Nota Final`
But it also return the same message.
What I'm doing wrong?
Ps:. Sorry for any misspeling, English is not my primary language.
You probably want
colnames(notas)[colnames(notas) == "Reviewer Overall Notes"] <- "Nota Final"
(#Whatif's answer shows how you can do this with the numeric index, but probably better practice to do it this way; working with strings rather than column indices makes your code both easier to read [you can see what you're renaming] and more robust [in case the order of columns changes in the future])
Alternatively,
notas <- notas %>% dplyr::rename(`Nota Final` = `Reviewer Overall Notes`)
Here you do use back-ticks, because tidyverse (of which dplyr is a part) prefers its arguments to be passed as symbols rather than strings.
Why using backtick? Use the normal quotation mark.
colnames(notas)[13] <- 'Nota Final'
This seems to matter:
df <- data.frame(a = 1:4)
colnames(df)[1] <- `b`
Error: object 'b' not found
You should not use single or double quotes in naming:
I have learned that we should not use space in names. If there are spaces in names (it works and is called a non-syntactic name: And according to Wickham Hadley's description in Advanced R book this is due to historical reasons:
"You can also create non-syntactic bindings using single or double quotes (e.g. "_abc" <- 1) instead of backticks, but you shouldn’t, because you’ll have to use a different syntax to retrieve the values. The ability to use strings on the left hand side of the assignment arrow is an historical artefact, used before R supported backticks."
To get an overview what syntactic names are use ?make.names:
make.names("Nota Final")
[1] "Nota.Final"

Why am I having errors with order of functions using %>% in R?

This is the code I am trying to run:
data_table<-data_table%>%
merge(new_table, by = 'Sample ID')%>%
mutate(Normalized_value = ((1.8^(data_table$Ubb - data_table$Ct_adj))*10000))
I want to first add the new column ("Ubb") from "new_table" and then add a calculated column using that new column. However, I get an error saying that Ubb column does not exist. So it's not performing merge before running mutate? When I separate the functions everything works fine:
data_table<-data_table%>%
merge(new_table, by = 'Sample ID')
data_table<-data_table%>%
mutate(Normalized_value = ((1.8^(data_table$Ubb - data_table$Ct_adj))*10000))
I would like to keep everything together just for style, but I'm also just curious, shouldn't R perform merge first and then mutate? How does order of operation during piping work?
Thank you!
you dont need to refer to column name with $ sign. i.e. use Normalized_value = ((1.8^(Ubb - Ct_adj))*10000)
because it is merged now. with $ sign I believe R, even though does the merge, has original data_table still in memory. because the assignment operator did not work yet. the assignment will take place after all operations.
Try running the code like this:
data_table<-data_table%>%
merge(new_table, by = 'Sample ID')%>%
mutate(Normalized_value = ((1.8^(Ubb - Ct_adj))*10000))
Notice that I'm not using the table name with the $ within the pipe. Your way is telling the mutate column to look at a vector. Maybe it's having some trouble understanding the length of that vector when used with the merge. Just call the variable name within the pipe. It's easiest to think of the %>% as 'and then': data_table() and then merge() and then mutate(). You might also want to think about a left_join instead of a merge.

How do I write a dplyr pipe-friendly function where a new column name is provided from a function argument?

I'm hoping to produce a pipe-friendly function where a user specifies the "name of choice" for a new column produced by the function as one of the function arguments.
In the function below, I'd like name_for_elective to be something that the user can set at will, and afterwards, the user could expect that there will be a new column in their data with the name that they provided here.
I've looked at https://dplyr.tidyverse.org/reference/dplyr_data_masking.html, the mutate() function documentation, searched here, and tried working with https://dplyr.tidyverse.org/reference/rename.html, but to no avail.
elective_open<-function(.data,name_for_elective,course,tiebreaker){
name_for_elective<-rlang::ensym(name_for_elective)
course<-rlang::ensym(course)
tiebreaker<-rlang::ensym(tiebreaker)
.data%>%
mutate(!!name_for_elective =ifelse(!!tiebreaker==max(!!tiebreaker),1,0))%>%mutate(!!name_for_elective=ifelse(!!name_for_elective==0,!!course[!!name_for_elective==1],""))%>%
filter(!(!!course %in% !!name_for_elective))
}
I've included this example function because there are several references to the desired new column name, and I'm unsure if the context in which the reference is made changes syntax.
As you can see, I was hoping !!name_for_elective would let me name our new column, but no. I've played with {{}}, not using rlang::ensym, and still haven't got this figured out.
Any solution would be greatly appreciated.
This: Use dynamic variable names in `dplyr` may be helpful, but I can't seem to figure out how to extend this to work in the case where multiple references are made to the name argument.
Example data, per a good suggestion by #MrFlick, takes the form below:
dat<-tibble(ID=c("AA","BB","AA","BB","AA","BB"),Class=c("A_Class","B_Class","C_Class","D_Class","E_Class","F_Class"),
randomNo=c(.75,.43,.97,.41,.27,.38))
The user could then run something like:
dat2<-dat%>%
elective_open(MyChosenName,Class,randomNo)
A desired result, if using the function a single time, would be:
desired_result_1<-tibble(ID=c("AA","BB","AA","BB"),
Class=c("A_Class","D_Class","E_Class","F_Class"),
randomNo=c(.75,.41,.27,.38),
MyChosenName=c("C_Class","B_Class"))
The goal would be to allow this function to be used again if so desired, with a different name specified.
In the case where a user runs:
dat3<-dat%>%
elective_open(MyChosenName,Class,randomNo)%>%
mutate(Just_Another_One=1)%>%
elective_open(SecondName,Class,randomNo)
The output would be:
desired_result_2<-tibble(ID=c("AA","BB"),
Class=c("E_Class","F_Class"),
randomNo=c(.27,.38),
MyChosenName=c("C_Class","B_Class"),
Just_Another_One=c(1,1),
SecondName=c("A_Class","D_Class"))
In reality, there may be any number of records with the same ID, and any number of Class-es.
In this case you can just stick to using the embrace {{}} option for your variables. If you want to dynamically create column names, you're going to still need to use :=. The difference here is that you can use the glue-style syntax with the embrace operator to get the name of the symbol. This works with the data provided.
elective_open <- function(.data, name_for_elective, course, tiebreaker){
.data%>%
mutate("{{name_for_elective}}" := ifelse({{tiebreaker}}==max({{tiebreaker}}),1,0)) %>%
mutate("{{name_for_elective}}" := ifelse({{name_for_elective}}==0,{{course}}[{{name_for_elective}}==1],"")) %>%
filter(!({{course}} %in% {{name_for_elective}}))
}

Combining many vectors into one larger vector (in an automated way)

I have a list of identifiers as follows:
url_num <- c('85054655', '85023543', '85001177', '84988480', '84978776', '84952756', '84940316', '84916976', '84901819', '84884081', '84862066', '84848942', '84820189', '84814935', '84808144')
And from each of these I'm creating a unique variable:
for (id in url_num){
assign(paste('test_', id, sep = ""), FUNCTION GOES HERE)
}
This leaves me with my variables which are:
test_8505465, test_85023543, etc, etc
Each of them hold the correct output from the function (I've checked), however my next step is to combine them into one big vector which holds all of these created variables as a seperate element in the vector. This is easy enough via:
c(test_85054655,test_85023543,test_85001177,test_84988480,test_84978776,test_84952756,test_84940316,test_84916976,test_84901819,test_84884081,test_84862066,test_84848942,test_84820189,test_84814935,test_84808144)
However, as I update the original 'url_num' vector with new identifiers, I'd also have to come down to the above chunk and update this too!
Surely there's a more automated way I can setup the above chunk?
Maybe some sort of concat() function in the original for-loop which just adds each created variable straight into an empty vector right then and there?
So far I've just been trying to list all the variable names and somehow get the output to be in an acceptable format to get thrown straight into the c() function.
for (id in url_num){
cat(as.name(paste('test_', id, ",", sep = "")))
}
...which results in:
test_85054655,test_85023543,test_85001177,test_84988480,test_84978776,test_84952756,test_84940316,test_84916976,test_84901819,test_84884081,test_84862066,test_84848942,test_84820189,test_84814935,test_84808144,
This is close to the output I'm looking for but because it's using the cat() function it's essentially a print statement and its output can't really get put anywhere. Not to mention I feel like this method I've attempted is wrong to begin with and there must be something simpler I'm missing.
Thanks in advance for any help you guys can give me!
Troy

R data table issue

I'm having trouble working with a data table in R. This is probably something really simple but I can't find the solution anywhere.
Here is what I have:
Let's say t is the data table
colNames <- names(t)
for (col in colNames) {
print (t$col)
}
When I do this, it prints NULL. However, if I do it manually, it works fine -- say a column name is "sample". If I type t$"sample" into the R prompt, it works fine. What am I doing wrong here?
You need t[[col]]; t$col does an odd form of evaluation.
edit: incorporating #joran's explanation:
t$col tries to find an element literally named 'col' in list t, not what you happen to have stored as a value in a variable named col.
$ is convenient for interactive use, because it is shorter and one can skip quotation marks (i.e. t$foo vs. t[["foo"]]. It also does partial matching, which is very convenient but can under unusual circumstances be dangerous or confusing: i.e. if a list contains an element foolicious, then t$foo will retrieve it. For this reason it is not generally recommended for programming.
[[ can take either a literal string ("foo") or a string stored in a variable (col), and does not do partial matching. It is generally recommended for programming (although there's no harm in using it interactively).

Resources