I have to read a bunch of .xlsx files into R, which I do with readxl::read_excel(). Each of these files does not give a variable name for the first column. Since there are plenty of files, I do not want to change those manually.
In order to process the data properly, it is necessary to give these first columns a name. In the end, I want to write a function that I can call for each of these .xlsx files (e.g. using purrr:map) and within this function I would prefer to get a single pipe as a solution.
Unfortunately, dplyr::rename(df, timeseries = ``) throws the following error:
Error: attempt to use zero-length variable name
Using the column index (dplyr::rename(df, timeseries = 1)) does not work either:
Error: Arguments to rename() must be unquoted variable names.
Argument timeseries is not.
How can I avoid to interrupt the pipe in order to rename the variable by names(df)[1] <- "timeseries"?
This can be accomplished with dplyr::select() in the following way:
select(df, timeseries = 1, everything())
Obviously, dplyr::select() can handle column indices, which allows this solution.
Please comment if you are aware of any particular reason why this is not possible with dplyr:rename()!
If you want to use rename and a column index (in this case 1), you can do
rename_(df, timeseries = names(df)[1])
When chaining, use a dot:
df %>% ... %>% rename_(timeseries = names(.)[1])
Related
Iam trying to extract data from a website using a custom function:
library(tidyverse)
library(rvest)
url = "https://www.boerse.de/fundamental-analyse/garbage/" # last part does not change outcome, therefore 'garbage'
read_html_tables = function(ISIN){
content <- read_html(paste0(url,ISIN,"#guv")) %>%
html_table(dec = ",") %>%
.[c(5:10)]
return(content)
}
If I run this function with a given ISIN, e.g. US88579Y1010, I get the desired result. A list containing 6 tibbles with the data I want. But if I wrap this function into lapply() with a vector containing a few hundred ISIN, I get the following error:
list_of_all <- lapply(X = df[,2], FUN = read_html_tables)
Error: x must be a string of length 1
Called from: read_xml.character(x, encoding = encoding, ..., as_html = TRUE,
options = options)
If I call which(length(df[,2]) != 1) (the column where the ISINs are), I get integer(0), so there seems to be no issue with the ISIN column in this dataframe. And since it works with a single ISIN as input, the read_html(paste0(url,ISIN)) part seems to work as well.
I have used a very similar function before and wrapped it into lapply(). The earlier function did basically exactly what this function does, but had to do some searching and combining for the correct URL to pass into the read_html(paste0(url,ISIN)) part (on another website).
Iam a bit puzzled, since this error did not occure beforehand. But if it occured and I try to run the earlier function now, I get the same error (which I didn't receive any time before).
Maybe there is a more talented R-programmer out there which can spot the issue?
Edit: Since a reply suggested the ISIN-list is the issue:
The first two are US88579Y1010 and US8318652091. Passed individually into the function as well as passing it in a vector (c(ISIN1, ISIN2)) and passing the vector to lapply works. But if I point at both ISINs inside the tibble (df[1:2,2]) I get the error from above. What am I missing here?
Solution:
read_xml.character from read_html() seems to not accept a column from a tibble as valid input. Transfering the tibble to a data.frame and recalculating gives the desired output.
I'm trying to write a function to use the R tidymodels function initial_split with an argument that would let me change the strata to a different variable each time I call the function.
Using initial_split regularly like this works perfectly:
split_glab=initial_split(data,prop=0.7,strata=sp_glabrata)
Then I converted it to a function and plugged in my species parameter:
split_data=function(df,species){
initial_split(df,prop=0.7,strata=species)
}
split_data(data,species=sp_glabrata)
And get the following error:
Error: Can't subset columns that don't exist.
x Column `species` doesn't exist.
Of course, this column doesn't exist in my data since it's just an argument in my function --the column I'm trying to reference is called sp_glabrata. I can't figure out how to get my function to reference the column instead of the parameter. I don't want to just type the column name since I have to apply many similar functions to several columns and it would take forever.
Any guidance would be appreciated!
As it is a tidy package, can make use of curly-curly operator ({{}}) to evaluate the unquoted argument as a column name
library(tidymodels)
split_data <- function(df, species){
initial_split(df, prop=0.7, strata={{species}})
}
-testing
split_data(iris, species = Species)
#<Analysis/Assess/Total>
#<105/45/150>
I am using the pROC package and I want to calculate multiple ROC curve plots using a for loop.
My variables are specific column names that are included as string in a vector and I want pROC to read sequentially that vector and use the strings in the field "predictor" that seems to accept text/characters.
However, I cannot parse correctly the variable, as I am getting the error:
'predictor' argument should be the name of the column, optionally quoted.
here is an example code with aSAH dataset:
ROCvector<- c("s100b","ndka")
for (i in seq_along(ROCvector)){
a<-ROCvector[i]
pROC_obj <- roc(data=aSAH, outcome, as.character(a))
#code for output/print#
}
I have tried to call just "a" and using the functions print() or get() without any results.
Writing manually the variable (with or without quoting) works, of course.
Is there something I am missing about the type of variable I should use in the predictor field?
By passing data=aSAH as first argument, you are triggering the non-standard evaluation (NSE) of arguments, dplyr-style. Therefore you cannot simply pass the column name in a variable. Note the inconsistency with outcome that you pass unquoted and looks like a variable (but isn't)? Fortunately, functions with NSE in dplyr come with an equivalent function with standard evaluation, whose name ends with _. The pROC package follows this convention. You should usually use those if you are programming with column names.
Long story short, you should use the roc_ function instead, which accepts characters as column names (don't forget to quote "outcome"):
pROC_obj <- roc_(data=aSAH, "outcome", as.character(a))
A slightly more idiomatic version of your code would be:
for (predictor in ROCvector) {
pROC_obj <- roc_(data=aSAH, "outcome", predictor)
}
roc can accept formula, so we can use paste0 and as.formula to create one. i.e.
library(pROC)
ROCvector<- c("s100b","ndka")
for (i in seq_along(ROCvector)){
a<-ROCvector[i]
pROC_obj <- roc(as.formula(paste0("outcome~",a)), data=aSAH)
print(pROC_obj)
#code for output/print#
}
To can get the original call i.e. without paste0 wich you can use for later for downstream calculations, use eval and bquote
pROC_obj <- eval(bquote(roc(.(as.formula(paste0("outcome~",a))), data=aSAH)))
So, I have a set of data, and what I'm trying to do is find all the local maxima on the resulting curve. I read in a CSV file, which has x-values in the first column and y-values in the second, first step done, easy.
To find the maxima, I tried to use the findpeaks() function from the pracma database. However, each time I tried to run it, I got the same error:
Error: is.vector(x, mode = "numeric") is not TRUE
So, I first tried just converting this to a vector. Still got the same issue, however is.vector(x, mode = "any") was now returning true. I found some other help threads (which I can no longer find, so I can't share them, sorry!), and decided to try using lapply to coerce each entry in the new vector using as.numeric. Didn't work. Looked into ?as.numeric, and it mentioned that as.double might be better suited. Didn't work. Now I'm at a loss and not sure what to do - current working code is shown below.
plot <- read_csv("AFGP60 UV-05-04-16.csv",
col_names = FALSE, na = "null", skip = 2,n_max = numrow)
diffplot <- c(plot[1:601,2])
diffplot <- lapply(diffplot,as.double)
findpeaks(diffplot)`
Try diffplot <- as.numeric(as.vector(plot[1:600, 2])).
The problem was that the data was read as character or as factor. The above code should change that. However, there are multiple issues with your code. First, plot is a base function used for plotting. Naming a variable with such a name is bad practice.
Second, the diffplot variable is a vector (first 600 rows from the second column), so there is no need to change each element separately with the lapply function.
I have an character array (chr [1:5] named keynn) of column names on which I would like to perform an aggregation.
All elements of the array is a valid column name of the data frame (mydata), but it is a string and not the variable ("YEAR" instead of mydata$YEAR).
I tried using get() to return the column from the name and it works, for the first element, like so:
attach(mydata)
aggregate(mydata, by=list(get(keynn, .GlobalEnv)), FUN=length)
I tried using mget() since my array as more than one element, like this:
attach(mydata)
aggregate(mydata, by=list(mget(keynn, .GlobalEnv)), FUN=length)
but I get an error:
value for 'YEAR' not found.
How can I get the equivalent of get for multiple columns to aggregate by?
Thank you!
I would suggest not using attach in general
If you are just trying to get columns from mydata you can use [ to index the list
aggregate(mydata, by = mydata[keynn], FUN = length)
should work -- and is very clear that you want to get keynn from mydata
The problem with using attach is that it adds mydata to the search path (not copying to the global environment)
try
attach(mydata)
mget(keynn, .GlobalEnv)
so if you were to use mget and attach, you need
mget(keynn, .GlobalEnv, inherits = TRUE)
so that it will not just search in the global environment.
But that is more effort than it is worth (IMHO)
The reason get works is that inherits = TRUE by default. You could thus use lapply(keynn, get) if mydata were attached, but again this ugly and unclear about what it is doing.
another approach would be to use data.table, which will evaluate the by argument within the data.table in question
library(data.table)
DT <- data.table(mydata)
DT[, {what you want to aggregate} , by =keynn]
Note that keynn doesn't need to be a character vector of names, it can be a list of names or a named list of functions of names etc