I'm trying to use "assign" function, inside a for loop, to assign the value of a file to one variable.
When I call the function, it brings the correct answer but in the end it gives me the following warning messages:
In assign(fileList, read.csv(fileList[i])) :
only the first element is used as variable name
If I run > corr("specdata", 129) I can see the correct answer, it can print all the right values, but If I assign the values to any variable for example, it says that this variable is "NULL". An example:
cr <- corr("specdata", 150)
head(cr)
NULL
It will give me all the values that fit in this criteria but it seems that it can't pass the values to "cr". Other examples:
> class(cr)
[1] "NULL"
> cr
NULL
The code that I'm currently using, if is helpful:
corr <- function(directory, threshold){
if (directory == "specdata"){
setwd("C:/Users/User/Downloads/specdata")
fileList <- list.files(pattern="*.csv")
for (i in 1:length(fileList)){
fileValues <- assign(fileList, read.csv(fileList[i]))
okValues <- na.omit(fileValues)
completeCases <- sum(complete.cases(okValues))
if (completeCases > threshold) {
sulfate <- okValues["sulfate"]
nitrate <- okValues["nitrate"]
correlation <- cor(sulfate, nitrate, use="complete.obs", method=c("pearson", "kendall", "spearman"))
#print(correlation)
}
else if (completeCases <= threshold){
#print("0")
}
i = i+1
}
}
else {
print("There's no such directory")
}
}
I'm a begginer on R language, so, if there's any way to fix this issue or to read every single file from a folder and manipulate separately, I'd be glad.
assign is used to do just that, "assign" a value to a variable (up to semantics). I suspect you want
fileValues <- read.csv(fileList[i])
Related
I am writing a function that will go through a list of files in a directory, count number of complete cases, and if the sum of complete cases is above a given threshhold, a correlation should be calculated. The output must be a numeric vector of correlations for all files that meet the threshhold requirement. This is what I have so far (and it gives me an Error: unexpected '}' in "}" Full disclosure - I am a complete newbie, as in wrote my first code 2 weeks ago. What am I doing wrong?
correlation <- function (directory, threshhold = 0) {
all_files <- list.files(path = getwd())
correlations_list <- numeric()
for (i in seq_along(all_files)) {
dataFR2 <- read.csv(all_files[i])
c <- c(sum(complete.cases(dataFR2)))
if c >= threshhold {
d <- cor(dataFR2$sulfate, dataFR2$nitrate, use = "complete.obs", method = c("pearson"))
correlations_list <- c(correlations_list, d)
}
}
correlations_list
}
"Unexpected *" errors are a syntax error. Often a missing parenthesis, comma, or curly bracket. In this case, you need to change if c >= threshhold { to if (c >= threshhold) {. if() is a function and it requires parentheses.
I'd also strongly recommend that you not use c as a variable name. c() is the most commonly used R function, and giving an object the same name will make your code look very strange to anyone else reading it.
Lastly, I'd recommend that you make your output the same length as the the number of files. As you have it, there won't be any way to know which files met the threshold to have their correlations calculated. I'd make correlations_list have the same length as the number of files, and add names to it so you know which correlation belongs to which file. This has the side benefit of not "growing an object in a loop", which is an anti-pattern known for its inefficiency. A rewritten function would look something like this:
correlation <- function (directory, threshhold = 0) {
all_files <- list.files(path = getwd())
correlations_list <- numeric(length(all_files)) ## initialize to full length
for (i in seq_along(all_files)) {
dataFR2 <- read.csv(all_files[i])
n_complete <- sum(complete.cases(dataFR2))
if(n_complete >= threshhold) {
d <- cor(dataFR2$sulfate, dataFR2$nitrate, use = "complete.obs", method = c("pearson"))
} else {
d <- NA
}
correlations_list[i] <- d
}
names(correlations_list) <- all_files
correlations_list
}
I want to calculate the percentage of sentences in a text that contain a double quotation mark and have written the following function to do so:
library(tokenizers)
quote_ratio <- function(text){
sentences <- tokenize_sentences(text, simplify = TRUE)
quote_sentences <- 0
for (i in sentences){
quote_hits <- grepl('\\"', i)
if (quote_hits == TRUE) {
quote_sentences <- quote_sentences + 1
}
}
ratio <- quote_sentences / length(sentences)
return (ratio)
}
The function works in many cases but with more data I run into the issue of having NA and/or NULL values in my sentences.
library(tm)
corpus = VCorpus(DirSource("/path/to/directory"))
ratios <- tm_map(corpus, content_transformer(quote_ratio))
# Error in if (quote_hits == TRUE) { : argument is of length zero
# In addition: Warning message:
# In if (quote_hits == TRUE) { : the condition has length > 1 and only the first element will be used
I've tried changing the if statement to check for null and NA values as follows:
if (!is.na(quote_hits) && !is.null(quote_hits) && quote_hits == TRUE) {
But this only produces more errors:
# Error in if (!is.na(quote_hits) && !is.null(quote_hits) && quote_hits == : missing value where TRUE/FALSE needed
Is there a better way to formulate the if statement and/or function? Many thanks.
EDIT:
I later realized it was likely a mistake to use the tm_map and content_transformer functions to calculate this. The function worked just fine when I stored the texts in a vector and used lapply.
corr <- function(directory, threshold) {
files <- list.files(directory, full.names = TRUE)
nu <- numeric()
for(i in length(files)) {
my_data <- read.csv(files[i])
if (sum(complete.cases(my_data)) >= threshold) {
vec_sul <- my_data[complete.cases(my_data),]$sulfate
vec_nit <- my_data[complete.cases(my_data),]$nitrate
nu <- c(nu, cor(vec_sul, vec_nit))
}
}
nu
}
I've a list of .csv files sitting inside the directory I wish to pass as an argument to the function illustrated above. I also pass threshold value as the second argument. The objective is to read through all the files in the directory parameter and check if the files have complete cases more than the threshold value passed as the second arg.
Those files that pass this criteria will further be examined and follows the evaluation of the correlation between the two variables inside it: Sulfate and Nitrate. The series of such correlation values associated with the files that have more complete cases than the threshold value will be concatenated to a numerical variable vector. At the end of the loop execution, I want the function to return the vector containing the series of the correlation values evaluated in the "if" loop.
cr <- corr("specdata", 150)
When I run the above line of code in console, I get a numerical variable which is null. Could someone help me fix the code?
Though this kind of error has been seen so many times, it still happen. You want
i in 1:length(files)
You get numeric(0) (the "numeric null" you talk about), because your loop only reads in the final file. I guess the final file does not satisfy sum(complete.cases(my_data)) >= threshold so nothing is added to nu, initialized as numeric(0).
Also, I would like to point out that
vec_sul <- my_data[complete.cases(my_data),]$sulfate
vec_nit <- my_data[complete.cases(my_data),]$nitrate
nu <- c(nu, cor(vec_sul, vec_nit))
can be replaced by
nu <- c(nu, with(my_data, cor(sulfate, nitrate, use = "complete.obs")))
Consider the vectorized lapply() across list of files which avoids expanding a preset vector. The only adjustment is that lapply will return a length equal to input list, files, hence an else statement is added to fill in for dataframes with unmet threshold condition. But outside the loop, nu is removed of these NAs.
corr <- function(directory, threshold) {
files <- list.files(directory, full.names = TRUE)
nu <- lapply(files, function(i) {
my_data <- read.csv(i)
if (sum(complete.cases(my_data)) >= threshold) {
vec_sul <- my_data[complete.cases(my_data),]$sulfate
vec_nit <- my_data[complete.cases(my_data),]$nitrate
temp <- cor(vec_sul, vec_nit)
} else {
temp <- NA # SET NAs
}
return(temp)
})
nu <- nu[!is.na(nu)] # REMOVE NAs
return(nu)
}
Alternatively, try even vapply() (arguably slightly faster) to specify a numeric vector return
corr <- function(directory, threshold) {
files <- list.files(directory, full.names = TRUE)
nu <- vapply(files, function(i) {
my_data <- read.csv(i)
if (sum(complete.cases(my_data)) >= threshold) {
vec_sul <- my_data[complete.cases(my_data),]$sulfate
vec_nit <- my_data[complete.cases(my_data),]$nitrate
temp <- cor(vec_sul, vec_nit)
} else {
temp <- NA # SET NAs
}
return(temp)
}, numeric(1))
nu <- nu[!is.na(nu)] # REMOVE NAs
return(nu)
}
So, I have a function:
complete <- function(directory,id = 1:332 ) {
directory <- list.files(path="......a")
g <- list()
for(i in 1:length(directory)) {
g[[i]] <- read.csv(directory[i],header=TRUE)
}
rbg <- do.call(rbind,g)
rbgr <- na.omit(rbg) #reads files and omits NA's
complete_subset <- subset(rbgr,rbgr$ID %in% id,select = ID)
table.rbgr <- sapply(complete_subset,table)
table.rbd <- data.frame(table.rbgr)
id.table <- c(id)
findla.tb <- cbind (id.table,table.rbd)
names(findla.tb) <- c("id","nob")
print(findla.tb) #creates table with number of observations
}
Basically when you call the specific numberic id (say 4),
you are suppose to get this output
id nobs
15 328
So, I just need the nobs data to be fed into another function which measures the correlation between two columns if the nobs value is greater than another arbitrarily determined value(T). Since nobs is determined by the value of id, I am uncertain how to create a function that takes into account the output of the other function?
I have tried something like this:
corr <- function (directory, t) {
directory <- list.files(path=".......")
g <- list()
for(i in 1:length(directory)) {
g[[i]] <- read.csv(directory[i],header=TRUE)
}
rbg <- do.call(rbind,g)
g.all <- na.omit(rbg) #reads files and removes observations
source(".....complete.R") #sourcing the complete function above
complete("spec",id)
g.allse <- subset(g.all,g.all$ID %in% id,scol )
g.allnit <- subset(g.all,g.all$ID %in% id,nit )
for(g.all$ID %in% id) {
if(id > t) {
cor(g.allse,g.allnit) #calcualte correlation of these two columns if they have similar id
}
}
#basically for each id that matches the ID in g.all function, if the id > t variable, calculate the correlation between columns
}
complete("spec", 3)
cr <- corr("spec", 150)
head(cr)
I have also tried to make the complete function a data.frame but it does not work and it gives me the following error:
error in data.frame(... check.names = false) arguments imply differing number of rows. So, I am not sure how to proceed....
First off, a reproducible example always helps in getting your question answered, along with a clear explanation of what your functions do/are supposed to do. We cannot run your example code.
Next, you seem to have an error in your corr function. You make multiple references to id but never actually populate this variable in your example code. So we'll just have to guess at what you need help with.
I think what you are trying to do is:
given an id, call complete with that id
use the nobs from that in your code.
In this case, you need to make sure to store the output of your call to complete, e.g.
comp <- complete('spec', id)
You can access the id column value comp['id'] and the nobs value via comp['nobs'] so you could do e.g.
if (comp['nobs'] > t) {
# do stuff e.g.
cor(g.allse, g.allnit)
}
Make sure you store the output of cor somewhere if you wish to actualy get it back later.
You will have to fix the problem of id not being defined yourself, because it is unclear what you want that to be.
I want to write a function that creates a time series, but I'd like it to generate the name of the time series as part of the call.
Sort of
makeTS(my.data.frame, string(dateName), string(varName)){
-create time series tsAux from my.data.frame, dateName and varName
-create string tsName
(-the creation of tsAux is not a problem)
assign(tsName, tsAux)
return(tsName)
}
This, perhaps not surprisingly, returns the string tsName, but is there any way that I can make it return a named object?
I've tried with
do.call('<-', list(tsName, tsAux))
and I've also tried using
as.name(tsName) <- tsAux
but nothing seems to work.
I know that
tsName <- makeTS2(my.data.frame, dateName, varName)
would do the trick (where makeTS2() just generates the time series tsAux and returns it), but is there any way to make it work with one function call?
Thanks!
Can you? Sure:
makeTS <- function(dat, varName) {
result <- NA
assign( varName, result, envir = .GlobalEnv )
result
}
> makeTS(NA, "test")
[1] NA
> test
[1] NA
Should you? Almost surely not.
Ari B.' answer is good. You could also use assign() with a variable.
> makeTS <- function(dat) {
+ return(666)
+ }
> varName <- "tmp"
> tmp
Error: object 'tmp' not found
> assign(varName, makeTS(1))
> tmp
[1] 666