Issue with summary() function in R - r

I am new to programming and trying to learn R using swirl.
In one of the exercises I was told to use the summary function on a dataset. However I encountered a discrepancy in the way the summary was printed:
Instead of summarizing the categorical variable values, it instead says something about length, class and mode.
I went around searching for why this might be happening to no avail, but I did manage to find what the output is supposed to look like:
Any help would be greatly appreciated!

This behaviour is due to the option stringsAsFactors, which is FALSE by default on R 4. Previously it was TRUE by default:
From R 4 news: "now uses a `stringsAsFactors = FALSE' default, and hence by default no longer converts strings to factors in calls to data.frame() and read.table()."
A way to return to the previous behaviour with the same code is to run options(stringsAsFactors=T) before building dataframes. However, there is a warning saying this option will eventually be removed, as explained here.
For your new code, you can use the stringsAsFactors parameter, for instance data.frame(..., stringsAsFactors=T).
If you already have dataframes and you want to convert them, you could use this function to convert all character variables (you will have to adapt if only some variables need conversion):
to.factors <- function(df) {
i <- sapply(df, is.character)
df[i] <- lapply(df[i], as.factor)
df
}

Related

R minus in row name makes name sometimes unaccessible

I have an error I cant recreate with smaller examples so I hope anyone has an idea where to look at.
The problem
As described in the code-comments: rownamesX is not found in the rownames of the matrix (But they are there of course). If I print the not-found names, something like this comes out:
hsa−miR−00
It should be
hsa-miR-00
Further, the I tested some different approaches:
Code works if I source the subscript directly in Rstudio in the console (ctrl-shift-s hotkey)
Code works if I call the function in the Console (Ctrl-Enter on the line)
Code does not work if the subscript is sourced in the main script by (Ctrl-Enter on the line)
Code does not work if the whole main.R is sourced (ctrl-shift-s hotkey)
My environment:
The data matrix
~200k elements
rownames in the form of "type-type2-number"
colnames (=samples) : "S1", "S2", ...
The call:
A main script
sources a subscript
sources a function
calls the function with the data matrix as parameter
The function:
myFunction <- function(rownamesX = c("type-type2-number")
,mat){
indexes <- which(rownames(mat) %in% rownamesX) # This is empty
mat.part <- mat[indexes, ] # therefore his is empty
resp <- mat.part[1, ] - mat.part[2, ] # therefore this yields an error
}
The mistake was quite easy:
There are more than one "-":
−
-
These two look even more equal in Rstudio than here. So I looked for the first(larger) one when the second (smaller) one was in the rowname

Fixing a multiple warning "unknown column"

I have a persistent multiple warning of "unknown column" for all types of commands (e.g., str(x) to installing updates on packages), and not sure how to debug this or fix it.
The warning "unknown column" is clearly related to a variable in a tbl_df that I renamed, but the warning comes up in all kinds of commands seemingly unrelated to the tbl_df (e.g., installing updates on a package, str(x) where x is simply a character vector).
This is an issue with the Diagnostics tool in RStudio (the tool that shows warnings and possible mistakes in your code). It was partially fixed at this commit in RStudio v1.1.103 or later by #kevin-ushey. That fix was partial, because the warnings still appeared (albeit with less frequency). This issue was reported with a reproducible example at https://github.com/rstudio/rstudio/issues/7372 and it was fixed on RStudio v1.4 pull request.
Update to the latest RStudio release to fix this issue. Alternatively, there are several workarounds available, choose the solution you prefer:
Disable the code diagnostics for all files in Preferences/Code/Diagnostics
Disable all diagnostics for a specific file:
Add at the beginning of the opened file(s):
# !diagnostics off
Then save the files and the warnings should stop appearing.
Disable the diagnostics for the variables that cause the warning
Add at the beginning of the opened file(s):
# !diagnostics suppress=<comma-separated list of variables>
Then save the files and the warnings should stop appearing.
The warnings appear because the diagnostics tool in RStudio parses the source code to detect errors and when it performs the diagnostic checks it accesses columns in your tibble that are not initialized, giving the Warning we see. The warnings do not appear because you run unrelated things, they appear when the RStudio diagnostics are executed (when a file is saved, then modified, when you run something...).
I have been encountering the same problem, and although I don't know why it occurs, I have been able to pin down when it occurs, and thus prevent it from happening.
The issue seems to be with adding in a new column, derived from indexing, in a base R data frame vs. in a tibble data frame. Take this example, where you add a new column (age) to a base R data frame:
base_df <- data.frame(id = c(1:3), name = c("mary", "jill","steve"))
base_df$age[base_df$name == "mary"] <- 47
That works without returning a warning. But when the same is done with a tibble, it throws a warning (and consequently, I think causing the weird, seemingly unprovoked, multiple warning issue):
library(tibble)
tibble_df <- tibble(id = c(1:3), name = c("mary", "jill","steve"))
tibble_df$age[tibble_df$name == "mary"] <- 47
Warning message:
Unknown column 'age'
There are surely better ways of avoiding this, but I have found that first creating a vector of NAs does the job:
tibble_df$age <- NA
tibble_df$age[tibble_df$name == "mary"] <- 47
I have faced this issue when using the "dplyr" package.
For those facing this problem after using the "group_by" function in the "dplyr" library:
I have found that ungrouping the variables solves the unknown column warning problem. Sometimes I have had to iterate through the ungrouping several times until the problem is resolved.
Converting the class into data.frame solved the problem for me:
library(dplyr)
df <- data.frame(id = c(1,1:3), name = c("mary", "jo", "jill","steve"))
dfTbl <- df %>%
group_by(id) %>%
summarize (n = n())
class(dfTbl) # [1] "tbl_df" "tbl" "data.frame"
dfTbl = as.data.frame(dfTbl)
class(dfTbl) # [1] "data.frame"
Borrowed the partial script from #adts
I had this problem when dealing with tibble and lapply functions together. The tibble seemed to save things as a list inside the dataframe.
I solved it by using unlist before adding the results of an lapply function to the tibble.
I ran into this problem too except through a tibble created using a dyplyr block. Here's slight modification of sabre's code to show how I came to the same error.
library(dplyr)
df <- data.frame(id = c(1,1:3), name = c("mary", "jo", "jill","steve"))
t <- df %>%
group_by(id) %>%
summarize (n = n())
t
str(t)
t$newvar[t$id==1] <- 0
I know this is an old thread, but I just encountered the same problem when loading a spatial vector in geopackage format with the package sf. Using as_tibble=FALSE worked for me. The file was loaded as an sp object but everything still worked fine. As mentioned by #sabre, trying to force an object into a tibble seems to be making the problems while trying to index a column that was not anymore there.
Let's say I wanted to select the following column(s)
best.columns = 'id'
For me the following gave the warning:
df%>% select_(one_of(best.columns))
While this worked as expected, although, as far as I know dplyr, this should be identical.
df%>% select_(.dots = best.columns)
I get these warnings when I rename a column using dplyr::rename after reading it using the readr package.
The old name of the column is not renamed in the spec attribute. So removing the the spec attribute makes the warnings go away. Also removing the "spec_tbl_df" class seems like a good idea.
attr(dat, "spec") <- NULL
class(dat) <- setdiff(class(dat), "spec_tbl_df")
Building on the answer by #stok ( https://stackoverflow.com/a/47848259/7733418 ), who found this problem when using group_by (which also converts your data.frame to a tibble), and solved it in the same way.
For me the problem was ultimately due to the use of "slice()".
Slice() converted my data.frame to a tibble, causing this error.
Checking the class of your data.frame and re-converting it to a data.frame whenever a function converts it to a tibble could solve this issue.

dplyr rename command with spaces

I have tried multiple variations of the rename function in dplyr.
I have a data frame called from a database called alldata, and a column within the data frame named WindDirection:N. I am trying to rename it as Wind Direction. I understand creating variable names containing spaces is not a good practice, but I want it to be named as such to improve readability for a selectInput list in shiny, and even if I settle for renaming it WindDirection I am getting all of the same error messages.
I have tried:
rename(alldata, Wind Direction = WindDirection:N)
which gives the error message:
Error: unexpected symbol in "rename(alldata, Wind Direction"
rename(alldata, `Wind Direction` = `WindDirection:N`)
which does not give an error message, but also does not rename the variable
rename(alldata, "Wind Direction" = "WindDirection:N")
which gives the error message:
Error: Arguments to rename must be unquoted variable names. Arguments Wind Direction are not.
I then tried the same 3 combinations of the reverse order (because I know that is how plyr works even though I do not call it to be used using the library command earlier in my code) putting the old variable first and the new variable 2nd with similar error messages.
I then tried to specify the package as I have 1 example below and tried all 6 combinations again.
dplyr::rename(alldata, `Wind Direction` = `WindDirection:N`)
to similar error messages as the first time.
I have used the following thread as an attempt to do this myself.
Replacement for "rename" in dplyr
as agenis pointed out, my mistake was not redefining the dataframe after renaming the variable.
So where I had
dplyr::rename(alldata,Wind Direction=WindDirection:N)
I should have
alldata <- dplyr::rename(alldata,Wind Direction=WindDirection:N)

read.csv() and colClasses [duplicate]

This question already has answers here:
Error in <my code> : object of type 'closure' is not subsettable
(6 answers)
Closed 8 years ago.
I am trying to use read.csv() command, but I do not understand colClasses part to run coding. Does anyone explain what it is, and also give me example of simple coding for read.csv()?
Also, if I run my coding for read.csv(), I get an error
> object of type 'closure' is not subsettable
What type of error is this? Last time I run my code, it worked, but now I get this. I am not sure what change I should make here. This is my code:
Precipfiles[1:24] <- list.files(pattern=".csv")
> DF <- NULL
> for (f in Precipfiles[1:24]) {
data[1:24]<-read.csv(f,header=T,sep="\t",na.string="",colClasses="character")
DF[1:24]<-rbind(DF,data[1:24])
}
Basically, I load all data and put them together, but I have not able to use merge() command since I am having troubles I listed above.
I think I should not use colClasses="character" because data I am using are all numeric in 200 by 200 matrix. There are 24 data files that I have to put them together.
If you have any suggestions and advise to improve this coding please let me know.
Thank you for all of your help.
You really don't need the [1:24] in every assignment, this is what is causing your problems. You are assign to a subset of a indexed vector of some description.
The error message when are trying to assign to data[1:24], without data being assigned previously (in your previous usage (which you mentioned worked), data was probably a list or data.frame you had created.). As such data is a function (for loading data associated with packages, see ?data) and the error you saw is saying that (a function includes a closure)
I would suggest something like
Precipfiles <- list.files(pattern=".csv")
DFlist <- lapply(Precipfiles, read.table, sep = '\t',
na.string = '', header = TRUE)
bigDF <- do.call(rbind, DFlist)
# or, much faster using data.table
library(data.table)
bigDF <- rbindlist(DFlist)

Why can't I pass a dataset to a function?

I'm using the package glmulti to fit models to several datasets. Everything works if I fit one dataset at a time.
So for example:
output <- glmulti(y~x1+x2,data=dat,fitfunction=lm)
works just fine.
However, if I create a wrapper function like so:
analyze <- function(dat)
{
out<- glmulti(y~x1+x2,data=dat,fitfunction=lm)
return (out)
}
simply doesn't work. The error I get is
error in evaluating the argument 'data' in selecting a method for function 'glmulti'
Unless there is a data frame named dat, it doesn't work. If I use results=lapply(list_of_datasets, analyze), it doesn't work.
So what gives? Without my said wrapper, I can't lapply a list of datasets through this function. If anyone has thoughts or ideas on why this is happening or how I can get around it, that would be great.
example 2:
dat=list_of_data[[1]]
analyze(dat)
works fine. So in a sense it is ignoring the argument and just literally looking for a data frame named dat. It behaves the same no matter what I call it.
I guess this is -yet another- problem due to the definition of environments in the parse tree of S4 methods (one of the resons why I am not a big fan of S4...)
It can be shown by adding quotes around the dat :
> analyze <- function(dat)
+ {
+ out<- glmulti(y~x1+x2,data="dat",fitfunction=lm)
+ return (out)
+ }
> analyze(test)
Initialization...
Error in eval(predvars, data, env) : invalid 'envir' argument
You should in the first place send this information to the maintainers of the package, as they know how they deal with the environments internally. They'll have to adapt the functions.
A -very dirty- workaround for yourself, is to put "dat" in the global environment and delete it afterwards.
analyze <- function(dat)
{
assign("dat",dat,envir=.GlobalEnv) # put the dat in the global env
out<- glmulti(y~x1+x2,data=dat,fitfunction=lm)
remove(dat,envir=.GlobalEnv) # delete dat again from global env
return (out)
}
EDIT:
Just for clarity, this is really about the worst solution possible, but I couldn't manage to find anything better. If somebody else gives you a solution where you don't have to touch your global environment, by all means use that one.

Resources