Closing unused connection after sqldf::read.csv.sql() - r

The sqldf::read.csv.sql()
function has been useful retrieving only a small portion of a large CSV.
However, the connection remains open and eventually produces the following warnings (after running it a few times):
Warning messages:
closing unused connection 11 (C:\Users\wibeasley\AppData\Local\Temp\asfasdf\fileasdfasdfasdf)
Four years ago, it was recommended to issue
base::closeAllConnections().
Is there a newer way to selectively close only the connection created by sqldf::read.csv.sql()?
path <- tempfile()
write.csv(mtcars, file=path, row.names=F)
# read.csv(path)
ds <- sqldf::read.csv.sql(path, "SELECT * FROM file", eol="\n")
base::closeAllConnections() # I'd like to be more selective than 'All'.
unlink(path)
The real code is the middle two lines. The first three lines set up the pretend file. The final base::unlink() deletes the temp CSV.
My attempts to pass an existing file connection (so I can later explicitly close it) apparently still leave the connection open when I run it several times:
Warning messages:
1: In .Internal(sys.call(which)) : closing unused connection 13 ()
path <- tempfile()
write.csv(mtcars, file=path, row.names=F)
ff <- base::file(path) # Create an explicit connection.
ds <- sqldf::read.csv.sql(sql="SELECT * FROM ff")
base::close(ff)
unlink(path)

Both of my OP's snippets still produce warnings about the connections. This way avoids them.
path_db <- tempfile(fileext = ".sqlite3")
path_csv <- tempfile(fileext = ".csv")
write.csv(mtcars, file=path_csv, row.names = F)
# read.csv(path_csv) # Peek at the results.
db <- DBI::dbConnect(RSQLite::SQLite(), dbname = path_db)
# DBI::dbExecute(db, "DROP TABLE if exists car;") # If desired
RSQLite::dbWriteTable(db, name = "car", value = path_csv)
ds <- DBI::dbGetQuery(db, "SELECT * FROM car")
str(ds)
#> 'data.frame': 32 obs. of 11 variables:
#> $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#> $ cyl : int 6 6 4 6 8 6 8 4 4 6 ...
#> $ disp: num 160 160 108 258 360 ...
#> $ hp : int 110 110 93 110 175 105 245 62 95 123 ...
#> $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
#> $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
#> $ qsec: num 16.5 17 18.6 19.4 17 ...
#> $ vs : int 0 0 1 1 0 1 0 1 1 1 ...
#> $ am : int 1 1 1 0 0 0 0 0 0 0 ...
#> $ gear: int 4 4 4 3 3 3 3 4 4 4 ...
#> $ carb: int 4 4 1 1 2 1 4 2 2 4 ...
DBI::dbDisconnect(db)
unlink(path_db)
unlink(path_csv)
Created on 2022-03-23 by the reprex package (v2.0.1)

Related

Combine multiple dataframes by selecting names dynamically

I received a script that generates a bunch of objects. I want to combine multiple dataframes using bind_rows. I am able to choose the correct objects using grep but I am not able to pass those object names as argument to bind_rows.
For example, I want to select the objects that start with df and pass those to bind_rows. In the example below I expect to have a dataframe named data which have the dataframe mtcars 3 times.
df1 <- mtcars
df2 <- mtcars
df3 <- mtcars
notdf4 <- mtcars
dfx <- ls()[grep("^df", ls())]
data <- bind_rows(eval(parse(text = dfx)))
The suggestion to use mget makes sense, although it returns a list so you would need to use do.call to execute an `rbind operation.
str( do.call( rbind, mget(ls( patt="^df.") ) ) )
'data.frame': 96 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
I think using mget and do.call (rather than will have a lower chance of offending people like me who might be called R purists. I chose to use the "pattern" argument to ls as cleaner than first getting all the workspace names and then selecting from them with grep.

Bootstrap Variance in R?

Trying to do a bootstrap variance of an estimator in R and having a difficult time. Essentially, I'm trying to pull out 50 random rows out of a larger dataset, then, from those 50 rows, bootstrap 1000 times a specific estimator (formula below) using a sample size of 20, and then, from there, calculate the variance between the estimators. My code is below. I am very lost.
vector = d()
bootstraprows <- data[sample(nrow(data), 50), ]
for (i:100){
i <- sample(nrow(bootstraprows), size=20, replace=T)
c <- (sum(i$mpg*i$weight))/(sum((i$weight)^2))
append(d, c)
}
var(d)
As noted, I'm trying to calculator the sum of MPG * weight divided by the sum of weight^2. Please help if you can. Thanks!
I am not quite sure what you seek to accomplish, but I tried to construct an example. I used the built-in mtcars dataset that comes with R.
# load sample data
data(mtcars)
df <- mtcars
# show data structure
str(df)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
# fix randomization seed, make sample() reproducible
set.seed(1)
# take random 10 rows from df
sampleSize <- 10
bRows <- df[sample(nrow(df), sampleSize), ]
# do 7 bootstrap replications
bSamples <- 7
# make container for results
bResults <- rep(NA, bSamples)
Now we can actually perform the bootstrap
# loop over bootstraps
for (b in seq_len(bSamples)) {
# make bootstrap draw from bRows
bData <- bRows[sample(sampleSize, size = sampleSize, replace = TRUE), ]
# compute your statistic of interest
bValue <- sum(bData[["mpg"]] * bData[["wt"]]) / sum((bData[["wt"]])^2)
# store results in container
bResults[[b]] <- bValue
}
# show what we computed
bResults
[1] 4.490459 6.297782 3.651372 3.612414 5.348291 5.149250 3.818677
Does any of this help?

Dynamically redefine a function like print.data.frame

Is there a way to redefine a locked function?
What would be the best way to dynamically redefine such a globally available function while evaluating some code?
Example: I have the following code:
print(cars[1:5, ])
This usually calls print.data.frame but for whatever reasons I want it to call my.fancy.print.data.frame() instead. What would be the best way to achieve this?
In the end, I would like to have something like this:
evalWithEnvir(print(cars[1:5, ]), envir = list(print.data.frame = my.fancy.print.data.frame))
EDIT:
The question was badly asked. The problem was that I used <<- to redefine the function. This tried to set the function in the wrong environment. As #hrbrmstr pointed out below, the function can be easily redefined in the global environment.
print.data.frame is not 'locked' (or hidden). It appears among methods("print"), where the non-visible methods are also given.
If you prefer not to define a special class, you can overwrite base::print.data.frame in a defined environment and reference this in your code e.g.
e1 <- new.env(parent=.GlobalEnv)
assign("print.data.frame",
function(x) print((unclass(x))),
envir=e1)
with(e1, print(cars[1:5, ]))
giving:
$speed
[1] 4 4 7 7 8
$dist
[1] 2 10 4 22 16
attr(,"row.names")
[1] 1 2 3 4 5
and your other code should run as normal inside e1.
You can redefine the functionality of print.data.frame in your environment with:
print.data.frame <- function(x, ..., digits = NULL,
quote = FALSE, right = TRUE, row.names = TRUE) {
print("WOO HOO")
}
Now that's useless since it will just print WOO HOO vs do something meaningful but it should help you get started.
SabDeM's idea is a better one:
class(mtcars) <- c("myclass", class(mtcars))
print.myclass <- function(x) {
print(ls.str(x))
}
print(mtcars)
## am : num [1:32] 1 1 1 0 0 0 0 0 0 0 ...
## carb : num [1:32] 4 4 1 1 2 1 4 2 2 4 ...
## cyl : num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
## disp : num [1:32] 160 160 108 258 360 ...
## drat : num [1:32] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## gear : num [1:32] 4 4 4 3 3 3 3 4 4 4 ...
## hp : num [1:32] 110 110 93 110 175 105 245 62 95 123 ...
## mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## qsec : num [1:32] 16.5 17 18.6 19.4 17 ...
## vs : num [1:32] 0 0 1 1 0 1 0 1 1 1 ...
## wt : num [1:32] 2.62 2.88 2.32 3.21 3.44 ...

Split a dataframe into a list of dataframes, but how to re-merge?

I have a big ol' data frame with two ID columns for courses and users, and I needed to split it into one dataframe per course to do some further analysis/subsetting. After eliminating quite a few rows from each of the individual course dataframes, I'll need to stick them back together.
I split it up using, you guessed it, split, and that worked exactly as I needed it to. However, unsplitting was harder than I thought. The R documentation says that "unsplit reverses the effect of split," but my reading on the web so far is suggesting that that is not the case when the elements of the split-out list are themselves dataframes.
What can I do to rejoin my modified dfs?
This is a place for do.call. Simply calling df <- rbind(split.df) will result in a weird and useless list object, but do.call("rbind", split.df) should give you the result you're looking for.
unsplit() will work / does seem to work in the general situation that you describe, but not the particular situation of removing rows from the thus split data frame.
Consider
> spl <- split(mtcars, mtcars$cyl)
> str(spl, max = 1)
List of 3
$ 4:'data.frame': 11 obs. of 11 variables:
$ 6:'data.frame': 7 obs. of 11 variables:
$ 8:'data.frame': 14 obs. of 11 variables:
> str(unsplit(spl, f = mtcars$cyl))
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
As we can see, unsplit() can undo a split. However, in the case where the split data frame is further worked upon and altered to remove rows, there will be a mismatch between the total number of rows in the data frames in the split list and the variable used to split the original data frame.
If you know or can compute the changes required to make the variable used to split the original data frame then unsplit() can be deployed. Though it is more than likely that this will not be trivial.
The general solution is, as #Andrew Sannier mentions is the do.call(rbind, ...) idiom:
> spl <- split(mtcars, mtcars$cyl)
> str(do.call(rbind, spl))
'data.frame': 32 obs. of 11 variables:
$ mpg : num 22.8 24.4 22.8 32.4 30.4 33.9 21.5 27.3 26 30.4 ...
$ cyl : num 4 4 4 4 4 4 4 4 4 4 ...
$ disp: num 108 146.7 140.8 78.7 75.7 ...
$ hp : num 93 62 95 66 52 65 97 66 91 113 ...
$ drat: num 3.85 3.69 3.92 4.08 4.93 4.22 3.7 4.08 4.43 3.77 ...
$ wt : num 2.32 3.19 3.15 2.2 1.61 ...
$ qsec: num 18.6 20 22.9 19.5 18.5 ...
$ vs : num 1 1 1 1 1 1 1 1 0 1 ...
$ am : num 1 0 0 1 1 1 0 1 1 1 ...
$ gear: num 4 4 4 4 4 4 3 4 5 5 ...
$ carb: num 1 2 2 1 2 1 1 1 2 2 ...
Outside of base R, also consider:
data.table::rbindlist() with the side effect of the result being a data.table
dplyr::bind_rows() which despite its somewhat confusing name will bind rows across lists
The answer by Andrew Sannier works but has the side-effect that the rownames get changed. rbind adds the list names to them, so e.g. "Datsun 710" becomes "4.Datsun 710". One can use unname in between to avoid this problem.
Complete example:
mtcars_reorder = mtcars[order(mtcars$cyl), ] #reorder based on cyl first
l1 = split(mtcars_reorder, mtcars_reorder$cyl) #split by cyl
l1 = unname(l1) #remove list names
l2 = do.call(what = "rbind", l1) #unsplit
all(l2 == mtcars_reorder) #check if matches
#> TRUE

R: rename subset of variables in data frame

I'm renaming the majority of the variables in a data frame and I'm not really impressed with my method.
Therefore, does anyone on SO have a smarter or faster way then the one presented below using only base?
data(mtcars)
# head(mtcars)
temp.mtcars <- mtcars
names(temp.mtcars) <- c((x <- c("mpg", "cyl", "disp")),
gsub('^', "baR.", setdiff(names (mtcars),x)))
str(temp.mtcars)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp : num 160 160 108 258 360 ...
$ baR.hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ baR.drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ baR.wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ baR.qsec: num 16.5 17 18.6 19.4 17 ...
$ baR.vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ baR.am : num 1 1 1 0 0 0 0 0 0 0 ...
$ baR.gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ baR.carb: num 4 4 1 1 2 1 4 2 2 4 ...
Edited for answer using base R only
The package plyr has a convenient function rename() that does what you ask. Your modified question specifies using base R only. One easy way of doing this is to simply copy the code from plyr::rename and create your own function.
rename <- function (x, replace) {
old_names <- names(x)
new_names <- unname(replace)[match(old_names, names(replace))]
setNames(x, ifelse(is.na(new_names), old_names, new_names))
}
The function rename takes an argument that is a named vector, where the elements of the vectors are the new names, and the names of the vector are the existing names. There are many ways to construct such a named vector. In the example below I simply use structure.
x <- c("mpg", "disp", "wt")
some.names <- structure(paste0("baR.", x), names=x)
some.names
mpg disp wt
"baR.mpg" "baR.disp" "baR.wt"
Now you are ready to rename:
mtcars <- rename(mtcars, replace=some.names)
The results:
'data.frame': 32 obs. of 11 variables:
$ baR.mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ baR.disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat : num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ baR.wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec : num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear : num 4 4 4 3 3 3 3 4 4 4 ...
$ carb : num 4 4 1 1 2 1 4 2 2 4 ...
I would use ifelse:
names(temp.mtcars) <- ifelse(names(mtcars) %in% c("mpg", "cyl", "disp"),
names(mtcars),
paste("bar", names(mtcars), sep = "."))
Nearly the same but without plyr:
data(mtcars)
temp.mtcars <- mtcars
carNames <- names(temp.mtcars)
modifyNames <- !(carNames %in% c("mpg", "cyl", "disp"))
names(temp.mtcars)[modifyNames] <- paste("baR.", carNames[modifyNames], sep="")
Output:
str(temp.mtcars)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp : num 160 160 108 258 360 ...
$ baR.hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ baR.drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ baR.wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ baR.qsec: num 16.5 17 18.6 19.4 17 ...
$ baR.vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ baR.am : num 1 1 1 0 0 0 0 0 0 0 ...
$ baR.gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ baR.carb: num 4 4 1 1 2 1 4 2 2 4 ...
You could use the rename.vars function in the gdata package.
It works well when you only want to replace a subset of variable names and where the order of your vector of names is not the same as the order of names in the data.frame.
Adapted from the help file:
library(gdata)
data <- data.frame(x=1:10,y=1:10,z=1:10)
names(data)
data <- rename.vars(data, from=c("z","y"), to=c("Z","Y"))
names(data)
Converts data.frame names:
[1] "x" "y" "z"
to
[1] "x" "Y" "Z"
I.e., Note how this handles the subsetting and the fact that string of names are not in the same order as the names in the data.frame.
names(df)[match(
c('old_var1','old_var2'),
names(df)
)]=c('new_var1', 'new_var2')

Resources