How to "really" convert variables in R [duplicate] - r

Let's assume that we have a data frame x which contains the columns job and income. Referring to the data in the frame normally requires the commands x$jobfor the data in the job column and x$income for the data in the income column.
However, using the command attach(x) permits to do away with the name of the data frame and the $ symbol when referring to the same data. Consequently, x$job becomes job and x$income becomes income in the R code.
The problem is that many experts in R advise NOT to use the attach() command when coding in R.
What is the main reason for that? What should be used instead?

When to use it:
I use attach() when I want the environment you get in most stats packages (eg Stata, SPSS) of working with one rectangular dataset at a time.
When not to use it:
However, it gets very messy and code quickly becomes unreadable when you have several different datasets, particularly if you are in effect using R as a crude relational database, where different rectangles of data, all relevant to the problem at hand and perhaps being used in various ways of matching data from the different rectangles, have variables with the same name.
The with() function, or the data= argument to many functions, are excellent alternatives to many instances where attach() is tempting.

Another reason not to use attach: it allows access to the values of columns of a data frame for reading (access) only, and as they were when attached. It is not a shorthand for the current value of that column. Two examples:
> head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
> attach(cars)
> # convert stopping distance to meters
> dist <- 0.3048 * dist
> # convert speed to meters per second
> speed <- 0.44707 * speed
> # compute a meaningless time
> time <- dist / speed
> # check our work
> head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
No changes were made to the cars data set even though dist and speed were assigned to.
If explicitly assigned back to the data set...
> head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
> attach(cars)
> # convert stopping distance to meters
> cars$dist <- 0.3048 * dist
> # convert speed to meters per second
> cars$speed <- 0.44707 * speed
> # compute a meaningless time
> cars$time <- dist / speed
> # compute meaningless time being explicit about using values in cars
> cars$time2 <- cars$dist / cars$speed
> # check our work
> head(cars)
speed dist time time2
1 1.78828 0.6096 0.5000000 0.3408862
2 1.78828 3.0480 2.5000000 1.7044311
3 3.12949 1.2192 0.5714286 0.3895842
4 3.12949 6.7056 3.1428571 2.1427133
5 3.57656 4.8768 2.0000000 1.3635449
6 4.02363 3.0480 1.1111111 0.7575249
the dist and speed that are referenced in computing time are the original (untransformed) values; the values of cars$dist and cars$speed when cars was attached.

I think there's nothing wrong with using attach. I myself don't use it (then again, I love animals, but don't keep any, either). When I think of attach, I think long term. Sure, when I'm working with a script I know it inside and out. But in one week's time, a month or a year when I go back to the script, I find the overheads with searching where a certain variable is from, just too expensive. A lot of methods have the data argument which makes calling variables pretty easy (sensulm(x ~ y + z, data = mydata)). If not, I find the usage of with to my satisfaction.
In short, in my book, attach is fine for short quick data exploration, but for developing scripts that I or other might want to use, I try to keep my code as readable (and transferable) as possible.

If you execute attach(data) multiple time, eg, 5 times, then you can see (with the help of search()) that your data has been attached 5 times in the workspace environment. So if you de-attach (detach(data)) it once, there'll still be data present 4 times in the environment. Hence, with()/within() are better options. They help create a local environment containing that object and you can use it without creating any confusions.

Related

Number of masked objects keeps increasing when using attach in R [duplicate]

Let's assume that we have a data frame x which contains the columns job and income. Referring to the data in the frame normally requires the commands x$jobfor the data in the job column and x$income for the data in the income column.
However, using the command attach(x) permits to do away with the name of the data frame and the $ symbol when referring to the same data. Consequently, x$job becomes job and x$income becomes income in the R code.
The problem is that many experts in R advise NOT to use the attach() command when coding in R.
What is the main reason for that? What should be used instead?
When to use it:
I use attach() when I want the environment you get in most stats packages (eg Stata, SPSS) of working with one rectangular dataset at a time.
When not to use it:
However, it gets very messy and code quickly becomes unreadable when you have several different datasets, particularly if you are in effect using R as a crude relational database, where different rectangles of data, all relevant to the problem at hand and perhaps being used in various ways of matching data from the different rectangles, have variables with the same name.
The with() function, or the data= argument to many functions, are excellent alternatives to many instances where attach() is tempting.
Another reason not to use attach: it allows access to the values of columns of a data frame for reading (access) only, and as they were when attached. It is not a shorthand for the current value of that column. Two examples:
> head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
> attach(cars)
> # convert stopping distance to meters
> dist <- 0.3048 * dist
> # convert speed to meters per second
> speed <- 0.44707 * speed
> # compute a meaningless time
> time <- dist / speed
> # check our work
> head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
No changes were made to the cars data set even though dist and speed were assigned to.
If explicitly assigned back to the data set...
> head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
> attach(cars)
> # convert stopping distance to meters
> cars$dist <- 0.3048 * dist
> # convert speed to meters per second
> cars$speed <- 0.44707 * speed
> # compute a meaningless time
> cars$time <- dist / speed
> # compute meaningless time being explicit about using values in cars
> cars$time2 <- cars$dist / cars$speed
> # check our work
> head(cars)
speed dist time time2
1 1.78828 0.6096 0.5000000 0.3408862
2 1.78828 3.0480 2.5000000 1.7044311
3 3.12949 1.2192 0.5714286 0.3895842
4 3.12949 6.7056 3.1428571 2.1427133
5 3.57656 4.8768 2.0000000 1.3635449
6 4.02363 3.0480 1.1111111 0.7575249
the dist and speed that are referenced in computing time are the original (untransformed) values; the values of cars$dist and cars$speed when cars was attached.
I think there's nothing wrong with using attach. I myself don't use it (then again, I love animals, but don't keep any, either). When I think of attach, I think long term. Sure, when I'm working with a script I know it inside and out. But in one week's time, a month or a year when I go back to the script, I find the overheads with searching where a certain variable is from, just too expensive. A lot of methods have the data argument which makes calling variables pretty easy (sensulm(x ~ y + z, data = mydata)). If not, I find the usage of with to my satisfaction.
In short, in my book, attach is fine for short quick data exploration, but for developing scripts that I or other might want to use, I try to keep my code as readable (and transferable) as possible.
If you execute attach(data) multiple time, eg, 5 times, then you can see (with the help of search()) that your data has been attached 5 times in the workspace environment. So if you de-attach (detach(data)) it once, there'll still be data present 4 times in the environment. Hence, with()/within() are better options. They help create a local environment containing that object and you can use it without creating any confusions.

R: Dropping variables using number of observations

I have a large dataset, and I'm trying to drop some of my variables based on how many observations each has. For instance, I would like to drop any variable in my dataframe where n < 3 (total observations for that variable is less than 3). Since R can count observations for each variable using describe, can't I use that number to subset the data instead of having to type in each variable name each time I pull in a new version (each version has different variables that will have low n's and there are over 40 variables). Thanks so much for your help!
For instance, my data looks like this:
ID Runaway Aggressive Emergency Hospitalization Injury
1 3 NA 4 1 NA
2 NA NA 2 1 NA
3 4 NA 6 2 3
4 1 NA 1 1 NA
I want to be able to drop "Aggressive" and "Injury" based on their n's being 0 and 1 respectively. However, instead of telling R to drop them by variable name, it would be much more convenient if it was possible to tell R to drop any variable where n < 3 (or whatever number I choose) as I'll be using this code for multiple versions of this dataset. I have tried using column numbers (which is better than writing them out) but it's still pretty tedious when I have to describe() the data, figure out which variables have low n's, and then drop 28 variables or subset() around them.
This works but it's cumbersome...
UIRCorrelation <- UIRKidUnique61[c(28, 30, 32, 34:38, 42, 54:74)]
For some reason, my example looks different when I'm editing versus when I save so I also included an image of it. Sorry. This is the first time I've ever used stack overflow to ask a question. I actually spent a lot of time googling this but couldn't find an answer relating to n.
This line did not work: DF[, sapply(DF, function(col) length(na.omit(col))) > 4]
DF being your dataframe
DF[, sapply(DF, function(col) length(na.omit(col))) > 4]
This function did the trick:
valid <- function(x) {sum(!is.na(x))}
N <- apply(UIRCorrelation,2,valid)
UIRCorrelation2 <- UIRCorrelation[N > 3]

Using Histogram as input in R

This is admittedly a very simple question that I just can't find an answer to.
In R, I have a file that has 2 columns: 1 of categorical data names, and the second a count column (count for each of the categories). With a small dataset, I would use 'reshape' and the function 'untable' to make 1 column and do analysis that way. The question is, how to handle this with a large data set?
In this case, my data is humungous and that just isn't going to work.
My question is, how do I tell R to use something like the following as distribution data:
Cat Count
A 5
B 7
C 1
That is, I give it a histogram as an input and have R figure out that it means there are 5 of A, 7 of B and 1 of C when calculating other information about the data.
The desired input rather than output would be for R to understand that the data would be the same as follows,
A
A
A
A
A
B
B
B
B
B
B
B
C
In reasonable size data, I can do this on my own, but what do you do when the data is very large?
Edit
The total sum of all the counts is 262,916,849.
In terms of what it would be used for:
This is new data, trying to understand the correlation between this new data and other pieces of data. Need to work on linear regressions and mixed models.
I think what you're asking is to reshape a data frame of categories and counts into a single vector of observations, where categories are repeated. Here's one way:
dat <- data.frame(Cat=LETTERS[1:3],Count=c(5,7,1))
# Cat Count
#1 A 5
#2 B 7
#3 C 1
rep.int(dat$Cat,times=dat$Count)
# [1] A A A A A B B B B B B B C
#Levels: A B C
To follow up on #Blue Magister's excellent answer, here's a 100,000 row histogram with a total count of 551,245,193:
set.seed(42)
Cat <- sapply(rep(10, 100000), function(x) {
paste(sample(LETTERS, x, replace=TRUE), collapse='')
})
dat <- data.frame(Cat, Count=sample(1000:10000, length(Cat), replace=TRUE))
> head(dat)
Cat Count
1 XYHVQNTDRS 5154
2 LSYGMYZDMO 4724
3 XDZYCNKXLV 8691
4 TVKRAVAFXP 2429
5 JLAZLYXQZQ 5704
6 IJKUBTREGN 4635
This is a pretty big dataset by my standards, and the operation Blue Magister describes is very quick:
> system.time(x <- rep(dat$Cat,times=dat$Count))
user system elapsed
4.48 1.95 6.42
It uses about 6GB of RAM to complete the operation.
This really depends on what statistics you are trying to calculate. The xtabs function will create tables for you where you can specify the counts. The Hmisc package has functions like wtd.mean that will take a vector of weights for computing a mean (and related functions for standard deviation, quantiles, etc.). The biglm package could be used to expand parts of the dataset at a time and analyze. There are probably other packages as well that would handle the frequency data, but which is best depends on what question(s) you are trying to answer.
The existing answers are all expanding the pre-binned dataset into a full distribution and then using R's histogram function which is memory inefficient and will not scale for very large datasets like the original poster asked about. The HistogramTools CRAN package includes a
PreBinnedHistogram function which takes arguments for breaks and counts to create a Histogram object in R without massively expanding the dataset.
For Example, if the data set has 3 buckets with 5, 7, and 1 elements, all of the other solutions posted here so far expand that into a list of 13 elements first and then create the histogram. PreBinnedHistogram in contrast creates the histogram directly from the 3 element input list without creating a much larger intermediate vector in memory.
big.histogram <- PreBinnedHistogram(my.data$breaks, my.data$counts)

Why is it not advisable to use attach() in R, and what should I use instead?

Let's assume that we have a data frame x which contains the columns job and income. Referring to the data in the frame normally requires the commands x$jobfor the data in the job column and x$income for the data in the income column.
However, using the command attach(x) permits to do away with the name of the data frame and the $ symbol when referring to the same data. Consequently, x$job becomes job and x$income becomes income in the R code.
The problem is that many experts in R advise NOT to use the attach() command when coding in R.
What is the main reason for that? What should be used instead?
When to use it:
I use attach() when I want the environment you get in most stats packages (eg Stata, SPSS) of working with one rectangular dataset at a time.
When not to use it:
However, it gets very messy and code quickly becomes unreadable when you have several different datasets, particularly if you are in effect using R as a crude relational database, where different rectangles of data, all relevant to the problem at hand and perhaps being used in various ways of matching data from the different rectangles, have variables with the same name.
The with() function, or the data= argument to many functions, are excellent alternatives to many instances where attach() is tempting.
Another reason not to use attach: it allows access to the values of columns of a data frame for reading (access) only, and as they were when attached. It is not a shorthand for the current value of that column. Two examples:
> head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
> attach(cars)
> # convert stopping distance to meters
> dist <- 0.3048 * dist
> # convert speed to meters per second
> speed <- 0.44707 * speed
> # compute a meaningless time
> time <- dist / speed
> # check our work
> head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
No changes were made to the cars data set even though dist and speed were assigned to.
If explicitly assigned back to the data set...
> head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
> attach(cars)
> # convert stopping distance to meters
> cars$dist <- 0.3048 * dist
> # convert speed to meters per second
> cars$speed <- 0.44707 * speed
> # compute a meaningless time
> cars$time <- dist / speed
> # compute meaningless time being explicit about using values in cars
> cars$time2 <- cars$dist / cars$speed
> # check our work
> head(cars)
speed dist time time2
1 1.78828 0.6096 0.5000000 0.3408862
2 1.78828 3.0480 2.5000000 1.7044311
3 3.12949 1.2192 0.5714286 0.3895842
4 3.12949 6.7056 3.1428571 2.1427133
5 3.57656 4.8768 2.0000000 1.3635449
6 4.02363 3.0480 1.1111111 0.7575249
the dist and speed that are referenced in computing time are the original (untransformed) values; the values of cars$dist and cars$speed when cars was attached.
I think there's nothing wrong with using attach. I myself don't use it (then again, I love animals, but don't keep any, either). When I think of attach, I think long term. Sure, when I'm working with a script I know it inside and out. But in one week's time, a month or a year when I go back to the script, I find the overheads with searching where a certain variable is from, just too expensive. A lot of methods have the data argument which makes calling variables pretty easy (sensulm(x ~ y + z, data = mydata)). If not, I find the usage of with to my satisfaction.
In short, in my book, attach is fine for short quick data exploration, but for developing scripts that I or other might want to use, I try to keep my code as readable (and transferable) as possible.
If you execute attach(data) multiple time, eg, 5 times, then you can see (with the help of search()) that your data has been attached 5 times in the workspace environment. So if you de-attach (detach(data)) it once, there'll still be data present 4 times in the environment. Hence, with()/within() are better options. They help create a local environment containing that object and you can use it without creating any confusions.

writing to global environment when running in parallel

I have a data.frame of cells, values and coordinates. It resides in the global environment.
> head(cont.values)
cell value x y
1 11117 NA -34 322
2 11118 NA -30 322
3 11119 NA -26 322
4 11120 NA -22 322
5 11121 NA -18 322
6 11122 NA -14 322
Because my custom function takes almost a second to calculate individual cell (and I have tens of thousands of cells to calculate) I don't want to duplicate calculations for cells that already have a value. My following solution tries to avoid that. Each cell can be calculated independently, screaming for parallel execution.
What my function actually does is check if there's a value for a specified cell number and if it's NA, it calculates it and inserts it in place of NA.
I can run my magic function (result is value for a corresponding cell) using apply family of functions and from within apply, I can read and write cont.values without a problem (it's in global environment).
Now, I want to run this in parallel (using snowfall) and I'm unable to read or write from/to this variable from individual core.
Question: What solution would be able to read/write from/to a dynamic variable residing in global environment from within worker (core) when executing a function in parallel. Is there a better approach of doing this?
The pattern of a central store that workers consult for values is implemented in the rredis package on CRAN. The idea is that the Redis server maintains a store of key-value pairs (your global data frame, re-implemented). Workers query the server to see if the value has been calculated (redisGet) and if not do the calculation and store it (redisSet) so that other workers can re-use it. Workers can be R scripts, so it's easy to expand the work force. It's a very nice alternative parallel paradigm. Here's an example that uses the notion of 'memoizing' each result. We have a function that is slow (sleeps for a second)
fun <- function(x) { Sys.sleep(1); x }
We write a 'memoizer' that returns a variant of fun that first checks to see if the value for x has already been calculated, and if so uses that
memoize <-
function(FUN)
{
force(FUN) # circumvent lazy evaluation
require(rredis)
redisConnect()
function(x)
{
key <- as.character(x)
val <- redisGet(key)
if (is.null(val)) {
val <- FUN(x)
redisSet(key, val)
}
val
}
}
We then memoize our function
funmem <- memoize(fun)
and go
> system.time(res <- funmem(10)); res
user system elapsed
0.003 0.000 1.082
[1] 10
> system.time(res <- funmem(10)); res
user system elapsed
0.001 0.001 0.040
[1] 10
This does require a redis server running outside R but very easy to install; see the documentation that comes with the rredis package.
A within-R parallel version might be
library(snow)
cl <- makeCluster(c("localhost","localhost"), type = "SOCK")
clusterEvalQ(cl, { require(rredis); redisConnect() })
tasks <- sample(1:5, 100, TRUE)
system.time(res <- parSapply(cl, tasks, funmem))
It will depend on what the function in question is, off course, but I'm afraid that snowfall won't be much of a help there. Thing is, you'll have to export the whole dataframe to the different cores (see ?sfExport) and still find a way to combine it. That kind of beats the whole purpose of changing the value in the global environment, as you probably want to keep memory use as low as possible.
You can dive into the low-level functions of snow to -kind of- get this to work. See following example :
#Some data
Data <- data.frame(
cell = 1:10,
value = sample(c(100,NA),10,TRUE),
x = 1:10,
y = 1:10
)
# A sample function
sample.func <- function(){
id <- which(is.na(Data$value)) # get the NA values
# this splits up the values from the dataframe in a list
# which will be passed to clusterApply later on.
parts <- lapply(clusterSplit(cl,id),function(i)Data[i,c("x","y")])
# Here happens the magic
Data$value[id] <<-
unlist(clusterApply(cl,parts,function(x){
x$x+x$y
}
))
}
#now we run it
require(snow)
cl <- makeCluster(c("localhost","localhost"), type = "SOCK")
sample.func()
stopCluster(cl)
> Data
cell value x y
1 1 100 1 1
2 2 100 2 2
3 3 6 3 3
4 4 8 4 4
5 5 10 5 5
6 6 12 6 6
7 7 100 7 7
8 8 100 8 8
9 9 18 9 9
10 10 20 10 10
You will still have to copy (part of) your data though to get it to the cores. But that will happen anyway when you call snowfall high level functions on dataframes, as snowfall uses the low-level function of snow anyway.
Plus, one shouldn't forget that if you change one value in a dataframe, the whole dataframe is copied in the memory as well. So you won't win that much by adding the values one by one when they come back from the cluster. You might want to try some different approaches and do some memory profiling as well.
I agree with Joris that you will need to copy your data to the other cores.
On the positive side, you don't have to worry about NA's being in the data or not, within the cores.
If your original data.frame is called cont.values:
nnaidx<-is.na(cont.values$value) #where is missing data originally
dfrnna<-cont.values[nnaidx,] #subset for copying to other cores
calcValForDfrRow<-function(dfrRow){return(dfrRow$x+dfrRow$y)}#or whatever pleases you
sfExport(dfrnna, calcValForDfrRow) #export what is needed to other cores
cont.values$value[nnaidx]<-sfSapply(seq(dim(dfrnna)[1]), function(i){calcValForDfrRow(dfrnna[i,])}) #sfSapply handles 'reordering', so works exactly as if you had called sapply
should work nicely (barring typos)

Resources