Break apart a dataframe into individual named vectors

Break apart a dataframe into individual named vectors - r

Is there an easy function, outside of attach/detach, that will break apart a dataframe or data table into its individual vectors with the names of the vectors as the names of the columns in the dataframe.
For example, suppose I have a data frame
x <- data.frame(a=c(1,2,3), b=c(4,5,6), d=c(7,8,9))
Then using the function it would return 3 vectors: a, b, and d. Seems like there should be function to do this but I cannot find it.

One option would be list2env
list2env(x,.GlobalEnv)
a
#[1] 1 2 3
b
#[1] 4 5 6
d
#[1] 7 8 9

Yes, there is a (very old, very standard) function called attach() that does that:
> x <- data.frame(a=c(1,2,3), b=c(4,5,6), d=c(7,8,9))
> attach(x)
> a
[1] 1 2 3
> b
[1] 4 5 6
> d
[1] 7 8 9
>
However, the general consensus is to Don't Do That (TM) as it can a) clutter the environment into which you attach() (usually the global one) and can b) silently overwrite an existing variable (though it warns by default unless you override a switch, see ?attach). There is a counterpart detach() to remove them too. The (aptly named) Section "Good Practice" in the help page for attach has more on all this, including a hint to use on.exit() with detach() where you use attach().
But if you need it, you can use it. Just be aware of Them Dragons.

While two answers have already been posted, I would like to suggest a solution that does not write to the global environment, which is generally considered to be a bad thing (see Dirk´s comments on attach).
First of all, R can only return single objects. It is not possible to return three, separate vectors. However, we can easily return a list of vectors.
df <- data.frame(a=c(1,2,3), b=c(4,5,6), d=c(7,8,9))
# Returns a named list
as.list(df)
#> $a
#> [1] 1 2 3
#>
#> $b
#> [1] 4 5 6
#>
#> $d
#> [1] 7 8 9

Related

Self reference when indexing into a vector

In R, is there a way to reference a vector from within the vector?
Say I have vectors with long names:
my.vector.with.a.long.name <- 1:10
Rather than this:
my.vector.with.a.long.name[my.vector.with.a.long.name > 5]
Something like this would be nice:
> my.vector.with.a.long.name[~ > 5]
[1] 6 7 8 9 10
Or alternatively indexing by a function would be convenient:
> my.vector.with.a.long.name[is.even]
[1] 2 4 6 8 10
Is there a package that already supports this?

You can use pipes which allow self-referencing with .:
library(pipeR)
my.vector.with.a.long.name %>>% `[`(.>5)
[1] 6 7 8 9 10
my.vector.with.a.long.name %>>% `[`(.%%2==0)
[1] 2 4 6 8 10

The Filter function helps with this
my.vector.with.a.long.name <- 1:10
Filter(function(x) x%%2==0, my.vector.with.a.long.name)
or
is.even <- function(x) x%%2==0
Filter(is.even, my.vector.with.a.long.name)

You can easily create another object with a shorter name:
my.vector.with.a.long.name <- 1:10
mm = my.vector.with.a.long.name
mm
[1] 1 2 3 4 5 6 7 8 9 10
mm[mm<5]
[1] 1 2 3 4
mm[mm>5]
[1] 6 7 8 9 10
Why use other packages and complex code?

So, you're basically asking if you can use something other than the variable's name to refer to it. The short answer is no. That is the whole idea behind variable names. If you want a shorter name, name it something shorter.
The longer answer is it depends. You're really just using logical indexing in its long form. To make it shorter/refer to it more than once without having to type that enormous name, just save it in a vector like so:
gt5 <- my.vector.with.a.long.name > 5
[1] FALSE FALSE FALSE FALSE FALSE TRUE...
my.vector.with.a.long.name[gt5]
[1] 6 7 8 9 10
You can do the same thing with a function as long as it returns the indexes or a logical vector.
The dplyr package allows you to do some cool chaining things, where you use the %.% operator to take the LHS of the operator and input into the first argument of the RHS function call.
It's cool to use in the dplyr package by saying things like:
data %.% group_by(group.var) %.% summarize(Mean=mean(ID))
instead of:
summarize(group_by(data, group.var), Mean=mean(ID)).

R sum element by element resulting in vector

First of all sorry for this question. I suppose it's super basic but I can't find the right search terms. For a vector a lets say:
a<-c(1,1,3,2,1)
I want to get a vector b which results when suming element by element
>b
1 2 5 7 8
it would be something like:
x<-2
b<-as.vector(a[1])
while(x<=length(a)) {
c<-a[x]+b[x-1]
b=c(b,c)
x=x+1
}
rm(x,c)
but isn't there a built-in function for this?

You are looking for cumsum:
a = c(1,1,3,2,1)
R> cumsum(a)
[1] 1 2 5 7 8

how to assign the name of list column's as string?

I do:
assign('test', 'bye')
test
[1] "bye"
now, I have the vector inside 'test' variable.
I would like to use the string inside 'test' variable as name of a column of the follow list:
list(test=c(1:10))
$test
[1] 1 2 3 4 5 6 7 8 9 10
But I would like to use 'bye' as NAME (because 'bye' is wrote inside the test variable)
How can I do it?

I don't think eval or assign are at all necessary here; their use usually (although not always) indicates that you're doing something the hard way, or at least the un-R-ish way.
> test <- "bye"
> L <- list(1:10) ## c() unnecessary here too
> names(L) <- test
> L
$bye
[1] 1 2 3 4 5 6 7 8 9 10
If you really want to do this in a single statement, you can do:
L <- setNames(list(1:10), test)
or
L <- structure(list(1:10), .Names=test)

I guess this will be the answer you're looking for?
assign('test','bye')
z<-list(c(1:10))
names(z)<-test

What's the use of which?

I'm trying to get a handle on the ubiquitous which function. Until I started reading questions/answers on SO I never found the need for it. And I still don't.
As I understand it, which takes a Boolean vector and returns a weakly shorter vector containing the indices of the elements which were true:
> seq(10)
[1] 1 2 3 4 5 6 7 8 9 10
> x <- seq(10)
> tf <- (x == 6 | x == 8)
> tf
[1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
> w <- which(tf)
> w
[1] 6 8
So why would I ever use which instead of just using the Boolean vector directly? I could maybe see some memory issues with huge vectors, since length(w) << length(tf), but that's hardly compelling. And there are some options in the help file which don't add much to my understanding of possible uses of this function. The examples in the help file aren't of much help either.
Edit for clarity-- I understand that the which returns the indices. My question is about two things: 1) why you would ever need to use the indices instead of just using the boolean selector vector? and 2) what interesting behaviors of which might make it preferred to just using a vectorized Boolean comparison?

Okay, here is something where it proved useful last night:
In a given vector of values what is the index of the 3rd non-NA value?
> x <- c(1,NA,2,NA,3)
> which(!is.na(x))[3]
[1] 5
A little different from DWin's use, although I'd say his is compelling too!

The title of the man page ?which provides a motivation. The title is:
Which indices are TRUE?
Which I interpret as being the function one might use if you want to know which elements of a logical vector are TRUE. This is inherently different to just using the logical vector itself. That would select the elements that are TRUE, not tell you which of them was TRUE.
Common use cases were to get the position of the maximum or minimum values in a vector:
> set.seed(2)
> x <- runif(10)
> which(x == max(x))
[1] 5
> which(x == min(x))
[1] 7
Those were so commonly used that which.max() and which.min() were created:
> which.max(x)
[1] 5
> which.min(x)
[1] 7
However, note that the specific forms are not exact replacements for the generic form. See ?which.min for details. One example is below:
> x <- c(4,1,1)
> which.min(x)
[1] 2
> which(x==min(x))
[1] 2 3

Two very compelling reasons not to forget which:
1) When you use "[" to extract from a dataframe, any calculation in the row position that results in NA will get a junk row returned. Using which removes the NA's. You can use subset or %in%, which do not create the same problem.
> dfrm <- data.frame( a=sample(c(1:3, NA), 20, replace=TRUE), b=1:20)
> dfrm[dfrm$a >0, ]
a b
1 1 1
2 3 2
NA NA NA
NA.1 NA NA
NA.2 NA NA
6 1 6
NA.3 NA NA
8 3 8
# Snipped remaining rows
2) When you need the array indicators.

which could be useful (by the means of saving both computer and human resources) e.g. if you have to filter the elements of a data frame/matrix by a given variable/column and update other variables/columns based on that. Example:
df <- mtcars
Instead of:
df$gear[df$hp > 150] <- mean(df$gear[df$hp > 150])
You could do:
p <- which(df$hp > 150)
df$gear[p] <- mean(df$gear[p])
Extra case would be if you have to filter a filtered elements what could not be done with a simple & or |, e.g. when you have to update some parts of a data frame based on other data tables. This way it is required to store (at least temporary) the indexes of the filtered element.
Another issue what cames to my mind if you have to loop thought a part of a data frame/matrix or have to do other kind of transformations requiring to know the indexes of several cases. Example:
urban <- which(USArrests$UrbanPop > 80)
> USArrests[urban, ] - USArrests[urban-1, ]
Murder Assault UrbanPop Rape
California 0.2 86 41 21.1
Hawaii -12.1 -165 23 -5.6
Illinois 7.8 129 29 9.8
Massachusetts -6.9 -151 18 -11.5
Nevada 7.9 150 19 29.5
New Jersey 5.3 102 33 9.3
New York -0.3 -31 16 -6.0
Rhode Island -2.9 68 15 -6.6
Sorry for the dummy examples, I know it makes not much sense to compare the most urbanized states of USA by the states prior to those in the alphabet, but I hope this makes sense :)
Checking out which.min and which.max gives some clue also, as you do not have to type a lot, example:
> row.names(mtcars)[which.max(mtcars$hp)]
[1] "Maserati Bora"

Well, I found one possible reason. At first I thought it might be the ,useNames option, but it turns out that simple boolean selection does that too.
However, if your object of interest is a matrix, you can use the ,arr.ind option to return the result as (row,column) ordered pairs:
> x <- matrix(seq(10),ncol=2)
> x
[,1] [,2]
[1,] 1 6
[2,] 2 7
[3,] 3 8
[4,] 4 9
[5,] 5 10
> which((x == 6 | x == 8),arr.ind=TRUE)
row col
[1,] 1 2
[2,] 3 2
> which((x == 6 | x == 8))
[1] 6 8
That's a handy trick to know about, but hardly seems to justify its constant use.

Surprised no one has answered this: how about memory efficiency?
If you have a long vector of very sparse TRUE's, then keeping track of only the indices of the TRUE values will probably be much more compact.

I use it quiet often in data exploration. For example if I have a dataset of kids data and see from summary that the max age is 23 (and should be 18), I might go:
sum(dat$age>18)
If that was 67, and I wanted to look closer I might use:
dat[which(dat$age>18)[1:10], ]
Also useful if you're making a presentation and want to pull out a snippet of data to demonstrate a certain oddity or what not.

Efficient functional programming (using mapply) in R for a "naturally" procedural problem

A common use case in R (at least for me) is identifying observations in a data frame that have some characteristic that depends on the values in some subset of other observations.
To make this more concerete, suppose I have a number of workers (indexed by WorkerId) that
have an associated "Iteration":
raw <- data.frame(WorkerId=c(1,1,1,1,2,2,2,2,3,3,3,3),
Iteration = c(1,2,3,4,1,2,3,4,1,2,3,4))
and I want to eventually subset the data frame to exclude the "last" iteration (by creating a "remove" boolean) for each worker. I can write a function to do this:
raw$remove <- mapply(function(wid,iter){
iter==max(raw$Iteration[raw$WorkerId==wid])},
raw$WorkerId, raw$Iteration)
> raw$remove
[1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
but this gets very slow as the data frame gets larger (presumably because I'm needlessly computing the max for every observation).
My question is what's the more efficient (and idiomatic) way of doing this in the functional programming style. Is it first creating a the WorkerId to Max value dictionary and then using that as a parameter in another function that operates on each observation?

The "most natural way" IMO is the split-lapply-rbind method. You start by split()-ting into a list of groups, then lapply() the processing rule (in this case removing the last row) and then rbind() them back together. It's all doable as a nested set of function calls. The inner two steps are illustrated here and the final one-liner is presented at the bottom:
> lapply( split(raw, raw$WorkerId), function(x) x[-NROW(x),] )
$`1`
WorkerId Iteration
1 1 1
2 1 2
3 1 3
$`2`
WorkerId Iteration
5 2 1
6 2 2
7 2 3
$`3`
WorkerId Iteration
9 3 1
10 3 2
11 3 3
do.call(rbind, lapply( split(raw, raw$WorkerId), function(x) x[-NROW(x),] ) )
Hadley Wickham has developed a wide set of tools, the plyr package, that extend this strategy to a wider variety of tasks.

For the specific problem posed !rev(duplicated(rev(raw$WorkerId))) or better, following Charles' advice, !duplicated(raw$WorkerId, fromLast=TRUE)

This situation is tailor-made for using the plyr package.
ddply(raw, .(WorkerId), function(df) df[-NROW(df),])
It produces the output
WorkerId Iteration
1 1 1
2 1 2
3 1 3
4 2 1
5 2 2
6 2 3
7 3 1
8 3 2
9 3 3

subset(raw, Iteration != ave(Iteration, WorkerId, FUN=max))

remove <- with(raw, as.logical(ave(Iteration, WorkerId, FUN=function(x) c(rep(TRUE, length(x)-1), FALSE)))))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Break apart a dataframe into individual named vectors - r

One option would be list2env list2env(x,.GlobalEnv) a #[1] 1 2 3 b #[1] 4 5 6 d #[1] 7 8 9

Related

Self reference when indexing into a vector

R sum element by element resulting in vector

how to assign the name of list column's as string?

What's the use of which?

Efficient functional programming (using mapply) in R for a "naturally" procedural problem

Categories

Resources