This question already has answers here:
Drop data frame columns by name
(25 answers)
Closed 6 years ago.
Rookie question - thanks in advance for patience...
I have a dataframe:
vals <- c(1,1,1,1)
testdf <- data.frame("var1"=vals, "var2"=vals, "var3"=vals)
I have a character vector of variable names:
varnames <- c("var1", "var2")
This is a character vector b/c I use it to generate a formula earlier in the script.
I'd like to subset a dataframe such that variables in varnames are excluded, e.g.
newDF <- subset(df, select=-varnames)
This creates an error since subset expects names instead of characters. So, I use lapply to change the characters to names:
varnames <- lapply(varnames, as.name)
The result of this lapply function is a named(?) and nested(?) list.
[[1]]
var1
[[2]]
var2
[[3]]
var3
Here's where I get lost (I feel like Mugatu on crazy pills... is this confusing to anyone else!?). I can see that each value has correctly been changed from character to name, but it's in this weird nested structure - so when I try to subset, I get an error.
I've tried various solutions to unnest and unname, but with no success. This must be something easy I'm missing.
As a bonus - can someone tell me why it is ever useful for lapply to return this nested named list instead of simple vector? It seems very different than, for instance, Python. Thank you.
You can define the names of the columns you want inside [ (see the help file ?Extract or help("[") for the subset operator [).
testdf[ names(testdf)[!names(testdf) %in% varnames] ]
## or
## testdf[, names(testdf)[!names(testdf) %in% varnames] , drop = FALSE]
Or, more concisely (thanks #Frank)
testdf[ setdiff(names(testdf), varnames)]
var3
1 1
2 1
3 1
4 1
where
names(testdf)
# [1] "var1" "var2" "var3"
varnames
# [1] "var1" "var2"
And So
names(testdf) %in% varnames
# [1] TRUE TRUE FALSE
And therefore
names(testdf)[!names(testdf) %in% varnames]
# [1] "var3"
Which is the same as
testdf[, "var3" ]
And drop = FALSE to stop it 'dropping' to a vector if there's only one column returned.
Also, if you look at the help file for lapply(X, FUN, ...)
?lapply
lapply returns a list of the same length as X
This is why you're getting a list.
As a bonus - can someone tell me why it is ever useful for lapply to return this nested named list instead of simple vector? It seems very different than, for instance, Python. Thank you.
When you're working with a list, and you want it to remain as a list.
You can also use match which returns an index
testdf[-match(varnames,names(testdf))]
# var3
#1 1
#2 1
#3 1
#4 1
You can access the elements using varnames[[1]] etc. and convert it into a vector, if it makes it easier for you.
Source: https://www.datacamp.com/community/tutorials/r-tutorial-apply-family
lapply takes a list and applies the function to every element of the list. The list can also have another list as an element. So it takes that into consideration and returns that nested structure.
Related
I'm trying to use the named index to replace some elements of a list.
I have three lists:
Superset
Subset
SubsetNames
My objective is to replace the old elements in Superset with the corresponding ones from Subset where Name(Subset) == Name(Superset).
Example Code (Edited for correctness):
# Setting things up
Superset <- list(1, 2, 3, 4)
names(Superset) <- c("a", "b", "c", "d")
Subset <- list(5, 6)
names(Subset) <- c("b", "c") # or any names from Superset
SubsetNames <- as.list(names(Subset))
I have tried things like this:
lapply(SubsetNames, FUN=function(x) Superset[[x]] <- Subset[[x]])
And:
Superset[SubsetNames] <- Subset
I even tried to construct a for-loop with a counter however this is not a working solution in my scenario.
In reality, Superset is a list of dataframes, each of which has almost 90k datapoints in 117 columns.
Some of those dataframes need some tweaking. I have code which successfully extracts a list of the ones needing tweaking and tweaks them... now I just need to put them back.
Your help much appreciated! Thank you!
We can use the names of the 'Subset' to subset the 'Superset' and assign it to values of 'Subset'
Superset[names(Subset)] <- Subset
Superset
#$a
#[1] 1
#$b
#[1] 5
#$c
#[1] 6
#$d
#[1] 4
The list creation seems to be faulty. It would be as.list
Superset <- as.list(1:4)
It will return a list of length 4 as opposed to length 1 with list(1:4)
If you want to change for every value in Subset, you could just do
modifyList(Superset, Subset)
or if you are just updating a smaller set of values from subset
modifyList(Superset, Subset[SubsetNames])
I would like to take a data frame with characters and numbers, and concatenate all of the elements of the each row into a single string, which would be stored as a single element in a vector. As an example, I make a data frame of letters and numbers, and then I would like to concatenate the first row via the paste function, and hopefully return the value "A1"
df <- data.frame(letters = LETTERS[1:5], numbers = 1:5)
df
## letters numbers
## 1 A 1
## 2 B 2
## 3 C 3
## 4 D 4
## 5 E 5
paste(df[1,], sep =".")
## [1] "1" "1"
So paste is converting each element of the row into an integer that corresponds to the 'index of the corresponding level' as if it were a factor, and it keeps it a vector of length two. (I know/believe that factors that are coerced to be characters behave in this way, but as R is not storing df[1,] as a factor at all (tested by is.factor(), I can't verify that it is actually an index for a level)
is.factor(df[1,])
## [1] FALSE
is.vector(df[1,])
## [1] FALSE
So if it is not a vector then it makes sense that it is behaving oddly, but I can't coerce it into a vector
> is.vector(as.vector(df[1,]))
[1] FALSE
Using as.character did not seem to help in my attempts
Can anyone explain this behavior?
While others have focused on why your code isn't working and how to improve it, I'm going to try and focus more on getting the result you want. From your description, it seems you can readily achieve what you want using paste:
df <- data.frame(letters = LETTERS[1:5], numbers = 1:5, stringsAsFactors=FALSE)
paste(df$letters, df$numbers, sep=""))
## [1] "A1" "B2" "C3" "D4" "E5"
You can change df$letters to character using df$letters <- as.character(df$letters) if you don't want to use the stringsAsFactors argument.
But let's assume that's not what you want. Let's assume you have hundreds of columns and you want to paste them all together. We can do that with your minimal example too:
df_args <- c(df, sep="")
do.call(paste, df_args)
## [1] "A1" "B2" "C3" "D4" "E5"
EDIT: Alternative method and explanation:
I realised the problem you're having is a combination of the fact that you're using a factor and that you're using the sep argument instead of collapse (as #adibender picked up). The difference is that sep gives the separator between two separate vectors and collapse gives separators within a vector. When you use df[1,], you supply a single vector to paste and hence you must use the collapse argument. Using your idea of getting every row and concatenating them, the following line of code will do exactly what you want:
apply(df, 1, paste, collapse="")
Ok, now for the explanations:
Why won't as.list work?
as.list converts an object to a list. So it does work. It will convert your dataframe to a list and subsequently ignore the sep="" argument. c combines objects together. Technically, a dataframe is just a list where every column is an element and all elements have to have the same length. So when I combine it with sep="", it just becomes a regular list with the columns of the dataframe as elements.
Why use do.call?
do.call allows you to call a function using a named list as its arguments. You can't just throw the list straight into paste, because it doesn't like dataframes. It's designed for concatenating vectors. So remember that dfargs is a list containing a vector of letters, a vector of numbers and sep which is a length 1 vector containing only "". When I use do.call, the resulting paste function is essentially paste(letters, numbers, sep).
But what if my original dataframe had columns "letters", "numbers", "squigs", "blargs" after which I added the separator like I did before? Then the paste function through do.call would look like:
paste(letters, numbers, squigs, blargs, sep)
So you see it works for any number of columns.
For those using library(tidyverse), you can simply use the unite function.
new.df <- df%>%
unite(together, letters, numbers, sep="")
This will give you a new column called together with A1, B2, etc.
This is indeed a little weird, but this is also what is supposed to happen.
When you create the data.frame as you did, column letters is stored as factor. Naturally factors have no ordering, therefore when as.numeric() is applied to a factor it returns the ordering of of the factor. For example:
> df[, 1]
[1] A B C D E
Levels: A B C D E
> as.numeric(df[, 1])
[1] 1 2 3 4 5
A is the first level of the factor df[, 1] therefore A gets converted to the value 1, when as.numeric is applied. This is what happens when you call paste(df[1, ]). Since columns 1 and 2 are of different class, paste first transforms both elements of row 1 to numeric then to characters.
When you want to concatenate both columns, you first need to transform the first row to character:
df[, 1] <- as.character(df[, 1])
paste(df[1,], collapse = "")
As #sebastian-c pointed out, you can also use stringsAsFactors = FALSE in the creation of the data.frame, then you can omit the as.character() step.
if you want to start with
df <- data.frame(letters = LETTERS[1:5], numbers = 1:5, stringsAsFactors=TRUE)
.. then there is no general rule about how df$letters will be interpreted by any given function. It's a factor for modelling functions, character for some and integer for some others. Even the same function such as paste may interpret it differently, depending on how you use it:
paste(df[1,], collapse="") # "11"
apply(df, 1, paste, collapse="") # "A1" "B2" "C3" "D4" "E5"
No logic in it except that it will probably make sense once you know the internals of every function.
The factors seem to be converted to integers when an argument is converted to vector (as you know, data frames are lists of vectors of equal length, so the first row of a data frame is also a list, and when it is forced to be a vector, something like this happens:)
df[1,]
# letters numbers
# 1 A 1
unlist(df[1,])
# letters numbers
# 1 1
I don't know how apply achieves what it does (i.e., factors are represented by character values) -- if you're interested, look at its source code. It may be useful to know, though, that you can trust (in this specific sense) apply (in this specific occasion). More generally, it is useful to store every piece of data in a sensible format, that includes storing strings as strings, i.e., using stringsAsFactors=FALSE.
Btw, every introductory R book should have this idea in a subtitle. For example, my plan for retirement is to write "A (not so) gentle introduction to the zen of data fishery with R, the stringsAsFactors=FALSE way".
I have a function myFun(a,b,c,d,e,f,g,h) which contains a vectorised expression of its parameters within it.
I'd like to add a new column: data$result <- with(data, myFun(A,B,C,D,E,F,G,H)) where A,B,C,D,E,F,G,H are column names of data. I'm using data.table but data.frame answers are appreciated too.
So far the parameter list (column names) can be tedious to type out, and I'd like to improve readability. Is there a better way?
> myFun <- function(a,b,c) a+b+c
> dt <- data.table(a=1:5,b=1:5,c=1:5)
> with(dt,myFun(a,b,c))
[1] 3 6 9 12 15
The ultimate thing I would like to do is:
dt[isFlag, newCol:=myFun(A,B,C,D,E,F,G,H)]
However:
> dt[a==1,do.call(myFun,dt)]
[1] 3 6 9 12 15
Notice that the j expression seems to ignore the subset. The result should be just 3.
Ignoring the subset aspect for now: df$result <- do.call("myFun", df). But that copies the whole df whereas data.table allows you to add the column by reference: df[,result:=myFun(A,B,C,D,E,F,G,H)].
To include the comment from #Eddi (and I'm not sure how to combine these operations in data.frame so easily) :
dt[isFlag, newCol := do.call(myFun, .SD)]
Note that .SD can be used even when you aren't grouping, just subsetting.
Or if your function is literally just adding its arguments together :
dt[isFlag, newCol := do.call(sum, .SD)]
This automatically places NA into newCol where isFlag is FALSE.
You can use
df$result <- do.call(myFun, df)
I would like to take a data frame with characters and numbers, and concatenate all of the elements of the each row into a single string, which would be stored as a single element in a vector. As an example, I make a data frame of letters and numbers, and then I would like to concatenate the first row via the paste function, and hopefully return the value "A1"
df <- data.frame(letters = LETTERS[1:5], numbers = 1:5)
df
## letters numbers
## 1 A 1
## 2 B 2
## 3 C 3
## 4 D 4
## 5 E 5
paste(df[1,], sep =".")
## [1] "1" "1"
So paste is converting each element of the row into an integer that corresponds to the 'index of the corresponding level' as if it were a factor, and it keeps it a vector of length two. (I know/believe that factors that are coerced to be characters behave in this way, but as R is not storing df[1,] as a factor at all (tested by is.factor(), I can't verify that it is actually an index for a level)
is.factor(df[1,])
## [1] FALSE
is.vector(df[1,])
## [1] FALSE
So if it is not a vector then it makes sense that it is behaving oddly, but I can't coerce it into a vector
> is.vector(as.vector(df[1,]))
[1] FALSE
Using as.character did not seem to help in my attempts
Can anyone explain this behavior?
While others have focused on why your code isn't working and how to improve it, I'm going to try and focus more on getting the result you want. From your description, it seems you can readily achieve what you want using paste:
df <- data.frame(letters = LETTERS[1:5], numbers = 1:5, stringsAsFactors=FALSE)
paste(df$letters, df$numbers, sep=""))
## [1] "A1" "B2" "C3" "D4" "E5"
You can change df$letters to character using df$letters <- as.character(df$letters) if you don't want to use the stringsAsFactors argument.
But let's assume that's not what you want. Let's assume you have hundreds of columns and you want to paste them all together. We can do that with your minimal example too:
df_args <- c(df, sep="")
do.call(paste, df_args)
## [1] "A1" "B2" "C3" "D4" "E5"
EDIT: Alternative method and explanation:
I realised the problem you're having is a combination of the fact that you're using a factor and that you're using the sep argument instead of collapse (as #adibender picked up). The difference is that sep gives the separator between two separate vectors and collapse gives separators within a vector. When you use df[1,], you supply a single vector to paste and hence you must use the collapse argument. Using your idea of getting every row and concatenating them, the following line of code will do exactly what you want:
apply(df, 1, paste, collapse="")
Ok, now for the explanations:
Why won't as.list work?
as.list converts an object to a list. So it does work. It will convert your dataframe to a list and subsequently ignore the sep="" argument. c combines objects together. Technically, a dataframe is just a list where every column is an element and all elements have to have the same length. So when I combine it with sep="", it just becomes a regular list with the columns of the dataframe as elements.
Why use do.call?
do.call allows you to call a function using a named list as its arguments. You can't just throw the list straight into paste, because it doesn't like dataframes. It's designed for concatenating vectors. So remember that dfargs is a list containing a vector of letters, a vector of numbers and sep which is a length 1 vector containing only "". When I use do.call, the resulting paste function is essentially paste(letters, numbers, sep).
But what if my original dataframe had columns "letters", "numbers", "squigs", "blargs" after which I added the separator like I did before? Then the paste function through do.call would look like:
paste(letters, numbers, squigs, blargs, sep)
So you see it works for any number of columns.
For those using library(tidyverse), you can simply use the unite function.
new.df <- df%>%
unite(together, letters, numbers, sep="")
This will give you a new column called together with A1, B2, etc.
This is indeed a little weird, but this is also what is supposed to happen.
When you create the data.frame as you did, column letters is stored as factor. Naturally factors have no ordering, therefore when as.numeric() is applied to a factor it returns the ordering of of the factor. For example:
> df[, 1]
[1] A B C D E
Levels: A B C D E
> as.numeric(df[, 1])
[1] 1 2 3 4 5
A is the first level of the factor df[, 1] therefore A gets converted to the value 1, when as.numeric is applied. This is what happens when you call paste(df[1, ]). Since columns 1 and 2 are of different class, paste first transforms both elements of row 1 to numeric then to characters.
When you want to concatenate both columns, you first need to transform the first row to character:
df[, 1] <- as.character(df[, 1])
paste(df[1,], collapse = "")
As #sebastian-c pointed out, you can also use stringsAsFactors = FALSE in the creation of the data.frame, then you can omit the as.character() step.
if you want to start with
df <- data.frame(letters = LETTERS[1:5], numbers = 1:5, stringsAsFactors=TRUE)
.. then there is no general rule about how df$letters will be interpreted by any given function. It's a factor for modelling functions, character for some and integer for some others. Even the same function such as paste may interpret it differently, depending on how you use it:
paste(df[1,], collapse="") # "11"
apply(df, 1, paste, collapse="") # "A1" "B2" "C3" "D4" "E5"
No logic in it except that it will probably make sense once you know the internals of every function.
The factors seem to be converted to integers when an argument is converted to vector (as you know, data frames are lists of vectors of equal length, so the first row of a data frame is also a list, and when it is forced to be a vector, something like this happens:)
df[1,]
# letters numbers
# 1 A 1
unlist(df[1,])
# letters numbers
# 1 1
I don't know how apply achieves what it does (i.e., factors are represented by character values) -- if you're interested, look at its source code. It may be useful to know, though, that you can trust (in this specific sense) apply (in this specific occasion). More generally, it is useful to store every piece of data in a sensible format, that includes storing strings as strings, i.e., using stringsAsFactors=FALSE.
Btw, every introductory R book should have this idea in a subtitle. For example, my plan for retirement is to write "A (not so) gentle introduction to the zen of data fishery with R, the stringsAsFactors=FALSE way".
When I pass a row of a data frame to a function using apply, I lose the class information of the elements of that row. They all turn into 'character'. The following is a simple example. I want to add a couple of years to the 3 stooges ages. When I try to add 2 a value that had been numeric R says "non-numeric argument to binary operator." How do I avoid this?
age = c(20, 30, 50)
who = c("Larry", "Curly", "Mo")
df = data.frame(who, age)
colnames(df) <- c( '_who_', '_age_')
dfunc <- function (er) {
print(er['_age_'])
print(er[2])
print(is.numeric(er[2]))
print(class(er[2]))
return (er[2] + 2)
}
a <- apply(df,1, dfunc)
Output follows:
_age_
"20"
_age_
"20"
[1] FALSE
[1] "character"
Error in er[2] + 2 : non-numeric argument to binary operator
apply only really works on matrices (which have the same type for all elements). When you run it on a data.frame, it simply calls as.matrix first.
The easiest way around this is to work on the numeric columns only:
# skips the first column
a <- apply(df[, -1, drop=FALSE],1, dfunc)
# Or in two steps:
m <- as.matrix(df[, -1, drop=FALSE])
a <- apply(m,1, dfunc)
The drop=FALSE is needed to avoid getting a single column vector.
-1 means all-but-the first column, you could instead explicitly specify the columns you want, for example df[, c('foo', 'bar')]
UPDATE
If you want your function to access one full data.frame row at a time, there are (at least) two options:
# "loop" over the index and extract a row at a time
sapply(seq_len(nrow(df)), function(i) dfunc(df[i,]))
# Use split to produce a list where each element is a row
sapply(split(df, seq_len(nrow(df))), dfunc)
The first option is probably better for large data frames since it doesn't have to create a huge list structure upfront.