I'm looking to add a column to a data.table which is a concatenation of several other columns, the names of which I've stored in a vector cols. Per https://stackoverflow.com/a/21682545/1840471 I tried do.call + paste but couldn't get it working. Here's what I've tried:
# Using mtcars as example, e.g. first record should be "110 21 6"
dt <- data.table(mtcars)
cols <- c("hp", "mpg", "cyl")
# Works old-fashioned way
dt[, slice.verify := paste(hp, mpg, cyl)]
# Raw do.call+paste fails with message:
# Error in do.call(paste, cols): second argument must be a list
dt[, slice := do.call(paste, cols)]
# Making cols a list makes the column "hpmpgcyl" for each row
dt[, slice := do.call(paste, as.list(cols))]
# Applying get fails with message:
# Error in (function (x) : unused arguments ("mpg", "cyl")
dt[, slice := do.call(function(x) paste(get(x)), as.list(cols))]
Help appreciated - thanks.
Similar questions:
Concatenate columns and add them to beginning of Data Frame (operates on data.frames using cbind and do.call - this was very slow on my data.table)
R - concatenate row-wise across specific columns of dataframe (doesn't deal with columns as names or large number of columns)
Accessing columns in data.table using a character vector of column names (considers aggregation using column names)
We can use mget to return the values of elements in 'cols' as a list
dt[, slice := do.call(paste, mget(cols))]
head(dt, 2)
# mpg cyl disp hp drat wt qsec vs am gear carb slice
#1: 21 6 160 110 3.9 2.620 16.46 0 1 4 4 110 21 6
#2: 21 6 160 110 3.9 2.875 17.02 0 1 4 4 110 21 6
Or another option is to specify the 'cols' in .SDcols and paste the .SD
dt[, slice:= do.call(paste, .SD), .SDcols = cols]
head(dt, 2)
# mpg cyl disp hp drat wt qsec vs am gear carb slice
#1: 21 6 160 110 3.9 2.620 16.46 0 1 4 4 110 21 6
#2: 21 6 160 110 3.9 2.875 17.02 0 1 4 4 110 21 6
Came across a possibly more simple solution using apply as follows:
df[, "combned_column"] <- apply(df[, cols], 1, paste0, collapse = "")
May not work for data.tables, but it did what I needed and possibly what was needed here.
Related
I want to order a data.table by using a set of predefined names available in a list.
For example:
library(data.table)
dt <- as.data.table(mtcars)
list_name <-c("mpg", "disp", "xyz")
#Order columns
setcolorder(dt, list_name) #requirement: if "xyz" column doesn't exist it should ignore and take the rest
The use case case is that there are multiple data.tables that are getting created and all of them have column names from a list of names. There can be missing column names in some data but the data needs to be ordered as per a list.
output:
dt
disp wt mpg cyl hp drat qsec vs am gear carb
1: 160.0 2.620 21.0 6 110 3.90 16.46 0 1 4 4
2: 160.0 2.875 21.0 6 110 3.90 17.02 0 1 4 4
3: 108.0 2.320 22.8 4 93 3.85 18.61 1 1 4 1
An option is to load all of them in a list and then use setcolorder by looping over the list with lapply and use intersect on the names of the dataset while ordering
lst1 <- list(dt, dt)
lst1 <- lapply(lst1, function(x) setcolorder(x, intersect(list_name, names(x)))
If we need to reuse, create a function
f1 <- function(dat, nm1) {
setcolorder(dat, intersect(nm1, names(dat)))
}
f1(dt, list_name)
f1(dt2, list_name)
I cannot successfully sum a column in R Studio from a database in SQL. I keep getting the error "Error in FUN: only defined on a data frame with all numeric variables".
Currently, I have:
newObject <- dataFrame %>% sum("COLUMN NAME", na.rm = FALSE)
The problem is that you're trying to pipe the entire dataFrame object into the sum function.
In essence, you're trying this:
newObject <- sum(dataFrame, "COLUMN NAME", na.rm = FALSE)
That isn't working because some of the values in your dataFrame are character. And if they aren't "COLUMN NAME" at the very least is a character string.
You might be looking for summarise, but other possibilities may be transmute or mutate:
mtcars %>%
summarise(Sum = sum(mpg, na.rm= FALSE))
# Sum
#1 642.9
mtcars %>%
transmute(Sum = sum(mpg, na.rm=FALSE))
# Sum
#1 642.9
#2 642.9
#...
mtcars %>%
mutate(Sum = sum(mpg, na.rm= FALSE))
# mpg cyl disp hp drat wt qsec vs am gear carb Sum
#1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 642.9
#2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 642.9
#...
Here mpg is the name of a column in mtcars. You can replace that with your column name, but without quotes.
Often I go about joining two dataframes together that have the same name. Is there a way to do this within the join-step so that I don't end up with a .x and a .y column? So the names might be 'original_mpg', and 'new_mpg'?
library(dplyr)
left_join(mtcars, mtcars[,c("mpg",'cyl')], by=c("cyl"))
names(mtcars) #ugh
Currently, this is an open issue with dplyr. You'll either have to rename before or after the join or use merge from base R, which takes a suffixes argument.
The default suffixes, c(".x", ".y"), can be overridden by passing them as a character vector of length 2:
library(dplyr)
left_join(mtcars, mtcars[,c("mpg","cyl")],
by = c("cyl"),
suffix = c("_original", "_new")) %>%
head()
Output
mpg_original cyl disp hp drat wt qsec vs am gear carb mpg_new
1 21 6 160 110 3.9 2.62 16.46 0 1 4 4 21.0
2 21 6 160 110 3.9 2.62 16.46 0 1 4 4 21.0
3 21 6 160 110 3.9 2.62 16.46 0 1 4 4 21.4
4 21 6 160 110 3.9 2.62 16.46 0 1 4 4 18.1
5 21 6 160 110 3.9 2.62 16.46 0 1 4 4 19.2
6 21 6 160 110 3.9 2.62 16.46 0 1 4 4 17.8
You can use suffix with a slightly modified function I found in the help of strsplit to make it a prefix
library(dplyr)
mt_cars <- left_join(mtcars, mtcars[,c("mpg","cyl")],
by = c("cyl"),
suffix = c("_original", "_new"))
strReverse <- function(x){
sapply(lapply(strsplit(x, "_"), rev), paste, collapse = "_")
}
colnames(mt_cars) <- strReverse(colnames(mt_cars))
Well, I had a similar question when I found this post.
I found a different solution to the question that I hope helps.
The solution is actually fairly simple, you generate a list with all the data frames you want to merge and use the reduce function.
library(dplyr)
df_list <- list(df1, df2, df3)
df <- Reduce(function(x, y) merge(x, y, all=TRUE), df_list)
This was a solution to another problem I had, I wanted to simplify merging multiple dataframes. But if you use two dataframes in the list, it works all the same and merging does not rename the columns.
I am getting this error when I try to create a new column in a data.table programmatically:
dt[, (new_x) := get(x)]
# Error in get(x) : invalid first argument
Where x is a variable that holds the name of the column that I am using in the assignment, which also happens to be named "x" in this case. In other words, x <- "x", and "x" %in% names(dt) is TRUE. This error only seems to occur when the variable name is the same as the column name.
A reproducible example:
library(data.table)
# Our data.table
dt <- as.data.table(mtcars)
dt
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1: 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
# 2: 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
# 3: 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
# 4: 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
# ...
# My new column name
new_col <- "new_column"
# Will make my new column be the sum of two columns
mpg <- "mpg"
cyl <- "cyl"
# I thought this would work:
dt[, (new_col) := get(mpg) + get(cyl)]
# Error in get(mpg) : invalid first argument
# If the variable names are not the same as the string it contains, it works
mpg_col <- "mpg"
cyl_col <- "cyl"
dt[, (new_col) := get(mpg_col) + get(cyl_col)]
Now, in my script, I have a helper function that takes in two column names, x and y, as arguments to calculate a new column with name new_col.
calculate_new_column <- function(dt, x, y, new_col) {
dt[, (new_col) := some calculation with x and y ]
}
Is there a way to make my function safe to this kind of corner case, where x = 'x' or y = 'y'? I guess I could give unique names to the arguments of the function (e.g. .x. and .y.), but would prefer a better solution.
EDIT
Following my reproducible example, it seems this works:
dt[, (new_col) := get(eval(mpg)) + get(eval(cyl))]
But I am wary of using eval and am not sure if this follows best practices. Would this be the way to go?
I'm familiar with being able to extract columns from an R data frame (or matrix) like so:
df.2 <- df[, c("name1", "name2", "name3")]
But can one use a ! or other tool to select all but those listed columns?
For background, I have a data frame with quite a few column vectors and I'd like to avoid:
Typing out the majority of the names when I could just remove a minority
Using the much shorter df.2 <- df[, c(1,3,5)] because when my .csv file changes, my code goes to heck since the numbering isn't the same anymore. I'm new to R and think I've learned the hard way not to use number vectors for larger df's that might change.
I tried:
df.2 <- df[, !c("name1", "name2", "name3")]
df.2 <- df[, !=c("name1", "name2", "name3")]
And just as I was typing this, found out that this works:
df.2 <- df[, !names(df) %in% c("name1", "name2", "name3")]
Is there a better way than this last one?
An alternative to grep is which:
df.2 <- df[, -which(names(df) %in% c("name1", "name2", "name3"))]
You can make a shorter call that is also more generalizable with negative-grep:
df.2 <- df[, -grep("^name[1:3]$", names(df) )]
Since grep returns numerics you can use the negative vector indexing to remove columns. You could add further number or more complex patterns.
dplyr::select() has several options for dropping specific columns:
library(dplyr)
drop_columns <- c('cyl','disp','hp')
mtcars %>%
select(-one_of(drop_columns)) %>%
head(2)
mpg drat wt qsec vs am gear carb
Mazda RX4 21 3.9 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21 3.9 2.875 17.02 0 1 4 4
Negating specific column names, the following drops the column "hp" and the columns from "qsec" through "gear":
mtcars %>%
select(-hp, -(qsec:gear)) %>%
head(2)
mpg cyl disp drat wt carb
Mazda RX4 21 6 160 3.9 2.620 4
Mazda RX4 Wag 21 6 160 3.9 2.875 4
You could also negate contains(), starts_with(), ends_with(), or matches():
mtcars %>%
select(-contains('t')) %>%
select(-starts_with('a')) %>%
select(-ends_with('b')) %>%
select(-matches('^m.+g$')) %>%
head(2)
cyl disp hp qsec vs gear
Mazda RX4 6 160 110 16.46 0 4
Mazda RX4 Wag 6 160 110 17.02 0 4
Old thread, but here's another solution:
df.2 <- subset(df, select=-c(name1, name2, name3))
This was posted in another similar thread (though I can't find it right now). Should be sustainable code in the situation you describe, and is probably easier to read and edit than some of the other options.
You could make a custom function to do this if you're using it for your own use to manipulate data. I may do something like this:
rm.col <- function(df, ...) {
x <- substitute(...())
z <- Trim(unlist(lapply(x, function(y) as.character(y))))
df[, !names(df) %in% z]
}
rm.col(mtcars, hp, mpg)
The first argument is the dataframe name. the following ... are the names of any columns you wish to remove.
The easiest way that comes to my mind:
filtered_df<-df[, setdiff(names(df),c("name1","name2") ]
essentially you are computing the set difference between full list of column names and the subset you want to filter out (name1 and name2 above).