If you are familiar with SVM, we can move data to higher dimension in order to deal with non-linearity.
I want to do that. I have 19 features and I want to do this:
for any pair of features x_i and x_j I have to find :
sqrt(2)*x_i*x_j
and also square of each features
( x_i)^2
so new features will be:
(x_1)^2, (x_2)^2,...,(x_19)^2, sqrt(2)*x_1*x_2, sqrt(2)*x_1*x_3,...
at the end removing columns whose values are all zero
example
col1 col2 col3
1 2 6
new data frame
col1 col2 col3 col4 col5 col6
(1)^2 (2)^2 (6)^2 sqrt(2)*(1)*(2) sqrt(2)*(1)*(6) sqrt(2)*(2)*(6)
I use data.table package to do these kind of operations. You will need gtools as well for making the combination of the features.
# input data frame
df <- data.frame(x1 = 1:3, x2 = 4:6, x3 = 7:9)
library(data.table)
library(gtools)
# convert to data table to do this
dt <- as.data.table(df)
# specify the feature variables
features <- c("x1", "x2", "x3")
# squares columns
dt[, (paste0(features, "_", "squared")) := lapply(.SD, function(x) x^2),
.SDcols = features]
# combinations columns
all_combs <- as.data.table(gtools::combinations(v=features, n=length(features), r=2))
for(i in 1:nrow(all_combs)){
set(dt,
j = paste0(all_combs[i, V1], "_", all_combs[i, V2]),
value = sqrt(2) * dt[, get(all_combs[i, V1])*get(all_combs[i, V2])])
}
# convert back to data frame
df2 <- as.data.frame(dt)
df2
Related
I have a vector with 2 elements:
v1 <- c('X1','X2')
I want to create possible combinations of these elements.
The resultant data frame would look like:
structure(list(ID = c(1, 2, 3, 4), c1 = c("X1", "X2", "X1", "X2"
), c2 = c("X1", "X1", "X2", "X2")), class = "data.frame", row.names = c(NA,
-4L))
Here, rows with ID=2 and ID=3 have same elements (however arranged in different order). I would like to consider these 2 rows as duplicate. Hence, the final output will have 3 rows only.
i.e. 3 combinations
X1, X1
X1, X2
X2, X2
In my actual dataset, I have 16 such elements in the vector V1.
I have tried using expand.grid approach for obtaining possible combinations but this actually exceeds the machine limit. (number of combinations with 16 elements will be too large). This is potentially due to duplications described above.
Can someone help here to get all possible combinations without any duplications ?
I am actually looking for a solution that uses data table functionality. I believe this can be really faster
Thanks in advance.
Here is a base R solution using your sample == data:
First, create your combinations. Using unique = TRUE cuts back on the number of combinations.
library(data.table)
data <- setDT(CJ(df$c1, df$c2, unique = TRUE))
Then, filter out duplicates:
data[!duplicated(t(apply(data, 1, sort))),]
This gives us:
V1 V2
1 X1 X1
2 X2 X1
10 X2 X2
I would look into the ?expand.grid function for this type of task.
expand.grid(v1, v1)
Var1 Var2
1 X1 X1
2 X2 X1
3 X1 X2
4 X2 X2
dat4 is the final output.
v1 <- c('X1','X2')
library(data.table)
dat <- expand.grid(v1, v1, stringsAsFactors = FALSE)
setDT(dat)
# A function to combine and sort string from two columns
f <- function(x, y){
z <- paste(sort(c(x, y)), collapse = "-")
return(z)
}
# Apply the f function to each row
dat2 <- dat[, comb := f(Var1, Var2), by = 1:nrow(dat)]
# Remove the duplicated elements in the comb column
dat3 <- unique(dat2, by = "comb")
# Select the columns
dat4 <- dat3[, c("Var1", "Var2")]
print(dat4)
# Var1 Var2
# 1: X1 X1
# 2: X2 X1
# 3: X2 X2
You may want to check RcppAlgos::comboGeneral, which does exactly what you want and is known to be fast and memory efficient. Just do something like this:
vars <- paste0("X", 1:2)
RcppAlgos::comboGeneral(vars, length(vars), repetition = TRUE)
Output
[,1] [,2]
[1,] "X1" "X1"
[2,] "X1" "X2"
[3,] "X2" "X2"
On my laptop with 16Gb RAM, I can run this function up to 14 variables, and it takes less than 5s to finish. Speed is less of a concern. However, note that you need at least 17.9Gb RAM to get all 16-variable combinations.
We can use crossing from tidyr
library(tidyr)
crossing(v1, v1)
Imagine you have 2 distributions resulting from two simulations stored in a data.frame:
sim1 = 1:10
sim2 = 91:100
sim = data.frame(sim1, sim2)
Now, we want to find the 10% and 90% percentiles of each distribution. This can be done by:
diffSim = ncol(sim)
confidenceInterval = c(0.1, 0.9)
results = lapply(1:diffSim, function(j) {quantile(sim[, j], confidenceInterval,
names = FALSE, type = 3)})
I would like to store these results in a data.table by assigning by reference (:=). However, I first need to getresults in the appropriate shape (i.e. a data.table of 1 row and 4 columns). To do so, I subsequently apply unlist, matrix and as.data.table to results:
DT = data.table(Col1 = "Result")
DT[, c("col2", "col3", "col4", "col5") := as.data.table(matrix(unlist(results), nrow = 1))]
I don't like this at all. Is there a shorter way of doing this?
Not necessarily shorter, but everything in data.table:
library(data.table)
setDT(sim)[, .(col1 = 'Result',
cols = paste0('col',2:5),
vals = unlist(lapply(.SD, quantile, probs = confidenceInterval, type = 3)))
][, dcast(.SD, col1 ~ cols, value.var = 'vals')]
which gives:
col1 col2 col3 col4 col5
1: Result 1 9 91 99
This may be a bad question because I am not posting any reproducible example. My main goal is to identify columns that are of different types between two dataframe that have the same column names.
For example
df1
Id Col1 Col2 Col3
Numeric Factor Integer Date
df2
Id Col1 Col2 Col3
Numeric Numeric Integer Date
Here both the dataframes (df1, df2) have same column names but the Col1 type is different and I am interested in identifying such columns. Expected output.
Col1 Factor Numeric
Any suggestions or tips on achieving this ?. Thanks
Try compare_df_cols() from the janitor package:
library(janitor)
mtcars2 <- mtcars
mtcars2$cyl <- as.character(mtcars2$cyl)
compare_df_cols(mtcars, mtcars2, return = "mismatch")
#> column_name mtcars mtcars2
#> 1 cyl numeric character
Self-promotion alert, I authored this package - am posting this function because it exists to solve precisely this problem.
Try this:
compareColumns <- function(df1, df2) {
commonNames <- names(df1)[names(df1) %in% names(df2)]
data.frame(Column = commonNames,
df1 = sapply(df1[,commonNames], class),
df2 = sapply(df2[,commonNames], class)) }
For a more compact method, you could use a list with sapply(). Efficiency shouldn't be a problem here since all we're doing is grabbing the class. Here I add data frame names to the list to create a more clear output.
m <- sapply(list(df1 = df1, df2 = df2), sapply, class)
m[m[, "df1"] != m[, "df2"], , drop = FALSE]
# df1 df2
# Col1 "factor" "character"
where df1 and df2 are the data from #ycw's answer.
If two data frame have same column names, then below will give you columns with different classes.
library(dplyr)
m1 = mtcars
m2 = mtcars %>% mutate(cyl = factor(cyl), vs = factor(cyl))
out = cbind(sapply(m1, class), sapply(m2, class))
out[apply(out, 1, function(x) !identical(x[1], x[2])), ]
We can use sapply with class to loop through all columns in df1 and df2. After that, we can compare the results.
# Create example data frames
df1 <- data.frame(ID = 1:3,
Col1 = as.character(2:4),
Col2 = 2:4,
Col3 = as.Date(paste0("2017-01-0", 2:4)))
df2 <- data.frame(ID = 1:3,
Col1 = as.character(2:4),
Col2 = 2:4,
Col3 = as.Date(paste0("2017-01-0", 2:4)),
stringsAsFactors = FALSE)
# Use sapply and class to find out all the class
class1 <- sapply(df1, class)
class2 <- sapply(df2, class)
# Combine the results, then filter for rows that are different
result <- data.frame(class1, class2, stringsAsFactors = FALSE)
result[!(result$class1 == result$class2), ]
class1 class2
Col1 factor character
I was playing with some data and trying to create a new data frame that contains key and value pairs that could be a dictionary. Here's some sample data and a quick manual solution.
df = data.frame(col1 = c("one", "one", "two", "two", "one"),
col2 = c("AG", "AB", "AC", "AG", "AB"),
col3 = c("F3", "F1", "F2", "F3", "F2") )
df
d1 = data.frame(vals = unique(df$col1))
d2 = data.frame(vals = unique(df$col2))
d3 = data.frame(vals = unique(df$col3))
d1
d2
d3
d1$name = "col1"
d2$name = "col2"
d3$name = "col3"
d1
d2
d3
rbind(d1,d2,d3)
Of course, this is a simple use case so real data is going to be a bit more mundane. For that reason, I was looking for a loop that could go through and set the key value pairs in a dictionary.
Most of my attempts have resulted in failure. Here's the format for my solution but I'm not sure how to dynamically create the new_df dictionary. Any suggestions?
new_df=data.frame()
prod.cols = c("col1", "col2", "col3")
for(col in prod.cols){
if(col %in% colnames(df)){
## solution in here
}
}
new_df
tidyr makes this easy:
library(tidyr)
df %>% gather(name, vals) %>% unique()
# name vals
# 1 col1 one
# 3 col1 two
# 6 col2 AG
# 7 col2 AB
# 8 col2 AC
# 11 col3 F3
# 12 col3 F1
# 13 col3 F2
alistaire's answer is quite elegant and readable. Just for fun, here's a base R approach. Not that efficiency is particularly important here, but this scales relatively well as more rows and columns are added:
My second and third approaches are nicer than my first, so I'm moving them to the top of the answer:
Approach # 2, implementing thelatemail's comment for a nice, efficient one-liner:
stack(lapply(df, function(ii) as.character(unique(ii))))
What's nice about this solution is that it first reduces the columns using unique, which makes less work for as.character and then for stack.
Approach # 3: more concise and more efficient version of approach 2 that avoids the need for unique and character conversion by using levels to deal with the factor columns:
stack(lapply(df, levels))
First approach:
Reduce(rbind,
lapply(seq_along(df),
function(ii) data.frame(vals = unique(df[, ii]), name = names(df)[ii])
)
)
# vals name
#1 one col1
#2 two col1
#3 AG col2
#4 AB col2
#5 AC col2
#6 F3 col3
#7 F1 col3
#8 F2 col3
Using do.call instead of Reduce is roughly equivalent here:
do.call(rbind,
lapply(seq_along(df),
function(ii) data.frame(vals = unique(df[, ii]), name = names(df)[ii])
)
)
We can also do
library(reshape2)
unique(melt(as.matrix(df))[-1])
I'm trying to put together several files and need to do a bunch of merges on column names that are created inside a loop. I can do this fine using data.frame() but am having issues using similar code with a data.table():
library(data.table)
df1 <- data.frame(id = 1:20, col1 = runif(20))
df2 <- data.frame(id = 1:20, col1 = runif(20))
newColNum <- 5
newColName <- paste('col',newColNum ,sep='')
df1[,newColName] <- runif(20)
df2 <- merge(df2, df1[,c('id',newColName)], by = 'id', all.x = T) # Works fine
######################
dt1 <- data.table(id = 1:20, col1 = runif(20))
dt2 <- data.table(id = 1:20, col1 = runif(20))
newColNum <- 5
newColName <- paste('col',newColNum ,sep='')
dt1[,newColName] <- runif(20)
dt2 <- merge(dt2, dt1[,c('id',newColName)], by = 'id', all.x = T) # Doesn't work
Any suggestions?
This really has nothing to do with merge(), and everything to do with how the j (i.e. column) index is, by default, interpreted by [.data.table().
You can make the whole statement work by setting with=FALSE, which causes the j index to be interpreted as it would be in a data.frame:
dt2 <- merge(dt2, dt1[,c('id',newColName), with=FALSE], by = 'id', all.x = T)
head(dt2, 3)
# id col1 col5
# 1: 1 0.4954940 0.07779748
# 2: 2 0.1498613 0.12707070
# 3: 3 0.8969374 0.66894157
More precisely, from ?data.table:
with: By default 'with=TRUE' and 'j' is evaluated within the frame
of 'x'. The column names can be used as variables. When
'with=FALSE', 'j' is a vector of names or positions to
select.
Note that this could be avoided by storing the columns in a variable like so:
cols = c('id', newColName)
dt1[ , ..cols]
.. signals to "look up one level"
Try dt1[,list(id,get(newColName))] in your merge.