Is it possible to manipulate data frame by column name and number? - r

I'm working on large data set, about 900 columns. i have something like this:
B <- c(1)
A_1 <- c(2)
A_2 <- c(3)
A_3 <- c(7)
A_4 <- c(9)
df <- data.frame(B,A_1,A_2,A_3,A_4)
I would like to be able to do something like this :
df[,A_1:A_1+3]
Do you know if it's possible ?
I'm also working with data.table so if there is a way with data.table it could be good.

Base R's subset will let you do this.
subset(mtcars, , mpg:(mpg + 1))
# mpg cyl
#Mazda RX4 21.0 6
#Mazda RX4 Wag 21.0 6
#Datsun 710 22.8 4
#Hornet 4 Drive 21.4 6
#Hornet Sportabout 18.7 8
#...
dplyr's select works the same way.

Related

Subset a dataframe using colnames from another dataframe

Question:
I have a particular problem where I want to subset a given dataframe columnwise where the column names are stored in another dataframe.
Example using mtcars dataset:
options(stringsAsFactors = FALSE)
col_names <- c("hp,disp", "disp,hp,mpg")
df_col_names <- as.data.frame(col_names)
vec <- df_col_names[1,] # first row contains "hp" and "disp"
mtcars_new <- mtcars[, c("hp", "disp")] ## assuming that vec gives colnames
I even tried inserting double quotes to each of the words using the following:
Attempted solution:
options(stringsAsFactors = FALSE)
col_names <- c("hp,disp", "disp,hp,mpg")
df_col_names <- as.data.frame(col_names)
df_col_names$col_names <- gsub("(\\w+)", '"\\1"', df_col_names$col_names)
vec <- df_col_names[1,]
vec2 <- gsub("(\\w+)", '"\\1"', vec)
mtcars_new <- mtcars[,vec2] ## this should be same as mtcars[, c("hp", "disp")]
Expected Solution
mtcars_new <- mtcars[,vec2] is equal to mtcars_new <- mtcars[, c("hp", "disp")]
Here's another way to do this:
col_names <- c("hp,disp", "disp,hp,mpg")
vec2 <- unlist(str_split(col_names[[1]],','))
mtcars_new <- mtcars[,vec2]
What you are doing is picking the first element from the col_names vector, splitting it by the separator, then unlisting it (because str_split() makes a list), then you are using your new vector of names to subset the mtcars data-frame.
Do you need this?
lapply(strsplit(as.character(df_col_names$col_names), ","), function(x) mtcars[x])
#[[1]]
# hp disp
#Mazda RX4 110 160.0
#Mazda RX4 Wag 110 160.0
#Datsun 710 93 108.0
#Hornet 4 Drive 110 258.0
#Hornet Sportabout 175 360.0
#.....
#[[2]]
# disp hp mpg
#Mazda RX4 160.0 110 21.0
#Mazda RX4 Wag 160.0 110 21.0
#Datsun 710 108.0 93 22.8
#Hornet 4 Drive 258.0 110 21.4
#Hornet Sportabout 360.0 175 18.7
#....
Here, we split the column names on comma (",") and then subset it from the dataframe using lapply. This returns a list of dataframes with length of list which is same as number of rows in the data frame.
If you want to subset only the first row, you could do
mtcars[strsplit(as.character(df_col_names$col_names[1]), ",")[[1]]]

Remove columns with dplyr [duplicate]

This question already has answers here:
how to drop columns by passing variable name with dplyr?
(6 answers)
Closed 5 years ago.
I'm interested in simplifying the way that I can remove columns with dplyr (version >= 0.7). Let's say that I have a character vector of names.
drop <- c("disp", "drat", "gear", "am")
Selecting Columns
With the current version version of dplyr, you can perform a selection with:
dplyr::select(mtcars, !! rlang::quo(drop))
Or even easier with base R:
mtcars[, drop]
Removing Columns
Removing columns names is another matter. We could use each unquoted column name to remove them:
dplyr::select(mtcars, -disp, -drat, -gear, -am)
But, if you have a data.frame with several hundred columns, this isn't a great solution. The best solution I know of is to use:
dplyr::select(mtcars, -which(names(mtcars) %in% drop))
which is fairly simple and works for both dplyr and base R. However, I wonder if there's an approach which doesn't involve finding the integer positions for each column name in the data.frame.
Use modify_atand set columns to NULL which will remove them:
mtcars %>% modify_at(drop,~NULL)
# mpg cyl hp wt qsec vs carb
# Mazda RX4 21.0 6 110 2.620 16.46 0 4
# Mazda RX4 Wag 21.0 6 110 2.875 17.02 0 4
# Datsun 710 22.8 4 93 2.320 18.61 1 1
# Hornet 4 Drive 21.4 6 110 3.215 19.44 1 1
# Hornet Sportabout 18.7 8 175 3.440 17.02 0 2
# Valiant 18.1 6 105 3.460 20.22 1 1
# ...
Closer to what you were trying, you could have tried magrittr::extract instead of dplyr::select
extract(mtcars,!names(mtcars) %in% drop) # same output
You can use -one_of(drop) with select:
drop <- c("disp", "drat", "gear", "am")
select(mtcars, -one_of(drop)) %>% names()
# [1] "mpg" "cyl" "hp" "wt" "qsec" "vs" "carb"
one_of evaluates the column names in character vector to integers, similar to which(... %in% ...) does:
one_of(drop, vars = names(mtcars))
# [1] 3 5 10 9
which(names(mtcars) %in% drop)
# [1] 3 5 9 10

select and rename stored in variable

I have several similar data frames with many columns in common. I would like to select and rename a subset of those columns from any table.
library(tidyverse)
mtcars %>%
select(my_mpg = mpg,
cylinders = cyl,
gear)
Is it possible to do something like
my_select_rename <- c("my_mpg"="mpg","cylinders"="cyl","gear")
mtcars %>%
select_(.dots = my_select_rename)
but using the tidyeval framework instead?
I think you want:
my_select <- c("mpg","cyl","gear")
my_select_rename <- c("my_mpg","cylinders","gear")
mtcars %>%
select_at(vars(my_select)) %>%
setNames(., my_select_rename)
my_mpg cylinders gear
Mazda RX4 21.0 6 4
Mazda RX4 Wag 21.0 6 4
Datsun 710 22.8 4 4
Hornet 4 Drive 21.4 6 3
Hornet Sportabout 18.7 8 3
lionel's answer to this question group_by by a vector of characters using tidy evaluation semantics provides the answer
mtcars %>%
select(!!! rlang::syms(my_select_rename))

Finding duplicate columns in a data.table

I have a pretty big data.table (500 x 2000), and I need to find out if any of the columns are duplicates, i.e., have the same values for all rows. Is there a way to efficiently do this within the data.table structure?
I have tried a naive two loop approach with all(col1 == col2) for each pair of columns, but it takes too long. I have also tried converting it to a data.frame and using the above approach, and it still takes quite a long time.
My current solution is to convert the data.table to a matrix and use the apply() function as:
similarity.matrix <- apply(m, 2, function(x) colSums(x == m)))/nrow(m)
However, the approach forces the modes of all elements to be the same, and I'd rather not have that happen. What other options do I have?
Here is a sample construction for the data.table:
m = matrix(sample(1:10, size=1000000, replace=TRUE), nrow=500, ncol=2000)
DF = as.data.frame(m)
DT = as.data.table(m)
Following the suggestion of #Haboryme*, you can do this using duplicated to find any duplicated vectors. duplicated usually works rowwise, but you can transpose it with t() just for finding the duplicates.
DF <- DF[ , which( !duplicated( t( DF ) ) ) ]
With a data.table, you may need to add with = FALSE (I think this depends on the version of data.table you're using).
DT <- DT[ , which( !duplicated( t( DT ) ) ), with = FALSE ]
*#Haboryme, if you were going to turn your comment into an answer, please do and I'll remove this one.
Here's a different approach, where you hash each column first and then call duplicated.
library(digest)
dups <- duplicated(sapply(DF, digest))
DF <- DF[,which(!dups)]
Depending on your data this might be a faster way.
I am using mtcars for a reproducible result:
library(data.table)
library(digest)
# Create data
data <- as.data.table(mtcars)
data[, car.name := rownames(mtcars)]
data[, car.name.dup := car.name] # create a duplicated row
data[, car.name.not.dup := car.name] # create a second duplicated row...
data[1, car.name.not.dup := "Moon walker"] # ... but change a value so that it is no longer a duplicated column
data contains now:
> head(data)
mpg cyl disp hp drat wt qsec vs am gear carb car.name car.name.dup car.name.not.dup
1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Mazda RX4 Moon walker
2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Mazda RX4 Wag Mazda RX4 Wag Mazda RX4 Wag
3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Datsun 710 Datsun 710 Datsun 710
4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet 4 Drive Hornet 4 Drive Hornet 4 Drive
5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Hornet Sportabout Hornet Sportabout Hornet Sportabout
6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 Valiant Valiant Valiant
Now find the duplicated colums:
# create a vector with the checksum for each column (and keep the column names as row names)
col.checksums <- sapply(data, function(x) digest(x, "md5"), USE.NAMES = T)
# make a data table with one row per column name and hash value
dup.cols <- data.table(col.name = names(col.checksums), hash.value = col.checksums)
# self join using the hash values and filter out all column name pairs that were joined to themselves
dup.cols[dup.cols,, on = "hash.value"][col.name != i.col.name,]
Results in:
col.name hash.value i.col.name
1: car.name.dup 58fed3da6bbae3976b5a0fd97840591d car.name
2: car.name 58fed3da6bbae3976b5a0fd97840591d car.name.dup
Note: The result still contains both directions (col1 == col2 and col2 == col1) and should be deduplicated ;-)

Re-assembling a dataframe after a split [duplicate]

This question already has answers here:
Grouping functions (tapply, by, aggregate) and the *apply family
(10 answers)
Closed 6 years ago.
I have trouble applying a split to a data.frame and then assembling some aggregated results back into a different data.frame. I tried using the 'unsplit' function but I can't figure out how to use it properly to get the desired result. Let me demonstrate on the common 'mtcars' data: Let's say that my ultimate result is to get a data frame with two variables: cyl (cylinders) and mean_mpg (mean over mpg for group of cars sharing the same count of cylinders).
So the initial split goes like this:
spl <- split(mtcars, mtcars$cyl)
The result of which looks something like this:
$`4`
mpg cyl disp hp drat wt qsec vs am gear carb
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
...
$`6`
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
...
$`8`
mpg cyl disp hp drat wt qsec vs am gear carb
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
...
Now I want to do something along the lines of:
df <- as.data.frame(lapply(spl, function(x) mean(x$mpg)), col.names=c("cyl", "mean_mpg"))
However, doing the above results in:
X4 X6 X8
1 26.66364 19.74286 15.1
While I'd want the df to be like this:
cyl mean_mpg
1 4 26.66364
2 6 19.74286
3 8 15.10000
Thanks, J.
If you are only interested in reassembling a split then look at (2), (4) and (4a) but if the actual underlying question is really about the way to perform aggregations over groups then they all may be of interest:
1) aggregate Normally one uses aggregate as already mentioned in the comments. Simplifying #alistaire's code slightly:
aggregate(mpg ~ cyl, mtcars, mean)
2) split/lapply/do.call Also #rawr has given a split/lapply/do.call solution in the comments which we can also simplify slightly:
spl <- split(mtcars, mtcars$cyl)
do.call("rbind", lapply(spl, with, data.frame(cyl = cyl[1], mpg = mean(mpg))))
3) do.call/by The last one could alternately be rewritten in terms of by:
do.call("rbind", by(mtcars, mtcars$cyl, with, data.frame(cyl = cyl[1], mpg = mean(mpg))))
4) split/lapply/unsplit Another possibility is to use split and unsplit:
spl <- split(mtcars, mtcars$cyl)
L <- lapply(spl, with, data.frame(cyl = cyl[1], mpg = mean(mpg), row.names = cyl[1]))
unsplit(L, sapply(L, "[[", "cyl"))
4a) or if row names are sufficient:
spl <- split(mtcars, mtcars$cyl)
L <- lapply(spl, with, data.frame(mpg = mean(mpg), row.names = cyl[1]))
unsplit(L, sapply(L, rownames))
The above do not use any packages but there are also many packages that can do aggregations including dplyr, data.table and sqldf:
5) dplyr
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarize(mpg = mean(mpg)) %>%
ungroup()
6) data.table
library(data.table)
as.data.table(mtcars)[, list(mpg = mean(mpg)), by = "cyl"]
7) sqldf
library(sqldf)
sqldf("select cyl, avg(mpg) mpg from mtcars group by cyl")

Resources