I have a very large data frame that contains 100 rows and 400000 columns.
To sample each column, I can simply do:
df <- apply(df, 2, sample)
But I want every two column to be sampled together. For example, if originally col1 is c(1,2,3,4,5) and col2 is also c(6,7,8,9,10), and after resampling, col1 becomes c(1,3,2,4,5), I want col2 to be c(6,8,7,9,10) that follows the resampling pattern of col1. Same thing for col3 & col4, col5 & col6, etc.
I wrote a for loop to do this, which takes forever. Is there a better way? Thanks!
You might try this; split the data frame every two columns with split.default, for each sub data frame, sample the rows and then bind them together:
df <- data.frame(col1 = 1:5, col2 = 6:10, col3 = 11:15)
index <- seq_len(nrow(df))
cbind.data.frame(
setNames(lapply(
split.default(df, (seq_along(df) - 1) %/% 2),
function(sdf) sdf[sample(index),,drop=F]),
NULL)
)
# col1 col2 col3
#5 5 10 12
#4 4 9 11
#1 1 6 15
#2 2 7 14
#3 3 8 13
Related
I want to do a union of two dataframes, that share some rows with same rowName. For those rows with common rowNames, I would like to take into account the second dataframe values, and not the first one's. For example :
df1 <- data.frame(col1 = c(1,2), col2 = c(2,4), row.names = c("row_1", "row_2"))
df1
# col1 col2
# row_1 1 2
# row_2 2 4
df2 <- data.frame(col1 = c(3,6), col2 = c(10,99), row.names = c("row_3", "row_2"))
df2
# col1 col2
# row_3 3 6
# row_2 10 99
The result I would like to obtain would then be :
someSpecificRBind(df1,df2, takeIntoAccount=df2)
# col1 col2
# row_1 1 2
# row_2 10 99
# row_3 3 6
The function rbind doesn't do the job, actually it updates rowNames for common ones.
I would conceptualize this as only adding to df2 the rows in df1 that aren't already there:
rbind(df2, df1[setdiff(rownames(df1), rownames(df2)), ])
We get the index of duplicated elements and use that to filter
rbind(df2, df1)[!duplicated(c(row.names(df2), row.names(df1))),]
I have a list of identically structured lists as follows:
test1 <- list(first = data.frame(col1 = c(1,2), col2 = c(3,4)),
second = data.frame(COL1 = c(100,200), COL2 = c(300, 400)))
test2 <- list(first = data.frame(col1 = c(5,6), col2 = c(7,8)),
second = data.frame(COL1 = c(500,600), COL2 = c(700,800)))
orig.list <- list(test1, test2)
I want to:
Bind the rows the first element of each nested list together, bind the rows 2nd element of each nested list together, etc.
Recombine the resulting elements into a single list with an identical structure to the first list.
I can easily do this element by element via:
firsts <- orig.list %>% purr::map(1) %>% dplyr::bind_rows()
seconds <- orig.list %>% purr::map(2) %>% dplyr::bind_rows()
new.list <- list(first = firsts, second = seconds)
However, for n list elements this requires that I:
know the number of elements in each list,
know the names and orders of the elements so I can recreate the new list with the correct names and order,
copy and past the same line of code over and over again.
I'm looking for how to apply purrr:map (or some other tidyverse function) more generically to combine all elements of a list of lists, preserving the element names and order.
Under the simplest cases as you've shown with your data, you can use pmap to walk through the list in parallel and bind_rows to combine individual data frames:
library(tidyverse)
pmap(orig.list, bind_rows)
#$first
# col1 col2
#1 1 3
#2 2 4
#3 5 7
#4 6 8
#$second
# COL1 COL2
#1 100 300
#2 200 400
#3 500 700
#4 600 800
identical(pmap(orig.list, bind_rows), new.list)
# [1] TRUE
To make this a little bit more generic, i.e. handles cases where the number of elements and order of names in each sublist can vary, you can use:
map(map_df(orig.list, ~ as.data.frame(map(.x, ~ unname(nest(.))))), bind_rows)
i.e. you nest each sub list as a data frame, and let bind_rows to check the names for you.
Test Cases:
With test1 the same, switch the order of the elements in test2:
test2 <- list(second = data.frame(COL1 = c(500,600), COL2 = c(700,800)),
first = data.frame(col1 = c(5,6), col2 = c(7,8)))
orig.list1 <- list(test1, test2)
map(map_df(orig.list1, ~ as.data.frame(map(.x, ~ unname(nest(.))))), bind_rows)
gives:
#$first
# col1 col2
#1 1 3
#2 2 4
#3 5 7
#4 6 8
#$second
# COL1 COL2
#1 100 300
#2 200 400
#3 500 700
#4 600 800
Now drop one element from test2:
test2 <- list(first = data.frame(col1 = c(5,6), col2 = c(7,8)))
orig.list2 <- list(test1, test2)
map(map_df(orig.list2, ~ as.data.frame(map(.x, ~ unname(nest(.))))), bind_rows)
gives:
#$first
# col1 col2
#1 1 3
#2 2 4
#3 5 7
#4 6 8
#$second
# COL1 COL2
#1 100 300
#2 200 400
You want purrr::transpose :
library(purrr)
library(dplyr)
transpose(orig.list) %>% map(bind_rows)
# $first
# col1 col2
# 1 1 3
# 2 2 4
# 3 5 7
# 4 6 8
#
# $second
# COL1 COL2
# 1 100 300
# 2 200 400
# 3 500 700
# 4 600 800
col1 <- c('A','B','C', 'D')
col2 <- c('B','A','C', 'C')
col3 <- c('B','C','C', 'A')
dat <- data.frame(cbind(col1, col2, col3))
dat
col1 col2 col3
1 A B B
2 B A C
3 C C C
4 D C A
I would like to remove rows 1 and 3 from dat as the letter B is present more than once in row 1 and the letter C is present more than once in row 3.
EDIT:
My actual data contains over 1 million rows and 14 columns, all of which contain character data. The solution that runs the fastest is preferred as I am using the dataframe in a live setting to make decisions, and the underlying data is changing every few minutes.
You could try this (but I'm sure there is a better way)
cols <- ncol(dat)
indx <- apply(dat, 1, function(x) length(unique(x)) == cols)
dat[indx, ]
# col1 col2 col3
# 2 B A C
# 4 D C A
Another way (if your columns are characters and if you don't have too many columns) is something like the following (which is vectorized)
indx <- with(dat, (col1 == col2) | (col1 == col3) | (col2 == col3))
dat[!indx, ]
# col1 col2 col3
# 2 B A C
# 4 D C A
You could do this in dplyr, if you don't mind specifying the columns:
library(dplyr)
dat %>%
rowwise() %>%
mutate(repeats = max(table(c(col1, col2, col3))) - 1) %>%
filter(repeats == 0) %>%
select(-repeats) # if you don't want that column to appear in results
Source: local data frame [2 x 3]
col1 col2 col3
1 B A C
2 D C A
Here is an alternative. I haven't tested on big dataset,
library(data.table) #devel version v1.9.5
dat[setDT(melt(as.matrix(dat)))[,uniqueN(value)==.N , Var1]$V1,]
# col1 col2 col3
#2 B A C
#4 D C A
Or use anyDuplicated
dat[!apply(dat, 1, anyDuplicated),]
# col1 col2 col3
#2 B A C
#4 D C A
This question already has answers here:
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 7 years ago.
I'd like to know how to consolidate duplicate rows in a data frame and then combine the duplicated values in another column.
Here's a sample of the existing dataframe and two dataframes that would be acceptable as a solution
df1 <- data.frame(col1 = c("test1", "test2", "test2", "test3"), col2 = c(1, 2, 3, 4))
df.ideal <- data.frame(col1 = c("test1", "test2", "test3"), col2 = c(1, "2, 3", 4))
df.ideal2 <- data.frame(col1 = c("test1", "test2", "test3"),
col2 = c(1, 2, 4),
col3 = c(NA, 3, NA))
In the first ideal dataframe, the duplicated row is collapsed and the column is added with both numbers. I've looked at other similar questions on stack overflow, but they all dealt with combining rows. I need to delete the duplicate row because I have another dataset I'm merging it with that needs the a certain number of rows. So, I want to preserve all of the values. Thanks for your help!
To go from df1 to df.ideal, you can use aggregate().
aggregate(col2~col1, df1, paste, collapse=",")
# col1 col2
# 1 test1 1
# 2 test2 2,3
# 3 test3 4
If you want to get to df.ideal2, that's more of a reshaping from long to wide process. You can do
reshape(transform(df1, time=ave(col2, col1, FUN=seq_along)), idvar="col1", direction="wide")
# col1 col2.1 col2.2
# 1 test1 1 NA
# 2 test2 2 3
# 4 test3 4 NA
using just the base reshape() function.
Another option would be to use splitstackshape
library(data.table)
library(splitstackshape)
DT1 <- setDT(df1)[,list(col2=toString(col2)) ,col1]
DT1
# col1 col2
#1: test1 1
#2: test2 2, 3
#3: test3 4
You could split the col2 in DT1 to get the df.ideal2 or
cSplit(DT1, 'col2', sep=',')
# col1 col2_1 col2_2
#1: test1 1 NA
#2: test2 2 3
#3: test3 4 NA
or from df1
dcast.data.table(getanID(df1, 'col1'), col1~.id, value.var='col2')
# col1 1 2
#1: test1 1 NA
#2: test2 2 3
#3: test3 4 NA
I have a dataframe containing 5 columns
COL1 | COL2 | COL 3 | COL 4 | COL 5
I need to aggregate on COL1 and apply 4 different function on COL2 to COL5 columns
a1<-aggregate( COL2 ~ COL1, data = dataframe, sum)
a2<-aggregate( COL3 ~ COL1, data = dataframe, length)
a3<-aggregate( COL4 ~ COL1, data = dataframe, max)
a4<-aggregate( COL5 ~ COL1, data = dataframe, min)
finalDF<- Reduce(function(x, y) merge(x, y, all=TRUE), list(a1,a2,a3,a4))
1)I have 24 cores on the machine.
How can I execute above 4 lines of code (a1,a2,a3,a4) in parallel?
I want to use 4 cores simultaneously and then use Reduce to compute finalDF
2) Can I use different function on different column in one aggregate
(I can use one fun on multiple column and I can also use multiple function on one column in aggregate but I was unable to apply multiple functions on different columns
[COL2-sum,COL3-length,COL4-max,COL5-min])
This is an example of how you might do it with dplyr as suggested by #Roland
set.seed(2)
df <- data.frame(COL1 = sample(LETTERS, 1e6, replace=T),
COL2 = rnorm(1e6),
COL3 = runif(1e6, 100, 1000),
COL4 = rnorm(1e6, 25, 100),
COL5 = runif(1e6, -100, 10))
#> head(df)
# COL1 COL2 COL3 COL4 COL5
#1 E 1.0579823 586.2360 -3.157057 -14.462318
#2 S 0.1238110 872.3868 129.579090 9.525772
#3 O 0.4902512 498.0537 93.063487 1.910506
#4 E 1.7215843 200.7077 126.716256 -5.865204
#5 Y 0.6515853 275.3369 12.554218 -26.301225
#6 Y 0.7959678 134.4977 54.789415 -33.145334
require(dplyr)
df <- df %.%
group_by(COL1) %.%
summarize(a1 = sum(COL2),
a2 = length(COL3),
a3 = max(COL4),
a4 = min(COL5)) #add as many calculations as you like
On my machine this took 0.064 seconds.
#> head(df)
#Source: local data frame [6 x 5]
#
# COL1 a1 a2 a3 a4
#1 A -0.9068368 38378 403.4208 -99.99943
#2 B 6.0557452 38551 419.0970 -99.99449
#3 C 108.5680251 38673 491.8061 -99.99382
#4 D -34.1217133 38469 481.0626 -99.99697
#5 E -68.2998926 38168 452.8280 -99.99602
#6 F -185.9059338 38159 417.2271 -99.99995