I'm combining 12 CSV files into one dataframe in R. Before doing this I want to ensure all the column names are an exact match with each other. I've made a dataframe where each column is the column names of the 12 CSV files.
jul21_cols <- data.frame(colnames(jul21))
aug21_cols <- data.frame(colnames(aug21))
sep21_cols <- data.frame(colnames(sep21))
oct21_cols <- data.frame(colnames(oct21))
nov21_cols <- data.frame(colnames(nov21))
dec21_cols <- data.frame(colnames(dec21))
jan22_cols <- data.frame(colnames(jan22))
feb22_cols <- data.frame(colnames(feb22))
mar22_cols <- data.frame(colnames(mar22))
apr22_cols <- data.frame(colnames(apr22))
may22_cols <- data.frame(colnames(may22))
jun22_cols <- data.frame(colnames(jun22))
col_df <- cbind(jul21_cols,aug21_cols,sep21_cols,oct21_cols,nov21_cols,dec21_cols,
jan22_cols,feb22_cols,mar22_cols,apr22_cols,may22_cols,jun22_cols)
I've tried using the identical function to compare 2 columns at a time.
identical(col_df[['jul21']], col_df[['aug21']])
identical(col_df[['aug21']], col_df[['sep21']])
identical(col_df[['sep21']], col_df[['oct21']])
identical(col_df[['oct21']], col_df[['nov21']])
identical(col_df[['nov21']], col_df[['dec21']])
identical(col_df[['dec21']], col_df[['jan22']])
identical(col_df[['jan22']], col_df[['feb22']])
identical(col_df[['feb22']], col_df[['mar22']])
identical(col_df[['mar22']], col_df[['apr22']])
identical(col_df[['apr22']], col_df[['may22']])
identical(col_df[['may22']], col_df[['jun22']])`
All of the identical lines return the value of TRUE
I'm just trying to verify that this code is telling me all my column names are identical in each CSV files before I move on. I'd also like to know if there is a more efficient way to solve this problem.
First, identical() will only return TRUE if the two dataframes have all the same column names in the same order. If you don’t care about order, just that all the same names are in both dataframes, you can sort() the names before comparing as shown below.
Second, you can often use the base::lapply() or purrr::map() families of functions for operations requiring iteration.
For your case, let’s put your dataframes in a list (which they probably should be to begin with), then use sapply() to compare the column names of the first df in the list to the column names of all other dfs.
jul21 <- data.frame(x = 1, y = 2)
aug21 <- data.frame(x = 3, y = 4)
sep21 <- data.frame(y = 6, x = 5)
dfs <- list(jul21,aug21,sep21)
all(sapply(
dfs[-1],
\(x) identical(sort(colnames(x)), sort(colnames(dfs[[1]])))
))
# TRUE
And as another test case, we’ll add a df with a non-matching column.
oct22 <- data.frame(x = 1, y = 2, z = 3)
dfs[[4]] <- oct22
all(sapply(
dfs[-1],
\(x) identical(sort(colnames(x)), sort(colnames(dfs[[1]])))
))
# FALSE
We assume that what is needed is to determine if the column names are the same and in same order and if not to determine which differ.
First get a character vector, Names, containing the names of the data frames and from that make a named list L containing the data frames themselves.
From those names assemble a list L of the data frames and then get a character vector nms whose elements are strings of column names, one for each data frame.
Finally group the names of the data frames using tapply and nms as the groupings so we can see which data frames contain which columns. In the example below aug21 and jul21 have one set of columns, i.e. Time and demand, and sep21 has a different set, i.e. Time and DEMAND. If there were only one row then all data frames have the same column names in the same order.
Names <- c("jul21", "aug21", "sep21") # using example in Note
L <- mget(Names)[Names]
nms <- sapply(names(L), function(x) toString(names(L[[x]])))
tab <- stack(tapply(names(nms), nms, toString))
names(tab) <- c("data.frames", "column.names")
nrow(tab)
## [1] 2
tab
## data.frames column.names
## 1 jul21, aug21 Time, demand
## 2 sep21 Time, DEMAND
graph
Another approach which could be used alternately or in conjuction with the one above is to create a graph such that each vertex is a data frame and each edge means that the two vertices on either end of the edge have the same column names in the same order. Each connected component represents distinct column names or orders. From the example below we see that jul21 and aug21 form one connected component and sep21 forms a second connected component.
To investigate how data frame column names differ note that setdiff(names(jul21), names(sep21)) will show names that are in jul21 but not in sep21 and the reverse can be used for the other direction. If the setdiff in both directions are zero length vectors and names vectors are not the same then they differ by order.
library(igraph)
set.seed(123)
isSame <- function(x, y) +identical(names(x), names(y))
A <- outer(L, L, Vectorize(isSame))
diag(A) <- 0
g <- graph_from_adjacency_matrix(A, "undirected")
plot(g, vertex.color = "white", vertex.size = 30)
Note
Test data. BOD comes with R.
jul21 <- aug21 <- sep21 <- BOD
names(sep21) <- c("Time", "DEMAND")
Related
I have some large data frames that are big enough to push the limits of R on my machine; e.g., the one on which I'm currently working is 2 columns by 70 million rows. The contents aren't important, but just in case, column 1 is a string and column 2 is an integer.
What I would like to do is split that data frame into n parts (say, 20, but preferably something that could change on a case-by-case basis) so that I can work on each of the smaller data frames one at a time. That means that (a) the result has to produce things that are named (e.g., "newdf_1", "newdf_2", ... "newdf_20" or something), and (b) each line in the original data frame needs to be in one (and only one) of the new "sub" data frames. The order does not matter, but doing it sequentially by rows makes sense to me.
Once I do the work, I will start to recombine them (using rbind()) one pair at a time.
I've looked at split(), but from what I can tell, it is designed to work with factors (which I don't have).
Any ideas?
You can create a new column and split the data frame based on that column. The column does not need to be a factor, but need to be a data type that can be converted to a factor by the split function.
# Number of groups
N <- 20
dat$group <- 1:nrow(dat) %% N
# Add 1 to group
dat$group <- dat$group + 1
# Split the dat by group
dat_list <- split(dat, f = ~group)
# Set the name of the list
names(dat_list) <- paste0("newdf_", 1:N)
Data
set.seed(123)
# Create example data frame
dat <- data.frame(
A = sample(letters, size = 70000000, replace = TRUE),
B = rpois(70000000, lambda = 1)
)
Here's a tidyverse based solution. Try using read_csv_chunked().
# practice data
tibble(string = sample(letters, 1e6, replace = TRUE),
value = rnorm(1e6) %>%
write_csv("test.csv")
# here's the solution
partial_data <- read_csv_chunked("test.csv",
DataFrameCallback$new(function(x, pos) filter(x, string == "a")),
chunk_size = 1000)
You can wrap the call to read_csv_chunked in a function where you change the string that you subset on.
This is more or less a repeat of this question:
How to read only lines that fulfil a condition from a csv into R?
I have two lists, A and B:
List A contains K character vectors of length W. Each vector contains the same W string values but the indices of the strings may differ. We can think of this list in practice as containing vector of variable names, where each vector contains the same variable names but in potentially-differing orders..
List B contains K character vectors of length W. Each vector can contain W arbitrary values. We can think of this list in practice as containing vectors with the corresponding values of the variables contained in each vector of List A.
I am trying to generate a data frame that is K rows long and W rows wide, where the column names are the W unique values in each vector in List A and the values for each row are drawn from the vector found in that row's index in List B.
I've been able to do this (minimal working example below) but it seems very hackish because it basically involves turning the two lists into data frames and then assigning values from one as column names for the other in a loop.
Is there a way to skip the steps of turning each list into a data frame before then using a loop to combine them? Looping through the lists seems inefficient, as does generating the two data frames rather than a single data frame that draws on contents of both lists.
# Declare number of rows and columns
K <- 10
W <- 5
colnames_set <- sample(LETTERS, W)
# Generate example data
# List A: column names
list_a <- vector(mode = "list", length = K)
list_a <- lapply(list_a, function(x) x <- sample(colnames_set, W))
# List B: values
list_b <- vector(mode = "list", length = K)
list_b <- lapply(list_b, function(x) x <- rnorm(n = W))
# Define function to take a vector and turn it into a
# data frame where each element of the vector is
# assigned to its own colun
vec2df <- function(x) {
x %>%
as.data.frame(., stringsAsFactors = FALSE) %>%
t() %>%
as.data.frame(., stringsAsFactors = FALSE)
}
# Convert vectors to data frames
vars <- lapply(list_a, vec2df)
vals <- lapply(list_b, vec2df)
# Combine the data frames into one
# (note the looping)
for(i in 1:K){
colnames(vals[[i]]) <- vars[[i]][1, ]
}
# Combine rows into a single data frame
out <- vals %>%
dplyr::bind_rows()
rownames(out) <- NULL
# Show output
out
Arrange the data in list_b so that the variables are aligned. We can use Map/mapply to do this, convert the output to dataframe and name the columns.
setNames(data.frame(t(mapply(function(x, y) y[order(x)], list_a, list_b))),
sort(colnames_set))
I have a list of data frames with same column names where each dataframe corresponds to a month
June_2018 <- data.frame(Features=c("abc","def","ghi","jkl"), Metric1=c(100,200,250,450), Metric2=c(1000,2000,5000,6000))
July_2018 <- data.frame(Features=c("abc","def","ghi","jkl"), Metric1=c(140,250,125,400), Metric2=c(2000,3000,2000,3000))
Aug_2018 <- data.frame(Features=c("abc","def","ghi","jkl"), Metric1=c(200,150,250,600), Metric2=c(1500,2000,4000,2000))
Sep_2018 <- data.frame(Features=c("abc","def","ghi","jkl"), Metric1=c(500,500,1000,100), Metric2=c(500,4000,6000,8000))
lst1 <- list(Aug_2018,June_2018,July_2018,Sep_2018)
names(lst1) <- c("Aug_2018","June_2018","July_2018","Sep_2018")
I intend to create a new column in each of the data frames in the list named Percent_Change_Metric1 and Percent_Change_Metric2 by doing below calculation
for (i in names(lst1)){
lst1[[i]]$Percent_Change_Metric1 <- ((lst1[[i+1]]$Metric1-lst1[[i]]$Metric1)*100/lst1[[i]]$Metric1)
lst1[[i]]$Percent_Change_Metric2 <- ((lst1[[i+1]]$Metric2-lst1[[i]]$Metric2)*100/lst1[[i]]$Metric2)
}
However, obviously the i in for loop is against the names(lst1) and wouldn't work
Also, the dataframes in my list in random order and not ordered by month-year. So the calculation to subtract successive dataframes' columns isn't entirely accurate.
Please advise
How I go about with adding the Percent_change_Metric1 and
Percent_change_Metric2
How to choose the dataframe corresponding
to next month to arrive at the correct Percent_Change
Thanks for the guidance
Here is one option with base R
lst1[-length(lst1)] <- Map(function(x, y)
transform(y, Percent_Change_Metric1 = (x$Metric1 - Metric1) * 100/Metric1,
Percent_Change_Metric2 = (x$Metric2 - Metric2) * 100/Metric2),
lst1[-1], lst1[-length(lst1)])
I would like to organise a data frame by the contents of three of its columns from a total of 6 columns (a minimal example of just the three below) and have each unique cluster of similarity (over those 3 columns) returned as a subsetted data frame structure inside a list. So I basically chop the dataframe up into smaller data frame and put into a list.
var1 <- "erg11"
var2 <- "cyp51"
df <- data.frame(primerID=c(1,2,3,2,4,3,2,1,1,1,2),geneName=c(var1,var1,var2,var1,var1,var2,var2,var2,var1,var2,var1),insertLength=c(111,111,81,81,81,111,102,111,81,81,102))
Given my old C background I tried nested for loops, subsetting the data frame when all three elements of the data frame were found in three lists e.g.,
Alist <- as.list(unique(df$primerID))
Blist <- as.list(unique(df$geneName))
Clist <- as.list(unique(df$insertLength))
uniqueCounter <- 1
uniqueList <- list()
for(i in 1:length(Alist)) {
for(k in 1:length(Blist)) {
for(n in 1:length(Clist)) {
indDF <- subset(df, df$primerID %in% Alist[i] & df$geneName %in% Blist[j] & df$insertLength %in% Clist[n])
if(nrow(indDF) > 0) {
uniqueList[uniqueCounter] <- indDF
uniqueCounter <- uniqueCounter + 1
}
}
}
}
However, this takes most of the night to run.
Thanks
You can give a list of factors as grouping a variable so that their interaction is used for grouping. Since all your data frame columns are grouping variables, we can do split(df, df).
Optionally do split(df, df, drop = TRUE), which drops groups with no records / cases.
Just read that your real data frame has 6 columns, 3 of which are for grouping. Suppose the grouping columns are 1, 3, 4, we can use split(df, df[c(1, 3, 4)]).
From ?split:
Description:
‘split’ divides the data in the vector ‘x’ into the groups defined
by ‘f’. The replacement forms replace values corresponding to
such a division. ‘unsplit’ reverses the effect of ‘split’.
Arguments:
x: vector or data frame containing values to be divided into
groups.
f: a ‘factor’ in the sense that ‘as.factor(f)’ defines the
grouping, or a list of such factors in which case their
interaction is used for the grouping.
drop: logical indicating if levels that do not occur should be
dropped (if ‘f’ is a ‘factor’ or a list).
My problem is the following. Suppose I have 1000 dataframes in R with the names eq1.1, eq1.2, ..., eq1.1000. I would like a single dataframe containing my 1000 dataframes. Normally, if I have only two dataframes, say eq1.1 and eq1.2 then I could define
df <- data.frame(eq1.1,eq1.2)
and I'm good. However, I can't follow this procedure because I have 1000 dataframes.
I was able to define a list containing the names of my 1000 dataframes using the code
names <- c()
for (i in 1:1000){names[i]<- paste0("eq1.",i)}
However, the elements of my list are recognized as strings and not as the dataframes that I previously defined.
Any help is appreciated!
How about
df.names <- ls(pattern = "^eq1\\.\\d")
eq1.dat <- do.call(cbind,
lapply(df.names,
get))
rm(list = df.names)
library(stringi)
library(dplyr)
# recreate dummy data
lapply(1:1000,function(i){
assign(sprintf("eq1.%s",i),
as.data.frame(matrix(ncol = 12, nrow = 13, sample(1:15))),
envir = .GlobalEnv)
})
# Now have 1000 data frames in my working environment named eq1.[1:1000]
> str(ls(pattern = "eq1.\\d+"))
> chr [1:1000] "eq1.1" "eq1.10" "eq1.100" "eq1.1000" "eq1.101" "eq1.102" "eq1.103" ...
1) create a holding data frame from the ep1.1 data frame that will be appended
each iteration in the following loop
empty_df <- eq1.1
2) im going to search for all the data frame named by convention and
create a data frame from the returned characters which represent our data frame
objects, but are nothing more than a character string.
3) mutate that data frame to hold an indexing column so that I can order the data frames properly from 1:1000 as the character representation will not be in numeric order from the step above
4) Drop the indexing column once the data frame names are in proper sequence
and then unlist the dfs column back into a character sequence and slice
the first value out, since it is stored already to our empty_df
5) loop through that sequence and for each iteration globally assign and
bind the preceding data frame into place. So for example on iteration 1,
the empty_df is now the same as data.frame(ep1.1, ep1.2) and for the
second iteration the empty_df is the same as data.frame(ep1.1, ep1.2, ep1.3)
NOTE: the get function takes the character representation and calls the data object from it. see ?get for details
lapply(
data.frame(dfs = ls(pattern = 'eq1\\.\\d+'))%>%
mutate(nth = as.numeric(stri_extract_last_regex(dfs,'\\d+'))) %>%
arrange(nth) %>% select(-nth) %>% slice(-1) %>% .$dfs, function(i){
empty_df <<- data.frame(empty_df, get(i))
}
)
All done, all the dataframes are bound to the empty_df and to check
> dim(empty_df)
[1] 13 12000