R dataframe subset collation optimization - r

This isn't a question about how to do something per se, it's more about how to do something better.
In R, say I have a dataframe, df:
df<-read.table(text="
Column1 Column2 Category
1 1 A
2 1 B
3 1 D
4 1 E
5 2 B
6 3 B
7 4 C
8 4 C
9 5 E
10 6 A", header=TRUE)
Now I would like to create a list (of dataframes) where each dataframe in the list is a subset of df where each subset is conditional on Category. I can create this as follows:
mylist <-list()
mylist[[1]] <- subset(df,df$Category=='A')
mylist[[2]] <- subset(df,df$Category=='B')
mylist[[3]] <- subset(df,df$Category=='C')
mylist[[4]] <- subset(df,df$Category=='D')
mylist[[5]] <- subset(df,df$Category=='E')
Now this works but is pretty clunky, is effectively a hard-coded loop and won't scale easily if I have more than five categories.
Is there a tighter/better way to do this?

You can use the function split
split(df,df$Category)

You can use dplyr library and loop for this case:
library(dplyr)
mylist <-list()
for ( v in unique(df$Category)){
mylist[[length(mylist)+1]] <- filter(df, Category == v)
}
mylist

Related

Using a loop to create multiple dataframes in R based on columns criteria [duplicate]

This question already has answers here:
Split dataframe using two columns of data and apply common transformation on list of resulting dataframes
(3 answers)
Closed 4 years ago.
Suppose I have a dataframe with 3 columns. I would like to create separate sub-dataframes for each of the unique combinations of a few columns.
For example, suppose we have just 3 columns,
a <- c(1,5,2,3,4,5,3,2,1,3)
b <- c("a","a","f","d","f","c","a","r","a","c")
c <- c(.2,.6,.4,.545,.98,.312,.112,.4,.9,.5)
df <- data.frame(a,b,c)
I would like to get a separate dataframe for each of the unique combinations of Column 'a' and 'b'
I started with using unique to get a list of the unique combinations as the following,
factors <- unique(df[,c('a','b')])
a b
1 1 a
2 5 a
3 2 f
4 3 d
5 4 f
6 5 c
7 3 a
8 2 r
10 3 c
But I am not sure what to do next.
The code below are for illustration purposes. Ideally this will be done through a loop where it uses each of the rows in factors to create the dataframes.
df_1_a <- df %>% filter(a==1, b=='a')
a b c
1 1 a 0.2
2 1 a 0.9
df_3_a <- %>% filter(a==3, b=='a')
a b c
1 3 a 0.112
.
.
.
This is kinda dirty and I'm not sure that answer your question but try this :
a <- c(1,5,2,3,4,5,3,2,1,3)
b <- c("a","a","f","d","f","c","a","r","a","c")
c <- c(.2,.6,.4,.545,.98,.312,.112,.4,.9,.5)
d <- paste0(a,b)
df <- data.frame(a,b,c,d)
df_splited <- split(df,df$d)
You obtain a list composed of dataframes with unique combinaison of a,b
You can use split after you get the unique combinations you are after.
a <- c(1,5,2,3,4,5,3,2,1,3)
b <- c("a","a","f","d","f","c","a","r","a","c")
c <- c(.2,.6,.4,.545,.98,.312,.112,.4,.9,.5)
df <- data.frame(a,b,c,stringsAsFactors = FALSE)
fx <- unique(df[,c('a','b')])
fx_list <- split(fx,rownames(fx))

Add Named Columns to Data Frame in R [duplicate]

This question already has answers here:
How to add multiple columns to a data.frame in one go?
(2 answers)
Closed 4 years ago.
I am in the process of reformatting a few data frames and was wondering if there is a more efficient way to add named columns to data frames, rather than the below:
colnames(df) <- c("c1", "c2)
to rename the current columns and:
df$c3 <- ""
to create a new column.
Is there a way to do this in a quicker manner? I'm trying to add dozens of named columns and this seems like an inefficient way of going through the process.
use your method in a shorter way:
cols_2_add=c("a","b","c","f")
df[,cols_2_add]=""
A way to add additional columns can be achieved using merge. Apply merge on existing dataframe with the one created with a desired columns and empty rows. This will be helpful if you want to create columns of different types.
For example:
# Existing dataframe
df <- data.frame(x=1:3, y=4:6)
#use merge to create say desired columns as a, b, c, d and e
merge(df, data.frame(a="", b="", c="", d="", e=""))
# Result
# x y a b c d e
#1 1 4
#2 2 5
#3 3 6
# Desired columns of different types
library(dplyr)
bind_rows(df, data.frame(a=character(), b=numeric(), c=double(), d=integer(),
e=as.Date(character()), stringsAsFactors = FALSE))
# x y a b c d e
#1 1 4 <NA> NA NA NA <NA>
#2 2 5 <NA> NA NA NA <NA>
#3 3 6 <NA> NA NA NA <NA>
A simple loop can help here
name_list <- c('a1','b1','c1','d1')
# example df
df <- data.frame(a = runif(3))
# this adds a new column
for(i in name_list)
{
df[[i]] <- runif(3)
}
# output
a a1 b1 c1 d1
1 0.09227574 0.08225444 0.4889347 0.2232167 0.8718206
2 0.94361151 0.58554887 0.7095412 0.2886408 0.9803941
3 0.22934864 0.73160433 0.6781607 0.7598064 0.4663031
# in case of data.table, for-set provides faster version:
# example df
df <- data.table(a = runif(3))
for(i in name_list)
set(df, j=i, value = runif(3))

How to replace values in multiple columns in a data.frame with values from a vector in R?

I would like to replace the values in the last three columns in my data.frame with the three values in a vector.
Example of data.frame
df
A B C D
5 3 8 9
Vector
1 2 3
what I would like the data.frame to look like.
df
A B C D
5 1 2 3
Currently I am doing:
df$B <- Vector[1]
df$C <- Vector[2]
df$D <- Vector[3]
I would like to not replace the values one by one. I would like to do it all at once.
Any help will be appreciated. Please let me know if any further information is needed.
We can subset the last three columns of the dataset with tail, replicate the 'Vector' to make the lengths similar and assign the values to those columns
df[,tail(names(df),3)] <- Vector[col(df[,tail(names(df),3)])]
df
# A B C D
#1 5 1 2 3
NOTE: I replicated the 'Vector' assuming that there will be more rows in the 'df' in the original dataset.
Try this:
df[-1] <- 1:3
giving:
> df
A B C D
1 5 1 2 3
Alternately, we could do it non-destructively like this:
replace(df, -1, 1:3)
Note: The input df in reproducible form is:
df <- data.frame(A = 5, B =3, C = 8, D = 9)

Replace values in selected columns by passing column name of data.frame into apply() or plyr function

Suppose I have a date.frame like:
df <- data.frame(a=1:5, b=sample(1:5, 5, replace=TRUE), c=5:1)
df
a b c
1 1 4 5
2 2 3 4
3 3 5 3
4 4 2 2
5 5 1 1
and I need to replace all the 5 as NA in column b & c then return to df:
df
a b c
1 1 4 NA
2 2 3 4
3 3 NA 3
4 4 2 2
5 5 1 1
But I want to do a generic apply() function instead of using replace() each by each because there are actually many variables need to be replaced in the real data. Suppose I've defined a variable list:
var <- c("b", "c")
and come up with something like:
df <- within(df, sapply(var, function(x) x <- replace(x, x==5, NA)))
but nothing happens. I was thinking if there is a way to work this out with something similar to the above by passing a variable list of column names from a data.frame into a generic apply / plyr function (or maybe some other completely different ways). Thanks~
You could just do
df[,var][df[,var] == 5] <- NA
df <- data.frame(a=1:5, b=sample(1:5, 5, replace=TRUE), c=5:1)
df
var <- c("b","c")
df[,var] <- sapply(df[,var],function(x) ifelse(x==5,NA,x))
df
I find the ifelse notation easier to understand here, but most Rers would probably use indexing instead.

Convert the values in a column into row names in an existing data frame

I would like to convert the values in a column of an existing data frame into row names. Is is possible to do this without exporting the data frame and then reimporting it with a row.names = call?
For example I would like to convert:
> samp
names Var.1 Var.2 Var.3
1 A 1 5 0
2 B 2 4 1
3 C 3 3 2
4 D 4 2 3
5 E 5 1 4
Into:
> samp.with.rownames
Var.1 Var.2 Var.3
A 1 5 0
B 2 4 1
C 3 3 2
D 4 2 3
E 5 1 4
This should do:
samp2 <- samp[,-1]
rownames(samp2) <- samp[,1]
So in short, no there is no alternative to reassigning.
Edit: Correcting myself, one can also do it in place: assign rowname attributes, then remove column:
R> df<-data.frame(a=letters[1:10], b=1:10, c=LETTERS[1:10])
R> rownames(df) <- df[,1]
R> df[,1] <- NULL
R> df
b c
a 1 A
b 2 B
c 3 C
d 4 D
e 5 E
f 6 F
g 7 G
h 8 H
i 9 I
j 10 J
R>
As of 2016 you can also use the tidyverse.
library(tidyverse)
samp %>% remove_rownames %>% column_to_rownames(var="names")
in one line
> samp.with.rownames <- data.frame(samp[,-1], row.names=samp[,1])
It looks like the one-liner got even simpler along the line (currently using R 3.5.3):
# generate original data.frame
df <- data.frame(a = letters[1:10], b = 1:10, c = LETTERS[1:10])
# use first column for row names
df <- data.frame(df, row.names = 1)
The column used for row names is removed automatically.
With a one-row dataframe
Beware that if the dataframe has a single row, the behaviour might be confusing. As the documentation mentions:
If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number).
This mean that, if you use the same command as above, it might look like it did nothing (when it actually named the first row "1", which won't look any different in the viewer).
In that case, you will have to stick to the more verbose:
df <- data.frame(a = "a", b = 1)
df <- data.frame(df, row.names = df[,1])
... but the column won't be removed. Also remember that, if you remove a column to end up with a single-column dataframe, R will simplify it to an atomic vector. In that case, you will want to use the extra drop argument:
df <- data.frame(df[,-1, drop = FALSE], row.names = df[,1])
You can execute this in 2 simple statements:
row.names(samp) <- samp$names
samp[1] <- NULL

Resources