Split R dataframe by n number of factors - r

I have a dataframe that I need to split into smaller dataframes by groups of factors so that I can paginate tables and figures.
For example, say I wanted to split the diamonds dataset into mini dataframes with 2 cut levels per dataframe. That would mean a list of 2 dataframes with 2 levels, 1 one dataframe with 1 level.
levels(diamonds$cut)
# "Fair" "Good" "Very Good" "Premium" "Ideal"
I'm trying to use split() to accomplish this. split(diamonds, diamonds$cut) splits the set into dataframes by factor, but how would you split it up by groups of 2, 3, or n levels? Something like split(data,rep(1:round(nrow(data)/10),each=10)) works when each factor only has one row, but im working with a "long" dataframe so the factors are spread out along the length of the dataframe.
This question comes close, but uses a numeric variable that I don't have.

We split the levels of the 'cut' variable with a grouping variable created with gl and then subset the 'diamonds' in each of the list element using %in%.
v1 <- levels(diamonds$cut)
n <- 2
lapply(split(v1, as.numeric(gl(length(v1), n, length(v1)))),
function(x) diamonds[diamonds$cut %in% x,])

By using:
diamonds$splt <- c("B","A")[diamonds$cut %in% c("Very Good","Premium","Ideal") + 1L]
you create a new variable on which you can split the dataset in two with:
split(diamonds, diamonds$splt)

simple solution:
df_splt<-split(diamonds,ceiling(as.numeric(diamonds$cut)/2))
Note though there are empty levels in each data.frame.
>table(df_splt[[1]]$cut)
Fair Good Very Good Premium Ideal
1610 4906 0 0 0

Related

Converting to factors

I have a data set as below:
age sex Cond label
range1 M 1 0
range2 M 2 1
range3 F 4 1
with more rows..all data columns are discrete.
I intend to use the hc, gs, bn, tan of bnlearn package in R.What data transformation should I use? How should I convert the data to factors?
Regarding the second question, it is very straightforward to convert to factor. Just loop through the columns of interest with lapply and apply the factor. Then update the original dataset with the output.
df1[] <- lapply(df1, factor)
In case, we are only looking for subset of columns, say, 'age', 'sex', subset the dataset and then loop through those
df1[c('age', 'sex')] <- lapply(df1[c('age', 'sex')], factor)

Combine, count df columns w/o repeating other columns

A simple question that I am completely stumped on after consulting packages thatI thought would help(plyr, reshape, unique)
Let's say I have the df below:
df <- data.frame(location=c("ny","nj","pa","ct"),
animal=c("dog","hamster","dog","pig"),
animal2=c("cat","dog","pig","dog"))
I would like to count the unique entities in specific columns and then rank occurrences. So here, I'd like to count the combined unique entities in the columns animal and animal2. If I use reshape and melt, the associated location values will repeat in the additional rows...but I don't want that because I only want to count the frequencies of the "location' variables as given in the original df.
Is there a way to rbind without repeating other columns? So in this case I would have another column called AnimalMaster and that would have all of the frequencies I need.
When I try count(df,c("animal","animal2")), it counts the joint occurrences, which is not what I want. Alternatively, I could also do this by just counting the unique strings across multiple columns without combining them. Is there a straightforward way to do this without running into the count problem?
Thank you for helping a beginner.
EDIT:
My desired output is the following:
countsdf with columns (Type, Name, Frequency, Frequency (%)), so that top row would be:
AnimalMaster | dog | 4 | 100%
Here's a suggestion with reshape2 and data.table
require(reshape2)
require(data.table)
dt <- data.table(melt(df, id.vars = 'location', value.name = 'animal'))
dt[, list(n=length(unique(location)),
percent=100*.N/dt[, length(unique(location))]),
by=animal]
# animal n percent
# 1: dog 4 100
# 2: hamster 1 25
# 3: pig 2 50
# 4: cat 1 25

keep most common factor levels in R

I used the "dummies" package to create 42 dummy variables for the 42 levels of a factor variable in my data-frame. Now I only want to keep the 5 dummies that represent the five most common factor levels. I used:
counts <- colSums(dummy_variables)
rank <- sort(counts)
to figure out what those levels are, but now I want to be able to reference the most common ones and keep them in my data frame. I am somewhat new to R - I just can't figure out the syntax to do this.
Filter out the top 5 variables, and then subset only those columns.
rank <- sort(counts)[(length(counts)-4):length(counts)]
dummy_variables <- dummy_variables[names(dummy_variables) %in% names(rank)]
Or in one line as the commenter suggested,
dummy_variables[names(dummy_variables) %in% names(tail(sort(colSums(dummy_variables)),5))]

two difficult conditions for subsetting in R

I need to subset a df with two very difficult conditions to code (for me) in R:
Given the following dataframe:
A=as.factor(rep(1:50,3))
B=as.factor(rep(c(1,2,3),50))
C=(rep(rnorm(10,30,3),15))
df=data.frame(A,B,C)
I need to subset rows of that dataframe which, for a given level of a factor A, contains observations of two of the levels from B (ex, the level "1" and the level "2").
Any hint?
Thanks in advance
Agus
Assuming you want first level of factor A and first 2 levels of factor B
df[df$A %in% levels(df$A)[1] & df$B %in% levels(df$B)[1:2], ]
To change the subset, replace levels(df$A)[1] and levels(df$B)[1:2] by exact values you need.

How to combine two columns of factors into one column without changing the factor levels into number [duplicate]

This question already has answers here:
Joining factor levels of two columns
(3 answers)
Closed 4 years ago.
I am trying to find a way to combine two columns of factors into one column without changing the factor levels into numbers. For instance, consider the following two data.frame datasets
dataset 1 dataset 2
Number Student Number Student
1 Chris 1 Matt
2 Sarah 2 Keith
I am trying to take "student" column from the dataset1 and the "student" column from the dataset2, and make one big student column containing the names "Chris", "Sarah", "Matt", and "Keith"
I tried:
student.list<-c(dataset1[,2],dataset2[,2])
student.list
However, this doesn't work since the names turns into numbers with c() function. I want my list to preserve the names of students (i.e. without converting them into numbers). I also tried cbind(), but gives same problem as c()...
Thank you
factors are numbers that happen to have labels. When you combine factors, you generally are combining their numeric values. This can often trip a person up.
If you want their labels, you must coerce them to strings, using as.character
student.list <- c( as.character(dataset1[,2]) ,
as.character(dataset2[,2]) )
If you want to get that back to factors, wrap it all in as.factor (can be all in one line, or split into two lines for easier reading)
student.list <- c(as.character(dataset1[,2]),as.character(dataset2[,2]))
student.list <- as.factor(student.list)
There is interaction() function in the base R package.
There is also strata() function in the survival package.
The data.table package, which extends the functionality of data frames in some very useful ways, will combine factors automatically when you use the rbindlist function. Plus, if your two data sets are large, it will usually combine them more quickly.
library(data.table)
# Example data:
# (If you already have data frames, you can convert them using `as.data.table(dataframename)`)
dataset1<-data.table(Number=1:2,Student=as.factor(c("Chris","Sarah")))
dataset2<-data.table(Number=1:2,Student=as.factor(c("Matt","Keith")))
# Combine the two data sets:
# (It's not necessary to convert factors to characters)
rbindlist(list(dataset1,dataset2))
# Number Student
#1: 1 Chris
#2: 2 Sarah
#3: 1 Matt
#4: 2 Keith
You can now do this easily with fct_c() from the forcats package.
dataset1 <- data.frame(Number = c(1,2), Student = factor(c('Chris','Sarah')))
dataset2 <- data.frame(Number = c(1,2), Student = factor(c('Matt','Keith')))
library(forcats)
fct_c(list(dataset1[ ,2], dataset2[ ,2]))
# [1] Chris Sarah Matt Keith
# Levels: Chris Sarah Keith Matt
If you factors are inside of data frames then you can combine them this way using rbind:
> df1 <- data.frame(x=factor(c('a','b')))
> df2 <- data.frame(x=factor(c('c','d')))
> rbind(df1,df2)
x
1 a
2 b
3 c
4 d

Resources