two difficult conditions for subsetting in R - r

I need to subset a df with two very difficult conditions to code (for me) in R:
Given the following dataframe:
A=as.factor(rep(1:50,3))
B=as.factor(rep(c(1,2,3),50))
C=(rep(rnorm(10,30,3),15))
df=data.frame(A,B,C)
I need to subset rows of that dataframe which, for a given level of a factor A, contains observations of two of the levels from B (ex, the level "1" and the level "2").
Any hint?
Thanks in advance
Agus

Assuming you want first level of factor A and first 2 levels of factor B
df[df$A %in% levels(df$A)[1] & df$B %in% levels(df$B)[1:2], ]
To change the subset, replace levels(df$A)[1] and levels(df$B)[1:2] by exact values you need.

Related

Extract certain amount of factor levels in R

I have a data frame with a column with more than 100 factor levels.
I want to extract rows to make the column just have 50 factor levels, to decrease the calculation time.
How to randomly extract certain amount of factor levels?
To avoid no answer ...
You can use sample to get a random sample of the factor and then use %in% to select the relevant rows of your data.frame.
ReducedFactors = sample(levels(df$MyFactor), 50)
df[which(df$MyFactor %in% ReducedFactors ), ]

Split R dataframe by n number of factors

I have a dataframe that I need to split into smaller dataframes by groups of factors so that I can paginate tables and figures.
For example, say I wanted to split the diamonds dataset into mini dataframes with 2 cut levels per dataframe. That would mean a list of 2 dataframes with 2 levels, 1 one dataframe with 1 level.
levels(diamonds$cut)
# "Fair" "Good" "Very Good" "Premium" "Ideal"
I'm trying to use split() to accomplish this. split(diamonds, diamonds$cut) splits the set into dataframes by factor, but how would you split it up by groups of 2, 3, or n levels? Something like split(data,rep(1:round(nrow(data)/10),each=10)) works when each factor only has one row, but im working with a "long" dataframe so the factors are spread out along the length of the dataframe.
This question comes close, but uses a numeric variable that I don't have.
We split the levels of the 'cut' variable with a grouping variable created with gl and then subset the 'diamonds' in each of the list element using %in%.
v1 <- levels(diamonds$cut)
n <- 2
lapply(split(v1, as.numeric(gl(length(v1), n, length(v1)))),
function(x) diamonds[diamonds$cut %in% x,])
By using:
diamonds$splt <- c("B","A")[diamonds$cut %in% c("Very Good","Premium","Ideal") + 1L]
you create a new variable on which you can split the dataset in two with:
split(diamonds, diamonds$splt)
simple solution:
df_splt<-split(diamonds,ceiling(as.numeric(diamonds$cut)/2))
Note though there are empty levels in each data.frame.
>table(df_splt[[1]]$cut)
Fair Good Very Good Premium Ideal
1610 4906 0 0 0

How to subset a data frame by taking only the Non NA values of 2 columns in this data frame

I am trying to subset a data frame by taking the integer values of 2 columns om my data frame
Subs1<-subset(DATA,DATA[,2][!is.na(DATA[,2])] & DATA[,3][!is.na(DATA[,3])])
but it gives me an error : longer object length is not a multiple of shorter object length.
How can I construct a subset which is composed of NON NA values of column 2 AND column 3?
Thanks a lot?
Try this:
Subs1<-subset(DATA, (!is.na(DATA[,2])) & (!is.na(DATA[,3])))
The second parameter of subset is a logical vector with same length of nrow(DATA), indicating whether to keep the corresponding row.
The na.omit functions can be an answer to you question
Subs1 <- na.omit(DATA[2:3])
[https://stat.ethz.ch/R-manual/R-patched/library/stats/html/na.fail.html]
Here an example.
a,b ,c are 3 vectors which a and b have a missing value.
once they are created i use cbind in order to bind them in one matrix which afterwards you can transform to data frame.
The final result is a dataframe where 2 out of 3 columns have a missing value.
So we need to keep only the rows with complete cases.DATA[complete.cases(DATA), ] is used in order to keep only these rows that have not missing values in every column. subset object is these rows that have complete cases.
a <- c(1,NA,2)
b <- c(NA,1,2)
c <- c(1,2,3)
DATA <- as.data.frame(cbind(a,b,c))
subset <- DATA[complete.cases(DATA), ]

subset based on frequency level [duplicate]

This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed 5 years ago.
I want to generate a df that selects rows associated with an "ID" that in turn is associated with a variable called cutoff. For this example, I set the cutoff to 9, meaning that I want to select rows in df1 whose ID value is associated with more than 9 rows. The last line of my code generates a df that I don't understand. The correct df would have 24 rows, all with either a 3 or a 4 in the ID column. Can someone explain what my last line of code is actually doing and suggest a different approach?
set.seed(123)
ID<-rep(c(1,2,3,4,5),times=c(5,7,9,11,13))
sub1<-rnorm(45)
sub2<-rnorm(45)
df1<-data.frame(ID,sub1,sub2)
IDfreq<-count(df1,"ID")
cutoff<-9
df2<-subset(df1,subset=(IDfreq$freq>cutoff))
df1[ df1$ID %in% names(table(df1$ID))[table(df1$ID) >9] , ]
This will test to see if the df1$ID value is in a category with more than 9 values. If it is, then the logical element for the returned vector will be TRUE and in turn that as the "i" argument will cause the [-function to return the entire row since the "j" item is empty.
See:
?`[`
?'%in%'
Using dplyr
library(dplyr)
df1 %>%
group_by(ID) %>%
filter(n()>cutoff)
Maybe closer to what you had in mind is to create a vector of frequencies using ave:
subset(df1, ave(ID, ID, FUN = length) > cutoff)

keep most common factor levels in R

I used the "dummies" package to create 42 dummy variables for the 42 levels of a factor variable in my data-frame. Now I only want to keep the 5 dummies that represent the five most common factor levels. I used:
counts <- colSums(dummy_variables)
rank <- sort(counts)
to figure out what those levels are, but now I want to be able to reference the most common ones and keep them in my data frame. I am somewhat new to R - I just can't figure out the syntax to do this.
Filter out the top 5 variables, and then subset only those columns.
rank <- sort(counts)[(length(counts)-4):length(counts)]
dummy_variables <- dummy_variables[names(dummy_variables) %in% names(rank)]
Or in one line as the commenter suggested,
dummy_variables[names(dummy_variables) %in% names(tail(sort(colSums(dummy_variables)),5))]

Resources