R select subset of data - r

I have a dataset with three columns.
## generate sample data
set.seed(1)
x<-sample(1:3,50,replace = T )
y<-sample(1:3,50,replace = T )
z<-sample(1:3,50,replace = T )
data<-as.data.frame(cbind(x,y,z))
What I am trying to do is:
Select those rows where all the three columns have 1
Select those rows where only two columns have 1 (could be any column)
Select only those rows where only column has 1 (could be any column)
Basically I want any two columns (for 2nd case) to fulfill the conditions and not any specific column.
I am aware of rows selection using
subset<-data[c(data$x==1,data$y==1,data$z==1),]
But this only selects those rows based on conditions for specific columns whereas I want any of the three/two columns to fullfill me criteria
Thanks

n = 1 # or 2 or 3
data[rowSums(data == 1) == n,]

Here is another method:
rowCounts <- table(c(which(data$x==1), which(data$y==1), which(data$z==1)))
# this is the long way
df.oneOne <- data[as.integer(names(rowCounts)[rowCounts == 1]),]
df.oneTwo <- data[as.integer(names(rowCounts)[rowCounts == 2]),]
df.oneThree <- data[as.integer(names(rowCounts)[rowCounts == 3]),]
It is better to save multiple data.frames in a list especially when there is some structure that guides this storage as is the case here. Following #richard-scriven 's suggestion, you can do this easily with lapply:
df.oneCountList <- lapply(1:3, function(i)
data[as.integer(names(rowCounts)[rowCounts == i]),]
names(df.oneCountList) <- c("df.oneOne", "df.oneTwo", df.oneThree)
You can then pull out the data.frames using either their index, df.oneCountList[[1]] or their name df.oneCountList[["df.oneOne"]].
#eddi below suggests a nice shortcut to my method of pulling out the table names using tabulate and the arr.ind argument of which. When which is applied on a multipdimensional object such as an array or a data.frame, setting arr.ind==TRUE produces indices of the rows and the columns where the logical expression evaluates to TRUE. His suggestion exploits this to pull out the row vector where a 1 is found across all variables. The tabulate function is then applied to these row values and tabulate returns a sorted vector that where each element represents a row and rows without a 1 are filled in with a 0.
Under this method,
rowCounts <- tabulate(which(data == 1, arr.ind = TRUE)[,1])
returns a vector from which you might immediately pull the values. You can include the above lapply to get a list of data.frames:
df.oneCountList <- lapply(1:3, function(i) data[rowCounts == i,])
names(df.oneCountList) <- c("df.oneOne", "df.oneTwo", df.oneThree)

Related

R- Remove rows based on condition across some columns

I have a data frame like this :
I want to remove rows which have values = 0 in those columns which are "numeric". I tried some functions but returned to me error o dind't remove anything be cause not the entire row is = 0. Summarizing, i need to remove the rows which are equals to 0 on the colums that have a numeric class( i.e from sales month to expected sales,). How i could do this???(below attach the result i expect)
PD: If I could do it with some function that allows me to put the number of the column instead of the name, it would be great!
Here a simple solution with lapply.
set.seed(5)
df <- data.frame(a=1:10,b=letters[1:10],x=sample(0:5,5,replace = T),y=sample(c(0,10,20,30,40,50),5,replace = T))
df <-df[!unlist(lapply(1:nrow(df), function(i) {
any(df[i, ] == 0)
})), ]

Removing columns from a matrix using a list of column names

I'm attempting to remove all colSums !=0 columns from my matrix. Using this post I have gotten close.
data2<-subset(data, Gear == 20) #subset off larger data matrix
i <- (colSums(data2[,3:277], na.rm=T) != 0) #column selection in this line limits to numeric columns only
data3<- data2[, i] #select only columns with non-zero colSums
This creates a vector, i, that properly identifies the columns to be removed using a logical True/False. However, the final line is removing columns by some logic other than my intentional if true then include if false then exclude.
The goal: remove all columns in that range that have a colSums == 0 .
The problem: my current code does not seem to properly identify said columns
Suggestion as to what I'm missing?
Update:
Adding dummy data to use:
a<-matrix(1:10, ncol = 10,nrow=10)
a
a[,c(3,5,8)]<-0
a
i <- (colSums(a, na.rm=T) != 0)
b<- a[, i]
it works well here, so I'm not sure why it won't work above on real data.
We can get the column names from the colSums and concatenate with the first two column names, which was not used for creating the condition with colSums to select the columns of interest
data[c(names(data)[1:2], names(which(i1)))]

Subset the remaining of a dataframe using another subset

I have a sample dataset. I've created a subset of the original data frame using some condition. Now I need to extract the remaining contents of the original sample data frame, except the subset created. How can I do this?
data("mtcars")
fulldf <- mtcars
subdf <- subset.data.frame(fulldf, subset = fulldf$disp < 100)
restdf <- subset.data.frame(fulldf, subset = <fulldf without subdf>)
There are a lot of questions on subsetting data frames in R, but I couldn't find one that satisfied my requirement.
Also the final solution need not necessarily be using subset.data.frame. Any method/package will do.
It is better to assign the logical condition in base R to an object identifier and then negate (!)
i1 <- fulldf$disp < 100
subdf <- subset.data.frame(fulldf, subset = i1)
restdf <- subset.data.frame(fulldf, subset = !i1)
Also another option is to create a list of two datasets with split
lst1 <- split(fulldf, i1)
If the 'subdf' is creating with multiple conditions (not clear though), one option is to add a sequence variable in the data and then subset with %in%
fulldf$ind <- seq_len(nrow(fulldf))
then after the 'subdf' step
restdf <- subset(fulldf, !ind %in% subdf$ind)
and remove the 'ind' columns
restdf$ind <- NULL
subdf$ind <- NULL

Extract Columns that Do Not Exist in Another Matrix Based on Column Names

I have two matrices df_matrix and df_subset. One is a subset of the other one. Therefore, df_matrix has 10000 rows and columns and df_subset contains only 8222 columns and rows of df_matrix.
I want to select only those columns from df_matrix that are NOT in df_subset. I thought it is best to do it by column names, so I tried executing this code:
newdf <- df_matrix[, which( (colnames(df_matrix)) != (colnames(KroneckerProducts)) )]
However, this is not working at all. Is there any other way to do this?
General rule is not to use == or != with objects of different length
Use %in% with !
newdf <- df_matrix[, !(colnames(df_matrix) %in% colnames(KroneckerProducts))]

Guetting a subset in R

I have a dataframe with 14 columns, and I want to subset a dataframe with the same column but keeping only row that repeats (for example, I have an ID variable and if ID = 2 repeated so I subset it).
To begin, I applied a table to my dataframe to see the frequencies of ID
head(sort(table(call.dat$IMSI), decreasing = TRUE), 100)
In my case, 20801170106338 repeat two time; so I want to see the two observation for this ID.
Afterward, I did x <- subset(call.dat, IMSI == "20801170106338") and hsb6 <- call.dat[call.dat$IMSI == "20801170106338", ], but the result is false (for x, it's returning me 0 observation of 14 variale and for hsb6 I have only NA in my dataframe).
Can you help me, thanks.
PS: IMSI is a numeric value.
And x <- subset(call.dat, Handset.Manufacturer == "LG") is another example which works perfectly...
You can use duplicated that is a function giving you an array that is TRUE in case the record is duplicated.
isDuplicated <- duplicated(call.dat$IMSI)
Then, you can extract all the rows containing a duplicated value.
call.dat.duplicated <- all.dat[isDuplicated, ]

Resources