R- Remove rows based on condition across some columns - r

I have a data frame like this :
I want to remove rows which have values = 0 in those columns which are "numeric". I tried some functions but returned to me error o dind't remove anything be cause not the entire row is = 0. Summarizing, i need to remove the rows which are equals to 0 on the colums that have a numeric class( i.e from sales month to expected sales,). How i could do this???(below attach the result i expect)
PD: If I could do it with some function that allows me to put the number of the column instead of the name, it would be great!

Here a simple solution with lapply.
set.seed(5)
df <- data.frame(a=1:10,b=letters[1:10],x=sample(0:5,5,replace = T),y=sample(c(0,10,20,30,40,50),5,replace = T))
df <-df[!unlist(lapply(1:nrow(df), function(i) {
any(df[i, ] == 0)
})), ]

Related

Removing columns from a matrix using a list of column names

I'm attempting to remove all colSums !=0 columns from my matrix. Using this post I have gotten close.
data2<-subset(data, Gear == 20) #subset off larger data matrix
i <- (colSums(data2[,3:277], na.rm=T) != 0) #column selection in this line limits to numeric columns only
data3<- data2[, i] #select only columns with non-zero colSums
This creates a vector, i, that properly identifies the columns to be removed using a logical True/False. However, the final line is removing columns by some logic other than my intentional if true then include if false then exclude.
The goal: remove all columns in that range that have a colSums == 0 .
The problem: my current code does not seem to properly identify said columns
Suggestion as to what I'm missing?
Update:
Adding dummy data to use:
a<-matrix(1:10, ncol = 10,nrow=10)
a
a[,c(3,5,8)]<-0
a
i <- (colSums(a, na.rm=T) != 0)
b<- a[, i]
it works well here, so I'm not sure why it won't work above on real data.
We can get the column names from the colSums and concatenate with the first two column names, which was not used for creating the condition with colSums to select the columns of interest
data[c(names(data)[1:2], names(which(i1)))]

How do you drop all rows from a dataframe where the sum of a range of columns is 0?

I have a dataframe with the columns
experimentResultDataColumns - faceGenderClk - 35 more columns ending with Clk - rougeClk - someMoreExperimentDataColumns
I am trying to drop all rows from the dataframe, where the sum of the 50 colums from faceGenderClk to (including) rougeClk is 0
There is data of an online study in the dataframe and the "Clk" columns count how many times the participant clicked a specific slider. If no sliders were clicked, the data is invalid. (It's basically like someone handing you your survey without setting their pen on the paper)
I was able to perform similar logic with a statement like this:
df<-df[!(df$screenWidth < 1280),]
to cut out all insufficiently sized screens, but I am unsure of how to perform this sum operation within that statement. I tried
df <- df[!(sum(df$faceGenderClk:df$rougeClk) > 0)]
but that doesn't work. (I'm not very good at R, I assume it definitely shouldn't work with that syntax)
The expected result is a dataframe which has all rows stripped from it, where the sum of all 50 values in that row from faceGenderClk to rougeClk is 0
EDIT:
data: https://pastebin.com/SLAmkHk5
the expected result of the code would drop the second row of data
code so far:
df <- read.csv("./trials.csv")
SECONDS_IN_AN_HOUR <- 60*60
MILLISECONDS_IN_AN_HOUR <- SECONDS_IN_AN_HOUR * 1000
library(dplyr)
#levels(df$latinSquare) <- c("AlexaF", "SiriF", "CortanaF", "SiriM", "GoogleF", "RobotM") ignore this since I faked the dataset to protect participants' personal data
df<-df[!(df$timeMainSessionTime > 6 * MILLISECONDS_IN_AN_HOUR),]
df<-df[!(df$screenWidth < 1280),]
the as of this edit accepted answer solves the problem with:
cols = grep(pattern = "Clk$", names(df), value=TRUE)
sums = rowSums(df[cols])
df <- df[sums != 0, ]
First, get the names of the column you want to check. Then add up the columns and do your subset.
# columns that end in Clk
cols = grep(pattern = "Clk$", names(df), value = TRUE)
# add them up
sums = rowSums(df[cols])
# susbet
df[sums != 0, ]

Add a column to empty data.frame

I want to initialise a column in a data.frame look so:
df$newCol = 1
where df is a data.frame that I have defined earlier and already done some processing on. As long as nrow(df)>0, this isn't a problem, but sometimes my data.frame has row length 0 and I get:
> df$newCol = 1
Error in `[[<-`(`*tmp*`, name, value = 1) :
1 elements in value to replace 0 elements
I can work around this by changing my original line to
df$newCol = rep(1,nrow(df))
but this seems a bit clumsy and is computationally prohibitive if the number of rows in df is large. Is there a built in or standard solution to this problem? Or should I use some custom function like so
addCol = function(df,name,value) {
if(nrow(df)==0){
df[,name] = rep(value,0)
}else{
df[,name] = value
}
df
}
If I understand correctly,
df = mtcars[0, ]
df$newCol = numeric(nrow(df))
should be it?
This is assuming that by "row length" you mean nrows, in which case you need to append a vector of length 0. In such case, numeric(nrow(df)) will give you the exact same result as rep(0, nrow(df)).
It also kind of assumes that you just need a new column, and not specifically column of ones - then you would simply do +1, which is a vectorized operation and therefore fast.
Other than that, I'm not sure you can have an "empty" column - the vector should have the same number of elements as the other vectors in the data frame. But numeric is fast, it should not hurt.

R select subset of data

I have a dataset with three columns.
## generate sample data
set.seed(1)
x<-sample(1:3,50,replace = T )
y<-sample(1:3,50,replace = T )
z<-sample(1:3,50,replace = T )
data<-as.data.frame(cbind(x,y,z))
What I am trying to do is:
Select those rows where all the three columns have 1
Select those rows where only two columns have 1 (could be any column)
Select only those rows where only column has 1 (could be any column)
Basically I want any two columns (for 2nd case) to fulfill the conditions and not any specific column.
I am aware of rows selection using
subset<-data[c(data$x==1,data$y==1,data$z==1),]
But this only selects those rows based on conditions for specific columns whereas I want any of the three/two columns to fullfill me criteria
Thanks
n = 1 # or 2 or 3
data[rowSums(data == 1) == n,]
Here is another method:
rowCounts <- table(c(which(data$x==1), which(data$y==1), which(data$z==1)))
# this is the long way
df.oneOne <- data[as.integer(names(rowCounts)[rowCounts == 1]),]
df.oneTwo <- data[as.integer(names(rowCounts)[rowCounts == 2]),]
df.oneThree <- data[as.integer(names(rowCounts)[rowCounts == 3]),]
It is better to save multiple data.frames in a list especially when there is some structure that guides this storage as is the case here. Following #richard-scriven 's suggestion, you can do this easily with lapply:
df.oneCountList <- lapply(1:3, function(i)
data[as.integer(names(rowCounts)[rowCounts == i]),]
names(df.oneCountList) <- c("df.oneOne", "df.oneTwo", df.oneThree)
You can then pull out the data.frames using either their index, df.oneCountList[[1]] or their name df.oneCountList[["df.oneOne"]].
#eddi below suggests a nice shortcut to my method of pulling out the table names using tabulate and the arr.ind argument of which. When which is applied on a multipdimensional object such as an array or a data.frame, setting arr.ind==TRUE produces indices of the rows and the columns where the logical expression evaluates to TRUE. His suggestion exploits this to pull out the row vector where a 1 is found across all variables. The tabulate function is then applied to these row values and tabulate returns a sorted vector that where each element represents a row and rows without a 1 are filled in with a 0.
Under this method,
rowCounts <- tabulate(which(data == 1, arr.ind = TRUE)[,1])
returns a vector from which you might immediately pull the values. You can include the above lapply to get a list of data.frames:
df.oneCountList <- lapply(1:3, function(i) data[rowCounts == i,])
names(df.oneCountList) <- c("df.oneOne", "df.oneTwo", df.oneThree)

Guetting a subset in R

I have a dataframe with 14 columns, and I want to subset a dataframe with the same column but keeping only row that repeats (for example, I have an ID variable and if ID = 2 repeated so I subset it).
To begin, I applied a table to my dataframe to see the frequencies of ID
head(sort(table(call.dat$IMSI), decreasing = TRUE), 100)
In my case, 20801170106338 repeat two time; so I want to see the two observation for this ID.
Afterward, I did x <- subset(call.dat, IMSI == "20801170106338") and hsb6 <- call.dat[call.dat$IMSI == "20801170106338", ], but the result is false (for x, it's returning me 0 observation of 14 variale and for hsb6 I have only NA in my dataframe).
Can you help me, thanks.
PS: IMSI is a numeric value.
And x <- subset(call.dat, Handset.Manufacturer == "LG") is another example which works perfectly...
You can use duplicated that is a function giving you an array that is TRUE in case the record is duplicated.
isDuplicated <- duplicated(call.dat$IMSI)
Then, you can extract all the rows containing a duplicated value.
call.dat.duplicated <- all.dat[isDuplicated, ]

Resources