Subset a dataframe based on a single condition applied to multiple columns - r

I've had a look through the existing subset Q&A's on this site and couldn't quite find what I was looking for.
I want to subset a data frame based on one condition (e.g. if the value is below 5). However, I only want the rows where the value in all of the columns is below 5.
For example using the iris dataset - I would like to select all the rows where columns 1-3 all have values below 5.
subdata <- iris[which(iris[,1:3]<5),]
This doesn't do it for me. I get lots of NA rows at the bottom of the subset data.
Any help much appreciated!

Try
subdata <- iris[apply(iris[,1:3] < 5, 1, all),]

Related

Loop over even/odd columns & stack them under specific ones

I have the following data set from Douglas Montgomery's book Introduction to Time Series Analysis & Forecasting:
I created a data frame called pharm from this spreadsheet. We only have two variables but they're repeated over several columns. I'd like to take all odd "Week" columns past the 2nd column and stack them under the 1st Week column in order. Conversely I'd like to do the same thing with the even "Sales, in thousands" columns. Here's what I've tried so far:
pharm2 <- data.frame(week=c(pharm$week, pharm[,3], pharm[,5], pharm[,7]), sales=c(pharm$sales, pharm[,4], pharm[,6], pharm[,8]))
This works because there aren't many columns, but I need a way to do this more efficiently because hard coding won't be practical with many columns. Does anyone know a more efficient way to do this?
If the columns are alternating, just subset with a recycling logical vector, unlist and create a new data.frame
out <- data.frame(week = unlist(pharm[c(TRUE, FALSE)]),
sales = unlist(pharm[c(FALSE, TRUE)]))
You may use the seq function to generate sequence to extract alternating columns.
pharm2 <- data.frame(week = unlist(pharm[seq(1, ncol(pharm), 2)]),
sales = unlist(pharm[seq(2, ncol(pharm), 2)]))

Count values per rows in a data frame R

I know, there is other questions like this one but none of them answer my specific problem.
On my data frame, I need to count the number of values in each rows between cols 3 and 8.
I want a simple NB.VAL like in Excel..
base_graphs$NB <- rowSums(!is.na(base_graphs)) # with this code, I count all values except NAs but I can't select specific columns
How to create this new column "NB" on my data frame "base_graphs" ?
You were really close:
base_graphs$NB <- rowSums(!is.na(base_graphs[, 3:8]))
The [, 3:8] subsets and selects columns 3 through 8.
apply can apply a function to each row of a data frame. Try:
base_graphs$NB <- apply(base_graphs[3:8], 1, function (x) sum(is.na(x)))

Conditional sum of rows by a column value

I'm trying to sum rows that contain a value in a different column.
rowSums(wood_plastics[,c(48,52,56,60)], na.rm=TRUE)
The above got me row sums for the columns identified but now I'd like to only sum rows that contain a certain year in a different column. I tried this
rowSums(mydata[,c(48,52,56,60)], na.rm=TRUE, mydata$current_year = '2015')
with no success. I thought I might have to single out the year value from the column number, 7, in the initial column list.
Any help is appreciated.
I would say simply
rowSums(mydata[mydata$current_year == '2015',c(48,52,56,60)], na.rm=TRUE)
since I don't have the original data frame I cannot give you the result. But the idea is that you can select which rows you want before the comma while selecting which column you want. Is this clear enough for you?

Selecting different numbers of columns on each row of a data frame

This question is about selecting a different number of columns on every row of a data frame. I have a data frame:
df = data.frame(
START=sample(1:2, 10, repace=T), END=sample(2:4, 10, replace=T),
X1=rnorm(10), X2=rnorm(10), X3=rnorm(10), X4=rnorm(10)
)
I would like to have a way without loops to select columns (START[i]:END[i])+2 on row i for all rows of my data frame.
Base R solution
lapply(split(df,1:nrow(df)),function(row) row[(row$START+2):(row$END+2)])
Or something similar as given in the comment above (I would store the output in a list)
library(plyr)
alply(df,1,function(row) row[(row$START+2):(row$END+2)])
Edit per request of OP:
To get a TRUE/FALSE index matrix, use the following R base solution
idx_matrix=col(df)>=df$START+2&col(df)<=df$END+2
df[idx_matrix]
Note, however, that you lose some information here (compared to the list based solution).

How to remove rows where columns satisfy certain condition in data frame

I have a data frame that looks like this
df <- data.frame(cbind(1:10, sample(c(1:5), 10, replace=TRUE)))
# in real case the columns could be more than two
# and the column name could be anything.
What I want to do is to remove all rows where the value of all its columns
is smaller than 5.
What's the way to do it?
df[!apply(df,1,function(x)all(x<5)),]
First of all ...please stop using cbind to create data.frames. You will be sorry if you continue. R will punish you.
df[ !rowSums(df <5) == length(df), ]
(The length() function returns the number of columns in a dataframe.)

Resources