Select columns that don't contain any NA value in R - r

How to select columns that don't contain any NA values in R? As long as a column contains at least one NA, I want to exclude it. What's the best way to do it? I am trying to use sum(is.na(x)) to achieve this, but haven't been successful.
Also, another R question. Is it possible to use commands to exclude columns that contain all same values? For example,
column1 column2
row1 a b
row2 a c
row3 a c
My purpose is to exclude column1 from my matrix so the final result is:
column2
row1 b
row2 c
row3 c

Remove columns from dataframe where ALL values are NA deals with the case where ALL values are NA
For a matrix, you can use colSums(is.na(x) to find out which columns contain NA values
given a matrix x
x[, !colSums(is.na(x)), drop = FALSE]
will subset appropriately.
For a data.frame, it will be more efficient to use lapply or sapply and the function anyNA
xdf[, sapply(xdf, Negate(anyNA)), drop = FALSE]

Also, could do
new.df <- df[, colSums(is.na(df)) == 0 ]
this way lets you subset based on the number of NA values in the columns.

Also if 'mat1' is the matrix:
indx <- unique(which(is.na(mat1), arr.ind=TRUE)[,2])
subset(mat1, select=-indx)

Related

R: Deleting rows from a data frame based on values of other vector

So I have a data frame with baskets of products of purchases of individuals. A row stands for a basket of products of one individual. I want to remove all the rows (baskets) that contain a product (expressed as a integer) that are listed in a vector named products.to.delete . Here is a small image of how the data set looks like.
Next to that I have a vector containing a large number of numbers that must be deleted. I would like to delete all the rows that contain a value from this vector.
here is some code to make it reproducable:
dataframe <- as.data.frame( matrix(data = sample(10000,1000,replace = TRUE),20,50))
products.to.delete <- sample(10000,200,replace = FALSE)
Thank you in advance for helping me out!
If your data is data, and your vector of target values is vals, you could do this:
data[apply(data,1,\(r) !any(r %in% vals)),]
That is, within each row of data (i.e. apply(data,1...)), you can check if any of the values are in vals. Reverse the boolean using !, to create an global logical vector for selecting the remaining rows
For your next questions, please create reproducible examples such as the one below.
What you're after is called filtering and can be done in base R by the following.
First, create an object called for example myfilter which is a boolean vector with the same length as the number of rows in your data.frame.
mydat <- data.frame("col1"=1:5, "col2"=letters[1:5])
col1 col2
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
myfilter <- mydat$col2 %in% c("a", "c")
[1] TRUE FALSE TRUE FALSE FALSE
mydat[myfilter,]
col1 col2
1 1 a
3 3 c
Then simply include this object into brackets []. R will keep rows where values are TRUE

extracting identifiers from row observations

I want to extract specific elements, specifically ID, from rows that have NAs. Here is my df:
df
ID x
1-12 1
1-13 NA
1-14 3
2-12 20
3-11 NA
I want a dataframe that has the IDs of observations that are NA, like so:
df
ID x
1-13 NA
3-11 NA
I tried this, but it's giving me a dataframe with the row #s that have NAs (e.g., row 2, row 5), not the IDs.
df1 <- data.frame(which(is.na(df$x)))
Can someone please help?
This is a very basic subsetting question:
df[is.na(df$x),]
Good basic and free guides can be found on w3schools: https://www.w3schools.com/r/
Cheers
Hannes
Simply run the following line:
df[is.na(x),]
Another option is complete.cases
subset(df, !complete.cases(x))
Here is another base R option using na.omit
> df[!1:nrow(df) %in% row.names(na.omit(df)), ]
ID x
2 1-13 NA
5 3-11 NA

R - How to compare values across more than two columns

I'm trying to write code to compare the values of several columns, and i dont know ahead of time how many columns I will have. The data will look like this:
X Val1 Val2 Val3 Val4
A 1 1 1 2
B NA 2 2 2
C 3 3 3 3
The code should return a Fail for rows A and B, and a Pass for row C, but needs to be able to handle a changing number of columns. I can't figure out how to do this without nesting a couple of for loops, but there has to be some way to use apply or sapply to iterate through columns 2: length(df)
EDIT: I want to see if the values (which will be numbers) are equal
Assuming that the first column is excluded from the comparison and that all the other columns are not, you can try:
which(rowSums(df[,2]==df[,3:ncol(df)])==(ncol(df)-2))
You can use apply with a custom function length(unique(x)) to count the unique number of values in rows 2:ncol(yourDataFrame). You can then throw the whole thing into an ifelse function to return a true/false list.
ifelse(apply(df[ , 2:ncol(yourDataFrame)], MARGIN=1, function(x) length(unique(x))) == 1, TRUE, FALSE)

Drop columns in a data.frame with conditions R

I am trying to be lazier than ever with R and I was wondering to know if there is a chance to drop columns from a data.frame by using a condition.
For instance, let's say my data.frame has 50 columns.
I want to drop all the columns that share each other
mean(mydata$coli)... = mean(mydata$coln) = 0
How would you write this code in order to drop them all at once? Because I use to drop columns with
mydata2 <- subset(mydata, select = c(vari, ..., varn))
Obviously not interesting because of the need of manual data checking.
Thank you all!
Something similar as #akrun using lapply
mydata <- data.frame(col1=0, col2=1:7, col3=0, col4=-3:3)
mydata[lapply(mydata, mean)!=0]
# col2
#1 1
#2 2
#3 3
#4 4
#5 5
#6 6
#7 7
We can use colMeans to get the mean of all the columns as a vector, convert that to a logical index (!=0) and subset the dataset.
mydata[colMeans(mydata)!=0]
Or use Filter with f as mean. If the mean of a column is 0, it will be coerced to FALSE and all others as TRUE to filter out the columns.
Filter(mean, mydata)
data
mydata <- data.frame(col1=0, col2=1:7, col3=0, col4=-3:3)

Combine different columns from different data frames based on another columns index R

So I have three data frames we will call them a,b,c
within each data frame there are columns called 1,2,3,4 with 54175 rows of data
Column 1 has id names that are the same in each data frame but not necessarily in the same order
Columns 2,3,4 are just numeric values
I want to pull out all the information from column 2 for a,b,c based on ID from column 1 so each values for a,b,c will correlate to the correct ID
I tried something like
m1 <- merge(A[,'2'], b[,'2'], c[,2'], by='1')
I get this error
Error in fix.by(by.x, x) : 'by' must match numbers of columns
Thank you for your help!
Couple problems:
Merge works two-at-a-time, no more.
You need to have the by column in the data.frames that are merged.
Fix these like this:
m1 <- merge(A[,c("1", "2")], B[,c("1", "2")])
m2 <- merge(m1, C[, c("1", "2")])
Then m2 should be the result you're looking for.
As an aside, it's pretty weird to use column names that are just characters of numbers. If they're in order, just use column indices (no quotes), and otherwise put something in them to indicate that they're names not numbers, e.g., R's default of "V1", "V2", "V3". Of course, the best is a meaningful name, like "id", "MeasureDescription", ...
You can either use merge two times:
merge(merge(a[1:2], b[1:2], by = "1"), c[1:2])
or Reduce with merge:
Reduce(function(...) merge(..., by = "1"), list(a[1:2], b[1:2], c[1:2]))
You have to merge them 2 at a time:
a<-data.frame(sample(1:100,100),100*runif(100),100*runif(100),100*runif(100))
colnames(a)<-1:4
b<-data.frame("C1"=sample(1:100,100),"C2"=100*runif(100),"C3"=100*runif(100),"C4"=100*runif(100))
colnames(b)<-1:4
c<-data.frame("C1"=sample(1:100,100),"C2"=100*runif(100),"C3"=100*runif(100),"C4"=100*runif(100))
colnames(c)<-1:4
f<-merge(a[,1:2],b[,1:2],by=(1))
f<-merge(f,c[,1:2],by=(1))
colnames(f)<-c(1,"A2","B2","C2")
head(f)
1 A2 B2 C2
1 1 54.63326 39.23676 28.10989
2 2 10.10024 56.08021 69.44268
3 3 45.02948 14.69028 22.44243
4 4 90.50883 33.61303 98.00917
5 5 13.80767 80.93382 77.22679
6 6 80.72241 27.22139 51.34516
I think the easiest way to answer this question is:
m1 <- merge(A[,'2'], b[,'2'], c[,2'], by='1')
should be by=(1)
m1 <- merge(A[,'2'], b[,'2'], c[,2'], by=(1))
only when you want to merge by a column name, you need single quotes, for example:
m1 <- merge(A[,'2'], b[,'2'], c[,2'], by='ID')

Resources