How to apply this code to all cells of dataframe in R - r

This line of code applies to a f_name column on my data frame and removes all the cells of f_name column but I want to apply it to all columns.
How do I do it?
subset(m, nchar(as.character(f_name)) <= 100

If your data.frame is named dat, try the following.
It first creates a logical index inx with values TRUE if all elements of a column have less than 100 characters. It then subsets the original data.frame keeping only those columns.
inx <- sapply(dat, function(x) all(nchar(x) < 100))
new_dat <- dat[which(inx)]

Related

remove outliers from multiple columns in R

I used below codes to identify outliers on different columns:
outliers_x1 <- boxplot(mydata$x1, plot=FALSE)$out
outliers_x4 <- boxplot(mydata$x4, plot=FALSE)$out
outliers_x6 <- boxplot(mydata$x6, plot=FALSE)$out
Now, how can I remove those outliers from the dataset by one code?
This will set any outlier values to NA, and then optionally remove all rows where any column contains an outlier. Works with arbitrary number of columns.
Uses data.table for convenience.
library(data.table)
library(matrixStats)
##
# create sample data
#
set.seed(1)
dt <- data.table(x1=rnorm(100), x2=rnorm(100), x3=rnorm(100))
##
# incorporate possible outliers
#
dt[sample(100, 5), x1:=10*x1]
dt[sample(100, 5), x2:=10*x2]
dt[sample(100, 5), x3:=10*x3]
##
# you start here...
# remove all rows where any column contains an outlier
#
indx <- sapply(dt, \(x) !(x %in% boxplot(x, plot=FALSE)$out))
dt[as.logical(rowProds(indx))]
In the above, indx is a matrix with three logical columns. Each element is TRUE unless the corresponding column contained an outlier in that row. We use rowProds(...) from the matrixStats package to multiply ( & ) the 3 rows together. Unfortunately this converts everything numeric (1, 0), so we have to convert back to logical to use as an index into dt.
##
# replaces outliers with NA in each column
#
dt.melt <- melt(dt[, id:=seq(.N)], id='id')
dt.melt[, ol:=(value %in% boxplot(value, plot=FALSE)$out), by=.(variable)]
dt.melt[(ol), value:=NA]
result <- dcast(dt.melt, id~variable)[, id:=NULL]
##
# remove all rows where any column contains an outlier
#
na.omit(result)
In the code above we add an id column, then melt(...) so all other columns are in one column (value) with a second column (variable) indicating the original source column. Then we apply the boxplot(...) algorithm group-wise (by variable) to produce an ol column indicating an outlier. Then we set any value corresponding to ol == TRUE to NA. Then we re-convert to your original wide format with dcast(...) and remove the id.
It's a bit roundabout but this melt - process - dcast pattern is common when processing multiple columns like this.
Finally, na.omit(result) will remove any rows which have NA in any of the columns. If that's what you want it's simpler to use the first approach.

How to compare a data frame with duplicates and a vector?

I have a data frame in which some ids appear more than once. I sampled this ids uniquely and now I have a vector with the sampled ids. Now I need to create a logical that tells me which rows in the data frame have ids that also appear on my sample.
I have tried the match function, but it selects only the first appearance and I need all appearances.
I have also tried merge but the dataset is to large so there is no memory to do it.
You can use %in% to get a logical vector and which together with in to get the row indices. Here is a reproducible example that contains duplicate IDs.
set.seed(1234)
df <- data.frame(id=sample(1:80, 100, replace=TRUE), b=rnorm(100))
mySample <- seq(1, 80, by=6)
#logical vector length of nrow(df)
myRows <- df$id %in% mySample
# row indices
myIndices <- which(df$id %in% mySample)
This is what you can do using match (as you were trying this function):
x=match(df$id, mySample, nomatch = 0) > 0
Which gives you a logical vector which is TRUE if df$id appears in mySample and FALSE otherwise.
To retrieve the respective indices:
which(x==T)

Extract Columns that Do Not Exist in Another Matrix Based on Column Names

I have two matrices df_matrix and df_subset. One is a subset of the other one. Therefore, df_matrix has 10000 rows and columns and df_subset contains only 8222 columns and rows of df_matrix.
I want to select only those columns from df_matrix that are NOT in df_subset. I thought it is best to do it by column names, so I tried executing this code:
newdf <- df_matrix[, which( (colnames(df_matrix)) != (colnames(KroneckerProducts)) )]
However, this is not working at all. Is there any other way to do this?
General rule is not to use == or != with objects of different length
Use %in% with !
newdf <- df_matrix[, !(colnames(df_matrix) %in% colnames(KroneckerProducts))]

R select subset of data

I have a dataset with three columns.
## generate sample data
set.seed(1)
x<-sample(1:3,50,replace = T )
y<-sample(1:3,50,replace = T )
z<-sample(1:3,50,replace = T )
data<-as.data.frame(cbind(x,y,z))
What I am trying to do is:
Select those rows where all the three columns have 1
Select those rows where only two columns have 1 (could be any column)
Select only those rows where only column has 1 (could be any column)
Basically I want any two columns (for 2nd case) to fulfill the conditions and not any specific column.
I am aware of rows selection using
subset<-data[c(data$x==1,data$y==1,data$z==1),]
But this only selects those rows based on conditions for specific columns whereas I want any of the three/two columns to fullfill me criteria
Thanks
n = 1 # or 2 or 3
data[rowSums(data == 1) == n,]
Here is another method:
rowCounts <- table(c(which(data$x==1), which(data$y==1), which(data$z==1)))
# this is the long way
df.oneOne <- data[as.integer(names(rowCounts)[rowCounts == 1]),]
df.oneTwo <- data[as.integer(names(rowCounts)[rowCounts == 2]),]
df.oneThree <- data[as.integer(names(rowCounts)[rowCounts == 3]),]
It is better to save multiple data.frames in a list especially when there is some structure that guides this storage as is the case here. Following #richard-scriven 's suggestion, you can do this easily with lapply:
df.oneCountList <- lapply(1:3, function(i)
data[as.integer(names(rowCounts)[rowCounts == i]),]
names(df.oneCountList) <- c("df.oneOne", "df.oneTwo", df.oneThree)
You can then pull out the data.frames using either their index, df.oneCountList[[1]] or their name df.oneCountList[["df.oneOne"]].
#eddi below suggests a nice shortcut to my method of pulling out the table names using tabulate and the arr.ind argument of which. When which is applied on a multipdimensional object such as an array or a data.frame, setting arr.ind==TRUE produces indices of the rows and the columns where the logical expression evaluates to TRUE. His suggestion exploits this to pull out the row vector where a 1 is found across all variables. The tabulate function is then applied to these row values and tabulate returns a sorted vector that where each element represents a row and rows without a 1 are filled in with a 0.
Under this method,
rowCounts <- tabulate(which(data == 1, arr.ind = TRUE)[,1])
returns a vector from which you might immediately pull the values. You can include the above lapply to get a list of data.frames:
df.oneCountList <- lapply(1:3, function(i) data[rowCounts == i,])
names(df.oneCountList) <- c("df.oneOne", "df.oneTwo", df.oneThree)

sum different columns in a data.frame

I have a very big data.frame and want to sum the values in every column.
So I used the following code:
sum(production[,4],na.rm=TRUE)
or
sum(production$X1961,na.rm=TRUE)
The problem is that the data.frame is very big. And I only want to sum 40 certain columns with different names of my data.frame. And I don't want to list every single column. Is there a smarter solution?
At the end I also want to store the sum of every column in a new data.frame.
Thanks in advance!
Try this:
colSums(df[sapply(df, is.numeric)], na.rm = TRUE)
where sapply(df, is.numeric) is used to detect all the columns that are numeric.
If you just want to sum a few columns, then do:
colSums(df[c("X1961", "X1962", "X1999")], na.rm = TRUE)
res <- unlist(lapply(production, function(x) if(is.numeric(x)) sum(x, na.rm=T)))
will return the sum of each numeric column.
You could create a new data frame based on the result with
data.frame(t(res))
If you dont want to include every single column, you somehow have to indicate which ones to include (or alternatively, which to exclude)
colsInclude <- c("X1961", "X1962", "X1963") # by name
# or #
colsInclude <- paste0("X", 1961:2003) # by name
# or #
colsInclude <- c(10:19, 23, 55, 147) # by column number
To put those columns in a new data frame simply use [ ] as you've done: '
newDF <- oldDF[, colsInclude]
To sum up each column, simply use colSums
sums <- colSums(newDF, na.rm=T)
# or #
sums <- colSums(oldDF[, colsInclude], na.rm=T)
Note that sums will be a vector, not necessarilly a data frame.
You can make it into a data frame using as.data.frame
sums <- as.data.frame(sums)
# or, to include the data frame from which it came #
sums <- rbind(newDF, "totals"=sums)

Resources