I've created a R script that calculates the percentage of missing values in each column of a data frame, and then removes the columns that exceed a preset threshold. The column names need to be maintained.
The names are maintained when there is more than one column in the data frame after column deletion, but not when there is only one column.
Code of when column names stay the same
df <- data.frame(A=rnorm(10, 10, 1), B=rep(NA, 10), C=rnorm(10, 10, 1))
threshold <- 80
pmiss <- function(x) {
ifelse(sum(is.na(x))/length(x)*100 > threshold, TRUE, FALSE)
}
temp <- sapply(df, pmiss)
deletecols <- names(temp[temp==TRUE])
df <- as.data.frame(df[,!(names(df) %in% deletecols)])
names(df) #prints
[1] "A" "C"
However, define df as
df <- data.frame(A=rnorm(10, 10, 1), B=rep(NA, 10))
and
names(df) #prints
[1] "df[, !(names(df) %in% deletecols)]"
Does anybody know why the column names are not kept when there is only one column?
You been bitten by an R FAQ. Add ,drop = FALSE to your data frame subsetting (and you notice as a side-effect that you no longer need as.data.frame.)
Related
I have a data frame with 20 rows, I randomly select n rows and modify them. How can I put the modified value back to the original data frame with only the modified value being different?
df<- data.frame(rnorm(n = 20, mean = 0, sd = 1))
n = 8
a<- data.frame(df[ c(1, sample(2:(nrow(df)-1), n), nrow(df) ), ])
a$changedvalue <- a[,1]*(2.5)
Now I want to replace the values of the original dataframe df with the values of a$changedvalue such that only the sampled values are changed while everything else is same in df. I tried doing something like this but it's not working.
df %>% a[order(as.numeric(rownames(a))),]
I just want to point out that in my original dataset the data are timeseries data, so maybe they can be used for the purpose.
Instead of writing data.frame(df[ c(1, sample(2:(nrow(df)-1), n), nrow(df) ), ])
You can define the rows you want to use, lets call it rows
rows <- c(1, sample(2:(nrow(df)-1), n), nrow(df) )
Now you can do
a<- data.frame(df[rows, ])
a$changedvalue <- a[,1]*(2.5)
df[rows, ] <- a$changedvalue
Let's say there is a matrix - 'mat' which has 115 columns.
There is another matrix - 'res_mat' which has a column having 38 column names of the previous matrix 'mat'.
I want to create a third matrix - 'fin_mat' which will be a subset of the first matrix 'mat' having the columns which are stored as values in the column of the second matrix 'res_mat'.
Or in other words, I have a list of column names which is stored in a variable. How can I create a subset of the first matrix containing the columns which are stored in a variable?
Doesn't seem very difficult. If I understand your question correctly, something like this will do it.
# First make up some matrix
mat <- matrix(1:24, ncol = 6)
colnames(mat) <- paste0("Col", 1:6)
# These would be the columns to keep
res_mat <- matrix(c("Col1", "Col3", "Col4"), ncol = 1)
fin_mat <- mat[, res_mat[, 1]]
fin_mat
One way would be to use the dplyr package with the functions "select" and "one_of". One_of allows to select columns based on their names (in a string format).
Here is a simple example with the iris table, in which I extract the columns names "Sepal.Length" and "Sepal.Width".
library(dplyr)
mat1 <- iris
mat2 <- data.frame(names = c("Sepal.Length", "Sepal.Width")) %>%
mutate(names = as.character(names)) #make sure the names are characters
results <- mat1 %>% select(one_of(mat2$names))
It can be done pretty easily. In the code below, I ma creating a dataframe mat and another one res_mat. mat has the data and res_mat has a single column named- select_these_columns. the mat dataframe has 10 columns named a,b,c,d,e...,j. the select_thes_colscolumn of res_mat has five rows with entries a,b,c,d,e. ALl that needs to be done is pass the res_mat$select_these_cols to mat
a <- (matrix(rnorm(1000), nrow = 100, ncol = 10))
mat <- as.data.frame(a)
names(mat) <- letters[1:10]
res_mat <- data.frame(x = letters[1:5])
names(res_mat) <- 'select_these_cols'
fin_mat <- mat[res_mat$select_these_cols] # subsetting operation
I have searched extensively but not found an answer to this question on Stack Overflow.
Lets say I have a data frame a.
I define:
a <- NULL
a <- as.data.frame(a)
If I wanted to add a column to this data frame as so:
a$col1 <- c(1,2,3)
I get the following error:
Error in `$<-.data.frame`(`*tmp*`, "a", value = c(1, 2, 3)) :
replacement has 3 rows, data has 0
Why is the row dimension fixed but the column is not?
How do I change the number of rows in a data frame?
If I do this (inputting the data into a list first and then converting to a df), it works fine:
a <- NULL
a$col1 <- c(1,2,3)
a <- as.data.frame(a)
The row dimension is not fixed, but data.frames are stored as list of vectors that are constrained to have the same length. You cannot add col1 to a because col1 has three values (rows) and a has zero, thereby breaking the constraint. R does not by default auto-vivify values when you attempt to extend the dimension of a data.frame by adding a column that is longer than the data.frame. The reason that the second example works is that col1 is the only vector in the data.frame so the data.frame is initialized with three rows.
If you want to automatically have the data.frame expand, you can use the following function:
cbind.all <- function (...)
{
nm <- list(...)
nm <- lapply(nm, as.matrix)
n <- max(sapply(nm, nrow))
do.call(cbind, lapply(nm, function(x) rbind(x, matrix(, n -
nrow(x), ncol(x)))))
}
This will fill missing values with NA. And you would use it like: cbind.all( df, a )
You could also do something like this where I read in data from multiple files, grab the column I want, and store it in the dataframe. I check whether the dataframe has anything in it, and if it doesn't, create a new one rather than getting the error about mismatched number of rows:
readCounts = data.frame()
for(f in names(files)){
d = read.table(files[f], header=T, as.is=T)
d2 = round(data.frame(d$NumReads))
colnames(d2) = f
if(ncol(readCounts) == 0){
readCounts = d2
rownames(readCounts) = d$Name
} else{
readCounts = cbind(readCounts, d2)
}
}
if you have an empty dataframe, called for example df, in my opinion another quite simple solution is the following:
df[1,]=NA # ad a temporary new row of NA values
df[,'new_column'] = NA # adding new column, called for example 'new_column'
df = df[0,] # delete row with NAs
I hope this may help.
Why is my last step converting the data frame to a vector? I want to keep the first 6000 observations in the data frame key.
set.seed(1)
key <- data.frame(matrix(NA, nrow = 10000, ncol = 1))
names(key) <- "ID"
key$ID <- replicate(10000,
rawToChar(as.raw(sample(c(48:57,65:90,97:122), 8, replace=T))))
key <- unique(key) # still a data frame
key <- key[1:6000,] # no longer a data frame
key1 <- key[1:6000,,drop=F] #should prevent the data.frame from converting to a vector.
According to the documentation of ?Extract.data.frame
drop: logical. If ‘TRUE’ the result is coerced to the lowest
possible dimension. The default is to drop if only one
column is left, but not to drop if only one row is left.
Or, you could use subset, but usually, this is a bit slower. Here the row.names are numbers from 1 to 10000
key2 <- subset(key, as.numeric(rownames(key)) <6000)
is.data.frame(key2)
#[1] TRUE
because,
## S3 method for class 'data.frame'
subset(x, subset, select, drop = FALSE, ...) #by default it uses drop=F
It's being coerced to a vector basically because it can be and that's the default coercion when there's only 1 element. R is trying to be "helpful".
This will keep it as a dataframe:
set.seed(1)
key <- data.frame(matrix(NA, nrow = 10000, ncol = 1))
names(key) <- "ID"
key$ID <- replicate(10000,
rawToChar(as.raw(sample(c(48:57,65:90,97:122), 8, replace=T))))
key <- unique(key)
key <- as.data.frame(key[1:6000,]) # still a data frame
I have a data.frame with names "d", "n", "beta", "family", "alpha", and "value". I would like to create a LaTeX table with Hmisc::latex, where the first three columns contain the variables "d", "n", and "beta" which give the corresponding row names. The other variables ("family", "alpha") should be displayed in the remaining columns (each of "F1" and "F2" -- the elements of family -- defines a group; for each of these two groups, the different values of alpha define columns; overall, there are thus 2 * 3 = 6 columns containing the corresponding "value"). Here is what I have so far:
## running parameters
nn <- length(n <- c(100, 500)) # sample sizes
nd <- length(d <- c(10, 100, 1000)) # dimensions
nfamily <- length(family <- c("F1", "F2")) # families
nbeta <- length(beta <- c(0.25, 0.75)) # betas
nalpha <- length(alpha <- c(0.95, 0.99, 0.999)) # alphas
## create array containing the results
res <- array(NA, dim=c(nn, nd, nfamily, nbeta, nalpha),
dimnames=list(n=n, d=d, family=family, beta=beta, alpha=alpha))
set.seed(1)
for(i in 1:nn){
for(j in 1:nd){
for(k in 1:nfamily){
for(l in 1:nbeta){
for(m in 1:nalpha){
res[i,j,k,l,m] <- i+j+k+l+m+runif(1) # some dummy values
}
}
}
}
}
## create a data.frame from the array of values
df <- as.data.frame.table(res, responseName="value")
## sort it according to the variables you want to display in the rows and bring the
## corresponding columns to the front/beginning
row.vars <- c("d", "n", "beta") # specify row variables
df. <- df[with(df, do.call(order, sapply(row.vars, as.name))), # sort rows
c(row.vars, setdiff(names(df), row.vars))] # sort colums
## format numbers, set unwanted row names to NA
df.. <- df.
df..$value <- formatC(df.$value, digits=3, format="f")
names2NA <- function(x) {x[c(FALSE, x[-1]==x[-length(x)])] <- NA; x} # arg = TRUE <=> entry equal to previous one
for(j in 1:length(row.vars)) df..[, row.vars[j]] <- names2NA(df..[, row.vars[j]])
## now use Hmisc's latex()
require(Hmisc)
latex(df.., title="title",
file="",
label="tab:res",
cgroup=c("family", "alpha"),
na.blank=TRUE, # use blanks rather than NA => not working (see first columns)!
rowname=NULL,
colheads=c("Family", "alpha"), # character() specifying column headings
dcolumn=TRUE,
booktabs=TRUE,
caption="My table containing all results.",
caption.loc="bottom",
collabel.just=rep("c", 2),
where="htbp",
center="centering",
type="verbatim",
helvetica=FALSE
)
Here are my questions:
1) Why are the NAs in the first three columns not replaced by blanks (as should be the case for na.blank=TRUE)?
2) Why is an emtpy fourth column inserted?
3) How can I obtain the variables "family" and "alpha" as groups in the columns as described above?
Update
In the meanwhile, I managed to convert the data.frame to a matrix. I have similar problems with that, I posted that here (since it is more specific): Hmisc: How to group column variables with latex()?
I only have an answer to question 1.
Apparently the na.blank=TRUE only applies to numeric columns, not character or factor. This doesn't seem to be documented anywhere but I found out in this very easy example.
x <- data.frame(c(1, NA, NA), c("cow", NA, NA), factor(c("chicken", NA, NA)))
names(x) <- c("numeric", "character", "factor")
library(Hmisc)
latex(x, file = '', na.blank = TRUE)
If you run the code you see that the NAs in the numeric column become blank, while the NAs in the other columns become "NA". I don't know the reason for this behavior. It is however easy to remedy by replacing the NA's in character and factor columns by "" before running the latex command.
In your code the first few columns are factor so the above applies.