subsetting data frame by row index - r

Why is my last step converting the data frame to a vector? I want to keep the first 6000 observations in the data frame key.
set.seed(1)
key <- data.frame(matrix(NA, nrow = 10000, ncol = 1))
names(key) <- "ID"
key$ID <- replicate(10000,
rawToChar(as.raw(sample(c(48:57,65:90,97:122), 8, replace=T))))
key <- unique(key) # still a data frame
key <- key[1:6000,] # no longer a data frame

key1 <- key[1:6000,,drop=F] #should prevent the data.frame from converting to a vector.
According to the documentation of ?Extract.data.frame
drop: logical. If ‘TRUE’ the result is coerced to the lowest
possible dimension. The default is to drop if only one
column is left, but not to drop if only one row is left.
Or, you could use subset, but usually, this is a bit slower. Here the row.names are numbers from 1 to 10000
key2 <- subset(key, as.numeric(rownames(key)) <6000)
is.data.frame(key2)
#[1] TRUE
because,
## S3 method for class 'data.frame'
subset(x, subset, select, drop = FALSE, ...) #by default it uses drop=F

It's being coerced to a vector basically because it can be and that's the default coercion when there's only 1 element. R is trying to be "helpful".
This will keep it as a dataframe:
set.seed(1)
key <- data.frame(matrix(NA, nrow = 10000, ncol = 1))
names(key) <- "ID"
key$ID <- replicate(10000,
rawToChar(as.raw(sample(c(48:57,65:90,97:122), 8, replace=T))))
key <- unique(key)
key <- as.data.frame(key[1:6000,]) # still a data frame

Related

Subsetting a matrix on the basis of a list of some of the column names

Let's say there is a matrix - 'mat' which has 115 columns.
There is another matrix - 'res_mat' which has a column having 38 column names of the previous matrix 'mat'.
I want to create a third matrix - 'fin_mat' which will be a subset of the first matrix 'mat' having the columns which are stored as values in the column of the second matrix 'res_mat'.
Or in other words, I have a list of column names which is stored in a variable. How can I create a subset of the first matrix containing the columns which are stored in a variable?
Doesn't seem very difficult. If I understand your question correctly, something like this will do it.
# First make up some matrix
mat <- matrix(1:24, ncol = 6)
colnames(mat) <- paste0("Col", 1:6)
# These would be the columns to keep
res_mat <- matrix(c("Col1", "Col3", "Col4"), ncol = 1)
fin_mat <- mat[, res_mat[, 1]]
fin_mat
One way would be to use the dplyr package with the functions "select" and "one_of". One_of allows to select columns based on their names (in a string format).
Here is a simple example with the iris table, in which I extract the columns names "Sepal.Length" and "Sepal.Width".
library(dplyr)
mat1 <- iris
mat2 <- data.frame(names = c("Sepal.Length", "Sepal.Width")) %>%
mutate(names = as.character(names)) #make sure the names are characters
results <- mat1 %>% select(one_of(mat2$names))
It can be done pretty easily. In the code below, I ma creating a dataframe mat and another one res_mat. mat has the data and res_mat has a single column named- select_these_columns. the mat dataframe has 10 columns named a,b,c,d,e...,j. the select_thes_colscolumn of res_mat has five rows with entries a,b,c,d,e. ALl that needs to be done is pass the res_mat$select_these_cols to mat
a <- (matrix(rnorm(1000), nrow = 100, ncol = 10))
mat <- as.data.frame(a)
names(mat) <- letters[1:10]
res_mat <- data.frame(x = letters[1:5])
names(res_mat) <- 'select_these_cols'
fin_mat <- mat[res_mat$select_these_cols] # subsetting operation

cbind equally named vectors in multiple data.frames in a list to a single data.frame

I have a list similar to this one:
set.seed(1602)
l <- list(data.frame(subst_name = sample(LETTERS[1:10]), perc = runif(10), crop = rep("type1", 10)),
data.frame(subst_name = sample(LETTERS[1:7]), perc = runif(7), crop = rep("type2", 7)),
data.frame(subst_name = sample(LETTERS[1:4]), perc = runif(4), crop = rep("type3", 4)),
NULL,
data.frame(subst_name = sample(LETTERS[1:9]), perc = runif(9), crop = rep("type5", 9)))
Question: How can I extract the subst_name-column of each data.frame and combine them with cbind() (or similar functions) to a new data.frame without messing up the order of each column? Additionally the columns should be named after the corresponding crop type (this is possible 'cause the crop types are unique for each data.frame)
EDIT: The output should look as follows:
Having read the comments I'm aware that within R it doesn't make much sense but for the sake of having alook at the output the data.frame's View option is quite handy.
With the help of this SO-Question I came up with the following sollution. (There's probably room for improvement)
a <- lapply(l, '[[', 1) # extract the first element of the dfs in the list
a <- Filter(function(x) !is.null(unlist(x)), a) # remove NULLs
a <- lapply(a, as.character)
max.length <- max(sapply(a, length))
## Add NA values to list elements
b <- lapply(a, function(v) { c(v, rep(NA, max.length-length(v)))})
e <- as.data.frame(do.call(cbind, d))
names(e) <- unlist(lapply(lapply(lapply(l, '[[', "crop"), '[[', 2), as.character))
It is not really correct to do this with the given example because the number of rows is not the same in each one of the list's data frames . But if you don't care you can do:
nullElements = unlist(sapply(l,is.null))
l = l[!nullElements] #delete useless null elements in list
columns=lapply(l,function(x) return(as.character(x$subst_name)))
newDf = as.data.frame(Reduce(cbind,columns))
If you don't want recycled elements in the columns you can do
for(i in 1:ncol(newDf)){
colLength = nrow(l[[i]])
newDf[(colLength+1):nrow(newDf),i] = NA
}
newDf = newDf[1:max(unlist(sapply(l,nrow))),] #remove possible extra NA rows
Note that I edited my previous code to remove NULL entries from l to simplify things

Add Columns to an empty data frame in R

I have searched extensively but not found an answer to this question on Stack Overflow.
Lets say I have a data frame a.
I define:
a <- NULL
a <- as.data.frame(a)
If I wanted to add a column to this data frame as so:
a$col1 <- c(1,2,3)
I get the following error:
Error in `$<-.data.frame`(`*tmp*`, "a", value = c(1, 2, 3)) :
replacement has 3 rows, data has 0
Why is the row dimension fixed but the column is not?
How do I change the number of rows in a data frame?
If I do this (inputting the data into a list first and then converting to a df), it works fine:
a <- NULL
a$col1 <- c(1,2,3)
a <- as.data.frame(a)
The row dimension is not fixed, but data.frames are stored as list of vectors that are constrained to have the same length. You cannot add col1 to a because col1 has three values (rows) and a has zero, thereby breaking the constraint. R does not by default auto-vivify values when you attempt to extend the dimension of a data.frame by adding a column that is longer than the data.frame. The reason that the second example works is that col1 is the only vector in the data.frame so the data.frame is initialized with three rows.
If you want to automatically have the data.frame expand, you can use the following function:
cbind.all <- function (...)
{
nm <- list(...)
nm <- lapply(nm, as.matrix)
n <- max(sapply(nm, nrow))
do.call(cbind, lapply(nm, function(x) rbind(x, matrix(, n -
nrow(x), ncol(x)))))
}
This will fill missing values with NA. And you would use it like: cbind.all( df, a )
You could also do something like this where I read in data from multiple files, grab the column I want, and store it in the dataframe. I check whether the dataframe has anything in it, and if it doesn't, create a new one rather than getting the error about mismatched number of rows:
readCounts = data.frame()
for(f in names(files)){
d = read.table(files[f], header=T, as.is=T)
d2 = round(data.frame(d$NumReads))
colnames(d2) = f
if(ncol(readCounts) == 0){
readCounts = d2
rownames(readCounts) = d$Name
} else{
readCounts = cbind(readCounts, d2)
}
}
if you have an empty dataframe, called for example df, in my opinion another quite simple solution is the following:
df[1,]=NA # ad a temporary new row of NA values
df[,'new_column'] = NA # adding new column, called for example 'new_column'
df = df[0,] # delete row with NAs
I hope this may help.

Keeping column names when deleting columns

I've created a R script that calculates the percentage of missing values in each column of a data frame, and then removes the columns that exceed a preset threshold. The column names need to be maintained.
The names are maintained when there is more than one column in the data frame after column deletion, but not when there is only one column.
Code of when column names stay the same
df <- data.frame(A=rnorm(10, 10, 1), B=rep(NA, 10), C=rnorm(10, 10, 1))
threshold <- 80
pmiss <- function(x) {
ifelse(sum(is.na(x))/length(x)*100 > threshold, TRUE, FALSE)
}
temp <- sapply(df, pmiss)
deletecols <- names(temp[temp==TRUE])
df <- as.data.frame(df[,!(names(df) %in% deletecols)])
names(df) #prints
[1] "A" "C"
However, define df as
df <- data.frame(A=rnorm(10, 10, 1), B=rep(NA, 10))
and
names(df) #prints
[1] "df[, !(names(df) %in% deletecols)]"
Does anybody know why the column names are not kept when there is only one column?
You been bitten by an R FAQ. Add ,drop = FALSE to your data frame subsetting (and you notice as a side-effect that you no longer need as.data.frame.)

Sorting and finding values in other data frames

I have a dataframe named commodities_3. It contains 28 columns with different commodities and 403 rows representing end-of-month data. What I need is to find the position for each row separately:
max value,
min value,
all other positives
all other negatives
Those index should then be used to locate the corresponding data in another dataframe with the same column and row characteristics called commodities_3_returns. These data should then be copied into 4 new dataframes (one dataframe for each sorting).
I know how to find the positions of the values for each row using which and which.min and which.max. But I don't know how to put this in a loop in order to do it for all 403 rows. And subsequently how to use this data to locate the corresponding data in the other dataframe commodities_3_returns.
Unfortunaltey I have to use a dataframe because I have dates as rownames in there, which I have to keep as I need them later for indexing, as well as NA's. It looks about like this:
commodities_3 <- as.data.frame(matrix(rnorm(15), nrow=5, ncol=3))
mydates <- as.Date(c("2011-01-01", "2011-01-02", "2011-01-03", "2011-01-04", "2011-01-05"))
rownames(commodities_3) <- mydates
commodities_3[3,2] <- NA
commodities_3_returns <- as.data.frame(matrix(rnorm(15), nrow=5, ncol=3))
mydates <- as.Date(c("2011-01-01", "2011-01-02", "2011-01-03", "2011-01-04", "2011-01-05"))
rownames(commodities_3_returns) <- mydates
commodities_3_returns[3,3] <- NA
As I said, I have in total 403 rows and 27 columns. In every row, there are some NA's which I have to keep as well. max.col doesn't seem to be able to handle NA's.
My desired output for the above mentioned example would be sth like this:
max_values <- as.data.frame(matrix(data=c(1:5,3,2,1,3,1), nrow=5, ncol=2, byrow=F))
If all the columns in commodities_3 are numeric, then you want a matrix, not a data frame. Then use the apply function. Some sample data, for reprodcubililty.
commodities_3 <- matrix(rnorm(12), nrow = 4)
commodities_3_returns <- matrix(1:12, nrow = 4)
The stats.
mins <- apply(commodities_3, 1, which.min)
maxs <- apply(commodities_3, 1, which.min)
pos <- apply(commodities_3, 1, function(x) which(x > 0)) #which is optional
neg <- apply(commodities_3, 1, function(x) which(x < 0))
Now use these in the index for commodities_3_returns. In the absence of coffee, my brain has only a clunky solution with a for loop
n_months <- nrow(commodities_3_returns)
min_returns <- numeric(n_months)
for(i in seq_len(n_months))
{
min_returns[i] <- commodities_3_returns[i, mins[i]]
}
Here is an alternate approach to get the min and max using max.col which is a C function internally. If you have a large data set, max.col works extremely fast compared to apply based solutions
mins = max.col(-commodities_3)
maxs = max.col(commodities_3)
N = NROW(commodities_3)
commodities_3_returns[cbind(1:N, mins)] # returns min
commodities_3_returns[cbind(1:N, maxs)] # returns max

Resources