Undesired output (Levels) while selecting from R dataframe [duplicate] - r

This question already has answers here:
Drop unused factor levels in a subsetted data frame
(16 answers)
Closed 5 years ago.
Part of my code is similar to following:
id.row <- c("x1","x2", "x10", "x20")
id.num <- c(1, 2, 10, 20)
id.name <- c("one","two","ten","twenty")
info.data <- data.frame(id.row, id.num, id.name)
req.mat <- matrix(1:4, nrow = 2, ncol = 2)
row.names(req.mat) <- c("x1","x10")
p1 <- info.data$id.row %in% row.names(req.mat)
op1 <- info.data$id.num[p1]
op2 <- info.data$id.name[p1]
I think the code is pretty much self explanatory and am getting the results that i want. There is no problem in printing op1 but when I am trying to get op2 its giving me additional output (Levels). As,there are 1000s of rows in original info.data so this "level" is not looking nice. Is there a way to turn it off?

Maybe you can use print(op2, max.levels=0)

If you didn't want any of your variables to be factors, then
info.data <- data.frame(id.row, id.num, id.name, stringsAsFactors=F)
is a good choice. If you wanted id.row but not id.name to be a factor then
info.data <- data.frame(id.row=factor(id.row), id.num, id.name,
stringsAsFactors=F)
is better to set that up. If you created your data in some other way so they already exist as factors in the data.frame, you can convert them back to character with
info.data$id.name <- as.character(info.data$id.name)
If you do want id.name to be a factor but just want to drop the extra levels after subsetting, then
op2 <- info.data$id.name[p1, drop=T]
will do the trick.

Related

Create a vector of all values from all columns in a dataframe in R [duplicate]

This question already has answers here:
Using R convert data.frame to simple vector
(4 answers)
Closed 2 years ago.
I would like to create a vector of all values from a data frame. It seems like there must be a simple way of doing this but I can't find it.
# Dummy data
samples <- c('A','B','C', 'D')
var1 <- c(3, 5, NA, 5)
var2 <- c(4, 4, 2, 2)
var3 <- c(NA, 12, 12, 8)
df <- data.frame(var1,var2,var3,row.names=samples)
df
Desired output:
output <- c(3,5,NA,5,4,4,2,2,NA,12,12,8)
output
I've thought about looping through every column but haven't figured out how to iteratively add to a vector with each column. Something like this, but at the moment vals just contains the final column without adding each column to it:
for(i in 1:ncol(df)) {
vals <- df[,i]
}
Maybe there's an easier way though. Thanks for your help.
Maybe you can try unlist
output <- unlist(df,use.names = FALSE)
or
output <- unname(unlist(df))

From a dataframe extract columns with numerical values [duplicate]

This question already has answers here:
Selecting only numeric columns from a data frame
(12 answers)
Closed 4 years ago.
I would like to extract all columns for which the values are numeric from a dataframe, for a large dataset.
#generate mixed data
dat <- matrix(rnorm(100), nrow = 20)
df <- data.frame(letters[1 : 20], dat)
I was thinking of something along the lines of:
numdat <- df[,df == "numeric"]
That however leaves me without variables. The following gives an error.
dat <- df[,class == "numeric"]
Error in class == "numeric" :
comparison (1) is possible only for atomic and list types
What should I do instead?
use sapply
numdat <- df[,sapply(df, function(x) {class(x)== "numeric"})]

cbind dataframes in loop [duplicate]

This question already has answers here:
Using cbind on an arbitrarily long list of objects
(4 answers)
Closed 4 years ago.
I have n number of dataframes named "s.dfx" where x=1:n. All the dataframes have 7 columns with different names. Now I want to cbind all the dataframes.
I know the comand
t<-cbind.data.frame(s.df1,s,df2,...,s.dfn)
But I want to optimize and cbind them in a loop, since n is a large number.
I have tried
for(t2 in 1:n){
t<-cbind.data.drame(s.df[t2])
}
But I get this error "Error in [.data.frame(s.df, t2) : undefined columns selected"
Can anyone help?
I don't think that a for-loop would be any faster than do.call(cbind, dfs), but it wasn't clear to me that you actually had such a list yet. I thought you might need to build such list from a character object. This answer assumes you don't have a list yet but that you do have all your dataframes numbered in an ascending sequence that ends in n where the decimal representation might have multiple digits.
t <- do.call( cbind, mget( paste0("s.dfs", 1:n) ) )
Pasqui uses ls inside mget and a pattern to capture all the numbered dataframes. I would have used a slightly different one, since you suggested that the number was higher than 9 which is all that his pattern would capture:
ls(pattern = "^s\\.df[0-9]+") # any number of digits
# ^ need double escapes to make '.' a literal period or fixed=TRUE
library(purrr) #to be redundant
#generating dummy data frames
df1 <- data.frame(x = c(1,2), y = letters[1:2])
df2 <- data.frame(x = c(10,20), y = letters[c(10, 20)])
df3 <- data.frame(x = c(100, 200), y = letters[c(11, 22)])
#' DEMO [to be adapted]: capturing the EXAMPLE data frames in a list
dfs <- mget(ls(pattern = "^df[1-3]"))
#A Tidyverse (purrr) Solution
t <- purrr::reduce(.x = dfs, .f = bind_cols)
#Base R
do.call(cbind,dfs)
# or
Reduce(cbind,dfs)

Cut a column based on intervals of another column in r

I want to cut test$income into 25 levels and using the intervals derived, I stored them in a variable called levels and I wish to cut train$income based on the same intervals. I tried the following code below but I am not sure why some of my values in train$income were coerced to NA.
What went wrong? Is there a better way to do this? Thank you!
test$income <- cut(test$income,b=25)
levels <- c(-0.853,-0.586,-0.325,-0.0643,0.196,0.457,0.718,0.978,1.24,1.5,1.76,2.02,2.28,2.54,2.8,3.06,3.32,3.59,3.85,4.11,4.37,4.63,4.89,5.15,5.41,5.68)
train$income <- cut(train$income,levels)
As #JohnGilfillan says, one reason can be that your train$income is higher than 5.68 or lower than -0.853. In this case you would get some of your values as NAs, while others would be numeric. This is a likely case, but another reason (for another instance) could be that you have used a character vector to specify the breaks in your actual code (levels from cut object will return a character vector). In this case you would get a vector with only NAs (written as <NA>).
The solution is to expand the extremes of your levels vector.
Try this:
set.seed(1)
a <- runif(100, -6, 6)
set.seed(2)
b <- runif(100, -6, 6)
levs <- levels(cut(a, 25))
levs <- gsub("\\(", "", levs)
levs <- gsub("\\]", "", levs)
levs <- c(as.numeric(sapply(strsplit(levs, ","), "[", 1)),
as.numeric(sapply(strsplit(levs, ","), "[", 2))[length(levs)])
cut.b <- cut(b, levs)
## Both NA values are outside levs
b[is.na(cut.b)]
cut.b.new <- cut(b, c(-6, levs[c(-1, -length(levs))], 6))
## No NAs
any(is.na(cut.b.new))
PS: It is not recommended to use function names as object names. Therefore levs instead of levels.

R assign levels to factor variable

I was given an Excel table similar to this:
datos <- data.frame(op= 1:4, var1= c(4, 2, 3, 2))
Now, there are other tables with the keys to op and var1, which happen to be categorical variables. Suppose that after loading them, they become:
set.seed(1)
op <- paste("op",c(1:4),sep="")
var1 <- sample(LETTERS, 19, replace= FALSE)
As you can see, there are unused levels in the data frame. I want to replace the numbers for the proper associated levels. This is what I've tried:
datos[] <- lapply(datos, factor)
levels(datos$op) <- op
levels(datos$var1) <- var1
This fails, because it reorders the factors alphabetically and gives a wrong output. I then tried:
datos$var1 <- factor(datos$var1, levels= var1, ordered= TRUE)
but this puts everything in datos$var1 as NA (I guess that's because of unmatching lengths.
What would be the rigth way to do this?
Following the kind advice of #docendoDiscimus, I post this answer for future reference:
For the data provided in the question:
datos$var1 <- factor(var1[datos$var1], levels= unique(var1))
datos
## op
Please notice that this solution should be applied without converting datos$var1 to factor (that is, without applying the code datos[] <- lapply(datos, factor).

Resources