Format class of many variables at once in R - r

I have several variables whose names all start with the same pattern in my data frame (around 20). R reads them in as characters but they should be formatted as factors.
Below I have provided a comparable (just much smaller) data frame.
animal.farm <- data.frame(matrix(0, 5, 0))
set.seed(1)
animal.farm$ord.3 <- sample(1:4, 5, replace=T)
animal.farm$ani.4 <- sample(c("dog", "horse", "mink"), 5, replace=T)
animal.farm$ani.5 <- sample(c("fun", "boring", "clever"), 5, replace=T)
I've tried both
ls(pattern = "animal.farm$ani")
and
apropos("animal.farm$ani")
so that I can apply factor() to all the variables with one or two lines of code (that in this case start with "ani") but no luck so far.

A simple base R solution:
id <- grep("^ani", names(animal.farm))
animal.farm[id] <- lapply(animal.farm[id], as.factor)

Using stringr to detect column names that start with ani
library(stringr)
cols <- str_detect(colnames(animal.farm), "^ani")
animal.farm[,cols] <- lapply(animal.farm[,cols], as.factor)

Related

Stacking two vectors into one column of data.frame with additional ID column

I have data in multiple vectors that I would like to convert to a data.frame with one ID column (vector name) and one data column (vector values). Here's a toy example:
data.1 <- c(1, 2)
data.2 <- c(10, 20, 30)
df <- bind_rows(data.frame(ID="data.1", value=data.1), data.frame(ID="data.2", value=data.2))
If I have another vector (or any other data structure) that contains the name of the variables as a character string, how can I elegantly shorten the code? One time I would need to retrieve the entry as a character string (for ID) and the other time as the variable name (for value).
studies <- c("data.1", "data.2")
you can define a function f, which will return a data.frame with the column ID as the variable name of the object you are passing into the function:
f <- function(x){
return(data.frame(ID=deparse(substitute(x)), value=x))
}
So, you can define your new data.frame as follows:
require(dplyr)
data.1 <- c(1, 2)
data.2 <- c(10, 20, 30)
bind_rows(f(data.1), f(data.2))
It looks much more elegant to me because you don't need to write twice the name of the sources.
I think I found a general solution via lists that only uses R base functions and is generally pretty simple. The most complicated part is to maintain the names they have different lengths or if there are different numbers of values in each study that span more than one order of magnitude (will append different numbers of characters to the name during unlist.
Thank you #Gregor for pointing me toward working with lists.
# data input
studies <- c("study1", "std2", "This name is very long")
data_1 <- c(1, 2)
data_2 <- c(10, 20, 30, 40, 50, 60, 70, 80, 90, 100)
data_3 <- c(3, 6, 9, 12, 15)
#generate list with data and assign names of idential length
data_list <- list(data_1, data_2, data_3)
names(data_list) <- format(studies, width=max(nchar(studies)))
# get names from 'unlist', make them the same length, remove last chars
names <- names(unlist(data_list))
names <- format(names, width=max(nchar(names)))
names <- substr(names,1,nchar(names)-2)
# get values from 'unlist'
values <- unname(unlist(data_list))
# make data.frame
data <- data.frame(names, values)

R assign levels to factor variable

I was given an Excel table similar to this:
datos <- data.frame(op= 1:4, var1= c(4, 2, 3, 2))
Now, there are other tables with the keys to op and var1, which happen to be categorical variables. Suppose that after loading them, they become:
set.seed(1)
op <- paste("op",c(1:4),sep="")
var1 <- sample(LETTERS, 19, replace= FALSE)
As you can see, there are unused levels in the data frame. I want to replace the numbers for the proper associated levels. This is what I've tried:
datos[] <- lapply(datos, factor)
levels(datos$op) <- op
levels(datos$var1) <- var1
This fails, because it reorders the factors alphabetically and gives a wrong output. I then tried:
datos$var1 <- factor(datos$var1, levels= var1, ordered= TRUE)
but this puts everything in datos$var1 as NA (I guess that's because of unmatching lengths.
What would be the rigth way to do this?
Following the kind advice of #docendoDiscimus, I post this answer for future reference:
For the data provided in the question:
datos$var1 <- factor(var1[datos$var1], levels= unique(var1))
datos
## op
Please notice that this solution should be applied without converting datos$var1 to factor (that is, without applying the code datos[] <- lapply(datos, factor).

Averaging values between paired columns across a large data frame

I have a dataframe consisting of a series of paired columns. Here is a small example.
df1 <- as.data.frame(matrix(sample(0:1000, 36*10, replace=TRUE), ncol=1))
df2 <- as.data.frame(rep(1:12, each=30))
df3 <- as.data.frame(matrix(sample(0:500, 36*10, replace=TRUE), ncol=1))
df4 <- as.data.frame(c(rep(5:12, each=30),rep(1:4, each=30)))
df5 <- as.data.frame(matrix(sample(0:200, 36*10, replace=TRUE), ncol=1))
df6 <- as.data.frame(c(rep(8:12, each=30),rep(1:7, each=30)))
Example <- cbind(df1,df2,df3,df4,df5,df6)
What I would like to do is find an average value for the odd numbers columns (df1,df3,df5) based on the values in the adjacent column, so in the example I would have three sets of averages for each value between 1 and 12. I have managed to apply a function for a specific pair of columns...
Example_two <- cbind(df1,df2)
colnames (Example_two) <- c("x","y")
tapply(Example_two$x, Example_two$y, mean)
However, the dataframe I will be looking at will be considerably larger so some form of apply function would be ideal to perform this iteratively across each paired set. I have found a similar problem Is there a R function that applies a function to each pair of columns?, but I can't seem to apply this to my own dataset.
Any help would be much appreciated, thank you in advance.
Try
mapply(function(x,y) tapply(x,y, FUN=mean) ,
Example[seq(1, ncol(Example), 2)], Example[seq(2, ncol(Example), 2)])
Or instead of seq(1, ncol(Example), 2) just use c(TRUE, FALSE) and c(FALSE, TRUE) for the second case

R: Add columns to a data frame on the fly

new at R and programming in general over here. I have several binary matrices of presence/absence data for species (columns) and plots (rows). I'm trying to use them in several dissimilarity indices which requires that they all have the same dimensions. Although there are always 10 plots there are a variable number of columns based on which species were observed at that particular time. My attempt to add the 'missing' columns to each matrix so I can perform the analyses went as follows:
df1 <- read.csv('file1.csv', header=TRUE)
df2 <- read.csv('file2.csv', header=TRUE)
newCol <- unique(append(colnames(df1),colnames(df2)))
diff1 <- setdiff(newCol,colnames(df1))
diff2 <- setdiff(newCol,colnames(df2))
for (i in 1:length(diff1)) {
df1[paste(diff1[i])]
}
for (i in 1:length(diff2)) {
df2[paste(diff2[i])]
}
No errors are thrown, but df1 and df2 both remain unchanged. I suspect my issue is with my use of paste, but I couldn't find any other way to add columns to a data frame on the fly like that. When added, the new columns should have 0s in the matrix as well, but I think that's the default, so I didn't add anything to specify it.
Thanks all.
Using your code, you can generate the columns without the for loop by:
df1[, diff1] <- 0 #I guess you want `0` to fill those columns
df2[, diff2] <- 0
identical(sort(colnames(df1)), sort(colnames(df2)))
#[1] TRUE
Or if you want to combine the datasets to one, you could use rbind_list from data.table with fill=TRUE
library(data.table)
rbindlist(list(df1, df2), fill=TRUE)
data
set.seed(22)
df1 <- as.data.frame(matrix(sample(0:1, 10*6, replace=TRUE), ncol=6,
dimnames=list(NULL, sample(paste0("Species", 1:10), 6, replace=FALSE))))
set.seed(35)
df2 <- as.data.frame(matrix(sample(0:1, 10*8, replace=TRUE), ncol=8,
dimnames=list(NULL, sample(paste0("Species", 1:10),8 , replace=FALSE))))

Generate multiple scatter plots on the same canvas from a dataframe's columns

df is a dataframe comprising 10 rows and 5 columns. The row names are '1', '2', ..., '10', in order, whereas the column names are 'A', B', ..., 'E', in order. The values in df's cells are numeric. Each column represents a function that maps the number labeling each row to the corresponding cell value. I would like to draw five scatter point plots on the same canvas, depicting the functions defined by the five columns. Additionally, for each function I'd like each dot to be connected by lines to its two "neighbors" (on the left and on the right).
How would the code need to be modified in order to draw the five functions on separate canvases?
Thank you.
Attempted solution of a simplified problem.
> row_labels <- (1 : 10)
> col_labels <- (1 : 3)
> raw_data <- outer(row_labels, col_labels, FUN = '^')
> df <- as.data.frame(raw_data)
> dimnames(df) <- list(row_labels, col_labels)
> plot(row_labels, df[['1']])
> lines(row_labels, df[['1']])
> points(row_labels, df[['2']])
> lines(row_labels, df[['2']])
> points(row_labels, df[['3']])
> lines(row_labels, df[['3']])
Problems with this solution:
The second and third function graphs exceed the canvas.
Is there a shorter, more elegant way to accomplish this task?
Is there a way to create three plots on three separate canvases in a short and elegant way?
# your data
raw_data <- as.data.frame(outer(1:10, 1:3, FUN = '^'))
colnames(df) <- c("A","B","C")
#plot it
plot(df$A,type="n",ylim=range(df),ylab="") # just creates "canvas"
lapply(df,points,type="b") # this does the plotting
EDIT: response to OP's comment.
rows <- c(1, 3, 7, 8, 16, 19, 20, 24, 28, 30)
df <- as.data.frame(outer(rows, 1:3, FUN = '^'))
colnames(df) <- c("A","B","C")
rownames(df) <- rows
plot(df$A,type="n",ylim=range(df),xlim=range(rows),ylab="")
lapply(df,points,x=as.integer(rownames(df)),type="b")
par(mfrow=c(1,3))
lapply(df,plot,x=as.integer(rownames(df)),type="b",ylab="")
You should know that the tortured syntax above is solely because the x-values are in the row names - a really, really bad idea.

Resources