Convert character column to factor preserving column label - r

I have a dataframe that I read from the XLSX file. Every column name looks like this: CODE___DESCRIPTION so for example A1___Some funky column here. It is easier to use the codes as colnames but I want to use description when needed so it must be stored in the dataframe. This is why I am using sjlabelled package later on.
Make yourself some random data and save it as some_data.xlsx.
library(dplyr) #to play with tibbles
library(stringi) #to play with strings
library(writexl) #name speaks for itself
tibble(col1 = sample(c("a", "b", "c", NA, "N/A"), 50, replace = T),
col2 = sample(c("d", "e", "f", NA, "N/A"), 50, replace = T),
col3 = sample(c("g", "h", "i", NA, "N/A"), 50, replace = T),
col4 = sample(c("j", "k", "l", NA, "N/A"), 50, replace = T)) %>%
setNames(stri_c("A", 1:4, "___", stri_rand_strings(4, 10))) %>%
write_xlsx(path = "some_data.xlsx", col_names = T, format_headers = F)
I've created simple function to prepare my data the way I want it.
library(sjlabelled) #to play with labelled data
label_it <- function(data = NULL, split = "___"){
#This basically makes an array of two columns (of codes and descriptions respectively)
k.n <- data %>%
names() %>%
stri_split_fixed(pattern = split, simplify = T)
data%>%
set_label(k.n[,2]) %>% #set description as each column's label
setNames(k.n[,1]) #set code as each column's name
}
First I read the data from XLSX file. Then I label it.
library(readxl) #name speaks for itself again
data <- read_xlsx("some_data.xlsx", na = c("", "N/A")) %>%
label_it()
Now each of my dataframe's column is character vector (in fact it's a structure) with two attributes:
label being description part
names being the original dataframe column name (CODE___DESCRIPTION style) and is not to be mistaken for output of names(data) which would be the codes part
Let's say I would like to change first and third column to factor.
To do this I have tried two things:
data[,1] <- factor(data[,1], levels = c("c", "a", "b"))
data[,3] <- factor(data[,3], levels = c("h", "g", "i"))
this changes all of those two columns values to NA_integer_.
data <- data %>%
mutate(A1 = factor(A1, levels = c("c", "a", "b")),
A3 = factor(A3, levels = c("h", "g", "i")))
this changes character vectors to factors as intended, but it drops both column attributes (label and names) which I need to be preserved.
I also tried quite a lot of functions from sjlabelled, labelled and haven packages. Nothing worked as I intended. Finally, I have found a solution, but it isn't perfect and I would love to find an easier way of doing this.
The solution is to lose those attributes but then regain ('copy' in fact) them.
data <- data %>%
mutate(A1 = factor(A1, levels = c("c", "a", "b")),
A3 = factor(A3, levels = c("h", "g", "i"))) %>%
copy_labels(data)
copy_labels is function from sjlabelled package which is used when labels are lost due to e.g. data subsetting as in this example.
P.S.
I would love to add r-sjlabelled and r-labelled tags because those packages are considered in this problem but am under 1500 reputation required to do this.

Related

Use index to subset dataframe based on unique values in a column

I have a large dataset with numerous sample IDs. A very simplified version looks something like this:
df <- data.frame(ID = rep(c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J"), times = c(10, 4, 12, 19, 5, 22, 6, 7, 11, 4)),
Value = sample(x = 20:30, size = 100, replace = T))
I would like to split my large dataset into multiple smaller dataframes based on ID so that when I plot the data my graph doesn't get too crowded. In this simplified example, I would like to split it into two dataframes/plots, one with data from the first 5 unique IDs (A-E) and the other with data from the next 5 unique IDs (F-J). How can I do this easily using index notation (assuming I have hundreds of IDs)? My code below doesn't work and I don't know what's wrong with it:
subset.1 <- df[unique(df$ID)[1:5]]
subset.2 <- df[unique(df$ID)[6:10]]
You should subset with a logical vector:
df[df$ID %in% unique(df$ID)[1:5], ]
df[df$ID %in% unique(df$ID)[6:10], ]
You can also use split with cut to split your dataframe into n datasets (here, 2) by group.
split(df, cut(as.numeric(as.factor(df$ID)), 2))

Left join two R data frames with OR conditions

Problem
I have two data frames that I want to join using a conditional statement on three non-numeric variables. Here is a pseudo-code version of what I want to achieve.
Join DF1 and DF2 on DF1$A == DF2$A | DF1$A == DF2$B
Dataset
Here's some code to create the two data frames. variant_index is the data frame that will be used to annotate input using a left_join:
library(dplyr)
options(stringsAsFactors = FALSE)
set.seed(5)
variant_index <- data.frame(
rsid = rep(sapply(1:5, function(x) paste0(c("rs", sample(0:9, 8, replace = TRUE)), collapse = "")), each = 2),
chrom = rep(sample(1:22, 5), each = 2),
ref = rep(sample(c("A", "T", "C", "G"), 5, replace = TRUE), each = 2),
alt = sample(c("A", "T", "C", "G"), 10, replace = TRUE),
eaf = runif(10),
stringAsFactors = FALSE
)
variant_index[1, "alt"] <- "T"
variant_index[8, "alt"] <- "A"
input <- variant_index[seq(1, 10, 2), ] %>%
select(rsid, chrom)
input$assessed <- c("G", "C", "T", "A", "T")
What I tried
I would like to perform a left_join on input to annotate with the eaf column from variant_index. As you can see from the input data frame, its assessed column can match either with input$ref or with input$alt. The rsid and chrom column will always match.
I know I can specify multiple column in the by argument of left_join, but if I understand correctly, the condition will always be
input$assessed == variant_index$ref & input$assessed == variant_index$alt
whereas I want to achieve
input$assessed == variant_index$ref | input$assessed == variant_index$alt
Possible solution
The desired output can be obtained like so:
input %>%
left_join(variant_index) %>%
filter(assessed == ref | assessed == alt)
But it doesn't seem like the best solution to me, since I am possibly generating double the lines, and would like to apply this join to data frames containing 100M+ lines. Is there a better solution?
Complex joins are straight forward in SQL:
library(sqldf)
sqldf("select *
from variant_index v
join input i on i.assessed = v.ref or i.assessed = v.alt")
Try this
library(dbplyr)
x1 <- memdb_frame(x = 1:5)
x2 <- memdb_frame(x1 = 1:3,x2 = letters[1:3])
x1 <- x1 %>% left_join(b, sql_on = "a.x=b.x1 or a.x=b.x2")
we can use show_query to see the code

How to lappy() over selective columns? - R

I am a novice R programmer. I am wondering how to lappy over a dataframe but avoiding certain columns.
# Some dummy dataframe
df <- data.frame(
grp = c("A", "B", "C", "D"),
trial = as.factor(c(1,1,2,2)),
mean = as.factor(c(44,33,22,11)),
sd = as.factor(c(3,4,1,.5)))
df <- lapply(df, function (x) {as.numeric(as.character(x))})
However, the method I used introduces NAs by coercion.
Would there to selectively (or deselectively) lapply over the dataframe while maintaining the integrity of the dataframe?
In other words, would there be a way to convert only mean and sd to numerics? (In general form)
Thank you
Try doing this:
df[,3:4] <- lapply(df[,3:4], function (x) {as.numeric(as.character(x))})
You are simply passing function to the specified columns. You can also provide a condition to select subset of your columns, something like excluding the ones you don't want to cast.
col = names(df)[names(df)!=c("grp","trial")]
df[,col] <- lapply(df[,col], function (x) {as.numeric(as.character(x))})
Well as you might have guessed, there are many ways. Since you seem to be doing in place substitution, actually, a for loop would be suitable.
df <- data.frame(
grp = c("A", "B", "C", "D"),
trial = as.factor(c(1,1,2,2)),
mean = as.factor(c(44,33,22,11)),
sd = as.factor(c(3,4,1,.5)))
my_cols <- c("trial", "mean", "sd")
for(mc in my_cols) {
df[[mc]] <- as.numeric(as.character(df[[mc]]))
}
If you want to convert selectively by column names:
library(dplyr)
df %>%
mutate_if(names(.) %in% c("mean", "sd"),
function(x) as.numeric(as.character(x)))

Finding occurrence of character from multiple vector or list

I wish to find the number of times a unique/distinct character occurs accross mulitple vectors or from a list.
Perhaps its best to describe in an example ;
In this example, lets say the "unique character" are letters. And the muliple "vectors" are books. I wish to find the occurance of the letters as the number of book increases.
# Initial data in the format of a list
book_list <- list(book_A <- c("a", "b", "c", "z"),
book_B <- c("c", "d", "a"),
book_C <- c("b", "a", "c", "e", "x"))
# Initial data in the format of multiple vectors
book_A <- c("a", "b", "c", "z")
book_B <- c("c", "d", "a")
book_C <- c("b", "a", "c", "e", "x")
# Finding the unique letters in each book
# This is the part im struggling to code in a loop fashion
one_book <- length(unique(book_A))
two_book <- length(unique(c(book_A, book_B)))
three_book <- length(unique(c(book_A, book_B, book_C)))
# Plot the desired output
plot(x=c(1,2,3),
y=c(one_book, two_book, three_book),
ylab = "Number of unqiue letters", xlab = "Book Number",
main="The occurence of unique letters as number of book increases")
To Note : The real data set is much bigger. Each vector (book_A, book_B...etc) is about 7000 in length.
I attempting to solve the problem with dplyr or data frame, but I'm not quite there yet.
# Explore data frame option with an example data
library(dplyr)
df <- read.delim("http://m.uploadedit.com/ba3s/148950223626.txt")
# Group them
df_group <- dplyr::group_by(df, book) %>% summarize(occurence = length(letter))
# Use the cummuative sum
plot(x=1:length(unique(df$book)), y=cumsum(df_group$occurence))
But I know the plot is not correct, as it is only plotting the cummulative sum rather than what I intended. Any hints would be most helpful.
To add to the complexity, it would be nice if the book which have the shortest number of letter first can be ploted. Something along the line
# Example ;
# Find the length of the letters in the book
lapply(book_list, length)
# I know that book_B is has the shortest number of letters (3);
# followed by book_A (4) then book_C (5)
one_book <- length(unique(book_B))
two_book <- length(unique(c(book_B, book_A)))
three_book <- length(unique(c(book_B, book_A, book_C)))
plot(x=c(1,2,3),
y=c(one_book, two_book, three_book),
ylab = "Number of letters", xlab = "Book Number")
You can use Reduce with accumulate = TRUE, i.e.
sapply(Reduce(c, book_list, accumulate = TRUE), function(i) length(unique(i)))
#[1] 4 5 7

R - reshape dataframe from duplicated column names but unique values

Hi I have a dataframe that looks like the following
I want to apply a function to it so that it reshapes it like this
How would I do that?
Here is one option that could work. W loop through the unique names of the dataset, create a logical index with ==, extract the columns, unlist, create a data.frame, and then cbind it together or just use data.frame (assumption is that the number of duplicate elements are equal for each set)
data.frame(lapply(unique(names(df1)), function(x)
setNames(data.frame(unlist(df1[names(df1)==x], use.names = FALSE)), x)))
# type model make
#1 a b c
#2 d e f
data
df1 <- data.frame(type = "a", model = "b", make = "c", type = "d",
model = "e",
make = "f", check.names=FALSE, stringsAsFactors=FALSE)

Resources