Use index to subset dataframe based on unique values in a column - r

I have a large dataset with numerous sample IDs. A very simplified version looks something like this:
df <- data.frame(ID = rep(c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J"), times = c(10, 4, 12, 19, 5, 22, 6, 7, 11, 4)),
Value = sample(x = 20:30, size = 100, replace = T))
I would like to split my large dataset into multiple smaller dataframes based on ID so that when I plot the data my graph doesn't get too crowded. In this simplified example, I would like to split it into two dataframes/plots, one with data from the first 5 unique IDs (A-E) and the other with data from the next 5 unique IDs (F-J). How can I do this easily using index notation (assuming I have hundreds of IDs)? My code below doesn't work and I don't know what's wrong with it:
subset.1 <- df[unique(df$ID)[1:5]]
subset.2 <- df[unique(df$ID)[6:10]]

You should subset with a logical vector:
df[df$ID %in% unique(df$ID)[1:5], ]
df[df$ID %in% unique(df$ID)[6:10], ]
You can also use split with cut to split your dataframe into n datasets (here, 2) by group.
split(df, cut(as.numeric(as.factor(df$ID)), 2))

Related

Breaking ties based on repeated counts/Subsetting data in R

I'm trying to come up with a reasonable (if not clever) way of subsetting some data. Assume that when I create a table from the original data, it looks like this:
testdat <- data.frame(nom = c("A", "B", "C", "D", "E", "F", "G", "H", "I",
"J", "K"), cts = c(100, 50, 35, 10, 10, 5, 4, 2, 1, 1, 1))
My idea was to cut the data after the first three points here (they all have unique name/count combinations) and then take points D, E, F, and G as a group (they are the first group with repeated counts), and then points I, J, and K (second group with repeated counts). Just in case it isn't clear what I mean by "repeated counts," I mean that there's no difference between E and F except their name - they both appear 10 times in the data.
This isn't searching for duplicates (since each row is unique), but it is (since there are repeated counts in the second column). We can assume that the order is always either decreasing or repeated; it never increases (the table results were sorted in decreasing order).
How can I find the row (and row number) of the first time cts is repeated n times?
You can get the row containing the first value that repeats more than once by doing:
which(testdat$cts == rle(testdat$cts)$values[which(rle(testdat$cts)$lengths > 1)[1]])[1]
#> [1] 4
And the first entry that repeats three times is
which(testdat$cts == rle(testdat$cts)$values[which(rle(testdat$cts)$lengths > 2)[1]])[1]
#> [1] 9
And all the duplicated rows with
which(duplicated(testdat$cts) | rev(duplicated(rev(testdat$cts))))
#> [1] 4 5 9 10 11

Compare each row in two dataframes in R

I have 2 data frames with account numbers and amounts plus some other irrelevant columns. I would like to compare the output with a Y or N if they match or not.
I need to compare the account number in row 1 in dataframe A to the account number in row 1 in dataframe B and if they match put a Y in a column or an N if they don't. I've managed to get the code to check if there is a match in the entire dataframe but I need to check each row individually.
E.g.
df1
|account.num|x1|x2|x3|
|100|a|b|c|
|101|a|b|c|
|102|a|b|c|
|103|a|b|c|
df2
|account.num|x1|x2|x3|
|100|a|b|c|
|102|a|b|c|
|101|a|b|c|
|103|a|b|c|
output
|account.num|x1|x2|x3|match|
|100|a|b|c|Y|
|101|a|b|c|N|
|102|a|b|c|N|
|103|a|b|c|Y|
So, row 1 matches as they have the same account number, but row 2 doesn't because they are different. However, the other data in the dataframe doesn't matter just that column. Can I do this without merging the data frames? (I did have tables, but they won't work. I don't know why. So sorry if that's hard to follow).
You can use == to compare if account.num is equal, and use this boolean vector to subset c("N", "Y")
df1$match <- c("N", "Y")[1 + (df1[[1]] == df2[[1]])]
df1
# account.num x1 x2 x3 match
#1 100 a b c Y
#2 101 a b c N
#3 102 a b c N
#4 103 a b c Y
Data:
df1 <- data.frame(account.num=100:103, x1="a", x2="b", x3="c")
df2 <- data.frame(account.num=c(100,102,101,103), x1="a", x2="b", x3="c")
If you want a base R solution, here is a quick sketch. Assuming boath dataframes are of the same length (number of rows), it should work with your data.
# example dataframes
a <- data.frame(A=c(1,2,3), B=c("one","two","three"))
b <- data.frame(A=c(3,2,1), B=c("three","two","one"))
res <- c() #initialise empty result vector
for (rownum in c(1:nrow(a))) {
# iterate over all numbers of rows
res[rownum] <- all(a[rownum,]==b[rownum,])
}
res # result vector
# [1] FALSE TRUE FALSE
# you can put it in frame a like this. example colname is "equalB"
a$equalB <- res
If you want a tidyverse solution, you can use left_join.
The principle here would be to try to match the data from df2 to the data from df1. If it matches, it would add TRUE to a match column. Then, the code replace the NA values with FALSE.
I'm also adding code to create the data frames from the exemple.
library(tidyverse)
df1 <-
tribble(~account_num, ~x1, ~x2, ~x3,
100, "a", "b", "c",
101, "a", "b", "c",
102, "a", "b", "c",
103, "a", "b", "c") %>%
rowid_to_column() # because position in the df is an important information,
# I need to hardcode it in the df
df2 <-
tribble(~account_num, ~x1, ~x2, ~x3,
100, "a", "b", "c",
102, "a", "b", "c",
101, "a", "b", "c",
103, "a", "b", "c") %>%
rowid_to_column()
# take a
df1 %>%
# try to match df1 with version of df2 with a new column where `match` = TRUE
# according to `rowid`, `account_num`, `x1`, `x2`, and `x3`
left_join(df2 %>%
tibble::add_column(match = TRUE),
by = c("rowid", "account_num", "x1", "x2", "x3")
) %>%
# replace the NA in `match` with FALSE in the df
replace_na(list(match = FALSE))

Convert character column to factor preserving column label

I have a dataframe that I read from the XLSX file. Every column name looks like this: CODE___DESCRIPTION so for example A1___Some funky column here. It is easier to use the codes as colnames but I want to use description when needed so it must be stored in the dataframe. This is why I am using sjlabelled package later on.
Make yourself some random data and save it as some_data.xlsx.
library(dplyr) #to play with tibbles
library(stringi) #to play with strings
library(writexl) #name speaks for itself
tibble(col1 = sample(c("a", "b", "c", NA, "N/A"), 50, replace = T),
col2 = sample(c("d", "e", "f", NA, "N/A"), 50, replace = T),
col3 = sample(c("g", "h", "i", NA, "N/A"), 50, replace = T),
col4 = sample(c("j", "k", "l", NA, "N/A"), 50, replace = T)) %>%
setNames(stri_c("A", 1:4, "___", stri_rand_strings(4, 10))) %>%
write_xlsx(path = "some_data.xlsx", col_names = T, format_headers = F)
I've created simple function to prepare my data the way I want it.
library(sjlabelled) #to play with labelled data
label_it <- function(data = NULL, split = "___"){
#This basically makes an array of two columns (of codes and descriptions respectively)
k.n <- data %>%
names() %>%
stri_split_fixed(pattern = split, simplify = T)
data%>%
set_label(k.n[,2]) %>% #set description as each column's label
setNames(k.n[,1]) #set code as each column's name
}
First I read the data from XLSX file. Then I label it.
library(readxl) #name speaks for itself again
data <- read_xlsx("some_data.xlsx", na = c("", "N/A")) %>%
label_it()
Now each of my dataframe's column is character vector (in fact it's a structure) with two attributes:
label being description part
names being the original dataframe column name (CODE___DESCRIPTION style) and is not to be mistaken for output of names(data) which would be the codes part
Let's say I would like to change first and third column to factor.
To do this I have tried two things:
data[,1] <- factor(data[,1], levels = c("c", "a", "b"))
data[,3] <- factor(data[,3], levels = c("h", "g", "i"))
this changes all of those two columns values to NA_integer_.
data <- data %>%
mutate(A1 = factor(A1, levels = c("c", "a", "b")),
A3 = factor(A3, levels = c("h", "g", "i")))
this changes character vectors to factors as intended, but it drops both column attributes (label and names) which I need to be preserved.
I also tried quite a lot of functions from sjlabelled, labelled and haven packages. Nothing worked as I intended. Finally, I have found a solution, but it isn't perfect and I would love to find an easier way of doing this.
The solution is to lose those attributes but then regain ('copy' in fact) them.
data <- data %>%
mutate(A1 = factor(A1, levels = c("c", "a", "b")),
A3 = factor(A3, levels = c("h", "g", "i"))) %>%
copy_labels(data)
copy_labels is function from sjlabelled package which is used when labels are lost due to e.g. data subsetting as in this example.
P.S.
I would love to add r-sjlabelled and r-labelled tags because those packages are considered in this problem but am under 1500 reputation required to do this.

Finding occurrence of character from multiple vector or list

I wish to find the number of times a unique/distinct character occurs accross mulitple vectors or from a list.
Perhaps its best to describe in an example ;
In this example, lets say the "unique character" are letters. And the muliple "vectors" are books. I wish to find the occurance of the letters as the number of book increases.
# Initial data in the format of a list
book_list <- list(book_A <- c("a", "b", "c", "z"),
book_B <- c("c", "d", "a"),
book_C <- c("b", "a", "c", "e", "x"))
# Initial data in the format of multiple vectors
book_A <- c("a", "b", "c", "z")
book_B <- c("c", "d", "a")
book_C <- c("b", "a", "c", "e", "x")
# Finding the unique letters in each book
# This is the part im struggling to code in a loop fashion
one_book <- length(unique(book_A))
two_book <- length(unique(c(book_A, book_B)))
three_book <- length(unique(c(book_A, book_B, book_C)))
# Plot the desired output
plot(x=c(1,2,3),
y=c(one_book, two_book, three_book),
ylab = "Number of unqiue letters", xlab = "Book Number",
main="The occurence of unique letters as number of book increases")
To Note : The real data set is much bigger. Each vector (book_A, book_B...etc) is about 7000 in length.
I attempting to solve the problem with dplyr or data frame, but I'm not quite there yet.
# Explore data frame option with an example data
library(dplyr)
df <- read.delim("http://m.uploadedit.com/ba3s/148950223626.txt")
# Group them
df_group <- dplyr::group_by(df, book) %>% summarize(occurence = length(letter))
# Use the cummuative sum
plot(x=1:length(unique(df$book)), y=cumsum(df_group$occurence))
But I know the plot is not correct, as it is only plotting the cummulative sum rather than what I intended. Any hints would be most helpful.
To add to the complexity, it would be nice if the book which have the shortest number of letter first can be ploted. Something along the line
# Example ;
# Find the length of the letters in the book
lapply(book_list, length)
# I know that book_B is has the shortest number of letters (3);
# followed by book_A (4) then book_C (5)
one_book <- length(unique(book_B))
two_book <- length(unique(c(book_B, book_A)))
three_book <- length(unique(c(book_B, book_A, book_C)))
plot(x=c(1,2,3),
y=c(one_book, two_book, three_book),
ylab = "Number of letters", xlab = "Book Number")
You can use Reduce with accumulate = TRUE, i.e.
sapply(Reduce(c, book_list, accumulate = TRUE), function(i) length(unique(i)))
#[1] 4 5 7

R - reshape dataframe from duplicated column names but unique values

Hi I have a dataframe that looks like the following
I want to apply a function to it so that it reshapes it like this
How would I do that?
Here is one option that could work. W loop through the unique names of the dataset, create a logical index with ==, extract the columns, unlist, create a data.frame, and then cbind it together or just use data.frame (assumption is that the number of duplicate elements are equal for each set)
data.frame(lapply(unique(names(df1)), function(x)
setNames(data.frame(unlist(df1[names(df1)==x], use.names = FALSE)), x)))
# type model make
#1 a b c
#2 d e f
data
df1 <- data.frame(type = "a", model = "b", make = "c", type = "d",
model = "e",
make = "f", check.names=FALSE, stringsAsFactors=FALSE)

Resources