Generation of Unique ID

Generation of Unique ID - r

Can some help with how to generate a unique 6 digit URN in R,as I don't know how to do this please.
Below are the rule for the URN
It needs to be alphanumeric,start with letter and maybe end with letter (e.g AA34YB)
Use only upper case alphabets
Do not use the alphabets O or I (this is the alphabet after H and before J)
Use only digits from 1- 9. Exclude 0
First two digit should be letter,then followed by 2 digit number and end with 2 digit letter,e.g "AA22DD","EE34TY","ER67YU"
All records must contain number as shown in rule 5
IT MUST BE 6 DIGIT PLEASE
I would love to generate upto 4 million unique records please.Any R code suggestion is highly welcome.I am not an expert in R,actually new to R
Thanks very much

here is a function that will generate ordered unique IDs:
generateIDs <- function(n, existing=NULL){
# Initialise a counter to produce IDs
counter <- 0
# Create a arrays of letters and digits
letters <- LETTERS[LETTERS %in% c("O", "I") == FALSE]
digits <- 1:9
# Initialise an array to store the IDs created
ids <- c()
# iterate through the letters
for(first in letters){
# iterate through the letters
for(second in letters){
# iterate through the digits
for(third in digits){
# iterate through the digits
for(fourth in digits){
# iterate through the letters
for(fifth in letters){
# iterate through the letters
for(sixth in letters){
# Create the unique code
code <- paste0(first, second, third, fourth, fifth, sixth)
# Check if already exists
if(code %in% existing == FALSE){
# Iterate the counter
counter <- counter + 1
# Store the ID
ids[counter] <- code
existing[length(existing) + 1] <- code
# Check if created enough IDs
if(counter == n){
return(ids)
}
# Note progress
if(counter %% 10000 == 0){
cat("\rCreated", counter, "ids!")
}
}
}
}
}
}
}
}
}
That is a horrific number of nested for loops but it avoids the inefficient random generation of IDs. You can test it using the following code:
generateIDs(10)
"AA11AA" "AA11AB" "AA11AC" "AA11AD" "AA11AE" "AA11AF" "AA11AG" "AA11AH" "AA11AJ" "AA11AK"
Note that ideally you should run this function once. Theoretically, this function could create up to 26873856 unique IDs but it doesn't scale well!
See #GKi's answer for a much better solution! :-)

You can use expand.grid to generate Unique ID's.
n <- 10
t1 <- LETTERS[!LETTERS %in% c("O", "I")]
t2 <- 1:9
#t1 <- rawToChar(as.raw(c(65:72,74:78,80:90)), multiple = TRUE) #Alternativ
#t2 <- rawToChar(as.raw(49:57), multiple = TRUE)
apply(expand.grid(t1, t1, t2, t2, t1, t1)[seq(n),], 1, paste, collapse = "")
# 1 2 3 4 5 6 7 8
#"AA11AA" "BA11AA" "CA11AA" "DA11AA" "EA11AA" "FA11AA" "GA11AA" "HA11AA"
# 9 10
#"JA11AA" "KA11AA"
set.seed(1) #Sample randomly
apply(expand.grid(t1, t1, t2, t2, t1, t1)[sample(length(t1)^4*length(t2)^2, n),]
, 1, paste, collapse = "")
#10938497 17633234 12201267 18120554 21612295 21509711 13901861 6841049
#"SL15UK" "BG59TR" "CU65XL" "BH54ES" "GJ13HV" "YF31FV" "EE79KN" "SV66CG"
#23945701 10770210
#"NK23KX" "TG68QK"
In case it needs to much memory look #Joseph-Crispell's answer.

Related

Adding values in a sample

If I have a sample of numbers, say a set of dice rolls given by:
Rolls <- sample(1:6, 4, replace = TRUE)
What I would like to do is take any two values in the sample set and be able to add them, however, after adding them I'd like to put the new value into the sample AND remove the values that were used in the addition.
For example, say my sample is 2 4 3 6 and I choose to add 4 and 6, I would then want my sample to look like this: 2 3 10 (the order they are arranged in isn't important for me)
(Side note: It would also be great if the code would then ask the user if they'd like to add the 2 leftover values or not, in the case I've given, after adding 10 to the sample and removing 4 and 6, it would then ask "would you like to add your other two values?" and if the user says yes it would add 2 and 3, put 5 in the sample then remove 2 and 3, same as for 4 and 6.)
Any help would be much appreciated

Here is a simple solution you could use.
# Create our Array of four
Rolls<-sample(1:6, 4, replace = TRUE)
cat("Values from our sample are: ", Rolls)
# Auxiliary variable to check if the input is correct
selectedVariables <- FALSE
values <- vector()
# Get the selected variables by the user
while(!selectedVariables){
number <- readline(prompt='Write two values from the sample (separate them using a comma): ')
# Transform input to int array
values <- as.integer(unlist(strsplit(number,",")))
# Check if both values are in the array considering repetitions
# (if the user select twice the same number and there is only one in the sample reject the input given)
if(values[1] %in% Rolls & values[2] %in% Rolls[-match(values[1],Rolls) & length(values)==2]){
selectedVariables = TRUE
}
else{
cat("Please select two valid values from the sample. Example: ", sample(Rolls,2))
}
}
# Depending on whether or not they are equal, we will use "which" or "match"
if (values[1] == values[2]){
# If they are equal we'll use which to get both positions of this values
RollsAux<- append(Rolls[! 1:length(Rolls) %in% which(Rolls %in% values)], sum(values))
}else{
# If they are different we'll get only the first appearance in our array of four
RollsAux<- append(Rolls[! 1:length(Rolls) %in% match(values,Rolls)], sum(values))
}
ans <- readline(prompt="Would you like to add up the other numbers? [y/n] ")
if (ans == "y" || ans == "Y"){
# If the user wants to sum the other two values this adds up the first two values in a new array with the last result
Result<- append(RollsAux[length(RollsAux)], sum(RollsAux[1:length(RollsAux)-1]))
}else{
# If not, the resulting array will be the same as our auxiliary array
Result <- RollsAux
}
Like I've mentioned in the comments you could also force the user to insert the positions of the values that he wants to select like this:
while(!selectedVariables){
number <- readline(prompt='Write the positions of the two values from the sample (separate them using a comma): ')
# Transform input to int array
values <- as.integer(unlist(strsplit(number,",")))
# Check if the user has given the same position twice
if (any(duplicated(values))){
print("Positions need to be different.")
}
else{
# Checks if all the positions given are in the length of the array
if(all(values <= length(Rolls))&length(values)==2){
values <- Rolls[values]
selectedVariables = TRUE
}
else{
print("Positions selected exceed length of array sample.")
}
}
}

I suggest you create a smaller generic function like so, which will change your sample by adding together two chosen values of the sample:
change_sample <- function(sample, i1, i2) {
new_sample <- sample
new_sample[i2] <- sample[i1] + sample[i2]
return(new_sample[-i1])
}
If you then want to apply this multiple times, you could chain each application of this function together to achieve your goal.

As in your previous question we can utilize scan() to handle user input.
Code
sum.dice <- function(Roll, p1, p2){
# extract values
tmp1 <- Roll[c(p1, p2)]
# ad and remove
tmp2 <- c(Roll[-c(p1, p2)], sum(tmp1))
# User input whether to add remaining wo numbers
cat("Your sample currently looks as follows: \n\n",
tmp2
,"\n\n Would you like to sum up the remaining two entries of your sample?")
# scan
tmp3 <- scan(what = "")
# add remaining values if so desired
if(tolower(tmp3) == "yes"){
c(sum(Roll[-c(p1, p2)]), sum(tmp1))
} else{
tmp2
}
}
sum.dice(Rolls, 3, 4)
# Your sample currently looks as follows:
#
# 5 3 6
#
# Would you like to sum up the remaining two entries of your sample?
# 1: no
# 2: ""
# Read 1 item
# [1] 5 3 6
# or
# Your sample currently looks as follows:
#
# 5 3 6
#
# Would you like to sum up the remaining two entries of your sample?
# 1: yes
# 2: ""
# Read 1 item
# [1] 8 6

DNA conditional frequency in R

I'm trying to find if there is any conditional dependence within 2 different DNA sequences in R
This is my code, however i'm getting an error;
Error in `[.data.frame`(data, i) : undefined columns selected
I'm not sure where the issue is, if I parentheses the data[i-1]==bases[b2], i just get multiple unexpected}, which is the only thing I can think else to do.
for (b1 in 1:length(bases))
{
for (b2 in 1:length(bases))
{
count = 1
for (i in 2:length(mydata1))
{
if ((mydata1[i]==bases[b1]) & mydata1[i-1]==bases[b2])
{
count = count+1
}
}
b3 = c(bases[b1], bases[b2], count)
print(b3)
}
}
_I'm expecting essentially a list of certain DNA bases, for example I see it as if the DNA sequence IS conditional upon the previous base then;.
[1] "A" "C" "002"
[1] "A" "C" "005"
[1] "A" "C" "009"
and so on, that can show me any indication as to whether a certain base has any sort of affect upon the identity of the following base, by clearly showing a condition for A to be previous to C.
Ok so essentially the mydata1 (there is also mydata2) are DNA sequences, that is to say a list of "A", "G", "C" and "T", each of which is 10,000 bases long.
As shown here;
V1
1 T
2 C
3 G
4 G
5 T
6 G
7 G
8 G
9 C
10 A
I'm tasked with trying to determine if the sequence has bases that are dependent on one another, so if [1] T affects the presence of [2] C, etc. One of the sequences is dependent, the other is not.

If I understand correctly, you want to count the occurrences of each pair of nucleotides i, i+1 in a sequence of DNA. You can achieve this with R function table; an example is provided below.
# input sequence
seq <- "ACGTACTGCACAAACTAC"
# length of input sequence
length_seq <- nchar(seq, type="chars")
# first substring: from 1 to second-last
seq1 <- substr(seq, 1, (length_seq - 1))
# second substring: from 2 to last
seq2 <- substring(seq, 2, length_seq)
# split strings
seq1_split <- strsplit(seq1, "")[[1]]
seq2_split <- strsplit(seq2, "")[[1]]
# initialize vectors
first_nt <- vector(mode="character", length = (length_seq - 1))
second_nt <- vector(mode="character", length = (length_seq -1))
# fill vectors
count = 0
for (b in seq1_split)
{
count = count + 1
first_nt[count] <- b
}
count = 0
for (b in seq2_split)
{
count = count + 1
second_nt[count] <- b
}
# create matrix with character i and i+1 in each row
mat <- matrix(c(first_nt, second_nt), nrow=(length_seq - 1))
# collapse matrix
to_table <- apply(mat, 1, paste, collapse="")
# table
my_table <- table(to_table)
print(my_table)

Shuffling string (non-randomly) for maximal difference

After trying for an embarrassingly long time and extensive searches online, I come to you with a problem.
I am looking for a method to (non-randomly) shuffle a string to get a string which has the maximal ‘distance’ from the original one, while still containing the same set of characters.
My particular case is for short nucleotide sequences (4-8 nt long), as represented by these example sequences:
seq_1<-"ACTG"
seq_2<-"ATGTT"
seq_3<-"ACGTGCT"
For each sequence, I would like to get a scramble sequence which contains the same nucleobase count, but in a different order.
A favourable scramble sequence for seq_3 could be something like;
seq_3.scramble<-"CATGTGC"
,where none of the sequence positions 1-7 has the same nucleobase, but the overall nucleobase count is the same (A =1, C = 2, G= 2, T=2). Naturally it would not always be possible to get a completely different string, but these I would just flag in the output.
I am not particularly interested in randomising the sequence and would prefer a method which makes these scramble sequences in a consistent manner.
Do you have any ideas?

python, since I don't know r, but the basic solution is as follows
def calcDistance(originalString,newString):
d = 0
i=0
while i < len(originalString):
if originalString[i] != newString[i]: d=d+1
i=i+1
s = "ACTG"
d_max = 0
s_final = ""
for combo in itertools.permutations(s):
if calcDistance(s,combo) > d_max:
d_max = calcDistance(s,combo)
s_final = combo

Give this a try. Rather than return a single string that fits your criteria, I return a data frame of all strings sorted by their string-distance score. String-distance score is calculated using stringdist(..., ..., method=hamming), which determines number of substitutions required to convert string A to B.
seq_3<-"ACGTGCT"
myfun <- function(S) {
require(combinat)
require(dplyr)
require(stringdist)
vec <- unlist(strsplit(S, ""))
P <- sapply(permn(vec), function(i) paste(i, collapse=""))
Dist <- c(stringdist(S, P, method="hamming"))
df <- data.frame(seq = P, HD = Dist, fixed=TRUE) %>%
distinct(seq, HD) %>%
arrange(desc(HD))
return(df)
}
library(combinat)
library(dplyr)
library(stringdist)
head(myfun(seq_3), 10)
# seq HD
# 1 TACGTGC 7
# 2 TACGCTG 7
# 3 CACGTTG 7
# 4 GACGTTC 7
# 5 CGACTTG 7
# 6 CGTACTG 7
# 7 TGCACTG 7
# 8 GTCACTG 7
# 9 GACCTTG 7
# 10 GATCCTG 7

Crafty ways to make super efficient R vector processing?

I have a very simple assignment for a project that requires processing a large amount of information; my professor's first words were "this will take a while to run" so I figured it'd be a good opportunity to spend that time i would be running my program making a super efficient one :P
Basically, I have a input file where each line is either a node or details. It might look something like:
#NODE1_length_17_2309482.2394832.2
val1 5 18
val2 6 21
val3 100 23
val4 9 6
#NODE2_length_1298_23948349.23984.2
val1 2 293
...
and so on. Basically, I want to know how I can efficiently use R to either output, line by line, something like:
NODE1_length_17 val1 18
NODE1_length_17 val2 21
...
So, as you can see, I would want to node name, the value, and the third column of the value line. I have implemented it using an ultra slow for loop that uses strsplit a whole bunch of times, and obviously this is not ideal. My current implementation looks like:
nodevals <- which(substring(data, 1, 1) == "#") # find lines with nodes
vallines <- which(substring(data, 1, 3) == "val")
out <- vector(mode="character", length=length(vallines))
for (i in vallines) {
line_ra <- strsplit(data[i], "\\s+")[[1]]
... and so on using a bunch of str splits and pastes to reformat
out[i] <- paste(node, val, value, sep="\t")
}
Does anybody know how I can optimize this using data frames or crafty vector manipulations?
EDIT: I'm implementing vecor wise splitting for everything, and so far I've found that the main thing I can't split correctly is the names of each node. I'm trying to do something like,
names <- data[max(nodes[nodelines < vallines])]
where nodes are the names of each line containing a node and vallines are the numbers of each line containing a val. The return vector should have the same number of elements as vallines. The goal is to find the maximum nodelines that is less than the line number of vallines for each vallines. Any thoughts?

I suggest using data.table package - it has very fast string split function tstrsplit.
library(data.table)
#read from file
data <- scan('data.txt', 'character', sep = '\n')
#create separate objects for nodes and values
dt <- data.table(data)
dt[, c('IsNode', 'NodeId') := list(IsNode <- substr(data, 1, 1) == '#', cumsum(IsNode))]
nodes <- dt[IsNode == TRUE, list(NodeId, data)]
values <- dt[IsNode == FALSE, list(data, NodeId)]
#split string and join back values and nodes
tmp <- values[, tstrsplit(data, '\\s+')]
values <- data.table(values[, list(NodeId)], tmp[, list(val = V1, value = V3)], key = 'NodeId')
res <- values[nodes]

Is it possible to swap columns around in a data frame using R?

I have three variables in a data frame and would like to swap the 4 columns around from
"dam" "piglet" "fdate" "ssire"
to
"piglet" "ssire" "dam" "tdate"
Is there any way I can do the swapping using R?
Any help would be very much appreciated.
Baz

dfrm <- dfrm[c("piglet", "ssire", "dam", "tdate")]
OR:
dfrm <- dfrm[ , c("piglet", "ssire", "dam", "tdate")]

d <- data.frame(a=1:3, b=11:13, c=21:23)
d
# a b c
#1 1 11 21
#2 2 12 22
#3 3 13 23
d2 <- d[,c("b", "c", "a")]
d2
# b c a
#1 11 21 1
#2 12 22 2
#3 13 23 3
or you can do same thing using index:
d3 <- d[,c(2, 3, 1)]
d3
# b c a
#1 11 21 1
#2 12 22 2
#3 13 23 3

To summarise the other posts, there are three ways of changing the column order, and two ways of specifying the indexing in each method.
Given a sample data frame
dfr <- data.frame(
dam = 1:5,
piglet = runif(5),
fdate = letters[1:5],
ssire = rnorm(5)
)
Kohske's answer: You can use standard matrix-like indexing using column numbers
dfr[, c(2, 4, 1, 3)]
or using column names
dfr[, c("piglet", "ssire", "dam", "fdate")]
DWin & Gavin's answer: Data frames allow you to omit the row argument when specifying the index.
dfr[c(2, 4, 1, 3)]
dfr[c("piglet", "ssire", "dam", "fdate")]
PaulHurleyuk's answer: You can also use subset.
subset(dfr, select = c(2, 4, 1, 3))
subset(dfr, select = c(c("piglet", "ssire", "dam", "fdate")))

You can use subset's 'select' argument;
#Assume df contains "dam" "piglet" "fdate" "ssire"
newdf<-subset(df, select=c("piglet", "ssire", "dam", "tdate"))

I noticed that this is almost an 8-year old question. But for people who are starting to learn R and might stumble upon this question, like I did, you can now use a much flexible select() function from dplyr package to accomplish the swapping operation as follows.
# Install and load the dplyr package
install.packages("dplyr")
library("dplyr")
# Override the existing data frame with the desired column order
df <- select(df, piglet, ssire, dam, tdate)
This approach has following advantages:
You will have to type less as the select() does not require variable names to be enclosed within quotes.
In case your data frame has more than 4 variables, you can utilize select helper functions such as starts_with(), ends_with(), etc. to select multiple columns without having to name each column and rearrange them with much ease.

Relevance Note: In response to some users (myself included) that would like to swap columns without having to specify every column, I wrote this answer up.
TL;DR: A one-liner for numerical indices is provided herein and a function for swapping exactly 2 nominal and numerical indices at the end, neither using imports, that will correctly swap any two columns in a data frame of any size is provided. A function that allows the reassignment of an arbitrary number of columns that may cause unavoidable superfluous swaps if not used carefully is also made available (read more & get functions in Summary section)
Preliminary Solution
Suppose you have some huge (or not) data frame, DF, and you only know the indices of the two columns you want to swap, say 1 < n < m < length(DF). (Also important is that your columns are not adjacent, i.e. |n-m| > 1 which is very likely to be the case in our "huge" data frame but not necessarily for smaller ones; work-arounds for all degenerate cases are provided at the end).
Because it is huge, there are a ton of columns and you don't want to have to specify every other column by hand, or it isn't huge and you're just lazy someone with fine taste in coding, either way, this one-liner will do the trick:
DF <- DF[ c( 1:(n-1), m, (n+1):(m-1), n, (m+1):length(DF) ) ]
Each piece works like this:
1:(n-1) # This keeps every column before column `n` in place
m # This places column `m` where column `n` was
(n+1):(m-1) # This keeps every column between the two in place
n # This places column `n` where column `m` was
(m+1):length(DF) # This keeps every column after column `m` in place
Generalizing for Degenerates
Because of how the : operator works, i.e. allowing "backwards-ranges" like this,
> 10:0
[1] 10 9 8 7 6 5 4 3 2 1 0
we have to be careful about our choices and placements of n and m, hence our previous restrictions. For instance, n < m doesn't lose us any generality (one of the columns has to be before the other one if they are different), however, it means we do need to be careful about which goes where in our line of code. We can make it so that we don't have to check this condition with the following modification:
DF <- DF[ c( 1:(min(n,m)-1), max(n,m), (min(n,m)+1):(max(n,m)-1), min(n,m), (max(n,m)+1):length(DF) ) ]
We have replaced every instance of n and m with min(n,m) and max(n,m) respectively, meaning that the correct ordering for our code will be preserved even in the case that m > n.
In the cases where min(n,m) == 1, max(n,m) == length(DF), both of those at the same time, and |n-m| == 1, we we will make some unreadable less aesthetic modifications involving if\else to forget about having to check if these are the case. Versions for where you know that one of these are the case, (i.e. you are always swapping some interior column with the first column, swapping some interior column with the last column, swapping the first and last columns, or swapping two adjacent columns), you can actually express these actions more succinctly because they usually just require omitting parts from our restricted case:
# Swapping not the last column with the first column
# We just got rid of 1:(min(n,m)-1) because it would be invalid and not what we meant
# since min(n,m) == 1
# Now we just stick the other column right at the front
DF <- DF[ c( max(n,m), (min(n,m)+1):(max(n,m)-1), min(n,m), (max(n,m)+1):length(DF) ) ]
# Also equivalent since we know min(n,m) == 1, for the leftover index i
DF <- DF[ c( i, 2:(i-1), 1, (i+1):length(DF) ) ]
# Swapping not the first column with the last column
# Similarly, we just got rid of (max(n,m)+1):length(DF) because it would also be invalid
# and not what we meant since max(n,m) == length(DF)
# Now we just stick the other column right at the end
DF <- DF[ c( 1:(min(n,m)-1), max(n,m), (min(n,m)+1):(max(n,m)-1), min(n,m) ) ]
# Also equivalent since we know max(n,m) == length(DF), for the leftover index, say i
DF <- DF[ c( 1:(i-1), length(DF), (i+1):(length(DF)-1), i ) ]
# Swapping the first column with the last column
DF <- DF[ c( max(n,m), (min(n,m)+1):(max(n,m)-1), min(n,m) ) ]
# Also equivalent (for if you don't actually know the length beforehand, as assumed
# elsewhere)
DF <- DF[ c( length(DF), 2:(length(DF)-1), 1 ) ]
# Swapping two interior adjacent columns
# Here we drop the explicit swap on either side of our middle column segment
# This is actually enough because then the middle segment becomes a backwards range
# because we know that `min(n,m) + 1 = max(n,m)`
# The range is just an ordering of the two adjacent indices from largest to smallest
DF <- DF[ c( 1:(min(n,m)-1), (min(n,m)+1):(max(n,m)-1), (max(n,m)+1):length(DF) )]
"But!", I hear you saying, "What if more than one of these cases occur simultaneously, like did in the third version in the block above!?". Right, coding up versions for each case is an enormous waste of time if one wants to be able to "swap columns" in the most general sense.
Swapping any Two Columns
It will be easiest to generalize our code to cover all of the cases at the same time, because they all employ essentially the same strategy. We will use if\else to keep our code a one-liner:
DF <- DF[ if (n==m) 1:length(DF) else c( (if (min(n,m)==1) c() else 1:(min(n,m)-1) ), (if (min(n,m)+1 == max(n,m)) (min(n,m)+1):(max(n,m)-1) else c( max(n,m), (min(n,m)+1):(max(n,m)-1), min(n,m))), (if (max(n,m)==length(DF)) c() else (max(n,m)+1):length(DF) ) ) ]
That's totally unreadable and probably pretty unfriendly to anyone who might try to understand or recreate your code (including yourself), so better to box it up in a function.
# A function that swaps the `n` column and `m` column in the data frame DF
swap <- function(DF, n, m)
{
return (DF[ if (n==m) 1:length(DF) else c( (if (min(n,m)==1) c() else 1:(min(n,m)-1) ), (if (min(n,m)+1 == max(n,m)) (min(n,m)+1):(max(n,m)-1) else c( max(n,m), (min(n,m)+1):(max(n,m)-1), min(n,m))), (if (max(n,m)==length(DF)) c() else (max(n,m)+1):length(DF) ) ) ])
}
A more robust version that can also swap on column names and has semi-explanatory comments:
# Returns data frame object with columns `n` and `m` swapped
# `n` and `m` can be column names, numerical indices, or a heterogeneous pair of both
swap <- function(DF, n, m)
{
# Of course, first, we want to make sure that n != m,
# because if they do, we don't need to do anything
if (n==m) return(DF)
# Next, if either n or m is a column name, we want to get its index
# We assume that if they aren't column names, they are indices (integers)
n <- if (class(n)=="character" & is.na(suppressWarnings(as.integer(n)))) which(colnames(DF)==n) else as.integer(n)
m <- if (class(m)=="character" & is.na(supressWarnings(as.integer(m)))) which(colnames(DF)==m) else as.integer(m)
# Make sure each index is actually valid
if (!(1<=n & n<=length(DF))) stop( "`n` represents invalid index!" )
if (!(1<=m & m<=length(DF))) stop( "`m` represents invalid index!" )
# Also, for readability, lets go ahead and set which column is earlier, and which is later
earlier <- min(n,m)
later <- max(n,m)
# This constructs the first third of the indices
# These are the columns that, if any, come before the earlier column you are swapping
firstThird <- if ( earlier==1 ) c() else 1:(earlier-1)
# This constructs the last third of the the indices
# These are the columns, if any, that come after the later column you are swapping
lastThird <- if ( later==length(DF) ) c() else (later+1):length(DF)
# This checks if the columns to be swapped are adjacent and then constructs the
# secondThird accordingly
if ( earlier+1 == later )
{
# Here; the second third is a list of the two columns ordered from later to earlier
secondThird <- (earlier+1):(later-1)
}
else
{
# Here; the second third is a list of
# the later column you want to swap
# the columns in between
# and then the earlier column you want to swap
secondThird <- c( later, (earlier+1):(later-1), earlier)
}
# Now we assemble our indices and return our permutation of DF
return (DF[ c( firstThird, secondThird, lastThird ) ])
}
And, for ease of repatriation with less of the spatial cost, a comment-less version that checks index validity and can handle column names, i.e. does everything in pretty close to the smallest space it can (yes, you could vectorize, using ifelse(...), the two checks that get performed, but then you'd have to unpack the vector back into n,m or change how the final line is written):
swap <- function(DF, n, m)
{
n <- if (class(n)=="character" & is.na(suppressWarnings(as.integer(n)))) which(colnames(DF)==n) else as.integer(n)
m <- if (class(m)=="character" & is.na(suppressWarnings(as.integer(m)))) which(colnames(DF)==m) else as.integer(m)
if (!(1<=n & n<=length(DF))) stop( "`n` represents invalid index!" )
if (!(1<=m & m<=length(DF))) stop( "`m` represents invalid index!" )
return (DF[ if (n==m) 1:length(DF) else c( (if (min(n,m)==1) c() else 1:(min(n,m)-1) ), (if (min(n,m)+1 == max(n,m)) (min(n,m)+1):(max(n,m)-1) else c( max(n,m), (min(n,m)+1):(max(n,m)-1), min(n,m))), (if (max(n,m)==length(DF)) c() else (max(n,m)+1):length(DF) ) ) ])
}
Permutations (or How to Do Specifically What the Question Asked and More!)
With our swap function in tow, we can try to actually do what the original question asked. The easiest way to do this, is to build a function that utilizes the really cool power that comes with a choice of heterogeneous arguments. Create a mapping:
mapping <- data.frame( "piglet" = 1, "ssire" = 2, "dam" = 3, "tdate" = 4)
In the case of the original question, these are all of the columns in our original data frame, but we will build a function where this doesn't have to be the case:
# A function that takes two data frames, one with actual data: DF, and the other with a
# rearrangement of the columns: R
# R must be structured so that colnames(R) is a subset of colnames(DF)
# Alternatively, R can be structured so that 1 <= as.integer(colnames(R)) <= length(DF)
# Further, 1 <= R$column <= length(DF), and length(R$column) == 1
# These structural requirements on R are not checked
# This is for brevity and because most likely R has been created specifically for use with
# this function
rearrange <- function(DF, R)
{
for (col in colnames(R))
{
DF <- swap(DF, col, R[col])
}
return (DF)
}
Wait, that's it? Yup. This will swap every column name to the appropriate placement. The power for such simplicity comes from swap taking heterogeneous arguments meaning we can specify the moving column name that we want to put somewhere, and so long as we only ever try to put one column in each position (which we should), once we put that column where it belongs, it won't move again. This means that even though it seems like later swaps could undo previous placements, the heterogeneous arguments make certain that won't happen, and so additionally, the order of the columns in our mapping also doesn't matter. This is a really nice quality because it means that we aren't kicking this whole "organizing the data" issue down the road too much. You only have to be able to determine which placement you want to send each column you want to move to.
Ok, ok, there is a catch. If you don't reassign the entire data frame when you do this, then you have superfluous swaps that occur, meaning that if you re-arrange over a subset of columns that isn't "closed", i.e. not every column name has an index that is represented in the rearrangement, then other columns that you didn't explicitly say to move may get moved to other places they don't exactly belong. This can be handled by creating your mapping very carefully, or simply using numerical indices mapping to other numerical indices. In the latter case, this doesn't solve the issue, but it makes more explicit what swaps are taking place and in what order so planning the rearrangement is more explicit and thus less likely to lead to problematic superfluous swaps.
Summary
You can use the swap function that we built to successfully swap exactly two columns or the rearrange function with a "rearrangement" data frame specifying where to send each column name you want to move. In the case of the rearrange function, if any of the placements chosen for each column name are not already occupied by one of the specified columns (i.e. not in colnames(R)), then superfluous swaps can and are very likely to occur (The only instance they won't is when every superfluous swap has a partner superfluous swap that undoes it before the end. This is, as stated, very unlikely to happen by accident, but the mapping can be structured to accomplish this outcome in practice).
swap <- function(DF, n, m)
{
n <- if (class(n)=="character" & is.na(suppressWarnings(as.integer(n)))) which(colnames(DF)==n) else as.integer(n)
m <- if (class(m)=="character" & is.na(suppressWarnings(as.integer(m)))) which(colnames(DF)==m) else as.integer(m)
if (!(1<=n & n<=length(DF))) stop( "`n` represents invalid index!" )
if (!(1<=m & m<=length(DF))) stop( "`m` represents invalid index!" )
return (DF[ if (n==m) 1:length(DF) else c( (if (min(n,m)==1) c() else 1:(min(n,m)-1) ), (if (min(n,m)+1 == max(n,m)) (min(n,m)+1):(max(n,m)-1) else c( max(n,m), (min(n,m)+1):(max(n,m)-1), min(n,m))), (if (max(n,m)==length(DF)) c() else (max(n,m)+1):length(DF) ) ) ])
}
rearrange <- function(DF, R)
{
for (col in colnames(R))
{
DF <- swap(DF, col, R[col])
}
return (DF)
}

I quickly wrote a function that takes a vector v and column indexes a and b which you want to swap.
swappy = function(v,a,b){ # where v is a dataframe, a and b are the columns indexes to swap
name = deparse(substitute(v))
helpy = v[,a]
v[,a] = v[,b]
v[,b] = helpy
name1 = colnames(v)[a]
name2 = colnames(v)[b]
colnames(v)[a] = name2
colnames(v)[b] = name1
assign(name,value = v , envir =.GlobalEnv)
}

I was using the function by Khôra Willis, which is helpful. But I encountered an error. I tried to make corrections. Here is R code that finally works. The arguments n and m could either be column names or column numbers in data frame DF.
require(tidyverse)
swap <- function(DF, n, m)
{
if (class(n)=="character") n <- which(colnames(DF)==n)
if (class(m)=="character") m <- which(colnames(DF)==m)
p <- NCOL(DF)
if (!(1<=n & n<=p)) stop("`n` represents invalid index!")
if (!(1<=m & m<=p)) stop("`m` represents invalid index!")
index <- 1:p
index[n] <- m; index[m] <- n
DF0 <- DF %>% select(all_of(index))
return(DF0)
}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Generation of Unique ID - r

Related

Adding values in a sample

DNA conditional frequency in R

Shuffling string (non-randomly) for maximal difference

Crafty ways to make super efficient R vector processing?

Is it possible to swap columns around in a data frame using R?

Categories

Resources