Selectively replacing columns in R with their delta values

Selectively replacing columns in R with their delta values - r

I've got data being read into a data frame R, by column. Some of the columns will increase in value; for those columns only, I want to replace each value (n) with its difference from the previous value in that column. For example, looking at an individual column, I want
c(1,2,5,7,8)
to be replaced by
c(1,3,2,1)
which are the differences between successive elements
However, it's getting really late in the day, and I think my brain has just stopped working. Here's my code at present
col1 <- c(1,2,3,4,NA,2,3,1) # This column rises and falls, so we want to ignore it
col2 <- c(1,2,3,5,NA,5,6,7) # Note: this column always rises in value, so we want to replace it with deltas
col3 <- c(5,4,6,7,NA,9,3,5) # This column rises and falls, so we want to ignore it
d <- cbind(col1, col2, col3)
d
fix_data <- function(data) {
# Iterate through each column...
for (column in data[,1:dim(data)[2]]) {
lastvalue <- 0
# Now walk through each value in the column,
# checking to see if the column consistently rises in value
for (value in column) {
if (is.na(value) == FALSE) { # Need to ignore NAs
if (value >= lastvalue) {
alwaysIncrementing <- TRUE
} else {
alwaysIncrementing <- FALSE
break
}
}
}
if (alwaysIncrementing) {
print(paste("Column", column, "always increments"))
}
# If a column is always incrementing, alwaysIncrementing will now be TRUE
# In this case, I want to replace each element in the column with the delta between successive
# elements. The size of the column shrinks by 1 in doing this, so just prepend a copy of
# the 1st element to the start of the list to ensure the column length remains the same
if (alwaysIncrementing) {
print(paste("This is an incrementing column:", colnames(column)))
column <- c(column[1], diff(column, lag=1))
}
}
data
}
fix_data(d)
d
If you copy/paste this code into RGui, you'll see that it doesn't do anything to the supplied data frame.
Besides losing my mind, what am I doing wrong??
Thanks in advance

Without addressing the code in any detail, you're assigning values to column, which is a local variable within the loop (i.e. there is no relationship between column and data in that context). You need to assign those values to the appropriate value in data.
Also, data will be local to your function, so you need to assign that back to data after running the function.
Incidentally, you can use diff to see if any value is incrementing rather than looping over every value:
idx <- apply(d, 2, function(x) !any(diff(x[!is.na(x)]) < 0))
d[,idx] <- blah

diff calculates the difference between consecutive values in a vector. You can apply it to each column in a dataframe using, e.g.
dfr <- data.frame(x = c(1,2,5,7,8), y = (1:5)^2)
as.data.frame(lapply(dfr, diff))
x y
1 1 3
2 3 5
3 2 7
4 1 9
EDIT: I just noticed a few more things. You are using a matrix, not a data frame (as you stated in the question). For your matrix 'd', you can use
d_diff <- apply(d, 2, diff)
#Find columns that are (strictly) increasing
incr <- apply(d_diff, 2, function(x) all(x > 0, na.rm=TRUE))
#Replace values in the approriate columns
d[2:nrow(d),incr] <- d_diff[,incr]

Related

Fill the gaps, depending on the length of missing values and the last & previous known value in R

consider following dataset:
df<-data.frame(ID=c(1,2), Value_1=c(1,7), Value_2= c(NA,10), Value_3=c(NA,13), Value_4=c(7,NA))
What I would like to achieve is this:
df_target<-data.frame(ID=c(1,2), Value_1=c(1,7), Value_2= c(3,10), Value_3=c(5,13), Value_4=c(7,16))
As you can see here we have two diffrent issues:
In the first column I would like to do the following operation:
"(last_know + previous_know)/number_of_elements"and add this number to the last known value, proceed until you reach the last value:
i.e.
(1+7)/4=2 --> 1; 1+2; 1+2+2; 7
The secound one is to do lm() to predict the last value.
but how to combine this? Especially the first case is the most challenging part.
I guess it should be done with median(last_known, previous_known), and then somehow count the missing values, and map it to the na_count_id and than add to the multiplication of mean and the corresponding na_count_id:
previous_known_value + na_count_id*median
Thanks in advance for your help!

Here is a solution that works. This should work even if there is an NA in the first column, based on testing I did. Basically, I iterate over every row by column. The increaser variable is the amount by which the column must be increased over the previous column to get the pattern you are looking to achieve.
library(tidyverse)
df <- column_to_rownames(df, var = "ID") # need to convert ID column to rownames
for(i in 1:nrow(df)){
increaser <- as.numeric((range(df[i,], na.rm = TRUE)[2] - range(df[i,], na.rm = TRUE)[1])/(which.max(df[i,]) - which.min(df[i,]))) # increaser is calculated by taking the range of the row and dividing by the difference between the indices of the max and min of the row
for(j in 1:ncol(df)){ # this iterates through every column
if(is.na(df[i,j])){
if(j == 1){ # special calculation needed for first column since there's no previous column to increase by
df[i, j] <- df[i, min(which(!is.na(df[i,])))] - increaser*(min(which(!is.na(df[i,])))-j) # this finds the next non NA column for that row, and subtracts that next non-NA column from the difference in the index positions multiplied by the increaser
} else {
df[i, j] <- df[i, j-1] + increaser # this is for an NA position which is not in the first column
}
} else {
df[i, j] <- df[i, j] # if a position is not NA, no calculations needed
}
}
}
# this loop returns the following. You can convert the row ID back to a column if desired.
# Value_1 Value_2 Value_3 Value_4
#1 1 3 5 7
#2 7 10 13 16

R function to subset dataframe so that non-adjacent values in a column differ by >= X (starting with the first value)

I am looking for a function that iterates through the rows of a given column ("pos" for position, ascending) in a dataframe, and only keeps those rows whose values are at least let's say 10 different, starting with the first row.Thus it would start with the first row (and store it), and then carry on until it finds a row with a value at least 10 higher than the first, store this row, then start from this value again looking for the next >10diff one.
So far I have an R for loop that successfully finds adjacent rows at least X values apart, but it does not have the capability of looking any further than one row down, nor of stopping once it has found the given row and starting again from there.
Here is the function I have:
# example data frame
df <- data.frame(x=c(1:1000), pos=sort(sample(1:10000, 1000)))
# prep function (this only checks row above)
library(dplyr)
pos.apart.subset <- function(df, pos.diff) {
# create new dfs to store output
new.df <- list()
new.df1 <- data.frame()
# iterate through each row of df
for (i in 1:nrow(df)) {
# if the value of next row is higher or equal than value or row i+posdiff, keep
# if not ascending, keep
# if first row, keep
if(isTRUE(df$pos[i+1] >= df$pos[i]+pos.diff | df$pos[i+1] < df$pos[i] | i==1 )) {
# add rows that meet conditions to list
new.df[[i]] <- df[i,] }
}
# bind all rows that met conditions
new.df1 <- bind_rows(new.df)
return(new.df1)}
# test run for pos column adjacent values to be at least 10 apart
df1 <- pos.apart.subset(df, 10); head(df1)
Happy to do this in awk or any other language. Many thanks.

It seems I misunderstood the question earlier since we don't want to calculate the difference between consecutive rows, you can try :
nrows <- 1
previous_match <- 1
for(i in 2:nrow(df)) {
if(df$pos[i] - df$pos[previous_match] > 10) {
nrows <- c(nrows, i)
previous_match <- i
}
}
and then subset the selected rows :
df[nrows, ]
Earlier answer
We can use diff to get the difference between consecutive rows and select the row which has difference of greater than 10.
head(subset(df, c(TRUE, diff(pos) > 10)))
# x pos
#1 1 1
#2 2 31
#6 6 71
#9 9 134
#10 10 151
#13 13 185
The first TRUE is to by default select the first row.
In dplyr, we can use lag to get value from previous row :
library(dplyr)
df %>% filter(pos - lag(pos, default = -Inf) > 10)

In R, remove rows where subtraction of values between contiguous values in same column excede a value

I have a data frame in R consisting in 5 columns and 30000 rows. One column, called "pos", has this kind of values sorted in ascending order:
pos
785989
888659
918573
949608
990417
I would like to remove all rows where the difference between a "x" value in "pos" (in a "n" row) and the anterior value in the "n-1" row or the difference between the posterior value in "n+1" row and "x" is greater than, lets say, 100000. Eg: in the input example, 888659-785989 = 102670 > 100000, therefore rows containing 888659 and 785989 values should be removed.
Thanks for your help !

One solution is to create a user function that takes the diff of a vector and checks a conditional gap provided by the user:
diff_set <- function(x, gap) {
ind <- c(F, diff(x) > gap)
if(sum(ind) == 0) return(!ind)
subst <- x[-unique(c(which(ind), which(ind)-1))]
x %in% subst
}
df1[diff_set(df1$x, 1e5),]
x y
3 918573 C
4 949608 D
5 990417 E
Data
x <- scan(text="785989
888659
918573
949608
990417")
df1 <- data.frame(x, y=LETTERS[1:5])

function to subtract each column from one specific column in r

I want to subtract each column from a column called df$Means in r. I want to do this as a function but Im not sure how to iterate through each of the columns- each iteration relies on one column being subtracted from df$Means and then there is a load of downstream code that uses the output. I have simplified the code for here as this is the bit that's giving me trouble. So far I have:
CopyNumberLoop <- function (i) {df$ZScore <- (df[3:5]-df$Means)/(df$sd)
}
apply(df[3:50], 2, CopyNumberLoop)
but Im not sure how to make sure that the operation is done on one column at a time. I don't think df[3:5] is correct?
I have been asked to produce a reproducible example so all the code I want is here:
df1 <- read.delim(file.choose(),header=TRUE)
#Take the control samples and average each row for three columns excluding the first two columns- add the per row means to the data frame
df$Means <- rowMeans(df[,30:32])
RowVar <- function(x) {rowSums((x - rowMeans(x))^2)/(dim(x)[2] - 1)}
df$sd=sqrt(RowVar(df[,c(30:32)]))
#Get a Z score by dividing the test sample count at each locus by the average for the control samples and divide everything by the st dev for controls at each locus.
{
df$ZScore <- (df[,35]-df$Means)/(df$sd)
######################################### QUARTILE FILTER ###########################################################
alpha=1.5
numberofControls = 3
UL = median(df$ZScore, na.rm = TRUE) + alpha*IQR(df$ZScore, na.rm = TRUE)
LL = median(df$ZScore, na.rm = TRUE) - alpha*IQR(df$ZScore, na.rm = TRUE)
#Copy the Z score if the score is > or < a certain number, i.e. LL or UL.
Zoutliers <- which(df$ZScore > UL | df$ZScore < LL)
df$Zoutliers <- ifelse(df$ZScore > UL |df$ZScore <LL ,1,-1)
tempout = ifelse(df$ZScore[Zoutliers] > UL,1,-1)
######################################### Three neighbour Isolation filter ##############################################################################
finalSeb=c()
for(i in 2:(length(Zoutliers)-1)){
j=Zoutliers[i]
if(sum(ifelse((j-1) == Zoutliers,1,0)) > 0 & tempout[i] == tempout[i-1] & sum(ifelse((j+1) == Zoutliers,1,0)) > 0 & tempout[i] == tempout[i+1]){
finalSeb = c(finalSeb,i)
}
}
finalset_row_number = Zoutliers[finalSeb]
#View(finalset_row_number)
p_seq = rep(0,nrow(df))
for(i in 1:length(finalset_row_number)){
p_seq[(finalset_row_number[i]-1):(finalset_row_number[i]+1)] = median(df$ZScore[(finalset_row_number[i]-1):(finalset_row_number[i]+1)])
}
nrow(as.data.frame(finalset_row_number))
}
For each column between 3 and 50 I'd like to generate a nrow(as.data.frame(finalset_row_number)) and keep it in another dataframe. Admittedly my code is a mess because I dont know how to create the function that will allow me to apply this to each column

Your code isn’t using the parameter i at all. In fact, i is the current column, so that’s what you should use:
result = apply(df[, 3 : 50], 2, function (col) col - df$Means)
Or you can subtract the means directly:
result = df[, 3 : 50] - df$Means
This will return a new matrix consisting of the columns 3–50 from df, subtracting df$Means from each in turn. Or, if you want to calculate Z scores as your code seems to do:
result = (df[, 3 : 50] - df$Means) / df$sd

It appeared that you wanted the Z-scores assigned back into the original dataframe as named columns. If you want to loop over columns, it would be just as economical to use lapply or sapply. The receiving function will accept each column in turn and match it to the first parameter. Any other arguments offered after the receiving function will get matched by name or position to any other symbol/names in the parameter list. You do not do any assignment to 'df' inside the function:
CopyNumberLoop <- function (col) { col-df$Means/(df$sd)
}
df[, paste0('ZScore' , 3:50)] <- # assignment done outside the loop
lapply(df[3:50], CopyNumberLoop) # result is a list
# but the `[.data.frame<-` method will accept a list.
Usign apply coerces to a matrix which may have undesirable effects in the column is not numeric (say factor or date-time). It's better to get into he habit of using lapply when working on ranges of columns in dataframes.
If you want to assign the result of this operation to a new dataframe, then the lapply(.) result would need to be wrapped in as.data.frame and then column names could be assigned. Same effort would need to be done to a result from apply(.).

Is it possible to swap columns around in a data frame using R?

I have three variables in a data frame and would like to swap the 4 columns around from
"dam" "piglet" "fdate" "ssire"
to
"piglet" "ssire" "dam" "tdate"
Is there any way I can do the swapping using R?
Any help would be very much appreciated.
Baz

dfrm <- dfrm[c("piglet", "ssire", "dam", "tdate")]
OR:
dfrm <- dfrm[ , c("piglet", "ssire", "dam", "tdate")]

d <- data.frame(a=1:3, b=11:13, c=21:23)
d
# a b c
#1 1 11 21
#2 2 12 22
#3 3 13 23
d2 <- d[,c("b", "c", "a")]
d2
# b c a
#1 11 21 1
#2 12 22 2
#3 13 23 3
or you can do same thing using index:
d3 <- d[,c(2, 3, 1)]
d3
# b c a
#1 11 21 1
#2 12 22 2
#3 13 23 3

To summarise the other posts, there are three ways of changing the column order, and two ways of specifying the indexing in each method.
Given a sample data frame
dfr <- data.frame(
dam = 1:5,
piglet = runif(5),
fdate = letters[1:5],
ssire = rnorm(5)
)
Kohske's answer: You can use standard matrix-like indexing using column numbers
dfr[, c(2, 4, 1, 3)]
or using column names
dfr[, c("piglet", "ssire", "dam", "fdate")]
DWin & Gavin's answer: Data frames allow you to omit the row argument when specifying the index.
dfr[c(2, 4, 1, 3)]
dfr[c("piglet", "ssire", "dam", "fdate")]
PaulHurleyuk's answer: You can also use subset.
subset(dfr, select = c(2, 4, 1, 3))
subset(dfr, select = c(c("piglet", "ssire", "dam", "fdate")))

You can use subset's 'select' argument;
#Assume df contains "dam" "piglet" "fdate" "ssire"
newdf<-subset(df, select=c("piglet", "ssire", "dam", "tdate"))

I noticed that this is almost an 8-year old question. But for people who are starting to learn R and might stumble upon this question, like I did, you can now use a much flexible select() function from dplyr package to accomplish the swapping operation as follows.
# Install and load the dplyr package
install.packages("dplyr")
library("dplyr")
# Override the existing data frame with the desired column order
df <- select(df, piglet, ssire, dam, tdate)
This approach has following advantages:
You will have to type less as the select() does not require variable names to be enclosed within quotes.
In case your data frame has more than 4 variables, you can utilize select helper functions such as starts_with(), ends_with(), etc. to select multiple columns without having to name each column and rearrange them with much ease.

Relevance Note: In response to some users (myself included) that would like to swap columns without having to specify every column, I wrote this answer up.
TL;DR: A one-liner for numerical indices is provided herein and a function for swapping exactly 2 nominal and numerical indices at the end, neither using imports, that will correctly swap any two columns in a data frame of any size is provided. A function that allows the reassignment of an arbitrary number of columns that may cause unavoidable superfluous swaps if not used carefully is also made available (read more & get functions in Summary section)
Preliminary Solution
Suppose you have some huge (or not) data frame, DF, and you only know the indices of the two columns you want to swap, say 1 < n < m < length(DF). (Also important is that your columns are not adjacent, i.e. |n-m| > 1 which is very likely to be the case in our "huge" data frame but not necessarily for smaller ones; work-arounds for all degenerate cases are provided at the end).
Because it is huge, there are a ton of columns and you don't want to have to specify every other column by hand, or it isn't huge and you're just lazy someone with fine taste in coding, either way, this one-liner will do the trick:
DF <- DF[ c( 1:(n-1), m, (n+1):(m-1), n, (m+1):length(DF) ) ]
Each piece works like this:
1:(n-1) # This keeps every column before column `n` in place
m # This places column `m` where column `n` was
(n+1):(m-1) # This keeps every column between the two in place
n # This places column `n` where column `m` was
(m+1):length(DF) # This keeps every column after column `m` in place
Generalizing for Degenerates
Because of how the : operator works, i.e. allowing "backwards-ranges" like this,
> 10:0
[1] 10 9 8 7 6 5 4 3 2 1 0
we have to be careful about our choices and placements of n and m, hence our previous restrictions. For instance, n < m doesn't lose us any generality (one of the columns has to be before the other one if they are different), however, it means we do need to be careful about which goes where in our line of code. We can make it so that we don't have to check this condition with the following modification:
DF <- DF[ c( 1:(min(n,m)-1), max(n,m), (min(n,m)+1):(max(n,m)-1), min(n,m), (max(n,m)+1):length(DF) ) ]
We have replaced every instance of n and m with min(n,m) and max(n,m) respectively, meaning that the correct ordering for our code will be preserved even in the case that m > n.
In the cases where min(n,m) == 1, max(n,m) == length(DF), both of those at the same time, and |n-m| == 1, we we will make some unreadable less aesthetic modifications involving if\else to forget about having to check if these are the case. Versions for where you know that one of these are the case, (i.e. you are always swapping some interior column with the first column, swapping some interior column with the last column, swapping the first and last columns, or swapping two adjacent columns), you can actually express these actions more succinctly because they usually just require omitting parts from our restricted case:
# Swapping not the last column with the first column
# We just got rid of 1:(min(n,m)-1) because it would be invalid and not what we meant
# since min(n,m) == 1
# Now we just stick the other column right at the front
DF <- DF[ c( max(n,m), (min(n,m)+1):(max(n,m)-1), min(n,m), (max(n,m)+1):length(DF) ) ]
# Also equivalent since we know min(n,m) == 1, for the leftover index i
DF <- DF[ c( i, 2:(i-1), 1, (i+1):length(DF) ) ]
# Swapping not the first column with the last column
# Similarly, we just got rid of (max(n,m)+1):length(DF) because it would also be invalid
# and not what we meant since max(n,m) == length(DF)
# Now we just stick the other column right at the end
DF <- DF[ c( 1:(min(n,m)-1), max(n,m), (min(n,m)+1):(max(n,m)-1), min(n,m) ) ]
# Also equivalent since we know max(n,m) == length(DF), for the leftover index, say i
DF <- DF[ c( 1:(i-1), length(DF), (i+1):(length(DF)-1), i ) ]
# Swapping the first column with the last column
DF <- DF[ c( max(n,m), (min(n,m)+1):(max(n,m)-1), min(n,m) ) ]
# Also equivalent (for if you don't actually know the length beforehand, as assumed
# elsewhere)
DF <- DF[ c( length(DF), 2:(length(DF)-1), 1 ) ]
# Swapping two interior adjacent columns
# Here we drop the explicit swap on either side of our middle column segment
# This is actually enough because then the middle segment becomes a backwards range
# because we know that `min(n,m) + 1 = max(n,m)`
# The range is just an ordering of the two adjacent indices from largest to smallest
DF <- DF[ c( 1:(min(n,m)-1), (min(n,m)+1):(max(n,m)-1), (max(n,m)+1):length(DF) )]
"But!", I hear you saying, "What if more than one of these cases occur simultaneously, like did in the third version in the block above!?". Right, coding up versions for each case is an enormous waste of time if one wants to be able to "swap columns" in the most general sense.
Swapping any Two Columns
It will be easiest to generalize our code to cover all of the cases at the same time, because they all employ essentially the same strategy. We will use if\else to keep our code a one-liner:
DF <- DF[ if (n==m) 1:length(DF) else c( (if (min(n,m)==1) c() else 1:(min(n,m)-1) ), (if (min(n,m)+1 == max(n,m)) (min(n,m)+1):(max(n,m)-1) else c( max(n,m), (min(n,m)+1):(max(n,m)-1), min(n,m))), (if (max(n,m)==length(DF)) c() else (max(n,m)+1):length(DF) ) ) ]
That's totally unreadable and probably pretty unfriendly to anyone who might try to understand or recreate your code (including yourself), so better to box it up in a function.
# A function that swaps the `n` column and `m` column in the data frame DF
swap <- function(DF, n, m)
{
return (DF[ if (n==m) 1:length(DF) else c( (if (min(n,m)==1) c() else 1:(min(n,m)-1) ), (if (min(n,m)+1 == max(n,m)) (min(n,m)+1):(max(n,m)-1) else c( max(n,m), (min(n,m)+1):(max(n,m)-1), min(n,m))), (if (max(n,m)==length(DF)) c() else (max(n,m)+1):length(DF) ) ) ])
}
A more robust version that can also swap on column names and has semi-explanatory comments:
# Returns data frame object with columns `n` and `m` swapped
# `n` and `m` can be column names, numerical indices, or a heterogeneous pair of both
swap <- function(DF, n, m)
{
# Of course, first, we want to make sure that n != m,
# because if they do, we don't need to do anything
if (n==m) return(DF)
# Next, if either n or m is a column name, we want to get its index
# We assume that if they aren't column names, they are indices (integers)
n <- if (class(n)=="character" & is.na(suppressWarnings(as.integer(n)))) which(colnames(DF)==n) else as.integer(n)
m <- if (class(m)=="character" & is.na(supressWarnings(as.integer(m)))) which(colnames(DF)==m) else as.integer(m)
# Make sure each index is actually valid
if (!(1<=n & n<=length(DF))) stop( "`n` represents invalid index!" )
if (!(1<=m & m<=length(DF))) stop( "`m` represents invalid index!" )
# Also, for readability, lets go ahead and set which column is earlier, and which is later
earlier <- min(n,m)
later <- max(n,m)
# This constructs the first third of the indices
# These are the columns that, if any, come before the earlier column you are swapping
firstThird <- if ( earlier==1 ) c() else 1:(earlier-1)
# This constructs the last third of the the indices
# These are the columns, if any, that come after the later column you are swapping
lastThird <- if ( later==length(DF) ) c() else (later+1):length(DF)
# This checks if the columns to be swapped are adjacent and then constructs the
# secondThird accordingly
if ( earlier+1 == later )
{
# Here; the second third is a list of the two columns ordered from later to earlier
secondThird <- (earlier+1):(later-1)
}
else
{
# Here; the second third is a list of
# the later column you want to swap
# the columns in between
# and then the earlier column you want to swap
secondThird <- c( later, (earlier+1):(later-1), earlier)
}
# Now we assemble our indices and return our permutation of DF
return (DF[ c( firstThird, secondThird, lastThird ) ])
}
And, for ease of repatriation with less of the spatial cost, a comment-less version that checks index validity and can handle column names, i.e. does everything in pretty close to the smallest space it can (yes, you could vectorize, using ifelse(...), the two checks that get performed, but then you'd have to unpack the vector back into n,m or change how the final line is written):
swap <- function(DF, n, m)
{
n <- if (class(n)=="character" & is.na(suppressWarnings(as.integer(n)))) which(colnames(DF)==n) else as.integer(n)
m <- if (class(m)=="character" & is.na(suppressWarnings(as.integer(m)))) which(colnames(DF)==m) else as.integer(m)
if (!(1<=n & n<=length(DF))) stop( "`n` represents invalid index!" )
if (!(1<=m & m<=length(DF))) stop( "`m` represents invalid index!" )
return (DF[ if (n==m) 1:length(DF) else c( (if (min(n,m)==1) c() else 1:(min(n,m)-1) ), (if (min(n,m)+1 == max(n,m)) (min(n,m)+1):(max(n,m)-1) else c( max(n,m), (min(n,m)+1):(max(n,m)-1), min(n,m))), (if (max(n,m)==length(DF)) c() else (max(n,m)+1):length(DF) ) ) ])
}
Permutations (or How to Do Specifically What the Question Asked and More!)
With our swap function in tow, we can try to actually do what the original question asked. The easiest way to do this, is to build a function that utilizes the really cool power that comes with a choice of heterogeneous arguments. Create a mapping:
mapping <- data.frame( "piglet" = 1, "ssire" = 2, "dam" = 3, "tdate" = 4)
In the case of the original question, these are all of the columns in our original data frame, but we will build a function where this doesn't have to be the case:
# A function that takes two data frames, one with actual data: DF, and the other with a
# rearrangement of the columns: R
# R must be structured so that colnames(R) is a subset of colnames(DF)
# Alternatively, R can be structured so that 1 <= as.integer(colnames(R)) <= length(DF)
# Further, 1 <= R$column <= length(DF), and length(R$column) == 1
# These structural requirements on R are not checked
# This is for brevity and because most likely R has been created specifically for use with
# this function
rearrange <- function(DF, R)
{
for (col in colnames(R))
{
DF <- swap(DF, col, R[col])
}
return (DF)
}
Wait, that's it? Yup. This will swap every column name to the appropriate placement. The power for such simplicity comes from swap taking heterogeneous arguments meaning we can specify the moving column name that we want to put somewhere, and so long as we only ever try to put one column in each position (which we should), once we put that column where it belongs, it won't move again. This means that even though it seems like later swaps could undo previous placements, the heterogeneous arguments make certain that won't happen, and so additionally, the order of the columns in our mapping also doesn't matter. This is a really nice quality because it means that we aren't kicking this whole "organizing the data" issue down the road too much. You only have to be able to determine which placement you want to send each column you want to move to.
Ok, ok, there is a catch. If you don't reassign the entire data frame when you do this, then you have superfluous swaps that occur, meaning that if you re-arrange over a subset of columns that isn't "closed", i.e. not every column name has an index that is represented in the rearrangement, then other columns that you didn't explicitly say to move may get moved to other places they don't exactly belong. This can be handled by creating your mapping very carefully, or simply using numerical indices mapping to other numerical indices. In the latter case, this doesn't solve the issue, but it makes more explicit what swaps are taking place and in what order so planning the rearrangement is more explicit and thus less likely to lead to problematic superfluous swaps.
Summary
You can use the swap function that we built to successfully swap exactly two columns or the rearrange function with a "rearrangement" data frame specifying where to send each column name you want to move. In the case of the rearrange function, if any of the placements chosen for each column name are not already occupied by one of the specified columns (i.e. not in colnames(R)), then superfluous swaps can and are very likely to occur (The only instance they won't is when every superfluous swap has a partner superfluous swap that undoes it before the end. This is, as stated, very unlikely to happen by accident, but the mapping can be structured to accomplish this outcome in practice).
swap <- function(DF, n, m)
{
n <- if (class(n)=="character" & is.na(suppressWarnings(as.integer(n)))) which(colnames(DF)==n) else as.integer(n)
m <- if (class(m)=="character" & is.na(suppressWarnings(as.integer(m)))) which(colnames(DF)==m) else as.integer(m)
if (!(1<=n & n<=length(DF))) stop( "`n` represents invalid index!" )
if (!(1<=m & m<=length(DF))) stop( "`m` represents invalid index!" )
return (DF[ if (n==m) 1:length(DF) else c( (if (min(n,m)==1) c() else 1:(min(n,m)-1) ), (if (min(n,m)+1 == max(n,m)) (min(n,m)+1):(max(n,m)-1) else c( max(n,m), (min(n,m)+1):(max(n,m)-1), min(n,m))), (if (max(n,m)==length(DF)) c() else (max(n,m)+1):length(DF) ) ) ])
}
rearrange <- function(DF, R)
{
for (col in colnames(R))
{
DF <- swap(DF, col, R[col])
}
return (DF)
}

I quickly wrote a function that takes a vector v and column indexes a and b which you want to swap.
swappy = function(v,a,b){ # where v is a dataframe, a and b are the columns indexes to swap
name = deparse(substitute(v))
helpy = v[,a]
v[,a] = v[,b]
v[,b] = helpy
name1 = colnames(v)[a]
name2 = colnames(v)[b]
colnames(v)[a] = name2
colnames(v)[b] = name1
assign(name,value = v , envir =.GlobalEnv)
}

I was using the function by Khôra Willis, which is helpful. But I encountered an error. I tried to make corrections. Here is R code that finally works. The arguments n and m could either be column names or column numbers in data frame DF.
require(tidyverse)
swap <- function(DF, n, m)
{
if (class(n)=="character") n <- which(colnames(DF)==n)
if (class(m)=="character") m <- which(colnames(DF)==m)
p <- NCOL(DF)
if (!(1<=n & n<=p)) stop("`n` represents invalid index!")
if (!(1<=m & m<=p)) stop("`m` represents invalid index!")
index <- 1:p
index[n] <- m; index[m] <- n
DF0 <- DF %>% select(all_of(index))
return(DF0)
}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Selectively replacing columns in R with their delta values - r

Related

Fill the gaps, depending on the length of missing values and the last & previous known value in R

R function to subset dataframe so that non-adjacent values in a column differ by >= X (starting with the first value)

In R, remove rows where subtraction of values between contiguous values in same column excede a value

function to subtract each column from one specific column in r

Is it possible to swap columns around in a data frame using R?

Categories

Resources