Comparing 2 datasets in R

Comparing 2 datasets in R - r

I have 2 extracted data sets from a dataset called babies2009( 3 vectors count, name, gender )
One is girls2009 containing all the girls and the other boys2009.
I want to find out what similar names exist between boys and girls.
I tried this
common.names = (boys2009$name %in% girls2009$name)
When I try
babies2009[common.names, ] [1:10, ]
all I get is the girl names not the common names.
I have confirmed that both data sets indeed contain boys and girls respectively by doing taking a 10 sample...
boys2009 [1:10,]
girsl2009 [1:10,]
How else can I compare the 2 datasets and determine what values they both share.
Thanks,

common.names = (boys2009$name %in% girls2009$name) gives you a logical vector of length length(boys2009$name). So when you try selecting from a much longer data.frame babies2009[common.names, ] [1:10, ], you wind up with nonsense.
Solution: use that logical vector on the proper data.frame!
boys2009 <- data.frame( names=c("Billy","Bob"),data=runif(2), gender="M" , stringsAsFactors=FALSE)
girls2009 <- data.frame( names=c("Billy","Mae","Sue"),data=runif(3), gender="F" , stringsAsFactors=FALSE)
babies2009 <- rbind(boys2009,girls2009)
common.names <- (boys2009$name %in% girls2009$name)
> boys2009[common.names, ]$names
[1] "Billy"

Since you want similarities but did not specify exact matches, you should consider agrep
sapply(boys2009$name , agrep, girls2009$name, max = 0.1)
You can adjust the max.distance argument to suit your needs.

How about using set functions:
list(
`only boys` = setdiff(boys2009$name, girls2009$name),
`common` = intersect(boys2009$name, girls2009$name),
`only girls` = setdiff(girls2009$name, boys2009$name)
)

Related

How to count missing values from two columns in R

I have a data frame which looks like this
**Contig_A** **Contig_B**
Contig_0 Contig_1
Contig_3 Contig_5
Contig_4 Contig_1
Contig_9 Contig_0
I want to count how many contig ids (from Contig_0 to Contig_1193) are not present in either Contig_A column of Contig_B.
For example: if we consider there are total 10 contigs here for this data frame (Contig_0 to Contig_9), then the answer would be 4 (Contig_2, Contig_6, Contig_7, Contig_8)

Create a vector of all the values that you want to check (all_contig) which is Contig_0 to Contig_10 here. Use setdiff to find the absent values and length to get the count of missing values.
cols <- c('Contig_A', 'Contig_B')
#If there are lot of 'Contig' columns that you want to consider
#cols <- grep('Contig', names(df), value = TRUE)
all_contig <- paste0('Contig_', 0:10)
missing_contig <- setdiff(all_contig, unlist(df[cols]))
#[1] "Contig_2" "Contig_6" "Contig_7" "Contig_8" "Contig_10"
count_missing <- length(missing_contig)
#[1] 5

by match,
x <- c(0:9)
contigs <- sapply(x, function(t) paste0("Contig_",t))
df1 <- data.frame(
Contig_A = c("Contig_0", "Contig_3", "Contig_4", "Contig_9"),
Contig_B = c("Contig_1", "Contig_5", "Contig_1", "Contig_0")
)
xx <- c(df1$Contig_A,df1$Contig_B)
contigs[is.na(match(contigs, xx))]
[1] "Contig_2" "Contig_6" "Contig_7" "Contig_8"
In your case, just change x as x <- c(0,1193)

Standardize group names using a vector of possible matches

I need to standardize how subgroups are referred to in a data set. To do this I need to identify when a variable matches one of several strings and then set a new variable with the standardized name. I am trying to do that with the following:
df <- data.frame(a = c(1,2,3,4), b = c(depression_male, depression_female, depression_hsgrad, depression_collgrad))
TestVector <- "male"
for (i in TestVector) {
df$grpl <- grepl(paste0(i), df$b)
df[ which(df$grpl == TRUE),]$standard <- "male"
}
The test vector will frequently have multiple elements. The grepl works (I was going to deal with the male/female match confusion later but I'll take suggestions on that) but the subsetting and setting a new variable doesn't. It would be better (and work) if I could transform the grepl output directly into the standard name variable.

Your only real issue is that you need to initialize the standard column. But we can simplify your code a bit:
df <- data.frame(a = c(1,2,3,4), b = c("depression_male", "depression_female", "depression_hsgrad", "depression_collgrad"))
TestVector <- "male"
df$standard <- NA
for (i in TestVector) {
df[ grepl(i, df$b), "standard"] <- "male"
}
df
# a b standard
# 1 1 depression_male male
# 2 2 depression_female male
# 3 3 depression_hsgrad <NA>
# 4 4 depression_collgrad <NA>
Then you've got the issue that the "male" pattern matches "female" as well.
Perhaps you're looking for sub instead? It works like find/replace:
df$standard = sub(pattern = "depression_", replacement = "", df$b)
df
# a b standard
# 1 1 depression_male male
# 2 2 depression_female female
# 3 3 depression_hsgrad hsgrad
# 4 4 depression_collgrad collgrad
It's hard to generalize what will be best in your case without more example input/output pairs. If all your data is of the form "depression_" this will work well. Or maybe the standard name is always after an underscore, so you could use pattern = ".*_" to replace everything before the last underscore. Or maybe something else... Hopefully these ideas give you a good start.

How to get in a specific order the results of an r lapply function with arguments from a dataframe

Following a previous question I asked, I got an awesome answer.
Here is a quick summary:
I want to compute a multidimensional development index based on South Africa Data for several years. My list is composed of individual information for each year, so basically df1 is about year 1 and df2 about year2.
df1<-data.frame(var1=c(1, 1,1), var2=c(0,0,1), var3=c(1,1,0))
df2<-data.frame(var1=c(1, 0,1), var2=c(1,0,1), var3=c(0,1,0))
mylist <-list (df1,df2)
var1 could be the stance on religion of each person, var2 how she voted in last national election, etc. In my very simple case, I have the data for 3 different persons each year.
From there, I compute an index based on a number of variables (not all of them)
You can find here a very simplified working index function, with only 2 of 3 variables, named dimX and dimY:
myindex <- function(x, dimX, dimY){
econ_i<- ( x[dimX]+ x[dimY] )
return ( (1/length(econ_i))*sum(econ_i) )
}
myindex(df1, "var2", "var3")
and
myindex2 = function(x, d) {
myindex(x, d[1], d[2])
}
Then I have my dataframe of variables I want to use for my index. I am trying to compute the index for several sets of variables.
args <- data.frame(set1=c("var1", "var2"), set2=c("var2", "var3"), stringsAsFactors = F)
I'd like to have the result as follows : (a)list(set1 = list(df1, df2), set2 = (df1, df2))instead of (b) list(df1 = list(set1, set2), df2 = list(set1, set2)).
Case (a) represents a time series, meaning I have a list of results of my indexes each year for only one set of variables. Case (b) is the opposite where I have the index results of one year for every set of variables. Each individual result should be a unique numeric value. Hence, I am expecting to get a list of 2 sublists df1 and df2, each sublist containing 3 numeric values.
I've been adviced to do use that great command:
lapply(mylist, function(m) lapply(args, myindex2, x = m))
It's working great, but I get the result in the "wrong" format, namely the second one (b) I showed.
How could I get the results ordered per set (i.e. case (a) as time series) instead of per year?
Thanks a lot for your help!
PJ
EDIT: I've managed to find a solution that doesn't answer the question, but still allows me to get my data in desired order.
Namely, I'm transforming my list of lists to a matrix that I simply transpose.

This answer will be edited!
Currently, your function index() does this
myindex <- function(x, dimX, dimY){
econ_i<- ( x[dimX]+ x[dimY] )
return ( (1/length(econ_i))*sum(econ_i) )
}
Aren't you after this, however?
myindex <- function(x, dimX, dimY){
econ_i<- ( x[,dimX]+ x[,dimY] )
return ( (1/length(econ_i))*sum(econ_i) )
}
The way you have it right now, length(econ_i) always returns 1 because econ_i is a data.frame() and not a vector. The length of a data.frame() is always 1, while the length of a vector is the number of elements within it.
Kindly note that here is what the output looks like in R.
df1["var1"]
var1
1 1
2 1
3 1
returns a data.frame()
df1[,"var1"]
[1] 1 1 1
returns a vector.
I will adjust this post to answer your question when you respond. I think it's important to solve this part first.

If that may provide any help, from this article, here my actual index function:
RCI_a_3det <-function(x, econ1, econ2, econ3, perso1, perso2, perso3, civic1, civic2, civic3){
econ_i<- (1/3) *( x[econ1]+ x[econ2] + x[econ3])
perso_i<- (1/3)*( x[perso1] + x[perso2] + x[perso3])
civic_i<- (1/3)*(x[civic1] + x[civic2] + x[civic3])
daf <- data.frame(econ_i, perso_i, civic_i)
colnames(daf)<- c("econ_i", "perso_i", "civic_i")
df1 <- subset(daf, daf$econ_i !=1 & daf$perso_i !=1 & daf$civic_i!=1 )
sum_xik <- (df1$econ_i + df1$perso_i + df1$civic_i)
return ( 1/(3*nrow(df1)) * sum(sum_xik, na.rm=T))
}
Edit:
x is a list of all personal information, for every variable and for every year. It is pretty large.
I am using 9 variables to compute this index, but I actually have 30 such variables in my data, so I have set up a dataframe of sets of variables I could use to compute this index. This is the equivalent of my args df in the simple example. I am actually using 200 such combinations.

Not all values storing in a loop

I want to store values in "yy" but my code below stores only one row (last value). Please see the output below. Can somebody help to store all the values in "yy"
Thanks in advance. I am a beginner to R.
arrPol <- as.matrix(unique(TN_97_Lau_Cot[,6]))
arrYear <- as.matrix(unique(TN_97_Lau_Cot[,1]))
for (ij in length(arrPol)){
for (ik in length(arrYear)) {
newPolicy <- subset(TN_97_Lau_Cot, POLICY == as.character(arrPol[ij]) & as.numeric(arrYear[ik]))
yy <- newPolicy[which.min(newPolicy$min_dist),]
}
}
Output:
YEAR DIVISION STATE COUNTY CROP POLICY STATE_ABB LRPP min_dist
1: 2016 8 41 97 21 699609 TN 0 2.6
Here is a image of "TN_97_Lau_Cot" matrix.

No loops required. There could be an easier way to do it, but two set-based steps are better than two loops. These are the two ways I would try and do it:
base
# Perform an aggregate and merge it to your data.frame.
TN_97_Lau_Cot_Agg <- merge(
x = TN_97_Lau_Cot,
y = aggregate(min_dist ~ YEAR + POLICY, data = TN_97_Lau_Cot, min),
by = c("YEAR","POLICY"),
all.x = TRUE
)
# Subset the values that you want.
TN_97_Lau_Cot_Final <- unique(subset(TN_97_Lau_Cot_Agg, min_dist.x == min_dist.y))
data.table
library(data.table)
# Convert your data.frame to a data.table.
TN_97_Lau_Cot <- data.table(TN_97_Lau_Cot)
# Perform a "window" function that calculates the min value for each year without reducing the rows.
TN_97_Lau_Cot[, minDistAggregate:=min(min_dist), by = c("YEAR","POLICY")]
# Find the policy numbers that match the minimum distance for that year.
TN_97_Lau_Cot_Final <- unique(TN_97_Lau_Cot[min_dist==minDistAggregate, -10, with=FALSE])

Is it possible to swap columns around in a data frame using R?

I have three variables in a data frame and would like to swap the 4 columns around from
"dam" "piglet" "fdate" "ssire"
to
"piglet" "ssire" "dam" "tdate"
Is there any way I can do the swapping using R?
Any help would be very much appreciated.
Baz

dfrm <- dfrm[c("piglet", "ssire", "dam", "tdate")]
OR:
dfrm <- dfrm[ , c("piglet", "ssire", "dam", "tdate")]

d <- data.frame(a=1:3, b=11:13, c=21:23)
d
# a b c
#1 1 11 21
#2 2 12 22
#3 3 13 23
d2 <- d[,c("b", "c", "a")]
d2
# b c a
#1 11 21 1
#2 12 22 2
#3 13 23 3
or you can do same thing using index:
d3 <- d[,c(2, 3, 1)]
d3
# b c a
#1 11 21 1
#2 12 22 2
#3 13 23 3

To summarise the other posts, there are three ways of changing the column order, and two ways of specifying the indexing in each method.
Given a sample data frame
dfr <- data.frame(
dam = 1:5,
piglet = runif(5),
fdate = letters[1:5],
ssire = rnorm(5)
)
Kohske's answer: You can use standard matrix-like indexing using column numbers
dfr[, c(2, 4, 1, 3)]
or using column names
dfr[, c("piglet", "ssire", "dam", "fdate")]
DWin & Gavin's answer: Data frames allow you to omit the row argument when specifying the index.
dfr[c(2, 4, 1, 3)]
dfr[c("piglet", "ssire", "dam", "fdate")]
PaulHurleyuk's answer: You can also use subset.
subset(dfr, select = c(2, 4, 1, 3))
subset(dfr, select = c(c("piglet", "ssire", "dam", "fdate")))

You can use subset's 'select' argument;
#Assume df contains "dam" "piglet" "fdate" "ssire"
newdf<-subset(df, select=c("piglet", "ssire", "dam", "tdate"))

I noticed that this is almost an 8-year old question. But for people who are starting to learn R and might stumble upon this question, like I did, you can now use a much flexible select() function from dplyr package to accomplish the swapping operation as follows.
# Install and load the dplyr package
install.packages("dplyr")
library("dplyr")
# Override the existing data frame with the desired column order
df <- select(df, piglet, ssire, dam, tdate)
This approach has following advantages:
You will have to type less as the select() does not require variable names to be enclosed within quotes.
In case your data frame has more than 4 variables, you can utilize select helper functions such as starts_with(), ends_with(), etc. to select multiple columns without having to name each column and rearrange them with much ease.

Relevance Note: In response to some users (myself included) that would like to swap columns without having to specify every column, I wrote this answer up.
TL;DR: A one-liner for numerical indices is provided herein and a function for swapping exactly 2 nominal and numerical indices at the end, neither using imports, that will correctly swap any two columns in a data frame of any size is provided. A function that allows the reassignment of an arbitrary number of columns that may cause unavoidable superfluous swaps if not used carefully is also made available (read more & get functions in Summary section)
Preliminary Solution
Suppose you have some huge (or not) data frame, DF, and you only know the indices of the two columns you want to swap, say 1 < n < m < length(DF). (Also important is that your columns are not adjacent, i.e. |n-m| > 1 which is very likely to be the case in our "huge" data frame but not necessarily for smaller ones; work-arounds for all degenerate cases are provided at the end).
Because it is huge, there are a ton of columns and you don't want to have to specify every other column by hand, or it isn't huge and you're just lazy someone with fine taste in coding, either way, this one-liner will do the trick:
DF <- DF[ c( 1:(n-1), m, (n+1):(m-1), n, (m+1):length(DF) ) ]
Each piece works like this:
1:(n-1) # This keeps every column before column `n` in place
m # This places column `m` where column `n` was
(n+1):(m-1) # This keeps every column between the two in place
n # This places column `n` where column `m` was
(m+1):length(DF) # This keeps every column after column `m` in place
Generalizing for Degenerates
Because of how the : operator works, i.e. allowing "backwards-ranges" like this,
> 10:0
[1] 10 9 8 7 6 5 4 3 2 1 0
we have to be careful about our choices and placements of n and m, hence our previous restrictions. For instance, n < m doesn't lose us any generality (one of the columns has to be before the other one if they are different), however, it means we do need to be careful about which goes where in our line of code. We can make it so that we don't have to check this condition with the following modification:
DF <- DF[ c( 1:(min(n,m)-1), max(n,m), (min(n,m)+1):(max(n,m)-1), min(n,m), (max(n,m)+1):length(DF) ) ]
We have replaced every instance of n and m with min(n,m) and max(n,m) respectively, meaning that the correct ordering for our code will be preserved even in the case that m > n.
In the cases where min(n,m) == 1, max(n,m) == length(DF), both of those at the same time, and |n-m| == 1, we we will make some unreadable less aesthetic modifications involving if\else to forget about having to check if these are the case. Versions for where you know that one of these are the case, (i.e. you are always swapping some interior column with the first column, swapping some interior column with the last column, swapping the first and last columns, or swapping two adjacent columns), you can actually express these actions more succinctly because they usually just require omitting parts from our restricted case:
# Swapping not the last column with the first column
# We just got rid of 1:(min(n,m)-1) because it would be invalid and not what we meant
# since min(n,m) == 1
# Now we just stick the other column right at the front
DF <- DF[ c( max(n,m), (min(n,m)+1):(max(n,m)-1), min(n,m), (max(n,m)+1):length(DF) ) ]
# Also equivalent since we know min(n,m) == 1, for the leftover index i
DF <- DF[ c( i, 2:(i-1), 1, (i+1):length(DF) ) ]
# Swapping not the first column with the last column
# Similarly, we just got rid of (max(n,m)+1):length(DF) because it would also be invalid
# and not what we meant since max(n,m) == length(DF)
# Now we just stick the other column right at the end
DF <- DF[ c( 1:(min(n,m)-1), max(n,m), (min(n,m)+1):(max(n,m)-1), min(n,m) ) ]
# Also equivalent since we know max(n,m) == length(DF), for the leftover index, say i
DF <- DF[ c( 1:(i-1), length(DF), (i+1):(length(DF)-1), i ) ]
# Swapping the first column with the last column
DF <- DF[ c( max(n,m), (min(n,m)+1):(max(n,m)-1), min(n,m) ) ]
# Also equivalent (for if you don't actually know the length beforehand, as assumed
# elsewhere)
DF <- DF[ c( length(DF), 2:(length(DF)-1), 1 ) ]
# Swapping two interior adjacent columns
# Here we drop the explicit swap on either side of our middle column segment
# This is actually enough because then the middle segment becomes a backwards range
# because we know that `min(n,m) + 1 = max(n,m)`
# The range is just an ordering of the two adjacent indices from largest to smallest
DF <- DF[ c( 1:(min(n,m)-1), (min(n,m)+1):(max(n,m)-1), (max(n,m)+1):length(DF) )]
"But!", I hear you saying, "What if more than one of these cases occur simultaneously, like did in the third version in the block above!?". Right, coding up versions for each case is an enormous waste of time if one wants to be able to "swap columns" in the most general sense.
Swapping any Two Columns
It will be easiest to generalize our code to cover all of the cases at the same time, because they all employ essentially the same strategy. We will use if\else to keep our code a one-liner:
DF <- DF[ if (n==m) 1:length(DF) else c( (if (min(n,m)==1) c() else 1:(min(n,m)-1) ), (if (min(n,m)+1 == max(n,m)) (min(n,m)+1):(max(n,m)-1) else c( max(n,m), (min(n,m)+1):(max(n,m)-1), min(n,m))), (if (max(n,m)==length(DF)) c() else (max(n,m)+1):length(DF) ) ) ]
That's totally unreadable and probably pretty unfriendly to anyone who might try to understand or recreate your code (including yourself), so better to box it up in a function.
# A function that swaps the `n` column and `m` column in the data frame DF
swap <- function(DF, n, m)
{
return (DF[ if (n==m) 1:length(DF) else c( (if (min(n,m)==1) c() else 1:(min(n,m)-1) ), (if (min(n,m)+1 == max(n,m)) (min(n,m)+1):(max(n,m)-1) else c( max(n,m), (min(n,m)+1):(max(n,m)-1), min(n,m))), (if (max(n,m)==length(DF)) c() else (max(n,m)+1):length(DF) ) ) ])
}
A more robust version that can also swap on column names and has semi-explanatory comments:
# Returns data frame object with columns `n` and `m` swapped
# `n` and `m` can be column names, numerical indices, or a heterogeneous pair of both
swap <- function(DF, n, m)
{
# Of course, first, we want to make sure that n != m,
# because if they do, we don't need to do anything
if (n==m) return(DF)
# Next, if either n or m is a column name, we want to get its index
# We assume that if they aren't column names, they are indices (integers)
n <- if (class(n)=="character" & is.na(suppressWarnings(as.integer(n)))) which(colnames(DF)==n) else as.integer(n)
m <- if (class(m)=="character" & is.na(supressWarnings(as.integer(m)))) which(colnames(DF)==m) else as.integer(m)
# Make sure each index is actually valid
if (!(1<=n & n<=length(DF))) stop( "`n` represents invalid index!" )
if (!(1<=m & m<=length(DF))) stop( "`m` represents invalid index!" )
# Also, for readability, lets go ahead and set which column is earlier, and which is later
earlier <- min(n,m)
later <- max(n,m)
# This constructs the first third of the indices
# These are the columns that, if any, come before the earlier column you are swapping
firstThird <- if ( earlier==1 ) c() else 1:(earlier-1)
# This constructs the last third of the the indices
# These are the columns, if any, that come after the later column you are swapping
lastThird <- if ( later==length(DF) ) c() else (later+1):length(DF)
# This checks if the columns to be swapped are adjacent and then constructs the
# secondThird accordingly
if ( earlier+1 == later )
{
# Here; the second third is a list of the two columns ordered from later to earlier
secondThird <- (earlier+1):(later-1)
}
else
{
# Here; the second third is a list of
# the later column you want to swap
# the columns in between
# and then the earlier column you want to swap
secondThird <- c( later, (earlier+1):(later-1), earlier)
}
# Now we assemble our indices and return our permutation of DF
return (DF[ c( firstThird, secondThird, lastThird ) ])
}
And, for ease of repatriation with less of the spatial cost, a comment-less version that checks index validity and can handle column names, i.e. does everything in pretty close to the smallest space it can (yes, you could vectorize, using ifelse(...), the two checks that get performed, but then you'd have to unpack the vector back into n,m or change how the final line is written):
swap <- function(DF, n, m)
{
n <- if (class(n)=="character" & is.na(suppressWarnings(as.integer(n)))) which(colnames(DF)==n) else as.integer(n)
m <- if (class(m)=="character" & is.na(suppressWarnings(as.integer(m)))) which(colnames(DF)==m) else as.integer(m)
if (!(1<=n & n<=length(DF))) stop( "`n` represents invalid index!" )
if (!(1<=m & m<=length(DF))) stop( "`m` represents invalid index!" )
return (DF[ if (n==m) 1:length(DF) else c( (if (min(n,m)==1) c() else 1:(min(n,m)-1) ), (if (min(n,m)+1 == max(n,m)) (min(n,m)+1):(max(n,m)-1) else c( max(n,m), (min(n,m)+1):(max(n,m)-1), min(n,m))), (if (max(n,m)==length(DF)) c() else (max(n,m)+1):length(DF) ) ) ])
}
Permutations (or How to Do Specifically What the Question Asked and More!)
With our swap function in tow, we can try to actually do what the original question asked. The easiest way to do this, is to build a function that utilizes the really cool power that comes with a choice of heterogeneous arguments. Create a mapping:
mapping <- data.frame( "piglet" = 1, "ssire" = 2, "dam" = 3, "tdate" = 4)
In the case of the original question, these are all of the columns in our original data frame, but we will build a function where this doesn't have to be the case:
# A function that takes two data frames, one with actual data: DF, and the other with a
# rearrangement of the columns: R
# R must be structured so that colnames(R) is a subset of colnames(DF)
# Alternatively, R can be structured so that 1 <= as.integer(colnames(R)) <= length(DF)
# Further, 1 <= R$column <= length(DF), and length(R$column) == 1
# These structural requirements on R are not checked
# This is for brevity and because most likely R has been created specifically for use with
# this function
rearrange <- function(DF, R)
{
for (col in colnames(R))
{
DF <- swap(DF, col, R[col])
}
return (DF)
}
Wait, that's it? Yup. This will swap every column name to the appropriate placement. The power for such simplicity comes from swap taking heterogeneous arguments meaning we can specify the moving column name that we want to put somewhere, and so long as we only ever try to put one column in each position (which we should), once we put that column where it belongs, it won't move again. This means that even though it seems like later swaps could undo previous placements, the heterogeneous arguments make certain that won't happen, and so additionally, the order of the columns in our mapping also doesn't matter. This is a really nice quality because it means that we aren't kicking this whole "organizing the data" issue down the road too much. You only have to be able to determine which placement you want to send each column you want to move to.
Ok, ok, there is a catch. If you don't reassign the entire data frame when you do this, then you have superfluous swaps that occur, meaning that if you re-arrange over a subset of columns that isn't "closed", i.e. not every column name has an index that is represented in the rearrangement, then other columns that you didn't explicitly say to move may get moved to other places they don't exactly belong. This can be handled by creating your mapping very carefully, or simply using numerical indices mapping to other numerical indices. In the latter case, this doesn't solve the issue, but it makes more explicit what swaps are taking place and in what order so planning the rearrangement is more explicit and thus less likely to lead to problematic superfluous swaps.
Summary
You can use the swap function that we built to successfully swap exactly two columns or the rearrange function with a "rearrangement" data frame specifying where to send each column name you want to move. In the case of the rearrange function, if any of the placements chosen for each column name are not already occupied by one of the specified columns (i.e. not in colnames(R)), then superfluous swaps can and are very likely to occur (The only instance they won't is when every superfluous swap has a partner superfluous swap that undoes it before the end. This is, as stated, very unlikely to happen by accident, but the mapping can be structured to accomplish this outcome in practice).
swap <- function(DF, n, m)
{
n <- if (class(n)=="character" & is.na(suppressWarnings(as.integer(n)))) which(colnames(DF)==n) else as.integer(n)
m <- if (class(m)=="character" & is.na(suppressWarnings(as.integer(m)))) which(colnames(DF)==m) else as.integer(m)
if (!(1<=n & n<=length(DF))) stop( "`n` represents invalid index!" )
if (!(1<=m & m<=length(DF))) stop( "`m` represents invalid index!" )
return (DF[ if (n==m) 1:length(DF) else c( (if (min(n,m)==1) c() else 1:(min(n,m)-1) ), (if (min(n,m)+1 == max(n,m)) (min(n,m)+1):(max(n,m)-1) else c( max(n,m), (min(n,m)+1):(max(n,m)-1), min(n,m))), (if (max(n,m)==length(DF)) c() else (max(n,m)+1):length(DF) ) ) ])
}
rearrange <- function(DF, R)
{
for (col in colnames(R))
{
DF <- swap(DF, col, R[col])
}
return (DF)
}

I quickly wrote a function that takes a vector v and column indexes a and b which you want to swap.
swappy = function(v,a,b){ # where v is a dataframe, a and b are the columns indexes to swap
name = deparse(substitute(v))
helpy = v[,a]
v[,a] = v[,b]
v[,b] = helpy
name1 = colnames(v)[a]
name2 = colnames(v)[b]
colnames(v)[a] = name2
colnames(v)[b] = name1
assign(name,value = v , envir =.GlobalEnv)
}

I was using the function by Khôra Willis, which is helpful. But I encountered an error. I tried to make corrections. Here is R code that finally works. The arguments n and m could either be column names or column numbers in data frame DF.
require(tidyverse)
swap <- function(DF, n, m)
{
if (class(n)=="character") n <- which(colnames(DF)==n)
if (class(m)=="character") m <- which(colnames(DF)==m)
p <- NCOL(DF)
if (!(1<=n & n<=p)) stop("`n` represents invalid index!")
if (!(1<=m & m<=p)) stop("`m` represents invalid index!")
index <- 1:p
index[n] <- m; index[m] <- n
DF0 <- DF %>% select(all_of(index))
return(DF0)
}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Comparing 2 datasets in R - r

Since you want similarities but did not specify exact matches, you should consider agrep sapply(boys2009$name , agrep, girls2009$name, max = 0.1) You can adjust the max.distance argument to suit your needs.

How about using set functions: list( `only boys` = setdiff(boys2009$name, girls2009$name), `common` = intersect(boys2009$name, girls2009$name), `only girls` = setdiff(girls2009$name, boys2009$name) )

Related

How to count missing values from two columns in R

Standardize group names using a vector of possible matches

How to get in a specific order the results of an r lapply function with arguments from a dataframe

Not all values storing in a loop

Is it possible to swap columns around in a data frame using R?

Categories

Resources