Sort columns of a dataframe by column name - r

This is possibly a simple question, but I do not know how to order columns alphabetically.
test = data.frame(C = c(0, 2, 4, 7, 8), A = c(4, 2, 4, 7, 8), B = c(1, 3, 8, 3, 2))
# C A B
# 1 0 4 1
# 2 2 2 3
# 3 4 4 8
# 4 7 7 3
# 5 8 8 2
I like to order the columns by column names alphabetically, to achieve
# A B C
# 1 4 1 0
# 2 2 3 2
# 3 4 8 4
# 4 7 3 7
# 5 8 2 8
For others I want my own defined order:
# B A C
# 1 4 1 0
# 2 2 3 2
# 3 4 8 4
# 4 7 3 7
# 5 8 2 8
Please note that my datasets are huge, with 10000 variables. So the process needs to be more automated.

You can use order on the names, and use that to order the columns when subsetting:
test[ , order(names(test))]
A B C
1 4 1 0
2 2 3 2
3 4 8 4
4 7 3 7
5 8 2 8
For your own defined order, you will need to define your own mapping of the names to the ordering. This would depend on how you would like to do this, but swapping whatever function would to this with order above should give your desired output.
You may for example have a look at Order a data frame's rows according to a target vector that specifies the desired order, i.e. you can match your data frame names against a target vector containing the desired column order.

Here's the obligatory dplyr answer in case somebody wants to do this with the pipe.
test %>%
select(sort(names(.)))

test = data.frame(C=c(0,2,4, 7, 8), A=c(4,2,4, 7, 8), B=c(1, 3, 8,3,2))
Using the simple following function replacement can be performed (but only if data frame does not have many columns):
test <- test[, c("A", "B", "C")]
for others:
test <- test[, c("B", "A", "C")]

An alternative option is to use str_sort() from library stringr, with the argument numeric = TRUE. This will correctly order column that include numbers not just alphabetically:
str_sort(c("V3", "V1", "V10"), numeric = TRUE)
# [1] V1 V3 V10

test[,sort(names(test))]
sort on names of columns can work easily.

If you only want one or more columns in the front and don't care about the order of the rest:
require(dplyr)
test %>%
select(B, everything())

So to have a specific column come first, then the rest alphabetically, I'd propose this solution:
test[, c("myFirstColumn", sort(setdiff(names(test), "myFirstColumn")))]

Here is what I found out to achieve a similar problem with my data set.
First, do what James mentioned above, i.e.
test[ , order(names(test))]
Second, use the everything() function in dplyr to move specific columns of interest (e.g., "D", "G", "K") at the beginning of the data frame, putting the alphabetically ordered columns after those ones.
select(test, D, G, K, everything())
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­

Similar to other syntax above but for learning - can you sort by column names?
sort(colnames(test[1:ncol(test)] ))

another option is..
mtcars %>% dplyr::select(order(names(mtcars)))

In data.table you can use the function setcolorder:
setcolorder reorders the columns of data.table, by reference, to the
new order provided.
Here a reproducible example:
library(data.table)
test = data.table(C = c(0, 2, 4, 7, 8), A = c(4, 2, 4, 7, 8), B = c(1, 3, 8, 3, 2))
setcolorder(test, c(order(names(test))))
test
#> A B C
#> 1: 4 1 0
#> 2: 2 3 2
#> 3: 4 8 4
#> 4: 7 3 7
#> 5: 8 2 8
Created on 2022-07-10 by the reprex package (v2.0.1)

Related

impute data with condition (4 levels) in R

I have two different variables in my data. first one: vein size (there are NAs).
Second variable is: procedure site (c=(1,2,3,4))
I want to impute different value to vein size based on different procedure site. I tried if else, but it wasn't successful. e.g.: if procedure site is 1 or 2, impute 3; if procedure site is 3, impute 4; if procedure site is 4, impute 5.I am new to this field. Any help is much appreciated!
vein.size<-(3,3,3,NA,NA,NA)
procedure.site<-(1,2,2,3,4,4)
df<-cbind(vein.size,procedure.site)
My expected output is:
vein.size<-(3,3,3,4,5,5)
thank you
You can use chain of ifelse statement or try case_when from dplyr :
library(dplyr)
df <- df %>%
mutate(output = case_when(is.na(vein.size) & procedure.site %in% 1:2 ~ 3,
is.na(vein.size) & procedure.site == 3 ~ 4,
is.na(vein.size) & procedure.site == 4 ~ 5,
TRUE ~ vein.size))
# vein.size procedure.site output
#1 3 1 3
#2 3 2 3
#3 3 2 3
#4 NA 3 4
#5 NA 4 5
#6 NA 4 5
data
vein.size<-c(3,3,3,NA,NA,NA)
procedure.site<-c(1,2,2,3,4,4)
df<-data.frame(vein.size,procedure.site)
You could use a lookup table and then merge:
# your data
vein.size <- c(3,3,3,NA,NA,NA)
procedure.site <- c(1,2,2,3,4,4)
your_df <- data.frame(vein.size = vein.size,
procedure.site = procedure.site)
# the lookup table
lookup_df <- data.frame(
procedure.site = c(1, 2, 3, 4),
imputation = c(3, 3, 4, 5)
)
# result
merge(your_df, lookup_df, by='procedure.site')
Which gives:
procedure.site vein.size imputation
1 1 3 3
2 2 3 3
3 2 3 3
4 3 NA 4
5 4 NA 5
6 4 NA 5

How to list R data frame variables in alphabetical order? [duplicate]

This is possibly a simple question, but I do not know how to order columns alphabetically.
test = data.frame(C = c(0, 2, 4, 7, 8), A = c(4, 2, 4, 7, 8), B = c(1, 3, 8, 3, 2))
# C A B
# 1 0 4 1
# 2 2 2 3
# 3 4 4 8
# 4 7 7 3
# 5 8 8 2
I like to order the columns by column names alphabetically, to achieve
# A B C
# 1 4 1 0
# 2 2 3 2
# 3 4 8 4
# 4 7 3 7
# 5 8 2 8
For others I want my own defined order:
# B A C
# 1 4 1 0
# 2 2 3 2
# 3 4 8 4
# 4 7 3 7
# 5 8 2 8
Please note that my datasets are huge, with 10000 variables. So the process needs to be more automated.
You can use order on the names, and use that to order the columns when subsetting:
test[ , order(names(test))]
A B C
1 4 1 0
2 2 3 2
3 4 8 4
4 7 3 7
5 8 2 8
For your own defined order, you will need to define your own mapping of the names to the ordering. This would depend on how you would like to do this, but swapping whatever function would to this with order above should give your desired output.
You may for example have a look at Order a data frame's rows according to a target vector that specifies the desired order, i.e. you can match your data frame names against a target vector containing the desired column order.
Here's the obligatory dplyr answer in case somebody wants to do this with the pipe.
test %>%
select(sort(names(.)))
test = data.frame(C=c(0,2,4, 7, 8), A=c(4,2,4, 7, 8), B=c(1, 3, 8,3,2))
Using the simple following function replacement can be performed (but only if data frame does not have many columns):
test <- test[, c("A", "B", "C")]
for others:
test <- test[, c("B", "A", "C")]
An alternative option is to use str_sort() from library stringr, with the argument numeric = TRUE. This will correctly order column that include numbers not just alphabetically:
str_sort(c("V3", "V1", "V10"), numeric = TRUE)
# [1] V1 V3 V10
test[,sort(names(test))]
sort on names of columns can work easily.
If you only want one or more columns in the front and don't care about the order of the rest:
require(dplyr)
test %>%
select(B, everything())
So to have a specific column come first, then the rest alphabetically, I'd propose this solution:
test[, c("myFirstColumn", sort(setdiff(names(test), "myFirstColumn")))]
Here is what I found out to achieve a similar problem with my data set.
First, do what James mentioned above, i.e.
test[ , order(names(test))]
Second, use the everything() function in dplyr to move specific columns of interest (e.g., "D", "G", "K") at the beginning of the data frame, putting the alphabetically ordered columns after those ones.
select(test, D, G, K, everything())
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
Similar to other syntax above but for learning - can you sort by column names?
sort(colnames(test[1:ncol(test)] ))
another option is..
mtcars %>% dplyr::select(order(names(mtcars)))
In data.table you can use the function setcolorder:
setcolorder reorders the columns of data.table, by reference, to the
new order provided.
Here a reproducible example:
library(data.table)
test = data.table(C = c(0, 2, 4, 7, 8), A = c(4, 2, 4, 7, 8), B = c(1, 3, 8, 3, 2))
setcolorder(test, c(order(names(test))))
test
#> A B C
#> 1: 4 1 0
#> 2: 2 3 2
#> 3: 4 8 4
#> 4: 7 3 7
#> 5: 8 2 8
Created on 2022-07-10 by the reprex package (v2.0.1)

How to reorder columns of a data.frame with na condition? [duplicate]

This is possibly a simple question, but I do not know how to order columns alphabetically.
test = data.frame(C = c(0, 2, 4, 7, 8), A = c(4, 2, 4, 7, 8), B = c(1, 3, 8, 3, 2))
# C A B
# 1 0 4 1
# 2 2 2 3
# 3 4 4 8
# 4 7 7 3
# 5 8 8 2
I like to order the columns by column names alphabetically, to achieve
# A B C
# 1 4 1 0
# 2 2 3 2
# 3 4 8 4
# 4 7 3 7
# 5 8 2 8
For others I want my own defined order:
# B A C
# 1 4 1 0
# 2 2 3 2
# 3 4 8 4
# 4 7 3 7
# 5 8 2 8
Please note that my datasets are huge, with 10000 variables. So the process needs to be more automated.
You can use order on the names, and use that to order the columns when subsetting:
test[ , order(names(test))]
A B C
1 4 1 0
2 2 3 2
3 4 8 4
4 7 3 7
5 8 2 8
For your own defined order, you will need to define your own mapping of the names to the ordering. This would depend on how you would like to do this, but swapping whatever function would to this with order above should give your desired output.
You may for example have a look at Order a data frame's rows according to a target vector that specifies the desired order, i.e. you can match your data frame names against a target vector containing the desired column order.
Here's the obligatory dplyr answer in case somebody wants to do this with the pipe.
test %>%
select(sort(names(.)))
test = data.frame(C=c(0,2,4, 7, 8), A=c(4,2,4, 7, 8), B=c(1, 3, 8,3,2))
Using the simple following function replacement can be performed (but only if data frame does not have many columns):
test <- test[, c("A", "B", "C")]
for others:
test <- test[, c("B", "A", "C")]
An alternative option is to use str_sort() from library stringr, with the argument numeric = TRUE. This will correctly order column that include numbers not just alphabetically:
str_sort(c("V3", "V1", "V10"), numeric = TRUE)
# [1] V1 V3 V10
test[,sort(names(test))]
sort on names of columns can work easily.
If you only want one or more columns in the front and don't care about the order of the rest:
require(dplyr)
test %>%
select(B, everything())
So to have a specific column come first, then the rest alphabetically, I'd propose this solution:
test[, c("myFirstColumn", sort(setdiff(names(test), "myFirstColumn")))]
Here is what I found out to achieve a similar problem with my data set.
First, do what James mentioned above, i.e.
test[ , order(names(test))]
Second, use the everything() function in dplyr to move specific columns of interest (e.g., "D", "G", "K") at the beginning of the data frame, putting the alphabetically ordered columns after those ones.
select(test, D, G, K, everything())
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
Similar to other syntax above but for learning - can you sort by column names?
sort(colnames(test[1:ncol(test)] ))
another option is..
mtcars %>% dplyr::select(order(names(mtcars)))
In data.table you can use the function setcolorder:
setcolorder reorders the columns of data.table, by reference, to the
new order provided.
Here a reproducible example:
library(data.table)
test = data.table(C = c(0, 2, 4, 7, 8), A = c(4, 2, 4, 7, 8), B = c(1, 3, 8, 3, 2))
setcolorder(test, c(order(names(test))))
test
#> A B C
#> 1: 4 1 0
#> 2: 2 3 2
#> 3: 4 8 4
#> 4: 7 3 7
#> 5: 8 2 8
Created on 2022-07-10 by the reprex package (v2.0.1)

Reverse Scoring Items

I have a survey of about 80 items, primarily the items are valanced positively (higher scores indicate better outcome), but about 20 of them are negatively valanced, I need to find a way to reverse score the ones negatively valanced in R. I am completely lost on how to do so. I am definitely an R beginner, and this is probably a dumb question, but could someone point me in an direction code-wise?
Here's an example with some fake data that you can adapt to your data:
# Fake data: Three questions answered on a 1 to 5 scale
set.seed(1)
dat = data.frame(Q1=sample(1:5,10,replace=TRUE),
Q2=sample(1:5,10,replace=TRUE),
Q3=sample(1:5,10,replace=TRUE))
dat
Q1 Q2 Q3
1 2 2 5
2 2 1 2
3 3 4 4
4 5 2 1
5 2 4 2
6 5 3 2
7 5 4 1
8 4 5 2
9 4 2 5
10 1 4 2
# Say you want to reverse questions Q1 and Q3
cols = c("Q1", "Q3")
dat[ ,cols] = 6 - dat[ ,cols]
dat
Q1 Q2 Q3
1 4 2 1
2 4 1 4
3 3 4 2
4 1 2 5
5 4 4 4
6 1 3 4
7 1 4 5
8 2 5 4
9 2 2 1
10 5 4 4
If you have a lot of columns, you can use tidyverse functions to select multiple columns to recode in a single operation.
library(tidyverse)
# Reverse code columns Q1 and Q3
dat %>% mutate(across(matches("^Q[13]"), ~ 6 - .))
# Reverse code all columns that start with Q followed by one or two digits
dat %>% mutate(across(matches("^Q[0-9]{1,2}"), ~ 6 - .))
# Reverse code columns Q11 through Q20
dat %>% mutate(across(Q11:Q20, ~ 6 - .))
If different columns could have different maximum values, you can (adapting #HellowWorld's suggestion) customize the reverse-coding to the maximum value of each column:
# Reverse code columns Q11 through Q20
dat %>% mutate(across(Q11:Q20, ~ max(.) + 1 - .))
Here is an alternative approach using the psych package. If you are working with survey data this package has lots of good functions. Building on #eipi10 data:
# Fake data: Three questions answered on a 1 to 5 scale
set.seed(1)
original_data = data.frame(Q1=sample(1:5,10,replace=TRUE),
Q2=sample(1:5,10,replace=TRUE),
Q3=sample(1:5,10,replace=TRUE))
original_data
# Say you want to reverse questions Q1 and Q3. Set those keys to -1 and Q2 to 1.
# install.packages("psych") # Uncomment this if you haven't installed the psych package
library(psych)
keys <- c(-1,1,-1)
# Use the handy function from the pysch package
# mini is the minimum value and maxi is the maimum value
# mini and maxi can also be vectors if you have different scales
new_data <- reverse.code(keys,original_data,mini=1,maxi=5)
new_data
The pro to this approach is that you can recode your entire survey in one function. The con to this is you need a library. The stock R approach is more elegant as well.
FYI, this is my first post on stack overflow. Long time listener, first time caller. So please give me feedback on my response.
Just converting #eipi10's answer using tidyverse:
# Create same fake data: Three questions answered on a 1 to 5 scale
set.seed(1)
dat <- data.frame(Q1 = sample(1:5,10, replace=TRUE),
Q2 = sample(1:5,10, replace=TRUE),
Q3 = sample(1:5,10, replace=TRUE))
# Reverse scores in the desired columns (Q2 and Q3)
dat <- dat %>%
mutate(Q2Reversed = 6 - Q2,
Q3Reversed = 6 - Q3)
Another example is to use recode in library(car).
#Example data
data = data.frame(Q1=sample(1:5,10, replace=TRUE))
# Say you want to reverse questions Q1
library(car)
data$Q1reversed <- recode(data$Q1, "1=5; 2=4; 3=3; 4=2; 5=1")
data
The psych package has the intuitive reverse.code() function that can be helpful. Using the dataset started by #eipi10 and the same goal or reversing q1 and q2:
set.seed(1)
dat <- data.frame(q1 =sample(1:5,10,replace=TRUE),
q2=sample(1:5,10,replace=TRUE),
q3 =sample(1:5,10,replace=TRUE))
You can use the reverse.code() function. The first argument is the keys. This is a vector of 1 and -1. -1 means that you want to reverse that item. These go in the same order as your data.
The second argument, called items, is simply the name of your dataset. That is, where are these items located?
Last, the mini and maxi arguments are the smallest and largest values that a participant could possibly score. You can also leave these arguments to NULL and the function will use the lowest and highest values in your data.
library(psych)
keys <- c(-1, 1, -1)
dat1 <- reverse.code(keys = keys, items = dat, mini = 1, maxi = 5)
dat1
Alternatively, your keys can also contain the specific names of the variables that you want to reverse score. This is helpful if you have many variables to reverse score and yields the same answer:
library(psych)
keys <- c("q1", "q3")
dat2 <- reverse.code(keys = keys, items = dat, mini = 1, maxi = 5)
dat2
Note that, after reverse scoring, reverse.code() slightly modifies the variable name to have a - behind it (i.e., q1 becomes q1- after being reverse scored).
The solutions above assume wide data (one score per column). This reverse scores specific rows in long data (one score per row).
library(magrittr)
max <- 5
df <- data.frame(score=sample(1:max, 20, replace=TRUE))
df <- mutate(df, question = rownames(df))
df
df[c(4,13,17),] %<>% mutate(score = max + 1 - score)
df
Here is another attempt that will generalize to any number of columns. Let's use some made up data to illustrate the function.
# create a df
{
A = c(3, 3, 3, 3, 3, 3, 3, 3, 3, 3)
B = c(9, 2, 3, 2, 4, 0, 2, 7, 2, 8)
C = c(2, 4, 1, 0, 2, 1, 3, 0, 7, 8)
df1 = data.frame(A, B, C)
print(df1)
}
A B C
1 3 9 2
2 3 2 4
3 3 3 1
4 3 2 0
5 3 4 2
6 3 0 1
7 3 2 3
8 3 7 0
9 3 2 7
10 3 8 8
The columns to reverse code
# variables to reverse code
vtcode = c("A", "B")
The function to reverse-code the selected columns
reverseCode <- function(data, rev){
# get maximum value per desired col: lapply(data[rev], max)
# subtract values in cols to reverse-code from max value plus 1
data[, rev] = mapply("-", lapply(data[rev], max), data[, rev]) + 1
return(data)
}
reverseCode(df1, vtcode)
A B C
1 1 1 2
2 1 8 4
3 1 7 1
4 1 8 0
5 1 6 2
6 1 10 1
7 1 8 3
8 1 3 0
9 1 8 7
10 1 2 8
This code was inspired by another response a response from #catastrophic-failure relating to subtract max of column from all entries in column R

comparing two files and outputting common elements

I have 2 files of 3 columns and hundreds of rows. I want to compare and list the common elements of first two columns of the two files. Then the list which i will get after comparing i have to add the third column of second file to that list. Third column will contain the values which were in the second file corresponding to numbers of remaining two columns which i have got as common to both the files.
For example, consider two files of 6 rows and 3 columns
First file -
1 2 3
2 3 4
4 6 7
3 8 9
11 10 5
19 6 14
second file -
1 4 1
2 1 4
4 6 10
3 7 2
11 10 3
19 6 5
As i said i have to compare the first two columns and then add the third column of second file to that list. Therefore, output must be:
4 6 10
11 10 3
19 6 5
I have the following code, however its showing an error object not found also i am not able to add the third column. Please help :)
df2 = reading first file, df3 = reading second file. Code is in R language.
s1 = 1
for(i in 1:nrow(df2)){
for(j in 1:nrow(df3)){
if(df2[i,1] == df3[j,1]){
if(df2[i,2] == df3[j,2]){
common.rows1[s1,1] <- df2[i,1]
common.rows1[s1,2] <- df2[i,2]
s1 = s1 + 1
}
}
}
You can use the %in% operator twice to subset your second data.frame (I call it df2):
df2[df2$V1 %in% df1$V1 & df2$V2 %in% df1$V2,]
# V1 V2 V3
#3 4 6 10
#5 11 10 3
#6 19 6 5
V1 and V2 in my example are the column names of df1 and df2.
It seems that this is the perfect use-case for merge, e.g.
merge(d1[c('V1','V2')],d2)
results in:
V1 V2 V3
1 11 10 3
2 19 6 5
3 4 6 10
In which 'V1' and 'V2' are the column names of interest.
data.table proposal
library(data.table)
setDT(df1)
setDT(df2)
setkey(df1, V1, V2)
setkey(df2, V1, V2)
df2[df1[, -3, with = F], nomatch = 0]
## V1 V2 V3
## 1: 4 6 10
## 2: 11 10 3
## 3: 19 6 5
If your two tables are d1 and d2,
d1<-data.frame(
V1 = c(1, 2, 4, 3, 11, 19),
V2 = c(2, 3, 6, 8, 10, 6),
V3 = c(3, 4, 7, 9, 5, 14)
)
d2<-data.frame(
V1 = c(1, 2, 4, 3, 11, 19),
V2 = c(4, 1, 6, 7, 10, 6),
V3 = c(1, 4, 10, 2, 3, 5)
)
then you can subset d2 (in order to keep the third column) with
d2[interaction(d2$V1, d2$V2) %in% interaction(d1$V1, d1$V2),]
The interaction() treats the first two columns as a combined key.

Resources