sorting data frame columns based on specific value in each column - r

I am using the Tidyverse package in R. I have a data frame with 20 rows and 500 columns. I want to sort all the columns based on the size of the value in the last row of each column.
Here is an example with just 3 rows and 4 columns:
1 2 3 4,
5 6 7 8,
8 7 9 1
The desired result is:
3 1 2 4,
7 5 6 8,
9 8 7 1
I searched stack overflow but could not find an answer to this type of question.

If we want to use dplyr from tidyverse, we can use slice to get the last row and then use order in decreasing order to subset columns.
library(dplyr)
df[df %>% slice(n()) %>% order(decreasing = TRUE)]
# V3 V1 V2 V4
#1 3 1 2 4
#2 7 5 6 8
#3 9 8 7 1
Whose translation in base R would be
df[order(df[nrow(df), ], decreasing = TRUE)]
data
df <- read.table(text = "1 2 3 4
5 6 7 8
8 7 9 1")

The following reorders the data frame columns by the order of the last-rows values:
df <- data.frame(col1=c(1,5,8),col2=c(2,6,7),col3=c(3,7,9),col4=c(4,8,1))
last_row <- df[nrow(df),]
df <- df[,order(last_row,decreasing = T)]
First, to get the last rows. Then to sort them with the order() function and to return the reordered columns.
>df
col3 col1 col2 col4
1 3 1 2 4
2 7 5 6 8
3 9 8 7 1

Related

How to subset a data.frame according to the values of last two rows?

###the original data
df1 <- data.frame(a=c(2,2,5,5,7), b=c(1,5,4,7,6))
df2 <- data.frame(a=c(2,2,5,5,7,7), b=c(1,5,4,7,6,3))
when the a column value of the last two rows are not equal (here the 4th row is not equal to the 5th row, namely, 5!=7), I want to subset the last row only.
#input
> df1
a b
1 2 1
2 2 5
3 5 4
4 5 7
5 7 6
#output
> df1
a b
1 7 6
when the a column value of the last two rows are equal (here 5th row is equal to the 6th row, namely, 7=7, I want to subset the last two rows
#input
> df2
a b
1 2 1
2 2 5
3 5 4
4 5 7
5 7 6
6 7 3
#output
> df2
a b
1 7 6
2 7 3
You can write a function to check last two row values for a column :
return_rows <- function(data) {
n <- nrow(data)
if(data$a[n] == data$a[n - 1])
tail(data, 2)
else tail(data, 1)
}
return_rows(df1)
# a b
#5 7 6
return_rows(df2)
# a b
#5 7 6
#6 7 3
try it this way
library(tidyverse)
df %>%
filter(a == last(a))
a b
5 7 6
a b
5 7 6
6 7 3
We can use subset from base R
subset(df1, a == a[length(a)])

How to check if rows in one column present in another column in R

I have a data set = data1 with id and emails as follows:
id emails
1 A,B,C,D,E
2 F,G,H,A,C,D
3 I,K,L,T
4 S,V,F,R,D,S,W,A
5 P,A,L,S
6 Q,W,E,R,F
7 S,D,F,E,Q
8 Z,A,D,E,F,R
9 X,C,F,G,H
10 A,V,D,S,C,E
I have another data set = data2 with check_email as follows:
check_email
A
D
S
V
I want to check if check_email column is present in data1 and want to take only those id from data1 when check_email in data2 is present in emails in data1.
My desired output will be:
id
1
2
4
5
7
8
10
I have created a code using for loop but it is taking forever because my actual dataset is very large.
Any advice in this regard will be highly appreciated!
You can use regular expression to subset your data. First collapse everything in one pattern:
paste(data2$check_email, collapse = "|")
# [1] "A|D|S|V"
Then create a indicator vector whether the pattern matches the emails:
grep(paste(data2$check_email, collapse = "|"), data1$emails)
# [1] 1 2 4 5 7 8 10
And then combine everything:
data1[grep(paste(data2$check_email, collapse = "|"), data1$emails), ]
# id emails
# 1 1 A,B,C,D,E
# 2 2 F,G,H,A,C,D
# 3 4 S,V,F,R,D,S,W,A
# 4 5 P,A,L,S
# 5 7 S,D,F,E,Q
# 6 8 Z,A,D,E,F,R
# 7 10 A,V,D,S,C,E
data1[rowSums(sapply(data2$check_email, function(x) grepl(x,data1$emails))) > 0, "id", F]
id
1 1
2 2
4 4
5 5
7 7
8 8
10 10
We can split the elements of the character vector as.character(data1$emails) into substrings, then we can iterate over this list with sapply looking for any value of this substring contained in data2$check_email. Finally we extract those values from data1
> emails <- strsplit(as.character(data1$emails), ",")
> ind <- sapply(emails, function(emails) any(emails %in% as.character(data2$check_email)))
> data1[ind,"id", drop = FALSE]
id
1 1
2 2
4 4
5 5
7 7
8 8
10 10

Moving down columns in data frames in R

Suppose I have the next data frame:
df<-data.frame(step1=c(1,2,3,4),step2=c(5,6,7,8),step3=c(9,10,11,12),step4=c(13,14,15,16))
step1 step2 step3 step4
1 1 5 9 13
2 2 6 10 14
3 3 7 11 15
4 4 8 12 16
and what I have to do is something like the following:
df2<-data.frame(col1=c(1,2,3,4,5,6,7,8,9,10,11,12),col2=c(5,6,7,8,9,10,11,12,13,14,15,16))
col1 col2
1 1 5
2 2 6
3 3 7
4 4 8
5 5 9
6 6 10
7 7 11
8 8 12
9 9 13
10 10 14
11 11 15
12 12 16
How can I do that? consider that more steps can be included (example, 20 steps).
Thanks!!
We can design a function to achieve this task. df_final is the final output. Notice that bin is an argument that the users can specify how many columns to transform together.
# A function to conduct data transformation
trans_fun <- function(df, bin = 3){
# Calculate the number of new columns
new_ncol <- (ncol(df) - bin) + 1
# Create a list to store all data frames
df_list <- lapply(1:new_ncol, function(num){
return(df[, num:(num + bin - 1)])
})
# Convert each data frame to a vector
dt_list2 <- lapply(df_list, unlist)
# Convert dt_list2 to data frame
df_final <- as.data.frame(dt_list2)
# Set the column and row names of df_final
colnames(df_final) <- paste0("col", 1:new_ncol)
rownames(df_final) <- 1:nrow(df_final)
return(df_final)
}
# Apply the trans_fun
df_final <- trans_fun(df)
df_final
col1 col2
1 1 5
2 2 6
3 3 7
4 4 8
5 5 9
6 6 10
7 7 11
8 8 12
9 9 13
10 10 14
11 11 15
12 12 16
Here is a method using dplyr and reshape2 - this assumes all of the columns are the same length.
library(dplyr)
library(reshape2)
Drop the last column from the dataframe
df[,1:ncol(df)-1]%>%
melt() %>%
dplyr::select(col1=value) -> col1
Drop the first column from the dataframe
df %>%
dplyr::select(-step1) %>%
melt() %>%
dplyr::select(col2=value) -> col2
Combine the dataframes
bind_cols(col1, col2)
This should do the work:
df2 <- data.frame(col1 = 1:(length(df$step1) + length(df$step2)))
df2$col1 <- c(df$step1, df$step2, df$step3)
df2$col2 <- c(df$step2, df$step3, df$step4)
Things to point:
The important thing to see in the first line of the code, is the need for creating a table with the right amount of rows
Calling a columns that does not exist will create one, with that name
Deleting columns in R should be done like this df2$col <- NULL
Are you not just looking to do:
df2 <- data.frame(col1 = unlist(df[,-nrow(df)]),
col2 = unlist(df[,-1]))
rownames(df2) <- NULL
df2
col1 col2
1 1 5
2 2 6
3 3 7
4 4 8
5 5 9
6 6 10
7 7 11
8 8 12
9 9 13
10 10 14
11 11 15
12 12 16

Select the variables in one dataframe from a list in another dataframe

I have a big data with large number of columns and rows. I want to subset few columns in df1 from a list of variables (the name of the columns in df1) in df2. Just for example, I have
df1 <- data.frame(A=sample(1:10, 10), B=sample(1:10, 10), C=sample(1:10,10), D=sample(1:10,10))
var <- c('A','C')
ratio <- c(0.5,0.6)
df2 <- data.frame(var,ratio)
New dataframe should look like this:
A C
1 9 2
2 1 3
3 4 5
4 2 8
5 10 7
6 5 1
7 7 9
8 3 4
9 8 10
10 6 6
We need to convert the factor variable 'var' to character class for subsetting the first dataset
df1[as.character(df2$var)]

R Looking up closest value in data.frame less than equal to another value

I have two data.frames, lookup_df and values_df. For each row in lookup_df I want to lookup the closest value in the values_df that is less than or equal to an index value.
Here's my code so far:
lookup_df <- data.frame(ids = 1:10)
values_df <- data.frame(idx = c(1,3,7), values = c(6,2,8))
What I'm wanting for the result_df is the following:
> result_df
ids values
1 1 6
2 2 6
3 3 2
4 4 2
5 5 2
6 6 2
7 7 8
8 8 8
9 9 8
10 10 8
I know how to do this with SQL fairly easily but I'm curious if there is an R way that is straightforward. I could iterate the the rows of the lookup_df and then loop through the rows of the values_df but that is not computationally efficient. I'm open to using dplyr library if someone knows how to use that to solve the problem.
If values_df is sorted by idx ascending, then findInterval will work:
lookup_df <- data.frame(ids = 1:10)
values_df <- data.frame(idx = c(1,3,7), values = c(6,2,8))
lookup_df$values <- values_df$values[findInterval(lookup_df$ids,values_df$idx)]
lookup_df
> ids values
1 1 6
2 2 6
3 3 2
4 4 2
5 5 2
6 6 2
7 7 8
8 8 8
9 9 8
10 10 8

Resources