Sort data based on conditions - r

I have a (x) data frame in R with 5 numeric columns and apart from this one information is sorting order to be followed in form of a vector i.e.
1, 0, 2, 4, 3
dataset
v1 v2 v3 v4 v5
1 2 3 4 5
3 13 12 1 4
6 4 6 5 3
Expected result
v1 v2 v3 v4 v5
3 13 12 1 4
1 2 2 4 5
6 4 6 5 3
this vector define the sorting order that first column needs to be sorted first then 3rd column then 5th column and then 4th column. manually it can be done as
x = x[order(x[1],)]
x = x[order(x[3],)]
x = x[order(x[5],)]
x = x[order(x[4],)]
rownames(x) = NULL
Problem is for 5 columns, it is easy but it is complicated for 100s of columns.
any lead to this will be appreciated.
Thanks

We can do a match on the original vector and then use a for loop to get the output
i1 <- match(seq_along(x), vec, nomatch = 0)
i1 <- i1[i1!=0]
for(i in i1){
x <- x[order(x[i]),]
}
x
# v1 v2 v3 v4 v5
# 2 3 13 12 1 4
# 1 1 2 3 4 5
# 3 6 4 6 5 3
data
x <- structure(list(v1 = c(1L, 3L, 6L), v2 = c(2L, 13L, 4L), v3 = c(3L,
12L, 6L), v4 = c(4L, 1L, 5L), v5 = c(5L, 4L, 3L)), .Names = c("v1",
"v2", "v3", "v4", "v5"), class = "data.frame", row.names = c(NA,
-3L))
vec <- c(1, 0, 2, 4, 3)

Related

How to sum multiple columns in two data frames in r

This question is similar to this one: R: Sum column wise value of two/more data frames having same variables (column names) and take Date column as reference , but my dfs have different number of columns, columns names and there is not a specific reference column.
Modifying his example:
df1:
V1 V2 V3
2 4 5
3 5 7
df2:
V1 V5 V2 V4
2 4 4 5
3 0 5 7
I want the result as:
df3:
V1 V2 V3 V4 V5
4 8 5 5 4
6 10 7 7 0
I keep getting errors like:
Error: Problem with `mutate()` input `..1`.
✖ Input `..1` can't be recycled to size 28. # 28 because this is referring to my df
ℹ Input `..1` is `colnames(col_name)`.
ℹ Input `..1` must be size 28 or 1, not 5992.
Run `rlang::last_error()` to see where the error occurred.
I've tried with merge, join, by ...etc
Here's a base R option :
tmp <- cbind(df1, df2)
data.frame(sapply(split.default(tmp, names(tmp)), rowSums))
# V1 V2 V3 V4 V5
#1 4 8 5 5 4
#2 6 10 7 7 0
data
df1 < -structure(list(V1 = 2:3, V2 = 4:5, V3 = c(5L, 7L)),
class = "data.frame", row.names = c(NA, -2L))
df2 <- structure(list(V1 = 2:3, V5 = c(4L, 0L), V2 = 4:5, V4 = c(5L,
7L)), class = "data.frame", row.names = c(NA, -2L))

Fill in missing rows in data in R

Suppose I have a data frame like this:
1 8
2 12
3 2
5 -6
6 1
8 5
I want to add a row in the places where the 4 and 7 would have gone in the first column and have the second column for these new rows be 0, so adding these rows:
4 0
7 0
I have no idea how to do this in R.
In excel, I could use a vlookup inside an iferror. Is there a similar combo of functions in R to make this happen?
Edit: also, suppose that row 1 was missing and needed to be filled in similarly. Would this require another solution? What if I wanted to add rows until I reached ten rows?
Use tidyr::complete to fill in the missing sequence between min and max values.
library(tidyr)
library(rlang)
complete(df, V1 = min(V1):max(V1), fill = list(V2 = 0))
#Or using `seq`
#complete(df, V1 = seq(min(V1), max(V1)), fill = list(V2 = 0))
# V1 V2
# <int> <dbl>
#1 1 8
#2 2 12
#3 3 2
#4 4 0
#5 5 -6
#6 6 1
#7 7 0
#8 8 5
If we already know min and max of the dataframe we can use them directly. Let's say we want data from V1 = 1 to 10, we can do.
complete(df, V1 = 1:10, fill = list(V2 = 0))
If we don't know the column names beforehand, we can do something like :
col1 <- names(df)[1]
col2 <- names(df)[2]
complete(df, !!sym(col1) := 1:10, fill = as.list(setNames(0, col2)))
data
df <- structure(list(V1 = c(1L, 2L, 3L, 5L, 6L, 8L), V2 = c(8L, 12L,
2L, -6L, 1L, 5L)), class = "data.frame", row.names = c(NA, -6L))

How to combine two rows in R?

I would like to combine/sum two rows based on rownames to make one row in R. The best route might be to create a new row and sum the two rows together.
Example df:
A 1 3 4 6
B 3 2 7 9
C 6 8 1 2
D 3 2 8 9
Where A,B,C,D are rownames, I want to combine/sum two rows (A & C) into one to get:
A+C 7 11 5 8
B 3 2 7 9
D 3 2 8 9
Thank you.
aggregate to the rescue:
aggregate(df, list(Group=replace(rownames(df),rownames(df) %in% c("A","C"), "A&C")), sum)
# Group V2 V3 V4 V5
#1 A&C 7 11 5 8
#2 B 3 2 7 9
#3 D 3 2 8 9
You can replace the A row using the standard addition arithmetic operator, and then remove the C row with a logical statement.
df["A", ] <- df["A", ] + df["C", ]
df[rownames(df) != "C", ]
# V2 V3 V4 V5
# A 7 11 5 8
# B 3 2 7 9
# D 3 2 8 9
For more than two rows, you can use colSums() for the addition. This presumes the first value in nm is the one we are replacing/keeping.
nm <- c("A", "C")
df[nm[1], ] <- colSums(df[nm, ])
df[!rownames(df) %in% nm[-1], ]
I'll leave it up to you to change the row names. :)
Data:
df <- structure(list(V2 = c(1L, 3L, 6L, 3L), V3 = c(3L, 2L, 8L, 2L),
V4 = c(4L, 7L, 1L, 8L), V5 = c(6L, 9L, 2L, 9L)), .Names = c("V2",
"V3", "V4", "V5"), class = "data.frame", row.names = c("A", "B",
"C", "D"))
matrix multiply?
> A <- matrix(c(1,0,0,0,1,0,1,0,0,0,0,1), 3)
> A
[,1] [,2] [,3] [,4]
[1,] 1 0 1 0
[2,] 0 1 0 0
[3,] 0 0 0 1
> A %*% X
V2 V3 V4 V5
[1,] 7 11 5 8
[2,] 3 2 7 9
[3,] 3 2 8 9
Or using the Matrix package for sparse matrices:
fac2sparse(factor(c(1,2,1,4))) %*% X

Manipulating a Data frame in R

I am new in R. I have data frame
A 5 8 9 6
B 8 2 3 6
C 1 8 9 5
I want to make
A 5
A 8
A 9
A 6
B 8
B 2
B 3
B 6
C 1
C 8
C 9
C 5
I have a big data file
Assuming you're starting with something like this:
mydf <- structure(list(V1 = c("A", "B", "C"), V2 = c(5L, 8L, 1L),
V3 = c(8L, 2L, 8L), V4 = c(9L, 3L, 9L),
V5 = c(6L, 6L, 5L)),
.Names = c("V1", "V2", "V3", "V4", "V5"),
class = "data.frame", row.names = c(NA, -3L))
mydf
# V1 V2 V3 V4 V5
# 1 A 5 8 9 6
# 2 B 8 2 3 6
# 3 C 1 8 9 5
Try one of the following:
library(reshape2)
melt(mydf, 1)
Or
cbind(mydf[1], stack(mydf[-1]))
Or
library(splitstackshape)
merged.stack(mydf, var.stubs = "V[2-5]", sep = "var.stubs")
The name pattern in the last example is unlikely to be applicable to your actual data though.
Someone could probably do this in a better way but here I go...
I put your data into a data frame called data
#repeat the value in the first column (c - 1) times were c is the number of columns (data[1,])
rep(data[,1], each=length(data[1,])-1)
#turning your data frame into a matrix allows you then turn it into a vector...
#transpose the matrix because the vector concatenates columns rather than rows
as.vector(t(as.matrix(data[,2:5])))
#combining these ideas you get...
data.frame(col1=rep(data[,1], each=length(data[1,])-1),
col2=as.vector(t(as.matrix(data[,2:5]))))
If you could use a matrix you can just 'cast' it to a vector and add the row names. I have assumed that you really want 'a', 'b', 'c' as row names.
n <- 3;
data <- matrix(1:9, ncol = n);
data <- t(t(as.vector(data)));
rownames(data) <- rep(letters[1:3], each = n);
If you want to keep the rownames from your first data frame this is ofcourse also possible without libraries.
n <- 3;
data <- matrix(1:9, ncol=n);
names <- rownames(data);
data <- t(t(as.vector(data)))
rownames(data) <- rep(names, each = n)

Merging files on the basis of columns

I have multiple files with many rows and three columns and need to merge them on the basis of first two columns match. File1
12 13 a
13 15 b
14 17 c
4 9 d
. . .
. . .
81 23 h
File 2
12 13 e
3 10 b
14 17 c
4 9 j
. . .
. . .
1 2 k
File 3
12 13 m
13 15 k
1 7 x
24 9 d
. . .
. . .
1 2 h
and so on.
I want to merge them to obtain the following result
12 13 a e m
13 15 b k
14 17 c c
4 9 d j
3 10 b
24 9 d
. . .
. . .
81 23 h
1 2 k
1 7 x
The first thing that usually comes to mind with these types of problems is merge, perhaps in conjunction with a Reduce(function(x, y) merge(x, y, by = "somecols", all = TRUE), yourListOfDataFrames).
However, merge is not always the most efficient function, especially since it looks like you want to "collapse" all the values to fill in the rows from left to right, which would not be the default merge behavior.
Instead, I suggest you stack everything into one long data.frame and reshape it after you have added an index variable.
Here are two approaches:
Option 1: "dplyr" + "tidyr"
Use mget to put all of your data.frames into a list.
Use rbind_all to convert that list into a single data.frame.
Use sequence(n()) in mutate from "dplyr" to group the data and create an index.
Use spread from "tidyr" to transform from a "long" format to a "wide" format.
library(dplyr)
library(tidyr)
combined <- rbind_all(mget(ls(pattern = "^file\\d")))
combined %>%
group_by(V1, V2) %>%
mutate(time = sequence(n())) %>%
ungroup() %>%
spread(time, V3, fill = "")
# Source: local data frame [7 x 5]
#
# V1 V2 1 2 3
# 1 1 7 x
# 2 3 10 b
# 3 4 9 d j
# 4 12 13 a e m
# 5 13 15 b k
# 6 14 17 c c
# 7 24 9 d
Option 2: "data.table"
Use mget to put all of your data.frames into a list.
Use rbindlist to convert that list into a single data.table.
Use sequence(.N) to generate your sequence by your groups.
Use dcast.data.table to convert the "long" data.table into a "wide" one.
library(data.table)
dcast.data.table(
rbindlist(mget(ls(pattern = "^file\\d")))[,
time := sequence(.N), by = list(V1, V2)],
V1 + V2 ~ time, value.var = "V3", fill = "")
# V1 V2 1 2 3
# 1: 1 7 x
# 2: 3 10 b
# 3: 4 9 d j
# 4: 12 13 a e m
# 5: 13 15 b k
# 6: 14 17 c c
# 7: 24 9 d
Both of these answers assume we are starting with the following sample data:
file1 <- structure(
list(V1 = c(12L, 13L, 14L, 4L), V2 = c(13L, 15L, 17L, 9L),
V3 = c("a", "b", "c", "d")), .Names = c("V1", "V2", "V3"),
class = "data.frame", row.names = c(NA, -4L))
file2 <- structure(
list(V1 = c(12L, 3L, 14L, 4L), V2 = c(13L, 10L, 17L, 9L),
V3 = c("e", "b", "c", "j")), .Names = c("V1", "V2", "V3"),
class = "data.frame", row.names = c(NA, -4L))
file3 <- structure(
list(V1 = c(12L, 13L, 1L, 24L), V2 = c(13L, 15L, 7L, 9L),
V3 = c("m", "k", "x", "d")), .Names = c("V1", "V2", "V3"),
class = "data.frame", row.names = c(NA, -4L))

Resources