I would like to combine/sum two rows based on rownames to make one row in R. The best route might be to create a new row and sum the two rows together.
Example df:
A 1 3 4 6
B 3 2 7 9
C 6 8 1 2
D 3 2 8 9
Where A,B,C,D are rownames, I want to combine/sum two rows (A & C) into one to get:
A+C 7 11 5 8
B 3 2 7 9
D 3 2 8 9
Thank you.
aggregate to the rescue:
aggregate(df, list(Group=replace(rownames(df),rownames(df) %in% c("A","C"), "A&C")), sum)
# Group V2 V3 V4 V5
#1 A&C 7 11 5 8
#2 B 3 2 7 9
#3 D 3 2 8 9
You can replace the A row using the standard addition arithmetic operator, and then remove the C row with a logical statement.
df["A", ] <- df["A", ] + df["C", ]
df[rownames(df) != "C", ]
# V2 V3 V4 V5
# A 7 11 5 8
# B 3 2 7 9
# D 3 2 8 9
For more than two rows, you can use colSums() for the addition. This presumes the first value in nm is the one we are replacing/keeping.
nm <- c("A", "C")
df[nm[1], ] <- colSums(df[nm, ])
df[!rownames(df) %in% nm[-1], ]
I'll leave it up to you to change the row names. :)
Data:
df <- structure(list(V2 = c(1L, 3L, 6L, 3L), V3 = c(3L, 2L, 8L, 2L),
V4 = c(4L, 7L, 1L, 8L), V5 = c(6L, 9L, 2L, 9L)), .Names = c("V2",
"V3", "V4", "V5"), class = "data.frame", row.names = c("A", "B",
"C", "D"))
matrix multiply?
> A <- matrix(c(1,0,0,0,1,0,1,0,0,0,0,1), 3)
> A
[,1] [,2] [,3] [,4]
[1,] 1 0 1 0
[2,] 0 1 0 0
[3,] 0 0 0 1
> A %*% X
V2 V3 V4 V5
[1,] 7 11 5 8
[2,] 3 2 7 9
[3,] 3 2 8 9
Or using the Matrix package for sparse matrices:
fac2sparse(factor(c(1,2,1,4))) %*% X
Related
I have 2 data sets, both include ID columns with the same IDs. I have already removed rows from the first data set. For the second data set, I would like to remove any rows associated with IDs that do not match the first data set by using dplyr.
Meaning whatever is DF2 must be in DF1, if it is not then it must be removed from DF2.
For example:
DF1
ID X Y Z
1 1 1 1
2 2 2 2
3 3 3 3
5 5 5 5
6 6 6 6
DF2
ID A B C
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
DF2 once rows have been removed
ID A B C
1 1 1 1
2 2 2 2
3 3 3 3
5 5 5 5
6 6 6 6
I used anti_join() which shows me the difference in rows but I cannot figure out how to remove any rows associated with IDs that do not match the first data set by using dplyr.
Try with paste
i1 <- do.call(paste, DF2) %in% do.call(paste, DF1)
# if it is only to compare the 'ID' columns
i1 <- DF2$ID %in% DF1$ID
DF3 <- DF2[i1,]
DF3
ID A B C
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 5 5 5 5
5 6 6 6 6
DF4 <- DF2[!i1,]
DF4
ID A B C
4 4 4 4 4
7 7 7 7 7
data
DF1 <- structure(list(ID = c(1L, 2L, 3L, 5L, 6L), X = c(1L, 2L, 3L,
5L, 6L), Y = c(1L, 2L, 3L, 5L, 6L), Z = c(1L, 2L, 3L, 5L, 6L)), class = "data.frame", row.names = c(NA,
-5L))
DF2 <- structure(list(ID = 1:7, A = 1:7, B = 1:7, C = 1:7), class = "data.frame", row.names = c(NA,
-7L))
# Load package
library(dplyr)
# Load dataframes
df1 <- data.frame(
ID = 1:6,
X = 1:6,
Y = 1:6,
Z = 1:6
)
df2 <- data.frame(
ID = 1:7,
X = 1:7,
Y = 1:7,
Z = 1:7
)
# Include all rows in df1
df1 %>%
left_join(df2)
Joining, by = c("ID", "X", "Y", "Z")
ID X Y Z
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6
In R I have:
DataSet1
A
1
4
13
19
22
DataSet2
(min)B (max)C
4 6
8 9
12 15
16 18
I am looking to set up a binary column D based on whether A is between B and C.
So D would added to dataset 1 and calculated as follows:
A D
1 0
4 1
13 1
19 0
22 0
I have tried using the InRange function but it just calculating for between one row of B and C rather than all intervals.
Any help would be much appreciated.
enter image description here
Here is one option using fuzzy_left_join
library(fuzzyjoin)
library(dplyr)
df1 %>% fuzzy_left_join(df2, by = c("A" = "B", "A" = "C"),
match_fun = list(`>=`, `<`)) %>%
mutate(D = ifelse(is.na(B) & is.na(C), 0, 1))
A B C D
1 1 NA NA 0
2 4 4 6 1
3 13 12 15 1
4 19 NA NA 0
5 22 NA NA 0
Data
df1 <- structure(list(A = c(1L, 4L, 13L, 19L, 22L)), class = "data.frame", row.names = c(NA, -5L))
df2 <- structure(list(B = c(4L, 8L, 12L, 16L), C = c(6L, 9L, 15L, 18L)), class = "data.frame", row.names = c(NA, -4L))
Here's a way using sapply from base R -
df1$D <- sapply(df1$A, function(x) {
+any(x >= df2$B & x <= df2$C)
})
df1
A D
1 1 0
2 4 1
3 13 1
4 19 0
5 22 0
I have a (x) data frame in R with 5 numeric columns and apart from this one information is sorting order to be followed in form of a vector i.e.
1, 0, 2, 4, 3
dataset
v1 v2 v3 v4 v5
1 2 3 4 5
3 13 12 1 4
6 4 6 5 3
Expected result
v1 v2 v3 v4 v5
3 13 12 1 4
1 2 2 4 5
6 4 6 5 3
this vector define the sorting order that first column needs to be sorted first then 3rd column then 5th column and then 4th column. manually it can be done as
x = x[order(x[1],)]
x = x[order(x[3],)]
x = x[order(x[5],)]
x = x[order(x[4],)]
rownames(x) = NULL
Problem is for 5 columns, it is easy but it is complicated for 100s of columns.
any lead to this will be appreciated.
Thanks
We can do a match on the original vector and then use a for loop to get the output
i1 <- match(seq_along(x), vec, nomatch = 0)
i1 <- i1[i1!=0]
for(i in i1){
x <- x[order(x[i]),]
}
x
# v1 v2 v3 v4 v5
# 2 3 13 12 1 4
# 1 1 2 3 4 5
# 3 6 4 6 5 3
data
x <- structure(list(v1 = c(1L, 3L, 6L), v2 = c(2L, 13L, 4L), v3 = c(3L,
12L, 6L), v4 = c(4L, 1L, 5L), v5 = c(5L, 4L, 3L)), .Names = c("v1",
"v2", "v3", "v4", "v5"), class = "data.frame", row.names = c(NA,
-3L))
vec <- c(1, 0, 2, 4, 3)
I would like to splite each row of a data frame(numberic) into two rows. For example, part of the original data frame like this (nrow(original datafram) > 2800000):
ID X Y Z value_1 value_2
1 3 2 6 22 54
6 11 5 9 52 71
3 7 2 5 2 34
5 10 7 1 23 47
And after spliting each row, we can get:
ID X Y Z
1 3 2 6
22 54 NA NA
6 11 5 9
52 71 NA NA
3 7 2 5
2 34 NA NA
5 10 7 1
23 47 NA NA
the "value_1" and "value_2" columns are split and each element is set to a new row. For example, value_1 = 22 and value_2 = 54 are set to a new row.
Here is one option with data.table. We convert the 'data.frame' to 'data.table' by creating a column of rownames (setDT(df1, keep.rownames = TRUE)). Subset the columns 1:5 and 1, 6, 7 in a list, rbind the list element with fill = TRUE option to return NA for corresponding columns that are not found in one of the datasets, order by the row number ('rn') and assign (:=) the row number column to 'NULL'.
library(data.table)
setDT(df1, keep.rownames = TRUE)[]
rbindlist(list(df1[, 1:5, with = FALSE], setnames(df1[, c(1, 6:7),
with = FALSE], 2:3, c("ID", "X"))), fill = TRUE)[order(rn)][, rn:= NULL][]
# ID X Y Z
#1: 1 3 2 6
#2: 22 54 NA NA
#3: 6 11 5 9
#4: 52 71 NA NA
#5: 3 7 2 5
#6: 2 34 NA NA
#7: 5 10 7 1
#8: 23 47 NA NA
A hadleyverse corresponding to the above logic would be
library(dplyr)
tibble::rownames_to_column(df1[1:4]) %>%
bind_rows(., setNames(tibble::rownames_to_column(df1[5:6]),
c("rowname", "ID", "X"))) %>%
arrange(rowname) %>%
select(-rowname)
# ID X Y Z
#1 1 3 2 6
#2 22 54 NA NA
#3 6 11 5 9
#4 52 71 NA NA
#5 3 7 2 5
#6 2 34 NA NA
#7 5 10 7 1
#8 23 47 NA NA
data
df1 <- structure(list(ID = c(1L, 6L, 3L, 5L), X = c(3L, 11L, 7L, 10L
), Y = c(2L, 5L, 2L, 7L), Z = c(6L, 9L, 5L, 1L), value_1 = c(22L,
52L, 2L, 23L), value_2 = c(54L, 71L, 34L, 47L)), .Names = c("ID",
"X", "Y", "Z", "value_1", "value_2"), class = "data.frame",
row.names = c(NA, -4L))
Here's a (very slow) pure R solution using no extra packages:
# Replicate your matrix
input_df <- data.frame(ID = rnorm(10000),
X = rnorm(10000),
Y = rnorm(10000),
Z = rnorm(10000),
value_1 = rnorm(10000),
value_2 = rnorm(10000))
# Preallocate memory to a data frame
output_df <- data.frame(
matrix(
nrow = nrow(input_df)*2,
ncol = ncol(input_df)-2))
# Loop through each row in turn.
# Put the first four elements into the current
# row, and the next two into the current+1 row
# with two NAs attached.
for(i in seq(1, nrow(output_df), 2)){
output_df[i,] <- input_df[i, c(1:4)]
output_df[i+1,] <- c(input_df[i, c(5:6)],NA,NA)
}
colnames(output_df) <- c("ID", "X", "Y", "Z")
Which results in
> head(output_df)
X1 X2 X3 X4
1 0.5529417 -0.93859275 2.0900276 -2.4023800
2 0.9751090 0.13357075 NA NA
3 0.6753835 0.07018647 0.8529300 -0.9844643
4 1.6405939 0.96133195 NA NA
5 0.3378821 -0.44612782 -0.8176745 0.2759752
6 -0.8910678 -0.37928353 NA NA
This should work
data <- read.table(text= "ID X Y Z value_1 value_2
1 3 2 6 22 54
6 11 5 9 52 71
3 7 2 5 2 34
5 10 7 1 23 47", header=T)
data1 <- data[,1:4]
data2 <- setdiff(data,data1)
names(data2) <- names(data1)[1:ncol(data2)]
combined <- plyr::rbind.fill(data1,data2)
n <- nrow(data1)
combined[kronecker(1:n, c(0, n), "+"),]
Though why you would need to do this beats me.
I'm looking for a way to refer to a pevious row in my data frame that has one column value in common with the 'current row'. Basically, if this would be my data frame
A B D
1 10
4 5
6 6
3 25
1 40
I would want D(i) to contain the B value of the last row for which A has the same value as A(i). So for the last row that should be 10.
You could try this:
for(i in seq_len(nrow(dat))) {
try(dat$D[i] <- dat$B[tail(which(dat$A[1:i-1] == dat$A[i]),1)],silent=TRUE)
}
Results:
> dat
A B D
1 1 10 NA
2 4 5 NA
3 6 6 NA
4 3 25 NA
5 1 40 10
Data:
dat <- read.csv(text="A,B,D
1,10
4,5
6,6
3,25
1,40")
You may try
library(dplyr)
df1%>%
group_by(A) %>%
mutate(D=lag(B))
# A B D
#1 1 10 NA
#2 4 5 NA
#3 6 6 NA
#4 3 25 NA
#5 1 40 10
Or
library(data.table)#data.table_1.9.5
setDT(df1)[, D:=shift(B), A][]
data
df1 <- structure(list(A = c(1L, 4L, 6L, 3L, 1L), B = c(10L, 5L, 6L,
25L, 40L)), .Names = c("A", "B"), class = "data.frame",
row.names = c(NA, -5L))