How to properly combined columns into one column using R - r

I have 3 sets of data. Each one is a column of variables:
A B C
81 35 31
62 34 33
46 36 31
45 31 33
81 35 31
62 34 33
46 36 31
45 31 33
81 35 31
62 34 33
46 36 31
45 31 33
I have been trying to use rbind to combine these three data sets into one dataset with one column.
Combine<-rbind(A,B,C)
Instead I get something this, where not only do I end up with a series of shorter columns, the numbers all change. How do I stop this from happening?
V1 V2 V3 V4
14 9 9 5
19 15 14 5

# example data frames
dt1 = data.frame(A = 1:5)
dt2 = data.frame(B = 3:10)
dt3 = data.frame(C = 5:7)
# change to a common column name
names(dt1) = "x"
names(dt2) = "x"
names(dt3) = "x"
# bind rows
rbind(dt1, dt2, dt3)
# x
# 1 1
# 2 2
# 3 3
# 4 4
# 5 5
# 6 3
# 7 4
# 8 5
# 9 6
# 10 7
# 11 8
# 12 9
# 13 10
# 14 5
# 15 6
# 16 7

Related

Operations on multiple columns accross many tables

I have two tables (dt1, dt2). dt2 contains the same variables names as dt1.
For each variable in dt1 I would like to multiply it with its values from dt2.
In the exemple below, x from dt1 will get multiplied with 4 and y with 7.
How would be the fast way to do it?
Thank you
set.seed(123)
dt1 <- data.frame(x = sample(1:10, 10, TRUE), y = sample(1:10, 10, TRUE) )
dt1
dt2 = data.frame (names = c("x", "y"), values = c(4, 7))
dt2
purrr style
map2_df(dt1, dt2 %>% pivot_wider(names_from = names, values_from = values), ~.y * .x)
# A tibble: 10 x 2
x y
<dbl> <dbl>
1 12 35
2 12 21
3 40 63
4 8 63
5 24 63
6 20 21
7 16 56
8 24 70
9 36 49
10 40 70
You can try sweep
> sweep(dt1, 2, dt2$values[match(dt2$names, names(dt1))], "*")
x y
1 12 35
2 12 21
3 40 63
4 8 63
5 24 63
6 20 21
7 16 56
8 24 70
9 36 49
10 40 70
or
> dt1[] <- t(t(dt1) * dt2$values[match(dt2$names, names(dt1))])
> dt1
x y
1 12 35
2 12 21
3 40 63
4 8 63
5 24 63
6 20 21
7 16 56
8 24 70
9 36 49
10 40 70

Assign new value to database based on value stored in another database

Here I share with you a simplified version of my issue. Say I have 6 observations (pid) for two variables:
pid <- c(1,2,3,4,5,6)
V1 <- c(11,11,33,11,22,33)
V2 <- c("A", "C", "M", "M", "A", "A")
data <- data.frame(pid, V1, V2)
# pid V1 V2
# 1 1 11 A
# 2 2 11 C
# 3 3 33 M
# 4 4 11 M
# 5 5 22 A
# 6 6 33 A
I would like to create a new column based on the values associated to the different combinations I have of V1 and V2, that stored in a second database:
V1 <- c(11,11,11,22,22,22,33,33,33)
V2 <- c("A", "C", "M","A", "C", "M","A", "C", "M")
valueA <- c(16,26,36,46,56,66,76,86,96)
valueB <- c(15,25,35,45,55,65,75,85,95)
values <- data.frame(V1, V2, valueA, valueB)
# V1 V2 valueA valueB
# 1 11 A 16 15
# 2 11 C 26 25
# 3 11 M 36 35
# 4 22 A 46 45
# 5 22 C 56 55
# 6 22 M 66 65
# 7 33 A 76 75
# 8 33 C 86 85
# 9 33 M 96 95
I tried this, following #akrun suggestion:
data <- mutate (data,
valueA = as.integer (ifelse(data$V1 %in% values$V1
& data$V2 %in% values$V2, values$valueA, NA))
)
But the result is the following:
# pid V1 V2 valueA
# 1 1 11 A 16
# 2 2 11 C 26
# 3 3 33 M 36
# 4 4 11 M 46
# 5 5 22 A 56
# 6 6 33 A 66
As you can see, the combination 33 M is 36 while it should be 96...
I would like to archive this:
# pid V1 V2 valueA
# 1 1 11 A 16
# 2 2 11 C 26
# 3 3 33 M 96
# 4 4 11 M 36
# 5 5 22 A 46
# 6 6 33 A 76
any suggestions on how to fix this? Any help would me much appreciated!
I solved the issue above creating a single column merging V1 and V2 as follows:
data$unique <- paste(data$V1,data$V2)
values$unique <- paste(values$V1, values$V2)
and then merged by the new column:
merge(x = data, y = values, by = "unique")
# unique pid V1.x V2.x V1.y V2.y valueA valueB
# 1 11 A 1 11 A 11 A 16 15
# 2 11 C 2 11 C 11 C 26 25
# 3 11 M 4 11 M 11 M 36 35
# 4 22 A 5 22 A 22 A 46 45
# 5 33 A 6 33 A 33 A 76 75
# 6 33 M 3 33 M 33 M 96 95

Select multiple ranges of columns using column names in data.table

Let say I have a data table,
dt = data.table(matrix(1:50, nrow = 5));
colnames(dt) = letters[1:10];
> dt
a b c d e f g h i j
1: 1 6 11 16 21 26 31 36 41 46
2: 2 7 12 17 22 27 32 37 42 47
3: 3 8 13 18 23 28 33 38 43 48
4: 4 9 14 19 24 29 34 39 44 49
5: 5 10 15 20 25 30 35 40 45 50
I want to select several discontinuous ranges of columns like: a, c:d, f:h and j. This can be done easily via dplyr's select():
dt %>% select(a, c:d, f:h, j)
I am looking for a data.table way of achieving the same.
Right now, I can either select columns individually in any order: dt[ , .(a, c)] or giving just one sequence of column names on the form startcol:endcol:
dt[ , c:f]
However, I can't combine the above two methods to select several column ranges in one shot in .SDcols, like I did in dplyr::select
We can use the range part in .SDcols and then append the other column by concatenating
dt[, c(list(a= a), .SD) , .SDcols = c:d]
If there are multiple ranges, we create a sequence of ranges by match, and then get the corresponding column names
i1 <- match(c("c", "f"), names(dt))
j1 <- match(c("d", "h"), names(dt))
nm1 <- c("a", names(dt)[unlist(Map(`:`, i1, j1))], "j")
dt[, ..nm1]
# a c d f g h j
#1: 1 11 16 26 31 36 46
#2: 2 12 17 27 32 37 47
#3: 3 13 18 28 33 38 48
#4: 4 14 19 29 34 39 49
#5: 5 15 20 30 35 40 50
Also, the dplyr methods can be used within the data.table
dt[, select(.SD, a, c:d, f:h, j)]
# a c d f g h j
#1: 1 11 16 26 31 36 46
#2: 2 12 17 27 32 37 47
#3: 3 13 18 28 33 38 48
#4: 4 14 19 29 34 39 49
#5: 5 15 20 30 35 40 50
Here is a workaround with cbind and two or more selections.
cbind(dt[, .(a)], dt[, c:d])
# a c d
# 1: 1 11 16
# 2: 2 12 17
# 3: 3 13 18
# 4: 4 14 19
# 5: 5 15 20

Subset data frame where values are greater than another data frame

Say I have a data frame with 3 columns of data (a,b,c) and 1 column of categories with multiple instances of each category (class).
set.seed(273)
a <- floor(runif(20,0,100))
b <- floor(runif(20,0,100))
c <- floor(runif(20,0,100))
class <- floor(runif(20,0,6))
df1 <- data.frame(a,b,c,class)
print(df1)
a b c class
1 31 73 28 3
2 44 33 57 3
3 19 35 53 0
4 68 70 39 4
5 92 7 57 2
6 13 67 23 3
7 73 50 14 2
8 59 14 91 5
9 37 3 72 5
10 27 3 13 4
11 63 28 0 5
12 51 7 35 4
13 11 36 76 3
14 72 25 8 5
15 23 24 6 3
16 15 1 16 5
17 55 24 5 5
18 2 54 39 1
19 54 95 20 3
20 60 39 65 1
And I have another data frame with the same 3 columns of data and category column, however this only has one instance per category (class).
a <- floor(runif(6,0,20))
b <- floor(runif(6,0,20))
c <- floor(runif(6,0,20))
class <- seq(0,5)
df2 <- data.frame(a,b,c,class)
print(df2)
a b c class
1 8 15 13 0
2 0 3 6 1
3 14 4 0 2
4 7 10 6 3
5 18 18 16 4
6 17 17 11 5
How to I subset the first data frame so that only rows where a, b, and c are all greater than the value in the second data frame for each class? For example, I only want rows where class == 0 if a > 8 & b > 15 & c > 13.
Note that I don't want to join the data frames, as the second data frame is the lowest acceptable value for the the first data frame.
As commented by Frank this can be done with non-equi joins.
# coerce to data.table
tmp <- setDT(df1)[
# non-equi join to find which rows of df1 fulfill conditions in df2
setDT(df2), on = .(class, a > a, b > b, c > c), rn, nomatch = 0L, which = TRUE]
# return subset in original order of df1
df1[sort(tmp)]
a b c class
1: 31 73 28 3
2: 44 33 57 3
3: 19 35 53 0
4: 68 70 39 4
5: 92 7 57 2
6: 13 67 23 3
7: 73 50 14 2
8: 11 36 76 3
9: 2 54 39 1
10: 54 95 20 3
11: 60 39 65 1
The parameter which = TRUE returns a vector of the matching row numbers instead of the joined data set. This saves us from creating a row id column before the join. (Credit to #Frank for reminding me of the which parameter!)
Note that there is no row in df1 which fulfills the condition for class == 5 in df2. Therefore, the parameter nomatch = 0L is used to exclude non-matching rows from the result.
This can be put together in a "one-liner":
setDT(df1)[sort(df1[setDT(df2), on = .(class, a > a, b > b, c > c), nomatch = 0L, which = TRUE])]

Combine two dataframes one above the other

I have two dataframes and I want to put one above the other "with" column names of second as a row of the new dataframe. Column names are different and one dataframe has more columns.
For example:
mydf1 <- data.frame(V1=c(1:5), V2=c(21:25))
mydf1
V1 V2
1 1 21
2 2 22
3 3 23
4 4 24
5 5 25
mydf2 <- data.frame(C1=c(1:10), C2=c(21:30),C3=c(41:50))
mydf2
C1 C2 C3
1 1 21 41
2 2 22 42
3 3 23 43
4 4 24 44
5 5 25 45
6 6 26 46
7 7 27 47
8 8 28 48
9 9 29 49
10 10 30 50
Result:
mydf
V1 V2
1 1 21 NA
2 2 22 NA
3 3 23 NA
4 4 24 NA
5 5 25 NA
6 C1 C2 C3
7 1 21 41
8 2 22 42
9 3 23 43
10 4 24 44
11 5 25 45
12 6 26 46
13 7 27 47
14 8 28 48
15 9 29 49
16 10 30 50
I dont care if all numeric values treated like characters.
Many thanks
You can do this easily without any packages:
mydf1 <- data.frame(V1=c(1:5), V2=c(21:25))
mydf1[,3] <- NA
names(mydf1) <- c("one", "two", "three")
mydf2 <- data.frame(C1=c(1:10), C2=c(21:30),C3=c(41:50))
names <- t(as.data.frame(names(mydf2)))
names <- as.data.frame(names)
names(mydf2) <- c("one", "two", "three")
names(names) <- c("one", "two", "three")
mydf3 <- rbind(mydf1, names)
mydf4 <- rbind(mydf3, mydf2)
> mydf4
one two three
1 1 21 <NA>
2 2 22 <NA>
3 3 23 <NA>
4 4 24 <NA>
5 5 25 <NA>
6 C1 C2 C3
7 1 21 41
8 2 22 42
9 3 23 43
10 4 24 44
11 5 25 45
12 6 26 46
13 7 27 47
14 8 28 48
15 9 29 49
16 10 30 50
>
Of course, you can edit the <- c("one", "two", "three") to make the final column names whatever you'd like. For example:
> mydf1 <- data.frame(V1=c(1:5), V2=c(21:25))
> mydf1[,3] <- NA
> names(mydf1) <- c("V1", "V2", "NA")
> mydf2 <- data.frame(C1=c(1:10), C2=c(21:30),C3=c(41:50))
> names <- t(as.data.frame(names(mydf2)))
> names <- as.data.frame(names)
> names(mydf2) <- c("V1", "V2", "NA")
> names(names) <- c("V1", "V2", "NA")
> mydf3 <- rbind(mydf1, names)
> mydf4 <- rbind(mydf3, mydf2)
> row.names(mydf4) <- NULL
> mydf4
V1 V2 NA
1 1 21 <NA>
2 2 22 <NA>
3 3 23 <NA>
4 4 24 <NA>
5 5 25 <NA>
6 C1 C2 C3
7 1 21 41
8 2 22 42
9 3 23 43
10 4 24 44
11 5 25 45
12 6 26 46
13 7 27 47
14 8 28 48
15 9 29 49
16 10 30 50
If you need to resort a package for any reason when scaling this up to your real use case, then try melt from reshape2 or the package plyr. However, use of a package shouldn't be necessary.
I don't know what you tried with write.table, but that seems to me like the way to go.
I would create a function something like this:
myFun <- function(...) {
L <- list(...)
temp <- tempfile()
maxCol <- max(vapply(L, ncol, 1L))
lapply(L, function(x)
suppressWarnings(
write.table(x, file = temp, row.names = FALSE,
sep = ",", append = TRUE)))
read.csv(temp, header = FALSE, fill = TRUE,
col.names = paste0("New_", sequence(maxCol)),
stringsAsFactors = FALSE)
}
Usage would then simply be:
myFun(mydf1, mydf2)
# New_1 New_2 New_3
# 1 V1 V2
# 2 1 21
# 3 2 22
# 4 3 23
# 5 4 24
# 6 5 25
# 7 C1 C2 C3
# 8 1 21 41
# 9 2 22 42
# 10 3 23 43
# 11 4 24 44
# 12 5 25 45
# 13 6 26 46
# 14 7 27 47
# 15 8 28 48
# 16 9 29 49
# 17 10 30 50
The function is written such that you can specify more than two data.frames as input:
mydf3 <- data.frame(matrix(1:8, ncol = 4))
myFun(mydf1, mydf2, mydf3)
# New_1 New_2 New_3 New_4
# 1 V1 V2
# 2 1 21
# 3 2 22
# 4 3 23
# 5 4 24
# 6 5 25
# 7 C1 C2 C3
# 8 1 21 41
# 9 2 22 42
# 10 3 23 43
# 11 4 24 44
# 12 5 25 45
# 13 6 26 46
# 14 7 27 47
# 15 8 28 48
# 16 9 29 49
# 17 10 30 50
# 18 X1 X2 X3 X4
# 19 1 3 5 7
# 20 2 4 6 8
Here's one approach with the rbind.fill function (part of the plyr package).
library(plyr)
setNames(rbind.fill(setNames(mydf1, names(mydf2[seq(mydf1)])),
rbind(names(mydf2), mydf2)), names(mydf1))
V1 V2 NA
1 1 21 <NA>
2 2 22 <NA>
3 3 23 <NA>
4 4 24 <NA>
5 5 25 <NA>
6 C1 C2 C3
7 1 21 41
8 2 22 42
9 3 23 43
10 4 24 44
11 5 25 45
12 6 26 46
13 7 27 47
14 8 28 48
15 9 29 49
16 10 30 50
Give this a try.
Assign the column names from the second data set to a vector, and then replace the second set's names with the names from the first set. Then create a list where the middle element is the vector you assigned. Now when you call rbind, it should be fine since everything is in the right order.
d1$V3 <- NA
nm <- names(d2)
names(d2) <- names(d1)
dc <- do.call(rbind, list(d1,nm,d2))
rownames(dc) <- NULL
dc

Resources