Merging files on the basis of columns - r

I have multiple files with many rows and three columns and need to merge them on the basis of first two columns match. File1
12 13 a
13 15 b
14 17 c
4 9 d
. . .
. . .
81 23 h
File 2
12 13 e
3 10 b
14 17 c
4 9 j
. . .
. . .
1 2 k
File 3
12 13 m
13 15 k
1 7 x
24 9 d
. . .
. . .
1 2 h
and so on.
I want to merge them to obtain the following result
12 13 a e m
13 15 b k
14 17 c c
4 9 d j
3 10 b
24 9 d
. . .
. . .
81 23 h
1 2 k
1 7 x

The first thing that usually comes to mind with these types of problems is merge, perhaps in conjunction with a Reduce(function(x, y) merge(x, y, by = "somecols", all = TRUE), yourListOfDataFrames).
However, merge is not always the most efficient function, especially since it looks like you want to "collapse" all the values to fill in the rows from left to right, which would not be the default merge behavior.
Instead, I suggest you stack everything into one long data.frame and reshape it after you have added an index variable.
Here are two approaches:
Option 1: "dplyr" + "tidyr"
Use mget to put all of your data.frames into a list.
Use rbind_all to convert that list into a single data.frame.
Use sequence(n()) in mutate from "dplyr" to group the data and create an index.
Use spread from "tidyr" to transform from a "long" format to a "wide" format.
library(dplyr)
library(tidyr)
combined <- rbind_all(mget(ls(pattern = "^file\\d")))
combined %>%
group_by(V1, V2) %>%
mutate(time = sequence(n())) %>%
ungroup() %>%
spread(time, V3, fill = "")
# Source: local data frame [7 x 5]
#
# V1 V2 1 2 3
# 1 1 7 x
# 2 3 10 b
# 3 4 9 d j
# 4 12 13 a e m
# 5 13 15 b k
# 6 14 17 c c
# 7 24 9 d
Option 2: "data.table"
Use mget to put all of your data.frames into a list.
Use rbindlist to convert that list into a single data.table.
Use sequence(.N) to generate your sequence by your groups.
Use dcast.data.table to convert the "long" data.table into a "wide" one.
library(data.table)
dcast.data.table(
rbindlist(mget(ls(pattern = "^file\\d")))[,
time := sequence(.N), by = list(V1, V2)],
V1 + V2 ~ time, value.var = "V3", fill = "")
# V1 V2 1 2 3
# 1: 1 7 x
# 2: 3 10 b
# 3: 4 9 d j
# 4: 12 13 a e m
# 5: 13 15 b k
# 6: 14 17 c c
# 7: 24 9 d
Both of these answers assume we are starting with the following sample data:
file1 <- structure(
list(V1 = c(12L, 13L, 14L, 4L), V2 = c(13L, 15L, 17L, 9L),
V3 = c("a", "b", "c", "d")), .Names = c("V1", "V2", "V3"),
class = "data.frame", row.names = c(NA, -4L))
file2 <- structure(
list(V1 = c(12L, 3L, 14L, 4L), V2 = c(13L, 10L, 17L, 9L),
V3 = c("e", "b", "c", "j")), .Names = c("V1", "V2", "V3"),
class = "data.frame", row.names = c(NA, -4L))
file3 <- structure(
list(V1 = c(12L, 13L, 1L, 24L), V2 = c(13L, 15L, 7L, 9L),
V3 = c("m", "k", "x", "d")), .Names = c("V1", "V2", "V3"),
class = "data.frame", row.names = c(NA, -4L))

Related

Column values based on different column [duplicate]

This question already has answers here:
Creating a new variable from a lookup table
(4 answers)
Closed 1 year ago.
I am working on R and I have the following data frame data:
country
index
value
A
0
15
B
1
15
C
2
15
D
3
15
E
4
15
F
5
15
How could I map values so that I get an extra column EXTRA with specific information. For example I want to pass information (in any form) that countries with index 0,1 and 2 should have value first in EXTRA, 3 and 5 should have second and 4 for example eleventh. So the expected output would look like this:
country
index
value
EXTRA
A
0
15
first
B
1
15
first
C
2
15
first
D
3
15
second
E
4
15
eleventh
F
5
15
second
We can use a named vector to match and replace
nm1 <- setNames(c('first', 'first', 'first', 'second', 'eleventh', 'second'), 0:5)
df1$EXTRA <- nm1[as.character(df1$index)]
Or can use a join
library(data.table)
keydat <- data.frame(index = 0:5,
EXTRA = c('first', 'first', 'first', 'second', 'eleventh', 'second'))
setDT(df1)[keydat, EXTRA := EXTRA, on = .(index)]
data
df1 <- structure(list(country = c("A", "B", "C", "D", "E", "F"), index = 0:5,
value = c(15L, 15L, 15L, 15L, 15L, 15L)), class = "data.frame",
row.names = c(NA,
-6L))
Here is one option using nested ifelse
transform(
df,
EXTRA = ifelse(index %in% 0:2,
"first",
ifelse(index %in% c(3, 5),
"second",
"eleventh"
)
)
)
or merge + stack
merge(df,
setNames(
stack(list(first = 0:2, second = c(3, 5), eleventh = 4)),
c("index", "EXTRA")
),
by = "index",
all.x = TRUE
)
which gives
country index value EXTRA
1 A 0 15 first
2 B 1 15 first
3 C 2 15 first
4 D 3 15 second
5 E 4 15 eleventh
6 F 5 15 second

How do you find if a number is between a range of multiple mins and max numbers

In R I have:
DataSet1
A
1
4
13
19
22
DataSet2
(min)B (max)C
4 6
8 9
12 15
16 18
I am looking to set up a binary column D based on whether A is between B and C.
So D would added to dataset 1 and calculated as follows:
A D
1 0
4 1
13 1
19 0
22 0
I have tried using the InRange function but it just calculating for between one row of B and C rather than all intervals.
Any help would be much appreciated.
enter image description here
Here is one option using fuzzy_left_join
library(fuzzyjoin)
library(dplyr)
df1 %>% fuzzy_left_join(df2, by = c("A" = "B", "A" = "C"),
match_fun = list(`>=`, `<`)) %>%
mutate(D = ifelse(is.na(B) & is.na(C), 0, 1))
A B C D
1 1 NA NA 0
2 4 4 6 1
3 13 12 15 1
4 19 NA NA 0
5 22 NA NA 0
Data
df1 <- structure(list(A = c(1L, 4L, 13L, 19L, 22L)), class = "data.frame", row.names = c(NA, -5L))
df2 <- structure(list(B = c(4L, 8L, 12L, 16L), C = c(6L, 9L, 15L, 18L)), class = "data.frame", row.names = c(NA, -4L))
Here's a way using sapply from base R -
df1$D <- sapply(df1$A, function(x) {
+any(x >= df2$B & x <= df2$C)
})
df1
A D
1 1 0
2 4 1
3 13 1
4 19 0
5 22 0

Summarize the lowest values in a Dataframe?

My data frame looks like this:
View(df)
Product Value
a 2
b 4
c 3
d 10
e 15
f 5
g 6
h 4
i 50
j 20
k 35
l 25
m 4
n 6
o 30
p 4
q 40
r 5
s 3
t 40
I want to find the 9 most expensive products and summaries the rest. It should look like this:
Product Value
d 10
e 15
i 50
j 20
k 35
l 25
o 30
q 40
t 40
rest 46
Rest is the sum of the other 11 products.
I tried it with summaries, but it didn't work:
new <- df %>%
group_by(Product)%>%
summarise((Value > 10) = sum(Value)) %>%
ungroup()
We can use dplyr::row_number to effectively rank the observations after using arrange to order the data by Value. Then, we augment the Product column so that any values that aren't in the top 9 are coded as Rest. Finally, we group by the updated Product and take the sum using summarise
dat %>%
arrange(desc(Value)) %>%
mutate(RowNum = row_number(),
Product = ifelse(RowNum <= 9, Product, 'Rest')) %>%
group_by(Product) %>%
summarise(Value = sum(Value))
# A tibble: 10 × 2
Product Value
<chr> <int>
1 d 10
2 e 15
3 i 50
4 j 20
5 k 35
6 l 25
7 o 30
8 q 40
9 Rest 46
10 t 40
data
dat <- structure(list(Product = c("a", "b", "c", "d", "e", "f", "g",
"h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t"
), Value = c(2L, 4L, 3L, 10L, 15L, 5L, 6L, 4L, 50L, 20L, 35L,
25L, 4L, 6L, 30L, 4L, 40L, 5L, 3L, 40L)), .Names = c("Product",
"Value"), class = "data.frame", row.names = c(NA, -20L))
Another way with dplyr would be to create the outcome with do. The code becomes a bit hard to read since you need to use .$, yet you can avoid ifelse/if_else. After arranging the order by Value, you can create two vectors. One with the first nine product names and "rest". The other with the first nine values and the sum of the value of the other values. You directly create a data frame using do.
df %>%
arrange(desc(Value)) %>%
do(data.frame(Product = c(as.character(.$Product[1:9]), "Rest"),
Value = c(.$Value[1:9], sum(.$Value[10:length(.$Value)]))))
# Product Value
#1 i 50
#2 q 40
#3 t 40
#4 k 35
#5 o 30
#6 l 25
#7 j 20
#8 e 15
#9 d 10
#10 Rest 46
Here is one option using data.table
library(data.table)
setDT(df)[, i1 := .I][order(desc(Value))
][-(seq_len(9)), Product := 'rest'
][, .(Value = sum(Value), i1=i1[1L]), Product
][order(Product=='rest', i1)][, i1 := NULL][]
# Product Value
#1: d 10
#2: e 15
#3: i 50
#4: j 20
#5: k 35
#6: l 25
#7: o 30
#8: q 40
#9: t 40
#10: rest 46

R: Transform data frame into pseudoCSV

let's have a two column data frame like this:
A 1
A 2
A 4
A 5
B 2
B 13
C 1
C 3
C 6
C 18
D 8
E 2
E 112
...
Is there a quick method in R how to transform it to such two columns dataframe?
A 1;2;4;5
B 2;13
C 1;3;6;18
D 8
E 2;112
And how to put it back to the first structure again?
Thank you
A base R option would be (comments from #David Arenburg)
res1 <- aggregate(Col2 ~ Col1, df1, paste, collapse = ";")
Or using data.table
library(data.table)
res2 <- setDT(df1)[, list(Col2=paste(Col2, collapse=";")), Col1]
Or with dplyr
library(dplyr)
res3 <- df1 %>%
group_by(Col1) %>%
summarise(Col2= paste(Col2, collapse=";") )
Update
To convert the output back to the original structure
library(splitstackshape)
cSplit(res2, 'Col2', ';', 'long')
data
df1 <- structure(list(Col1 = c("A", "A", "A", "A", "B", "B", "C", "C",
"C", "C", "D", "E", "E"), Col2 = c(1L, 2L, 4L, 5L, 2L, 13L, 1L,
3L, 6L, 18L, 8L, 2L, 112L)), .Names = c("Col1", "Col2"),
class = "data.frame", row.names = c(NA, -13L))
paste() with collapse = ";" is used in aggregate() to concatenate V2. To return it to the original structure, strsplit() is used to split V2 in lapply() - do.call() is just to bind the resulting list row-wise.
df <- read.table(header = F, text = "
A 1
A 2
A 4
A 5
B 2
B 13
C 1
C 3
C 6
C 18
D 8
E 2
E 112")
df1 <- aggregate(df, by = list(df$V1), FUN = function(x) paste(x, collapse = ";"))[,-2]
names(df1) <- c("V1", "V2")
df1
# V1 V2
#1 A 1;2;4;5
#2 B 2;13
#3 C 1;3;6;18
#4 D 8
#5 E 2;112
df <- do.call(rbind, lapply(unique(df1$V1), function(x) {
df <- data.frame(x, strsplit(df1[df1$V1 == x, 2], ";"))
names(df) <- c("V1", "V2")
df
}))
df
# V1 V2
#1 A 1
#2 A 2
#3 A 4
#4 A 5
#5 B 2
#6 B 13
#7 C 1
#8 C 3
#9 C 6
#10 C 18
#11 D 8
#12 E 2
#13 E 112

Manipulating a Data frame in R

I am new in R. I have data frame
A 5 8 9 6
B 8 2 3 6
C 1 8 9 5
I want to make
A 5
A 8
A 9
A 6
B 8
B 2
B 3
B 6
C 1
C 8
C 9
C 5
I have a big data file
Assuming you're starting with something like this:
mydf <- structure(list(V1 = c("A", "B", "C"), V2 = c(5L, 8L, 1L),
V3 = c(8L, 2L, 8L), V4 = c(9L, 3L, 9L),
V5 = c(6L, 6L, 5L)),
.Names = c("V1", "V2", "V3", "V4", "V5"),
class = "data.frame", row.names = c(NA, -3L))
mydf
# V1 V2 V3 V4 V5
# 1 A 5 8 9 6
# 2 B 8 2 3 6
# 3 C 1 8 9 5
Try one of the following:
library(reshape2)
melt(mydf, 1)
Or
cbind(mydf[1], stack(mydf[-1]))
Or
library(splitstackshape)
merged.stack(mydf, var.stubs = "V[2-5]", sep = "var.stubs")
The name pattern in the last example is unlikely to be applicable to your actual data though.
Someone could probably do this in a better way but here I go...
I put your data into a data frame called data
#repeat the value in the first column (c - 1) times were c is the number of columns (data[1,])
rep(data[,1], each=length(data[1,])-1)
#turning your data frame into a matrix allows you then turn it into a vector...
#transpose the matrix because the vector concatenates columns rather than rows
as.vector(t(as.matrix(data[,2:5])))
#combining these ideas you get...
data.frame(col1=rep(data[,1], each=length(data[1,])-1),
col2=as.vector(t(as.matrix(data[,2:5]))))
If you could use a matrix you can just 'cast' it to a vector and add the row names. I have assumed that you really want 'a', 'b', 'c' as row names.
n <- 3;
data <- matrix(1:9, ncol = n);
data <- t(t(as.vector(data)));
rownames(data) <- rep(letters[1:3], each = n);
If you want to keep the rownames from your first data frame this is ofcourse also possible without libraries.
n <- 3;
data <- matrix(1:9, ncol=n);
names <- rownames(data);
data <- t(t(as.vector(data)))
rownames(data) <- rep(names, each = n)

Resources