Merge select columns from multiple tables using common identifiers in R - r

I would like to combine (merge) select columns from multiple tables with following organization.
Here's two datasets as examples that I want to combine
"dataset1"
A B C D E F (header)
1 2 3 4 5 F1(1st row)
6 7 8 9 10 F2(2nd row)
11 12 13 14 15 F3 (3rd row)
....
"dataset2"
A B C D E F (header)
16 17 18 19 20 F1(1st row)
21 22 23 24 25 F2(2nd row)
26 27 28 29 30 F3 (3rd row)
....
Here, header for all different datasets (I have more than 100 datasets) are identical, and I want to use names in F columns (F1, F2, F3...more than F200) as unique identifier.
For example, If I combine column "A" from all different datasets using column F as identifier, the results should look like this. Also to distinguish where the data come from, header also needs to be changed to dataset ID.
dataset1 dataset2 F (header)
1 16 F1 (1st row)
6 21 F2 (2nd row)
11 26 F3 (3rd row)
....
Note that all datasets I have contain different numbers of row, so that some data point values corresponding to F1~F200 could be missing. in this case I want to put NA or leave it as empty.
To this end, I tried following code
x <- merge(dataset1, dataset2, by="F", all=T)
But this way, I cannot extract only column A, rather it merges evert columns.
Similarly, I tried also
x <- Reduce(function(x, y) merge(x, y, all=TRUE, by=("F")), list(dataset1, dataset2))
This gave me actually identical results as previous code. To further extract only column A using this code, I tried following one, but did not worked.
x <- Reduce(function(x, y) merge(x, y, all=TRUE, by=("F")), list(dataset1[,1], dataset2[,1))
And I have no idea how to change name of header into the name of data set which came from.
Please understand I just started to learn R basics.
I'm using RStudio 0.98507 and currently all datasets (more than hundred) were loaded and in present in "Global Environment"
Thank you very much!

Here's one solution with the following four sample data frames:
dataset1 <- data.frame(A = c(1, 6, 11),
B = c(2, 7, 12),
C = c(3, 8, 12),
D = c(4, 9, 13),
E = c(5, 10, 14),
F = c("F1", "F2", "F3"))
dataset2 <- data.frame(A = c(16, 21, 26),
B = c(17, 22, 27),
C = c(18, 23, 28),
D = c(19, 24, 29),
E = c(20, 25, 30),
F = c("F1", "F2", "F3"))
dataset3 <- data.frame(A = c(30, 61),
B = c(57, 90),
C = c(38, 33),
D = c(2, 16),
E = c(77, 25),
F = c("F1", "F2"))
dataset4 <- data.frame(A = c(36, 61),
B = c(47, 30),
C = c(37, 33),
D = c(45, 10),
E = c(66, 29),
F = c("F1", "F2"))
First combine them into a list:
datasets <- list(dataset1, dataset2, dataset3, dataset4)
Then rename all the columns except the F column. This is because later when we merge the data frames together, if the columns all have the same names then merge will try to differentiate them by adding .x or .y to the names -- which is fine when you're only merging two data sets, but gets confusing with more than two.
for (i in seq_along(datasets)) {
for (j in seq_along(colnames(datasets[[i]]))) {
if (colnames(datasets[[i]])[j] != "F") {
colnames(datasets[[i]])[j] <- paste(colnames(datasets[[i]])[j], i, sep = ".")
}
}
}
This gives us data frames whose column headers look like this:
datasets[[1]]
## A.1 B.1 C.1 D.1 E.1 F
## 1 1 2 3 4 5 F1
## 2 6 7 8 9 10 F2
## 3 11 12 12 13 14 F3
Then use Reduce:
df <- Reduce(function(x, y) merge(x, y, all = TRUE, by = "F"), datasets)
And select the columns you want, in this case all the columns with A in the column name:
df[, c("F", grep("A", names(df), value = TRUE))]
## F A.1 A.2 A.3 A.4
## 1 F1 1 16 30 36
## 2 F2 6 21 61 61
## 3 F3 11 26 NA NA

Related

Arranging a dataset by groups in R

I need help trying to make a dataset which contains which treatment the participants are on and what they scored in a composite test (this is just an exercise for my course no real data used)
A <- c(36, 35, 22, 20)
B <- c(26, 30, 25, 20)
C <- c(42, 30, 45, 62)
treatment <- c("A", "B", "C")
depression <- c(A, B, C)
df1 <- data.frame(treatment, depression)
It arranges the data in the dataframe wrong with each value being A, B, C instead of A, A , A , A, B...
anyone know how to convert and arrange this data?
I need the data arranged so I can split and do different calculations with them.
treatment <- rep(LETTERS[1:3], each=4)
depression <- c(A, B, C)
df1 <- data.frame(treatment, depression)
I think you're looking for
df1 <- data.frame(treatment = rep(treatment, each = 4), depression)
For production/"real-life" code you would probably want to do something fancier, e.g.
L <- tibble::lst(A,B,C) ## self-naming list
data.frame(treatment = rep(names(L), lengths(L)),
depression = unlist(L))
Here is tidyverse approach:
library(tidyverse)
tibble(depression) %>%
mutate(treatment = rep(treatment, each=length(A)))
depression treatment
<dbl> <chr>
1 36 A
2 35 A
3 22 A
4 20 A
5 26 B
6 30 B
7 25 B
8 20 B
9 42 C
10 30 C
11 45 C
12 62 C

Concatenate two dataframes and replacing NA values in R and transform the result in a csv file

Hello I have two daframes in R, and I would like to concatenate them. The structure of the dfs its like that:
x <- data.frame(
S1 = c(10, NA, NA),
S2 = c(21, 22, 23)
)
y <- data.frame(
S1 = c(11, 12, 13, 14),
S2 = c(24, 25, 26, 27)
)
And I would like to have something like:
final <- data.frame(
S1 = c(10, 11, 12, 13, 14, NA, NA),
S2 = c(21, 22, 23, 24, 25, 26, 27)
)
I've tried to use natural_join but it's give me a erro:
>library("rquery")
> final <- natural_join(ipeadata_d, ipeadata_d.cont, by = "ID",jointype = "FULL")
Error in natural_join.relop(dnodea, dnodeb, jointype = jointype, by = by, :
rquery::natural_join.relop all tables must have all join keys, the following keys are not in some tables: ID
I Also tried the rbind, but the dataframe keeps the NA.
I would like to concatenate the dataframe like in the "final" example and would like to transform final in a csv file. Thanks for your help.
You may combine the two datasets using bind_rows and sort the columns putting NA's at the last.
library(dplyr)
bind_rows(x, y) %>%
mutate(ID = row_number(),
across(c(S1, S2), sort, na.last = TRUE))
# ID S1 S2
#1 1 10 21
#2 2 11 22
#3 3 12 23
#4 4 13 24
#5 5 14 25
#6 6 NA 26
#7 7 NA 27
There are a few issues here:
First - your example isn't reproducible because those data.frames do not have the same number of elements in each vector. I assume your ID vector should be of equal length to S1 and S2
Second: it sounds like you can accomplish what you want in base R, without any special functions. You are just attempting to concatenate or "union" the 2 data.frames. R uses the command rbind to do this.
I am making an assumption here on what your desired output is.
Here is a working example using rbind:
x <- data.frame(
ID = c(1, 2, 3),
S1 = c(10, NA, NA),
S2 = c(21, 22, 23)
)
y <- data.frame(
ID = c(4, 5, 6, 7),
S1 = c(11, 12, 13, 14),
S2 = c(24, 25, 26, 27)
)
final <- rbind(x,y)
> rbind(x,y)
ID S1 S2
1 1 10 21
2 2 NA 22
3 3 NA 23
4 4 11 24
5 5 12 25
6 6 13 26
7 7 14 27
For your reference, "merging" usually refers to combining 2 data.frames based on a shared column or key.

R using combn with apply

I have a data frame that has percentage values for a number of variables and observations, as follows:
obs <- data.frame(Site = c("A", "B", "C"), X = c(11, 22, 33), Y = c(44, 55, 66), Z = c(77, 88, 99))
I need to prepare this data as an edge list for network analysis, with "Site" as the nodes and the remaining variables as the edges. The result should look like this:
Node1 Node2 Weight Type
A B 33 X
A C 44 X
...
B C 187 Z
So that for "Weight" we are calculating the sum of all possible pairs, and this separately for each column (which ends up in "Type").
I suppose the answer to this has to be using apply on a combn expression, like here Applying combn() function to data frame, but I haven't quite been able to work it out.
I can do this all by hand taking the combinations for "Site"
sites <- combn(obs$Site, 2)
Then the individual columns like so
combA <- combn(obs$A, 2, function(x) sum(x)
and binding those datasets together, but this obviously become annoying very soon.
I have tried to do all the variable columns in one go like this
b <- apply(newdf[, -1], 1, function(x){
sum(utils::combn(x, 2))
}
)
but there is something wrong with that.
Can anyone help, please?
One option would be to create a function and then map that function to all the columns that you have.
func1 <- function(var){
obs %>%
transmute(Node1 = combn(Site, 2)[1, ],
Node2 = combn(Site, 2)[2, ],
Weight = combn(!!sym(var), 2, function(x) sum(x)),
Type = var)
}
map(colnames(obs)[-1], func1) %>% bind_rows()
Here is an example using combn
do.call(
rbind,
combn(1:nrow(obs),
2,
FUN = function(k) cbind(data.frame(t(obs[k, 1])), stack(data.frame(as.list(colSums(obs[k, -1]))))),
simplify = FALSE
)
)
which gives
X1 X2 values ind
1 A B 33 X
2 A B 99 Y
3 A B 165 Z
4 A C 44 X
5 A C 110 Y
6 A C 176 Z
7 B C 55 X
8 B C 121 Y
9 B C 187 Z
try it this way
library(tidyverse)
obs_long <- obs %>% pivot_longer(-Site, names_to = "type")
sites <- combn(obs$Site, 2) %>% t() %>% as_tibble()
Type <- tibble(type = c("X", "Y", "Z"))
merge(sites, Type) %>%
left_join(obs_long, by = c("V1" = "Site", "type" = "type")) %>%
left_join(obs_long, by = c("V2" = "Site", "type" = "type")) %>%
mutate(res = value.x + value.y) %>%
select(-c(value.x, value.y))
V1 V2 type res
1 A B X 33
2 A C X 44
3 B C X 55
4 A B Y 99
5 A C Y 110
6 B C Y 121
7 A B Z 165
8 A C Z 176
9 B C Z 187

How to make a new column in a dataframe, based on ranges within factors from another, in R

This is a slightly complex issue I am trying to solve in R (R-Studio, R version 3.3.1).
I have two dataframes (DF_A, DF_B) . DF_A is structured like this:
Filename Timestamp
A 11
A 12
A 17
B 18
B 22
B 23
C 24
C 28
C 30
And, DF_B like this:
Timestamp
11
12
13
14
15
16
17
18
19
...
30
And I'd like to be able to move the filename from DF_A to DF_B, based on the range of values seen in each Filename factor from DF_A. So:
Timestamp Filename
11 A
12 A
13 A
14 A
...
18 B
19 B
...
24 C
I was considering getting the min-max timestamp of each factor in DF_A, appending the Filename as they belong to the same range of timestamps in DF_B. Thusfar, I have managed to get the min-max by a solution I found, which turns the dataframe into a datatable- and gets the min/max for each factor:
DT_A <- as.data.table(DF_A)
DT[,.SD[which.min(Timestamp)], by = Filename]
DT[,.SD[which.max(Timestamp)], by = Filename]
Alas, this is as far as I have gotten. I am not sure how I would apply this to DF_B. The solution can be pretty open here. Curious to see the different solutions. Any help is greatly appreciated. Thanks!
# import the necessary package
library(data.table)
# create lookup data table
DT_A <- data.table(
Filename = rep(c("A", "B", "C"), each = 3),
Timestamp = c(11, 12, 17, 18, 22, 23, 24, 28, 30)
)
# form data table to be labelled
DT_B <- data.table(
Timestamp = 11:30
)
# get the minimum and maximum timestamp for each filename
DT_limits <- DT_A[ ,
.(from = min(Timestamp, na.rm = T),
to = max(Timestamp, na.rm = T)),
by = Filename]
## apply a fast overlap
DT_B[ , dummy:= Timestamp]
setkey(DT_limits, from, to)
DT_final <- foverlaps(
DT_B,
DT_limits,
by.x = c("Timestamp", "dummy"),
nomatch = 0L
)[ , c("from", "to", "dummy") := NULL]
DT_final
# Filename Timestamp
# 1: A 11
# 2: A 12
# 3: A 13
# 4: A 14
# ...
# 8: B 18
# 9: B 19
# ...
# 14: C 24
# ...

Making contingency table

I'm having trouble with contingency table.
I want to convert that kind of table:
dat <- read.csv(text="Gatunek,Obecnosc,Lokalizacja,Frekwencja
Koń dziki,TAK,Polska,11
Koń dziki,NIE,Polska,14
Koń dziki,TAK,Kujawy,39
Koń dziki,NIE,Kujawy,31",header=TRUE)
# Gatunek Obecnosc Lokalizacja Frekwencja
#Koń dziki TAK Polska 11
#Koń dziki NIE Polska 14
#Koń dziki TAK Kujawy 39
#Koń dziki NIE Kujawy 31
to this:
Don't be afraid, it's just Polish language.
For that moment I only have table which look like this:
xtabs should do the trick:
x <- data.frame(a = c(1, 2, 1, 2), b = c("a", "a", "b", "b"), c = c(11, 14, 39, 31))
xtabs(c ~ a + b, data = x)
# b
#a a b
# 1 11 39
# 2 14 31

Resources