Adding column with information from another dataframe R - r

I have two dataframes and I need to join informations.
Here the first df where I have different points (1,2,3..):
eleno elety resno
1 N 1
2 CA 1
3 C 1
4 O 1
5 CB 1
6 CG 1
The second one indicates distances between points, "eleno" represents the first point and "ele2" the second one:
eleno ele2 values
<chr> <chr> <dbl>
1 2 1.46
1 3 2.46
1 4 2.86
1 5 2.46
1 6 3.83
1 7 4.47
I'd like to have in the 1st df a new column with info from df 2. For example, for point 1 I'd like to have -2(second point):1.46(distance) , -3:2.46, -4:2.86 and so on, preferable in a one column.
Something like this
eleno elety resno dist
1 N 1 -2:1.46, -3:2.46, -4:2.86 ...
2 CA 1
3 C 1
4 O 1
5 CB 1
6 CG 1
Thank you!

If I understand your preference to one column, then a possibility without dplyr is as follows. First, we create the new column by concatenating the ele2 and values columns from df2 using the paste() function, with a colon as the separator:
new_column <- paste(-df2$ele2, df2$values, sep = ":")
Then, we use cbind() to bind it to df1:
new_df1 <- cbind(df1, ele2_values = new_column)
This will give us a new data frame like so:
eleno elety resno ele2_values
1 1 N 1 -2:1.46
2 2 CA 1 -3:2.46
3 3 C 1 -4:2.86
4 4 O 1 -5:2.46
5 5 CB 1 -6:3.83
6 6 CG 1 -7:4.47
Here is the data that I used, based on what you have given:
df1 <- data.frame(
eleno = 1:6,
elety = c("N", "CA", "C", "O", "CB", "CG"),
resno = rep(1, 6)
)
df2 <- data.frame(
eleno = rep(1, 6),
ele2 = 2:7,
values = c(1.46, 2.46, 2.86, 2.46, 3.83, 4.47)
)
If we want to get this column as a single element for each point, we can modify our code in the following manner:
Instantiate new_column as an empty vector:
new_column <- vector()
Then call some variant of *apply() or use a for loop to subset the original data frame by points, while applying our original code and appending our singular character elements back to new_column:
lapply(unique(df2$eleno), FUN = function(x) {
subset <- subset(df2, eleno == x)
new_elem <- paste(-subset$ele2, subset$values, sep = ":", collapse = ", ")
new_column <<- c(new_column, new_elem)
})
Once this operation is complete, we use cbind() as before to bind new_column to df1:
new_df1 <- cbind(df1, ele2_values = new_column)
Our output is as follows,
eleno elety resno ele2_values
1 1 N 1 -2:1.13703411305323, -3:6.22299404814839, -4:6.09274732880294, -5:6.23379441676661, -6:8.60915383556858, -7:6.40310605289415
2 2 CA 1 -2:0.094957563560456, -3:2.32550506014377, -4:6.66083758231252, -5:5.14251141343266, -6:6.93591291783378, -7:5.44974835589528
3 3 C 1 -2:2.82733583590016, -3:9.23433484276757, -4:2.92315840255469, -5:8.37295628152788, -6:2.86223284667358, -7:2.66820780001581
4 4 O 1 -2:1.86722789658234, -3:2.32225910527632, -4:3.16612454829738, -5:3.02693370729685, -6:1.59046002896503, -7:0.399959180504084
5 5 CB 1 -2:2.18799541005865, -3:8.10598552459851, -4:5.25697546778247, -5:9.14658166002482, -6:8.3134504687041, -7:0.45770263299346
6 6 CG 1 -2:4.56091482425109, -3:2.65186671866104, -4:3.04672203026712, -5:5.0730687007308, -6:1.81096208281815, -7:7.59670635452494
Here is my random data that I used for df2 in this case:
set.seed(1234)
df2 <- data.frame(
eleno = rep(1:6, rep(6, 6)),
ele2 = 2:7,
values = runif(length(rep(1:6, rep(6, 6)))) * 10
)

Related

calculate means and occurences from multiple matrices

I have a number of matrices that they all have the same type of elements but different lengths. Columns in all files are the same (lets call them "A" and "B") but rows between files are mostly the same elements but not always.
Here are some example data (in the form of dataframes)
df1 <- data.frame(A = 1:3, B = 3:1)
rownames(df1)=c("alpha","beta","gamma")
df2 <- data.frame(A = 1:5,B = 5:1)
rownames(df2)=c("alpha","beta","delta","gamma","zeta")
df3 <- data.frame(A = 1:7, B = 7:1)
rownames(df3)=c("alpha","beta","delta","gamma","zeta","theta","epsilon")
as you can see as far as the rows go even though "alpha","beta" and "gamma" are always present many of the others are not always there
I would like to calculate 2 things:
the average values of all A and B columns in all matrices and ideally that would be by creating an ave.matr that would have all rownames and the average/mean values of the columns "A" and "B"
A B
alpha 1 7
beta 2 6
delta 3 5
gamma 4 4
zeta 5 3
theta 6 2
epsilon 7 1
(where the above numbers are the mean values of all matrices)
and then an occurrence matrix, lets call it occur.matr that would count the number of occurrences of each row across all matrices and it should look like that
A B
alpha 3
beta 3
delta 2
gamma 3
zeta 2
theta 1
epsilon 1
I have started working on this today but I cannot figure out how to do it.
I started by creating a list and a matrix with the unique rownames from all matrices
list=c(rownames(df1),rownames(df2),rownames(df3))
unique=unique(list)
avematr<-matrix(NA,nrow=length(unique),ncol=2)
and my next step would be to make rownames of all matrices identical. I tried with match but i cannot figure it out but at this moment I dont even know if this is the best strategy...
And all similar questions out there are related to merging the matrices (which is not what I want to do).
Any help is greatly appreciated
Here is a tidyverse approach:
library(tidyverse)
df1 <- data.frame(A = 1:3, B = 3:1)
rownames(df1)=c("alpha","beta","gamma")
df2 <- data.frame(A = 1:5,B = 5:1)
rownames(df2)=c("alpha","beta","delta","gamma","zeta")
df3 <- data.frame(A = 1:7, B = 7:1)
rownames(df3)=c("alpha","beta","delta","gamma","zeta","theta","epsilon")
dat <- list(df1, df2, df3) %>%
map_dfr(rownames_to_column)
avg_dat <- dat %>%
group_by(id) %>%
summarise(A = mean(A),
B = mean(B))
#> `summarise()` ungrouping output (override with `.groups` argument)
avg_dat
#> # A tibble: 7 x 3
#> id A B
#> <chr> <dbl> <dbl>
#> 1 alpha 1 5
#> 2 beta 2 4
#> 3 delta 3 4
#> 4 epsilon 7 1
#> 5 gamma 3.67 2.33
#> 6 theta 6 2
#> 7 zeta 5 2
occ_dat <- dat %>% count(id)
occ_dat
#> id n
#> 1 alpha 3
#> 2 beta 3
#> 3 delta 2
#> 4 epsilon 1
#> 5 gamma 3
#> 6 theta 1
#> 7 zeta 2
Created on 2021-01-27 by the reprex package (v0.3.0)
If you want to stick to base R:
For the averaging task it makes things easier when you add your rowname as a column. This prevents autonumbering of rownames when combining the dataframes. You then can simply loop over every unique rowname and construct the averages. A quick and dirty solution could look like this:
df1 <- data.frame(A = 1:3, B = 3:1)
rownames(df1)=c("alpha","beta","gamma")
df2 <- data.frame(A = 1:5,B = 5:1)
rownames(df2)=c("alpha","beta","delta","gamma","zeta")
df3 <- data.frame(A = 1:7, B = 7:1)
rownames(df3)=c("alpha","beta","delta","gamma","zeta","theta","epsilon")
add_row_names_to_df <- function(df) {
df$rn <- rownames(df)
return(df)
}
new_df <- rbind(add_row_names_to_df(df1),
add_row_names_to_df(df2),
add_row_names_to_df(df3))
avg_df <- as.data.frame(matrix(unique(new_df$rn),
nrow = length(unique(new_df$rn)),
ncol = 3))
for(i in 1:nrow(avg_df)) {
avg.df[i,] <- c(avg_df[i,1],
mean(new_df$A[new_df$rn==avg_df[i,1]]),
mean(new_df$B[new_df$rn==avg_df[i,1]]))
}
colnames(avg_df) <- c("rowname", "avgA", "avgB")
avg_df
results in:
rowname avgA avgB
1 alpha 1 5
2 beta 2 4
3 gamma 3.66666666666667 2.33333333333333
4 delta 3 4
5 zeta 5 2
6 theta 6 2
7 epsilon 7 1
For the occurence matrix you can use the table() function from R:
as.matrix(table(c(rownames(df1),rownames(df2),rownames(df3))))
yields:
[,1]
alpha 3
beta 3
delta 2
epsilon 1
gamma 3
theta 1
zeta 2

Including map() function to tabulate each element in a character vector returns an error

I'd like to tabulate the frequencies of each unitary element in a character vector. This is vector contains the answers to a set of items in a survey, with this structure "ADCDAB...", being "A" the answer to the first item, "D" to the second one, etc.
I'd like to process the data with purrr::map combined with base string functions.
p1 <- strsplit(substr(test$answer),"")
map(p1,table)
However, if I include the code with dplyr, the systems returns an error message:
test %>%
mutate(p1=strsplit(answer,"")) %>%
map(p1,table)
the system returns the following error message:
Error: Index 1 must have length 1, not 10
What's wrong with the second syntax?
A dummy dataset
structure(list(answer = c(".BBCBD.A.D", "...DB..AA.", "B......AB.",
"BDDDBACADD", "BB.ABC.AAD"), d.n.i = c(1, 2, 3, 4, 5)), row.names = c(NA,
5L), class = "data.frame")
Here is a base R option
x <- "ADCDAB"
out <- table(utf8ToInt(x))
names(out) <- intToUtf8(names(out), multiple = TRUE)
out
#A B C D
#2 1 1 2
With multiple elements use lapply
x <- c("ADCDAB", "EFG")
f <- function(i) {
out <- table(utf8ToInt(i))
names(out) <- intToUtf8(names(out), multiple = TRUE)
out
}
lapply(x, f)
Returns
#[[1]]
#A B C D
#2 1 1 2
#[[2]]
#E F G
#1 1 1
If you need output as single table, try
x <- c("ADCDAB", "EFGAA")
f(paste(x, collapse = ""))
#A B C D E F G
#4 1 1 2 1 1 1
.. or as dataframe
as.data.frame(f(paste(x, collapse = "")))
# Var1 Freq
#1 A 4
#2 B 1
#3 C 1
#4 D 2
#5 E 1
#6 F 1
#7 G 1
You could do :
library(tidyverse)
test %>% mutate(p1 = strsplit(answer,""), p2 = map(p1, table))
However, I would suggest something like below :
test %>%
mutate(p1 = strsplit(answer,"")) %>%
unnest(p1) %>%
count(answer, p1)
# answer p1 n
# <chr> <chr> <int>
#1 ABCD A 1
#2 ABCD B 1
#3 ABCD C 1
#4 ABCD D 1
#5 ADCDAB A 2
#6 ADCDAB B 1
#7 ADCDAB C 1
#8 ADCDAB D 2
data
test <- data.frame(answer = c("ADCDAB", "ABCD"), stringsAsFactors = FALSE)

How to stack columns of data-frame in r?

I have a data-frame with these characteristics:
Z Y X1 X2 X3 X4 X5 ... X30
A n1 1 2 1 2 1 2 1 2
B n2 1 2 1 2 1 2 1 2
C n3 1 2 1 2 1 2 1 2
D n4 1 2 1 2 1 2 1 2
.
.
.
My purpose is to stack the column x1, x2, … x30, and associated the new column with columns z, y, and x. Some like this:
Newcolumn zyx
1 x-y-z
... I need a data-frame like this:
colum1 colum2
1 A+n1+X1.headername 1
2 B+n2+X2.headernam 2
3 C+n3X3.headername 1
4 D+n4X4.headername 2
. .
. .
. .
I’m trying to build a function, but I have some troubles
I follow this code for the data-frame:
df$zy <- paste(df$z,"-",df$y)
After that, I eliminate the columns “z” and “y”:
df$z <- NULL
df$y <- NULL
And save column df$zy as data-frame for use later:
df_zy <- as.data.frame(df$zy)
Then eliminate df$xy of original dataframe:
df$xy <- NULL
After that, I save as data-frame the column x1, and incorporate df_zy and name of column x1 (the name is “1”):
a <- as.data.frame(df$`1`)
b <- cbind(a, df_xy, x_column= 1)
b$zy <- paste(b$x_column,"-",b$` df$zy`)
b$` df$zy ` <- NULL
b$ x_column <- NULL
colnames(b)
names(b)[names(b) == "b$`1`"] <- "new_column"
This works, but only for the column x1 and I need this for x1 to x30, and stack all new column
Does anybody have an answer to this problem? Thanks!
You can use tidyr and dplyr librairies:
library(dplyr)
library(tidyr)
df_zy = df %>% pivot_longer(., cols = starts_with("X"), names_to = "Variables", values_to = "Value") %>%
mutate(NewColumn = paste0(Z,"-",Y,"-",Variables)) %>% select(NewColumn, Value)
And you get:
> df_zy
# A tibble: 8 x 2
NewColumn Value
<chr> <dbl>
1 A-n1-X1 1
2 A-n1-X2 2
3 B-n2-X1 1
4 B-n2-X2 2
5 C-n3-X1 1
6 C-n3-X2 2
7 D-n4-X1 1
8 D-n4-X2 2
Data
df = data.frame("Z" = LETTERS[1:4],
"Y" = c("n1","n2","n3","n4"),
"X1" = c(1,1,1,1),
"X2" = c(2,2,2,2))
Is it what you are looking for ?

Recursively sum data frames for matching rows

I would like to combine a set of data frames into a single data frame by summing columns that have matching variables (instead of appending columns).
For example, given
df1 <- data.frame(A = c(0,0,1,1,1,2,2), B = c(1,2,1,2,3,1,5), x = c(2,3,1,5,3,7,0))
df2 <- data.frame(A = c(0,1,1,2,2,2), B = c(1,1,3,2,4,5), x = c(4,8,4,1,0,3))
df3 <- data.frame(A = c(0,1,2), B = c(5,4,2), x = c(5,3,1))
I want to match by "A" and "B" and sum the values of "x". For this example, I can get the desired result as follows:
library(plyr)
library(dplyr)
# rename columns so that join_all preserves them all:
colnames(df1)[3] <- "x1"
colnames(df2)[3] <- "x2"
colnames(df3)[3] <- "x3"
# join the data frames by matching "A" and "B" values:
res <- join_all(list(df1, df2, df3), by = c("A", "B"), type = "full")
# get the sums and drop superfluous columns:
arrange(res, A, B) %>%
rowwise() %>%
mutate(x = sum(x1, x2, x3, na.rm = TRUE)) %>%
select(A, B, x)
Result:
A B x
<dbl> <dbl> <dbl>
1 0 1 6
2 0 2 3
3 0 5 5
4 1 1 9
5 1 2 5
6 1 3 7
7 1 4 3
8 2 1 7
9 2 2 2
10 2 4 0
11 2 5 3
A more general solution is
library(dplyr)
# function to get the desired result for two data frames:
my_merge <- function(df1, df2)
{
m1 <- merge(df1, df2, by = c("A", "B"), all = TRUE)
m1 <- rowwise(res) %>%
mutate(x = sum(x.x, x.y, na.rm = TRUE)) %>%
select(A, B, x)
return(m1)
}
l1 <- list(df2, df3) # omit the first data frame
res <- df1 # initial value of the result
for(df in l1) res <- my_merge(res, df) # call the function repeatedly
Is there a more efficient option for combining a large set of data frames? Ideally it should be recursive (i.e. it's better not to join all data frames into one massive data frame before calculating the sums).
An easier option is to bind the rows of the datasets, then group by the columns of interest and get the summarised output by getting the sum of 'x'
library(tidyverse)
bind_rows(df1, df2, df3) %>%
group_by(A, B) %>%
summarise(x = sum(x))
# A tibble: 11 x 3
# Groups: A [?]
# A B x
# <dbl> <dbl> <dbl>
# 1 0 1 6
# 2 0 2 3
# 3 0 5 5
# 4 1 1 9
# 5 1 2 5
# 6 1 3 7
# 7 1 4 3
# 8 2 1 7
# 9 2 2 2
#10 2 4 0
#11 2 5 3
If there are many objects in the global environment with the pattern "df" followed by some digits
mget(ls(pattern= "^df\\d+")) %>%
bind_rows %>%
group_by(A, B) %>%
summarise(x = sum(x))
As the OP mentioned about memory constraints, if we do the join first and then use rowSums or + with reduce, it would be more efficient
mget(ls(pattern= "^df\\d+")) %>%
reduce(full_join, by = c("A", "B")) %>%
transmute(A, B, x = rowSums(.[3:5], na.rm = TRUE)) %>%
arrange(A, B)
# A B x
#1 0 1 6
#2 0 2 3
#3 0 5 5
#4 1 1 9
#5 1 2 5
#6 1 3 7
#7 1 4 3
#8 2 1 7
#9 2 2 2
#10 2 4 0
#11 2 5 3
This could also be done with data.table
library(data.table)
rbindlist(mget(ls(pattern= "^df\\d+")))[, .(x = sum(x)), by = .(A, B)]
Ideally it should be recursive (i.e. it's better not to join all data frames into one massive data frame before calculating the sums).
If you're memory constrained and willing to sacrifice speed (vs #akrun's data.table approach), use one table at a time in a loop:
library(data.table)
tabs = c("df1", "df2", "df3")
# enumerate all combos for the results table
# initializing sum to 0
res = CJ(A = 0:2, B = 1:5, x = 0)
# loop over tabs, adding on
for (i in seq_along(tabs)){
tab = get(tabs[[i]])
res[tab, on=.(A, B), x := x + i.x][]
rm(tab)
}
If you need to read tables from disk, change tabs to file names and get to fread or whatever function.
I am skeptical that you can fit all the tables in memory, but cannot also fit an rbind-ed copy of them together.
Similarly (thanks to #akrun's comment), use his approach pairwise:
res = data.table(get(tabs[[1]]))[0L]
for (i in seq_along(tabs)){
tab = get(tabs[[i]])
res = rbind(res, tab)[, .(x = sum(x)), by=.(A,B)]
rm(tab)
}

Grouping of R dataframe by connected values

I didn't find a solution for this common grouping problem in R:
This is my original dataset
ID State
1 A
2 A
3 B
4 B
5 B
6 A
7 A
8 A
9 C
10 C
This should be my grouped resulting dataset
State min(ID) max(ID)
A 1 2
B 3 5
A 6 8
C 9 10
So the idea is to sort the dataset first by the ID column (or a timestamp column). Then all connected states with no gaps should be grouped together and the min and max ID value should be returned. It's related to the rle method, but this doesn't allow the calculation of min, max values for the groups.
Any ideas?
You could try:
library(dplyr)
df %>%
mutate(rleid = cumsum(State != lag(State, default = ""))) %>%
group_by(rleid) %>%
summarise(State = first(State), min = min(ID), max = max(ID)) %>%
select(-rleid)
Or as per mentioned by #alistaire in the comments, you can actually mutate within group_by() with the same syntax, combining the first two steps. Stealing data.table::rleid() and using summarise_all() to simplify:
df %>%
group_by(State, rleid = data.table::rleid(State)) %>%
summarise_all(funs(min, max)) %>%
select(-rleid)
Which gives:
## A tibble: 4 × 3
# State min max
# <fctr> <int> <int>
#1 A 1 2
#2 B 3 5
#3 A 6 8
#4 C 9 10
Here is a method that uses the rle function in base R for the data set you provided.
# get the run length encoding
temp <- rle(df$State)
# construct the data.frame
newDF <- data.frame(State=temp$values,
min.ID=c(1, head(cumsum(temp$lengths) + 1, -1)),
max.ID=cumsum(temp$lengths))
which returns
newDF
State min.ID max.ID
1 A 1 2
2 B 3 5
3 A 6 8
4 C 9 10
Note that rle requires a character vector rather than a factor, so I use the as.is argument below.
As #cryo111 notes in the comments below, the data set might be unordered timestamps that do not correspond to the lengths calculated in rle. For this method to work, you would need to first convert the timestamps to a date-time format, with a function like as.POSIXct, use df <- df[order(df$ID),], and then employ a slight alteration of the method above:
# get the run length encoding
temp <- rle(df$State)
# construct the data.frame
newDF <- data.frame(State=temp$values,
min.ID=df$ID[c(1, head(cumsum(temp$lengths) + 1, -1))],
max.ID=df$ID[cumsum(temp$lengths)])
data
df <- read.table(header=TRUE, as.is=TRUE, text="ID State
1 A
2 A
3 B
4 B
5 B
6 A
7 A
8 A
9 C
10 C")
An idea with data.table:
require(data.table)
dt <- fread("ID State
1 A
2 A
3 B
4 B
5 B
6 A
7 A
8 A
9 C
10 C")
dt[,rle := rleid(State)]
dt2<-dt[,list(min=min(ID),max=max(ID)),by=c("rle","State")]
which gives:
rle State min max
1: 1 A 1 2
2: 2 B 3 5
3: 3 A 6 8
4: 4 C 9 10
The idea is to identify sequences with rleid and then get the min and max of IDby the tuple rle and State.
you can remove the rle column with
dt2[,rle:=NULL]
Chained:
dt2<-dt[,list(min=min(ID),max=max(ID)),by=c("rle","State")][,rle:=NULL]
You can shorten the above code even more by using rleid inside by directly:
dt2 <- dt[, .(min=min(ID),max=max(ID)), by=.(State, rleid(State))][, rleid:=NULL]
Here is another attempt using rle and aggregate from base R:
rl <- rle(df$State)
newdf <- data.frame(ID=df$ID, State=rep(1:length(rl$lengths),rl$lengths))
newdf <- aggregate(ID~State, newdf, FUN = function(x) c(minID=min(x), maxID=max(x)))
newdf$State <- rl$values
# State ID.minID ID.maxID
# 1 A 1 2
# 2 B 3 5
# 3 A 6 8
# 4 C 9 10
data
df <- structure(list(ID = 1:10, State = c("A", "A", "B", "B", "B",
"A", "A", "A", "C", "C")), .Names = c("ID", "State"), class = "data.frame",
row.names = c(NA,
-10L))

Resources