Melting columns with mutually exclusive values and adding an origin column - r

I have a dataframe that looks like this (m):
a <- rep(c("one","two"),6)
b <- c(1,2,3,4,NA,NA,NA,NA,NA,NA,NA,NA)
c <- c(NA,NA,NA,NA,5,6,7,8,NA,NA,NA,NA)
d <- c(NA,NA,NA,NA,NA,NA,NA,NA,9,10,11,12)
(m <- cbind(a,b,c,d))
I would like to reduce it to a dataframe that looks like this (n):
e <- seq(1:12)
f <- rep(c("b","c","d"), each = 4)
(n <- cbind(a,e,f))
I tried melt, but apparently unsuccessful:
melt(data = m, na.rm=TRUE)
Var1 Var2 value
1 1 a one
2 2 a two
3 3 a one
4 4 a two
5 5 a one
6 6 a two
7 7 a one
8 8 a two
9 9 a one
10 10 a two
11 11 a one
12 12 a two
13 1 b 1
14 2 b 2
15 3 b 3
16 4 b 4
29 5 c 5
30 6 c 6
31 7 c 7
32 8 c 8
45 9 d 9
46 10 d 10
47 11 d 11
48 12 d 12
What would be the necessary change, and is there a better function than melt?

An option with pivot_longer
library(dplyr)
library(tidyr)
m %>%
pivot_longer(cols = b:d, names_to = 'f', values_to ='e', values_drop_na = TRUE)
data
m <- data.frame(a, b, c, d)

Related

Filter a grouped variable from a dataset based on the range values of another dataset using dplyr

I want to take the values of a (large) data frame:
library(tidyverse)
df.grid = expand.grid(x = letters, y = 1:60)
head(df.grid)
x y
1 a 1
2 b 1
3 c 1
4 d 1
5 e 1
6 f 1
[...]
Which eventually reaches a 2, a 3, etc.
And I have a second data frame which contains some variables (x) that I want just part of a range (min max) that is different for each "x" variables
sub.data = data.frame(x = c("a","c","d"), min = c(2,50,25), max = c(6,53,30))
sub.data
x min max
1 a 2 6
2 c 50 53
3 d 25 30
The output should look like something like this:
x y
1 a 2
2 a 3
3 a 4
4 a 5
5 a 6
6 c 50
7 c 51
8 c 52
9 c 53
10 d 25
11 d 26
12 d 27
13 d 28
14 d 29
15 d 30
I've tried this:
df.grid %>%
group_by(x) %>%
filter_if(y > sub.data$min)
But it doesn't work as the min column has multiple values and the 'if' part complains.
I also found this post, but it doesn't seem to work for me as there is no 'matching' variables to guide the filtering process.
I want to avoid using for loops since I want to apply this to a data frame that is 11GB in size.
We could use a non-equi join
library(data.table)
setDT(df.grid)[, y1 := y][sub.data, .(x, y), on = .(x, y1 >= min, y1 <= max)]
-output
x y
1: a 2
2: a 3
3: a 4
4: a 5
5: a 6
6: c 50
7: c 51
8: c 52
9: c 53
10: d 25
11: d 26
12: d 27
13: d 28
14: d 29
15: d 30
With dplyr version 1.1.0, we could also use non-equi joins with join_by
library(dplyr)
inner_join(df.grid, sub.data, by = join_by(x, y >= min , y <= max)) %>%
select(x, y)
-output
x y
1 a 2
2 a 3
3 a 4
4 a 5
5 a 6
6 d 25
7 d 26
8 d 27
9 d 28
10 d 29
11 d 30
12 c 50
13 c 51
14 c 52
15 c 53
Or as #Davis Vaughan mentioned, use between with a left_joion
left_join(sub.data, df.grid, by = join_by(x, between(y$y, x$min,
x$max))) %>%
select(names(df.grid))

How to use matrix algebra in R to create new column?

I have a dataframe with multiple columns. I have another dataframe with two columns, factor and coefficient. I want to create a new column in the initial dataframe (mydata) that is the sum of multiplying each element in each row of mydata(a:e) by the coefficients (a:e) in df. The result for the first row in the newcol should be 64 (1*1 + 2*2 + 3*3 + 4*4 + 7*5). Ideally, I would be able to somehow replicate this 20+ times with different coefficients.
mydata <- data.frame(a = 1:10, b = 2:11, c = 3:12, d = 4:13, d_1 = 5:14, d_2 = 6:15, d_3 = 7:16, e = 8:17)
df <- data.frame(factor = c('a','b','c','d','e'), coefficient = 1:5)
mydata$newcol <- mydata[,c("a","b","c","d","e")] %*% df$coefficient
mydata$newcol2 <- mydata[,c("a","b","c","d_1","e")] %*% df$coefficient
Any advice would be helpful!
We can use sweep here, subset mydata based on factor column in df and multiply it with coefficient for each element and then take rowSums to calculate the sum.
mydata$newcol <- rowSums(sweep(mydata[as.character(df$factor)], 2,df$coefficient, `*`))
mydata
# a b c d d_1 d_2 d_3 e newcol
#1 1 2 3 4 5 6 7 8 70
#2 2 3 4 5 6 7 8 9 85
#3 3 4 5 6 7 8 9 10 100
#4 4 5 6 7 8 9 10 11 115
#5 5 6 7 8 9 10 11 12 130
#6 6 7 8 9 10 11 12 13 145
#7 7 8 9 10 11 12 13 14 160
#8 8 9 10 11 12 13 14 15 175
#9 9 10 11 12 13 14 15 16 190
#10 10 11 12 13 14 15 16 17 205
Or we can also transpose mydata and multiply the coefficient and get colSums.
colSums(t(mydata[as.character(df$factor)]) * df$coefficient)

Is there an R function for selecting common values of 2 dataframe?

I am trying to select common values of two data frame. I have a big_df and a small_df
What I am trying to obtain is a data frame where only the "ID" values are common in both data frame, and I am only interested to keep the big_df and not the small_df ones.
library(dplyr)
df3 <- merge(big_df, small_df, by =("ID"))
> df3
ID Age Name Colour
1 1 21 a blue
2 4 20 d green
3 8 87 h red
4 9 9 i black
big_df <- data.frame("ID" = 1:10, "Age" = c(21,15,1,20,34,45,67,87,9,77), "Name" = c("a","b","c","d","e","f","g","h","i","l"))
> big_df
ID Age Name
1 1 21 a
2 2 15 b
3 3 1 c
4 4 20 d
5 5 34 e
6 6 45 f
7 7 67 g
8 8 87 h
9 9 9 i
10 10 77 l
small_df <- data.frame("ID" = c(1,4,8,9), "Colour" = c("blue","green","red","black"))
> small_df
ID Colour
1 1 blue
2 4 green
3 8 red
4 9 black
I would like to have instead, withouth the colour information
> df3
ID Age Name
1 1 21 a
2 4 20 d
3 8 87 h
4 9 9 i
dplyr's semi_join() was intended for exactly this
big_df <- data.frame("ID" = 1:10, "Age" = c(21,15,1,20,34,45,67,87,9,77), "Name" = c("a","b","c","d","e","f","g","h","i","l"))
small_df <- data.frame("ID" = c(1,4,8,9), "Colour" = c("blue","green","red","black"))
library(dplyr)
semi_join(big_df,small_df,by='ID')
#
# ID Age Name
# 1 1 21 a
# 2 4 20 d
# 3 8 87 h
# 4 9 9 i
I have a feeling what you really need is:
#check which big IDs exist in small IDs and subset
big_df[big_df$ID %in% unique(small_df$ID), ]
# ID Age Name
#1 1 21 a
#4 4 20 d
#8 8 87 h
#9 9 9 i
So, I don't think you need a join in this case.

Merging and summarizing two dataframes

I have the following data:
a <- data.frame(ID=c("A","B","Z","H"), a=c(0,1,2,45), b=c(3,4,5,22), c=c(6,7,8,3))
> a
ID a b c
1 A 0 3 6
2 B 1 4 7
3 Z 2 5 8
4 H 45 22 3
b <- data.frame(ID=c("A","B","E","W","Z","H"), a=c(9,10,11,39,5,0), b=c(4,2,7,54,12,34), c=c(12,0,34,23,13,14))
> b
ID a b c
1: A 9 4 12
2: B 10 2 0
3: E 11 7 34
4: W 39 54 23
5: Z 5 12 13
6: H 0 34 14
I want to merge both dataframes, keeping only rows of data.frame a and summarize the same columns, so at the end I get:
> z
ID a b c
1 A 9 7 18
2 B 11 6 7
3 Z 7 17 21
4 H 45 56 17
So far I have tried the following:
merge(a,b,by="ID",all.x=T,all.y=F)
> merge(a,b,by="ID",all.x=T,all.y=F)
ID a.x b.x c.x a.y b.y c.y
1 A 0 3 6 9 4 12
2 B 1 4 7 10 2 0
3 H 45 22 3 0 34 14
4 Z 2 5 8 5 12 13
> join(a,b,type="left",by="ID")
ID a b c a b c
1 A 0 3 6 9 4 12
2 B 1 4 7 10 2 0
3 Z 2 5 8 5 12 13
4 H 45 22 3 0 34 14
I cannot manage to summarize the columns.
My dataframe is pretty big so if the solution can speed up things that would even be better.
If your data.frame is very big, then you may consider this option:
library(data.table)
## convert data.frame to data.table
setDT(a)
## convert data.frame to data.table
setDT(b)
## merge the two data.tables
c <- merge(a,b,by='ID')
## extract names of all columns except the first one i.e. ID
col_names <- colnames(a)[-1]
## query building
col_1 <- paste0(col_names,'.x')
col_2 <- paste0(col_names,'.y')
cols <- paste(col_1,col_2,sep=',')
cols_2 <- paste0(col_names," = sum(",cols,")")
cols_3 <- paste(cols_2,collapse=',')
query <- paste0("z <- c[,.(",cols_3,"),by=ID]")
## query execution
eval(parse(text = query))
This works at least for your example:
a <- data.frame(ID=c("A","B","Z","H"), a=c(0,1,2,45), b=c(3,4,5,22), c=c(6,7,8,3))
b <- data.frame(ID=c("A","B","E","W","Z","H"), a=c(9,10,11,39,5,0), b=c(4,2,7,54,12,34), c=c(12,0,34,23,13,14))
match_a <- na.omit(match(b$ID, a$ID))
match_b <- na.omit(match(a$ID, b$ID))
df <- cbind(ID = a$ID[match_a], a[match_a, -1] + b[match_b, -1])
First, get matching rows from a in b and vice versa, so we can be sure that we only have those rows that appear in both data frames (and we now know their row-indices in both data frames). Then, simply use vectorized additions for those matching rows, but omit ID, as factor cannot be summed up; add ID back manually.
You cannot directly add both data frame is because both the data frames are of unequal size. To make them of equal size you can check for IDs in a which are present in b and then add them element wise.
new <- b[b$ID %in% a$ID, ]
cbind(ID = a$ID, a[-1] + new[-1])
# ID a b c
#1 A 9 7 18
#2 B 11 6 7
#3 Z 7 17 21
#4 H 45 56 17

How to sum over diagonals of data frame

Say that I have this data frame:
1 2 3 4
100 8 12 5 14
99 1 6 4 3
98 2 5 4 11
97 5 3 7 2
In this above data frame, the values indicate counts of how many observations take on (100, 1), (99, 1), etc.
In my context, the diagonals have the same meanings:
1 2 3 4
100 A B C D
99 B C D E
98 C D E F
97 D E F G
How would I sum across the diagonals (i.e., sum the counts of the like letters) in the first data frame?
This would produce:
group sum
A 8
B 13
C 13
D 28
E 10
F 18
G 2
For example, D is 5+5+4+14
You can use row() and col() to identify row/column relationships.
m <- read.table(text="
1 2 3 4
100 8 12 5 14
99 1 6 4 3
98 2 5 4 11
97 5 3 7 2")
vals <- sapply(2:8,
function(j) sum(m[row(m)+col(m)==j]))
or (as suggested in comments by ?#thelatemail)
vals <- sapply(split(as.matrix(m), row(m) + col(m)), sum)
data.frame(group=LETTERS[seq_along(vals)],sum=vals)
or (#Frank)
data.frame(vals = tapply(as.matrix(m),
(LETTERS[row(m) + col(m)-1]), sum))
as.matrix() is required to make split() work correctly ...
Another aggregate variation, avoiding the formula interface, which actually complicates matters in this instance:
aggregate(list(Sum=unlist(dat)), list(Group=LETTERS[c(row(dat) + col(dat))-1]), FUN=sum)
# Group Sum
#1 A 8
#2 B 13
#3 C 13
#4 D 28
#5 E 10
#6 F 18
#7 G 2
Another solution using bgoldst's definition of df1 and df2
sapply(unique(c(as.matrix(df2))),
function(x) sum(df1[df2 == x]))
Gives
#A B C D E F G
#8 13 13 28 10 18 2
(Not quite the format that you wanted, but maybe it's ok...)
Here's a solution using stack(), and aggregate(), although it requires the second data.frame contain character vectors, as opposed to factors (could be forced with lapply(df2,as.character)):
df1 <- data.frame(a=c(8,1,2,5), b=c(12,6,5,3), c=c(5,4,4,7), d=c(14,3,11,2) );
df2 <- data.frame(a=c('A','B','C','D'), b=c('B','C','D','E'), c=c('C','D','E','F'), d=c('D','E','F','G'), stringsAsFactors=F );
aggregate(sum~group,data.frame(sum=stack(df1)[,1],group=stack(df2)[,1]),sum);
## group sum
## 1 A 8
## 2 B 13
## 3 C 13
## 4 D 28
## 5 E 10
## 6 F 18
## 7 G 2

Resources