Merge data frames based on rownames in R - r

How can I merge the columns of two data frames, containing a distinct set of columns but some rows with the same names? The fields for rows that don't occur in both data frames should be filled with zeros:
> d
a b c d e f g h i j
1 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10
2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
> e
k l m n o p q r s t
1 11 12 13 14 15 16 17 18 19 20
3 21 22 23 24 25 26 27 28 29 30
> de
a b c d e f g h i j k l m n o p q r s t
1 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10 11 12 13 14 15 16 17 18 19 20
2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0 0 0 0 0 0 0 0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 21 22 23 24 25 26 27 28 29 30

See ?merge:
the name "row.names" or the number 0 specifies the row names.
Example:
R> de <- merge(d, e, by=0, all=TRUE) # merge by row names (by=0 or by="row.names")
R> de[is.na(de)] <- 0 # replace NA values
R> de
Row.names a b c d e f g h i j k l m n o p q r s
1 1 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10 11 12 13 14 15 16 17 18 19
2 2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0 0 0 0 0 0 0
3 3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 21 22 23 24 25 26 27 28 29
t
1 20
2 0
3 30

Related

r - How to subset a dataframe based on another dataframe

I have data dat like this:
s A chan
10 0.1 1
20 0.2 1
30 0.3 1
40 0.5 1
50 0.7 1
60 0.5 1
10 0.1 2
20 0.3 2
30 0.4 2
40 0.5 2
50 0.6 2
60 0.6 2
10 0.2 3
20 0.2 3
30 0.3 3
40 0.4 3
50 0.5 3
40 0.7 3
10 0.2 4
20 0.2 4
30 0.3 4
40 0.3 4
50 0.6 4
60 0.8 4
and I want to subset my data frame dat based on s (time) for each chan (channel) with a data frame df like this
s chan
10 1
20 2
30 3
40 4
If I use dat %>% filter(s %in% df$s) I get each value for every channel like this:
s A chan
10 0.1 1
20 0.2 1
30 0.3 1
40 0.5 1
10 0.1 2
20 0.3 2
30 0.4 2
40 0.5 2
10 0.2 3
20 0.2 3
30 0.3 3
40 0.4 3
10 0.2 4
20 0.2 4
30 0.3 4
40 0.3 4
but what I actualy want it this:
s A chan
10 0.1 1
20 0.3 2
30 0.3 3
40 0.3 4
How can I achieve this result?
what you are looking for is semi_join; it filters rows from left data frame based on the presence or absence of matches in right data frame,
semi_join(dat, df, by = c("s", "chan"))
I think this should do it
dat[which(dat[,3]==df[1:4,2] & dat[,1]==df[1:4,1]),]
1:4 being the range of lines in df.

Frequency count based on two columns in r

I have just one dataframe as below.
df=data.frame(o=c(rep("a",12),rep("b",3)), d=c(0,0,1,0,0.3,0.6,0,1,2,3,4,0,0,1,0))
> df
o d
1 a 0.0
2 a 0.0
3 a 1.0
4 a 0.0
5 a 0.3
6 a 0.6
7 a 0.0
8 a 1.0
9 a 2.0
10 a 3.0
11 a 4.0
12 a 0.0
13 b 0.0
14 b 1.0
15 b 0.0
I want to add a new column that counts frequency based on both columns 'o' and 'd'.
And the frequency should start again from 1 if the value of column 'd' is zero like below(hand-made).
> df_result
o d freq
1 a 0.0 1
2 a 0.0 2
3 a 1.0 2
4 a 0.0 3
5 a 0.3 3
6 a 0.6 3
7 a 0.0 5
8 a 1.0 5
9 a 2.0 5
10 a 3.0 5
11 a 4.0 5
12 a 0.0 1
13 b 0.0 2
14 b 1.0 2
15 b 0.0 1
In base R, use ave :
df$freq <- with(df, ave(d, cumsum(d == 0), FUN = length))
df
# o d freq
#1 a 0.0 1
#2 a 0.0 2
#3 a 1.0 2
#4 a 0.0 3
#5 a 0.3 3
#6 a 0.6 3
#7 a 0.0 5
#8 a 1.0 5
#9 a 2.0 5
#10 a 3.0 5
#11 a 4.0 5
#12 a 0.0 1
#13 b 0.0 2
#14 b 1.0 2
#15 b 0.0 1
With dplyr :
library(dplyr)
df %>% add_count(grp = cumsum(d == 0))
using data.tables and #Ronak Shah approach
df=data.frame(o=c(rep("a",12),rep("b",3)), d=c(0,0,1,0,0.3,0.6,0,1,2,3,4,0,0,1,0))
library(data.table)
setDT(df)[, freq := .N, by = cumsum(d == 0)]
df
#> o d freq
#> 1: a 0.0 1
#> 2: a 0.0 2
#> 3: a 1.0 2
#> 4: a 0.0 3
#> 5: a 0.3 3
#> 6: a 0.6 3
#> 7: a 0.0 5
#> 8: a 1.0 5
#> 9: a 2.0 5
#> 10: a 3.0 5
#> 11: a 4.0 5
#> 12: a 0.0 1
#> 13: b 0.0 2
#> 14: b 1.0 2
#> 15: b 0.0 1
Created on 2021-02-26 by the reprex package (v1.0.0)
One more answer using rle()
df$freq <- with(rle(cumsum(df$d == 0)), rep(lengths, lengths))
df
o d freq
1 a 0.0 1
2 a 0.0 2
3 a 1.0 2
4 a 0.0 3
5 a 0.3 3
6 a 0.6 3
7 a 0.0 5
8 a 1.0 5
9 a 2.0 5
10 a 3.0 5
11 a 4.0 5
12 a 0.0 1
13 b 0.0 2
14 b 1.0 2
15 b 0.0 1

How to create a variable with multiple conditions in two different datasets in R

I've been working for developing a reference curve in a risky placenta thickness by week.
So I calculated quantiles of .03, .05, .10, .50, .90, .95, and .99 by each gestational week.
Consequently, I have two datasets for placenta thickness and quantiles. And I'd like to make a new variable, which presents outliers in the former dataset using the lowest and highest quantiles by week.
Here's examples of data:
Data A for thickness:
ID week day thickness
1 15 0 1.3
2 15 0 1.5
3 16 2 2.3
4 16 1 3.5
5 16 1 2.5
6 17 0 3.6
7 17 0 3.4
8 17 3 2.4
Data B for quantiles:
week .03 .05 .10 .50 .90 .95 .99
15 1.6 1.7 1.8 2.4 2.6 2.7 2.8
16 1.7 1.8 2.0 2.5 3.1 3.3 3.4
17 1.7 1.8 2.1 2.6 3.4 3.5 3.7
So I tried codes using ifelse() statement like below:
C<-within(A, {outlier = ifelse(A$Thickness<B[2] & A$week == B[1], 1, 0)
outlier = ifelse(A$Thickness>B[8] & A$week == B[1], 1, 0)})
But an error occurred regarding the mismatched number of rows from each data.
Error in `[<-.data.frame`(`*tmp*`, nl, value = list(outlier = c(0, 0, : replacement element 1 is a matrix/data frame of 33 rows, need 55808
The expected form of data based on Data A will be like this:
Data C:
ID week day thickness outlier
1 15 0 1.3 1
2 15 0 1.5 1
3 16 2 2.3 0
4 16 1 3.5 1
5 16 1 2.5 0
6 17 0 3.6 0
7 17 0 3.4 0
8 17 3 2.4 0
The base R solution I can think of.:
transform(A,outlier=as.numeric((C<-thickness-B[as.factor(week),c(2,8)])[,1]<0|C[,2]>0))
ID week day thickness outlier
1 1 15 0 1.3 1
2 2 15 0 1.5 1
3 3 16 2 2.3 0
4 4 16 1 3.5 1
5 5 16 1 2.5 0
6 6 17 0 3.6 0
7 7 17 0 3.4 0
8 8 17 3 2.4 0
You can decide to write it as below:
C=A$thickness-B[as.factor(A$week),c(2,8)] #Only columns 2 and 8 subtract from A
transform(A,outlier=as.numeric(C[,1]<0|C[,2]>0)) #eg If the first column is -ve then an outlier
ID week day thickness outlier
1 1 15 0 1.3 1
2 2 15 0 1.5 1
3 3 16 2 2.3 0
4 4 16 1 3.5 1
5 5 16 1 2.5 0
6 6 17 0 3.6 0
7 7 17 0 3.4 0
8 8 17 3 2.4 0
A solution using dplyr. We can perform a join and then determine the outlier condition.
library(dplyr)
B2 <- B %>% select(week, X.03, X.99)
A2 <- A %>%
left_join(B2, by = "week") %>%
mutate(outlier = as.integer(thickness < X.03 | thickness > X.99)) %>%
select(-starts_with("X"))
A2
# ID week day thickness outlier
# 1 1 15 0 1.3 1
# 2 2 15 0 1.5 1
# 3 3 16 2 2.3 0
# 4 4 16 1 3.5 1
# 5 5 16 1 2.5 0
# 6 6 17 0 3.6 0
# 7 7 17 0 3.4 0
# 8 8 17 3 2.4 0
Here is the base R version of the same operation.
B2 <- B[, c("week", "X.03", "X.99")]
A2 <- merge(A, B2, by = "week", all.x = TRUE)
A2$outlier <- as.integer(A2$thickness < A2$X.03 | A2$thickness > A2$X.99)
A2[, c("X.03", "X.99")] <- NULL
A2
# week ID day thickness outlier
# 1 15 1 0 1.3 1
# 2 15 2 0 1.5 1
# 3 16 3 2 2.3 0
# 4 16 4 1 3.5 1
# 5 16 5 1 2.5 0
# 6 17 6 0 3.6 0
# 7 17 7 0 3.4 0
# 8 17 8 3 2.4 0
Here is the data.table version of the same operation.
library(data.table)
setDT(A)
setDT(B)
B2 <- B[, .(week, X.03, X.99)]
setkey(A, week)
setkey(B2, week)
A2 <- merge(A, B2)[, outlier := as.integer(between(thickness, X.03, X.99, incbounds = FALSE)),
][, c("X.03","X.99"):=NULL]
A2[]
# week ID day thickness outlier
# 1: 15 1 0 1.3 1
# 2: 15 2 0 1.5 1
# 3: 16 3 2 2.3 0
# 4: 16 4 1 3.5 1
# 5: 16 5 1 2.5 0
# 6: 17 6 0 3.6 0
# 7: 17 7 0 3.4 0
# 8: 17 8 3 2.4 0
DATA
A <- read.table(text = "ID week day thickness
1 15 0 1.3
2 15 0 1.5
3 16 2 2.3
4 16 1 3.5
5 16 1 2.5
6 17 0 3.6
7 17 0 3.4
8 17 3 2.4
",
header = TRUE)
B <- read.table(text = "week .03 .05 .10 .50 .90 .95 .99
15 1.6 1.7 1.8 2.4 2.6 2.7 2.8
16 1.7 1.8 2.0 2.5 3.1 3.3 3.4
17 1.7 1.8 2.1 2.6 3.4 3.5 3.7",
header = TRUE)
Here is an option using data.table join
library(data.table)
setDT(A)[B[c('week', '.03', '.99')], outlier :=
as.integer(thickness < `.03`| thickness > `.99`), on = .(week)]
A
# ID week day thickness outlier
#1: 1 15 0 1.3 1
#2: 2 15 0 1.5 1
#3: 3 16 2 2.3 0
#4: 4 16 1 3.5 1
#5: 5 16 1 2.5 0
#6: 6 17 0 3.6 0
#7: 7 17 0 3.4 0
#8: 8 17 3 2.4 0

Using grep to get the rows of a dataframe, instead of the row number

I am trying to make a sub dataframe based on the already existing dataframe. My sub dataframe is being filled with the number of the row instead of the row itself.
rates = read.csv("file.txt")
genes = unique(gsub('_[0-9]+', '', rates[,1]))
for (k in unique(gsub('_[0-9]+', '', rates[,1])) ){
sub = print(grep(k, rates[,1]), value=T)
sub
}
file.txt
clothing,freq,temp
coat_1,0.3,10
coat_1,0.9,0
coat_1,0.1,20
coat_2,0.5,20
coat_2,0.3,15
coat_2,0.1,5
scarf,0.4,30
scarf,0.2,20
scarf,0.1,10
This is what is currently output
[1] 1 2 3 4 5 6
[1] 7 8 9
I would like something like this instead
clothing freq temp
1 coat_1 0.3 10
2 coat_1 0.9 0
3 coat_1 0.1 20
4 coat_2 0.5 20
5 coat_2 0.3 15
6 coat_2 0.1 5
clothing freq temp
1 scarf 0.4 30
2 scarf 0.2 20
3 scarf 0.1 10
rates <- read.csv("file.txt", stringsAsFactors = FALSE)
rates
# clothing freq temp
# 1 coat_1 0.3 10
# 2 coat_1 0.9 0
# 3 coat_1 0.1 20
# 4 coat_2 0.5 20
# 5 coat_2 0.3 15
# 6 coat_2 0.1 5
# 7 scarf 0.4 30
# 8 scarf 0.2 20
# 9 scarf 0.1 10
rates[rates$clothing != "scarf",]
# clothing freq temp
# 1 coat_1 0.3 10
# 2 coat_1 0.9 0
# 3 coat_1 0.1 20
# 4 coat_2 0.5 20
# 5 coat_2 0.3 15
# 6 coat_2 0.1 5
rates[rates$clothing == "scarf",]
# clothing freq temp
#7 scarf 0.4 30
#8 scarf 0.2 20
#9 scarf 0.1 10

Concatenating difference rows with multi-column keys

Suppose I have a data.frame where if I take multiple columns together (say a, b, and c), then I have an identifier that is unique to two different rows (that differ on column name, and a bunch of value columns x, y, and z).
I'd like to take the difference on the value columns, preserve the key columns, and give the name column a new value like diff.
So for example, suppose I have the following data:
a b c x y z name
1 1 M J 0.0 1.0 2.0 alpha
2 1 M K 0.1 0.9 2.0 alpha
3 1 O J 0.2 0.8 2.0 alpha
4 1 O K 0.3 0.7 2.0 alpha
5 2 M J 0.4 0.6 2.0 alpha
6 2 M K 0.5 0.5 2.0 alpha
7 2 O J 0.6 0.4 2.0 alpha
8 2 O K 0.7 0.3 2.0 alpha
9 1 M J 0.0 2.0 1.0 beta
10 1 M K 0.1 1.9 3.0 beta
11 1 O J 0.2 1.8 1.0 beta
12 1 O K 0.3 1.7 3.0 beta
13 2 M J 0.4 1.6 1.0 beta
14 2 M K 0.5 1.5 3.0 beta
15 2 O J 0.6 1.4 1.0 beta
16 2 O K 0.7 1.3 3.0 beta
Then I want the new data frame to be:
a b c x y z name
1 1 M J 0.0 1.0 2.0 alpha
2 1 M K 0.1 0.9 2.0 alpha
3 1 O J 0.2 0.8 2.0 alpha
4 1 O K 0.3 0.7 2.0 alpha
5 2 M J 0.4 0.6 2.0 alpha
6 2 M K 0.5 0.5 2.0 alpha
7 2 O J 0.6 0.4 2.0 alpha
8 2 O K 0.7 0.3 2.0 alpha
9 1 M J 0.0 2.0 1.0 beta
10 1 M K 0.1 1.9 3.0 beta
11 1 O J 0.2 1.8 1.0 beta
12 1 O K 0.3 1.7 3.0 beta
13 2 M J 0.4 1.6 1.0 beta
14 2 M K 0.5 1.5 3.0 beta
15 2 O J 0.6 1.4 1.0 beta
16 2 O K 0.7 1.3 3.0 beta
17 1 M J 0.0 -1.0 1.0 diff
18 1 M K 0.0 -1.0 -1.0 diff
19 1 O J 0.0 -1.0 1.0 diff
20 1 O K 0.0 -1.0 -1.0 diff
21 2 M J 0.0 -1.0 1.0 diff
22 2 M K 0.0 -1.0 -1.0 diff
23 2 O J 0.0 -1.0 1.0 diff
24 2 O K 0.0 -1.0 -1.0 diff
What's the easiest way to accomplish this?
You could make each column individually:
colx = ave(df$x, paste(df$a, df$b, df$c), FUN=function(x) x[1]-x[2])
coly = ave(df$y, paste(df$a, df$b, df$c), FUN=function(x) x[1]-x[2])
colz = ave(df$z, paste(df$a, df$b, df$c), FUN=function(x) x[1]-x[2])
And then put them together:
df2 = subset(df, name=="alpha")
df2$name = "diff"
df2$x = colx[1:(length(colx)/2)]
df2$y = coly[1:(length(coly)/2)]
df2$z = colz[1:(length(colz)/2)]
Now join to original
df = rbind(df, df2)
That gives:
a b c x y z name
1 1 m j 0.0 1.0 2 a
2 1 m k 0.1 0.9 2 a
3 1 o j 0.2 0.8 2 a
4 1 o k 0.3 0.7 2 a
5 2 m j 0.4 0.6 2 a
6 2 m k 0.5 0.5 2 a
7 2 o j 0.6 0.4 2 a
8 2 o k 0.7 0.3 2 a
9 1 m j 0.0 2.0 1 b
10 1 m k 0.1 1.9 3 b
11 1 o j 0.2 1.8 1 b
12 1 o k 0.3 1.7 3 b
13 2 m j 0.4 1.6 1 b
14 2 m k 0.5 1.5 3 b
15 2 o j 0.6 1.4 1 b
16 2 o k 0.7 1.3 3 b
17 1 m j 0.0 -1.0 1 diff
18 1 m k 0.0 -1.0 -1 diff
19 1 o j 0.0 -1.0 1 diff
20 1 o k 0.0 -1.0 -1 diff
21 2 m j 0.0 -1.0 1 diff
22 2 m k 0.0 -1.0 -1 diff
23 2 o j 0.0 -1.0 1 diff
24 2 o k 0.0 -1.0 -1 diff
If your matrix is always sorted and ballanced. Then this should work
half<-1:(nrow(df)/2)
rbind(
df,
cbind(
df[half, 1:3],
df[half, 4:6] - df[half+half[length(half)], 4:6],
name="diff"
)
)

Resources