Binding dataframes by multiple conditions in R - r

I have a data frame which looks like this:
> data
Class Number
1 A 1
2 A 2
3 A 3
4 B 1
5 B 2
6 B 3
7 C 1
8 C 2
9 C 3
I have a reference data frame which is:
> reference
Class Number Value
1 A 1 0.5
2 B 3 0.3
I want to join these data frames to create a single data frame:
> resultdata
Class Number Value
1 A 1 0.5
2 A 2 0.0
3 A 3 0.0
4 B 1 0.0
5 B 2 0.0
6 B 3 0.3
7 C 1 0.0
8 C 2 0.0
9 C 3 0.0
How can I achieve this? Any help will be greatly appreciated

You can do
library(data.table)
setkey(setDT(reference), Class, Number)[data]
Or
setkey(setDT(data), Class, Number)[reference,
Value:= i.Value][is.na(Value), Value:=0]
# Class Number Value
#1: A 1 0.5
#2: A 2 0.0
#3: A 3 0.0
#4: B 1 0.0
#5: B 2 0.0
#6: B 3 0.3
#7: C 1 0.0
#8: C 2 0.0
#9: C 3 0.0

The basic starting point for this would be merge.
merge(data, reference, all = TRUE)
# Class Number Value
# 1 A 1 0.5
# 2 A 2 NA
# 3 A 3 NA
# 4 B 1 NA
# 5 B 2 NA
# 6 B 3 0.3
# 7 C 1 NA
# 8 C 2 NA
# 9 C 3 NA
There are many questions which show how to replace NA with 0.

You can do:
library(dplyr)
left_join(data, reference) %>% (function(x) { x[is.na(x)] <- 0; x })
Or (as per #akrun suggestion):
left_join(data, reference) %>% mutate(Value = replace(Value, is.na(Value), 0))
Which gives:
# Class Number Value
#1 A 1 0.5
#2 A 2 0.0
#3 A 3 0.0
#4 B 1 0.0
#5 B 2 0.0
#6 B 3 0.3
#7 C 1 0.0
#8 C 2 0.0
#9 C 3 0.0

Related

Frequency count based on two columns in r

I have just one dataframe as below.
df=data.frame(o=c(rep("a",12),rep("b",3)), d=c(0,0,1,0,0.3,0.6,0,1,2,3,4,0,0,1,0))
> df
o d
1 a 0.0
2 a 0.0
3 a 1.0
4 a 0.0
5 a 0.3
6 a 0.6
7 a 0.0
8 a 1.0
9 a 2.0
10 a 3.0
11 a 4.0
12 a 0.0
13 b 0.0
14 b 1.0
15 b 0.0
I want to add a new column that counts frequency based on both columns 'o' and 'd'.
And the frequency should start again from 1 if the value of column 'd' is zero like below(hand-made).
> df_result
o d freq
1 a 0.0 1
2 a 0.0 2
3 a 1.0 2
4 a 0.0 3
5 a 0.3 3
6 a 0.6 3
7 a 0.0 5
8 a 1.0 5
9 a 2.0 5
10 a 3.0 5
11 a 4.0 5
12 a 0.0 1
13 b 0.0 2
14 b 1.0 2
15 b 0.0 1
In base R, use ave :
df$freq <- with(df, ave(d, cumsum(d == 0), FUN = length))
df
# o d freq
#1 a 0.0 1
#2 a 0.0 2
#3 a 1.0 2
#4 a 0.0 3
#5 a 0.3 3
#6 a 0.6 3
#7 a 0.0 5
#8 a 1.0 5
#9 a 2.0 5
#10 a 3.0 5
#11 a 4.0 5
#12 a 0.0 1
#13 b 0.0 2
#14 b 1.0 2
#15 b 0.0 1
With dplyr :
library(dplyr)
df %>% add_count(grp = cumsum(d == 0))
using data.tables and #Ronak Shah approach
df=data.frame(o=c(rep("a",12),rep("b",3)), d=c(0,0,1,0,0.3,0.6,0,1,2,3,4,0,0,1,0))
library(data.table)
setDT(df)[, freq := .N, by = cumsum(d == 0)]
df
#> o d freq
#> 1: a 0.0 1
#> 2: a 0.0 2
#> 3: a 1.0 2
#> 4: a 0.0 3
#> 5: a 0.3 3
#> 6: a 0.6 3
#> 7: a 0.0 5
#> 8: a 1.0 5
#> 9: a 2.0 5
#> 10: a 3.0 5
#> 11: a 4.0 5
#> 12: a 0.0 1
#> 13: b 0.0 2
#> 14: b 1.0 2
#> 15: b 0.0 1
Created on 2021-02-26 by the reprex package (v1.0.0)
One more answer using rle()
df$freq <- with(rle(cumsum(df$d == 0)), rep(lengths, lengths))
df
o d freq
1 a 0.0 1
2 a 0.0 2
3 a 1.0 2
4 a 0.0 3
5 a 0.3 3
6 a 0.6 3
7 a 0.0 5
8 a 1.0 5
9 a 2.0 5
10 a 3.0 5
11 a 4.0 5
12 a 0.0 1
13 b 0.0 2
14 b 1.0 2
15 b 0.0 1

Filter a group of a data.frame based on multiple conditions

I am looking for an elegant way to filter the values of a specific group of big data.frame based on multiple conditions.
My data frame looks like this.
data=data.frame(group=c("A","B","C","A","B","C","A","B","C"),
time= c(rep(1,3),rep(2,3), rep(3,3)),
value=c(0.2,1,1,0.1,10,20,10,20,30))
group time value
1 A 1 0.2
2 B 1 1.0
3 C 1 1.0
4 A 2 0.1
5 B 2 10.0
6 C 2 20.0
7 A 3 10.0
8 B 3 20.0
9 C 3 30.0
I would like only for the time point 1 to filter out all the values that are smaller than 1 but bigger than 0.1
I want my data.frame to look like this.
group time value
1 A 1 0.2
4 A 2 0.1
5 B 2 10.0
6 C 2 20.0
7 A 3 10.0
8 B 3 20.0
9 C 3 30.0
Any help is highly appreciated.
With dplyr you can do
library(dplyr)
data %>% filter(!(time == 1 & (value <= 0.1 | value >= 1)))
# group time value
# 1 A 1 0.2
# 2 A 2 0.1
# 3 B 2 10.0
# 4 C 2 20.0
# 5 A 3 10.0
# 6 B 3 20.0
# 7 C 3 30.0
Or if you have too much free time and you decided to avoid dplyr:
ind <- with(data, (data$time==1 & (data$value > 0.1 & data$value < 1)))
ind <- ifelse((data$time==1) & (data$value > 0.1 & data$value < 1), TRUE, FALSE)
#above two do the same
data$ind <- ind
data <- data[!(data$time==1 & ind==F),]
data$ind <- NULL
group time value
1 A 1 0.2
4 A 2 0.1
5 B 2 10.0
6 C 2 20.0
7 A 3 10.0
8 B 3 20.0
9 C 3 30.0
Another simple option would be to use subset twice and then append the results in a row wise manner.
rbind(
subset(data, time == 1 & value > 0.1 & value < 1),
subset(data, time != 1)
)
# group time value
# 1 A 1 0.2
# 4 A 2 0.1
# 5 B 2 10.0
# 6 C 2 20.0
# 7 A 3 10.0
# 8 B 3 20.0
# 9 C 3 30.0

Use conditions from multiple variables to replace a variable in R

I did some searches but could not find the best keywords to phrase my question so I think I will attempt to ask it here.
I am dealing with a data frame in R that have two variables represent the identity of the data points. In the following example, A and 1 represent the same individual, B and 2 are the same and so are C and 3 but they are being mixed in the original data.
ID1 ID2 Value
A 1 0.5
B 2 0.8
C C 0.7
A A 0.6
B 2 0.3
3 C 0.4
2 2 0.3
1 A 0.4
3 3 0.6
What I want to achieve is to unify the identity by using only one of the identifiers so it can be either:
ID1 ID2 Value ID
A 1 0.5 A
B 2 0.8 B
C C 0.7 C
A A 0.6 A
B 2 0.3 B
3 C 0.4 C
2 2 0.3 B
1 A 0.4 A
3 3 0.6 C
or:
ID1 ID2 Value ID
A 1 0.5 1
B 2 0.8 2
C C 0.7 3
A A 0.6 1
B 2 0.3 2
3 C 0.4 3
2 2 0.3 2
1 A 0.4 1
3 3 0.6 3
I can probably achieve it by using ifelse function but that means I have to write two ifelse statements for each condition and it does not seem efficient so I was wondering if there is a better way to do it. Here is the example data set.
df=data.frame(ID1=c("A","B","C","A","B","3","2","1","3"),
ID2=c("1","2","C","A","2","C","2","A","3"),
Value=c(0.5,0.8,0.7,0.6,0.3,0.4,0.3,0.4,0.6))
Thank you so much for the help!
Edit:
To clarify, the two identifiers I have in my real data are longer string of texts instead of just ABC and 123. Sorry I did not make it clear.
An option is to to detect the elements that are only digits, convert to integer, then get the corresponding LETTERS in case_when
library(dplyr)
library(stringr)
df %>%
mutate(ID = case_when(str_detect(ID1, '\\d+')~
LETTERS[as.integer(ID1)], TRUE ~ ID1))
# ID1 ID2 Value ID
#1 A 1 0.5 A
#2 B 2 0.8 B
#3 C C 0.7 C
#4 A A 0.6 A
#5 B 2 0.3 B
#6 3 C 0.4 C
#7 2 2 0.3 B
#8 1 A 0.4 A
#9 3 3 0.6 C
Or more compactly
df %>%
mutate(ID = coalesce(LETTERS[as.integer(ID1)], ID1))
If we have different sets of values, then create a key/value dataset and do a join
keyval <- data.frame(ID1 = c('1', '2', '3'), ID = c('A', 'B', 'C'))
left_join(df, keyval) %>% mutate(ID = coalesce(ID, ID1))
A base R option using replace
within(
df,
ID <- replace(
ID1,
!ID1 %in% LETTERS,
LETTERS[as.numeric(ID1[!ID1 %in% LETTERS])]
)
)
or ifelse
within(
df,
ID <- suppressWarnings(ifelse(ID1 %in% LETTERS,
ID1,
LETTERS[as.integer(ID1)]
))
)
which gives
ID1 ID2 Value ID
1 A 1 0.5 A
2 B 2 0.8 B
3 C C 0.7 C
4 A A 0.6 A
5 B 2 0.3 B
6 3 C 0.4 C
7 2 2 0.3 B
8 1 A 0.4 A
9 3 3 0.6 C

Combine incremental sequence with a fixed columns in a dataframe [duplicate]

This question already has answers here:
Alternative to expand.grid for data.frames
(6 answers)
Closed 2 years ago.
I have a dataframe:
data.frame(x=c(1,2,3), y=c(4,5,6))
x y
1 1 4
2 2 5
3 3 6
For each row, I want to repeat x and y for each element within a given sequence, where the sequence is:
E=seq(0,0.2,by=0.1)
So when combined this would give:
x y E
1 1 4 0
2 1 4 0.1
3 1 4 0.2
4 2 5 0
5 2 5 0.1
6 2 5 0.2
7 3 6 0
8 3 6 0.1
9 3 6 0.2
I can not seem to achieve this with expand.grid - seems to give me all possible combinations. Am I after a cartesian product?
library(data.table)
dt <- data.table(x=c(1,2,3), y=c(4,5,6))
dt[,.(E=seq(0,0.2,by=0.1)),by=.(x,y)]
#> x y E
#> 1: 1 4 0.0
#> 2: 1 4 0.1
#> 3: 1 4 0.2
#> 4: 2 5 0.0
#> 5: 2 5 0.1
#> 6: 2 5 0.2
#> 7: 3 6 0.0
#> 8: 3 6 0.1
#> 9: 3 6 0.2
Created on 2020-05-01 by the reprex package (v0.3.0)
Yes, you are looking for cartesian product but base expand.grid cannot handle dataframes.
You can use tidyr functions here :
tidyr::expand_grid(df, E)
# A tibble: 9 x 3
# x y E
# <dbl> <dbl> <dbl>
#1 1 4 0
#2 1 4 0.1
#3 1 4 0.2
#4 2 5 0
#5 2 5 0.1
#6 2 5 0.2
#7 3 6 0
#8 3 6 0.1
#9 3 6 0.2
Or with crossing
tidyr::crossing(df, E)

Summation of products of a variable [duplicate]

This question already has answers here:
Group by and then add a column for ratio based on condition
(3 answers)
Closed 5 years ago.
I have a dataset like this one:
test <-
data.frame(
variable = c("A","A","B","B","C","D","E","E","E","F","F","G"),
confidence = c(1,0.6,0.1,0.15,1,0.3,0.4,0.5,0.2,1,0.4,0.9),
freq = c(2,2,2,2,1,1,3,3,3,2,2,1),
weight = c(2,2,0,0,1,3,5,5,5,0,0,4)
)
> test
variable confidence freq weight
1 A 1.00 2 2
2 A 0.60 2 2
3 B 0.10 2 0
4 B 0.15 2 0
5 C 1.00 1 1
6 D 0.30 1 3
7 E 0.40 3 5
8 E 0.50 3 5
9 E 0.20 3 5
10 F 1.00 2 0
11 F 0.40 2 0
12 G 0.90 1 4
I want to calculate the sum of the weight by the confidence of each variable, like this:
, where i is the variable (A, B, C…)
Developing the formula above :
w[1]c[1]+w[1]c[2]=2*1+2*0.6=3.2
w[2]c[1]+w[2]c[2]
w[3]c[3]+w[3]c[4]
w[4]c[3]+w[4]c[4]
w[5]c[5]
w[6]c[6]
w[7]c[7]+w[7]c[8]+w[7]c[9]
w[8]c[7]+w[8]c[8]+w[8]c[9]
w[9]c[7]+w[9]c[8]+w[9]c[9]
…
The result should look like this:
> test
variable confidence freq weight SWC
1 A 1.00 2 2 3.2
2 A 0.60 2 2 3.2
3 B 0.10 2 0 0.0
4 B 0.15 2 0 0.0
5 C 1.00 1 1 1.0
6 D 0.30 1 3 0.9
7 E 0.40 3 5 5.5
8 E 0.50 3 5 5.5
9 E 0.20 3 5 5.5
10 F 1.00 2 0 0.0
11 F 0.40 2 0 0.0
12 G 0.90 1 4 3.6
Note that the confidence value is different for each observation but each variable has the same weight, so the summation I need is the same for each of the same variable observation.
First, I tried to make a loop iterating each variable a number of times with:
> table(test$variable)
A B C D E F G
2 2 1 1 3 2 1
but I couldn't make it work. So then, I calculated the position where each variable start, to try to make the for loop iterate only in these values:
> tpos = cumsum(table(test$variable))
> tpos = tpos+1
> tpos
A B C D E F G
3 5 6 7 10 12 13
> tpos = shift(tpos, 1)
> tpos
[1] NA 3 5 6 7 10 12
> tpos[1]=1
> tpos
[1] 1 3 5 6 7 10 12
# tpos is a vector with the positions where each variable (A, B, c...) start
> tposn = c(1:nrow(test))[-tpos]
> tposn
[1] 2 4 8 9 11
> c(1:nrow(test))[-tposn]
[1] 1 3 5 6 7 10 12
# then i came up with this loop but it doesn't give the correct result
for(i in 1:nrow(test)[-tposn]){
a = test$freq[i]-1
test$SWC[i:i+a] = sum(test$weight[i]*test$confidence[i:i+a])
}
Maybe there is an easier way to this? tapply?
By using dplyr:
library(dplyr)
test %>%
group_by(variable) %>%
mutate(SWC=sum(confidence*weight))
# A tibble: 12 x 5
# Groups: variable [7]
variable confidence freq weight SWC
<fctr> <dbl> <dbl> <dbl> <dbl>
1 A 1.00 2 2 3.2
2 A 0.60 2 2 3.2
3 B 0.10 2 0 0.0
4 B 0.15 2 0 0.0
5 C 1.00 1 1 1.0
6 D 0.30 1 3 0.9
7 E 0.40 3 5 5.5
8 E 0.50 3 5 5.5
9 E 0.20 3 5 5.5
10 F 1.00 2 0 0.0
11 F 0.40 2 0 0.0
12 G 0.90 1 4 3.6

Resources