Check if values exist in other reference dataframes - r

I have the below toy dataset which is representative of a much larger data. However, these are the columns of importance. I'm attempting to check whether the values in Dataframe match the reference dataframes Reference_A, Reference_B, and Reference_C.
DataFrame
group type value
x A Teddy
x A William
x A Lars
y B Robert
y B Elsie
y C Maeve
y C Charlotte
y C Bernard
Reference_A
type value
A Teddy
A William
A Lars
Reference_B
type value
B Elsie
B Dolores
Reference_C
type value
C Maeve
C Hale
C Bernard
Desired output:
group type value check
x A Teddy TRUE
x A William TRUE
x A Lars TRUE
y B Robert FALSE
y B Elsie TRUE
y C Maeve TRUE
y C Charlotte FALSE
y C Bernard TRUE
I posted a similar question here, but realize that TRUE and FALSE's might be more effective to check: Check if values of one dataframe exist in another dataframe in exact order. I don't think that order matters, since I can manipulate my data so that all values are unique.

You can combine the "Reference" dataframes into one dataframe and join it with DataFrame by type, for each type and value you can then check if any value matches.
library(dplyr)
mget(paste0('Reference_', c('A', 'B', 'C'))) %>%
bind_rows() %>%
right_join(DataFrame, by = 'type') %>%
group_by(group, type, value = value.y) %>%
summarise(check = any(value.x == value.y))
# group type value check
# <chr> <chr> <chr> <lgl>
#1 x A Lars TRUE
#2 x A Teddy TRUE
#3 x A William TRUE
#4 y B Elsie TRUE
#5 y B Robert FALSE
#6 y C Bernard TRUE
#7 y C Charlotte FALSE
#8 y C Maeve TRUE
data
Reference_A <- structure(list(type = c("A", "A", "A"),
value = c("Teddy", "William", "Lars")), class = "data.frame",
row.names = c(NA, -3L))
Reference_B <- structure(list(type = c("B", "B"), value = c("Elsie", "Dolores")),
class = "data.frame", row.names = c(NA, -2L))
Reference_C <- structure(list(type = c("C", "C", "C"), value = c("Maeve", "Hale",
"Bernard")), class = "data.frame", row.names = c(NA, -3L))
DataFrame <- structure(list(group = c("x", "x", "x", "y", "y", "y", "y", "y"),
type = c("A", "A", "A", "B", "B", "C", "C", "C"), value = c("Teddy",
"William", "Lars", "Robert", "Elsie", "Maeve", "Charlotte", "Bernard"
)), class = "data.frame", row.names = c(NA, -8L))

Related

Verifyin if there's at least two columns have the same value in a specefic column

i have a data and i want to see if my variables they all have unique value in specefic row
let's say i want to analyze row D
my data
Name F S T
A 1 2 3
B 2 3 4
C 3 4 5
D 4 5 6
> TRUE (because all the three variables have unique value)
Second example
Name F S T
A 1 2 3
B 2 3 4
C 3 4 5
D 4 5 4
>False (because F and T have the same value in row D )
In base R do
f1 <- function(dat, ind) {
tmp <- unlist(dat[ind, -1])
length(unique(tmp)) == length(tmp)
}
-testing
> f1(df, 4)
[1] TRUE
> f1(df1, 4)
[1] FALSE
data
df <- structure(list(Name = c("A", "B", "C", "D"), F = 1:4, S = 2:5,
T = 3:6), class = "data.frame", row.names = c(NA, -4L))
df1 <- structure(list(Name = c("A", "B", "C", "D"), F = 1:4, S = 2:5,
T = c(3L, 4L, 5L, 4L)), class = "data.frame", row.names = c(NA,
-4L))
You can use dplyr for this:
df %>%
summarize_at(c(2:ncol(.)), n_distinct) %>%
summarize(if_all(.fns = ~ .x == nrow(df)))

Filter rows which has at least two of particular values

I have a data frame like this.
df
Languages Order Machine Company
[1] W,X,Y,Z,H,I D D B
[2] W,X B A G
[3] W,I E B A
[4] H,I B C B
[5] W G G C
I want to get the number of rows where languages has 2 out of 3 values among W,H,I.
The result should be: 3 because row 1, row 3 and row 4 contains at least 2 values out of the3 values among W,H,I
You can use strsplit on df$Languages and take the intersect with W,H,I. Then get the lengths of this result and use which to get those which have more than 1 >1.
sum(lengths(sapply(strsplit(df$Languages, ",", TRUE), intersect, c("W","H","I"))) > 1)
#[1] 3
You can use :
sum(sapply(strsplit(df$Languages, ','), function(x)
sum(c("W","H","I") %in% x) >= 2))
#[1] 3
data
df<- structure(list(Languages = c("W,X,Y,Z,H,I", "W,X", "W,I", "H,I",
"W"), Order = c("D", "B", "E", "B", "G"), Machine = c("D", "A",
"B", "C", "G"), Company = c("B", "G", "A", "B", "C")),
class = "data.frame", row.names = c(NA, -5L))
a tidyverse approach
df %>% filter(map_int(str_split(Languages, ','), ~ sum(.x %in% c('W', 'H', 'I'))) >= 2)
Languages Order Machine Company
1 W,X,Y,Z,H,I D D B
2 W,I E B A
3 H,I B C B

Recode variable values into strings based on different variable

I have two datasets:
df1:
structure(list(v1 = c(1, 4, 3, 7, 8, 1, 2, 4)), row.names = c(NA,
-8L), class = c("tbl_df", "tbl", "data.frame"))
df2:
structure(list(val = c(1, 2, 3, 4, 5, 6, 7, 8, 9), lab = c("a",
"b", "c", "d", "e", "f", "g", "h", "i")), row.names = c(NA, -9L
), class = c("tbl_df", "tbl", "data.frame"))
I want to recode v1 in df1 according to the values (val) and labels (lab) in df2.
Following this, my output would should look like this:
df3:
structure(list(v1 = c("a", "d", "c", "g", "h", "a", "b", "d")), row.names = c(NA,
-8L), class = c("tbl_df", "tbl", "data.frame"))
Is there any package or function I am missing which could easily solve this problem? The problem itself looks quite easy to me but I found no simple solution. Of course, writing a for loop would be always possible but it would make this operation probably too complicated as I want to do this many times with big datasets.
An option using dplyr which will keep the original order
library(dplyr)
new_df <- df1 %>%
transmute(v1 = left_join(df1, df2, by = c("v1" = "val"))$lab)
# v1
# <chr>
#1 a
#2 d
#3 c
#4 g
#5 h
#6 a
#7 b
#8 d
identical(new_df, df3)
#[1] TRUE
Another base option is using merge, this will not keep the order
df1$v1 <- merge(df1, df2, all.x = TRUE, by.x = "v1", by.y = "val")$lab
# v1
# <chr>
#1 a
#2 a
#3 b
#4 c
#5 d
#6 d
#7 g
#8 h
Below is a simple solution:
X<-as.data.frame(df1)
Y<-as.data.frame(df2)
final_df <- merge(X, Y, all.x = TRUE, by.x = "v1", by.y = "val")
print(final_df)
output
v1 lab
1 1 a
2 1 a
3 2 b
4 3 c
5 4 d
6 4 d
7 7 g
8 8 h
This will not keep the order, but below approach using the dplyr will keep the order also.
library(dplyr)
X<-as.data.frame(df1)
Y<-as.data.frame(df2)
final_df <- X %>%
transmute(v1 = left_join(X, Y, by = c("v1" = "val"))$lab)
print(final_df)
output
v1
1 a
2 d
3 c
4 g
5 h
6 a
7 b
8 d
I hope this helps

Change values in data frame in one column where some other column is equal to some text.

Given the following dataframe. How can I change all hab type to "ungrazed" where Study area is equal to "B". It seems I would need to use an apply function, but I can't seem to unravel the correct construction. Thanks in advance.
hab type Study area
grazed A
grazed A
grazed B
grazed B
grazed C
grazed C
You can try
df$hab.type[df$Study.area=='B'] <- 'Ungrazed'
df
# hab.type Study.area
#1 grazed A
#2 grazed A
#3 Ungrazed B
#4 Ungrazed B
#5 grazed C
#6 grazed C
Or
transform(df, hab.type=replace(hab.type, Study.area=='B', 'Ungrazed'))
data
df <- structure(list(hab.type = c("grazed", "grazed", "grazed", "grazed",
"grazed", "grazed"), Study.area = c("A", "A", "B", "B", "C",
"C")), .Names = c("hab.type", "Study.area"), class = "data.frame",
row.names = c(NA, -6L))

Random sampling by category, different number of samples needed per category in R [duplicate]

This question already has answers here:
Stratified random sampling from data frame
(6 answers)
Closed 3 years ago.
I have a question to do some random sampling in R. I have two datasets. One dataset, say df1, is organized where each observation is a sample, and the location from which the sample was collected is under the variable "loc". "loc" is set as a character. An example data layout is shown below.
ID loc x1 x2 x3
1 A x x x
2 A x x x
3 A x x x
4 B x x x
5 B x x x
6 C x x x
7 C x x x
8 C x x x
9 C x x x
etc.
The second dataset, say df2, is a list of all of the locations and the number of random samples required from each location. It looks like this:
loc n
A 2
B 1
C 3
I am wondering how to take different numbers of random samples by group, where the number of samples required is denoted in df2.
We can split the first dataset by 'loc', use map2 to loop over the list with the corresponding 'n' from the second dataset and use that in sample_n
library(purrr)
library(dplyr)
map2_dfr(df1 %>%
group_split(loc), df2$n, ~ .x %>%
sample_n(.y))
# A tibble: 6 x 5
# ID loc x1 x2 x3
# <int> <chr> <chr> <chr> <chr>
#1 1 A x x x
#2 2 A x x x
#3 5 B x x x
#4 6 C x x x
#5 8 C x x x
#6 7 C x x x
Or another option is to a match
df1 %>%
group_by(loc) %>%
sample_n(df2$n[match(first(loc), df2$loc)])
data
df1 <- structure(list(ID = 1:9, loc = c("A", "A", "A", "B", "B", "C",
"C", "C", "C"), x1 = c("x", "x", "x", "x", "x", "x", "x", "x",
"x"), x2 = c("x", "x", "x", "x", "x", "x", "x", "x", "x"), x3 = c("x",
"x", "x", "x", "x", "x", "x", "x", "x")), class = "data.frame",
row.names = c(NA,
-9L))
df2 <- structure(list(loc = c("A", "B", "C"), n = c(2L, 1L, 3L)),
class = "data.frame", row.names = c(NA,
-3L))

Resources