I have two datasets:
df1:
structure(list(v1 = c(1, 4, 3, 7, 8, 1, 2, 4)), row.names = c(NA,
-8L), class = c("tbl_df", "tbl", "data.frame"))
df2:
structure(list(val = c(1, 2, 3, 4, 5, 6, 7, 8, 9), lab = c("a",
"b", "c", "d", "e", "f", "g", "h", "i")), row.names = c(NA, -9L
), class = c("tbl_df", "tbl", "data.frame"))
I want to recode v1 in df1 according to the values (val) and labels (lab) in df2.
Following this, my output would should look like this:
df3:
structure(list(v1 = c("a", "d", "c", "g", "h", "a", "b", "d")), row.names = c(NA,
-8L), class = c("tbl_df", "tbl", "data.frame"))
Is there any package or function I am missing which could easily solve this problem? The problem itself looks quite easy to me but I found no simple solution. Of course, writing a for loop would be always possible but it would make this operation probably too complicated as I want to do this many times with big datasets.
An option using dplyr which will keep the original order
library(dplyr)
new_df <- df1 %>%
transmute(v1 = left_join(df1, df2, by = c("v1" = "val"))$lab)
# v1
# <chr>
#1 a
#2 d
#3 c
#4 g
#5 h
#6 a
#7 b
#8 d
identical(new_df, df3)
#[1] TRUE
Another base option is using merge, this will not keep the order
df1$v1 <- merge(df1, df2, all.x = TRUE, by.x = "v1", by.y = "val")$lab
# v1
# <chr>
#1 a
#2 a
#3 b
#4 c
#5 d
#6 d
#7 g
#8 h
Below is a simple solution:
X<-as.data.frame(df1)
Y<-as.data.frame(df2)
final_df <- merge(X, Y, all.x = TRUE, by.x = "v1", by.y = "val")
print(final_df)
output
v1 lab
1 1 a
2 1 a
3 2 b
4 3 c
5 4 d
6 4 d
7 7 g
8 8 h
This will not keep the order, but below approach using the dplyr will keep the order also.
library(dplyr)
X<-as.data.frame(df1)
Y<-as.data.frame(df2)
final_df <- X %>%
transmute(v1 = left_join(X, Y, by = c("v1" = "val"))$lab)
print(final_df)
output
v1
1 a
2 d
3 c
4 g
5 h
6 a
7 b
8 d
I hope this helps
Related
Given two dataframes df1 and df2 as follows:
df1:
df1 <- structure(list(A = 1L, B = 2L, C = 3L, D = 4L, G = 5L), class = "data.frame", row.names = c(NA,
-1L))
Out:
A B C D G
1 1 2 3 4 5
df2:
df2 <- structure(list(Col1 = c("A", "B", "C", "D", "X"), Col2 = c("E",
"Q", "R", "Z", "Y")), class = "data.frame", row.names = c(NA,
-5L))
Out:
Col1 Col2
1 A E
2 B Q
3 C R
4 D Z
5 X Y
I need to rename columns of df1 using df2, except column G since it not in df2's Col1.
I use df2$Col2[match(names(df1), df2$Col1)] based on the answer from here, but it returns "E" "Q" "R" "Z" NA, as you see column G become NA. I hope it keep the original name.
The expected result:
E Q R Z G
1 1 2 3 4 5
How could I deal with this issue? Thanks.
By using na.omit(it's little bit messy..)
colnames(df1)[na.omit(match(names(df1), df2$Col1))] <- df2$Col2[na.omit(match(names(df1), df2$Col1))]
df1
E Q R Z G
1 1 2 3 4 5
I have success to reproduce your error with
df2 <- data.frame(
Col1 = c("H","I","K","A","B","C","D"),
Col2 = c("a1","a2","a3","E","Q","R","Z")
)
The problem is location of df2$Col1 and names(df1) in match.
na.omit(match(names(df1), df2$Col1))
gives [1] 4 5 6 7, which index does not exist in df1 that has length 5.
For df1, we should change order of terms in match, na.omit(match(df2$Col1,names(df1))) gives [1] 1 2 3 4
colnames(df1)[na.omit(match(df2$Col1, names(df1)))] <- df2$Col2[na.omit(match(names(df1), df2$Col1))]
This will works.
A solution using the rename_with function from the dplyr package.
library(dplyr)
df3 <- df2 %>%
filter(Col1 %in% names(df1))
df4 <- df1 %>%
rename_with(.cols = df3$Col1, .fn = function(x) df3$Col2[df3$Col1 %in% x])
df4
# E Q R Z G
# 1 1 2 3 4 5
i have a data and i want to see if my variables they all have unique value in specefic row
let's say i want to analyze row D
my data
Name F S T
A 1 2 3
B 2 3 4
C 3 4 5
D 4 5 6
> TRUE (because all the three variables have unique value)
Second example
Name F S T
A 1 2 3
B 2 3 4
C 3 4 5
D 4 5 4
>False (because F and T have the same value in row D )
In base R do
f1 <- function(dat, ind) {
tmp <- unlist(dat[ind, -1])
length(unique(tmp)) == length(tmp)
}
-testing
> f1(df, 4)
[1] TRUE
> f1(df1, 4)
[1] FALSE
data
df <- structure(list(Name = c("A", "B", "C", "D"), F = 1:4, S = 2:5,
T = 3:6), class = "data.frame", row.names = c(NA, -4L))
df1 <- structure(list(Name = c("A", "B", "C", "D"), F = 1:4, S = 2:5,
T = c(3L, 4L, 5L, 4L)), class = "data.frame", row.names = c(NA,
-4L))
You can use dplyr for this:
df %>%
summarize_at(c(2:ncol(.)), n_distinct) %>%
summarize(if_all(.fns = ~ .x == nrow(df)))
I have the below toy dataset which is representative of a much larger data. However, these are the columns of importance. I'm attempting to check whether the values in Dataframe match the reference dataframes Reference_A, Reference_B, and Reference_C.
DataFrame
group type value
x A Teddy
x A William
x A Lars
y B Robert
y B Elsie
y C Maeve
y C Charlotte
y C Bernard
Reference_A
type value
A Teddy
A William
A Lars
Reference_B
type value
B Elsie
B Dolores
Reference_C
type value
C Maeve
C Hale
C Bernard
Desired output:
group type value check
x A Teddy TRUE
x A William TRUE
x A Lars TRUE
y B Robert FALSE
y B Elsie TRUE
y C Maeve TRUE
y C Charlotte FALSE
y C Bernard TRUE
I posted a similar question here, but realize that TRUE and FALSE's might be more effective to check: Check if values of one dataframe exist in another dataframe in exact order. I don't think that order matters, since I can manipulate my data so that all values are unique.
You can combine the "Reference" dataframes into one dataframe and join it with DataFrame by type, for each type and value you can then check if any value matches.
library(dplyr)
mget(paste0('Reference_', c('A', 'B', 'C'))) %>%
bind_rows() %>%
right_join(DataFrame, by = 'type') %>%
group_by(group, type, value = value.y) %>%
summarise(check = any(value.x == value.y))
# group type value check
# <chr> <chr> <chr> <lgl>
#1 x A Lars TRUE
#2 x A Teddy TRUE
#3 x A William TRUE
#4 y B Elsie TRUE
#5 y B Robert FALSE
#6 y C Bernard TRUE
#7 y C Charlotte FALSE
#8 y C Maeve TRUE
data
Reference_A <- structure(list(type = c("A", "A", "A"),
value = c("Teddy", "William", "Lars")), class = "data.frame",
row.names = c(NA, -3L))
Reference_B <- structure(list(type = c("B", "B"), value = c("Elsie", "Dolores")),
class = "data.frame", row.names = c(NA, -2L))
Reference_C <- structure(list(type = c("C", "C", "C"), value = c("Maeve", "Hale",
"Bernard")), class = "data.frame", row.names = c(NA, -3L))
DataFrame <- structure(list(group = c("x", "x", "x", "y", "y", "y", "y", "y"),
type = c("A", "A", "A", "B", "B", "C", "C", "C"), value = c("Teddy",
"William", "Lars", "Robert", "Elsie", "Maeve", "Charlotte", "Bernard"
)), class = "data.frame", row.names = c(NA, -8L))
Lets assume there are 2 columns of two huge dataframes (different lengths) like:
df1 df2
A 1 C X
A 1 D X
B 4 C X
A 1 F X
B 4 A X
B 4 B X
C 7 B X
Each time there is a match in the 1st columns, X should be replaced with data from column 2 of df1. If the 1st column of df2 contains Elements, which are still not in the first column of df1 (F, D), X should be replaced with 0.
Hence there is a huge dataframe, a loop in a loop would not be useful.
The solution should look like this:
df1 df2
A 1 C 7
A 1 D 0
B 4 C 7
A 1 F 0
B 4 A 1
B 4 B 4
C 7 B 4
Thank You in advance
As there are duplicate rows in 'df1', we can get the unique rows
df3 <- unique(df1)
Then, use match to get the idnex
i1 <- match(df2$Col1, df3$Col1)
and based on the index, assign
df2$Col2 <- df3$Col2[i1]
If there are no matches, it would be NA, which can be changed to 0
df2$Col2[is.na(df2$Col2)] <- 0
df2
# Col1 Col2
#1 C 7
#2 D 0
#3 C 7
#4 F 0
#5 A 1
#6 B 4
#7 B 4
Or this can be done with data.table by joining on the 'Col1' and assigning the 'Col2' (after removing the Col2 from the second data) with the Col2 from 'df3'
library(data.table)
setDT(df2)[, Col2 := NULL][df3, Col2 := Col2, on = .(Col1)]
data
df1 <- structure(list(Col1 = c("A", "A", "B", "A", "B", "B", "C"), Col2 = c(1,
1, 4, 1, 4, 4, 7)), class = "data.frame", row.names = c(NA, -7L
))
df2 <- structure(list(Col1 = c("C", "D", "C", "F", "A", "B", "B"), Col2 = c("X",
"X", "X", "X", "X", "X", "X")), class = "data.frame", row.names = c(NA,
-7L))
Sample data.frame:
structure(list(a = c(1, 2, 3), b = c(4, 5, 6), c = c(7, 8, 9)), .Names = c("a", "b", "c"), row.names = c(NA, -3L), class = "data.frame")
Output:
df
# a b c
# 1 1 4 7
# 2 2 5 8
# 3 3 6 9
I'd like to get the first and third columns, but I want to subset by name and also by column index.
df[, "a"]
# [1] 1 2 3
df[, 3]
# [1] 7 8 9
df[, c("a", 3)]
# Error in `[.data.frame`(df, , c("a", 3)) : undefined columns selected
df[, c(match("a", names(df)), 3)]
# a c
# 1 1 7
# 2 2 8
# 3 3 9
Are there functions or packages that allow for clean/simple syntax, as in the third example, while also achieving the result of the fourth example?
Maybe use dplyr?
For interactive use - i.e., if you know ahead of time the name of the column you want to select
library(dplyr)
df %>% select(a, 3)
If you do not know the name of the column in advance, and want to pass it as a variable,
x <- names(df)[1]
x
[1] "a"
df %>% select_(x, 3)
Either way the output is
# a c
#1 1 7
#2 2 8
#3 3 9
In base R you can combine subset with select.
df <- structure(list(a = c(1, 2, 3),
b = c(4, 5, 6), c = c(7, 8, 9)),
.Names = c("a", "b", "c"), row.names = c(NA, -3L), class = "data.frame")
df <- subset(df, select = c(a, 3))
You can index names(df) without using dplyr:
df <- structure(list(a = c(1, 2, 3), b = c(4, 5, 6), c = c(7, 8, 9)), .Names = c("a", "b", "c"), row.names = c(NA, -3L), class = "data.frame")
df[,c("a",names(df)[3]) ]
Output:
a c
1 1 7
2 2 8
3 3 9