Matching data frames

Matching data frames - r

I have this code:
df1 <- data.frame(letter=c("a", "a", "a", "b", "b", "c", "d", "e"),
value=c(NA))
df2 <- data.frame(letter=c("a", "b", "g", "f", "d", "e", "a", "b", "a", "c"),
number=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
I want to match these two data frames by letter and return the corresponding number in df2 into the Value column in df1.
So the result for df1 would look this this:

I want to match these two data frames by letter
This can be accomplished with merge and unique:
> df = unique(merge(df1['letter'], df2))
> df
letter number
1 a 1
2 a 9
3 a 7
10 b 2
11 b 8
14 c 10
15 d 5
16 e 6
- almost, since the number values for each letter might not be sorted as you want, so let's sort them
and return the corresponding number in df2 into the Value column in df1.
> df1$value = df$number[order(df$letter, df$number)]
> df1
letter value
1 a 1
2 a 7
3 a 9
4 b 2
5 b 8
6 c 10
7 d 5
8 e 6

Look at the package dplyr with the join functions :
library(dplyr)
inner_join(df1, df2) %>%
distinct(letter, number)

Related

Create cross-tabulation of most frequent value of string variable and sort by frequency

I have a sample dataset:
df <- data.frame(category = c("A", "A", "B", "C", "C", "D", "E", "C", "E", "A", "B", "C", "B", "B", "B", "D", "D", "D", "D", "B"), year = c(1, 2, 1, 2, 3, 2, 3, 1, 3, 2, 1, 1, 2, 1, 2, 3, 1, 2, 3, 1))
and would like to create a cross-tabulation of year and category such that only the 3 most frequent categories are in the table and also sorted by total number of occurences:
1 2 3
B 4 2 0
D 1 2 2
C 2 1 1
Using something like
df %>%
add_count(category) %>%
filter(n %in% tail(sort(unique(n)),3)) %>%
arrange(desc(n)) %>% {table(.$category, .$year)}
will filter for the three most occurring categories but leave the table unsorted
1 2 3
B 4 2 0
C 2 1 1
D 1 2 2

This should give you what you want.
# Make a table
df.t <- table(df)
# Order by top occurrences (sum over margin 1)
df.t <- df.t[order(apply(df.t, 1, sum), decreasing=TRUE),]
# Keep top 3 results
df.t <- df.t[1:3,]
Output:
year
category 1 2 3
B 4 2 0
D 1 2 2
C 2 1 1

You'd want to arrange by the rowsums after creating table. If you want to stay (more) within tidyverse, e.g.:
df |>
janitor::tabyl(category, year) |>
arrange(desc(rowSums(across(where(is.numeric))))) |>
head(3)
Here with janitor::tabyl(), but you could also use dplyr::tally() and tidyr::pivot_longer() directly or do df |> table() |> as.data.frame.matrix() like #Adamm.

It's not elegent solution using base R but it works
result <- as.data.frame.matrix(table(df))
result$sum <- rowSums(result)
result <- result[order(-result$sum),]
result <- result[1:3,]
result$sum <- NULL
1 2 3
B 4 2 0
D 1 2 2
C 2 1 1

Subset a data frame based on count of values of column x. Want only the top two in R

here is the data frame
p <- c(1, 3, 45, 1, 1, 54, 6, 6, 2)
x <- c("a", "b", "a", "a", "b", "c", "a", "b", "b")
df <- data.frame(p, x)
I want to subset the data frame such that I get a new data frame with only the top two"x" based on the count of "x".

One of the simplest ways to achieve what you want to do is with the package data.table. You can read more about it here. Basically, it allows for fast and easy aggregation of your data.
Please note that I modified your initial data by appending the elements 10 and c to p and x, respectively. This way, you won't see a NA when filtering the top two observations.
The idea is to sort your dataset and then operate the function .SD which is a convenient way for subsetting/filtering/extracting observations.
Please, see the code below.
library(data.table)
p <- c(1, 3, 45, 1, 1, 54, 6, 6, 2, 10)
x <- c("a", "b", "a", "a", "b", "c", "a", "b", "b", "c")
df <- data.table(p, x)
# Sort by the group x and then by p in descending order
setorder( df, x, -p )
# Extract the first two rows by group "x"
top_two <- df[ , .SD[ 1:2 ], by = x ]
top_two
#> x p
#> 1: a 45
#> 2: a 6
#> 3: b 6
#> 4: b 3
#> 5: c 54
#> 6: c 10
Created on 2021-02-16 by the reprex package (v1.0.0)
Does this work for you?

Using dplyr:
library(dplyr)
df %>%
add_count(x) %>%
slice_max(n, n = 2)
p x n
1 1 a 4
2 3 b 4
3 45 a 4
4 1 a 4
5 1 b 4
6 6 a 4
7 6 b 4
8 2 b 4

R: Matching rows by two columns

I am currently trying to figure out a vectorized way to match by two values in the same row. I have the following two simplified data frames:
# Dataframe 1: Displaying all my observations
df1 <- data.frame(c(1, 2, 3, 4, 5, 6, 7, 8),
c("A", "B", "C", "D", "A", "B", "A", "C"),
c("B", "E", "D", "A", "C", "A", "D", "A"))
colnames(df1) <- c("ID", "Number1", "Number2")
> df1
ID Number1 Number2
1 1 A B
2 2 B E
3 3 C D
4 4 D A
5 5 A C
6 6 B A
7 7 A D
8 8 C A
# Dataframe 2: Matrix of observations I am interested in
df2 <- matrix(c("A", "B",
"D", "A",
"C", "B",
"E", "D"),
ncol = 2,
byrow = TRUE)
> df2
[,1] [,2]
[1,] "A" "B"
[2,] "D" "A"
[3,] "C" "B"
[4,] "E" "D"
What I am trying to accomplish is to create a new column in df1 that states TRUE only if the exact combination is present in df2 (for example ID = 1 is equivalent to the first row in df2 because both of them consist of A and B). Additionally, if there is a shortcut, I would also like the status to be TRUE if the numbers are reversed, i.e. df1$Number1 matches df2[i,2] and df1$Number2 matches df2[i,1] (for example for ID = 7, the combination in df1 is A,D and in df2, the combination is D,A --> TRUE).
My desired output looks like this:
> df1
ID Number1 Number2 Status
1 1 A B TRUE
2 2 B E FALSE
3 3 C D FALSE
4 4 D A TRUE
5 5 A C FALSE
6 6 B A TRUE
7 7 A D TRUE
8 8 C A FALSE
All I have gotten so far is this:
for (i in 1:nrow(df1)) {
for (j in 1:nrow(df2)) {
Status <- ifelse(df1$Number1[i] %in% df2[j,1] &&
df1$Number2[i] %in% df2[j,2], TRUE, FALSE)
StatusComb[i,j] <- Status
}
df1$Status[i] <- ifelse(any(StatusComb[i,]) == TRUE, TRUE, FALSE)
}
It is really inefficient (you can clearly tell I am new to R) and does not look very nice either. I would appreciate any help!

One method would be to merge things together.
Adapting your data, to account for reversed labels, I'll reverse df2 on itself and rbind it:
df2 <- rbind.data.frame(df2, df2[,c(2,1)])
colnames(df2) <- c("Number1", "Number2")
df2$a <- TRUE
df2
# Number1 Number2 a
# 1 A B TRUE
# 2 D A TRUE
# 3 C B TRUE
# 4 E D TRUE
# 5 B A TRUE
# 6 A D TRUE
# 7 B C TRUE
# 8 D E TRUE
I added a so that it'll be merged in. From there:
df3 <- merge(df1, df2, all.x = TRUE)
df3$a <- !is.na(df3$a)
df3[ order(df3$ID), ]
# Number1 Number2 ID a
# 1 A B 1 TRUE
# 5 B E 2 FALSE
# 7 C D 3 FALSE
# 8 D A 4 TRUE
# 2 A C 5 FALSE
# 4 B A 6 TRUE
# 3 A D 7 TRUE
# 6 C A 8 FALSE
If you look at it before !is.na(df3$a), you'll see that the column is wholly TRUE and NA (the NA were not present in df2); if that is enough for you, then you can omit the middle step. The order step is just because row-order with merge is not assured (in fact I find it always inconveniently different). Since it was previously ordered by ID, I reverted to that, but it was entirely for aesthetics here to match your desired output.

You could define the combination variable that you want to search for in alphabetical order as below:
combination <- apply(df2, 1, function(x) {
paste(sort(x), collapse = '')
})
combination
[1] "AB" "AD" "BC" "DE"
And then mutate the Status field based on the concatenation of the Number field
library(dplyr)
df1 %>%
rowwise() %>%
mutate(S = paste(sort(c(Number1, Number2)), collapse = "")) %>%
mutate(Status = ifelse(S %in% combination, TRUE, FALSE))
Source: local data frame [8 x 5]
Groups: <by row>
# A tibble: 8 x 5
ID Number1 Number2 S Status
<dbl> <chr> <chr> <chr> <lgl>
1 1 A B AB TRUE
2 2 B E BE FALSE
3 3 C D CD FALSE
4 4 D A AD TRUE
5 5 A C AC FALSE
6 6 B A AB TRUE
7 7 A D AD TRUE
8 8 C A AC FALSE
Data:
I set stringsAsFactors = F in the dataframe
df1 <- data.frame(c(1, 2, 3, 4, 5, 6, 7, 8),
c("A", "B", "C", "D", "A", "B", "A", "C"),
c("B", "E", "D", "A", "C", "A", "D", "A"), stringsAsFactors = F)
colnames(df1) <- c("ID", "Number1", "Number2")

melt() is using all column names as id variables

So, with ths dummy dataset
test_species <- c("a", "b", "c", "d", "e")
test_abundance <- c(4, 7, 15, 2, 9)
df <- rbind(test_species, test_abundance)
df <- as.data.frame(df)
colnames(df) <- c("a", "b", "c", "d", "e")
df <- dplyr::slice(df, 2)
we get a dataframe that's something like this:
a b c d e
4 7 15 2 9
I'd like to transform it into something like
species abundance
a 4
b 7
c 15
d 2
e 9
using the reshape2 function melt(). I tried the code
melted_df <- melt(df,
variable.name = "species",
value.name = "abundance")
but that tells me: "Using a, b, c, d, e as id variables", and the end result looks like this:
a b c d e
4 7 15 2 9
What am I doing wrong, and how can I fix it?

You can define it in the correct shape from the start, using only base library functions:
> data.frame(species=test_species, abundance=test_abundance)
species abundance
1 a 4
2 b 7
3 c 15
4 d 2
5 e 9

Rbind is adding some odd behaviour I think, I cannot explain why.
A fairly basic fix is:
test_species <-c("a", "b", "c", "d", "e")
test_abundance <- c(4, 7, 15, 2, 9)
df <- data.frame(test_species, test_abundance) #skip rbind and go straight to df
colnames(df) <- c('species', 'abundance') #colnames correct
This skips the rbind function and will give desired outcome.

Find out the row with different value with in same name [duplicate]

This question already has answers here:
How to remove rows that have only 1 combination for a given ID
(4 answers)
Selecting & grouping dual-category data from a data frame
(4 answers)
Closed 5 years ago.
I have a df looks like
df <- data.frame(Name = c("A", "A","A","B", "B", "C", "D", "E", "E"),
Value = c(1, 1, 1, 2, 15, 3, 4, 5, 5))
Basically, A is 1, B is 2, C is 3 and so on.
However, as you can see, B has "2" and "15"."15" is the wrong value and it should not be here.
I would like to find out the row which Value won't matches within the same Name.
Ideal output will looks like
B 2
B 15

you can use tidyverse functions like:
df %>%
group_by(Name, Value) %>%
unique()
giving:
Name Value
1 A 1
2 B 2
3 B 15
4 C 3
5 D 4
6 E 5
then, to keep only the Name with multiple Value, append above with:
df %>%
group_by(Name) %>%
filter( n() > 1)

Something like this? This searches for Names that are associated to more than 1 value and outputs one copy of each pair {Name - Value}.
df <- data.frame(Name = c("A", "A","A","B", "B", "C", "D", "E", "E"),
Value = c(1, 1, 1, 2, 15, 3, 4, 5, 5))
res <- do.call(rbind, lapply(unique(df$Name), (function(i){
if (length(unique(df[df$Name == i,]$Value)) > 1 ) {
out <- df[df$Name == i,]
out[!duplicated(out$Value), ]
}
})))
res
Result as expected
Name Value
4 B 2
5 B 15

Filter(function(x)nrow(unique(x))!=1,split(df,df$Name))
$B
Name Value
4 B 2
5 B 15
Or:
Reduce(rbind,by(df,df$Name,function(x) if(nrow(unique(x))>1) x))
Name Value
4 B 2
5 B 15

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Matching data frames - r

Look at the package dplyr with the join functions : library(dplyr) inner_join(df1, df2) %>% distinct(letter, number)

Related

Create cross-tabulation of most frequent value of string variable and sort by frequency

Subset a data frame based on count of values of column x. Want only the top two in R

R: Matching rows by two columns

melt() is using all column names as id variables

Find out the row with different value with in same name [duplicate]

Categories

Resources