I have two data tables as shown below:
bigrams
w1w2 freq w1 w2
common names 1 common names
department of 4 department of
family name 6 family name
bigrams = setDT(structure(list(w1w2 = c("common names", "department of", "family name"
), freq = c(1L, 4L, 6L), w1 = c("common", "department", "family"
), w2 = c("names", "of", "name")), .Names = c("w1w2", "freq",
"w1", "w2"), row.names = c(NA, -3L), class = "data.frame"))
unigrams
w1 freq
common 2
department 3
family 4
name 5
names 1
of 9
unigrams = setDT(structure(list(w1 = c("common", "department", "family", "name",
"names", "of"), freq = c(2L, 3L, 4L, 5L, 1L, 9L)), .Names = c("w1",
"freq"), row.names = c(NA, -6L), class = "data.frame"))
desired output
w1w2 freq w1 w2 w1freq w2freq
common names 1 common names 2 1
department of 4 department of 3 9
family name 6 family name 4 5
What I have done so far
setkey(bigrams, w1)
setkey(unigrams, w1)
result <- bigrams[unigrams]
This gives me the i.freq column for w1 but when I try to do the same for w2 the i.freq column is updated to reflect the freq of w2.
How can I get freq for both w1 and w2 in separate columns?
Note: I have already seen solutions to data.table Lookup value and translate and Modify column of a data.table based on another column and add the new column
You can do two joins, and in v1.9.6 of data.table you can specify the on= argument for differing column names.
library(data.table)
bigrams[unigrams, on=c("w1"), nomatch = 0][unigrams, on=c(w2 = "w1"), nomatch = 0]
w1w2 freq w1 w2 i.freq i.freq.1
1: family name 6 family name 4 5
2: common names 1 common names 2 1
3: department of 4 department of 3 9
You can do this with a bit of reshaping.
library(dplyr)
library(tidyr)
bigrams %>%
rename(w1w2_string = w1w2,
w1w2_freq = freq) %>%
gather(order, string,
w1, w2) %>%
left_join(unigrams %>%
rename(string = w1) ) %>%
gather(type, value,
string, freq) %>%
unite(order_type, order, type) %>%
spread(order_type, value)
Edit: Explanation
The first observation you can make is that bigrams contains in fact information about three different units of analysis: a bigram and two unigrams. Convert to long form so that the unit of analysis is a unigram. Then we can merge in the other unigram data. Now note that your unigram has two different pieces of information per row: the frequency for the unigram, and the text of the unigram. Convert to long form again so that the unit of analysis is a piece of information about a unigram. Now spread, so that each new column is a type of information about a unigram.
Related
I need to prepare queries that are made of characters strings (DOI, Digital Object Identifier) stored in a data frame. All strings associated with the same case have to be joined to produce one query.
The df looks like this:
Case
DOI
1
1212313/dfsjk23
1
322332/jdkdsa12
2
21323/xsw.w3
2
311331313/q1231
2
1212121/1231312
The output should be a data frame looking like this:
Case
Query
1
DO=(1212313/dfsjk23 OR 322332/jdkdsa12)
2
DO=(21323/xsw.w3 OR 311331313/q1231 OR 1212121/1231312)
The prefix ("DO="), suffix (")") and "OR" are not critical, I can add them later, but how to aggregate character strings based on a case number?
In base R you could do:
aggregate(DOI~Case, df1, function(x) sprintf('DO=(%s)', paste0(x, collapse = ' OR ')))
Case DOI
1 1 DO=(1212313/dfsjk23 OR 322332/jdkdsa12)
2 2 DO=(21323/xsw.w3 OR 311331313/q1231 OR 1212121/1231312)
if Using R 4.1.0
aggregate(DOI~Case, df1, \(x)sprintf('DO=(%s)', paste0(x, collapse = ' OR ')))
We can use glue with str_c to collapse the 'DOI' column after grouping by 'Case'
library(stringr)
library(dplyr)
df1 %>%
group_by(Case) %>%
summarise(Query = glue::glue("DO=({str_c(DOI, collapse= ' OR ')})"))
-output
## A tibble: 2 x 2
# Case Query
# <int> <glue>
#1 1 DO=(1212313/dfsjk23 OR 322332/jdkdsa12)
#2 2 DO=(21323/xsw.w3 OR 311331313/q1231 OR 1212121/1231312)
data
df1 <- structure(list(Case = c(1L, 1L, 2L, 2L, 2L), DOI = c("1212313/dfsjk23",
"322332/jdkdsa12", "21323/xsw.w3", "311331313/q1231", "1212121/1231312"
)), class = "data.frame", row.names = c(NA, -5L))
I have two data.frame tables in R. Both have IDs for users who took particular actions. The users in the second table should all have done the actions in the first table, but I want to confirm. What would be the best way to determine if all the IDs in table 2 are represented in table, and if not what IDs aren't?
Table A
**Unique ID** **Count**
abc123 1
zyx456 15
888aaaa 4
Table B
**Unique ID** **Count**
abc123 1
zyx456 1
zzzzz123 2
I'm trying to get a response that abc123 and zyx456 in Table B are in Table A and that zzzzz123 is not represented in Table A but is in B (which would be an error, since all B should be in A).
This is an efficient one-liner in base R:
setdiff(TableB$ID, TableA$ID)
It will return an empty result if everything in TableB is in TableA, and return the missing fields if there are any.
Other answers may be better choices with broader context, but this is a simple solution for a simple problem.
We can do this easily with a join in the tidyverse:
library(tidyverse)
JoinedTable = full_join(
x = TableA %>% mutate(in.A = TRUE),
y = TableB %>% mutate(in.B = TRUE).
by = "UniqueID",
suffix = c(".A",".B")
)
### Use whichever of the following is applicable
## Is in both
JoinedTable %>%
filter(in.A, in.B)
## In A only
JoinedTable %>%
filter(in.A, !in.B)
## In B only
JoinedTable %>%
filter(!in.A, in.B)
Use a full_join to intersect the tables; set "by" to your ID column and adding a suffix to differentiate other columns that aren't unique to a particular column. I've added mutates to make the filtering code more clear, but you could also just look for NAs in the respective Counts columns (i.e. filter(!is.na(Count.A), is.na(Count.B)) to find ones in A but not B).
If you just want a vector of the ones that meet each condition, just tack on %>% pull(UniqueID) to grab that.
You can add another column to table B show if it is also in table A. Here is the code can make it (assuming dfA and dfB denote tables A and B):
dfB <- within(dfB, in_dfA <- UniqueID %in% tbla$UniqueID)
gives
> dfB
UniqueID Count in_dfA
1 abc123 1 TRUE
2 zyx456 1 TRUE
3 zzzzz123 2 FALSE
DATA
dfA <- structure(list(UniqueID = structure(c(2L, 3L, 1L), .Label = c("888aaaa",
"abc123", "zyx456"), class = "factor"), Count = c(1L, 15L, 4L
)), class = "data.frame", row.names = c(NA, -3L))
dfB <- structure(list(UniqueID = structure(1:3, .Label = c("abc123",
"zyx456", "zzzzz123"), class = "factor"), Count = c(1L, 1L, 2L
), in_dfA = c(TRUE, TRUE, FALSE)), row.names = c(NA, -3L), class = "data.frame")
How about using the %in% operator to see which are in both versus those that are not:
library(tibble)
library(tidyverse)
df1 <- tribble(~ID, ~Count,
'abc', 1,
'zyx', 15,
'other', 3)
df2 <- tribble(~ID, ~Count,
'abc', 2,
'zyx', 33,
'another', 334)
match <- df2[which(df2$ID %in% df1$ID),'ID']
notmatch <- df2[which(!(df2$ID %in% df1$ID)),'ID']
This outputs two comparisons that you can use to check for values in a function and pass errors if need be:
match
A tibble: 2 x 1
ID
<chr>
1 abc
2 zyx
notmatch
# A tibble: 1 x 1
ID
<chr>
1 another
You could do an update join to see which IDs are/aren't in the first table
tblb[tbla, on = 'UniqueID', in_tbla := i.UniqueID
][, in_tbla := !is.na(in_tbla)]
tblb
# UniqueID Count in_tbla
# 1: abc123 1 TRUE
# 2: zyx456 1 TRUE
# 3: zzzzz123 2 FALSE
Not sure if that's any better than #Onyambu's suggestion though (same output)
tblb[, in_tbla := UniqueID %in% tbla$UniqueID]
Data used:
tbla <- fread('
UniqueID Count
abc123 1
zyx456 15
888aaaa 4
')
tblb <- fread('
UniqueID Count
abc123 1
zyx456 1
zzzzz123 2
')
My data frame has (8211 observation) but following is a simplified example. If I have the following data Frame in R
Var1 Freq
a/b/e 1
b/a/e 2
a/c/d 3
d/c/a 1
How can I obtain the following data frame:
Var1 Freq
a/b/e 3
a/c/d 4
Here is a way
df1[, "Var1"] <- sapply(strsplit(df1$Var1, "/"), function(x) paste0(sort(x), collapse = "/"))
aggregate(Freq ~ Var1, df1, FUN = sum)
# Var1 Freq
#1 a/b/e 3
#2 a/c/d 4
We use strsplit to split column Var1 on "/". This returns a list of character vectors which we sort, paste back together and later aggregate.
data
df1 <- structure(list(Var1 = c("a/b/e", "a/b/e", "a/c/d", "a/c/d"),
Freq = c(1L, 2L, 3L, 1L)), .Names = c("Var1", "Freq"), row.names = c(NA,
-4L), class = "data.frame")
This question already has answers here:
duplicates in multiple columns
(2 answers)
Closed 5 years ago.
I am working on a dataset in R, where WO can have values "K" and "B". I want to have the WO be returned where the frequency per WO does not match between the "K" and "B" records. For example the following table:
df <- structure(list(WO = c(917595L, 917595L, 1011033L, 1011033L),
Invoice = c("B", "K", "B", "K"), freq = c(3L, 6L, 2L, 2L)),
.Names = c("WO", "Invoice", "freq"),
class = "data.frame", row.names = c(NA, -4L)
)
I want 917595 returned because 3 does not equal 6. However, 1011033 should be returned because its frequency matches.
Reshaping the data let's you do a comparison on the frequency values.
library(dplyr)
library(reshape2)
dframe <-
"WO,Invoice,freq
917595,B,3
917595,K,6
1011033,B,2
1011033,K,2" %>%
read.csv(text = .,
stringsAsFactors = FALSE)
dcast(dframe,
WO ~ Invoice,
value.var = "freq") %>%
filter(B != K)
We could do it with base R using duplicated
df1[!(duplicated(df1[c(1, 3)])|duplicated(df1[c(1,3)], fromLast = TRUE)),]
# WO Invoice freq
#1 917595 B 3
#2 917595 K 6
Or another option is to group by 'WO' and check if the number of unique elements in 'freq' is greater than 1
library(data.table)
setDT(df1)[, if(uniqueN(freq)>1) .SD, WO]
# WO Invoice freq
#1: 917595 B 3
#2: 917595 K 6
I have a data frame in R like the following:
Group.ID status
1 1 open
2 1 open
3 2 open
4 2 closed
5 2 closed
6 3 open
I want to count the number of IDs under the condition: when all status are "open" for same ID number. For example, Group ID 1 has two observations, and their status are both "open", so that's one for my count. Group ID 2 is not because not all status are open for group ID 2.
I can count the rows or the group IDs under conditions. However I don't know how to apply "all status equal to one value for a group" logic.
DATA.
df1 <-
structure(list(Group.ID = c(1, 1, 2, 2, 2, 3), status = structure(c(2L,
2L, 2L, 1L, 1L, 2L), .Label = c("closed", "open"), class = "factor")), .Names = c("Group.ID",
"status"), row.names = c(NA, -6L), class = "data.frame")
Here are two solutions, both using base R, one more complicated with aggregate and the other with tapply. If you just want the total count of Group.ID matching you request, I suggest that you use the second solution.
agg <- aggregate(status ~ Group.ID, df1, function(x) as.integer(all(x == "open")))
sum(agg$status)
#[1] 2
sum(tapply(df1$status, df1$Group.ID, FUN = function(x) all(x == "open")))
#[1] 2
a dplyrsolution:
library(dplyr)
df1 %>%
group_by(Group.ID) %>%
filter(cumsum(status == "open") == 2) %>%
nrow()