I have two tables: coc_data and DT. coc_data table contains co-occurrence frequency between pair of words. Its structure is similar to:
word1 word2 freq
1 A B 1
2 A C 2
3 A D 3
4 A E 2
Second table, DT contains frequencies for each word for different years, e.g.:
word year weight
1 A 1966 9
2 A 1967 3
3 A 1968 1
4 A 1969 4
5 A 1970 10
6 B 1966 9
In reality, coc_data has currently 150.000 rows and DT has about 450.000 rows. Below is R code, which simulate both datasets.
# Prerequisites
library(data.table)
set.seed(123)
n <- 5
# Simulate co-occurrence data [coc_data]
words <- LETTERS[1:n]
# Times each word used
freq <- sample(10, n, replace = TRUE)
# Co-occurrence data.frame
coc_data <- setNames(data.frame(t(combn(words,2))),c("word1", "word2"))
coc_data$freq <- apply(combn(freq, 2), 2, function(x) sample(1:min(x), 1))
# Simulate frequency table [DT]
years <- (1965 + 1):(1965 + 5)
word <- sort(rep(LETTERS[1:n], 5))
year <- rep(years, 5)
weight <- sample(10, 25, replace = TRUE)
freq_data <- data.frame(word = word, year = year, weight = weight)
# Combine to data.table for speed
DT <- data.table(freq_data, key = c("word", "year"))
My task is to normalize frequencies in coc_data table according to frequencies in DT table using the following function:
my_fun <- function(x, freq_data, years) {
word1 <- x[1]
word2 <- x[2]
freq12 <- as.numeric(x[3])
freq1 <- sum(DT[word == word1 & year %in% years]$weight)
freq2 <- sum(DT[word == word2 & year %in% years]$weight)
ei <- (freq12^2) / (freq1 * freq2)
return(ei)
}
Then I use apply() function to apply my_fun function to each row of the coc_data table:
apply(X = coc_data, MARGIN = 1, FUN = my_fun, freq_data = DT, years = years)
Because DT lookup table is quite large the whole mapping process take very long. I wonder how could I improve my code to speed up the computation.
Since the years parameter is constant in my_fun for the actual usage using apply, you could compute the frequencies for all words first:
f<-aggregate(weight~word,data=DT,FUN=sum)
Now transform this into a hash, e.g.:
hs<-f$weight
names(hs)<-f$word
Now in my_fun use the precomputed frequencies by looking up hs[word]. This should be faster.
Even better - the answer you're looking for is
(coc_data$freq)^2 / (hs[coc_data$word1] * hs[coc_data$word2])
The data.table implementation of this would be:
f <- DT[, sum(weight), word]
vec <- setNames(f$V1, f$word)
setDT(coc_data)[, freq_new := freq^2 / (vec[word1] * vec[word2])]
which gives the following result:
> coc_data
word1 word2 freq freq_new
1: A B 1 0.0014792899
2: A C 1 0.0016025641
3: A D 1 0.0010683761
4: A E 1 0.0013262599
5: B C 5 0.0434027778
6: B D 1 0.0011574074
7: B E 1 0.0014367816
8: C D 4 0.0123456790
9: C E 1 0.0009578544
10: D E 2 0.0047562426
Related
I have a table with "power" values in one column and team ids in the other. There is a function I have to calculate the probability team A beats team B. I want to make a matrix in R that performs this function on every combination of teams. Something that would have the team ids as the rows and columns and where they meet is the probability the team id in the row would beat the team in the column. I am fairly new to R so I'm not quite sure how to go about this.
In what may be the ugliest possible answer to this question, a brut-force way to achieve this is below. I broke it out step by step. Hopefully a more elegant coder than me can improve upon it. Since there is no sample data I am not sure if this is exactly what you need.
set.seed(05062020)
# sample data
df <- data.frame(team = LETTERS[1:10],
power = runif(10))
# team power
# 1 A 0.06839351
# 2 B 0.99013777
# 3 C 0.65360185
# 4 D 0.87851168
# ....
# 8 H 0.83947190
# 9 I 0.17248571
# 10 J 0.21813885
# all possible combination of teams
df2 <- data.frame(expand.grid(df$team, df$team))
colnames(df2) <- c("team", "team2")
# > df2
# team team2
# 1 A A
# 2 B A
# 3 C A
#.....
# 98 H J
# 99 I J
# 100 J J
## add in power values
df3 <- merge(df2, df, by = "team")
colnames(df3) <- c("team1", "team", "power1")
df4 <- merge(df3, df, by = "team")
df4 <- df4[,c(2:1,3:4)] #rearrange
colnames(df4) <- c("team1", "team2", "power1", "power2")
# Define whatever function you want to use, this is just a dummy function
test_fun <- function(team1, team2, p1, p2){
if(p1 == p2) {NA}
else {
if(p1>p2){paste0("Team ", team1, " wins")}
else{paste0("Team ", team2, " wins")}
}
}
# apply across rows
df4$winner <- apply(df4, 1, function(x) test_fun(x[1], x[2], x[3], x[4]))
# team1 team2 power1 power2 winner
#1 A A 0.06839351 0.06839351 <NA>
#2 B A 0.99013777 0.06839351 Team B wins
#3 C A 0.65360185 0.06839351 Team C wins
#4 H A 0.83947190 0.06839351 Team H wins
#5 D A 0.87851168 0.06839351 Team D wins
# ......
#97 C J 0.65360185 0.21813885 Team C wins
#98 H J 0.83947190 0.21813885 Team H wins
#99 I J 0.17248571 0.21813885 Team J wins
#100 J J 0.21813885 0.21813885 <NA>
I currently have two .csv files that look like this:
File 1:
Attempt
Result
Intervention 1
B
Intervention 2
H
and File 2:
Name
Outcome 1
Outcome 2
Outcome 3
Sample 1
A
B
C
Sample 2
D
E
F
Sample 3
G
H
I
I would like to merge and align the two .csvs such that the result each row of File 1 is aligned by its "result" cell, against any of the three "outcome" columns in File 2, leaving blanks or "NA"s if there are no similarities.
Ideally, would look like this:
Attempt
Result
Name
Outcome 1
Outcome 2
Outcome 3
Intervention 1
B
Sample 1
A
B
C
Sample 2
D
E
F
Intervention 2
H
Sample 3
G
H
I
I've looked and only found answers when merging two .csv files with one common column. Any help would be very appreciated.
I will assume that " Result " in File 1 is unique, since more File 1 rows with same result value (i.e "B") will force us to consider new columns in the final data frame.
By this way,
Attempt <- c("Intervention 1","Intervention 2")
Result <- c("B","H")
df1 <- as.data.frame(cbind(Attempt,Result))
one <- c("Sample 1","A","B","C")
two <- c("Sample 2","D","E","F")
three <- c("Sample 3","G","H","I")
df2 <- as.data.frame(rbind(one,two,three))
row.names(df2) <- 1:3
colnames(df2) <- c("Name","Outcome 1","Outcome 2","Outcome 3")
vec_at <- rep(NA,nrow(df2));vec_res <- rep(NA,nrow(df2)); # Define NA vectors
for (j in 1:nrow(df2)){
a <- which(is.element(df1$Result,df2[j,2:4])==TRUE) # Row names which satisfy same element in two dataframes?
if (length(a>=1)){ # Don't forget that "a" may not be a valid index if no element satify the condition
vec_at[j] <- df1$Attempt[a] #just create a vector with wanted information
vec_res[j] <- df1$Result[a]
}
}
desired_df <- as.data.frame(cbind(vec_at,vec_res,df2)) # define your wanted data frame
Output:
vec_at vec_res Name Outcome 1 Outcome 2 Outcome 3
1 Intervention 1 B Sample 1 A B C
2 <NA> <NA> Sample 2 D E F
3 Intervention 2 H Sample 3 G H I
I wonder if you could use fuzzyjoin for something like this.
Here, you can provide a custom function for matching between the two data.frames.
library(fuzzyjoin)
fuzzy_left_join(
df2,
df1,
match_fun = NULL,
multi_by = list(x = paste0("Outcome_", 1:3), y = "Result"),
multi_match_fun = function(x, y) {
y == x[, "Outcome_1"] | y == x[, "Outcome_2"] | y == x[, "Outcome_3"]
}
)
Output
Name Outcome_1 Outcome_2 Outcome_3 Attempt Result
1 Sample_1 A B C Intervention_1 B
2 Sample_2 D E F <NA> <NA>
3 Sample_3 G H I Intervention_2 H
It's hard to explain, so I'll start with an example. I have some numeric columns (A, B, C). The column 'tmp' contains variable names of the numeric columns as concatenated strings:
set.seed(100)
A <- floor(runif(5, min=0, max=10))
B <- floor(runif(5, min=0, max=10))
C <- floor(runif(5, min=0, max=10))
tmp <- c('A','B,C','C','A,B','A,B,C')
df <- data.frame(A,B,C,tmp)
A B C tmp
1 3 4 6 A
2 2 8 8 B,C
3 5 3 2 C
4 0 5 3 A,B
5 4 1 7 A,B,C
Now, for each row, I want to use the variable names in tmp to select the values from the corresponding numeric columns with the same name(s). Then I want to keep only the rows where all the selected values are less than or equal 3.
E.g. in the first row, tmp is A, and the corresponding value in column A is 3, i.e. keep this row.
Another example, in row 4, tmp is A,B. The corresponding values are A = 0 and B = 5. Thus, all selected values are not less than or equal 3, and this row is discarded.
Desired result:
A B C tmp
1 3 4 6 A
2 5 3 2 C
How can I perform such filtering?
This is a bit more complicated than I like and there might be a more elegant solution, but here we go:
#split tmp
col <- strsplit(df[["tmp"]], ",")
#create an index matrix
inds <- do.call(rbind, Map(data.frame, row = seq_along(col), col = col))
inds$col <- match(inds$col, names(df))
inds <- as.matrix(inds)
#check
chk <- m <- as.matrix(df[, names(df) != "tmp"])
mode(chk) <- "logical"
chk[] <- NA
chk[inds] <- m[inds] <= 3
sel <- apply(chk, 1, prod, na.rm = TRUE)
df[as.logical(sel),]
# A B C tmp
#1 3 4 6 A
#3 5 3 2 C
Not sure if it works always (and probably isn't the best solution)... but it worked here:
library(dplyr)
library(tidyr)
library(stringr)
List= vector("list")
for (i in 1:length(df)){
tmpT= as.vector(str_split(df$tmp[i], ",", simplify=TRUE))
selec= df %>%
select(tmpT) %>%
slice(which(row_number() == i)) %>%
filter_all(., all_vars(. <= 3)) %>%
unite(val, sep= ", ")
if (nrow(selec) == 0) {
tab= NA
} else{
tab= df[i,]
}
List[[i]] = tab
}
df2= do.call("rbind", List)
This answer has some similarities with #Roland's, but here we work with the data in a 'longer' format:
# create row index
df$ri = seq_len(nrow(df))
# split the concatenated column
l <- strsplit(df$tmp, ',')
# repeat each row of the data with the lengths of the split string,
# bind with individual strings
d = cbind(df[rep(1:nrow(df), lengths(l)), ], x = unlist(l))
# use match to grab values from corresponding columns
d$val <- d[cbind(seq(nrow(d)), match(d$x, names(d)))]
# for each original row 'ri', check if all values are <= 3. use result to index data frame
d[as.logical(ave(d$val, d$ri, FUN = function(x) all(x <= 3))), ]
# A B C tmp ri x val
# 1 3 4 6 A 1 A 3
# 3 5 3 2 C 3 C 2
I want to turn combinations of columns into some kind of interpretable variable. There are 3 levels of a factor repeated in three columns, for each id. For all the combinations between the variables I would like to gain a list, and when I have the lsit, I want to know how many times can we find each combination. For example, when q1 and q2 are the same, it should return "A". An then A appear XX times. Anyone with suggestions? Thanks!!
id <- 1:10
set.seed(1)
q1 <- sample(1:3, 10, replace=TRUE)
set.seed(2)
q2 <- sample(1:3, 10, replace=TRUE)
set.seed(2)
q3 <- sample(1:3, 10, replace=TRUE)
df <- data.frame(id,q1,q2,q3)
df
df
id q1 q2 q3
1 1 1 1 1
2 2 2 3 3
3 3 2 2 2
4 4 3 1 1
5 5 1 3 3
6 6 3 3 3
7 7 3 1 1
8 8 2 3 3
9 9 2 2 2
10 10 1 2 2
if df$q1=="1" & df$q2=="1" print A
if df$q1=="1" & df$q2=="2" print B
if df$q1=="1" & df$q2=="3" print C
if df$q1=="2" & df$q2=="3" print D
if df$q1=="2" & df$q2=="2" print E
if df$q1=="3" & df$q2=="3" print F
if df$q2=="1" & df$q2=="1" print G
if df$q2=="1" & df$q2=="2" print H
response <- save(print A, print B, print C and so on....)
length(A)
length(B)
and so on...
I think this should do what you want, using base R. I hope I understood your desired output. I basically combined each pair of columns into its own variable (comb.var[, i]) and then combined that with each column name pair to create another variable output$fct and the relabeled the new variable which represents each q-pair x value-pair combination and counted the occurrence of each combination with summary()
code:
# dimensions of df
n = nrow(df) #rows
p = ncol(df) #columns
# unique pairs of q columns
pairs.n = choose(p - 1, 2) # number of unique pairs
pairs = combn(1:(p - 1), 2) # matrix of those pairs
# data frame of NAs of proper size
comb.var <- matrix(NA, nrow = n, ncol = pairs.n)
for(combo in 1:ncol(pairs)){
i = pairs[1, combo]
j = pairs[2, combo]
# get the right 2 columns from df
qi = df[, i + 1]
qj = df[, j + 1]
# combine into 1 variable
comb.var[, combo] <- paste(qi, qj, sep = "_")
}
# clean up the output: turn out.M into vector and add id columns
output = data.frame(data.frame(id = rep(df$id, times = pairs.n),
qi = rep(pairs[1, ], each = n),
qj = rep(pairs[2, ], each = n),
val = as.vector(comb.var)))
# combine variables again
output$fct = with(output, paste(qi, qj, val, sep = "."))
# count number of different outputs
uniq.n = length(unique(output$fct))
# re-label the factor
output$fct <- factor(output$fct, labels = LETTERS[1:uniq.n])
# count the group members
summary(output$fct)
I'm trying to find out the most efficient way of joining data from one dataframe into another. The idea is that I have a master data set (df) and a secondary dataset (lookup). I want to append the the data in the lookup table to the master data set.
Theoretical data as follows:
COLUMN_A <- 1:5
COLUMN_B <- 1:5
LOOKUP_COL <- letters[1:5]
df <- data.frame(COLUMN_A,COLUMN_B,LOOKUP_COL)
COLUMN_A COLUMN_B LOOKUP_COL
1 1 1 a
2 2 2 b
3 3 3 c
4 4 4 d
5 5 5 e
COLUMN_A <- 2*(1:5)
LOOKUP_COL <- letters[1:5]
SPARE_COL <- runif(5)
lookup <- data.frame(COLUMN_A,LOOKUP_COL,SPARE_COL)
COLUMN_A LOOKUP_COL SPARE_COL
1 1 a 0.6113499
2 2 b 0.3712987
3 3 c 0.3551038
4 4 d 0.6650248
5 5 e 0.2680611
This is how I've been doing it so far:
results <- merge(df,lookup,by='LOOKUP_COL')
Which provides me with:
LOOKUP_COL COLUMN_A.x COLUMN_B COLUMN_A.y SPARE_COL
1 a 1 1 1 0.6113499
2 b 2 2 2 0.3712987
3 c 3 3 3 0.3551038
4 d 4 4 4 0.6650248
5 e 5 5 5 0.2680611
So it seems that the entire lookup table has been merged into the master data, SPARE_COL is surplus to requirements - how can I control what columns get passed into the master data? Essentially, I'm trying to understand how the functionality of an excel vlookup can be used in R.
thanks
EDIT: This one uses SPARE_COL as the one to keep instead of COLUMN_A. If you have columns with the same name in different dataframes, the solution with indices will require that you rename them in one of the data frames before merging everything together.
Single column
You can do this by passing only the columns you want to merge to the function merge. Obviously you have to keep the columns used for merging in your selection. Taking your example, this becomes:
keep <- c('LOOKUP_COL','SPARE_COL')
results <- merge(df,lookup[keep],by='LOOKUP_COL')
And the result is
> results
LOOKUP_COL COLUMN_A COLUMN_B SPARE_COL
1 a 1 1 0.75670441
2 b 2 2 0.52122950
3 c 3 3 0.99338019
4 d 4 4 0.71904088
5 e 5 5 0.05405722
By selecting the columns first, you make merge work faster and you don't have to bother about finding the columns you want after the merge.
If speed is an issue and the merge is simple, you can speed things up by manually doing the merge using indices:
id <- match(df$LOOKUP_COL, lookup$LOOKUP_COL)
keep <- c('SPARE_COL')
results <- df
results[keep] <- lookup[id,keep, drop = FALSE]
This gives the same result, and gives a good speedup.
more columns
Let's create an example with 2 lookup columns first:
N <- 10000
COLUMN_A <- 1:N
COLUMN_B <- 1:N
LOOKUP_COL <- sample(letters[3:7], N, replace = TRUE)
LOOKUP_2 <- sample(letters[10:14], N, replace = TRUE)
df <- data.frame(COLUMN_A,COLUMN_B,LOOKUP_COL, LOOKUP_2)
COLUMN_A <- 2*(1:36)
LOOKUP_COL <- rep(letters[1:6], each = 6)
LOOKUP_2 <- rep(letters[10:15], times = 6)
SPARE_COL <- runif(36)
lookup <- data.frame(COLUMN_A,LOOKUP_COL, LOOKUP_2, SPARE_COL)
You can use merge again like this:
keep <- c('LOOKUP_COL','SPARE_COL', 'LOOKUP_2')
results <- merge(df,lookup[keep],by=c('LOOKUP_COL', 'LOOKUP_2'))
And you can use indices again. Before you match, you have to create the interaction between the lookup columns. You can do this using the function
interaction() for any number of lookup columns:
lookups <- c('LOOKUP_COL','LOOKUP_2')
id <- match(interaction(df[lookups]),
interaction(lookup[lookups]))
keep <- c('SPARE_COL')
results <- df
results[keep] <- lookup[id,keep, drop = FALSE]
Timing
In the test below the speedup is about a 6-fold for the two-column case:
test replications elapsed relative user.self sys.self user.child
1 code1() 100 6.30 6.117 6.30 0 NA
2 code2() 100 1.03 1.000 1.03 0 NA
sys.child
1 NA
2 NA
The code for testing:
N <- 10000
COLUMN_A <- 1:N
COLUMN_B <- 1:N
LOOKUP_COL <- sample(letters[3:7], N, replace = TRUE)
LOOKUP_2 <- sample(letters[10:14], N, replace = TRUE)
df <- data.frame(COLUMN_A,COLUMN_B,LOOKUP_COL, LOOKUP_2)
COLUMN_A <- 2*(1:36)
LOOKUP_COL <- rep(letters[1:6], each = 6)
LOOKUP_2 <- rep(letters[10:15], times = 6)
SPARE_COL <- runif(36)
lookup <- data.frame(COLUMN_A,LOOKUP_COL, LOOKUP_2, SPARE_COL)
code1 <- function(){
keep <- c('LOOKUP_COL','SPARE_COL', 'LOOKUP_2')
results <- merge(df,lookup[keep],by=c('LOOKUP_COL', 'LOOKUP_2'))
}
code2 <- function(){
lookups <- c('LOOKUP_COL','LOOKUP_2')
id <- match(interaction(df[lookups]),
interaction(lookup[lookups]))
keep <- c('SPARE_COL')
results <- df
results[keep] <- lookup[id,keep, drop = FALSE]
}
require(rbenchmark)
benchmark(code1(),code2())
For manipulating and merging dataframes, I suggest package dplyr:
library(dplyr)
df %>%
left_join(lookup, by=c("LOOKUP_COL")) %>%
select(LOOKUP_COL, COLUMN_A=COLUMN_A.x, COLUMN_B, COLUMN_C=COLUMN_A.y)