I have two of directories. The name of first directory is "model" and the second directory is "test", the list of files in both of directories are same but have different content. The total number of files in both of directories also same, that is 37 files.
I show the example of content from one of file.
First file from model directory
Name file : Model_A5B45
data
1 papaya | durian | orange | grapes
2 orange
3 grapes
4 banana | durian
5 tomato
6 apple | tomato
7 apple
8 mangostine
9 strawberry
10 strawberry | mango
dput output :
structure(list(data = structure(c(7L, 6L, 4L, 3L, 10L, 2L, 1L,
5L, 8L, 9L), .Label = c("apple", "apple | tomato", "banana | durian",
"grapes", "mangostine ", "orange", "papaya | durian | orange | grapes",
"strawberry", "strawberry | mango", "tomato"), class = "factor")), .Names = "data", class = "data.frame", row.names = c(NA,
-10L))
Second file in test directory
Name file: Test_A5B45
data
1 apple
2 orange | apple | mango
3 apple
4 banana
5 grapes
6 papaya
7 durian
8 tomato | orange | papaya | durian
dput output:
structure(list(data = structure(c(1L, 5L, 1L, 2L, 4L, 6L, 3L,
7L), .Label = c("apple", "banana", "durian", "grapes", "orange | apple | mango",
"papaya", "tomato | orange | papaya | durian"), class = "factor")), .Names = "data", class = "data.frame", row.names = c(NA,
-8L))
I want to calculate the percentage of intersect and except data from files in directory test to files in directory model.
This is example of my code only for two of files (Model_A5B45 and Test_A5B45).
library(dplyr)
data_test <- read.csv("Test_A5B45")
data_model <- read.csv("Model_A5B45")
intersect <- semi_join(data_test,data_model)
except <- anti_join(data_test,data_model)
except_percentage <- (nrow(except)/nrow(data_test))*100
intersect_percentage <- (nrow(intersect)/nrow(data_test))*100
sprintf("%s/%s",intersect_percentage,except_percentage)
Output : "37.5/62.5"
My question is, I want to implement my code to all of files (looping in both of directories) so the output will looks like confusion matrix.
Example of my expected output:
## y
## Model_A5B45 Model_A6B46 Model_A7B47
## Test_A5B45 37.5/62.5 value value
## Test_A6B46 value value value
## Test_A7B47 value value value
My answer:
I've create code that can process those thing, but I am still do not know how to make output looks like confusion matrix.
This is my code: (I don't know this is efficient or not, I use for loop)
f_performance_testing <- function(data_model_path, data_test_path){
library(dplyr)
data_model <- read.csv(data_model_path, header=TRUE)
data_test <- read.csv(data_test_path, header=TRUE)
intersect <- semi_join(data_test,data_model)
except <- anti_join(data_test,data_model)
except_percentage <- (nrow(except)/nrow(data_test))*100
intersect_percentage <- (nrow(intersect)/nrow(data_test))*100
return(list("intersect"=intersect_percentage,"except"=except_percentage))
}
for (model in model_list){
for (test in test_list){
result <- f_performance_testing(model,test)
intersect_percentage <- round(result$intersect,3)
except_percentage <- round(result$except,3)
final_output <- sprintf("intersect : %s | except : %s",intersect_percentage,except_percentage)
cat(print(paste(substring(model,57),substring(test,56), final_output,sep=",")),file="outfile.txt",append=TRUE,"\n")
print("Writing to file.......")
}
}
The output is:
Model_A5B45,Test_A5B45, 37.5/62.5
Model_A5B45,Test_A6B46, value
Model_A5B45,Test_A7B47, value
Model_A6B46,......
Model_A7B47,.....
...............
......
....
How can I transform this output as looks like confusion matrix table?
This won't answer your question directly, but hopefully gives you enough information to arrive at your own solution.
I would recommend creating a function like the following:
myFun <- function(model, test, datasource) {
model <- datasource[[model]]
test <- datasource[[test]]
paste(rev(mapply(function(x, y) (x/y)*100,
lapply(split(test, test %in% model), length),
length(test))),
collapse = "/")
}
This function is to be used with a two-column data.frame, where the columns represent all the combinations of "test" and "model" values (why work with a data.frame structure when a character vector would suffice?)
Here's an example of such a data.frame (other sample data is found at the end of the answer).
models <- c("model_1", "model_2", "model_3")
tests <- c("test_1", "test_2", "test_3")
A <- expand.grid(models, tests, stringsAsFactors = FALSE)
Next, create a named list of your models and tests. If you've read your data in using lapply, it is likely you might have names to work with anyway.
dataList <- mget(c(models, tests))
Now, calculate the relevant values. Here, we can use apply to cycle through each row and perform the relevant calculation.
A$value <- apply(A, 1, function(x) myFun(x[1], x[2], dataList))
Finally, you reshape the data from a "long" form to a "wide" form.
reshape(A, direction = "wide", idvar = "Var1", timevar = "Var2")
# Var1 value.test_1 value.test_2 value.test_3
# 1 model_1 75/25 100 75/25
# 2 model_2 50/50 50/50 62.5/37.5
# 3 model_3 62.5/37.5 50/50 87.5/12.5
Here's some sample data. Note that they are basic character vectors and not data.frames.
set.seed(1)
sets <- c("A", "A|B", "B", "C", "A|B|C", "A|C", "D", "A|D", "B|C", "B|D")
test_1 <- sample(sets, 8, TRUE)
model_1 <- sample(sets, 10, TRUE)
test_2 <- sample(sets, 8, TRUE)
model_2 <- sample(sets, 10, TRUE)
test_3 <- sample(sets, 8, TRUE)
model_3 <- sample(sets, 10, TRUE)
In a real world application, you would probably do something like:
testList <- lapply(list.files(path = "path/to/test/files"),
function(x) read.csv(x, stringsAsFactors = FALSE)$data)
modelList <- lapply(list.files(path = "path/to/model/files"),
function(x) read.csv(x, stringsAsFactors = FALSE)$data)
dataList <- c(testList, modelList)
But, this is pure speculation on my part based on what you've shared in your question as working code (for example, csv files with no file extension).
Related
I have two lists of dataframes. One list of dataframes is structured as follows:
data1
Label Pred n
1 Mito-0001_Series007_blue.tif Pear 10
2 Mito-0001_Series007_blue.tif Orange 223
3 Mito-0001_Series007_blue.tif Apple 890
4 Mito-0001_Series007_blue.tif Peach 34
And repeats with different numbers e.g.
Label Pred n
1 Mito-0002_Series007_blue.tif Pear 90
2 Mito-0002_Series007_blue.tif Orange 127
3 Mito-0002_Series007_blue.tif Apple 76
4 Mito-0002_Series007_blue.tif Peach 344
The second list of dataframes is structured. like this:
data2
Slice Area
Mask of Mask-0001Series007_blue-1.tif. 789.21
etc
Question
I want to
Make the row names match up by:
a) Remove the "Mito-" from data1
b) Remove the "Mask of Mask-" from data 2
c) Remove the "-1" towards the end of data 2
Keeping in mind that this is a list of dataframes.
So far:
I have used the information from the post named "How can I remove certain part of row names in data frame"
How can I remove certain part of row names in data frame
They suggest using
data2$Slice <- sub("Mask of Mask-", "", data2$Slice)
Which obviously isn't working for the list of dataframes. It returns a blank character
character(0)
Thanks in advance, I have been amazed at how great people are at answering questions on this site :)
First, we could define a function f that applies gsub with a regex that fits for all.
f <- \(x) gsub('.*(\\d{4}_?Series\\d{3}_blue).*(\\.tif)?\\.?', '\\1\\2', x)
Explanation:
.* any single character, repeatedly
\\d{4} four digits
_? underscore, if available
Series literally
(...) capture group (they get numbered internally)
\\. a period (needs to be escaped, otherwise we say "any character")
\\1 capture group 1
Test the regex
## test it
(x <- c(names(data1), data1[[1]]$Label, data2$Slice))
# [1] "Mito-0001_Series007_blue" "Mito-0002_Series007_blue"
# [3] "Mito-0001_Series007_blue.tif" "Mito-0001_Series007_blue.tif"
# [5] "Mito-0001_Series007_blue.tif" "Mito-0001_Series007_blue.tif"
# [7] "Mask of Mask-0001Series007_blue-1.tif."
f(x)
# [1] "0001_Series007_blue" "0002_Series007_blue" "0001_Series007_blue" "0001_Series007_blue"
# [5] "0001_Series007_blue" "0001_Series007_blue" "0001Series007_blue"
Seems to work, so we can apply it.
names(data1) <- f(names(data1))
data1 <- lapply(data1, \(x) {x$Label <- f(x$Label); x})
data2$Slice <- f(data2$Slice)
data1
# $`0001_Series007_blue`
# Label Pred n
# 1 0001_Series007_blue Pear 10
# 2 0001_Series007_blue Orange 223
# 3 0001_Series007_blue Apple 890
# 4 0001_Series007_blue Peach 34
#
# $`0002_Series007_blue`
# Label Pred n
# 1 0002_Series007_blue Pear 90
# 2 0002_Series007_blue Orange 127
# 3 0002_Series007_blue Apple 76
# 4 0002_Series007_blue Peach 344
data2
# Slice Area
# 1 0001Series007_blue 789.21
Data:
data1 <- list(`Mito-0001_Series007_blue` = structure(list(Label = c("Mito-0001_Series007_blue.tif",
"Mito-0001_Series007_blue.tif", "Mito-0001_Series007_blue.tif",
"Mito-0001_Series007_blue.tif"), Pred = c("Pear", "Orange", "Apple",
"Peach"), n = c(10L, 223L, 890L, 34L)), class = "data.frame", row.names = c("1",
"2", "3", "4")), `Mito-0002_Series007_blue` = structure(list(
Label = c("Mito-0002_Series007_blue.tif", "Mito-0002_Series007_blue.tif",
"Mito-0002_Series007_blue.tif", "Mito-0002_Series007_blue.tif"
), Pred = c("Pear", "Orange", "Apple", "Peach"), n = c(90L,
127L, 76L, 344L)), class = "data.frame", row.names = c("1",
"2", "3", "4")))
data2 <- structure(list(Slice = "Mask of Mask-0001Series007_blue-1.tif.",
Area = 789.21), class = "data.frame", row.names = c(NA, -1L
))
Using the given info
The answer by #jay.sf, was really helpful. But it only worked for data1, rather than data2. To ensure it also got applied to data2, I added the extra line of code:
#Old code
f <-function(x) gsub('.*(\\d{4}_?Series\\d{3}_blue).*(\\.tif)?\\.?', '\\1\\2', x)
#I added the [[1]] after data2 as well
(x <- c(names(data1), data1[[1]]$Label, data2[[1]]$Slice))
f(x)
names(data1) <- f(names(data1))
data1 <- lapply(data1, function(x) {x$Label <- f(x$Label); x})
# This line of code was causing problems, so I removed it
# data2$Slice <- f(data2$Slice)
#And added the following to apply it to data 2
names(data2) <- f(names(data2))
data2 <- lapply(data2, function(x) {x$Slice <- f(x$Slice); x})
I have a large (>200000 observations) flat file dataframe which has multiple "paired" codes throughout it. For each pair, one column contains a numerical code, the second is a description of the code. I have set both the codes and descriptions to "factors".
An example of the dataframe is below
|-------------|---------------|---------------|-------------|---------------|---------
| ID | Unit_CD | Unit | Name_CD | Name | etc
|-------------|---------------|---------------|-------------|---------------|---------
| 01 | 12 | Bob | A01 | EPID | etc
| 02 | 10 | Sue | A04 | UPIM | etc
| 03 | 12 | Bob | V03 | AVRM | etc
| 04 | 14 | Moo | A04 | UPIM | etc
I would like to create a function where you can input the 2 paired column names and it will return a concatenated field which displays the numeric code and the description as per below:
'code.names(df,Unit_CD,Unit)'
OUTPUT:
Unit Codes
10: Sue
12: Bob
14: Moo
I have written the following code however I can not get it to accept the column names as an input to the function:
code.names <- function(df,column1, column2){
n <-count(df,column1,column2)
CD.V <- as.vector(n[,1])
CD.Code <- as.vector(n[,2])
i <- nrows(n)
for (i in 1:n){
paste(CD.V[i],CD.Code[i])
}
}
The error I am getting is
Error: Must group by variables found in `.data`.
* Column `column1` is not found.
* Column `column2` is not found.
As I am doing this multiple times through the code, I would prefer to set this up as a function, however any other method of achieving my end goal would still be appreciated.
Hope the code below works for your goal
code.names <- function(df,column1, column2) unique(paste0(df[[column1]],":",df[[column2]]))
An option with unite
library(dplyr)
library(tidyr)
df %>%
distinct(Unit_CD, Unit) %>%
unite(New, Unit_CD, Unit, sep=": ")
# New
#1 12: Bob
#2 10: Sue
#3 14: Moo
data
df <- structure(list(ID = 1:4, Unit_CD = c(12L, 10L, 12L, 14L), Unit = c("Bob",
"Sue", "Bob", "Moo"), Name_CD = c("A01", "A04", "V03", "A04"),
Name = c("EPID", "UPIM", "AVRM", "UPIM")), class = "data.frame",
row.names = c(NA,
-4L))
I would suggest this approach with a new function using your data and the names of the columns to be concatenated:
#Data
df <- structure(list(ID = 1:4, Unit_CD = c(12L, 10L, 12L, 14L), Unit = c("Bob",
"Sue", "Bob", "Moo"), Name_CD = c("A01", "A04", "V03", "A04"),
Name = c("EPID", "UPIM", "AVRM", "UPIM")), class = "data.frame", row.names = c(NA,
-4L))
Code:
#Function
myfun <- function(x,cola,colb)
{
var <- paste0(x[,cola],': ',x[,colb])
var <- unique(var)
var <- data.frame(var)
return(var)
}
#Apply
myfun(df,'Unit_CD', 'Unit')
Output:
var
1 12: Bob
2 10: Sue
3 14: Moo
You can use duplicated to keep only unique values in the dataframe.
code.names <- function(df,column1, column2) {
df1 <- df[!duplicated(df[c(column1, column2)]), ]
cat(paste(df1[[column1]], df1[[column2]], sep = ":", collapse = "\n"))
}
code.names(df, 'Unit_CD','Unit')
#12:Bob
#10:Sue
#14:Moo
I am trying to extract information from more than 2 columns (2 columns given as an example below) using a list and creating another column which contains the string from the list found from either one of the column specifying which column to look in first. I have the example below and what the desired output is. Hope that helps what I am exactly looking for.
A<-c("This contains NYU", "This has NYU", "This has XT", "This has FIT",
"Something something UNH","I got into UCLA","Hello XT")
B<-c("NYU","UT","USC","FIT","UNA","UCLA", "CA")
data<-data.frame(A,B)
list <- c("NYU","FIT","UCLA","CA","UT","USC")
A B
1 This contains NYU NYU
2 This has NYU UT
3 This has XT USC
4 This has FIT FIT
5 Something something UNH UNA
6 I got into UCLA UCLA
7 Hello XT CA
I would want the code to search from the list and look in column A first and if it cannot find the string then look in column B and if not then give null. By looking at the list, I would like the desired output to look like the below.
A B C
1 This contains NYU NYU NYU
2 This has NYU UT NYU
3 This has XT USC USC
4 This has FIT FIT FIT
5 Something something UNH UNA <NA>
6 I got into UCLA UCLA UCLA
7 Hello XT CA CA
You can transform your list to a regexpr and then apply R regexpr function :
expr <- paste0(list,collapse = "|")
# expr = "NYU|FIT|UCLA|CA|UT|USC" -> Reg expr means NYU or FIT or ......
data[,"C"] <- ""
cols <- rev(names(data)[-(which(names(data)=="C"))])
for(c in cols) {
index <- regexpr(expr,data[,c])
data[,"C"] <- ifelse(index != -1,substr(data[,c],index,index + attr(index,"match.length")-1),data[,"C"])
}
Hope that will helps
Gottavianoni
Another approach could be
#common between column A & vector l
C_tempA <- sapply(df$A, function(x) intersect(strsplit(as.character(x), split = " ")[[1]], l))
#common between column B & vector l
C_tempB <- sapply(df$B, function(x) intersect(as.character(x), l))
#column C calculation
df$C <- ifelse(C_tempA=="character(0)", C_tempB, C_tempA)
df$C[df$C=="character(0)"] <- NA
#final dataframe
df
Output is:
A B C
1 This contains NYU NYU NYU
2 This has NYU UT NYU
3 This has XT USC USC
4 This has FIT FIT FIT
5 Something something UNH UNA NA
6 I got into UCLA UCLA UCLA
7 Hello XT CA CA
Sample data:
df <- structure(list(A = structure(c(4L, 6L, 7L, 5L, 3L, 2L, 1L), .Label = c("Hello XT",
"I got into UCLA", "Something something UNH", "This contains NYU",
"This has FIT", "This has NYU", "This has XT"), class = "factor"),
B = structure(c(3L, 7L, 6L, 2L, 5L, 4L, 1L), .Label = c("CA",
"FIT", "NYU", "UCLA", "UNA", "USC", "UT"), class = "factor")), .Names = c("A",
"B"), row.names = c(NA, -7L), class = "data.frame")
l <- c("NYU","FIT","UCLA","CA","UT","USC")
Use library(tokenizers) from tokenizers package.
Merge two columns and create a new column with merged A and B
data$newC <- paste(data$A, data$B, sep = " " )
Then, follow below loop which will extract values in a vector and then u can cbind the vector in existing dataframe.
newcolumn <- 'X'
for (p in data$newC)
{
if (!is.na(p))
{
x <- which(is.element(unlist(tokenize_words(list, lowercase = TRUE)), unlist(tokenize_words(p, lowercase = TRUE, stopwords = NULL, simplify = FALSE))))
newcolumn <- append(newcolumn,ifelse(x[1]!= 0, list[x[1]], "NA"))
}
}
newcolumn <- newcolumn[-1]
newcolumn
data <- cbind(data, newcolumn)
Hope it helps.
I am getting output of above as what you expected.
Solution Image:
I've been trying to solve this issue with mapply, but I believe I will have to use several nested applies to make this work, and it has gotten real confusing.
The problem is as follows:
Dataframe one contains around 400 keywords. These fall into roughly 15 categories.
Dataframe two contains a string description field, and 15 additional columns, each named to correspond to the categories mentioned in dataframe one. This has millions of rows.
If a keyword from dataframe 1 exists in the string field in dataframe 2, the category in which the keyword exists should be flagged in dataframe 2.
What I want should look something like this:
> #Dataframe1 df1
>> keyword category
>> cat A
>> dog A
>> pig A
>> crow B
>> pigeon B
>> hawk B
>> catfish C
>> carp C
>> ...
>>
> #Dataframe2 df2
>> description A B C ....
>> false cat 1 0 0 ....
>> smiling pig 1 0 0 ....
>> shady pigeon 0 1 0 ....
>> dogged dog 2 0 0 ....
>> sad catfish 0 0 1 ....
>> hawkward carp 0 1 1 ....
>> ....
I tried to use mapply to get this to work but it fails, giving me the error "longer argument not a multiple of length of shorter". It also computes this only for the first string in df2. I haven't proceeded beyond this stage, i.e. attempting to get category flags.
> mapply(grepl, pattern = df1$keyword, x = df2$description)
Could anyone be of help? I thank you very much. I am new to R so it would also help if someone could mention some 'thumb rules' for turning loops into apply functions. I cannot afford to use loops to solve this as it would take way too much time.
There might be a more elegant way to do this but this is what I came up with:
## Your sample data:
df1 <- structure(list(keyword = c("cat", "dog", "pig", "crow", "pigeon", "hawk", "catfish", "carp"),
category = c("A", "A", "A", "B", "B", "B", "C", "C")),
.Names = c("keyword", "category"),
class = "data.frame", row.names = c(NA,-8L))
df2 <- structure(list(description = structure(c(2L, 6L, 5L, 1L, 4L,3L),
.Label = c("dogged dog", "false cat", "hawkward carp", "sad catfish", "shady pigeon", "smiling pig"), class = "factor")),
.Names = "description", row.names = c(NA, -6L), class = "data.frame")
## Load packages:
library(stringr)
library(dplyr)
library(tidyr)
## For each entry in df2$description count how many times each keyword
## is contained in it:
outList <- lapply(df2$description, function(description){
outDf <- data.frame(description = description,
value = vapply(stringr::str_extract_all(description, df1$keyword),
length, numeric(1)), category = df1$category)
})
## Combine to one long data frame and aggregate by category:
outLongDf<- do.call('rbind', outList) %>%
group_by(description, category) %>%
dplyr::summarise(value = sum(value))
## Reshape from long to wide format:
outWideDf <- tidyr::spread(data = outLongDf, key = category,
value = value)
outWideDf
# Source: local data frame [6 x 4]
# Groups: description [6]
#
# description A B C
# * <fctr> <dbl> <dbl> <dbl>
# 1 dogged dog 2 0 0
# 2 false cat 1 0 0
# 3 hawkward carp 0 1 1
# 4 sad catfish 1 0 1
# 5 shady pigeon 1 1 0
# 6 smiling pig 1 0 0
This approach, however also catches the "pig" in "pigeon" and the "cat" in "catfish". I don't know if this is what you want, though.
What you are looking for is a so-called document-term-matrix (or dtm in short), which stems from NLP (Natural Language Processing). There are many options available. I prefer text2vec. This package is blazingly fast (I wouldn't be surprised if it would outperform the other solutions here by a large magnitude) especially in combination with tokenizers.
In your case the code would look something like this:
# Create the data
df1 <- structure(list(keyword = c("cat", "dog", "pig", "crow", "pigeon", "hawk", "catfish", "carp"),
category = c("A", "A", "A", "B", "B", "B", "C", "C")),
.Names = c("keyword", "category"),
class = "data.frame", row.names = c(NA,-8L))
df2 <- structure(list(description = structure(c(2L, 6L, 5L, 1L, 4L,3L),
.Label = c("dogged dog", "false cat", "hawkward carp", "sad catfish", "shady pigeon", "smiling pig"), class = "factor")),
.Names = "description", row.names = c(NA, -6L), class = "data.frame")
# load the libraries
library(text2vec) # to create the dtm
library(tokenizers) # to help creating the dtm
library(reshape2) # to reshape the data from wide to long
# 1. create the vocabulary from the keywords
vocabulary <- vocab_vectorizer(create_vocabulary(itoken(df1$keyword)))
# 2. create the dtm
dtm <- create_dtm(itoken(as.character(df2$description)), vocabulary)
# 3. convert the sparse-matrix to a data.frame
dtm_df <- as.data.frame(as.matrix(dtm))
dtm_df$description <- df2$description
# 4. melt to long format
df_result <- melt(dtm_df, id.vars = "description", variable.name = "keyword")
df_result <- df_result[df_result$value == 1, ]
# 5. combine the data, i.e., add category
df_final <- merge(df_result, df1, by = "keyword")
# keyword description value category
# 1 carp hawkward carp 1 C
# 2 cat false cat 1 A
# 3 catfish sad catfish 1 C
# 4 dog dogged dog 1 A
# 5 pig smiling pig 1 A
# 6 pigeon shady pigeon 1 B
Whatever the implementation, counting the number of matches per category needs k x d comparisons, where k is the number of keywords and d the number of descriptions.
There are a few tricks to make solve this problem fast and without a lot of memory:
Use vectorized operations. These can be performed a lot quicker than use for loops. Note that lapply, mapply or vapply are just shorthand for for loops. I parallelize (see next) over the keywords such that the vectorization can be over the descriptions which is the largest dimension.
Use parallelization. Optimally using your multiple cores speeds up the proces at the cost of an increase in memory (since every core needs its own copy).
Example:
keywords <- stringi::stri_rand_strings(400, 2)
categories <- letters[1:15]
keyword_categories <- sample(categories, 400, TRUE)
descriptions <- stringi::stri_rand_strings(3e6, 20)
keyword_occurance <- function(word, list_of_descriptions) {
description_keywords <- str_detect(list_of_descriptions, word)
}
category_occurance <- function(category, mat) {
rowSums(mat[,keyword_categories == category])
}
list_keywords <- mclapply(keywords, keyword_occurance, descriptions, mc.cores = 8)
df_keywords <- do.call(cbind, list_keywords)
list_categories <- mclapply(categories, category_occurance, df_keywords, mc.cores = 8)
df_categories <- do.call(cbind, list_categories)
With my computer this takes 140 seconds and 14GB RAM to match 400 keywords in 15 categories to 3 million descriptions.
Supposing I have the following dataframes:
d1 <- data.frame(index = c(1,2,3,4), location = c('barn', 'house', 'restaurant', 'tomb'), random = c(5,3,2,1), different_col1 = c(66,33,22,11))
d2 <- data.frame(index = c(1,2,3,4), location = c('server', 'computer', 'home', 'dictionary'), random = c(1,7,2,9), differen_col2 = c('hi', 'there', 'different', 'column'))
What I am trying to do is get the location based on the index and what dataframe it is. So I have the following:
data <- data.frame(src = c('one', 'one', 'two', 'one', 'two'), index = c(1,4,2,3,2))
Where src indicates which dataframe the data should come from and index, the value in index from the index column.
src | index
-------------
one | 1
one | 4
two | 2
one | 3
two | 2
And I would like it to become:
src | index | location
-----------------------
one | 1 | barn
one | 4 | tomb
two | 2 | computer
one | 3 | restaurant
two | 2 | computer
Due to the size of my data I would like to avoid merge or comparable joins (sqldf, etc).
Here's one way to add a new column by reference using data.table:
require(data.table)
setDT(d1); setDT(d2); setDT(data) # convert all data.frames to data.tables
data[src == "one", location := d1[.SD, location, on="index"]]
data[src == "two", location := d2[.SD, location, on="index"]]
.SD stands for subset of data, and contains all columns in data that matches the condition provided in i-argument.
See the vignettes for more.
You can use match in the expression to the right of := as well instead of extracting location using a join. But it'd not be extensible if you'd want to match on multiple columns.
library(dplyr)
mutate(data,
location = ifelse(src == "one",
as.character(d1[index, "location"]),
as.character(d2[index, "location"])))
output
src index location
1 one 1 barn
2 one 4 tomb
3 two 2 computer
4 one 3 restaurant
5 two 2 computer
data.table will help you to deal with Big Data much more efficiently.
You could either use match or a special data.table implementation of merge that's much faster than the merge of my original solution, as we discussed in the comments.
Here's an example:
require(data.table)
d1 <- data.frame(index = c(1,2,3,4), location = c('barn', 'house', 'restaurant', 'tomb'), random = c(5,3,2,1), different_col1 = c(66,33,22,11))
d2 <- data.frame(index = c(1,2,3,4), location = c('server', 'computer', 'home', 'dictionary'), random = c(1,7,2,9), differen_col2 = c('hi', 'there', 'different', 'column'))
mydata <- data.table(src = c('one', 'one', 'two', 'one', 'two'), index = c(1,4,2,3,2))
mydata.d1 <- mydata[mydata$src == "one",]
mydata.d2 <- mydata[mydata$src == "two",]
mydata.d1 <- merge(mydata.d1, d1, all.x = T, by = "index")
mydata.d2 <- merge(mydata.d2, d2, all.x = T, by = "index")
# If you want to keep the 'different column' values from d1 and d2:
mydata <- rbind(mydata.d1, mydata.d2, fill = T)
mydata
index src location random different_col1 differen_col2
1: 1 one barn 5 66 NA
2: 3 one restaurant 2 22 NA
3: 4 one tomb 1 11 NA
4: 2 two computer 7 NA there
5: 2 two computer 7 NA there
# If you don't want to keep those 'different column' values:
mydata <- rbind(mydata.d1[,.(index, src, location)], mydata.d2[,.(index, src, location)])
mydata
index src location
1: 1 one barn
2: 3 one restaurant
3: 4 one tomb
4: 2 two computer
5: 2 two computer
Base solution: use a character index to chose the correct dataframe and then use mapply to handle submission of the multiple "parallel arguments.
dput(dat)
structure(list(src = c("one", "one", "two", "one", "two"), X. = c("|",
"|", "|", "|", "|"), index = c(1L, 4L, 2L, 3L, 2L), location = structure(c(1L,
4L, 5L, 3L, 5L), .Label = c("barn", "house", "restaurant", "tomb",
"computer", "dictionary", "home", "server"), class = "factor")), .Names = c("src",
"X.", "index", "location"), row.names = c(NA, -5L), class = "data.frame")
May need to use stringsAsFactor to ensure character argument.
dat$location <- mapply(function(whichd,i) dlist[[whichd]][i,'location'], whichd=dat$src, i=dat$index)
> dat
src X. index location
1 one | 1 barn
2 one | 4 tomb
3 two | 2 computer
4 one | 3 restaurant
5 two | 2 computer
>