Mapping elements of a data frame by looping through another data frame - r

I have two R data frame with differing dimensions. However but data frames have an id column
df1:
nrow(df1)=22308
c1 c2 c3 pattern1.match
ENSMUSG00000000001_at 10.175115 10.175423 10.109524 0
ENSMUSG00000000003_at 2.133651 2.144733 2.106649 0
ENSMUSG00000000028_at 5.713781 5.714827 5.701983 0
df2:
Genes Pattern.Count
ENSMUSG00000000276 ENSMUSG00000000276_at 1
ENSMUSG00000000876 ENSMUSG00000000876_at 1
ENSMUSG00000001065 ENSMUSG00000001065_at 1
ENSMUSG00000001098 ENSMUSG00000001098_at 1
nrow(df2)=425
I would like to loop through df2, and find all genes that have pattern.count=1 and check it in df1$pattern1.match column.
Basically I would like to overwrite the fields GENES AND pattern1.match with the df2$Genes and df2$Pattern.Count. All the elements from df2$Pattern.Count are equal to one.
I wrote this function, but R freezes while looping through all these rows.
idcol <- ncol(df1)
return.frame.matches <- function(df1, df2, idcol) {
for (i in 1:nrow(df1)) {
for (j in 1:nrow(df2))
if(df1[i, 1] == df2[j, 1]) {
df1[i, idcol] = 1
break
}
}
return (df1)
}
Is there another way of doing that without almost killing the computer?

I'm not sure I get exactly what you are doing, but the following should at least get you closer.
The first column of df1 doesn't seem to have a name, are they rownames?
If so,
df1$Genes <- rownames(df1)
Then you could then do a merge to create a new dataframe with the genes you require:
merge(df1,subset(df2,Pattern.Count==1))
Note they are matching on the common column Genes. I'm not sure what you want to do with the pattern1.match column, but a subset on the df1 part of merge can incorporate conditions on that.
Edit
Going by the extra information in the comments,
df1$pattern1.match <- as.numeric(df1$Genes %in% df2$Genes)
should achieve what you are looking for.

Your sample data is not enough to play around with, but here is what I would start with:
dfm <- merge( df1, df2, by = idcol, all = TRUE )
dfm_pc <- subset( dfm, Pattern.Count == 1 )
I took the "idcol" from your code, don't see it in the data.

Related

Improve R speed when using dynamic column names

I have some code that works as expected but is very inefficient and slow. I have 2 dataframes - 1) with 40,645 rows and 264 columns where each column represents a KPI/dimension of some kind and 2) with 478,872 rows and 11 columns. DF1 is a wide dataframe and DF2 is a long data frame. I need to merge the two but cannot simply merge the columns since the dataframes are different formats. Additionally I need to merge 2 columns values from the DF2 to create a name for the new column in DF1. In the end I am accomplishing this using a loop and this command to do the work is what is slowing the code down substantially:
DF1[[col_name_v]][index_v] <- col_val_v
# Additionally these other methods appears just as slow:
DF1[index_v,"col1"] <- col_val_v
DF1[index_v,265] <- col_val_v
If I actually where to specify each of the column names manually like this it works 10X+ faster:
DF1$col1[index_v] <- col_val1_v
DF1$col2[index_v] <- col_val2_v
DF1$col3[index_v] <- col_val3_v
#.... etc
The problem is I need the code to be dynamic because there are many columns and potentially those column names may change over time so I would prefer the code to dynamically learn the column names and apply them on the fly to prevent frequent additions and changes to the code.
Data looks like the following:
DF1 (40645*264) - before adding new columns from the code below
Date_Time,YEAR,MON,DOW,WEEK,Location,ID,KPI1,KPI2,KPI3,...,KPI264
9/7/2020,2020,September,Monday,36,33,33001,43,0,2,...,10
DF2 (478872*11) - multiple rows to be merged as multiple columns to DF1
Date_Time,Location,ID,Technology,Cluster,Dimension,Variable,Formula,Value,Index
9/7/2020,33,33001,1,"OVERALL","NEWKPINAME1","SUM(N)/SUM(D)",2.8,0.003
9/7/2020,33,33001,1,"LOCATION","NEWKPINAME1","SUM(N)/SUM(D)",2.8,0.004
9/7/2020,33,33001,1,"GROUP1","NEWKPINAME1","SUM(N)/SUM(D)",2.8,0.002
dimension+variable are combined to create a new unique KPI name to be added as a column and the index value is written to DF1 for that new column index location.
# Provides the number of new columns that need to be added to DF1
dim_v <- unique(DF2$Dimension)
var_v <- unique(DF2$Variable)
limit_v <- length(dim_v) * length(var_v)
# This part adds the new columns to the DF1 - with NA set for values
index_v <- 1
while(index_v <= limit_v) {
# Create the column name
col_name_v <- paste(DF2$Dimension[index_v], DF2$Variable[index_v],sep="_")
# Add the column name with default values of NA
DF1[[col_name_v]] <- NA
index_v <- index_v + 1
}
# This part writes the values of each DF2 Dimension_Variable KPI values stored in ROWS (12 per ID) to DF1 across 12 COLUMNS
# Merge the long DF2 ROW based KPIs into the wide DF1 COLUMN based KPIs
index_v <- 1
while(index_v <= nrow(DF1)) {
# Identify the key fields we need for the lookup
datetime_v <- as.Date(DF1$Date_Time[index_v])
id_v <- DF1$ID[index_v]
# Create a temp data.frame for the related data
data_df <- subset(DF2, Date_Time == datetime_v & ID == id_v)
# Do we even have records?
if (nrow(data_df) !=0) {
# cycle through and write each value
index2_v <- 1
while(index2_v <= nrow(data_df)) {
# Create the column name
col_name_v <- paste(data_df$Dimension[index2_v], data_df$Variable[index2_v],sep="_")
col_val_v <- data_df$Index[index2_v]
# Write the values related to the column name
DF1[[col_name_v]][index_v] <- col_val_v
index2_v <- index2_v + 1
}
} else {
print(sprintf("No records for Date: %s ID: %s", datetime_v, id_v))
}
index_v <- index_v + 1
}

R: How do you subset all data-frames within a list?

I have a list of data-frames called WaFramesCosts. I want to simply subset it to show specific columns so that I can then export them. I have tried:
for (i in names(WaFramesCosts)) {
WaFramesCosts[[i]][,c("Cost_Center","Domestic_Anytime_Min_Used","Department",
"Domestic_Anytime_Min_Used")]
}
but it returns the error of
Error in `[.data.frame`(WaFramesCosts[[i]], , c("Cost_Center", "Department", :
undefined columns selected
I also tried:
for (i in seq_along(WaFramesCosts)){
WaFramesCosts[[i]][ , -which(names(WaFramesCosts[[i]]) %in% c("Cost_Center","Domestic_Anytime_Min_Used","Department",
"Domestic_Anytime_Min_Used"))]
but I get the same error. Can anyone see what I am doing wrong?
Side Note: For reference, I used this:
for (i in seq_along(WaFramesCosts)) {
t <- WaFramesCosts[[i]][ , grepl( "Domestic" , names( WaFramesCosts[[i]] ) )]
q <- subset(WaFramesCosts[[i]], select = c("Cost_Center","Domestic_Anytime_Min_Used","Department","Domestic_Anytime_Min_Used"))
WaFramesCosts[[i]] <- merge(q,t)
}
while attempting the same goal with a different approach and seemed to get closer.
Welcome back, Kootseeahknee. You are still incorrectly assuming that the last command of a for loop is implicitly returned at the end. If you want that behavior, perhaps you want lapply:
myoutput <- lapply(names(WaFramesCosts)), function(i) {
WaFramesCosts[[i]][,c("Cost_Center","Domestic_Anytime_Min_Used","Department","Domestic_Anytime_Min_Used")]
})
The undefined columns selected error tells me that your assumptions of the datasets are not correct: at least one is missing at least one of the columns. From your previous question (How to do a complex edit of columns of all data frames in a list?), I'm inferring that you want columns that match, not assuming that it is in everything. From that, you could/should be using grep or some variant:
myoutput <- lapply(names(WaFramesCosts)), function(i) {
WaFramesCosts[[i]][,grep("(Cost_Center|Domestic_Anytime_Min_Used|Department)",
colnames(WaFramesCosts)),drop=FALSE]
})
This will match column names that contain any of those strings. You can be a lot more precise by ensuring whole strings or start/end matches occur by using regular expressions. For instance, changing from (Cost|Dom) (anything that contains "Cost" or "Dom") to (^Cost|Dom) means anything that starts with "Cost" or contains "Dom"; similarly, (Cost|ment$) matches anything that contains "Cost" or ends with "ment". If, however, you always want exact matches and just need those that exist, then something like this will work:
myoutput <- lapply(names(WaFramesCosts)), function(i) {
WaFramesCosts[[i]][,intersect(c("Cost_Center","Domestic_Anytime_Min_Used","Department"),
colnames(WaFramesCosts)),drop=FALSE]
})
Note, in that last example: notice the difference between mtcars[,2] (returns a vector) and mtcars[,2,drop=FALSE] (returns a data.frame with 1 column). Defensive programming, if you think it at all possible that your filtering will return a single-column, make sure you do not inadvertently convert to a vector by appending ,drop=FALSE to your bracket-subsetting.
Based on your description, this is an example of using library dplyr to achieve combining a list of data frames for a given set of columns. This doesn't require all data frames to have identical columns (Providing your data in a reproducible example would be better)
# test data
df1 = read.table(text = "
c1 c2 c3
a 1 101
b 2 102
", header = TRUE, stringsAsFactors = FALSE)
df2 = read.table(text = "
c1 c2 c3
w 11 201
x 12 202
", header = TRUE, stringsAsFactors = FALSE)
# dfs is a list of data frames
dfs <- list(df1, df2)
# use dplyr::bind_rows
library(dplyr)
cols <- c("c1", "c3")
result <- bind_rows(dfs)[cols]
result
# c1 c3
# 1 a 101
# 2 b 102
# 3 w 201
# 4 x 202

Cross-referencing data frames without using for loops

Im having an issue with speed of using for loops to cross reference 2 data frames. The overall aim is to identify rows in data frame 2 that lie between coordinates specified in data frame 1 (and meet other criteria). e.g. df1:
chr start stop strand
1 chr1 179324331 179327814 +
2 chr21 45176033 45182188 +
3 chr5 126887642 126890780 +
4 chr5 148730689 148734146 +
df2:
chr start strand
1 chr1 179326331 +
2 chr21 45175033 +
3 chr5 126886642 +
4 chr5 148729689 +
My current code for this is:
for (index in 1:nrow(df1)) {
found_miRNAs <- ""
curr_row = df1[index, ];
for (index2 in 1:nrow(df2)){
curr_target = df2[index2, ]
if (curr_row$chrm == curr_target$chrm & curr_row$start < curr_target$start & curr_row$stop > curr_target$start & curr_row$strand == curr_target$strand) {
found_miRNAs <- paste(found_miRNAs, curr_target$start, sep=":")
}
}
curr_row$miRNAs <- found_miRNAs
found_log <- rbind(Mcf7_short_aUTRs2,curr_row)
}
My actual data frames are 400 lines for df1 and > 100 000 lines for df2 and I am hoping to do 500 iterations, so, as you can imagine this unworkably slow. I'm relatively new to R so any hints for functions that may increase the efficiency of this would be great.
Maybe not fast enough, but probably faster and a lot easier to read:
df1 <- data.frame(foo=letters[1:5], start=c(1,3,4,6,2), end=c(4,5,5,9,4))
df2 <- data.frame(foo=letters[1:5], start=c(3,2,5,4,1))
where <- sapply(df2$start, function (x) which(x >= df1$start & x <= df1$end))
This will give you a list of the relevant rows in df1 for each row in df2. I just tried it with 500 rows in df1 and 50000 in df2. It finished in a second or two.
To add criteria, change the inner function within sapply. If you then want to put where into your second data frame, you could do e.g.
df2$matching_rows <- sapply(where, paste, collapse=":")
But you probably want to keep it as a list, which is a natural data structure for it.
Actually, you can even have a list column it in the data frame:
df2$matching_rows <- where
though this is quite unusual.
You've run into two of the most common mistakes people make when coming to R from another programming language. Using for loops instead of vector-based operations and dynamically appending to a data object. I'd suggest as you get more fluent you take some time to read Patrick Burns' R Inferno, it provides some interesting insight into these and other problems.
As #David Arenburg and #zx8754 have pointed out in the comments above there are specialized packages that can solve the problem, and the data.table package and #David's approach can be very efficient for larger datasets. But for your case base R can do what you need it to very efficiently as well. I'll document one approach here, with a few more steps than necessary for clarity, just in case you're interested:
set.seed(1001)
ranges <- data.frame(beg=rnorm(400))
ranges$end <- ranges$beg + 0.005
test <- data.frame(value=rnorm(100000))
## Add an ID field for duplicate removal:
test$ID <- 1:nrow(test)
## This is where you'd set your criteria. The apply() function is just
## a wrapper for a for() loop over the rows in the ranges data.frame:
out <- apply(ranges, MAR=1, function(x) test[ (x[1] < test$value & x[2] > test$value), "ID"])
selected <- unlist(out)
selected <- unique( selected )
selection <- test[ selected, ]

Searching and Replacing between Two Data Frames with Apply Family

I'm trying to analyze a large set of data so I can't use for loops to search for ID's from one data frame on the other and replace the text.
Basically, first data frame is with IDs and without names. The names are in the other data frame.
(Edit) Input dfs
(Edit) df1
ID------Name
1,2,3---NA
4,5-----NA
6-------NA
(Edit) df2
ID------Name
1-------John
2-------John
3-------John
4-------Stacy
5-------Stacy
6-------Alice
(Edit) Expected output df
ID------Name
1,2,3---John
4,5-----Stacy
6-------Alice
(Edit) Please note that this is very simplified version. df1 actually has 63 columns and 8551 rows, df2 has 5 columns and 37291 rows.
I can search for the IDs and get names on the second data frame like this. It' super fast!
namer <- function(df2, ids) {
ids <- gsub(',', '|', ids);
names <- df2[which(apply(df2, 1, function(x) any(grepl(ids, x)))),][['Name']];
if (length(names) != 0) {
return(names[[1]]);
} else {
return(NA);
}
}
But, I can't replace using apply families. I know doing it with for loops and it's super slow because I have around 8500 rows in the first data frame.
for (k in 1:nrow(df1)) {
df1$Name[k] <- namer(df2, df1$ID[k]);
}
Can you please help to do convert for loops into apply functions as well to speed it up?
Thanks in advance
You can try
df1$Name <- sapply(as.character(df1$ID),
function(x) paste(unique(df2[match(strsplit(x, ",")[[1]], df2$ID), "Name"]), collapse = ","))
df1
# ID Name
# 1 1,2,3 John
# 2 4,5 Stacy
# 3 6 Alice
Although I doubt sapply will be faster than a for loop. I've also added paste function here in case you have more than one name matched in df1$ID

Replace NA with 0 in R using a loop on a dataframe

I would like to run through specific columns in a dataframe and replace all NAs with 0s using a loop.
extract = read.csv("2013-09 Data extract.csv")
extract$Premium1[is.na(extract$Premium1)] <- 0
extract$Premium1
gives me the required result for Premium1 in dataframe extract, but I would like to loop through all 27 columns of premiums, so what I am trying is
extract = read.csv("2013-09 Data extract.csv")
for(i in 1:27) {
thispremium <- get(paste("extract$Premium", i, sep=""))
thispremium[is.na(thispremium)] <- 0
}
which gives
Error in get(paste("extract$Premium", i, sep = "")) :
object 'extract$Premium1' not found
Any idea on what is causing the error?
Do you need the loop because of other requirements? Because it works just fine without one:
extract[is.na(extract)] <- 0
If you want to do the replacement for some columns only, select those columns first, perform the replacement, and substitute the columns back into the original set:
first5 <- extract[, 1 : 5]
first5[is.na(first5)] <- 0
extract[, 1 : 5] <- first5
More generally loops can (and should) be almost avoided in R – especially when manipulating data frames). Often operations vectorise automatically (like above). When they don’t, functions of the apply family can be used.
How about
for (colname in names(extract))
extract[[colname]][is.na(extract[[colname]])] <- 0
(or even extract[is.na(extract)] <- 0)
Or, if you are not doing it to all the columns (I think I misread your question):
for(i in 1:27) {
colname <- paste0("Premium",i)
extract[[colname]][is.na(extract[[colname]])] <- 0
}
Alternatively, you don't really need to know the number of such columns:
premium <- grep("^Premium[0-9]*$",names(extract))
extract[premium][is.na(extract[premium])] <- 0

Resources