I have some code that works as expected but is very inefficient and slow. I have 2 dataframes - 1) with 40,645 rows and 264 columns where each column represents a KPI/dimension of some kind and 2) with 478,872 rows and 11 columns. DF1 is a wide dataframe and DF2 is a long data frame. I need to merge the two but cannot simply merge the columns since the dataframes are different formats. Additionally I need to merge 2 columns values from the DF2 to create a name for the new column in DF1. In the end I am accomplishing this using a loop and this command to do the work is what is slowing the code down substantially:
DF1[[col_name_v]][index_v] <- col_val_v
# Additionally these other methods appears just as slow:
DF1[index_v,"col1"] <- col_val_v
DF1[index_v,265] <- col_val_v
If I actually where to specify each of the column names manually like this it works 10X+ faster:
DF1$col1[index_v] <- col_val1_v
DF1$col2[index_v] <- col_val2_v
DF1$col3[index_v] <- col_val3_v
#.... etc
The problem is I need the code to be dynamic because there are many columns and potentially those column names may change over time so I would prefer the code to dynamically learn the column names and apply them on the fly to prevent frequent additions and changes to the code.
Data looks like the following:
DF1 (40645*264) - before adding new columns from the code below
Date_Time,YEAR,MON,DOW,WEEK,Location,ID,KPI1,KPI2,KPI3,...,KPI264
9/7/2020,2020,September,Monday,36,33,33001,43,0,2,...,10
DF2 (478872*11) - multiple rows to be merged as multiple columns to DF1
Date_Time,Location,ID,Technology,Cluster,Dimension,Variable,Formula,Value,Index
9/7/2020,33,33001,1,"OVERALL","NEWKPINAME1","SUM(N)/SUM(D)",2.8,0.003
9/7/2020,33,33001,1,"LOCATION","NEWKPINAME1","SUM(N)/SUM(D)",2.8,0.004
9/7/2020,33,33001,1,"GROUP1","NEWKPINAME1","SUM(N)/SUM(D)",2.8,0.002
dimension+variable are combined to create a new unique KPI name to be added as a column and the index value is written to DF1 for that new column index location.
# Provides the number of new columns that need to be added to DF1
dim_v <- unique(DF2$Dimension)
var_v <- unique(DF2$Variable)
limit_v <- length(dim_v) * length(var_v)
# This part adds the new columns to the DF1 - with NA set for values
index_v <- 1
while(index_v <= limit_v) {
# Create the column name
col_name_v <- paste(DF2$Dimension[index_v], DF2$Variable[index_v],sep="_")
# Add the column name with default values of NA
DF1[[col_name_v]] <- NA
index_v <- index_v + 1
}
# This part writes the values of each DF2 Dimension_Variable KPI values stored in ROWS (12 per ID) to DF1 across 12 COLUMNS
# Merge the long DF2 ROW based KPIs into the wide DF1 COLUMN based KPIs
index_v <- 1
while(index_v <= nrow(DF1)) {
# Identify the key fields we need for the lookup
datetime_v <- as.Date(DF1$Date_Time[index_v])
id_v <- DF1$ID[index_v]
# Create a temp data.frame for the related data
data_df <- subset(DF2, Date_Time == datetime_v & ID == id_v)
# Do we even have records?
if (nrow(data_df) !=0) {
# cycle through and write each value
index2_v <- 1
while(index2_v <= nrow(data_df)) {
# Create the column name
col_name_v <- paste(data_df$Dimension[index2_v], data_df$Variable[index2_v],sep="_")
col_val_v <- data_df$Index[index2_v]
# Write the values related to the column name
DF1[[col_name_v]][index_v] <- col_val_v
index2_v <- index2_v + 1
}
} else {
print(sprintf("No records for Date: %s ID: %s", datetime_v, id_v))
}
index_v <- index_v + 1
}
Related
Suppose I have two lists with the following embedded data frames:
# Data frames to intersect
US <- data.frame("Group" = c(1,2,3), "Age" = c(21,20,17), "Name" = c("John","Dora","Helen"))
CA <- data.frame("Group" = c(2,3,4), "Age" = c(21,20,19), "Name" = c("John","Dora","Dan"))
JP <- data.frame("Group" = c(4,5,6), "Age" = c(16,15,14), "Name" = c("Mac","Hector","Jack"))
# Lists to compare----
list1<-list(US,CA,JP)
names(list1)<-c("US","CA","JP")
# List 2 can serve as a "reference list," a duplicate of the first.
list2<-list(US,CA,JP)
names(list2)<-c("US","CA","JP")
I have a second list, that serves as a "reference list" to the first. It is copy and is only meant to be used as a reference in some operation, like a for loop. What I want to do is intersect the scalars / values from only the first column (e.g. Group), and store the intersected output in separate data frames or matrices. I do not want to intersect dataframe groups that have the same names(i.e. List 1 US groups should not be intersected with List 2 US groups).
Ideally, a final list of DFs would be created, containing all possible combinations of intersected DF, their names and the results for final output would be something to the effect of:
print(comb_list)
$US_CA
Group
1 2
2 3
$US_JP
data frame with 0 columns and 0 rows
$CA_JP
Group
1 4
Would it be possible to create this as a for-loop?
Sure that looks doable with a nested for loop. There's no need to copy the initial list. The loop can iterate over the same list. I'd suggest using dplyr for it's handy filter and select functions
require(dplyr)
comb_list <- list()
for (i in 1:length(list1)) {
for (j in 1:length(list1)) {
# don't intersect country with itself
if (names(list1)[i] != names(list1)[j]) {
value <- filter(list1[[i]], Group %in% list1[[j]]$Group)
value <- select(value, Group)
name <- paste0(names(list1)[i], "_", names(list1[j]))
name_alt <- paste0(names(list1)[j], "_", names(list1[i]))
#don't store equivalent country intersections i.e. US_CA and CA_US
if (!name %in% names(comb_list) & !name_alt %in% names(comb_list)) {
comb_list[[name]] <- value
}
}
}
}
print(comb_list)
$US_CA
Group
1 2
2 3
$US_JP
[1] Group
<0 rows> (or 0-length row.names)
$CA_JP
Group
1 4
I am looking for a function that iterates through the rows of a given column ("pos" for position, ascending) in a dataframe, and only keeps those rows whose values are at least let's say 10 different, starting with the first row.Thus it would start with the first row (and store it), and then carry on until it finds a row with a value at least 10 higher than the first, store this row, then start from this value again looking for the next >10diff one.
So far I have an R for loop that successfully finds adjacent rows at least X values apart, but it does not have the capability of looking any further than one row down, nor of stopping once it has found the given row and starting again from there.
Here is the function I have:
# example data frame
df <- data.frame(x=c(1:1000), pos=sort(sample(1:10000, 1000)))
# prep function (this only checks row above)
library(dplyr)
pos.apart.subset <- function(df, pos.diff) {
# create new dfs to store output
new.df <- list()
new.df1 <- data.frame()
# iterate through each row of df
for (i in 1:nrow(df)) {
# if the value of next row is higher or equal than value or row i+posdiff, keep
# if not ascending, keep
# if first row, keep
if(isTRUE(df$pos[i+1] >= df$pos[i]+pos.diff | df$pos[i+1] < df$pos[i] | i==1 )) {
# add rows that meet conditions to list
new.df[[i]] <- df[i,] }
}
# bind all rows that met conditions
new.df1 <- bind_rows(new.df)
return(new.df1)}
# test run for pos column adjacent values to be at least 10 apart
df1 <- pos.apart.subset(df, 10); head(df1)
Happy to do this in awk or any other language. Many thanks.
It seems I misunderstood the question earlier since we don't want to calculate the difference between consecutive rows, you can try :
nrows <- 1
previous_match <- 1
for(i in 2:nrow(df)) {
if(df$pos[i] - df$pos[previous_match] > 10) {
nrows <- c(nrows, i)
previous_match <- i
}
}
and then subset the selected rows :
df[nrows, ]
Earlier answer
We can use diff to get the difference between consecutive rows and select the row which has difference of greater than 10.
head(subset(df, c(TRUE, diff(pos) > 10)))
# x pos
#1 1 1
#2 2 31
#6 6 71
#9 9 134
#10 10 151
#13 13 185
The first TRUE is to by default select the first row.
In dplyr, we can use lag to get value from previous row :
library(dplyr)
df %>% filter(pos - lag(pos, default = -Inf) > 10)
I defined the following function, which takes two DataFrames, DF_TAGS_LIST and DF_epc_list. Both data frames have a column with a different number of rows. I want to search each value DF_TAGS_LIST in DF_epc_list, and if found, store it in another dataframe
One example of DF_TAGS_LIST:
TAGS_LIST
3036029B539869100000000B
3036029B537663000000002A
3036029B5398694000000009
3036029B539869400000000C
3036029B5398690000000006
3036029B5398692000000007
And one example of DF_epc_list:
EPC
3036029B539869100000000B
3036029B537663000000002A
3036029B5398690000000006
3036029B5398692000000007
3036029B5398691000000006
3036029B5376630000000034
3036029B53986940000000WF
3036029B5398694000000454
3036029B5398690000000234
3036029B53986920000000FG
In this case, I would like one dataframe output that had the following values:
FOUND_TAGS
3036029B5398690000000006
3036029B5398692000000007
3036029B539869100000000B
3036029B537663000000002A
My function is:
FOUND_COMPARE_TAGS<-function(DF_TAGS_LIST, DF_epc_list){
DF_epc_list<-toString(DF_epc_list)
DF_TAGS_LIST<-toString(DF_TAGS_LIST)
DF_found_epc_tags <- data.frame(DF_found_epc_tags=intersect(DF_TAGS_LIST$DF_TAGS_LIST, DF_epc_list$DF_epc_list)); setdiff(union(DF_TAGS_LIST$DF_TAGS_LIST, DF_epc_list$DF_epc_list), DF_found_epc_tags$DF_found_epc_tags)
#DF_found_epc_tags <- data.frame(DF_found_epc_tags = DF_TAGS_LIST[unique(na.omit(match(DF_epc_list$DF_epc_list, DF_TAGS_LIST$DF_TAGS_LIST))),])
return(DF_found_epc_tags)
}
I now returns an empty data frame with two columns. Only recently programmed in R
You can use %in% or (as I mentioned in my comment) intersect:
DF_TAGS_LIST[DF_TAGS_LIST$TAGS_LIST %in% DF_epc_list$EPC, , drop = FALSE]
# TAGS_LIST
# 1 3036029B539869100000000B
# 2 3036029B537663000000002A
# 5 3036029B5398690000000006
# 6 3036029B5398692000000007
intersect(DF_TAGS_LIST$TAGS_LIST, DF_epc_list$EPC)
# [1] "3036029B539869100000000B" "3036029B537663000000002A"
# [3] "3036029B5398690000000006" "3036029B5398692000000007"
FOUND_TAGS <- rbind(TAGS_LIST, EPC)
FOUND_TAGS <- FOUND_TAGS[duplicated(FOUND_TAGS), , drop = FALSE]
I have two R data frame with differing dimensions. However but data frames have an id column
df1:
nrow(df1)=22308
c1 c2 c3 pattern1.match
ENSMUSG00000000001_at 10.175115 10.175423 10.109524 0
ENSMUSG00000000003_at 2.133651 2.144733 2.106649 0
ENSMUSG00000000028_at 5.713781 5.714827 5.701983 0
df2:
Genes Pattern.Count
ENSMUSG00000000276 ENSMUSG00000000276_at 1
ENSMUSG00000000876 ENSMUSG00000000876_at 1
ENSMUSG00000001065 ENSMUSG00000001065_at 1
ENSMUSG00000001098 ENSMUSG00000001098_at 1
nrow(df2)=425
I would like to loop through df2, and find all genes that have pattern.count=1 and check it in df1$pattern1.match column.
Basically I would like to overwrite the fields GENES AND pattern1.match with the df2$Genes and df2$Pattern.Count. All the elements from df2$Pattern.Count are equal to one.
I wrote this function, but R freezes while looping through all these rows.
idcol <- ncol(df1)
return.frame.matches <- function(df1, df2, idcol) {
for (i in 1:nrow(df1)) {
for (j in 1:nrow(df2))
if(df1[i, 1] == df2[j, 1]) {
df1[i, idcol] = 1
break
}
}
return (df1)
}
Is there another way of doing that without almost killing the computer?
I'm not sure I get exactly what you are doing, but the following should at least get you closer.
The first column of df1 doesn't seem to have a name, are they rownames?
If so,
df1$Genes <- rownames(df1)
Then you could then do a merge to create a new dataframe with the genes you require:
merge(df1,subset(df2,Pattern.Count==1))
Note they are matching on the common column Genes. I'm not sure what you want to do with the pattern1.match column, but a subset on the df1 part of merge can incorporate conditions on that.
Edit
Going by the extra information in the comments,
df1$pattern1.match <- as.numeric(df1$Genes %in% df2$Genes)
should achieve what you are looking for.
Your sample data is not enough to play around with, but here is what I would start with:
dfm <- merge( df1, df2, by = idcol, all = TRUE )
dfm_pc <- subset( dfm, Pattern.Count == 1 )
I took the "idcol" from your code, don't see it in the data.
I've got data being read into a data frame R, by column. Some of the columns will increase in value; for those columns only, I want to replace each value (n) with its difference from the previous value in that column. For example, looking at an individual column, I want
c(1,2,5,7,8)
to be replaced by
c(1,3,2,1)
which are the differences between successive elements
However, it's getting really late in the day, and I think my brain has just stopped working. Here's my code at present
col1 <- c(1,2,3,4,NA,2,3,1) # This column rises and falls, so we want to ignore it
col2 <- c(1,2,3,5,NA,5,6,7) # Note: this column always rises in value, so we want to replace it with deltas
col3 <- c(5,4,6,7,NA,9,3,5) # This column rises and falls, so we want to ignore it
d <- cbind(col1, col2, col3)
d
fix_data <- function(data) {
# Iterate through each column...
for (column in data[,1:dim(data)[2]]) {
lastvalue <- 0
# Now walk through each value in the column,
# checking to see if the column consistently rises in value
for (value in column) {
if (is.na(value) == FALSE) { # Need to ignore NAs
if (value >= lastvalue) {
alwaysIncrementing <- TRUE
} else {
alwaysIncrementing <- FALSE
break
}
}
}
if (alwaysIncrementing) {
print(paste("Column", column, "always increments"))
}
# If a column is always incrementing, alwaysIncrementing will now be TRUE
# In this case, I want to replace each element in the column with the delta between successive
# elements. The size of the column shrinks by 1 in doing this, so just prepend a copy of
# the 1st element to the start of the list to ensure the column length remains the same
if (alwaysIncrementing) {
print(paste("This is an incrementing column:", colnames(column)))
column <- c(column[1], diff(column, lag=1))
}
}
data
}
fix_data(d)
d
If you copy/paste this code into RGui, you'll see that it doesn't do anything to the supplied data frame.
Besides losing my mind, what am I doing wrong??
Thanks in advance
Without addressing the code in any detail, you're assigning values to column, which is a local variable within the loop (i.e. there is no relationship between column and data in that context). You need to assign those values to the appropriate value in data.
Also, data will be local to your function, so you need to assign that back to data after running the function.
Incidentally, you can use diff to see if any value is incrementing rather than looping over every value:
idx <- apply(d, 2, function(x) !any(diff(x[!is.na(x)]) < 0))
d[,idx] <- blah
diff calculates the difference between consecutive values in a vector. You can apply it to each column in a dataframe using, e.g.
dfr <- data.frame(x = c(1,2,5,7,8), y = (1:5)^2)
as.data.frame(lapply(dfr, diff))
x y
1 1 3
2 3 5
3 2 7
4 1 9
EDIT: I just noticed a few more things. You are using a matrix, not a data frame (as you stated in the question). For your matrix 'd', you can use
d_diff <- apply(d, 2, diff)
#Find columns that are (strictly) increasing
incr <- apply(d_diff, 2, function(x) all(x > 0, na.rm=TRUE))
#Replace values in the approriate columns
d[2:nrow(d),incr] <- d_diff[,incr]