I have a double for loop over a data frame and its rows. I'm applying some calculations for each row of the data frame (which represent different batteries and therefore all vary in their values). In the end I want to check if a row (e.g. a battery) fits the criteria. If it does, I want to put it in a new df gathering all batteries that fit the criteria.
df1 <- as.data.frame(matrix("values",nrow=24,ncol=19))
df2 <- as.data.frame(matrix("values",nrow=2976,ncol=22))
df3 <- df1[0,] #empty df of the same structure as df1
What I'm doing:
for(i in 1:nrow(df1)){
for(j in 1:nrow(df2)){
# some calculations giving me a result what the necessary capacity "nc" is
...
so far it works alright. What I want to do then is compare if the result for each row in in df1 (e.g. the necessary battery capacity) is bigger then a condition "con":
...
con <- df1[i,4]
nc <- max(df2[[20]])) # defining the necessary capacity
if(con > nc){
newdf <- bind_rows(df3,df1[i,])
}
}
}
I expect newdf to have 0 to max 24 rows. According to the real data I should get 11 entries. What I got was 1 (that was the last row of df1) or some more than 30000 entries. So this is not working as expected. Any ideas? Thanks!
I think you forgot to increment df3 and this is why you only have one line, you always bind one row to the empty df3 data.frame. Your code works otherwise. You should change the line
newdf <- bind_rows(df3,df1[i,])
into
df3 <- bind_rows(df3,df1[i,])
This might be very slow however, so I suggest you use vectorization, for example as suggested in the comments by #Dave2i newdf<-df[df[,4] > nc, ]
Related
I need help to rewrite my function (see below called randomdraws()) that operates now through a repeat loop and a for loop. This does take a lot of time (especially the for loop) for my bigger datasets. Additionally I need to repeat this function x-times and want to store the results in a list object.
Here is what I am trying to achieve: I have two dataframes (here df_1 and df_2) which I need as an input for my function randomdraws(). The interesting part of this function begins in the repeat{} section; here I need to draw a number of values from the extreme value distribution (evd) equal to the length of df_1. Afterwards I need to add this values (called evd_draw) to the values of df_1 and perform a check, if this altered values fulfill a certain condition (i.e. varX==varY). If this is not the case (condition is not met with the random draw added) I want to repeat this part until the condition is met. If the condition is met, I need to store the evd_draw with which the condtion was met. I now want to iterate this over each row in my dataframe df_1. In the end I get a new dataframe with the "stored" random draws of the evd per row of df_1 that fulfilled the condition varX==varY. In my example below, for only 10 observations, my code runs just fine.
But: if the number of rows and columns of df_1 (and df_2) expand, the function randomdraws() gets very slow. I therefore need another solution that performs the calculation of the repeat loop for each row of dataframe df_1. I think I need to parallelize my computations instead of iterating over each row one after another but I seem to fail at (i) rewriting my repeat function part for this and (ii) use that in functions likewise apply()/ map()/...
QUESTION: Is there a way that I can achieve my result (i.e. a dataframe/list of the random draws that fulfilled the condition performed on dataframes df_1 and df_2) avoiding the for loop and that is quick for large datasets/dataframes?
Example data:
df_1 <- as.data.frame(rbind(c(0.23040,0.10153,0.28394,0.17105,0.00125),
c(0.11839,0.16768 ,0.26914 ,0.19163,0.00126),
c(0.11703,0.18245 ,0.16571 ,0.16223,0.00423),
c(0.39406,0.08535 ,0.21181 ,0.12780,0.00039),
c(0.16097 ,0.16369, 0.23839, 0.17830,0.00158),
c(0.39812 ,0.04525, 0.17583, 0.09064,0.00167),
c(0.30547 ,0.10900, 0.18930 ,0.12665,0.00197),
c(0.19357 ,0.17854, 0.18003 ,0.19576,0.00189),
c(0.19466 ,0.17339, 0.21267 ,0.18410,0.00069),
c(0.07884 ,0.21299 ,0.18480 ,0.17908,0.00178)))
colnames(df_1) <- c("xf0m40","xf30m40","xf10m40","xf20m40","xf40m0")
rownames(df_1) <- c(2,7,21,33,50,77,80,96,102,110)
df_2 <- cbind.data.frame(varX=c("xf0m40","xf30m40","xf10m40","xf0m40","xf20m40","xf0m40","xf0m40","xf40m0","xf10m40","xf30m40"),
id=c(2,7,21,33,50,77,80,96,102,110))
Function (that runs smoothly but is too slow):
randomdraws <- function(df_1, df_2) {
require(tidyverse)
require(EnvStats)
dfx <- df_1 #here df_1 is actually retrieved from fitted values of regression output,
# simplified here for the sake of clarity
df <- df_2 #select two variables from separate dataframe df_2
#(already simplified here), where varX is a character var, id is numeric
# matrix containing only 0; to be filled with rowwise iteration
df_evd <- matrix(0, nrow = nrow(dfx), ncol= ncol(dfx), byrow = T)
colnames(df_evd) <- colnames(dfx)
rownames(df_evd) <- rownames(dfx)
for (i in 1:nrow(dfx)){
repeat {
evd_draw <- revd(length(dfx), scale = .5) #draw from evd for length of one row
t <- as.data.frame(dfx[i,] + evd_draw) %>% bind_cols(df[i,]) %>%
mutate(varY=as.character(pmap(across(1:ncol(dfx)),~ names(c(...)[which.max(c(...))]))),
overlap=ifelse(varX == varY,1,0))
#object t should sum row i values of dfx and evd_draw, then add varX and id from
#df_2 and calculate new varY to check if varX==varY
df_evd[i,] <- evd_draw
if (t[,ncol(t)]==1) break
#this code section should be repeated until the condition varX==varY (in
#other words; overlap==1 or t[,ncol(t)]==1 is true
}
}
return(df_evd)
}
Apply function on data:
system.time(exampledf <- randomdraws(df_1, df_2))
#replicate this function 3 times (takes even longer then!)
ls_example <- replicate(3, list(as.data.frame(randomdraws(df_1, df_2))), simplify=TRUE)
I have two large data frames (500k rows) from two separate sources without a key. Instead of being able to merge using a key, I want to merge the two data frames by matching other columns. Such as age and amount. It is not a perfect match between the two data frames so some values will not match, and I will later simply remove these ones.
The data could look something like this.
So, in the example above I want to be able to create a table matching Key 1 and Key 2. In the picture above we see that XXX1 and YYY3 is a match. So from here I would like to create a data frame like:
[Key 1] [Key 2]
XXX1 YYY3
XXX2 N/A
XXX3 N/A
I know how to do this in Excel but due to the large amount of data, it simply crashes. I want to focus on R but for what it is worth, this is how I built it in Excel (where the idea is that we first do a VLOOKUP, and then uses INDEX as a VLOOKUP for getting the second match if the first one does not match both criteria):
=IF(P2=0;IFNA(VLOOKUP(L2;B:C;2;FALSE);VLOOKUP(L2;G:H;2;FALSE));IF(O2=Q2;INDEX($A$2:$A$378300;SMALL(IF($L2=$B$2:$B378300;ROW($B$2:$B$378300)-ROW($B$2)+1);2));0))
And this is the approach made in R:
for (i in 1:nrow(df)) {
for (j in 1:nrow(df)) {
if (df_1$pc_age[i] == df_2$pp_age[j] && (df_1$amount[i] %in% c(df_2$amount1[j], df_2$amount2[j], df_2$amount3[j]))) {
df_1$Key1[i] = df_2$Key2[j]
} else (df_1$Key1[i] = N/A)
}}
The problem is that this takes way, way to long. Is there a more effective way to map this data as good as possible?
Thanks!
Create dummy columns in both the data frames such as(I can show you for df1) :
for(i in 1:nrow(df1)){
df1$key1 <- paste0("X_",i)
}
Similarly for df2 from Y1....Yn and then join both data frames using "merge" on columns age and amount.
Concatenate Key1 and key2 in a new column in the merged data frame. You will directly get your desired data frame.
could the following code work for you?
# create random data
set.seed(123)
df1 <- data.frame(
key_1=as.factor(paste("xxx",1:100,sep="_")),
age = sample(1:100,100,replace=TRUE),
amount = sample(1:200,100))
df2 <- data.frame(
key_1=paste("yyy",1:500,sep="_"),
age = sample(1:100,500,replace=TRUE),
amount_1 = sample(1:200,500,replace=TRUE),
amount_2 = sample(1:200,500,replace=TRUE),
amount_3 = sample(1:200,500,replace=TRUE))
# ensure at least three fit rows
df2[10,2:3] <- df1[1,2:3]
df2[20,c(2,4)] <- df1[2,2:3]
df2[30,c(2,5)] <- df1[3,2:3]
# define comparrison with df2
comp2df2 <- function(x){
ageComp <- df2$age == as.numeric(x[2])
if(!any(ageComp)){
return(NaN)
}
amountComp <- apply(df2,1,function(a) as.numeric(x[3]) %in% as.numeric(a[3:5]))
if(!any(amountComp)){
return(NaN)
}
matchIdx <- ageComp & amountComp
if(sum(matchIdx) > 1){
warning("multible match detected first match is taken\n")
}
return(which(matchIdx)[1])
}
# run match
matchIdx <- apply(df1,1,comp2df2)
# merge
df_new <- cbind(df1[!is.na(matchIdx),],df2[matchIdx[!is.na(matchIdx)],])
didn't had time to test it on really big data, but this should be faster than your two for loops I guess....
To further speed up things you could delete the
if(sum(matchIdx) > 1){
warning("multible match detected first match is taken\n")
}
lines if you are not worried about a line matches several others.
I want to extract some values out of a vector, modify them and put them back at the original position.
I have been searching a lot and tried different approaches to this problem. I'm afraid this might be really simple but I'm not seeing it yet.
Creating a vector and convert it to a dataframe with. Also creating a empty dataframe for the results.
hight <- c(5,6,1,3)
hight_df <- data.frame("ID"=1:length(hight), "hight"=hight)
hight_min_df <- data.frame()
Extract for every pair of values the smaller value with corresponding ID.
for(i in 1:(length(hight_df[,2])-1))
{
hight_min_df[i,1] <- which(grepl(min(hight_df[,2][i:(i+1)]), hight_df[,2]))
hight_min_df[i,2] <- min(hight_df[,2][i:(i+1)])
}
Modify the extracted values and aggregate same IDs by higher value. At the end writing the modified values back.
hight_min_df[,2] <- hight_min_df[,2]+20
adj_hight <- aggregate(x=hight_min_df[,2],by=list(hight_min_df[,1]), FUN=max)
hight[adj_hight[,1]] <- adj_hight[,2]
This works perfectly as long a I have only uniqe values in hight.
How can I run this script with a vector like this: hight <- c(5,6,1,3,5)?
Alright there's a lot to unpack here. Instead of looping, I would suggest piping functions with dplyr. Read the vignette here - it is an outstanding resource and an excellent approach to data manipulation in R.
So using dplyr we can rewrite your code like this:
library(dplyr)
hight <- c(5,6,1,3,5) #skip straight to the test case
hight_df <- data.frame("ID"=1:length(hight), "hight"=hight)
adj_hight <- hight_df %>%
#logic psuedo code: if the last hight (using lag() function),
# going from the first row to the last,
# is greater than the current rows hight, take the current rows value. else
# take the last rows value
mutate(subst.id = ifelse(lag(hight) > hight, ID, lag(ID)),
subst.val = ifelse(lag(hight) > hight, hight, lag(hight)) + 20) %>%
filter(!is.na(subst.val)) %>% #remove extra rows
select(subst.id, subst.val) %>% #take just the columns we want
#grouping - rewrite of your use of aggregate
group_by(subst.id) %>%
summarise(subst.val = max(subst.val)) %>%
data.frame(.)
#tying back in
hight[adj_hight[,1]] <- adj_hight[,2]
print(hight)
Giving:
[1] 25 6 21 23 5
I'm working with data gathered from multi-channel electrode systems, and am trying to make this run faster than it currently is, but I can't find any good way of doing it without loops.
The gist of it is; I have modified averages for each column (which is a channel), and need to compare each value in a column to the average for that column. If the value is above the adjusted mean, then I need to put that value in another data frame so it can be easily read.
Here is some sample code for the problematic bit:
readout <- data.frame(dimnmames <- c("Values"))
#need to clear the dataframe in order to run it multiple times without errors
#timeFrame is just a subsection of the original data, 60 channels with upwards of a few million rows
readout <- readout[0,]
for (i in 1:ncol(timeFrame)){
for (g in 1:nrow(timeFrame)){
if (timeFrame[g,i] >= posCompValues[i,1])
append(spikes, timeFrame[g,i])
}
}
The data ranges from 500 thousand to upwards of 130 million readings, so if anyone could point me in the right direction I'd appreciate it.
Something like this should work:
Return values of x greater than y:
cmpfun <- function(x,y) return(x[x>y])
For each element (column) of timeFrame, compare with the corresponding value of the first column of posCompValues
vals1 <- Map(cmpfun,timeFrame,posCompValues[,1])
Collapse the list into a single vector:
spikes <- unlist(vals1)
If you want to save both the value and the corresponding column it may be worth unpacking this a bit into a for loop:
resList <- list()
for (i in seq(ncol(timeFrame))) {
tt <- timeFrame[,i]
spikes <- tt[tt>posCompVals[i,1]]
if (length(spikes)>0) {
resList[[i]] <- data.frame(value=spikes,orig_col=i)
}
}
res <- do.call(rbind, resList)
I have a dataframe which looks like this (obviously with few variables compared to original data I need to work on with)
woe <- c('1:woe', '2:woe', '3:woe', '4:woe', '5:woe')
svi <- c('stated','verified','verified','stated','stated')
fico_avg <- ceiling(runif(5,750, 780))
count <- c(8,12,34,24,7)
df <- data.frame(cbind(woe,svi,fico_avg,count))
woe svi fico_avg count
1:woe stated 771 8
2:woe verified 759 12
3:woe verified 752 34
4:woe stated 776 24
5:woe stated 767 7
I would like to create a dataset with first row repeating 8 times( filling first 8 rows), second row repeating 12 times, third one 34 times depending on the value of variable 'count' . I tried lookup the function InsertRow() in DataCombine package. InsertRow() require RowNum as one of the argument to insert newrow. the RawNum changes as I insert newrows into the frame. Basic idea is to extract each row from original dataframe copy it x time ( if count=x) and finally row bind all those rows into one frame. Any help is appretiated. Thanks in advance
If your dataset is large - probably this should be Quicker
df <- data.frame(woe,svi,fico_avg,count)
df[rep(seq.int(1,nrow(df)), df$count),]
Works.
Try:
outdf = df
outdf = outdf[-c(1:nrow(outdf)),]
for(i in 1:nrow(df)){
for(j in 1:df[i,]$count) outdf[nrow(outdf)+1,]= df[i,]
}
outdf
You should use:
df <- data.frame(woe,svi,fico_avg,count)
rather than
df <- data.frame(cbind(woe,svi,fico_avg,count))
No need for cbind here. It actually converts your count variable from numeric to a factor variable.
Try this:
df_long <- df[rep(1:nrow(df), df$count), ]
Hope it helps