Multiply 2 very large data frames in R - r

I have 2 dataframes in R as below
Data Frame 1
structure(list(X1 = c(1, 4, 3), X2 = c(2, 1, 2), X3 = c(3, 1,
1)), class = "data.frame", row.names = c(NA, -3L))
Data Frame 2
structure(list(X1 = c(0.5, 0.1), X2 = c(0.7, 0.2), X3 = c(0.3,
0.2)), class = "data.frame", row.names = c(NA, -2L))
I want to multiply each row of DF1 with every row of DF2 and perform some calculation as below. This is a sort of matrix multiplication along with additional calculations
After matrix multiplication, I will calculate 1/(1+exp(-x)) for every cell in resultant matrix
and lastly, take the column sum of the matrix
The above dataset is just a dummy set. In actual, DF1 has 1.1 million rows while DF2 has 65000 rows.
While doing matrix multiplication, I get error
cannot allocate vector of Size 560 GB
Is there any alternative to this. Also, I am looking for time effective solution due to large data frames.
May be Data table ?
Thanks,

Related

How to remove the last x zeros from a column dataframe in r?

My data frame looks like this:
dput(tree)
structure(list(date = c(2.0220409e+13, 2.022041e+13, 2.0220411e+13,
2.0220412e+13, 2.0220413e+13, 2.0220414e+13, 2.0220415e+13, 2.0220416e+13,
2.0220417e+13, 2.0220418e+13), N = c(1, 2, 3, 4, 5, 6, 7, 8,
9, 10), NDVI = c(0.7192, 0.7034, 0.689, 0.6761, 0.6646, 0.6545,
0.6458, 0.6386, 0.6299, 0.6231)), class = "data.frame", row.names = c(NA,
-10L))
In the column date I want to remove the last 6 zeros (which are repeated for all entries), how can I do that?
Any help is much appreciated.
Maybe like this?
options(scipen = 999)
library(dplyr)
df |>
dplyr::mutate(across(date, ~.x/1000000))

Performance indices for unequal datasets in R

I wanted to do the performance indices in R. My data looks like this (example):
enter image description here
I want to ignore the comparison of values in Time 2 and 4 in data frame 1 and then compare it with the available set of observed data. I know how to develop the equation for the performance indicators (R2, RMSE, IA, etc.), but I am not sure how to ignore the data in the simulated data frame when corresponding observed data is not available for comparison.
Perhaps just do a left join, and compare the columns directly?
library(dplyr)
left_join(d2,d1 %>% rename(simData=Data), by="Time")
Output:
Time Data simData
<dbl> <dbl> <dbl>
1 1 57 52
2 3 88 78
3 5 19 23
Input:
d1 = structure(list(Time = c(1, 2, 3, 4, 5), Data = c(52, 56, 78,
56, 23)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-5L))
d2 = structure(list(Time = c(1, 3, 5), Data = c(57, 88, 19)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -3L))

use mutate_if by subtracting from another data frame

I would like to do (more or less) the following
dplyr::mutate_if(tmp, is.numeric, function(x) x-df[3,])
in effect this should subtract at every x a value from df. The problem I have is that it should only use the matching column number, i.e. tmp[x,y] - df[3,y].
However what's happening is that it loops over the df[3,] vector for every x, irrespective of column position.
Is there any way to make this work with mutate_if by indexing the column somehow, which would be my preferred solution?
here is an example:
tmp is:
tmp <- structure(list(x = c(1, 1, 1, 1),
y = c(2, 2, 2, 2)),
row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))
df (actually a matrix) is:
df <- structure(c(1L, 2L, 3L, 2L, 3L, 4L),
.Dim = 3:2, .Dimnames = list(NULL, c("x", "y")))
now when I apply mutate it returns:
structure(list(x = c(-2, -3, -2, -3),
y = c(-1, -2, -1, -2)),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -4L))
but I want it to be:
structure(list(x = c(-2, -2, -2, -2),
y = c(-2, -2, -2, -2)),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -4L))
I hope that makes it clearer
We can use purrr:
df1<-as.data.frame(df)
as_tibble(purrr::map2(tmp[,purrr::map_lgl(tmp,is.numeric)],df1[3,],function(x,y) x-y))
This gives us:
# A tibble: 4 x 2
x y
<dbl> <dbl>
1 -2 -2
2 -2 -2
3 -2 -2
4 -2 -2
This isn't a perfect solution, but it will get you what you want (if my understanding is correct), and then you will have to play with the formatting. I don't quite understand why you have an entire data frame for df if you only care about the 3rd row. I don't know how to index the column using dplyr::mutate_if either; that would be useful to know!
Since you want the columns to match, you are effectively trying to subtract each row of tmp from a set row of df. For loops and sapply() are good for row-wise subtraction.
sapply(1:nrow(tmp), function(x) tmp[x, ] - df[3, ]) %>%
as.data.frame() %>%
t()
## x y
## V1 -2 -2
## V2 -2 -2
## V3 -2 -2
## V4 -2 -2

Control number of rows when binding dataframes with different number of rows?

I have a dataframe generated by a function:
Each time it's of different number of rows:
structure(list(a = c(1, 2, 3), b = c("er", "gd", "ku"), c = c(43,
453, 12)), .Names = c("a", "b", "c"), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame"))
structure(list(a = c(1, 2), b = c("er", "gd"), c = c(43, 453)), .Names = c("a",
"b", "c"), row.names = c(NA, -2L), class = c("tbl_df", "tbl",
"data.frame"))
I want to be able like in a while loop to control the number of rows to be less then n (n = 4, 100, 4242...) when I bind rows.
Please advise how to do this using functional programming without a while loop?
I mean sometimes you will get n = 10 and the df before bind_rows is 7 and after binding the last one it will be 20. It's ok, I want the number of rows to be min_k (k >= n)
Here is my while loop doing this:
b <- list()
total_rows <- 0
while(total_rows < 1000) {
df <- f_produce_rand_df()
b[[length(b) + 1]] <- df
total_rows <- total_rows + nrow(df)
}

reshaping a large data frame from wide to long in R

I've been through the various reshape questions but don't believe this iteration has been asked before. I am dealing with a data frame of 81K rows and 4188 variables. Variables 161:4188 are the measurements present as different variables. The idvar is in column 1. I want to repeat columns 1:160 and create new records for columns 169:4188. The final data frame will be of the dimension 162 columns and 326,268,000 rows (81K * 4028 variables converted as unique records).
Here is what I tried:
reshapeddf <- reshape(c, idvar = "PID", varying = c(dput(names(c[161:4188]))),
v.names = "viewership",
timevar = "network.show",
times = c(dput(names(c[161:4188]))),
direction = "long")
The operation didn't complete. I waited almost 10 minutes. Is this the right way? I am on a Windows 7, 8GB RAM, i5 3.20ghz PC. What is the most efficient way to complete this transpose in R? Both of the answers by BondedDust and Nick are clever but I run into memory issues. Is there a way any of the three approaches in this thread- reshape, tidyr or the do.call be implemented using ff?
In Example Data below, columns 1:4 are the ones I want to repeat and columns 5:9 are the ones to create new records for.
structure(list(PID = c(1003401L, 1004801L, 1007601L, 1008601L,
1008602L, 1011901L), HHID = c(10034L, 10048L, 10076L, 10086L,
10086L, 10119L), HH.START.DATE = structure(c(1378440000, 1362974400,
1399521600, 1352869200, 1352869200, 1404964800), class = c("POSIXct",
"POSIXt"), tzone = ""), VISITOR.CODE = structure(c(1L, 1L, 1L,
1L, 1L, 1L), .Label = c("0", "L"), class = "factor"), WEIGHTED.MINUTES.VIEWED..ABC...20.20.FRI = c(0,
0, 305892, 0, 101453, 0), WEIGHTED.MINUTES.VIEWED..ABC...BLACK.ISH = c(0,
0, 0, 0, 127281, 0), WEIGHTED.MINUTES.VIEWED..ABC...CASTLE = c(0,
27805, 0, 0, 0, 0), WEIGHTED.MINUTES.VIEWED..ABC...CMA.AWARDS = c(0,
679148, 0, 0, 278460, 498972), WEIGHTED.MINUTES.VIEWED..ABC...COUNTDOWN.TO.CMA.AWARDS = c(0,
316448, 0, 0, 0, 0)), .Names = c("PID", "HHID", "HH.START.DATE",
"VISITOR.CODE", "WEIGHTED.MINUTES.VIEWED..ABC...20.20.FRI", "WEIGHTED.MINUTES.VIEWED..ABC...BLACK.ISH",
"WEIGHTED.MINUTES.VIEWED..ABC...CASTLE", "WEIGHTED.MINUTES.VIEWED..ABC...CMA.AWARDS",
"WEIGHTED.MINUTES.VIEWED..ABC...COUNTDOWN.TO.CMA.AWARDS"), row.names = c(NA,
6L), class = "data.frame")
Might be as easy as something like this:
dat2 <- cbind(dat[1:4], stack( dat[5:length(dat)] )
I think this should work:
library(tidyr)
newdf <- gather(yourdf, program, minutes, -PID:-VISITOR.CODE)

Resources