use mutate_if by subtracting from another data frame - r

I would like to do (more or less) the following
dplyr::mutate_if(tmp, is.numeric, function(x) x-df[3,])
in effect this should subtract at every x a value from df. The problem I have is that it should only use the matching column number, i.e. tmp[x,y] - df[3,y].
However what's happening is that it loops over the df[3,] vector for every x, irrespective of column position.
Is there any way to make this work with mutate_if by indexing the column somehow, which would be my preferred solution?
here is an example:
tmp is:
tmp <- structure(list(x = c(1, 1, 1, 1),
y = c(2, 2, 2, 2)),
row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))
df (actually a matrix) is:
df <- structure(c(1L, 2L, 3L, 2L, 3L, 4L),
.Dim = 3:2, .Dimnames = list(NULL, c("x", "y")))
now when I apply mutate it returns:
structure(list(x = c(-2, -3, -2, -3),
y = c(-1, -2, -1, -2)),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -4L))
but I want it to be:
structure(list(x = c(-2, -2, -2, -2),
y = c(-2, -2, -2, -2)),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -4L))
I hope that makes it clearer

We can use purrr:
df1<-as.data.frame(df)
as_tibble(purrr::map2(tmp[,purrr::map_lgl(tmp,is.numeric)],df1[3,],function(x,y) x-y))
This gives us:
# A tibble: 4 x 2
x y
<dbl> <dbl>
1 -2 -2
2 -2 -2
3 -2 -2
4 -2 -2

This isn't a perfect solution, but it will get you what you want (if my understanding is correct), and then you will have to play with the formatting. I don't quite understand why you have an entire data frame for df if you only care about the 3rd row. I don't know how to index the column using dplyr::mutate_if either; that would be useful to know!
Since you want the columns to match, you are effectively trying to subtract each row of tmp from a set row of df. For loops and sapply() are good for row-wise subtraction.
sapply(1:nrow(tmp), function(x) tmp[x, ] - df[3, ]) %>%
as.data.frame() %>%
t()
## x y
## V1 -2 -2
## V2 -2 -2
## V3 -2 -2
## V4 -2 -2

Related

Performance indices for unequal datasets in R

I wanted to do the performance indices in R. My data looks like this (example):
enter image description here
I want to ignore the comparison of values in Time 2 and 4 in data frame 1 and then compare it with the available set of observed data. I know how to develop the equation for the performance indicators (R2, RMSE, IA, etc.), but I am not sure how to ignore the data in the simulated data frame when corresponding observed data is not available for comparison.
Perhaps just do a left join, and compare the columns directly?
library(dplyr)
left_join(d2,d1 %>% rename(simData=Data), by="Time")
Output:
Time Data simData
<dbl> <dbl> <dbl>
1 1 57 52
2 3 88 78
3 5 19 23
Input:
d1 = structure(list(Time = c(1, 2, 3, 4, 5), Data = c(52, 56, 78,
56, 23)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-5L))
d2 = structure(list(Time = c(1, 3, 5), Data = c(57, 88, 19)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -3L))

Calculate mean of each row in a large list of dataframes in R

I know this question has been asked before on this forum. But my data set is significantly large and I could not make any of the existing solutions work.
Here's a sample dataset.
list(structure(list(id = c("id1", "id2", "id3"), value = c(2,
0, 2), value_2 = c(0, 1, 2)), class = "data.frame", row.names = c(NA,
-3L)), structure(list(id = c("id1", "id2", "id3"), value = c(-1,
0, 0), value_2 = c(1, 0, -3)), class = "data.frame", row.names = c(NA,
-3L)), structure(list(id = c("id1", "id2", "id3"), value = c(-2,
1, 0), value_2 = c(-2, 0, 1)), class = "data.frame", row.names = c(NA,
-3L)), structure(list(id = c("id1", "id2", "id3"), value = c(2,
0, 0), value_2 = c(-2, 0, -1)), class = "data.frame", row.names = c(NA,
-3L)))
I want to calculate the mean of the column 'value' for each 'id' across the list. The result should look like this, where 'value_mean' should be the average of the column 'value' of each id in lists 1, 2, 3 and 4.
structure(list(id = c("id1", "id2", "id3"), value_mean = c(NA,
NA, NA)), class = "data.frame", row.names = c(NA, -3L))
Please note that my real list has 5000 data frames where each data frame has 100,000 rows. I have tried using "bind_rows" and similar functions to convert the list/ to a data frame first, but the data frame becomes too large and R runs out of memory.
Any help would be much appreciated! Thanks!
We may bind the list elements to a single data and then use a group by mean operation
library(dplyr)
bind_rows(lst1) %>%
group_by(id) %>%
summarise(value_mean = mean(value, na.rm = TRUE), .groups = 'drop')
-output
# A tibble: 3 x 2
id value_mean
<chr> <dbl>
1 id1 0.25
2 id2 0.25
3 id3 0.5
If the datasets have a the same dimension and the 'id' are in same order, extract the 'value' column, use Reduce to do elementwise + and divide by the length of list
Reduce(`+`, lapply(lst1, `[[`, "value"))/length(lst1)
[1] 0.25 0.25 0.50
Or a more efficient approach is with dapply/t_list from collapse
library(collapse)
dapply(t_list(dapply(lst1, `[[`, "value")), fmean)
V1 V2 V3
0.25 0.25 0.50
You could try to calculate the mean for each data.frame in your list. Weighted by the elements in each data.frame you could calculate the mean for all data.frames:
library(dplyr)
library(purrr)
my_list %>%
map_df(~ .x %>%
group_by(id) %>%
summarise(n = n(),
mean = mean(value, na.rm = TRUE))) %>%
group_by(id) %>%
summarize(mean_value = sum(n * mean)/ sum(n))
This returns
# A tibble: 3 x 2
id mean_value
<chr> <dbl>
1 id1 0.25
2 id2 0.25
3 id3 0.5
Disclaimer: I'm tired right now, don't knwo if this makes any sense.

Multiply 2 very large data frames in R

I have 2 dataframes in R as below
Data Frame 1
structure(list(X1 = c(1, 4, 3), X2 = c(2, 1, 2), X3 = c(3, 1,
1)), class = "data.frame", row.names = c(NA, -3L))
Data Frame 2
structure(list(X1 = c(0.5, 0.1), X2 = c(0.7, 0.2), X3 = c(0.3,
0.2)), class = "data.frame", row.names = c(NA, -2L))
I want to multiply each row of DF1 with every row of DF2 and perform some calculation as below. This is a sort of matrix multiplication along with additional calculations
After matrix multiplication, I will calculate 1/(1+exp(-x)) for every cell in resultant matrix
and lastly, take the column sum of the matrix
The above dataset is just a dummy set. In actual, DF1 has 1.1 million rows while DF2 has 65000 rows.
While doing matrix multiplication, I get error
cannot allocate vector of Size 560 GB
Is there any alternative to this. Also, I am looking for time effective solution due to large data frames.
May be Data table ?
Thanks,

Control number of rows when binding dataframes with different number of rows?

I have a dataframe generated by a function:
Each time it's of different number of rows:
structure(list(a = c(1, 2, 3), b = c("er", "gd", "ku"), c = c(43,
453, 12)), .Names = c("a", "b", "c"), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame"))
structure(list(a = c(1, 2), b = c("er", "gd"), c = c(43, 453)), .Names = c("a",
"b", "c"), row.names = c(NA, -2L), class = c("tbl_df", "tbl",
"data.frame"))
I want to be able like in a while loop to control the number of rows to be less then n (n = 4, 100, 4242...) when I bind rows.
Please advise how to do this using functional programming without a while loop?
I mean sometimes you will get n = 10 and the df before bind_rows is 7 and after binding the last one it will be 20. It's ok, I want the number of rows to be min_k (k >= n)
Here is my while loop doing this:
b <- list()
total_rows <- 0
while(total_rows < 1000) {
df <- f_produce_rand_df()
b[[length(b) + 1]] <- df
total_rows <- total_rows + nrow(df)
}

Export data frames from list to txt file

I have a question in exporting data frame from list into txt file. I found some solutions, but it was only for vectors. Here is one example:
dataframe1 <- data.frame(a= c(1,2,3,4,5), b= c(1,1,1,1,1))
dataframe2 <- data.frame(a= c(5,5,5), b= c(1,1,1))
mylist <- list(dataframe1, dataframe2)
I would like that the txt file looks like this:
$dataframe1
a b
1 1
2 1
3 1
4 1
5 1
$dataframe2
a b
5 1
5 1
5 1
Thank you for the help.
Say your list is named:
mylist<-structure(list(dataframe1 = structure(list(a = c(1, 2, 3, 4,
5), b = c(1, 1, 1, 1, 1)), .Names = c("a", "b"), row.names = c(NA,
-5L), class = "data.frame"), dataframe2 = structure(list(a = c(5,
5, 5), b = c(1, 1, 1)), .Names = c("a", "b"), row.names = c(NA,
-3L), class = "data.frame")), .Names = c("dataframe1", "dataframe2"
))
You can try:
con<-file("temp.csv",open="at")
Map(function(x,y) {cat(file=con,y,"\n");write.table(x,file=con,quote=FALSE,row.names=FALSE)},
mylist,names(mylist))
close(con)
The above will write the files on the file temp.csv. You have to give names to your list if you want it to work.
Alternatively, if you are ok with the print method, you can just redirect the standard output to a file:
sink("temp.csv")
print(mylist)
sink(NULL)

Resources