Performance indices for unequal datasets in R - r

I wanted to do the performance indices in R. My data looks like this (example):
enter image description here
I want to ignore the comparison of values in Time 2 and 4 in data frame 1 and then compare it with the available set of observed data. I know how to develop the equation for the performance indicators (R2, RMSE, IA, etc.), but I am not sure how to ignore the data in the simulated data frame when corresponding observed data is not available for comparison.

Perhaps just do a left join, and compare the columns directly?
library(dplyr)
left_join(d2,d1 %>% rename(simData=Data), by="Time")
Output:
Time Data simData
<dbl> <dbl> <dbl>
1 1 57 52
2 3 88 78
3 5 19 23
Input:
d1 = structure(list(Time = c(1, 2, 3, 4, 5), Data = c(52, 56, 78,
56, 23)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-5L))
d2 = structure(list(Time = c(1, 3, 5), Data = c(57, 88, 19)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -3L))

Related

Combine ifelse two conditions and loop

I have a liste of dataframes (file1, file2, ..., file 72). For each dataframe I want to create one variable containing information from another dataframe based on two conditions.
The idea is simple:
condition 1: if file*$countryid equals source$country, and
condition 2: if file*$year is higher than source$starting but lower than source$ending, then if true I want to create a column file*$rank with the value in source$rank
I have been trying code lines like this but this code does not go through all lines in source:
file1$rank<-ifelse(file1$countryid=source$countryid & file1$year>source$starting & file1$year<source$ending,source$rank,NA)
In addition I would like to implement this within a loop to avoid iterating manually through all these dataframes:
dflist<-Filter(is.data.frame, mget(ls()))
dflist<-function(df,x){df$rank<-ifelse(df$countryid=source$countryid & df$year>source$starting & df$year<source$ending,source$rank,NA))
Here is an example of the data I have.
Thank you!
> dput(file1)
structure(list(id = c(1, 2, 3), countryid = c(10, 10, 13), year = c(1948,
1954, 1908)), row.names = c(NA, -3L), class = c("tbl_df", "tbl",
"data.frame"))
dput(file2)
structure(list(id = c(1, 2, 3), countryid = c(13, 10, 13), year = c(1907,
1908, 1907)), row.names = c(NA, -3L), class = c("tbl_df", "tbl",
"data.frame"))
> dput(source)
structure(list(country = c(13, 13, 13, 10, 10, 10), rank = c(1,
2, 3, 1, 2, 3), starting = c(1885, 1909, 1940, 1902, 1907, 1931
), ending = c(1908, 1939, 1960, 1906, 1930, 1960)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
We can use a non-equi join after getting all the file\\d+ datasets into a list
library(data.table)
out <- lapply(mget(ls(pattern = '^file\\d+$')), function(dat)
setDT(dat)[, year := as.integer(year)][as.data.table(source), rank := i.rank,
on = .(countryid = country, year > starting, year < ending)])
-output
out
#$file1
# id countryid year rank
#1: 1 10 1948 3
#2: 2 10 1954 3
#3: 3 13 1908 NA
#$file2
# id countryid year rank
#1: 1 13 1907 1
#2: 2 10 1908 2
#3: 3 13 1907 1
if it needs to update the original objects, use list2env
list2env(out, .GlobalEnv)

Multiply 2 very large data frames in R

I have 2 dataframes in R as below
Data Frame 1
structure(list(X1 = c(1, 4, 3), X2 = c(2, 1, 2), X3 = c(3, 1,
1)), class = "data.frame", row.names = c(NA, -3L))
Data Frame 2
structure(list(X1 = c(0.5, 0.1), X2 = c(0.7, 0.2), X3 = c(0.3,
0.2)), class = "data.frame", row.names = c(NA, -2L))
I want to multiply each row of DF1 with every row of DF2 and perform some calculation as below. This is a sort of matrix multiplication along with additional calculations
After matrix multiplication, I will calculate 1/(1+exp(-x)) for every cell in resultant matrix
and lastly, take the column sum of the matrix
The above dataset is just a dummy set. In actual, DF1 has 1.1 million rows while DF2 has 65000 rows.
While doing matrix multiplication, I get error
cannot allocate vector of Size 560 GB
Is there any alternative to this. Also, I am looking for time effective solution due to large data frames.
May be Data table ?
Thanks,

use mutate_if by subtracting from another data frame

I would like to do (more or less) the following
dplyr::mutate_if(tmp, is.numeric, function(x) x-df[3,])
in effect this should subtract at every x a value from df. The problem I have is that it should only use the matching column number, i.e. tmp[x,y] - df[3,y].
However what's happening is that it loops over the df[3,] vector for every x, irrespective of column position.
Is there any way to make this work with mutate_if by indexing the column somehow, which would be my preferred solution?
here is an example:
tmp is:
tmp <- structure(list(x = c(1, 1, 1, 1),
y = c(2, 2, 2, 2)),
row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))
df (actually a matrix) is:
df <- structure(c(1L, 2L, 3L, 2L, 3L, 4L),
.Dim = 3:2, .Dimnames = list(NULL, c("x", "y")))
now when I apply mutate it returns:
structure(list(x = c(-2, -3, -2, -3),
y = c(-1, -2, -1, -2)),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -4L))
but I want it to be:
structure(list(x = c(-2, -2, -2, -2),
y = c(-2, -2, -2, -2)),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -4L))
I hope that makes it clearer
We can use purrr:
df1<-as.data.frame(df)
as_tibble(purrr::map2(tmp[,purrr::map_lgl(tmp,is.numeric)],df1[3,],function(x,y) x-y))
This gives us:
# A tibble: 4 x 2
x y
<dbl> <dbl>
1 -2 -2
2 -2 -2
3 -2 -2
4 -2 -2
This isn't a perfect solution, but it will get you what you want (if my understanding is correct), and then you will have to play with the formatting. I don't quite understand why you have an entire data frame for df if you only care about the 3rd row. I don't know how to index the column using dplyr::mutate_if either; that would be useful to know!
Since you want the columns to match, you are effectively trying to subtract each row of tmp from a set row of df. For loops and sapply() are good for row-wise subtraction.
sapply(1:nrow(tmp), function(x) tmp[x, ] - df[3, ]) %>%
as.data.frame() %>%
t()
## x y
## V1 -2 -2
## V2 -2 -2
## V3 -2 -2
## V4 -2 -2

Control number of rows when binding dataframes with different number of rows?

I have a dataframe generated by a function:
Each time it's of different number of rows:
structure(list(a = c(1, 2, 3), b = c("er", "gd", "ku"), c = c(43,
453, 12)), .Names = c("a", "b", "c"), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame"))
structure(list(a = c(1, 2), b = c("er", "gd"), c = c(43, 453)), .Names = c("a",
"b", "c"), row.names = c(NA, -2L), class = c("tbl_df", "tbl",
"data.frame"))
I want to be able like in a while loop to control the number of rows to be less then n (n = 4, 100, 4242...) when I bind rows.
Please advise how to do this using functional programming without a while loop?
I mean sometimes you will get n = 10 and the df before bind_rows is 7 and after binding the last one it will be 20. It's ok, I want the number of rows to be min_k (k >= n)
Here is my while loop doing this:
b <- list()
total_rows <- 0
while(total_rows < 1000) {
df <- f_produce_rand_df()
b[[length(b) + 1]] <- df
total_rows <- total_rows + nrow(df)
}

How to cast nested list to matrix like or tabular like object?

I have nested list data that needed to be in desired output representation, either matrix like object or just directly export these nested list as csv file. I tried several general approach to do this task, but exporting nested list is not going well, so I am looking for the solution that casting nested list to matrix like or tabular like object to hold data in desired way. Maybe I could hold nested list data in data.table, but not quite sure about this. Can anyone tell me how to do this sort of manipulation easily ? How can I achieve clean, well structured data representation for nested list data ? Any idea ? Thanks a lot
mini example :
output of custom function:
AcceptedList <- list(
A_accepted = data.frame(pos.start=c(1,6,16), pos.stop=c(4,12,23), pos.ID=c("A1","A2","A3"), pos.score=c(11,8,13)),
B_accepted = data.frame(pos.start=c(7,19,31), pos.stop=c(13,28,43), pos.ID=c("B3","B6","B7"), pos.score=c(12,5,7)),
C_accepted = data.frame(pos.start=c(5,21,36), pos.stop=c(11,29,42), pos.ID=c("C2","C4","C9"), pos.score=c(7,13,9))
)
RejectedList <- list(
A_rejected = data.frame(pos.start=c(6,25,40), pos.stop=c(12,33,49), pos.ID=c("A2","A5","A8"), pos.score=c(8,4,7)),
B_rejected = data.frame(pos.start=c(15,19,47), pos.stop=c(18,28,55), pos.ID=c("B4","B6","B9"), pos.score=c(10,5,14)),
C_rejected = data.frame(pos.start=c(13,21,36,53), pos.stop=c(19,29,42,67), pos.ID=c("C3","C4","C9","C12"), pos.score=c(4,13,9,17))
)
so I implement this function to further manipulate output one more step :
func <- function(mlist, threshold) {
res <- lapply(mlist, function(x) {
splt <- split(x, ifelse(x$pos.score >= threshold, "up", "down"))
})
return(res)
}
#example
.res_accepted <- func(AcceptedList, 9)
.res_rejected <- func(RejectedList, 9)
I have hard time how to case nested list .res_accepted, .res_rejected as matrix like object. Ideally exporting nested list as csv file is highly expected, but I failed to export them in desired way. How can I make this happen ?
ultimately, desired list of csv files with desired named as follows:
A_accepted_up.csv
A_accepted_down.csv
A_rejected_up.csv
A_rejected_down.csv
B_accepted_up.csv
B_accepted_down.csv
B_rejected_up.csv
B_rejected_down.csv
C_accepted_up.csv
C_accepted_down.csv
C_rejected_up.csv
C_rejected_down.csv
The point is, nested list returned by my custom functions, so I intend to either directly export them or cast them into matrix like object as well. Any idea for this sort of manipulation ? Thanks:)
This returns a data.frame DF of the data. No packages are used.
both <- do.call("rbind", c(AcceptedList, RejectedList))
cn <- c("letter", "accepted", "seq")
DF <- cbind(
read.table(text = chartr("_", ".", rownames(both)), sep = ".", col.names = cn),
both)
DF <- transform(DF, updown = ifelse(pos.score > 8, "up", "down"))
giving:
> DF
letter accepted seq pos.start pos.stop pos.ID pos.score updown
A_accepted.1 A accepted 1 1 4 A1 11 up
A_accepted.2 A accepted 2 6 12 A2 8 down
A_accepted.3 A accepted 3 16 23 A3 13 up
B_accepted.1 B accepted 1 7 13 B3 12 up
B_accepted.2 B accepted 2 19 28 B6 5 down
B_accepted.3 B accepted 3 31 43 B7 7 down
C_accepted.1 C accepted 1 5 11 C2 7 down
C_accepted.2 C accepted 2 21 29 C4 13 up
C_accepted.3 C accepted 3 36 42 C9 9 up
A_rejected.1 A rejected 1 6 12 A2 8 down
A_rejected.2 A rejected 2 25 33 A5 4 down
A_rejected.3 A rejected 3 40 49 A8 7 down
B_rejected.1 B rejected 1 15 18 B4 10 up
B_rejected.2 B rejected 2 19 28 B6 5 down
B_rejected.3 B rejected 3 47 55 B9 14 up
C_rejected.1 C rejected 1 13 19 C3 4 down
C_rejected.2 C rejected 2 21 29 C4 13 up
C_rejected.3 C rejected 3 36 42 C9 9 up
C_rejected.4 C rejected 4 53 67 C12 17 up
This will write DF out in separate files:
junk <- by(DF, DF[c("letter", "accepted", "updown")],
function(x) write.csv(x[-(1:3)],
sprintf("%s_%s_%s.csv", x$letter[1], x$accepted[1], x$updown[1])))
or this will write out the data frames in .res_accepted -- .res_rejected could be handled similarly:
junk <- lapply(names(.res_accepted), function(nm)
mapply(write.csv,
.res_accepted[[nm]],
paste0(nm, "_", names(.res_accepted[[nm]]), ".csv")))
Note: The poster changed the data after this answer already had appeared. The output above corresponds to the original data; however, it should also work for the revised data. The original data was:
AcceptedList <-
structure(list(foo_accepted = structure(list(pos.start = c(1,
6, 16), pos.stop = c(4, 12, 23), pos.ID = structure(1:3, .Label = c("A1",
"A2", "A3"), class = "factor"), pos.score = c(11, 8, 13)), .Names = c("pos.start",
"pos.stop", "pos.ID", "pos.score"), row.names = c(NA, -3L), class = "data.frame"),
bar_accepted = structure(list(pos.start = c(7, 19, 31), pos.stop = c(13,
28, 43), pos.ID = structure(1:3, .Label = c("B3", "B6", "B7"
), class = "factor"), pos.score = c(12, 5, 7)), .Names = c("pos.start",
"pos.stop", "pos.ID", "pos.score"), row.names = c(NA, -3L
), class = "data.frame"), cat_accepted = structure(list(pos.start = c(5,
21, 36), pos.stop = c(11, 29, 42), pos.ID = structure(1:3, .Label = c("C2",
"C4", "C9"), class = "factor"), pos.score = c(7, 13, 9)), .Names = c("pos.start",
"pos.stop", "pos.ID", "pos.score"), row.names = c(NA, -3L
), class = "data.frame")), .Names = c("foo_accepted", "bar_accepted",
"cat_accepted"))
RejectedList <-
structure(list(foo_rejected = structure(list(pos.start = c(6,
25, 40), pos.stop = c(12, 33, 49), pos.ID = structure(1:3, .Label = c("A2",
"A5", "A8"), class = "factor"), pos.score = c(8, 4, 7)), .Names = c("pos.start",
"pos.stop", "pos.ID", "pos.score"), row.names = c(NA, -3L), class = "data.frame"),
bar_rejected = structure(list(pos.start = c(15, 19, 47),
pos.stop = c(18, 28, 55), pos.ID = structure(1:3, .Label = c("B4",
"B6", "B9"), class = "factor"), pos.score = c(10, 5,
14)), .Names = c("pos.start", "pos.stop", "pos.ID", "pos.score"
), row.names = c(NA, -3L), class = "data.frame"), cat_rejected = structure(list(
pos.start = c(13, 21, 36, 53), pos.stop = c(19, 29, 42,
67), pos.ID = structure(c(2L, 3L, 4L, 1L), .Label = c("C12",
"C3", "C4", "C9"), class = "factor"), pos.score = c(4,
13, 9, 17)), .Names = c("pos.start", "pos.stop", "pos.ID",
"pos.score"), row.names = c(NA, -4L), class = "data.frame")),
.Names = c("foo_rejected",
"bar_rejected", "cat_rejected"))

Resources