r substracting 2 dataframes of different nrows following a matching column - r

I have 2 dataframes with the same headers similar to that.
Jul X1 X2 X3 X4 X5
The sizes of each data are:
D1:
nrowA=2191, ncolA= 51.
nrowB=366, ncolB= 51.
Actually, I have exacly the same columns in each dataframe. The first dataframe is daily data of temperature for 04 years while the second data is a "reference". I want to do (A-B) where the first column (Jul) of each dataframe does match. Could you please advise me with a method to do that in AVOIDING loops. Cheers

If you know SQL there is a library that allows you to compute SQL queries:
D1 <- data.frame(a = 1:5, b=letters[1:5])
D2 <- data.frame(a = 1:3, b=letters[1:3])
require(sqldf)
a1NotIna2 <- sqldf('SELECT * FROM D1 WHERE (a NOT IN (SELECT a FROM D2))')

Related

find the common rows in 2 different dataframes

I'm writing a function that needs to work with different datasets. The columns that have to be passed inside the function look somewhat like the following data frames:
df1 <- data.frame(x1 = c("d","e","f","g"), x2 = c("Aug 2017","Sep 2017","Oct 2017","Nov 2017"), x3 = c(456,678,876,987))
df2 <- data.frame(x1 = c("a","b","c","d"), x2 = c("Aug 2017","Sep 2017","Oct 2017","Nov 2017"), x3 = c(123,324,345,564))
From these I need to find out if any of the df1$x1 are present in df2$x2. If present, print the entire row where df1$x1 value that is present in df2$x2.
I need to use the data frames inside the function but I can't specify the column names explicitly. So I need to find a way to access the columns without exactly using the column name.
The desired output:
x1 x2 x3 x4
d Aug 2017 456 common
enter image description here
My problem is, I can't use any kind of function where I need to specify the column names explicitly. For example, inner join cannot be performed since I have to specify
by = 'col_name'
You can use match with column indices:
df1[na.omit(match(df2[, 1], df1[, 1])), ]
# x1 x2 x3
#1 d Aug 2017 456
Here are three simple examples of functions that you might use to return the rows you want, where the function itself does not need to hardcode the name of the column, and does not require the indices to match:
Pass the frames directly, and the column names as strings:
f <- function(d1, d2, col1, col2) d1[which(d1[,col1] %in% d2[,col2]),]
Usage and Output
f(df1,df2, "x1", "x1")
x1 x2 x3
1 d Aug 2017 456
Pass only the values of the columns as vectors:
f <- function(x,y) which(x %in% y)
Usage and Output
df1 %>% filter(row_number() %in% f(x1, df2$x1))
x1 x2 x3
1 d Aug 2017 456
Pass the frames and the unquoted columns, using the {{}} operator
f <- function(d1, d2, col1, col2) {
filter(d1, {{col1}} %in% pull(d2, {{col2}}))
}
Usage and Output:
f(df1,df2,x1,x1)
x1 x2 x3
1 d Aug 2017 456

Changing character values in a data frame column based on a conversion data frame in R

I have a data frame in R that has a column of strings/characters. I am calling this "myDat" below.
I have another data frame in R that has two columns of strings/characters. I am calling this "conversionDat" below. One column ("Name") contains similar names as the column in "myDat". The other column ("Name2") contains names to which the "myDat" column should be converted to.
Here is a MWE of these two data frames:
myDat <- data.frame(Name = c("A","D","P","R"))
conversionDat <- data.frame(Name = c("D","R","A","P"), Name2 = c("S","T","B","Z"))
myDat$Name <- as.character(myDat$Name)
conversionDat$Name <- as.character(conversionDat$Name)
conversionDat$Name2 <- as.character(conversionDat$Name2)
I would like to find any case where "myDat" equals a value in "conversionDat$Name" and convert it to "conversionDat$Name2". So, in the MWE above, the "conversionDat" data frame would remain unchanged, but the "myDat" data frame would become:
B2
S2
Z2
T2
Is there a painless method to go about doing this? Any ideas would be much appreciated!
A painless method would be to simply merge both and then add the "2" you need in the Name2 column?
myDat <- data.frame(Name = c("A","D","P","R"))
conversionDat <- data.frame(Name = c("D","R","A","P"), Name2 = c("S","T","B","Z"))
myDat <- merge(myDat, conversionDat, by = "Name")
myDat$Name2 <- paste(myDat$Name2, "2", sep = "")
> myDat
Name Name2
1 A B2
2 D S2
3 P Z2
4 R T2

Pasting a column from column names of a data.frame

I have a list(list1) with n elements. Each element of list1 is data frame(df1, df2, ..., dfn) with different(also maybe same number) number of columns.
Let i'th element(data frame) dfi with column names x1,x2,x3,....,xi.
I want to paste such a formula with column names:
x1+x2+x3+......+xi.
And assign this formula as i'th element of a list(list2).
I want to this for each data frame in list1.
How can I do this using R? I will be very glad for any help. Thanks a lot.
Ex: Let list1 have two elements(two data frames: df1 and df2)
list1[[1]]:
df1:
x1 x2 x3
-- -- --
43 12 7
3 6 5
and
list1[[2]]:
df2:
x1 x2
-- --
21 45
14 16
I want to return list2 which is:
list2[[1]]:
x1+x2+x3
and
list2[[2]]:
x1+x2
I am not interested with elements of data frames(df1 and df2), just with the column names.
From my understanding of your question, this may do the work for you?
list2 <- lapply(list1, function(x) { return(paste(names(x), collapse = "+")) })
Does this do what you want:
list2 <- lapply(list1, function(x){apply(x, 1, sum)})?
I have tried to do something similar recently. I dont have a gracious solution, but the below will work only if you have equal number of columns in all the dataframes.
length_list <- length(list1)
cols_in_df <- ncol(list[[i]])
i = 1
for(i in 1:length_list)
{
assign(model[i], lm( list[[i]][1]+ list[[i]][2]+ ....+list[[i]][cols_in_df])
}

Function for pasting corrected values inside existing dataframe

Does something like the 'paste_over' function below already exist within base R or one of the standard R packages?
paste_over <- function(original, corrected, key){
corrected <- corrected[order(corrected[[key]]),]
output <- original
output[
original[[key]] %in% corrected[[key]],
names(corrected)
] <- corrected
return(output)
}
An example:
D1 <- data.frame(
k = 1:5,
A = runif(5),
B = runif(5),
C = runif(5),
D = runif(5),
E = runif(5)
)
D2 <- data.frame(
k=c(4,1,3),
D=runif(3),
E=runif(3),
A=runif(3)
)
D2 <- D2[order(D2$k),]
D3 <- D1
D3[
D1$k %in% D2$k,
names(D2)
] <- D2
D4 <- paste_over(D1, D2, "k")
all(D4==D3)
In the example D2 contains some values that I want to paste over corresponding cells within D1. However D2 is not in the same order and does not have the same dimension as D1.
The motivation for this is that I was given a very large dataset, reported some errors within it, and received a subset of the original dataset with some corrected values. I would like to be able to 'paste over' the new, corrected values into the old dataset without changing the old dataset in terms of structure. (As the rest of the code I've written assume's the old dataset's structure.)
Although the paste_over function seems to work I can't help but think this must have been tackled before, and so maybe there's already a well known function that's both faster and has error checking. If there is then please let me know what it is.
Thanks.
We can accomplish this using data.table as follows:
setkeyv(setDT(D1), "k")
cols = c("D", "E", "A")
D1[D2, (cols) := D2[, cols]]
setDT() converts a data.frame to data.table by reference (without actually copying the data). We want D1 to be a data.table.
setkey() sorts the data.table by the column specified (here k) and marks that column as sorted (by setting the attribute sorted) by reference. This allows us to perform joins using binary search.
x[i] in data.table performs a join. You can read more about it here. Briefly, for each row of column k in D2, it finds the matching row indices in D1 by matching on D1's key column (here k).
x[i, LHS := RHS] performs the join to find matching rows, and the LHS := RHS part adds/updates x with the columns specified in LHS with the values specified in RHS by reference. LHS should be a a vector of column names or numbers, and RHS should be a list of values.
So, D1[D2, (cols) := D2[, cols]] finds matching rows in D1 for k=c(1,3,4) from D2 and updates the columns D,E,A specified in cols by the list (a data.frame is also a list) of corresponding columns from D2 on RHS.
D1 will now be modified in-place.
HTH
You could use the replacement method for data frames in your function, like this maybe. It does adequate checking for you. I chose to pass the logical row subset as an argument, but you can change that
pasteOver <- function(original, corrected, key) {
"[<-.data.frame"(original, key, names(corrected), corrected)
}
(p1 <- pasteOver(D1, D2, D1$k %in% D2$k))
k A B C D E
1 1 0.18827167 0.006275082 0.3754535 0.8690591 0.73774065
2 2 0.54335829 0.122160101 0.6213813 0.9931259 0.38941407
3 3 0.62946977 0.323090601 0.4464805 0.5069766 0.41443988
4 4 0.66155954 0.201218532 0.1345516 0.2990733 0.05296677
5 5 0.09400961 0.087096652 0.2327039 0.7268058 0.63687025
p2 <- paste_over(D1, D2, "k")
identical(p1, p2)
# [1] TRUE

Compare two frames by parts R

I want to compare two data frames by parts. Here is an example of my data frames:
a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = c(1,6,3,4), b=letters[1:4])
I would like to write a function which finds the two sequential rows in a1 which also exists in data frame a2 ( both columns have to match) and save it in new frame.
Any help would be appreciated.
dual.matches <- match(a1$a, a2$a) == match(a1$b, a2$b)
sequential.dual.matches <- with(rle(dual.matches), rep(replace(values, lengths==1, FALSE), lengths))
a1[sequential.dual.matches, ]
# a b
# 3 3 c
# 4 4 d

Resources