Excluding rows if present in second dataframe in R - r

I have 2 dataframes.
df1-
col1 col2 col3 col4 col5
name1 A 23 x y
name1 A 29 x y
name1 B 17 x y
name1 A 77 x y
df2-
col1 col2 col3
B 17 LL1
Z 193 KK1
A 77 LO9
Y 80 LK2
I want to return those rows from df1 if col2 and col3 of df1 are not equal to col1 and col2 of df2.
The output should be-
col1 col2 col3 col4 col5
name1 A 23 x y
name1 A 29 x y
Solution I found-
unique.rows <- function (df1, df2) {
out <- NULL
for (i in 1:nrow(df1)) {
found <- FALSE
for (j in 1:nrow(df2)) {
if (all(df1[i,2:3] == df2[j,1:2])) {
found <- TRUE
break
}
}
if (!found) out <- rbind(out, df1[i,])
}
out
}
This solution is working fine but initially, I was applying for small dataframes. Now my df1 has about 10k rows and df2 has about 7 million rows. It is just running and running from last 2 days. Could anyone please suggest a fast way to do this?

try
> df1[!paste(df1$col2,df1$col3)%in%paste(df2$col1,df2$col2),]
col1 col2 col3 col4 col5
1 name1 A 23 x y
2 name1 A 29 x y

What is probably biting you is the line:
if (!found) out <- rbind(out, df1[i,])
You continuously grow a data.frame, which causes the operating system to allocate new memory for the object. I would recommend you preallocate a data.frame with enough room and then assign the right output to the right index. This should speed things up by several orders of magnitude.
In addition, R works vectorized so often there is no need for an explicit loop. See for example the answer by #ttmaccer. You could also take a look at data.table, which is lightning fast for these kinds of operations.

Related

Strange behaviour when selecting columns in data.table in r: Only works when string is given directly, not as a variable

I want to select some columns in a data.frame/data.table. However there seems to be a strange behaviour:
Create dummy data:
df=data.frame(col1=c(1,2),col2=c(11,22),col3=c(111,222))
So our data.frame looks like
col1 col2 col3
1 1 11 111
2 2 22 222
Now I define some variables for the column names:
col1='col1'
col2='col2'
So both df[,c(col1,col2)] and df[,c('col1','col2')] result in
col1 col2
1 1 11
2 2 22
as one would expect.
However if I do the same on the data.table (created by df=data.table(df))
col1 col2 col3
1: 1 11 111
2: 2 22 222
something strange happens. df[,c('col1','col2')] still gets the correct result:
col1 col2
1: 1 11
2: 2 22
but df[,c(col1,col2)] does not work anymore:
[1] 1 2 11 22
Why is that?
It is not a strange behavior as it is already mentioned in the documenation - with = FALSE
df[, c(col1, col2), with = FALSE]
-output
col1 col2
1: 1 11
2: 2 22
According to ?data.table
When with=TRUE (default), j is evaluated within the frame of the data.table; i.e., it sees column names as if they are variables. This allows to not just select columns in j, but also compute on them e.g., x[, a] and x[, sum(a)] returns x$a and sum(x$a) as a vector respectively. x[, .(a, b)] and x[, .(sa=sum(a), sb=sum(b))] returns a two column data.table each, the first simply selecting columns a, b and the second computing their sums.
Other options are
df[, .(col1, col2)]
col1 col2
1: 1 11
2: 2 22
df[, .SD, .SDcols = c(col1, col2)]
col1 col2
1: 1 11
2: 2 22

Add each column to a new data frame

For example, I have a data frame with 4 columns:
col1 col2 col3 col4
I would like to get a new data frame by accumulating each column:
col1 col1+col2 col1+col2+col3 col1+col2+col3+col4
How should I write in R?
In base R, you can calculate row-wise cumsum using apply.
Using #Henry's data :
startdf[] <- t(apply(startdf, 1, cumsum))
startdf
# col1 col2 col3 col4
#1 1 21 321 4321
#2 4 34 234 1234
If this was a matrix then you could use rowCumsums from the matrixStats package
so starting with a dataframe and returning to a dataframe I suppose you could try something like
library(matrixStats)
startdf <- data.frame(col1=c(1,4), col2=c(20,30),
col3=c(300,200), col4=c(4000,1000))
finishdf <- as.data.frame(rowCumsums(as.matrix(startdf)))
to go from
col1 col2 col3 col4
1 1 20 300 4000
2 4 30 200 1000
to
V1 V2 V3 V4
1 1 21 321 4321
2 4 34 234 1234
Base R (not as efficient or clean as Ronak's) [using Henry's data]:
data.frame(Reduce(rbind, Map(cumsum, data.frame(t(startdf)))), row.names = NULL)

Renaming data frame column variables without using column name

I want to create a function which renames specific values in a column to something else, which is specified by the function, something like this (although in reality there would be much more to rename):
func <- function(x) x %>%
mutate(col_name = ifelse(col_name =="something","something else",
ifelse(col_name == "something2","something_else2")))
Note that it isn't the column names that I want to change, it is the values themselves in the column. However, I would like this to work regardless of which column the values are in (e.g. the function works all over the data frame). Also, this only works if the values named in the function is present, and I would like it to ignore the ones that aren't present in the columns. here is a small reproducible example: (column values are arbitrary)
col1 <- c("a","b","c","d","e")
col2 <- c("b","f","d","c","g")
df <- data.frame(col1, col2)
col3 <- c("a","h","i","b","c")
col4 <- c("c","d","j","a","g")
df2 <- data.frame(col3, col4)
Which looks like this:
df1:
col1 col2
1 a b
2 b f
3 c d
4 d c
5 e g
df2:
col3 col4
1 a c
2 h d
3 i j
4 b a
5 c g
Say that i want to rename like this:
df1:
col1 col2
1 can chi
2 chi pig
3 equ she
4 she equ
5 fox bov
df2:
col3 col4
1 can equ
2 avi she
3 tyr asp
4 chi can
5 equ bov
So what I was hoping to get was a function that changes the names of multiple values in data frame columns regardless of its position in the data frame, and that it ignores the values not found in the data frame by the function.
Recode all columns
library(dplyr)
func = function(x, originals = letters[1:10],
rename_tos = c("can", "chi", "equ", "she", "fox", "pig", "bov", "avi", "tyr", "asp")){
names(rename_tos) = originals
x %>%
mutate_if(is.factor, as.character) %>%
lapply(function(y){
y = rename_tos[y]
}) %>%
data.frame(row.names = NULL)
}
Results:
> func(df)
col1 col2
1 can chi
2 chi pig
3 equ she
4 she equ
5 fox bov
> func(df2)
col3 col4
1 can equ
2 avi she
3 tyr asp
4 chi can
5 equ bov
Notes:
The method I used is basically to create a lookup table (named vector) for the renames and index the rename_tos vector with column values. Here, I've set the originals and renames as the default of the function, but you can also supply your own.
User-supplied column names
If you want to be able to rename columns specified and leave the other columns the same, you can do something like the following:
library(dplyr)
library(rlang)
func = function(x, ..., originals = letters[1:10],
rename_tos = c("can", "chi", "equ", "she", "fox", "pig", "bov", "avi", "tyr", "asp")){
names(rename_tos) = originals
dots = quos(...)
x %>%
mutate_at(vars(!!! dots), as.character) %>%
mutate_at(vars(!!! dots), funs(rename_tos[.])) %>%
data.frame(row.names = NULL)
}
Result:
> func(df, col2)
col1 col2
1 a chi
2 b pig
3 c she
4 d equ
5 e bov
> func(df2, col3, col4)
col3 col4
1 can equ
2 avi she
3 tyr asp
4 chi can
5 equ bov
> func(df2, c(col3, col4))
col3 col4
1 can equ
2 avi she
3 tyr asp
4 chi can
5 equ bov
Notes:
Here, I added the ... argument to allow the user to input their own column names. I used quos from rlang to quote the ... arguments and later unquoted them inside vars to mutate_at using !!!. For example, if the user supplied func(df, col2), the first argument of mutate_at evaluates to vars(col2). This works with multiple arguments as well as a vector of arguments as one can see in the results.

Checking if a value is numerical in R

I have two dataframes, df1 and df2.
df1:
col1 <- c('30','30','30','30')
col2 <- c(3,13,18,41)
col3 <- c("heavy","light","blue","black")
df1 <- data.frame(col1,col2,col3)
>df1
col1 col2 col3
1 30 3 heavy
2 30 13 light
3 30 18 blue
4 30 41 black
df2:
col1 <- c('10',"NONE")
col2 <- c(21,"NONE")
col3 <- c("blue","NONE")
df2 <- data.frame(col1,col2,col3)
>df2
col1 col2 col3
1 10 21 blue
2 NONE NONE NONE
I wrote a bit of script that says; if a value in col3 is equal to "light", I want to remove that row and all subsequent rows in the dataframe. So df1 would look like:
col1 col2 col3
1 30 3 heavy
And there would be no changes to df2 (as it has no matches to "light" in col3).
I have stated there are two separate df's above as two examples, but the script below just refers to a general "df" to save me copying and pasting the same bit of code twice with df1 repalced with df2.
phrase=c("light")
start_rownum=which(grepl(phrase, df[,3]))
end_rownum=nrow(df)
end_rownum=as.numeric(end_rownum)
if(start_rownum > 0){
df=df[-c(start_rownum:end_rownum),]
}
This script works fine with df1, as the start_rownum has a numerical value. However, I get the following error with df2:
Error in start_rownum:end_rownum : argument of length 0
Instead of saying "if(start_rownum > 0)", is there some way to check if start_rownum has a numerical value? I can't find a working solution.
Thanks.
For anyone who has a similar problem, I just solved it:
Use the phrase
if (length(start_rownum)>0 & is.numeric(start_rownum))

Loop to count number of rows in each column and save output

I think this is pretty simple. I have a dataframe called df. It has 51 columns. The rows in each column contains random integers. All I want to do as a loop is add all the integers in all the rows of each column and then store the output for each of the columns in a seperate list or dataframe.
The df looks like this
Col1 col2 col3 col4
34 12 33 67
22 1 56 66
Etc
The output I want is:
Col1 col2 col3 col4
56 13 89 133
I do want to do this as a loop as I want to apply what I've learnt here to a more complex script with similar output and I need to do it quick- can't quite master functions as yet...
You can use the built in function colSums for this:
> df <- data.frame(col1 = c(1,2,3), col2 = c(2,3,4))
> colSums(df)
col1 col2
6 9
Another option using a loop:
# Create the result data frame
> res <- data.frame(df[1,], row.names = c('count'))
# Populate the results
> for(n in 1:ncol(df)) { res[colnames(df)[n]] <- sum(df[n]) }
col1 col2
6 9
If you really want to use a loop over a vectorized solution, use apply to loop over columns (second argument equal to 2, 1 is to loop over rows), by mentioning the function you want (here sum):
df = data.frame(col1=1:3,col2=2:4,col3=3:5)
apply(df, 2, sum)
#col1 col2 col3
# 6 9 12

Resources