How to exclude a set of elements in R? - r

I have two data frames: A and B of the same number of columns names and content. Data frame B is the subset of A. I want to get A without B. I have tried different functions like setdiff, duplicated, which and others. None of them worked for me, perhaps I didn't use them correctly. Any help is appreciated.

You could use merge e.g.:
df1 <- data.frame(col1=c('A','B','C','D','E'),col2=1:5,col3=11:15)
subset <- df1[c(2,4),]
subset$EXTRACOL <- 1 # use a column name that is not present among
# the original data.frame columns
merged <- merge(df1,subset,all=TRUE)
dfdifference <- merged[is.na(merged$EXTRACOL),]
dfdifference$EXTRACOL <- NULL
-----------------------------------------
> df1:
col1 col2 col3
1 A 1 11
2 B 2 12
3 C 3 13
4 D 4 14
5 E 5 15
> subset:
col1 col2 col3
2 B 2 12
4 D 4 14
> dfdifference:
col1 col2 col3
1 A 1 11
3 C 3 13
5 E 5 15

Related

Create dummy variable if a dataframe contains rows from another dataframe

I'm trying to create a dummy variable based on if df1 is contained within df2. Note that df2 has columns more than just the columns in df1.
e.g.:
df1:
A
B
C
1
2
3
4
5
6
7
8
0
df2:
A
B
C
D
1
2
3
E
4
5
6
F
7
8
9
G
Resulting in:
df2:
A
B
C
D
Dummy
1
2
3
E
1
4
5
6
F
1
7
8
9
G
0
Any good approaches I should consider?
I've tried using an ifelse function applied to the dataframe, but I suspect I've coded it wrong. Any tips would be appreciated!
One approach would be to add a column called "dummy" to df1, then join with df2 on all variables of df1.
df1$dummy <- 1
library(dplyr)
dplyr::left_join(df2, df1) %>%
mutate(dummy = ifelse(is.na(dummy), 0, dummy))
# Joining, by = c("A", "B", "C")
# A B C D dummy
# 1 2 3 E 1
# 4 5 6 F 1
# 7 8 9 G 0
By default left_join joins using all commonly named variables, but this can be modified as required.

different print.gap value for specific column

Is there any way to have a different print.gap for a particular column?
Example data:
dd <- data.frame(col1 = 1:5, col2 = 1:5, col3 = I(letters[1:5]))
print (dd, quote=F, right=T, print.gap=5)
Output with print.gap=5:
col1 col2 col3
1 1 1 a
2 2 2 b
3 3 3 c
4 4 4 d
5 5 5 e
Desired output (print.gap mix, first two with print.gap=5, third with print.gap=12)
col1 col2 col3
1 1 1 a
2 2 2 b
3 3 3 c
4 4 4 d
5 5 5 e
I realise this may not be achievable with any change of the print statement, but perhaps some have an alternative method or suggestion. The output is to be saved in a text file. Also please note, the solution should be flexible enough to not just increase the gap for the last column, it could be any column, or multiple columns with different print.gaps in a data frame.
There's probably a way to do this by defining a "proper" alternative print method, but here's a hackish solution that can be used to adjust each column width independently.
rbind(
data.frame(lapply(dd, as.character), stringsAsFactors=FALSE),
substring(" ", 1, c(1,7,12))
)
# col1 col2 col3
#1 1 1 a
#2 2 1 b
#3 3 1 c
#4 4 1 d
#5 5 2 e
#6 6 2 f
#7 7 2 g
#8 8 2 h

Pass variable as column name to dplyr?

I have a very ugly dataset that is a flat file of a relational database. A minimal reproducible example is here:
df <- data.frame(col1 = c(letters[1:4],"c"),
col1.p = 1:5,
col2 = c("a","c","l","c","l"),
col2.p = 6:10,
col3= letters[3:7],
col3.p = 11:20)
I need to be able to identify the '.p' value for the 'col#' that has the "c". My previous question on SO got the first part: In R, find the column that contains a string in for each row. Which I'm providing for context.
tmp <- which(projectdata=='Transmission and Distribution of Electricity', arr.ind=TRUE)
cnt <- ave(tmp[,"row"], tmp[,"row"], FUN=seq_along)
maxnames <- paste0("max",sequence(max(cnt)))
projectdata[maxnames] <- NA
projectdata[maxnames][cbind(tmp[,"row"],cnt)] <- names(projectdata)[tmp[,"col"]]
rm(tmp, cnt, maxnames)
This results in a dataframe that looks like this:
df
col1 col1.p col2 col2.p col3 col3.p max1
1 a 1 a 6 c 11 col3
2 b 2 c 7 d 12 col2
3 c 3 l 8 e 13 col1
4 d 4 c 9 f 14 col2
5 c 5 l 10 g 15 col1
6 a 1 a 6 c 16 col3
7 b 2 c 7 d 17 col2
8 c 3 l 8 e 18 col1
9 d 4 c 9 f 19 col2
10 c 5 l 10 g 20 col1
When I tried to get the ".p" that matched the value in "max1", I kept getting errors. I thought the approach would be:
df %>%
mutate(my.p = eval(as.name(paste0(max1,'.p'))))
Error: object 'col3.p' not found
Clearly, this did not work, so I thought maybe this was similar to passing a column name in a function, where I need to use 'get'. That also didn't work.
df %>%
mutate(my.p = get(as.name(paste0(max1,'.p'))))
Error: invalid first argument
df %>%
mutate(my.p = get(paste0(max1,'.p')))
Error: object 'col3.p' not found
I found something that gets rid of this error, using data.table from a different, but related problem, here: http://codereply.com/answer/7y2ra3/dplyr-error-object-found-using-rle-mutate.html. However, it gives me "col3.p" for every row. This is max1 for the first row, df$max1[1]
library('dplyr')
library('data.table') # must have the data.table package
df %>%
tbl_dt(df) %>%
mutate(my.p = get(paste0(max1,'.p')))
Source: local data table [10 x 8]
col1 col1.p col2 col2.p col3 col3.p max1 my.p
1 a 1 a 6 c 11 col3 11
2 b 2 c 7 d 12 col2 12
3 c 3 l 8 e 13 col1 13
4 d 4 c 9 f 14 col2 14
5 c 5 l 10 g 15 col1 15
6 a 1 a 6 c 16 col3 16
7 b 2 c 7 d 17 col2 17
8 c 3 l 8 e 18 col1 18
9 d 4 c 9 f 19 col2 19
10 c 5 l 10 g 20 col1 20
Using the lazyeval interp approach (from this SO: Hot to pass dynamic column names in dplyr into custom function?) doesn't work for me. Perhaps I am implementing it incorrectly?
library(lazyeval)
library(dplyr)
df %>%
mutate_(my.p = interp(~colp, colp = as.name(paste0(max1,'.p'))))
I get an error:
Error in paste0(max1, ".p") : object 'max1' not found
Ideally, I will have the new column my.p equal the appropriate p based on the column identified in max1.
I can do this all with ifelse, but I am trying to do it with less code and to make it applicable to the next ugly flat table.
We can do this with data.table. We convert the 'data.frame' to 'data.table' (setDT(df)), grouped by the the row sequence, we get the value of the paste output, and assign (:=) it to a new column ('my.p').
library(data.table)
setDT(df)[, my.p:= get(paste0(max1, '.p')), 1:nrow(df)]
df
# col1 col1.p col2 col2.p col3 col3.p max1 my.p
# 1: a 1 a 6 c 11 col3 11
# 2: b 2 c 7 d 12 col2 7
# 3: c 3 l 8 e 13 col1 3
# 4: d 4 c 9 f 14 col2 9
# 5: c 5 l 10 g 15 col1 5
# 6: a 1 a 6 c 16 col3 16
# 7: b 2 c 7 d 17 col2 7
# 8: c 3 l 8 e 18 col1 3
# 9: d 4 c 9 f 19 col2 9
#10: c 5 l 10 g 20 col1 5

Matching and merging headers in R

In R, I want to match and merge two matrices.
For example,
> A
ID a b c d e f g
1 ex 3 8 7 6 9 8 4
2 am 7 5 3 0 1 8 3
3 ple 8 5 7 9 2 3 1
> B
col1
1 a
2 c
3 e
4 f
Then, I want to match header of matrix A and 1st column of matrix B.
The final result should be a matrix like below.
> C
ID a c e f
1 ex 3 7 9 8
2 am 7 3 1 8
3 ple 8 7 2 3
*(My original data has more than 500 columns and more than 20,000 rows.)
Are there any tips for that? Would really appreciate your help.
*In advance, if the matrix B is like below,
> B
col1 col2 col3 col4
1 a c e f
How to make the matrix C in this case?
You want:
A[, c('ID', B[, 1])]
For the second case, you want to use row number 1 of the second matrix, instead of its first column.
A[, c('ID', B[1, ])]
If B is a data.frame instead of a matrix, the syntax changes somewhat — you can use B$col1 instead of B[, 1], and to select by row, you need to transform the result to a vector, because the result of selecting a row in a data.frame is again a data.frame, i.e. you need to do unlist(B[1, ]).
You can use a subset:
cbind(A$ID, A[names(A) %in% B$col1])

Order a data frame only from a certain row index to a certain row index

Let's say we have a DF like this:
col1 col2
A 1
A 5
A 3
A 16
B 5
B 4
B 3
C 7
C 2
I'm trying to order col2 but only for same values in col1. Better said, I want it to look like this:
col1 col2
A 1
A 3
A 5
A 16
B 3
B 4
B 5
C 2
C 7
So order col2 only for A, B and C values, not order the entire col2 column
x <- function() {
values<- unique(DF[, 1])
for (i in values) {
currentData <- which(DF$col1== i)
## what to do here ?
data[order(data[, 2]), ]
}
}
so in CurrentData I have indexes for col2 values for only As, Bs etc. But how do I order only those items in my entire DF data frame ? Is it somehow possible to tell the order function to do order only on certain row indexes of data frame ?
ave will group the data by the first element, and apply the named function to the second element for each group. Here is an application of ave sorting within groups:
DF$col2 <- ave(DF$col2, DF$col1, FUN=sort)
DF
## col1 col2
## 1 A 1
## 2 A 3
## 3 A 5
## 4 A 16
## 5 B 3
## 6 B 4
## 7 B 5
## 8 C 2
## 9 C 7
This will work even if the values in col1 are not consecutive, leaving them in their original positions.
If that is not an important consideration, there are better ways to do this, such as the answer by #user314046.
It seems that
my_df[with(my_df, order(col1, col2)), ]
will do what you want - this just sorts the dataframe by col1 and col2. If you don't want to order by col1 a method is provided in the other answer.

Resources