select rows in one data frame that partially match rows in another data frame - r

I wish to select rows in one data frame, data.1, that partially match rows in a second data frame, keep.these, to obtain the desired.result. I have found several questions here that match based on one column, but I want to match on three columns: STATE, COUNTY and CITY. I have come up with three solutions so far, but none seem ideal.
Note that each row contains a unique combination of STATE, COUNTY and CITY in my real data.
When I use merge I must re-order. The function match seems to work, but I am not familiar with it and do not know if my use of this function is as intended. The apply solution below is clearly too complex.
The merge approach would be ideal if I did not have to reorder the result. Reordering can be time consuming with large data sets. The match approach seems okay if someone can confirm this is a reasonable approach.
Is there a better solution, ideally in base R?
data.1 <- read.table(text= "
CITY COUNTY STATE AA
1 1 1 2
2 1 1 4
1 2 1 6
2 2 1 8
1 1 2 20
2 1 2 40
1 2 2 60
2 2 2 80
1 1 3 200
2 1 3 400
1 2 3 600
2 2 3 800
1 1 4 2000
2 1 4 4000
1 2 4 6000
2 2 4 8000
1 1 5 20000
2 1 5 40000
1 2 5 60000
2 2 5 80000
", header=TRUE, na.strings=NA)
keep.these <- read.table(text= "
CITY COUNTY STATE BB
1 1 2 -10
2 1 2 -11
1 2 2 -12
2 2 2 -13
1 1 4 -14
2 1 4 -15
1 2 4 -16
2 2 4 -17
", header=TRUE, na.strings=NA)
desired.result <- read.table(text= "
CITY COUNTY STATE AA
1 1 2 20
2 1 2 40
1 2 2 60
2 2 2 80
1 1 4 2000
2 1 4 4000
1 2 4 6000
2 2 4 8000
", header=TRUE, na.strings=NA)
##########
# this works, but I need to reorder
new.data.a <- merge(keep.these[,1:3], data.1, by=c('CITY', 'COUNTY', 'STATE'))
new.data.a <- new.data.a[order(new.data.a$STATE, new.data.a$COUNTY, new.data.a$CITY),]
rownames(desired.result) <- NULL
rownames(new.data.a) <- NULL
all.equal(desired.result, new.data.a)
##########
# this seems to work, but match is unfamiliar
new.data.2 <- data.1[match(data.1$CITY , keep.these$CITY , nomatch=0) &
match(data.1$STATE , keep.these$STATE , nomatch=0) &
match(data.1$COUNTY, keep.these$COUNTY, nomatch=0),]
rownames(desired.result) <- NULL
rownames(new.data.2) <- NULL
all.equal(desired.result, new.data.2)
##########
# this works, but is too complex
data.1b <- data.frame(my.group = apply( data.1[,1:3], 1, paste, collapse = "."), data.1)
keep.these.b <- data.frame(my.group = apply(keep.these[,1:3], 1, paste, collapse = "."), keep.these)
data.1b <- data.1b[apply(data.1b, 1, function(x) {x[1] %in% keep.these.b$my.group}),]
data.1b <- data.1b[,-1]
rownames(desired.result) <- NULL
rownames(data.1b) <- NULL
all.equal(desired.result, data.1b)
##########

Here is a generic solution for this type of problem which is very efficient:
data.1.ID <- paste(data.1[,1],data.1[,2],data.1[,3])
keep.these.ID <- paste(keep.these[,1],keep.these[,2],keep.these[,3])
desired.result <- data.1[data.1.ID %in% keep.these.ID,]
I have simply created an unique ID for each record, and then searched it.
Note: This will change the row names, and you may want to add the following:
row.names(desired.result) <- 1:nrow(desired.result)
EDIT:
Here is another way to solve the same problem.
If you have a very large data set, say millions of rows, another very efficient solution is using the package data.table. It works nearly 50-100 times faster than merge, depending on how much data you have.
All you have to do is the following:
library(data.table)
Step1: Convert data.frame to data.table, with first three columns as keys.
d1 <- data.table(data.1, key=names(data.1)[1:3])
kt <- data.table(keep.these, key=names(keep.these)[1:3])
Step2: A merge using data.table's binary search:
d1[kt]
Note1: The simplicity of execution.
Note2: This will sort the data by key. To avoid that try following:
data.1$index <- 1:nrow(data.1) # Add index to original data
d1 <- data.table(data.1,key=names(data.1)[1:3]) # Step1 as above
kt <- data.table(keep.these,key=names(keep.these)[1:3]) # Step1 as above
d1[kt][order(index)] # Step2 as above
If you want to remove the last two columns (index, BB), that's straight forward too:
d1[kt][order(index)][,-(5:6),with=F] #Remove index
Try this with large data sets, and compare the timing with merge. It's typically about 50-100 times faster.
To learn more about data.table, try:
vignette("datatable-intro")
vignette("datatable-faq")
vignette("datatable-timings")
Or see it in action:
example(data.table)
Hope this helps!!

I'm not sure how this will do in terms of time, compared to reordering, but you can simply add an option to merge to not change the sort.
new.data.a <- merge(keep.these[,1:3], data.1, by=c('CITY', 'COUNTY', 'STATE'), sort = FALSE)
rownames(desired.result) <- NULL
rownames(new.data.a) <- NULL
all.equal(desired.result, new.data.a)

Related

Compare lists in dataframes based on personal code, shorten one lists if longer

I have two separate dataframes each for one speaker of an interacting dyad. They have different amounts of talk-turns (rows) which is why I keep them in separate files for now.
In order to run my final analyses I need identical number of rows for each speaker.
So what I want to do is compare dyad_id 1 in both data frames and then shorten the longer list for one by deleting the last row for all columns.
I prepared a data frame to illustrate what I already have.
So far, I tried to split the data frame by the dyad_id in both data sets to now compare the splits one after another and delete the unnecessary rows. As I have various conversations, I need to automate this to go through all dyad_ids one after another.
I hope someone can help me, I am completely lost.
dyad_id_A <- c(1,1,1,2,2,2,2,3,3,3,3,3)
fw_quantiles_a <- c(4,3,1,2,3,2,4,1,4,5,6,7)
df_A<- data.frame(dyad_id_A,fw_quantiles_a)
dyad_id_B <- c(1,1,1,1,2,2,2,3,3,3,3)
fw_quantiles_b <- c(3,1,2,1,2,4,1,3,3,4,5)
df_B <- data.frame(dyad_id_B,fw_quantiles_b)
example for final dataset
dyad_id_AB <- c(1,1,1,2,2,2,3,3,3,3)
What I tried so far:
split_conv_A = split(df_A, list(df_A$dyad_id_A))
split_conv_B = split(df_B, list(df_B$dyad_id_B))
Add a time counter within each dyad_id_x group and then merge together:
df_A$time <- ave(df_A$dyad_id_A, df_A$dyad_id_A, FUN=seq_along)
df_B$time <- ave(df_B$dyad_id_B, df_B$dyad_id_B, FUN=seq_along)
merge(
df_A, df_B,
by.x=c("dyad_id_A","time"), by.y=c("dyad_id_B","time")
)
# dyad_id_A time fw_quantiles_a fw_quantiles_b
#1 1 1 4 3
#2 1 2 3 1
#3 1 3 1 2
#4 2 1 2 2
#5 2 2 3 4
#6 2 3 2 1
#7 3 1 1 3
#8 3 2 4 3
#9 3 3 5 4
#10 3 4 6 5
Maybe we can try using table to calculate frequncies of id's in both the dataframe assuming you have the same id's in both the dataframe. Calculate the minimum between them using pmin and repeat the names based on the frequency.
tab <- pmin(table(df_A$dyad_id_A), table(df_B$dyad_id_B))
as.integer(rep(names(tab), tab))
# [1] 1 1 1 2 2 2 3 3 3 3

How to remove columns of data from a data frame using a vector with a regular expression

I am trying to remove columns from a dataframe using a vector of numbers, with those numbers being just a part of the whole column header. What I'm looking to use is something like the wildcard "*" in unix, so that I can say that I want to remove columns with labels xxxx, xxkx, etc... To illustrate what I mean, if I have the following data:
data_test_read <- read.table("batch_1_8c9.structure-edit.tsv",sep="\t", header=TRUE)
data_test_read[1:5,1:5]
samp pop X12706_10 X14223_16 X14481_7
1 BayOfIslands_s088.fq 1 4 1 3
2 BayOfIslands_s088.fq 1 4 1 3
3 BayOfIslands_s089.fq 1 4 1 3
4 BayOfIslands_s089.fq 1 4 3 3
5 BayOfIslands_s090.fq 1 4 1 3
And I want to take out, for example, columns with headers (X12706_10, X14481_7), the following works
data_subs1=subset(data_test_read, select = -c(X12706_10, X14481_7))
data_subs1[1:4,1:4]
samp pop X14223_16 X15213_19
1 BayOfIslands_s088.fq 1 1 3
2 BayOfIslands_s088.fq 1 1 3
3 BayOfIslands_s089.fq 1 1 3
4 BayOfIslands_s089.fq 1 3 3
However, what I need is to be able to identify these columns by only the numbers, so, using (12706,14481). But, if I try this, I get the following
data_subs2=subset(data_test_read, select = -c(12706,14481))
data_subs2[1:4,1:4]
samp pop X12706_10 X14223_16
1 BayOfIslands_s088.fq 1 4 1
2 BayOfIslands_s088.fq 1 4 1
3 BayOfIslands_s089.fq 1 4 1
4 BayOfIslands_s089.fq 1 4 3
This is clearly because I haven't specified anything to do with the "x", or the "_" or what is after the underscore. I've read so many answers on using regular expressions, and I just can't seem to sort it out. Any thoughts, or pointers to what I might turn to would be appreciated.
First you can just extract the numbers from the headers
# for testing
col_names <- c("X12706_10","X14223_16","X14481_7")
# in practice, use
# col_names <- names(data_test_read)
samples <- gsub("X(\\d+)_.*","\\1",col_names)
The find the indexes of the samples you want to drop.
samples_to_drop <- c(12706, 14481)
cols_to_drop <- match(samples_to_drop, samples)
Then you can use
data_subs2 <- subset(data_test_read, select = -cols_to_drop)
to actually get rid of those columns.
Perhaps put this all in a function to make it easier to use
sample_subset <- function(x, drop) {
samples <- gsub("X(\\d+)_.*","\\1", names(x))
subset(x, select = -match(drop, samples))
}
sample_subset(data_test_read, c(12706, 14481))

Take the subsets of a data.frame with the same feature and select a single row from each subset

Suppose I have a matrix in R as follows:
ID Value
1 10
2 5
2 8
3 15
4 7
4 9
...
What I need is a random sample where every element is represented once and only once.
That means that ID 1 will be chosen, one of the two rows with ID 2, ID 3 will be chosen, one of the two rows with ID 4, etc...
There can be more than two duplicates.
I'm trying to figure out the most R-esque way to do this without subsetting and sampling the subsets?
Thanks!
tapply across the rownames and grab a sample of 1 in each ID group:
dat[tapply(rownames(dat),dat$ID,FUN=sample,1),]
# ID Value
#1 1 10
#3 2 8
#4 3 15
#6 4 9
If your data is truly a matrix and not a data.frame, you can work around this too, with:
dat[tapply(as.character(seq(nrow(dat))),dat$ID,FUN=sample,1),]
Don't be tempted to remove the as.character, as sample will give unintended results when there is only one value passed to it. E.g.
replicate(10, sample(4,1) )
#[1] 1 1 4 2 1 2 2 2 3 4
You can do that with dplyr like so:
library(dplyr)
df %>% group_by(ID) %>% sample_n(1)
The idea is reorder the rows randomly and then remove duplicates in that order.
df <- read.table(text="ID Value
1 10
2 5
2 8
3 15
4 7
4 9", header=TRUE)
df2 <- df[sample(nrow(df)), ]
df2[!duplicated(df2$ID), ]

How to load a text file into R where a column contains data points separated by comma

I have a text file with two variables(data points) - the first variable is for student ID and the second variable contains a set of grades for each student ID.
The format is student_id,{grades}
For example:
0,80,1001,65,71,402,99,50,03,904
indicates
student_id=0 has grades{80,100}
student_id=2 has grades{65,71,40} and so on.
I'd like to get a data frame in R as follows
student_id grades
0 80,100
1 65,71,40
2 99,50,0
3 90
4
I tried the following command to load the data into R
x <- read.delim(file, header=TRUE, row.names=NULL)
and this is what I ended up with
student_id. .grades.
1 0,80,100
2 1,65,71,40
3 2,99,50,0
4 3,90
5 4
I'd appreciate any help as to how to go about this problem. Please let me know if you would like me to provide more information. Thanks!
I don't understand exactly what is the your problem input. But here I assume that you have something like this :
x <- readLines(textConnection(
"student_id grades
0 80,100
1 65,71,40
2 99,50,0
3 90
4"))
Then using read.table like this after replacing all spaces by | and use it as a regular separator:
res <- read.table(text=gsub('\\s+','|',x),sep='|',header=TRUE,fill=TRUE)
you get this table:
student_id grades X
1 0 80,100 NA
2 1 65,71,40 NA
3 2 99,50,0 NA
4 3 90 NA
5 4 NA
Of course it is easy to remove the last column like this :
res[,-ncol(res)]
student_id grades
1 0 80,100
2 1 65,71,40
3 2 99,50,0
4 3 90
5 4
This is a bit tricky because it's a mixture of whitespace- and comma-delimited. My solution is a bit ugly -- maybe someone will come up with something better.
x <- readLines(textConnection(
"student_id grades
0 80,100
1 65,71,40
2 99,50,0
3 90
4"))
padNA <- function(x,maxLen) {
if ((L <- (maxLen-length(x)))>0) x <- c(x,rep(NA,L))
x
}
getPos <- function(x,n) if (length(x)>=n) x[[n]] else ""
## separate student IDs
student_id <- sapply(strsplit(x[-1],"\\s+"),getPos,1)
## (convert to numeric if you want)
## separate scores
scores <- sapply(strsplit(x[-1],"\\s+"),getPos,2)
## split scores by comma and pad to max length
scoreMat <- do.call(rbind,lapply(strsplit(scores,","),padNA,5))
## convert from character to numeric
storage.mode(scoreMat) <- "numeric"
## combine
data.frame(student_id,scoreMat)

Data frame "expand" procedure in R?

This is not a real statistical question, but rather a data preparation question before performing the actual statistical analysis. I have a data frame which consists of sparse data. I would like to "expand" this data to include zeroes for missing values, group by group.
Here is an example of the data (a and b are two factors defining the group, t is the sparse timestamp and xis the value):
test <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,2,2,2,2,2,2),
t=c(0,2,3,4,7,3,4,6,7,8,9),
x=c(1,2,1,2,2,1,1,2,1,1,3))
Assuming I would like to expand the values between t=0 and t=9, this is the result I'm hoping for:
test.expanded <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2),
t=c(0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9),
x=c(1,0,2,1,2,0,0,2,0,0,0,0,0,1,1,0,2,1,1,3))
Zeroes have been inserted for all missing values of t. This makes it easier to use.
I have a quick and dirty implementation which sorts the dataframe and loops through each of its lines, adding missing lines one at a time. But I'm not entirely satisfied by the solution. Is there a better way to do it?
For those who are familiar with SAS, it is similar to the proc expand.
Thanks!
As you noted in a comment to the other answer, doing it by group is easy with plyr which just leaves how to "fill in" the data sets. My approach is to use merge.
library("plyr")
test.expanded <- ddply(test, c("a","b"), function(DF) {
DF <- merge(data.frame(t=0:9), DF[,c("t","x")], all.x=TRUE)
DF[is.na(DF$x),"x"] <- 0
DF
})
merge with all.x=TRUE will make the missing values NA, so the second line of the function is needed to replace those NAs with 0's.
This is convoluted but works fine:
test <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,2,2,2,2,2,2),
t=c(0,2,3,4,7,3,4,6,7,8,9),
x=c(1,2,1,2,2,1,1,2,1,1,3))
my.seq <- seq(0,9)
not.t <- !(my.seq %in% test$t)
test[nrow(test)+seq(length(my.seq[not.t])),"t"] <- my.seq[not.t]
test
#------------
a b t x
1 1 1 0 1
2 1 1 2 2
3 1 1 3 1
4 1 1 4 2
5 1 1 7 2
6 1 2 3 1
7 1 2 4 1
8 1 2 6 2
9 1 2 7 1
10 1 2 8 1
11 1 2 9 3
12 NA NA 1 NA
13 NA NA 5 NA
Not sure if you want it sorted by t afterwards or not. If so, easy enough to do:
https://stackoverflow.com/a/6871968/636656

Resources