finding similar element between two data - r

I asked a question before which was complicated and I did not get any help. So I tried to simplify the question and input output.
I have tried many ways but none worked for example , I sort down some
# 1
for(i in ncol(mydata)){
corsA = grep(colnames(mydata)[i] , colnames(mysecond))
mydata[,corsA]%in%mysecond[,i]}
# here if I get true then means they have match
## 2
are.cols.identical <- function(col1, col2) identical(mydata[,col1], mysecond[,col2])
res <- outer(colnames(mydata), colnames(mysecond),FUN = Vectorize(are.cols.identical))
cut <- apply(res, 1, function(x)match(TRUE, x))
### 3
(mydata$Rad) %in% (mysecond$Ro5_P1_A5)
#### 4
which(mydata %in% mysecond)
#### 5
match(mydata$sus., mysecond$R5_P1_A5)
or
which(mydata$sus. %in% mysecond$RP1_A5)
matches <- sapply(mydata,function(x) sapply(mysecond,identical,x))
and few others, but none led me to an answer

Here is another solution using regex:
rows<-mapply(grep,mysecond,mydata)
The step above will return a list with the matched rows in each column:
rows
If you would like to see how many rows where matched you can do this:
lapply(rows,length)
Now we can go ahead a get the rows of interest in mydata, but rows is a list so we need to unlist() and we might have some duplicate rows, and we don't want them to appear twice in the output, so we use the unique() function:
rows<-unique(unlist(rows))
mydata[rows,]
#View(mydata[rows,])

require(plyr)
dat <- strsplit(as.character(mydata$subunits..UniProt.IDs.), ',')
dat <- data.frame(mydata[,1],rbind.fill(lapply(dat,function(y){as.data.frame(t(y),stringsAsFactors=FALSE)})))
mydata[unlist(apply(dat,2, function(x) which(x %in% mysecond[,2]))),]

Related

How to iterate a function over multiple instances based on number of partial string matches?

Had trouble figuring out the best way to phrase this in the title, but the broader issue here is I'm trying to combine two non-overlapping columns (split by gender) in a dataset into a third, gender-neutral column with values for each row/participant... and then do that for i times.
Here's an example. My dataset is ELSH2, and the first set of columns will be HTM1, HTW1, and HT1. I figured out pretty quickly how to combine columns just once:
ELSH2$HT1 <- ifelse(is.na(ELSH2$HTM1), ELSH2$HTW1, ELSH2$HTM1)
So all the values from the HTW1 and HTM1 columns are now combined in the HT1 column. But essentially what I want is:
ELSH2$HTi <- ifelse(is.na(ELSH2$HTMi), ELSH2$HTWi, ELSH2$HTMi)
where i is each sequential number in the range 1-k, k being the largest number at the end of column names matching the above strings (i.e., there are k columns that start with HTM or HTW; HTM and HTW will always have the same k value). In this example, k=5, but I'm going to do this with multiple cases (i.e., other strings to match in place of HTM/HTW) involving different values of k.
I tried using grepl:
ELSH2[,grepl("HT.", names(ELSH2))] <- ifelse(
is.na(ELSH[,grepl("HTM.", names(ELSH2))]),
ELSH2[,grepl("HTW.", names(ELSH2))],
ELSH2[,grepl("HTM.", names(ELSH2))])
But I'm getting the following error:
Warning message:
In `[<-.data.frame`(`*tmp*`, , grepl("HTM.", names(ELSH2)), value = list( :
provided 5300 variables to replace 10 variables
I'm pretty sure there's something wrong with the way I'm trying to make the HT columns here, but even if I create them manually, I get the same sort of error.
EDIT: Here's a sample dataset.
HTM1<- rnorm(10)
HTW1<- rnorm(10)
HTM2<- rnorm(10)
HTW2<- rnorm(10)
HTM3<- rnorm(10)
HTW3<- rnorm(10)
HTM4<- rnorm(10)
HTW4<- rnorm(10)
HTM5<- rnorm(10)
HTW5<- rnorm(10)
HTM <- data.frame(HTM1,HTM2,HTM3,HTM4,HTM5)
HTW <- data.frame(HTW1,HTW2,HTW3,HTW4,HTW5)
HTM[1, ] <- NA
HTM[3, ] <- NA
HTM[5, ] <- NA
HTM[7, ] <- NA
HTM[9, ] <- NA
HTW[2, ] <- NA
HTW[4, ] <- NA
HTW[6, ] <- NA
HTW[8, ] <- NA
HTW[10, ] <- NA
ELSH2 <- cbind(HTW, HTM)
ELSH2 looks like this:
And I want the final HT columns to look like this poorly photoshopped monstrosity:
Just interleaving the columns where they have missing values.
On possibility is just to treat this like a reshaping problem. Here we use dplyr and tidyr to make that easier
library(dplyr)
library(tidyr)
ELSH2 %>%
mutate(row=row_number()) %>%
pivot_longer(HTW1:HTM5) %>%
filter(!is.na(value)) %>%
extract(name, into=c("prefix","code"), "^([A-Za-z]+)(\\d+)$") %>%
mutate(name=paste0("HT", code)) %>%
pivot_wider(row, names_from=name, values_from=value)

Is there an easy way to tell if many data frames stored in one list contain the same columns?

I have a list containing many data frames:
df1 <- data.frame(A = 1:5, B = 2:6, C = LETTERS[1:5])
df2 <- data.frame(A = 1:5, B = 2:6, C = LETTERS[1:5])
df3 <- data.frame(A = 1:5, C = LETTERS[1:5])
my_list <- list(df1, df2, df3)
I want to know if every data frame in this list contains the same columns (i.e., the same number of columns, all having the same names and in the same order).
I know that you can easily find column names of data frames in a list using lapply:
lapply(my_list, colnames)
Is there a way to determine if any differences in column names occur? I realize this is a complicated question involving pairwise comparisons.
You can avoid pairwise comparison by simply checking if the count of each column name is == length(my_list). This will simultaneously check for dim and names of you dataframe -
lapply(my_list, names) %>%
unlist() %>%
table() %>%
all(. == length(my_list))
[1] FALSE
In base R i.e. without %>% -
all(table(unlist(lapply(my_list, names))) == length(my_list))
[1] FALSE
or sightly more optimized -
!any(table(unlist(lapply(my_list, names))) != length(my_list))
Here's another base solution with Reduce:
!is.logical(
Reduce(function(x,y) if(identical(x,y)) x else FALSE
, lapply(my_list, names)
)
)
You could also account for same columns in a different order with
!is.logical(
Reduce(function(x,y) if(identical(x,y)) x else FALSE
, lapply(my_list, function(z) sort(names(z)))
)
)
As for what's going on, Reduce() accumulates as it goes through the list. At first, identical(names_df1, names_df2) are evaluated. If it's true, we want to have it return the same vector evaluated! Then we can keep using it to compare to other members of the list.
Finally, if everything evaluates as true, we get a character vector returned. Since you probably want a logical output, !is.logical(...) is used to turn that character vector into a boolean.
See also here as I was very inspired by another post:
check whether all elements of a list are in equal in R
And a similar one that I saw after my edit:
Test for equality between all members of list
We can use dplyr::bind_rows:
!any(is.na(dplyr::bind_rows(my_list)))
# [1] FALSE
Here is my answer:
k <- 1
output <- NULL
for(i in 1:(length(my_list) - 1)) {
for(j in (i + 1):length(my_list)) {
output[k] <- identical(colnames(my_list[[i]]), colnames(my_list[[j]]))
k <- k + 1
}
}
all(output)

read.csv.chunked not working with filter %in% list or is.element in list in R

I have a large file (>6GB) with 5million+ rows and 329 columns.
Need to pull out full records for 23k rows for a fixed list of Health Care Providers (HCPlist$NPI). Trying to subset or filter by chunks during read as file size overloads my 14GB RAM.
Initially had trouble due to data type so I have already converted HCPlist$NPI to integer to match data type in source file.
Tried the following and both ran smoothly but came up with 0 rows and 329 columns (ie no records)
f <- function(x, pos) filter(x, x[,1] %in% HCPlist$NPI)
NPPESinfo_list <- read_csv_chunked("npidata_pfile_20050523-20190210.csv",
DataFrameCallback$new(f), chunk_size = 10000)
Also tried subset instead of filter as well as the following...also all ran smoothly but output was 0 rows and 329 columns (ie again no records)
# Filter NPPES Data for NPIs
f <- function(x, pos) x[(is.element(x[,1], HCPlist$NPI)),]
NPPESinfo_list <- read_csv_chunked("npidata_pfile_20050523-20190210.csv",
DataFrameCallback$new(f), chunk_size = 10000)
I have run similar code in the past filtering specific specialty codes and it has run fine. For example...
# Filter NPPES Data for Specialty (Medical Oncologists = "207RX0202X")
f2 <- function(x, pos) subset(x,
x[,48] == "207RX0202X" |
x[,52] == "207RX0202X" )
NPIs_MedOnc <- read_csv_chunked("npidata_pfile_20050523-20190210.csv",
DataFrameCallback$new(f2), chunk_size = 10000)
When I run the same filter above only against the first 2000 rows of the file it runs fine.
# Test run on first 2000 rows
df <- read.csv(file="npidata_pfile_20050523-20190210.csv",nrows=2000)
df2 <- filter(df, df[,1] %in% HCPlist$NPI)
I get a nice dataframe with 48 rows and 329 columns.
Not sure why filter with %in% works fine on just the first 2000 rows and gives me 48 records. However when part of a function and applied to read.csv.chunked it gives me no records?
Could use some help here as I haven't found a similar case/question elsewhere on Stackoverflow or google.
The parts seem to work fine but when I put what I want together not getting the needed records.
Thanks in advance!!!
I'am not sure if this is the answer, but I tried some stuff with the mtcars dataset.
I try to only select the cars with 3 gears.
This works:
library(tidyverse)
library(readr)
f1 <- function(x, pos) filter(x, gear %in% c(3))
read_csv_chunked(readr_example("mtcars.csv"), DataFrameCallback$new(f1), chunk_size = 5)
And this also works, just like you showed:
library(tidyverse)
library(readr)
my_df <- read.csv(readr_example("mtcars.csv"), nrows = 5)
my_df2 <- filter(my_df, my_df[, 10] %in% c(3))
But this gives me 0 rows:
library(tidyverse)
library(readr)
f2 <- function(x, pos) filter(x, x[, 10] %in% c(3))
read_csv_chunked(readr_example("mtcars.csv"), DataFrameCallback$new(f2), chunk_size = 5)
I don't now why (yet) this behaves like this, but the trick seems to use the column name in your function f.
Thanks ricoderks!!! Great insight for a simple but effective fix!!!
For some reason read.csv.chunked did not like having the variable identified by the column number indicator in the function (ie. x[, 10]), even though this same function worked great separately. Not sure why...
Simplest solution worked best!
Just replaced x[, 10] with the name of the column/variable NPI. Did not even include the name of the dataframe as it is already specified as part of filter function, and it worked like a charm!
More specifically...replaced this...
f <- function(x, pos) filter(x, x[,1] %in% HCPlist$NPI)
NPPESinfo_list <- read_csv_chunked("npidata_pfile_20050523-20190210.csv",
DataFrameCallback$new(f), chunk_size = 10000)
...with this...
f <- function(x, pos) filter(x, NPI %in% HCPlist$NPI)
NPPESinfo_list <- read_csv_chunked("npidata_pfile_20050523-20190210.csv",
DataFrameCallback$new(f), chunk_size = 10000)
Worked perfectly!!!
Thanks again!!!

R: How to efficiently find out whether data.frame A is contained in data.frame B?

In order to find out whether data frame df.a is a subset of data frame df.b I did the following:
df.a <- data.frame( x=1:5, y=6:10 )
df.b <- data.frame( x=1:7, y=6:12 )
inds.x <- as.integer( lapply( df.a$x, function(x) which(df.b$x == x) ))
inds.y <- as.integer( lapply( df.a$y, function(y) which(df.b$y == y) ))
identical( inds.x, inds.y )
The last line gave TRUE, hence df.a is contained in df.b.
Now I wonder whether there is a more elegant - and possibly more efficient - way to answer this question?
This task also is easily extended to find the intersection between two given data frames, possibly based on only a subset of columns.
Help will be much appreciated.
I am going to hazard a guess at an answer.
I think semi_join from dplyr will do what you want, even taking into account duplicated rows.
First note the helpfile ?semi_join:
return all rows from x where there are matching values in y, keeping just columns from x.
A semi join differs from an inner join because an inner
join will return one row of x for each matching row of y,
where a semi join will
never duplicate rows of x.
Ok, this suggests that the following should correctly fail:
df.a <- data.frame( x=c(1:5,1), y=c(6:10,6) )
df.b <- data.frame( x=1:7, y=6:12 )
identical(semi_join(df.b, df.a), semi_join(df.a, df.a))
which gives FALSE, as expected since
> semi_join(df.b, df.a)
Joining by: c("x", "y")
x y
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
However, the following should pass:
df.c <- data.frame( x=c(1:7, 1), y= c(6:12, 6) )
identical(semi_join(df.c, df.a), semi_join(df.a, df.a))
and it does, giving TRUE.
The second semi_join(df.a, df.a) is required to get the canonical sorting on df.a.

Create a table from survey results

I have the following data and I was wondering how to generate a table of the frequency from each response via base, plyr, or another package.
My data:
df = data.frame(id = c(1,2,3,4,5),
Did_you_use_tv=c("tv","","","tv","tv"),
Did_you_use_internet=c("","","","int","int"))
df
I can run a table and get the frequencies for any column using the table
table(df[,2])
table(df[,2], df[,3])
However, how can I go about setting up the data so it looks like below.
df2 = data.frame(Did_you_use_tv=c(3),
Did_you_use_internet=c(2))
df2
It's just a summary of frequencies for each column.
I'm going to be creating cross tabs but given the structure of the data, I feel this may be a little more useful.
This is similar in concept to #Tyler's answer. Just take the sum of all values that are not equal to "":
colSums(!df[-1] == "")
# Did_you_use_tv Did_you_use_internet
# 3 2
Update
Fellow Stack Overflow user #juba has done some work on a function called multi.table which looks like this:
multi.table <- function(df, true.codes=NULL, weights=NULL) {
true.codes <- c(as.list(true.codes), TRUE, 1)
as.table(sapply(df, function(v) {
sel <- as.numeric(v %in% true.codes)
if (!is.null(weights)) sel <- sel * weights
sum(sel)
}))
}
The function is part of the questionr package.
Usage in your example would be:
library(questionr)
multi.table(df[-1], true.codes=list("tv", "int"))
# Did_you_use_tv Did_you_use_internet
# 3 2
Here's one approach of many that came to mind:
FUN <- function(x) sum(x != "")
do.call(cbind, lapply(df[, -1], FUN))
## Did_you_use_tv Did_you_use_internet
## [1,] 3 2
Here's another approach
> do.call(cbind, lapply(df[,-1], table))[-1, ]
Did_you_use_tv Did_you_use_internet
3 2
With plyr and reshape2
t(dcast(subset(melt(df,id.var="id"), value!=""), variable ~ .))

Resources