How to merge two data frames using (parts of) text values? [duplicate] - r

This question already has answers here:
R: I have to do Softmatch in String
(2 answers)
Closed 9 years ago.
I have two data frames both with columns containing texts. Now I want to merge those data frames by using (imperfect) matches between the text columns. If e.g. cell 1 of the text column of data frame 1 has a text value that contains a (part of a) word that resembles a (part of a) word in the text value of cel 2 of the text column of data frame 2, then I want the data frames be me merged using these cells. What is the best way to do this in R?
I am not sure if my question is clear enough, but if so, does anyone know of an R package or a function that can help me do this kind of merging?
Many thanks in advance!

Try the RecordLinkage package.
Here is a possible solution where the merge works based on generally how "close" the two "words" match:
library(reshape2)
library(RecordLinkage)
set.seed(16)
l <- LETTERS[1:10]
ex1 <- data.frame(lets = paste(l, l, l, sep = ""), nums = 1:10)
ex2 <- data.frame(lets = paste(sample(l), sample(l), sample(l), sep = ""),
nums = 11:20)
ex1
# lets nums
# 1 AAA 1
# 2 BBB 2
# 3 CCC 3
# 4 DDD 4
# 5 EEE 5
# 6 FFF 6
# 7 GGG 7
# 8 HHH 8
# 9 III 9
# 10 JJJ 10
ex2
# lets nums
# 1 GDJ 11
# 2 CFH 12
# 3 DBE 13
# 4 BED 14
# 5 FJB 15
# 6 JHG 16
# 7 AII 17
# 8 ICC 18
# 9 EGF 19
# 10 HAA 20
lets <- melt(outer(ex1$lets, ex2$lets, FUN = "levenshteinDist"))
lets <- lets[lets$value < 2, ] # adjust the "< 2" as necessary
cbind(ex1[lets$Var1, ], ex2[lets$Var2, ])
# lets nums lets nums
# 9 III 9 AII 17
# 3 CCC 3 ICC 18
# 1 AAA 1 HAA 20

Related

How to use a fulljoin on my dataframes and rename columns with the same name R

I have two dataframes and they both have the exact same column names, however the data in the columns is different in each dataframe. I am trying to join the two frames (as seen below) by a full join. However, the hard part for me is the fact that I have to rename the columns so that the columns corresponding to my one dataset have some text added to the end while adding different text to the end of the columns that correspond to the second data set.
combined_df <- full_join(any.drinking, binge.drinking, by = ?)
A look at one of my df's:
Without custom function and shorter:
df <- cbind(cars, cars)
colnames(df) <- c(paste0(colnames(cars), "_any"), paste0(colnames(cars), "_binge"))
Output:
> head(df)
speed_any dist_any speed_binge dist_binge
1 4 2 4 2
2 4 10 4 10
3 7 4 7 4
4 7 22 7 22
5 8 16 8 16
6 9 10 9 10
Certainly not the most elegant way but maybe it is what you want:
custom_bind <- function(df1, suffix1, df2, suffix2){
colnames(df1) <- paste(colnames(df1), suffix1, sep = "_")
colnames(df2) <- paste(colnames(df2), suffix2, sep = "_")
df <- cbind(df1, df2)
return(df)
}
custom_bind(cars, "any", cars, "binge")
I made it as a function in case you want to do it with other tables. If not then it is not necessary.
Output:
> head(custom_bind(cars, "any", cars, "binge"))
speed_any dist_any speed_binge dist_binge
1 4 2 4 2
2 4 10 4 10
3 7 4 7 4
4 7 22 7 22
5 8 16 8 16
6 9 10 9 10

How to check if rows in one column present in another column in R

I have a data set = data1 with id and emails as follows:
id emails
1 A,B,C,D,E
2 F,G,H,A,C,D
3 I,K,L,T
4 S,V,F,R,D,S,W,A
5 P,A,L,S
6 Q,W,E,R,F
7 S,D,F,E,Q
8 Z,A,D,E,F,R
9 X,C,F,G,H
10 A,V,D,S,C,E
I have another data set = data2 with check_email as follows:
check_email
A
D
S
V
I want to check if check_email column is present in data1 and want to take only those id from data1 when check_email in data2 is present in emails in data1.
My desired output will be:
id
1
2
4
5
7
8
10
I have created a code using for loop but it is taking forever because my actual dataset is very large.
Any advice in this regard will be highly appreciated!
You can use regular expression to subset your data. First collapse everything in one pattern:
paste(data2$check_email, collapse = "|")
# [1] "A|D|S|V"
Then create a indicator vector whether the pattern matches the emails:
grep(paste(data2$check_email, collapse = "|"), data1$emails)
# [1] 1 2 4 5 7 8 10
And then combine everything:
data1[grep(paste(data2$check_email, collapse = "|"), data1$emails), ]
# id emails
# 1 1 A,B,C,D,E
# 2 2 F,G,H,A,C,D
# 3 4 S,V,F,R,D,S,W,A
# 4 5 P,A,L,S
# 5 7 S,D,F,E,Q
# 6 8 Z,A,D,E,F,R
# 7 10 A,V,D,S,C,E
data1[rowSums(sapply(data2$check_email, function(x) grepl(x,data1$emails))) > 0, "id", F]
id
1 1
2 2
4 4
5 5
7 7
8 8
10 10
We can split the elements of the character vector as.character(data1$emails) into substrings, then we can iterate over this list with sapply looking for any value of this substring contained in data2$check_email. Finally we extract those values from data1
> emails <- strsplit(as.character(data1$emails), ",")
> ind <- sapply(emails, function(emails) any(emails %in% as.character(data2$check_email)))
> data1[ind,"id", drop = FALSE]
id
1 1
2 2
4 4
5 5
7 7
8 8
10 10

Create all possible combinations from two values for each element in a vector in R [duplicate]

This question already has answers here:
How to generate a matrix of combinations
(3 answers)
Closed 6 years ago.
I have been trying to create vectors where each element can take two different values present in two different vectors.
For example, if there are two vectors a and b, where a is c(6,2,9) and b is c(12,5,15) then the output should be 8 vectors given as follows,
6 2 9
6 2 15
6 5 9
6 5 15
12 2 9
12 2 15
12 5 9
12 5 15
The following piece of code works,
aa1 <- c(6,12)
aa2 <- c(2,5)
aa3 <- c(9,15)
for(a1 in 1:2)
for(a2 in 1:2)
for(a3 in 1:2)
{
v <- c(aa1[a1],aa2[a2],aa3[a3])
print(v)
}
But I was wondering if there was a simpler way to do this instead of writing several for loops which will also increase linearly with the number of elements the final vector will have.
expand.grid is a function that makes all combinations of whatever vectors you pass it, but in this case you need to rearrange your vectors so you have a pair of first elements, second elements, and third elements so the ultimate call is:
expand.grid(c(6, 12), c(2, 5), c(9, 15))
A quick way to rearrange the vectors in base R is Map, the multivariate version of lapply, with c() as the function:
a <- c(6, 2, 9)
b <- c(12, 5, 15)
Map(c, a, b)
## [[1]]
## [1] 6 12
##
## [[2]]
## [1] 2 5
##
## [[3]]
## [1] 9 15
Conveniently expand.grid is happy with either individual vectors or a list of vectors, so we can just call:
expand.grid(Map(c, a, b))
## Var1 Var2 Var3
## 1 6 2 9
## 2 12 2 9
## 3 6 5 9
## 4 12 5 9
## 5 6 2 15
## 6 12 2 15
## 7 6 5 15
## 8 12 5 15
If Map is confusing you, if you put a and b in a list, purrr::transpose will do the same thing, flipping from a list of two elements of length three to a list of three elements of length two:
library(purrr)
list(a, b) %>% transpose() %>% expand.grid()
and return the same thing.
I think what you're looking for is expand.grid.
a <- c(6,2,9)
b <- c(12,5,15)
expand.grid(a,b)
Var1 Var2
1 6 12
2 2 12
3 9 12
4 6 5
5 2 5
6 9 5
7 6 15
8 2 15
9 9 15

reshape data into panel with multiple variables and no time variable in R

I'm new to reshaping data in R and can't figure out how to use reshape() (or another package) to create a panel data. There are two time observations for each geographical unit, however each of the time observations is formatted in a variable. For example:
subdistrict <- 1:4
control_t1 <- 5:8
control_t2 <- 9:12
motivation_t1 <- 12:15
motivation_t2 <- 16:19
data_mat <- as.data.frame(cbind(subdistrict, control_t1, control_t2, motivation_t1, motivation_t2))
data_mat
subdistrict control_t1 control_t2 motivation_t1 motivation_t2
1 1 5 9 12 16
2 2 6 10 13 17
3 3 7 11 14 18
4 4 8 12 15 19
Here, control_t1 and control_t2 each refer to a different period. My goal is to reshape the data such that a time variable can be established and the named variable can be collapsed so to produce the following frame:
subdistrict time control motivation
1 1 1 12
1 2 5 16
2 1 2 13
2 2 6 17
3 1 3 14
3 2 7 18
4 1 4 15
4 2 8 19
I'm not sure how to create the new time variable, and collapse and rename the variables to reshape the data as such. Thanks for any help.
You just have to use the reshape() function with option direction = "long". Here is the code :
district <- 1:4
control_t1 <- 5:8
control_t2 <- 9:12
relax_t1 <- 12:15
relax_t2 <- 16:19
data_mat <- as.data.frame(cbind(district, control_t1, control_t2, relax_t1, relax_t2))
reshape(data = data_mat, direction = "long", idvar = "district", timevar = "time", varying = list(c(2:3), c(4:5)))
# district time control_t1 relax_t1
# 1.1 1 1 5 12
# 2.1 2 1 6 13
# 3.1 3 1 7 14
# 4.1 4 1 8 15
# 1.2 1 2 9 16
# 2.2 2 2 10 17
# 3.2 3 2 11 18
# 4.2 4 2 12 19
Have a look at the R Programming wikibooks to learn more.
A simple answer is to split and rebind the data frame into your new form, like so:
new_Data <- data.frame(
subdistrict=data_mat[,1],
control=unlist(data_mat[,2:3]),
motivation=unlist(data_mat[,4:5]))
All we are doing here is collapsing the two columns of 'control' and 'motivation' into single columns of data by using the 'unlist' function and then binding it all into a new data frame. The 'subdistrict' data repeats, so there is no reason to specify it twice.

How to perform wilcox test in R?

I have this data frame with 4 genes and 3 samples measured in duplicate.
The TS is the standard.
I want to perform the wilcox test between the sample S1 with TS and S2 with the TS for each protein, but i´m having problems with the for cycle.
MS.rawMV <- read.table("C:/Users/aaa/Desktop/genomic/MS.csv", header=T)
S1_1 S1_2 S2_1 S2_2 TS_1 TS_2
gene 1 1 1 2 3 5 5
gene 2 10 10 4 5 9 10
gene 3 5 6 4 4 5 7
gene 4 9 9 8 7 6 6
Samples=list(
S1=grep("S1_*", colnames(MS.rawMV), value=TRUE),
S2=grep("S2_*", colnames(MS.rawMV), value=TRUE),
TS=grep("TS_*", colnames(MS.rawMV), value=TRUE))
sample.names <- names(Samples)
ref.sample <- "TS_"
# Build a data.frame
GRates <- data.frame(MS.rawMV[Reduce("c", Samples)])
## Statistics: non parametric test using TS as a standart
for (i in names(Samples)) {
WILCOXTEST <- wilcox.test(GRates[c(Samples[[i]])],Samples[[ref.sample]])
pnames <- paste(i,".wilcoxtest",sep="")
GRates[pnames] <- WILCOXTEST["p.value"]
}
Error in wilcox.test.default(GRates[Samples[[i]]], Samples[[ref.sample[i]]]) :
'x' must be numeric
It looks like the data is being treated as a factor.
The easiest fix would be to convert them back to numeric via factor->character->numeric.
try this
wilcox.test(
as.numeric(as.character(GRates[c(Samples[[i]])])),
as.numeric(as.character(Samples[[ref.sample]]))
)
If you try to convert straight to numeric from factor, you'll end up with integers that represent the factor classes instead of the actual values.
#DWin's comment is well taken (you have additional structure in your data that is hard to incorporate into a Wilcoxon test). However, if you want to ignore the distinction between the _1 and _2 columns and run Wilcoxon test on S1 vs TS and S2 vs TS, here's a way to rearrange the data and do it:
dat <- read.table(text="
gene S1_1 S1_2 S2_1 S2_2 TS_1 TS_2
1 1 1 2 3 5 5
2 10 10 4 5 9 10
3 5 6 4 4 5 7
4 9 9 8 7 6 6",
header=TRUE)
library(reshape2)
library(plyr)
m1 <- melt(dat,id.var="gene")
## break var_num into separate components
m2 <- subset(data.frame(m1,
colsplit(m1$variable,"_",names=c("var","num"))),
select=-variable)
## combine treatments with standards
m3 <- merge(subset(m2,var!="TS"),
subset(m2,var=="TS"),by=c("gene","num"))
## clean up
m4 <- subset(rename(m3,c(value.x="value",var.x="var",value.y="standard")),
select=-var.y)
## apply Wilcoxon test to each component, save the p value
ddply(m4,"var",
function(x) with(x,wilcox.test(value,standard))$p.value)
Or, if you want to test each replication separately (as in #agstudy's answer), do
ddply(m4,c("var","num"),
function(x) with(x,wilcox.test(value,standard))$p.value)
instead.
I think , since wilcox.test is not vectorized you need 2 loops. Even I am not sure Of the statistical meaning of this , here how you can do :
nn <- colnames(dat)
lapply(1:2,function(x){
col.L <- grep(paste0('S',x,'_*'),nn)
col.R <- dat[,paste0('TS_',x)]
lapply(col.L,function(y)
wilcox.test(dat[,y],col.R)['p.value'])
})
Here I assume dat as
dat <- read.table(text='S1_1 S1_2 S2_1 S2_2 TS_1 TS_2
gene_1 1 1 2 3 5 5
gene_2 10 10 4 5 9 10
gene_3 5 6 4 4 5 7
gene_4 9 9 8 7 6 6',header=TRUE)

Resources