overlapping unique dataframes in R

overlapping unique dataframes in R - r

My two dataframes are:
df1<-structure(list(header1 = structure(1:4, .Label = c("a", "b",
"c", "d"), class = "factor")), class = "data.frame", row.names = c(NA,
-4L))
and
df2<-structure(list(sample_x = structure(c(1L, 1L, 2L, 3L), .Label = c("0",
"a", "c"), class = "factor"), sample_y = structure(c(1L, 3L,
2L, 4L), .Label = c("0", "a", "m", "t"), class = "factor"), sample_z = structure(c(3L,
2L, 1L, 1L), .Label = c("0", "a", "c"), class = "factor")), class = "data.frame", row.names = c(NA,
-4L))
0s in df2 means no values.
Now I want to overlap df1 and df2 to make an output dataframe(df3):
df3<-structure(list(sample_x = c(2L, 2L, 0L), sample_y = c(1L, 3L,
2L), sample_z = c(2L, 2L, 0L)), class = "data.frame", row.names = c("overlap_df1_df2",
"unique_df1", "unique_df2"))
I tried the datatable function foverlaps:
setkeyv(df1, names(df1))
setkeyv(df2, names(df2))
df3<-foverlaps(df1,df2)
But seems like I need to have some common column names in these two dataframes, which is obviously not the case.
Thank you!

Loop through columns, and use set operations:
sapply(df2, function(i){
x = i[ !is.na(i) ]
o = intersect(df1$header1, x)
u_df1 = setdiff(df1$header1, o)
u_df2 = setdiff(x, o)
c(o = length(o),
u_df1 = length(u_df1),
u_df2 = length(u_df2))
})
# sample_x sample_y sample_z
# o 2 1 2
# u_df1 2 3 2
# u_df2 0 2 0

A solution using map:
library(purrr)
rbind(
overlap = map_dbl(df2, ~length(intersect(df1$header1, .x))),
unique_df1 = map_dbl(df2, ~length(setdiff(df1$header1, .x))),
unique_df2 = unique_df1 - overlap
)
sample_x sample_y sample_z
overlap 2 1 2
unique_df1 2 3 2
unique_df2 0 2 0

Related

Function for referencing values associated with specific factor values

I have a fairly large list looking something like this, where I have the first two variables stored are factors
Product Vendor Sales Product sales share
a x 100
b y 200
a y 250
c y 700
a z 150
Ideally, I'd like to create a new column containing the vendors share of that product's total sales i.e. Share_{p=a,v=x} = 100/(100+250+150)
I figure lapply() would be viable but not sure how to write the function
> dput(list)
list(structure(list(Product = structure(c(1L, 2L, 1L, 3L, 1L), .Label = c("a",
"b", "c"), class = "factor"), Vendor = structure(c(1L, 2L, 2L,
2L, 3L), .Label = c("x", "y", "z"), class = "factor"), Sales = c(100,
200, 250, 700, 150)), class = "data.frame", row.names = c(NA,
-5L)))

Using dplyr package, you could calculate the total sales for each product, then calculate the vendor share based on individual vendor and total sales.
library(dplyr)
df %>%
group_by(Product) %>%
mutate(Total_Sales = sum(Sales),
Vendor_Share = Sales/Total_Sales)
A base R approach could use prop.table as an alternative:
df$Vendor_Share <- with(df, ave(Sales, Product, FUN = prop.table))
Output
Product Vendor Sales Vendor_Share
1 a x 100 0.2
2 b y 200 1.0
3 a y 250 0.5
4 c y 700 1.0
5 a z 150 0.3
Data
df <- structure(list(Product = structure(c(1L, 2L, 1L, 3L, 1L), .Label = c("a",
"b", "c"), class = "factor"), Vendor = structure(c(1L, 2L, 2L,
2L, 3L), .Label = c("x", "y", "z"), class = "factor"), Sales = c(100,
200, 250, 700, 150), Vendor_Share = c(0.2, 1, 0.5, 1, 0.3)), row.names = c(NA,
-5L), class = "data.frame")

I need to convert the levels of multiple categorical variable into 0,1

I have five columns with 2 levels and their column names are like c(a,b,x,y,z). The command below works for 1 column at time. But I need to it for all five columns at the same time.
levels(car_data[,"x"]) <- c(0,1)
car_data[,"x"] <- as.numeric(levels(car_data[,"x"]))[car_data[,"x"]]

If there are two levels, then we can do
library(dplyr)
car_data %>%
mutate_all(funs(as.integer(.)-1))
# a b c
#1 0 0 0
#2 1 1 1
#3 0 0 0
#4 1 1 1
data
car_data <- structure(list(a = structure(c(1L, 2L, 1L, 2L), .Label = c("a",
"b"), class = "factor"), b = structure(c(1L, 2L, 1L, 2L), .Label = c("a",
"b"), class = "factor"), c = structure(c(1L, 2L, 1L, 2L), .Label = c("a",
"b"), class = "factor")), .Names = c("a", "b", "c"), row.names = c(NA,
-4L), class = "data.frame")

merging and counting similar strings

I have a data with three columns like
Inputdf<-structure(list(df1 = structure(c(4L, 5L, 2L, 1L, 3L), .Label = c("P61160,P61158,O15143,O15144,O15145,P59998,O15511",
"P78537,Q6QNY1,Q6QNY0", "Q06323,Q9UL46", "Q92793,Q09472,Q9Y6Q9,Q92831",
"Q92828,Q13227,O15379,O75376,O60907,Q9BZK7"), class = "factor"),
df2 = structure(c(3L, 2L, 5L, 4L, 1L), .Label = c("", "P61158,O15143,O15144",
"Q06323,Q9UL46", "Q6QNY0", "Q92828"), class = "factor"),
df3 = structure(c(5L, 4L, 3L, 2L, 1L), .Label = c("", "O15511",
"Q06323,Q9UL46", "Q6QNY0", "Q92793,Q09472"), class = "factor")), .Names = c("df1",
"df2", "df3"), class = "data.frame", row.names = c(NA, -5L))
I am trying to find similar strings in this data for example
in df1, I have the first row I have Q92793,Q09472,Q9Y6Q9,Q92831
then I look at df2 and df3 and see if any of these members are in there then in this example, I make the following data
df1 df2 df3 Numberdf1 df2 df3
1 0 1 4 0 Q92793,Q09472
df1 1 means the first row of df1
df2 0 means it did not have any similarity
df3 1, means the first row of df3 has similarity with df1 row 1
Numberdf1, it is the count of strings separated by a ,which is 4
df2 is 0 because there was not any similar string accords df2
df3 is Q92793,Q09472 which paste the string which were similar in here
a desire output looks like below
out<- structure(list(df1 = 1:5, df2 = c(0L, 3L, 4L, 2L, 1L), df3 = c(1L,
0L, 2L, 4L, 3L), Numberdf1 = c(4L, 6L, 2L, 7L, 2L), df2.1 = structure(c(1L,
5L, 4L, 2L, 3L), .Label = c("0", "P61158,O15143,O15144", "Q06323,Q9UL46",
"Q6QNY0", "Q92828"), class = "factor"), df3.1 = structure(c(5L,
1L, 4L, 2L, 3L), .Label = c("0", "O15511", "Q06323,Q9UL46", "Q6QNY0",
"Q92793,Q09472"), class = "factor")), .Names = c("df1", "df2",
"df3", "Numberdf1", "df2.1", "df3.1"), class = "data.frame", row.names = c(NA,
-5L))
The below function does not work , for example, use this data as input
Inputdf1<- structure(list(df1 = structure(c(2L, 3L, 1L), .Label = c("Q06323,Q9UL46",
"Q92793,Q09472,Q9Y6Q9,Q92831", "Q92828,Q13227,O15379,O75376,O60907,Q9BZK7"
), class = "factor"), df2 = structure(1:3, .Label = c("P25788,P25789",
"Q92828, O60907, O75376", "Q9UL46, Q06323"), class = "factor"),
df3 = structure(c(2L, 1L, 3L), .Label = c("Q92831, Q92793, Q09472",
"Q9BZK7, Q92828, O75376, O60907", "Q9UL46, Q06323"), class = "factor")), .Names = c("df1",
"df2", "df3"), class = "data.frame", row.names = c(NA, -3L))

This works for your example:
# First convert factors to strings to lists
Inputdf[] = lapply(Inputdf, as.character)
Inputdf[] = lapply(Inputdf, function(col) sapply(col, function(x) unlist(strsplit(x,','))))
not.empty = function(x) length(x) > 0
out = data.frame()
for (r in 1:nrow(Inputdf)) {
df2.intersect = lapply(Inputdf$df2, intersect, Inputdf$df1[[r]])
df3.intersect = lapply(Inputdf$df3, intersect, Inputdf$df1[[r]])
out[r, 'df1'] = r
out[r, 'df2'] = Position(not.empty, df2.intersect, nomatch=0)
out[r, 'df3'] = Position(not.empty, df3.intersect, nomatch=0)
out[r, 'Numberdf1'] = length(Inputdf$df1[[r]])
out[r, 'df2.1'] = paste(Find(not.empty, df2.intersect, nomatch=0), collapse=',')
out[r, 'df3.1'] = paste(Find(not.empty, df3.intersect, nomatch=0), collapse=',')
}
out
# df1 df2 df3 Numberdf1 df2.1 df3.1
# 1 1 0 1 4 0 Q92793,Q09472
# 2 2 3 0 6 Q92828 0
# 3 3 4 2 3 Q6QNY0 Q6QNY0
# 4 4 2 4 7 P61158,O15143,O15144 O15511
# 5 5 1 3 2 Q06323,Q9UL46 Q06323,Q9UL46
Note: Find and Position identify the first match only. If there are potentially multiple matches, use which.
EDIT
Version accounting for multiple matches
Inputdf[] = lapply(Inputdf, as.character)
Inputdf[] = lapply(Inputdf, function(col) sapply(col, function(x) unlist(strsplit(x,',\\s*'))))
not.empty = function(x) length(x) > 0
out = data.frame()
for (r in 1:nrow(Inputdf)) {
df2.intersect = lapply(Inputdf$df2, intersect, Inputdf$df1[[r]])
df3.intersect = lapply(Inputdf$df3, intersect, Inputdf$df1[[r]])
out[r, 'df1'] = r
out[r, 'df2'] = paste(which(sapply(df2.intersect, not.empty)), collapse=',')
out[r, 'df3'] = paste(which(sapply(df3.intersect, not.empty)), collapse=',')
out[r, 'Numberdf1'] = length(Inputdf$df1[[r]])
out[r, 'df2.1'] = paste(unique(unlist(df2.intersect)), collapse=',')
out[r, 'df3.1'] = paste(unique(unlist(df3.intersect)), collapse=',')
}
out[out==""] = "0"

R - individual categorical plot [duplicate]

This question already has answers here:
How to produce a heatmap with ggplot2?
(2 answers)
Closed 7 years ago.
I would simply like to represent a sequence of categorical states with different colours.
This kind of plot is also known as individual sequence plot (TraMineR).
I would like to use ggplot2.
My data simply look like this
> head(dta)
V1 V2 V3 V4 V5 id
1 b a e d c 1
2 d b a e c 2
3 b c a e d 3
4 c b a e d 4
5 b c e a d 5
with the personal id in the last column.
The plot looks like this.
Each letters (states) is represented by a colour. Basically, this plot visualise the successive states for each individual.
Blue is a, Red is b, Purple is c, Yellow is d and Brown is e.
Any idea how I could do this with ggplot2?
dta = structure(list(V1 = structure(c(1L, 3L, 1L, 2L, 1L), .Label = c("b",
"c", "d"), class = "factor"), V2 = structure(c(1L, 2L, 3L, 2L,
3L), .Label = c("a", "b", "c"), class = "factor"), V3 = structure(c(2L,
1L, 1L, 1L, 2L), .Label = c("a", "e"), class = "factor"), V4 = structure(c(2L,
3L, 3L, 3L, 1L), .Label = c("a", "d", "e"), class = "factor"),
V5 = structure(c(1L, 1L, 2L, 2L, 2L), .Label = c("c", "d"
), class = "factor"), id = 1:5), .Names = c("V1", "V2", "V3",
"V4", "V5", "id"), row.names = c(NA, -5L), class = "data.frame")
what I tried so far
nr = nrow(dta3)
nc = ncol(dta3)
# space
m = 0.8
n = 1 # do not touch this one
plot(0, xlim = c(1,nc*n), ylim = c(1, nr), type = 'n', axes = F, ylab = 'individual sequences', xlab = 'Time')
axis(1, at = c(1:nc*m), labels = c(1:nc))
axis(2, at = c(1:nr), labels = c(1:nr) )
for(i in 1:nc){
points(x = rep(i*m,nr) , y = 1:nr, col = dta3[,i], pch = 15)
}
But it is not with ggplot2 and not very satisfying.

Here you go:
library(reshape2)
library(ggplot2)
m_dta <- melt(dta,id.var="id")
m_dta
p1 <- ggplot(m_dta,aes(x=variable,y=id,fill=value))+
geom_tile()
p1

Compare first element of a list with another list

I am using R and need a hint to solve my problem:
I have two lists and I want to compare the values of the first row of list "a" with the values of the first row of list "b". If the element exists, I want to write the value of the second row of list "b" into the second row of list "a".
So, here is list "a":
X.WORD FREQ
abase 0
abased 0
abasing 0
abashs 0
here list "b"
V1 V2
arthur 11
abased 29
turtle 9
abash 2
The result should be
X.WORD FREQ
abase 0
abased 29
abasing 0
abashs 0
Thanks for your answers

That's just a task for simple merge in base R
Res <- merge(a, b, by.x = "X.WORD", by.y = "V1", all.x = TRUE)[, -2]
Res$V2[is.na(Res$V2)] <- 0
Res
# X.WORD V2
# 1 abase 0
# 2 abased 29
# 3 abashs 0
# 4 abasing 0
Data
a <- structure(list(X.WORD = structure(c(1L, 2L, 4L, 3L), .Label = c("abase",
"abased", "abashs", "abasing"), class = "factor"), FREQ = c(0L,
0L, 0L, 0L)), .Names = c("X.WORD", "FREQ"), class = "data.frame", row.names = c(NA,
-4L))
b <- structure(list(V1 = structure(c(3L, 1L, 4L, 2L), .Label = c("abased",
"abash", "arthur", "turtle"), class = "factor"), V2 = c(11L,
29L, 9L, 2L)), .Names = c("V1", "V2"), class = "data.frame", row.names = c(NA,
-4L))

Here is one approach.
library(dplyr)
ana <- foo %>%
left_join(foo2, by = c("X.WORD" = "V1")) %>%
select(-FREQ) %>%
rename(FREQ = V2)
ana$FREQ[is.na(ana$FREQ)] <- 0
# X.WORD FREQ
#1 abase 0
#2 abased 29
#3 abasing 0
#4 abashs 0
Data
foo <- structure(list(X.WORD = structure(c(1L, 2L, 4L, 3L), .Label = c("abase",
"abased", "abashs", "abasing"), class = "factor"), FREQ = c(0L,
0L, 0L, 0L)), .Names = c("X.WORD", "FREQ"), class = "data.frame", row.names = c(NA,
-4L))
foo2 <- structure(list(V1 = structure(c(3L, 1L, 4L, 2L), .Label = c("abased",
"abash", "arthur", "turtle"), class = "factor"), V2 = c(11L,
29L, 9L, 2L)), .Names = c("V1", "V2"), class = "data.frame", row.names = c(NA,
-4L))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

overlapping unique dataframes in R - r

A solution using map: library(purrr) rbind( overlap = map_dbl(df2, ~length(intersect(df1$header1, .x))), unique_df1 = map_dbl(df2, ~length(setdiff(df1$header1, .x))), unique_df2 = unique_df1 - overlap ) sample_x sample_y sample_z overlap 2 1 2 unique_df1 2 3 2 unique_df2 0 2 0

Related

Function for referencing values associated with specific factor values

I need to convert the levels of multiple categorical variable into 0,1

merging and counting similar strings

R - individual categorical plot [duplicate]

Compare first element of a list with another list

Categories

Resources