I want to know which values are common among N columns, N-1 columns, N-2 columns etc.
Input
structure(c("a", "b", "c", "d", "e", "f", "a", "z", "d", "b",
"e", "s", "a", "b", "c", "d", "e", "s", "a", "b", "c", "d", "e",
"f"), .Dim = c(6L, 4L), .Dimnames = list(NULL, c("x", "y", "z",
"a")))
Output:
common in all 4 columns :- a , b, e ,d
common in maximum 3 columns :- c
common in maximum 2 columns:- f,s
Reshaping the given matrix from wide to long format (melt() has a method for matrices) and counting by value might be an approach:
library(data.table)
options(datatable.print.class = TRUE)
setDT(melt(dat))[, .N, by = "value"][order(-N)]
value N
<fctr> <int>
1: a 4
2: b 4
3: d 4
4: e 4
5: c 3
6: f 2
7: s 2
8: z 1
However, code needs to be enhanced to deal with duplicates in each column (dat2 has row 1 duplicated):
setDT(melt(dat2))[, unique(value), by = Var2][, .N, by = "V1"][order(-N)]
V1 N
<fctr> <int>
1: a 4
2: b 4
3: d 4
4: e 4
5: c 3
6: f 2
7: s 2
8: z 1
or, more consisely:
setDT(melt(dat2))[, unique(value), by = Var2][, .N, by = "V1"][
, toString(sort(V1)), by = N][order(-N)]
N V1
<int> <char>
1: 4 a, b, d, e
2: 3 c
3: 2 f, s
4: 1 z
N denotes the count of columns a value appears in.
Data
dat <- structure(
c("a", "b", "c", "d", "e", "f", "a", "z", "d", "b", "e", "s",
"a", "b", "c", "d", "e", "s", "a", "b", "c", "d", "e", "f"),
.Dim = c(6L, 4L),
.Dimnames = list(NULL, c("x", "y", "z", "a")))
# second data set with duplicated row 1
dat2 <- dat[c(1, seq_len(nrow(dat))), ]
dat2
x y z a
[1,] "a" "a" "a" "a"
[2,] "a" "a" "a" "a"
[3,] "b" "z" "b" "b"
[4,] "c" "d" "c" "c"
[5,] "d" "b" "d" "d"
[6,] "e" "e" "e" "e"
[7,] "f" "s" "s" "f"
Related
From a single dataset I created two dataset filtering on the target variable. Now I'd like to compare all the features in the dataset using chi square. The problem is that one of the two dataset is much smaller than the other one so in some features I have some values that are not present in the second one and when I try to apply the chi square test I get this error: "all arguments must have the same length".
How can I add to the dataset with less value the missing value in order to be able to use chi square test?
Example:
I want to use chi square on a the same feature in the two dataset:
chisq.test(table(df1$var1, df2$var1))
but I get the error "all arguments must have the same length" because table(df1$var1) is:
a b c d
2 5 7 18
while table(df2$var1) is:
a b c
8 1 12
so what I would like to do is adding the value d in df2 and set it equal to 0 in order to be able to use the chi square test.
The table output of df2 can be modified if we convert to factor with levels specified
table(factor(df2$var1, levels = letters[1:4]))
a b c d
8 1 12 0
But, table with two inputs, should have the same length. For this, we may need to bind the datasets and then use table
library(dplyr)
table(bind_rows(df1, df2, .id = 'grp'))
var1
grp a b c d
1 2 5 7 18
2 8 1 12 0
Or in base R
table(data.frame(col1 = rep(1:2, c(nrow(df1), nrow(df2))),
col2 = c(df1$var1, df2$var1)))
col2
col1 a b c d
1 2 5 7 18
2 8 1 12 0
data
df1 <- structure(list(var1 = c("a", "a", "b", "b", "b", "b", "b", "c",
"c", "c", "c", "c", "c", "c", "d", "d", "d", "d", "d", "d", "d",
"d", "d", "d", "d", "d", "d", "d", "d", "d", "d", "d")), class = "data.frame",
row.names = c(NA,
-32L))
df2 <- structure(list(var1 = c("a", "a", "a", "a", "a", "a", "a",
"a",
"b", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c"
)), class = "data.frame", row.names = c(NA, -21L))
My initial matrix looks like the following (but my matrix is huge)
A NA A A A D D B NA B C NA C
A NA A B B D C A NA A A NA A
D NA D D A A A C NA C C NA C
structure(c("A", "A", "D", NA, NA, NA, "A", "A", "D", "A", "B",
"D", "A", "B", "A", "D", "D", "A", "D", "C", "A", "B", "A", "C",
NA, NA, NA, "B", "A", "C", "C", "A", "C", NA, NA, NA, "C", "A",
"C"), .Dim = c(3L, 13L), .Dimnames = list(NULL, c("V1", "V2",
"V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10", "V11", "V12",
"V13")))
I want to substitute the NA with the letters surroundings (left and right), if they are the same, that is, I want something like this:
A A A A A D D B B B C C C
A A A B B D C A A A A A A
D D D D A A A C C C C C C
structure(c("A", "A", "D", "A", "A", "D", "A", "A", "D", "A",
"B", "D", "A", "B", "A", "D", "D", "A", "D", "C", "A", "B", "A",
"C", "B", "A", "C", "B", "A", "C", "C", "A", "C", "C", "A", "C",
"C", "A", "C"), .Dim = c(3L, 13L), .Dimnames = list(NULL, c("V1",
"V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10", "V11",
"V12", "V13")))
So, if both surrounding letters are the same, I would change the NA to the surrounding letter, otherwise, I would keep the NA.
Any ideas?
Thank you very much.
Here my approach without using additional librariey:
dat <- matrix(c('A',NA,'A','A',NA,'B',
'B',NA,'A','B',NA,'B',
'B',NA,NA,'B','B',NA
),nrow=3,byrow=TRUE)
t(apply(dat,1,function(x){
pos <- which(!is.na(x))
## if the delta of the index of two non-na elements is 2 -> potential match
dif <- which(diff(pos)==2)
## prevent to process rows with no potential match (woiuld convert NA to "NA"
if(length(dif)){
x[pos[dif]+1] <- sapply(dif,function(y) ifelse(x[pos[y]]==x[pos[y]+2], x[pos[y]],NA))
}
x
}))
Questions are: how do you handle a sequence of NA's and NA's at the margins
Here the version which allows NA sequences to be handeld too
t(apply(dat,1,function(x){
pos <- which(!is.na(x))
## if the delta of the index of two non-na elements is > 1 -> potential match
dif <- diff(pos)
for(cur in which(dif>1)){
if(x[pos[cur]]==x[pos[cur]+dif[cur]]){
x[(pos[cur]+1):(pos[cur]+dif[cur])] <- x[pos[cur]]
}
}
x
}))
I'm not sure if there is an elegant and simply way. Assuming your matrix is named mat, you could use
library(tidyr)
library(dplyr)
library(zoo)
mat %>%
as.data.frame(stringsAsFactors = FALSE) %>%
mutate(id = row_number()) %>%
pivot_longer(cols=-id) %>%
group_by(id) %>%
mutate(value = ifelse(is.na(value) & (na.locf(value) == na.locf(value, fromLast = TRUE)), na.locf(value), value)) %>%
ungroup() %>%
pivot_wider() %>%
select(-id) %>%
as.matrix()
which returns
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
[1,] "A" "A" "A" "A" "A" "D" "D" "B" "B" "B" "C" "C" "C"
[2,] "A" "A" "A" "B" "B" "D" NA "A" "A" "A" "A" "A" "A"
[3,] "D" "D" "D" "D" "A" "A" "A" "C" "C" "C" "C" "C" "C"
Note: I added an NA-value in mat[2,7] for the case of unequal surroundings.
Data
mat <- structure(c("A", "A", "D", NA, NA, NA, "A", "A", "D", "A", "B",
"D", "A", "B", "A", "D", "D", "A", "D", NA, "A", "B", "A", "C",
NA, NA, NA, "B", "A", "C", "C", "A", "C", NA, NA, NA, "C", "A",
"C"), .Dim = c(3L, 13L))
The following code:
df <- data.frame(
"letter" = c("a", "b", "c", "d", "e", "f"),
"score" = seq(1,6)
)
Results in the following dataframe:
letter score
1 a 1
2 b 2
3 c 3
4 d 4
5 e 5
6 f 6
I want to get the scores for a sequence of letters, for example the scores of c("f", "a", "d", "e"). It should result in c(6, 1, 4, 5).
What's more, I want to get the scores for c("c", "o", "f", "f", "e", "e"). Now the o is not in the letter column so it should return NA, resulting in c(3, NA, 6, 6, 5, 5).
What is the best way to achieve this? Can I use dplyr for this?
We can use match to create an index and extract the corresponding 'score' If there is no match, then by default it gives NA
df$score[match(v1, df$letter)]
#[1] 3 NA 6 6 5 5
df$score[match(v2, df$letter)]
#[1] 6 1 4 5
data
v1 <- c("c", "o", "f", "f", "e", "e")
v2 <- c("f", "a", "d", "e")
If you want to use dplyr I would use a join:
df <- data.frame(
"letter" = c("a", "b", "c", "d", "e", "f"),
"score" = seq(1:6)
)
library(dplyr)
df2 <- data.frame(letter = c("c", "o", "f", "f", "e", "e"))
left_join(df2, df, by = "letter")
letter score
1 c 3
2 o NA
3 f 6
4 f 6
5 e 5
6 e 5
I have a simple problem (seemingly) but have not yet able to find an appropriately quick/time & resource efficient solution. This is a problem in R-Software.
My data is of format:
INPUT
col1 col2
A q
C w
B e
A r
A t
A y
C q
B w
C e
C r
B t
C y
DESIRED OUTPUT
unit1 unit2 same_col2_freq
A B 1
A C 3
B A 1
B C 2
C A 3
C B 2
That is in input A has occurred in col1 with q, r, t, y occurring in col2. Now, q, r, t, y occurs for B with t so the A-B combination has count 1.
B has occurred in col1 with e, w, t occurring in col2. Now, e, w, t occurs for C with w, t so the B-C combination has count 2.
.... and so on for all combinations in col1.
I have done it using a for loop but it is slow. I am picking unique elements from col1 and then, all the data is iterated for each element of col1. Then I am combining the results using rbind. This is slow and resource costly.
I am looking for an efficient method. Maybe a library, function etc. exists that I am unaware of. I tried using co-occurrence matrix but the number of elements in col1 is of order of ~10,000 and it does not solve my purpose.
Any help is greatly appreciated.
Thanks!
Use merge to join the dataframe with itself and then use aggregate to count within groups. demo:
d = data.frame(col1=c("A", "C", "B", "A", "A", "A", "C", "B", "C", "C", "B", "C"), col2=c("q", "w", "e", "r", "t", "y", "q", "w", "e", "r", "t", "y"))
dm = merge(d, d, by="col2")
dm = dm[dm[,'col1.x']!=dm[,'col1.y'],]
aggregate(col2 ~ col1.x + col1.y, data=dm, length)
# col1.x col1.y col2
# 1 B A 1
# 2 C A 3
# 3 A B 1
# 4 C B 2
# 5 A C 3
# 6 B C 2
Here is a similar approach (as showed by #cogitovita), but using data.table. Convert the "data.frame" to "data.table" using setDT, then Cross Join (CJ) the unique elements of "col1", grouped by "col2". Subset the rows of the output columns that are not equal (V1!=V2), get the count (.N), grouped by the new columns (.(V1, V2)) and finally order the columns (order(V1,V2))
library(data.table)
setDT(df)[,CJ(unique(col1), unique(col1)), col2][V1!=V2,
.N, .(V1,V2)][order(V1,V2)]
# V1 V2 N
#1: A B 1
#2: A C 3
#3: B A 1
#4: B C 2
#5: C A 3
#6: C B 2
data
df <- structure(list(col1 = c("A", "C", "B", "A", "A", "A", "C", "B",
"C", "C", "B", "C"), col2 = c("q", "w", "e", "r", "t", "y", "q",
"w", "e", "r", "t", "y")), .Names = c("col1", "col2"), class =
"data.frame", row.names = c(NA, -12L))
I have following data:
ddf2 = structure(list(col1 = c(3, 3, 2, 1, 1, 1, 3, 2, 1, 1, 3, 1, 1,
2, 1, 1, 1, 2, 3, 1, 1, 3, 2, 3, 3), col2 = c("c", "c", "b",
"b", "b", "a", "b", "c", "b", "b", "c", "c", "b", "b", "a", "c",
"c", "b", "a", "b", "b", "c", "a", "c", "a"), col3 = c("C", "E",
"E", "B", "D", "E", "C", "C", "E", "E", "C", "A", "D", "D", "C",
"E", "A", "A", "A", "D", "A", "A", "B", "A", "E")), .Names = c("col1",
"col2", "col3"), row.names = c(NA, 25L), class = "data.frame")
head(ddf2)
col1 col2 col3
1 3 c C
2 3 c E
3 2 b E
4 1 b B
5 1 b D
6 1 a E
For every combination of col1 and col2, there may be many values of col3:
with(ddf2, ddf2[col1==1 & col2=='b',])
col1 col2 col3
4 1 b B
5 1 b D
9 1 b E
10 1 b E
13 1 b D
20 1 b D
21 1 b A
with(ddf2, table(col1, col2))
col2
col1 a b c
1 2 7 3
2 1 3 1
3 2 1 5
I want to create a table/matrix of col1 and col2 as above but each cell should have a list of unique col3 entries for that set of col1 and col2. I expect following output:
col2
col1 a b c
1 E,C A,B,D,E A,E
2 B A,D,E C
3 A,E C A,C,E
I tried following but it does not work:
with(ddf2, tapply(col3, list(col1,col2), c))
a b c
1 Character,2 Character,7 Character,3
2 "B" Character,3 "C"
3 Character,2 "C" Character,5
How can this be done? Thanks for your help.
One option:
d <- with(ddf2, aggregate(col3 ~ col2 + col1, FUN = function(x) paste0(unique(x))))
library(reshape2)
dcast(d, col1 ~ col2, value.var = "col3")
# col1 a b c
#1 1 E, C B, D, E, A A, E
#2 2 B E, D, A C
#3 3 A, E C C, E, A
Most likely it's possible to do both steps in one, but I'll generously leave it to someone else to figure this out ;)
Or
library(dplyr)
library(tidyr)
ddf2 %>%
group_by(col1, col2) %>%
summarise(col3 = paste(unique(col3), collapse = ", ")) %>%
spread(col2, col3)
#Source: local data frame [3 x 4]
#
# col1 a b c
#1 1 E, C B, D, E, A A, E
#2 2 B E, D, A C
#3 3 A, E C C, E, A
Edit after comment:
Just tested with tapply and this seems to work (the problem was apparently in calling c()):
with(ddf2, tapply(col3, list(col1,col2), FUN = function(x) paste(unique(x), collapse = ", ")))
# a b c
#1 "E, C" "B, D, E, A" "A, E"
#2 "B" "E, D, A" "C"
#3 "A, E" "C" "C, E, A"