I have a sequence which looks like this
SEQENCE
1 A
2 B
3 B
4 C
5 A
Now from this sequence, I want to get the matrix like this where i the row and jth column element denotes how many times movement occurred from ith row node to jth column node
A B C
A 0 1 0
B 0 1 1
C 1 0 0
How Can I get this in R
1) Use table like this:
s <- DF[, 1]
table(tail(s, -1), head(s, -1))
giving:
A B C
A 0 0 1
B 1 1 0
C 0 1 0
2) or like this. Since embed does not work with factors we convert the factor to character,
s <- as.character(DF[, 1])
do.call(table, data.frame(embed(s, 2)))
giving:
X2
X1 A B C
A 0 0 1
B 1 1 0
C 0 1 0
3) xtabs also works:
s <- as.character(DF[, 1])
xtabs(data = data.frame(embed(s, 2)))
giving:
X2
X1 A B C
A 0 0 1
B 1 1 0
C 0 1 0
Note: The input DF in reproducible form is:
Lines <- " SEQENCE
1 A
2 B
3 B
4 C
5 A"
DF <- read.table(text = Lines, header = TRUE)
Related
I have for example a datset like this:
data <- data.frame(matrix(c(1,2,2,3,4,5,5,"a","a","b","a","a","a","b"), nrow = 7, ncol = 2, byrow = F))
X1 X2
1 a
2 a
2 b
3 a
4 a
5 a
5 b
then I add another variable "tag", initially set to 0.
data$tag <- 0
X1 X2 tag
1 a 0
2 a 0
2 b 0
3 a 0
4 a 0
5 a 0
5 b 0
I'd like to have "tag" equal to 1 for each row that is repeated, like:
X1 X2 tag
1 a 0
2 a 1
2 b 1
3 a 0
4 a 0
5 a 1
5 b 1
I used the followed code:
for (i in data$X1) {
for (j in 1:length(data$X1)) {
if (j==2) {data$tag[j] <- 1}
}
}
but it doesn't work like I would like to. I'd like the second loop (j) to work inside the previous one in order to obtain what I want, where j starts from 1 every time X1 changes.
How can I manage it?
Thanks a lot
Maybe you can try ave
within(
data,
tag <- +(ave(X1, X1, FUN = length) > 1)
)
which gives
X1 X2 tag
1 1 a 0
2 2 a 1
3 2 b 1
4 3 a 0
5 4 a 0
6 5 a 1
7 5 b 1
You can use duplicated from both the ends in base R :
data$tag <- as.integer(duplicated(data$X1) |
duplicated(data$X1, fromLast = TRUE))
data
# X1 X2 tag
#1 1 a 0
#2 2 a 1
#3 2 b 1
#4 3 a 0
#5 4 a 0
#6 5 a 1
#7 5 b 1
An option with add_count
library(dplyr)
data %>%
add_count(X1) %>%
mutate(n = +(n > 1))
I have the following data frame in R:
Row number A B C D E F G H I J
1 1 1 0 0 1 0 0 1 1
2 1 0 0 0 1 0 0 1
3 1 0 0 0 1 0 0 1 1
I am trying to calculate the number of times the number changes between 1 and 0 excluding the Nulls
The result I am expecting is this
Row Number No of changes
---------- --------------
1 4
2 4
3 4
An explanation for row 1
In row 1, A has a null so we exclude that.
B and C have 1 which is our first set of values.
D and E have 0 which is our second set of values. Now Change = 1
F has our third set of values which is 1. Now Change = 1+1
G and H have 0 which is our third set of values. Now Change = 1+1+1
I and J have 1 which is our fourth set of values. Now Change = 1+1+1+1 =4
Here's a tidyverse approach.
I gather into longer format (from tidyr::pivot_longer), then add a helper column noting when we have a change from 0 to 1 or from 1 to 0, and then sum those by row.
library(tidyverse)
df %>%
# before tidyr 1.0, this would be gather(col, value, -1)
pivot_longer(-1, "col") %>%
group_by(Row.number) %>%
mutate(chg = value == 1 & lag(value) == 0 |
value == 0 & lag(value) == 1) %>%
summarize(no_chgs = sum(chg, na.rm = T))
# A tibble: 3 x 2
Row.number no_chgs
<int> <int>
1 1 4
2 2 4
3 3 4
Sample data:
df <- read.table(
header = T,
stringsAsFactors = F,
text = "'Row number' A B C D E F G H I J
1 NA 1 1 0 0 1 0 0 1 1
2 NA NA 1 0 0 0 1 0 0 1
3 NA 1 0 0 0 1 0 0 1 1")
Here's a data.table solution:
library(data.table)
dt <- as.data.table(df)
dt[,
no_change := max(rleid(na.omit(t(.SD)))) - 1,
by = RowNumber
]
dt
Alternatively, here's a base version:
apply(df[, -1],
1,
function(x) {
complete_case = complete.cases(x)
if (sum(complete_case) > 0) {
return(length(rle(x[complete_case])$lengths) - 1)
} else {
return (0)
}
}
)
I want to identify (not eliminate) duplicates in a data frame and add 0/1 variable accordingly (wether a row is a duplicate or not), using the R dplyr package.
Example:
| A B C D
1 | 1 0 1 1
2 | 1 0 1 1
3 | 0 1 1 1
4 | 0 1 1 1
5 | 1 1 1 1
Clearly, row 1 and 2 are duplicates, so I want to create a new variable (with mutate?), say E, that is equal to 1 in row 1,2,3 and 4 since row 3 and 4 are also identical.
Moreover, I want to add another variable, F, that is equal to 1 if there is a duplicate differing only by one column. That is, F in row 1,2 and 5 would be equal to 1 since they only differ in the B column.
I hope it is clear what I want to do and I hope that dplyr offers a smooth solution to this problem. This is of course possible in "base" R but I believe (hope) that there exists a smoother solution.
You can use dist() to compute the differences, and then a search in the resulting distance object can give the needed answers (E, F, etc.). Here is an example code, where X is the original data.frame:
W=as.matrix(dist(X, method="manhattan"))
X$E = as.integer(sapply(1:ncol(W), function(i,D){any(W[-i,i]==D)}, D=0))
X$F = as.integer(sapply(1:ncol(W), function(i,D){any(W[-i,i]==D)}, D=1))
Just change D= for the number of different columns needed.
It's all base R though. Using plyr::laply instead of sappy has same effect. dplyr looks overkill here.
Here is a data.table solution that is extendable to an arbitrary case (1..n columns the same)- not sure if someone can convert to dpylr for you. I had to change your dataset a bit to show your desired F column - in your example all rows would get a 1 because 3 and 4 are one column different from 5 as well.
library(data.table)
DT <- data.frame(A = c(1,1,0,0,1), B = c(0,0,1,1,1), C = c(1,1,1,1,1), D = c(1,1,1,1,1), E = c(1,1,0,0,0))
DT
A B C D E
1 1 0 1 1 1
2 1 0 1 1 1
3 0 1 1 1 0
4 0 1 1 1 0
5 1 1 1 1 0
setDT(DT)
DT_ncols <- length(DT)
base <- data.table(t(combn(1:nrow(DT), 2)))
setnames(base, c("V1","V2"),c("ind_x","ind_y"))
DT[, ind := .I)]
DT_melt <- melt(DT, id.var = "ind", variable.name = "column")
base <- merge(base, DT_melt, by.x = "ind_x", by.y = "ind", allow.cartesian = TRUE)
base <- merge(base, DT_melt, by.x = c("ind_y", "column"), by.y = c("ind", "column"))
base <- base[, .(common_cols = sum(value.x == value.y)), by = .(ind_x, ind_y)]
This gives us a data.frame that looks like this:
base
ind_x ind_y common_cols
1: 1 2 5
2: 1 3 2
3: 2 3 2
4: 1 4 2
5: 2 4 2
6: 3 4 5
7: 1 5 3
8: 2 5 3
9: 3 5 4
10: 4 5 4
This says that rows 1 and 2 have 5 common columns (duplicates). Rows 3 and 5 have 4 common columns, and 4 and 5 have 4 common columns. We can now use a fairly extendable format to flag any combination we want:
base <- melt(base, id.vars = "common_cols")
# Unique - common_cols == DT_ncols
DT[, F := ifelse(ind %in% unique(base[common_cols == DT_ncols, value]), 1, 0)]
# Same save 1 - common_cols == DT_ncols - 1
DT[, G := ifelse(ind %in% unique(base[common_cols == DT_ncols - 1, value]), 1, 0)]
# Same save 2 - common_cols == DT_ncols - 2
DT[, H := ifelse(ind %in% unique(base[common_cols == DT_ncols - 2, value]), 1, 0)]
This gives:
A B C D E ind F G H
1: 1 0 1 1 1 1 1 0 1
2: 1 0 1 1 1 2 1 0 1
3: 0 1 1 1 0 3 1 1 0
4: 0 1 1 1 0 4 1 1 0
5: 1 1 1 1 0 5 0 1 1
Instead of manually selecting, you can append all combinations like so:
# run after base <- melt(base, id.vars = "common_cols")
base <- unique(base[,.(ind = value, common_cols)])
base[, common_cols := factor(common_cols, 1:DT_ncols)]
merge(DT, dcast(base, ind ~ common_cols, fun.aggregate = length, drop = FALSE), by = "ind")
ind A B C D E 1 2 3 4 5
1: 1 1 0 1 1 1 0 1 1 0 1
2: 2 1 0 1 1 1 0 1 1 0 1
3: 3 0 1 1 1 0 0 1 0 1 1
4: 4 0 1 1 1 0 0 1 0 1 1
5: 5 1 1 1 1 0 0 0 1 1 0
Here is a dplyr solution:
test%>%mutate(flag = (A==lag(A)&
B==lag(B)&
C==lag(C)&
D==lag(D)))%>%
mutate(twice = lead(flag)==T)%>%
mutate(E = ifelse(flag == T | twice ==T,1,0))%>%
mutate(E = ifelse(is.na(E),0,1))%>%
mutate(FF = ifelse( ( (A +lag(A)) + (B +lag(B)) + (C+lag(C)) + (D + lag(D))) == 7,1,0))%>%
mutate(FF = ifelse(is.na(FF)| FF == 0,0,1))%>%
select(A,B,C,D,E,FF)
Result:
A B C D E FF
1 1 0 1 1 1 0
2 1 0 1 1 1 0
3 0 1 1 1 1 0
4 0 1 1 1 1 0
5 1 1 1 1 0 1
I have a data.frame that looks like this:
> DF1
A B C D E
a x c h p
c d q t w
s e r p a
w l t s i
p i y a f
I would like to compare each column of my data.frame with the remaining columns in order to count the number of common elements. For example, I would like to compare column A with all the remaining columns (B, C, D, E) and count the common entities in this way:
A versus the remaining:
A vs B: 0 (because they have 0 common elements)
A vs C: 1 (c in common)
A vs D: 2 (p and s in common)
A vs E: 3 (p,w,a, in common)
Then the same: B versus C,D,E columns and so on.
How can I implement this?
We can loop through the column names and compare with the other columns, by taking the intersect and get the length
sapply(names(DF1), function(x) {
x1 <- lengths(Map(intersect, DF1[setdiff(names(DF1), x)], DF1[x]))
c(x1, setNames(0, setdiff(names(DF1), names(x1))))[names(DF1)]})
# A B C D E
#A 0 0 1 3 3
#B 0 0 0 0 1
#C 1 0 0 1 0
#D 3 0 1 0 2
#E 3 1 0 2 0
Or this can be done more compactly by taking the cross product after getting the frequency of the long formatted (melt) dataset
library(reshape2)
tcrossprod(table(melt(as.matrix(DF1))[-1])) * !diag(5)
# Var2
#Var2 A B C D E
# A 0 0 1 3 3
# B 0 0 0 0 1
# C 1 0 0 1 0
# D 3 0 1 0 2
# E 3 1 0 2 0
NOTE: The crossprod part is also implemented with RcppEigen here which would make this faster
An alternative is to use combn twice, once to get the column combinations and next to find the lengths of the element intersections.
cbind.data.frame returns a data.frame and setNames is used to add column names.
setNames(cbind.data.frame(t(combn(names(df), 2)),
combn(names(df), 2, function(x) length(intersect(df[, x[1]], df[, x[2]])))),
c("col1", "col2", "count"))
col1 col2 count
1 A B 0
2 A C 1
3 A D 3
4 A E 3
5 B C 0
6 B D 0
7 B E 1
8 C D 1
9 C E 0
10 D E 2
I have a dataframe that looks like this:
step var1 score1 score2
1 a 0 0
2 b 1 1
3 d 1 1
4 e 0 0
5 g 0 0
1 b 1 1
2 a 1 0
3 d 1 0
4 e 0 1
5 f 1 1
1 g 0 1
2 d 1 1
etc.
Because I need to correlate variabeles a-g (their scores are in score1) with score2 in only step 5 I think i need to schange my dataset into this first:
a b c d e f g score2_step5
0 1 1 0 0 0
1 1 1 0 1 1
1 0
etc.
I am pretty sure that the Reshape package should be able to help me to do the job, but I haven't been able to make it work yet.
Can anyone help me? Many thanks in advance!
Here's another version. In case there is no step = 5, the value for score2_step = 0. Assuming your data.frame is df:
require(reshape2)
out <- do.call(rbind, lapply(seq(1, nrow(df), by=5), function(ix) {
iy <- min(ix+4, nrow(df))
df.b <- df[ix:iy, ]
tt <- dcast(df.b, 1 ~ var1, fill = 0, value.var = "score1", drop=F)
tt$score2_step5 <- 0
if (any(df.b$step == 5)) {
tt$score2_step5 <- df.b$score2[df.b$step == 5]
}
tt[,-1]
}))
> out
a b d e f g score2_step5
2 0 1 1 0 0 0 0
21 1 1 1 0 1 0 1
22 0 0 1 0 0 0 0
It looks like you want 7 correlations between the variables a-g and score2_step5--is that correct? First, you're going to need another variable. I'm assuming that step repeats continuously from 1 to 5; if not, this is going to be more complicated. I'm assuming your data is called df. I also prefer the newer reshape2 package, so I'm using that.
df$block <- rep(1:(nrow(df)/5),each=5)
df.molten <- melt(df,id.vars=c("var1", "step", "block"),measure.vars=c("score1"))
df2 <- dcast(df.molten, block ~ var1)
score2_step5 <- df$score2[df$step==5]
and then finally
cor(df2, score2_step5, use='pairwise')
There's an extra column (block) in df2 that you can get rid of or just ignore.
I added another row to your example data because my code doesn't work unless there is a step-5 observation in every block.
dat <- read.table(textConnection("
step var1 score1 score2
1 a 0 0
2 b 1 1
3 d 1 1
4 e 0 0
5 g 0 0
1 b 1 1
2 a 1 0
3 d 1 0
4 e 0 1
5 f 1 1
1 g 0 1
2 d 1 1
5 a 1 0"),header=TRUE)
Like #JonathanChristensen, I made another variable (I called it rep instead of block), and I made var1 into a factor (since there are no c values in the example data set given and I wanted a placeholder).
dat <- transform(dat,var1=factor(var1,levels=letters[1:7]),
rep=cumsum(step==1))
tapply makes the table of score1 values:
tab <- with(dat,tapply(score1,list(rep,var1),identity))
add the score2, step-5 values:
data.frame(tab,subset(dat,step==5,select=score2))