I have a data set I would like to remove the rows of data that have duplicate information in 4 different columns.
foo<- data.frame(g1 = c("1","0","0","1","1"), v1 = c("7","5","4","4","3"), v2 = c("a","b","x","x","e"), y1 = c("y","c","f","f","w"), y2= c("y","y","y","f","c"), y3 = c("y","c","c","f","w"), y4= c("y","y","f","f","c"), y5=c("y","w","f","f","w"), y6=c("y","c","f","f","w"))
foo then looks like:
g1 v1 v2 y1 y2 y3 y4 y5 y6
1 1 7 a y y y y y y
2 0 5 b c y c y w c
3 0 4 x f y c f f f
4 1 4 x f f f f f f
5 1 3 e w c w c w w
Now, I want to remove any row that has duplicated data based on the Y1-6columns. So, only row 4 and 1 would be removed if done properly, based on all Y variables being the exact same. Its a multiple column condition.
I believe I am close, but its just not working correctly.
I have tried: new = foo[!(duplicated(foo[,1:6]))]
thinking to use the duplicated command that it would search and only find those that matched exactly?
I thought about using a conditional statement with &, but can't figure out how to do that either.
new = foo[foo$y1==foo$y2|foo$y3|foo$y4|foo$y5|foo$y6]
I thought about which but Im now overwhelmed and lost. I would expect foo to look like:
g1 v1 v2 y1 y2 y3 y4 y5 y6
2 0 5 b c y c y w c
3 0 4 x f y c f f f
5 1 3 e w c w c w w
> foo[apply(foo[ , paste("y", 1:6, sep = "")], 1,
FUN = function(x) length(unique(x)) > 1 ), ]
g1 v1 v2 y1 y2 y3 y4 y5 y6
2 0 5 b c y c y w c
3 0 4 x f y c f f f
5 1 3 e w c w c w w
foo[apply(foo, 1, function(x) any(x != x[1])),]
> foo[ !rowSums( apply( foo[2:6], 2, "!=", foo[1] ) )==0, ]
y1 y2 y3 y4 y5 y6
2 c y c y w c
3 f y c f f f
5 w c w c w w
> foo[ ! colSums( apply( foo, 1, duplicated, foo[1] ) ) == 5, ]
y1 y2 y3 y4 y5 y6
2 c y c y w c
3 f y c f f f
5 w c w c w w
Related
I used rbind to join 2 dataframes, with a column denoting its source, resulting in
from | to | source
1 A B X
2 C D Y
3 B A Y
...
I would like to look for overlapping pairs, regardless of "order", combine those pairs, then edit the source column to something else, e.g. "Z".
In the above example, rows 1 and 3 would be flagged as overlapping, so they will be combined and modified.
So the desired output would look something like
from | to | source
1 A B Z
2 C D Y
...
How can this be done?
You can try the code below
unique(
transform(
transform(
df,
from = pmin(from, to),
to = pmax(from, to)
),
source = ave(source, from, to, FUN = function(x) ifelse(length(x) > 1, "Z", x))
)
)
which gives
from to source
1 A B Z
2 C D Y
Example
set.seed(1)
df=data.frame(
"from"=sample(LETTERS[1:4],10,replace=T),
"to"=sample(LETTERS[1:4],10,replace=T),
"source"=sample(c("X","Y"),10,replace=T)
)
from to source
1 A C X
2 D C X
3 C A X
4 A A X
5 B A X
6 A B X
7 C B Y
8 C B X
9 B B X
10 B C Y
and then
tmp=t(
apply(df,1,function(x){
sort(x[1:2])
})
)
t1=duplicated(tmp,fromLast=F)
t2=duplicated(tmp,fromLast=T)
df[t2,"source"]="Z"
df[!t1,]
from to source
1 A C Z
2 D C X
4 A A X
5 B A Z
7 C B Z
9 B B X
# Construct a design matrix
# model y = y0 +y1.t + y2. cos(t) + y3 + y4 + y5
# unknowns are y0, y1, y2, y3, y4, y5
# where y3, y4, y5 corresponds to three instruments A, B, C
# Data
dat <- list("A" = data.frame(t=c(1,5), var= "A"),
"B" = data.frame(t= c(7,4,1), var= "B"),
"C" = data.frame(t= c(12,4,3,9), var= "C"))
design_mat <-lapply(dat, function(x) {matrix(c(rep(1,nrow(x)), x$t,
cos(x$t), rep(1,nrow(x)), rep(0,nrow(x)), rep(0,nrow(x))), ncol=6)})
design_mat <- do.call(rbind.data.frame, design_mat)
The above code gives the following result
the desired result is as follows
We could use table
out <- cbind(V1 = 1, do.call(rbind, dat))
out$V3 <- cos(out$t)
out <- cbind(out, as.data.frame.matrix( table(seq_len(nrow(out)), out$var)))
model.matrix can also be handy when constructing design matrices.
design_mat <-lapply(dat, function(x) data.frame(v1 = x$t,v2 = cos(x$t),v3 = x$var))
tmp <- do.call(rbind,(design_mat))
cbind(model.matrix(v2 ~ v3, tmp),tmp)
(Intercept) v3B v3C v1 v2 v3
A.1 1 0 0 1 0.5403023 A
A.2 1 0 0 5 0.2836622 A
B.1 1 1 0 7 0.7539023 B
B.2 1 1 0 4 -0.6536436 B
B.3 1 1 0 1 0.5403023 B
C.1 1 0 1 12 0.8438540 C
C.2 1 0 1 4 -0.6536436 C
C.3 1 0 1 3 -0.9899925 C
C.4 1 0 1 9 -0.9111303 C
I am currently trying to cycle through a dataframe of integers and characters and change one value of each row, conditionally. For all rows that do not meet the conditions I would just like to add them back into a new dataframe filled with the modified rows.
I've done this before with no trouble, but I feel as though I have been staring at this too long without any enlightenment.
a<-data.frame(cbind(1,'a',2,'c',3,'d'), stringsAsFactors = F)
b<-data.frame(cbind(1,'a',2,'c',3,'g'), stringsAsFactors = F)
c<-data.frame(cbind(1,'f',4,'g',5,'h'), stringsAsFactors = F)
x<-rbind(a,b,c)
fun<-function(x){
fin<-NULL
for(i in 1:nrow(x)){
v<-x[i+1,]
if ((x[i,1]== v[i,1]) & (x[i,2]==v[i,2]) ){
x[i,3]<-"f"
fin<-rbind(fin, x[i,])
}else {fin<-rbind(fin, x[i,]) }
return(fin)
}
}
fun(x)
X1 X2 X3 X4 X5 X6
1 1 a f c 3 d
>
The result I desire:
X1 X2 X3 X4 X5 X6
1 1 a f c 3 d
1 1 a 2 c 3 g
1 1 f 4 g 5 h
Or an alternative:
library(dplyr)
library(magrittr)
> z <- x %>% mutate(match = ifelse(( (lead(X1)==X1) & (lead(X2)==X2)),"YES","NO"))
> z %>% mutate(X3 = replace(X3, match=="YES", "f"))
X1 X2 X3 X4 X5 X6 match
1 1 a f c 3 d YES
2 1 a 2 c 3 g NO
3 1 f 4 g 5 h <NA>
I have a large database from which I have extracted a data value (x) using the aggregate function:
library(plotrix)
aggregate(mydataNC[,c(52)],by=list(patientNC, siteNC, supNC),max)
OUTPUT:
Each (x) value has a corresponding distance value in located in a column titled (dist) in this database.
What is the easiest way to extract the value dist and added to the table?
I'd probably start with merge() first. Here's a small reproducible example you can use to see what's going on and modify it to use your data:
# generate bogus data and view it
x1 <- rep(c("A", "B", "C"), each = 4)
x2 <- rep(c("E", "E", "F", "F"), times = 3)
y1 <- rnorm(12)
y2 <- rnorm(12)
md <- data.frame(x1, x2, y1, y2)
> head(md)
x1 x2 y1 y2
1 A E -1.4603164 -0.9662473
2 A E -0.5247227 1.7970341
3 A F 0.8990502 1.7596285
4 A F -0.6791145 2.2900357
5 B E 1.2894863 0.1152571
6 B E -0.1981511 0.6388998
# aggregate by taking maximum of each unique (x1, x2) combination
md.agg <- with(md, aggregate(y1, by = list(x1, x2), FUN = max))
names(md.agg) <- c("x1", "x2", "y1")
> md.agg
x1 x2 y1
1 A E -0.5247227
2 B E 1.2894863
3 C E 0.9982510
4 A F 0.8990502
5 B F 2.5125956
6 C F -0.5916491
# merge y2 into the aggregated data
md.final <- merge(md, md.agg)
> md.final
x1 x2 y1 y2
1 A E -0.5247227 1.7970341
2 A F 0.8990502 1.7596285
3 B E 1.2894863 0.1152571
4 B F 2.5125956 -0.2217510
5 C E 0.9982510 0.6813261
6 C F -0.5916491 1.0348518
I am stuck on this problem and would be happy for advice. I have the following data.frame:
c1 <- factor(c("a","a","a","a"))
c2 <- factor(c("b","b","y","b"))
c3 <- factor(c("c","y","z","c"))
c4 <- factor(c("y","z","","y"))
c5 <- factor(c("z","","","z"))
x <- data.frame(c1,c2,c3,c4,c5)
So this data looks like this:
c1 c2 c3 c4 c5
1 a b c y z
2 a b y z
3 a y z
4 a b c y z
So in each row, there is a sequence of varying length of a, b, c which concludes with values for y and z. What I need to do is to move values y and z each to separate column that I can work with, so the data looks like this:
c6 c7 c8 c9 c10
1 a b c y z
2 a b y z
3 a y z
4 a b c y z
I have worked out to identify the length of each sequence per row and added that as a column, so I know which column y and z is located in:
x$not.na <- apply(paths, 1, function(x) length(which(!x=="")))
But I am stuck on how to loop(?) over each row to perform the necessary cut and paste of z and y.
Something like this:
lastTwoToEnd<-function(x){
i<-sum(x!="")-1:0
x[c(setdiff(seq_along(x),i),i)]
}
data.frame(t(apply(x,1,lastTwoToEnd)))
## X1 X2 X3 X4 X5
## 1 a b c y z
## 2 a b y z
## 3 a y z
## 4 a b c y z