How to write the remaining data frame in R after randomly subseting the data - r

I took a random sample from a data frame. But I don't know how to get the remaining data frame.
df <- data.frame(x=rep(1:3,each=2),y=6:1,z=letters[1:6])
#select 3 random rows
df[sample(nrow(df),3)]
What I want is to get the remaining data frame with the other 3 rows.

sample sets a random seed each time you run it, thus if you want to reproduce its results you will either need to set.seed or save its results in a variable.
Addressing your question, you simply need to add - before your index in order to get the rest of the data set.
Also, don't forget to add a comma after the indx if you want to select rows (unlike in your question)
set.seed(1)
indx <- sample(nrow(df), 3)
Your subset
df[indx, ]
# x y z
# 2 1 5 b
# 6 3 1 f
# 3 2 4 c
Remaining data set
df[-indx, ]
# x y z
# 1 1 6 a
# 4 2 3 d
# 5 3 2 e

Try:
> df
x y z
1 1 6 a
2 1 5 b
3 2 4 c
4 2 3 d
5 3 2 e
6 3 1 f
>
> df2 = df[sample(nrow(df),3),]
> df2
x y z
5 3 2 e
3 2 4 c
1 1 6 a
> df[!rownames(df) %in% rownames(df2),]
x y z
1 1 6 a
2 1 5 b
5 3 2 e

Related

Filtering observations using multivariate column conditions

I'm not very experienced R user, so seek advice how to optimize what I've build and in which direction to move on.
I have one reference data frame, it contains four columns with integer values and one ID.
df <- matrix(ncol=5,nrow = 10)
colnames(df) <- c("A","B","C","D","ID")
# df
for (i in 1:10){
df[i,1:4] <- sample(1:5,4, replace = TRUE)
}
df <- data.frame(df)
df$ID <- make.unique(rep(LETTERS,length.out=10),sep='')
df
A B C D ID
1 2 4 3 5 A
2 5 1 3 5 B
3 3 3 5 3 C
4 4 3 1 5 D
5 2 1 2 5 E
6 5 4 4 5 F
7 4 4 3 3 G
8 2 1 5 5 H
9 4 4 1 3 I
10 4 2 2 2 J
Second data frame has manual input, it's user input, I want to turn it into shiny app later on, that's why also I'm asking for optimization, because my code doesn't seem very neat to me.
df.man <- data.frame(matrix(ncol=5,nrow=1))
colnames(df.man) <- c("A","B","C","D","ID")
df.man$ID <- c("man")
df.man$A <- 4
df.man$B <- 4
df.man$C <- 3
df.man$D <- 4
df.man
A B C D ID
4 4 3 4 man
I want to filter rows from reference sequentially, following the rules:
If there is exact match in a whole row between reference table and manual than extract this(those) from reference and show me that row, if not then reduce number of matching columns from right to left until there is a match but not between less then two variables(columns A,B).
So with my limited knowledge I've wrote this:
# subtraction manual from reference
df <- df %>% dplyr::mutate(Adiff=A-df.man$A)%>%
dplyr::mutate(Bdiff=B-df.man$B)%>%
dplyr::mutate(Cdiff=C-df.man$C) %>%
dplyr::mutate(Ddiff=D-df.man$D)
# check manually how much in a row has zero difference and filter those
ifelse(nrow(df%>%filter(Adiff==0 & Bdiff==0 & Cdiff==0 & Ddiff==0)) != 0,
df0<-df%>%filter(Adiff==0 & Bdiff==0 & Cdiff==0 & Ddiff==0),
ifelse(nrow(df%>%filter(Adiff==0 & Bdiff==0 & Cdiff==0)) != 0,
df0<-df%>%filter(Adiff==0 & Bdiff==0 & Cdiff==0),
ifelse(nrow(df%>%filter(Adiff==0 & Bdiff==0)) != 0,
df0<-df%>%filter(Adiff==0 & Bdiff==0),
"less then two exact match")
))
tbl_df(df0[,1:5])
# A tibble: 1 x 5
A B C D ID
<int> <int> <int> <int> <chr>
1 4 4 3 3 G
It works and found ID G but looks ugly to me. So the first question is - What would be recommended way to improve this? Are there any functions, packages or smth I'm missing?
Second question - I want to complicate condition.
Imagine we have reference data set.
A B C D ID
2 4 3 5 A
5 1 3 5 B
3 3 5 3 C
4 3 1 5 D
2 1 2 5 E
5 4 4 5 F
4 4 3 3 G
2 1 5 5 H
4 4 1 3 I
4 2 2 2 J
Manual input is
A B C D ID
4 4 2 2 man
Filtering rules should be following:
If there is exact match in a whole row between reference table and manual than extract this(those) from reference and show me that row, if not then reduce number of matching columns from right to left until there is a match but not between less then two variables(columns A,B).
From those rows where I have only two variable matches filter those which has ± 1 difference in columns to the right. So I should have filtered case G and I from reference table from the example above.
keep going the way I did above, I would do the following:
ifelse(nrow(df0%>%filter(Cdiff %in% (-1:1) & Ddiff %in% (-1:1)))>0,
df01 <- df0%>%filter(Cdiff %in% (-1:1) & Ddiff %in% (-1:1)),
ifelse(nrow(df0%>%filter(Cdiff %in% (-1:1)))>0,
df01<- df0%>%filter(Cdiff %in% (-1:1)),
"NA"))
It will be about 11 columns at the end, but I assume it doesn't matter so much.
Keeping in mind this objective - how would you suggest to proceed?
Thanks!
This is a lot to sort through, but I have some ideas that might be helpful.
First, you could keep your df a matrix, and use row names for your letters. Something like:
set.seed(2)
df
A B C D
A 5 1 5 1
B 4 5 1 2
C 3 1 3 2
D 3 1 1 4
E 3 1 5 3
F 1 5 5 2
G 2 3 4 3
H 1 1 5 1
I 2 4 5 5
J 4 2 5 5
And for demonstration, you could use a vector for manual as this is input:
# Complete match example
vec.man <- c(3, 1, 5, 3)
To check for complete matches between the manual input and reference (all 4 columns), with all numbers, you can do:
df[apply(df, 1, function(x) all(x == vec.man)), ]
A B C D
3 1 5 3
If you don't have a complete match, would calculate differences between df and vec.man:
# Change example vec.man
vec.man <- c(3, 1, 5, 2)
df.diff <- sweep(df, 2, vec.man)
A B C D
A 2 0 0 -1
B 1 4 -4 0
C 0 0 -2 0
D 0 0 -4 2
E 0 0 0 1
F -2 4 0 0
G -1 2 -1 1
H -2 0 0 -1
I -1 3 0 3
J 1 1 0 3
The diffs that start with and continue with 0 will be your best matches (same as looking from right to left iteratively). Then, your best match is the column of the first non-zero element in each row:
df.best <- apply(df.diff, 1, function(x) which(x!=0)[1])
A B C D E F G H I J
1 1 3 3 4 1 1 1 1 1
You can see that the best match is E which was non-zero in the 4th column (last column did not match). You can extract rows that have 4 in df.best as your best matches:
df.match <- df[which(df.best == max(df.best, na.rm = T)), ]
A B C D
3 1 5 3
Finally, if you want all the rows with closest match +/- 1 if only 2 match, you could check for number of best matches (should be 3). Then, compare differences with vector c(0,0,1) which would imply 2 matches then 3rd column off by +/- 1:
# Example vec.man with only 2 matches
vec.man <- c(3, 1, 6, 9)
> df.match
A B C D
C 3 1 3 2
D 3 1 1 4
E 3 1 5 3
if (max(df.best, na.rm = T) == 3) {
vec.alt = c(0, 0, 1)
df[apply(df.diff[,1:3], 1, function(x) all(abs(x) == vec.alt)), ]
}
A B C D
3 1 5 3
This should be scalable for 11 columns and 4 matches.
To generalize for different numbers of columns, #IlyaT suggested:
n.cols <- max(df.best, na.rm=TRUE)
vec.alt <- c(rep(0, each=n.cols-1), 1)

Replace values in a series exceeding a threshold

In a dataframe I'd like to replace values in a series where they exceed a given threshold.
For example, within a group ('ID') in a series designated by 'time', if 'value' ever exceeds 3, I'd like to make all following entries also equal 3.
ID <- as.factor(c(rep("A", 3), rep("B",3), rep("C",3)))
time <- rep(1:3, 3)
value <- c(c(1,1,2), c(2,3,2), c(3,3,2))
dat <- cbind.data.frame(ID, time, value)
dat
ID time value
A 1 1
A 2 1
A 3 2
B 1 2
B 2 3
B 3 2
C 1 3
C 2 3
C 3 2
I'd like it to be:
ID time value
A 1 1
A 2 1
A 3 2
B 1 2
B 2 3
B 3 3
C 1 3
C 2 3
C 3 3
This should be easy, but I can't figure it out. Thanks!
The ave function makes this very easy by allowing you to apply a function to each of the groupings. In this case, we will adapth the cummax (cumulative maximum) to see if we've seen a 3 yet.
dat$value2<-with(dat, ave(value, ID, FUN=
function(x) ifelse(cummax(x)>=3, 3, x)))
dat;
# ID time value value2
# 1 A 1 1 1
# 2 A 2 1 1
# 3 A 3 2 2
# 4 B 1 2 2
# 5 B 2 3 3
# 6 B 3 2 3
# 7 C 1 3 3
# 8 C 2 3 3
# 9 C 3 2 3
You could also just use FUN=cummax if you want never-decreasing values. I wasn't sure about the sequence c(1,2,1) if you wanted to keep that unchanged or not.
If you can assume your data are sorted by group, then this should be fast, essentially relying on findInterval() behind the scenes:
library(IRanges)
id <- Rle(ID)
three <- which(value>=3L)
ir <- reduce(IRanges(three, end(id)[findRun(three, id)])))
dat$value[as.integer(ir)] <- 3L
This avoids looping over the groups.

Subseting data frame by another data frame

The data is as follows:
> x
a b
1 1 a
2 2 a
3 3 a
4 1 b
5 2 b
6 3 b
> y
a b
1 2 a
2 3 a
3 3 b
My goal is to compare both data frames, and for each row in x indicate whether equivalent row exists in y. All of the y rows are actually contained in x, so I would like to end up with something like this:
> x
a b intersect.x.y
1 1 a F
2 2 a T
3 3 a T
4 1 b F
5 2 b F
6 3 b T
How about that?
How about this?
x$rn <- 1:nrow(x)
xyrows <- merge(x,y)$rn # maybe you just want to look at the merge ...?
x$iny <- FALSE
x$iny[xyrows] <- TRUE
I suspect there is a more standard approach, but this way is easy to understand.

How can I loop a data matrix in R?

I am trying to loop a data matrix for each separate ID tag, “1”, “2” and “3” (see my data at the bottom). Ultimately I am doing this to transform the X and Y coordinates into a timeseries with the ts() function, but first i need to build a loop into the function that returns a timeseries for each separate ID. The looping itself works perfectly fine when I use the following code for a dataframe:
for(i in 1:3){
print(na.omit(xyframe[ID==i,]))
}
Returning the following output:
Timestamp X Y ID
1. 0 -34.012 3.406 1
2. 100 -33.995 3.415 1
3. 200 -33.994 3.427 1
Timestamp X Y ID
4. 0 -34.093 3.476 2
5. 100 -34.145 3.492 2
6. 200 -34.195 3.506 2
Timestamp X Y ID
7. 0 -34.289 3.522 3
8. 100 -34.300 3.520 3
9. 200 -34.303 3.517 3
Yet, when I want to produce a loop in a matrix with the same code:
for(i in 1:3){
print(na.omit(xymatrix[ID==i,])
}
It returns the following error:
Error in print(na.omit(xymatrix[ID == i, ]) :
(subscript) logical subscript too long
Why does it not work to loop the ID through a matrix while it does work for the dataframe and how would I be able to fix it?
Furthermore did I read that looping requires much more computational strength then doing the same thing vector based, would there be a way to do this vector based?
The data (simplification of the real data):
Timestamp X Y ID
1. 0 -34.012 3.406 1
2. 100 -33.995 3.415 1
3. 200 -33.994 3.427 1
4. 0 -34.093 3.476 2
5. 100 -34.145 3.492 2
6. 200 -34.195 3.506 2
7. 0 -34.289 3.522 3
8. 100 -34.300 3.520 3
9. 200 -34.303 3.517 3
The format xymatrix[ID==i,] doesn't work for matrix. Try this way:
for(i in 1:3){ print(na.omit(xymatrix[xymatrix[,'ID'] == i,])) }
In general, if you want to apply a function to a data frame, split by some factor, then you should be using one of the apply family of functions in combination with split.
Here's some reproducible sample data.
n <- 20
some_data <- data.frame(
x = sample(c(1:5, NA), n, replace= TRUE),
y = sample(c(letters[1:5], NA), n, replace= TRUE),
id = gl(3, 1, length = n)
)
If you want to print out the rows with no missing values, split by each ID level, then you want something like this.
lapply(split(some_data, some_data$grp), na.omit)
or more concisely using the plyr package.
library(plyr)
dlply(some_data, .(grp), na.omit)
Both methods return output like this
# $`1`
# x y grp
# 1 2 d 1
# 4 3 e 1
# 7 3 c 1
# 10 4 a 1
# 13 2 e 1
# 16 3 a 1
# 19 1 d 1
# $`2`
# x y grp
# 2 1 e 2
# 5 3 e 2
# 8 3 b 2
# $`3`
# x y grp
# 6 3 c 3
# 9 5 a 3
# 12 2 c 3
# 15 2 d 3
# 18 4 a 3

Convert a matrix with dimnames into a long format data.frame

Hoping there's a simple answer here but I can't find it anywhere.
I have a numeric matrix with row names and column names:
# 1 2 3 4
# a 6 7 8 9
# b 8 7 5 7
# c 8 5 4 1
# d 1 6 3 2
I want to melt the matrix to a long format, with the values in one column and matrix row and column names in one column each. The result could be a data.table or data.frame like this:
# col row value
# 1 a 6
# 1 b 8
# 1 c 8
# 1 d 1
# 2 a 7
# 2 c 5
# 2 d 6
...
Any tips appreciated.
Use melt from reshape2:
library(reshape2)
#Fake data
x <- matrix(1:12, ncol = 3)
colnames(x) <- letters[1:3]
rownames(x) <- 1:4
x.m <- melt(x)
x.m
Var1 Var2 value
1 1 a 1
2 2 a 2
3 3 a 3
4 4 a 4
...
The as.table and as.data.frame functions together will do this:
> m <- matrix( sample(1:12), nrow=4 )
> dimnames(m) <- list( One=letters[1:4], Two=LETTERS[1:3] )
> as.data.frame( as.table(m) )
One Two Freq
1 a A 7
2 b A 2
3 c A 1
4 d A 5
5 a B 9
6 b B 6
7 c B 8
8 d B 10
9 a C 11
10 b C 12
11 c C 3
12 d C 4
Assuming 'm' is your matrix...
data.frame(col = rep(colnames(m), each = nrow(m)),
row = rep(rownames(m), ncol(m)),
value = as.vector(m))
This executes extremely fast on a large matrix and also shows you a bit about how a matrix is made, how to access things in it, and how to construct your own vectors.
A modification that doesn't require you to know anything about the storage structure, and that easily extends to high dimensional arrays if you use the dimnames, and slice.index functions:
data.frame(row=rownames(m)[as.vector(row(m))],
col=colnames(m)[as.vector(col(m))],
value=as.vector(m))

Resources