subset.data.frame in R - r

I have a data frame of raw data:
raw <- data.frame(subj = c(1,1,1,2,2,2,3,3,3,4,4,4),
blah = c(0,0,0,1,1,1,1,0,1,0,0,0))
From it, I want to remove the bad subj.
badsubj <- c(1,4)
trim <- subset.data.frame(raw, subj != badsubj)
But for some reason, all the badsubj values are not removed:
subj blah
2 1 0
4 2 1
5 2 1
6 2 1
7 3 1
8 3 0
9 3 1
11 4 0
What am I doing wrong? Obersvations 2 and 11 should be excluded because they are members of badsubj.

raw[!raw$subj %in% badsubj, ]
wrong use of !=

The problem is that subj and badsubj do not have the same length. Therefore badsubj will be recycled until both vectors have the same length. Then your code compares elementwise the values in the output below.
subj badsubj
1 1 1
2 1 4
3 1 1
4 2 4
5 2 1
6 2 4
7 3 1
8 3 4
9 3 1
10 4 4
11 4 1
12 4 4

Related

R: Return values in a columns when the value in another column becomes negative for the first time

For each ID, I want to return the value in the 'distance' column where the value becomes negative for the first time. If the value does not become negative at all, return the value 99 (or some other random number) for that ID. A sample data frame is given below.
df <- data.frame(ID=c(rep(1, 4),rep(2,4),rep(3,4),rep(4,4),rep(5,4)),distance=rep(1:4,5), value=c(1,4,3,-1,2,1,-4,1,3,2,-1,1,-4,3,2,1,2,3,4,5))
> df
ID distance value
1 1 1 1
2 1 2 4
3 1 3 3
4 1 4 -1
5 2 1 2
6 2 2 1
7 2 3 -4
8 2 4 1
9 3 1 3
10 3 2 2
11 3 3 -1
12 3 4 1
13 4 1 -4
14 4 2 3
15 4 3 2
16 4 4 1
17 5 1 2
18 5 2 3
19 5 3 4
20 5 4 5
The desired output is as follows
> df2
ID first_negative_distance
1 1 4
2 2 3
3 3 3
4 4 1
5 5 99
I tried but couldn't figure out how to do it through dplyr. Any help would be much appreciated. The actual data I'm working on has thousands of ID's with 30 different distance levels for each. Bear in mind that for any ID, there could be multiple instances of negative values. I just need the first one.
Edit:
Tried the solution proposed by AntonoisK.
> df%>%group_by(ID)%>%summarise(first_neg_dist=first(distance[value<0]))
first_neg_dist
1 4
This is the result I am getting. Does not match what Antonois got. Not sure why.
library(dplyr)
df %>%
group_by(ID) %>%
summarise(first_neg_dist = first(distance[value < 0]))
# # A tibble: 5 x 2
# ID first_neg_dist
# <dbl> <int>
# 1 1 4
# 2 2 3
# 3 3 3
# 4 4 1
# 5 5 NA
If you really prefer 99 instead of NA you can use
summarise(first_neg_dist = coalesce(first(distance[value < 0]), 99L))
instead.

Group by each increasing sequence in data frame

If I have a data frame with a column of monotonically increasing values such as:
x
1
2
3
4
1
2
3
1
2
3
4
5
6
1
2
How do I add a column to group each increasing sequence that results in:
x y
1 1
2 1
3 1
4 1
1 2
2 2
3 2
1 3
2 3
3 3
4 3
5 3
6 3
1 4
2 4
I can only think of using a loop which will be slow.
You may choose cumsum function to do it.
> x <- c(1,2,3,4,1,2,3,1,2,4,5,1,2)
> cumsum(x==1)
[1] 1 1 1 1 2 2 2 3 3 3 3 4 4
I would use diff and compute the cumulative sum:
df$y <- c(1, cumsum(diff(df$x) < 0 ) + 1)
> df
x y
1 1 1
2 2 1
3 3 1
4 4 1
5 1 2
6 2 2
7 3 2
8 1 3
9 2 3
10 3 3
11 4 3
12 5 3
13 6 3
14 1 4
15 2 4

remove i+1th term if reoccuring

Say we have the following data
A <- c(1,2,2,2,3,4,8,6,6,1,2,3,4)
B <- c(1,2,3,4,5,1,2,3,4,5,1,2,3)
data <- data.frame(A,B)
How would one write a function so that for A, if we have the same value in the i+1th position, then the reoccuring row is removed.
Therefore the output should like like
data.frame(c(1,2,3,4,8,6,1,2,3,4), c(1,2,5,1,2,3,5,1,2,3))
My best guess would be using a for statement, however I have no experience in these
You can try
data[c(TRUE, data[-1,1]!= data[-nrow(data), 1]),]
Another option, dplyr-esque:
library(dplyr)
dat1 <- data.frame(A=c(1,2,2,2,3,4,8,6,6,1,2,3,4),
B=c(1,2,3,4,5,1,2,3,4,5,1,2,3))
dat1 %>% filter(A != lag(A, default=FALSE))
## A B
## 1 1 1
## 2 2 2
## 3 3 5
## 4 4 1
## 5 8 2
## 6 6 3
## 7 1 5
## 8 2 1
## 9 3 2
## 10 4 3
using diff, which calculates the pairwise differences with a lag of 1:
data[c( TRUE, diff(data[,1]) != 0), ]
output:
A B
1 1 1
2 2 2
5 3 5
6 4 1
7 8 2
8 6 3
10 1 5
11 2 1
12 3 2
13 4 3
Using rle
A <- c(1,2,2,2,3,4,8,6,6,1,2,3,4)
B <- c(1,2,3,4,5,1,2,3,4,5,1,2,3)
data <- data.frame(A,B)
X <- rle(data$A)
Y <- cumsum(c(1, X$lengths[-length(X$lengths)]))
View(data[Y, ])
row.names A B
1 1 1 1
2 2 2 2
3 5 3 5
4 6 4 1
5 7 8 2
6 8 6 3
7 10 1 5
8 11 2 1
9 12 3 2
10 13 4 3

Two dimensional heatmap with R

I have an input file of this form:
0.35217720 1 201 1
0.26413283 1 209 1
1.1665874 1 210 1
...
0.30815500 2 194 1
0.15407741 2 196 1
0.15407741 2 197 1
0.33016610 2 205 1
...
where the first column is a scalar value, the second is the x coordinate of a discrete lattice, the third is the y coordinate and the last one is time-like discrete component.
I would like to make a two dimensional heatmap of the scalar values at fixed time. How can i do? Edit: I don't know how to use image() to use the second and the third column as x, y coordinates.
Example file:
7.62939453 1 1 1
1.3153768 1 2 1
7.5560522 1 3 1
4.5865011 1 4 1
5.3276706 1 5 1
2.1895909 2 1 1
0.47044516 2 2 1
6.7886448 2 3 1
6.7929626 2 4 1
9.3469286 2 5 1
3.8350201 3 1 1
5.1941633 3 2 1
8.3096523 3 3 1
0.34571886 3 4 1
0.53461552 3 5 1
5.2970004 4 1 1
6.7114925 4 2 1
7.69805908 4 3 1
3.8341546 4 4 1
0.66842079 4 5 1
4.1748595 5 1 1
6.8677258 5 2 1
5.8897662 5 3 1
9.3043633 5 4 1
8.4616680 5 5 1
Reshape your data to a matrix and then use heatmap():
This worked on R version 2.10.1 (2009-12-14):
txt <- textConnection("7.62939453 1 1 1
1.3153768 1 2 1
7.5560522 1 3 1
4.5865011 1 4 1
5.3276706 1 5 1
2.1895909 2 1 1
0.47044516 2 2 1
6.7886448 2 3 1
6.7929626 2 4 1
9.3469286 2 5 1
3.8350201 3 1 1
5.1941633 3 2 1
8.3096523 3 3 1
0.34571886 3 4 1
0.53461552 3 5 1
5.2970004 4 1 1
6.7114925 4 2 1
7.69805908 4 3 1
3.8341546 4 4 1
0.66842079 4 5 1
4.1748595 5 1 1
6.8677258 5 2 1
5.8897662 5 3 1
9.3043633 5 4 1
8.4616680 5 5 1
")
df <- read.table(txt)
close(txt)
names(df) <- c("value", "x", "y", "t")
require(reshape)
dfc <- cast(df[ ,-4], x ~ y)
heatmap(as.matrix(dfc))
## Some copy/pasteable fake data for you (dput() works nicely for pasteable real data)
your_matrix <- cbind(runif(25, 0, 10), rep(1:5, each = 5), rep(1:5, 5), rep(1, 25))
heatmap_matrix <- matrix(your_matrix[, 1], nrow = 5)
## alternatively, if your_matrix isn't in order
## (The reshape method in EDi's answer is a nicer alternative)
for (i in 1:nrow(your_matrix)) {
heatmap_matrix[your_matrix[i, 2], you_matrix[i, 3]]
}
heatmap(heatmap_matrix) # one option
image(z = heatmap_matrix) # another option
require(gplots)
heatmap.2(heatmap_matrix) # this has fancier preferences

Conditonally delete columns in R

I know how to delete columns in R, but I am not sure how to delete them based on the following set of conditions.
Suppose a data frame such as:
DF <- data.frame(L = c(2,4,5,1,NA,4,5,6,4,3), J= c(3,4,5,6,NA,3,6,4,3,6), K= c(0,1,1,0,NA,1,1,1,1,1),D = c(1,1,1,1,NA,1,1,1,1,1))
DF
L J K D
1 2 3 0 1
2 4 4 1 1
3 5 5 1 1
4 1 6 0 1
5 NA NA NA NA
6 4 3 1 1
7 5 6 1 1
8 6 4 1 1
9 4 3 1 1
10 3 6 1 1
The data frame has to be set up in this fashion. Column K corresponds to column L, and column D, corresponds to column J. Because column D has values that are all equal to one, I would like to delete column D, and the corresponding column J yielding a dataframe that looks like:
DF
L K
1 2 0
2 4 1
3 5 1
4 1 0
5 NA NA
6 4 1
7 5 1
8 6 1
9 4 1
10 3 1
I know there has got to be a simple command to do so, I just can't think of any. And if it makes any difference, the NA's must be retained.
Additional helpful information, in my real data frame there are a total of 20 columns, so there are 10 columns like L and J, and another 10 that are like K and D, I need a function that can recognize the correspondence between these two groups and delete columns accordingly if necessary
Thank you in advance!
Okey, assuming the column-number based correspondence, here is an example:
> n <- 10
>
> # sample data
> d <- data.frame(lapply(1:n, function(x)sample(n)), lapply(1:n, function(x)sample(2, n, T, c(0.1, 0.9))-1))
> names(d) <- c(LETTERS[1:n], letters[1:n])
> head(d)
A B C D E F G H I J a b c d e f g h i j
1 5 5 2 7 4 3 4 3 5 8 0 1 1 1 1 1 1 1 1 1
2 9 8 4 6 7 8 8 2 10 5 1 1 1 1 1 1 1 1 1 1
3 6 6 10 3 5 6 2 1 8 6 1 1 1 1 1 1 1 1 1 1
4 1 7 5 5 1 10 10 4 2 4 1 1 1 1 1 1 1 1 1 1
5 10 9 6 2 9 5 6 9 9 9 1 1 0 1 1 1 1 1 1 1
6 2 1 1 4 6 1 5 8 4 10 1 1 1 1 1 1 1 1 1 1
>
> # find the column that should be left.
> idx <- which(colMeans(d[(n+1):(2*n)], na.rm = TRUE) != 1)
>
> # filter the data
> d[, c(idx, idx+n)]
A B C D F a b c d f
1 5 5 2 7 3 0 1 1 1 1
2 9 8 4 6 8 1 1 1 1 1
3 6 6 10 3 6 1 1 1 1 1
4 1 7 5 5 10 1 1 1 1 1
5 10 9 6 2 5 1 1 0 1 1
6 2 1 1 4 1 1 1 1 1 1
7 8 4 7 10 2 1 1 1 1 0
8 7 3 9 9 4 1 0 1 0 1
9 3 10 3 1 9 1 1 0 1 1
10 4 2 8 8 7 1 0 1 1 1
I basically agree with koshke (whose SO work is excellent), but would suggest that the test to use is colSums(d[(n+1):(2*n)], na.rm=TRUE) == NROW(d) , since a paired 0 and 2 or -1 and 3 could throw off the colMeans test.

Resources