Subset first n occurrences of certain value in dataframe - r

Suppose I have a matrix (or dataframe):
1 5 8
3 4 9
3 9 6
6 9 3
3 1 2
4 7 2
3 8 6
3 2 7
I would like to select only the first three rows that have "3" as their first entry, as follows:
3 4 9
3 9 6
3 1 2
It is clear to me how to pull out all rows that begin with "3" and it is clear how to pull out just the first row that begins with "3."
But in general, how can I extract the first n rows that begin with "3"?
Furthermore, how can I select just the 3rd and 4th appearances, as follows:
3 1 2
3 8 6

Without the need for an extra package:
mydf[mydf$V1==3,][1:3,]
results in:
V1 V2 V3
2 3 4 9
3 3 9 6
5 3 1 2
When you need the third and fourth row:
mydf[mydf$V1==3,][3:4,]
# or:
mydf[mydf$V1==3,][c(3,4),]
Used data:
mydf <- structure(list(V1 = c(1L, 3L, 3L, 6L, 3L, 4L, 3L, 3L),
V2 = c(5L, 4L, 9L, 9L, 1L, 7L, 8L, 2L),
V3 = c(8L, 9L, 6L, 3L, 2L, 2L, 6L, 7L)),
.Names = c("V1", "V2", "V3"), class = "data.frame", row.names = c(NA, -8L))
Bonus material: besides dplyr, you can do this also very efficiently with data.table (see this answer for speed comparisons on large datasets for the different data.table methods):
setDT(mydf)[V1==3, head(.SD,3)]
# or:
setDT(mydf)[V1==3, .SD[1:3]]

You can do something like this with dplyr to extract first three rows of each unique value of that column:
library(dplyr)
df %>% arrange(columnName) %>% group_by(columnName) %>% slice(1:3)
If you want to extract only three rows when the value of that column, you can try:
df %>% filter(columnName == 3) %>% slice(1:3)
If you want specific rows, you can supply to slice as c(3, 4), for example.

We could also use subset
head(subset(mydf, V1==3),3)
Update
If we need to extract also one row below the rows where V1==3,
i1 <- with(mydf, V1==3)
mydf[sort(unique(c(which(i1),pmin(which(i1)+1L, nrow(mydf))))),]

Related

group two variables(in rows) in R to create one variable [duplicate]

This question already has answers here:
How to merge multiple rows by a given condition and sum?
(2 answers)
Closed 2 years ago.
I have a data frame where
Disease Genemutation Mean. Total No of pateints No.of pateints.
cancertype1 BRCA1 1 10 2
cancertype2 BRCA2 5 10 3
cancertype3 BRCA2 7 10 4
cancertype1 BRCA1 8 10 1
cancertype3 BRCA2 4 10 4
cancertype2 BRCA1 6 10 1
how do I create an new variable called cancertype 4 (from cancer type 3 and cancer type 2) that includes the number of patients that have it as a result of merging the two variable?
We can use replace with %in% to replace those values (assuming 'Disease' is character class)
df1 %>%
group_by(Disease = replace(Disease,
Disease %in% c("cancertype2", "cancertype3"), "cancertype4")) %>%
summarise(TotalNoofpateints = sum(TotalNoofpateints))
-output
# A tibble: 2 x 2
# Disease TotalNoofpateints
# <chr> <int>
#1 cancertype1 20
#2 cancertype4 40
Here is a base R option using aggregate
aggregate(
Total.No.of.pateints ~ Disease,
transform(
df,
Disease = replace(Disease, Disease %in% c("cancertype2", "cancertype3"), "cancertype4")
),
sum
)
giving
Disease Total.No.of.pateints
1 cancertype1 20
2 cancertype4 40
Data
> dput(df)
structure(list(Disease = c("cancertype1", "cancertype2", "cancertype3",
"cancertype1", "cancertype3", "cancertype2"), Genemutation = c("BRCA1",
"BRCA2", "BRCA2", "BRCA1", "BRCA2", "BRCA1"), Mean. = c(1L, 5L,
7L, 8L, 4L, 6L), Total.No.of.pateints = c(10L, 10L, 10L, 10L,
10L, 10L), No.of.pateints. = c(2L, 3L, 4L, 1L, 4L, 1L)), class = "data.frame", row.names = c(NA,
-6L))

rowsums with multiple conditions

I'm trying to calculate cumulative sum in rows with several variables.
This is my data as example. I have 5 patients ID, and 4 condition variables. If there is value between '1 to 3' in conditions, cumsum will be added 1.
ID<-c("a","b","c","d","e")
cond1<-as.factor(sample(x=1:7,size=5,replace=TRUE))
cond2<-as.factor(sample(x=1:7,size=5,replace=TRUE))
cond3<-as.factor(sample(x=1:7,size=5,replace=TRUE))
cond4<-as.factor(sample(x=1:7,size=5,replace=TRUE))
df<-data.frame(ID,cond1,cond2,cond3,cond4)
df
ID cond1 cond2 cond3 cond4
1 a 2 7 6 6
2 b 7 2 3 6
3 c 4 3 1 4
4 d 7 3 3 6
5 e 6 7 7 3
I use rowSums code with following statement. However, as 2nd row, though cond2 is 2 and cond3 is 3, the cumsum was not '2', '1'. 4nd row has same problem.
df$cumsum<-rowSums(df[,2:5]==c(1,2,3),na.rm=TRUE)
df
ID cond1 cond2 cond3 cond4 cumsum
1 a 2 7 6 6 0
2 b 7 2 3 6 1
3 c 4 3 1 4 1
4 d 7 3 3 6 1
5 e 6 7 7 3 0
How to make it cumulative? I would really appreciate all your help.
For more than 1 element comparison, use %in%, but %in% works on a vector. So, we loop through the columns with lapply/sapply and then do the rowSums on the logical matrix
df$RSum <- rowSums(sapply(df[,2:5], `%in%`, 1:3))
df$RSum
#[1] 1 2 2 2 1
If the values were numeric, then we could also make use of > or <
df$RSum <- rowSums(df[, 2:5] >=1 & df[, 2:5] <=3)
data
df <- structure(list(ID = c("a", "b", "c", "d", "e"), cond1 = c(2L,
7L, 4L, 7L, 6L), cond2 = c(7L, 2L, 3L, 3L, 7L), cond3 = c(6L,
3L, 1L, 3L, 7L), cond4 = c(6L, 6L, 4L, 6L, 3L)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5"))
I suggest you fix two problems with your data:
Your data is wide, instead of long formatted. Had your data been long formatted, your analysis would have been much simpler. This is specially true for plotting.
Your values for each condition are factors. That makes it more difficult to do comparissons, and might induce some difficult-to-spot errors. If you see #akrun answer carefully, you'll notice the values are integer (numeric).
That said, I propose a data.table solution:
# 1. load libraries and make df a data.table:
library(data.table)
setDT(df)
# 2. make the wide table a long one
melt(df, id.vars = "ID")
# 3. with a long table, count the number of conditions that are in the 1:3 range for each ID. Notice I chained the first command with this second one:
melt(df, id.vars = "ID")[, sum(value %in% 1:3), by = ID]
Which produces the result:
ID V1
1: a 1
2: b 2
3: c 2
4: d 2
5: e 1
You'll only need to run commands under 1 and 3 (2 has been chained into 3). See ?data.table for further details.
You can read more about wide vs long in wikipedia and in Mike Wise's answer
The data I used is the same as #akrun:
df <- structure(list(ID = c("a", "b", "c", "d", "e"),
cond1 = c(2L, 7L, 4L, 7L, 6L),
cond2 = c(7L, 2L, 3L, 3L, 7L),
cond3 = c(6L, 3L, 1L, 3L, 7L),
cond4 = c(6L, 6L, 4L, 6L, 3L)),
class = "data.frame",
row.names = c("1", "2", "3", "4", "5"))

In R: How can I create a Variable C that has the maximum value from Variable B for a group in Variable A prior to the given observation? [duplicate]

I need to find a running maximum of a variable by group using R. The variable is sorted by time within group using df[order(df$group, df$time),].
My variable has some NA's but I can deal with it by replacing them with zeros for this computation.
this is how the data frame df looks like:
(df <- structure(list(var = c(5L, 2L, 3L, 4L, 0L, 3L, 6L, 4L, 8L, 4L),
group = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L),
.Label = c("a", "b"), class = "factor"),
time = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L)),
.Names = c("var", "group","time"),
class = "data.frame", row.names = c(NA, -10L)))
# var group time
# 1 5 a 1
# 2 2 a 2
# 3 3 a 3
# 4 4 a 4
# 5 0 a 5
# 6 3 b 1
# 7 6 b 2
# 8 4 b 3
# 9 8 b 4
# 10 4 b 5
And I want a variable curMax as:
var | group | time | curMax
5 a 1 5
2 a 2 5
3 a 3 5
4 a 4 5
0 a 5 5
3 b 1 3
6 b 2 6
4 b 3 6
8 b 4 8
4 b 5 8
Please let me know if you have any idea how to implement it in R.
We can try data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'group' , we get the cummax of 'var' and assign (:=) it to a new variable ('curMax')
library(data.table)
setDT(df1)[, curMax := cummax(var), by = group]
As commented by #Michael Chirico, if the data is not ordered by 'time', we can do that in the 'i'
setDT(df1)[order(time), curMax:=cummax(var), by = group]
Or with dplyr
library(dplyr)
df1 %>%
group_by(group) %>%
mutate(curMax = cummax(var))
If df1 is tbl_sql explicit ordering might be required, using arrange
df1 %>%
group_by(group) %>%
arrange(time, .by_group=TRUE) %>%
mutate(curMax = cummax(var))
or dbplyr::window_order
library(dbplyr)
df1 %>%
group_by(group) %>%
window_order(time) %>%
mutate(curMax = cummax(var))
you can do it so:
df$curMax <- ave(df$var, df$group, FUN=cummax)

Delete a set of data from .csv file after figuring out the duplicate values

The problem is I have a set of data in two columns. Ex:
A B
3 5
6 7
4 4
7 8
1 6
8 7
Here I want to figure out the values that are same in both A & B column(4 & 4). Also I want to know the duplicates that are present in the B column(7 & 7).
After figuring it out, is there a way to remove them and keep in a different file?
Also if you can guide me to a good data manipulation with R content.
We create two indexes for two columns
i1 <- df1$A == df1$B
i2 <- with(df1, duplicated(B)|duplicated(B, fromLast = TRUE))
df1[i1,1]
#[1] 4
df1[i1|i2,2]
#[1] 7 4 7
As the number of elements to be removed are different for both columns, we loop through the columns and remove those values based on the logical index
Map(`[`, df1, list(!i1, !(i1|i2)))
#$A
#[1] 3 6 7 1 8
#$B
#[1] 5 8 6
data
df1 <- structure(list(A = c(3L, 6L, 4L, 7L, 1L, 8L), B = c(5L, 7L, 4L,
8L, 6L, 7L)), .Names = c("A", "B"), class = "data.frame", row.names = c(NA,
-6L))
Your dataframe
db<-data.frame(A=c(3,6,4,7,1,8),
B=c(5,7,4,8,6,7))
Identify equal and duplicated data
not_equal<-!db[,1]==db[,2]
not_duplicated<-!duplicated(db[,2])
Filter out
db[not_equal & not_duplicated,]
A B
1 3 5
2 6 7
4 7 8
5 1 6

Using a vector of characters that correspond to an expression as an argument to a function

I have the following code,
z <- data.frame(a=sample.int(10),b=sample.int(10),c=sample.int(10))
letter <- c("a","c","b") # this will be used as the argument to a function
vec <- unlist(lapply(1:length(letter),
function(x) cat(paste("z[[letter[",x,"]]],",sep=""))))
vec[length(vec)] <- paste("z[[letter[",length(vec),"]]]",sep="")
Consequently:
> vec
[1] "z[[letter[1]]]," "z[[letter[2]]]," "z[[letter[3]]]"
I want to use vec to order the rows of dataframe z, using the code below,
z.sort <- z[with(z, order(???)),]
How can I get the character vector vec to be evaluated as the arguments to order?
Is there a better way of doing this bearing in mind that letter, which is used to form vec will be an argument to a function?
Desired output would be:
a b c
5 1 1 9
10 2 10 2
1 3 7 1
9 4 2 5
8 5 8 6
2 6 4 3
4 7 9 10
3 8 3 8
6 9 5 7
7 10 6 4
or as dput output:
structure(list(a = 1:10, b = c(1L, 10L, 7L, 2L, 8L, 4L, 9L, 3L, 5L, 6L), c = c(9L, 2L, 1L, 5L, 6L, 3L, 10L, 8L, 7L, 4L)), .Names = c("a", "b", "c"), row.names = c(5L, 10L, 1L, 9L, 8L, 2L, 4L, 3L, 6L, 7L), class = "data.frame")
Here's what you want (with different random data):
> z[do.call(order, z[,letter]),]
a b c
5 1 2 1
4 2 4 8
1 3 3 9
6 4 6 3
8 5 8 5
10 6 1 4
2 7 5 7
3 8 10 2
9 9 7 6
7 10 9 10
do.call lets us send a list to a function as its arguments, so we can just reorder the columns of z and send them to order using do.call, as a data.frame is just a special kind of list.
Works with no problem in a function:
my.reorder <- function(dat, cols) { dat[do.call(order, dat[,cols]),] }
A little more clarification about what your doing would help a lot, but a few thoughts...
in your sapply call you can simplify dramatically by using:
vec <- sapply(letter, function(x) paste('z[["',x,'"]]',sep=''))
the cat step is unnecessary unless you're trying to see the output as well as assign it... but you'd be better served there with a two line function that uses print. you can get the sorting you're talking about by using eval(parse(text=...))
z[order(eval(parse(text=vec[1])),]
and getting each sorting out as a list:
lapply(vec, function(x) z[order(eval(parse(text=x))),])
But... if sorting your data.frame by the each column specified in letter is the goal:
lapply(letter, function(x) z[order(z[[x]]),])
gives you the same output as all the fiddly steps above. Using something like eval(parse(text=...)) often means you're doing something silly and should rethink your steps since there very well could be a more straightforward solution.

Resources