R create group variable based on row order and condition - r

I have a dataframe containing multiple groups that are not explicitly stated. Instead, new group always start when type == 1, and is the same for following rows, containing type == 2. The number of rows per group can vary.
How can I explicitly create new variable based on order of another column? The groups, of course, should be exclusive.
My data:
df <- data.frame(type = c(1,2,2,1,2,1,2,2,2,1),
stand = 1:10)
Expected output with new group myGroup:
type stand myGroup
1 1 1 a
2 2 2 a
3 2 3 a
4 1 4 b
5 2 5 b
6 1 6 c
7 2 7 c
8 2 8 c
9 2 9 c
10 1 10 d

One option could be:
with(df, letters[cumsum(type == 1)])
[1] "a" "a" "a" "b" "b" "c" "c" "c" "c" "d"

Here is another option using rep() + diff(), but not as simple as the approach by #tmfmnk
idx <- which(df$type==1)
v <- diff(which(df$type==1))
df$myGroup <- rep(letters[seq(idx)],c(v <- diff(which(df$type==1)),nrow(df)-sum(v)))
such that
> df
type stand myGroup
1 1 1 a
2 2 2 a
3 2 3 a
4 1 4 b
5 2 5 b
6 1 6 c
7 2 7 c
8 2 8 c
9 2 9 c
10 1 10 d

Related

Randomly delete rows due to a variable in a dataframe

In my data below, I wonder how to delete all rows with a given value of outcome (say "A") from n (say 1) randomly selected studyies?
The only condition is that we want to select only from studies that have used more than one value of outcome (e.g., study==1 and study==2 each of which has both outcome == "A" and outcome == "B").
For example, below let's say the given value of outcome is "A". Then, for a given n (say n = 1), we delete all rows with with outcome == "A" from n = 1 randomly selected study from study==1 or study==2.
Is this possible in R?
m =
"
study group outcome
1 1 1 A
2 1 1 B
3 1 2 A
4 1 2 B
5 2 1 A
6 2 1 B
7 2 2 A
8 2 2 B
9 3 1 B
10 4 1 B
"
data <- read.table(text=m,h=T)
library(dplyr)
n = 1
studies_to_remove = sample(unique(data$study), size = n)
outcome_to_remove = "A"
data %>%
filter(
!(
study %in% studies_to_remove &
outcome %in% outcome_to_remove
)
)
# study group outcome
# 2 1 1 B
# 4 1 2 B
# 5 2 1 A
# 6 2 1 B
# 7 2 2 A
# 8 2 2 B
# 9 3 1 B
# 10 4 1 B

cumulative product in R across column

I have a dataframe in the following format
> x <- data.frame("a" = c(1,1),"b" = c(2,2),"c" = c(3,4))
> x
a b c
1 1 2 3
2 1 2 4
I'd like to add 3 new columns which is a cumulative product of the columns a b c, however I need a reverse cumulative product i.e. the output should be
row 1:
result_d = 1*2*3 = 6 , result_e = 2*3 = 6, result_f = 3
and similarly for row 2
The end result will be
a b c result_d result_e result_f
1 1 2 3 6 6 3
2 1 2 4 8 8 4
the column names do not matter this is just an example. Does anyone have any idea how to do this?
as per my comment, is it possible to do this on a subset of columns? e.g. only for columns b and c to return:
a b c results_e results_f
1 1 2 3 6 3
2 1 2 4 8 4
so that column "a" is effectively ignored?
One option is to loop through the rows and apply cumprod over the reverse of elements and then do the reverse
nm1 <- paste0("result_", c("d", "e", "f"))
x[nm1] <- t(apply(x, 1,
function(x) rev(cumprod(rev(x)))))
x
# a b c result_d result_e result_f
#1 1 2 3 6 6 3
#2 1 2 4 8 8 4
Or a vectorized option is rowCumprods
library(matrixStats)
x[nm1] <- rowCumprods(as.matrix(x[ncol(x):1]))[,ncol(x):1]
temp = data.frame(Reduce("*", x[NCOL(x):1], accumulate = TRUE))
setNames(cbind(x, temp[NCOL(temp):1]),
c(names(x), c("res_d", "res_e", "res_f")))
# a b c res_d res_e res_f
#1 1 2 3 6 6 3
#2 1 2 4 8 8 4

Repeat elements of data.frame [duplicate]

This question already has answers here:
Repeat rows of a data.frame [duplicate]
(10 answers)
Closed 7 years ago.
This seems to be a fairly simple problem but I can't find a simple solution:
I want to repeat a data.frame (i) several times as follows:
My initial data.frame:
i <- data.frame(c("A","A","A","B","B","B","C","C","C"))
i
Printing i results in:
1 A
2 A
3 A
4 B
5 B
6 B
7 C
8 C
9 C
How I want to repeat the elements (The numbers on the first column is just for easy understanding/viewing)
i
1 A
2 A
3 A
4 B
5 B
6 B
7 C
8 C
9 C
1 A
2 A
3 A
4 B
5 B
6 B
7 C
8 C
9 C
I tried doing it using:
i[rep(seq_len(nrow(i)), each=2),]
but it provides me output as such (The numbers on the first column is just for easy understanding/viewing):
1 A
2 A
3 A
1 A
2 A
3 A
4 B
5 B
6 B
4 B
5 B
6 B
7 C
8 C
9 C
7 C
8 C
9 C
Please help!
Not sure if this solves your problem, but to obtain the desired output You could simply repeat the entire sequence:
i <- c("A","A","A","B","B","B","C","C","C")
i2 <- rep(i,2)
#> i2
# [1] "A" "A" "A" "B" "B" "B" "C" "C" "C" "A" "A" "A" "B" "B" "B" "C" "C" "C"
Since you're dealing with a data frame, you could use a slightly modified variant:
i <- data.frame(c("A","A","A","B","B","B","C","C","C"))
i2 <- rep(i[,1],2)
You could use rbind(i, i). Does that work?
If you are working with a data frame, this code will work fine too:
i[rep(1:nrow(i), 5), ,drop=F]

Create new column with lowest value of several columns in data.frame

I have a data.frame like this:
data <- data.frame(A=c(1,3,5),B=c(4,3,6),C=c(2,2,8),D=c(3,3,4))
A B C D
1 4 2 3
3 3 2 3
5 6 8 4
Now I want to create new variable "E", which is the lowest value of columns A,B and C. So that the data.frame now looks like this:
A B C D E
1 4 2 3 1
3 3 2 3 2
5 6 8 4 5
I can do this using a for loop:
for (i in 1:nrow(data)) {
data$E[i] <- min(data[i,c("A","B","C")])
}
But I was wondering whether this could be done differently (more efficient)?
Many thanks!
Here are a few ways of doing it,
with apply (to apply the min function to each row)
or pmin (parallel min).
pmin( data[,1], data[,2], data[,3] )
# [1] 1 2 5
do.call( pmin, data[,1:3] )
# [1] 1 2 5
apply(data[,1:3], 1, min)
# [1] 1 2 5

Filtering a dataframe in r row names from a second data frame in r

I have the data.frame :
df1<-data.frame("Sp1"=1:6,"Sp2"=7:12,"Sp3"=13:18)
rownames(df1)=c("A","B","C","D","E","F")
df1
Sp1 Sp2 Sp3
A 1 7 13
B 2 8 14
C 3 9 15
D 4 10 16
E 5 11 17
F 6 12 18
I filter df1 by a cutoff value for rowSums(df1) and return sites (row names) that I want to include in downstream analysis.
include<-rownames(df1[rowSums(df1)>=22,])
include
[1] "B" "C" "D" "E" "F"
I have a second data.frame :
df2<-data.frame(site.x=c("A","B","C"), site.y=c("D","E","F"),score=1:3)
site.x site.y score
1 A D 1
2 B E 2
3 C F 3
I want to filter df2 such that it only includes rows where df2$site.x and df2$site.y are exactly equal to the sites listed in 'include' i.e. filtering out the row containing "A" and returning.
site.x site.y score
2 B E 2
3 C F 3
I have tried :
filter<-df2$site.x == include & df2$site.y == include
filtered<-df2[filter,]
Thanks for any advice!
ANSWER
use %in%
filter<-df2$site.x %in% include & df2$site.y =%in% include
filtered<-df2[filter,]
filtered
site.x site.y score
2 B E 2
3 C F 3
For me, it works with :
filter<-df2$site.x %in% include & df2$site.y %in% include
df2[filter,]
In fact, you've put df1 instead of df2 in the last two lines of your question.

Resources