Hello I have a df such as
COL1 COL2
A OKI
B OKO
C OKU
D OKP
E BRUT
F 0.87
G 0.82
H 0.57
and I would like to subset the df for all line after the "BRUT" row
and get :
COL1 COL2
F 0.87
G 0.82
H 0.57
You can use match to get the line with BRUT, add 1 and create a sequence until nrow(x) to subset x to get all lines after BRUT.
x[(match("BRUT", x$COL2)+1):nrow(x),]
# COL1 COL2
#6 F 0.87
#7 G 0.82
#8 H 0.57
Or using tail, as suggested by #thelatemail (Thanks!).
tail(x, -match("BRUT",x$COL2))
Or some other alternatives:
x[-(1:match("BRUT", x$COL2)),]
x[-seq_len(match("BRUT", x$COL2)),]
It seems you only want the numeric values. In this case a more robust solution can be,
df[grepl('[0-9]', df$COL2),]
# COL1 COL2
#6 F 0.87
#7 G 0.82
#8 H 0.57
You can use which.max to get row number for first value of "BRUT".
df[(which.max(df$COL2 == 'BRUT') + 1):nrow(df), ]
# COL1 COL2
#6 F 0.87
#7 G 0.82
#8 H 0.57
Some other options comparing with row number :
df[seq_len(nrow(df)) > which.max(df$COL2 == 'BRUT'), ]
Using dplyr :
library(dplyr)
df %>% filter(row_number() > which.max(COL2 == 'BRUT'))
data
df <- structure(list(COL1 = c("A", "B", "C", "D", "E", "F", "G", "H"
), COL2 = c("OKI", "OKO", "OKU", "OKP", "BRUT", "0.87", "0.82",
"0.57")), class = "data.frame", row.names = c(NA, -8L))
Another option with cumsum in base R
subset(df, cumsum(cumsum(COL2 == "BRUT")) >1)
# COL1 COL2
#6 F 0.87
#7 G 0.82
#8 H 0.57
data
df <- structure(list(COL1 = c("A", "B", "C", "D", "E", "F", "G", "H"
), COL2 = c("OKI", "OKO", "OKU", "OKP", "BRUT", "0.87", "0.82",
"0.57")), class = "data.frame", row.names = c(NA, -8L))
Related
I have a data frame like this.
df
Languages Order Machine Company
[1] W,X,Y,Z,H,I D D B
[2] W,X B A G
[3] W,I E B A
[4] H,I B C B
[5] W G G C
I want to get the number of rows where languages has 2 out of 3 values among W,H,I.
The result should be: 3 because row 1, row 3 and row 4 contains at least 2 values out of the3 values among W,H,I
You can use strsplit on df$Languages and take the intersect with W,H,I. Then get the lengths of this result and use which to get those which have more than 1 >1.
sum(lengths(sapply(strsplit(df$Languages, ",", TRUE), intersect, c("W","H","I"))) > 1)
#[1] 3
You can use :
sum(sapply(strsplit(df$Languages, ','), function(x)
sum(c("W","H","I") %in% x) >= 2))
#[1] 3
data
df<- structure(list(Languages = c("W,X,Y,Z,H,I", "W,X", "W,I", "H,I",
"W"), Order = c("D", "B", "E", "B", "G"), Machine = c("D", "A",
"B", "C", "G"), Company = c("B", "G", "A", "B", "C")),
class = "data.frame", row.names = c(NA, -5L))
a tidyverse approach
df %>% filter(map_int(str_split(Languages, ','), ~ sum(.x %in% c('W', 'H', 'I'))) >= 2)
Languages Order Machine Company
1 W,X,Y,Z,H,I D D B
2 W,I E B A
3 H,I B C B
Given column 1 that has A,B, and C values, how to create column 2 under this condition:
- If column 1 is either A or B, column 2 would be F,F1,F,F1 (every second cell be F1), otherwise, same C.
We can use transform with ifelse
transform(df, Col2 = ifelse(Col1 %in% c("A", "B"), c("F", "F1"), Col1))
# Col1 Col2
#1 A F
#2 A F1
#3 A F
#4 A F1
#5 A F
#6 A F1
#7 B F
#8 B F1
#9 B F
#10 B F1
#11 B F
#12 C C
#13 C C
#14 C C
#15 C C
Probably, using it by group is more appropriate.
library(dplyr)
df %>%
group_by(Col1) %>%
mutate(Col2 = ifelse(Col1 %in% c("A", "B"), c("F", "F1"), Col1))
data
df <- data.frame(Col1 = rep(c("A", "B", "C"), 6:4), stringsAsFactors = FALSE)
The dataframe I am working on is coded in dyadic format where each observation (i.e., row) contains a source node (from) and a target node (to) along with other some dyadic covariates (such as dyadic correlation, corr).
For simplicity sake, I want to treat each dyad as un-ordered and generate a unique identifier for each dyad like the one (i.e., df1) elow:
# original data
df <- data.frame(
from = c("A", "A", "A", "B", "C", "A", "D", "E", "F", "B"),
to = c("B", "C", "D", "C", "B", "B", "A", "A", "A", "A"),
corr = c(0.5, 0.7, 0.2, 0.15, 0.15, 0.5, 0.2, 0.45, 0.54, 0.5))
from to corr
1 A B 0.50
2 A C 0.70
3 A D 0.20
4 B C 0.15
5 C B 0.15
6 A B 0.50
7 D A 0.20
8 E A 0.45
9 F A 0.54
10 B A 0.50
# desired format
df1 <- data.frame(
from = c("A", "A", "A", "B", "C", "A", "D", "E", "F", "B"),
to = c("B", "C", "D", "C", "B", "B", "A", "A", "A", "A"),
corr = c(0.5, 0.7, 0.2, 0.15, 0.15, 0.5, 0.2, 0.45, 0.54, 0.5),
dyad = c(1, 2, 3, 4, 4, 1, 3, 5, 6, 1))
from to corr dyad
1 A B 0.50 1
2 A C 0.70 2
3 A D 0.20 3
4 B C 0.15 4
5 C B 0.15 4
6 A B 0.50 1
7 D A 0.20 3
8 E A 0.45 5
9 F A 0.54 6
10 B A 0.50 1
where dyad A-B/B-A, A-D/D-A are treated as identical pairs and are assigned with the same dyad identifiers.
While it's easy to extract a list of un-ordered pairs from the original data, it's hard to map them onto the original dataframe to generate un-ordered dyad identifiers. Could anyone offer some insights on this?
One dplyr option could be:
df %>%
mutate(dyad = group_indices(., paste0(pmax(from, to), pmin(from, to))))
from to corr dyad
1 A B 0.50 1
2 A C 0.70 2
3 A D 0.20 4
4 B C 0.15 3
5 C B 0.15 3
6 A B 0.50 1
7 D A 0.20 4
8 E A 0.45 5
9 F A 0.54 6
10 B A 0.50 1
Or:
df %>%
mutate(dyad = dense_rank(paste0(pmax(from, to), pmin(from, to))))
However, if you need to assign the identifiers in a specific order (meaning that the identifiers hold some information on their own), then the solution from #Ronak Shah could be better for you.
One way using apply could be to sort and paste the value in two column, convert them to factor and then integer to get a unique number for each combination.
df$temp <- apply(df[1:2], 1, function(x) paste(sort(x), collapse = "_"))
df$dyad <- as.integer(factor(df$temp, levels = unique(df$temp)))
df$temp <- NULL
df
# from to corr dyad
#1 A B 0.50 1
#2 A C 0.70 2
#3 A D 0.20 3
#4 B C 0.15 4
#5 C B 0.15 4
#6 A B 0.50 1
#7 D A 0.20 3
#8 E A 0.45 5
#9 F A 0.54 6
#10 B A 0.50 1
Lets assume there are 2 columns of two huge dataframes (different lengths) like:
df1 df2
A 1 C X
A 1 D X
B 4 C X
A 1 F X
B 4 A X
B 4 B X
C 7 B X
Each time there is a match in the 1st columns, X should be replaced with data from column 2 of df1. If the 1st column of df2 contains Elements, which are still not in the first column of df1 (F, D), X should be replaced with 0.
Hence there is a huge dataframe, a loop in a loop would not be useful.
The solution should look like this:
df1 df2
A 1 C 7
A 1 D 0
B 4 C 7
A 1 F 0
B 4 A 1
B 4 B 4
C 7 B 4
Thank You in advance
As there are duplicate rows in 'df1', we can get the unique rows
df3 <- unique(df1)
Then, use match to get the idnex
i1 <- match(df2$Col1, df3$Col1)
and based on the index, assign
df2$Col2 <- df3$Col2[i1]
If there are no matches, it would be NA, which can be changed to 0
df2$Col2[is.na(df2$Col2)] <- 0
df2
# Col1 Col2
#1 C 7
#2 D 0
#3 C 7
#4 F 0
#5 A 1
#6 B 4
#7 B 4
Or this can be done with data.table by joining on the 'Col1' and assigning the 'Col2' (after removing the Col2 from the second data) with the Col2 from 'df3'
library(data.table)
setDT(df2)[, Col2 := NULL][df3, Col2 := Col2, on = .(Col1)]
data
df1 <- structure(list(Col1 = c("A", "A", "B", "A", "B", "B", "C"), Col2 = c(1,
1, 4, 1, 4, 4, 7)), class = "data.frame", row.names = c(NA, -7L
))
df2 <- structure(list(Col1 = c("C", "D", "C", "F", "A", "B", "B"), Col2 = c("X",
"X", "X", "X", "X", "X", "X")), class = "data.frame", row.names = c(NA,
-7L))
I have a data frame like as follows:
Col1 Col2 Col3
A B C
D E F
G H I
I am trying to keep lines matching 'B' in 'Col2' OR F in 'Col3', in order to get:
Col1 Col2 Col3
A B C
D E F
I tried:
data[(grep("B",data$Col2) || grep("F",data$Col3)), ]
but it returns the entire data frame.
NOTE: it works when calling the 2 grep one at a time.
Or using a single grepl after pasteing the columns
df1[with(df1, grepl("B|F", paste(Col2, Col3))),]
# Col1 Col2 Col3
#1 A B C
#2 D E F
with(df1, df1[ Col2 == 'B' | Col3 == 'F',])
# Col1 Col2 Col3
# 1 A B C
# 2 D E F
Using grepl
with(df1, df1[ grepl( 'B', Col2) | grepl( 'F', Col3), ])
# Col1 Col2 Col3
# 1 A B C
# 2 D E F
Data:
df1 <- structure(list(Col1 = c("A", "D", "G"), Col2 = c("B", "E", "H"
), Col3 = c("C", "F", "I")), .Names = c("Col1", "Col2", "Col3"
), row.names = c(NA, -3L), class = "data.frame")
The data.table package makes this type of operation trivial due to its compact and readable syntax. Here is how you would perform the above using data.table:
> df1 <- structure(list(Col1 = c("A", "D", "G"), Col2 = c("B", "E", "H"
+ ), Col3 = c("C", "F", "I")), .Names = c("Col1", "Col2", "Col3"
+ ), row.names = c(NA, -3L), class = "data.frame")
> library(data.table)
> DT <- data.table(df1)
> DT
Col1 Col2 Col3
1: A B C
2: D E F
3: G H I
> DT[Col2 == 'B' | Col3 == 'F']
Col1 Col2 Col3
1: A B C
2: D E F
>
data.table performs its matching operations with with=TRUE by default. Note that the matching is much faster if you set keys on the data but that is for another topic.