select row based on value of another row in R - r

EDIT: to make myself clear, I know how to select individual rows and I know there are many different ways of doing it. I want to write a code that will work no matter what the actual value of the rows is, so it works over a larger dataframe, that is, I don't have to change the code based on the content. So instead of saying, select row 1, then 3, it'll say, select row one, then row [value in row 1 column Z] then row [value in column Z from the row just selected] and so on - so my question is, how to tell R to read that value as row number
I'm trying to figure out how to select and save a row based on a value in another row, so that I get get a new df with row 1(aA), then go to row 3 and save it (cC), then go to row 2 etc.
X Y Z
a A 3
b B 5
c C 2
d D 1
e E NA
Knowing the row number, I can use rbind which will give me the following
rbind(df[1, ], df[3, ]
a A 3
c C 2
But I want R to extract the number 3 from the column not to explicitly tell it which row to pick - how do I do that?
Thanks

You can use a while loop to keep on selecting rows until NA occurs or all the rows are selected in the dataframe.
all_rows <- 1
next_row <- df$Z[all_rows]
while(!is.na(next_row) || length(all_rows) >= nrow(df)) {
all_rows <- c(all_rows, next_row)
next_row <- df$Z[all_rows[length(all_rows)]]
}
result <- df[all_rows, ]
# X Y Z
#1 a A 3
#3 c C 2
#2 b B 5
#5 e E NA

if you know which rows of which column that you want, you can use ;
df <- read.table(textConnection('X Y Z
a A 3
b B 5
c C 2
d D 1
e E NA'),
header=T)
desired_rows <- c('a','c')
df2 <- df[df$X %in% desired_rows,]
df2
output;
X Y Z
<fct> <fct> <int>
1 a A 3
2 c C 2

Related

R - Merging and aligning two CSVs using common values in multiple columns

I currently have two .csv files that look like this:
File 1:
Attempt
Result
Intervention 1
B
Intervention 2
H
and File 2:
Name
Outcome 1
Outcome 2
Outcome 3
Sample 1
A
B
C
Sample 2
D
E
F
Sample 3
G
H
I
I would like to merge and align the two .csvs such that the result each row of File 1 is aligned by its "result" cell, against any of the three "outcome" columns in File 2, leaving blanks or "NA"s if there are no similarities.
Ideally, would look like this:
Attempt
Result
Name
Outcome 1
Outcome 2
Outcome 3
Intervention 1
B
Sample 1
A
B
C
Sample 2
D
E
F
Intervention 2
H
Sample 3
G
H
I
I've looked and only found answers when merging two .csv files with one common column. Any help would be very appreciated.
I will assume that " Result " in File 1 is unique, since more File 1 rows with same result value (i.e "B") will force us to consider new columns in the final data frame.
By this way,
Attempt <- c("Intervention 1","Intervention 2")
Result <- c("B","H")
df1 <- as.data.frame(cbind(Attempt,Result))
one <- c("Sample 1","A","B","C")
two <- c("Sample 2","D","E","F")
three <- c("Sample 3","G","H","I")
df2 <- as.data.frame(rbind(one,two,three))
row.names(df2) <- 1:3
colnames(df2) <- c("Name","Outcome 1","Outcome 2","Outcome 3")
vec_at <- rep(NA,nrow(df2));vec_res <- rep(NA,nrow(df2)); # Define NA vectors
for (j in 1:nrow(df2)){
a <- which(is.element(df1$Result,df2[j,2:4])==TRUE) # Row names which satisfy same element in two dataframes?
if (length(a>=1)){ # Don't forget that "a" may not be a valid index if no element satify the condition
vec_at[j] <- df1$Attempt[a] #just create a vector with wanted information
vec_res[j] <- df1$Result[a]
}
}
desired_df <- as.data.frame(cbind(vec_at,vec_res,df2)) # define your wanted data frame
Output:
vec_at vec_res Name Outcome 1 Outcome 2 Outcome 3
1 Intervention 1 B Sample 1 A B C
2 <NA> <NA> Sample 2 D E F
3 Intervention 2 H Sample 3 G H I
I wonder if you could use fuzzyjoin for something like this.
Here, you can provide a custom function for matching between the two data.frames.
library(fuzzyjoin)
fuzzy_left_join(
df2,
df1,
match_fun = NULL,
multi_by = list(x = paste0("Outcome_", 1:3), y = "Result"),
multi_match_fun = function(x, y) {
y == x[, "Outcome_1"] | y == x[, "Outcome_2"] | y == x[, "Outcome_3"]
}
)
Output
Name Outcome_1 Outcome_2 Outcome_3 Attempt Result
1 Sample_1 A B C Intervention_1 B
2 Sample_2 D E F <NA> <NA>
3 Sample_3 G H I Intervention_2 H

Comparing elements from different columns but from the same data frame with R

I am trying to determine sequence similarity.
I would like to create a function to compare df elements, for the following example:
V1 V2 V3 V4
1 C D A D
2 A A S E
3 V T T V
4 A T S S
5 C D R Y
6 C A D V
7 V T E T
8 A T A A
9 R V V W
10 W R D D
I want to compare the first element from the first column with a first element from the second column. If it matches == 1, else 0. Then the second element from the first column compared with the second element from the second column. and so on.
For example:
C != D -----0
A == A -----1
That way I would like to compare column 1 with column 2 then column 3 and column 4.
Then column 2 compare with column 3 and column 4.
Then column 3 with column 4.
The output would be just the numbers:
0
1
0
0
0
0
0
0
0
0
I tried the following but it doesn't work:
compared_df <- ifelse(df_trial$V1==df_trial$V2,1,ifelse(df_trial$V1==df_trial$V2,0,NA))
compared_df
As suggested, I tried the following:
compared_df1 <- df_trial$matches <- as.integer(df_trial$V1 == df_trial$V2)
This works well for small sample comparison. Is there a way to compare more globally? Like for the updated columns.
As #Ronak Shah said in the comment using the following is sufficent in the case you want to compare 2 values:
df$matches <- as.integer(df$V1 == df$V2)
Another option which is applicable to more the 2 rows as well is to use apply to check for the number of unique elements in a row in the following way:
df$matches = apply(df, 1, function(x) as.integer(length(unique(x)) == 1))

R: How to merge two rows based on specific values

given the example table,
ID T A B X Y Z
1 S 1
2 S 2
1 E 4 a b c
3 S 5
2 E 8 d e f
and the assumptions:
for the same ID there is a pair of rows (first row T == S; second
row T == E)
in the first row (T == S) the columns ID, T, A have values
in the second row (T == E) the columns ID, T, B, X, Y, Z have values
the two row pairs are not necessarily below each other
I try to do the following:
Look for rows with the same ID
and merge the values (into the row T == S)
remove rows with T == E // since merged with other row
The result would look like this
ID T A B X Y Z
1 S 1 4 a b c
2 S 2 8 d e f
3 S 5
...
Currently I use two nested for-loops, which is too slow. Has anybody a idea that is faster than two nested for-loops?
Combine rows and sum their values
refer answer section of this question.

How to create a table in R that includes column totals

I'm somewhat new to R programming and am in need of assistance.
I'm looking to take the sum of 4 columns in a dataframe and list these totals in a simple table.
Essentially, take the sum of 4 columns (A, B, C, D) and list the total in a table (table = column 1: A, B, C, D column 2: sum of column A, B, C, D) - something along the lines of:
A = 3
B = 4
C = 4
D = 3
Does anyone know how to get this output? Also, the less "manual" the response, the better (i.e. trying to avoid having to input several lines of code to get this output if possible).
Thank you.
If your data looks like this:
a <- c(1:4)
b <- c(2:5)
c <- c(3:6)
d <- c(4:7)
df <- data.frame(a,b,c,d)
a b c d
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7
Use
> res <- sapply(df,sum)
to get
a b c d
10 14 18 22
in order to apply the function only on numeric columns, try
> res <- colSums(df[sapply(df,is.numeric)])
There is colSums:
colSums(Filter(is.numeric, df))

Find row with lowest value in COLUMNA, return row's COLUMNB

I have a feeling this is one of those stupidly easy things where I'm just not using a function I should be.
Here's the relevant part of the function:
min(DATASET$COLUMNNAME, na.rm = TRUE)
Right now, it reports the correct value from COLUMNNAME--the lowest value in that column. Great. However, what I really want it to do is look across the dataframe to that result's entry in column NAME and print that. It should not print the minimum value at all, just the entry in NAME for the row with COLUMNAME's minimum value.
Is the best way to do it to get the row number of that minimum value somehow, and return DATASET$NAME[row,] ?
Looking for this maybe:
DATASET$NAME[DATASET$COLUMNNAME == min(DATASET$COLUMNNAME)]
That is, you select NAME from DATASET, where COLUMNAME has the minimum value.
If you don't like repeating DATASET so many times, this is equivalent using with:
with(DATASET, NAME[COLUMNNAME == min(COLUMNNAME)])
The function you are looking for is which.min:
> set.seed(123)
> df<-data.frame(name=sample(LETTERS[1:10]),value=sample(10))
> df
name value
1 C 10
2 H 5
3 D 6
4 G 9
5 F 1
6 A 7
7 J 8
8 I 4
9 B 3
10 E 2
> df[which.min(df$value),]
name value
5 F 1
> df$name[which.min(df$value)]
[1] F
Levels: A B C D E F G H I J

Resources