R: How to merge two rows based on specific values - r

given the example table,
ID T A B X Y Z
1 S 1
2 S 2
1 E 4 a b c
3 S 5
2 E 8 d e f
and the assumptions:
for the same ID there is a pair of rows (first row T == S; second
row T == E)
in the first row (T == S) the columns ID, T, A have values
in the second row (T == E) the columns ID, T, B, X, Y, Z have values
the two row pairs are not necessarily below each other
I try to do the following:
Look for rows with the same ID
and merge the values (into the row T == S)
remove rows with T == E // since merged with other row
The result would look like this
ID T A B X Y Z
1 S 1 4 a b c
2 S 2 8 d e f
3 S 5
...
Currently I use two nested for-loops, which is too slow. Has anybody a idea that is faster than two nested for-loops?

Combine rows and sum their values
refer answer section of this question.

Related

select row based on value of another row in R

EDIT: to make myself clear, I know how to select individual rows and I know there are many different ways of doing it. I want to write a code that will work no matter what the actual value of the rows is, so it works over a larger dataframe, that is, I don't have to change the code based on the content. So instead of saying, select row 1, then 3, it'll say, select row one, then row [value in row 1 column Z] then row [value in column Z from the row just selected] and so on - so my question is, how to tell R to read that value as row number
I'm trying to figure out how to select and save a row based on a value in another row, so that I get get a new df with row 1(aA), then go to row 3 and save it (cC), then go to row 2 etc.
X Y Z
a A 3
b B 5
c C 2
d D 1
e E NA
Knowing the row number, I can use rbind which will give me the following
rbind(df[1, ], df[3, ]
a A 3
c C 2
But I want R to extract the number 3 from the column not to explicitly tell it which row to pick - how do I do that?
Thanks
You can use a while loop to keep on selecting rows until NA occurs or all the rows are selected in the dataframe.
all_rows <- 1
next_row <- df$Z[all_rows]
while(!is.na(next_row) || length(all_rows) >= nrow(df)) {
all_rows <- c(all_rows, next_row)
next_row <- df$Z[all_rows[length(all_rows)]]
}
result <- df[all_rows, ]
# X Y Z
#1 a A 3
#3 c C 2
#2 b B 5
#5 e E NA
if you know which rows of which column that you want, you can use ;
df <- read.table(textConnection('X Y Z
a A 3
b B 5
c C 2
d D 1
e E NA'),
header=T)
desired_rows <- c('a','c')
df2 <- df[df$X %in% desired_rows,]
df2
output;
X Y Z
<fct> <fct> <int>
1 a A 3
2 c C 2

Comparing elements from different columns but from the same data frame with R

I am trying to determine sequence similarity.
I would like to create a function to compare df elements, for the following example:
V1 V2 V3 V4
1 C D A D
2 A A S E
3 V T T V
4 A T S S
5 C D R Y
6 C A D V
7 V T E T
8 A T A A
9 R V V W
10 W R D D
I want to compare the first element from the first column with a first element from the second column. If it matches == 1, else 0. Then the second element from the first column compared with the second element from the second column. and so on.
For example:
C != D -----0
A == A -----1
That way I would like to compare column 1 with column 2 then column 3 and column 4.
Then column 2 compare with column 3 and column 4.
Then column 3 with column 4.
The output would be just the numbers:
0
1
0
0
0
0
0
0
0
0
I tried the following but it doesn't work:
compared_df <- ifelse(df_trial$V1==df_trial$V2,1,ifelse(df_trial$V1==df_trial$V2,0,NA))
compared_df
As suggested, I tried the following:
compared_df1 <- df_trial$matches <- as.integer(df_trial$V1 == df_trial$V2)
This works well for small sample comparison. Is there a way to compare more globally? Like for the updated columns.
As #Ronak Shah said in the comment using the following is sufficent in the case you want to compare 2 values:
df$matches <- as.integer(df$V1 == df$V2)
Another option which is applicable to more the 2 rows as well is to use apply to check for the number of unique elements in a row in the following way:
df$matches = apply(df, 1, function(x) as.integer(length(unique(x)) == 1))

How to create a table in R that includes column totals

I'm somewhat new to R programming and am in need of assistance.
I'm looking to take the sum of 4 columns in a dataframe and list these totals in a simple table.
Essentially, take the sum of 4 columns (A, B, C, D) and list the total in a table (table = column 1: A, B, C, D column 2: sum of column A, B, C, D) - something along the lines of:
A = 3
B = 4
C = 4
D = 3
Does anyone know how to get this output? Also, the less "manual" the response, the better (i.e. trying to avoid having to input several lines of code to get this output if possible).
Thank you.
If your data looks like this:
a <- c(1:4)
b <- c(2:5)
c <- c(3:6)
d <- c(4:7)
df <- data.frame(a,b,c,d)
a b c d
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7
Use
> res <- sapply(df,sum)
to get
a b c d
10 14 18 22
in order to apply the function only on numeric columns, try
> res <- colSums(df[sapply(df,is.numeric)])
There is colSums:
colSums(Filter(is.numeric, df))

Assigning random rows from a dataframe into 2 other dataframes in R

I have a dataframe (a) as mentioned below:
V1 V2
1 a b
2 a e
3 a f
4 b c
5 b e
6 b f
7 c d
8 c g
9 c h
10 d g
11 d h
12 e f
13 f g
14 g h
Now what i want is to randomly assign rows from the above dataframe (a) to 2 other empty dataframes (b and c) such that none of the rows are repeated. That means neither b has any repeated rows nor c has any repeated row. Now apart from that even across b and c, none of the rows should be same i.e a row in b shouldn't be present in any rows of c and vice versa.
Once way is to sample 7 elements from (a) without replacement and assign to (b) and then assign remaining to the (c). But in this approach all elements would be assigned at the same time to (b) and then to (c) BUT what i want is to assign elements one by one. That is a random row to (b) then a random row to (c) then again a random row to (b) ... and so on till all rows in dataframe (a) are done.
Thanks
Sampling all of the row numbers and then partitioning the dataframe according to the parity of the row number indexes should achieve what you are after. This is the same as randomly partitioning the original dataframe row-by-row.
n <- nrow(df)
s <- sample.int(n, n)
odd.idxs <- seq_along(s) %% 2 != 0
s1 <- s[odd.idxs]
s2 <- s[-odd.idxs]
d1 <- df[s1, ]
d2 <- df[s2, ]

Find row with lowest value in COLUMNA, return row's COLUMNB

I have a feeling this is one of those stupidly easy things where I'm just not using a function I should be.
Here's the relevant part of the function:
min(DATASET$COLUMNNAME, na.rm = TRUE)
Right now, it reports the correct value from COLUMNNAME--the lowest value in that column. Great. However, what I really want it to do is look across the dataframe to that result's entry in column NAME and print that. It should not print the minimum value at all, just the entry in NAME for the row with COLUMNAME's minimum value.
Is the best way to do it to get the row number of that minimum value somehow, and return DATASET$NAME[row,] ?
Looking for this maybe:
DATASET$NAME[DATASET$COLUMNNAME == min(DATASET$COLUMNNAME)]
That is, you select NAME from DATASET, where COLUMNAME has the minimum value.
If you don't like repeating DATASET so many times, this is equivalent using with:
with(DATASET, NAME[COLUMNNAME == min(COLUMNNAME)])
The function you are looking for is which.min:
> set.seed(123)
> df<-data.frame(name=sample(LETTERS[1:10]),value=sample(10))
> df
name value
1 C 10
2 H 5
3 D 6
4 G 9
5 F 1
6 A 7
7 J 8
8 I 4
9 B 3
10 E 2
> df[which.min(df$value),]
name value
5 F 1
> df$name[which.min(df$value)]
[1] F
Levels: A B C D E F G H I J

Resources