sum cells of certain columns for each row - r

I would like to calculate sums for certain columns and then apply this summation for every row. Unfortunately, I can only get to the first step. How do I now make it happen for each row? I know that R doesn't need loops; what are good approaches?
My matrix (zscore) looks like this:
a b c t y
1 3 4 7 7 4
2 4 56 6 6 4
3 3 3 2 1 7
4 3 88 9 9 9
Now I would want to calculate the row sum for each row, based on some of the columns. For one row it could look like this:
f1 <- sum(zscore[1,1:2], zscore[1,3], zscore[1,5])
How do I do that now for each row?

You could do something like this:
summed <- rowSums(zscore[, c(1, 2, 3, 5)])

The summation of all individual rows can also be done using the row-wise operations of dplyr (with col1, col2, col3 defining three selected columns for which the row-wise sum is calculated):
library(tidyverse)
df <- df %>%
rowwise() %>%
mutate(rowsum = sum(c(col1, col2,col3)))

If you don't have NA you can apply this
suma.zscore = (zscore$a + zscore$c + zscore$t + zscore$y)

Related

How to filter rows where there are changes in a categorical variable

We have a dataframe called data with 2 columns: Time which is arranged in ascending order, and Place which describes where the individual was:
data.frame(Time = seq(1,20,1),
Place = rep(letters[c(1:3,1)], c(5,5,3,7)))
Since this data is in ascending order with respect to Time, we want to subset the rows where Place changes from the previous observation.
The resulting dataframe for this data would look like this:
Time Place
1 a
6 b
11 c
14 a
Notice that the same Place can show up later, like Place == a did in this example. How can we perform this kind of subset in R?
Apply the duplicated on the rleid of the 'Place'
library(dplyr)
library(data.table)
df1 %>%
filter(!duplicated(rleid(Place)))
Or in base R with rle
subset(df1, !duplicated(with(rle(Place), rep(seq_along(values), lengths))))
-output
Time Place
1 1 a
6 6 b
11 11 c
14 14 a
Another base R option using subset + tail + head
subset(
df,
c(TRUE, tail(Place, -1) != head(Place, -1))
)
which gives
Time Place
1 1 a
6 6 b
11 11 c
14 14 a

R Fill empty cells of separate column with matching ID [duplicate]

This question already has answers here:
Replace missing values (NA) with most recent non-NA by group
(7 answers)
Replacing NAs with latest non-NA value
(21 answers)
Closed 3 years ago.
I have just added new data to my original data frame, so simplified, it looks like this:
df <- data.frame(ID = rep(letters[1:5], each = 2))
df
ID Volume
1 a 1.23
2 a
3 a
4 a
5 b 4.74
6 b
7 b
8 b
9 c 5.35
10 c
11 c
12 c
13 c
14 d 1.53
15 d
16 d
where I have an ID column with differing numbers of entries for each ID and a volume for one entry, but not the others.
Is there a way to populate the empty Volume cells with the filled cell of the corresponding ID?
I'm essentially trying to remove the step of going into Excel and using the "drag to fill" for each ID (I have over 2000 IDs). Not every ID has the same amount of entries (i.e. ID "a" has 4, where ID "c" has 5 and ID "d" has 3).
I'm thinking dplyr will have a tool to do this, but I have not been able to find the answer.
In the tidyverse
library(tidyverse)
df %>%
group_by(id) %>%
fill()
Here is a base R solution, where ave() is used to fill up the NAs
df <- within(df,Volume <- ave(Volume,ID, FUN = function(x) unique(x[!is.na(x)])))

Transform table [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 4 years ago.
I would like to repeat entire rows in a data-frame based on the samples column.
My input:
df <- 'chr start end samples
1 10 20 2
2 4 10 3'
df <- read.table(text=df, header=TRUE)
My expected output:
df <- 'chr start end samples
1 10 20 1-10-20-s1
1 10 20 1-10-20-s2
2 4 10 2-4-10-s1
2 4 10 2-4-10-s2
2 4 10 2-4-10-s3'
Some idea how to perform it wisely?
We can use expandRows to expand the rows based on the value in the 'samples' column, then convert to data.table, grouped by 'chr', we paste the columns together along with sequence of rows using sprintf to update the 'samples' column.
library(splitstackshape)
setDT(expandRows(df, "samples"))[,
samples := sprintf("%d-%d-%d-%s%d", chr, start, end, "s",1:.N) , chr][]
# chr start end samples
#1: 1 10 20 1-10-20-s1
#2: 1 10 20 1-10-20-s2
#3: 2 4 10 2-4-10-s1
#4: 2 4 10 2-4-10-s2
#5: 2 4 10 2-4-10-s3
NOTE: data.table will be loaded when we load splitstackshape.
You can achieve this using base R (i.e. avoiding data.tables), with the following code:
df <- 'chr start end samples
1 10 20 2
2 4 10 3'
df <- read.table(text = df, header = TRUE)
duplicate_rows <- function(chr, starts, ends, samples) {
expanded_samples <- paste0(chr, "-", starts, "-", ends, "-", "s", 1:samples)
repeated_rows <- data.frame("chr" = chr, "starts" = starts, "ends" = ends, "samples" = expanded_samples)
repeated_rows
}
expanded_rows <- Map(f = duplicate_rows, df$chr, df$start, df$end, df$samples)
new_df <- do.call(rbind, expanded_rows)
The basic idea is to define a function that will take a single row from your initial data.frame and duplicate rows based on the value in the samples column (as well as creating the distinct character strings you're after). This function is then applied to each row of your initial data.frame. The output is a list of data.frames that then need to be re-combined into a single data.frame using the do.call pattern.
The above code can be made cleaner by using the Hadley Wickham's purrr package (on CRAN), and the data.frame specific version of map (see the documentation for the by_row function), but this may be overkill for what you're after.
Example using DataFrame function from S4Vector package:
df <- DataFrame(x=c('a', 'b', 'c', 'd', 'e'), y=1:5)
rep(df, df$y)
where y column represents the number of times to repeat its corresponding row.
Result:
DataFrame with 15 rows and 2 columns
x y
<character> <integer>
1 a 1
2 b 2
3 b 2
4 c 3
5 c 3
... ... ...
11 e 5
12 e 5
13 e 5
14 e 5
15 e 5

Expand each row with specific value in tidyr [duplicate]

This question already has answers here:
Repeat rows of a data.frame [duplicate]
(10 answers)
Closed 3 years ago.
I have a dataset with grouped observations per row. However, I would like to expand each row observation from a single observation per replicate to a set number (in this case "20" observations each).
In the attached picture,
Each replicate is a row. I would like to expand each row into 20. So "wellA" for "LS x SB" becomes expands to 20 of the same line. As a bonus, I would also like to make a new column called "Replicate2" that numerically lists 1 to 20 to reflect these 20 new rows per replicate.
The idea would to then add the survival status per individual (reflected in the new columns "Status" and "Event").
I think the "expand" function in tidyr has potential but can't figure out how to just add a fixed number per replicate. Using the "Alive" column is adding a variable number of observations.
expand<-DF %>% expand(nesting(Date, Time, Cumulative.hrs, Timepoint, Treatment, Boat, Parentage, Well, Mom, Dad, Cone, NumParents, Parents), Alive)
Any help appreciated!
In base R, we can use rep to repeat rows and transform to add new columns
n <- 20
transform(df[rep(seq_len(nrow(df)), each = n), ], Replicate = 1:n, row.names = NULL)
Using a reproducible example with n = 3
df <- data.frame(a = 1:3, b = 4:6, c = 7:9)
n <- 3
transform(df[rep(seq_len(nrow(df)), each = n), ], Replicate = 1:n, row.names = NULL)
# a b c Replicate2
#1 1 4 7 1
#2 1 4 7 2
#3 1 4 7 3
#4 2 5 8 1
#5 2 5 8 2
#6 2 5 8 3
#7 3 6 9 1
#8 3 6 9 2
#9 3 6 9 3
Using dplyr we can use slice to repeat rows and mutate to add new column.
library(dplyr)
df %>%
slice(rep(seq_len(n()), each = n)) %>%
mutate(Replicate2 = rep(seq_len(n), n))
Do a cross join between your existing data and the numbers 1:20.
tidyr::crossing(DF, replicate2 = 1:20)
If you want to add additional columns, use mutate:
... %>% mutate(status = 1, event = FALSE)

Unique Data Frame Based On Three Column Values

I have a data frame of 6449x743, in which few rows are repeating twice with same column_X and column_Y values, but with higher column_Z values for second repetition for the same row. I want to keep the row with higher column_Z only.
I tried following, but this doesn't get rid of duplicate values and gives me output of 6449x743 only.
output <- unique(Data[,c('column_X', 'column_Y', max('column_Z'))])
Ideally, the output should be (6449 - N)x743, as number of rows will be less, but number of columns will remain same, as column_X and column_Y will become unique after filtering data based on column_Z.
If anyone has suggestions, please let me know. Thanks.
You can used not duplicated (!duplicated) with option fromLast = TRUE on specific columns like this:
df <- data.frame(a=c(1,1,2,3,4),b=c(2,2,3,4,5),c=1:5)
df <- df[order(df$c),] #make sure the data is sorted.
a b c
1 1 2 1
2 1 2 2
3 2 3 3
4 3 4 4
5 4 5 5
df[!duplicated(df$a,fromLast = TRUE) & !duplicated(df$b,fromLast = TRUE),]
a b c
2 1 2 2
3 2 3 3
4 3 4 4
5 4 5 5
Try
library(dplyr)
Data %>%
group_by(column_x, column_Y) %>%
filter(Z==max(column_Z))
It works with the sample data
set.seed(13)
df<-data_frame(a=sample(1:4, 50, rep=T),
b=sample(1:3, 50, rep=T),
x=runif(50), y=rnorm(50))
df %>% group_by(a,b) %>% filter(x==max(x))
Probably the easiest way would be to order the whole thing by column_Z and then remove the duplicates:
output <- Data[order(Data$column_Z, decreasing=TRUE),]
output <- output[!duplicated(paste(output$column_X, output$column_Y)),]
assuming I understood you correctly.
Here's an older answer which may be trying to accomplish the same thing that you are:
How to make a unique in R by column A and keep the row with maximum value in column B
Editing with relevant code:
A solution using package data.table:
set.seed(42)
dat <- data.frame(A=c('a','a','a','b','b'),B=c(1,2,3,5,200),C=rnorm(5))
library(data.table)
dat <- as.data.table(dat)
dat[,.SD[which.max(B)],by=A]
A B C
1: a 3 0.3631284
2: b 200 0.4042683

Resources