Row-wise sum for columns with certain names - r

I have a sample data:
SampleID a b d f ca k l cb
1 0.1 2 1 2 7 1 4 3
2 0.2 3 2 3 4 2 5 5
3 0.5 4 3 6 1 3 9 2
I need to find row-wise sum of columns which have something common in names, e.g. row-wise sum(a, ca) or row-wise sum(b,cb). The problem is that i have large data.frame and ideally i would be able to write what is common in column header, so that code would pick only those columns to sum
Thank you beforehand for any assistance.

We can select the columns that have 'a' with grep, subset the columns and do rowSums and the same with 'b' columns.
rowSums(df1[grep('a', names(df1)[-1])+1])
rowSums(df1[grep('b', names(df1)[-1])+1])

If you want the output as a data frame, try using dplyr
# Recreating your sample data
df <- data.frame(SampleID = c(1, 2, 3),
a = c(0.1, 0.2, 0.5),
b = c(2, 3, 4),
d = c(1, 2, 3),
f = c(2, 3, 6),
ca = c(7, 4, 1),
k = c(1, 2, 3),
l = c(4, 5, 9),
cb = c(3, 5, 2))
Process the data
# load dplyr
library(dplyr)
# Sum across columns 'a' and 'ca' (sum(a, ca))
df2 <- df %>%
select(contains('a'), -SampleID) %>% # 'select' function to choose the columns you want
mutate(row_sum = rowSums(.)) # 'mutate' function to create a new column 'row_sum' with the sum of the selected columns. You can drop the selected columns by using 'transmute' instead.
df2 # have a look
a ca row_sum
1 0.1 7 7.1
2 0.2 4 4.2
3 0.5 1 1.5

Related

Combine two columns into one column, and have values in new column correspond to unique combinations of values in adjacent columns

I have two dataframes.
First dataframe:
#################################################
### create first data.frame
df1 <- data.frame(matrix(ncol = 5, nrow = 8))
### vector of column names
cols <- c("identifier1", "identifier2", "time", "lightsaber_length", "blaster_length")
### assign column names to df1
colnames(df1) <- cols
### generate random data for time column
tenths <- seq(from = 0, to = 10, by = .1)
sample_time <- sample(tenths, size = 8, replace = TRUE)
### generate random data for lightsaber_length and blaster_length columns
hundredths <- seq(from = 0, to = 1, by = .01)
sample_lightsaber <- sample(hundredths, size = 8, replace = TRUE)
sample_blaster <- sample(hundredths, size = 8, replace = TRUE)
### Assign column values
df1$identifier1 <- c(1, 1, 1, 2, 2, 3, 3, 4)
df1$identifier2 <- c("hello", "hello", "hello", "there", "there", "general", "general", "kenobi")
df1$time <- sample_time
df1$lightsaber_length <- sample_lightsaber
df1$blaster_length <- sample_blaster
second dataframe:
### create second data.frame
df2 <- data.frame(matrix(ncol = 4, nrow = 8))
### vector of column names
cols <- c("study_id", "identifier1", "identifier2", "object")
### assign column names to df2
colnames(df2) <- cols
### create new study_id column, where study_id equals row number
df2$study_id <- 1:nrow(df2)
### move study_id column to front
df2 <- df2 %>% relocate(study_id, .before = identifier1)
### assign column values
df2$identifier1 <- c(1, 1, 2, 2, 3, 3, 4, 4)
df2$identifier2 <- c("hello", "hello", "there", "there", "general", "general", "kenobi", "kenobi")
df2$object <- c("lightsaber", "blaster")
I want a third dataframe, generated from wrangling df1 and df2, that combines 'lightsaber_length' and 'blaster_length' into one 'length' column. However, I want to retain which 'object' a 'length' value corresponds to by assigning the appropriate 'study_id' in the same row.
Each 'study_id' value represents a unique combination of 'identifier1', 'identifier2', and 'object', and does away with the need to have two separate length columns.
Desired output:
As my actual data is larger and more varied, I would like a solution that is not unique to the small example I made here.
We can reshape to 'long' format with pivot_longer and then do a join
library(tidyr)
library(dplyr)
df1 %>%
pivot_longer(cols = ends_with("length"), names_to = c("object", ".value"),
names_sep = "_") %>%
left_join(df2)
-output
# A tibble: 16 x 6
identifier1 identifier2 time object length study_id
<dbl> <chr> <dbl> <chr> <dbl> <int>
1 1 hello 4.5 lightsaber 0.28 1
2 1 hello 4.5 blaster 0.92 2
3 1 hello 1.7 lightsaber 0.42 1
4 1 hello 1.7 blaster 0.57 2
5 1 hello 8.5 lightsaber 0.55 1
6 1 hello 8.5 blaster 0.88 2
7 2 there 4.7 lightsaber 0.09 3
8 2 there 4.7 blaster 0.37 4
9 2 there 0.1 lightsaber 0.57 3
10 2 there 0.1 blaster 0.74 4
11 3 general 3.6 lightsaber 0.18 5
12 3 general 3.6 blaster 0.15 6
13 3 general 6.8 lightsaber 0.84 5
14 3 general 6.8 blaster 0.66 6
15 4 kenobi 1.2 lightsaber 0.76 7
16 4 kenobi 1.2 blaster 0.64 8

How to arrange elements of a vector based on a square matrix

I have a vector that results from a square matrix as below
P = as.vector(matrix(c(1,2,3,4),nrow=2))
What would be the simplest way of arranging this vector to get a response similar to what I have below as columns
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4
1,1,2,2,3,3,4,4,1,1,2,2,3,3,4,4,1,1,2,2,3,3,4,4,1,1,2,2,3,3,4,4,1,1,2,2,3,3,4,4,1,1,2,2,3,3,4,4,1,1,2,2,3,3,4,4,1,1,2,2,3,3,4,4
1,2,1,2,1,2,1,2,3,4,3,4,3,4,3,4,1,2,1,2,1,2,1,2,3,4,3,4,3,4,3,4,1,2,1,2,1,2,1,2,3,4,3,4,3,4,3,4,1,2,1,2,1,2,1,2,3,4,3,4,3,4,3,4
I have been able to arrange the first 2 columns as
library(tidyverse)
df <- expand.grid(rep(c(1, 2, 3, 4),2))
df1 <- df %>% arrange_all()
df = expand.grid(a = df1[,1], b = df1[,1])
df[,c(2,1)]
The last column should repeat as a whole through
1,2,1,2,1,2,1,2,3,4,3,4,3,4,3,4
Does this work:
as.vector(apply(matrix(P, nrow = 2), 2, rep, 4))
[1] 1 2 1 2 1 2 1 2 3 4 3 4 3 4 3 4
paste(as.vector(apply(matrix(P, nrow = 2), 2, rep, 4)), collapse = ',')
[1] "1,2,1,2,1,2,1,2,3,4,3,4,3,4,3,4"

Using a column as a column index to extract value from a data frame in R

I am trying to use the values from a column to extract column numbers in a data frame. My problem is similar to this topic in r-bloggers. Copying the script here:
df <- data.frame(x = c(1, 2, 3, 4),
y = c(5, 6, 7, 8),
choice = c("x", "y", "x", "z"),
stringsAsFactors = FALSE)
However, instead of having column names in choice, I have column index number, such that my data frame looks like this:
df <- data.frame(x = c(1, 2, 3, 4),
y = c(5, 6, 7, 8),
choice = c(1, 2, 1, 3),
stringsAsFactors = FALSE)
I tried using this solution:
df$newValue <-
df[cbind(
seq_len(nrow(df)),
match(df$choice, colnames(df))
)]
Instead of giving me an output that looks like this:
# x y choice newValue
# 1 1 4 1 1
# 2 2 5 2 2
# 3 3 6 1 6
# 4 8 9 3 NA
My newValue column returns all NAs.
# x y choice newValue
# 1 1 4 1 NA
# 2 2 5 2 NA
# 3 3 6 1 NA
# 4 8 9 3 NA
What should I modify in the code so that it would read my choice column as column index?
As you have column numbers which we need to extract from data frame already we don't need match here. However, since there is a column called choice in the data which you don't want to consider while extracting data we need to turn the values which are not in the range to NA before subsetting from the dataframe.
mat <- cbind(seq_len(nrow(df)), df$choice)
mat[mat[, 2] > (ncol(df) -1), ] <- NA
df$newValue <- df[mat]
df
# x y choice newValue
#1 1 5 1 1
#2 2 6 2 6
#3 3 7 1 3
#4 4 8 3 NA
data
df <- data.frame(x = c(1, 2, 3, 4),
y = c(5, 6, 7, 8),
choice = c(1, 2, 1, 3))

how to get a cumulative sum on a large dataframe in r

I have a large data dataframe (48 x 100). Is there an elegant formula in R that makes you transform this dataframe in to a "custom dataframe"?
a = c(2, 3, 5)
b = c(2, 3, 5)
c = c(2, 3, 5)
df <- rbind(a,b,c)
Now.. i want cumsum of df so it looks like this.
I know its easy to do with a loop.. but is there a function?
Like this:
a <- c(2, 3, 5)
b <- c(2, 3, 5)
c <- c(2, 3, 5)
df <- data.frame(rbind(a,b,c))
df <- cumsum(df)
so this dataframe:
> df
X1 X2 X3
a 2 3 5
b 2 3 5
c 2 3 5
becomes this:
> cumsum(df)
X1 X2 X3
a 2 3 5
b 4 6 10
c 6 9 15

data.table merge produces extra columns [R]

Below I define a master dataset of dimensions 12x5. I divide it into four data.tables and I want to merge them. There is no row ID overlap between data.tables and some column name overlap. When I merge them, merge() doesn't recognize column name matches, and creates new columns for every column in each data.table. The final merged data.table should be 12x5, but it is coming out as 12x7. I thought that the all=TRUE command in data.table's merge() would solve this.
library(data.table)
a <- data.table(id = c(1, 2, 3), C1 = c(1, 2, 3))
b <- data.table(id = c(4, 5, 6), C1 = c(1, 2, 3), C2 = c(2, 3, 4))
c <- data.table(id = c(7, 8, 9), C3 = c(5, 2, 7))
d <- data.table(id = c(10, 11, 12), C3 = c(8, 2, 3), C4 = c(4, 6, 8))
setkey(a, "id")
setkey(b, "id")
setkey(c, "id")
setkey(d, "id")
final <- merge(a, b, all = TRUE)
final <- merge(final, c, all = TRUE)
final <- merge(final, d, all = TRUE)
names(final)
dim(final) #outputs correct numb of rows, but too many columns
The problem is with the way you are using the 'merge' function.
'merge' function in data.table package by default merges two data tables by the "shared key columns between them". Suppose you create 'a' and 'b' data tables like this:
library(data.table)
a <- data.table(id = c(1, 2, 3), C1 = c(1, 2, 3))
b <- data.table(id = c(4, 5, 6), C1 = c(1, 2, 3), C2 = c(2, 3, 4))
setkey(a, "id")
setkey(b, "id")
where 'a' is going to be like:
id C1
1: 1 1
2: 2 2
3: 3 3
and 'b' is going to be like:
id C1 C2
1: 4 1 2
2: 5 2 3
3: 6 3 4
now, lets first try your code:
merge(a, b, all = TRUE)
This is the result:
id C1.x C1.y C2
1: 1 1 NA NA
2: 2 2 NA NA
3: 3 3 NA NA
4: 4 NA 1 2
5: 5 NA 2 3
6: 6 NA 3 4
This is due to the fact that 'merge' function is taking only 'id' field (shared key between data tables 'a' and 'b') as the merging column, while adding all non-shared columns to the resulting data table. Now lets try specifying what columns to merge on:
merge(a, b, by=c("id","C1"), all = TRUE)
now the result is going to be:
id C1 C2
1: 1 1 NA
2: 2 2 NA
3: 3 3 NA
4: 4 1 2
5: 5 2 3
6: 6 3 4
Same applies to other merge functions you called. So try this:
final <- merge(a, b, by=c("id","C1"), all = TRUE)
final <- merge(final, c, by="id", all = TRUE) #here you don't necessarily need to specify by...
final <- merge( final, d, by=c("id","C3"),all=TRUE)
dim(final)
[1] 12 5

Resources