Multiply columns in a data frame by a vector - r

What I want to do is multiply all the values in column 1 of a data.frame by the first element in a vector, then multiply all the values in column 2 by the 2nd element in the vector, etc...
c1 <- c(1,2,3)
c2 <- c(4,5,6)
c3 <- c(7,8,9)
d1 <- data.frame(c1,c2,c3)
c1 c2 c3
1 1 4 7
2 2 5 8
3 3 6 9
v1 <- c(1,2,3)
So the result is this:
c1 c2 c3
1 1 8 21
2 2 10 24
3 3 12 27
I can do this one column at a time but what if I have 100 columns? I want to be able to do this programmatically.

Or simply diagonalize the vector, so that each row entry is multiplied by the corresponding element in v1:
c1 <- c(1,2,3)
c2 <- c(4,5,6)
c3 <- c(7,8,9)
d1 <- as.matrix(cbind(c1,c2,c3))
v1 <- c(1,2,3)
d1%*%diag(v1)
[,1] [,2] [,3]
[1,] 1 8 21
[2,] 2 10 24
[3,] 3 12 27

Transposing the dataframe works.
c1 <- c(1,2,3)
c2 <- c(4,5,6)
c3 <- c(7,8,9)
d1 <- data.frame(c1,c2,c3)
v1 <- c(1,2,3)
t(t(d1)*v1)
# c1 c2 c3
#[1,] 1 8 21
#[2,] 2 10 24
#[3,] 3 12 27
EDIT: If all columns are not numeric, you can do the following
c1 <- c(1,2,3)
c2 <- c(4,5,6)
c3 <- c(7,8,9)
d1 <- data.frame(c1,c2,c3)
# Adding a column of characters for demonstration
d1$c4 <- c("rr", "t", "s")
v1 <- c(1,2,3)
#Choosing only numeric columns
index <- which(sapply(d1, is.numeric) == TRUE)
d1_mat <- as.matrix(d1[,index])
d1[,index] <- t(t(d1_mat)*v1)
d1
# c1 c2 c3 c4
#1 1 8 21 rr
#2 2 10 24 t
#3 3 12 27 s

We can also replicate the vector to make the lengths equal and then multiply
d1*v1[col(d1)]
# c1 c2 c3
#1 1 8 21
#2 2 10 24
#3 3 12 27
Or use sweep
sweep(d1, 2, v1, FUN="*")
Or with mapply to multiply the corresponding columns of 'data.frame' and elements of 'vector'
mapply(`*`, d1, v1)

Related

Create multiple variables in data.table based other variables names [duplicate]

This question already has answers here:
Multiple pairwise differences based on column name patterns
(3 answers)
Multiply several sets of columns in the same data.table
(2 answers)
Closed 2 years ago.
I am trying to create a series of variables, c1, c2, and c3, based on the values of two sets of variables, a1, a2, and a3, and b1, b2, and b3. The code below shows a hard-coded solution, but in reality I don't know the total number of set of variables, say an and bn. As you can see the name of the c variables depend on the names of the a and b variables.
Is there a way in data.table to do this? I tried to do it by using purrr::map2 within data.table but I could not make it work. I would highly appreciate your help.
Thanks.
library(data.table)
DT <- data.table(
a1 = c(1, 2, 3),
a2 = c(1, 2, 3)*2,
a3 = c(1, 2, 3)*3,
b1 = c(5, 6, 7),
b2 = c(5, 6, 7)*4,
b3 = c(5, 6, 7)*5
)
DT[]
#> a1 a2 a3 b1 b2 b3
#> 1: 1 2 3 5 20 25
#> 2: 2 4 6 6 24 30
#> 3: 3 6 9 7 28 35
DT[,
`:=`(
c1 = a1 + b1,
c2 = a2 + b2,
c3 = a3 + b3
)
]
DT[]
#> a1 a2 a3 b1 b2 b3 c1 c2 c3
#> 1: 1 2 3 5 20 25 6 22 28
#> 2: 2 4 6 6 24 30 8 28 36
#> 3: 3 6 9 7 28 35 10 34 44
Created on 2020-08-26 by the reprex package (v0.3.0)
This first part is mostly defensive, guarding against: a* variables without matching b* variables; vice versa; and different order of each:
anames <- grep("^a[0-9]+$", colnames(DT), value = TRUE)
bnames <- grep("^b[0-9]+$", colnames(DT), value = TRUE)
numnames <- gsub("^a", "", anames)
anames <- sort(anames[gsub("^a", "", anames) %in% numnames])
bnames <- sort(bnames[gsub("^b", "", bnames) %in% numnames])
cnames <- gsub("^b", "c", bnames)
If you know the number ranges a priori and want something less-dynamic but more straight-forward, then
anames <- paste0("a", 1:3)
bnames <- paste0("b", 1:3)
cnames <- paste0("c", 1:3)
Now the magic:
DT[, (cnames) := Map(`+`, mget(anames), mget(bnames)) ]
DT
# a1 a2 a3 b1 b2 b3 c1 c2 c3
# 1: 1 2 3 5 20 25 6 22 28
# 2: 2 4 6 6 24 30 8 28 36
# 3: 3 6 9 7 28 35 10 34 44
You could tackle this issue if you split DT column-wise by the pattern of the names first, and then aggregate it
# removes numbers from col names
(ptn <- sub("\\d", "", names(DT)))
# [1] "a" "a" "a" "b" "b" "b"
# get unique numbers contained in the col names (as strings but it doesn't matter here)
(nmb <- unique(sub("\\D", "", names(DT))))
# [1] "1" "2" "3"
Next step is to split DT and finally do the aggregation
DT[, paste0("c", nmb) := do.call(`+`, split.default(DT, f = ptn))]
Result
DT
# a1 a2 a3 b1 b2 b3 c1 c2 c3
#1: 1 2 3 5 20 25 6 22 28
#2: 2 4 6 6 24 30 8 28 36
#3: 3 6 9 7 28 35 10 34 44
We can melt to long format, create the column 'c', dcast into 'wide' format and then cbind
library(data.table)
cbind(DT, dcast(melt(DT, measure = patterns('^a', '^b'))[,
c := value1 + value2], rowid(variable) ~ paste0('c', variable),
value.var = 'c')[, variable := NULL])
# a1 a2 a3 b1 b2 b3 c1 c2 c3
#1: 1 2 3 5 20 25 6 22 28
#2: 2 4 6 6 24 30 8 28 36
#3: 3 6 9 7 28 35 10 34 44
A base R option
u<-split.default(DT,gsub("\\D","",names(DT)))
cbind(DT,do.call(cbind,Map(rowSums,setNames(u,paste0("c",names(u))))))
which gives
a1 a2 a3 b1 b2 b3 c1 c2 c3
1: 1 2 3 5 20 25 6 22 28
2: 2 4 6 6 24 30 8 28 36
3: 3 6 9 7 28 35 10 34 44

Data manipulation for pairwise.t.test in R

So I'm trying to do a pairwise table and retain a p-value of each pair.
Please be noted that I'm still a beginner to R.
My data looks like this (though much bigger):
a <- factor(c("ID1","ID2","ID3","ID4","ID5"))
b <- runif(5)
b1 <- runif(5)
b2 <- runif(5)
b3 <- runif(5)
c1 <- runif(5)
c2 <- runif(5)
c3 <- runif(5)
df <- data.frame(a,b1,b2,b3,c1,c2,c3)
Where b1,b2,b3 should be compared to c1,c2,c3 for each row (for each ID in column a).
The end result should be something like:
a <- cbind(a,Adjusted_P_Values)
Where the head(a,1) would look like:
head(a,1)
a b1 b2 b3 c1 c2
1 ID1 0.1337694 0.7347543 0.5808391 0.4324976 0.5378458
c3 Adjusted_P_value
1 0.6368778 0.99
where each row has its corresponding P-value.
A function I have found which I think could do the trick is pairwise.t.test.
(Currently, I'm just running a loop for each row and doing a normal t-test and then correct them with p.adjust, but I can't do pooled sd---which I would like.)
So my issue now is how I construct the data so that R likes it. I can use melt.data.framefrom reshape2 library, but it wont give me the correct structure.
I use it like this:
Test_Data <- melt(df, "a", c("b1","b2","b3","c1","c2","c3"))
But I loose the row symmetry.
As, when I now do pairwise.t.test I have to use either the "a" column or the "variable" column created by melt, hence I either get a comparison between the replicates or between the IDs.
So, simply my question is:
how do I structure the data so that each row is tested and I get a p-value for each row, and where each treatment (b or c) has a standard deviation based on all the rows (one sd for all b's and one for all c's)?
I have been googeling a lot looking for similar problems (and tutorials on pairwise.t.test) but without success.
My approach was slightly different than the other answer, spreading the data into two columns, b and c by time measure (1 - 3), and then using t.test(...,paired=TRUE) to conduct a pairwise t-test.
set.seed(1234)
a <- factor(c("ID1","ID2","ID3","ID4","ID5"))
b <- runif(5)
b1 <- runif(5)
b2 <- runif(5)
b3 <- runif(5)
c1 <- runif(5)
c2 <- runif(5)
c3 <- runif(5)
df <- data.frame(a,b1,b2,b3,c1,c2,c3)
library(tidyr)
library(dplyr)
df %>%
gather(.,key="variable",value="value",-a) %>%
extract(.,variable,into = c("measure", "time"),
regex = "([A-Za-z]+)([0-9]+)") %>%
spread(.,measure,value) -> spreadData
# split by ID to conduct paired t-tests by ID
dataList <- split(spreadData,spreadData$a)
pValues <- unlist(lapply(dataList,function(x){
t.test(x$b,x$c,paired=TRUE)$p.value
}))
df$p.value <- pValues
df
...and the output:
> df
a b1 b2 b3 c1 c2
1 ID1 0.640310605 0.6935913 0.8372956 0.31661245 0.81059855
2 ID2 0.009495756 0.5449748 0.2862233 0.30269337 0.52569755
3 ID3 0.232550506 0.2827336 0.2668208 0.15904600 0.91465817
4 ID4 0.666083758 0.9234335 0.1867228 0.03999592 0.83134505
5 ID5 0.514251141 0.2923158 0.2322259 0.21879954 0.04577026
c3 p.value
1 0.4560915 0.3391364
2 0.2651867 0.5043753
3 0.3046722 0.4598274
4 0.5073069 0.6764142
5 0.1810962 0.1178471
>
NOTE: if one modifies the code from the other answer to include paired=TRUE argument, the p-values across the two solutions match.
Alternative approach: run t-test on difference between c and b
Given the commentary on this post about pairwise t-tests, I thought I'd illustrate what's happening with a pairwise test. Essentially for each time period 1 - 3, we subtract the b value from the c value, and run a t-test on the difference. Since we've reduced the data to a single column, there's no need for the paired= argument, but the test produces the same results as passing 2 columns with the paired=TRUE argument to t.test().
# alternative 2: subtract b from c and use regular t-test
# to show how pairwise works
spreadData$difference <- spreadData$c - spreadData$b
dataList <- split(spreadData,spreadData$a)
pValues <- unlist(lapply(dataList,function(x){
t.test(x$difference)$p.value
}))
df$p.value <- pValues
df
...and the output:
> spreadData$difference <- spreadData$c - spreadData$b
> dataList <- split(spreadData,spreadData$a)
> pValues <- unlist(lapply(dataList,function(x){
+ t.test(x$difference)$p.value
+ }))
> df$p.value <- pValues
> df
a b1 b2 b3 c1 c2
1 ID1 0.640310605 0.6935913 0.8372956 0.31661245 0.81059855
2 ID2 0.009495756 0.5449748 0.2862233 0.30269337 0.52569755
3 ID3 0.232550506 0.2827336 0.2668208 0.15904600 0.91465817
4 ID4 0.666083758 0.9234335 0.1867228 0.03999592 0.83134505
5 ID5 0.514251141 0.2923158 0.2322259 0.21879954 0.04577026
c3 p.value
1 0.4560915 0.3391364
2 0.2651867 0.5043753
3 0.3046722 0.4598274
4 0.5073069 0.6764142
5 0.1810962 0.1178471
>
A possible solution using the tidyverse package.
First, adjust the format of the data frame to the following structure.
library(tidyverse)
df2 <- df %>%
gather(Column, Value, -a) %>%
extract(Column, into = c("Group", "Number"), regex = "([A-Za-z]+)([0-9]+)")
df2
# a Group Number Value
# 1 ID1 b 1 0.640310605
# 2 ID2 b 1 0.009495756
# 3 ID3 b 1 0.232550506
# 4 ID4 b 1 0.666083758
# 5 ID5 b 1 0.514251141
# 6 ID1 b 2 0.693591292
# 7 ID2 b 2 0.544974836
# 8 ID3 b 2 0.282733584
# 9 ID4 b 2 0.923433484
# 10 ID5 b 2 0.292315840
# 11 ID1 b 3 0.837295628
# 12 ID2 b 3 0.286223285
# 13 ID3 b 3 0.266820780
# 14 ID4 b 3 0.186722790
# 15 ID5 b 3 0.232225911
# 16 ID1 c 1 0.316612455
# 17 ID2 c 1 0.302693371
# 18 ID3 c 1 0.159046003
# 19 ID4 c 1 0.039995918
# 20 ID5 c 1 0.218799541
# 21 ID1 c 2 0.810598552
# 22 ID2 c 2 0.525697547
# 23 ID3 c 2 0.914658166
# 24 ID4 c 2 0.831345047
# 25 ID5 c 2 0.045770263
# 26 ID1 c 3 0.456091482
# 27 ID2 c 3 0.265186672
# 28 ID3 c 3 0.304672203
# 29 ID4 c 3 0.507306870
# 30 ID5 c 3 0.181096208
Second, split the data frame and conduct pairwise.t.test, and then extract the P values.
p_value <- df2 %>%
split(.$a) %>%
map(function(x) pairwise.t.test(x$Value, x$Group, paired = TRUE)) %>%
map_dbl("p.value")
p_value
# ID1 ID2 ID3 ID4 ID5
# 0.3391364 0.5043753 0.4598274 0.6764142 0.1178471
Finally, added the P values to the original data frame as a new column.
df_final <- df %>% mutate(Adjusted_P_value = p_value)
df_final
# a b1 b2 b3 c1 c2 c3 Adjusted_P_value
# 1 ID1 0.640310605 0.6935913 0.8372956 0.31661245 0.81059855 0.4560915 0.3391364
# 2 ID2 0.009495756 0.5449748 0.2862233 0.30269337 0.52569755 0.2651867 0.5043753
# 3 ID3 0.232550506 0.2827336 0.2668208 0.15904600 0.91465817 0.3046722 0.4598274
# 4 ID4 0.666083758 0.9234335 0.1867228 0.03999592 0.83134505 0.5073069 0.6764142
# 5 ID5 0.514251141 0.2923158 0.2322259 0.21879954 0.04577026 0.1810962 0.1178471
DATA
set.seed(1234)
a <- factor(c("ID1","ID2","ID3","ID4","ID5"))
b <- runif(5)
b1 <- runif(5)
b2 <- runif(5)
b3 <- runif(5)
c1 <- runif(5)
c2 <- runif(5)
c3 <- runif(5)
df <- data.frame(a,b1,b2,b3,c1,c2,c3)
Edit:
In order to correctly map the P-values back onto the data frame, the data frame has to be ordered on the 'a' column.
Just add to Baraliuh solution:
map_dbl("p.value") does not work, however,
map_df( "p.value") work in my case

Column Split into columns and rows in R

My Data looks like
df <- data.frame(user_id=c('13','15'),
answer_id = c('{"row[0][0]":"A","row[0][1]":"B","row[0][2]":"C","row[0][3]":"D","row[1][0]":"A1","row[1][1]":"B1","row[1][2]":"C1","row[1][3]":"D1"}', '{"row[0][0]":"W","row[0][1]":"X","row[0][2]":"Y","row[0][3]":"Z","row[1][0]":"W1","row[1][1]":"X1","row[1][2]":"Y1","row[1][3]":"Z1"}
'))
Desired data view
user_id answer_id1 answer_id2 answer_id3 answer_id4
13 A B C D
13 A1 B1 C1 D1
15 W X Y Z
15 W1 X1 Y1 Z1
i'm new with R and hope to get solution soon as i do always
may not be the best solution but this can get you from your sample input to your desired output using stringr, purrr, & tidyr. See regex101 for an explanation of the regex used in the stringr::str_match_all() call.
df <- data.frame(user_id=c('13','15'),
answer_id = c('{"row[0][0]":"A","row[0][1]":"B","row[0][2]":"C","row[0][3]":"D","row[1][0]":"A1","row[1][1]":"B1","row[1][2]":"C1","row[1][3]":"D1"}', '{"row[0][0]":"W","row[0][1]":"X","row[0][2]":"Y","row[0][3]":"Z","row[1][0]":"W1","row[1][1]":"X1","row[1][2]":"Y1","row[1][3]":"Z1"}'),
stringsAsFactors=F)
#use regex to extract row ids and answers
regex_matches <- stringr::str_match_all(df$answer_id, '\\"row\\[(\\d+)\\]\\[(\\d+)\\]\\":\\"([^\\"]*)\\"')
#add user id to each result
answers_by_user <- purrr::map2(df$user_id, regex_matches, ~cbind(.x, .y[,-1]))
#combine list of matrices and convert to df
answers_df <- data.frame(do.call(rbind, answers_by_user))
#add meaningful names
names(answers_df) <- c("user_id", "row_1", "row_2", "value")
#convert to wide
spread_row_1 <- tidyr::spread(answers_df, row_1, value)
final_df <- tidyr::spread(answers_df, row_2, value)
#remove row column
final_df$row_1 <- NULL
#clean up names
names(final_df) <- c("user_id", "answer_id1", "answer_id2", "answer_id3", "answer_id4")
final_df
#output
user_id answer_id1 answer_id2 answer_id3 answer_id4
1 13 A B C D
2 13 A1 B1 C1 D1
3 15 W X Y Z
4 15 W1 X1 Y1 Z1
Column 2 looks like JSON, so you could do something like this to get it into a form that you can do something with...
library(rjson)
df2 <- lapply(1:nrow(df),function(i)
data.frame(user=df[i,1],
answer=unlist(fromJSON(as.character(df[i,2]))),stringsAsFactors = FALSE))
df2 <- do.call(rbind,df2)
df2[,"r1"] <- gsub(".+\\[(\\d)]\\[(\\d)].*","\\1",rownames(df2))
df2[,"r2"] <- gsub(".+\\[(\\d)]\\[(\\d)].*","\\2",rownames(df2))
df2
user answer r1 r2
row[0][0] 13 A 0 0
row[0][1] 13 B 0 1
row[0][2] 13 C 0 2
row[0][3] 13 D 0 3
row[1][0] 13 A1 1 0
row[1][1] 13 B1 1 1
row[1][2] 13 C1 1 2
row[1][3] 13 D1 1 3
row[0][0]1 15 W 0 0
row[0][1]1 15 X 0 1
row[0][2]1 15 Y 0 2
row[0][3]1 15 Z 0 3
row[1][0]1 15 W1 1 0
row[1][1]1 15 X1 1 1
row[1][2]1 15 Y1 1 2
row[1][3]1 15 Z1 1 3

Math function using multiple matching criteria

I'm new to this but I'm pretty sure this question hasn't been answered, or I'm just not good at searching....
I would like to subtract the values in multiple rows from a particular row based on matching columns and values. My actual data will be a large matrix with >5000 columns, eaching needing to be subtracted by a blank value that matches the a value in a factor column.
Here is an example data table:
c1 c2 c3 c4 c5
r1 A 1 2 3 aa
r2 B 2 3 4 bb
r3 C 3 4 5 aa
r4 D 4 1 6 bb
r5 Blank 2 3 4 aa
r6 Blank 3 4 5 bb
I would like to subtract the c2,c3,and c4 values of c1 ="Blank" row from A,B,and C using the c5 factor to define which Blank values are used (aa or bb). I would like the "Blank" values to be subtracted from all rows sharing c5 info.
(i know this is confusing to describe)
So the results would look like this:
c1 c2 c3 c4 c5
r1 A -1 -1 -1 aa
r2 B -1 -1 -1 bb
r3 C 1 1 1 aa
r4 D 1 -3 1 bb
I've seen the ddply function work for doing something like this with a single column, but I wasn't able to expand that to perform this task for multiple columns. I'm a noob though...
Thank you for your help!
This is not tested for all possible cases, but should give you an idea:
df <- read.table(text =
"c1 c2 c3 c4 c5
r1 A 1 2 3 aa
r2 B 2 3 4 bb
r3 C 3 4 5 aa
r4 D 4 1 6 bb
r5 Blank 2 3 4 aa
r6 Blank 3 4 5 bb", header = T)
library(data.table)
# separate dataset into two
dt <- data.table(df, key = "c5")
dt.blank <- dt[c1 == "Blank"]
dt <- dt[c1 != "Blank"]
# merge into resulting dataset
dt.res <- dt[dt.blank]
# update each column
columns.count <- ncol(dt)
for(i in 2:(columns.count-1)) {
dt.res[[i]] <- dt.res[[i]] - dt.res[[i + columns.count]]
}
# > dt.res
# c1 c2 c3 c4 c5 i.c1 i.c2 i.c3 i.c4
# 1: A -1 -1 -1 aa Blank 2 3 4
# 2: C 1 1 1 aa Blank 2 3 4
# 3: B -1 -1 -1 bb Blank 3 4 5
# 4: D 1 -3 1 bb Blank 3 4 5
First split your data, since there's no reason you have them in a single data structure. Then apply the function:
# recreate your data
df <- data.frame(rbind(c(1:3, "aa"), c(2:4, "bb"), c(3:5, "aa"), c(4,1,6, "bb"), c(2:4, "aa"), c(3:5, "bb")))
df[,1:3] <- apply(df[,1:3], 2, as.integer)
# split it
blank1 <- df[5,]
blank2 <- df[6,]
df <- df[1:4,]
for (i in 1:nrow(df)) {
if (df[i,4] == "aa") {df[i,1:3] <- df[i,1:3] - blank1[1:3]}
else {df[i,1:3] <- df[i,1:3] - blank2[1:3]}
}
There are a few different was to run the loop, including vectorizing. But this suffices. I'd also argue that there's no reason to keep the labels "aa" v "bb" in the initial data structure either, which would make this simpler; but it's your choice.

R: Rearrange matrix into three columns

I have a matrix in R. Each entry i,j is a score and the rownames and colnames are ids.
Instead of the matrix I just want a 3 column matrix that has: i,j,score
Right now I'm using nested for loops. Like:
for(i in rownames(g))
{
print(which(rownames(g)==i))
for(j in colnames(g))
{
cur.vector<-c(cur.ref, i, j, g[rownames(g) %in% i,colnames(g) %in% j])
rbind(new.file,cur.vector)->new.file
}
}
But thats very inefficient I think...I'm sure there's a better way I'm just not good enough with R yet.
Thoughts?
If I understand you correctly, you need to flatten the matrix.
You can use as.vector and rep to add the id columns e.g. :
m = cbind(c(1,2,3),c(4,5,6),c(7,8,9))
row.names(m) = c('R1','R2','R3')
colnames(m) = c('C1','C2','C3')
d <- data.frame(i=rep(row.names(m),ncol(m)),
j=rep(colnames(m),each=nrow(m)),
score=as.vector(m))
Result:
> m
C1 C2 C3
R1 1 4 7
R2 2 5 8
R3 3 6 9
> d
i j score
1 R1 C1 1
2 R2 C1 2
3 R3 C1 3
4 R1 C2 4
5 R2 C2 5
6 R3 C2 6
7 R1 C3 7
8 R2 C3 8
9 R3 C3 9
Please, note that this code converts a matrix into a data.frame, since the row and col names can be string and you can't have a matrix with different column type.
If you are sure that all row and col names are numbers, you can coerced it to a matrix.
If you convert your matrix first to a table (with as.table) then to a data frame (as.data.frame) then it will accomplish what you are asking for. A simple example:
> tmp <- matrix( 1:12, 3 )
> dimnames(tmp) <- list( letters[1:3], LETTERS[4:7] )
> as.data.frame( as.table( tmp ) )
Var1 Var2 Freq
1 a D 1
2 b D 2
3 c D 3
4 a E 4
5 b E 5
6 c E 6
7 a F 7
8 b F 8
9 c F 9
10 a G 10
11 b G 11
12 c G 12

Resources