How to use matrix algebra in R to create new column? - r

I have a dataframe with multiple columns. I have another dataframe with two columns, factor and coefficient. I want to create a new column in the initial dataframe (mydata) that is the sum of multiplying each element in each row of mydata(a:e) by the coefficients (a:e) in df. The result for the first row in the newcol should be 64 (1*1 + 2*2 + 3*3 + 4*4 + 7*5). Ideally, I would be able to somehow replicate this 20+ times with different coefficients.
mydata <- data.frame(a = 1:10, b = 2:11, c = 3:12, d = 4:13, d_1 = 5:14, d_2 = 6:15, d_3 = 7:16, e = 8:17)
df <- data.frame(factor = c('a','b','c','d','e'), coefficient = 1:5)
mydata$newcol <- mydata[,c("a","b","c","d","e")] %*% df$coefficient
mydata$newcol2 <- mydata[,c("a","b","c","d_1","e")] %*% df$coefficient
Any advice would be helpful!

We can use sweep here, subset mydata based on factor column in df and multiply it with coefficient for each element and then take rowSums to calculate the sum.
mydata$newcol <- rowSums(sweep(mydata[as.character(df$factor)], 2,df$coefficient, `*`))
mydata
# a b c d d_1 d_2 d_3 e newcol
#1 1 2 3 4 5 6 7 8 70
#2 2 3 4 5 6 7 8 9 85
#3 3 4 5 6 7 8 9 10 100
#4 4 5 6 7 8 9 10 11 115
#5 5 6 7 8 9 10 11 12 130
#6 6 7 8 9 10 11 12 13 145
#7 7 8 9 10 11 12 13 14 160
#8 8 9 10 11 12 13 14 15 175
#9 9 10 11 12 13 14 15 16 190
#10 10 11 12 13 14 15 16 17 205
Or we can also transpose mydata and multiply the coefficient and get colSums.
colSums(t(mydata[as.character(df$factor)]) * df$coefficient)

Related

r average of distance by Id

I have a dataset with two groups of subjects, Group A, Group B like this.
Id Group Age
1 A 17
2 A 14
3 A 10
4 A 17
5 A 12
6 A 6
7 A 18
8 A 7
9 B 18
9 B 13
10 B 6
10 B 12
11 B 16
11 B 17
12 B 11
12 B 18
The subjects in Group A are unique. One row per subject. The subjects in Group B are not unique. There are two or in some cases 3 rows of observations per subject in Group B, example ID 9, 10, 10 etc.
What I am trying to do is
a) estimate the average distance of subjects in GroupB to everyone in Group A. Using Age to estimate the distance.
b) estimate the distance of subjects in GroupB to the mode of subjects in Group A. Using Age to estimate the mode in Group A and Age in Group B to estimate the distance from the mode.
Expecting a dataset like this.
ID Group Age AvDistance DistanceToMedian
1 A 17 NA NA
2 A 14 NA NA
3 A 10 NA NA
4 A 17 NA NA
5 A 12 NA NA
6 A 6 NA NA
7 A 18 NA NA
8 A 7 NA NA
9 B 18 6 2.11
9 B 13 3.875 2.88
10 B 6 ... ...
10 B 12 ... ...
11 B 16 ... ...
11 B 17 ... ...
12 B 11 ... ...
12 B 18 ... ...
I can do this manually, any suggestions on how to make this more efficient is much appreciated. Thanks.
# Estimate Average Distance of Id in Group B to all subjects in Group A
(sqrt((17 - 18)^2)+ sqrt((14-18)^2)+ sqrt((10-18)^2) + sqrt((17-18)^2) + sqrt((12-18)^2) + sqrt((6-18)^2) + sqrt((18-18)^2) + sqrt((7-18)^2))/8 = 6
(sqrt((17 - 13)^2)+ sqrt((14-13)^2)+ sqrt((10 - 13)^2) + sqrt((17-13)^2) + sqrt((12-13)^2) + sqrt((6-13)^2) + sqrt((18-13)^2) + sqrt((7-13)^2))/8 = 3.875
estimate_mode <- function(x) {
d <- density(x)
d$x[which.max(d$y)]
}
# Estimate Mode for Age in Group A
x <- c(17, 14, 10, 17, 12, 6, 18, 7)
estimate_mode(x)
m1 <- estimate_mode(x)
# Estimate Mode of
sqrt((18 - m1)^2) = 2.11
sqrt((13 - m1)^2) =2.88
This will be easier with a unique row ID, so I'll create one:
library(dplyr)
library(tibble)
df = df %>%
mutate(rownum = paste0("row", row_number()))
ages = setNames(df$Age, df$rownum)
## make a distance matrix
dist = outer(ages[df$Group == "B"], ages[df$Group == "A"], FUN = \(x, y) abs(x - y))
## calculate average distances
av_dist = data.frame(AvDist = rowMeans(dist)) %>% rownames_to_column("rownum")
## calculate median age for A
med_a = median(ages[df$Group == "A"])
## add back to original data
df %>%
left_join(av_dist, by = "rownum") %>%
mutate(DistanceToMedian = ifelse(Group == "B", abs(Age - med_a), NA))
# Id Group Age rownum AvDist DistanceToMedian
# 1 1 A 17 row1 NA NA
# 2 2 A 14 row2 NA NA
# 3 3 A 10 row3 NA NA
# 4 4 A 17 row4 NA NA
# 5 5 A 12 row5 NA NA
# 6 6 A 6 row6 NA NA
# 7 7 A 18 row7 NA NA
# 8 8 A 7 row8 NA NA
# 9 9 B 18 row9 5.375 5
# 10 9 B 13 row10 3.875 0
# 11 10 B 6 row11 6.625 7
# 12 10 B 12 row12 3.875 1
# 13 11 B 16 row13 4.375 3
# 14 11 B 17 row14 4.625 4
# 15 12 B 11 row15 4.125 2
# 16 12 B 18 row16 5.375 5
I used median, not mode, because I was looking at your column names, but you can easily swap in your mode instead.
Using this sample data:
df = read.table(text = 'Id Group Age
1 A 17
2 A 14
3 A 10
4 A 17
5 A 12
6 A 6
7 A 18
8 A 7
9 B 18
9 B 13
10 B 6
10 B 12
11 B 16
11 B 17
12 B 11
12 B 18', header = T)

R : Change name of variable using for loop

I have a data, and vectors conatin name of variables, from these vectorsi calculate the sum of variables contained in the vector and i want to put the result in a new variables that have diffrent names
let say i have three vectors
>data
Name A B C D E
r1 1 5 12 21 15
r2 2 4 7 10 9
r3 5 15 6 9 6
r4 7 8 0 7 18
And i have these vectors that are generated using for loop that are in variable vec
V1 <- ("A","B","C")
V2 <- ("B","D")
V3 <- ("D","E")
Edit 1 :
These vector are generated using for loop and i don't know the vectors that will be generated or the elemnts contained in these vector , here i'm giving just an example , i want to calculate the sum of variables in each vector and make the result in new variable in my data frame
The issue is don't know how to give new name to variables created (that contains the sum of each vector)
data$column[j] <- rowSums(all_data_Second_program[,vec])
j <- j+1
To obtain this result for example
Name A B C Column1 D Column2 E Column3
r1 1 5 12 18 21 26 15 36
r2 2 4 7 13 10 14 9 19
r3 5 15 6 26 9 24 6 15
r4 7 8 0 15 7 15 18 25
But i didn't obtain this result
Please tell me if you need any more informations or clarifications
Can you tell me please how to that
Put the vectors in a list and then you can use rowSums in lapply -
list_vec <- list(c("A","B","C"), c("B","D"), c("D","E"))
new_cols <- paste0('Column', seq_along(list_vec))
data[new_cols] <- lapply(list_vec, function(x) rowSums(data[x]))
data
# Name A B C D E Column1 Column2 Column3
#1 r1 1 5 12 21 15 18 26 36
#2 r2 2 4 7 10 9 13 14 19
#3 r3 5 15 6 9 6 26 24 15
#4 r4 7 8 0 7 18 15 15 25
We may use a for loop
for(i in 1:3) {
data[[paste0('Column', i)]] <- rowSums(data[get(paste0('V', i))],
na.rm = TRUE)
}
-output
> data
Name A B C D E Column1 Column2 Column3
1 r1 1 5 12 21 15 18 26 36
2 r2 2 4 7 10 9 13 14 19
3 r3 5 15 6 9 6 26 24 15
4 r4 7 8 0 7 18 15 15 25

How to extract a sample of pairs in grouping variable

My data looks like this:
x y
1 1
2 2
3 2
4 4
5 5
6 6
7 6
8 8
9 9
10 9
11 11
12 12
13 13
14 13
15 14
16 15
17 14
18 16
19 17
20 18
y is a grouping variable. I would like to see how well this grouping went.
Because of this I want to extract a sample of n pairs of cases that are grouped together by variable y
and n pairs of cases that are not grouped together by variable y. In order to calculate the number of
false positives and false negatives (either falsly grouped or not). How do I extract a sample of grouped pairs
and a sample of not-grouped pairs?
I would like the samples to look like this (for n=6) :
Grouped sample:
x y
2 2
3 2
9 9
10 9
15 14
17 14
Not-grouped sample:
x y
1 1
2 2
6 8
6 8
11 11
19 17
How would I go about this in R?
I'm not entirely clear on what you like to do, partly because I feel there is some context missing as to what you're trying to achieve. I also don't quite understand your expected output (for example, the not-grouped sample contains an entry 6 8 that does not exist in your original data...)
That aside, here is a possible approach.
# Maximum number of samples per group
n <- 3;
# Set fixed RNG seed for reproducibility
set.seed(2017);
# Grouped samples
df.grouped <- do.call(rbind.data.frame, lapply(split(df, df$y),
function(x) if (nrow(x) > 1) x[sample(min(n, nrow(x))), ]));
df.grouped;
# x y
#2.3 3 2
#2.2 2 2
#6.6 6 6
#6.7 7 6
#9.10 10 9
#9.9 9 9
#13.13 13 13
#13.14 14 13
#14.15 15 14
#14.17 17 14
# Ungrouped samples
df.ungrouped <- df[sample(nrow(df.grouped)), ];
df.ungrouped;
# x y
#7 7 6
#1 1 1
#9 9 9
#4 4 4
#3 3 2
#2 2 2
#5 5 5
#6 6 6
#10 10 9
#8 8 8
Explanation: Split df based on y, then draw min(n, nrow(x)) samples from subset x containing >1 rows; rbinding gives the grouped df.grouped. We then draw nrow(df.grouped) samples from df to produce the ungrouped df.ungrouped.
Sample data
df <- read.table(text =
"x y
1 1
2 2
3 2
4 4
5 5
6 6
7 6
8 8
9 9
10 9
11 11
12 12
13 13
14 13
15 14
16 15
17 14
18 16
19 17
20 18", header = T)

Merging and summarizing two dataframes

I have the following data:
a <- data.frame(ID=c("A","B","Z","H"), a=c(0,1,2,45), b=c(3,4,5,22), c=c(6,7,8,3))
> a
ID a b c
1 A 0 3 6
2 B 1 4 7
3 Z 2 5 8
4 H 45 22 3
b <- data.frame(ID=c("A","B","E","W","Z","H"), a=c(9,10,11,39,5,0), b=c(4,2,7,54,12,34), c=c(12,0,34,23,13,14))
> b
ID a b c
1: A 9 4 12
2: B 10 2 0
3: E 11 7 34
4: W 39 54 23
5: Z 5 12 13
6: H 0 34 14
I want to merge both dataframes, keeping only rows of data.frame a and summarize the same columns, so at the end I get:
> z
ID a b c
1 A 9 7 18
2 B 11 6 7
3 Z 7 17 21
4 H 45 56 17
So far I have tried the following:
merge(a,b,by="ID",all.x=T,all.y=F)
> merge(a,b,by="ID",all.x=T,all.y=F)
ID a.x b.x c.x a.y b.y c.y
1 A 0 3 6 9 4 12
2 B 1 4 7 10 2 0
3 H 45 22 3 0 34 14
4 Z 2 5 8 5 12 13
> join(a,b,type="left",by="ID")
ID a b c a b c
1 A 0 3 6 9 4 12
2 B 1 4 7 10 2 0
3 Z 2 5 8 5 12 13
4 H 45 22 3 0 34 14
I cannot manage to summarize the columns.
My dataframe is pretty big so if the solution can speed up things that would even be better.
If your data.frame is very big, then you may consider this option:
library(data.table)
## convert data.frame to data.table
setDT(a)
## convert data.frame to data.table
setDT(b)
## merge the two data.tables
c <- merge(a,b,by='ID')
## extract names of all columns except the first one i.e. ID
col_names <- colnames(a)[-1]
## query building
col_1 <- paste0(col_names,'.x')
col_2 <- paste0(col_names,'.y')
cols <- paste(col_1,col_2,sep=',')
cols_2 <- paste0(col_names," = sum(",cols,")")
cols_3 <- paste(cols_2,collapse=',')
query <- paste0("z <- c[,.(",cols_3,"),by=ID]")
## query execution
eval(parse(text = query))
This works at least for your example:
a <- data.frame(ID=c("A","B","Z","H"), a=c(0,1,2,45), b=c(3,4,5,22), c=c(6,7,8,3))
b <- data.frame(ID=c("A","B","E","W","Z","H"), a=c(9,10,11,39,5,0), b=c(4,2,7,54,12,34), c=c(12,0,34,23,13,14))
match_a <- na.omit(match(b$ID, a$ID))
match_b <- na.omit(match(a$ID, b$ID))
df <- cbind(ID = a$ID[match_a], a[match_a, -1] + b[match_b, -1])
First, get matching rows from a in b and vice versa, so we can be sure that we only have those rows that appear in both data frames (and we now know their row-indices in both data frames). Then, simply use vectorized additions for those matching rows, but omit ID, as factor cannot be summed up; add ID back manually.
You cannot directly add both data frame is because both the data frames are of unequal size. To make them of equal size you can check for IDs in a which are present in b and then add them element wise.
new <- b[b$ID %in% a$ID, ]
cbind(ID = a$ID, a[-1] + new[-1])
# ID a b c
#1 A 9 7 18
#2 B 11 6 7
#3 Z 7 17 21
#4 H 45 56 17

R: subset a data frame based on conditions from another data frame

Here is a problem I am trying to solve. Say, I have two data frames like the following:
observations <- data.frame(id = rep(rep(c(1,2,3,4), each=5), 5),
time = c(rep(1:5,4), rep(6:10,4), rep(11:15,4), rep(16:20,4), rep(21:25,4)),
measurement = rnorm(100,5,7))
sampletimes <- data.frame(location = letters[1:20],
id = rep(1:4,5),
time1 = rep(c(2,7,12,17,22), each=4),
time2 = rep(c(4,9,14,19,24), each=4))
They both contain a column named id, which links the data frames. I want to have the measurements from observationss for whichtimeis betweentime1andtime2from thesampletimesdata frame. Additionally, I'd like to connect the appropriatelocation` to each measurement.
I have successfully done this by converting my sampletimes to a wide format (i.e. all the time1 and time2 information in one row per entry for id), merging the two data frames by the id variable, and using conditional statements to take only instances when the time falls between at least one of the time intervals in the row, and then assigning location to the appropriate measurement.
However, I have around 2 million rows in observations and doing this takes a really long time. I'm looking for a better way where I can keep the data in long format. The example dataset is very simple, but in reality, my data contains variable numbers of intervals and locations per id.
For our example, the data frame I would hope to get back would be as follows:
id time measurement letters[1:20]
1 3 10.5163892 a
2 3 5.5774119 b
3 3 10.5057060 c
4 3 14.1563179 d
1 8 2.2653761 e
2 8 -1.0905546 f
3 8 12.7434161 g
4 8 17.6129261 h
1 13 10.9234673 i
2 13 1.6974481 j
3 13 -0.3664951 k
4 13 13.8792198 l
1 18 6.5038847 m
2 18 1.2032935 n
3 18 15.0889469 o
4 18 0.8934357 p
1 23 3.6864527 q
2 23 0.2404074 r
3 23 11.6028766 s
4 23 20.7466908 t
Here's a proposal with merge:
# merge both data frames
dat <- merge(observations, sampletimes, by = "id")
# extract valid rows
dat2 <- dat[dat$time > dat$time1 & dat$time < dat$time2, seq(4)]
# sort
dat2[order(dat2$time, dat2$id), ]
The result:
id time measurement location
11 1 3 7.086246 a
141 2 3 6.893162 b
251 3 3 16.052627 c
376 4 3 -6.559494 d
47 1 8 11.506810 e
137 2 8 10.959782 f
267 3 8 11.079759 g
402 4 8 11.082015 h
83 1 13 5.584257 i
218 2 13 -1.714845 j
283 3 13 -11.196792 k
418 4 13 8.887907 l
99 1 18 1.656558 m
234 2 18 16.573179 n
364 3 18 6.522298 o
454 4 18 1.005123 p
125 1 23 -1.995719 q
250 2 23 -6.676464 r
360 3 23 10.514282 s
490 4 23 3.863357 t
Not efficient , but do the job :
subset(merge(observations,sampletimes), time > time1 & time < time2)
id time measurement location time1 time2
11 1 3 3.180321 a 2 4
47 1 8 6.040612 e 7 9
83 1 13 -5.999317 i 12 14
99 1 18 2.689414 m 17 19
125 1 23 12.514722 q 22 24
137 2 8 4.420679 f 7 9
141 2 3 11.492446 b 2 4
218 2 13 6.672506 j 12 14
234 2 18 12.290339 n 17 19
250 2 23 12.610828 r 22 24
251 3 3 8.570984 c 2 4
267 3 8 -7.112291 g 7 9
283 3 13 6.287598 k 12 14
360 3 23 11.941846 s 22 24
364 3 18 -4.199001 o 17 19
376 4 3 7.133370 d 2 4
402 4 8 13.477790 h 7 9
418 4 13 3.967293 l 12 14
454 4 18 12.845535 p 17 19
490 4 23 -1.016839 t 22 24
EDIT
Since you have more than 5 millions rows, you should give a try to a data.table solution:
library(data.table)
OBS <- data.table(observations)
SAM <- data.table(sampletimes)
merge(OBS,SAM,allow.cartesian=TRUE,by='id')[time > time1 & time < time2]

Resources