R : Change name of variable using for loop - r

I have a data, and vectors conatin name of variables, from these vectorsi calculate the sum of variables contained in the vector and i want to put the result in a new variables that have diffrent names
let say i have three vectors
>data
Name A B C D E
r1 1 5 12 21 15
r2 2 4 7 10 9
r3 5 15 6 9 6
r4 7 8 0 7 18
And i have these vectors that are generated using for loop that are in variable vec
V1 <- ("A","B","C")
V2 <- ("B","D")
V3 <- ("D","E")
Edit 1 :
These vector are generated using for loop and i don't know the vectors that will be generated or the elemnts contained in these vector , here i'm giving just an example , i want to calculate the sum of variables in each vector and make the result in new variable in my data frame
The issue is don't know how to give new name to variables created (that contains the sum of each vector)
data$column[j] <- rowSums(all_data_Second_program[,vec])
j <- j+1
To obtain this result for example
Name A B C Column1 D Column2 E Column3
r1 1 5 12 18 21 26 15 36
r2 2 4 7 13 10 14 9 19
r3 5 15 6 26 9 24 6 15
r4 7 8 0 15 7 15 18 25
But i didn't obtain this result
Please tell me if you need any more informations or clarifications
Can you tell me please how to that

Put the vectors in a list and then you can use rowSums in lapply -
list_vec <- list(c("A","B","C"), c("B","D"), c("D","E"))
new_cols <- paste0('Column', seq_along(list_vec))
data[new_cols] <- lapply(list_vec, function(x) rowSums(data[x]))
data
# Name A B C D E Column1 Column2 Column3
#1 r1 1 5 12 21 15 18 26 36
#2 r2 2 4 7 10 9 13 14 19
#3 r3 5 15 6 9 6 26 24 15
#4 r4 7 8 0 7 18 15 15 25

We may use a for loop
for(i in 1:3) {
data[[paste0('Column', i)]] <- rowSums(data[get(paste0('V', i))],
na.rm = TRUE)
}
-output
> data
Name A B C D E Column1 Column2 Column3
1 r1 1 5 12 21 15 18 26 36
2 r2 2 4 7 10 9 13 14 19
3 r3 5 15 6 9 6 26 24 15
4 r4 7 8 0 7 18 15 15 25

Related

How to use matrix algebra in R to create new column?

I have a dataframe with multiple columns. I have another dataframe with two columns, factor and coefficient. I want to create a new column in the initial dataframe (mydata) that is the sum of multiplying each element in each row of mydata(a:e) by the coefficients (a:e) in df. The result for the first row in the newcol should be 64 (1*1 + 2*2 + 3*3 + 4*4 + 7*5). Ideally, I would be able to somehow replicate this 20+ times with different coefficients.
mydata <- data.frame(a = 1:10, b = 2:11, c = 3:12, d = 4:13, d_1 = 5:14, d_2 = 6:15, d_3 = 7:16, e = 8:17)
df <- data.frame(factor = c('a','b','c','d','e'), coefficient = 1:5)
mydata$newcol <- mydata[,c("a","b","c","d","e")] %*% df$coefficient
mydata$newcol2 <- mydata[,c("a","b","c","d_1","e")] %*% df$coefficient
Any advice would be helpful!
We can use sweep here, subset mydata based on factor column in df and multiply it with coefficient for each element and then take rowSums to calculate the sum.
mydata$newcol <- rowSums(sweep(mydata[as.character(df$factor)], 2,df$coefficient, `*`))
mydata
# a b c d d_1 d_2 d_3 e newcol
#1 1 2 3 4 5 6 7 8 70
#2 2 3 4 5 6 7 8 9 85
#3 3 4 5 6 7 8 9 10 100
#4 4 5 6 7 8 9 10 11 115
#5 5 6 7 8 9 10 11 12 130
#6 6 7 8 9 10 11 12 13 145
#7 7 8 9 10 11 12 13 14 160
#8 8 9 10 11 12 13 14 15 175
#9 9 10 11 12 13 14 15 16 190
#10 10 11 12 13 14 15 16 17 205
Or we can also transpose mydata and multiply the coefficient and get colSums.
colSums(t(mydata[as.character(df$factor)]) * df$coefficient)

subset a dataframe based on a matrix of row numbers and save the result in one list

I have a data frame called df that looks like:
> df
Date A B C
1 2001 1 12 14
2 2002 2 13 15
3 2003 3 14 16
4 2004 4 15 17
5 2005 5 16 18
6 2006 6 17 19
7 2007 7 18 20
8 2008 8 19 21
9 2009 9 20 22
10 2010 10 21 23
and a matrix called index that looks like:
> index
Resample01 Resample02 Resample03 Resample04 Resample05
[1,] 1 7 1 2 7
[2,] 3 9 2 3 8
[3,] 5 1 3 8 1
[4,] 8 3 4 9 4
[5,] 10 4 5 10 9
The numbers in each column stands for the row number to be selected.
The aim is to split the dataframe into two exclusive groups of "train" and "test" according to the row numbers in each column of the matrix "index". For example for "Resample01", the result should be look like:
> train
Date A B C
1 2001 1 12 14
3 2003 3 14 16
5 2005 5 16 18
8 2008 8 19 21
10 2010 10 21 23
and
> test
Date A B C
2 2002 2 13 15
4 2004 4 15 17
6 2006 6 17 19
7 2007 7 18 20
9 2009 9 20 22
and this process should be done for each colum in "index", and the results should be saved in two lists of "train" and "test", in which "train" is like:
$train1
Date A B C
1 2001 1 12 14
3 2003 3 14 16
5 2005 5 16 18
8 2008 8 19 21
10 2010 10 21 23
$train2
:
:
$train5
and "test" should be in the same format.
Only to note that my df accually contains 43,000 observations and the index matrix has 2000 columns and more than 20,000 rows. I know that subsetting for one column is easy, by doing:
test = df[-c(index[,1]),]
but for multiple columns I don't know how to do it (or loop it), and the saving form of a list seems also difficult.
You could try it something like this. The result should be of length ncol(index) and each element should hold two list elements, training and testing datasets each.
apply(index, MARGIN = 2, FUN = function(x, data) {
# is is "demoted" from a column to a vector
list(train = data[x, ], test = data[-x, ])
}, data = df)
The solution from akrun solves my problem.
by #Roman Luštrik codes:
listofsample = apply(index, MARGIN = 2, FUN = function(x, data) {
list(train = df[x, ], test = df[-x, ])
}, data = df)
following code from akrun:
train = sapply(listofsample, `[`,1)
test = sapply(listofsample, `[`,2)
it produce the two lists that I wanted.

Merging and summarizing two dataframes

I have the following data:
a <- data.frame(ID=c("A","B","Z","H"), a=c(0,1,2,45), b=c(3,4,5,22), c=c(6,7,8,3))
> a
ID a b c
1 A 0 3 6
2 B 1 4 7
3 Z 2 5 8
4 H 45 22 3
b <- data.frame(ID=c("A","B","E","W","Z","H"), a=c(9,10,11,39,5,0), b=c(4,2,7,54,12,34), c=c(12,0,34,23,13,14))
> b
ID a b c
1: A 9 4 12
2: B 10 2 0
3: E 11 7 34
4: W 39 54 23
5: Z 5 12 13
6: H 0 34 14
I want to merge both dataframes, keeping only rows of data.frame a and summarize the same columns, so at the end I get:
> z
ID a b c
1 A 9 7 18
2 B 11 6 7
3 Z 7 17 21
4 H 45 56 17
So far I have tried the following:
merge(a,b,by="ID",all.x=T,all.y=F)
> merge(a,b,by="ID",all.x=T,all.y=F)
ID a.x b.x c.x a.y b.y c.y
1 A 0 3 6 9 4 12
2 B 1 4 7 10 2 0
3 H 45 22 3 0 34 14
4 Z 2 5 8 5 12 13
> join(a,b,type="left",by="ID")
ID a b c a b c
1 A 0 3 6 9 4 12
2 B 1 4 7 10 2 0
3 Z 2 5 8 5 12 13
4 H 45 22 3 0 34 14
I cannot manage to summarize the columns.
My dataframe is pretty big so if the solution can speed up things that would even be better.
If your data.frame is very big, then you may consider this option:
library(data.table)
## convert data.frame to data.table
setDT(a)
## convert data.frame to data.table
setDT(b)
## merge the two data.tables
c <- merge(a,b,by='ID')
## extract names of all columns except the first one i.e. ID
col_names <- colnames(a)[-1]
## query building
col_1 <- paste0(col_names,'.x')
col_2 <- paste0(col_names,'.y')
cols <- paste(col_1,col_2,sep=',')
cols_2 <- paste0(col_names," = sum(",cols,")")
cols_3 <- paste(cols_2,collapse=',')
query <- paste0("z <- c[,.(",cols_3,"),by=ID]")
## query execution
eval(parse(text = query))
This works at least for your example:
a <- data.frame(ID=c("A","B","Z","H"), a=c(0,1,2,45), b=c(3,4,5,22), c=c(6,7,8,3))
b <- data.frame(ID=c("A","B","E","W","Z","H"), a=c(9,10,11,39,5,0), b=c(4,2,7,54,12,34), c=c(12,0,34,23,13,14))
match_a <- na.omit(match(b$ID, a$ID))
match_b <- na.omit(match(a$ID, b$ID))
df <- cbind(ID = a$ID[match_a], a[match_a, -1] + b[match_b, -1])
First, get matching rows from a in b and vice versa, so we can be sure that we only have those rows that appear in both data frames (and we now know their row-indices in both data frames). Then, simply use vectorized additions for those matching rows, but omit ID, as factor cannot be summed up; add ID back manually.
You cannot directly add both data frame is because both the data frames are of unequal size. To make them of equal size you can check for IDs in a which are present in b and then add them element wise.
new <- b[b$ID %in% a$ID, ]
cbind(ID = a$ID, a[-1] + new[-1])
# ID a b c
#1 A 9 7 18
#2 B 11 6 7
#3 Z 7 17 21
#4 H 45 56 17

R: subset a data frame based on conditions from another data frame

Here is a problem I am trying to solve. Say, I have two data frames like the following:
observations <- data.frame(id = rep(rep(c(1,2,3,4), each=5), 5),
time = c(rep(1:5,4), rep(6:10,4), rep(11:15,4), rep(16:20,4), rep(21:25,4)),
measurement = rnorm(100,5,7))
sampletimes <- data.frame(location = letters[1:20],
id = rep(1:4,5),
time1 = rep(c(2,7,12,17,22), each=4),
time2 = rep(c(4,9,14,19,24), each=4))
They both contain a column named id, which links the data frames. I want to have the measurements from observationss for whichtimeis betweentime1andtime2from thesampletimesdata frame. Additionally, I'd like to connect the appropriatelocation` to each measurement.
I have successfully done this by converting my sampletimes to a wide format (i.e. all the time1 and time2 information in one row per entry for id), merging the two data frames by the id variable, and using conditional statements to take only instances when the time falls between at least one of the time intervals in the row, and then assigning location to the appropriate measurement.
However, I have around 2 million rows in observations and doing this takes a really long time. I'm looking for a better way where I can keep the data in long format. The example dataset is very simple, but in reality, my data contains variable numbers of intervals and locations per id.
For our example, the data frame I would hope to get back would be as follows:
id time measurement letters[1:20]
1 3 10.5163892 a
2 3 5.5774119 b
3 3 10.5057060 c
4 3 14.1563179 d
1 8 2.2653761 e
2 8 -1.0905546 f
3 8 12.7434161 g
4 8 17.6129261 h
1 13 10.9234673 i
2 13 1.6974481 j
3 13 -0.3664951 k
4 13 13.8792198 l
1 18 6.5038847 m
2 18 1.2032935 n
3 18 15.0889469 o
4 18 0.8934357 p
1 23 3.6864527 q
2 23 0.2404074 r
3 23 11.6028766 s
4 23 20.7466908 t
Here's a proposal with merge:
# merge both data frames
dat <- merge(observations, sampletimes, by = "id")
# extract valid rows
dat2 <- dat[dat$time > dat$time1 & dat$time < dat$time2, seq(4)]
# sort
dat2[order(dat2$time, dat2$id), ]
The result:
id time measurement location
11 1 3 7.086246 a
141 2 3 6.893162 b
251 3 3 16.052627 c
376 4 3 -6.559494 d
47 1 8 11.506810 e
137 2 8 10.959782 f
267 3 8 11.079759 g
402 4 8 11.082015 h
83 1 13 5.584257 i
218 2 13 -1.714845 j
283 3 13 -11.196792 k
418 4 13 8.887907 l
99 1 18 1.656558 m
234 2 18 16.573179 n
364 3 18 6.522298 o
454 4 18 1.005123 p
125 1 23 -1.995719 q
250 2 23 -6.676464 r
360 3 23 10.514282 s
490 4 23 3.863357 t
Not efficient , but do the job :
subset(merge(observations,sampletimes), time > time1 & time < time2)
id time measurement location time1 time2
11 1 3 3.180321 a 2 4
47 1 8 6.040612 e 7 9
83 1 13 -5.999317 i 12 14
99 1 18 2.689414 m 17 19
125 1 23 12.514722 q 22 24
137 2 8 4.420679 f 7 9
141 2 3 11.492446 b 2 4
218 2 13 6.672506 j 12 14
234 2 18 12.290339 n 17 19
250 2 23 12.610828 r 22 24
251 3 3 8.570984 c 2 4
267 3 8 -7.112291 g 7 9
283 3 13 6.287598 k 12 14
360 3 23 11.941846 s 22 24
364 3 18 -4.199001 o 17 19
376 4 3 7.133370 d 2 4
402 4 8 13.477790 h 7 9
418 4 13 3.967293 l 12 14
454 4 18 12.845535 p 17 19
490 4 23 -1.016839 t 22 24
EDIT
Since you have more than 5 millions rows, you should give a try to a data.table solution:
library(data.table)
OBS <- data.table(observations)
SAM <- data.table(sampletimes)
merge(OBS,SAM,allow.cartesian=TRUE,by='id')[time > time1 & time < time2]

Is there a way for a column to keep a running total of all columns that precede it?

If I have a data frame where I am adding columns, and I would like one column to sum them up. I will not know the names of the columns ahead of time, so I guess I would need some kind of function that would count the number of columns and then sum them up.
If my data is like this:
w=1:10
x=11:20
z=data.frame(w,x)
I would like the total for z$w and z$x. But then if I were to add z$y, I would like to have that incorporated into the sum as well.
You should consider not adding a column for the sum, and just call rowSums(z) whenever you need it. That removes the hassle of having to update the column whenever you modify your data.frame.
Now if that's really what you want, here is a little function that will update the sum and always keep it as the last column. You'll have to run it every time you make a change to your data.frame:
> refresh.total <- function(df) {
+ df$total <- NULL
+ df$total <- rowSums(df)
+ return(df)
+ }
>
> z <- refresh.total(z)
> z
w x total
1 1 11 12
2 2 12 14
3 3 13 16
4 4 14 18
5 5 15 20
6 6 16 22
7 7 17 24
8 8 18 26
9 9 19 28
10 10 20 30
>
> z$y <- 2:11
> z <- refresh.total(z)
> z
w x y total
1 1 11 2 14
2 2 12 3 17
3 3 13 4 20
4 4 14 5 23
5 5 15 6 26
6 6 16 7 29
7 7 17 8 32
8 8 18 9 35
9 9 19 10 38
10 10 20 11 41
After you've finished adding in all the columns, you can do:
z$total <- rowSums(z)

Resources