R Creating new columns using vector contains name of variables - r

I have a data and a vector contain name of variables and i want to create new variable contain rowsum of variables in my vector, and i want the name of new variable ( sum of variables in my vector) to be concatenation of names of variables
for example i have this data
> data
Name A B C D E
r1 1 5 12 21 15
r2 2 4 7 10 9
r3 5 15 6 9 6
r4 7 8 0 7 18
and this vector
>Vec
"A" , "C" , "D"
the result i want is the sum of Variables A , C and D and the name of my variable is ACD
here's the result i want :
> data
Name A B C D ACD E
r1 1 5 12 21 34 15
r2 2 4 7 10 18 9
r3 5 15 6 9 20 6
r4 7 8 0 7 14 18
I tried this :
data <- cbind(data , as.data.frame(rowSums(data[,Vec]) ))
But i don't know how to create the name
Here's the result i got
>data
Name A B C D E rowSums(data[,Vec])
r1 1 5 12 21 15 34
r2 2 4 7 10 9 18
r3 5 15 6 9 6 20
r4 7 8 0 7 18 14
Not that i gave just a sample example to explain what i want to do
i want to do affectation of my old data to my new data ( that contains the new variable), like i did in my command above
edit 1 : in my real program , i don't know the elements ( name of my variables in my vector so i can not do data$ACD <- cbind(data , as.data.frame(rowSums(data[,Vec]) )) as suggested by Pax, in fact i have for loop that generate my vectors and each time i create variable to put the result i want ( sum of variable in my vector) so i don't know how to affect the name without knowing the elements of vectors
Please tell me if you need anymore clarifications or informations
Thank you

It's not a one line solution but you can set the name on the subsequent line:
data <- data.frame(A = c(1, 2, 5, 7),
B = c(5, 4, 15, 8),
C = c(12, 7, 6, 0),
D = c(21, 10, 9, 7),
E = c(15, 9, 6, 18))
Vec <- c("A" , "C" , "D")
data <- cbind(data, rowSums(data[,Vec]))
# Add name
names(data)[ncol(data)] <- paste(Vec, collapse="")
# A B C D E ACD
# 1 1 5 12 21 15 34
# 2 2 4 7 10 9 19
# 3 5 15 6 9 6 20
# 4 7 8 0 7 18 14

Here is an option with the janitor package. You can use adorn_totals which appends a totals row or column to a data.frame. The name argument includes the name of the new column in this case, and final Vec included at the end includes the columns to total.
library(janitor)
adorn_totals(data, "col", fill = NA, na.rm = TRUE, name = paste(Vec, collapse = ""), all_of(Vec))
Output
A B C D E ACD
1 5 12 21 15 34
2 4 7 10 9 19
5 15 6 9 6 20
7 8 0 7 18 14

Related

Labelling rows according to how many times the group appeared in previous rows

Suppose I have the following data.frame object:
df = data.frame(id=(1:25),
col1=c('a','a','a',
'b','b','b',
'c','c','c',
'd','d',
'b','b','b',
'e',
'c','c','c',
'e','e',
'a','a','a',
'e','e'))
From the snapshot above, you can see that there are two groups of rows that have col1=="a": rows 1 through 3 and rows 21 through 23. Similarly, there are three groups of rows that have col1=="e": row 15, rows 19 through 20 and rows 24 through 25 (and so on and so on with "b", "c" and "d").
Here's my main question
Is it possible to label the rows according to what "chunk" we're currently on? More explicitly: since rows 1 through 3 are the first time where we have col1=="a", they should be labelled as 1. Then, rows 21 through 23 should be labelled as 2, because that is the second time that we have a set of rows that have col1=="a". Using the same logic, but for col1=="e", we'd label row 15 as 1, rows 19 and 20 as 2 and rows 24 and 25 as 3 (again, so on and so on with "b", "c" and "d").
Desired output
Here is what the resulting data.frame would look like:
df = data.frame(id=(1:25),
col1=c('a','a','a',
'b','b','b',
'c','c','c',
'd','d',
'b','b','b',
'e',
'c','c','c',
'e','e',
'a','a','a',
'e','e'),
grup=c(1,1,1,
1,1,1,
1,1,1,
1,1,
2,2,2,
1,
2,2,2,
2,2,
2,2,2,
3,3))
My attempt
I tried implementing a solution using a for loop, but that was quite slow (the original data I'm working on has about 500,000 rows), and it just looked a bit sloppy:
my_classifier = function(input_df, ref_column){
# Keeps a tally of how many times each unique group was "found" before.
group_counter = list()
# Dealing with the corner case of the first row
group_counter[[df$col1[1]]] = 1
output_groups = rep(-1, nrow(input_df))
output_groups[1] = 1
# The for loop starts at the second row because I've already "dealt" with the
# first row in the corner cases above
for(i in 2:nrow(input_df)){
prev_group = input_df[[ref_column]][i-1]
this_group = input_df[[ref_column]][i]
if(is.null(group_counter[[this_group]])){
this_counter = 0
}
else{
this_counter = group_counter[[this_group]]
}
if(prev_group != this_group){
this_counter = this_counter + 1
}
output_groups[i] = this_counter
group_counter[[this_group]] = this_counter
}
return(output_groups)
}
df$grup = my_classifier(df,'col1')
Is there a quicker/more efficient way to solve this problem? Maybe something that relies on vectorized functions or something?
Important notes
Consider that we cannot rely on the number of repetitions of each "block". Sometimes, col1 will have just one single row of a particular group, while other times the "block" will have several rows where col1 share the same value. Also consider that we cannot assume any logic in the "order" or the number of times each group shows up.
So, for example, there might be a a stretch of 10 rows where col1=="z", then a stretch of 15 rows where col1=="x", then another single row where col1=="x" and then finally a stretch of 100 rows where col1=="w".
You can use data.table::rleid() twice, like this:
library(data.table)
setDT(df)[,grp:=rleid(col1)][, grp:=rleid(grp), by=col1][order(id)]
Output:
id col1 grp
<int> <char> <int>
1: 1 a 1
2: 2 a 1
3: 3 a 1
4: 4 b 1
5: 5 b 1
6: 6 b 1
7: 7 c 1
8: 8 c 1
9: 9 c 1
10: 10 d 1
11: 11 d 1
12: 12 b 2
13: 13 b 2
14: 14 b 2
15: 15 e 1
16: 16 c 2
17: 17 c 2
18: 18 c 2
19: 19 e 2
20: 20 e 2
21: 21 a 2
22: 22 a 2
23: 23 a 2
24: 24 e 3
25: 25 e 3
id col1 grp
Here is a possible base R solution:
change <- with(rle(df$col1), rep(seq_along(values), lengths))
cbind(df, grp = with(df, ave(
change,
col1,
FUN = function(x)
inverse.rle(within.list(rle(x), values <- seq_along(values)))
)))
Or another option using a combination of rle and dplyr using the function from here:
rle_new <- function(x) {
x <- rle(x)$lengths
rep(seq_along(x), times=x)
}
library(dplyr)
df %>%
mutate(grp = rle_new(col1)) %>%
group_by(col1) %>%
mutate(grp = rle_new(grp))
Output
id col1 grp
1 1 a 1
2 2 a 1
3 3 a 1
4 4 b 1
5 5 b 1
6 6 b 1
7 7 c 1
8 8 c 1
9 9 c 1
10 10 d 1
11 11 d 1
12 12 b 2
13 13 b 2
14 14 b 2
15 15 e 1
16 16 c 2
17 17 c 2
18 18 c 2
19 19 e 2
20 20 e 2
21 21 a 2
22 22 a 2
23 23 a 2
24 24 e 3
25 25 e 3

How do I loop over a vector of strings and another vector of numbers at the same time in R?

I want to loop over an index of strings which correspond to columns of a dataframe (d) to put these columns into a new dataframe (h). But instead of copying the content of the columns, adding a distinct number to each column. The numbers are specified in the vector numvec. Here is my sample code:
d<-data.frame(replicate(5,sample(0:9,300,rep=TRUE)))
d$rn<-replicate(1,"mystring")
h<-as.data.frame(d[,6])
colnames(d)<-c("first","second","third","fourth","fifth")
trtvec<-colnames(d)
numvec<-c(2,8,7,6,5)
#loop for each trait
for(i in seq_along(trtvec))
{
h$trtvec[i]<-d$trtvec[i]+numvec[i]
}
Basically, already the first part
h$trtvec[i]<-d$trtvec[i]
doesn't seem to work.
Can anyone help me?
If I understand correctly, the following does what you want.
It uses a apply loop to add numvec to each 5 elements of the rows of data frame d.
h <- d[6]
numvec <- c(2, 8, 7, 6, 5)
h1 <- cbind(h, t(apply(d[1:5], 1, '+', numvec)))
head(h1)
# rn first second third fourth fifth
#1 mystring 3 11 9 12 11
#2 mystring 8 17 8 10 8
#3 mystring 8 13 8 15 12
#4 mystring 8 12 12 10 7
#5 mystring 10 17 11 15 5
#6 mystring 8 12 10 6 14
If you want column rn as the last column, use cbind.data.frame and change the order of the arguments.
h2 <- cbind.data.frame(t(apply(d[1:5], 1, '+', numvec)), h)
head(h2)
# first second third fourth fifth rn
#1 3 11 9 12 11 mystring
#2 8 17 8 10 8 mystring
#3 8 13 8 15 12 mystring
#4 8 12 12 10 7 mystring
#5 10 17 11 15 5 mystring
#6 8 12 10 6 14 mystring
You can use sapply on the selected column names like this
data <- data.frame(a = rep(0, 100) b = rep(1, 100), c = rep(2, 100))
# using a named vector to simplify indexing
num.vec <- c(a = 2, b = 3)
# add the corresponding number to selected columns
new.data <- sapply(names(num.vec), FUN = function(x) data[,x] + num.vec[x])
head(new.data, 1)
a b
[1,] 2 4

For Loops using colnames in R , increment i by 10

I had a bit specific problem in running for loops in colnames , increment i by 10 and creating new dataframe using i.
For example
x <- data.frame(A = c(1, 2), B = c(3, 4),C =c(5,6),D=c(7,8),E=c(9,10),F=c(11,12),G=c(13,14),
H=c(16,17),I=c(18,19),J=c(22,25),K=c(12,13),L=c(19,20))
# below create 12 dataframe starting from A to L which i do not want
for (i in colnames(x))
assign(i, subset(x, select=i))
I want to increment i by 3, so I want my output as col A to C in one dataframe, col D to F in one dataframe, col G to I in one dataframe and col J to L in one dataframe, which means only 4 dataframes not 12.
Assigning to the global environment is generally not the way to go, especially from functions. You could do the following, generating a list containg the splitted dataframes.
Make a vector of indices where a 'new' dataframe should start, starting at 1 and incrementing by i.
i<- 3
start_indices <- seq(1,ncol(x),by=i)
> start_indices
[1] 1 4 7 10
Use lapply to generate a list of splitted dataframes.
res <- lapply(start_indices, function(j){
return(x[,j:(j+i-1)])
})
>res
[[1]]
A B C
1 1 3 5
2 2 4 6
[[2]]
D E F
1 7 9 11
2 8 10 12
[[3]]
G H I
1 13 16 18
2 14 17 19
[[4]]
J K L
1 22 12 19
2 25 13 20
If you want to use your approach
> for (i in 1:(ncol(x)/3))
+ assign(names(x)[3*i-2], subset(x, select=(3*i-2):(3*i)))
> A
A B C
1 1 3 5
2 2 4 6
> D
D E F
1 7 9 11
2 8 10 12
> G
G H I
1 13 16 18
2 14 17 19
> J
J K L
1 22 12 19
2 25 13 20
Just thought to add last line on unlisting list ,previous answer by Heroka
create multiple data frame from list of
for(i in 1:length(res)) {
assign(paste0("gf", i), res[[i]])
}

automating a normal transformation function in R over multiple columns

I have a data frame m with:
>m
id w y z
1 2 5 8
2 18 5 98
3 1 25 5
4 52 25 8
5 5 5 4
6 3 3 5
Below is a general function for normally transforming a variable that I need to apply to columns w,y,z.
y<-qnorm((rank(x,na.last="keep")-0.5)/sum(!is.na(x))
For example, if I wanted to run this function on "column w" to get the output column appended to dataframe "m" then:
m$w_n<-qnorm((rank(m$w,na.last="keep")-0.5)/sum(!is.na(m$w))
Can someone help me automate this to run on multiple columns in data frame m?
Ideally, I would want an output data frame with the following columns:
id w y z w_n y_n z_n
Note this is a sample data frame, the one I have is much larger and I have more letter columns to run this function on other than w, y,z.
Thanks!
Probably a way to do it in a single step, but what about:
df <- data.frame(id = 1:6, w = sample(50, 6), z = sample(50, 6) )
df
id w z
1 1 39 40
2 2 20 26
3 3 43 11
4 4 4 37
5 5 36 24
6 6 27 14
transCols <- function(x) qnorm((rank(x,na.last="keep")-0.5)/sum(!is.na(x)))
tmpdf <- lapply(df[, -1], transCols)
names(tmpdf) <- paste0(names(tmpdf), "_n")
df_final <- cbind(df, tmpdf)
df_final
df_final
id w z w_n z_n
1 1 39 40 -0.2104284 -1.3829941
2 2 20 26 1.3829941 1.3829941
3 3 43 11 0.2104284 0.6744898
4 4 4 37 -1.3829941 0.2104284
5 5 36 24 0.6744898 -0.6744898
6 6 27 14 -0.6744898 -0.2104284

sum by group in a data.frame

I'm trying to get the sum of a numerical variable per a categorical variable (in a data frame). I've tried using tapply, but it's doesn't take a whole data.frame.
Here is a working example with some data that looks like this:
> set.seed(667)
> df <- data.frame(a = sample(c("Group A","Group B","Group C",NA), 10, rep = TRUE),
b = sample(c(1, 2, 3, 4, 5, 6), 10, rep=TRUE),
c = sample(c(11, 12, 13, 14, 15, 16), 10, rep=TRUE))
> df
a b c
1 Group A 4 12
2 Group B 6 12
3 <NA> 4 14
4 Group C 1 16
5 <NA> 2 14
6 <NA> 3 13
7 Group C 4 13
8 <NA> 6 15
9 Group B 3 16
10 Group B 5 16
using tapply, I can get one vector at a time:
> tapply(df$b,df$a,sum)
Group A Group B Group C
4 14 5
but I am more interested in getting something like this:
a b c
1 Group A 4 12
2 Group B 14 44
3 Group C 5 29
Any help would be appreciated. Thanks.
Use aggregate instead:
aggregate(df[ , c("b","c")], df['a'], FUN=sum)
a b c
1 Group A 4 12
2 Group B 14 44
3 Group C 5 29
I'm not sure why but you need to pass the second argument to aggregate as a list, so using df$a will error out. It then uses the function on the individual columns in the first argument.

Resources