Create multiple data frames from one based off values with a for loop - r

I have a large data frame that I would like to convert in to smaller subset data frames using a for loop. I want the new data frames to be based on the the values in a column in the large/parent data frame. Here is an example
x<- 1:20
y <- c("A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","C","C","C")
df <- as.data.frame(cbind(x,y))
ok, now I want three data frames, one will be columns x and y but only where y == "A", the second where y==
"B" etc etc. So the end result will be 3 new data frames df.A, df.B, and df.C. I realize that this would be easy to do out of a for loop but my actual data has a lot of levels of y so using a for loop (or similar) would be nice.
Thanks!

If you want to create separate objects in a loop, you can use assign. I used unique because you said you had many levels.
for(i in unique(df$y)) {
nam <- paste("df", i, sep = ".")
assign(nam, df[df$y==i,])
}
> df.A
x y
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
6 6 A
7 7 A
8 8 A
> df.B
x y
9 9 B
10 10 B
11 11 B
12 12 B
13 13 B
14 14 B

I think you just need the split function:
split(df, df$y)
$A
x y
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
6 6 A
7 7 A
8 8 A
$B
x y
9 9 B
10 10 B
11 11 B
12 12 B
13 13 B
14 14 B
15 15 B
16 16 B
17 17 B
$C
x y
18 18 C
19 19 C
20 20 C
It is just a matter of properly subsetting the output to split and store the results to objects like dfA <- split(df, df$y)[[1]] and dfB <- split(df, df$y)[[2]] and so on.

Related

Moving down columns in data frames in R

Suppose I have the next data frame:
df<-data.frame(step1=c(1,2,3,4),step2=c(5,6,7,8),step3=c(9,10,11,12),step4=c(13,14,15,16))
step1 step2 step3 step4
1 1 5 9 13
2 2 6 10 14
3 3 7 11 15
4 4 8 12 16
and what I have to do is something like the following:
df2<-data.frame(col1=c(1,2,3,4,5,6,7,8,9,10,11,12),col2=c(5,6,7,8,9,10,11,12,13,14,15,16))
col1 col2
1 1 5
2 2 6
3 3 7
4 4 8
5 5 9
6 6 10
7 7 11
8 8 12
9 9 13
10 10 14
11 11 15
12 12 16
How can I do that? consider that more steps can be included (example, 20 steps).
Thanks!!
We can design a function to achieve this task. df_final is the final output. Notice that bin is an argument that the users can specify how many columns to transform together.
# A function to conduct data transformation
trans_fun <- function(df, bin = 3){
# Calculate the number of new columns
new_ncol <- (ncol(df) - bin) + 1
# Create a list to store all data frames
df_list <- lapply(1:new_ncol, function(num){
return(df[, num:(num + bin - 1)])
})
# Convert each data frame to a vector
dt_list2 <- lapply(df_list, unlist)
# Convert dt_list2 to data frame
df_final <- as.data.frame(dt_list2)
# Set the column and row names of df_final
colnames(df_final) <- paste0("col", 1:new_ncol)
rownames(df_final) <- 1:nrow(df_final)
return(df_final)
}
# Apply the trans_fun
df_final <- trans_fun(df)
df_final
col1 col2
1 1 5
2 2 6
3 3 7
4 4 8
5 5 9
6 6 10
7 7 11
8 8 12
9 9 13
10 10 14
11 11 15
12 12 16
Here is a method using dplyr and reshape2 - this assumes all of the columns are the same length.
library(dplyr)
library(reshape2)
Drop the last column from the dataframe
df[,1:ncol(df)-1]%>%
melt() %>%
dplyr::select(col1=value) -> col1
Drop the first column from the dataframe
df %>%
dplyr::select(-step1) %>%
melt() %>%
dplyr::select(col2=value) -> col2
Combine the dataframes
bind_cols(col1, col2)
This should do the work:
df2 <- data.frame(col1 = 1:(length(df$step1) + length(df$step2)))
df2$col1 <- c(df$step1, df$step2, df$step3)
df2$col2 <- c(df$step2, df$step3, df$step4)
Things to point:
The important thing to see in the first line of the code, is the need for creating a table with the right amount of rows
Calling a columns that does not exist will create one, with that name
Deleting columns in R should be done like this df2$col <- NULL
Are you not just looking to do:
df2 <- data.frame(col1 = unlist(df[,-nrow(df)]),
col2 = unlist(df[,-1]))
rownames(df2) <- NULL
df2
col1 col2
1 1 5
2 2 6
3 3 7
4 4 8
5 5 9
6 6 10
7 7 11
8 8 12
9 9 13
10 10 14
11 11 15
12 12 16

How to do iterations in R?

I'm operating with a dataset that contains the values of same variables at different points in time. In the example below I have the values of variables a and b at time points 1 and 2.
> set.seed(1)
> data <- data.frame(matrix(sample(16), ncol = 4))
> names(data) <- paste(rep(c("a", "b"), each = 2), 1:2, sep = "")
> data
a1 a2 b1 b2
1 5 3 14 13
2 6 10 1 8
3 9 11 2 4
4 12 15 7 16
Now, suppose I want to calculate a new variable for both time points so that it would contain the sum of a and b (instead of the NAs as in example below). Since my actual dataset contains about 15 different variables and 10 time points (so 150 columns), I want to automate this calculation of 10 new variables.
> data[, paste("ab", 1:2, sep = "")] <- NA
> data
a1 a2 b1 b2 ab1 ab2
1 5 3 14 13 NA NA
2 6 10 1 8 NA NA
3 9 11 2 4 NA NA
4 12 15 7 16 NA NA
I've previously used Stata where I could create a simple 'foreach' loop to do this. Something like below.
foreach t of numlist 1/2 {
generate ab`t' = a`t' + b`t'
}
But I've learned that using loops in R is not feasible, nor have I any idea how to loop over variable names like that in R.
So what would be the correct solution for my problem in R?
This will replicate the same foreach loop you used in Stata.
for(i in 1:2){
data[, paste("ab", i, sep="")] <-
data[,paste("a", i, sep="")] + data[, paste("b", i, sep="")]
}
The output looks like this:
> data
a1 a2 b1 b2 ab1 ab2
1 15 1 16 12 31 13
2 10 7 14 3 24 10
3 2 5 9 4 11 9
4 6 8 13 11 19 19
to do this the R way,
make use of some native iteration via a *apply function
use the built-in rowSums (as in #Sotos) answer
make use of assignment into the data.frame, that is `]`<-
all together
data[paste0('ab', 1:2)] <- sapply(1:2,
function(i)
rowSums(data[paste0(c('a', 'b'), i)]))
data
# a1 a2 b1 b2 ab1 ab2
# 1 5 3 14 13 19 16
# 2 6 10 1 8 7 18
# 3 9 11 2 4 11 15
# 4 12 15 7 16 19 31
ps, in a program use vapply instead, you'll need to provide an additional argument specifying the shape of the output but its safer and sometimes faster
You can do without iteration:
data$ab1 <- data$a1 + data$b1
data$ab2 <- data$a2 + data$b2
or
data <- transform(data, ab1=a1+b1, ab2=a2+b2)
BTW:
It is better not to name an object data because data= is often a parameter in functions.
Here is one way to do it. We iterate over the unique values of the column names and we calculate the rowSums when those unique values match the colname values.
sapply(unique(sub('\\D', '', names(data))),
function(i) rowSums(data[,grepl(i, sub('\\D', '', names(data)))]))
# 1 2
#[1,] 17 23
#[2,] 24 22
#[3,] 14 10
#[4,] 15 11

For Loops using colnames in R , increment i by 10

I had a bit specific problem in running for loops in colnames , increment i by 10 and creating new dataframe using i.
For example
x <- data.frame(A = c(1, 2), B = c(3, 4),C =c(5,6),D=c(7,8),E=c(9,10),F=c(11,12),G=c(13,14),
H=c(16,17),I=c(18,19),J=c(22,25),K=c(12,13),L=c(19,20))
# below create 12 dataframe starting from A to L which i do not want
for (i in colnames(x))
assign(i, subset(x, select=i))
I want to increment i by 3, so I want my output as col A to C in one dataframe, col D to F in one dataframe, col G to I in one dataframe and col J to L in one dataframe, which means only 4 dataframes not 12.
Assigning to the global environment is generally not the way to go, especially from functions. You could do the following, generating a list containg the splitted dataframes.
Make a vector of indices where a 'new' dataframe should start, starting at 1 and incrementing by i.
i<- 3
start_indices <- seq(1,ncol(x),by=i)
> start_indices
[1] 1 4 7 10
Use lapply to generate a list of splitted dataframes.
res <- lapply(start_indices, function(j){
return(x[,j:(j+i-1)])
})
>res
[[1]]
A B C
1 1 3 5
2 2 4 6
[[2]]
D E F
1 7 9 11
2 8 10 12
[[3]]
G H I
1 13 16 18
2 14 17 19
[[4]]
J K L
1 22 12 19
2 25 13 20
If you want to use your approach
> for (i in 1:(ncol(x)/3))
+ assign(names(x)[3*i-2], subset(x, select=(3*i-2):(3*i)))
> A
A B C
1 1 3 5
2 2 4 6
> D
D E F
1 7 9 11
2 8 10 12
> G
G H I
1 13 16 18
2 14 17 19
> J
J K L
1 22 12 19
2 25 13 20
Just thought to add last line on unlisting list ,previous answer by Heroka
create multiple data frame from list of
for(i in 1:length(res)) {
assign(paste0("gf", i), res[[i]])
}

Creating a dataframe grouping observations according to labels

I have an x vector with categorical variables and a y vector of numerical variables, both of the same length.
I need to create a data-frame in which all numerical observations in y are separated into groups by a categorical label in x so the end result would look something like:
x obs1 obs2 obs3
a 1 3 5
b 6 7 8
c 3 4 6
Now both aggregate and tapply require a FUN specification but I don't want to do operations on the variables.
x= {random sampling from letters of the alphabet}
y= {random numbers}
Remember, everything is a function in R. So things like c() are just function calls.
x <- rep(letters[1:3], each=3)
y <- c(1, 3, 5, 6, 7, 8, 3, 4, 6)
foo <- tapply(y, x, c)
# > foo
# $a
# [1] 1 3 5
# $b
# [1] 6 7 8
# $c
# [1] 3 4 6
Then you can use this silly pattern to get the data.frame you're looking for:
do.call(rbind, foo)
# [,1] [,2] [,3]
# a 1 3 5
# b 6 7 8
# c 3 4 6
I am not clear about something from your example: is it possible for there to be different numbers of y-values for each category in x? For example, would you consider basic data like this:
> x <- c(rep(c("a", "b", "c"), 3), "c", "c")
> y <- sample(1:20, 11)
> df <- data.frame(x, y)
> df
x y
1 a 16
2 b 4
3 c 9
4 a 2
5 b 12
6 c 17
7 a 7
8 b 10
9 c 11
10 c 1
11 c 8
Here there are more values for category c. This is not entirely what you are looking for, but it might be a start:
> library(reshape2)
> dcast(df, x ~ y)
Using y as value column: use value.var to override.
x 1 2 4 7 8 9 10 11 12 16 17
1 a NA 2 NA 7 NA NA NA NA NA 16 NA
2 b NA NA 4 NA NA NA 10 NA 12 NA NA
3 c 1 NA NA NA 8 9 NA 11 NA NA 17
The values for each of the categories appear on the right rows... the NAs are a nuisance though. How would you want the data to appear in this case? Something like
1 a 2 7 16
2 b 4 10 12
3 c 1 8 9 11 17
This will not work, of course, because each row must have the same number of columns, so you would end up with NAs for the last two elements in the top two rows.
However, I suspect that a list would probably be the best solution in this case anyway, in which case, consider this:
> dl <- split(y, x)
> dl[["a"]]
[1] 16 2 7
> dl$b
[1] 4 12 10
> dl[["c"]]
[1] 9 17 11 1 8
You can then operate on the elements of this list. As with all things R, there are a variety of ways to do this. For example, to get the output as a list:
> lapply(dl, sum)
$a
[1] 25
$b
[1] 26
$c
[1] 46
Or with output as a vector
> sapply(dl, sum)
a b c
25 26 46
Or, alternatively, to get the output as a data frame:
> library(plyr)
> ldply(dl, sum)
.id V1
1 a 25
2 b 26
3 c 46
These mechanisms afford a far greater degree of generality than functions like rowSum() since you can apply essentially arbirary functions to each of the elements in the original list.

automating a normal transformation function in R over multiple columns

I have a data frame m with:
>m
id w y z
1 2 5 8
2 18 5 98
3 1 25 5
4 52 25 8
5 5 5 4
6 3 3 5
Below is a general function for normally transforming a variable that I need to apply to columns w,y,z.
y<-qnorm((rank(x,na.last="keep")-0.5)/sum(!is.na(x))
For example, if I wanted to run this function on "column w" to get the output column appended to dataframe "m" then:
m$w_n<-qnorm((rank(m$w,na.last="keep")-0.5)/sum(!is.na(m$w))
Can someone help me automate this to run on multiple columns in data frame m?
Ideally, I would want an output data frame with the following columns:
id w y z w_n y_n z_n
Note this is a sample data frame, the one I have is much larger and I have more letter columns to run this function on other than w, y,z.
Thanks!
Probably a way to do it in a single step, but what about:
df <- data.frame(id = 1:6, w = sample(50, 6), z = sample(50, 6) )
df
id w z
1 1 39 40
2 2 20 26
3 3 43 11
4 4 4 37
5 5 36 24
6 6 27 14
transCols <- function(x) qnorm((rank(x,na.last="keep")-0.5)/sum(!is.na(x)))
tmpdf <- lapply(df[, -1], transCols)
names(tmpdf) <- paste0(names(tmpdf), "_n")
df_final <- cbind(df, tmpdf)
df_final
df_final
id w z w_n z_n
1 1 39 40 -0.2104284 -1.3829941
2 2 20 26 1.3829941 1.3829941
3 3 43 11 0.2104284 0.6744898
4 4 4 37 -1.3829941 0.2104284
5 5 36 24 0.6744898 -0.6744898
6 6 27 14 -0.6744898 -0.2104284

Resources