creating multiple variables in loops in r - r

I am quite new to R, and I do not know how to create variables in a loop. I have a dataset where each observation is uniquely defined by an id and a type. My goal would be to create different datasets from a starting one, keeping for each dataset the id, type a specific variable, and to rename the variable type as type_variable. Please see below a reproducible example of my dataset:
dt_type <- data.frame(id = c(1,1,1,1,2,2,2,2),
type= c("b1", "b2","c1", "c2","b1", "b2","c1", "c2"),
a=rnorm(8), b=rnorm(8),c=rnorm(8),d=rnorm(8))
# id type a b c d
# 1 1 b1 -0.74733339 -1.1121249 -0.2005649 1.70320036
# 2 1 b2 -0.87290362 -0.1221949 -2.7723691 1.04158671
# 3 1 c1 -0.00878965 -0.7592988 -0.5108226 2.10755315
# 4 1 c2 0.87295622 -0.5885439 0.2606365 -0.87080649
# 5 2 b1 -0.74536372 0.1377794 -0.1382621 0.01743011
# 6 2 b2 -0.01570109 -0.3058672 -0.3146880 -0.43594081
# 7 2 c1 -0.28966205 -0.2045772 -1.1776759 -2.24223369
# 8 2 c2 -0.63680969 2.3815740 0.4462243 -0.05397941
This is how I have tried to do it, but unfortunately it does not work.
varlist <- list("a", "b", "c", "d")
for (i in 1:4) {
tmp <- dt_type %>% rename(paste("type", varlist[[i]], sep=="_") = type) %>%
arrange(id, varlist[[i]], desc(paste("type", varlist[[i]], sep=="_"))) %>%
distinct(id, varlist[[i]], .keep_all = T)
assign(paste("dt_type_", varlist[[i]]), tmp)
}
I am used to using loops in other programming languages, but if there are better ways to reach the result I want, please let me know.
Sorry for not posting the expected output, here it is:
dt_type_a
# id type value
# 1 1 b1 -1.5023199
# 2 1 b2 -0.3653626
# 3 1 c1 1.2842098
# 4 1 c2 0.2732327
# 5 2 b1 -0.7581897
# 6 2 b2 1.1627059
# 7 2 c1 -1.6644546
# 8 2 c2 1.2916819
dt_type_b
# id type value
# 1 1 b1 -0.19573684
# 2 1 b2 -1.35095843
# 3 1 c1 0.69342205
# 4 1 c2 0.47689611
# 5 2 b1 0.67058845
# 6 2 b2 0.21992074
# 7 2 c1 -0.02046201
# 8 2 c2 0.19686712
Thanks,
Vincenzo

Hum, I would just go from wide to long but since you're asking to create variables dynamically:
library(data.table)
dt_type <- data.frame(id = c(1,1,1,1,2,2,2,2),
type= c("b1", "b2","c1", "c2","b1", "b2","c1", "c2"),
a=rnorm(8), b=rnorm(8),c=rnorm(8),d=rnorm(8))
setDT(dt_type)
dt_long <- melt(dt_type, id.vars = c("id", "type"))
varnames <- unique(dt_long$variable)
for (var in varnames) {
assign(paste0("dt_type_", var), dt_long[variable == var, .(id, type, value)])
}
hope it helps...

Related

Changing the structure of rows and columns in a data frame- R

I have a large database (90,000 * 1500) sorted by child observations - which includes their mom's info. I want to sort the database according to mom's data.
The problem is that each kid only appears once in DB mom bs. It may appear up to 10 times.
In addition, I want the number of rows to be a number of different mothers (approx. 40,000) and a bit of data for each child - between 0-10.
For example, the DB I have and the DB I want to create:
You could use reshape
library(data.table)
df = data.frame(
'c' = c('c1', 'c2', 'c3', 'c4', 'c5'),
'id_num' = seq(1,5),
'age' = c(12, 15, 5, 8, 19),
'mom'= c(1,3,1,2,3)
)
df
c id_num age mom
1 c1 1 12 1
2 c2 2 15 3
3 c3 3 5 1
4 c4 4 8 2
5 c5 5 19 3
df = setDT(df)[order(mom)]
df[, id_child := seq(.N), mom]
reshape(df, idvar = "mom", timevar = "id_child", direction = "wide")
mom c.1 id_num.1 age.1 c.2 id_num.2 age.2
1: 1 c1 1 12 c3 3 5
2: 2 c4 4 8 <NA> NA NA
3: 3 c2 2 15 c5 5 19
Here is a solution similar to #Metariat, but with base R, where ave() is used
df$seq <- with(df,ave(id_num,mom,FUN = seq_along))
dfout <- reshape(df, idvar = "mom", timevar = "seq", direction = "wide")
such that
> dfout
mom c.1 id_num.1 age.1 c.2 id_num.2 age.2
1 1 c1 1 12 c3 3 5
2 3 c2 2 15 c5 5 19
4 2 c4 4 8 <NA> NA NA
EDIT:
If you have very big data frame, you can try the divide and conquer policy to see if it works
library(plyr)
dfs <- split(df,df$mom)
lst <- lapply(dfs, function(x) {
x <- within(x,seqnum <- ave(id_num,mom,FUN = seq_along))
reshape(x, idvar = "mom", timevar = "seqnum", direction = "wide")
}
)
dfout <- rbind.fill(lst)
You can do this using the tidyr package, with group_by.
group_by(data, mom)
Then each mom contains a list of children. You can then sort the database as follows.
arrange(data, id_num, .by_group = TRUE)
To filter children between 0 and 10:
filter(data, age <= 10)

How to rearrange a tibble by rownames

Traditional dataframes support rearrangement of rows by rownames:
> df <- data.frame(c1 = letters[1:3], c2 = 1:3, row.names = paste0("x", 1:3))
> df
c1 c2
x1 a 1
x2 b 2
x3 c 3
#' If we want, say, row "x3" and "x1":
> df[c("x3", "x1"), ]
c1 c2
x3 c 3
x1 a 1
When it comes to tibble, since it drops the concept of rownames, I wonder what the standard way is to achieve similar goal.
> tb <- as_tibble(rownames_to_column(df))
> tb
# A tibble: 3 x 3
rowname c1 c2
<chr> <fct> <int>
1 x1 a 1
2 x2 b 2
3 x3 c 3
> ?
Thanks.
Update
I can come up with the following solution:
> tb[match(c("x3", "x1"), tb[["rowname"]]), ]
# A tibble: 2 x 3
rowname c1 c2
<chr> <fct> <int>
1 x3 c 3
2 x1 a 1
But it seems clumsy. Does anyone have better idea?
Update 2
In a more generalized sense, my question can be rephrased as: by the syntax of tidyverse, what is the most neat and quick equivalent to
df[c("x3", "x1"), ]
that is, subsetting and rearranging rows of a dataframe.
As joran described, you can use filter to select rows of interest and then to arrange a tibble in a specific order, manually defined, you can use arrange with factor:
tibble(rowname = paste0("x", 1:3), c1 = letters[1:3], c2 = 1:3) %>%
filter(rowname %in% c("x3", "x1")) %>%
arrange(factor(rowname, levels = c("x3", "x1")))

forloop inside dplyr mutate

I would like to do a few column operations using mutate in more elegant way as I have more than 200 columns in my table that I would like transform using mutate.
here is an example
Sample data:
df <- data.frame(treatment=rep(letters[1:2],10),
c1_x=rnorm(20),c2_y=rnorm(20),c3_z=rnorm(20),
c4_x=rnorm(20),c5_y=rnorm(20),c6_z=rnorm(20),
c7_x=rnorm(20),c8_y=rnorm(20),c9_z=rnorm(20),
c10_x=rnorm(20),c11_y=rnorm(20),c12_z=rnorm(20),
c_n=rnorm(20))
sample code:
dfm<-df %>%
mutate(cx=(c1_x*c4_x/c_n+c7_x*c10_x/c_n),
cy=(c2_y*c5_y/c_n+c8_y*c11_y/c_n),
cz=(c3_z*c6_z/c_n+c9_z*c12_z/c_n))
Despite the tangent, the initial recommendations for using tidyr functions is where you need to go. This pipe of functions seems to do the job based on what you've provided.
Your data:
df <- data.frame(treatment=rep(letters[1:2],10),
c1_x=rnorm(20), c2_y=rnorm(20), c3_z=rnorm(20),
c4_x=rnorm(20), c5_y=rnorm(20), c6_z=rnorm(20),
c7_x=rnorm(20), c8_y=rnorm(20), c9_z=rnorm(20),
c10_x=rnorm(20), c11_y=rnorm(20), c12_z=rnorm(20),
c_n=rnorm(20))
library(dplyr)
library(tidyr)
This first auxiliary data.frame is used to translate your c#_[xyz] variable into a unified one. I'm sure there are other ways to handle this, but it works and is relatively easy to reproduce and extend based on your 200+ columns.
variableTransform <- data_frame(
cnum = paste0("c", 1:12),
cvar = rep(paste0("a", 1:4), each = 3)
)
head(variableTransform)
# Source: local data frame [6 x 2]
# cnum cvar
# <chr> <chr>
# 1 c1 a1
# 2 c2 a1
# 3 c3 a1
# 4 c4 a2
# 5 c5 a2
# 6 c6 a2
Here's the pipe all at once. I'll explain the steps in a sec. What you're looking for is likely a combination of the treatment, xyz, and ans columns.
df %>%
tidyr::gather(cnum, value, -treatment, -c_n) %>%
tidyr::separate(cnum, c("cnum", "xyz"), sep = "_") %>%
left_join(variableTransform, by = "cnum") %>%
select(-cnum) %>%
tidyr::spread(cvar, value) %>%
mutate(
ans = a1 * (a2/c_n) + a3 * (a4/c_n)
) %>%
head
# treatment c_n xyz a1 a2 a3 a4 ans
# 1 a -1.535934 x -0.3276474 1.45959746 -1.2650369 1.02795419 1.15801448
# 2 a -1.535934 y -1.3662388 -0.05668467 0.4867865 -0.10138979 -0.01828831
# 3 a -1.535934 z -2.5026018 -0.99797169 0.5181513 1.20321878 -2.03197283
# 4 a -1.363584 x -0.9742016 -0.12650863 1.3612361 -0.24840493 0.15759418
# 5 a -1.363584 y -0.9795871 1.52027017 0.5510857 1.08733839 0.65270681
# 6 a -1.363584 z 0.2985557 -0.22883439 0.1536078 -0.09993095 0.06136036
First, we take the original data and turn all (except two) columns into two columns of "column name" and "column values" pairs:
df %>%
tidyr::gather(cnum, value, -treatment, -c_n) %>%
# treatment c_n cnum value
# 1 a 0.20745647 c1_x -0.1250222
# 2 b 0.01015871 c1_x -0.4585088
# 3 a 1.65671028 c1_x -0.2455927
# 4 b -0.24037137 c1_x 0.6219516
# 5 a -1.16092349 c1_x -0.3716138
# 6 b 1.61191700 c1_x 1.7605452
It will be helpful to split c1_x into c1 and x in order to translate the first and preserve the latter:
tidyr::separate(cnum, c("cnum", "xyz"), sep = "_") %>%
# treatment c_n cnum xyz value
# 1 a 0.20745647 c1 x -0.1250222
# 2 b 0.01015871 c1 x -0.4585088
# 3 a 1.65671028 c1 x -0.2455927
# 4 b -0.24037137 c1 x 0.6219516
# 5 a -1.16092349 c1 x -0.3716138
# 6 b 1.61191700 c1 x 1.7605452
From here, let's translate the c1, c2, and c3 variables into a1 (repeat for other 9 variables) using variableTransform:
left_join(variableTransform, by = "cnum") %>%
select(-cnum) %>%
# treatment c_n xyz value cvar
# 1 a 0.20745647 x -0.1250222 a1
# 2 b 0.01015871 x -0.4585088 a1
# 3 a 1.65671028 x -0.2455927 a1
# 4 b -0.24037137 x 0.6219516 a1
# 5 a -1.16092349 x -0.3716138 a1
# 6 b 1.61191700 x 1.7605452 a1
Since we want to deal with multiple variables simultaneously (with a simple mutate), we need to bring some of the variables back into columns. (The reason we gathered and will now spread helps me with keeping things organized and named well. I'm confident somebody can come up with another way to do it.)
tidyr::spread(cvar, value) %>% head
# treatment c_n xyz a1 a2 a3 a4
# 1 a -1.535934 x -0.3276474 1.45959746 -1.2650369 1.02795419
# 2 a -1.535934 y -1.3662388 -0.05668467 0.4867865 -0.10138979
# 3 a -1.535934 z -2.5026018 -0.99797169 0.5181513 1.20321878
# 4 a -1.363584 x -0.9742016 -0.12650863 1.3612361 -0.24840493
# 5 a -1.363584 y -0.9795871 1.52027017 0.5510857 1.08733839
# 6 a -1.363584 z 0.2985557 -0.22883439 0.1536078 -0.09993095
From here, we just need to mutate to get the right answer.
Similar to r2evans's answer, but with more manipulation instead of the joins (and less explanation).
library(tidyr)
library(stringr)
library(dplyr)
# get it into fully long form
gather(df, key = cc_xyz, value = value, c1_x:c12_z) %>%
# separate off the xyz and the c123
separate(col = cc_xyz, into = c("cc", "xyz")) %>%
# extract the number
mutate(num = as.numeric(str_replace(cc, pattern = "c", replacement = "")),
# mod it by 4 for groupings and add a letter so its a good col name
num_mod = paste0("v", (num %% 4) + 1)) %>%
# remove unwanted columns
select(-cc, -num) %>%
# go into a reasonable data width for calculation
spread(key = num_mod, value = value) %>%
# calculate
mutate(result = v1 + v2/c_n + v3 + v4 / c_n)
# treatment c_n xyz v1 v2 v3 v4 result
# 1 a -1.433858289 x 1.242153708 -0.985482158 -0.0240414692 1.98710285 0.51956295
# 2 a -1.433858289 y -0.019255516 0.074453615 -1.6081599298 1.18228939 -2.50389188
# 3 a -1.433858289 z -0.362785313 2.296744655 -0.0610463292 0.89797526 -2.65188998
# 4 a -0.911463819 x -1.088308527 -0.703388193 0.6308253909 0.22685013 0.06534405
# 5 a -0.911463819 y 1.284513516 1.410276163 0.5066869590 -2.07263912 2.51790289
# 6 a -0.911463819 z 0.957778345 -1.136532104 1.3959561507 -0.50021647 4.14947069
# ...

How to collapse/join selected factor levels across two columns in R

Let's say I have the following data frame:
x <-c(rep (c ("s1", "s2", "s3"),each=5 ))
y <- c(rep(c("a", "b", "c", "d", "e"), 3) )
z<-c(1:15)
x_name <- "dimensions"
y_name <- "aspects"
z_name<-"value"
df <- data.frame(x,y,z)
names(df) <- c(x_name,y_name, z_name)
How can I collapse/join factor levels 'a', 'c', 'd' in one new factor 'x' across 'dimensions' and 'value', so that the value is added up for the new x factor level. The output should look like this:
I thought to use gsub to replace the names of a,c, d, with x and then sum their values using aggregate. But is there a simpler way to do this? Besides I am not sure my solution would be still good if I have other columns containing a, c, d.
I reviewed several related answers on the forum but neither addressed this situation. Thanks.
First rename a, c, and d to x and then sum by dimensions and aspects
Reading the data:
df <- data.frame(dimensions = x, aspects = y, value = z, stringsAsFactors = FALSE)
Base R solution:
# if you read the data my way the following line is unnecessary
# df$aspects <- as.character(df$aspects)
df[df$aspects %in% c("a","c","d"),]$aspects <- "x"
aggregate(value ~., df, sum)
Result:
dimensions aspects value
1 s1 b 2
2 s2 b 7
3 s3 b 12
4 s1 e 5
5 s2 e 10
6 s3 e 15
7 s1 x 8
8 s2 x 23
9 s3 x 38
data.table solution
require(data.table)
DT <- setDT(df)
DT[aspects %in% c("a","c","d"), aspects := "x"]
DT[,sum(value), by=.(dimensions, aspects)]
Results in
dimensions aspects V1
1: s1 x 8
2: s1 b 2
3: s1 e 5
4: s2 x 23
5: s2 b 7
6: s2 e 10
7: s3 x 38
8: s3 b 12
9: s3 e 15
Here's a solution using plyr::revalue (see also plyr::mapvalues) and dplyr:
# install.packages("plyr")
library(dplyr)
df %>%
mutate(aspects = plyr::revalue(aspects, c("a" = "x", "c" = "x", "d" = "x"))) %>%
group_by(dimensions, aspects) %>%
summarise(sum_value = sum(value))
# dimensions aspects sum_value
# (fctr) (fctr) (int)
# 1 s1 x 8
# 2 s1 b 2
# 3 s1 e 5
# 4 s2 x 23
# 5 s2 b 7
# 6 s2 e 10
# 7 s3 x 38
# 8 s3 b 12
# 9 s3 e 15

Summarize a dataframe by groups

Consider the following dataframe with 4 columns:
df = data.frame(A = rnorm(10), B = rnorm(10), C = rnorm(10), D = rnorm(10))
The columns A, B, C, D belong to different groups, and the groups are defined in a separate dataframe:
groups = data.frame(Class = c("A","B","C","D"), Group = c("G1", "G2", "G2", "G1"))
#> groups
# Class Group
#1 A G1
#2 B G2
#3 C G2
#4 D G1
I would like to average elements of the columns that belong to the same group, and get something similar to:
#> res
# G1 G2
#1 -0.30023039 -0.71075139
#2 0.53053443 -0.12397126
#3 0.21968567 -0.46916160
#4 -1.13775100 -0.61266026
#5 1.30388130 -0.28021734
#6 0.29275876 -0.03994522
#7 -0.09649998 0.59396983
#8 0.71334020 -0.29818438
#9 -0.29830924 -0.47094084
#10 -0.36102888 -0.40181739
where each cell of G1 is the mean of the relative cells of A and D, and each cell of G2 is the mean of the relative cells of B and C, etc.
I was able to achieve this result, but in a rather brute force way:
l = levels(groups$Group)
res = data.frame(matrix(nc = length(levels), nr = nrow(df)))
for(i in 1:length(l)) {
df.sub = df[which(groups$Group == l[i])]
res[,i] = apply(df.sub, 1, mean)
}
names(res) <- l
Is there a better way of doing this? In reality, I have more than 20 columns and more than 10 groups.
Thank you!
using data.table
library(data.table)
groups <- data.table(groups, key="Group")
DT <- data.table(df)
groups[, rowMeans(DT[, Class, with=FALSE]), by=Group][, setnames(as.data.table(matrix(V1, ncol=length(unique(Group)))), unique(Group))]
G1 G2
1: -0.13052091 -0.3667552
2: 1.17178729 -0.5496347
3: 0.23115841 0.8317714
4: 0.45209516 -1.2180895
5: -0.01861638 -0.4174929
6: -0.43156831 0.9008427
7: -0.64026238 0.1854066
8: 0.56225108 -0.3563087
9: -2.00405840 -0.4680040
10: 0.57608055 -0.6177605
# Also, make sure you have characters, not factors,
groups[, Class := as.character(Class)]
groups[, Group := as.character(Group)]
simple base:
tapply(groups$Class, groups$Group, function(X) rowMeans(df[, X]))
using sapply :
sapply(unique(groups$Group), function(X)
rowMeans(df[, groups[groups$Group==X, "Class"]]) )
I would personally go with Ricardo's solution, but another option would be to merge your two datasets first, and then use your preferred method of aggregating.
library(reshape2)
## Retain the "rownames" so we can aggregate by row
temp <- merge(cbind(id = rownames(df), melt(df)), groups,
by.x = "variable", by.y = "Class")
head(temp)
# variable id value Group
# 1 A 1 -0.6264538 G1
# 2 A 2 0.1836433 G1
# 3 A 3 -0.8356286 G1
# 4 A 4 1.5952808 G1
# 5 A 5 0.3295078 G1
# 6 A 6 -0.8204684 G1
## This is the perfect form for `dcast` to do its work
dcast(temp, id ~ Group, value.var="value", mean)
# id G1 G2
# 1 1 0.36611287 1.21537927
# 2 10 0.22889368 0.50592144
# 3 2 0.04042780 0.58598977
# 4 3 -0.22397850 -0.27333780
# 5 4 0.77073788 -2.10202579
# 6 5 -0.52377589 0.87237833
# 7 6 -0.61773147 -0.05053117
# 8 7 0.04656955 -0.08599288
# 9 8 0.33950565 -0.26345809
# 10 9 0.83790336 0.17153557
(Above data using set.seed(1) on your sample "df".

Resources