I know teh question of unnesting a list column in a data frame has been raised and answered multiple times. However, here's the potentially 237. problem of that kind.
I have the following data:
set.seed(666)
dat <- data.frame(sysRespNum = c(1,2,3,4,5,6),
product1 = sqrt(rnorm(6, 20, 5)^2),
product2 = sqrt(rnorm(6, 20, 5)^2),
product3 = sqrt(rnorm(6, 20, 5)^2))
data:
sysRespNum product1 product2 product3
1 1 23.766555 13.46907 24.32327
2 2 30.071773 15.98740 11.39922
3 3 18.224328 11.03880 20.67063
4 4 30.140839 19.78984 19.62087
5 5 8.915628 30.75021 24.29150
6 6 23.791981 11.14885 21.72450
Now, I want to calculate the proportion of each product among the sum of all products, so I want to calculate product1/sum(my three products), then the same for product 2 and 3. So I'm expecting three new columns.
I've tried the following:
library(tidyverse)
dat %>%
mutate(sum_Product = apply(across(-sysRespNum), 1, function(x) list(sum_Product = x/sum(x))))
(side question: is there maybe a more straightforward way of mutating this directly without the need to create a list. I now I could create a sum column first and then do a simple mutate along with across. But I'm wondering if teh calculations can be achieved without creating a temporary sum column first)
Now my problem is that it's difficult to unnest the sum_Product list column. unnest_wider doesn't work, the sum_Product column is still a list.
So the only thing that worked for me is
following this solution: https://stackoverflow.com/a/60824506/2725773
changing my code above and replace the list part by data.frame:
full code:
dat %>%
mutate(sum_Product = apply(across(-sysRespNum), 1, function(x) data.frame(sum_Product = x/sum(x)))) %>%
unnest(cols = everything()) %>%
mutate(product = rep(1:3, nrow(.)/3)) %>%
pivot_wider(values_from = sum_Product,
names_from = product,
names_prefix = "share_product")
which gives the correct result:
# A tibble: 6 x 7
sysRespNum product1 product2 product3 share_product1 share_product2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 23.8 13.5 24.3 0.386 0.219
2 2 30.1 16.0 11.4 0.523 0.278
3 3 18.2 11.0 20.7 0.365 0.221
4 4 30.1 19.8 19.6 0.433 0.285
5 5 8.92 30.8 24.3 0.139 0.481
6 6 23.8 11.1 21.7 0.420 0.197
# … with 1 more variable: share_product3 <dbl>
However, it feels unnecessary complicated to unnest everything and then reshape with pivot_wider.
So a) is there a more elegant way to calculate my share variables and b) is there a more elegant/shorter/less verbose way of reshaping a list column into multiple vector columns?
It is easier to do this rowSums i.e. divide the 'product1' by the rowSums on the columns that starts with key word 'product'. Instead of doing rowwise with c_across, this is vectorized and should be fast as well
library(dplyr)
dat %>%
mutate(sum_product = product1/rowSums(select(., starts_with('product'))))
NOTE: There is a mixing of base R code (apply) and the tidyverse option with across which doesn't seem to be the optimal way
If we need to do this for all the 'product' columns, create a sum column first with mutate and then use across on the columns that starts with 'product' to divide the column by 'Sum_col'
dat %>%
mutate(Sum_col = rowSums(select(., starts_with('product'))),
across(starts_with('product'),
~ ./Sum_col, .names = '{.col}_sum_product')) %>%
select(-Sum_col)
-output
#ysRespNum product1 product2 product3 product1_sum_product product2_sum_product product3_sum_product
#1 1 23.766555 13.46907 24.32327 0.3860783 0.2187998 0.3951219
#2 2 30.071773 15.98740 11.39922 0.5233660 0.2782431 0.1983909
#3 3 18.224328 11.03880 20.67063 0.3649701 0.2210688 0.4139610
#4 4 30.140839 19.78984 19.62087 0.4333597 0.2845348 0.2821054
#5 5 8.915628 30.75021 24.29150 0.1393996 0.4807925 0.3798079
#6 6 23.791981 11.14885 21.72450 0.4198684 0.1967490 0.3833826
Or using base R
nm1 <- startsWith(names(dat), 'product')
dat[paste0('sum_product', seq_along(nm1))] <- dat[nm1]/rowSums(dat[nm1])
I guess the following base R code should work for you
cbind(
dat,
setNames(dat[-1] / rowSums(dat[-1]), paste0("share_product", seq_along(dat[-1])))
)
which gives
sysRespNum product1 product2 product3 share_product1 share_product2
1 1 23.766555 13.46907 24.32327 0.3860783 0.2187998
2 2 30.071773 15.98740 11.39922 0.5233660 0.2782431
3 3 18.224328 11.03880 20.67063 0.3649701 0.2210688
4 4 30.140839 19.78984 19.62087 0.4333597 0.2845348
5 5 8.915628 30.75021 24.29150 0.1393996 0.4807925
6 6 23.791981 11.14885 21.72450 0.4198684 0.1967490
share_product3
1 0.3951219
2 0.1983909
3 0.4139610
4 0.2821054
5 0.3798079
6 0.3833826
Good old plain basic R
rdat <- dat[-1]
rdat <- rdat/rowSums(rdat)
colnames(rdat) <- paste0("share_", colnames(rdat))
cbind(dat, rdat)
Which gives:
sysRespNum product1 product2 product3 share_sum_product1 share_sum_product2
1 1 23.766555 13.46907 24.32327 0.3860783 0.2187998
2 2 30.071773 15.98740 11.39922 0.5233660 0.2782431
3 3 18.224328 11.03880 20.67063 0.3649701 0.2210688
4 4 30.140839 19.78984 19.62087 0.4333597 0.2845348
5 5 8.915628 30.75021 24.29150 0.1393996 0.4807925
6 6 23.791981 11.14885 21.72450 0.4198684 0.1967490
share_sum_product3
1 0.3951219
2 0.1983909
3 0.4139610
4 0.2821054
5 0.3798079
6 0.3833826
Related
I have a big dataset ('links_cl', each participant of a study has several 100 rows), which I need to subset into dfs, one for each participant.
For those 42 dfs, I then need to do the same operation again and again. After spending half a day trying to write my own function, trying to find a solution online, I now have to ask here.
So, I am looking for a way to
subset the huge dataset several times and have one in my environment for every participant, without using the same code 42 times. What I did so far 'by hand' is:
Subj01 <- subset(links_cl, Subj == 01, select = c("Condition", "ACC_LINK", "RT_LINK", "CONF_LINK", "ACC_SOURCE", "RT_SOURCE", "CONF_SOURCE"))
filter for Column 'Condition' (either == 1,2,3 or 4), and describe/get the mean and sd of 'RT_LINK', which I so far also did 'manually'.
Subj01 %>% filter(Condition == 01) %>% describe(Subj01$RT_LINK)
But here I just get the description of the whole df of Subj01, so I would have to find 4x41 means by hand. It would be great to just have an output with the means and SDs of every participant, but I have no idea where to start and how to tell R to do this.
I tried this, but it won't work:
subsetsubj <- function(x,y) {
Subj_x <- links_cl %>%
subset(links_cl,
Subj == x,
select = c("Condition", "ACC_LINK", "RT_LINK", "CONF_LINK", "ACC_SOURCE", "RT_SOURCE", "CONF_SOURCE")) %>%
filter(Condition == y) %>%
describe(Subj_x$RT_LINK)
}
I also tried putting all dfs into a List and work with that, but it lead to nowhere.
If there is a solution without the subsetting, that would also work. This just seemed a logical step to me. Any idea, any help how to solve it?
You don't really need to split the dataset up into one dataframe for each patient. I would recommend a standard group_by()/summarize() approach, like this:
links_cl %>%
group_by(Subj, Condition) %>%
summarize(mean_val = mean(RT_LINK),
sd_val = sd(RT_LINK))
Output:
Subj Condition mean_val sd_val
<int> <int> <dbl> <dbl>
1 1 1 0.0375 0.873
2 1 2 0.103 1.05
3 1 3 0.184 0.764
4 1 4 0.0375 0.988
5 2 1 -0.0229 0.962
6 2 2 -0.156 0.820
7 2 3 -0.175 0.999
8 2 4 -0.0763 1.12
9 3 1 0.272 1.02
10 3 2 0.0172 0.835
# … with 158 more rows
Input:
set.seed(123)
links_cl <- data.frame(
Subj = rep(1:42, each =100),
Condition = rep(1:4, times=4200/4),
RT_LINK = rnorm(4200)
)
I used
df$Total.P.n <- rowSums(df[grep('p.n', names(df), ignore.case = FALSE)])
to sum count values from any column name containing p.n, but the values it produced are way off. The columns are counts of certain combinations of language types in a language corpus. I want to get a summary of all times p.n. was used within other combinations, but am struggling. It seems like perhaps it is counting other occurences like e.sp.NR in my variable names, but shouldn't ignore.case=FALSE take care of that? I've also tried tidyverse and dplyr solutions to no avail.
Here's example of df structure:
ID.
do.p.n.NP
do.p.n.SE
p.d.e.sp.SR
1510
4
6
2
1515
2
0
1
and what I need:
ID.
do.p.n.NP
do.p.n.SE
p.d.e.sp.SR
Total.P.n
1510
4
6
2
10
1515
2
0
1
2
Update after update(new column names) of OP:
The code is like:
df$Total.P.n <- rowSums(df[grep('p.n', names(df), ignore.case = FALSE)])
df$p.d.e.sp.SR <- rowSums(df[,2:3]!=0)
ID. do.p.n.NP do.p.n.SE. p.d.e.sp.SR Total.P.n
1 1510 4 6 2 10
2 1515 2 0 1 2
First answer:
The argument pattern you are searching for e.g. p.n does not exist in df. Therefore I think you mean pn: Then your code works as expectect:
df$Total.P.n <- rowSums(df[grep('pn', names(df), ignore.case = FALSE)])
ID. do.pn.NP do.pn.SE. p.d.e.sp.SR Total.P.n
1 1510 4 6 0 10
2 1515 2 0 1 2
If we can use dplyr, I would suggest using a tidy-select function / selection helper like matches. And please mind that your regex is likely wrong. If we need to match literal dots . , we need to escape the metacharacter with a double backslash. The appropriate regex would be n\\.p.
library(dplyr)
data
df <- tibble(`ID.` = c(1510, 1515), `do.p.n.NP` = c(4,2), `do.p.n.SE.` = c(6,0), `p.d.e.sp.SR` = c(0,1))
answer
df %>%
mutate(Total.P.n = rowSums(across(matches('p\\.n'))))
# A tibble: 2 × 5
ID. do.p.n.NP do.p.n.SE. p.d.e.sp.SR Total.P.n
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1510 4 6 0 10
2 1515 2 0 1 2
I am interested in performing multiple tests for a single variable with an associated factor that split the values into multiple groups. It is related to this question and, actually, I would like to get a solution of that kind but it is not exactly the same.
In my case, I have a single variable and multiple groups (eventually many). Expanding on this example:
library(reshape)
# Create a dataset
mu=34
stdv=5
Location=rep(c("Area_A","Area_B","Area_C"),5)
distro=rnorm(length(Location),mu,stdv)
id=seq(1:length(Location))
sample_long=data.frame(id,Location,distro)
sample_long
id Location distro
1 1 Area_A 34.95737
2 2 Area_B 31.30298
3 3 Area_C 35.86569
4 4 Area_A 40.45378
5 5 Area_B 36.12060
6 6 Area_C 28.29649
7 7 Area_A 30.64495
8 8 Area_B 29.70668
9 9 Area_C 33.22874
10 10 Area_A 25.29148
11 11 Area_B 32.35511
12 12 Area_C 34.69159
13 13 Area_A 26.89791
14 14 Area_B 35.30717
15 15 Area_C 40.64628
I would like to perform all-against-all tests among Areas, i.e. test(Area_A,Area_B), test(Area_A,Area_C) and test(Area_B,Area_C) (in a more general case, all the i<j possible tests).
A simple way to go is to transform the data into wide format:
# Reshape to wide format
sample_wide=reshape(sample_long,direction="wide",idvar="id",timevar="Location")
sample_wide
id distro.Area_A distro.Area_B distro.Area_C
1 1 34.95737 NA NA
2 2 NA 31.30298 NA
3 3 NA NA 35.86569
4 4 40.45378 NA NA
5 5 NA 36.12060 NA
6 6 NA NA 28.29649
7 7 30.64495 NA NA
8 8 NA 29.70668 NA
9 9 NA NA 33.22874
10 10 25.29148 NA NA
11 11 NA 32.35511 NA
12 12 NA NA 34.69159
13 13 26.89791 NA NA
14 14 NA 35.30717 NA
15 15 NA NA 40.64628
and then loop across all-against-all columns, for which I've seen several approximations more R-like than the following one in which I'm using for loops:
# Now compute the test
test.out=list()
k=0
for(i in 2:(dim(sample_wide)[2]-1)){ # All against all var groups
for(j in (i+1):dim(sample_wide)[2]){
k=k+1
test.out[[k]]=t.test(sample_wide[,i],
sample_wide[,j]) # store results in a list
}
}
But my question is not about which is the best solution given the wide format, but whether it is possible to find a solution for the problem working from the original long format, in line with the solutions found for the links I provided above that use dplyr, broom, etc.
This is a little trickier and less straightforward than I hoped. You can first figure out the combinations of locations and, to make it a little simpler, save that in a lookup table. I turned that into a long shape with an ID for each pair, which I'll use as a grouping variable on the data.
library(dplyr)
library(tidyr)
library(purrr)
set.seed(111)
# same data creation code
grps <- as.data.frame(t(combn(levels(sample_long$Location), 2))) %>%
mutate(pair = row_number()) %>%
gather(key, value = loc, -pair) %>%
select(-key)
grps
#> pair loc
#> 1 1 Area_A
#> 2 2 Area_A
#> 3 3 Area_B
#> 4 1 Area_B
#> 5 2 Area_C
#> 6 3 Area_C
Joining the lookup to the data frame doubles the rows—that will differ depending on how many levels you're combining. Note also I dropped your ID column since it didn't seem necessary right now. Nest, do the t-test, and tidy the results.
sample_long %>%
select(-id) %>%
inner_join(grps, by = c("Location" = "loc")) %>%
group_by(pair) %>%
nest() %>%
mutate(t_test = map(data, ~t.test(distro ~ Location, data = .)),
tidied = map(t_test, broom::tidy)) %>%
unnest(tidied)
#> # A tibble: 3 x 13
#> pair data t_test estimate estimate1 estimate2 statistic p.value
#> <int> <lis> <list> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 <tib… <htes… -0.921 31.8 32.7 -0.245 0.816
#> 2 2 <tib… <htes… -1.48 31.8 33.3 -0.383 0.716
#> 3 3 <tib… <htes… -0.563 32.7 33.3 -0.305 0.769
#> # … with 5 more variables: parameter <dbl>, conf.low <dbl>,
#> # conf.high <dbl>, method <chr>, alternative <chr>
If you needed to, you could do something to show which locations are in each pair—joining with the lookup table would be one way to do this.
I'm realizing also that you mentioned wanting to use broom functions afterwards, but didn't specify that you need a broom::tidy call. In that case, just drop the last 2 lines.
A little bit of base R will do the trick:
combn(x=unique(sample_long$Location), m=2, simplify=FALSE,
FUN=function(l) {
t.test(distro ~ Location, data=subset(sample_long, Location %in% l))
})
combn will generate all combinations of elements of x taken m at a time (sic). Combined with subset, you will apply your test to subsets of your data.frame.
I have a data frame with patient data and measurements of different variables over time.
The data frame looks a bit like this but more lab-values variables:
df <- data.frame(id=c(1,1,1,1,2,2,2,2,2),
time=c(0,3,7,35,0,7,14,28,42),
labvalue1=c(4.04,NA,2.93,NA,NA,3.78,3.66,NA,2.54),
labvalue2=c(NA,63.8,62.8,61.2,78.1,NA,77.6,75.3,NA))
> df2
id time labvalue1 labvalue2
1 1 0 4.04 NA
2 1 3 NA 63.8
3 1 7 2.93 62.8
4 1 35 NA 61.2
5 2 0 NA 78.1
6 2 7 3.78 NA
7 2 14 3.66 77.6
8 2 28 NA 75.3
9 2 42 2.54 NA
I want to calculate for each patient (with unique ID) the decrease or slope per day for the first and last measurement. To compare the slopes between patients. Time is in days. So, eventually I want a new variable, e.g. diff_labvalues - for each value, that gives me for labvalue1:
For patient 1: (2.93-4.04)/ (7-0) and for patient 2: (2.54-3.78)/(42-7) (for now ignoring the measurements in between, just last-first); etc for labvalue2, and so forth.
So far I have used dplyr, created the first1 and last1 functions, because first() and last() did not work with the NA values.
Thereafter, I have grouped_by 'id', used mutate_all (because there are more lab-values in the original df) calculated the difference between the last1() and first1() lab-values for that patient.
But cannot find HOW to extract the values of the corresponding time values (the delta-time value) which I need to calculate the slope of the decline.
Eventually I want something like this (last line):
first1 <- function(x) {
first(na.omit(x))
}
last1 <- function(x) {
last(na.omit(x))
}
df2 = df %>%
group_by(id) %>%
mutate_all(funs(diff=(last1(.)-first1(.)) / #it works until here
(time[position of last1(.)]-time[position of first1(.)]))) #something like this
Not sure if tidyverse even has a solution for this, so any help would be appreciated. :)
We can try
df %>%
group_by(id) %>%
filter(!is.na(labs)) %>%
summarise(diff_labs = (last(labs) - first(labs))/(last(time) - first(time)))
# A tibble: 2 x 2
# id diff_labs
# <dbl> <dbl>
#1 1 -0.15857143
#2 2 -0.03542857
and
> (2.93-4.04)/ (7-0)
#[1] -0.1585714
> (2.54-3.78)/(42-7)
#[1] -0.03542857
Or another option is data.table
library(data.table)
setDT(df)[!is.na(labs), .(diff_labs = (labs[.N] - labs[1])/(time[.N] - time[1])) , id]
# id diff_labs
#1: 1 -0.15857143
#2: 2 -0.03542857
I was trying to use ggplot2 to plot the built-in anscombe data set in R (which contains four different small data sets with identical correlations but radically different relationships between X and Y). My attempts to reshape the data properly were all pretty ugly. I used a combination of reshape2 and base R; a Hadleyverse 2 (tidyr/dplyr) or a data.table solution would be fine with me, but the ideal solution would be
short/no repeated code
comprehensible (somewhat conflicting with criterion #1)
involve as little hard-coding of column numbers, etc. as possible
The original format:
anscombe
## x1 x2 x3 x4 y1 y2 y3 y4
## 1 10 10 10 8 8.04 9.14 7.46 6.58
## 2 8 8 8 8 6.95 8.14 6.77 5.76
## 3 13 13 13 8 7.58 8.74 12.74 7.71
## ...
## 11 5 5 5 8 5.68 4.74 5.73 6.89
Desired format:
## s x y
## 1 1 10 8.04
## 2 1 8 6.95
## ...
## 44 4 8 6.89
Here's my attempt:
library("reshape2")
ff <- function(x,v)
setNames(transform(
melt(as.matrix(x)),
v1=substr(Var2,1,1),
v2=substr(Var2,2,2))[,c(3,5)],
c(v,"s"))
f1 <- ff(anscombe[,1:4],"x")
f2 <- ff(anscombe[,5:8],"y")
f12 <- cbind(f1,f2)[,c("s","x","y")]
Now plot:
library("ggplot2"); theme_set(theme_classic())
th_clean <-
theme(panel.margin=grid::unit(0,"lines"),
axis.ticks.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.y=element_blank(),
axis.text.y=element_blank()
)
ggplot(f12,aes(x,y))+geom_point()+
facet_wrap(~s)+labs(x="",y="")+
th_clean
If you are really dealing with the "anscombe" dataset, then I would say #Thela's reshape solution is very direct.
However, here are a few other options to consider:
Option 1: Base R
You can write your own "reshape" function, perhaps something like this:
myReshape <- function(indf = anscombe, stubs = c("x", "y")) {
temp <- sapply(stubs, function(x) {
unlist(indf[grep(x, names(indf))], use.names = FALSE)
})
s <- rep(seq_along(grep(stubs[1], names(indf))), each = nrow(indf))
data.frame(s, temp)
}
Notes:
I'm not sure that this is necessarily less clunky than what you're already doing
This approach will not work if the data are "unbalanced" (for example, more "x" columns than "y" columns.)
Option 2: "dplyr" + "tidyr"
Since pipes are the rage these days, you can also try:
library(dplyr)
library(tidyr)
anscombe %>%
gather(var, val, everything()) %>%
extract(var, into = c("variable", "s"), "(.)(.)") %>%
group_by(variable, s) %>%
mutate(ind = sequence(n())) %>%
spread(variable, val)
Notes:
I'm not sure that this is necessarily less clunky than what you're already doing, but some people like the pipe approach.
This approach should be able to handle unbalanced data.
Option 3: "splitstackshape"
Before #Arun went and did all that fantastic work on melt.data.table, I had written merged.stack in my "splitstackshape" package. With that, the approach would be:
library(splitstackshape)
setnames(
merged.stack(
data.table(anscombe, keep.rownames = TRUE),
var.stubs = c("x", "y"), sep = "var.stubs"),
".time_1", "s")[]
A few notes:
merged.stack needs something to treat as an "id", hence the need for data.table(anscombe, keep.rownames = TRUE), which adds a column named "rn" with the row numbers
The sep = "var.stubs" basically means that we don't really have a separator variable, so we'll just strip out the stub and use whatever remains for the "time" variable
merged.stack will work if the data are unbalanced. For instance, try using it with anscombe2 <- anscombe[1:7] as your dataset instead of "anscombe".
The same package also has a function called Reshape that builds upon reshape to let it reshape unbalanced data. But it's slower and less flexible than merged.stack. The basic approach would be Reshape(data.table(anscombe, keep.rownames = TRUE), var.stubs = c("x", "y"), sep = "") and then rename the "time" variable using setnames.
Option 4: melt.data.table
This was mentioned in the comments above, but hasn't been shared as an answer. Outside of base R's reshape, this is a very direct approach that handles column renaming from within the function itself:
library(data.table)
melt(as.data.table(anscombe),
measure.vars = patterns(c("x", "y")),
value.name=c('x', 'y'),
variable.name = "s")
Notes:
Will be insanely fast.
Much better supported than "splitstackshape" or reshape ;-)
Handles unbalanced data just fine.
I think this meets the criteria of being 1) short 2) comprehensible and 3) no hardcoded column numbers. And it doesn't require any other packages.
reshape(anscombe, varying=TRUE, sep="", direction="long", timevar="s")
# s x y id
#1.1 1 10 8.04 1
#...
#11.1 1 5 5.68 11
#1.2 2 10 9.14 1
#...
#11.2 2 5 4.74 11
#1.3 3 10 7.46 1
#...
#11.3 3 5 5.73 11
#1.4 4 8 6.58 1
#...
#11.4 4 8 6.89 11
I don't know if a non-reshape solution would be acceptable, but here you go:
library(data.table)
#create the pattern that will have the Xs
#this will make it easy to create the Ys
pattern <- 1:4
#use Map to create a list of data.frames with the needed columns
#and also use rbindlist to rbind the list produced by Map
lists <- rbindlist(Map(data.frame,
pattern,
anscombe[pattern],
anscombe[pattern+length(pattern)]
)
)
#set the correct names
setnames(lists, names(lists), c('s','x','y'))
Output:
> lists
s x y
1: 1 10 8.04
2: 1 8 6.95
3: 1 13 7.58
4: 1 9 8.81
5: 1 11 8.33
6: 1 14 9.96
7: 1 6 7.24
8: 1 4 4.26
9: 1 12 10.84
10: 1 7 4.82
....
A newer tidyverse option is suggested in the tidyverse vignette:
anscombe %>%
pivot_longer(everything(),
names_to = c(".value", "set"),
names_pattern = "(.)(.)"
) %>%
arrange(set)
#> # A tibble: 44 x 3
#> set x y
#> <chr> <dbl> <dbl>
#> 1 1 10 8.04
#> 2 1 8 6.95
#> 3 1 13 7.58
#> 4 1 9 8.81
#> 5 1 11 8.33
#> 6 1 14 9.96
#> 7 1 6 7.24
#> 8 1 4 4.26
#> 9 1 12 10.8
#> 10 1 7 4.82
#> # … with 34 more rows