I want to pivot multiple sets of variables in a data frame. My data looks like this:
require(dplyr)
require(tidyr)
x_1=rnorm(10,0,1)
x_2=rnorm(10,0,1)
x_3=rnorm(10,0,1)
y_1=rnorm(10,0,1)
y_2=rnorm(10,0,1)
aid=rep(1:5,2)
data=data.frame(aid, x_1,x_2,y_1,y_2)
> data
aid x_1 x_2 y_1 y_2
1 1 -0.82305819 0.9366731 0.95419200 2.29544019
2 2 0.64424320 -0.2807793 0.51303834 0.02560463
3 3 -1.11108822 -0.2475625 0.05747951 -0.51218368
4 4 -1.04026895 -0.4138653 0.57751999 0.60942652
5 5 1.29097040 -1.7829966 1.59940532 0.75868562
6 1 -0.57845406 -1.0002074 0.04302291 0.86766265
7 2 0.08996163 -0.7949632 -2.10422124 -0.43432995
8 3 0.14331978 0.4203010 -1.12748270 0.14484670
9 4 -0.25207187 1.5559295 0.23621422 -0.04719046
10 5 -0.25617731 0.6241852 -1.21131110 1.02236458
I want to pivot x and y variables separately. I did that using following lines of codes.
data2 = data %>% reshape(.,direction = "long",
varying = list(c('x_1','x_2'),
c('y_1','y_2')),
v.names = c("x",'y'))
I need to generalize this to any number of columns. That means, in this example x and y have 2 columns each. But for a another data set it may be different. If there are more columns, it would be difficult to type everything under varying parameter.
In order to avoid specifying the columns when pivoting, I tried this code:
data1 <- data%>% pivot_longer(!aid, names_to = c("id"), names_pattern = "(.)(.)")
But it gave this error:
Error: `regex` should define 1 groups; found.
Can anyone help me to fix this?
Thank you.
The brackets around the matched pattern represents that we are capturing that pattern as a group. In the below code, we capture one or more lower-case letters ([a-z]+) followed by a _ (not inside the brackets, thus it is removed) and the second capture group matches one or more digits (\\d+), and this will be matched with the corresponding values of names_to - i.e. .value represents the value of the column, thus we get the columns 'x' and 'y' with the values and the second will be a new column names that returs the suffix digits of the column names i.e. 'time'
library(tidyr)
pivot_longer(data, cols = -aid, names_to = c(".value", "time"),
names_pattern = "^([a-z]+)_(\\d+)")
-output
# A tibble: 20 × 4
aid time x y
<int> <chr> <dbl> <dbl>
1 1 1 -0.823 0.954
2 1 2 0.937 2.30
3 2 1 0.644 0.513
4 2 2 -0.281 0.0256
5 3 1 -1.11 0.0575
6 3 2 -0.248 -0.512
7 4 1 -1.04 0.578
8 4 2 -0.414 0.609
9 5 1 1.29 1.60
10 5 2 -1.78 0.759
11 1 1 -0.578 0.0430
12 1 2 -1.00 0.868
13 2 1 0.0900 -2.10
14 2 2 -0.795 -0.434
15 3 1 0.143 -1.13
16 3 2 0.420 0.145
17 4 1 -0.252 0.236
18 4 2 1.56 -0.0472
19 5 1 -0.256 -1.21
20 5 2 0.624 1.02
In the OP's code, there are two groups ((.) and (.)) and only one element in names_to, thus it fails along with the fact that there is _ between the 'x', 'y' and the digit. Also, by default, the names_pattern will be in regex mode and some characters are thus in metacharacter mode i.e. . represents any character and not the literal .
In this case names_sep is a handy alternative to names_pattern as the column names are already separated by _:
library(dplyr)
library(tidyr)
data %>%
pivot_longer(-aid,
names_to =c(".value","time"),
names_sep ="_"
)
aid time x y
<int> <chr> <dbl> <dbl>
1 1 1 1.08 -1.49
2 1 2 0.871 0.449
3 2 1 -1.01 -0.577
4 2 2 1.23 -0.0890
5 3 1 -0.905 -0.289
6 3 2 1.16 -0.380
7 4 1 -0.316 -0.446
8 4 2 0.902 1.05
9 5 1 -0.908 1.36
10 5 2 -0.558 -1.57
11 1 1 -0.383 1.22
12 1 2 0.704 0.000539
13 2 1 0.595 -0.668
14 2 2 -0.461 1.46
15 3 1 2.00 -0.365
16 3 2 -1.14 0.150
17 4 1 -2.13 -0.827
18 4 2 0.642 -0.798
19 5 1 0.397 -0.0143
20 5 2 0.981 1.79
Related
I am trying to spread my data such that months are the columns associated with both site and spx. I tried to use recast but I lose the informaton about species. What do I do to get the expected output (attached)?
set.seed(111)
month <- rep(c("J","F","M"), each = 6)
site <- rep(c(1,2,3,4,5,6), times = 3)
spA <- rnorm(18,0,2)
spB <- rnorm(18,0,2)
spC <- rnorm(18,0,2)
spD <- rnorm(18,0,2)
df <- data.frame(month, site, spA, spB, spC, spD)
df.test <- reshape2::recast(df, site ~ month)
Here is what I am getting.
site F J M
1 1 5 5 5
2 2 5 5 5
3 3 5 5 5
4 4 5 5 5
5 5 5 5 5
6 6 5 5 5
#Expected output (It's dummy data)
site sp J F M
1 A 5 6 7
1 B 2 3 4
..
6 D 1 2 3
If the intention is not to aggregate, but just transpose, then we can use pivot_longer to reshape to long and then reshape back to wide with pivot_wider
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = starts_with('sp'), names_prefix = 'sp',
names_to = 'sp') %>%
pivot_wider(names_from = month, values_from = value)
-output
# A tibble: 24 × 5
site sp J F M
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 A 0.470 -2.99 3.69
2 1 B -2.39 0.653 -6.23
3 1 C -0.232 -2.72 4.97
4 1 D 0.350 -0.433 0.405
5 2 A -0.661 -2.02 0.788
6 2 B 0.728 1.20 -1.88
7 2 C 0.669 0.962 3.92
8 2 D -1.69 2.89 -1.61
9 3 A -0.623 -1.90 1.60
10 3 B 0.723 -3.68 2.80
# … with 14 more rows
Or using recast - specify the id.var and then include the variable also in the formula
library(reshape2)
reshape2::recast(df, site + variable ~ month, id.var = c("month", "site"))
site variable F J M
1 1 spA -2.99485331 0.4704414 3.6912725
2 1 spB 0.65309848 -2.3872179 -6.2264346
3 1 spC -2.72380897 -0.2323101 4.9713231
4 1 spD -0.43285732 0.3501913 0.4046144
5 2 spA -2.02037684 -0.6614717 0.7881082
6 2 spB 1.19650840 0.7283735 -1.8827148
7 2 spC 0.96224916 0.6685120 3.9199634
8 2 spD 2.89295633 -1.6945355 -1.6123984
9 3 spA -1.89695121 -0.6232476 1.5950570
10 3 spB -3.68306860 0.7233249 2.8005176
11 3 spC 1.48394325 -1.2417162 0.3833268
12 3 spD 0.81941960 1.9564633 0.5892684
13 4 spA -0.98792443 -4.6046913 -3.1333307
14 4 spB 5.43611120 0.6939287 -3.2409401
15 4 spC 0.05564925 -2.6196898 3.1050885
16 4 spD 1.82183314 3.6117365 2.8097662
...
I want to perform a comparison between the slope of different regressions: CO2 changes through time (day) for 8 different nests.
> structure(as1)
# A tibble: 16 x 4
day nest N2O CO2
<dbl> <dbl> <dbl> <dbl>
1 1 1 0.00549 0.206
2 1 2 0.129 0.0343
3 1 3 0.157 0.113
4 1 4 0.0760 0.106
5 2 1 -0.0487 0.214
6 2 2 -0.0561 0.358
7 2 3 -0.0522 0.767
8 2 4 -0.0193 0.188
9 3 1 -0.0757 0.255
10 3 2 -0.237 0.753
11 3 3 -0.117 0.745
12 3 4 0.0345 0.502
13 4 1 0.135 0.325
14 4 2 0.264 0.767
15 4 3 0.0116 0.926
16 4 4 0.0342 0.358
I'm following the instructions given in https://stats.stackexchange.com/questions/33013/what-test-can-i-use-to-compare-slopes-from-two-or-more-regression-models by the answer with a rate of 16.
Instead of using the library lsmeans as it suggests I used emmeans because R encourages to switch to emmeans the rest of the way. However I've also tried it with lsmeans and I get the same problem. When I run this:
library(emmeans)
m.interaction <- lm(CO2 ~ day*nest, data = as1)
anova(m.interaction)
# Obtain slopes
m.interaction$coefficients
m.lst <- lstrends(m.interaction, "day", var="CO2", data = as1)
Everything works fine until lstrends, where I get this error:
##Error in emmfcn(...) : Variable 'CO2' is not in the dataset
Does somebody know what can be happening?
Thanks in advance!
I have a question on how to mutate the slopes of lines into a new data frame into
by category.
d1 <-read.csv(file.choose(), header = T)
d2 <- d1 %>%
group_by(ID)%>%
mutate(Slope=sapply(split(df,df$ID), function(v) lm(x~y,v)$coefficients["y"]))
ID x y
1 3.429865279 2.431363764
1 3.595066124 2.681241237
1 3.735263469 2.352182518
1 3.316473584 2.51851394
1 3.285984642 2.380211242
1 3.860793029 2.62324929
1 3.397714117 2.819543936
1 3.452997088 2.176091259
1 3.718933278 2.556302501
1 3.518566578 2.537819095
1 3.689033452 2.40654018
1 3.349160923 2.113943352
1 3.658888644 2.556302501
1 3.251151343 2.342422681
1 3.911194909 2.439332694
1 3.432584505 2.079181246
1 4.031267043 2.681241237
1 3.168733129 1.544068044
1 4.032239897 3.084576278
1 3.663361648 2.255272505
1 3.582302046 2.62324929
1 3.606585565 2.079181246
1 3.541791347 2.176091259
4 3.844012861 2.892094603
4 3.608318477 2.767155866
4 3.588990218 2.883661435
4 3.607957917 2.653212514
4 3.306753044 2.079181246
4 4.002604841 2.880813592
4 3.195299837 2.079181246
4 3.512203238 2.643452676
4 3.66878494 2.431363764
4 3.598910385 2.511883361
4 3.721810134 2.819543936
4 3.352964661 2.113943352
4 4.008109343 3.084576278
4 3.584693332 2.556302501
4 4.019461819 3.084576278
4 3.359474563 2.079181246
4 3.950256012 2.829303773
I got the error message like'replacement has 2 rows, data has 119'. I am sure that the error is derived from mutate().
Best,
Once you do group_by, any function that succeeds uses on the columns in the grouped data.frame, in your case, it will only use x,y column within.
If you only want the coefficient, it goes like this:
df %>% group_by(ID) %>% summarize(coef=lm(x~y)$coefficients["y"])
# A tibble: 2 x 2
ID coef
<int> <dbl>
1 1 0.437
2 4 0.660
If you want the coefficient, which means a vector a long as the dataframe, you use mutate:
df %>% group_by(ID) %>% mutate(coef=lm(x~y)$coefficients["y"])
# A tibble: 40 x 4
# Groups: ID [2]
ID x y coef
<int> <dbl> <dbl> <dbl>
1 1 3.43 2.43 0.437
2 1 3.60 2.68 0.437
3 1 3.74 2.35 0.437
4 1 3.32 2.52 0.437
5 1 3.29 2.38 0.437
6 1 3.86 2.62 0.437
7 1 3.40 2.82 0.437
8 1 3.45 2.18 0.437
9 1 3.72 2.56 0.437
10 1 3.52 2.54 0.437
# … with 30 more rows
I have some data on organism survival as a function of time. The data is constructed using the averages of many replicates for each time point, which can yield a forward time step with an increase in survival. Occasionally, this results in a survivorship greater than 1, which is impossible. How can I conditionally change values greater than 1 to the value preceeding it in the same column?
Here's what the data looks like:
>df
Generation Treatment time lx
1 0 1 0 1
2 0 1 2 1
3 0 1 4 0.970
4 0 1 6 0.952
5 0 1 8 0.924
6 0 1 10 0.913
7 0 1 12 0.895
8 0 1 14 0.729
9 0 2 0 1
10 0 2 2 1
I've tried mutating the column of interest as such, which still yields values above 1:
df1 <- df %>%
group_by(Generation, Treatment) %>%
mutate(lx_diag = as.numeric(lx/lag(lx, default = first(lx)))) %>% #calculate running survival
mutate(lx_diag = if_else(lx_diag > 1.000000, lag(lx_diag), lx_diag)) #substitute values >1 with previous value
>df1
Generation Treatment time lx lx_diag
1 12 1 0 1 1
2 12 1 2 1 1
3 12 1 4 1 1
4 12 1 6 0.996 0.996
5 12 1 8 0.988 0.992
6 12 1 10 0.956 0.968
7 12 1 12 0.884 0.925
8 12 1 14 0.72 0.814
9 12 1 15 0.729 1.01
10 12 1 19 0.76 1.04
I expect the results to look something like:
>df1
Generation Treatment time lx lx_diag
1 12 1 0 1 1
2 12 1 2 1 1
3 12 1 4 1 1
4 12 1 6 0.996 0.996
5 12 1 8 0.988 0.992
6 12 1 10 0.956 0.968
7 12 1 12 0.884 0.925
8 12 1 14 0.72 0.814
9 12 1 15 0.729 0.814
10 12 1 19 0.76 0.814
I know you can conditionally change the values to a specific value (i.e. ifelse with no else), but I haven't found any solutions that can conditionally change a value in a column to the value in the previous row. Any help is appreciated.
EDIT: I realized that mutate and if_else are quite efficient when it comes to converting values. Instead of replacing values in sequence from the first to last, as I would have expected, the commands replace all values at the same time. So in a series of values >1, you will have some left behind. Thus, if you just run the command:
SurvTot1$lx_diag <- if_else(SurvTot1$lx_diag > 1, lag(SurvTot1$lx_diag), SurvTot1$lx_diag)
over again, you can rid of the values >1. Not the most elegant solution, but it works.
This looks like a very ugly solution to me, but I couldn't think of anything else:
df = data.frame(
"Generation" = rep(12,10),
"Treatent" = rep(1,10),
"Time" = c(seq(0,14,by=2),15,19),
"lx_diag" = c(1,1,1,0.996,0.992,0.968,0.925,0.814,1.04,1.04)
)
update_lag = function(x){
k <<- k+1
x
}
k=1
df %>%
mutate(
lx_diag2 = ifelse(lx_diag <=1,update_lag(lx_diag),lag(lx_diag,n=k))
)
Using the data from #Fino, here is my vectorized solution using base R
vals.to.replace <- which(df$lx_diag > 1)
vals.to.substitute <- sapply(vals.to.replace, function(x) tail( df$lx_diag[which(df$lx_diag[1:x] <= 1)], 1) )
df$lx_diag[vals.to.replace] = vals.to.substitute
df
Generation Treatent Time lx_diag
1 12 1 0 1.000
2 12 1 2 1.000
3 12 1 4 1.000
4 12 1 6 0.996
5 12 1 8 0.992
6 12 1 10 0.968
7 12 1 12 0.925
8 12 1 14 0.814
9 12 1 15 0.814
10 12 1 19 0.814
I'm looking for an efficient way to create multiple 2-dimension tables from an R dataframe of chi-square statistics. The code below builds on this answer to a previous question of mine about getting chi-square stats by groups. Now I want to create tables from the output by group. Here's what I have so far using the hsbdemo data frame from the UCLA R site:
ml <- foreign::read.dta("https://stats.idre.ucla.edu/stat/data/hsbdemo.dta")
str(ml)
'data.frame': 200 obs. of 13 variables:
$ id : num 45 108 15 67 153 51 164 133 2 53 ...
$ female : Factor w/ 2 levels "male","female": 2 1 1 1 1 2 1 1 2 1 ...
$ ses : Factor w/ 3 levels "low","middle",..: 1 2 3 1 2 3 2 2 2 2 ...
$ schtyp : Factor w/ 2 levels "public","private": 1 1 1 1 1 1 1 1 1 1 ...
$ prog : Factor w/ 3 levels "general","academic",..: 3 1 3 3 3 1 3 3 3 3 ...
ml %>%
dplyr::select(prog, ses, schtyp) %>%
table() %>%
apply(3, chisq.test, simulate.p.value = TRUE) %>%
lapply(`[`, c(6,7,9)) %>%
reshape2::melt() %>%
tidyr::spread(key = L2, value = value) %>%
dplyr::rename(SchoolType = L1) %>%
dplyr::arrange(SchoolType, prog) %>%
dplyr::select(-observed, -expected) %>%
reshape2::acast(., prog ~ ses ~ SchoolType ) %>%
tbl_df()
The output after the last arrange statement produces this tibble (showing only the first five rows):
prog ses SchoolType expected observed stdres
1 general low private 0.37500 2 3.0404678
2 general middle private 3.56250 3 -0.5187244
3 general high private 2.06250 1 -1.0131777
4 academic low private 1.50000 0 -2.5298221
5 academic middle private 14.25000 14 -0.2078097
It's easy to select one column, for example, stdres, and pass it to acast and tbl_df, which gets pretty much what I'm after:
# A tibble: 3 x 6
low.private middle.private high.private low.public middle.public high.public
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3.04 -0.519 -1.01 1.47 -0.236 -1.18
2 -2.53 -0.208 1.50 -0.940 -2.06 3.21
3 -0.377 1.21 -1.06 -0.331 2.50 -2.45
Now I can repeat these steps for observed and expected frequencies and bind them by rows, but that seems inefficient. The output would observed frequencies stacked on expected, stacked on the standardized residuals. Something like this:
low.private middle.private high.private low.public middle.public high.public
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 3 1 14 17 8
2 0 14 10 19 30 32
3 0 2 0 12 29 7
4 0.375 3.56 2.06 10.4 17.6 10.9
5 1.5 14.2 8.25 21.7 36.6 22.7
6 0.125 1.19 0.688 12.9 21.7 13.4
7 3.04 -0.519 -1.01 1.47 -0.236 -1.18
8 -2.53 -0.208 1.50 -0.940 -2.06 3.21
9 -0.377 1.21 -1.06 -0.331 2.50 -2.45
Seems there ought to be a way to do this without repeating the code for each column, probably by creating and processing a list. Thanks in advance.
Might this be the answer?
ml1 <- ml %>%
dplyr::select(prog, ses, schtyp) %>%
table() %>%
apply(3, chisq.test, simulate.p.value = TRUE) %>%
lapply(`[`, c(6,7,9)) %>%
reshape2::melt()
ml2 <- ml1 %>%
dplyr::mutate(type=paste(ses, L1, sep=".")) %>%
dplyr::select(-ses, -L1) %>%
tidyr::spread(type, value)
This gives you
prog L2 high.private high.public low.private low.public middle.private middle.public
1 general expected 2.062500 10.910714 0.3750000 10.4464286 3.5625000 17.6428571
2 general observed 1.000000 8.000000 2.0000000 14.0000000 3.0000000 17.0000000
3 general stdres -1.013178 -1.184936 3.0404678 1.4663681 -0.5187244 -0.2360209
4 academic expected 8.250000 22.660714 1.5000000 21.6964286 14.2500000 36.6428571
5 academic observed 10.000000 32.000000 0.0000000 19.0000000 14.0000000 30.0000000
6 academic stdres 1.504203 3.212431 -2.5298221 -0.9401386 -0.2078097 -2.0607058
7 vocation expected 0.687500 13.428571 0.1250000 12.8571429 1.1875000 21.7142857
8 vocation observed 0.000000 7.000000 0.0000000 12.0000000 2.0000000 29.0000000
9 vocation stdres -1.057100 -2.445826 -0.3771236 -0.3305575 1.2081594 2.4999085
I am not sure I understand completely what you are out after... But basically, create a new variable of SES and school type, and gather based on that. And obviously, reorder it as you wish :-)