Print data.frame structure as character - r

I have a data.frame that looks like this:
df <- data.frame(
y = c(0.348, 0.099, 0.041, 0.022, 0.015, 0.010, 0.007, 0.005, 0.004, 0.003),
x = c(458, 648, 694, 724, 756, 790, 818, 836, 848, 876))
When I print the data.frame I (obviously) get this output:
df
# y x
# 1 0.348 458
# 2 0.099 648
# 3 0.041 694
# 4 0.022 724
# 5 0.015 756
# 6 0.010 790
# 7 0.007 818
# 8 0.005 836
# 9 0.004 848
# 10 0.003 876
Is there any function where I can print the data.frame as a character string (or similar)?
magic_function(df)
# output
"df <- data.frame(
y = c(0.348, 0.099, 0.041, 0.022, 0.015, 0.010, 0.007, 0.005, 0.004, 0.003),
x = c(458, 648, 694, 724, 756, 790, 818, 836, 848, 876))"
I literally want to print out something like "df <- data.frame(x = c(...), y = (...))" so that I can copy the output and paste it to a stackoverflow question (for reproducibility)!

I just had to do this recently. deparse will do the trick, and you can paste the multi-line output into a single string with collapse:
df.as.char <- paste(deparse(df), collapse = "")
df.as.char
# [1] "structure(list(y = c(0.348, 0.099, 0.041, 0.022, 0.015, 0.01, 0.007, 0.005, 0.004, 0.003), x = c(458, 648, 694, 724, 756, 790, 818, 836, 848, 876)), .Names = c(\"y\", \"x\"), row.names = c(NA, -10L), class = \"data.frame\")"
Depending on the size of your object, you might consider using the width.cutoff argument to deparse (which will reduce the number of lines created by deparse).
If you've got the same thing in mind that I did, then you can assign this through:
df.from.char <- eval(parse(text = df.as.char))
df.from.char
# y x
# 1 0.348 458
# 2 0.099 648
# 3 0.041 694
# 4 0.022 724
# 5 0.015 756
# 6 0.010 790
# 7 0.007 818
# 8 0.005 836
# 9 0.004 848
# 10 0.003 876
identical(df.from.char, df)
# [1] TRUE
And if you really need the assignment arrow to be part of the character, just paste0 that in.

one option is to use:
dput(df)
returns:
structure(list(y = c(0.348, 0.099, 0.041, 0.022, 0.015, 0.01,
0.007, 0.005, 0.004, 0.003), x = c(458, 648, 694, 724, 756, 790,
818, 836, 848, 876)), .Names = c("y", "x"), row.names = c(NA,
-10L), class = "data.frame")

I think I got something!
df4so <- function(df) {
# collapse dput
# shout out to KonradRudolph, Roland and MichaelChirico
a <- paste(capture.output(dput(df)), collapse = "")
# remove structure junk
b <- gsub("structure\\(list\\(", "", a)
# remove everything after names
c <- gsub("\\.Names\\s.*","",b)
# remove trailing whitespace
d <- gsub("\\,\\s+$", "", c)
# put it all together
e <- paste0('df <- data.frame(', d)
# return
print(e)
}
df4so(df)
Output:
[1] "df <- data.frame(y = c(0.348, 0.099, 0.041, 0.022, 0.015, 0.01, 0.007, 0.005, 0.004, 0.003), x = c(458, 648, 694, 724, 756, 790, 818, 836, 848, 876))"
Suitable for copying and pasting to stackoverflow!

Related

Trying to downsize a dataframe by increasing the timestep in R

I have a column which records time in milliseconds starting at 0 and uniformly increasing by .001. I would like to downsize my dataframe by creating a new dataframe that only records the rows that occur every ten milliseconds.
My problem is that the data is in long format and not all participants took the same amount of time to complete the task, so I cannot just take every 10th row.
To try and clarify, this means that whenever there is 0.000 in the time column, I would like to record this point in the new dataframe and then restart the process of taking every tenth millisecond. So far I have tried using "filter" and "subset" with no success.
This is a small example of the data I have:
ID
Time
X
Y
1
0.000
1
5
1
0.001
2
10
1
0.002
3
15
1
0.003
4
20
1
0.004 (on so on... until 0.052)
...
...
1
0.053
10
25
2
0.000
30
30
2
0.001
35
35
2
0.002 (on so on...until 0.036)
...
...
2
0.037
50
55
3
0.000
55
50
And this is what I would like:
ID
Time
X
Y
1
0.000
1
5
1
0.010
30
40
1
0.020
35
45
1
0.030
30
40
1
0.040
33
44
1
0.050
60
100
2
0.000
30
30
2
0.010
40
40
2
0.020
50
50
2
0.030
60
60
3
0.000
55
50
You can try subset + ave + duplicated like below
subset(
df,
!ave(Time, ID, FUN = function(x) duplicated(ceiling(seq_along(x) / 10)))
)
which gives
ID Time
1 1 0.00
11 1 0.01
21 1 0.02
31 1 0.03
41 1 0.04
51 1 0.05
55 2 0.00
65 2 0.01
75 2 0.02
85 2 0.03
92 3 0.00
Data
> dput(df)
structure(list(ID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3), Time = c(0,
0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.007, 0.008, 0.009,
0.01, 0.011, 0.012, 0.013, 0.014, 0.015, 0.016, 0.017, 0.018,
0.019, 0.02, 0.021, 0.022, 0.023, 0.024, 0.025, 0.026, 0.027,
0.028, 0.029, 0.03, 0.031, 0.032, 0.033, 0.034, 0.035, 0.036,
0.037, 0.038, 0.039, 0.04, 0.041, 0.042, 0.043, 0.044, 0.045,
0.046, 0.047, 0.048, 0.049, 0.05, 0.051, 0.052, 0.053, 0, 0.001,
0.002, 0.003, 0.004, 0.005, 0.006, 0.007, 0.008, 0.009, 0.01,
0.011, 0.012, 0.013, 0.014, 0.015, 0.016, 0.017, 0.018, 0.019,
0.02, 0.021, 0.022, 0.023, 0.024, 0.025, 0.026, 0.027, 0.028,
0.029, 0.03, 0.031, 0.032, 0.033, 0.034, 0.035, 0.036, 0)), row.names = c(NA,
-92L), class = "data.frame")

Reformatting cumulative data

I have data with the cumulative households against the cumulative wealth they posses. I've attached an image of a small amount of the data. Using the R diff() function allows me to get what % of households hold what % of wealth which is good.
I aim to find the Gini index of my data which I first need to get in a format where the households are evenly spaced. There are 20000 rows or so meaning I need to standardise the wealth owned to 0.005% at a time or something like that so as to attain a true distribution of wealth with households (1,2, etc) and not the percentage of households.
EDIT:
structure(list(ï..0.002 = c(0.005, 0.007, 0.017, 0.025, 0.027,
0.037, 0.047, 0.057, 0.067, 0.075, 0.081, 0.09, 0.1, 0.107, 0.116,
0.124, 0.13, 0.138, 0.145, 0.151), X.0.002 = c(-0.004, -0.005,
-0.008, -0.01, -0.01, -0.013, -0.015, -0.017, -0.019, -0.02,
-0.021, -0.022, -0.024, -0.025, -0.026, -0.027, -0.027, -0.028,
-0.029, -0.03)), row.names = c(NA, 20L), class = "data.frame")
Data OCR'd with https://ocr.space/ :
Obs wealth households
1 -0.002 0.002
2 -0.004 0.005
3 -0.005 0.007
4 -0.008 0.017
5 -0.01 0.025
6 -0.01 0.027
7 -0.013 0.037
8 -0.015 0.047
9 -0.017 0.057
10 -0.019 0.067
11 -0.02 0.075
12 -0.021 0.081
13 -0.022 0.09
14 -0.024 0.1
I suggest you used an interpolation to get your data into an evenly spaced form using the approx function.
interpolation <- approx(x = df$cum_hh, y = df$cum_wealth, xout = seq(0, 1, by = 0.00005))
interpolation$x ## evenly spaced cumulative households
interpolation$y ## interpolated cumulative wealth

What are the aes() values when making a boxplot using the ggplot package?

I'm trying to make a boxplot with the ggplot2 package in r studo. I've been reading around on past ggplot2 questions but this is just so basic I can't find it covered in detail... I'm bad at using r.
This is my very basic code that I'm trying to use but I don't know my x and y values?
ggplot(data, aes(x,y)) + geom_boxplot()
So, my y values are Pearson Coefficents which is either 0-1 but I'm struggling to put that in as a range. Then I'm just confused because my x values are just 4 different conditions. Should I use a vector? e.g. c(drug 6hr, control, drug 24hr, control)
I succesfully made a basic boxplot using boxplot() but I am using ggplot2 because I want to show every individual value on the plot using jitter which I have also failed to use.
Sorry I have only been using R for about 6 months! Trying to learn as much as I can.
My data:
drug 6hr, control, drug 24hr, control
0.876 0.707 0.709 0.521
0.084 0.275 0.468 0.795
0.911 0.985 0.565 0.150
0.503 0.584 0.693 0.766
0.363 0.102 0.775 0.640
0.219 0.888 0.724 0.516
0.041 0.277 0.877 0.216
0.206 0.974 0.771 0.434
0.787 0.725 0.671 0.916
0.896 0.873 0.443 0.693
0.396 0.641 0.525 0.471
0.250 0.184 0.467 0.537
0.094 0.453 0.641 0.910
0.750 0.748 0.634 0.007
0.026 0.263 0.069 0.725
0.109 0.227 0.535
0.780 0.811 0.241
0.710 0.568 0.029
0.676 0.114 0.237
0.610 0.260 0.241
0.170 0.728 0.405
0.025 0.815 0.914
0.022 0.329 0.766
0.039 0.714
0.034 0.096
0.402 0.988
0.649
0.564
0.190
0.844
0.920
0.744
0.871
0.565
You need to reshape your dataframe into a longer format and then it will makes things easier forg etting your boxplot with ggplot2.
Here, I'm using pivot_longer function from tidyr package to transform your data into two columns with the first one being the name of the condition and the second one contains values:
library(tidyr)
library(dplyr)
DF %>% pivot_longer(everything(), names_to = "var",values_to = "values")
# A tibble: 136 x 2
var values
<chr> <dbl>
1 drug_6hr 0.876
2 Control_6 0.707
3 drug_24hr 0.709
4 Control_24 0.521
5 drug_6hr 0.084
6 Control_6 0.275
7 drug_24hr 0.468
8 Control_24 0.795
9 drug_6hr 0.911
10 Control_6 0.985
# … with 126 more rows
Then, you can add the graphic part to the pipe (symbol %>%) sequence by defining your dataframe into ggplot with various aes arguments and use geom_boxplot and geom_jitter functions:
library(tidyr)
library(dplyr)
library(ggplot2)
DF %>% pivot_longer(everything(), names_to = "var",values_to = "values") %>%
ggplot(aes(x = var, y = values, fill = var, color = var))+
geom_boxplot(alpha = 0.2)+
geom_jitter()
Alternatively, to remove the warning messages based on the presence of NA values, you can filter out NA values by adding a filter function between the pivot_longer and ggplot:
DF %>% pivot_longer(everything(), names_to = "var",values_to = "values") %>%
filter(!is.na(values)) %>%
ggplot(aes(x = var, y = values, fill = var, color = var))+
geom_boxplot(alpha = 0.2)+
geom_jitter()
Does it answer your question ?
Reproducible example
I edited your example in order to make it better for reading into R. I also modify colnames as pointed out by #akrun:
structure(list(drug_6hr = c(0.876, 0.084, 0.911, 0.503, 0.363,
0.219, 0.041, 0.206, 0.787, 0.896, 0.396, 0.25, 0.094, 0.75,
0.026, 0.109, 0.78, 0.71, 0.676, 0.61, 0.17, 0.025, 0.022, 0.039,
0.034, 0.402, 0.649, 0.564, 0.19, 0.844, 0.92, 0.744, 0.871,
0.565), Control_6 = c(0.707, 0.275, 0.985, 0.584, 0.102, 0.888,
0.277, 0.974, 0.725, 0.873, 0.641, 0.184, 0.453, 0.748, 0.263,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA), drug_24hr = c(0.709, 0.468, 0.565, 0.693, 0.775,
0.724, 0.877, 0.771, 0.671, 0.443, 0.525, 0.467, 0.641, 0.634,
0.069, 0.227, 0.811, 0.568, 0.114, 0.26, 0.728, 0.815, 0.329,
0.714, 0.096, 0.988, NA, NA, NA, NA, NA, NA, NA, NA), Control_24 = c(0.521,
0.795, 0.15, 0.766, 0.64, 0.516, 0.216, 0.434, 0.916, 0.693,
0.471, 0.537, 0.91, 0.007, 0.725, 0.535, 0.241, 0.029, 0.237,
0.241, 0.405, 0.914, 0.766, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA)), row.names = c(NA, -34L), class = c("data.table", "data.frame"
))

confidence intervals for a tibble in wide format [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Group by multiple columns and sum other multiple columns
(7 answers)
Closed 3 years ago.
I have a large tibble, an example of which is shown below. It has seven predictors (V4 to V10) and nine outcomes (w1, w2, w3, mw, i1, i2, i3, mi, p2).
What I am trying to do is to create confidence intervals for the outcomes in columns 2 (w1) to 10 (p2)
vars w1 w2 w3 mw i1 i2 i3 mi p2
V4 0.084 0.017 0.061 0.054 22.800 4.570 16.700 14.700 0.367
V5 0.032 0.085 0.039 0.052 8.840 23.100 10.700 14.200 0.367
V6 0.026 0.066 0.022 0.038 7.030 18.000 6.070 10.400 0.367
V7 0.097 0.020 0.066 0.061 26.300 5.420 18.100 16.600 0.367
V8 0.048 0.071 0.043 0.054 13.100 19.300 11.800 14.700 0.367
V9 0.018 0.111 0.020 0.050 4.800 30.300 5.440 13.500 0.367
V10 0.053 0.020 0.103 0.058 14.300 5.330 28.000 15.900 0.367
V4 0.084 0.017 0.060 0.054 22.400 4.420 16.200 14.300 0.373
V5 0.032 0.072 0.036 0.047 8.630 19.300 9.760 12.500 0.373
V6 0.030 0.076 0.023 0.043 8.080 20.500 6.070 11.500 0.373
V7 0.080 0.021 0.087 0.063 21.500 5.720 23.300 16.800 0.373
V8 0.053 0.090 0.034 0.059 14.100 24.000 9.110 15.700 0.373
V9 0.016 0.101 0.025 0.048 4.410 27.100 6.790 12.800 0.373
V10 0.060 0.022 0.100 0.061 16.000 5.950 26.800 16.300 0.373
When I group_by variables (vars) in dplyr and run quantiles on three of the outcomes (as a test), it does not give me what I'm looking for. Instead of giving me the confidence intervals for the three outcomes, it just gives me one confidence interval as
seen below:
+ group_by(vars) %>%
+ do(data.frame(t(quantile(c(.$w1, .$w2, .$w3), probs = c(0.025, 0.975)))))
# A tibble: 7 x 3
# Groups: variables [7]
variables X2.5 X97.5
1 V10 0.0202 0.103
2 V4 0.017 0.084
3 V5 0.032 0.0834
4 V6 0.0221 0.0748
5 V7 0.0201 0.0958
6 V8 0.0351 0.0876
7 V9 0.0162 0.110
In short, what I'm looking for is something like the table below, where I get the confidence intervals for each outcome.
w1 w2 w3
vars X2.5 X97.5 vars X2.5 X97.5 vars X2.5 X97.5
V10 0.020 0.103 V10 0.020 0.103 V10 0.020 0.103
V4 0.017 0.084 V4 0.017 0.084 V4 0.017 0.084
V5 0.032 0.083 V5 0.032 0.083 V5 0.032 0.083
V6 0.022 0.075 V6 0.022 0.075 V6 0.022 0.075
V7 0.020 0.096 V7 0.020 0.096 V7 0.020 0.096
V8 0.035 0.088 V8 0.035 0.088 V8 0.035 0.088
V9 0.016 0.110 V9 0.016 0.110 V9 0.016 0.110
Any pointers in the right direction would be greatly appreciated. I've read on StackOverflow, but can't seem to find an answer that addresses what I want to do.
Here are two ways.
Base R.
aggregate(df1[-1], list(df1[[1]]), quantile, probs = c(0.025, 0.975))
With the tidyverse.
library(dplyr)
df1 %>%
group_by(vars) %>%
mutate_at(vars(w1:p2), quantile, probs = c(0.025, 0.975))
Note that in the second way, the output format is different, the first quantile (0.025) is in the first rows and the second (0.975) in the last rows.
Data.
df1 <-
structure(list(vars = structure(c(2L, 3L, 4L,
5L, 6L, 7L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L),
.Label = c("V10", "V4", "V5", "V6", "V7", "V8",
"V9"), class = "factor"), w1 = c(0.084, 0.032,
0.026, 0.097, 0.048, 0.018, 0.053, 0.084,
0.032, 0.03, 0.08, 0.053, 0.016, 0.06),
w2 = c(0.017, 0.085, 0.066, 0.02, 0.071, 0.111,
0.02, 0.017, 0.072, 0.076, 0.021, 0.09, 0.101,
0.022), w3 = c(0.061, 0.039, 0.022, 0.066,
0.043, 0.02, 0.103, 0.06, 0.036, 0.023, 0.087,
0.034, 0.025, 0.1), mw = c(0.054, 0.052, 0.038,
0.061, 0.054, 0.05, 0.058, 0.054, 0.047, 0.043,
0.063, 0.059, 0.048, 0.061), i1 = c(22.8, 8.84,
7.03, 26.3, 13.1, 4.8, 14.3, 22.4, 8.63, 8.08,
21.5, 14.1, 4.41, 16), i2 = c(4.57, 23.1, 18, 5.42,
19.3, 30.3, 5.33, 4.42, 19.3, 20.5, 5.72, 24, 27.1,
5.95), i3 = c(16.7, 10.7, 6.07, 18.1, 11.8, 5.44,
28, 16.2, 9.76, 6.07, 23.3, 9.11, 6.79, 26.8),
mi = c(14.7, 14.2, 10.4, 16.6, 14.7, 13.5, 15.9,
14.3, 12.5, 11.5, 16.8, 15.7, 12.8, 16.3),
p2 = c(0.367, 0.367, 0.367, 0.367, 0.367, 0.367,
0.367, 0.373, 0.373, 0.373, 0.373, 0.373, 0.373,
0.373)), class = "data.frame",
row.names = c(NA, -14L))
Another possibility: melt/pivot to long format; compute summaries; then cast/pivot to wide format
library(tidyverse)
df2 <- (df1
%>% pivot_longer(-vars,"outcome","value")
%>% group_by(vars,outcome)
%>% summarise(lwr=quantile(value,0.025),upr=quantile(value,0.975))
)
df2 %>% pivot_wider(names_from=outcome,values_from=c(lwr,upr))
Unfortunately the columns aren't in the order you want; I can't think of a quick fix (you can select() with variables in the order you want ...

Nested reshape from wide to long

I keep on getting all sort of error messages when trying to reshape an object into long direction. Toy data:
d <- structure(c(0.204, 0.036, 0.015, 0.013, 0.208, 0.037, 0.015,
0.006, 0.186, 0.044, 0.016, 0.023, 0.251, 0.044, 0.02, 0.01,
0.268, 0.04, 0.007, 0.007, 0.208, 0.062, 0.027, 0.036, 0.272,
0.054, 0.006, 0.01, 0.274, 0.05, 0.011, 0.006, 0.28, 0.039, 0.007,
0.019, 1.93, 0.345, 0.087, 0.094, 2.007, 0.341, 0.064, 0.061,
1.733, 0.39, 0.131, 0.201, 0.094, 0.01, 0.004, 0, 0.096, 0.014,
0, 0.001, 0.081, 0.016, 0.002, 0.016, 0.062, 0.007, 0.011, 0.001,
0.07, 0.003, 0.005, 0.002, 0.043, 0.033, 0, 0.007, 0.081, 0.039,
0.007, 0, 0.085, 0.033, 0.008, 0, 0.086, 0.023, 0.007, 0.007,
0.083, 0.015, 0, 0, 0.09, 0.009, 0, 0, 0.049, 0.052, 0, 0.025,
2.779, 0.203, 0.098, 0.016, 2.801, 0.242, 0.135, 0.01, 2.12,
0.466, 0.177, 0.121, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 0, 1,
2, 3, 0, 1, 2, 3, 0, 1, 2, 3), .Dim = c(12L, 11L), .Dimnames = list(
c("0", "1", "2", "3", "0", "1", "2", "3", "0", "1", "2",
"3"), c("age_77", "age_78", "age_79", "age_80", "age_81",
"age_82", "age_83", "age_84", "age_85", "item", "k")))
Basically I have different ages, for which 3 items have been reported with four response categories each. I would like to obtain a long-shaped object with colnames = age, item, k, proportion, like this:
structure(c(77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 78,
78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 1, 1, 1, 1, 2, 2,
2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 0, 1, 2,
3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3,
0.204, 0.036, 0.015, 0.013, 0.208, 0.037, 0.015, 0.006, 0.186,
0.044, 0.016, 0.023, 0.251, 0.044, 0.02, 0.01, 0.268, 0.04, 0.007,
0.007, 0.208, 0.062, 0.027, 0.036), .Dim = c(24L, 4L), .Dimnames = list(
c("0", "1", "2", "3", "0", "1", "2", "3", "0", "1", "2",
"3", "0", "1", "2", "3", "0", "1", "2", "3", "0", "1", "2",
"3"), c("age", "item", "k", "proportion")))
An example I tried:
reshape(as.data.frame(d), varying =1:9, sep = "_", direction = "long",
times = "k", idvar = "item")
Error in `row.names<-.data.frame`(`*tmp*`, value = paste(ids, times[i], :
duplicate 'row.names' are not allowed
Any clue where's my mistake? Thanks a lot beforehand!
The object d as provided by the OP is not a data.frame but a matrix which is causing the error:
str(d)
num [1:12, 1:11] 0.204 0.036 0.015 0.013 0.208 0.037 0.015 0.006 0.186 0.044 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:12] "0" "1" "2" "3" ...
..$ : chr [1:11] "age_77" "age_78" "age_79" "age_80" ...
In addition, the row numbers are not unique which causes an error as well when coercing d to data.frame.
With data.table, d can be coerced to a data.table object and reshaped from wide to long format using melt(). Finally, age is extracted from the column names and stored as integer values as requested by the OP.
library(data.table)
melt(as.data.table(d), measure.vars = patterns("^age_"),
variable.name = "age", value.name = "proportion")[
, age := as.integer(stringr::str_replace(age, "age_", ""))][]
item k age proportion
1: 1 0 77 0.204
2: 1 1 77 0.036
3: 1 2 77 0.015
4: 1 3 77 0.013
5: 2 0 77 0.208
---
104: 2 3 85 0.010
105: 3 0 85 2.120
106: 3 1 85 0.466
107: 3 2 85 0.177
108: 3 3 85 0.121

Resources