I was trying to use ggplot2 to plot the built-in anscombe data set in R (which contains four different small data sets with identical correlations but radically different relationships between X and Y). My attempts to reshape the data properly were all pretty ugly. I used a combination of reshape2 and base R; a Hadleyverse 2 (tidyr/dplyr) or a data.table solution would be fine with me, but the ideal solution would be
short/no repeated code
comprehensible (somewhat conflicting with criterion #1)
involve as little hard-coding of column numbers, etc. as possible
The original format:
anscombe
## x1 x2 x3 x4 y1 y2 y3 y4
## 1 10 10 10 8 8.04 9.14 7.46 6.58
## 2 8 8 8 8 6.95 8.14 6.77 5.76
## 3 13 13 13 8 7.58 8.74 12.74 7.71
## ...
## 11 5 5 5 8 5.68 4.74 5.73 6.89
Desired format:
## s x y
## 1 1 10 8.04
## 2 1 8 6.95
## ...
## 44 4 8 6.89
Here's my attempt:
library("reshape2")
ff <- function(x,v)
setNames(transform(
melt(as.matrix(x)),
v1=substr(Var2,1,1),
v2=substr(Var2,2,2))[,c(3,5)],
c(v,"s"))
f1 <- ff(anscombe[,1:4],"x")
f2 <- ff(anscombe[,5:8],"y")
f12 <- cbind(f1,f2)[,c("s","x","y")]
Now plot:
library("ggplot2"); theme_set(theme_classic())
th_clean <-
theme(panel.margin=grid::unit(0,"lines"),
axis.ticks.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.y=element_blank(),
axis.text.y=element_blank()
)
ggplot(f12,aes(x,y))+geom_point()+
facet_wrap(~s)+labs(x="",y="")+
th_clean
If you are really dealing with the "anscombe" dataset, then I would say #Thela's reshape solution is very direct.
However, here are a few other options to consider:
Option 1: Base R
You can write your own "reshape" function, perhaps something like this:
myReshape <- function(indf = anscombe, stubs = c("x", "y")) {
temp <- sapply(stubs, function(x) {
unlist(indf[grep(x, names(indf))], use.names = FALSE)
})
s <- rep(seq_along(grep(stubs[1], names(indf))), each = nrow(indf))
data.frame(s, temp)
}
Notes:
I'm not sure that this is necessarily less clunky than what you're already doing
This approach will not work if the data are "unbalanced" (for example, more "x" columns than "y" columns.)
Option 2: "dplyr" + "tidyr"
Since pipes are the rage these days, you can also try:
library(dplyr)
library(tidyr)
anscombe %>%
gather(var, val, everything()) %>%
extract(var, into = c("variable", "s"), "(.)(.)") %>%
group_by(variable, s) %>%
mutate(ind = sequence(n())) %>%
spread(variable, val)
Notes:
I'm not sure that this is necessarily less clunky than what you're already doing, but some people like the pipe approach.
This approach should be able to handle unbalanced data.
Option 3: "splitstackshape"
Before #Arun went and did all that fantastic work on melt.data.table, I had written merged.stack in my "splitstackshape" package. With that, the approach would be:
library(splitstackshape)
setnames(
merged.stack(
data.table(anscombe, keep.rownames = TRUE),
var.stubs = c("x", "y"), sep = "var.stubs"),
".time_1", "s")[]
A few notes:
merged.stack needs something to treat as an "id", hence the need for data.table(anscombe, keep.rownames = TRUE), which adds a column named "rn" with the row numbers
The sep = "var.stubs" basically means that we don't really have a separator variable, so we'll just strip out the stub and use whatever remains for the "time" variable
merged.stack will work if the data are unbalanced. For instance, try using it with anscombe2 <- anscombe[1:7] as your dataset instead of "anscombe".
The same package also has a function called Reshape that builds upon reshape to let it reshape unbalanced data. But it's slower and less flexible than merged.stack. The basic approach would be Reshape(data.table(anscombe, keep.rownames = TRUE), var.stubs = c("x", "y"), sep = "") and then rename the "time" variable using setnames.
Option 4: melt.data.table
This was mentioned in the comments above, but hasn't been shared as an answer. Outside of base R's reshape, this is a very direct approach that handles column renaming from within the function itself:
library(data.table)
melt(as.data.table(anscombe),
measure.vars = patterns(c("x", "y")),
value.name=c('x', 'y'),
variable.name = "s")
Notes:
Will be insanely fast.
Much better supported than "splitstackshape" or reshape ;-)
Handles unbalanced data just fine.
I think this meets the criteria of being 1) short 2) comprehensible and 3) no hardcoded column numbers. And it doesn't require any other packages.
reshape(anscombe, varying=TRUE, sep="", direction="long", timevar="s")
# s x y id
#1.1 1 10 8.04 1
#...
#11.1 1 5 5.68 11
#1.2 2 10 9.14 1
#...
#11.2 2 5 4.74 11
#1.3 3 10 7.46 1
#...
#11.3 3 5 5.73 11
#1.4 4 8 6.58 1
#...
#11.4 4 8 6.89 11
I don't know if a non-reshape solution would be acceptable, but here you go:
library(data.table)
#create the pattern that will have the Xs
#this will make it easy to create the Ys
pattern <- 1:4
#use Map to create a list of data.frames with the needed columns
#and also use rbindlist to rbind the list produced by Map
lists <- rbindlist(Map(data.frame,
pattern,
anscombe[pattern],
anscombe[pattern+length(pattern)]
)
)
#set the correct names
setnames(lists, names(lists), c('s','x','y'))
Output:
> lists
s x y
1: 1 10 8.04
2: 1 8 6.95
3: 1 13 7.58
4: 1 9 8.81
5: 1 11 8.33
6: 1 14 9.96
7: 1 6 7.24
8: 1 4 4.26
9: 1 12 10.84
10: 1 7 4.82
....
A newer tidyverse option is suggested in the tidyverse vignette:
anscombe %>%
pivot_longer(everything(),
names_to = c(".value", "set"),
names_pattern = "(.)(.)"
) %>%
arrange(set)
#> # A tibble: 44 x 3
#> set x y
#> <chr> <dbl> <dbl>
#> 1 1 10 8.04
#> 2 1 8 6.95
#> 3 1 13 7.58
#> 4 1 9 8.81
#> 5 1 11 8.33
#> 6 1 14 9.96
#> 7 1 6 7.24
#> 8 1 4 4.26
#> 9 1 12 10.8
#> 10 1 7 4.82
#> # … with 34 more rows
Related
I know teh question of unnesting a list column in a data frame has been raised and answered multiple times. However, here's the potentially 237. problem of that kind.
I have the following data:
set.seed(666)
dat <- data.frame(sysRespNum = c(1,2,3,4,5,6),
product1 = sqrt(rnorm(6, 20, 5)^2),
product2 = sqrt(rnorm(6, 20, 5)^2),
product3 = sqrt(rnorm(6, 20, 5)^2))
data:
sysRespNum product1 product2 product3
1 1 23.766555 13.46907 24.32327
2 2 30.071773 15.98740 11.39922
3 3 18.224328 11.03880 20.67063
4 4 30.140839 19.78984 19.62087
5 5 8.915628 30.75021 24.29150
6 6 23.791981 11.14885 21.72450
Now, I want to calculate the proportion of each product among the sum of all products, so I want to calculate product1/sum(my three products), then the same for product 2 and 3. So I'm expecting three new columns.
I've tried the following:
library(tidyverse)
dat %>%
mutate(sum_Product = apply(across(-sysRespNum), 1, function(x) list(sum_Product = x/sum(x))))
(side question: is there maybe a more straightforward way of mutating this directly without the need to create a list. I now I could create a sum column first and then do a simple mutate along with across. But I'm wondering if teh calculations can be achieved without creating a temporary sum column first)
Now my problem is that it's difficult to unnest the sum_Product list column. unnest_wider doesn't work, the sum_Product column is still a list.
So the only thing that worked for me is
following this solution: https://stackoverflow.com/a/60824506/2725773
changing my code above and replace the list part by data.frame:
full code:
dat %>%
mutate(sum_Product = apply(across(-sysRespNum), 1, function(x) data.frame(sum_Product = x/sum(x)))) %>%
unnest(cols = everything()) %>%
mutate(product = rep(1:3, nrow(.)/3)) %>%
pivot_wider(values_from = sum_Product,
names_from = product,
names_prefix = "share_product")
which gives the correct result:
# A tibble: 6 x 7
sysRespNum product1 product2 product3 share_product1 share_product2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 23.8 13.5 24.3 0.386 0.219
2 2 30.1 16.0 11.4 0.523 0.278
3 3 18.2 11.0 20.7 0.365 0.221
4 4 30.1 19.8 19.6 0.433 0.285
5 5 8.92 30.8 24.3 0.139 0.481
6 6 23.8 11.1 21.7 0.420 0.197
# … with 1 more variable: share_product3 <dbl>
However, it feels unnecessary complicated to unnest everything and then reshape with pivot_wider.
So a) is there a more elegant way to calculate my share variables and b) is there a more elegant/shorter/less verbose way of reshaping a list column into multiple vector columns?
It is easier to do this rowSums i.e. divide the 'product1' by the rowSums on the columns that starts with key word 'product'. Instead of doing rowwise with c_across, this is vectorized and should be fast as well
library(dplyr)
dat %>%
mutate(sum_product = product1/rowSums(select(., starts_with('product'))))
NOTE: There is a mixing of base R code (apply) and the tidyverse option with across which doesn't seem to be the optimal way
If we need to do this for all the 'product' columns, create a sum column first with mutate and then use across on the columns that starts with 'product' to divide the column by 'Sum_col'
dat %>%
mutate(Sum_col = rowSums(select(., starts_with('product'))),
across(starts_with('product'),
~ ./Sum_col, .names = '{.col}_sum_product')) %>%
select(-Sum_col)
-output
#ysRespNum product1 product2 product3 product1_sum_product product2_sum_product product3_sum_product
#1 1 23.766555 13.46907 24.32327 0.3860783 0.2187998 0.3951219
#2 2 30.071773 15.98740 11.39922 0.5233660 0.2782431 0.1983909
#3 3 18.224328 11.03880 20.67063 0.3649701 0.2210688 0.4139610
#4 4 30.140839 19.78984 19.62087 0.4333597 0.2845348 0.2821054
#5 5 8.915628 30.75021 24.29150 0.1393996 0.4807925 0.3798079
#6 6 23.791981 11.14885 21.72450 0.4198684 0.1967490 0.3833826
Or using base R
nm1 <- startsWith(names(dat), 'product')
dat[paste0('sum_product', seq_along(nm1))] <- dat[nm1]/rowSums(dat[nm1])
I guess the following base R code should work for you
cbind(
dat,
setNames(dat[-1] / rowSums(dat[-1]), paste0("share_product", seq_along(dat[-1])))
)
which gives
sysRespNum product1 product2 product3 share_product1 share_product2
1 1 23.766555 13.46907 24.32327 0.3860783 0.2187998
2 2 30.071773 15.98740 11.39922 0.5233660 0.2782431
3 3 18.224328 11.03880 20.67063 0.3649701 0.2210688
4 4 30.140839 19.78984 19.62087 0.4333597 0.2845348
5 5 8.915628 30.75021 24.29150 0.1393996 0.4807925
6 6 23.791981 11.14885 21.72450 0.4198684 0.1967490
share_product3
1 0.3951219
2 0.1983909
3 0.4139610
4 0.2821054
5 0.3798079
6 0.3833826
Good old plain basic R
rdat <- dat[-1]
rdat <- rdat/rowSums(rdat)
colnames(rdat) <- paste0("share_", colnames(rdat))
cbind(dat, rdat)
Which gives:
sysRespNum product1 product2 product3 share_sum_product1 share_sum_product2
1 1 23.766555 13.46907 24.32327 0.3860783 0.2187998
2 2 30.071773 15.98740 11.39922 0.5233660 0.2782431
3 3 18.224328 11.03880 20.67063 0.3649701 0.2210688
4 4 30.140839 19.78984 19.62087 0.4333597 0.2845348
5 5 8.915628 30.75021 24.29150 0.1393996 0.4807925
6 6 23.791981 11.14885 21.72450 0.4198684 0.1967490
share_sum_product3
1 0.3951219
2 0.1983909
3 0.4139610
4 0.2821054
5 0.3798079
6 0.3833826
In R, I perform dunn's test. The function I use has no option to group the input variables by their statistical significant differences. However, this is what I am genuinely interested in, so I tried to write my own function. Unfortunately, I am not able to wrap my head around it. Perhaps someone can help.
I use the airquality dataset that comes with R as an example. The result that I need could look somewhat like this:
> library (tidyverse)
> ozone_summary <- airquality %>% group_by(Month) %>% dplyr::summarize(Mean = mean(Ozone, na.rm=TRUE))
# A tibble: 5 x 2
Month Mean
<int> <dbl>
1 5 23.6
2 6 29.4
3 7 59.1
4 8 60.0
5 9 31.4
When I run the dunn.test, I get the following:
> dunn.test::dunn.test (airquality$Ozone, airquality$Month, method = "bh", altp = T)
Kruskal-Wallis rank sum test
data: x and group
Kruskal-Wallis chi-squared = 29.2666, df = 4, p-value = 0
Comparison of x by group
(Benjamini-Hochberg)
Col Mean-|
Row Mean | 5 6 7 8
---------+--------------------------------------------
6 | -0.925158
| 0.4436
|
7 | -4.419470 -2.244208
| 0.0001* 0.0496*
|
8 | -4.132813 -2.038635 0.286657
| 0.0002* 0.0691 0.8604
|
9 | -1.321202 0.002538 3.217199 2.922827
| 0.2663 0.9980 0.0043* 0.0087*
alpha = 0.05
Reject Ho if p <= alpha
From this result, I deduce that May differs from July and August, June differs from July (but not from August) and so on. So I'd like to append significantly differing groups to my results table:
# A tibble: 5 x 3
Month Mean Group
<int> <dbl> <chr>
1 5 23.6 a
2 6 29.4 ac
3 7 59.1 b
4 8 60.0 bc
5 9 31.4 a
While I did this by hand, I suppose it must be possible to automate this process. However, I don't find a good starting point. I created a dataframe containing all comparisons:
> ozone_differences <- dunn.test::dunn.test (airquality$Ozone, airquality$Month, method = "bh", altp = T)
> ozone_differences <- data.frame ("P" = ozone_differences$altP.adjusted, "Compare" = ozone_differences$comparisons)
P Compare
1 4.436043e-01 5 - 6
2 9.894296e-05 5 - 7
3 4.963804e-02 6 - 7
4 1.791748e-04 5 - 8
5 6.914403e-02 6 - 8
6 8.604164e-01 7 - 8
7 2.663342e-01 5 - 9
8 9.979745e-01 6 - 9
9 4.314957e-03 7 - 9
10 8.671708e-03 8 - 9
I thought that a function iterating through this data frame and using a selection variable to choose the right letter from letters() might work. However, I cannot even think of a starting point, because changing numbers of rows have to considered at the same time...
Perhaps someone has a good idea?
Perhaps you could look into cldList() function from rcompanion library, you can pipe the res results from the output od dunnTest() and create a table that specifies the compact letter display comparison per group.
Following the advice of #TylerRuddenfort , the following code will work. The first cld is created with rcompanion::cldList, and the second directly uses multcompView::multcompLetters. Note that to use multcompLetters, the spaces have to be removed from the names of the comparisons.
Here, I have used FSA:dunnTest for the Dunn test (1964).
In general, I recommend ordering groups by e.g. median or mean before running e.g. dunnTest if you plan on using a cld, so that the cld comes out in a sensible order.
library (tidyverse)
ozone_summary <- airquality %>% group_by(Month) %>% dplyr::summarize(Mean = mean(Ozone, na.rm=TRUE))
library(FSA)
Result = dunnTest(airquality$Ozone, airquality$Month, method = "bh")$res
### Use cldList()
library(rcompanion)
cldList(P.adj ~ Comparison, data=Result)
### Use multcompView
library(multcompView)
X = Result$P.adj <= 0.05
names(X) = gsub(" ", "", Result$Comparison)
multcompLetters(X)
I am struggling with a data frame I have been given:
game.time.total game.time.first.half game.time.second.half
1 95:09 46:04 49:05
2 95:09 46:04 49:05
3 95:31 46:07 49:23
4 95:31 46:07 49:23
5 95:39 46:08 49:31
Currently, these columns are currently factor variables (see str output)
'data.frame': 5 obs. of 3 variables:
$ game.time.total : Factor w/ 29 levels "100:22","100:53",..: 7 7 10 10 12
$ game.time.first.half : Factor w/ 27 levels "45:18","46:00",..: 3 3 5 5 6
$ game.time.second.half: Factor w/ 29 levels "48:01","48:03",..: 12 12 15 15 17
However I wish to be able to average each column using colmeans(). From my understanding I need to convert the column to numeric and to be expressed as minutes.seconds as shown here:
game.time.total game.time.first.half game.time.second.half
1 95.09 46.04 49.05
2 95.09 46.04 49.05
3 95.31 46.07 49.23
4 95.31 46.07 49.23
5 95.39 46.08 49.31
I understand that I could just type them out however there are many more column and rows of similar formatting...Is there a simple way of how to do this? Or do I need to re-adjust the format of the original file(.csv)?
EDIT: Thank you for the answers. My mistake as in my original question I did not provide my actual DF. I have now added this and with the str() result.
#hello_friend this is what is returned when I apply your second solution
game.time.total game.time.first.half game.time.second.half
1 7 3 12
2 7 3 12
3 10 5 15
4 10 5 15
5 12 6 17
Thanks in advance.
Base R solution:
numeric_df <- setNames(data.frame(lapply(data.frame(
Vectorize(gsub)(":", ".", DF), stringsAsFactors = FALSE
),
function(x) {
as.double(x)
})), names(DF))
Data:
DF <- structure(list(game.time.total = c("95:09", "95:09", "95:31",
"95:31", "95:39"), game.time.first.half = c("46:04", "46:04",
"46:07", "46:07", "46:08"), game.time.second.half = c("49:05",
"49:05", "49:23", "49:23", "49:31")), class = "data.frame", row.names = c(NA, -5L))
You can convert the columns to minutes and seconds with the ms function from lubridate package.
library(lubridate)
library(dplyr)
DF %>%
mutate_all(ms) %>%
mutate_all(period_to_seconds) %>%
summarise_all(mean) %>%
mutate_all(seconds_to_period)
game.time.total game.time.first.half game.time.second.half
1 1H 35M 23.8000000000002S 46M 6S 49M 17.4000000000001S
Or without the last mutate_all call if you want the mean in seconds.
DF %>%
mutate_all(ms) %>%
mutate_all(period_to_seconds) %>%
summarise_all(mean)
game.time.total game.time.first.half game.time.second.half
1 5723.8 2766 2957.4
Note: Asssuming 95.09 means 95 minutes and 9 seconds and not 95 minutes and .09 minutes
You have to be careful here. Think about the average of "89:30" and "90:30". They add to 180 minutes, so the mean should be 90:00. However, if you convert them to 89.30 and 90.30, then they add to 179.60 and the mean becomes 89.80, which doesn't even make sense.
There are packages available that make dealing with time easier, such as lubridate, and if you deal with times frequently you should learn to use them. However, the solution below doesn't require any additional packages and is quite straightforward. It defines two small functions to convert between "mm:ss" format and number of seconds. You can safely do sums and averages on seconds then convert back to your original format:
as_seconds <- function(x) sapply(strsplit(x, ":"), function(y) sum(as.numeric(y) * c(60, 1)))
as_min_sec <- function(x) paste0(x %/% 60, sprintf(":%02d", 21))
apply(DF, 2, function(x) as_min_sec(mean(as_seconds(x))))
#> game.time.total game.time.first.half game.time.second.half
#> "95:21" "46:21" "49:21"
If you just want to replace the colons with dots in each column, you can do this:
as.data.frame(lapply(DF, function(x) gsub(":", ".", x)))
#> game.time.total game.time.first.half game.time.second.half
#> 1 95.09 46.04 49.05
#> 2 95.09 46.04 49.05
#> 3 95.31 46.07 49.23
#> 4 95.31 46.07 49.23
#> 5 95.39 46.08 49.31
is there a package to easily calculate for each specific n number, the mean/std/ci.
In example starting with the data:
> n = c(0,0,0,0,0,0,0,2,2,2,2,5,5,5,5,8,8,8,8)
> s = c(43,23,65,43,12,54,43,12,2,43,62,25,55,75,95,28,48,68,18)
> df = data.frame(n, s)
> df
n s
1 0 43
2 0 23
3 0 65
4 0 43
5 0 12
6 0 54
7 0 43
8 2 12
9 2 2
10 2 43
11 2 62
12 5 25
13 5 55
14 5 75
15 5 95
16 8 28
17 8 48
18 8 68
19 8 18
resulting as:
data
n mean std ci
0 40 .. ..
2 30 .. ..
5 63 .. ..
8 41 .. ..
dplyr is nice, but not necessary. In base R:
## df() is built-in in R, avoid ...
dd <- data.frame(n=rep(c(0,2,5,8),c(7,4,4,4)),
s = c(43,23,65,43,12,54,43,12,2,43,
62,25,55,75,95,28,48,68,18))
sumfun <- function(x) {
m <- mean(x)
s <- sd(x)
se <- s/sqrt(length(x))
c(mean=m,sd=s,lwr=m-1.96*se,upr=m+1.96*se)
}
(or see smean.cl.normal(), smean.cl.boot(), etc. from the Hmisc package ...)
res <- do.call(rbind,tapply(dd$s,dd$n,sumfun))
res <- cbind(n=unique(dd$n),as.data.frame(res))
Or as pointed out by #thelatemail:
res <- do.call(data.frame,aggregate(s ~ n, data=df, FUN=sumfun ))
You can easily package this into a function if you're going to use it on a regular basis.
For larger data sets/more complex transformations you can search SO for answers comparing solutions from the dplyr, plyr, data.table, doBy packages as well as base-R solutions using combinations of tapply(), ave(), aggregate(), by() ...
You can use the dplyr package.
Here's a code snippet. Note, I'm assuming you want to build the confidence interval using the standard normal approximation at the 95% level but you can make whatever choice you like.
n = c(0,0,0,0,0,0,0,2,2,2,2,5,5,5,5,8,8,8,8)
s = c(43,23,65,43,12,54,43,12,2,43,62,25,55,75,95,28,48,68,18)
df = data.frame(n, s)
df %>%
group_by(n) %>%
summarise(mean = mean(s),
std = sqrt(var(s)),
lower = mean(s) - qnorm(.975)*std/sqrt(n()),
upper = mean(s) + qnorm(.975)*std/sqrt(n()))
Source: local data frame [4 x 5]
n mean std lower upper
1 0 40.42857 17.88721 27.177782 53.67936
2 2 29.75000 27.69326 2.611104 56.88890
3 5 62.50000 29.86079 33.236965 91.76303
4 8 40.50000 22.17356 18.770313 62.22969
Thanks for the advice everyone, I have taken a look into plyr and solved it:
n = c(0,0,0,0,0,0,0,2,2,2,2,5,5,5,5,8,8,8,8)
s = c(43,23,65,43,12,54,43,12,2,43,62,25,55,75,95,28,48,68,18)
dd = data.frame(n, s)
library(plyr)
data <- ddply(dd,.(n),function(dd) c(mean=mean(dd$s),
std = sd(dd$s),
se = sd(dd$s)/sqrt(length(dd$s)),
lower = mean(dd$s)-qnorm(.975)*sd(dd$s)/sqrt(length(dd$s)),
upper = mean(dd$s)+qnorm(.975)*sd(dd$s)/sqrt(length(dd$s))
))
resulting as:
data
n mean std se lower upper
1 0 40.42857 17.88721 6.760731 27.177782 53.67936
2 2 29.75000 27.69326 13.846630 2.611104 56.88890
3 5 62.50000 29.86079 14.930394 33.236965 91.76303
4 8 40.50000 22.17356 11.086779 18.770313 62.22969
Will avoid df() in the future, thanks
Update tidyr 1.0.0
Even though the solution by #user1357015 is totally fine, if you are a tidyverse fan like me, there is an elegant alternative:
The new tidyr 1.0.0 contained a function that didn't get much attention but is very useful: unnest_wider.
With that, you can simplify the code to the following:
df %>%
group_by(n) %>%
nest(data = -"n") %>%
mutate(ci = map(data, ~ MeanCI(.x$s))) %>%
unnest_wider(ci)
which gives
# A tibble: 4 x 5
# Groups: n [4]
n data mean lwr.ci upr.ci
<dbl> <list> <dbl> <dbl> <dbl>
1 0 <tibble [7 × 1]> 40.4 23.9 57.0
2 2 <tibble [4 × 1]> 29.8 -14.3 73.8
3 5 <tibble [4 × 1]> 62.5 15.0 110.
4 8 <tibble [4 × 1]> 40.5 5.22 75.8
Can I merge data like
name,#797,"Stachy, Poland"
at_rank,#797,1
to_center,#797,4.70
predicted,#797,4.70
According to the second column and take the first column as column names?
name at_rank to_center predicted
#797 "Stachy, Poland" 1 4.70 4.70
Upon request, the whole set of data: http://sprunge.us/cYSJ
The first problem, of reading the data in, should not be a problem if your strings with commas are quoted (which they seem to be). Using read.csv with the header=FALSE argument does the trick with the data you shared. (Of course, if the data file had headers, delete that argument.)
From there, you have several options. Here are two.
reshape (base R) works fine for this:
myDF <- read.csv("http://sprunge.us/cYSJ", header=FALSE)
myDF2 <- reshape(myDF, direction="wide", idvar="V2", timevar="V1")
head(myDF2)
# V2 V3.name V3.at_rank V3.to_center V3.predicted
# 1 #1 Kitoman 1 2.41 2.41
# 5 #2 Hosaena 2 4.23 9.25
# 9 #3 Vinzelles, Puy-de-Dôme 1 5.20 5.20
# 13 #4 Whitelee Wind Farm 6 3.29 8.07
# 17 #5 Steveville, Alberta 1 9.59 9.59
# 21 #6 Rocher, Ardèche 1 0.13 0.13
The reshape2 package is also useful in these cases. It has simpler syntax and the output is also a little "cleaner" (at least in terms of variable names).
library(reshape2)
myDFw_2 <- dcast(myDF, V2 ~ V1)
# Using V3 as value column: use value.var to override.
head(myDFw_2)
# V2 at_rank name predicted to_center
# 1 #1 1 Kitoman 2.41 2.41
# 2 #10 4 Icaraí de Minas 6.07 8.19
# 3 #100 2 Scranton High School (Pennsylvania) 5.78 7.63
# 4 #1000 1 Bat & Ball Inn, Clanfield 2.17 2.17
# 5 #10000 3 Tăuteu 1.87 5.87
# 6 #10001 1 Oak Grove, Northumberland County, Virginia 5.84 5.84
Look at the reshape package from Hadley. If I understand correctly, you are just pivoting your data from long to wide.
I think in this case all you really need to do is transpose, cast to data.frame, set the colnames to the first row and then remove the first row. It might be possible to skip the last step through some combination of arguments to data.frame but I don't know what they are right now.