R merge with itself - r

Can I merge data like
name,#797,"Stachy, Poland"
at_rank,#797,1
to_center,#797,4.70
predicted,#797,4.70
According to the second column and take the first column as column names?
name at_rank to_center predicted
#797 "Stachy, Poland" 1 4.70 4.70
Upon request, the whole set of data: http://sprunge.us/cYSJ

The first problem, of reading the data in, should not be a problem if your strings with commas are quoted (which they seem to be). Using read.csv with the header=FALSE argument does the trick with the data you shared. (Of course, if the data file had headers, delete that argument.)
From there, you have several options. Here are two.
reshape (base R) works fine for this:
myDF <- read.csv("http://sprunge.us/cYSJ", header=FALSE)
myDF2 <- reshape(myDF, direction="wide", idvar="V2", timevar="V1")
head(myDF2)
# V2 V3.name V3.at_rank V3.to_center V3.predicted
# 1 #1 Kitoman 1 2.41 2.41
# 5 #2 Hosaena 2 4.23 9.25
# 9 #3 Vinzelles, Puy-de-Dôme 1 5.20 5.20
# 13 #4 Whitelee Wind Farm 6 3.29 8.07
# 17 #5 Steveville, Alberta 1 9.59 9.59
# 21 #6 Rocher, Ardèche 1 0.13 0.13
The reshape2 package is also useful in these cases. It has simpler syntax and the output is also a little "cleaner" (at least in terms of variable names).
library(reshape2)
myDFw_2 <- dcast(myDF, V2 ~ V1)
# Using V3 as value column: use value.var to override.
head(myDFw_2)
# V2 at_rank name predicted to_center
# 1 #1 1 Kitoman 2.41 2.41
# 2 #10 4 Icaraí de Minas 6.07 8.19
# 3 #100 2 Scranton High School (Pennsylvania) 5.78 7.63
# 4 #1000 1 Bat & Ball Inn, Clanfield 2.17 2.17
# 5 #10000 3 Tăuteu 1.87 5.87
# 6 #10001 1 Oak Grove, Northumberland County, Virginia 5.84 5.84

Look at the reshape package from Hadley. If I understand correctly, you are just pivoting your data from long to wide.

I think in this case all you really need to do is transpose, cast to data.frame, set the colnames to the first row and then remove the first row. It might be possible to skip the last step through some combination of arguments to data.frame but I don't know what they are right now.

Related

How to fuzzy join 2 dataframes on 2 variables with differing "fuzzy logic"?

# example
a <- data.frame(name=c("A","B","C"), KW=c(201902,201904,201905),price=c(1.99,3.02,5.00))
b <- data.frame(KW=c(201903,201904,201904),price=c(1.98,3.00,5.00),name=c("a","b","c"))
I want to match a and b with fuzzy logic, using the variables KW and price. I want to allow a tolerance of +/- 1 for KW and a tolerance for +/- 0.02 in price.
The desired outcome should look like this:
name.x KW.x price.x KW.y price.y name.y
1 A 201902 1.99 201903 1.98 a
2 B 201904 3.02 201904 3.00 b
3 C 201905 5.00 201904 5.00 c
I would prefer to find a solution using the fuzzyjoin package. I tried so far using the fuzzy_inner_join function and specifying my desired tolrences for KW and price using the match_fun argument. However, I couldn't get it to work.
Looking for help, how to solve this problem.
You can create a cartesian product of two dataframes using merge and then subset the rows which follow our required conditions.
subset(merge(a, b, by = NULL), abs(KW.x - KW.y) <= 1 &
abs(price.x - price.y) <= 0.02)
# name.x KW.x price.x KW.y price.y name.y
#1 A 201902 1.99 201903 1.98 a
#5 B 201904 3.02 201904 3.00 b
#9 C 201905 5.00 201904 5.00 c

changing variable value in data frame

I have a data frame:
id,male,exposure,age,tol
9,0,1.54,tol12,1.79
9,0,1.54,tol13,1.9
9,0,1.54,tol14,2.12
9,0,1.54,tol11,2.23
However, I want the values of the age variable to be (11,12,13,14) not (tol11,tol12,tol13,tol14). I tried the following, but it does not make a difference.
levels(tolerance_wide$age)[levels(tolerance_wide$age)==tol11] <- 11
levels(tolerance_wide$age)[levels(tolerance_wide$age)==tol12] <- 12
Any help would be appreciated.
(data from Singer, Willett book)
Assuming that you data frame is named foo:
foo$age <- as.numeric(gsub("tol", "", foo$age))
id male exposure age tol
1: 9 0 1.54 12 1.79
2: 9 0 1.54 13 1.90
3: 9 0 1.54 14 2.12
4: 9 0 1.54 11 2.23
Here we use two functions:
gsub to replace pattern in a string (we replace tol with nothing "").
as.numeric to transform gsub output (which is character) into numbers

less clunky reshaping of anscombe data

I was trying to use ggplot2 to plot the built-in anscombe data set in R (which contains four different small data sets with identical correlations but radically different relationships between X and Y). My attempts to reshape the data properly were all pretty ugly. I used a combination of reshape2 and base R; a Hadleyverse 2 (tidyr/dplyr) or a data.table solution would be fine with me, but the ideal solution would be
short/no repeated code
comprehensible (somewhat conflicting with criterion #1)
involve as little hard-coding of column numbers, etc. as possible
The original format:
anscombe
## x1 x2 x3 x4 y1 y2 y3 y4
## 1 10 10 10 8 8.04 9.14 7.46 6.58
## 2 8 8 8 8 6.95 8.14 6.77 5.76
## 3 13 13 13 8 7.58 8.74 12.74 7.71
## ...
## 11 5 5 5 8 5.68 4.74 5.73 6.89
Desired format:
## s x y
## 1 1 10 8.04
## 2 1 8 6.95
## ...
## 44 4 8 6.89
Here's my attempt:
library("reshape2")
ff <- function(x,v)
setNames(transform(
melt(as.matrix(x)),
v1=substr(Var2,1,1),
v2=substr(Var2,2,2))[,c(3,5)],
c(v,"s"))
f1 <- ff(anscombe[,1:4],"x")
f2 <- ff(anscombe[,5:8],"y")
f12 <- cbind(f1,f2)[,c("s","x","y")]
Now plot:
library("ggplot2"); theme_set(theme_classic())
th_clean <-
theme(panel.margin=grid::unit(0,"lines"),
axis.ticks.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.y=element_blank(),
axis.text.y=element_blank()
)
ggplot(f12,aes(x,y))+geom_point()+
facet_wrap(~s)+labs(x="",y="")+
th_clean
If you are really dealing with the "anscombe" dataset, then I would say #Thela's reshape solution is very direct.
However, here are a few other options to consider:
Option 1: Base R
You can write your own "reshape" function, perhaps something like this:
myReshape <- function(indf = anscombe, stubs = c("x", "y")) {
temp <- sapply(stubs, function(x) {
unlist(indf[grep(x, names(indf))], use.names = FALSE)
})
s <- rep(seq_along(grep(stubs[1], names(indf))), each = nrow(indf))
data.frame(s, temp)
}
Notes:
I'm not sure that this is necessarily less clunky than what you're already doing
This approach will not work if the data are "unbalanced" (for example, more "x" columns than "y" columns.)
Option 2: "dplyr" + "tidyr"
Since pipes are the rage these days, you can also try:
library(dplyr)
library(tidyr)
anscombe %>%
gather(var, val, everything()) %>%
extract(var, into = c("variable", "s"), "(.)(.)") %>%
group_by(variable, s) %>%
mutate(ind = sequence(n())) %>%
spread(variable, val)
Notes:
I'm not sure that this is necessarily less clunky than what you're already doing, but some people like the pipe approach.
This approach should be able to handle unbalanced data.
Option 3: "splitstackshape"
Before #Arun went and did all that fantastic work on melt.data.table, I had written merged.stack in my "splitstackshape" package. With that, the approach would be:
library(splitstackshape)
setnames(
merged.stack(
data.table(anscombe, keep.rownames = TRUE),
var.stubs = c("x", "y"), sep = "var.stubs"),
".time_1", "s")[]
A few notes:
merged.stack needs something to treat as an "id", hence the need for data.table(anscombe, keep.rownames = TRUE), which adds a column named "rn" with the row numbers
The sep = "var.stubs" basically means that we don't really have a separator variable, so we'll just strip out the stub and use whatever remains for the "time" variable
merged.stack will work if the data are unbalanced. For instance, try using it with anscombe2 <- anscombe[1:7] as your dataset instead of "anscombe".
The same package also has a function called Reshape that builds upon reshape to let it reshape unbalanced data. But it's slower and less flexible than merged.stack. The basic approach would be Reshape(data.table(anscombe, keep.rownames = TRUE), var.stubs = c("x", "y"), sep = "") and then rename the "time" variable using setnames.
Option 4: melt.data.table
This was mentioned in the comments above, but hasn't been shared as an answer. Outside of base R's reshape, this is a very direct approach that handles column renaming from within the function itself:
library(data.table)
melt(as.data.table(anscombe),
measure.vars = patterns(c("x", "y")),
value.name=c('x', 'y'),
variable.name = "s")
Notes:
Will be insanely fast.
Much better supported than "splitstackshape" or reshape ;-)
Handles unbalanced data just fine.
I think this meets the criteria of being 1) short 2) comprehensible and 3) no hardcoded column numbers. And it doesn't require any other packages.
reshape(anscombe, varying=TRUE, sep="", direction="long", timevar="s")
# s x y id
#1.1 1 10 8.04 1
#...
#11.1 1 5 5.68 11
#1.2 2 10 9.14 1
#...
#11.2 2 5 4.74 11
#1.3 3 10 7.46 1
#...
#11.3 3 5 5.73 11
#1.4 4 8 6.58 1
#...
#11.4 4 8 6.89 11
I don't know if a non-reshape solution would be acceptable, but here you go:
library(data.table)
#create the pattern that will have the Xs
#this will make it easy to create the Ys
pattern <- 1:4
#use Map to create a list of data.frames with the needed columns
#and also use rbindlist to rbind the list produced by Map
lists <- rbindlist(Map(data.frame,
pattern,
anscombe[pattern],
anscombe[pattern+length(pattern)]
)
)
#set the correct names
setnames(lists, names(lists), c('s','x','y'))
Output:
> lists
s x y
1: 1 10 8.04
2: 1 8 6.95
3: 1 13 7.58
4: 1 9 8.81
5: 1 11 8.33
6: 1 14 9.96
7: 1 6 7.24
8: 1 4 4.26
9: 1 12 10.84
10: 1 7 4.82
....
A newer tidyverse option is suggested in the tidyverse vignette:
anscombe %>%
pivot_longer(everything(),
names_to = c(".value", "set"),
names_pattern = "(.)(.)"
) %>%
arrange(set)
#> # A tibble: 44 x 3
#> set x y
#> <chr> <dbl> <dbl>
#> 1 1 10 8.04
#> 2 1 8 6.95
#> 3 1 13 7.58
#> 4 1 9 8.81
#> 5 1 11 8.33
#> 6 1 14 9.96
#> 7 1 6 7.24
#> 8 1 4 4.26
#> 9 1 12 10.8
#> 10 1 7 4.82
#> # … with 34 more rows

R log-transformation on dataframe

I have a dataframe (df) with the values (V) of different stocks at different dates (t). I would like to get a new df with the profitability for each time period.
Profitability is: ln(Vi_t / Vi_t-1)
where:
ln is the natural logarithm
Vi_t is the Value of the stock i at the date t
Vi_t-1 the value of the same stock at the date before
This is the output of df[1:3, 1:10]
date SMI Bond ABB ADDECO Credit Holcim Nestle Novartis Roche
1 01/08/88 1507.5 3.63 4.98 159.20 15.62 14.64 4.01 4.59 11.33
2 01/09/88 1467.4 3.69 4.97 161.55 15.69 14.40 4.06 4.87 11.05
3 01/10/88 1538.0 3.27 5.47 173.72 16.02 14.72 4.14 5.05 11.94
Specifically, instead of 1467.4 at [2, "SMI"] I want the profitability which is ln(1467.4/1507.5) and the same for all the rest of the values in the dataframe.
As I am new to R I am stuck. I was thinking of using something like mapply, and create the transformation function myself.
Any help is highly appreciated.
This will compute the profitabilities (assuming data is in a data.frame call d):
(d2<- log(embed(as.matrix(d[,-1]), 2) / d[-dim(d)[1], -1]))
# SMI Bond ABB ADDECO Credit Holcim Nestle Novartis Roche
#1 -0.02696052 0.01639381 -0.002010051 0.01465342 0.004471422 -0.01652930 0.01239173 0.05921391 -0.02502365
#2 0.04699074 -0.12083647 0.095858776 0.07263012 0.020814375 0.02197891 0.01951281 0.03629431 0.07746368
Then, you can add in the dates, if you want:
d2$date <- d$date[-1]
Alternatively, you could use an apply based approach:
(d2 <- apply(d[-1], 2, function(x) diff(log(x))))
# SMI Bond ABB ADDECO Credit Holcim Nestle Novartis Roche
#[1,] -0.02696052 0.01639381 -0.002010051 0.01465342 0.004471422 -0.01652930 0.01239173 0.05921391 -0.02502365
#[2,] 0.04699074 -0.12083647 0.095858776 0.07263012 0.020814375 0.02197891 0.01951281 0.03629431 0.07746368

R data.table reshape chunks of columns at once

Lets say I have a data.table with these columns
nodeID
hour1aaa
hour1bbb
hour1ccc
hour2aaa
hour2bbb
hour2ccc
...
hour24aaa
hour24bbb
hour24ccc
for a total of 72 columns. Let's call it rawtable
I want to reshape it so I have
nodeID
hour
aaa
bbb
ccc
for a total of just these 5 columns
where the hour column will contain whichever hour from the original 72 that it should be.
Let's call it newshape
The way I'm doing it now is to use rbindlist with 24 items where each item is the proper subset of the bigger data.table. Like this (except I'm leaving out most of the hours in my example)
newshape<-rbindlist(list(
rawtable[,list(nodeID, Hour=1, aaa=hour1aaa, bbb=hour1bbb, ccc=hour1ccc)],
rawtable[,list(nodeID, Hour=2, aaa=hour2aaa, bbb=hour2bbb, ccc=hour2ccc)],
rawtable[,list(nodeID, Hour=24, aaa=hour24aaa, bbb=hour24bbb, ccc=hour24ccc)]))
Here is some sample data to play with
rawtable<-data.table(nodeID=c(1,2),hour1aaa=c(12.4,32),hour1bbb=c(61.1,65.33),hour1ccc=c(-4.2,54),hour2aaa=c(12.2,1.2),hour2bbb=c(12.2,5.7),hour2ccc=c(5.6,101.9),hour24aaa=c(45.2,8.5),hour24bbb=c(23,7.9),hour24ccc=c(98,32.3))
Using my rbindlist approach gives the desired result but, as with most things I do with R, there is probably a better way. By better I mean more memory efficient, faster, and/or uses less lines of code. Does anyone have a better way to achieve this?
This is a classic reshape problem if you get your names in the standard convention it expects, though I'm not sure this really harnesses the efficiency of the data.table structure:
reshape(
setNames(rawtable, gsub("(\\D+)(\\d+)(\\D+)", "\\3.\\2", names(rawtable))),
idvar="nodeID", direction="long", varying=-1
)
Result:
nodeID hour aaa bbb ccc
1: 1 1 12.4 61.10 -4.2
2: 2 1 32.0 65.33 54.0
3: 1 2 12.2 12.20 5.6
4: 2 2 1.2 5.70 101.9
5: 1 24 45.2 23.00 98.0
6: 2 24 8.5 7.90 32.3
#Arun's answer over here: https://stackoverflow.com/a/15510828/496803 may also be useful if you can adapt it to your current data.
One option is to use merged.stack from my package "splitstackshape". This function, stacks groups of columns and then merges the output together. Because of how the function creates the "time" variable, you can specify whatever you wanted to strip out from the column names. In this case, we want to strip out "hour", "aaa", "bbb", and "ccc" and have just the numbers remaining.
library(splitstackshape)
## Make sure you're using at least 1.2.0
packageVersion("splitstackshape")
# [1] ‘1.2.0’
merged.stack(rawtable, id.vars="nodeID",
var.stubs=c("aaa", "bbb", "ccc"),
sep="hour|aaa|bbb|ccc")
# nodeID .time_1 aaa bbb ccc
# 1: 1 1 12.4 61.10 -4.2
# 2: 1 2 12.2 12.20 5.6
# 3: 1 24 45.2 23.00 98.0
# 4: 2 1 32.0 65.33 54.0
# 5: 2 2 1.2 5.70 101.9
# 6: 2 24 8.5 7.90 32.3

Resources