Merging multiple columns in single data frame in R - r

I have a bizarre problem where I've combined together several data frames that have different species abundance data. I used rbind.fill() to collate the data frames, but some of the columns names for like species are spelled slightly differently, hence, for several species I have 2-3 columns.
Does anyone know of a way I can merge the data from these columns together?
Simple example:
dat <- matrix(data=c(
Sp.a=c(1,2,3,4,5,NA,NA,NA,NA,NA),
Sp.b=c(3,4,5,6,7,5,4,6,3,4),
Sp.c=c(4,4,4,3,2,NA,NA,NA,NA,NA),
Spp.A=c(NA,NA,NA,NA,NA,2,3,4,2,3),
Spp.C=c(NA,NA,NA,NA,NA,3,4,2,5,4)
), 10,5)
colnames(dat)<- c("Sp.a", "Sp.b", "Sp.c", "Spp.A", "Spp.C")
dat
sp.a sp.b sp.c Spp.A Spp.C
[1,] 1 3 4 NA NA
[2,] 2 4 4 NA NA
[3,] 3 5 4 NA NA
[4,] 4 6 3 NA NA
[5,] 5 7 2 NA NA
[6,] NA 5 NA 2 3
[7,] NA 4 NA 3 4
[8,] NA 6 NA 4 2
[9,] NA 3 NA 2 5
[10,] NA 4 NA 3 4
How can I get sp.a and Spp.A into a single column? (same for sp.c and Spp.C).
Thanks for any help,
Paul

Using reshape2 and going from long --> wide --> long(again) format:
library(reshape2)
## long format
dat.m <- melt(dat)
## remove missing values
dat.m <- dat.m[!is.na(dat.m$value),]
## rename names
dat.m$Var2 <- tolower(sub("Spp","Sp", dat.m$Var2) )
## wide format
dcast(Var1~Var2,data=dat.m)
Var1 sp.a sp.b sp.c
1 1 1 3 4
2 2 2 4 4
3 3 3 5 4
4 4 4 6 3
5 5 5 7 2
6 6 2 5 3
7 7 3 4 4
8 8 4 6 2
9 9 2 3 5
10 10 3 4 4

Here's one way. This is pretty general, and would even work if you had one series divided over three or more columns.
dat <- data.frame(dat)
# get the last letter of each column and make it lowercase,
# we'll be grouping the columns by this
ns <- tolower(gsub('^.+\\.', '', names(dat)))
# group the columns by their last letter, and run each group through pmax
result <- lapply(split.default(dat, ns), function(x) do.call(function(...) pmax(..., na.rm=TRUE), x))
do.call(cbind, result)
# a b c
# [1,] 1 3 4
# [2,] 2 4 4
# [3,] 3 5 4
# [4,] 4 6 3
# [5,] 5 7 2
# [6,] 2 5 3
# [7,] 3 4 4
# [8,] 4 6 2
# [9,] 2 3 5
# [10,] 3 4 4

ColsToMerge <- c("sp.a", "Spp.A")
dat[["A.merged"]] <-
apply(dat[, ColsToMerge], 1, function(rr) ifelse(is.na(rr[[1]]), rr[[2]], rr[[1]]))

Related

save a loop while an matriz or a data frame

I want to save a while in an matrix or in a data frame, in such a way that it places me in an orderly way
i <- 15#year
pon<-list()
while (i < 63) {
pon[i] <-cumprod( vlookup(i:62,Tabla_de_mortalidad_css,4))
i = i+1}
this is my command that i am doing
I want you to print something like that: for example
v1 v2 v3
1
2 1
3 2 1
4 3 2
. . .
. . .
. . .
v1, v2, v3 are my variabl
Do you need something like this?
n <- 1:63
mat <- cbind(1:63, sapply(1:3, function(x) c(rep(NA, x), head(n, -x))))
mat
# [,1] [,2] [,3] [,4]
# [1,] 1 NA NA NA
# [2,] 2 1 NA NA
# [3,] 3 2 1 NA
# [4,] 4 3 2 1
# [5,] 5 4 3 2
# [6,] 6 5 4 3
# [7,] 7 6 5 4
#...
#...
use tidyverse
library(tidyverse)
n <- 1:10
Lag <- 1:3
df <- data.frame(n = n)
bind_cols(df, map_dfc(Lag, ~transmute(df, !!paste0("Lag", .x) := lag(n, n = .x))))
#> n Lag1 Lag2 Lag3
#> 1 1 NA NA NA
#> 2 2 1 NA NA
#> 3 3 2 1 NA
#> 4 4 3 2 1
#> 5 5 4 3 2
#> 6 6 5 4 3
#> 7 7 6 5 4
#> 8 8 7 6 5
#> 9 9 8 7 6
#> 10 10 9 8 7
Created on 2020-12-11 by the reprex package (v0.3.0)
We can do this easily with shift from data.table
library(data.table)
do.call(cbind, shift(1:10, n = 1:3))
-output
# [,1] [,2] [,3]
# [1,] NA NA NA
# [2,] 1 NA NA
# [3,] 2 1 NA
# [4,] 3 2 1
# [5,] 4 3 2
# [6,] 5 4 3
# [7,] 6 5 4
# [8,] 7 6 5
# [9,] 8 7 6
#[10,] 9 8 7

Difference between row in df with na

My sample data looks like this
DF
n a b c d
1 NA NA NA NA
2 1 2 3 4
3 5 6 7 8
4 9 NA 11 12
5 NA NA NA NA
6 4 5 6 NA
7 8 9 10 11
8 12 13 15 16
9 NA NA NA NA
I need to substract row 2 from row 3 and row 4.
Similarly i need to subtract row 6 from row 7 and row 8
My real data is huge, is there a way of doing it automatically. It seems it could be some for loop but as I am dummy R user my trials were not successful.
Thank you for any help and tips.
UPDATE
I want to achieve something like this
DF2
rowN1<-DF$row3-DF$row2
rowN2<-DF$row4-DF$row2
rowN3<-DF$row7-DF$row6 # there is NA in row 6 so after subtracting there should be NA also
rowN4<-DF$row8-DF$row6
Here's one idea
set.seed(1)
(m <- matrix(sample(c(1:9, NA), 60, T), ncol=5))
# [,1] [,2] [,3] [,4] [,5]
# [1,] 3 7 3 8 8
# [2,] 4 4 4 2 7
# [3,] 6 8 1 8 5
# [4,] NA 5 4 5 9
# [5,] 3 8 9 9 5
# [6,] 9 NA 4 7 3
# [7,] NA 4 5 8 1
# [8,] 7 8 6 6 1
# [9,] 7 NA 5 6 4
# [10,] 1 3 2 8 6
# [11,] 3 7 9 1 7
# [12,] 2 2 7 5 5
idx <- seq(2, nrow(m)-2, 4)
do.call(rbind, lapply(idx, function(x) {
rbind(m[x+1, ]-m[x, ], m[x+2, ]-m[x, ])
}))
# [1,] 2 4 -3 6 -2
# [2,] NA 1 0 3 2
# [3,] NA NA 1 1 -2
# [4,] -2 NA 2 -1 -2
# [5,] 2 4 7 -7 1
# [6,] 1 -1 5 -3 -1

How can I create a 2D heatmap grid in R?

I have a data set like this
head(data)
V1 V2
[1,] NA NA
[2,] NA NA
[3,] NA NA
[4,] 5 2
[5,] 5 2
[6,] 5 2
where
unique(data$V1)
[1] NA 5 4 3 2 1 0 6 7 9 8
unique(data$V2)
[1] NA 2 6 1 5 3 7 4 0 8 9
What I would like to do is a plot similar to this
plot(df$V1,df$V2)
but with a colour indicator indicating how many match there are with a grid instead of points.
Can anyone help me?
It looks like this may be what you want - first you tabulate using table(), then plot a heatmap of the table using heatmap():
set.seed(1)
data <- data.frame(V1=sample(1:10,100,replace=TRUE),V2=sample(1:10,100,replace=TRUE))
foo <- table(data)
heatmap(foo,Rowv=NA,Colv=NA)

do.call with cbind and lags of a variable

I want to create a function that produces a matrix containing several lags of a variable. A simple example that works is
a <- ts(1:10)
cbind(a, lag(a, -1))
To do this for multiple lags, I have
lagger <- function(var, lags) {
### Create list of lags
lagged <- lapply(1:lags, function(x){
lag(var, -x)
})
### Join lags together
do.call(cbind, list(var, lagged))
}
Using the above example gives unexpected results;
lagger(a, 1)
gives a length 20 list with the original time series broken out into separate list slots and the final 10 each being a replication of the lagged series.
Any suggestions to getting this working? Thanks!
This gives a lag of 0 and of 1.
library(zoo)
a <- ts(11:13)
lags <- -(0:1)
a.lag <- as.ts(lag(as.zoo(a), lags))
Now a.lag is this:
> a.lag
Time Series:
Start = 1
End = 4
Frequency = 1
lag0 lag-1
1 11 NA
2 12 11
3 13 12
4 NA 13
If you don't want the NA entries then use: as.ts(na.omit(lag(as.zoo(a), lags))) .
Based on #Joshua Ulrich answer.
I thinkd embed is the correct answer but you get the vectors in the other way around. I mean using embed you'll get the lagged series not in the proper order, see the following
lagged <- embed(a,4)
colnames(lagged) <- paste('t', 3:0, sep='-')
lagged
t-3 t-2 t-1 t-0
[1,] 4 3 2 1
[2,] 5 4 3 2
[3,] 6 5 4 3
[4,] 7 6 5 4
[5,] 8 7 6 5
[6,] 9 8 7 6
[7,] 10 9 8 7
this gives the correct answer to you but not in the correct order, since the lags are in descending order.
But it you reorder just like this:
lagged_OK <- lagged[,ncol(lagged):1]
colnames(lagged_OK) <- paste('t', 0:3, sep='-')
lagged_OK
lag.0 lag.1 lag.2 lag.3
[1,] 1 2 3 4
[2,] 2 3 4 5
[3,] 3 4 5 6
[4,] 4 5 6 7
[5,] 5 6 7 8
[6,] 6 7 8 9
[7,] 7 8 9 10
Then, you get the right lagged matrix.
I add colnames only for explanation purpose, you can just do:
embed(a,4)[ ,4:1]
If you really want a lagger function, try this
lagger <- function(x, lag=1){
lag <- lag+1
Lagged <- embed(x,lag)[ ,lag:1]
colnames(Lagged) <- paste('lag', 0:(lag-1), sep='.')
return(Lagged)
}
lagger(a, 4)
lag.0 lag.1 lag.2 lag.3 lag.4
[1,] 1 2 3 4 5
[2,] 2 3 4 5 6
[3,] 3 4 5 6 7
[4,] 4 5 6 7 8
[5,] 5 6 7 8 9
[6,] 6 7 8 9 10
lagger(a, 1)
lag.0 lag.1
[1,] 1 2
[2,] 2 3
[3,] 3 4
[4,] 4 5
[5,] 5 6
[6,] 6 7
[7,] 7 8
[8,] 8 9
[9,] 9 10
I'm not sure what's wrong with your function, but you can probably use embed instead.
> embed(a,4)
[,1] [,2] [,3] [,4]
[1,] 4 3 2 1
[2,] 5 4 3 2
[3,] 6 5 4 3
[4,] 7 6 5 4
[5,] 8 7 6 5
[6,] 9 8 7 6
[7,] 10 9 8 7

"Loop through" data.table to calculate conditional averages

I want to "loop through" the rows of a data.table and calculate an average for each row. The average should be calculated based on the following mechanism:
Look up the identifier ID in row i (ID(i))
Look up the value of T2 in row i (T2(i))
Calculate the average over the Data1 values in all rows j, which meet these two criteria: ID(j) = ID(i) and T1(j) = T2(i)
Enter the calculated average in the column Data2 of row i
DF = data.frame(ID=rep(c("a","b"),each=6),
T1=rep(1:2,each=3), T2=c(1,2,3), Data1=c(1:12))
DT = data.table(DF)
DT[ , Data2:=NA_real_]
ID T1 T2 Data1 Data2
[1,] a 1 1 1 NA
[2,] a 1 2 2 NA
[3,] a 1 3 3 NA
[4,] a 2 1 4 NA
[5,] a 2 2 5 NA
[6,] a 2 3 6 NA
[7,] b 1 1 7 NA
[8,] b 1 2 8 NA
[9,] b 1 3 9 NA
[10,] b 2 1 10 NA
[11,] b 2 2 11 NA
[12,] b 2 3 12 NA
For this simple example the result should look like this:
ID T1 T2 Data1 Data2
[1,] a 1 1 1 2
[2,] a 1 2 2 5
[3,] a 1 3 3 NA
[4,] a 2 1 4 2
[5,] a 2 2 5 5
[6,] a 2 3 6 NA
[7,] b 1 1 7 8
[8,] b 1 2 8 11
[9,] b 1 3 9 NA
[10,] b 2 1 10 8
[11,] b 2 2 11 11
[12,] b 2 3 12 NA
I think one way of doing this would be to loop through the rows, but I think that is inefficient. I've had a look at the apply() function, but I'm sure if it would solve my problem. I could also use data.frame instead of data.table if this would make it much more efficient or much easier. The real dataset contains approximately 1 million rows.
The rule of thumb is to aggregate first, and then join to that.
agg = DT[,mean(Data1),by=list(ID,T1)]
setkey(agg,ID,T1)
DT[,Data2:={JT=J(ID,T2);agg[JT,V1][[3]]}]
ID T1 T2 Data1 Data2
[1,] a 1 1 1 2
[2,] a 1 2 2 5
[3,] a 1 3 3 NA
[4,] a 2 1 4 2
[5,] a 2 2 5 5
[6,] a 2 3 6 NA
[7,] b 1 1 7 8
[8,] b 1 2 8 11
[9,] b 1 3 9 NA
[10,] b 2 1 10 8
[11,] b 2 2 11 11
[12,] b 2 3 12 NA
As you can see it's a bit ugly in this case (but will be fast). It's planned to add drop which will avoid the [[3]] bit, and maybe we could provide a way to tell [.data.table to evaluate i in calling scope (i.e. no self join) which would avoid the JT= bit which is needed here because ID is in both agg and DT.
keyby has been added to v1.8.0 on R-Forge so that avoids the need for the setkey, too.
A somewhat faster alternative to iterating over rows would be a solution which employs vectorization.
R> d <- data.frame(ID=rep(c("a","b"),each=6), T1=rep(1:2,each=3), T2=c(1,2,3), Data1=c(1:12))
R> d
ID T1 T2 Data1
1 a 1 1 1
2 a 1 2 2
3 a 1 3 3
4 a 2 1 4
5 a 2 2 5
6 a 2 3 6
7 b 1 1 7
8 b 1 2 8
9 b 1 3 9
10 b 2 1 10
11 b 2 2 11
12 b 2 3 12
R> rowfunction <- function(i) with(d, mean(Data1[which(T1==T2[i] & ID==ID[i])]))
R> d$Data2 <- sapply(1:nrow(d), rowfunction)
R> d
ID T1 T2 Data1 Data2
1 a 1 1 1 2
2 a 1 2 2 5
3 a 1 3 3 NaN
4 a 2 1 4 2
5 a 2 2 5 5
6 a 2 3 6 NaN
7 b 1 1 7 8
8 b 1 2 8 11
9 b 1 3 9 NaN
10 b 2 1 10 8
11 b 2 2 11 11
12 b 2 3 12 NaN
Also, I'd prefer to preprocess the data before getting it into R. I.e. if you are retrieving the data from an SQL server, it might be a better choice to let the server calculate the averages, as it will very likely do a better job in this.
R is actually not very good at number crunching, for several reasons. But it's excellent when doing statistics on the already-preprocessed data.
Using tapply and part of another recent post:
DF = data.frame(ID=rep(c("a","b"),each=6), T1=rep(1:2,each=3), T2=c(1,2,3), Data1=c(1:12))
EDIT: Actually, most of the original function is redundant and was intended for something else. Here, simplified:
ansMat <- tapply(DF$Data1, DF[, c("ID", "T1")], mean)
i <- cbind(match(DF$ID, rownames(ansMat)), match(DF$T2, colnames(ansMat)))
DF<-cbind(DF,Data2 = ansMat[i])
# ansMat<-tapply(seq_len(nrow(DF)), DF[, c("ID", "T1")], function(x) {
# curSub <- DF[x, ]
# myIndex <- which(DF$T2 == curSub$T1 & DF$ID == curSub$ID)
# meanData1 <- mean(curSub$Data1)
# return(meanData1 = meanData1)
# })
The trick was doing tapply over ID and T1 instead of ID and T2. Anything speedier?

Resources