Prevent reshape2 from converting column headings to numbers - r

I am trying to create a lookup table from an actual distance matrix for US zip code locations using the GoogleMaps model shown on Lars Relund Nielsen's webpage. The Zip codes in the Northeastern US begin with a "0" and therefore get dropped when reshape2 converts the matrix from wide to long as described on his page.
example matrix of distances (km) between 5 Zip codes:
mdat <- matrix(c(0.000, 113.288, 145.986, 126.271, 368.354
,103.988, 0.000, 69.637, 49.922, 294.386
,144.851, 69.285, 0.000, 25.547, 244.024
,124.531, 48.965, 25.245, 0.000, 258.729
,368.346, 295.159, 243.478, 258.598, 0.000)
, nrow = 5
, ncol = 5
, byrow = TRUE,
dimnames = list(c("01014", "01747", "02144", "02453", "04040"),
c("01014", "01747", "02144", "02453", "04040")))
Looks like this (all well and good);
# 01014 01747 02144 02453 04040
#01014 0.000 113.288 145.986 126.271 368.354
#01747 103.988 0.000 69.637 49.922 294.386
#02144 144.851 69.285 0.000 25.547 244.024
#02453 124.531 48.965 25.245 0.000 258.729
#04040 368.346 295.159 243.478 258.598 0.000
But when I reshape the Matrix to a lookup table it converts the row/col names to a number dropping the leading zero from the Zip code.
#reshape into a table of distances
library(reshape2)
dat<-(melt(mdat))
dat
colnames(dat)<-c("from","to","km")
head(dat)
from to km
1 1014 1014 0.000
2 1747 1014 103.988
3 2144 1014 144.851
4 2453 1014 124.531
5 4040 1014 368.346
6 1014 1747 113.288
I am hoping to get;
from to km
1 01014 01014 0.000
2 01747 01014 103.988
3 02144 01014 144.851
4 02453 01014 124.531
5 04040 01014 368.346
6 01014 01747 113.288
Any thoughts on how I can keep reshape2 from converting the the Zip codes to a number?

Include as.is=TRUE in your melt function:
dat<-(melt(mdat, as.is=TRUE))
colnames(dat)<-c("from","to","km")
This will keep the column names as strings through the melting process.

Related

Find columns with different values in duplicate rows

I have a data set that has some duplicate records. For those records, most of the column values are the same, but a few ones are different.
I need to identify the columns where the values are different, and then subset those columns.
This would be a sample of my dataset:
library(data.table)
dat <- "ID location date status observationID observationRep observationVal latitude longitude setSource
FJX8KL loc1 2018-11-17 open 445 1 17.6 -52.7 -48.2 XF47
FJX8KL loc2 2018-11-17 open 445 2 1.9 -52.7 -48.2 LT12"
dat <- setDT(read.table(textConnection(dat), header=T))
And this is the output I would expect:
observationRep observationVal setSource
1: 1 17.6 XF47
2: 2 1.9 LT12
One detail is: my original dataset has 189 columns, so I need to check all of them.
How to achieve this?
Two issues, first, use text= argument rather than textConnection, second, use as.data.table, since seDT modifies object in place, but it yet isn't there.
dat1 <- data.table::as.data.table(read.table(text=dat, header=TRUE))
dat1[, c('observationRep', 'observationVal', 'setSource')]
# observationRep observationVal setSource
# 1: 1 17.6 XF47
# 2: 2 1.9 LT12

Merging Embedded Lists with Different Date Formats in R

I need to merge two lists with each other but I am not getting what I want and I think it is because the "Date" column is in two different formats. I have a list called li and in this list there are 12 lists each with the following format:
> tail(li$fxe)
Date fxe
3351 2020-06-22 0.0058722768
3352 2020-06-23 0.0044256216
3353 2020-06-24 -0.0044998220
3354 2020-06-25 -0.0027309539
3355 2020-06-26 0.0002832672
3356 2020-06-29 0.0007552346
I am trying to merge each of these unique lists with a different list called factors which looks like :
> tail(factors)
Date Mkt-RF SMB HML RF
3351 20200622 0.0071 0.83 -1.42 0.000
3352 20200623 0.0042 0.15 -0.56 0.000
3353 20200624 -0.0261 -0.52 -1.28 0.000
3354 20200625 0.0112 0.25 0.50 0.000
3355 20200626 -0.0243 0.16 -1.37 0.000
3356 20200629 0.0151 1.25 1.80 0.000
The reason I need this structure is because I am trying to send them to a function I wrote to do linear regressions. But the first line of my function aims to merge these lists. When I merge them I end up with a null structure even thought my lists clearly have the same number of rows. In my function df is li. The embedded list of li is confusing me. Can someone help please?
Function I want to use:
Bf <- function(df, fac){
#This function calculates the beta of the french fama factor #using linear regression
#Input: df = a dataframe containg returns of the security
# fac = dataframe containing excess market retrun and
# french fama 3 factor
#Output: a Beta vectors of the french fama model
temp <- merge(df, fac, by="Date")
temp <- temp[, !names(temp) %in% "Date"]
temp[ ,1] <- temp[,1] - temp$RF return(lm(temp[,1]~temp[,2]+temp[,3]+temp[,4])$coeff)
}
a: you are dealing with data frames and not lists
b: if you want to merge them, you need to modify the factors$date column to match that of li$fxe$date
try to do:
factors$date <- as.Date(strptime(factors$date, format = "%Y%M%d"))
This should convert, the factors column to "Date" format.

R: Aggregating over several variables and observations (depending on values) and creating a new variable

The data set has the following structure
Key Date Mat Amount
<int> <date> <chr> <dbl>
1 1001056 2014-12-12 10025 0.10
2 1001056 2014-12-23 10025 0.20
3 1001056 2015-01-08 10025 0.10
4 1001056 2015-04-07 10025 0.20
5 1001056 2015-05-08 10025 0.20
6 1001076 2013-10-29 10026 3.00
7 1001140 2013-01-18 10026 0.72
8 1001140 2013-04-11 10026 2.40
9 1001140 2014-10-08 10026 0.24
10 1001237 2015-02-17 10025 2.40
11 1001237 2015-02-17 10026 3.40
Mat takes values in {10001,...,11000}, hence A:=|Mat|=1000.
I would like to accomplish the following goals:
1) (Intermediate step) For each Key-Date combination I would like to calculate for all materials, which are availabe at such a combination (which might vary from key to key), the differences in amount,
e.g. for combination "1001237 2015-02-17" this would be for materials 10025 and 10026 2.40-3.40=-1 (but might be more combinations). (How to store those values effienently?)
This step might be skipped.
2) Finally, I would like to construct a new matrix of dimension A=1000 where each entry (i,j) (Material combination i and j) contains the average of the values calculated in the step before.
More formally, entry (i,j) is given by,
1/|all key-date combinationas containing Mat i and Mat j| \sum_{all key-date combinationas containing Mat i and Mat j} Amount_i - Amount_j
As the table is quite large efficiency of the computation is very important.
Thank you very much for your help in advance!
I can do it with list columns in tidyverse; the trick is to use group_by to get distinct combinations of Key and Date. Here's the code:
materials <- unique(x$Mat)
n <- length(materials)
x <- x %>%
group_by(Key, Date) %>%
nest() %>%
# Create a n by n matrix for each combination of Key and Date
mutate(matrices = lapply(data,
function(y) {
out <- matrix(nrow = n, ncol = n,
dimnames = list(materials, materials))
# Only fill in when the pair of materials is present
# for the date of interest
mat_present <- as.character(unique(y$Mat))
for (i in mat_present) {
for (j in mat_present) {
# You may want to take an absolute value
out[i,j] <- y$Amount[y$Mat == i] - y$Amount[y$Mat == j]
}
}
out
}))
If you really want speed, you can implement the function in lapply with Rcpp. You can use RcppParallel to further speed it up. Now one of the columns of the data frame is a list of matrices. Then, for each element of the matrices, take an average while ignoring NAs:
x_arr <- array(unlist(x$matrices), dim = c(2,2,10))
results <- apply(x_arr, 2, rowMeans, na.rm = TRUE)
I stacked the list of matrices into a 3D array and found row means slice by slice. For performance, you can also do it in RcppArmadillo, with sum(x_arr, 2), but it's hard to deal with missing values when not all types of materials are represented in a combination of Key and Date.

Find correlation coefficient of two columns in a dataframe by group

I have a large dataframe "Im" that looks like this
V1 V7 X134 X135 X136 X137 X138
1 m 1000 543.360 1057.770 1869.42 2664.06 3935.307
2 m 2000 767.256 1704.430 2993.63 5248.06 6341.129
3 m 3000 413.096 796.168 1441.13 3500.46 2962.048
4 a 4000 257.128 559.200 1014.79 2948.64 2080.437
5 a 5000 188.504 440.640 813.60 2538.11 1639.349
6 a 6000 483.704 921.064 1679.98 3626.44 3426.709
....
I want to find the correlation coefficient between column X135 and all the other numbered columns, so in other words, I essentially need this
> cor(Im$X135,Im$X136)
> cor(Im$X135,Im$X134)
> cor(Im$X135,Im$X137)
The problem is I need the correlation grouped by "V1". The output I am looking for should be something like this (using hypothetical correlation coefficients)
V1 cc134 cc136 cc137
1 m 0.92 0.99 0.95
1 a 0.99 0.93 0.89
I have looked at tapply, ddply, aggregate and everything I found was for row wise functions like sum and average, resulting in an output which has the same columns. I am new to R so couldn't figure out how to write a clever function to do this. I have considered reshaping the data, but didn't get anywhere with that either. Any and all help appreciated!
This type of problem is best suited by data.table package.
Here is a simple attempt:
library(data.table)
data <- read.table(text=" V1 V7 134 135 136 137 138
1 m 1000 543.360 1057.770 1869.42 2664.06 3935.307
2 m 2000 767.256 1704.430 2993.63 5248.06 6341.129
3 m 3000 413.096 796.168 1441.13 3500.46 2962.048
4 a 4000 257.128 559.200 1014.79 2948.64 2080.437
5 a 5000 188.504 440.640 813.60 2538.11 1639.349
6 a 6000 483.704 921.064 1679.98 3626.44 3426.709",header=T)
data <- data.table(data)
setkey(data,V1)
data[,list(cc134=cor(X135,X134),cc136=cor(X135,X136),cc137=cor(X135,X137)),by=key(data)]
To learn more about the package:
vignette("datatable-intro")
vignette("datatable-faq")
vignette("datatable-timings")
Or see it in action:
example(data.table)

How to column bind and row bind a large number of data frames in R?

I have a large data set of vehicles. They were recorded every 0.1 seconds so there IDs repeat in Vehicle ID column. In total there are 2169 vehicles. I filtered the 'Vehicle velocity' column for every vehicle (using for loop) which resulted in a new column with first and last 30 values removed (per vehicle) . In order to bind it with original data frame, I removed the first and last 30 values of table too and then using cbind() combined them. This works for one last vehicle. I want this smoothing and column binding for all vehicles and finally I want to combine all the data frames of vehicles into one single table. That means rowbinding in sequence of vehicle IDs. This is what I wrote so far:
traj1 <- read.csv('trajectories-0750am-0805am.txt', sep=' ', header=F)
head(traj1)
names (traj1)<-c('Vehicle ID', 'Frame ID','Total Frames', 'Global Time','Local X', 'Local Y', 'Global X','Global Y','Vehicle Length','Vehicle width','Vehicle class','Vehicle velocity','Vehicle acceleration','Lane','Preceding Vehicle ID','Following Vehicle ID','Spacing','Headway')
# TIME COLUMN
Time <- sapply(traj1$'Frame ID', function(x) x/10)
traj1$'Time' <- Time
# SMOOTHING VELOCITY
smooth <- function (x, D, delta){
z <- exp(-abs(-D:D/delta))
r <- convolve (x, z, type='filter')/convolve(rep(1, length(x)),z,type='filter')
r
}
for (i in unique(traj1$'Vehicle ID')){
veh <- subset (traj1, traj1$'Vehicle ID'==i)
svel <- smooth(veh$'Vehicle velocity',30,10)
svel <- data.frame(svel)
veh <- head(tail(veh, -30), -30)
fta <- cbind(veh,svel)
}
'fta' now only shows the data frame for last vehicle. But I want all data frames (for all vehicles 'i') combined by row. May be for loop is not the right way to do it but I don't know how can I use tapply (or any other apply function) to do so many things same time.
EDIT
I can't reproduce my dataset here but 'Orange' data set in R could provide good analogy. Using the same smoothing function, the for loop would look like this (if 'age' column is smoothed and 'Tree' column is equivalent to my 'Vehicle ID' coulmn):
for (i in unique(Orange$Tree)){
tre <- subset (Orange, Orange$'Tree'==i)
age2 <- round(smooth(tre$age,2,0.67),digits=2)
age2 <- data.frame(age2)
tre <- head(tail(tre, -2), -2)
comb <- cbind(tre,age2)}
}
Umair, I am not sure I understood what you want.
If I understood right, you want to combine all the results by row. To do that you could save all the results in a list and then do.call an rbind:
comb <- list() ### create list to save the results
length(comb) <- length(unique(Orange$Tree))
##Your loop for smoothing:
for (i in 1:length(unique(Orange$Tree))){
tre <- subset (Orange, Tree==unique(Orange$Tree)[i])
age2 <- round(smooth(tre$age,2,0.67),digits=2)
age2 <- data.frame(age2)
tre <- head(tail(tre, -2), -2)
comb[[i]] <- cbind(tre,age2) ### save results in the list
}
final.data<-do.call("rbind", comb) ### combine all results by row
This will give you:
Tree age circumference age2
3 1 664 87 687.88
4 1 1004 115 982.66
5 1 1231 120 1211.49
10 2 664 111 687.88
11 2 1004 156 982.66
12 2 1231 172 1211.49
17 3 664 75 687.88
18 3 1004 108 982.66
19 3 1231 115 1211.49
24 4 664 112 687.88
25 4 1004 167 982.66
26 4 1231 179 1211.49
31 5 664 81 687.88
32 5 1004 125 982.66
33 5 1231 142 1211.49
Just for fun, a different way to do it using plyr::ddply and sapply with split:
library(plyr)
data<-ddply(Orange, .(Tree), tail, n=-2)
data<-ddply(data, .(Tree), head, n=-2)
data<- cbind(data,
age2=matrix(sapply(split(Orange$age, Orange$Tree), smooth, D=2, delta=0.67), ncol=1, byrow=FALSE))

Resources