reshape2 cast data frame with some duplicate values - r

I'm trying to use the reshape2 package to redistribute the columns across the top of my dataset. I have temperature and chl-a measured twice at three sites. However, when I melt and cast the data frame, the fun.aggregate defaults to length. I want to preserve the original values. Here is an example data set:
library(reshape2)
library(stringr)
df=data.frame(site=rep(1:3,each=2),temp_2009=c(23,24,25,25,23,43),chla_2009=c(3,2,3,4,5,6),
temp_2010=c(23,25,26,27,23,23),chla_2010=c(2,3,5,6,2,1))
df2=melt(df,id.vars=1,measure.vars=c(2:5))
df2=cbind(df2,data.frame(str_split_fixed(df2$variable,"_",2)))
df2=df2[,-2]
names(df2)[3:4]=c("variable","year")
dcast(df2,site+year~variable)
I think this has something to do with the way reshape2 handles duplicate values.
Any thoughts?

The rows are being aggregated as dcast can't distinguish them based upon the formula provided. If you want to maintain the original values then you'll need to include a field to identify the duplicates uniquely. To continue your code...
df2$group <- rep(1:2,12)
dcast(df2,site+year+group~variable)
Clearly this code is a bit simplistic (in particular your data must be sorted by 'group' with no missing values) but it should serve to demonstrate how to preserve the original values.

Another option trying to dcast a molten dataset with duplicate values is to get dcast to calculate a mean/median/min/max (whatever is most relevant depending on your case) to 'resolve' duplicates.
dcast(df2, site+year~variable, fun.aggregate = mean)
Obviously that deletes (amalgamates) records - which the OP says is undesirable

Related

Reshape dataframe with numeric and character variables

probabily it is refusee.
I want to transpose a data frame that has both numeric and character columns. I have some lines where the id is repeated 2 or even more times. I would like to have a final dataframe where I have this data in one line.
I thought about using both the data.table and reshape2 library (they have similar functions) but I can't find the right combination to do what I want and I'm going crazy. Could someone give me some help?
Here a modified example of my database
example_data <-data.frame(cod=c(20,20,20,20,20,20,20,40,80,80,80,80,80,240),
id=c(44,68,137,150,186,236,289,236,44,150,155,236,68,289),
textVar=c('aaaa','aaaa','aaaa bbbb','aaaa','cccc','cccc','cccc bbb','dddd','dddd cccc','dddd','ffff','ffff gggg','ffff','hhhh'),
ww=c(4,4,4,4,4,4,4,45,118,118,118,118,118,118))
If for example consider the column with id=44 my output is like this:
exampleRow <-data.frame(cod_1=c(20),id=c(44),textVar_1=c('aaaa'),ww_1=c(4),cod_2=c(80),id=c(44),textVar_2=c('dddd cccc'),ww_2=c(118))

Dcast() weird output

I have two dataframes. Applying the same dcast() function to the two get me different results in the output. Both the dataset have the same structure but different size. The first one has more than 950 rows:
The code I apply is:
trans_matrix_complete <- mod_attrib$transition_matrix
trans_matrix_complete[which(trans_matrix_complete$channel_from=="_3RDLIVE"),]
trans_matrix_complete <- rbind(trans_matrix_complete, df_dummy)
trans_matrix_complete$channel_to <- factor(trans_matrix_complete$channel_to,
levels = c(levels(trans_matrix_complete$channel_to)))
trans_matrix_complete <- dcast(trans_matrix_complete,
channel_from ~ channel_to,value.var = 'transition_probability')
And the trans_matrix_complete output I get is the following:
Something is not working as it should be as with the smaller dataframe of just few lines I get the following outcome:
Where
a) the row number is different. I'm not sure why there are two dots listed in the first case
b) and too, trying to assign rownames to the dataframe by
row.names(trans_matrix_complete) <- trans_matrix_complete$channel_from
does not work for the large dataframe, as despite the row.names contact the dataframe show up exactly as in the first image, without names assigned to rows.
Any idea about this weird behavior?
I resolved moving from dcast() to spread() of the package tidyverse using the following function:
trans_matrix_complete<-spread(trans_matrix_complete,
channel_to,transition_probability)
By applying spread() the two dataframe the matrix output is of the same format and accept rownames without any issue.
So I suspect it is all realted to the fact that dcast() and reshape2 package are not maintained anymore
Regards

dcast object orders of magnitude larger than original object

I have a data.frame in in that consist of two columns, a Sample_ID variable and a value variable. Each sample (of which there are 1971) has 132 individual points. The entire object is only ~3000000 bytes, or about 0.003 gigabytes (according to object.size()). For some reason, when I try to dcast the object into wide format, it throws an error saying it can't allocate vectors of size 3.3 GB, which is more 3 orders of magnitude larger than the original object.
The output I'm hoping for is 1 column for each sample, with 132 rows of data for each column.
The dcast code I am using is the following:
df_dcast = dcast(df, value.var = "Vals", Vals~Sample_ID)
I would provide the dataset for reproducibility but because this problem has to do with object size, I don't think a subset of it would help and I'm not sure how to easily post the full dataset. If you know how to post the full dataset or think that a subset would be helpful, let me know.
Thanks
Ok I figured out what was going wrong. It was attempting to use each unique value in the Vals column as an individual row producing far far more rows than the 132 that I wanted, so I needed to add a new column that was basically a value index going from 1:132 so the dataframe has 3 columns: ID, Vals, ValsNumber
The dcast code then looks like the following:
df_wide = dcast(df, value.var = "Vals", ValsNumber ~ Sample_ID)

R: Warning when creating a (long) list of dummies

A dummy column for a column c and a given value x equals 1 if c==x and 0 else. Usually, by creating dummies for a column c, one excludes one value x at choice, as the last dummy column doesn't add any information w.r.t. the already existing dummy columns.
Here's how I'm trying to create a long list of dummies for a column firm, in a data.table:
values <- unique(myDataTable$firm)
cols <- paste('d',as.character(inds[-1]), sep='_') # gives us nice d_value names for columns
# the [-1]: I arbitrarily do not create a dummy for the first unique value
myDataTable[, (cols):=lapply(values[-1],function(x)firm==x)]
This code reliably worked for previous columns, which had smaller unique values. firm however is larger:
tr(values)
num [1:3082] 51560090 51570615 51603870 51604677 51606085 ...
I get a warning when trying to add the columns:
Warning message:
truelength (6198) is greater than 1000 items over-allocated (length = 36). See ?truelength. If you didn't set the datatable.alloccol option very large, please report this to datatable-help including the result of sessionInfo().
As far as I can tell, there is still all columns that I need. Can I just ignore this issue? Will it slow down future computations? I'm not sure what to make of this and the relevant of truelength.
Taking Arun's comment as an answer.
You should use alloc.col function to pre-allocate required amount of columns in your data.table to the number which will be bigger than expected ncol.
alloc.col(myDataTable, 3200)
Additionally depending on the way how you consume the data I would recommend to consider reshaping your wide table to long table, see EAV. Then you need to have only one column per data type.

R: create data.table with periodic function

I would like to create a data.table in tidy form containing the columns articleID, period and demand (with articleID and period as key). The demand is subject to a random function with input data from another data.frame (params). It is created at runtime for differing numbers of periods.
It is easy to do this in "non-tidy" form:
#example data
params <- data.frame(shape=runif(10), rate=runif(10)*2)
rownames(params) <- letters[1:10]
periods <- 10
# create non-tidy data with one column for each period
df <- replicate(nrow(params),
rgamma(periods,shape=params[,"shape"], rate=params[,"rate"]))
rownames(df) <- rownames(params)
Is there a "tidy" way to do this creation? I would need to replicate the rgamma(), but I am not sure how to make it use the parameters of the corresponding article. I tried starting with a Cross Join from data.table:
dt <- CJ(articleID=rownames(params), per=1:periods, demand=0)
but I don't know how to pass the rgamma to the dt[,demand] directly and correctly at creation nor how to change the values now without using some ugly for loop. I also considered using gather() from the tidyr package, but as far as I can see, I would need to use a for loop either.
It does not really matter to me whether I use data.frame or data.table for my current use case. Solutions for any (or both!) would be highly appreciated.
This'll do (note that it assumes that params is sorted by row names, if not you can convert it to a data.table and merge the two):
CJ(articleID=rownames(params), per=1:periods)[,
demand := rgamma(.N, shape=params[,"shape"], rate=params[,"rate"]), by = per]

Resources