I have a data.frame in in that consist of two columns, a Sample_ID variable and a value variable. Each sample (of which there are 1971) has 132 individual points. The entire object is only ~3000000 bytes, or about 0.003 gigabytes (according to object.size()). For some reason, when I try to dcast the object into wide format, it throws an error saying it can't allocate vectors of size 3.3 GB, which is more 3 orders of magnitude larger than the original object.
The output I'm hoping for is 1 column for each sample, with 132 rows of data for each column.
The dcast code I am using is the following:
df_dcast = dcast(df, value.var = "Vals", Vals~Sample_ID)
I would provide the dataset for reproducibility but because this problem has to do with object size, I don't think a subset of it would help and I'm not sure how to easily post the full dataset. If you know how to post the full dataset or think that a subset would be helpful, let me know.
Thanks
Ok I figured out what was going wrong. It was attempting to use each unique value in the Vals column as an individual row producing far far more rows than the 132 that I wanted, so I needed to add a new column that was basically a value index going from 1:132 so the dataframe has 3 columns: ID, Vals, ValsNumber
The dcast code then looks like the following:
df_wide = dcast(df, value.var = "Vals", ValsNumber ~ Sample_ID)
Related
I have a large dataset (~800M rows) as a data.table. The dataset consists out of equidistant timeseries data for thousands of IDs. My problem is that missing values were originally not encoded but are really missing in the dataset. So, I would like to add the rows with missing data. I know that for each ID the same timestamps should be present.
Given the size of the dataset my initial idea was to create one data.table which includes every timestep the data should include and then use merge with all=TRUE, for each ID of the main data.table. However so far, I have only managed to do that if my data.table with all-time steps (complete_dt) includes also the ID column. However, this creates a lot of redundant information, as each ID should have the same timesteps.
I made a MWE - for simplicity as my data is equidistant, I have replaced the POSIXct column with a simple integer column
library(data.table)
# My main dataset
set.seed(123)
main_dt <- data.table(id = as.factor(rep(1:3, c(5,4,3))),
pseudo_time = c(1,3,4,6,7, 1,3,4,5, 3,5,6),
value = runif(12))
# Assuming that I should have the pseudo timesteps 1:7 for each ID
# Given the size of my real data I would like to create the pseudo time not for each ID but only once
complete_dt <- main_dt[, list(pseudo_time = 1:7), by = id]
#The dt I need to get in the end
result_dt <- merge.data.table(main_dt,complete_dt, all = TRUE )
I have seen this so what similar question Merge (full join) recursively one data.table with each group of another data.table, but I have not managed to apply this to my problem.
Any help for a more efficient solution then mine would be much appreciated.
Here is an alternative but probably not much more efficient:
setkey(main_dt, id, pseudo_time)
main_dt[CJ(id, pseudo_time = 1:7, unique = TRUE)]
This question is quite difficult to describe, but easy to understand when visualized. I would therefore suggest looking at the two images that I linked to this post to help facilitate understanding the issue.
Here is a link to my practice data frame:
sample.data <-read.table("https://pastebin.com/uAQD6nnM", header=T, sep="\t")
I don't know why I get an error "more columns than column names", because using the same file from my desktop works just fine, however clicking on the link goes to my dataset.
I received very large data frames that are arranged in rows, and I want it to be put it in columns, however it is not that 'easy', because I do not necessarily want (or need) to transpose all the data.
This link appears to be close to what I would like to do, but just not quite the right answer for me Python Pandas: Transpose or Stack?
I have a header with GPS data (Coords_Y, Coords_X), followed by a list of 100+ plant species names. If a species is present at a certain location, the author used the term TRUE, and if not present, they used the term FALSE.
I would like to take this data set I've been sent, create a new column called "species", where it stacks each of the species listed in rows on top of each other , & keeps only data set to TRUE. Therefore, as my images point out, if 2 plants are both present at the same location, then the GPS points will need to be duplicated so no data point is lost, and at the same time, if a certain species is present at many locations, the species name will need to be repeated multiple times in the column. In the end, I will have a dataset that is 1000's of rows long, but only 5 columns in my header row.
Before
After
Here is a way to do it using base R:
# Notice that the link works if you include the /raw/ part
sample.data <-read.table("https://pastebin.com/raw/uAQD6nnM", header=T, sep="\t")
vars <- c("var0", "Var.1", "Coords_y", "Coords_x")
# Just selects the ones marked TRUE for each
alf <- sample.data[ sample.data$Alfaroa.williamsii, vars ]
aln <- sample.data[ sample.data$Alnus.acuminata, vars ]
alf$species <- "Alfaroa.williamsii"
aln$species <- "Alnus.acuminata"
final <- rbind(alf,aln)
final
var0 Var.1 Coords_y Coords_x species
192 191 7.10000 -73.00000 Alfaroa.williamsii
101 100 -13.18000 -71.59000 Alfaroa.williamsii
36 35 10.18234 -84.10683 Alnus.acuminata
38 37 10.26787 -84.05528 Alnus.acuminata
To do it more generally, using dplyr and tidyr, you can use the gather function:
library(dplyr)
library(tidyr)
tidyr::gather(sample.data, key = "species", value = "keep", 5:6) %>%
dplyr::filter(keep) %>%
dplyr::select(-keep)
Just replace the 5:6 with the indices of the columns of different species.
I could not download the data so I made some:
sample.data=data.frame(var0=c(192,36,38,101),var1=c(191,35,37,100),y=c(7.1,10.1,10.2,-13.8),x=c(-73,-84,-84,-71),
Alfaroa=c(T,F,F,T),Alnus=c(T,T,T,F))
the code that gives the requested result is:
dfAlfaroa=sample.data%>%filter(Alfaroa)%>%select(-Alnus)%>%rename("Species"="Alfaroa")%>%replace("Species","Alfaroa")
dfAlnus=sample.data%>%filter(Alnus)%>%select(-Alfaroa)%>%rename("Species"="Alnus")%>%replace("Species","Alnus")
rbind(dfAlfaroa,dfAlnus)
I did search for this and linking to entries HERE, HERE and HERE.
But they don't answer my question.
Code:
for (i in 1:nrow(files.df)) {
Final <- parser(as.character(files.df[i,]))
Final2 <- rbind(Final2,Final)
}
files.df contains over 30 filenames (read from a directory using list.files) which is then passed to a custom function parser which returns a dataframe holding over 100 lines (number varies from one file to the next). Both Final and Final2 are initialised with NA outside the for loop. The script runs fine with rbind but its a semantic issue - the resulting output is not what I expect. The resulting dataframe is a lot smaller than the files combined.
I am certain it has to do with the rbind bit.
Secondly, am looking to mimic the pivot functionality that's in excel, whereby I have four columns, the first column is repeated for each row, second column is distinct, third column is distinct, fourth column is distinct. The final dataframe should be pivoted around the first column. Any idea as to how I can achieve this? I had a go at cast and melt but no avail.
Any thoughts would be great! Would be good if I can stick to the data frame structure.
Attaching pictures for reference:
With pivot on and ideal output
For your pivot functionality which essentially requires transforming dataframe from long to wide format, aggregating on the Value column, you can use base R's reshape():
reshapedf <- reshape(df, v.names = c("Value"),
timevar=c("Identifier"),
idvar = c("Date"),
direction = "wide")
# RENAME COLUMNS
names(reshapedf) <- c('Date', 'A', 'B', 'C')
# CONVERT NAs TO ZEROS
reshapedf[,c(2:4)] <- data.frame(sapply(reshapedf[,c(2:4)],
function(x) ifelse(is.na(x),0,x)))
# RESET ROW.NAMES
row.names(reshapedf) <- 1:nrow(reshapedf)
Alternatively, the dedicated package, reshape2 which as seen below tends to be less wordy and less post-formatting. Hence, most prefer this transformation route. Plus, like Excel Pivot Tables other aggregate functions are available (sum, mean, length, etc.):
library(reshape2)
reshape2df <- dcast(df, Date~Identifier, sum, value.var="Value")
A dummy column for a column c and a given value x equals 1 if c==x and 0 else. Usually, by creating dummies for a column c, one excludes one value x at choice, as the last dummy column doesn't add any information w.r.t. the already existing dummy columns.
Here's how I'm trying to create a long list of dummies for a column firm, in a data.table:
values <- unique(myDataTable$firm)
cols <- paste('d',as.character(inds[-1]), sep='_') # gives us nice d_value names for columns
# the [-1]: I arbitrarily do not create a dummy for the first unique value
myDataTable[, (cols):=lapply(values[-1],function(x)firm==x)]
This code reliably worked for previous columns, which had smaller unique values. firm however is larger:
tr(values)
num [1:3082] 51560090 51570615 51603870 51604677 51606085 ...
I get a warning when trying to add the columns:
Warning message:
truelength (6198) is greater than 1000 items over-allocated (length = 36). See ?truelength. If you didn't set the datatable.alloccol option very large, please report this to datatable-help including the result of sessionInfo().
As far as I can tell, there is still all columns that I need. Can I just ignore this issue? Will it slow down future computations? I'm not sure what to make of this and the relevant of truelength.
Taking Arun's comment as an answer.
You should use alloc.col function to pre-allocate required amount of columns in your data.table to the number which will be bigger than expected ncol.
alloc.col(myDataTable, 3200)
Additionally depending on the way how you consume the data I would recommend to consider reshaping your wide table to long table, see EAV. Then you need to have only one column per data type.
I'm trying to use the reshape2 package to redistribute the columns across the top of my dataset. I have temperature and chl-a measured twice at three sites. However, when I melt and cast the data frame, the fun.aggregate defaults to length. I want to preserve the original values. Here is an example data set:
library(reshape2)
library(stringr)
df=data.frame(site=rep(1:3,each=2),temp_2009=c(23,24,25,25,23,43),chla_2009=c(3,2,3,4,5,6),
temp_2010=c(23,25,26,27,23,23),chla_2010=c(2,3,5,6,2,1))
df2=melt(df,id.vars=1,measure.vars=c(2:5))
df2=cbind(df2,data.frame(str_split_fixed(df2$variable,"_",2)))
df2=df2[,-2]
names(df2)[3:4]=c("variable","year")
dcast(df2,site+year~variable)
I think this has something to do with the way reshape2 handles duplicate values.
Any thoughts?
The rows are being aggregated as dcast can't distinguish them based upon the formula provided. If you want to maintain the original values then you'll need to include a field to identify the duplicates uniquely. To continue your code...
df2$group <- rep(1:2,12)
dcast(df2,site+year+group~variable)
Clearly this code is a bit simplistic (in particular your data must be sorted by 'group' with no missing values) but it should serve to demonstrate how to preserve the original values.
Another option trying to dcast a molten dataset with duplicate values is to get dcast to calculate a mean/median/min/max (whatever is most relevant depending on your case) to 'resolve' duplicates.
dcast(df2, site+year~variable, fun.aggregate = mean)
Obviously that deletes (amalgamates) records - which the OP says is undesirable