Replace subset of data table with other data table - r

I feel a bit silly for this question, but I only want to something which I know how to do with a data.frame, but I have not yet found a nice way to do it in R. All other similar questions seem way more complicated for what I have in mind. I simply want to replace a subset of a data.table with another data.table only based on an row index and choosing some columns.
MWE follows
x.df <- data.frame(a=c(1,2,3),
b=c(2,NA,NA),
c=c(3,NA,NA))
x.dt <- data.table(x.df)
x.df.replace<- data.frame(b=c(10,11), c=c(22,21))
x.dt.replace<- data.table(x.df.replace)
This works like a charm in data Frame
x.df[is.na(x.df$b),2:3]<-x.df.replace
On the other hand I would like to call the columns by name and I only know how to replace each column individually, but not jointly
x.dt[is.na(b),]
x.dt[is.na(b),c:=x.dt.replace[,c]]
x.dt[is.na(b),b:=x.dt.replace[,b]]
x.dt[is.na(b), list(b,c)]<-x.dt.replace
x.dt[is.na(b), list(b,c):=x.dt.replace]

I was having the same issue and I came across this question with no answer. The comments above helped me to find the solution to my problem, so I decided to post it here. May simply be a difference between data.table versions (I am using version 1.11.8), since this is relatively old question.
The solution uses a () instead of a .() or a list() to declare the column names to be replaced:
colunas <- c("b","c")
x.dt[is.na(b), (colunas) := x.dt.replace]
Hope this is useful

Related

How to process a dataframe row by row, passing the columns as args to a function, *as a single call to function*

Still quite new to R. Its quite possible my question is due to gaps in my thinking about this problem, but after few hours of googling, I'm still stuck.
The problem:
I have a dataframe(tibble) that contains 6 rows, and 3 columns.
The columns are Filename, Metadata1, Metadata2.
I want to call a function for each row, as follows:
function(Filename, Metadata1, Metadata2).
In other languages, this would be a simple for loop, but I am completely stuck how to do this in R, both looking at base, and tidyverse ways to do this. All the answers I've come across are variations of calling the function on every element in the dataframe or matrix, whereas I want to effectively pass the whole row to the function, as individual args.
Its probably blindly obvious, but I would really appreciate some guidance.
EDIT:
I ran across mapply, and it seems to do the job I need, but I have no idea if this is the only or best method. This what I'm working with currently:
testfunc <- function(a,b,c){
str(a)
str(b)
str(c)
}
discard <- mapply(testfunc, a=files_sorted$file, b=files_sorted$AppID, c=files_sorted$server)
Moments after I posted the last edit to my question, I hit the exact issue that #mrflick mentioned where my function was not vectorised.
In the end, I did end up using a for loop, this is what I settled on:
overall_data <- tibble()
for(a in transpose(files_sorted)){
df <- processFile(file=a[1]$list_files, srv=a[2]$server, tap=a[3]$AppID )
#view(df)
overall_data <- bind_rows(overall_data, df)
}
files_sorted:
I'm sure I'll learn better ways to tackle this in future, but leaving this here

Is there a way to apply plyr's count() function to every column individually?

Similar to this question but for R. I want to get a summary count of every variable in each column of a data frame.
Currently, doing something like plyr::count(df[,1:10]) checks for how many times every variable in a row match. Instead, I just want a quick way of printing out what all my variables even are, though. I know this can be done with C-style recursion, but I'm hoping for a more elegant/simpler solution.
You can use lapply:
lapply(df, plyr::count)
Alternatively, keeping everything in base R you can use table with stack to get similar output
lapply(df, function(x) stack(table(x)))

R - Order of the data table records from subsetting columns

I am currently learning data.table in R. a few questions which got me confused:
Does subsetting columns always preserve the order of records? (i.e. Row 1,2,3 will stay as Row 1,2,3 instead of Row 1,3,2)
Also, does the same conclusion apply to different expressions, such as DB[[1]], DB$V1, etc.
2.
When subsetting multiple columns, I know I need to use something like DB[,.(V1, V2)], but I am confused about what's the result from DB[,V1, V2]?
The code runs, seems to produce the result but the rows are not in the same order as the original table. If someone can explain what does the latter code mean, that would be great help.
Thanks a lot!
I wanted to start with small suggestion... if you create data processing related question on SO it is enormously better to ship reproducible code in the question, and expected output if it isn't clear. You will reach much bigger audience and gather more quality solutions. This is generally common practice on r tag.
Subsetting preserve order, underlying storage of data is column oriented unlike regular SQL db (which are not aware of row order), it works exactly the same as subsetting a vector in base R, just much faster.
Regarding [[ and $, these are just a methods for extracting column from data.table, and a list in general, you can use DB[[1]], DB[["V1"]], DB$V1. They behave differently depending if column/list element exists.
Third argument inside data.table [ operator is by which expect columns to group by over, so you query column V1 grouped by V2, without using any aggregate function. And this is very different than DB[, .(V1, V2)] or DB[, c("V1","V2"), with=FALSE] or DB[, list(V1,V2)] or DB[, .SD, .SDcols=c("V1","V2")], ... . Most of the api is borrowed from base R, functions like subset() or with().
At the end I would recommend to go through data.table vignettes, also there is my recent longish post that goes through various data.table examples: Boost Your Data Munging with R.

R - Operate on a column w/o explicitely reassigning it?

I'm often writing things like:
dataframe$this_column <- as.Date(dataframe$this_column)
That is, when changing some column in my data frame [table], I'm constantly writing the column twice. Is there some function that allows me to directly change the data frame w/o explicitly reassigning it? Say: ch(dataframe$this_column, as.Date())
EDIT: While similar, the potential duplicate is not the same. I am not looking for a way to shorten self-referential reassignments. I'm looking to avoid the explicit reassignment all together. The answer I accepted here is an appropriate solution (and much better than the answers provided in the "duplicate" question, in regards to their relevance to my question).
Here is the example using magrittr package:
library(magrittr)
x = c('2015-12-12','2015-12-13','2015-12-14')
df = data.frame(x)
df$x %<>% as.Date

Calculate e.g. a mean in a list with multi-column data.frames

I have a list of several data.frames. Each data.frame has several columns.
By using
mean(mylist$first_dataframe$a
I can get the mean for a in this one data.frame.
However I do not know how to calculate over all the data.frames stored in my list or how for specific data.frames.
I could use a loop but I was told that
apply() and its variations are better
I tried using several solutions I found via search but somehow it just doesn't work.
I assume I need to use
unlist()
Could you provide an example of how to calculate e.g. a mean for a data structure like mine.
A list with several data.frames containing several columns.
Update:
I'm sorry for the confusion. I wanted the grand mean for a specific column in all dataframes.
Thanks to Thomas for providing a working solution for calculating a grand mean for a specific column in all dataframes and to psychometriko for providing a useful solution for calculating means over all columns in all dataframes (& even for the case when not numeric data is involved).
Thanks!
Is this what you are looking for?
set.seed(42)
mylist <- list(a=data.frame(foo=rnorm(10),
bar=rnorm(10)),
b=data.frame(foo=rnorm(10),
bar=rnorm(10)),
c=data.frame(foo=rnorm(10),
bar=rnorm(10)))
sapply(do.call("rbind",mylist),mean)
foo bar
0.1163340 -0.1696556
Note: do.call("rbind",mylist) returns something similar to what you referred to above with the unlist function, and then sapply, as referred to by Roland in his answer, just calls the function mean on each component (column) of the data.frame that results from the above do.call function.
Edit: In response to the question of how to deal with non-numeric data.frame components, the below solution admittedly isn't very elegant and I'm sure better ones exist, but here's the first thing I was able to think of:
set.seed(42)
mylist <- list(a=data.frame(rand=rnorm(10),
lets=sample(LETTERS,10,replace=TRUE)),
b=data.frame(rand=rnorm(10),
lets=sample(LETTERS,10,replace=TRUE)),
c=data.frame(rand=rnorm(10),
lets=sample(LETTERS,10,replace=TRUE)))
sapply(do.call("rbind",mylist),function(x) {
if (is.numeric(x)) mean(x)
})
$rand
[1] -0.02470602
$lets
NULL
This basically just creates a custom function that first tests whether each component is numeric and, if it is, returns the mean. If it isn't, it skips it.
The whole do.call('rbind', List) thing can be quite slow and prone to mishaps. If there is only one column you need the mean for, the best way is:
mean(sapply(mylist, function(X) X$rand))
It's about 10x faster the the do.call method.

Resources