consecutive setdiff in datatable list

consecutive setdiff in datatable list - r

Using data organised as
dtl <- replicate(10,data.table(id=sample(letters,10),val=sample(10)), simplify=F)
lapply(dtl, function(x){setkey(x,'id')})
I need to extract a list of datatables that contain the rows in dtl[[n+1]]] with id not present in dtl[[n]]. I assume it would be something like
dtl2 <- list(setdiff(dtl[[1]][['id']],dtl[[2]][['id']]),setdiff(dtl[[2]][['id']],dtl[[3]][['id']]...)
Please notice that, while the setdiff should only take the id column into account, I expect the result to contain all columns from each datatable.

I think this will do it for you:
mapply(setdiff, head(dtl, -1), tail(dtl, -1), SIMPLIFY = FALSE)
Edit: with your new expected output, I would still use mapply as above, but with one of the following two changes:
replace setdiff with function(x,y)setdiff(x$id, y$id)
replace dtl with ids <- lapply(dtl, "[", "id")
Edit2:: you've changed your expected output again by adding a plain English description that does not match the code you had provided... I think you are now looking for this:
mapply(function(x,y)y[setdiff(y$id, x$id), ],
head(dtl, -1), tail(dtl, -1), SIMPLIFY = FALSE)

Related

Converting list of Characters to Named num in R

I want to create a dataframe with 3 columns.
#First column
name_list = c("ABC_D1", "ABC_D2", "ABC_D3",
"ABC_E1", "ABC_E2", "ABC_E3",
"ABC_F1", "ABC_F2", "ABC_F3")
df1 = data.frame(C1 = name_list)
These names in column 1 are a bunch of named results of the cor.test function. The second column should consist of the correlation coefficents I get by writing ABC_D1$estimate, ABC_D2$estimate.
My problem is now that I dont want to add the $estimate manually to every single name of the first column. I tried this:
df1$C2 = paste0(df1$C1, '$estimate')
But this doesnt work, it only gives me this back:
"ABC_D1$estimate", "ABC_D2$estimate", "ABC_D3$estimate",
"ABC_E1$estimate", "ABC_E2$estimate", "ABC_E3$estimate",
"ABC_F1$estimate", "ABC_F2$estimate", "ABC_F3$estimate")
class(df1$C2)
[1] "character
How can I get the numeric result for ABC_D1$estimate in my dataframe? How can I convert these characters into Named num? The 3rd column should constist of the results of $p.value.

As pointed out by #DSGym there are several problems, including the it is not very convenient to have a list of character names, and it would be better to have a list of object instead.
Anyway, I think you can get where you want using:
estimates <- lapply(name_list, function(dat) {
dat_l <- get(dat)
dat_l[["estimate"]]
}
)
cbind(name_list, estimates)
This is not really advisable but given those premises...

Ok I think now i know what you need.
eval(parse(text = paste0("ABC_D1", '$estimate')))
You connect the two strings and use the functions parse and eval the get your results.
This it how to do it for your whole data.frame:
name_list = c("ABC_D1", "ABC_D2", "ABC_D3",
"ABC_E1", "ABC_E2", "ABC_E3",
"ABC_F1", "ABC_F2", "ABC_F3")
df1 = data.frame(C1 = name_list)
df1$C2 <- map_dbl(paste0(df1$C1, '$estimate'), function(x) eval(parse(text = x)))

relisting resulted data.frame respect to input list

I feed inputList to my custom function, after several workflows(few simple filtration), I end up with data.frame resultDF, which needed to be relisted. I used relist to make resultDF has the same structure of inputList, but I got an error. Is there any simplest way of relisting resultDF? Can anyone point me out how to make this happen? Any idea? sorry for this simple question.
Here is input data.frame within the list:
inputList <- list(
bar=data.frame(from=c(8,18,33,53),
to=c(14,21,39,61), val=c(48,7,10,8)),
cat=data.frame(from=c(6,15,20,44),
to=c(10,17,34,51), val=c(54,21,14,12)),
foo=data.frame(from=c(11,43), to=c(36,49), val=c(49,13)))
After several workflows, I end up with this data.frame:
resultDF <- data.frame(
from=c(53,8,6,15,11,44,43,44,43),
to=c(61,14,10,17,36,51,49,51,49),
val=c(8,48,54,21,49,12,13,12,13)
)
I need to relist resultDF with the same structure of inputList. I used relit method, but I got an error.
This is my desired list:
desiredList <- list(
bar=data.frame(from=c(8,53), to=c(14,61), val=c(48,8)),
cat=data.frame(from=c(6,15,44,44), to=c(10,17,51,51), val=c(54,21,12,12)),
foo=data.frame(from=c(11,43,43), to=c(36,49,49), val=c(49,13,13))
)
How can I achieve desiredList ? Thanks in advance :)

We can loop through the 'inputList' and check whether the pasted row elements in 'resultDF' are %in% list elements and use that index to subset the 'resultDF'
lapply(inputList, function(x) resultDF[do.call(paste, resultDF) %in% do.call(paste, x),])
Another option is a join and then split. We rbind the 'inputList' to a data.table with an additional column 'grp' specifying the list names, join with the 'resultDF' on the column names of 'resultDF', and finally split the dataset using the 'grp' column
library(data.table)
dt <- rbindlist(inputList, idcol = "grp")[resultDF, on = names(resultDF)]
split(dt[,-1, with = FALSE], dt$grp)

R apply function on data frame column

I need to , efficiently, parse one of my dataframe column (a url string)
and call a function (strsplit) to parse it, e.g.:
url <- c("www.google.com/nir1/nir2/nir3/index.asp")
unlist(strsplit(url,"/"))
My data frame : spark.data.url.clean looks like this:
classes url
[107,662,685,508,111,654,509] drudgereport.com/level1/level2/level3
This df has 100k rows and I don't want to loop/iterate over it, parse each url separately and write the results to a new data frame.
What I DO need/want is to create a new 5 column data frame:
df.result <- data.frame(fullurl = as.character(),baseurl=as.character(), firstlevel = as.character(), secondlevel=as.character(),thirdlevel=as.character(),classificaiton=as.character())
call one of the "apply" family function over spark.data.url.clean$url
and to write the results to the new data frame df.result such that the first column (fullurl) will be populated with the relevant spark.data.url.clean$url, the 2nd to 5th columns will be populated with the relevant results from applying
unlist(strsplit(url,"/"))
- taking the only the first, 2nd, 3rd and 4th elements from the resulted vector and putting it in the first,2nd, 3rd and 4th columns in df.result and finally putting the spark.data.url.clean$classes in the new data frame columns df.result$classificaiton
Sorry for the complication and let me know if anything need to be further cleared out.

There is no need for apply, as far as I see it.
Try this:
spark.data.url.clean <- data.frame(classes = c(107,662,685,508,111,654,509),
url = c("drudgereport.com/level1/level2/level3", "drudgeddddreport.com/levelfe1/lefvel2/leveel3",
"drudgeaasreport2.com/lefvel13/lffvel244/fel223", "otherurl.com/level1/second/level3",
"whateversite.com/level13/level244/level223", "esportsnow.com/first/level2/level3",
"reeport2.com/level13/level244/third"), stringsAsFactors = FALSE)
df.result <- spark.data.url.clean
names(df.result) <- c("classification", "fullurl")
df.result[c("baseurl", "firstlevel", "secondlevel", "thirdlevel")] <- do.call(rbind, strsplit(df.result$fullurl, "/"))

You could consider using the package splitstackshape to do this; we can use its cSplit-function. Setting drop to F ensures that the original column is preserved. Not that it returns a data.table, not a data.frame.
library(splitstackshape)
output <- cSplit(dat,2,sep="/", drop=F)
data used:
dat <- data.frame(classes="[107,662,685,508,111,654,509]",
url="drudgereport.com/level1/level2/level3")

Here's an option with data.table which should be pretty fast. If your data looks like this:
> df
# classes url
#1 [107,662,685,508,111,654,509] drudgereport.com/level1/level2/level3
You can do the following:
library(data.table)
setDT(df) # convert to data.table
cols <- c("baseurl", "firstlevel", "secondlevel", "thirdlevel") # define new column names
df[, (cols) := tstrsplit(url, "/", fixed = TRUE)[1:4]] # assign new columns
Now, the data looks like this:
> df
# classes url baseurl firstlevel secondlevel thirdlevel
#1: [107,662,685,508,111,654,509] drudgereport.com/level1/level2/level3 drudgereport.com level1 level2 level3

The simple solution is to use:
apply(row, 2, function(col) {})

Using paste to create logical expression for data frame subset

I have two dataframes, remove and dat (the actual dataframe). remove specifies various combinations of the factor variables found in dat, and how many to sample (remove$cases).
Reproducible example:
set.seed(83)
dat <- data.frame(RateeGender=sample(c("Male", "Female"), size = 1500, replace = TRUE),
RateeAgeGroup=sample(c("18-39", "40-49", "50+"), size = 1500, replace = TRUE),
Relationship=sample(c("Direct", "Manager", "Work Peer", "Friend/Family"), size = 1500, replace = TRUE),
X=rnorm(n=1500, mean=0, sd=1),
y=rnorm(n=1500, mean=0, sd=1),
z=rnorm(n=1500, mean=0, sd=1))
What I am trying to accomplish is to read in a row from remove and use it to subset dat. My current approach looks like:
remove <- expand.grid(RateeGender = c("Male", "Female"),
RateeAgeGroup = c("18-39","40-49", "50+"),
Relationship = c("Direct", "Manager", "Work Peer", "Friend/Family"))
remove$cases <- c(36,34,72,58,47,38,18,18,15,22,17,10,24,28,11,27,15,25,72,70,52,43,21,27)
# For each row of remove (combination of factor levels:)
for (i in 1:nrow(remove)) {
selection <- character()
# For each column of remove (particular selection):
for (j in 1:(ncol(remove)-1)){
add <- paste0("dat$", names(remove)[j], ' == "', remove[i,j], '" & ')
selection <- paste0(selection, add)
}
selection <- sub(' & $', '', selection) # Remove trailing ampersand
cat(selection, sep = "\n") # What does selection string look like?
tmp <- sample(dat[selection, ], size = remove$cases[i], replace = TRUE)
}
The output from cat() while the loop runs looks right, for example: dat$RateeGender == "Male" & dat$RateeAgeGroup == "18-39" & dat$Relationship == "Direct" and if I paste that into dat[dat$RateeGender == "Male" & dat$RateeAgeGroup3 == "18-39" & dat$Relationship == "Direct" ,], I get the right subset.
However, if I run the loop as written with dat[selection, ], each subset only returns NAs. I get the same outcome if I use subset(). Note, I have replace = TRUE in the above solely because of the random sampling. In the actual application, there will always be more cases per combination than required.
I know I can dynamically construct formulas for lm() and other functions using paste() in this way, but am obviously missing something in translating this into working with [,].
Any advice would be really appreciated!

You cannot use character expressions as you describe to subset either with [ or subset. If you wanted to do that you would have to construct the entire expression, and then use eval. That said, there is a better solution using merge. For example, let's get all the entries in dat that match the first two rows from remove:
merge(dat, remove[1:2,])
If we want all the rows that don't match those two, then:
subset(merge(dat, remove[1:2,], all.x=TRUE), is.na(cases))
This is assuming you want to join on the columns with the same names across the two tables. If you have a lot of data you should consider using data.table as it is very fast for this type of operation.

I upvoted BrodieG's answer before I realized it doesn't do what you wanted in situations wehre the size of the category is smaller than the number of samples desired. (In fact his method doesn't really do sampling at all, but I think it is is an elegant solution to a different question so I'm not reversing my vote. And you could use a similar split strategy as illustrated below with that data.frame as the input.).
sub <- lapply( split(dat, with(dat, paste(RateeGender, # split vector
RateeAgeGroup,
Relationship, sep="_")) ),
function (d) { n= with(remove, remove[
RateeGender==d$RateeGender[1]&
RateeAgeGroup==d$RateeAgeGroup[1]&
Relationship==d$Relationship[1],
"cases"])
cat(n);
sample(d, n, repl=TRUE) } )

Split the dataframe into subset dataframes and naming them on-the-fly (for loop)

I have 9880 records in a data frame, I am trying to split it into 9 groups of 1000 each and the last group will have 880 records and also name them accordingly. I used for-loop for 1-9 groups but manually for the last 880 records, but i am sure there are better ways to achieve this,
library(sqldf)
for (i in 0:8)
{
assign(paste("test",i,sep="_"),as.data.frame(final_9880[((1000*i)+1):(1000*(i+1)), (1:53)]))
}
test_9<- num_final_9880[9001:9880,1:53]
also am unable to append all the parts in one for-loop!
#append all parts
all_9880<-rbind(test_0,test_1,test_2,test_3,test_4,test_5,test_6,test_7,test_8,test_9)
Any help is appreciated, thanks!

A small variation on this solution
ls <- split(final_9880, rep(0:9, each = 1000, length.out = 9880)) # edited to Roman's suggestion
for(i in 1:10) assign(paste("test",i,sep="_"), ls[[i]])
Your command for binding should work.
Edit
If you have many dataframes you can use a parse-eval combo. I use the package gsubfn for readability.
library(gsubfn)
nms <- paste("test", 1:10, sep="_", collapse=",")
eval(fn$parse(text='do.call(rbind, list($nms))'))
How does this work? First I create a string containing the comma-separated list of the dataframes
> paste("test", 1:10, sep="_", collapse=",")
[1] "test_1,test_2,test_3,test_4,test_5,test_6,test_7,test_8,test_9,test_10"
Then I use this string to construct the list
list(test_1,test_2,test_3,test_4,test_5,test_6,test_7,test_8,test_9,test_10)
using parse and eval with string interpolation.
eval(fn$parse(text='list($nms)'))
String interpolation is implemented via the fn$ prefix of parse, its effect is to intercept and substitute $nms with the string contained in the variable nms. Parsing and evaluating the string "list($mns)" creates the list needed. In the solution the rbind is included in the parse-eval combo.
EDIT 2
You can collect all variables with a certain pattern, put them in a list and bind them by rows.
do.call("rbind", sapply(ls(pattern = "test_"), get, simplify = FALSE))
ls finds all variables with a pattern "test_"
sapply retrieves all those variables and stores them in a list
do.call flattens the list row-wise.

No for loop required -- use split
data <- data.frame(a = 1:9880, b = sample(letters, 9880, replace = TRUE))
splitter <- (data$a-1) %/% 1000
.list <- split(data, splitter)
lapply(0:9, function(i){
assign(paste('test',i,sep='_'), .list[[(i+1)]], envir = .GlobalEnv)
return(invisible())
})
all_9880<-rbind(test_0,test_1,test_2,test_3,test_4,test_5,test_6,test_7,test_8,test_9)
identical(all_9880,data)
## [1] TRUE

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

consecutive setdiff in datatable list - r

Related

Converting list of Characters to Named num in R

relisting resulted data.frame respect to input list

R apply function on data frame column

Using paste to create logical expression for data frame subset

Split the dataframe into subset dataframes and naming them on-the-fly (for loop)

Categories

Resources