fast join data.table (potential bug, checking before reporting)

fast join data.table (potential bug, checking before reporting) - r

This might be a bug. In that case, I will delete this question and report as bug. I would like someone to take a look to make sure I'm not doing something incorrectly so I don't waste the developer time.
test = data.table(mo=1:100, b=100:1, key=c("mo", "b"))
mo = 1
test[J(mo)]
That returns the entire test data.table instead of the correct result returned by
test[J(1)]
I believe the error might be coming from test having the same column name as the table which is being joined by, mo. Does anyone else get the same problem?

This is a scoping issue, similar to the one discussed in data.table-faq 2.13 (warning, pdf). Because test contains a column named mo, when J(mo) is evaluated, it returns that entire column, rather than value of the mo found in the global environment, which it masks. (This scoping behavior is, of course, quite nice when you want to do something like test[mo<4]!)
Try this to see what's going on:
test <- data.table(mo=1:5, b=5:1, key=c("mo", "b"))
mo <- 1
test[browser()]
Browse[1]> J(mo)
# mo
# 1: 1
# 2: 2
# 3: 3
# 4: 4
# 5: 5
# Browse[1]>
As suggested in the linked FAQ, a simple solution is to rename the indexing variable:
MO <- 1
test[J(MO)]
# mo b
# 1: 1 6
(This will also work, for reasons discussed in the documentation of i in ?data.table):
mo <- data.table(1)
test[mo]
# mo b
# 1: 1 6

This is not a bug, but documented behaviour afaik. It's a scoping issue:
test[J(globalenv()$mo)]
mo b
1: 1 100

Related

Specify multiple conditions in long form data in R

How do I index rows I need by with specifications?
id<-c(65,65,65,65,65,900,900,900,900,900,900,211,211,211,211,211,211,211,45,45,45,45,45,45,45)
age<-c(19,22,23,24,25,21,26,31,32,37,38,22,23,25,28,29,31,32,30,31,36,39,42,44,48)
stat<-c('intern','reg','manage1','left','reg','manage1','manage2','left','reg',
'reg','left','intern','left','intern','reg','left','reg','manage1','reg','left','intern','manage1','left','reg','manage2')
mydf<-data.frame(id,age,stat)
I need to create 5 variables:
m01time & m12time: measure the amount of years elapsed before becoming a level1 manager (manage1), and then since manage1 to manage2 regardless of whether or not it's at the same job. (numeric in years)
change: capture whether or not they experienced a job change between manage1 and manage2 (if 'left' happens somewhere in between manage1 and manage2), (0 or 1)
& 4: m1p & m2p: capture the position before becoming manager1 and manager2 (intern, reg, or manage1).
There's a lot of information I don't need here that I am not sure how to ignore (all the jobs 211 went through before going to one where they become a manager).
The end result should look something like this:
id m01time m02time change m1p m2p
1 65 4 NA NA reg <NA>
2 900 NA 5 0 <NA> manage1
3 211 1 NA NA reg <NA>
4 45 3 9 1 intern reg
I tried to use ifelse with lag() and lead() to capture some conditions, but there are more for loop type of jobs (such as how to capture a "left" somewhere in between) that I am not sure what to do with.

I'd calculate the variables the first three variables differently than m1p and m2p. Maybe there's an elegant unified approach that I don't see at the moment.
So for the last position before manager you could do:
mydt <- data.table(mydf)
mydt[,.(m1p=stat[.I[stat=="manage1"]-1],
m2p=stat[.I[stat=="manage2"]-1]),by=id]
The other variables are more conveniently calculated in a wide data.format:
dt <- dcast(unique(mydt,by=c("id","stat")),
formula=id~stat,value.var="age")
dt[,.(m01time = manage1-intern,
m12time = manage2-manage1,
change = manage1<left & left<manage2)]
Two caveats:
reshaping might be quite costly larger data sets
I (over-)simplified your dummy data by ignoring duplicates of id and stat

how to use semi_join function properly

I have two files.
The first one looks like below,
> data.frame(head(Becker))
Becker
1 ABACK BACK A+ (BACK)
2 ABACUS ABACUS ~- (ABACUS)
3 ABANDGN ABANDON A+ ( BANDON)
4 ABANDONED ABANDON A+ (BANDON) +ED
5 ABANDONING ABANDON A+( BANDON) +ING
6 ABANOONMENT ABANDON A+( BANDON) #MENT
The second file looks like
> data.frame(head(unique))
Word
1 Aback
2 carful
3 basketful
4 meaningful
5 boxful
6 armsful
My ideal output
1 ABACK BACK A+ (BACK)
That is, I only wanted to extract words(including the neighbor words) that are present in the both files.
I read similar questions and I learned about semi_join function.
However, I kept getting the error message. Here is my code and the error message.
Could you anyone help me how to apply this function properly? or should I use different functions? If so, which function should I use?
Thank you.
semi_join(Becker, unique, by=c("Becker"="Word"))
Becker <= output
1 as
Warning message:
Column `Becker`/`Word` joining factors with different levels, coercing to character
vector

How do I find the sum of a category under a subset?

So... I'm very illiterate when it comes to RStudio and I'm using this program for a class... I'm trying to figure out how to sum a subset of a category. I apologize in advance if this doesn't make sense but I'll do my best to explain because I have no clue what I'm doing and would also appreciate an explanation of why and not just what the answer would be. Note: The two lines I included are part of the directions I have to follow, not something I just typed in because I knew how to - I don't... It's the last part, the sum, that I am not explained how to do and thus I don't know what to do and would appreciate help figuring out.
For example,
I have this:
category_name category2_name
1 ABC
2 ABC
3 ABC
4 ABC
5 ABC
6 BDE
5 EFG
7 EFG
I wanted to find the sum of these numbers, so I was told to put in this:
sum(dataname$category_name)
After doing this, I'm asked to type this in, apparently creating a subset.
allabc <- subset(dataname, dataname$category_name2 == "abc")
I created this subset and now I have a new table popped up with this subset. I'm asked to sum only the numbers of this ABC subset... I have absolutely no clue on how to do this. If someone could help me out, I'd really appreciate it!

R is the software you are using. It is case-sensitive. So "abc" is not equal to "ABC".
The arguments are the "things" you put inside functions. Some arguments have the same name as the functions (which is a little confusing at first, but you get used to this eventually). So when I say the subset argument, I am talking about your second argument to the subset function, which you didn't name. That's ok, but when starting to learn R, try to always name your arguments.
So,
allabc <- subset(dataname, dataname$category_name2 == "abc")
Needs to be changed to:
allabc <- subset(dataname, subset=category2_name == "ABC")
And you also don't need to specify the name of the data again in the subset argument, since you've done that already in the first argument (which you didn't name, but almost everyone never bothers to do that).

This is the most easily done using tidyverse.
# Your data
data <- data.frame(category_name = 1:8, category_name2 = c(rep("ABC", 5), "BDE", "EFG", "EFG"))
# Installing tidyverse
install.packages("tidyverse")
# Loading tidyverse
library(tidyverse)
# For each category_name2 the category_name is summed
data %>%
group_by(category_name2) %>%
summarise(sum_by_group = sum(category_name))
# Output
category_name2 sum_by_group
ABC 15
BDE 6
EFG 15

Tallying up data

I have two data sets. One has 2 million cases (individual donations to various causes), the other has about 38,000 (all zip codes in the U.S.).
I want to sort through the first data set and tally up the total number of donations by zip code. (Additionally, the total for each zip code will be broken down by cause.) Each case in the first data set includes the zip code of the corresponding donation and information about what kind of cause it went to.
Is there an efficient way to do this? The only approach that I (very much a novice) can think of is to use a for ... if loop to go through each case and count them up one by one. This seems like it would be really slow, though, for data sets of this size.
edit: thanks, #josilber. This gets me a step closer to what I'm looking for.
One more question, though. table seems to generate frequencies, correct? What if I'm actually looking for the sum for each cause by zip code? For example, if the data frame looks like this:
dat3 <- data.frame(zip = sample(paste("Zip", 1:3), 2000000, replace=TRUE),
cause = sample(paste("Cause", 1:3), 2000000, replace=TRUE),
amt = sample(250:2500, 2000000, replace=TRUE))
Suppose that instead of frequencies, I want to end up with output that looks like this?
# Cause 1(amt) Cause 2(amt) Cause 3(amt)
# Zip 1 (sum) (sum) (sum)
# Zip 2 (sum) (sum) (sum)
# Zip 3 (sum) (sum) (sum)
# etc. ... ... ...
Does that make sense?

Sure, you can accomplish what you're looking for with the table command in R. First, let's start with a reproducible example (I'll create an example with 2 million cases, 3 zip codes, and 3 causes; I know you have more zip codes and more causes but that won't cause the code to take too much longer to run):
# Data
set.seed(144)
dat <- data.frame(zip = sample(paste("Zip", 1:3), 2000000, replace=TRUE),
cause = sample(paste("Cause", 1:3), 2000000, replace=TRUE))
Please note that it's a good idea to include a reproducible example with all your questions on Stack Overflow because it helps make sure we understand what you're asking! Basically you should include a sample dataset (like the one I've just included) along with your desired output for that dataset.
Now you can use the table function to count the number of donations in each zip code, broken down by cause:
table(dat$zip, dat$cause)
# Cause 1 Cause 2 Cause 3
# Zip 1 222276 222004 222744
# Zip 2 222068 222791 222363
# Zip 3 221015 221930 222809
This took about 0.3 seconds on my computer.

could this work?-
aggregate(amt~cause+zip,data=dat3,FUN=sum)
cause zip amt
1 Cause 1 Zip 1 306231179
2 Cause 2 Zip 1 306600943
3 Cause 3 Zip 1 305964165
4 Cause 1 Zip 2 305788668
5 Cause 2 Zip 2 306306940
6 Cause 3 Zip 2 305559305
7 Cause 1 Zip 3 304898918
8 Cause 2 Zip 3 304281568
9 Cause 3 Zip 3 303939326

R eval list strings for data.frames and then concat the data.frames

I have the following situation where Im pretty desperate.
paste("crossdata","$geno$'",1:4,"'$data",sep="")
generates 4 strings which look like that:
"crossdata$geno$'1'$data" "crossdata$geno$'2'$data" "crossdata$geno$'3'$data" "crossdata$geno$'4'$data"
I want to retrieve the corresponding data.frames of these 4 strings via evaluation of one of these strings and the combine them via cbind. However when Im doing something like this:
cbind(sapply(parse(text=paste("crossdata","$geno$'",i,"'$data",sep="")),eval))
that does not work. Can anybody help me out?
Thanks

datlist <- list(adat=data.frame(u=1:5,v=6:10),bdat=data.frame(x=11:15,y=16:20))
extdat <- c("datlist$adat","datlist$bdat")
do.call('cbind',lapply(extdat,function(i) eval(parse(text=i))))
u v x y
1 1 6 11 16
2 2 7 12 17
3 3 8 13 18
4 4 9 14 19
5 5 10 15 20
Of course this uses eval + parse, which usually means you are on the wrong track.

Using the combination of parse and eval is like saying that you know how to get from New York City to Boston and therefore making all your travel plans by going from your origin to New York, then to Boston, then to your desitination. In some cases this may not be to bad, but it is a bit of a long detour if you are traveling from London to Paris.
You should first learn the relationship and difference between subsetting lists using $ and [[ (see ?'[[' for the documentation) and when it is, and more importantly, is not appropriate to use $. Once you understand that you should be able to find solutions that do not require parse and eval.
Your problem may be as simple as (untested since your example is not reproducible):
do.call( cbind, lapply( 1:4, function(x) crossdata[['geno']][[x]][['data']] ) )
or possibly
do.call(cbind, lapply(as.character(1:4), function(x) crossdata$geno[[x]]$data ) )

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

fast join data.table (potential bug, checking before reporting) - r

This is not a bug, but documented behaviour afaik. It's a scoping issue: test[J(globalenv()$mo)] mo b 1: 1 100

Related

Specify multiple conditions in long form data in R

how to use semi_join function properly

How do I find the sum of a category under a subset?

Tallying up data

R eval list strings for data.frames and then concat the data.frames

Categories

Resources