how to use semi_join function properly

how to use semi_join function properly - r

I have two files.
The first one looks like below,
> data.frame(head(Becker))
Becker
1 ABACK BACK A+ (BACK)
2 ABACUS ABACUS ~- (ABACUS)
3 ABANDGN ABANDON A+ ( BANDON)
4 ABANDONED ABANDON A+ (BANDON) +ED
5 ABANDONING ABANDON A+( BANDON) +ING
6 ABANOONMENT ABANDON A+( BANDON) #MENT
The second file looks like
> data.frame(head(unique))
Word
1 Aback
2 carful
3 basketful
4 meaningful
5 boxful
6 armsful
My ideal output
1 ABACK BACK A+ (BACK)
That is, I only wanted to extract words(including the neighbor words) that are present in the both files.
I read similar questions and I learned about semi_join function.
However, I kept getting the error message. Here is my code and the error message.
Could you anyone help me how to apply this function properly? or should I use different functions? If so, which function should I use?
Thank you.
semi_join(Becker, unique, by=c("Becker"="Word"))
Becker <= output
1 as
Warning message:
Column `Becker`/`Word` joining factors with different levels, coercing to character
vector

Related

Specify multiple conditions in long form data in R

How do I index rows I need by with specifications?
id<-c(65,65,65,65,65,900,900,900,900,900,900,211,211,211,211,211,211,211,45,45,45,45,45,45,45)
age<-c(19,22,23,24,25,21,26,31,32,37,38,22,23,25,28,29,31,32,30,31,36,39,42,44,48)
stat<-c('intern','reg','manage1','left','reg','manage1','manage2','left','reg',
'reg','left','intern','left','intern','reg','left','reg','manage1','reg','left','intern','manage1','left','reg','manage2')
mydf<-data.frame(id,age,stat)
I need to create 5 variables:
m01time & m12time: measure the amount of years elapsed before becoming a level1 manager (manage1), and then since manage1 to manage2 regardless of whether or not it's at the same job. (numeric in years)
change: capture whether or not they experienced a job change between manage1 and manage2 (if 'left' happens somewhere in between manage1 and manage2), (0 or 1)
& 4: m1p & m2p: capture the position before becoming manager1 and manager2 (intern, reg, or manage1).
There's a lot of information I don't need here that I am not sure how to ignore (all the jobs 211 went through before going to one where they become a manager).
The end result should look something like this:
id m01time m02time change m1p m2p
1 65 4 NA NA reg <NA>
2 900 NA 5 0 <NA> manage1
3 211 1 NA NA reg <NA>
4 45 3 9 1 intern reg
I tried to use ifelse with lag() and lead() to capture some conditions, but there are more for loop type of jobs (such as how to capture a "left" somewhere in between) that I am not sure what to do with.

I'd calculate the variables the first three variables differently than m1p and m2p. Maybe there's an elegant unified approach that I don't see at the moment.
So for the last position before manager you could do:
mydt <- data.table(mydf)
mydt[,.(m1p=stat[.I[stat=="manage1"]-1],
m2p=stat[.I[stat=="manage2"]-1]),by=id]
The other variables are more conveniently calculated in a wide data.format:
dt <- dcast(unique(mydt,by=c("id","stat")),
formula=id~stat,value.var="age")
dt[,.(m01time = manage1-intern,
m12time = manage2-manage1,
change = manage1<left & left<manage2)]
Two caveats:
reshaping might be quite costly larger data sets
I (over-)simplified your dummy data by ignoring duplicates of id and stat

Conditional sentence for specific rows

Disclaimer: I am not that advanced with R Studio and hence my question might be quite self explanatory.
Lets assume the following data set
**ID value1a value2a value1b value2b ...
1 2 3 ...
8 4 4
2 5 5
I want to create a forth variable that is part of the expression of an if sentence, that logically should go as follows:
If ID = 1 is over 5 in "value1x" and below 3 in "value2x", then add the value 1 to this forth variable. Hence the forth variable should function as a counter, that the number in the forth variable indiciates the frequency of value1x being over 5 and value2x being below 3.
I hope my question makes sense and Id appreciate answers!

What does support feature mean in result of function "term_stats()" from package "tm" in R and how is it different from count?

Running following script will produce the results
a <- c("Your work is going to fill a large part of your life, and the only way to be truly satisfied is to do what you believe is great work. And the only way to do great work is to love what you do. If you haven't found it yet, keep looking. Don't settle. As with all matters of the heart, you'll know when you find it. - Steve Jobs")
a_source <- VectorSource(a)
a_corpus <- VCorpus(a_source)
term_stats(a_corpus)
term_stats(a_corpus)
term count support
1 . 5 1
2 to 5 1
3 is 4 1
4 you 4 1
5 , 3 1

Support is the number of documents where the word occurs, count is the number of occurrences. You need both if doing tf-idf.
library(tm)
txt <- c("Your work is going to fill a large part of your life,
and the only way to be truly satisfied is to do what you
believe is great work.
And the only way to do great work is to love what you do.
If you haven't found it yet, keep looking. Don't settle.
As with all matters of the heart, you'll know when you find it.
- Steve Jobs")
term_stats(VCorpus(VectorSource(txt)))[1:5,]
term count support
. 5 1
to 5 1
is 4 1
#Split txt into 4 docs
txt_df <- data.frame( txt = c(
"Your work is going to fill a large part of your life,
and the only way to be truly satisfied is to do what you
believe is great work." ,
"And the only way to do great work is to love what you do." ,
"If you haven't found it yet, keep looking. Don't settle." ,
"As with all matters of the heart, you'll know when you find it. -
Steve Jobs"))
term_stats(VCorpus(VectorSource(txt_df$txt)))[1:6,]
term count support
. 5 4
you 4 4
, 3 3
the 3 3
to 5 2
is 4 2
Default is to sort by support.

R: Subsetting rows by group based on time difference

I have the following data frame:
group_id date_show date_med
1 1976-02-07 1971-04-14
1 1976-02-09 1976-12-11
1 2011-03-02 1970-03-22
2 1993-08-04 1997-06-13
2 2008-07-25 2006-09-01
2 2009-06-18 2005-11-12
3 2009-06-18 1999-11-03
I want to subset my data frame in such a way that the new data frame only shows the rows in which the values of date_show are further than 10 days apart but this condition should only be applied per group. I.e. if the values in the date_show column are less than 10 days apart but the group_ids are different, I need to keep both entries. What I want my result to look like based on the above table is:
group_id date_show date_med
1 1976-02-07 1971-04-14
1 2011-03-02 1970-03-22
2 1993-08-04 1997-06-13
2 2008-07-25 2006-09-01
2 2009-06-18 2005-11-12
3 2009-06-18 1999-11-03
Which row gets deleted isn't important because the reason why I'm subsetting in the first place is to calculate the number of rows I am left with after applying this criteria.
I've tried playing around with the diff function but I'm not sure how to go about it in the simplest possible way because this problem is already within another sapply function so I'm trying to avoid any kind of additional loop (in this case by group_id).
The df I'm working with has around 100 000 rows. Ideally, I would like to do this with base R because I have no rights to install any additional packages on the machine I'm working on but if this is not possible (or if solving this with an additional package would be significantly better), I can try and ask my admin to install it.
Any tips would be appreciated!

simple rank formula

I'm looking for a mathmatical ranking formula.
Sample is
2008 2009 2010
A 5 6 4
B 6 7 5
C 7 8 2
I want to add a rank column for each period code field
rank
2008 2009 2010 2008 2009 2010
B 6 7 5 2 1 1
A 5 6 4 3 2 2
C 7 2 2 1 3 3
please do not reply with methods that loop thru the rows and columns, incrementing the rank value as it goes, that's easy. I'm looking for a formula much like finding the percent total (item / total). I know i've seen this before but an havning a tough time locating it.
Thanks in advance!

sort ((letters_col, number_col) descending by number_col)
As efficient as your sort alg.
Then number the rows, of course
Edit
I really got upset by your comment "please don't up vote this answer, sorting and loop is not what I'm asking for. i specifically stated this in my original question. " , and the negative votes, because, as you may have noted by the various answers received, it's basically correct.
However, I remained pondering where and how you may "have seen this before".
Well, I think I got the answer: You saw this in Excel.
Look at this:
This is the result after entering the formulas and sorting by column H.
It's exactly what you want ...

What are you using? If you're using Excel, you're looking for RANK(num, ref).
=RANK(B2,B$2:B$9)
I don't know of any programming language that has that built in, it would always require a loop of some form.

If you want the rank of a single element, you can do it in O(n) by looping through the elements, counting how many have value above the given element, and adding 1.
If you want the rank of all the elements, the best (and really only) way is to sort the elements. Anything else you do will be equivalent to sorting (there is no "formula")

Are you using T-SQL? T-SQL RANK() may pull what you want.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

how to use semi_join function properly - r

Related

Specify multiple conditions in long form data in R

Conditional sentence for specific rows

What does support feature mean in result of function "term_stats()" from package "tm" in R and how is it different from count?

R: Subsetting rows by group based on time difference

simple rank formula

Categories

Resources