Conditional sentence for specific rows - r

Disclaimer: I am not that advanced with R Studio and hence my question might be quite self explanatory.
Lets assume the following data set
**ID value1a value2a value1b value2b ...
1 2 3 ...
8 4 4
2 5 5
I want to create a forth variable that is part of the expression of an if sentence, that logically should go as follows:
If ID = 1 is over 5 in "value1x" and below 3 in "value2x", then add the value 1 to this forth variable. Hence the forth variable should function as a counter, that the number in the forth variable indiciates the frequency of value1x being over 5 and value2x being below 3.
I hope my question makes sense and Id appreciate answers!

Related

Specify multiple conditions in long form data in R

How do I index rows I need by with specifications?
id<-c(65,65,65,65,65,900,900,900,900,900,900,211,211,211,211,211,211,211,45,45,45,45,45,45,45)
age<-c(19,22,23,24,25,21,26,31,32,37,38,22,23,25,28,29,31,32,30,31,36,39,42,44,48)
stat<-c('intern','reg','manage1','left','reg','manage1','manage2','left','reg',
'reg','left','intern','left','intern','reg','left','reg','manage1','reg','left','intern','manage1','left','reg','manage2')
mydf<-data.frame(id,age,stat)
I need to create 5 variables:
m01time & m12time: measure the amount of years elapsed before becoming a level1 manager (manage1), and then since manage1 to manage2 regardless of whether or not it's at the same job. (numeric in years)
change: capture whether or not they experienced a job change between manage1 and manage2 (if 'left' happens somewhere in between manage1 and manage2), (0 or 1)
& 4: m1p & m2p: capture the position before becoming manager1 and manager2 (intern, reg, or manage1).
There's a lot of information I don't need here that I am not sure how to ignore (all the jobs 211 went through before going to one where they become a manager).
The end result should look something like this:
id m01time m02time change m1p m2p
1 65 4 NA NA reg <NA>
2 900 NA 5 0 <NA> manage1
3 211 1 NA NA reg <NA>
4 45 3 9 1 intern reg
I tried to use ifelse with lag() and lead() to capture some conditions, but there are more for loop type of jobs (such as how to capture a "left" somewhere in between) that I am not sure what to do with.
I'd calculate the variables the first three variables differently than m1p and m2p. Maybe there's an elegant unified approach that I don't see at the moment.
So for the last position before manager you could do:
mydt <- data.table(mydf)
mydt[,.(m1p=stat[.I[stat=="manage1"]-1],
m2p=stat[.I[stat=="manage2"]-1]),by=id]
The other variables are more conveniently calculated in a wide data.format:
dt <- dcast(unique(mydt,by=c("id","stat")),
formula=id~stat,value.var="age")
dt[,.(m01time = manage1-intern,
m12time = manage2-manage1,
change = manage1<left & left<manage2)]
Two caveats:
reshaping might be quite costly larger data sets
I (over-)simplified your dummy data by ignoring duplicates of id and stat

Assign new ID taking into account previous changes

Sorry I do not know how to properly title my question. It is easier to understand with an example.
Sample data
Consider the following example.
> l_ids=as.data.frame(cbind(a=c("strong","intense","intensity"),
id=c("1","2","3"),new_id=c("","1","2")),stringsAsFactors = FALSE)
a id new_id
1 strong 1
2 intense 2 1
3 intensity 3 2
I would like to update the id of each word in a with a new_id, if it applies. Consider this as a synonym dictionary. As I iterate over new_id;
> for (i in 1:nrow(l_ids)){
+ if (nchar(l_ids$new_id[i])>0){
+ l_ids$id[i]=l_ids$new_id[i]
+ }
+ }
> l_ids
a id new_id
1 strong 1
2 intense 1 1
3 intensity 2 2
The problem is that I would like for intensity to also be given a 1. Is there a way to do this without having to iterate multiple times?
Update on background
I have a document where I have a list of synonyms. These are synonyms only relevant to the field of application of the problem. Example:
> dictionary
good bad
1 strong intense
2 intense intensity
3 light soft
I am then given a list of words, each with a given id. My task is to check if any of those words is in the bad column of dictionary and, if so, update it with the id of the word to its left. As can be seen, intensity would need two steps to become strong (a good word in the dictionary). Is there a way to do so without having to do multiple iterations? (say, a for loop)

Static variable next to a dynamic variable in R

I posted yesterday another question but I feel I need to clarify it.
Let's say I have this code
md.NAME <- (subset(MyData, HotelName=="ALAMEDA"))
md.NAME.fc <- (subset(md.ALAMEDA, TIPO=="FORECAST"))
md.NAME.fc.bar <- (subset(md.ALAMEDA.fc, Market.Segment=="BAR"))
What I want is that NAME changes according to a variable set before those 3 lines are run,
So NAME is just dynamic in the sense that before these 3 lines I could say, ok, NAME now is equal to JOHN, but then, I could say that NAME is now equal to PATRIC.
So after running those 3 lines, twice (once for John and once for Patric) somehow in the environment I will get something like this:
6 dataframes, 3 for JOHN and 3 for PATRIC
DATAFRAME 1 WILL BE md.JOHN
DATAFRAME 2 WILL BE md.JOHN.fc
DATAFRAME 3 WILL BE md.JOHN.fc.bar
DATAFRAME 1 WILL BE md.PATRIC
DATAFRAME 2 WILL BE md.PATRIC.fc
DATAFRAME 3 WILL BE md.PATRIC.fc.bar
All the answers I had so far would help me only if "md" and "fc" or "fc.bar" are always the same. But I will have several variables like this, which will change a lot as far as the naming goes. So, it is the center part (NAME) the only one that should change.
I could even have something like:
md.test$NAME <- ...

Calendar (again) manipulations in R

I have code like this:
today<-as.Date(Sys.Date())
spec<-as.Date(today-c(1:1000))
df<-data.frame(spec)
stage.dates<-as.Date(c('2015-05-31','2015-06-07','2015-07-01','2015-08-23','2015-09-15','2015-10-15','2015-11-03'))
stage.vals<-c(1:8)
stagedf<-data.frame(stage.dates,stage.vals)
df['IsMonthInStage']<-ifelse(format(df$spec,'%m')==(format(stagedf$stage.dates,'%m')),stagedf$stage.vals,0)
This is producing the incorrect output, i.e.
df.spec, df.IsMonthInStage
2013-05-01, 0
2013-05-02, 1
2013-05-03, 0
....
2013-05-10, 1
It seems to be looping around, so stage.dates is 8 long, and it is repeating the 'TRUE' match every 8th. How do I fix this so that it would flag 1 for the whole month that it is in stage vals?
Or for bonus reputation - how do I set it up so that between different stage.dates, it will populate 1, 2, 3, etc of the most recent stage?
For example:
31st of May to 7th of June would be populated 1, 7th of June to 1st of July would be populated 2, etc, 3rd of November to 30th of May would be populated 8?
Thanks
Edit:
I appreciate the latter is functionally different to the former question. I am ultimately trying to arrive at both (for different reasons), so all answers appreciated
see if this works.
cut and split your data based on the stage.dates consider them as your buckets. you don't need btw stage.vals here.
Cut And Split
data<-split(df, cut(df$spec, stagedf$stage.dates, include.lowest=TRUE))
This should give you list of data.frame splitted as per stage.dates
Now mutate your data with index..this is what your stage.vals were going to be
Mutate
data<-lapply(seq_along(data), function(index) {mutate(data[[index]],
IsMonthInStage=index)})
Now join the data frame in the list using ldply
Join
data=ldply(data)
This will however give out or order dates which you can arrange by
Sort
arrange(data,spec)
Final Output
data[1:10,]
spec IsMonthInStage
1 2015-05-31 1
2 2015-06-01 1
3 2015-06-02 1
4 2015-06-03 1
5 2015-06-04 1
6 2015-06-05 1
7 2015-06-06 1
8 2015-06-07 2
9 2015-06-08 2
10 2015-06-09 2

simple rank formula

I'm looking for a mathmatical ranking formula.
Sample is
2008 2009 2010
A 5 6 4
B 6 7 5
C 7 8 2
I want to add a rank column for each period code field
rank
2008 2009 2010 2008 2009 2010
B 6 7 5 2 1 1
A 5 6 4 3 2 2
C 7 2 2 1 3 3
please do not reply with methods that loop thru the rows and columns, incrementing the rank value as it goes, that's easy. I'm looking for a formula much like finding the percent total (item / total). I know i've seen this before but an havning a tough time locating it.
Thanks in advance!
sort ((letters_col, number_col) descending by number_col)
As efficient as your sort alg.
Then number the rows, of course
Edit
I really got upset by your comment "please don't up vote this answer, sorting and loop is not what I'm asking for. i specifically stated this in my original question. " , and the negative votes, because, as you may have noted by the various answers received, it's basically correct.
However, I remained pondering where and how you may "have seen this before".
Well, I think I got the answer: You saw this in Excel.
Look at this:
This is the result after entering the formulas and sorting by column H.
It's exactly what you want ...
What are you using? If you're using Excel, you're looking for RANK(num, ref).
=RANK(B2,B$2:B$9)
I don't know of any programming language that has that built in, it would always require a loop of some form.
If you want the rank of a single element, you can do it in O(n) by looping through the elements, counting how many have value above the given element, and adding 1.
If you want the rank of all the elements, the best (and really only) way is to sort the elements. Anything else you do will be equivalent to sorting (there is no "formula")
Are you using T-SQL? T-SQL RANK() may pull what you want.

Resources