"Dummy" coding a factor that has two values in R [duplicate] - r

This question already has answers here:
Generate a dummy-variable
(17 answers)
Closed 5 years ago.
I'm not quite sure if there is a better way to say what I'm asking. Basically I have route data (for example LAX-BWI, SFO-JFK, etc). I want to dummy it so I basically would have a 1 for every airport that a flight touches (directionality doesn't matter so LAX-BWI is the same as BWI-LAX).
So for example:
ROUTE | OFF | ON |
LAX-BWI|10:00|17:00|
LAX-SFO|11:00|13:00|
BWI-LAX|18:00|01:00|
BWI-SFO|15:00|20:00|
becomes
BWI|LAX|SFO| OFF | ON |
1 | 1 | 0 |10:00|17:00|
0 | 1 | 1 |11:00|13:00|
1 | 1 | 0 |18:00|01:00|
1 | 0 | 1 |15:00|20:00|
I can either pull in the data as a string "BWI-LAX" or have two columns Orig and Dest whose values are string "BWI" and "LAX".
The closest thing I can think of is dummying it, but if there is an actual term for what I want, please let me know. I feel like this has been answered, but I can't think of how to search for it.

Someone just asked a very similar question so I'll copy my answer from here:
allDest <- sort(unique(unlist(strsplit(dataFrame$ROUTE, "-"))))
for(i in allDest){
dataFrame[, i] <- grepl(i, dataFrame$ROUTE)
}
This will create one new column for every airport in the set and indicate with TRUE or FALSE if a flight touches an airport. If you want 0 and 1 instead you can do:
allDest <- sort(unique(unlist(strsplit(dataFrame$ROUTE, "-"))))
for(i in allDest){
dataFrame[, i] <- grepl(i, dataFrame$ROUTE)*1
}
TRUE*1 is 1 FALSE*1 is 0.

No need for the for loop. data.frames are just lists so we can assign extra elements all in one go:
cities <- unique(unlist(strsplit(df$ROUTE, "-")))
df[, cities] <- lapply(cities, function(x) as.numeric(grepl(x, df$ROUTE)))
# ROUTE OFF ON LAX BWI SFO
#1 LAX-BWI 10:00 17:00 1 1 0
#2 LAX-SFO 11:00 13:00 1 0 1
#3 BWI-LAX 18:00 01:00 1 1 0
#4 BWI-SFO 15:00 20:00 0 1 1
The ROUTE column is easy enough to drop after the calculation if you don't want it

Related

Get frequency counts for a subset of elements in a column

I may be missing some elegant ways in Stata to get to this example, which has to do with electrical parts and observed monthly failures etc.
clear
input str3 (PartID Type FailType)
ABD A 4
BBB S 0
ABD A 3
ABD A 4
ABC A 2
BBB A 0
ABD B 1
ABC B 7
BBB C 1
BBB D 0
end
I would like to group by (bysort) each PartID and record the highest frequency for FailType within each PartID type. Ties can be broken arbitrarily, and preferably, the lower one can be picked.
I looked at groups etc., but do not know how to peel off certain elements from the result set. So that is a major question for me. If you execute a query, how do you select only the elements you want for the next computation? Something like n(0) is the count, n(1) is the mean etc. I was able to use contract, bysort etc. and create a separate data set which I then merged back into the main set with an extra column There must be something simple using gen or egen so that there is no need to create an extra data set.
The expected results here will be:
PartID Freq
ABD 4 #(4 occurs twice)
ABC 2 #(tie broken with minimum)
BBB 0 #(0 occurs 3 times)
Please let me know how I can pick off specific elements that I need from a result set (can be from duplicate reports, tab etc.)
Part II - Clarification: Perhaps I should have clarified and split the question into two parts. For example, if I issue this followup command after running your code: tabdisp Type, c(Freq). It may print out a nice table. Can I then use that (derived) table to perform more computations programatically?
For example get the first row of the table.
Table. ----------------------
Type| Freq ----------+-----------
A | -1
B | -1
C | -1
D | -3
S | -3
---------------------- –
I found this difficult to follow (see comment on question), but some technique is demonstrated here. The numbers of observations in subsets of observations defined by by: are given by _N. The rest is sorting tricks. Negating the frequency is a way to select the highest frequency and the lowest Type which I think is what you are after when splitting ties. Negating back gets you the positive frequencies.
clear
input str3 (PartID Type FailType)
ABD A 4
BBB S 0
ABD A 3
ABD A 4
ABC A 2
BBB A 0
ABD B 1
ABC B 7
BBB C 1
BBB D 0
end
bysort PartID FailType: gen Freq = -_N
bysort PartID (Freq Type) : gen ToShow = _n == 1
replace Freq = -Freq
list PartID Type FailType Freq if ToShow
+---------------------------------+
| PartID Type FailType Freq |
|---------------------------------|
1. | ABC A 2 1 |
3. | ABD A 4 2 |
7. | BBB A 0 3 |
+---------------------------------+

Searching a vector/data table backwards in R

Basically, I have a very large data frame/data table and I would like to search a column for the first, and closest, NA value which is less than my current index position.
For example, let's say I have a data frame DF as follows:
INDEX | KEY | ITEM
----------------------
1 | 10 | AAA
2 | 12 | AAA
3 | NA | AAA
4 | 18 | AAA
5 | NA | AAA
6 | 24 | AAA
7 | 29 | AAA
8 | 31 | AAA
9 | 34 | AAA
From this data frame we have an NA value at index 3 and at index 5. Now, let's say we start at index 8 (which has KEY of 31). I would like to search the column KEY backwards such that the moment it finds the first instance of NA the search stops, and the index of the NA value is returned.
I know there are ways to find all NA values in a vector/column (for example, I can use which(is.na(x)) to return the index values which have NA) but due to the sheer size of the data frame I am working and due to the large number of iterations that need to be performed this is a very inefficient way of doing it. One method I thought of doing is creating a kind of "do while" loop and it does seem to work, but this again seems quite inefficient since it needs to perform calculations each time (and given that I need to do over 100,000 iterations this does not look like a good idea).
Is there a fast way of searching a column backwards from a particular index such that I can find the index of the closest NA value?
Why not do a forward-fill of the NA indexes once, so that you can then look up the most recent NA for any row in future:
library(dplyr)
library(tidyr)
df = df %>%
mutate(last_missing = if_else(is.na(KEY), INDEX, as.integer(NA))) %>%
fill(last_missing)
Output:
> df
INDEX KEY ITEM last_missing
1 1 10 AAA NA
2 2 12 AAA NA
3 3 NA AAA 3
4 4 18 AAA 3
5 5 NA AAA 5
6 6 24 AAA 5
7 7 29 AAA 5
8 8 31 AAA 5
9 9 34 AAA 5
Now there's no need to recalculate every time you need the answer for a given row. There may be more efficient ways to do the forward fill, but I think exploring those is easier than figuring out how to optimise the backward search.

R- How to get top 2 Values in Column which is depending on other column? [duplicate]

This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 6 years ago.
I have a data set of Zip code and house code.
df = data.frame(zip = c(2900,2900,2900,3200,3100,3200),
house_code = c('abc','cde','efg','ghi','ijk','klm'))
I need to find top 2 zip code in terms of number of house_code?
I think it could be: head(df[df$house_code == 'some value']$zip,2) where 'some value' is a house_code entry.
First use table to match house_codes and zip_codes.
> dftable <- table(df)
house_code
zip abc cde efg ghi ijk klm
2900 1 1 1 0 0 0
3100 0 0 0 0 1 0
3200 0 0 0 1 0 1
Then use rowSums to find the number of house_codes for each zip_code.
> numHouse <- rowSums(dftable)
2900 3100 3200
3 1 2
Finally use order to find the top 2.
> names(numHouse)[order(numHouse, decreasing = TRUE)[1:2]]
[1] "2900" "3200"

dplyr filter function gives wrong data

I have the following dataset: (sample)
Team Job Question Answer
1 1 1 2 1
2 1 1 3a 2
3 1 1 3b 2
4 1 1 4a 1
5 1 1 4b 1
and I have 21 teams so there are many rows. I am trying filter the rows of the teams which did good in the experiment (with the dplyr package):
q10best <- filter(quest,Team==c(2,4,6,10,13,17,21))
But it gives me messed up data and with many missing rows.
On the other hand, when I use:
q10best <- filter(quest,Team==2 | Team==4 | Team==6 | Team==10 | Team==13 | Team==17 | Team==21)
It gives me the right dataset that I want. What is the difference? what am I doing wrong in the first command?
Thanks
== checks if two objects are exactly the same. You are trying to check if one object (each element of quest$Team) belongs to a list of value. The proper way to do that is to use %in%
q10best <- filter(quest,Team %in% c(2,4,6,10,13,17,21))

Select consecutive date entries

I have updated the question, as a) i articulated the question not clearly on the first attempt, b) my exact need also shifted somewhat.
I want to especially thank Hemmo for great help so far - and apologies for not articulating my question clearly enough to him. His code (that addressed earlier version of problem) is shown in the answer section.
At a high-level - i am looking for code that helps to identify and differentiate the different blocks of consecutive free time of different individuals. More specifically - the code would ideally:
Check whehter an activity is labelled as "Free"
Check whether consecutive weeks (week earlier, week later) of time spent by the same person where also labelled as "Free".
Give the entire block of consecutive weeks of that person that are labelled "Free" an indicator in the desired outcome column. Note that the lenght of time-periods (e.g. 1 consec week, 4 consec weeks, 8 consec weeks) will vary
Finally - due to a need for further analysis on the characteristics of these clusters, different blocks should receive different indicators. (e.g. the march block of Paul would have value 1, the May block value 2, and Kim's block in March would be have value 3)
Hopefully this becomes more clear when one looks at the example dataframe (see the desired final column)
Any help much appreciated, code for the test dataframe per below.
Many thanks in advance,
W
Example (note that the last column should be generated by the code, purely included as illustration):
Week Name Activity Hours Desired_Outcome
1 01/01/2013 Paul Free 40 1
2 08/01/2013 Paul Free 10 1
3 08/01/2013 Paul Project A 30 0
4 15/01/2013 Paul Project B 30 0
5 15/01/2013 Paul Project A 10 0
6 22/01/2013 Paul Free 40 2
7 29/01/2013 Paul Project B 40 0
8 05/02/2013 Paul Free 40 3
9 12/02/2013 Paul Free 10 3
10 19/02/2013 Paul Free 30 3
11 01/01/2013 Kim Project E 40 0
12 08/01/2013 Kim Free 40 4
13 15/01/2013 Kim Free 40 4
14 22/01/2013 Kim Project E 40 0
15 29/01/2013 Kim Free 40 5
Code for dataframe:
Name=c(rep("Paul",10),rep("Kim",5))
Week=c("01/01/2013","08/01/2013","08/01/2013","15/01/2013","15/01/2013","22/01/2013","29/01/2013","05/02/2013","12/02/2013","19/02/2013","01/01/2013","08/01/2013","15/01/2013","22/01/2013","29/01/2013")
Activity=c("Free","Free","Project A","Project B","Project A","Free","Project B","Free","Free","Free","Project E","Free","Free","Project E","Free")
Hours=c(40,10,30,30,10,40,40,40,10,30,40,40,40,40,40)
Desired_Outcome=c(1,1,0,0,0,2,0,3,3,3,0,4,4,0,5)
df=as.data.frame(cbind(Week,Name,Activity,Hours,Desired_Outcome))
df
EDIT: This was messy already as the question was edited several times, so I removed old answers.
checkFree<-function(df){
df$Week<-as.Date(df$Week,format="%d/%m/%Y")
df$outcome<-numeric(nrow(df))
if(df$Activity[1]=="Free"){ #check first
counter<-1
df$outcome[1]<-counter
} else counter<-0
for(i in 2:nrow(df)){
if(df$Activity[i]=="Free"){
LastWeek <- (df$Week >= (df$Week[i]-7) &
df$Week < (df$Week[i]))
if(all(df$Activity[LastWeek]!="Free"))
counter<-counter+1
df$outcome[i]<-counter
}
}
df
}
splitdf<-split(df, Name)
df<-unsplit(lapply(splitdf,checkFree),Name)
uniqs<-unique(df2$Name) #for renumbering
for(i in 2:length(uniqs))
df$outcome[df$Name==uniqs[i] & df$outcome>0]<-
max(df$outcome[df$Name==uniqs[i-1]]) +
df$outcome[df$Name==uniqs[i] & df$outcome>0]
df
That should do it, although the above code is probably far from optimal.
Using the comment by user1885116 to Hemmo's answer as a guide to what is desired, here is a somewhat simpler approach:
N <- 1
x <- with(df, df[Activity=='Free',])
y <- with(x, diff(Week)) <= N*7
df$outcome <- 0
df[rownames(x[c(y, FALSE) | c(FALSE, y),]),]$outcome <- 1
df
## Week Activity Hours Desired_Outcome outcome
## 1 2013-01-01 Project A 40 0 0
## 2 2013-01-08 Project A 10 0 0
## 3 2013-01-08 Free 30 1 1
## 4 2013-01-15 Project B 30 0 0
## 5 2013-01-15 Free 10 1 1
## 6 2013-01-22 Project B 40 0 0
## 7 2013-01-29 Free 40 0 0
## 8 2013-02-05 Project C 40 0 0

Resources