dplyr filter function gives wrong data - r

I have the following dataset: (sample)
Team Job Question Answer
1 1 1 2 1
2 1 1 3a 2
3 1 1 3b 2
4 1 1 4a 1
5 1 1 4b 1
and I have 21 teams so there are many rows. I am trying filter the rows of the teams which did good in the experiment (with the dplyr package):
q10best <- filter(quest,Team==c(2,4,6,10,13,17,21))
But it gives me messed up data and with many missing rows.
On the other hand, when I use:
q10best <- filter(quest,Team==2 | Team==4 | Team==6 | Team==10 | Team==13 | Team==17 | Team==21)
It gives me the right dataset that I want. What is the difference? what am I doing wrong in the first command?
Thanks

== checks if two objects are exactly the same. You are trying to check if one object (each element of quest$Team) belongs to a list of value. The proper way to do that is to use %in%
q10best <- filter(quest,Team %in% c(2,4,6,10,13,17,21))

Related

For loop to iterate through columns in data.table [duplicate]

This question already has answers here:
Convert *some* column classes in data.table
(2 answers)
Closed 4 years ago.
I am trying to write a "for" loop that iterates through each column in a data.table and return a frequency table. However, I keep getting an error saying:
library(datasets)
data(cars)
cars <- as.data.table(cars)
for (i in names(cars)){
print(table(cars[,i]))
}
Error in `[.data.table`(cars, , i) :
j (the 2nd argument inside [...]) is a single symbol but column name 'i' is not found. Perhaps you intended DT[, ..i]. This difference to data.frame is deliberate and explained in FAQ 1.1.
When I use each column individually like below, I do not have any problem:
> table(cars[,dist])
2 4 10 14 16 17 18 20 22 24 26 28 32 34 36 40 42 46 48 50 52 54 56 60 64 66
1 1 2 1 1 1 1 2 1 1 4 2 3 3 2 2 1 2 1 1 1 2 2 1 1 1
68 70 76 80 84 85 92 93 120
1 1 1 1 1 1 1 1 1
My data is quite large (8921483x52), that is why I want to use the "for" loop and run everything at once then look at the result.
I included the cars dataset (which is easier to run) to demonstrate my code.
If I convert the dataset to data.frame, there is no problem running the "for" loop. But I just want to know why this does not work with data.table because I am learning it, which work better with large dataset in my belief.
If by chance, someone saw a post with an answer already, please let me know because I have been trying for several hours to look for one.
Some solution found here
My personal preference is the apply function though
library(datasets)
data(cars)
cars <- as.data.table(cars)
apply(cars,2,table)
To make your loop work you tweak the i
library(datasets)
data(cars)
cars <- as.data.table(cars)
for (i in names(cars)){
print(table(cars[,(i) := as.character(get(i))]))
}

Subdividing and counting how many values in particular columns under certain conditions in r

I am new to R and data analysis. I have a database similar to this one below, just a lot bigger and I was trying to find a general way to count for each country how many actions there are and how many subquestion with value 1, value 2 and so on there are. For each action there are multiple questions, subquestions and subsubquestions but I would love to find a way to count
1:how many actions there are per country, excluding subquestions
2: a way to find out how many subquestions 1 or 2 with value 1 there are for each country, actionn and questionn.
id country questionn subquestion value actionn
06 NIE 1 1 1 1
05 NIG 1 1 1 1
07 TAN 1 1 1 1
08 BEN 1 1 1 1
03 TOG 1 1 2 1
45 MOZ 1 1 2 1
40 ZIM 1 1 1 1
56 COD 1 1 1 1
87 BFA 1 1 1 1
09 IVC 1 1 2 1
08 SOA 1 1 2 1
02 MAL 1 1 2 1
78 MAI 1 1 2 1
35 GUB 1 1 2 1
87 RWA 1 1 2 1
41 ETH 1 1 1 1
06 NIE 1 2 2 1
05 NIG 1 2 1 1
87 BFA 1 2 1 2
I have tried to create subsets of the data frame and count everything for each country once at a time but it is going to take forever and I was wondering if there was a general way to do it.
For the first question I have done this
df1<-df %>% group_by (country) %>% summarise (countries=county)
unique(df1)
count(df1)
For the second question I was thinking of individually select and count each rows which has quesionn=1, subquestion=1, value=1 and actionn=1, then to select and count how many per country with qustionn=1, subquestionn=2, value=1, actionn=1 etc. Value refers to whether the answer to the question is 1=yes or 2=no.
I would be grateful for any help, thank you soo much :)
For the first question you can try to do something like this:
df %>%
filter(subquestion != 2) %>%
group_by(country) %>%
summarise(num_actions = n())
This will return the number of actions per country, removing rows that do not have 2 for the subquestion column. Note that the n() in the summarize function will count the number observations in the groups (in this case countries).
I am not sure I fully understand the second question, but my suggestion would be to make a new label for the particular observation you want to know (how many subquestions 1 or 2 with value 1 there are for each country, actionn and questionn):
df %>%
mutate(country_question_code = paste(country, action, questionn, sep = "_")) %>%
group_by(country_question_code) %>%
summarize(num_subquestion = n())
For question 1 possible solution (assuming country name are not unique and actionn can be 0, 1,2 or more..
For just total count:
df%>%group_by(country)%>%
summarise(
"Count_actions" = sum(actionn)
) #ignores all other columns.
If you want to count how many times a country appears use n() in place of sum(actionn, na.rm=TRUE).# this may not be desired but sometime simple solution is the best
(Just count the frequency of country)
Or df%>%group_by(country, actionn)%>%summarise("count_actions"= n()) will give country wise count for each type ( say 1,2 or more actions).
Data table version dt[, .(.N), by=.(country, actionn )]
For question 2: use grouping for "for each on your question" after putting filter on data as required. Here, filter subquestions 1 or 2 with (and) value 1 for each "country, question and actionn":
df%>%filter(subquestions <=2 & value==1)%>%group_by( country, question, actionn)%>%summarise("counts_desired"= n(), "sums_desired"= sum(actionn, na.rm=TRUE))
Hope this works. I am also learning and applying it on similar data.
Have not tested it and made certain assumptions about your data (numerical and clean). (Also for.mobile while traveling! Cheers!!)

"Dummy" coding a factor that has two values in R [duplicate]

This question already has answers here:
Generate a dummy-variable
(17 answers)
Closed 5 years ago.
I'm not quite sure if there is a better way to say what I'm asking. Basically I have route data (for example LAX-BWI, SFO-JFK, etc). I want to dummy it so I basically would have a 1 for every airport that a flight touches (directionality doesn't matter so LAX-BWI is the same as BWI-LAX).
So for example:
ROUTE | OFF | ON |
LAX-BWI|10:00|17:00|
LAX-SFO|11:00|13:00|
BWI-LAX|18:00|01:00|
BWI-SFO|15:00|20:00|
becomes
BWI|LAX|SFO| OFF | ON |
1 | 1 | 0 |10:00|17:00|
0 | 1 | 1 |11:00|13:00|
1 | 1 | 0 |18:00|01:00|
1 | 0 | 1 |15:00|20:00|
I can either pull in the data as a string "BWI-LAX" or have two columns Orig and Dest whose values are string "BWI" and "LAX".
The closest thing I can think of is dummying it, but if there is an actual term for what I want, please let me know. I feel like this has been answered, but I can't think of how to search for it.
Someone just asked a very similar question so I'll copy my answer from here:
allDest <- sort(unique(unlist(strsplit(dataFrame$ROUTE, "-"))))
for(i in allDest){
dataFrame[, i] <- grepl(i, dataFrame$ROUTE)
}
This will create one new column for every airport in the set and indicate with TRUE or FALSE if a flight touches an airport. If you want 0 and 1 instead you can do:
allDest <- sort(unique(unlist(strsplit(dataFrame$ROUTE, "-"))))
for(i in allDest){
dataFrame[, i] <- grepl(i, dataFrame$ROUTE)*1
}
TRUE*1 is 1 FALSE*1 is 0.
No need for the for loop. data.frames are just lists so we can assign extra elements all in one go:
cities <- unique(unlist(strsplit(df$ROUTE, "-")))
df[, cities] <- lapply(cities, function(x) as.numeric(grepl(x, df$ROUTE)))
# ROUTE OFF ON LAX BWI SFO
#1 LAX-BWI 10:00 17:00 1 1 0
#2 LAX-SFO 11:00 13:00 1 0 1
#3 BWI-LAX 18:00 01:00 1 1 0
#4 BWI-SFO 15:00 20:00 0 1 1
The ROUTE column is easy enough to drop after the calculation if you don't want it

Get frequency counts for a subset of elements in a column

I may be missing some elegant ways in Stata to get to this example, which has to do with electrical parts and observed monthly failures etc.
clear
input str3 (PartID Type FailType)
ABD A 4
BBB S 0
ABD A 3
ABD A 4
ABC A 2
BBB A 0
ABD B 1
ABC B 7
BBB C 1
BBB D 0
end
I would like to group by (bysort) each PartID and record the highest frequency for FailType within each PartID type. Ties can be broken arbitrarily, and preferably, the lower one can be picked.
I looked at groups etc., but do not know how to peel off certain elements from the result set. So that is a major question for me. If you execute a query, how do you select only the elements you want for the next computation? Something like n(0) is the count, n(1) is the mean etc. I was able to use contract, bysort etc. and create a separate data set which I then merged back into the main set with an extra column There must be something simple using gen or egen so that there is no need to create an extra data set.
The expected results here will be:
PartID Freq
ABD 4 #(4 occurs twice)
ABC 2 #(tie broken with minimum)
BBB 0 #(0 occurs 3 times)
Please let me know how I can pick off specific elements that I need from a result set (can be from duplicate reports, tab etc.)
Part II - Clarification: Perhaps I should have clarified and split the question into two parts. For example, if I issue this followup command after running your code: tabdisp Type, c(Freq). It may print out a nice table. Can I then use that (derived) table to perform more computations programatically?
For example get the first row of the table.
Table. ----------------------
Type| Freq ----------+-----------
A | -1
B | -1
C | -1
D | -3
S | -3
---------------------- –
I found this difficult to follow (see comment on question), but some technique is demonstrated here. The numbers of observations in subsets of observations defined by by: are given by _N. The rest is sorting tricks. Negating the frequency is a way to select the highest frequency and the lowest Type which I think is what you are after when splitting ties. Negating back gets you the positive frequencies.
clear
input str3 (PartID Type FailType)
ABD A 4
BBB S 0
ABD A 3
ABD A 4
ABC A 2
BBB A 0
ABD B 1
ABC B 7
BBB C 1
BBB D 0
end
bysort PartID FailType: gen Freq = -_N
bysort PartID (Freq Type) : gen ToShow = _n == 1
replace Freq = -Freq
list PartID Type FailType Freq if ToShow
+---------------------------------+
| PartID Type FailType Freq |
|---------------------------------|
1. | ABC A 2 1 |
3. | ABD A 4 2 |
7. | BBB A 0 3 |
+---------------------------------+

Searching a vector/data table backwards in R

Basically, I have a very large data frame/data table and I would like to search a column for the first, and closest, NA value which is less than my current index position.
For example, let's say I have a data frame DF as follows:
INDEX | KEY | ITEM
----------------------
1 | 10 | AAA
2 | 12 | AAA
3 | NA | AAA
4 | 18 | AAA
5 | NA | AAA
6 | 24 | AAA
7 | 29 | AAA
8 | 31 | AAA
9 | 34 | AAA
From this data frame we have an NA value at index 3 and at index 5. Now, let's say we start at index 8 (which has KEY of 31). I would like to search the column KEY backwards such that the moment it finds the first instance of NA the search stops, and the index of the NA value is returned.
I know there are ways to find all NA values in a vector/column (for example, I can use which(is.na(x)) to return the index values which have NA) but due to the sheer size of the data frame I am working and due to the large number of iterations that need to be performed this is a very inefficient way of doing it. One method I thought of doing is creating a kind of "do while" loop and it does seem to work, but this again seems quite inefficient since it needs to perform calculations each time (and given that I need to do over 100,000 iterations this does not look like a good idea).
Is there a fast way of searching a column backwards from a particular index such that I can find the index of the closest NA value?
Why not do a forward-fill of the NA indexes once, so that you can then look up the most recent NA for any row in future:
library(dplyr)
library(tidyr)
df = df %>%
mutate(last_missing = if_else(is.na(KEY), INDEX, as.integer(NA))) %>%
fill(last_missing)
Output:
> df
INDEX KEY ITEM last_missing
1 1 10 AAA NA
2 2 12 AAA NA
3 3 NA AAA 3
4 4 18 AAA 3
5 5 NA AAA 5
6 6 24 AAA 5
7 7 29 AAA 5
8 8 31 AAA 5
9 9 34 AAA 5
Now there's no need to recalculate every time you need the answer for a given row. There may be more efficient ways to do the forward fill, but I think exploring those is easier than figuring out how to optimise the backward search.

Resources