selecting consecutive answers in R - r

I have data set as follows (it is just a sample below):
dataframe<-data.frame("id" = c(1,2,5,7,9,21,22,23),"questionfk"=c(145,51,51,145,145,51,145,51))
In this data id represents the order of the questions. Questionfk, is the question id.
I would like to filter this data on questionfk 145 and 51, where 145 is asked right before 51 was the second question after. So what I want in the end seems like below:
dataframefiltered<-data.frame("id" = c(1,2,22,23),"questionfk"=c(145,51,145,51))
I did this with lots of if's and for's is it possible to do this with data.table? and How?
Thank you!

May be this helps
library(data.table)
setDT(dataframe)[dataframe[, {indx=which(c(TRUE, questionfk[-1]==145 &
questionfk[-.N]==51) & c(TRUE, diff(id)==1))
sort(c(indx, indx+1))}]]
# id questionfk
#1: 1 145
#2: 2 51
#3: 22 145
#4: 23 51

I'm not sure I understand the exact conditions you're looking for, but I'm basing this on wanting to select questions 145 and 51, but only when then come consecutively in that order. I realize that this does not give the same result as you show, but presumably you can modify this to match the right conditions.
Rather than data.table, here's a way to do it with dplyr (which is also fast with big datasets, and very elegant):
dataframe %>%
mutate(last_question = lag(questionfk),
next_question = lead(questionfk),
after_145 = last_question==145,
before_51 = next_question==51) %>%
filter(after_145 | before_51) %>%
select(id, questionfk)

Related

Specify multiple conditions in long form data in R

How do I index rows I need by with specifications?
id<-c(65,65,65,65,65,900,900,900,900,900,900,211,211,211,211,211,211,211,45,45,45,45,45,45,45)
age<-c(19,22,23,24,25,21,26,31,32,37,38,22,23,25,28,29,31,32,30,31,36,39,42,44,48)
stat<-c('intern','reg','manage1','left','reg','manage1','manage2','left','reg',
'reg','left','intern','left','intern','reg','left','reg','manage1','reg','left','intern','manage1','left','reg','manage2')
mydf<-data.frame(id,age,stat)
I need to create 5 variables:
m01time & m12time: measure the amount of years elapsed before becoming a level1 manager (manage1), and then since manage1 to manage2 regardless of whether or not it's at the same job. (numeric in years)
change: capture whether or not they experienced a job change between manage1 and manage2 (if 'left' happens somewhere in between manage1 and manage2), (0 or 1)
& 4: m1p & m2p: capture the position before becoming manager1 and manager2 (intern, reg, or manage1).
There's a lot of information I don't need here that I am not sure how to ignore (all the jobs 211 went through before going to one where they become a manager).
The end result should look something like this:
id m01time m02time change m1p m2p
1 65 4 NA NA reg <NA>
2 900 NA 5 0 <NA> manage1
3 211 1 NA NA reg <NA>
4 45 3 9 1 intern reg
I tried to use ifelse with lag() and lead() to capture some conditions, but there are more for loop type of jobs (such as how to capture a "left" somewhere in between) that I am not sure what to do with.
I'd calculate the variables the first three variables differently than m1p and m2p. Maybe there's an elegant unified approach that I don't see at the moment.
So for the last position before manager you could do:
mydt <- data.table(mydf)
mydt[,.(m1p=stat[.I[stat=="manage1"]-1],
m2p=stat[.I[stat=="manage2"]-1]),by=id]
The other variables are more conveniently calculated in a wide data.format:
dt <- dcast(unique(mydt,by=c("id","stat")),
formula=id~stat,value.var="age")
dt[,.(m01time = manage1-intern,
m12time = manage2-manage1,
change = manage1<left & left<manage2)]
Two caveats:
reshaping might be quite costly larger data sets
I (over-)simplified your dummy data by ignoring duplicates of id and stat

R - Using Stringr to identify a string across hundreds of rows

I have a database where some people have multiple diagnoses. I posted a similar question in the past, but now have some more nuances I need to work through:
R- How to test multiple 100s of similar variables against a condition
I have this dataset (which was an import of a SAS file)
ID dx1 dx2 dx3 dx4 dx5 dx6 .... dx200
1 343 432 873 129 12 123 3445
2 34 12 44
3 12
4 34 56
Initially, I wanted to be able to create a new variable if any of the "dxs" equals a certain number without using hundreds of if statements? All the different variables have the same format (dx#). So I used the following code:
Ex:
dataset$highbloodpressure <- rowSums(screen[0:832] == "410") > 0
This worked great. However, there are many different codes for the same diagnosis. For example, a heart attack can be defined as:
410.1,
410.71,
410.62,
410.42,
...this goes on for 20 additional codes. BUT! They all start with 410.
I thought about using stringr (the variable is a string), to identify the common code components (410, for the example above), but am not sure how to use it in the context of rowsums.
If anyone has any suggestions for this, please let me know!
Thanks for all the help!
You can use the grepl() function that returns TRUE if a value is present. In order to check all columns simultaneously, just collapse all of them to one character per row:
df$dx.410 = NA
for(i in 1:dim(df)[1]){
if(grepl('410',paste(df[i,2:200],collapse=' '))){
df$dx.410[i]="Present"
}
}
This will loop through all lines, create one large character containing all diagnoses for this case and write "Present" in column dx.410 if any column contains a 410-diagnosis.
(The solution expects the data structure you have here with the dx-variables in columns 2 to 200. If there are some other columns, just adjust these numbers)

Data frame Manipulation

I have the following dataframe named stations https://i.stack.imgur.com/qOo44.png. I also have the two vectors from<- 1 147 141 8 and to<-147 141 8 17. As you can see in the data frame, the columns "from" and "to" do not match up with the vector. This is causing the longitude and latitude columns of the route to be backwards. For example, instead of going from San Francisco to Portland, it is going from Portland to San Francisco. In order to fix this I would have to reverse the order of the dataframe columns that do not match up with my vectors. So my data frame should start at row 125 and go to 116 in order to correct the route. This would need to be done for all the columns of the data frame where the "from" and "to" colulmns do not match up with the from and to vectors. I am sorry if this was not the best explanation, but this a difficult topic to explain.
EDIT: Here is a reproducible code of the what the current structure is
current<-data.frame(ID= c(116,117,118,119,120,121,122,123,124,125),
from = c(147,147,147,147,147,147,147,147,147,147),to = c(1,1,1,1,1,1,1,1,1,1),lon=c(-122.6742,-122.6402,-122.6267,-122.5792,-122.5634,-122.5401,-122.5199,-122.5081,-122.4775,-122.4415),
lat= c(45.52025, 44.48824, 44.07356, 42.62986, 42.14788, 41.44040, 40.58136,40.46431 ,39.53378, 38.43697))
and what i want
Final<-data.frame(ID= c(125,124,123,122,121,120,119,118,117,116),
from = c(1,1,1,1,1,1,1,1,1,1), to = c(147,147,147,147,147,147,147,147,147,147),lon=c(-122.4415, -122.4775, -122.5081 ,-122.5199, -122.5401, -122.5634, -122.5792,-122.6267, -122.6402, -122.6742),
lat= c(38.43697 ,39.53378, 40.46431, 40.58136, 41.44040, 42.14788 ,42.62986,44.07356, 44.48824, 45.52025))
The changing of the structure should be based on the detection of vectors from and to not matching the columns in the current data frame.
from<-1 147 141 8
to<-147 141 8 17
Any tips help would help greatly, thank you.
You can use indexing backwards:
df[116:125,] <- df[125:116,]
Example (df is a data frame of length 8):
> df$nums <- c(1,2,3,4,5,6,7,8)
> df[2:5,] <- df[5:2,]
> df$nums
[1] 1 5 4 3 2 6 7 8
However, you will still then need to change the to/from vectors in these cases back to the original formats or it will still be backwards if I'm reading correctly. Let me know if I properly understood your question.

Matching Data from Different columns / dataframes - Working in R

Here is some sample data
Dataset A
id name reasonforlogin
123 Tom work
246 Timmy work
789 Mark play
Dataset B
id name reasonforlogin
789 Mark work
313 Sasha interview
000 Meryl interview
987 Dara play
789 Mark play
246 Timmy work
Two datasets. Same columns. Uneven number of rows.
I want to be able to say something like
1)"I want all of id numbers that appear in both datasetA and datasetB"
or
2)"I want to know how many times any one ID logs in on a day, say day 2."
So the answer to
1) So a list like
[246, 789]
2) So a data.frame with a "header" of ids, and then a "row" of their login numhbers.
123, 246, 789, 313, 000, 987
0, 1, 2, 1, 1, 1
It seems easy, but I think its non-trivial to do this quickly with large data. Originally I planned on doing loops-in-loops, but I'm sure there has to be a term for these kind of comparisons and likely packages that already do similar things.
If we have A as the first data set and B the second, and id as a character column in both so as to keep 000 from being printed as 0, we can do ...
id common to both data sets:
intersect(A$id, B$id)
# [1] "246" "789"
Times an id logged in on the second day (B), including those that were not logged in at all:
table(factor(B$id, levels = unique(c(A$id, B$id))))
# 123 246 789 313 000 987
# 0 1 2 1 1 1
You can do both with dplyr
1
A %>% select(id)
inner_join(B %>% select(id) ) %>%
distinct
2
B %>% count(id)
You need which and table.
1) Find which ids are in both data.frames
common_ids <- unique(df1[which(df1$id %in% df2$id), "id"])
Using intersect as in the other answers is much more elegant in this simple case. which provides however more flexibility when the comparison you need to do is more complicated than simple equality and is worth to know.
2) Find how many times any ID logs in
table(df1$id)

R: iterating through unique values of a vector in for loop

I'm new to R and I am having some trouble iterating through the unique element of a vector. I have a dataframe "School" with 700 different teachers. Each teacher has around 40 students.
I want to be able to loop through each teacher, create a graphs for the mean score of his/her students' over time, save the graphs in a folder and automatically email that folder to that teacher.
I'm just getting started and am having trouble setting up the for-loop. In Stata, I know how to loop through each unique element in a list, but am having trouble doing that in R. Any help would be appreciated.
School$Teacher School$Student School$ScoreNovember School$ScoreDec School$TeacherEmail
A 1 35 45 A#school.org
A 2 43 65 A#school.org
B 1 66 54 B#school.org
A 3 97 99 A#school.org
C 1 23 45 C#school.org
Your question seems a bit vague and it looks like you want us to write your whole project. Could you share what you have done so far and where exactly you are struggling?
see ?subset
School=data.frame(Teacher=c("A","B"), ScoreNovember=10:11, ScoreDec=13:14)
for (teacher in unique(School$Teacher)) {
teacher_df=subset(School, Teacher==teacher)
MeanScoreNovember=mean(teacher_df$ScoreNovember)
MeanScoreDec =mean(teacher_df$ScoreDec)
# do your plot
# send your email
}
I think you have 3 questions, which will need separate questions, how do I:
Create graphs
Automatically email output
Compute a subset mean based on group
For the 3rd one, I like using the plyr package, other people will recommend data.table or dplyrpackages. You can also use aggregate from base. To get a teacher's mean:
library(plyr)
ddply(School,.(Teacher),summarise,Nov_m=mean(ScoreNovember))
If you want per student per teacher, etc. just add between the columns, like:
library(plyr)
ddply(School,.(Teacher,Student),summarise,Nov_m=mean(ScoreNovember))
You could do that for each score column (and then chart it) if your data was long rather than wide you could also add the date ('November', 'Dec') as a group in the brackets, or:
library(plyr)
ddply(School,.(Teacher,Student),summarise,Nov_m=mean(ScoreNovember),Dec_m=mean(ScoreDec))
See if that helps with the 3rd, but look at splitting your questions up too.

Resources