How to see n rows after a specific row - r

I'm trying to track user actions but I'd like to see what they do AFTER a specific event. How do I get the next n amount of lines?
For example below, I'd like to know what the user is doing after they "Get mushroom" to see if they are eating it. I'd want to reference the "Get mushroom" for each User and see the next few lines after that.
User Action
Bob Enter chapter 1
Bob Attack
Bob Jump
Bob Get mushroom
Bob Open inventory
Bob Eat mushroom
Bob Close inventory
Bob Run
Mary Enter chapter 1
Mary Get mushroom
Mary Attack
Mary Jump
Mary Attack
Mary Open inventory
Mary Close inventory
I'm not sure how to approach this after grouping by users. Expected results would be something like the below if i wanted 3 lines below
User Action
Bob Get mushroom # Action I want to find and the next 3 lines below it
Bob Open inventory
Bob Eat mushroom
Bob Close inventory
Mary Get mushroom # Action I want to find and the next 3 lines below it
Mary Attack
Mary Jump
Mary Attack
Thank you.

Two alternatives with dplyr and data.table:
library(dplyr)
df1 %>%
group_by(User) %>%
slice(rep(which(Action == 'Get-mushroom'), each=4) + 0:3)
library(data.table)
setDT(df1)[df1[, rep(.I[Action == 'Get-mushroom'], each=4) + 0:3, User]$V1]
both result in:
User Action
1: Bob Get-mushroom
2: Bob Open-inventory
3: Bob Eat-mushroom
4: Bob Close-inventory
5: Mary Get-mushroom
6: Mary Attack
7: Mary Jump
8: Mary Attack

Try this:
df
User Action
1 Bob Enterchapter1
2 Bob Attack
3 Bob Jump
4 Bob Getmushroom
5 Bob Openinventory
6 Bob Eatmushroom
7 Bob Closeinventory
8 Bob Run
9 Mary Enterchapter1
10 Mary Getmushroom
11 Mary Attack
12 Mary Jump
13 Mary Attack
14 Mary Openinventory
15 Mary Closeinventory
indices <- which(df$Action == 'Getmushroom')
n <- 3
# ensure that x + n does not go beyond the #rows of df
do.call(rbind, lapply(indices, function(x)df[x:min(x+n, nrow(df)),]))
User Action
4 Bob Getmushroom
5 Bob Openinventory
6 Bob Eatmushroom
7 Bob Closeinventory
10 Mary Getmushroom
11 Mary Attack
12 Mary Jump
13 Mary Attack

First find out the indices which has the term Get mushroom using which
You can use lapply on every indices and get the next 3 indices using seq.
args <- which(df$Action == "Get mushroom")
df[unlist(lapply(args, function(x) seq(x, x+3))), ]
# User Action
#4 Bob Get mushroom
#5 Bob Open inventory
#6 Bob Eat mushroom
#7 Bob Close inventory
#10 Mary Get mushroom
#11 Mary Attack
#12 Mary Jump
#13 Mary Attack
Or a similar approach (as suggested by #Sotos in comments)
df[sapply(args, function(x) seq(x, x+3)), ]
This sapply solution would work on dataframe and not on data.table as it does not accept 2-column matrix.
For it to work on data.table, you can unlist it using c
df[c(sapply(args, function(x) seq(x, x+3))), ]

Related

R cleaning, change data format horizontal to vertical, repeating some data

Right now, my data looks like this:
Coder Bill Witness1name Witness1job Witness2name Witness2Job
Joe 123 Fred Plumber Bob Coach
Karen 122 Sally Barista Helen Translator
Harry 431 Lisa Swimmer N/A N/A
Frank 301 N/A N/A N/A N/A
But I want my data to look like this:
Coder Bill WitnessName WitnessJob
Joe 123 Fred Plumber
Joe 123 Bob Coach
Karen 122 Sally Barista
Karen 122 Helen Translator
Harry 431 Lisa Swimmer
Frank 301 N/A N/A
So I want to take it from the coder/bill level to the "witness" level. Some coder/bills have up to 10 witnesses in their rows. Some have no witnesses, but I do not want to completely drop them from the dataset (see Frank).
All help is appreciated! I am familiar with the tidyverse package.
For those interested, I figured it out.
I had to change all the column names like this:
Witness1Name to Witness1_Name
Witness1Job to Witness1_Job
etc.
Then I ran this:
cleandata <- pivot_longer(mddata, cols = -c(Coder, Bill),
names_to = c("Witness", ".value"),
names_pattern = 'Witness(\\d)_(.*)') %>%
drop_na(Name)
And it gave me this:
Coder Bill Witness Name Job
Joe 123 1 Fred Plumber
Joe 123 2 Bob Coach
Karen 122 1 Sally Barista
Karen 122 2 Helen Translator
Harry 431 1 Lisa Swimmer
Close enough to what I wanted

Create weight node and edges lists from a normal dataframe in R?

I'm trying to use visNetwork to create a node diagram. However, my data is not in the correct format and I haven't been able to find any help on this on the internet.
My current data frame looks similar to this:
name town car color age school
John Bringham Swift Red 22 Brighton
Sarah Bringham Corolla Red 33 Rustal
Beth Burb Swift Blue 43 Brighton
Joe Spring Polo Black 18 Riding
I'm wanting to change use this to create nodes and edges lists that can be used to create a vis network.
I know that the "nodes" list will be made from the unique values in the "name" column but I'm not sure how I would use the rest of the data to create the "edges" list?
I was thinking that it may be possible to group by each column and then read back the matches from this function but I am not sure how to implement this. The idea that I thought of is to weight the edges based on how many matches they detect in the various group by functions. I'm not sure how to actually implement this yet.
For example, Joe will not match with anyone because he shares no common columns with any of the others. John and Sarah will have a weight of 2 because they share two common columns.
Also open to solutions in python!
One option is to compar row by row, in order to calculate the number of commun values.
For instance for John (first row) and Sarah (second row):
sum(df[1,] == df[2,])
# 2
Then you use the function combn() from library utils to know in advance the number of pair-combinaison you have to calculate:
nodes <- matrix(combn(df$name, 2), ncol = 2, byrow = T) %>% as.data.frame()
nodes$V1 <- as.character(nodes$V1)
nodes$V2 <- as.character(nodes$V2)
nodes$weight <- NA
(nodes)
# V1 V2 weight
#1 John Sarah NA
#2 John Beth NA
#3 John Joe NA
#4 Sarah Beth NA
#5 Sarah Joe NA
#6 Beth Joe NA
Finally a loop to calculate weight for each node.
for(n in 1:nrow(nodes)){
name1 <- df[df$name == nodes$V1[n],]
name2 <- df[df$name == nodes$V2[n],]
nodes$weight[n] <- sum(name1 == name2)
}
# V1 V2 weight
#1 John Sarah 2
#2 John Beth 2
#3 John Joe 0
#4 Sarah Beth 0
#5 Sarah Joe 0
#6 Beth Joe 0
I think node will be the kind of dataframe that you can use in the function visNetwork().

Find the favorite and analyse sequence questions in R

We have a daily meeting when participants nominate each other to speak. The first person is chosen randomly.
I have a dataframe that consists of names and the order of speech every day.
I have a day1, a day2 ,a day3 , etc. in the columns.
The data in the rows are numbers, meaning the order of speech on that particular day.
NA means that the person did not participate on that day.
Name day1 day2 day3 day4 ...
Albert 1 3 1 ...
Josh 2 2 NA
Veronica 3 5 3
Tim 4 1 2
Stew 5 4 4
...
I want to create two analysis, first, I want to create a dataframe who has chosen who the most times. (I know that the result depends on if a participant was nominated before and therefore on that day that participant cannot be nominated again, I will handle it later, but for now this is enough)
It should look like this:
Name Favorite
Albert Stew
Josh Veronica
Veronica Tim
Tim Stew
...
My questions (feel free to answer only one if you can):
1. What code shall I use for it without having to manunally put the names in a different dataframe?
2. How shall I handle a tie, for example Josh chose Veronica and Tim first the same number of times? Later I want to visualise it and I have no idea how to handle ties.
I also would like to analyse the results to visualise strong connections.
Like to show that there are people who usually chose each other, etc.
Is there a good package that is specialised for these? Or how should I get to it?
I do not need DNA sequences, only this simple ones, but I have not found a suitable one yet.
Thanks for your help!
If I am not misunderstanding your problem, here is some code to get the number of occurences of who choose who as next speaker. I added a fourth day to have some count that is not 1. There are ties in the result, choosing the first couple of each group by speaker ('who') may be a solution :
df <- read.table(textConnection(
"Name,day1,day2,day3,day4
Albert,1,3,1,3
Josh,2,2,,2
Veronica,3,5,3,1
Tim,4,1,2,4
Stew,5,4,4,5"),header=TRUE,sep=",",stringsAsFactors=FALSE)
purrr::map(colnames(df)[-1],
function (x) {
who <- df$Name[order(df[x],na.last=NA)]
data.frame(who,lead(who),stringsAsFactors=FALSE)
}
) %>%
replyr::replyr_bind_rows() %>%
filter(!is.na(lead.who.)) %>%
group_by(who,lead.who.) %>% summarise(n=n()) %>%
arrange(who,desc(n))
Input:
Name day1 day2 day3 day4
1 Albert 1 3 1 3
2 Josh 2 2 NA 2
3 Veronica 3 5 3 1
4 Tim 4 1 2 4
5 Stew 5 4 4 5
Result:
# A tibble: 12 x 3
# Groups: who [5]
who lead.who. n
<chr> <chr> <int>
1 Albert Tim 2
2 Albert Josh 1
3 Albert Stew 1
4 Josh Albert 2
5 Josh Veronica 1
6 Stew Veronica 1
7 Tim Stew 2
8 Tim Josh 1
9 Tim Veronica 1
10 Veronica Josh 1
11 Veronica Stew 1
12 Veronica Tim 1

If conditions and copying values from different rows

I have the following data:
Data <- data.frame(Project=c(123,123,123,123,123,123,124,124,124,124,124,125,125,125),
Name=c("Harry","David","David","Harry","Peter","Peter","John","Alex","Alex","Mary","Mary","Dan","Joe","Joe"),
Value=c(1,4,7,3,8,9,8,3,2,5,6,2,2,1),
OldValue=c("","Open","In Progress","Complete","Open","In Progress","Complete","Open","In Progress","System Declined","In Progress","","Open","In Progress"),
NewValue=c("Open","In Progress","Complete","Open","In Progress","Complete","Open","In Progress","System Declined","In Progress","Complete","Open","In Progress","Complete"))
The data should look like this
I want to create another column called EditedBy that applies the following logic.
IF the project in row 1 equals the project in row 2 AND the New Value in row 1 equals "Open" THEN take the name from row 2. If either of the first two conditions are False, then stick with the name in the first row.
So the data should look like this
How can I do this?
We can do this with data.table
library(data.table)
setDT(Data)[, EditedBy := Name[2L] ,.(Project, grp=cumsum(NewValue == "Open"|
shift(NewValue == "System Declined", fill=TRUE)))]
Data
# Project Name Value OldValue NewValue EditedBy
# 1: 123 Harry 1 Open David
# 2: 123 David 4 Open In Progress David
# 3: 123 David 7 In Progress Complete David
# 4: 123 Harry 3 Complete Open Peter
# 5: 123 Peter 8 Open In Progress Peter
# 6: 123 Peter 9 In Progress Complete Peter
# 7: 124 John 8 Complete Open Alex
# 8: 124 Alex 3 Open In Progress Alex
# 9: 124 Alex 2 In Progress System Declined Alex
#10: 124 Mary 5 System Declined In Progress Mary
#11: 124 Mary 6 In Progress Complete Mary
#12: 125 Dan 2 Open Joe
#13: 125 Joe 2 Open In Progress Joe
#14: 125 Joe 1 In Progress Complete Joe

cbind for multiple table() functions

I'm trying to count the frequency of multiple columns in a data.frame.
I used the table function on each column and bound them all by cbind, and was going to use the aggregate function after to calculate the means by my identifier.
Example:
df1
V1 V2 V3
George Mary Mary
George Mary Mary
George Mary George
Mary Mary George
Mary George George
Mary
Frequency<- as.data.frame(cbind(table(df1$V1), table(df1$V2), table(df1$V3)))
row.names V1
George 3
Mary 3
1
George 1
Mary 4
1
George 3
Mary 2
The result I get (visually) is a 2 column data frame, but when I check the dimension of Frequency, I get a result implying that the 2nd column only exists.
It's causing me trouble when I try to rename the columns and run the aggregate function, errors I get for rename:
colnames(Frequency) <- c("Name", "Frequency")
Error in names(Frequency) <- c("Name", "Frequency") :
'names' attribute [2] must be the same length as the vector [1]
The Final purpose is to run an aggregate command and get the mean by name:
Name.Mean<- aggregate(Frequency$Frequency, list(Frequency.Name), mean)
Desired output:
Name Mean
George Value
Mary Value
Using mtabulate (data from #user3169080's post)
library(qdapTools)
d1 <- mtabulate(df1)
is.na(d1) <- d1==0
colMeans(d1, na.rm=TRUE)
# Alice George Mary
# 4.0 3.0 2.5
I hope this is what you were looking for:
> df1
V1 V2 V3
1 George George George
2 Mary Mary Alice
3 George George George
4 Mary Mary Alice
5 <NA> George George
6 <NA> Mary Alice
7 <NA> <NA> George
8 <NA> <NA> Alice
> ll=unlist(lapply(df1,table))
> nn=names(ll)
> nn1=sapply(nn,function(x) substr(x,4,nchar(x)))
> mm=data.frame(ll)
> mm$names=nn1
> tapply(mm$ll,mm$names,mean)
> Mean=tapply(mm$ll,mm$names,mean)
> data.frame(Mean)
Mean
Alice 4.0
George 3.0
Mary 2.5

Resources