transposing data and sequence mining most common patterns in rows - r

I have a data frame that looks like this:
SFOpID Number MAGroupID
1 0032A00002cgs3XQAQ 1 99
2 0032A00002cgs3XQAQ 1 79
3 003F000001vyUGKIA2 2 8
4 0032A00002btWE6QAM 3 97
5 0032A00002btWE6QAM 3 86
6 0032A00002btWE6QAM 3 35
I need to transpose it so that it looks like this:
SFOpID Number MAGroupID
1 0032A00002cgs3XQAQ 1 99 79
3 003F000001vyUGKIA2 2 8
Then generate counts for the five most common sequences for example: 12 people (SFOpID) have the 97 86 35 sequence, but only 4 people have the 99 79 sequence. I think this may be possible with the arules package doing something like the following:
x <- read_baskets(con = system.file("misc", "zaki.txt", package =
"arulesSequences"),
info = c("sequenceID","eventID","SIZE"))
as(x, "data.frame")
The goal is to have output that looks like this:
items sequenceID eventID SIZE
1 {C,D} 1 10 2
2 {A,B,C} 1 15 3
3 {A,B,F} 1 20 3
4 {A,C,D,F} 1 25 4
5 {A,B,F} 2 15 3
Just, for items, it would be a sequence like {99, 79} or {97, 86, 35}

You can use group_by and next to collect values into one list. The list could be converted to text. Here is an example:
code <- read.csv("code.csv", stringsAsFactors = F)
library(dplyr)
output <- code[, 2:4]%>%
group_by(Number, MAGroupID) %>%
nest()
output$data <- as.character(output$data )

Related

Populate column by adding onto row above using lag() in R

I want to populate an existing column with values that continually add onto the row above.
This is easy in Excel, but I haven't figured out a good way to automate it in R.
If we had 2 columns in Excel, A and B, we want cell B2 to =B1+A2, and cell B3 would = B2+A3. How can I do this in R?
#example dataframe
df <- data.frame(A = 0:9, B = c(50,0,0,0,0,0,0,0,0,0))
#desired output
desired <- data.frame(A = 0:9, B = c("NA",51,53,56,60,65,71,78,86,95))
I tried using the lag() function, but it didn't give the correct output.
df <- df %>%
mutate(B = B + lag(A))
So I made a for loop that works, but I feel like there's a better solution.
for(i in 2:nrow(df)){
df$B[i] <- df$B[i-1] + df$A[i]
}
Eventually, I want to iterate this function over every n rows of the whole dataframe, essentially so the summation resets every n rows. (any tips on how to do that would be greatly appreciated!)
This might be close to what you need, and uses tidyverse. Specifically, it uses accumulate from purrr.
Say you want to reset to zero every n rows, you can also use group_by ahead of time.
It was not entirely clear how you'd like to handle the first row; here, it will just use the first B value and ignore the first A value, which looked similar to what you had in the post.
n <- 5
library(tidyverse)
df %>%
group_by(grp = ceiling(row_number() / n)) %>%
mutate(B = accumulate(A[-1], sum, .init = B[1]))
Output
A B grp
<int> <dbl> <dbl>
1 0 50 1
2 1 51 1
3 2 53 1
4 3 56 1
5 4 60 1
6 5 0 2
7 6 6 2
8 7 13 2
9 8 21 2
10 9 30 2
cumsum() can be used to get the result you need.
df$B <- cumsum(df$B + df$A)
df
A B
1 0 50
2 1 51
3 2 53
4 3 56
5 4 60
6 5 65
7 6 71
8 7 78
9 8 86
10 9 95

Subset specific row and last row from data frame

I have a data frame which contains data relating to a score of different events. There can be a number of scoring events for one game. What I would like to do, is to subset the occasions when the score goes above 5 or below -5. I would also like to get the last row for each ID. So for each ID, I would have one or more rows depending on whether the score goes above 5 or below -5. My actual data set contains many other columns of information, but if I learn how to do this then I'll be able to apply it to anything else that I may want to do.
Here is a data set
ID Score Time
1 0 0
1 3 5
1 -2 9
1 -4 17
1 -7 31
1 -1 43
2 0 0
2 -3 15
2 0 19
2 4 25
2 6 29
2 9 33
2 3 37
3 0 0
3 5 3
3 2 11
So for this data set, I would hopefully get this output:
ID Score Time
1 -7 31
1 -1 43
2 6 29
2 9 33
2 3 37
3 2 11
So at the very least, for each ID there will be one line printed with the last score for that ID regardless of whether the score goes above 5 or below -5 during the event( this occurs for ID 3).
My attempt can subset when the value goes above 5 or below -5, I just don't know how to write code to get the last line for each ID:
Data[Data$Score > 5 | Data$Score < -5]
Let me know if you need anymore information.
You can use rle to grab the last row for each ID. Check out ?rle for more information about this useful function.
Data2 <- Data[cumsum(rle(Data$ID)$lengths), ]
Data2
# ID Score Time
#6 1 -1 43
#13 2 3 37
#16 3 2 11
To combine the two conditions, use rbind.
Data2 <- rbind(Data[Data$Score > 5 | Data$Score < -5, ], Data[cumsum(rle(Data$ID)$lengths), ])
To get rid of rows that satisfy both conditions, you can use duplicated and rownames.
Data2 <- Data2[!duplicated(rownames(Data2)), ]
You can also sort if desired, of course.
Here's a go at it in data.table, where df is your original data frame.
library(data.table)
setDT(df)
df[df[, c(.I[!between(Score, -5, 5)], .I[.N]), by = ID]$V1]
# ID Score Time
# 1: 1 -7 31
# 2: 1 -1 43
# 3: 2 6 29
# 4: 2 9 33
# 5: 2 3 37
# 6: 3 2 11
We are grouping by ID. The between function finds the values between -5 and 5, and we negate that to get our desired values outside that range. We then use a .I subset to get the indices per group for those. Then .I[.N] gives us the row number of the last entry, per group. We use the V1 column of that result as our row subset for the entire table. You can take unique values if unique rows are desired.
Note: .I[c(which(!between(Score, -5, 5)), .N)] could also be used in the j entry of the first operation. Not sure if it's more or less efficient.
Addition: Another method, one that uses only logical values and will never produce duplicate rows in the output, is
df[df[, .I == .I[.N] | !between(Score, -5, 5), by = ID]$V1]
# ID Score Time
# 1: 1 -7 31
# 2: 1 -1 43
# 3: 2 6 29
# 4: 2 9 33
# 5: 2 3 37
# 6: 3 2 11
Here is another base R solution.
df[as.logical(ave(df$Score, df$ID,
FUN=function(i) abs(i) > 5 | seq_along(i) == length(i))), ]
ID Score Time
5 1 -7 31
6 1 -1 43
11 2 6 29
12 2 9 33
13 2 3 37
16 3 2 11
abs(i) > 5 | seq_along(i) == length(i) constructs a logical vector that returns TRUE for each element that fits your criteria. ave applies this function to each ID. The resulting logical vector is used to select the rows of the data.frame.
Here's a tidyverse solution. Not as concise as some of the above, but easier to follow.
library(tidyverse)
lastrows <- Data %>% group_by(ID) %>% top_n(1, Time)
scorerows <- Data %>% group_by(ID) %>% filter(!between(Score, -5, 5))
bind_rows(scorerows, lastrows) %>% arrange(ID, Time) %>% unique()
# A tibble: 6 x 3
# Groups: ID [3]
# ID Score Time
# <int> <int> <int>
# 1 1 -7 31
# 2 1 -1 43
# 3 2 6 29
# 4 2 9 33
# 5 2 3 37
# 6 3 2 11

perform operations on a data frame based on a factors

I'm having a hard time to describe this so it's best explained with an example (as can probably be seen from the poor question title).
Using dplyr I have the result of a group_by and summarize I have a data frame that I want to do some further manipulation on by factor.
As an example, here's a data frame that looks like the result of my dplyr operations:
> df <- data.frame(run=as.factor(c(rep(1,3), rep(2,3))),
group=as.factor(rep(c("a","b","c"),2)),
sum=c(1,8,34,2,7,33))
> df
run group sum
1 1 a 1
2 1 b 8
3 1 c 34
4 2 a 2
5 2 b 7
6 2 c 33
I want to divide sum by a value that depends on run. For example, if I have:
> total <- data.frame(run=as.factor(c(1,2)),
total=c(45,47))
> total
run total
1 1 45
2 2 47
Then my final data frame will look like this:
> df
run group sum percent
1 1 a 1 1/45
2 1 b 8 8/45
3 1 c 34 34/45
4 2 a 2 2/47
5 2 b 7 7/47
6 2 c 33 33/47
Where I manually inserted the fraction in the percent column by hand to show the operation I want to do.
I know there is probably some dplyr way to do this with mutate but I can't seem to figure it out right now. How would this be accomplished?
(In base R)
You can use total as a look-up table where you get a total for each run of df :
total[df$run,'total']
[1] 45 45 45 47 47 47
And you simply use it to divide the sum and assign the result to a new column:
df$percent <- df$sum / total[df$run,'total']
run group sum percent
1 1 a 1 0.02222222
2 1 b 8 0.17777778
3 1 c 34 0.75555556
4 2 a 2 0.04255319
5 2 b 7 0.14893617
6 2 c 33 0.70212766
If your "run" values are 1,2...n then this will work
divisor <- c(45,47) # c(45,47,...up to n divisors)
df$percent <- df$sum/divisor[df$run]
first you want to merge in the total values into your df:
df2 <- merge(df, total, by = "run")
then you can call mutate:
df2 %<>% mutate(percent = sum / total)
Convert to data.table in-place, then merge and add new column, again in-place:
library(data.table)
setDT(df)[total, on = 'run', percent := sum/total]
df
# run group sum percent
#1: 1 a 1 0.02222222
#2: 1 b 8 0.17777778
#3: 1 c 34 0.75555556
#4: 2 a 2 0.04255319
#5: 2 b 7 0.14893617
#6: 2 c 33 0.70212766

randomly select rows based on limited random numbers

Seems simple but I can't figure it out.
I have a bunch of animal location data (217 individuals) as a single dataframe. I'm trying to randomly select X locations per individual for further analysis with the caveat that X is within the range of 6-156.
So I'm trying to set up a loop that first randomly selects a value within the range of 6-156 then use that value (say 56) to randomly extract 56 locations from the first individual animal and so on.
for(i in unique(ANIMALS$ID)){
sub<-sample(6:156,1)
sub2<-i([sample(nrow(i),sub),])
}
This approach didn't seem to work so I tried tweaking it...
for(i in unique(ANIMALS$ID)){
sub<-sample(6:156,1)
rand<-i[sample(1:nrow(i),sub,replace=FALSE),]
}
This did not work either.. Any suggestions or previous postings would be helpful!
Head of the datafile...ANIMALS is the name of the df, ID indicates unique individuals
> FID X Y MONTH DAY YEAR HOUR MINUTE SECOND ELKYR SOURCE ID animalid
1 0 510313 4813290 9 5 2008 22 30 0 342008 FG 1 1
2 1 510382 4813296 9 6 2008 1 30 0 342008 FG 1 1
3 2 510385 4813311 9 6 2008 2 0 0 342008 FG 1 1
4 3 510385 4813394 9 6 2008 3 30 0 342008 FG 1 1
5 4 510386 4813292 9 6 2008 2 30 0 342008 FG 1 1
6 5 510386 4813431 9 6 2008 4 1 0 342008 FG 1 1
Here's one way using mapply. This function takes two lists (or something that can be coerced into a list) and applies function FUN to corresponding elements.
# simulate some data
xy <- data.frame(animal = rep(1:10, each = 10), loc = runif(100))
# calculate number of samples for individual animal
num.samples.per.animal <- sample(3:6, length(unique(xy$animal)), replace = TRUE)
num.samples.per.animal
[1] 6 3 4 4 6 3 3 6 3 5
# subset random x number of rows from each animal
result <- do.call("rbind",
mapply(num.samples.per.animal, split(xy, f = xy$animal), FUN = function(x, y) {
y[sample(1:nrow(y), x),]
}, SIMPLIFY = FALSE)
)
result
animal loc
7 1 0.99483999
1 1 0.50951321
10 1 0.36505294
6 1 0.34058842
8 1 0.26489107
9 1 0.47418823
13 2 0.27213396
12 2 0.28087775
15 2 0.22130069
23 3 0.33646632
21 3 0.02395097
28 3 0.53079981
29 3 0.85287600
35 4 0.84534073
33 4 0.87370167
31 4 0.85646813
34 4 0.11642335
46 5 0.59624723
48 5 0.15379729
45 5 0.57046122
42 5 0.88799675
44 5 0.62171858
49 5 0.75014593
60 6 0.86915983
54 6 0.03152932
56 6 0.66128549
64 7 0.85420774
70 7 0.89262455
68 7 0.40829671
78 8 0.19073661
72 8 0.20648832
80 8 0.71778913
73 8 0.77883677
75 8 0.37647108
74 8 0.65339300
82 9 0.39957202
85 9 0.31188471
88 9 0.10900795
100 10 0.55282999
95 10 0.10145296
96 10 0.09713218
93 10 0.64900866
94 10 0.76099256
EDIT
Here is another (more straightforward) approach that also handles cases when number of rows is less than the number of samples that should be allocated.
set.seed(357)
result <- do.call("rbind",
by(xy, INDICES = xy$animal, FUN = function(x) {
avail.obs <- nrow(x)
num.rows <- sample(3:15, 1)
while (num.rows > avail.obs) {
message("Sample to be larger than available data points, repeating sampling.")
num.rows <- sample(3:15, 1)
}
x[sample(1:avail.obs, num.rows), ]
}))
result
I like Stackoverflow because I learn so much. #RomanLustrik provided a simple solution; mine is straight-froward as well:
# simulate some data
xy <- data.frame(animal = rep(1:10, each = 10), loc = runif(100))
newVec <- NULL #Create a blank dataFrame
for(i in unique(xy$animal)){
#Sample a number between 1 and 10 (or 6 and 156, if you need)
samp <- sample(1:10, 1)
#Determine which rows of dataFrame xy correspond with unique(xy$animal)[i]
rows <- which(xy$animal == unique(xy$animal)[i])
#From xy, sample samp times from the rows associated with unique(xy$animal)[i]
newVec1 <- xy[sample(rows, samp, replace = TRUE), ]
#append everything to the same new dataFrame
newVec <- rbind(newVec, newVec1)
}

Return rows of data frame that meet multiple criteria in R (panel data random sample)

I am hoping to create a random sample from panel data based on the unique id.
For instance if you start with:
e = data.frame(id=c(1,1,1,2,2,3,3,3,4,4,4,4), data=c(23,34,45,1,23,45,6,2,9,39,21,1))
And you want a random sample of 2 unique ids:
out = data.frame(id=c(1,1,1,3,3,3), data=c(23,34,45,45,6,2))
Although sample gives me random unique ids
sample( e$id ,2) # give c(1,3)
I can't figure out how to use logical calls to return all the desired data.
I have tried a number of things including:
e[ e$id == sample( e$id ,2) ] # only returns 1/2 the data
Any ideas??? Its killing me.
I'm not entirely sure what your expected result should be, but does this work for what you're trying to do?
> e[e$id %in% sample(e$id, 2), ]
id data
6 3 45
7 3 6
8 3 2
9 4 9
10 4 39
11 4 21
12 4 1
Or maybe you want this:
> e[e$id %in% sample(unique(e$id), 2), ]
id data
1 1 23
2 1 34
3 1 45
9 4 9
10 4 39
11 4 21
12 4 1

Resources