I have a large data set which contains a time column and a column with the identification of a saccade or fixation of the eye (saccade = fast eye movement, fixation = relative stable eye movement). I want to calculate how long each period of fixations and saccades last, by taking the time at the start of the first "f" until the first "s" and so on. So if there are 3 consecutive rows with "s", I want it to take the time in column [i] where the first "s" appeared and the time in column [i] where the last "s" appeared before the next "f". By distracting these 2 times I know the duration of each fixation and saccade period.
The time scale is not continuous, since sometimes rows are deleted because of blinks in the data.
example.df <- data.frame(time = seq(1:100),
saccade = sample(letters[c(6, 19)], 100, replace = T))
Is there an easy way to do this?
Thanks a lot
We can create an index using rle() and then group_by() this index to sum() the time:
library(tidyverse)
example.df <- data.frame(time = seq(1:100),
saccade = sample(letters[c(6, 19)], 100, replace = T))
test <- rle(example.df$saccade == "s")
example.df$indexer <- rep(1:length(test$lengths), test$lengths)
example.df <- example.df %>%
group_by(indexer) %>%
mutate(period = time[n()] - time[1])
# A tibble: 100 x 4
# Groups: indexer [53]
time saccade indexer period
<int> <fctr> <int> <int>
1 1 s 1 1
2 2 s 1 1
3 3 f 2 0
4 4 s 3 0
5 5 f 4 3
6 6 f 4 3
7 7 f 4 3
8 8 f 4 3
9 9 s 5 1
10 10 s 5 1
# ... with 90 more rows
# drop indexer column
example.df <- example.df[setdiff(names(example.df),"indexer")]
Result as a data.frame:
example.df <- data.frame(time = seq(1:100),
saccade = sample(letters[c(6, 19)], 100, replace = T),
stringsAsFactors = FALSE)
run_len_encoding <- rle(example.df$saccade)
length_of_runs <- run_len_encoding$length
index_of_changes <- cumsum(length_of_runs)
duration <- diff(c(1,index_of_changes),1)
result.df <- data.frame(duration, state = run_len_encoding$values)
result.df
duration state
1 1 s
2 2 f
3 1 s
4 4 f
5 1 s
6 3 f
7 3 s
8 2 f
9 3 s
10 1 f
11 2 s
12 1 f
13 1 s
14 2 f
15 4 s
16 1 f
17 2 s
18 1 f
19 1 s
20 1 f
21 1 s
22 1 f
23 2 s
24 1 f
25 2 s
26 3 f
27 1 s
28 1 f
29 2 s
30 1 f
31 1 s
32 1 f
33 6 s
34 1 f
35 3 s
36 3 f
37 1 s
38 2 f
39 2 s
40 4 f
41 1 s
42 1 f
43 1 s
44 1 f
45 1 s
46 2 f
47 1 s
48 3 f
49 2 s
50 1 f
51 4 s
52 1 f
53 1 s
54 1 f
55 2 s
Related
this may have a simple answer but after after a few hours of searching I still cannot find it. Basically I need to turn a wide dataset to a long format dataset but with multiple variables. My dataset structure looks like this:
df1 <- data.frame(id = c(1,2,3),
sex = c("M","F","M"),
day0s = c(21,25,15),
day1s = c(20,30,18),
day2s = c(18,18,17),
day0t = c(2,5,7),
day1t = c(3,6,5),
day2t = c(3,8,7))
df1
id sex day0s day1s day2s day0t day1t day2t
1 M 21 20 18 2 3 3
2 F 25 30 18 5 6 8
3 M 15 18 17 7 5 7
Basically 3 subjects have done a math test (s) and history test (t) every day for 3 days.
I tried to use gather from tidyr to turn it into long form, but I don't know how to assign the mt and ht variables to the same day. I also coded a new variable day with just day0 = 0, day1 = 1 and day2 = 2.
dfl <- df1 %>%
gather(day, value, - c(id,sex))
dfl
id sex variable value day
1 M day0s 21 0
1 M day1s 20 1
1 M day2s 18 2
1 M day0t 2 0
1 M day1t 3 1
1 M day2t 3 2
2 F day0s 25 0
2 F day1s 30 1
2 F day2s 18 2
2 F day0t 5 0
2 F day1t 6 1
2 F day2t 8 2
3 M day0s 15 0
3 M day1s 18 1
3 M day2s 17 2
3 M day0t 7 0
3 M day1t 5 1
3 M day2t 7 1
Ideally in the end it should look like this.
id sex day s t
1 M 0 21 2
1 M 1 20 3
1 M 2 18 3
2 F 0 25 5
2 F 1 30 6
2 F 2 18 8
3 M 0 15 7
3 M 1 18 5
3 M 2 17 7
Do you please have any suggestions on how to achieve this?
You can use {tidyr}'s pivot_longer here.
If your actual variables are named a bit differently, you can adapt the regex to your case. Here you can try out and adapt accordingly . (Note that in R the backslash has to be escaped, therefore the
double backslash in \\d+ and \\w+)
In general, the names_pattern argument works by matching the regex within the parenthesis with the names_to argument, so that here:
(\\d+) -> becomes variable day. Regex \d+ matches 1 or more digits.
(\\w+) -> becomes ".value". Regex \w+ matches 1 or more word character. Thanks to r2evans for pointing out the ".value" argument that spares one further reshape. The documentation states that .value "tells pivot_longer() that that part of the column name specifies the “value” being measured (which will become a variable in the output)." While I don't fully grasp the documentation explanation, the results are that the matching regex are mapped to the variable names in the output data.
library(dplyr)
library(tidyr)
df1 <- data.frame(id = c(1,2,3),
sex = c("M","F","M"),
day0mt = c(21,25,15),
day1mt = c(20,30,18),
day2mt = c(18,18,17),
day0ht = c(2,5,7),
day1ht = c(3,6,5),
day2ht = c(3,8,7))
df1
#> id sex day0mt day1mt day2mt day0ht day1ht day2ht
#> 1 1 M 21 20 18 2 3 3
#> 2 2 F 25 30 18 5 6 8
#> 3 3 M 15 18 17 7 5 7
df1 %>%
pivot_longer(cols = starts_with("day"),
names_pattern = "day(\\d+)(\\w+)",
names_to = c("day", ".value"))
#> # A tibble: 9 x 5
#> id sex day mt ht
#> <dbl> <chr> <chr> <dbl> <dbl>
#> 1 1 M 0 21 2
#> 2 1 M 1 20 3
#> 3 1 M 2 18 3
#> 4 2 F 0 25 5
#> 5 2 F 1 30 6
#> 6 2 F 2 18 8
#> 7 3 M 0 15 7
#> 8 3 M 1 18 5
#> 9 3 M 2 17 7
Created on 2021-06-20 by the reprex package (v2.0.0)
Note that in newer versions of tidyr, gather and spread are deprecated and replaced by pivot_longer and pivot_wider.
Using the latest development-version of data.table (1.14.1) which adds some cool new melt-features..
use data.table::update.dev.pkg() for installation of the dev-version
library(data.table)
# data.table 1.14.1 IN DEVELOPMENT built 2021-06-22 09:38:23 UTC
dcast(
melt(setDT(df1), measure.vars = measure(day, type, pattern="^day(.)(.)")),
... ~ type, value.var = "value")
id sex day s t
1: 1 M 0 21 2
2: 1 M 1 20 3
3: 1 M 2 18 3
4: 2 F 0 25 5
5: 2 F 1 30 6
6: 2 F 2 18 8
7: 3 M 0 15 7
8: 3 M 1 18 5
9: 3 M 2 17 7
Here is a way. It first reshapes to long format, separates the day* column into day number and suffix columns and reshapes back to wide format.
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
pivot_longer(cols = starts_with("day")) %>%
mutate(day = str_extract(name, "\\d+"),
suffix = str_extract(name, "[^[:digit:]]+$")) %>%
select(-name) %>%
pivot_wider(
id_cols = -c(value, suffix),
names_from = suffix,
values_from = value
)
## A tibble: 9 x 5
# id sex day s t
# <dbl> <chr> <chr> <dbl> <dbl>
#1 1 M 0 21 2
#2 1 M 1 20 3
#3 1 M 2 18 3
#4 2 F 0 25 5
#5 2 F 1 30 6
#6 2 F 2 18 8
#7 3 M 0 15 7
#8 3 M 1 18 5
#9 3 M 2 17 7
I would like to get multiple frequency tables within a single table.
Here is my data:
df<-read.table(text=" group score night ticket book gender course
A Y 1 0 0 Male M
A Y 1 0 0 Female N
A N 1 1 1 Female N
A Y 2 1 1 Female M
A Y 2 1 1 Male N
A Y 2 0 0 Female N
A N 3 1 0 Male N
B N 3 1 1 Female N
B N 1 0 1 Female M
B Y 1 0 1 Female M
",header=TRUE)
and the output would be :
Frequency Percent
Group
A 7 70
B 3 30
score
Y 4 40
N 6 60
night
1 5 50
2 3 30
3 2 20
book
0 4 40
1 6 60
gender
Female 7 70
Male 3 30
course
M 4 40
N 6 60
I have used the following codes:
df%>%
group_by( group, score, night, ticket, book, gender, course) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n)
But it did not work for.
A general solution would be to apply the table function on each column of your frame. Typically table returns a named-vector, but you want a more frame-like presentation, so we'll augment that with as.data.frame.table.
lst2 <- lapply(df, function(x) {
out <- as.data.frame.table(table(x))
out$Pct <- 100*out$Freq/sum(out$Freq)
out
})
# or code-golf:
# lapply(df, function(x) transform(as.data.frame.table(table(x)), Pct = 100*Freq/sum(Freq)))
lst2
# $group
# x Freq Pct
# 1 A 7 70
# 2 B 3 30
# $score
# x Freq Pct
# 1 N 4 40
# 2 Y 6 60
# $night
# x Freq Pct
# 1 1 5 50
# 2 2 3 30
# 3 3 2 20
# $ticket
# x Freq Pct
# 1 0 5 50
# 2 1 5 50
# $book
# x Freq Pct
# 1 0 4 40
# 2 1 6 60
# $gender
# x Freq Pct
# 1 Female 7 70
# 2 Male 3 30
# $course
# x Freq Pct
# 1 M 4 40
# 2 N 6 60
You can combine all of these elements with something like:
do.call(rbind, c(Map(cbind, nm=names(lst2), lst2), list(make.row.names = FALSE)))
# nm x Freq Pct
# 1 group A 7 70
# 2 group B 3 30
# 3 score N 4 40
# 4 score Y 6 60
# 5 night 1 5 50
# 6 night 2 3 30
# 7 night 3 2 20
# 8 ticket 0 5 50
# 9 ticket 1 5 50
# 10 book 0 4 40
# 11 book 1 6 60
# 12 gender Female 7 70
# 13 gender Male 3 30
# 14 course M 4 40
# 15 course N 6 60
Edited to remove the row names by default.
An option with tidyverse would be
library(purrr)
library(dplyr)
map(names(df), ~ df %>%
count(!!rlang::sym(.x)) %>%
mutate(Pct = 100 * n/sum(n)))
I am trying to split column values separated by comma(,) into new rows based on id's. I know how to do this in R using dplyr and tidyr. But I am looking to solve same problem in sparklyr.
id <- c(1,1,1,1,1,2,2,2,3,3,3)
name <- c("A,B,C","B,F","C","D,R,P","E","A,Q,W","B,J","C","D,M","E,X","F,E")
value <- c("1,2,3","2,4,43,2","3,1,2,3","1","1,2","26,6,7","3,3,4","1","1,12","2,3,3","3")
dt <- data.frame(id,name,value)
R solution:
separate_rows(dt, name, sep=",") %>%
separate_rows(value, sep=",")
Desired Output from sparkframe(sparklyr package)-
> final_result
id name value
1 1 A 1
2 1 A 2
3 1 A 3
4 1 B 1
5 1 B 2
6 1 B 3
7 1 C 1
8 1 C 2
9 1 C 3
10 1 B 2
11 1 B 4
12 1 B 43
13 1 B 2
14 1 F 2
15 1 F 4
16 1 F 43
17 1 F 2
18 1 C 3
19 1 C 1
20 1 C 2
21 1 C 3
22 1 D 1
23 1 R 1
24 1 P 1
25 1 E 1
26 1 E 2
27 2 A 26
28 2 A 6
29 2 A 7
30 2 Q 26
31 2 Q 6
32 2 Q 7
33 2 W 26
34 2 W 6
35 2 W 7
36 2 B 3
37 2 B 3
38 2 B 4
39 2 J 3
40 2 J 3
41 2 J 4
42 2 C 1
43 3 D 1
44 3 D 12
45 3 M 1
46 3 M 12
47 3 E 2
48 3 E 3
49 3 E 3
50 3 X 2
51 3 X 3
52 3 X 3
53 3 F 3
54 3 E 3
Note-
I have approx 1000 columns with nested values. so, I need a function which can loop in for each column.
I know we have sdf_unnest() function from package sparklyr.nested. But, I am not sure how to split strings of multiple columns and apply this function. I am quite new in sparklyr.
Any help would be much appreciated.
You have to combine explode and split
sdt %>%
mutate(name = explode(split(name, ","))) %>%
mutate(value = explode(split(value, ",")))
# Source: lazy query [?? x 3]
# Database: spark_connection
id name value
<dbl> <chr> <chr>
1 1.00 A 1
2 1.00 A 2
3 1.00 A 3
4 1.00 B 1
5 1.00 B 2
6 1.00 B 3
7 1.00 C 1
8 1.00 C 2
9 1.00 C 3
10 1.00 B 2
# ... with more rows
Please note that lateral views have be to expressed as separate subqueries, so this:
sdt %>%
mutate(
name = explode(split(name, ",")),
value = explode(split(value, ",")))
won't work
I have the following dataset:
df<- as.data.frame(c(rep("a", times = 9), rep("b", times = 18), rep("c", times = 27)))
colnames(df)<-"Location"
Year<-c(rep(1:3,times = 3), rep(1:6, times = 3), rep(1:9, times = 3))
df$Year<-Year
df<- df %>%
mutate(Predictor = seq_along(Location)) %>%
ungroup(df)
print(df)
Location Year Predictor
a 1 1
a 2 2
a 3 3
a 1 4
a 2 5
a 3 6
a 1 7
a 2 8
a 3 9
b 1 10
b 2 11
b 3 12
b 4 13
b 5 14
... 40 more rows
I want to split the above dataframe into training and test sets. For the test set, I want to randomly sample a third of the number of years in each Location, while keeping the years together. So if year "1" is selected for location "a", I want all three "1's" in the test set and so on. My test set should look something like this:
Location Year Predictor
a 1 1
a 1 4
a 1 7
b 3 12
b 3 18
b 3 24
b 5 14
b 5 20
b 5 26
c 3 30
c 3 39
c 3 48
c 6 33
c 6 42
c 6 51
c 7 34
c 7 43
c 7 52
I found a similar question here, but this procedure would sample the same year and the same number of years from every location (and YEAR is numeric, not a factor). I want a different random sample of years from each location and a proportional number of samples.
Would like to do this in dplyr if possible
You can first create a distinct set of year/location combinations, then sample some of them for each location and use that in a semi_join on the original data. This could be done as:
df %>%
distinct(Location, Year) %>%
group_by(Location) %>%
sample_frac(.3) %>%
semi_join(df, .)
# Location Year Predictor
# 1 a 3 3
# 2 a 3 6
# 3 a 3 9
# 4 b 4 13
# 5 b 4 19
# 6 b 4 25
# 7 b 5 14
# 8 b 5 20
# 9 b 5 26
# 10 c 8 35
# 11 c 8 44
# 12 c 8 53
# 13 c 1 28
# 14 c 1 37
# 15 c 1 46
# 16 c 2 29
# 17 c 2 38
# 18 c 2 47
I have this data.frame called dum
dummy <- data.frame(label = "a", x = c(1,1,1,1,0,1,1,1,1,1,1,1,1))
dummy1 <- data.frame(label = "b", x = c(1,1,1,1,1,1,1,1,0,1,1,1,1))
dum <- rbind(dummy,dummy1)
What I am trying to do is take the cumulative sum starting at 0 in the x column of dum. The summing would be grouped by the label column, which can be implemented in dplyr or plyr. The part that I am struggling with is how to start the cumulative sum from the 0 position in x and go outward.
The resulting data.frame should look like this :
>dum
label x output
1 a 1 4
2 a 1 3
3 a 1 2
4 a 1 1
5 a 0 0
6 a 1 1
7 a 1 2
8 a 1 3
9 a 1 4
10 a 1 5
11 a 1 6
12 a 1 7
13 a 1 8
14 b 1 8
15 b 1 7
16 b 1 6
17 b 1 5
18 b 1 4
19 b 1 3
20 b 1 2
21 b 1 1
22 b 0 0
23 b 1 1
24 b 1 2
25 b 1 3
26 b 1 4
This would need to be iterated thousands of times over millions of rows of data.
As usual, thanks for any and all help
It seems more like you just want to find the distance to a zero, rather than any sort of cumulative sum. If that's the case, then
#find zeros for each group
zeros <- tapply(seq.int(nrow(dum)) * as.numeric(dum$x==0), dum$label, max)
#calculate distance from zero for each point
dist <- abs(zeros[dum$label]-seq.int(nrow(dum)))
And that gives
cbind(dum, dist)
# label x dist
# 1 a 1 4
# 2 a 1 3
# 3 a 1 2
# 4 a 1 1
# 5 a 0 0
# 6 a 1 1
# 7 a 1 2
# 8 a 1 3
# 9 a 1 4
# 10 a 1 5
# 11 a 1 6
# 12 a 1 7
# 13 a 1 8
# 14 b 1 8
# 15 b 1 7
# 16 b 1 6
# 17 b 1 5
# 18 b 1 4
# 19 b 1 3
# 20 b 1 2
# 21 b 1 1
# 22 b 0 0
# 23 b 1 1
# 24 b 1 2
# 25 b 1 3
# 26 b 1 4
Or even ave will let you do it in one step
dist <- with(dum, ave(x,label,FUN=function(x) abs(seq_along(x)-which.min(x))))
cbind(dum, dist)
You can do this with by but also with plyr, data.table, etc. The function that is used on each subset is
f <- function(d) {
x <- d$x
i <- match(0, x)
v1 <- rev(cumsum(rev(x[1:i])))
v2 <- cumsum(x[(i+1):length(x)])
transform(d, output = c(v1, v2))
}
To call it on each subset e.g. with by
res <- by(dum, list(dum$label), f)
do.call(rbind, res)
If you want to use ddply
library(plyr)
ddply(dum, .(label), f)
May be faster with data.table
library(data.table)
dumdt <- as.data.table(dum)
setkey(dumdt, label)
dumdt[, f(.SD), by = key(dumdt)]
Using dplyr
library(dplyr)
dum%>%
group_by(label)%>%
mutate(dist=abs(row_number()-which.min(x)))