Row-wise expansion of data.frame [duplicate] - r

This question already has answers here:
De-aggregate / reverse-summarise / expand a dataset in R [duplicate]
(4 answers)
Closed 6 years ago.
I basically want do the opposite of ddply(df, columns.to.preserve, numcolwise(FUNCTION).
Suppose I have
d <- data.frame(
count=c(2,1,3),
summed.value=c(50,20,30),
averaged.value=c(35,80,20)
)
count summed.value averaged.value
1 2 50 35
2 1 20 80
3 3 30 20
I want to do a row expansion of this data.frame based on the count column while specifying what kind of operation I want to apply to the other columns.
Here is the kind of result I'm looking for:
> d2
count summed.value averaged.value
1 1 25 35
2 1 25 35
3 1 20 80
4 1 10 20
5 1 10 20
6 1 10 20
Any there built in functions within dplyr or other packages that does this kind of operation?
Edit: This is different from the De-aggregate / reverse-summarise / expand a dataset in R question because I want to go further and actually apply different functions to columns within the table I wish to expand. There are also more useful and answers on this post.

Use dplyr and tidyr, you can do a rowwise transformation for the summed.value which produces a list for each cell and then unnest the column should give you what you need:
library(dplyr); library(tidyr)
d %>% rowwise() %>% summarise(summed.value = list(rep(summed.value/count, count)),
averaged.value = averaged.value, count = 1) %>% unnest()
# Source: local data frame [6 x 3]
# averaged.value count summed.value
# <dbl> <dbl> <dbl>
# 1 35 1 25
# 2 35 1 25
# 3 80 1 20
# 4 20 1 10
# 5 20 1 10
# 6 20 1 10
Another way is to use data.table, where you can specify the row number as group variable, and the data table will automatically expand it:
library(data.table)
setDT(d)
d[, .(summed.value = rep(summed.value/count, count), averaged.value, count = 1), .(1:nrow(d))]
[, nrow := NULL][]
# summed.value averaged.value count
#1: 25 35 1
#2: 25 35 1
#3: 20 80 1
#4: 10 20 1
#5: 10 20 1
#6: 10 20 1

There is a function untable in package reshape for getting the inverse of a table. Then divide the variables that need dividing by count via mutate_at (or mutate_each). mutate_at was introduced in dplyr_0.5.0.
First the untable:
library(reshape)
untable(d, num = d$count)
count summed.value averaged.value
1 2 50 35
1.1 2 50 35
2 1 20 80
3 3 30 20
3.1 3 30 20
3.2 3 30 20
Then the mutate_at for dividing summed.value and count by count:
library(dplyr)
untable(d, num = d$count) %>%
mutate_at(vars(summed.value, count), funs(./count))
count summed.value averaged.value
1 1 25 35
2 1 25 35
3 1 20 80
4 1 10 20
5 1 10 20
6 1 10 20

Here's a both simple and fully vecotrized base R approach
transform(d[rep(1:nrow(d), d$count), ],
count = 1,
summed.value = summed.value/count)
# count summed.value averaged.value
# 1 1 25 35
# 1.1 1 25 35
# 2 1 20 80
# 3 1 10 20
# 3.1 1 10 20
# 3.2 1 10 20
Or similarly, using data.table
library(data.table)
res <- setDT(d)[rep(1:.N, count)][, `:=`(count = 1, summed.value = summed.value / count)]
res
# count summed.value averaged.value
# 1: 1 25 35
# 2: 1 25 35
# 3: 1 20 80
# 4: 1 10 20
# 5: 1 10 20
# 6: 1 10 20

A base R solution: It tries to replicate each row by the value of the count column and then divide count and summed.value columns by count.
mytext <- 'count,summed.value,averaged.value
2,50,35
1,20,80
3,30,20'
mydf <- read.table(text=mytext,header=T,sep = ",")
mydf <- do.call(rbind,apply(mydf, 1, function(x) {
tempdf <- t(replicate(x[1],x,simplify = T))
tempdf[,1] <- tempdf[,1]/x[1]
tempdf[,2] <- tempdf[,2]/x[1]
return(data.frame(tempdf))
}))
count summed.value averaged.value
1 25 35
1 25 35
1 20 80
1 10 20
1 10 20
1 10 20

Related

Drop rows in a data frame that are in-between two integer values in R

I have this data frame coming out of certain participant's behaviour in an episodic task, and let's say the episode starts at 90 and finishes when we have a certain trigger that can be in the range of 40s. I am doing a sample dataframe with a column with the number of the rows and the other with the actual triggers.
ex1 <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)
ex2 <- c(41,1,1,90,1,1,1,44,1,90,1,2,42,1,1,1,1,90,1,41)
df <- data.frame(ex1,ex2)
> df
ex1 ex2
1 1 41
2 2 1
3 3 1
4 4 90
5 5 1
6 6 1
7 7 1
8 8 44
9 9 1
10 10 90
11 11 1
12 12 2
13 13 42
14 14 1
15 15 1
16 16 1
17 17 1
18 18 90
19 19 1
20 20 41
Now, what I am trying to do is remove all the rows that are outside the beginning and the end of the episode, as they are recordings of typed behaviour that is not interesting as it falls outside of the episode. Therefore, I want to end up with a dataframe like this:
ex1 <- c(1,4,5,6,7,8,10,11,12,13,18,19,20)
ex2 <- c(41,90,1,1,1,44,90,1,2,42,90,1,41)
df <- data.frame(ex1,ex2)
> df
ex1 ex2
1 1 41
2 4 90
3 5 1
4 6 1
5 7 1
6 8 44
7 10 90
8 11 1
9 12 2
10 13 42
11 18 90
12 19 1
13 20 41
I have been trying to use subset but I cannot make it work between a range and a number.
Thanks in advance!
Setting the values:
ex1 <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)
ex2 <- c(41,1,1,90,1,1,1,44,1,90,1,2,42,1,1,1,1,90,1,41)
before <- data.frame(ex1,ex2)
before
ex1 ex2
1 1 41
2 2 1
3 3 1
4 4 90
5 5 1
6 6 1
7 7 1
8 8 44
9 9 1
10 10 90
11 11 1
12 12 2
13 13 42
14 14 1
15 15 1
16 16 1
17 17 1
18 18 90
19 19 1
20 20 41
I have built a function that should do the work.
The function is constructed based on my understanding of your problem so there is a chance that my function would not work perfectly to your setting.
However I believe you can do your task by adjusting the function a little bit to satisfy your needs.
library(dplyr)
episode <- function(start = 90, end = 40, data){#the default value of start is 90 and the default value of end is 40
#retrieving all the row indices that correspond to values that indicates an end
end_idx <- which(data$ex2>=end & data$ex2<=end+10)
#retrieving all the row indices that correspond to values that indicates a start
start_idx <- which(data$ex2==start)
#declaring a list that would contain the extracted sub samples in your liking
sub_sample_list <- vector("list", length(start_idx))
#looping through the start indices
for(i in 1:length(start_idx)){
#extracting the minimum among those have values larger than the i-th start_idx value
temp_end <- min(end_idx[end_idx>start_idx[i]])
#extracting the rows between the i-th start index and the minimum end index that is larger than the i-th start index
temp_sub_sample <- data[start_idx[i]:temp_end,]
#saving the sub-sample in the list
sub_sample_list[[i]] <- temp_sub_sample
}
#now row binding all the extracted sub samples
clean.df <- do.call(rbind.data.frame, sub_sample_list)
#if there is an end index that is smaller than the minimum start index
if(min(end_idx)< min(start_idx)){
#only retrieve those corresponding rows and add to the clean.df
clean.df <- rbind(data[end_idx[end_idx<min(start_idx)],], clean.df)
}
#cleaning up the row numbers a bit
rownames(clean.df) <- 1:nrow(clean.df)
#sort the clean.df by ex1
clean.df <- clean.df %>% arrange(ex1)
#returning the clean.df
return(clean.df)
}
Generating the after data set by using the episode function.
after <- episode(start = 90, end = 40, before)
after
ex1 ex2
1 1 41
2 4 90
3 5 1
4 6 1
5 7 1
6 8 44
7 10 90
8 11 1
9 12 2
10 13 42
11 18 90
12 19 1
13 20 41
And base:
ex1 <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)
ex2 <- c(41,1,1,90,1,1,1,44,1,90,1,2,42,1,1,1,1,90,1,41)
df <- data.frame(ex1,ex2)
index start of series [90] and if not row 1 and subset out rows prior to start as incomplete:
start_idx <- which(df$ex2 == 90)
df <- df[start_idx[1]:nrow(df), ]
re-index start and index end >= 40 & < 90
start_idx <- which(df$ex2 == 90)
end_idx <- which(df$ex2 >= 40 & df$ex2 < 90)
make an empty list and for loop through, subsetting out start:end sections
df_lst <- list()
for (k in 1:length(start_idx)) {
df_lst[[k]] <- df[start_idx[k]:end_idx[k], ]
}
bring them all together
df2 <- do.call('rbind' df_lst)
df2
ex1 ex2
4 4 90
5 5 1
6 6 1
7 7 1
8 8 44
10 10 90
11 11 1
12 12 2
13 13 42
18 18 90
19 19 1
20 20 41
Fairly compact.

using intervals in a column to populate values for another column

I have a dataframe:
dataframe <- data.frame(Condition = rep(c(1,2,3), each = 5, times = 2),
Time = sort(sample(1:60, 30)))
Condition Time
1 1 1
2 1 3
3 1 4
4 1 7
5 1 9
6 2 11
7 2 12
8 2 14
9 2 16
10 2 18
11 3 19
12 3 24
13 3 25
14 3 28
15 3 30
16 1 31
17 1 34
18 1 35
19 1 38
20 1 39
21 2 40
22 2 42
23 2 44
24 2 47
25 2 48
26 3 49
27 3 54
28 3 55
29 3 57
30 3 59
I want to divide the total length of Time (i.e., max(Time) - min(Time)) per Condition by a constant 'x' (e.g., 3). Then I want to use that quotient to add a new variable Trial such that my dataframe looks like this:
Condition Time Trial
1 1 1 A
2 1 3 A
3 1 4 B
4 1 7 C
5 1 9 C
6 2 11 A
7 2 12 A
8 2 14 B
9 2 16 C
10 2 18 C
... and so on
As you can see, for Condition 1, Trial is populated with unique identifying values (e.g., A, B, C) every 2.67 seconds = 8 (total time) / 3. For Condition 2, Trial is populated every 2.33 seconds = 7 (total time) /3.
I am not getting what I want with my current code:
dataframe %>%
group_by(Condition) %>%
mutate(Trial = LETTERS[cut(Time, 3, labels = F)])
# Groups: Condition [3]
Condition Time Trial
<dbl> <int> <chr>
1 1 1 A
2 1 3 A
3 1 4 A
4 1 7 A
5 1 9 A
6 2 11 A
7 2 12 A
8 2 14 A
9 2 16 A
10 2 18 A
# ... with 20 more rows
Thanks!
We can get the diffrence of range (returns min/max as a vector) and divide by the constant passed into i.e. 3 as the breaks in cut). Then, use integer index (labels = FALSE) to get the corresponding LETTER from the LETTERS builtin R constant
library(dplyr)
dataframe %>%
group_by(Condition) %>%
mutate(Trial = LETTERS[cut(Time, diff(range(Time))/3,
labels = FALSE)])
If the grouping should be based on adjacent values in 'Condition', use rleid from data.table on the 'Condition' column to create the grouping, and apply the same code as above
library(data.table)
dataframe %>%
group_by(grp = rleid(Condition)) %>%
mutate(Trial = LETTERS[cut(Time, diff(range(Time))/3,
labels = FALSE)])
Here's a one-liner using my santoku package. The rleid line is the same as mentioned in #akrun's solution.
dataframe %<>%
group_by(grp = data.table::rleid(Condition)) %>%
mutate(
Trial = chop_evenly(Time, intervals = 3, labels = lbl_seq("A"))
)

R: Aggregate column of values to multiple new columns, each based on index column

Assume I have data:
data.frame(Plot = rep(1:2,3),Index = rep(1:3, each = 2), Val = c(1:6)*10)
Plot Index Val
1 1 1 10
2 2 1 20
3 1 2 30
4 2 2 40
5 1 3 50
6 2 3 60
I want to create new columns combining/aggregating all Val that share a common Index for a given Plot. I want to do this for each Index.
Plot Val1 Val2 Val3
1 1 10 30 50
2 2 20 40 60
I would like any remaining columns (e.g., just Plot in this simplified example) to remain in my final data.frame.
My Attempt
I know I can do this step-wise using aggregate() and merge(), but is there a way to do this using a single (or minimal) call(s)?
Any approach is great, but I always like to see an elegant base R approach if one exists...
Update:
I'm looking for a solution that also works well when other columns are involved:
dat2 = data.frame(Plot = rep(1:2,each = 8),Year = rep(rep(2010:2011, each = 4),2),
Index = rep(rep(1:2,2),4), Val = rep(c(1:4)*10,4))
Plot Year Index Val
1 1 2010 1 10
2 1 2010 2 20
3 1 2010 1 30
4 1 2010 2 40
5 1 2011 1 10
6 1 2011 2 20
7 1 2011 1 30
8 1 2011 2 40
9 2 2010 1 10
10 2 2010 2 20
11 2 2010 1 30
12 2 2010 2 40
13 2 2011 1 10
14 2 2011 2 20
15 2 2011 1 30
16 2 2011 2 40
#Resulting in (if aggregating by sum, for example):
Plot Year Val1 Val2
1 1 2010 40 60
2 1 2011 40 60
3 2 2010 40 60
4 2 2011 40 60
Also, ideally, the new columns could be named based on the Index value.
So if my index were instead A:C, my new columns would be ValA, ValB, and ValC
It seems you want a base R solution: then you can do something like:
m = aggregate(Val~.,dat2,sum)
reshape(m,v.names = "Val",idvar = c("Plot","Year"),timevar = "Index",direction = "wide")
Plot Year Val.1 Val.2
1 1 2010 40 60
2 2 2010 40 60
3 1 2011 40 60
4 2 2011 40 60
But you can use other functions:
do.call(data.frame,aggregate(Val~Plot+Year,m,I))
Plot Year Val.1 Val.2
1 1 2010 40 60
2 2 2010 40 60
3 1 2011 40 60
4 2 2011 40 60
Or using the reshape2 library, you can tackle the problem as:
library(reshape2)
dcast(dat2,Plot+Year~Index,sum,value.var = "Val")
Plot Year 1 2
1 1 2010 40 60
2 1 2011 40 60
3 2 2010 40 60
4 2 2011 40 60
One can think of using gather, unite and spread functions to get the desired result as mentioned by OP.
library(tidyverse)
df <- data.frame(Plot = rep(1:2,3),Index = rep(1:3, each = 2), Val = c(1:6)*10)
df %>% gather(key, value, -Plot, -Index) %>%
unite("key", c(key,Index), sep="") %>%
spread(key, value)
# Plot Val1 Val2 Val3
# 1 1 10 30 50
# 2 2 20 40 60
Note: There are other short options (as correctly pointed out by #Onyambu) but then again per OP's desire column's names required to be changed.
spread(df, Index, Val)
# Plot 1 2 3
# 1 1 10 30 50
# 2 2 20 40 60
aggregate(Val~Plot,df,I)
# Plot Val.1 Val.2 Val.3
# 1 1 10 30 50
# 2 2 20 40 60
Updated: Based on 2nd data frame from OP.
dat2 = data.frame(Plot = rep(1:2,each = 8),Year = rep(rep(2010:2011, each = 4),2),
Index = rep(rep(1:2,2),4), Val = rep(c(1:4)*10,4))
library(tidyverse)
library(reshape2)
dat2 %>% gather(key, value, -Plot, -Index, -Year) %>%
unite("key", c(key,Index), sep="") %>%
dcast(Plot+Year~key, value.var = "value")
# Plot Year Val1 Val2
# 1 1 2010 2 2
# 2 1 2011 2 2
# 3 2 2010 2 2
# 4 2 2011 2 2

Select row meeting condition and all subsequent rows by group

Let's assume I have a data frame consisting of a categorical variable and a numerical one.
df <- data.frame(group=c(1,1,1,1,1,2,2,2,2,2),days=floor(runif(10, min=0, max=101)))
df
group days
1 1 54
2 1 61
3 1 31
4 1 52
5 1 21
6 2 22
7 2 18
8 2 50
9 2 46
10 2 35
I would like to select the row corresponding to the maximum number of days by group as well as all the following/subsequent group rows. For the example above, my subset df2 should look as follows:
df2
group days
2 1 61
3 1 31
4 1 52
5 1 21
8 2 50
9 2 46
10 2 35
Please note that the groups could have different lengths.
For a base R solution, aggregate days by group using a function that keeps the elements with index greater than or equal to the maximum, and then reshape as a long data.frame
df0 = aggregate(days ~ group, df, function(x) x[seq_along(x) >= which.max(x)])
data.frame(group=rep(df0$group, lengths(df0$days)),
days=unlist(df0$days, use.names=FALSE)))
leading to
group days
1 1 84
2 1 31
3 1 65
4 1 23
5 2 94
6 2 69
7 2 45
You can use which.max to find out the index of the maximum of the days and then use slice from dplyr to select all the rows after that, where n() gives the number of rows in each group:
library(dplyr)
df %>% group_by(group) %>% slice(which.max(days):n())
#Source: local data frame [7 x 2]
#Groups: group [2]
# group days
# <int> <int>
#1 1 61
#2 1 31
#3 1 52
#4 1 21
#5 2 50
#6 2 46
#7 2 35
data.table syntax would be similar, .N is similar to n() in dplyr and gives the number of rows in each group:
library(data.table)
setDT(df)[, .SD[which.max(days):.N], group]
# group days
#1: 1 61
#2: 1 31
#3: 1 52
#4: 1 21
#5: 2 50
#6: 2 46
#7: 2 35
We can use a faster option with data.table where we find the row index (.I) and then subset the rows based on that.
library(data.table)
setDT(df)[df[ , .I[which.max(days):.N], by = group]$V1]
# group days
#1: 1 61
#2: 1 31
#3: 1 52
#4: 1 21
#5: 2 50
#6: 2 46
#7: 2 35

how to use apply-like function on data frame? [please see details below]

I have a dataframe with columns A, B and C.
I want to apply a function on each row of a dataframe in which a function will check the value of row$A and row$B and will update row$C based on those values. How can I achieve that?
Example:
A B C
1 1 10 10
2 2 20 20
3 NA 30 30
4 NA 40 40
5 5 50 50
Now I want to update all rows in C column to B/2 value in that same row if value in A column for that row is NA.
So the dataframe after changes would look like:
A B C
1 1 10 10
2 2 20 20
3 NA 30 15
4 NA 40 20
5 5 50 50
I would like to know if this can be done without using a for loop.
Or if you want to update the column by reference (without copying the whole data set when updating the column) could also try data.table
library(data.table)
setDT(dat)[is.na(A), C := B/2]
dat
# A B C
# 1: 1 10 10
# 2: 2 20 20
# 3: NA 30 15
# 4: NA 40 20
# 5: 5 50 50
Edit:
Regarding #aruns comment, checking the address before and after the change implies it was updated by reference still.
library(pryr)
address(dat$C)
## [1] "0x2f85a4f0"
setDT(dat)[is.na(A), C := B/2]
address(dat$C)
## [1] "0x2f85a4f0"
Try this:
your_data <- within(your_data, C[is.na(A)] <- B[is.na(A)] / 2)
Try
indx <- is.na(df$A)
df$C[indx] <- df$B[indx]/2
df
# A B C
#1 1 10 10
#2 2 20 20
#3 NA 30 15
#4 NA 40 20
#5 5 50 50
here is simple example using library(dplyr).
Fictional dataset:
df <- data.frame(a=c(1, NA, NA, 2), b=c(10, 20, 50, 50))
And you want just those rows where a == NA, therefore you can use ifelse:
df <- mutate(df, c=ifelse(is.na(a), b/2, b))
Another approach:
dat <- transform(dat, C = B / 2 * (i <- is.na(A)) + C * !i)
# A B C
# 1 1 10 10
# 2 2 20 20
# 3 NA 30 15
# 4 NA 40 20
# 5 5 50 50
Try:
> ddf$C = with(ddf, ifelse(is.na(A), B/2, C))
>
> ddf
A B C
1 1 10 10
2 2 20 20
3 NA 30 15
4 NA 40 20
5 5 50 50

Resources