How can I conditionally expand rows in my R dataframe? - r

I have a dataframe that I would like to expand based on a few conditions. If the Activity is "Repetitive" I would like to explode the rows to twice as long as the duration, filling in a new dataframe with a row for each 0.5 second event. The rest of the information would stay the same, except that the rows that have been expanded will alternate between the given object in the original dataframe (e.g. "Toy") and "Nothing."
Location <- c("Kitchen", "Living Room", "Living Room", "Garage")
Object <- c("Food", "Toy", "Clothes", "Floor")
Duration <- c(6,3,2,5)
CumDuration <- c(6,9,11,16)
Activity <- c("Repetitive", "Constant", "Constant", "Repetitive")
df <- data.frame(Location, Object, Duration, CumDuration, Activity)
So it looks like this:
| Location | Object | Duration | CumDuration | Activity |
| ----------- | -------- | -------- | ----------- | ---------- |
| Kitchen | Food | 6 | 6 | Repetitive |
| Living Room | Toy | 3 | 9 | Constant |
| Living Room | Clothes | 2 | 11 | Constant |
| Garage | Floor | 5 | 16 | Repetitive |
And I want it to look like this:
| Location | Object | Duration | CumDuration | Activity |
| ----------- | -------- | -------- | ----------- | ---------- |
| Kitchen | Food | 0.5 | 0.5 | Repetitive |
| Kitchen | Nothing | 0.5 | 1 | Repetitive |
| Kitchen | Food | 0.5 | 1.5 | Repetitive |
| Kitchen | Nothing | 0.5 | 2 | Repetitive |
| Kitchen | Food | 0.5 | 2.5 | Repetitive |
| Kitchen | Nothing | 0.5 | 3 | Repetitive |
| Kitchen | Food | 0.5 | 3.5 | Repetitive |
| Kitchen | Nothing | 0.5 | 4 | Repetitive |
| Kitchen | Food | 0.5 | 4.5 | Repetitive |
| Kitchen | Nothing | 0.5 | 5 | Repetitive |
| Kitchen | Food | 0.5 | 5.5 | Repetitive |
| Kitchen | Nothing | 0.5 | 6 | Repetitive |
| Living Room | Toy | 3 | 9 | Constant |
| Living Room | Clothes | 2 | 11 | Constant |
| Garage | Floor | 0.5 | 11.5 | Repetitive |
| Garage | Nothing | 0.5 | 12 | Repetitive |
| Garage | Floor | 0.5 | 12.5 | Repetitive |
| Garage | Nothing | 0.5 | 13 | Repetitive |
| Garage | Floor | 0.5 | 13.5 | Repetitive |
| Garage | Nothing | 0.5 | 14 | Repetitive |
| Garage | Floor | 0.5 | 14.5 | Repetitive |
| Garage | Nothing | 0.5 | 15 | Repetitive |
| Garage | Floor | 0.5 | 15.5 | Repetitive |
| Garage | Nothing | 0.5 | 16 | Repetitive |
Thanks so much in advance!

Here is a dyplr option to achieve this
library(dplyr)
df$CumDuration = as.numeric(df$CumDuration)
df %>% filter(Activity == "Repetitive") %>%
group_by(Location) %>%
slice(rep(1:n(), each= Duration/0.5)) %>% # Create the new rows
mutate(Duration = Duration/(Duration*2)) %>% # Change the Duration to 0.5
ungroup() %>%
arrange(CumDuration) %>%
mutate(Object = ifelse((row_number() %% 2) == 0, "Nothing", Object), ID = 1:n()) %>% # Change the Object every other row for "Nothing" and add ID for sorting in correct order
full_join(filter(df, Activity != "Repetitive")) %>% # Merge back with the unmodified rows of original data frame
arrange(CumDuration, ID) %>% # Arrange rows in the correct order
mutate(CumDuration = cumsum(Duration)) %>% # Recalculate the cumulative sum
select(-ID) # Remove the ID column no longer wanted
# A tibble: 24 x 5
Location Object Duration CumDuration Activity
<chr> <chr> <dbl> <dbl> <chr>
1 Kitchen Food 0.5 0.5 Repetitive
2 Kitchen Nothing 0.5 1 Repetitive
3 Kitchen Food 0.5 1.5 Repetitive
4 Kitchen Nothing 0.5 2 Repetitive
5 Kitchen Food 0.5 2.5 Repetitive
6 Kitchen Nothing 0.5 3 Repetitive
7 Kitchen Food 0.5 3.5 Repetitive
8 Kitchen Nothing 0.5 4 Repetitive
9 Kitchen Food 0.5 4.5 Repetitive
10 Kitchen Nothing 0.5 5 Repetitive
# ... with 14 more rows

Related

How to summarize data in R (dplyr) and avoid duplicate identifiers? [duplicate]

This question already has answers here:
Extract row corresponding to minimum value of a variable by group
(9 answers)
Closed 1 year ago.
I'm trying to identify the lowest rate over a range of years for a number of items (ID).
In addition, I would like to know the Year the lowest rate was pulled from.
I'm grouping by ID, but I run into an issue when rates are duplicated across years.
sample data
df <- data.frame(ID = c(1,1,1,2,2,2,3,3,3,4,4,4),
Year = rep(2010:2012,4),
Rate = c(0.3,0.6,0.9,
0.8,0.5,0.2,
0.8,0.4,0.9,
0.7,0.7,0.7))
sample data as table
| ID | Year | Rate |
|:------:|:------:|:------:|
| 1 | 2010 | 0.3 |
| 1 | 2012 | 0.6 |
| 1 | 2010 | 0.9 |
| 2 | 2010 | 0.8 |
| 2 | 2011 | 0.5 |
| 2 | 2012 | 0.2 |
| 3 | 2010 | 0.8 |
| 3 | 2011 | 0.4 |
| 3 | 2012 | 0.9 |
| 4 | 2010 | 0.7 |
| 4 | 2011 | 0.7 |
| 4 | 2012 | 0.7 |
Using dplyr I grouped by ID, then found the lowest rate.
df.Summarise <- df %>%
group_by(ID) %>%
summarise(LowestRate = min(Rate))
This gives me the following
| ID | LowestRate |
| --- | --- |
| 1 | 0.3 |
| 2 | 0.2 |
| 3 | 0.4 |
| 4 | 0.7 |
However, I also need to know the year that data was pulled from.
This is what I would like my final result to look like:
| ID | Year | Rate |
| --- | --- | --- |
| 1 | 0.3 | 2010 |
| 2 | 0.2 | 2012 |
| 3 | 0.4 | 2011 |
| 4 | 0.7 | 2012 |
Here's where I ran into some issues.
Attempt #1: Include "Year" in the original dplyr code
df.Summarise2 <- df %>%
group_by(ID) %>%
summarise(LowestRate = min(Rate),
Year = Year)
Error: Column `Year` must be length 1 (a summary value), not 3
Makes sense. I'm not summarizing "Year" at all. I just want to include that row's value for Year!
Attempt #2: Use mutate instead of summarise
df.Mutate <- df %>%
group_by(ID) %>%
mutate(LowestRate = min(Rate))
So that essentially returns my original dataframe, but with an extra column for LowestRate attached.
How would I go from this to what I want?
I tried to left_join / merge based on ID and Lowest Rate, but there's multiple matches for ID #4. Is there any way to only pick one match (row)?
df.joined <- left_join(df.Summarise,df,by = c("ID","LowestRate" = "Rate"))
df.joined as table
| ID | Year | Rate |
| --- | --- | --- |
| 1 | 0.3 | 2010 |
| 2 | 0.2 | 2012 |
| 3 | 0.4 | 2011 |
| 4 | 0.7 | 2010 |
| 4 | 0.7 | 2011 |
| 4 | 0.7 | 2012 |
I've tried looking online, but I can't really find anything that strikes this exactly.
Using ".drop = FALSE" for group_by() didn't help, as it seems to be intended for empty values?
The dataset I'm working with is large, so I'd really like to find how to make this work and avoid hard-coding anything :)
Thanks for any help!
You can group by ID and then filter without summarizing, and that way you'll preserve all columns but still only keep the min value:
df %>%
group_by(ID) %>%
filter(Rate == min(Rate))

Relabel of rowname column in R dataframe

When I bind multiple dataframes together using Out2 = do.call(rbind.data.frame, Out), I obtain the following output. How do I relabel the first column such that it only contains the numbers within the square brackets, i.e. 1 to 5 for each trial number? Is there a way to add a column name to the first column too?
| V1 | V2 | Trial |
+--------+--------------+--------------+-------+
| [1,] | 0.130880519 | 0.02085533 | 1 |
| [2,] | 0.197243133 | -0.000502744 | 1 |
| [3,] | -0.045241653 | 0.106888902 | 1 |
| [4,] | 0.328759949 | -0.106559163 | 1 |
| [5,] | 0.040894969 | 0.114073454 | 1 |
| [1,]1 | 0.103130056 | 0.013655756 | 2 |
| [2,]1 | 0.133080106 | 0.038049071 | 2 |
| [3,]1 | 0.067975054 | 0.03036033 | 2 |
| [4,]1 | 0.132437217 | 0.022887103 | 2 |
| [5,]1 | 0.124950463 | 0.007144698 | 2 |
| [1,]2 | 0.202996317 | 0.004181205 | 3 |
| [2,]2 | 0.025401354 | 0.045672932 | 3 |
| [3,]2 | 0.169469266 | 0.002551237 | 3 |
| [4,]2 | 0.2303046 | 0.004936579 | 3 |
| [5,]2 | 0.085702254 | 0.020814191 | 3 |
+--------+--------------+--------------+-------+
We can use parse_number to extract the first occurence of numbers
library(dplyr)
df1 %>%
mutate(newcol = readr::parse_number(row.names(df1)))
Or in base R, use sub to capture the digits after the [ in the row names
df1$newcol <- sub("^\\[(\\d+).*", "\\1", row.names(df1))

How do you assign groups to larger groups dpylr

I would like to assign groups to larger groups in order to assign them to cores for processing. I have 16 cores.This is what I have so far
test<-data_extract%>%group_by(group_id)%>%sample_n(16,replace = TRUE)
This takes staples OF 16 from each group.
This is an example of what I would like the final product to look like (with two clusters),all I really want is for the same group id to belong to the same cluster as a set number of clusters
________________________________
balance | group_id | cluster|
454452 | a | 1 |
5450441 | a | 1 |
5444531 | b | 1 |
5404051 | b | 1 |
5404501 | b | 1 |
5404041 | b | 1 |
544251 | b | 1 |
254252 | b | 1 |
541254 | c | 2 |
54123254 | d | 1 |
542541 | d | 1 |
5442341 | e | 2 |
541 | f | 1 |
________________________________
test<-data%>%group_by(group_id)%>% mutate(group = sample(1:16,1))

How to remove empty cells and reduce columns

I have a table, that looks roughly like this:
| variable | observer1 | observer2 | observer3 | final |
| -------- | --------- | --------- | --------- | ----- |
| case1 | | | | |
| var1 | 1 | 1 | | |
| var2 | 3 | 3 | | |
| var3 | 4 | 5 | | 5 |
| case2 | | | | |
| var1 | 2 | | 2 | |
| var2 | 5 | | 5 | |
| var3 | 1 | | 1 | |
| case3 | | | | |
| var1 | | 2 | 3 | 2 |
| var2 | | 2 | 2 | |
| var3 | | 1 | 1 | |
| case4 | | | | |
| var1 | 1 | | 1 | |
| var2 | 5 | | 5 | |
| var3 | 3 | | 3 | |
Three colums for the observers, but only two are filled.
First I want to compute the IRR, so I need a table that has two columns without the empty cells like this:
| variable | observer1 | observer2 |
| -------- | --------- | --------- |
| case1 | | |
| var1 | 1 | 1 |
| var2 | 3 | 3 |
| var3 | 4 | 5 |
| case2 | | |
| var1 | 2 | 2 |
| var2 | 5 | 5 |
| var3 | 1 | 1 |
| case3 | | |
| var1 | 2 | 3 |
| var2 | 2 | 2 |
| var3 | 1 | 1 |
| case4 | | |
| var1 | 1 | 1 |
| var2 | 5 | 5 |
| var3 | 3 | 3 |
I try to use the tidyverse packages, but I'm not sure. Some 'ifelse()' magic may be easier.
Is there a clean and easy method to do something like this? Can anybody point me to the right function to use? Or just to a keyword to search for on stackoverflow? I found a lot of methods to remove whole empty columns or rows.
Edit: I removed the link to the original data. It was unnecessary. Thanks to Lamia for his working answer.
Out of your 3 columns observer1, observer2 and observer3, you sometimes have 2 non-NA values, 1 non-NA value, or 3 NA values.
If you want to merge your 3 columns, you could do:
res = data.frame(df$coding,t(apply(df[paste0("observer",1:3)],1,function(x) x[!is.na(x)][1:2])))
The apply function will return for each row the 2 non-NA values if there are 2, one non-NA value and one NA if there is only one value, and two NAs if there is no data in the row.
We then put this result in a dataframe with the first column (coding).

How to subset a dataframe using a column from another dataframe in r?

I have 2 dataframes
Dataframe1:
| Cue | Ass_word | Condition | Freq | Cue_Ass_word |
1 | ACCENDERE | ACCENDINO | A | 1 | ACCENDERE_ACCENDINO
2 | ACCENDERE | ALLETTARE | A | 0 | ACCENDERE_ALLETTARE
3 | ACCENDERE | APRIRE | A | 1 | ACCENDERE_APRIRE
4 | ACCENDERE | ASCENDERE | A | 1 | ACCENDERE_ASCENDERE
5 | ACCENDERE | ATTIVARE | A | 0 | ACCENDERE_ATTIVARE
6 | ACCENDERE | AUTO | A | 0 | ACCENDERE_AUTO
7 | ACCENDERE | ACCENDINO | B | 2 | ACCENDERE_ACCENDINO
8 | ACCENDERE| ALLETTARE | B | 3 | ACCENDERE_ALLETTARE
9 | ACCENDERE| ACCENDINO | C | 2 | ACCENDERE_ACCENDINO
10 | ACCENDERE| ALLETTARE | C | 0 | ACCENDERE_ALLETTARE
Dataframe2:
| Group.1 | x
1 | ACCENDERE_ACCENDINO | 5
13 | ACCENDERE_FUOCO | 22
16 | ACCENDERE_LUCE | 10
24 | ACCENDERE_SIGARETTA | 6
....
I want to exclude from Dataframe1 all the rows that contain words (Cue_Ass_word) that are not reported in the column Group.1 in Dataframe2.
In other words, how can I subset Dataframe1 using the strings reported in Dataframe2$Group.1?
It's not quite clear what you mean, but is this what you need?
Dataframe1[!(Dataframe1$Cue_Ass_word %in% Dataframe2$Group1),]

Resources