I have data collected over multiple days, with timestamps that contain information for when food was eaten. Example dataframes:
head(Day3)
==================================================================
Day3.time Day3.Pellet_Count
1 18:05:30 1
2 18:06:03 2
3 18:06:34 3
4 18:06:40 4
5 18:06:52 5
6 18:07:03 6
head(Day4)
==================================================================
Day4.time Day4.Pellet_Count
1 18:00:21 1
2 18:01:34 2
3 18:02:22 3
4 18:03:35 4
5 18:03:54 5
6 18:05:06 6
Given the variability, the timestamps don't line up and therefore aren't matched. I've done a "full join" with merge from all of the data from two of the days, in the following way:
pellets <- merge(Day3, Day4, by = 'time', all=TRUE)
This results in the following:
head(pellets)
==================================================================
pellets.time pellets.Pellet_Count.x pellets.Pellet_Count.y
1 02:40:18 39 NA
2 18:00:21 NA 1
3 18:01:34 NA 2
4 18:02:22 NA 3
5 18:03:35 NA 4
6 18:03:54 NA 5
I would like to plot the Pellet_Count in one line graph from each of the days, but this is making it very difficult to group the data. My approach thus far has been:
pelletday <- ggplot() + geom_line(data=pellets, aes(x=time, y=Pellet_Count.x)) +
geom_line(data=pellets, aes(x=time, y=Pellet_Count.y))
But, I get this error:
geom_path: Each group consists of only one observation. Do you need to adjust the group
aesthetic?
I also would like to be able to merge all days (I oftentimes have up to 9 days) and plot it on the same graph.
I believe my goal is to ultimately get the following dataframe output:
==================================================================
pellets.time Pellet_Count Day
1 02:40:18 39 3
2 18:00:21 1 4
3 18:01:34 2 4
4 18:02:22 3 4
5 18:03:35 4 4
6 18:03:54 5 4
and to use this to graph:
ggplot(pellets, aes(time, Pellet_Count, group=Day)
Any ideas?
There's a couple of issues here
Firstly have you tried using rbind() or bind_rows() rather than merge.
This seems like a more natural fit for what you're trying to do. With a merge or some other join, you are effectively trying to bring new information into your data table. Most often you are trying to bring in new columns
But here you are really trying to append days' data together, you're not actually adding a new column.
So this is my attempt at replicating what you're describing above
Day3 <- tibble(
Day3.time = c('18:05:30', '18:06:03', '18:06:34',
'18:06:40', '18:06:52', '18:07:03'),
Day3.Pellet_Count = c(1, 2, 3, 4, 5, 6)) %>%
mutate(day = '3') %>%
rename(time = Day3.time)
Day4 <- tibble(
Day4.time = c('18:00:21', '18:01:34', '18:02:22',
'18:03:35', '18:03:54', '18:05:06'),
Day4.Pellet_Count = c(1, 2, 3, 4, 5, 6)) %>%
mutate(day = '4') %>%
rename(time = Day4.time)
pellets <- merge(Day3, Day4, by = 'time', all=TRUE)
time Day3.Pellet_Count day.x Day4.Pellet_Count day.y
1 18:00:21 NA <NA> 1 4
2 18:01:34 NA <NA> 2 4
3 18:02:22 NA <NA> 3 4
4 18:03:35 NA <NA> 4 4
5 18:03:54 NA <NA> 5 4
6 18:05:06 NA <NA> 6 4
7 18:05:30 1 3 NA <NA>
8 18:06:03 2 3 NA <NA>
9 18:06:34 3 3 NA <NA>
10 18:06:40 4 3 NA <NA>
11 18:06:52 5 3 NA <NA>
12 18:07:03 6 3 NA <NA>
And here is how you would work with bind_rows(), (rbind works the same) this should get you more useful data to work with
pettets <- bind_rows(Day3 %>%
+ rename(Pellet_Count = Day3.Pellet_Count),
+ Day4 %>%
+ rename(Pellet_Count = Day4.Pellet_Count))
> pettets
# A tibble: 12 x 3
time Pellet_Count day
<chr> <dbl> <chr>
1 18:05:30 1 3
2 18:06:03 2 3
3 18:06:34 3 3
4 18:06:40 4 3
5 18:06:52 5 3
6 18:07:03 6 3
7 18:00:21 1 4
8 18:01:34 2 4
9 18:02:22 3 4
10 18:03:35 4 4
11 18:03:54 5 4
12 18:05:06 6 4
Secondly you probably need to find a way to handle the dates. So with your Ggplot code a big problem is that you are passing characters where you want to pass date / time data. to get a useful datetime format I think you'll need to have the date.
You first need to convert your data from 'wide' to 'long' format (see example here). After this, you should be able to use ggplot (looks like you tried to use base R plot logic here with lines but it doesn't work with ggplot).
For example:
pellets %>% gather("day", "count", -pellets.time) %>% na.omit()
All together it will be:
pellets %>% rename(Day3 = pellets.Pellet_Count.x, Day4 = pellets.Pellet_Count.y) %>% gather("day", "count", -pellets.time) %>% na.omit() %>% ggplot() + geom_point(aes(x=pellets.time, y=count, col=day))
(I added rename to match your preferred output)
Related
Suppose I have a dataframe that looks like this:
> data <- data.frame(x = c(1,1,2,2,3,4,5,6), y = c(1,2,3,4,5,6,7,8))
> data
x y
1 1 1
2 1 2
3 2 3
4 2 4
5 3 5
6 4 6
7 5 7
8 6 8
I want to use mutate and case_when to create a new id variable that will identify rows using the variable x, and give rows missing x a unique id. In other words, I should have the same id for rows one and two, rows three and four, while rows 5-8 should have their own unique ids. Suppose I want to generate these id values with a function:
id_function <- function(x, n){
set.seed(x)
res <- character(n)
for(i in seq(n)){
res[i] <- paste0(sample(c(letters, LETTERS, 0:9), 32), collapse="")
}
res
}
id_function(1, 1)
[1] "4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf"
I am trying to use this function on the RHS of a case_when expression like this:
data %>%
mutate(my_id = id_function(1234, nrow(.)),
my_id = dplyr::case_when(!is.na(x) ~ id_function(x, 1),
TRUE ~ my_id))
But the RHS does not seem to be vectorized and I get the same value for all non-missing values of x:
x y my_id
1 1 1 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
2 1 2 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
3 2 3 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
4 2 4 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
5 NA 5 0vnws5giVNIzp86BHKuOZ9ch4dtL3Fqy
6 NA 6 IbKU6DjvW9ypitl7qc25Lr4sOwEfghdk
7 NA 7 8oqQMPx6IrkGhXv4KlUtYfcJ5Z1RCaDy
8 NA 8 BRsjumlCEGS6v4ANrw1bxLynOKkF90ao
I'm sure there's a way to vectorize the RHS, what am I doing wrong? Is there an easier approach to solving this problem?
I guess rowwise() would do the trick:
data %>%
rowwise() %>%
mutate(my_id = id_function(x, 1))
x y my_id
1 1 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
1 2 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
2 3 uof7FhqC3lOXkacp54MGZJLUR6siSKDb
2 4 uof7FhqC3lOXkacp54MGZJLUR6siSKDb
3 5 e5lMJNQEhtj4VY1KbCR9WUiPrpy7vfXo
4 6 3kYcgR7109DLbxatQIAKXFeovN8pnuUV
5 7 bQ4ok7OuDgscLUlpzKAivBj2T3m6wrWy
6 8 0jSn3Jcb2HDA5uhvG8g1ytsmRpl6CQWN
purrr map functions can be used for non-vectorized functions. The following will give you a similar result. map2 will take the two arguments expected by your id_function.
library(tidyverse)
data %>%
mutate(my_id = map2(x, 1, id_function))
Output
x y my_id
1 1 1 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
2 1 2 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
3 2 3 uof7FhqC3lOXkacp54MGZJLUR6siSKDb
4 2 4 uof7FhqC3lOXkacp54MGZJLUR6siSKDb
5 3 5 e5lMJNQEhtj4VY1KbCR9WUiPrpy7vfXo
6 4 6 3kYcgR7109DLbxatQIAKXFeovN8pnuUV
7 5 7 bQ4ok7OuDgscLUlpzKAivBj2T3m6wrWy
8 6 8 0jSn3Jcb2HDA5uhvG8g1ytsmRpl6CQWN
I'm having trouble figuring out how to do the opposite of the answer to this question (and in R not python).
Count the amount of times value A occurs with value B
Basically I have a dataframe with a lot of combinations of pairs of columns like so:
df <- data.frame(id1 = c("1","1","1","1","2","2","2","3","3","4","4"),
id2 = c("2","2","3","4","1","3","4","1","4","2","1"))
I want to count, how often all the values in column A occur in the whole dataframe without the values from column B. So the results for this small example would be the output of:
df_result <- data.frame(id1 = c("1","1","1","2","2","2","3","3","4","4"),
id2 = c("2","3","4","1","3","4","1","4","2","1"),
count = c("4","5","5","3","5","4","2","3","3","3"))
The important criteria for this, is that the final results dataframe is collapsed by the pairs (so in my example rows 1 and 2 are duplicates, and they are collapsed and summed by the total frequency 1 is observed without 2). For tallying the count of occurances, it's important that both columns are examined. I.e. order of columns doesn't matter for calculating the frequency - if column A has 1 and B has 2, this counts the same as if column A has 2 and B has 1.
I can do this very slowly by filtering for each pair, but it's not really feasible for my real data where I have many many different pairs.
Any guidance is greatly appreciated.
First paste the two id columns together to id12 for later matching. Then use sapply to go through all rows to see the records where id1 appears in id12 but id2 doesn't. sum that value and only output the distinct records. Finally, remove the id12 column.
library(dplyr)
df %>% mutate(id12 = paste0(id1, id2),
count = sapply(1:nrow(.),
function(x)
sum(grepl(id1[x], id12) & !grepl(id2[x], id12)))) %>%
distinct() %>%
select(-id12)
Or in base R completely:
id12 <- paste0(df$id1, df$id2)
df$count <- sapply(1:nrow(df), function(x) sum(grepl(df$id1[x], id12) & !grepl(df$id2[x], id12)))
df <- df[!duplicated(df),]
Output
id1 id2 count
1 1 2 4
2 1 3 5
3 1 4 5
4 2 1 3
5 2 3 5
6 2 4 4
7 3 1 2
8 3 4 3
9 4 2 3
10 4 1 3
A full tidyverse version:
library(tidyverse)
df %>%
mutate(id = paste(id1, id2),
count = map(cur_group_rows(), ~ sum(str_detect(id, id1[.x]) & str_detect(id, id2[.x], negate = T))))
A more efficient approach would be to work on a tabulation format:
tab = crossprod(table(rep(seq_len(nrow(df)), ncol(df)), c(df$id1, df$id2)))
#tab
#
# 1 2 3 4
# 1 7 3 2 2
# 2 3 6 1 2
# 3 2 1 4 1
# 4 2 2 1 5
So, now, we have the times each value appears with another (irrespectively of their order in the two columns). Here on, we need a way to subset the above table by each pair and subtract the value of their cooccurence from the value of each id's total appearance.
Make a grid of all combinations:
gr = expand.grid(id1 = colnames(tab), id2 = rownames(tab), stringsAsFactors = FALSE)
Create 2-column matrices to subset the table:
id1.ij = cbind(match(gr$id1, colnames(tab)),
match(gr$id1, rownames(tab)))
id2.ij = cbind(match(gr$id1, colnames(tab)),
match(gr$id2, rownames(tab)))
Subtract the respective values:
cbind(gr, count = tab[id1.ij] - tab[id2.ij])
# id1 id2 count
#1 1 1 0
#2 2 1 3
#3 3 1 2
#4 4 1 3
#5 1 2 4
#6 2 2 0
#7 3 2 3
#8 4 2 3
#9 1 3 5
#10 2 3 5
#11 3 3 0
#12 4 3 4
#13 1 4 5
#14 2 4 4
#15 3 4 3
#16 4 4 0
Of course, if we do not need the full grid of values, we can set:
gr = unique(df)
which results in:
# id1 id2 count
#1 1 2 4
#3 1 3 5
#4 1 4 5
#5 2 1 3
#6 2 3 5
#7 2 4 4
#8 3 1 2
#9 3 4 3
#10 4 2 3
#11 4 1 3
I have a long data frame with players' decisions who worked in groups.
I need to convert the data in such a way that each row (individual observation) would contain all group members decisions (so we basically can see whether they are interdependent).
Let's say the generating code is:
group_id <- c(rep(1, 3), rep(2, 3))
player_id <- c(rep(seq(1, 3), 2))
player_decision <- seq(10,60,10)
player_contribution <- seq(6,1,-1)
df <-
data.frame(group_id, player_id, player_decision, player_contribution)
So the initial data looks like:
group_id player_id player_decision player_contribution
1 1 1 10 6
2 1 2 20 5
3 1 3 30 4
4 2 1 40 3
5 2 2 50 2
6 2 3 60 1
But I need to convert it to wide per each group, but only for some of these variables, (in this example specifically for player_contribution, but in such a way that the rest of the data remains. So the head of the converted data would be:
data.frame(group_id=c(1,1),
player_id=c(1,2),
player_decision=c(10,20),
player_1_contribution=c(6,6),
player_2_contribution=c(5,5),
player_3_contribution=c(4,6)
)
group_id player_id player_decision player_1_contribution player_2_contribution player_3_contribution
1 1 1 10 6 5 4
2 1 2 20 6 5 6
I suspect I need to group_by in dplyr and then somehow gather per group but only for player_contribution (or a vector of variables). But I really have no clue how to approach it. Any hints would be welcome!
Here is solution using tidyr and dplyr.
Make a dataframe with the columns for the players contributions. Then join this dataframe back onto the columns of interest from the original Dataframe.
library(tidyr)
library(dplyr)
wide<-pivot_wider(df, id_cols= - player_decision,
names_from = player_id,
values_from = player_contribution,
names_prefix = "player_contribution_")
answer<-left_join(df[, c("group_id", "player_id", "player_decision") ], wide)
answer
group_id player_id player_decision player_contribution_1 player_contribution_2 player_contribution_3
1 1 1 10 6 5 4
2 1 2 20 6 5 4
3 1 3 30 6 5 4
4 2 1 40 3 2 1
5 2 2 50 3 2 1
6 2 3 60 3 2 1
This question already has an answer here:
Insert missing time rows into a dataframe
(1 answer)
Closed 5 years ago.
I have a dataset that look like the following
id = c(1,1,1,2,2,2,3,3,4)
cycle = c(1,2,3,1,2,3,1,3,2)
value = 1:9
data.frame(id,cycle,value)
> data.frame(id,cycle,value)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 3 8
9 4 2 9
so basically there is a variable called id that identifies the sample, a variable called cycle which identifies the timepoint, and a variable called value that identifies the value at that timepoint.
As you see, sample 3 does not have cycle 2 data and sample 4 is missing cycle 1 and 3 data. What I want to know is there a way to run a command outside of a loop to get the data to place NA's where there is no data. So I would like for my dataset to look like the following:
> data.frame(id,cycle,value)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 2 NA
9 3 3 8
10 4 1 NA
11 4 2 9
12 4 3 NA
I am able to solve this problem with a lot of loops and if statements but the code is extremely long and cumbersome (I have many more columns in my real dataset).
Also, the number of samples I have is very large so I need something that is generalizable.
Using merge and expand.grid, we can come up with a solution. expand.grid creates a data.frame with all combinations of the supplied vectors (so you'd supply it with the id and cycle variables). By merging to your original data (and using all.x = T, which is like a left join in SQL), we can fill in those rows with missing data in dat with NA.
id = c(1,1,1,2,2,2,3,3,4)
cycle = c(1,2,3,1,2,3,1,3,2)
value = 1:9
dat <- data.frame(id,cycle,value)
grid_dat <- expand.grid(id = 1:4,
cycle = 1:3)
# or you could do (HT #jogo):
# grid_dat <- expand.grid(id = unique(dat$id),
# cycle = unique(dat$cycle))
merge(x = grid_dat, y = dat, by = c('id','cycle'), all.x = T)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 2 NA
9 3 3 8
10 4 1 NA
11 4 2 9
12 4 3 NA
A solution based on the package tidyverse.
library(tidyverse)
# Create example data frame
id <- c(1, 1, 1, 2, 2, 2, 3, 3, 4)
cycle <- c(1, 2, 3, 1, 2, 3, 1, 3, 2)
value <- 1:9
dt <- data.frame(id, cycle, value)
# Complete the combination between id and cycle
dt2 <- dt %>% complete(id, cycle)
Here is a solution with data.table doing a cross join:
library("data.table")
d <- data.table(id = c(1,1,1,2,2,2,3,3,4), cycle = c(1,2,3,1,2,3,1,3,2), value = 1:9)
d[CJ(id=id, cycle=cycle, unique=TRUE), on=.(id,cycle)]
This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 4 years ago.
Apologies if this is a simple question, but I haven't been able to find a simple solution after searching. I'm fairly new to R, and am having trouble converting wide format to long format using either the melt (reshape2) or gather(tidyr) functions. The dataset that I'm working with contains 22 different time variables that are each 3 time periods. The problem occurs when I try to convert all of these from wide to long format at once. I have had success in converting them individually, but it's a very inefficient and long, so I was wondering if anyone could suggest a simpler solution. Below is a sample dataset I created that is formatted in a similar way as the dataset I am working with:
Subject <- c(1, 2, 3)
BlueTime1 <- c(2, 5, 6)
BlueTime2 <- c(4, 6, 7)
BlueTime3 <- c(1, 2, 3)
RedTime1 <- c(2, 5, 6)
RedTime2 <- c(4, 6, 7)
RedTime3 <- c(1, 2, 3)
GreenTime1 <- c(2, 5, 6)
GreenTime2 <- c(4, 6, 7)
GreenTime3 <- c(1, 2, 3)
sample.df <- data.frame(Subject, BlueTime1, BlueTime2, BlueTime3,
RedTime1, RedTime2, RedTime3,
GreenTime1,GreenTime2, GreenTime3)
A solution that has worked for me is to use the gather function from tidyr, arranging the data by Subject (so that each subject's data is grouped together), and then selecting only the subject, time period, and rating. This was done for each variable (in my case 22).
install.packages("dplyr")
install.packages("tidyr")
library(dplyr)
library(tidyr)
BlueGather <- gather(sample.df, Time_Blue, Rating_Blue, c(BlueTime1,
BlueTime2,
BlueTime3))
BlueSorted <- arrange(BlueGather, Subject)
BlueSubtracted <- select(BlueSorted, Subject, Time_Blue, Rating_Blue)
After this code, I combine everything into one data frame. This seems very slow and inefficient to me, and was hoping that someone could help me find a simpler solution. Thank you!
The idea here is to gather() all the time variables (all variables but Subject), use separate() on key to split them into a label and a time and then spread() the label and value to obtain your desired output.
library(dplyr)
library(tidyr)
sample.df %>%
gather(key, value, -Subject) %>%
separate(key, into = c("label", "time"), "(?<=[a-z])(?=[0-9])") %>%
spread(label, value)
Which gives:
# Subject time BlueTime GreenTime RedTime
#1 1 1 2 2 2
#2 1 2 4 4 4
#3 1 3 1 1 1
#4 2 1 5 5 5
#5 2 2 6 6 6
#6 2 3 2 2 2
#7 3 1 6 6 6
#8 3 2 7 7 7
#9 3 3 3 3 3
Note
Here we use the regex in separate() from this answer by #RichardScriven to split the column on the first encountered digit.
Edit
I understand from your comments that your dataset column names are actually in the form ColorTime_Pre, ColorTime_Post, ColorTime_Final. If that is the case, you don't have to specify a regex in separate() as the default one sep = "[^[:alnum:]]+" will match your _ and split the key into label and time accordingly:
sample.df %>%
gather(key, value, -Subject) %>%
separate(key, into = c("label", "time")) %>%
spread(label, value)
Will give:
# Subject time BlueTime GreenTime RedTime
#1 1 Final 1 1 1
#2 1 Post 4 4 4
#3 1 Pre 2 2 2
#4 2 Final 2 2 2
#5 2 Post 6 6 6
#6 2 Pre 5 5 5
#7 3 Final 3 3 3
#8 3 Post 7 7 7
#9 3 Pre 6 6 6
We can use melt from data.table which can take multiple measure columns as a regex pattern
library(data.table)
melt(setDT(sample.df), measure = patterns("^Blue", "^Red", "^Green"),
value.name = c("BlueTime", "RedTime", "GreenTime"), variable.name = "time")
# Subject time BlueTime RedTime GreenTime
#1: 1 1 2 2 2
#2: 2 1 5 5 5
#3: 3 1 6 6 6
#4: 1 2 4 4 4
#5: 2 2 6 6 6
#6: 3 2 7 7 7
#7: 1 3 1 1 1
#8: 2 3 2 2 2
#9: 3 3 3 3 3
Or as #StevenBeaupré mentioned in the comments, if there are many patterns, one option would be to use the names of the dataset after extracting the substring as the patterns argument
melt(setDT(sample.df), measure = patterns(as.list(unique(sub("\\d+", "",
names(sample.df)[-1])))),value.name = c("BlueTime", "RedTime",
"GreenTime"), variable.name = "time")
If your goal is to convert the three colors to long this can be accomplished with the base R reshape function:
reshape(sample.df, idvar="subject", varying=2:length(sample.df), sep="", direction="long")
Subject time BlueTime RedTime GreenTime subject
1.1 1 1 2 2 2 1
2.1 2 1 5 5 5 2
3.1 3 1 6 6 6 3
1.2 1 2 4 4 4 1
2.2 2 2 6 6 6 2
3.2 3 2 7 7 7 3
1.3 1 3 1 1 1 1
2.3 2 3 2 2 2 2
3.3 3 3 3 3 3 3
The time variable captures the 1,2,3 in the names of the wide variables. The varying argument tells reshape which variables should be converted to long. The sep argument tells reshape to look for numbers at the end of the varying variables that are not separated by any characters, while the direction argument tells the function to attempt a long conversion.
I always add the id variable, even if it is not necessary for future reference.
If your data.frame doesn't have actually have the numbers for the time variable, a fairly simple solution is to change the variable names so that they do. For example, the following would replace "_Pre" with "1" at the end of any such variables.
names(df)[grep("_Pre$", names(df))] <- gsub("_Pre$", "1",
names(df)[grep("_Pre$", names(df))])