How to make multiple plots using data in one column in R? - r

Lets say I have the following data frame:
ID amount_ID timespan change
3 1 20 2
3 2 40 3
3 3 60 6
3 4 80 4
3 5 100 5
9 1 25 1
9 2 50 -2
9 3 75 0
9 4 100 -1
3 1 33.33 4
3 2 66.67 8
3 3 100 7
9 1 33.33 1
9 2 66.67 3
9 3 100 4
I want to make 2 plots with this data, one for ID 3 and one for ID 9. The timespan should be on the x-axis and the change should be on the y-axis. As you can see the maximum length of the x-axis is 100 per ID. But I want to make a graph where the change is the average of all the previous changes in the past from that same ID. So essentially I need to add up all the changes per timespan per individual ID and divide that by the times a particular ID number is present. The problem is that the timespan can be different within a particular ID (here ID 3 first has 5 amounts and then has 3 amounts, ID 9 first has 4 amounts and then has 3 amounts).
Here is a visual example
I hope you can help me!!! Thanks!

We can use cummean to calculate the running average at each timespan. Also facetsare useful to show every ID in separate plot.
library(ggplot2)
library(dplyr)
df %>%
arrange(ID, timespan) %>%
group_by(ID) %>%
mutate(change = cummean(change)) %>%
ggplot() + aes(timespan, change) +
geom_line() +
facet_wrap(.~ID, scales = "free_y")

Related

How do I mutate numbers to a different scale in R?

x <- read.csv(text="Group,X,Y
1,1,2
1,2,5
1,3,8
1,4,9
2,1,1
2,2,4
2,3,8
2,4,10
2,5,16
3,1,6
3,2,7
3,3,8
3,4,12
3,5,13")
I want to shift the y-scale from what it currently is to one that starts from 0. Right now, I have grouped this dataset by the "Group" variable. And I am trying to find the minimum "Y" value from each group and subtracting it from each value in the "Y" column of each group. However, this isn't working as intended. Any ideas on how I can do this? Thank you.
This is a pretty simple use case for dplyr::group_by. Once you group the data.frame by your grouping variable, when you use functions like min or max in mutate, they will be called (and applied to) that group individually. So if we say Y=Y-min(Y) it will find the minimum value of Y in each group, and subtract it from the values of Y in that group
library(dplyr)
x %>%
group_by(Group) %>%
mutate(Y=Y-min(Y))
Group X Y
<int> <int> <int>
1 1 1 0
2 1 2 3
3 1 3 6
4 1 4 7
5 2 1 0
6 2 2 3
7 2 3 7
8 2 4 9
9 2 5 15
10 3 1 0
11 3 2 1
12 3 3 2
13 3 4 6
14 3 5 7

How to extract a list of columns name based on the means of their data?

I'm pretty new to R and hope i'll make myself clear enough.
I have a table of several columns which are factors. I want to make a score for each of these columns. Then I want to calculate the mean of each score, and display the list of columns ranked by their mean scores, is that possible ?
Table would be:
head(musico[,69:73])
AVIS1 AVIS2 AVIS3 AVIS4 AVIS5
1 2 1 2 3 2
2 2 5 2 3 2
3 3 2 5 5 1
4 1 2 5 5 5
5 1 5 1 3 1
6 4 1 4 5 4
I want to make a score for each:
musico$score1<-0
musico$score1[musico$AVIS1==1]<-1
musico$score1[musico$AVIS1==2]<-0.5
then do the mean of each column score: mean of score1, mean of score2, ...:
mean(musico$score1), mean(musico$score2), ...
My goal is to have a list of titles (avis1, avis2,...) ranked by their mean score.
Any advice appreciated !
Here's one way using base although it is somewhat unclear what you want. What does score1 have to do with AVIS1? I think you may be missing some of the data from musico.
Based on the example provided, here's a base R solution. vapply loops through the data.frame and produces the mean for each column. Then the stack and order are only there to make the output a dataframe that looks nice.
music <- read.table(text = "
AVIS1 AVIS2 AVIS3 AVIS4 AVIS5
1 2 1 2 3 2
2 2 5 2 3 2
3 3 2 5 5 1
4 1 2 5 5 5
5 1 5 1 3 1
6 4 1 4 5 4", header = TRUE)
means <- vapply(music, mean, 1)
stack(means[order(means, decreasing = TRUE)])
values ind
4 4.000000 AVIS4
3 3.166667 AVIS3
2 2.666667 AVIS2
5 2.500000 AVIS5
1 2.166667 AVIS1
This is how I would do it by first introducing a scores vector to be used as a lookup. I assume that scores are decreasing by 0.5 and that the number of scores needed are according to the maximum number of levels found in your columns (i.e. 6 seen in AVIS1).
Then using tidyr you can organise your data set such that you have to variables (i.e. AVIS and Value) containing the respective levels. Then add a score variable with the mutate function from dplyr in which the position of the score in the score vector matches the value in the Value variable. From here you can find the mean scores corresponding to the AVIS levels, arrange them accordingly and put them in a list.
music <- read.table(text = "
AVIS1 AVIS2 AVIS3 AVIS4 AVIS5
1 2 1 2 3 2
2 2 5 2 3 2
3 3 2 5 5 1
4 1 2 5 5 5
5 1 5 1 3 1
6 4 1 4 5 4", header = TRUE) # your data
scores <- seq(1, by = -0.5, length.out = 6) # vector of scores
library(tidyr)
library(dplyr)
music2 <- music %>%
gather(AVIS, Value) %>% # here you tidy the data
mutate(score = scores[Value]) %>% # match score to value
group_by(AVIS) %>% # group AVIS levels
summarise(score.mean = mean(score)) %>% # find mean scores for AVIS levels
arrange(desc(score.mean))
list <- list(AVIS = music2$AVIS) # here is the list
> list$AVIS
[1] "AVIS1" "AVIS5" "AVIS2" "AVIS3" "AVIS4"

gather() per grouped variables in R for specific columns

I have a long data frame with players' decisions who worked in groups.
I need to convert the data in such a way that each row (individual observation) would contain all group members decisions (so we basically can see whether they are interdependent).
Let's say the generating code is:
group_id <- c(rep(1, 3), rep(2, 3))
player_id <- c(rep(seq(1, 3), 2))
player_decision <- seq(10,60,10)
player_contribution <- seq(6,1,-1)
df <-
data.frame(group_id, player_id, player_decision, player_contribution)
So the initial data looks like:
group_id player_id player_decision player_contribution
1 1 1 10 6
2 1 2 20 5
3 1 3 30 4
4 2 1 40 3
5 2 2 50 2
6 2 3 60 1
But I need to convert it to wide per each group, but only for some of these variables, (in this example specifically for player_contribution, but in such a way that the rest of the data remains. So the head of the converted data would be:
data.frame(group_id=c(1,1),
player_id=c(1,2),
player_decision=c(10,20),
player_1_contribution=c(6,6),
player_2_contribution=c(5,5),
player_3_contribution=c(4,6)
)
group_id player_id player_decision player_1_contribution player_2_contribution player_3_contribution
1 1 1 10 6 5 4
2 1 2 20 6 5 6
I suspect I need to group_by in dplyr and then somehow gather per group but only for player_contribution (or a vector of variables). But I really have no clue how to approach it. Any hints would be welcome!
Here is solution using tidyr and dplyr.
Make a dataframe with the columns for the players contributions. Then join this dataframe back onto the columns of interest from the original Dataframe.
library(tidyr)
library(dplyr)
wide<-pivot_wider(df, id_cols= - player_decision,
names_from = player_id,
values_from = player_contribution,
names_prefix = "player_contribution_")
answer<-left_join(df[, c("group_id", "player_id", "player_decision") ], wide)
answer
group_id player_id player_decision player_contribution_1 player_contribution_2 player_contribution_3
1 1 1 10 6 5 4
2 1 2 20 6 5 4
3 1 3 30 6 5 4
4 2 1 40 3 2 1
5 2 2 50 3 2 1
6 2 3 60 3 2 1

Reuse value of previous row during dplyr::mutate

I am trying to group events based on their time of occurrence. To achieve this, I simply calculate a diff over the timestamps and want to essentially start a new group if the diff is larger than a certain value. I would have tried like the code below. However, this is not working since the dialog variable is not available during the mutate it is created by.
library(tidyverse)
df <- data.frame(time = c(1,2,3,4,5,510,511,512,513), id = c(1,2,3,4,5,6,7,8,9))
> df
time id
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 510 6
7 511 7
8 512 8
9 513 9
df <- df %>%
mutate(t_diff = c(NA, diff(time))) %>%
# This generates an error as dialog is not available as a variable at this point
mutate(dialog = ifelse(is.na(t_diff), id, ifelse(t_diff >= 500, id, lag(dialog, 1))))
# This is the desired result
> df
time id t_diff dialog
1 1 1 NA 1
2 2 2 1 1
3 3 3 1 1
4 4 4 1 1
5 5 5 1 1
6 510 6 505 6
7 511 7 1 6
8 512 8 1 6
9 513 9 1 6
In words, I want to add a column that points to the first element of each group. Thereby, the groups are distinguished at points at which the diff to the previous element is larger than 500.
Unfortunately, I have not found a clever workaround to achieve this in an efficient way using dplyr. Obviously, iterating over the data.frame with a loop would work, but would be very inefficient.
Is there a way to achieve this in dplyr?

Panel Data in R: Get complete cases of data based on individuals

I'm working on an unbalanced panel dataset. Data came from a game and for every user (user_id) in the record I have data for every level (level) of the game. As recording data started some time after introduction of the game, for some users I don't have data regarding the first levels, that's why I want to throw them out in a first step.
I've tried the complete.cases-function, but it only excludes the rows with the missing values (NAs), but not data for the whole user with missing values in level 1.
panel <- panel[complete.cases(panel), ]
That's why I need a code that excludes every user who has no record in level 1 (which in my dataset means he has an "NA" at one of the dependent variables, i.e. number of activities).
Update #1:
Data looks like this (thanks to thc):
> game_data <- data.frame(player = c(1,1,1,2,2,2,3,3,3), level = c(1,2,3,1,2,3,1,2,3), score=c(0,150,170,80,100,110,75,100,0))
> game_data
player level score
1 1 1 0
2 1 2 150
3 1 3 170
4 2 1 80
5 2 2 100
6 2 3 110
7 3 1 75
8 3 2 100
9 3 3 0
I now want to exclude data from player 1, because he has a score of 0 in level 1.
Here is one approach
Example data:
game_data <- data.frame(player = c(1,1,2,2,2,3,3,3), level = c(2,3,1,2,3,1,2,3), score=sample(100, 8))
> game_data
player level score
1 1 2 19
2 1 3 13
3 2 1 65
4 2 2 32
5 2 3 22
6 3 1 98
7 3 2 58
8 3 3 84
library(dplyr)
game_data %>% group_by(player) %>% filter(any(level == 1)) %>% as.data.frame
player level score
1 2 1 65
2 2 2 32
3 2 3 22
4 3 1 98
5 3 2 58
6 3 3 84
I think I now find a solution with your help:
game_data %>% group_by(player) %>% filter(any(level == 1 & score > 0)) %>% as.data.frame
This seems to work and I just needed a little adjustment from your code thc, thank you very much for your help!

Resources