Elegant way determine last observation among timepoints? - r

To start, here's some example data called df1:
ID Time Score1 Score2 SumScore
1 Baseline 1 2 3
1 Midpoint 2 2 4
1 Final 3 2 5
2 Baseline 2 2 4
2 Midpoint 5 2 7
2 Final 6 2 8
I should mention now that some of my 'Final' timepoint scores in these data are missing. I am interested in only those observations with missing Final timepoints. Let's select these observations an call the new df df2: df2<-df1%>%filter(is.na(SumScore)==T,Time=="Final")
From here, I spread the data using tidyr::spread() to create a new data frame (df3)that looks like this:
df3<-spread(df,ID,SumScore)
ID Baseline Midpoint
1 3 NA
1 NA 4
1 NA NA
2 4 NA
2 NA 7
2 NA NA
What I would like to accomplish is to determine the last observation (among the baseline and midpoint timepoints) and then carry that observation forward for the observations in df1 that are missing the Final timepoint score. It is possible that for some observations, the midpoint scores are missing as well.
Thanks

Using dplyr and tidyr, something like this might be what you are looking for...
df4 <- df1 %>% select(-c(Score1,Score2)) %>%
spread(key=Time,value=SumScore) %>%
mutate(finalScore=coalesce(Final,Midpoint,Baseline))
df4
ID Baseline Final Midpoint finalScore
1 1 3 5 4 5
2 2 4 8 7 8

Related

Creating a count variable for NA cases in data frame

I have an R data frame including a few columns of numerical data with NA values too. See the example with first 2 columns below. I want to create a new column (3rd one below called output) which shows an incremental count of NA values for each of my group variables. For example, region A has 2 NA values so it will show 1 and 2 next to the relevant rows. Region B has only one NA value so will show 1 next to it. If a region X has 10 NA values it should show 1,2,3 ... , 10 next to each case, as move down the data frame.
Region
Value
Output
Region A
5
0
Region B
2
0
Region B
NA
1
Region A
NA
1
Region A
9
0
Region A
NA
2
Region A
4
0
I am familiar with dplyr so happy to see a solution around it. Ideally i don't want to use a for loop, but could do if the best solution. In my example above i used zero values for my non-NA cases, that can be anything, doesn't have to be 0.
thanks! :)
You can use cumsum to count up NA within each group. An ifelse will only assign these counts to NA, otherwise will include 0 in output.
library(dplyr)
df %>%
group_by(Region) %>%
mutate(Output = ifelse(is.na(Value), cumsum(is.na(Value)), 0))
Output
Region Value Output
<chr> <int> <dbl>
1 A 5 0
2 B 2 0
3 B NA 1
4 A NA 1
5 A 9 0
6 A NA 2
7 A 4 0
You could create a new column with is.na(value), than group by region and than use cumsum() to create your desired output
df%>%mutate(output=ifelse(!is.na(Value), 0, 1))%>%group_by(Region, output)%>%mutate(output=cumsum(output))
# A tibble: 7 x 3
# Groups: Region, output [5]
Region Value output
<fct> <dbl> <dbl>
1 A 5 0
2 B 2 0
3 B NA 1
4 A NA 1
5 A 9 0
6 A NA 2
7 A 4 0

How to extract a list of columns name based on the means of their data?

I'm pretty new to R and hope i'll make myself clear enough.
I have a table of several columns which are factors. I want to make a score for each of these columns. Then I want to calculate the mean of each score, and display the list of columns ranked by their mean scores, is that possible ?
Table would be:
head(musico[,69:73])
AVIS1 AVIS2 AVIS3 AVIS4 AVIS5
1 2 1 2 3 2
2 2 5 2 3 2
3 3 2 5 5 1
4 1 2 5 5 5
5 1 5 1 3 1
6 4 1 4 5 4
I want to make a score for each:
musico$score1<-0
musico$score1[musico$AVIS1==1]<-1
musico$score1[musico$AVIS1==2]<-0.5
then do the mean of each column score: mean of score1, mean of score2, ...:
mean(musico$score1), mean(musico$score2), ...
My goal is to have a list of titles (avis1, avis2,...) ranked by their mean score.
Any advice appreciated !
Here's one way using base although it is somewhat unclear what you want. What does score1 have to do with AVIS1? I think you may be missing some of the data from musico.
Based on the example provided, here's a base R solution. vapply loops through the data.frame and produces the mean for each column. Then the stack and order are only there to make the output a dataframe that looks nice.
music <- read.table(text = "
AVIS1 AVIS2 AVIS3 AVIS4 AVIS5
1 2 1 2 3 2
2 2 5 2 3 2
3 3 2 5 5 1
4 1 2 5 5 5
5 1 5 1 3 1
6 4 1 4 5 4", header = TRUE)
means <- vapply(music, mean, 1)
stack(means[order(means, decreasing = TRUE)])
values ind
4 4.000000 AVIS4
3 3.166667 AVIS3
2 2.666667 AVIS2
5 2.500000 AVIS5
1 2.166667 AVIS1
This is how I would do it by first introducing a scores vector to be used as a lookup. I assume that scores are decreasing by 0.5 and that the number of scores needed are according to the maximum number of levels found in your columns (i.e. 6 seen in AVIS1).
Then using tidyr you can organise your data set such that you have to variables (i.e. AVIS and Value) containing the respective levels. Then add a score variable with the mutate function from dplyr in which the position of the score in the score vector matches the value in the Value variable. From here you can find the mean scores corresponding to the AVIS levels, arrange them accordingly and put them in a list.
music <- read.table(text = "
AVIS1 AVIS2 AVIS3 AVIS4 AVIS5
1 2 1 2 3 2
2 2 5 2 3 2
3 3 2 5 5 1
4 1 2 5 5 5
5 1 5 1 3 1
6 4 1 4 5 4", header = TRUE) # your data
scores <- seq(1, by = -0.5, length.out = 6) # vector of scores
library(tidyr)
library(dplyr)
music2 <- music %>%
gather(AVIS, Value) %>% # here you tidy the data
mutate(score = scores[Value]) %>% # match score to value
group_by(AVIS) %>% # group AVIS levels
summarise(score.mean = mean(score)) %>% # find mean scores for AVIS levels
arrange(desc(score.mean))
list <- list(AVIS = music2$AVIS) # here is the list
> list$AVIS
[1] "AVIS1" "AVIS5" "AVIS2" "AVIS3" "AVIS4"

gather() per grouped variables in R for specific columns

I have a long data frame with players' decisions who worked in groups.
I need to convert the data in such a way that each row (individual observation) would contain all group members decisions (so we basically can see whether they are interdependent).
Let's say the generating code is:
group_id <- c(rep(1, 3), rep(2, 3))
player_id <- c(rep(seq(1, 3), 2))
player_decision <- seq(10,60,10)
player_contribution <- seq(6,1,-1)
df <-
data.frame(group_id, player_id, player_decision, player_contribution)
So the initial data looks like:
group_id player_id player_decision player_contribution
1 1 1 10 6
2 1 2 20 5
3 1 3 30 4
4 2 1 40 3
5 2 2 50 2
6 2 3 60 1
But I need to convert it to wide per each group, but only for some of these variables, (in this example specifically for player_contribution, but in such a way that the rest of the data remains. So the head of the converted data would be:
data.frame(group_id=c(1,1),
player_id=c(1,2),
player_decision=c(10,20),
player_1_contribution=c(6,6),
player_2_contribution=c(5,5),
player_3_contribution=c(4,6)
)
group_id player_id player_decision player_1_contribution player_2_contribution player_3_contribution
1 1 1 10 6 5 4
2 1 2 20 6 5 6
I suspect I need to group_by in dplyr and then somehow gather per group but only for player_contribution (or a vector of variables). But I really have no clue how to approach it. Any hints would be welcome!
Here is solution using tidyr and dplyr.
Make a dataframe with the columns for the players contributions. Then join this dataframe back onto the columns of interest from the original Dataframe.
library(tidyr)
library(dplyr)
wide<-pivot_wider(df, id_cols= - player_decision,
names_from = player_id,
values_from = player_contribution,
names_prefix = "player_contribution_")
answer<-left_join(df[, c("group_id", "player_id", "player_decision") ], wide)
answer
group_id player_id player_decision player_contribution_1 player_contribution_2 player_contribution_3
1 1 1 10 6 5 4
2 1 2 20 6 5 4
3 1 3 30 6 5 4
4 2 1 40 3 2 1
5 2 2 50 3 2 1
6 2 3 60 3 2 1

value of certain column based on multiple conditions in two data frames R

As shown above, there are df1 and df2
If you look at btime one df1 there are NAs
I want to fill up the btime NAs with all unique + stnseq = 1, so only the first NA of each Unique will be filled
the value i would like it to fill is in df2. The condition would be for all unique + boardstation = 8501970 add the value in the departure column.
i have tried the aggregate function but i do not know how to make the condition for only boardstation 8501970.
Thanks anyone for any help
If I understood the question correctly then this might help.
library(dplyr)
df2 %>%
group_by(unique) %>%
summarise(departure_sum = sum(departure[boardstation==8501970])) %>%
right_join(df1, by="unique") %>%
mutate(btime = ifelse(is.na(btime) & stnseq==1, departure_sum, btime)) %>%
select(-departure_sum) %>%
data.frame()
Since the sample data is in image format I cooked my own data as below:
df1
unique stnseq btime
1 1 1 NA
2 1 2 NA
3 2 1 NA
4 2 2 200
df2
unique boardstation departure
1 1 8501970 1
2 1 8501970 2
3 1 123 3
4 2 8501970 4
5 2 456 5
6 3 900 6
Output is:
unique stnseq btime
1 1 1 3
2 1 2 NA
3 2 1 4
4 2 2 200

Assign ID across 2 columns of variable

I have a data frame in which each individual (row) has two data points per variable.
Example data:
df1 <- read.table(text = "IID L1.1 L1.2 L2.1 L2.2
1 1 38V1 38V1 48V1 52V1
2 2 36V1 38V2 50V1 48Y1
3 3 37Y1 36V1 50V2 48V1
4 4 38V2 36V2 52V1 50V2",
stringsAsFactor = FALSE, header = TRUE)
I have many more columns than this in the full dataset and would like to recode these values to label unique identifiers across the two columns. I know how to get identifiers and relabel a single column from previous questions (Creating a unique ID and How to assign a unique ID number to each group of identical values in a column) but I don't know how to include the information for two columns, as R identifies and labels factors per column.
Ultimately I want something that would look like this for the above data:
(df2)
IID L1.1 L1.2 L2.1 L2.2
1 1 1 1 1 4
2 2 2 4 2 5
3 3 3 2 3 1
4 4 1 5 4 3
It doesn't really matter what the numbers are, as long as they indicate unique values across both columns. I've tried creating a function based on the output from:
unique(df1[,1:2])
but am struggling as this still looks at unique entries per column, not across the two.
Something like this would work...
pairs <- (ncol(df1)-1)/2
for(i in 1:pairs){
refs <- unique(c(df1[,2*i],df1[,2*i+1]))
df1[,2*i] <- match(df1[,2*i],refs)
df1[,2*i+1] <- match(df1[,2*i+1],refs)
}
df1
IID L1.1 L1.2 L2.1 L2.2
1 1 1 1 1 4
2 2 2 4 2 5
3 3 3 2 3 1
4 4 4 5 4 3
You could reshape it to long format, assign the groups and then recast it to wide:
library(data.table)
df_m <- melt(df, id.vars = "IID")
setDT(df_m)[, id := .GRP, by = .(gsub("(.*).","\\1", df_m$variable), value)]
dcast(df_m, IID ~ variable, value.var = "id")
# IID L1.1 L1.2 L2.1 L2.2
#1 1 1 1 6 9
#2 2 2 4 7 10
#3 3 3 2 8 6
#4 4 1 5 9 8
This should also be easily expandable to multiple groups of columns. I.e. if you have L3. it should work with that as well.

Resources