I have a matrix with numerous cases and corresponding answers from a questionnaire. The strongly reduced example below (raw_responses) contains the answers of 5 persons to 5 items. Let us assume that these are multiple choice items with 4 possible answers each. If the item was not processed, the person received the code 9.
raw_responses <- data.frame('id' = 1:10,
'item_1' = sample(c(1:4,9), 10, replace = TRUE),
'item_2' = sample(c(1:4,9), 10, replace = TRUE),
'item_3' = sample(c(1:4,9), 10, replace = TRUE),
'item_4' = sample(c(1:4,9), 10, replace = TRUE),
'item_5' = sample(c(1:4,9), 10, replace = TRUE))
The correct answers are stored in a separate table that reflects the entire test design. Below again a strongly reduced variant (design) with only item names and the corresponding correct answers.
design <- data.frame('item' = c('item_1','item_2','item_3','item_4','item_5'),
'key' = sample(1:4, 5, replace = TRUE))
Finally, the goal is a table with scored answers. A correct answer is coded with 1, a wrong one with 0 and an "empty" answer with 99. This works for example with the for loop below.
scored_responses <- raw_responses
for(item in colnames(raw_responses)[2:6]) {
scored_responses[, item] <- ifelse(scored_responses[, item] == design[design$item == item, 'key'], 1,
ifelse(scored_responses[, item] == 9, 99, 0))
}
However, I was wondering if this would work with a more efficient variant with dplyr (including case_when) and possibly purr. Especially because the very extensive answer table is cleaned up with the help of a longer dplyr-pipe, it would be an advantage if the scoring could be built in there.
I thank you in advance for all ideas and hints.
Get the data in long format, join, recode the values and get the data back in wide format.
library(dplyr)
library(tidyr)
raw_responses %>%
pivot_longer(cols = -id, names_to = 'item') %>%
left_join(design, by = 'item') %>%
mutate(value = case_when(value == 9 ~ 99,
value == key ~ 1,
TRUE ~ 0)) %>%
select(-key) %>%
pivot_wider(names_from = 'item')
# A tibble: 10 x 6
# id item_1 item_2 item_3 item_4 item_5
# <int> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 99 99 1 0 0
# 2 2 99 99 99 0 0
# 3 3 1 99 0 99 99
# 4 4 0 1 1 99 1
# 5 5 99 0 1 0 1
# 6 6 0 1 0 0 1
# 7 7 0 0 0 1 99
# 8 8 1 99 0 0 0
# 9 9 0 99 99 0 1
#10 10 99 1 99 1 0
Another approach without getting data into wide format is to use map2_dfc from purrr :
library(purrr)
map2_dfc(raw_responses[-1], design$key, ~case_when(.x == 9 ~ 99,
.x == .y ~ 1,
TRUE ~ 0))
However, for this answer to work we need to ensure that column names in raw_responses and design$item are in the same order. In this example, they are already in the same order however, in the real data if they are not we can achieve it by doing :
raw_responses[-1] <- raw_responses[-1][design$key]
Related
I am trying to multiply the first column with each subsequent second column with some condition.
The main condition is to have 10 in the first row. Below you can see my data.
df<-data.frame(
Stores=c(10,30,10,0,10),
Value1=c(10,10,0,100,0),
Value2=c(10,10,0,100,0),
Value3=c(10,0,0,0,0),
Value4=c(10,10,0,0,0)
)
df
So multiplying values works well with this command but without any condition.
df[,1] * df[seq(3,ncol(df), by = 2)]
Now I want to put a condition for the first row of data. I tried with this command below but is not work well.
ifelse(df[,1]==10,1,0) * df[seq(3,ncol(df), by = 2)]
So can anybody help me how to solve this and to multiply values only if the first column with the title Stores is number 10?
You can extend your selection with df$Stores == 10 e.g.
df[df$Stores == 10, 1] * df[df$Stores == 10, seq(3, ncol(df), by = 2)]
Value2 Value4
1 100 100
3 0 0
5 0 0
If you want to keep the structure of the data frame try this approach using dplyr. Since you can select ranges very specifically with dplyr the column wise condition can be changed to the variable name plus the variable number, i.e. paste0("Value", seq(2, 4, 2)) which equals "Value2" and "Value4".
library(dplyr)
library(tidyr)
df %>%
mutate(n = row_number()) %>%
pivot_longer(Value1:Value4) %>%
mutate(is = Stores == 10,
value = if_else(is & name %in% paste0("Value", seq(2, 4, 2)),
value * Stores[is][1], value)) %>%
pivot_wider(c(Stores, n)) %>%
select(-n)
# A tibble: 5 × 5
Stores Value1 Value2 Value3 Value4
<dbl> <dbl> <dbl> <dbl> <dbl>
1 10 10 100 10 100
2 30 10 10 0 10
3 10 0 0 0 0
4 0 100 100 0 0
5 10 0 0 0 0
I am trying to create a variable for a data frame in which I count the number of observations between two observations which meet a criteria. Here it is counting the number of times since last win in a game.
Say I have a dataframe like this:
df <- data.frame(player = c(10,10,10,10,10,10,10,10,10,10,10),win = c(1,0,0,0,1,1,0,1,0,0,1))
I want to create a new variable that counts the number of games it has been since the player has won.
Summarized in a vector, the result should be (setting a Not Applicable for the first observation):
c(NA,0,1,2,3,0,0,1,0,1,2)
I want to be able to do this easily and create it as a variable in the data.frame using dplyr (or any other suitable approach)
I am not quite sure why the first value should be NA. Because the elapsed time is 0 since the last "win" and not NA.
For purely logical reasons, I would take the following approach:
seq = with(df, ave(win, cumsum(win == 1), FUN = seq_along)-1)
So you get the past cummulated sum games since the last win as follows:
c(0,1,2,3,0,0,1,0,1,2,0)
But if you still aim for your described result with a little data handling you can achieve it with this:
append(NA, seq[1:length(seq)-1])
It is not nice, but it works ;)
With {tidyverse}, try:
library(tidyverse)
df <- data.frame(player = c(10,10,10,10,10,10,10,10,10,10,10),
win = c(1,0,0,0,1,1,0,1,0,0,1))
df %>%
group_by(player, group = cumsum(win != lag(win, default = first(win)))) %>%
mutate(counter = row_number(),
counter = if_else(win == 1, true = 0L, false = counter)) %>%
ungroup() %>%
group_by(player) %>%
mutate(counter = if_else(row_number() == 1, true = NA_integer_, false = counter)) %>%
ungroup() %>%
select(-group)
player win counter
<dbl> <dbl> <int>
1 10 1 NA
2 10 0 1
3 10 0 2
4 10 0 3
5 10 1 0
6 10 1 0
7 10 0 1
8 10 1 0
9 10 0 1
10 10 0 2
11 10 1 0
Given a larger data frame with around 300k+ rows and 14 columns in the following form:
df <- data.frame(team_id = c(rep(1,10),rep(2,10),rep(3,10),rep(4,10),rep(5,10)),
year = rep(c(1954:1963), 5), members= c(0,0,0,1,1,1,2,0,0,0,0,0,2,1,1,1,0,0,0,0, 1,1,1,1,1,1,1,1,1,1,0,1,1,1,0,0,0,0,0,0,1,1,1,1,1,1,1,1,0,0),
size = c(rep(60,8),50,50,rep(40,7),50,50,70,rep(30,10),rep(99,6),110,101,101,101,rep(80,9),66) )
The aim is to create a new vector containing the difference in size, for each team, once all members left (members change from 2 or 1 to 0) subtracting the size of the year of the last departure of players from the next different size.
The direction of change should be shown so absolute values are not necessary.
What I achieved so far is:
df2 <- df %>% arrange(team_id,year) %>%
group_by(team_id) %>%
mutate(sizediff = if_else(members == 1 & lead(members) == 0 | members == 2 & lead(members) == 0,1,0, missing = 0) )
However, instead of the values 1 in the sizediff vector I want to have the difference to future size. Maybe changes from long to wide format or a conditional re-arrangement the year vector could help but I am stuck. What I want to achieve looks like:
aim <- data.frame(team_id = c(rep(1,10),rep(2,10),rep(3,10),rep(4,10),rep(5,10)),
year = rep(c(1954:1963), 5), members= c(0,0,0,1,1,1,2,0,0,0, 0,0,2,1,1,1,0,0,0,0, 1,1,1,1,1,1,1,1,1,1, 0,1,1,1,0,0,0,0,0,0, 1,1,1,1,1,1,1,1,0,0 ) ,
size = c(57,rep(60,7),50,50,rep(40,7),50,50,70,rep(30,10),rep(99,6),110,101,101,101,88,rep(80,8),66),
sizediff = c(rep(0,6),-10,rep(0,3),rep(0,5),10,rep(0,4),rep(0,10),rep(0,3),11,rep(0,6),rep(0,7),-14,rep(0,2)) )
is this something your are looking for?
df %>%
arrange(team_id, year) %>%
mutate(diff = if_else((members> 0 & dplyr::lead(members, n=1)==0), size, 0)) %>%
group_by(team_id) %>%
mutate(diff = ifelse(diff>0, dplyr::last(size)-size, NA))
Try this custom approach :
library(dplyr)
df %>%
group_by(team_id) %>%
mutate(sizediff = {
sizediff = rep(0, n())
inds <- which(members %in% c(1, 2) & lead(members) == 0)[1]
sizediff[inds] <- size[which(row_number() > inds & size != size[inds])[1]] - size[inds]
sizediff
}) -> result
result
# team_id year members size sizediff
# <dbl> <int> <dbl> <dbl> <dbl>
# 1 1 1954 0 60 0
# 2 1 1955 0 60 0
# 3 1 1956 0 60 0
# 4 1 1957 1 60 0
# 5 1 1958 1 60 0
# 6 1 1959 1 60 0
# 7 1 1960 2 60 -10
# 8 1 1961 0 60 0
# 9 1 1962 0 50 0
#10 1 1963 0 50 0
# … with 40 more rows
We first initialise sizediff to 0, inds is used to find where members left. We calculate the difference in size from the next value which changes and update inds position.
shift = c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3)
count =c(1,2,3,4,5,6,7,8,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7)
test <- cbind(shift,count)
So I am trying to mark every last row for every shift (so rows with count = c(8,10,7)with a binary 1 and every other row with 0. Right now I am thinking maybe that is possible with a left join but I am not quite sure. I would prefer not working with loops but rather use some techniques from dplyr. Thanks guys!
Assuming that you want to add a new 0/1 column last that contains a 1 in the last row of each shift and that the shifts are contiguous, here are two base R approaches:
transform(test, last = ave(count, shift, FUN = function(x) x == max(x)))
transform(test, last = +!duplicated(shift, fromLast = TRUE))
or with dplyr use mutate:
test %>%
as.data.frame %>%
group_by(shift) %>%
mutate(last = +(1:n() == n())) %>%
ungroup
test %>%
as.data.frame %>%
mutate(last = +!duplicated(shift, fromLast = TRUE))
Try this one
library(dplyr)
test %>%
as_tibble() %>%
group_by(shift) %>%
mutate(is_last = ifelse( row_number() == max(row_number()), 1, 0)) %>%
ungroup()
# A tibble: 25 x 3
shift count is_last
<dbl> <dbl> <dbl>
1 1 1 0
2 1 2 0
3 1 3 0
4 1 4 0
5 1 5 0
6 1 6 0
7 1 7 0
8 1 8 1
9 2 1 0
10 2 2 0
# … with 15 more rows
I have a dataframe (df1) containing many records Each record has up to three trials, each trial can be repeat up to five times. Below is an example of some data I have:
Record Trial Start End Speed Number
1 2 1 4 12 9
1 2 4 6 11 10
1 3 1 3 10 17
2 1 1 5 14 5
I have the following code that calculates the longest 'Distance' and 'Maximum Number' for each Record.:
getInfo <- function(race_df) {
race_distance <- as.data.frame(race_df %>% group_by(record,trial) %>% summarise(max.distance = max(End - Start)))
race_max_number = as.data.frame(race_df %>% group_by(record,trial) %>% summarise(max.N = max(Number)))
rd_rmn_merge <- as.data.frame(merge(x = race_distance, y = race_max_number)
total_summary <- as.data.frame(rd_rmn_merge[order(rd_rmn_merge$trial,])
return(list(race_distance, race_max_number, total_summary)
}
list_summary <- getInfo(race_df)
total_summary <- list_of_races[[3]]
list_summary gives me an output like this:
[[1]]
Record Trial Max.Distance
1 2 3
1 3 2
2 1 4
[[2]]
Record Trial Max.Number
1 2 10
1 3 17
2 1 5
[[3]]
Record Trial Max.Distance Max.Number
1 2 3 10
1 3 2 17
2 1 4 5
I am now trying to seek the longest distance with the corresponding 'Number' regardless if it being maximum. So having Record 1, Trial 2 look like this instead:
Record Trial Max.Distance Corresponding Number
1 2 3 9
Eventually I would like to be able to create a function that is able to take arguments 'Record' and 'Trial' through the 'race_df' dataframe to make searching for a specific record and trial's longest distance easier.
Any help on this would be much appreciated.
The data (in case anyone else wants to offer their solution):
df <- data.frame( Record = c(1,1,1,2),
Trial = c(2,2,3,1),
Start = c(1,4,1,1),
End = c(4,6,3,5),
Speed = c(12,11,10,14),
Number = c(9,10,17,5))
Here's a tidyverse solution:
library(tidyverse)
df %>%
mutate( Max.Distance = End - Start) %>%
select(-Start,-End,-Speed) %>%
group_by(Record) %>%
nest() %>%
mutate( data = map( data, ~ filter(.x, Max.Distance == max(Max.Distance)) )) %>%
unnest()
The output:
Record Trial Number Max.Distance
<dbl> <dbl> <dbl> <dbl>
1 1 2 9 3
2 2 1 5 4
Note if you want to keep all of your columns in the final data frame, just remove select....
I hope I get right what your function is supposed to do. In the end it should take a record and a trial and put out the row(s) where we have the maximum distance, right?
So, it boils down to two filters:
filter rows for the record and trial.
filter the row inside that subset that has the maximum distance
Between those two filters, we have to calculate the distance although I suggest you move that outside the function because it is basically a one time operation.
race_df <- data.frame(Record = c(1, 1, 1, 2), Trial = c(2, 2, 3, 1),
Start = c(1, 4, 1, 1), End = c(4, 6, 3, 5), Speed = c(12, 11, 10, 14),
Number = c(9, 10, 17, 5))
get_longest <- function(df, record, trial){
df %>%
filter(Record == record & Trial == trial) %>%
mutate(Distance = End - Start) %>%
filter(Distance == max(Distance)) %>%
select(Number, Distance)
}
get_longest(race_df, 1, 2)