I have a data frame with a few variables to reverse code. I have a separate vector that has all the variables to reverse code. I'd like to use mutate_at(), or some other tidy way, to reverse code them all in one line of code. Here's the dataset and the vector of items to reverse
library(tidyverse)
mock_data <- tibble(id = 1:5,
item_1 = c(1, 5, 3, 5, 5),
item_2 = c(4, 4, 4, 1, 1),
item_3 = c(5, 5, 5, 5, 1))
reverse <- c("item_2", "item_3")
Here's what I want it to look like with only items 2 and 3 reverse coded:
library(tidyverse)
solution <- tibble(id = 1:5,
item_1 = c(1, 5, 3, 5, 5),
item_2 = c(2, 2, 2, 5, 5),
item_3 = c(1, 1, 1, 1, 5))
I've tried this below code. I know that the recode is correct because I've used it for other datasets, but I know something is off with the %in% operator.
library(tidyverse)
mock_data %>%
mutate_at(vars(. %in% reverse), ~(recode(., "1=5; 2=4; 3=3; 4=2; 5=1")))
Error: `. %in% reverse` must evaluate to column positions or names, not a logical vector
Any help would be appreciated!
You can give reverse directly to mutate_at, no need for vars(. %in% reverse). And I would simplify the reversing as 6 minus the current value.
mock_data %>% mutate_at(reverse, ~6 - .)
# # A tibble: 5 x 4
# id item_1 item_2 item_3
# <int> <dbl> <dbl> <dbl>
# 1 1 1 2 1
# 2 2 5 2 1
# 3 3 3 2 1
# 4 4 5 5 1
# 5 5 5 5 5
If there's a possibility that reverse includes columns that are not in mock_data, and you want to skip those, use mutate_at(vars(one_of(reverse)), ...)
Related
I want to make a function that can detect if there is a matching pair of numbers. I want to simulate x and y many times to see the # of matches occurring using a function.
x<-sample(1:6,6)
y<-sample(1:6,6)
x;y
For example, I have x<- c(2, 5, 6, 4, 3, 1)and y<- c(2, 1, 6, 5, 4, 3). Numbers 2 and 6 matches in order. There are 2 pairs. If there is no match between x and y, it should be just 0. I can use sum(x==y) to find for one example of x and y.
How can I make a function that finds number of identical pairs for many x and y?
You can just use
f<-function(n,k) {
sapply(1:k, \(i) sum(sample(n) == sample(n)))
}
where k is the number of iterations and n is the range (in your case 6)
Example Usage:
f(n=6, k=100)
In base R the following function would do the trick. The length of vector is given by the size argument, and the number of trials is given by n
n_pairs <- function(size, n) {
colSums(replicate(n, sample(size)) == replicate(n, sample(size)))
}
So, for example we can see:
set.seed(1)
n_pairs(size = 6, n = 5)
#> [1] 2 0 1 1 1
hist(n_pairs(6, 100), breaks = 0:6)
mean(n_pairs(6, 1000))
#> [1] 1.013
Note though that R already has the function rbinom, which can achieve the same result with:
rbinom(n, size, 1/size)
Created on 2022-04-26 by the reprex package (v2.0.1)
Maybe this one (removed first answer):
x<- c(2, 5, 6, 4, 3, 1)
y<- c(2, 1, 6, 5, 4, 3)
lst = list(x,y)
pairs <- outer(lst,lst,Vectorize(function(x,y){x[x==y]}))
pairs[1,2]
[[1]]
[1] 2 6
A possible solution with dplyr package
require(tidyverse)
x <- c(2, 5, 6, 4, 3, 1)
y <- c(2, 1, 6, 5, 4, 3)
df <- tibble(x = x,
y = y) %>%
mutate(pair = case_when(x == y ~ "PAIR",
TRUE ~ "NOT"))
The dataset:
# A tibble: 6 x 3
x y pair
<dbl> <dbl> <chr>
1 2 2 PAIR
2 5 1 NOT
3 6 6 PAIR
4 4 5 NOT
5 3 4 NOT
6 1 3 NOT
Filtering:
df %>%
filter(pair == "PAIR")
Output:
# A tibble: 2 x 3
x y pair
<dbl> <dbl> <chr>
1 2 2 PAIR
2 6 6 PAIR
Will this give you what you want? Make a table out of the values that are paired.
table(x[x==y])
x <- sample(1:6,1000, TRUE)
y <- sample(1:6,1000, TRUE)
table(x[x==y])
# 1 2 3 4 5 6
# 37 26 32 28 30 33
A MWE is as follows:
I have 3 groups with 2, 4, and 3 subjects consecutively. So I have:
library(dplyr)
Group <- c(1, 1, 2, 2, 2, 2, 3, 3, 3)
Subject_ID <- c(1, 2, 1 ,2, 3, 4, 1, 2)
df <- rbind(Group, Subject_ID)
Since the subjects in different groups are different subjects, so I want the subject ID be unique for each subject in the dataset. What I did was as follows:
Num_Subjects <- (length(unique(filter(df, Group == 1)$Subject)),
length(unique(filter(df, Group == 2)$Subject)),
length(unique(filter(df, Group == 3)$Subject)),
)
# Then I defined a summation function to calculate how many subjects there are in all previous groups.
sumfun <- function(x,start,end){
return(sum(x[start:end]))
}
# Then I defined another function that generates a new subject ID for each subject in each group.
SubjIDFn <- function(x, i) {
x %>% filter(Session == i) %>% mutate(
Sujbect = Subject + sumfun(Num_Subjects, 1, i-1)
)
}
# Then I loop this from group 2 to group 3,
for (i in 2:3) {
df.Corruption.WithoutS1 <- SubjIDFn(df.Corruption.WithoutS1, i)
}
Then the data set has zero observations. I don't know where it went wrong, and I don't know what is the smart solution to this problem. Thanks for your help!
I think you're a bit overshooting it... If Subject_ID is unique within groups, you may just go with:
library(dplyr)
Group <- c(1, 1, 2, 2, 2, 2, 3, 3, 3)
Subject_ID <- c(1, 2, 1 ,2, 3, 4, 1, 2, 3)
df <- bind_cols(Group=Group, Subject_ID=Subject_ID)
df %>% mutate(unique_id = paste(Group, Subject_ID, sep="."))
# A tibble: 9 x 3
Group Subject_ID unique_id
<dbl> <dbl> <chr>
1 1 1 1.1
2 1 2 1.2
3 2 1 2.1
4 2 2 2.2
5 2 3 2.3
6 2 4 2.4
7 3 1 3.1
8 3 2 3.2
9 3 3 3.3
Note that I used bind_cols instead of rbind to have a dataframe instead of a matrix.
I am trying to create a new column that counts each column where criteria are met. That is because I want to summarize the number of correct answers by each participant in my master thesis. I am really new to R and in desperate need for help, even on easy tasks.
For Example:
(Participant, Task1, Task2, Task3; COUNT)
1 4 8 1 ; 1|
2 3 8 7 ; 1|
3 1 3 4 ; 2|
4 5 6 4 ; 1|
5 1 8 4 ; 3
The column COUNT should count all correct answers of the rows Task1-Task3. If the correct answers are (1, 8, 4), the COUNT row should result in the numbers shown in the example above.
Can anybody tell me how to create such a variable?
Really appreciated, thanks
Luca
We can use rowSums by making the vector c(1, 8, 4) length same as the 'Task' columns length and do a ==, and get the rowSums
i1 <- startsWith(names(df1), 'Task')
df1$COUNT <- rowSums(df1[i1] == c(1, 8, 4)[col(df1[i1])])
df1$COUNT
#[1] 1 1 2 1 3
Or with sweep
rowSums(sweep(df1[i1], 2, c(1, 8, 4), `==`))
Or another option is apply
df1$COUNT <- apply(df1[i1], 1, function(x) sum(x == c(1, 8, 4)))
NOTE: None of the solutions require any external package
data
df1 <- data.frame(Participant = 1:5, Task1 = c(4, 3, 1, 5, 1),
Task2 = c(8, 8, 3, 6, 8), Task3 = c(1, 7, 4, 4, 4))
We can use pmap_int from purrr to count number of correct answers.
library(dplyr)
df %>% mutate(COUNT = purrr::pmap_int(select(., starts_with('Task')),
~sum(c(...) == c(1, 8, 4))))
# Participant Task1 Task2 Task3 COUNT
#1 1 4 8 1 1
#2 2 3 8 7 1
#3 3 1 3 4 2
#4 4 5 6 4 1
#5 5 1 8 4 3
Another option is to get data in long format, calculate the number of correct answers for each Participant and join the data back.
df1 %>%
tidyr::pivot_longer(cols = starts_with('Task')) %>%
group_by(Participant) %>%
summarise(COUNT = sum(value == c(1, 8, 4))) %>%
left_join(df1, by = 'Participant')
I am trying to use the values from a column to extract column numbers in a data frame. My problem is similar to this topic in r-bloggers. Copying the script here:
df <- data.frame(x = c(1, 2, 3, 4),
y = c(5, 6, 7, 8),
choice = c("x", "y", "x", "z"),
stringsAsFactors = FALSE)
However, instead of having column names in choice, I have column index number, such that my data frame looks like this:
df <- data.frame(x = c(1, 2, 3, 4),
y = c(5, 6, 7, 8),
choice = c(1, 2, 1, 3),
stringsAsFactors = FALSE)
I tried using this solution:
df$newValue <-
df[cbind(
seq_len(nrow(df)),
match(df$choice, colnames(df))
)]
Instead of giving me an output that looks like this:
# x y choice newValue
# 1 1 4 1 1
# 2 2 5 2 2
# 3 3 6 1 6
# 4 8 9 3 NA
My newValue column returns all NAs.
# x y choice newValue
# 1 1 4 1 NA
# 2 2 5 2 NA
# 3 3 6 1 NA
# 4 8 9 3 NA
What should I modify in the code so that it would read my choice column as column index?
As you have column numbers which we need to extract from data frame already we don't need match here. However, since there is a column called choice in the data which you don't want to consider while extracting data we need to turn the values which are not in the range to NA before subsetting from the dataframe.
mat <- cbind(seq_len(nrow(df)), df$choice)
mat[mat[, 2] > (ncol(df) -1), ] <- NA
df$newValue <- df[mat]
df
# x y choice newValue
#1 1 5 1 1
#2 2 6 2 6
#3 3 7 1 3
#4 4 8 3 NA
data
df <- data.frame(x = c(1, 2, 3, 4),
y = c(5, 6, 7, 8),
choice = c(1, 2, 1, 3))
I have the following Problem: In a data frame I have a lot of rows and columns with the first row being the date. For each date I have more than 1 observation and I want to summarize them.
My df looks like that (date replaced by ID for ease of use):
df:
ID Cash Price Weight ...
1 0.4 0 0
1 0.2 0 82 ...
1 0 1 0 ...
1 0 3.2 80 ...
2 0.3 1 70 ...
... ... ... ... ...
I want to group them by the first column and then summarize all rows BUT with different functions:
The function Cash and Price should be sum so I get the sum of Cash and Price for each ID. The function on Weight should be max so I only get the maximum weight for the ID.
Because I have so many columns I can not write a all functions by hand, but I have only 2 columns which should be summarized by max the rest should be summarized by sum.
So I am looking for a function to group by ID, summarize all with sum except 2 different columns which I need the max value.
I tried to use the dplyr package with:
df %>% group_by(ID = tolower(ID)) %>% summarise_each(funs(sum))
But I need the addition to not sum but max the 2 specified columns, any Ideas?
To be clear, the output of the example df should be:
ID Cash Price Weight
1 0.6 4.2 82
2 0.3 1 70
As of dplyr 1.0.0 you can use across():
tribble(
~ID, ~max1, ~max2, ~sum1, ~sum2, ~sum3,
1, 1, 1, 1, 2, 3,
1, 2, 3, 1, 2, 3,
2, 1, 1, 1, 2, 3,
2, 3, 4, 2, 3, 4,
3, 1, 1, 1, 2, 3,
3, 4, 5, 3, 4, 5,
3, NA, NA, NA, NA, NA
) %>%
group_by(ID) %>%
summarize(
across(matches("max1|max2"), max, na.rm = T),
across(!matches("max1|max2"), sum, na.rm = T)
)
# ID max1 max2 sum1 sum2 sum3
# 1 2 3 2 4 6
# 2 3 4 3 5 7
# 3 4 5 4 6 8
We can use
df %>%
group_by(ID) %>%
summarise(Cash = sum(Cash), Price = sum(Price), Weight = max(Weight))
If we have many columns, one way would be to do this separately and then join the output together.
df1 <- df %>%
group_by(ID) %>%
summarise_each(funs(sum), Cash:Price)
df2 <- df %>%
group_by(ID) %>%
summarise_each(funs(max), Weight)
inner_join(df1, df2, by = "ID")
# ID Cash Price Weight
# (int) (dbl) (dbl) (int)
#1 1 0.6 4.2 82
#2 2 0.3 1.0 70
Or do it w/o the double groups:
library(dplyr)
set.seed(1492)
df <- data.frame(id=rep(c(1,2), 3),
cash=rnorm(6, 0.5, 0.1),
price=rnorm(6, 0.5, 0.1)*6,
weight=sample(100, 6))
df
## id cash price weight
## 1 1 0.4410152 2.484082 10
## 2 2 0.4101343 3.032529 93
## 3 1 0.3375889 2.305076 58
## 4 2 0.6047922 3.248851 55
## 5 1 0.4721711 3.209930 34
## 6 2 0.5362493 2.331530 99
custom_summarise <- function(do_df) {
return(bind_cols(
summarise_each(select(do_df, -weight), funs(sum)),
summarise_each(select(do_df, weight), funs(max))
))
}
group_by(df, id) %>% do(custom_summarise(.))
## Source: local data frame [2 x 4]
## Groups: id [2]
##
## id cash price weight
## (dbl) (dbl) (dbl) (int)
## 1 3 1.250775 7.999089 58
## 2 6 1.551176 8.612910 99
library(data.table)
setDT(df)
df[,.(Cash = sum(Cash),Price = sum(Price),Weight = max(Weight)),by=ID]
One way of doing this for +90 columns can be:
max_col <- 'Weight'
sum_col <- setdiff(colnames(df),max_col)
query_1 <- paste0(sum_col,' = sum(',sum_col,')')
query_2 <- paste0(max_col,' = max(',max_col,')')
query_3 <- paste(query_1,collapse=',')
query_4 <- paste(query_2,collapse=',')
query_5 <- paste(query_3,query_4,sep=',')
final_query <- paste0('df[,.(',query_5,'),by = ID]')
eval(parse(text = final_query))
Here is a solution based on this comment on an issue on dplyr repo. I think it's very general to be applied to more complicated cases.
library(tidyverse)
df <- tribble(
~ID, ~Cash, ~Price, ~Weight,
#----------------------
'a', 4, 6, 8,
'a', 7, 3, 0,
'a', 7, 9, 0,
'b', 2, 8, 8,
'b', 5, 1, 8,
'b', 8, 0, 1,
'c', 2, 1, 1,
'c', 3, 8, 0,
'c', 1, 9, 1
)
out <- list(.vars=lst(vars(-Weight), vars(Weight)),
.funs=lst(sum, max))%>%
pmap(~df%>%group_by(ID)%>%summarise_at(.x, .y)) %>%
reduce(inner_join)
out
# A tibble: 3 x 4
# ID Cash Price Weight
# <chr> <dbl> <dbl> <dbl>
# 1 a 18 18 8
# 2 b 15 9 8
# 3 c 6 18 1
You should specify the vars in the first lst (e.g. vars(-Weight), vars(Weight)) and respective function to be applied in the lst (sum, max). The .x in the summarise_at argument refers to elements in the variable lst, and .y refers to the elements in the function lst.