How to rearrange dataframe to wide format? [duplicate] - r

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 3 years ago.
I have a dataframe with 3 column: participant ID, questionID and a column containing wether or not they gave the correct (1) response or not (0).
It looks like this:
> head(df)
# A tibble: 6 x 3
ID questionID correct
<dbl> <int> <dbl>
1 1 1 1
2 2 2 0
3 3 3 1
4 4 4 0
5 5 5 0
6 6 6 1
And can be recreated using:
set.seed(0)
df <- tibble(ID = seq(1, 100, 1),
questionID = rep(seq(1, 10,), 10),
correct = base::sample(c(0, 1), size = 100, replace = TRUE))
Now I would like each question to have their own column, with the ultimate goal of fitting a 2PL model to it. The data should for that purpose look like 1 row per participant, and 11 columns (ID and 10 question Columns).
How do I achieve this?

You can use pivot_wider from the tidyr package:
df %>%
pivot_wider(names_from = questionID,
values_from = correct,
names_prefix = "questionID_")
# A tibble: 100 x 11
ID questionID_1 questionID_2 questionID_3 questionID_4
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 NA NA NA
2 2 NA 0 NA NA
3 3 NA NA 0 NA
4 4 NA NA NA 1
5 5 NA NA NA NA
6 6 NA NA NA NA
7 7 NA NA NA NA
8 8 NA NA NA NA
9 9 NA NA NA NA
10 10 NA NA NA NA
# ... with 90 more rows, and 6 more variables: questionID_5 <dbl>,
# questionID_6 <dbl>, questionID_7 <dbl>, questionID_8 <dbl>,
# questionID_9 <dbl>, questionID_10 <dbl>

Using data.table you can use dcast
df <- data.frame(ID=c(1,2,3,4,5,6), questionID= c(1,22,13,4,35,8),correct=c(1,0,1,0,0,1))
df
ID questionID correct
1 1 1 1
2 2 22 0
3 3 13 1
4 4 4 0
5 5 35 0
6 6 8 1
setDT(df)
dcast(df,ID~questionID,value.var="correct")
ID 1 4 8 13 22 35
1: 1 1 NA NA NA NA NA
2: 2 NA NA NA NA 0 NA
3: 3 NA NA NA 1 NA NA
4: 4 NA 0 NA NA NA NA
5: 5 NA NA NA NA NA 0
6: 6 NA NA 1 NA NA NA
# replace NA to what you want
df[is.na(df)]<- "-"

Related

Keep first duplicate in a sequence across all sequences of numerical values and replace the remaining values with NA in R

I have the following dataset, where numerical values in column x are intertwined with NAs. I would like to keep the first instance of the numerical values across all numerical sequences and replace the remaining duplicated values in each sequence with NAs.
x = c(1,1,1,NA,NA,NA,3,3,3,NA,NA,1,1,1,NA)
data = data.frame(x)
> data
x
1 1
2 1
3 1
4 NA
5 NA
6 NA
7 3
8 3
9 3
10 NA
11 NA
12 1
13 1
14 1
15 NA
So that the final result should be:
> data
x
1 1
2 NA
3 NA
4 NA
5 NA
6 NA
7 3
8 NA
9 NA
10 NA
11 NA
12 1
13 NA
14 NA
15 NA
I would apprecite some suggestions, ideally with dplyr. Thanks!
This simple solution seems to work as I expected, although it doesn't use dplyr.
data$x[data$x == lag(data$x)] <- NA
> data
x
1 1
2 NA
3 NA
4 NA
5 NA
6 NA
7 3
8 NA
9 NA
10 NA
11 NA
12 1
13 NA
14 NA
15 NA
For those who want to stay within a dplyr workflow:
library(dplyr)
data %>%
as_tibble() %>%
mutate(x = na_if(x, lag(x)))
#> # A tibble: 15 × 1
#> x
#> <dbl>
#> 1 1
#> 2 NA
#> 3 NA
#> 4 NA
#> 5 NA
#> 6 NA
#> 7 3
#> 8 NA
#> 9 NA
#> 10 NA
#> 11 NA
#> 12 1
#> 13 NA
#> 14 NA
#> 15 NA

If statements for a multiple columns in R

data <- data.frame(id=c(1,2,3,4,5,6,7),
q1=c(3,4,5,2,1,2,4),
q2=c(3,4,4,5,4,3,2),
q3=c(2,3,2,3,1,2,3),
q4=c(3,4,4,4,4,5,5))
For these q1-q4 I would like to write a statement where it says if q1 = 1 then generate a q1_1 =1 ; if q1=2 then generate q1_2 = 2;if q1=3 then generate q1_3=3; if q1=4 then generate q1_4=4, and if q5=5 then generate q1_5=5 for all of the questions in this dataset. I know I would have to do some sort of loop and then maybe an if statement, but I am just not really familiar with loops at all.
The OUTPUT i am hoping to get looks like (but with more columns for all the questions)
id q1 q2 q3 q4 q1_1 q1_2 q1_3 q1_4 q1_5
1 1 3 3 2 3 NA NA 3 NA NA
2 2 4 4 3 4 NA NA NA 4 NA
3 3 5 4 2 4 NA NA NA NA 5
4 4 2 5 3 4 NA 2 NA NA NA
5 5 1 4 1 4 1 NA NA NA NA
6 6 2 3 2 5 NA 2 NA NA NA
7 7 4 2 3 5 4 NA NA 4 NA
Any help is appreciated, thank you!
No loop necessary. Go long, then go wide, then join the original data.
library(tidyverse)
data <- tibble(id=c(1,2,3,4,5,6,7),
q1=c(3,4,5,2,1,2,4),
q2=c(3,4,4,5,4,3,2),
q3=c(2,3,2,3,1,2,3),
q4=c(3,4,4,4,4,5,5))
data |>
pivot_longer(-id) |>
mutate(name = paste(name, value, sep = "_")) |>
pivot_wider() |>
(\(d) left_join(data, d, by = "id"))()
#> # A tibble: 7 x 20
#> id q1 q2 q3 q4 q1_3 q2_3 q3_2 q4_3 q1_4 q2_4 q3_3 q4_4
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3 3 2 3 3 3 2 3 NA NA NA NA
#> 2 2 4 4 3 4 NA NA NA NA 4 4 3 4
#> 3 3 5 4 2 4 NA NA 2 NA NA 4 NA 4
#> 4 4 2 5 3 4 NA NA NA NA NA NA 3 4
#> 5 5 1 4 1 4 NA NA NA NA NA 4 NA 4
#> 6 6 2 3 2 5 NA 3 2 NA NA NA NA NA
#> 7 7 4 2 3 5 NA NA NA NA 4 NA 3 NA
#> # ... with 7 more variables: q1_5 <dbl>, q1_2 <dbl>, q2_5 <dbl>, q1_1 <dbl>,
#> # q3_1 <dbl>, q4_5 <dbl>, q2_2 <dbl>

How to delete a particular value from the whole dataframe in R? [duplicate]

This question already has answers here:
Replacing values from a column using a condition in R
(2 answers)
Closed 7 months ago.
I have a data frame that is z-score converted. I want to delete from the data frame (and convert to NA) only those values that are higher or equal to 4, without dropping any row or column. I would appreciate an answer.
Best
You can use the following code:
df <- data.frame(v1 = c(1,3,6,7,3),
v2 = c(2,1,4,6,7),
v3 = c(1,2,3,4,5))
df
#> v1 v2 v3
#> 1 1 2 1
#> 2 3 1 2
#> 3 6 4 3
#> 4 7 6 4
#> 5 3 7 5
is.na(df) <- df >= 4
df
#> v1 v2 v3
#> 1 1 2 1
#> 2 3 1 2
#> 3 NA NA 3
#> 4 NA NA NA
#> 5 3 NA NA
Created on 2022-07-10 by the reprex package (v2.0.1)
you can simply use df[df>=4] <- NA to achieve what you want.
df <- data.frame(replicate(10,sample(0:10,10,rep=TRUE)))
> df
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 2 3 4 5 6 4 3 1 10 6
2 5 7 0 4 3 10 10 3 6 10
3 5 5 0 3 1 3 5 7 2 7
4 7 0 4 1 10 0 5 2 5 0
5 8 8 7 8 4 6 6 10 10 0
6 1 4 1 3 3 8 8 0 4 8
7 6 3 3 6 7 4 10 9 7 2
8 2 1 4 0 7 8 10 1 6 3
9 0 9 6 2 9 6 2 9 0 3
10 8 2 1 0 1 4 0 6 2 8
df[df>=4] <- NA
> df
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 2 3 NA NA NA NA 3 1 NA NA
2 NA NA 0 NA 3 NA NA 3 NA NA
3 NA NA 0 3 1 3 NA NA 2 NA
4 NA 0 NA 1 NA 0 NA 2 NA 0
5 NA NA NA NA NA NA NA NA NA 0
6 1 NA 1 3 3 NA NA 0 NA NA
7 NA 3 3 NA NA NA NA NA NA 2
8 2 1 NA 0 NA NA NA 1 NA 3
9 0 NA NA 2 NA NA 2 NA 0 3
10 NA 2 1 0 1 NA 0 NA 2 NA
Here is one more. Using replace_with_na_all() from naniar package:
Use replace_with_na_all() when you want to replace ALL values that meet a condition across an entire dataset. The syntax here is a little different, and follows the rules for rlang’s expression of simple functions. This means that the function starts with ~, and when referencing a variable, you use .x.
https://cran.r-project.org/web/packages/naniar/vignettes/replace-with-na.html
library(naniar)
library(dplyr)
df %>%
replace_with_na_all(condition = ~.x > 4)
v1 v2 v3
<dbl> <dbl> <dbl>
1 1 2 1
2 3 1 2
3 NA 4 3
4 NA NA 4
5 3 NA NA
Though the solution by #Quinten is very concise, just add an approach in tidyverse
library(dplyr)
set.seed(123)
df <- data.frame(
x = sample(1:10, 7),
y = sample(1:10, 7)
)
df %>%
mutate(
across(.fns = ~ if_else(.x >= 4, NA_integer_, .x))
)
#> x y
#> 1 3 NA
#> 2 NA NA
#> 3 2 1
#> 4 NA 2
#> 5 NA 3
#> 6 NA NA
#> 7 1 NA
Created on 2022-07-10 by the reprex package (v2.0.1)
In base R, we can use replace():
df <- replace(df, df > 4, NA_real_)
Output
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 NA NA 3 NA 1 3 1 1 NA NA
2 1 NA 2 NA NA 3 NA NA 2 0
3 NA 1 NA 2 2 1 NA NA 4 1
4 NA NA 0 NA NA NA 0 2 4 NA
5 NA 1 NA 3 0 NA 4 NA 2 3
6 0 3 NA 0 NA NA 1 1 NA 2
7 3 NA NA NA 2 2 NA 2 NA 4
8 NA 1 0 2 NA NA 2 NA NA NA
9 NA 3 NA 2 4 NA NA 0 1 3
10 1 3 NA 3 NA NA 3 4 NA NA
Or use replace in dplyr:
library(dplyr)
df %>%
mutate(across(everything(), ~ replace(.x, .x > 4, NA_real_)))
Data
set.seed(321)
df <- data.frame(replicate(10, sample(0:10, 10, rep = TRUE)))
If the columns are numeric, an option is also to use ^ on a logical matrix (df >= 4) to return NA for TRUE values and 1 for FALSE, then multiply with original data so that those elements corresponding to NA returns NA and the ones with 1 returns the original element
NA^(df >= 4) * df

How can I create lag data (all forward combinations) with data.table in R?

I need to create a dataframe with all possible combinations of a variable. I found an example using data.table that works like this:
df <- data.frame("Age"=1:10)
df <- setDT(df)
df[,lag.Age1 := c(NA,Age[-.N])]
That creates this:
Age lag.Age1
1: 1 NA
2: 2 1
3: 3 2
.. .. ..
10: 10 9
Now, I want to keep adding lagged vectors that produce something like this:
Age lag.Age1 lag.Age2 lag.Age3
1: 1 NA NA NA
2: 2 1 NA NA
3: 3 2 1 NA
.. .. .. .. ..
10: 10 9 8 7
I tried this for the third column:
df[,lag.Age2 := c(NA,NA,Age[1:8])]
But I really don't get how data.table works here. That line runs but it doesn't do anything.
EDIT: what if the dataframe has a group variable and I want the lag to be done by group? For the first lag it is just:
df <- data.frame("Age"=1:10, "Group"=c(rep("A",4),rep("B",6)))
df[,lag.Age1 := c(NA,Age[-.N]), by="Group"]
How would this be now? note that the groups have different length.
data.table::shift() is very powerful, because you can provide a vector of offsets; For example, if you want n lag columns (from 1 to n), you can do this:
n=3
cols = paste0("lag.Age",1:n)
df[, c(cols):=shift(Age,1:n), Group]
Output:
Age Group lag.Age1 lag.Age2 lag.Age3
<int> <char> <int> <int> <int>
1: 1 A NA NA NA
2: 2 A 1 NA NA
3: 3 A 2 1 NA
4: 4 A 3 2 1
5: 5 B NA NA NA
6: 6 B 5 NA NA
7: 7 B 6 5 NA
8: 8 B 7 6 5
9: 9 B 8 7 6
10: 10 B 9 8 7
Alternatively:
df[, c(paste0("lag.Age",1:3)):=shift(Age,1:3), Group]
If you want to have the number of lags vary by group, where the number equals the number of observations in that group-1, then one approach is to do this:
# make function to return lags based on length of x
f <- function(x) shift(x,1:(length(x)-1))
# get unique groups
grps= unique(df$Group)
# set as DT, and use lapply()
setDT(df)
grp_lags = lapply(grps, \(g) f(df[Group==g, Age]))
names(grp_lags)<-grps
Output:
$A
$A[[1]]
[1] NA 1 2 3
$A[[2]]
[1] NA NA 1 2
$A[[3]]
[1] NA NA NA 1
$B
$B[[1]]
[1] NA 5 6 7 8 9
$B[[2]]
[1] NA NA 5 6 7 8
$B[[3]]
[1] NA NA NA 5 6 7
$B[[4]]
[1] NA NA NA NA 5 6
$B[[5]]
[1] NA NA NA NA NA 5
Or, if you have okay with lots of extra columns (i.e. for the groups with fewer observations), you can do this:
n = df[, .N, Group][,max(N)]
cols = paste0("lag.Age",1:n)
df[, c(cols):=shift(Age,1:n), Group]
Output:
Age Group lag.Age1 lag.Age2 lag.Age3 lag.Age4 lag.Age5 lag.Age6
1: 1 A NA NA NA NA NA NA
2: 2 A 1 NA NA NA NA NA
3: 3 A 2 1 NA NA NA NA
4: 4 A 3 2 1 NA NA NA
5: 5 B NA NA NA NA NA NA
6: 6 B 5 NA NA NA NA NA
7: 7 B 6 5 NA NA NA NA
8: 8 B 7 6 5 NA NA NA
9: 9 B 8 7 6 5 NA NA
10: 10 B 9 8 7 6 5 NA

Running a forloop over a header in R

I am trying to switch from stata to R and need help with a forloop
Context:
I have data(survey questionnaire) with 5 blocks and with 10 questions each. B1B2 <- 2nd question of first block. My rows are people (who can only be in 1 block each) so I have values for that block and NAs for the other variables in the other block. (eg. a person in 3rd block will have observations for B3B1-10 and NA for B1B1-10, B2B1-10 etc.) I am trying to combine all the blocks to B1-10. Heres a header of my data:
B1B1 B1B2 B1B3 B1B4 B1B5 B1B6 B1B7 B1B8 B1B9 B1B10 B2B1 B2B2 B2B3 B2B4 B2B5 B2B6 B2B7 B2B8 B2B9 B2B10
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 NA NA NA NA NA NA NA NA NA NA 1 2 2 2 2 1 2 1 1 2
2 NA NA NA NA NA NA NA NA NA NA 1 1 1 2 2 2 2 1 1 1
3 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA NA NA NA 2 2 2 2 1 2 2 1 1 1
6 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
I got it working for 1 instance using the unite function:
data %>% unite("B1", B1B1,B2B1,B3B1,B4B1,B5B1, na.rm = TRUE, remove = FALSE) -> data
I want to loop this from B1 to B10 as such
for (i in (1:10)){ data %>% unite("paste0("B",i)", paste0("B1B",i),paste0("B2B",i),paste0("B3B",i),paste0("B4B",i),paste0("B5B",i), na.rm = TRUE, remove = FALSE) -> data}
but im getting an unexpected symbol error. I think I have a misunderstanding of how forloops work in R and any explanation on why my code doesnt run would be greatly appreciated
Here is my working stata code if it helps:
forvalues i=1(1)10{
gen b`i'=B1B`i' if B1B`i' != .
replace b`i'=B2B`i' if B2B`i' != .
replace b`i'=B3B`i' if B3B`i' != .
replace b`i'=B4B`i' if B4B`i' != .
replace b`i'=B5B`i' if B5B`i' != .
}
Here is an idea. Split the dataframe into list of questions and map over each element of the lists.
Example data: Three Blocks with 2 Questions
df <- data.frame(B1B1 = c(1,2, rep(NA, 4)),
B1B2 = c(3,4, rep(NA, 4)),
B2B1 = c(NA,NA,5,6,NA,NA),
B2B2 = c(NA,NA,7,8,NA,NA),
B3B1 = c(rep(NA,4), 1,2),
B3B2 = c(rep(NA,4), 3,4))
B1B1 B1B2 B2B1 B2B2 B3B1 B3B2
1 1 3 NA NA NA NA
2 2 4 NA NA NA NA
3 NA NA 5 7 NA NA
4 NA NA 6 8 NA NA
5 NA NA NA NA 1 3
6 NA NA NA NA 2 4
Code:
library(tidyverse)
split.default(df, str_extract(names(df), "..$")) %>%
map_df(~ coalesce(!!! .x))
Result:
# A tibble: 6 x 2
B1 B2
<dbl> <dbl>
1 1 3
2 2 4
3 5 7
4 6 8
5 1 3
6 2 4

Resources