I want to combine two records of datafram "df" with ID of "A" and "B" which lacks some data (NA)into one row with ID "C" (goal). I know matrix [ , ] can do this kind of work. But in the dataframe no row number is not available.
Below is my data.
df
ID Y1 Y2 Y3 Y4 Y5 Y6
A 7 4 NA NA NA NA
B NA NA 5 5 4 4
goal:
ID Y1 Y2 Y3 Y4 Y5 Y6
C 7 4 5 5 4 4
We can use
df1 %>%
summarise(ID = 'C', across(where(is.numeric), na.omit))
# ID Y1 Y2 Y3 Y4 Y5 Y6
#1 C 7 4 5 5 4 4
data
df1 <- structure(list(ID = c("A", "B"), Y1 = c(7L, NA), Y2 = c(4L, NA
), Y3 = c(NA, 5L), Y4 = c(NA, 5L), Y5 = c(NA, 4L), Y6 = c(NA,
4L)), class = "data.frame", row.names = c(NA, -2L))
We could use adorn_totals from janitor package:
library(dplyr)
library(janitor)
df1 %>%
adorn_totals("row") %>%
slice(3)
Output:
ID Y1 Y2 Y3 Y4 Y5 Y6
Total 7 4 5 5 4 4
Does this work:
as.data.frame(cbind(ID = 'C',t(apply(df[-1], 2, sum, na.rm = TRUE))))
ID Y1 Y2 Y3 Y4 Y5 Y6
1 C 7 4 5 5 4 4
Some base R options
colSums
> cbind(ID = "C", data.frame(t(colSums(df[-1], na.rm = TRUE))))
ID Y1 Y2 Y3 Y4 Y5 Y6
1 C 7 4 5 5 4 4
na.omit + list2DF
> list2DF(c(ID = "C", Map(na.omit, df[-1])))
ID Y1 Y2 Y3 Y4 Y5 Y6
1 C 7 4 5 5 4 4
If in any case, you have pair of rows which you want to coalesce into each other, you may follow this simple strategy
df <- structure(list(ID = c("A", "B", "C", "E"), Y1 = c(7L, NA, NA,
7L), Y2 = c(4L, NA, 5L, NA), Y3 = c(NA, 5L, NA, 5L), Y4 = c(NA,
5L, NA, 5L), Y5 = c(NA, 4L, 14L, NA), Y6 = c(NA, 4L, 5L, NA)), row.names = c(NA,
-4L), class = "data.frame")
df
#> ID Y1 Y2 Y3 Y4 Y5 Y6
#> 1 A 7 4 NA NA NA NA
#> 2 B NA NA 5 5 4 4
#> 3 C NA 5 NA NA 14 5
#> 4 E 7 NA 5 5 NA NA
library(dplyr)
df %>% group_by(ID = (row_number()+1) %/% 2) %>%
summarise(across(everything(), sum, na.rm =T))
#> # A tibble: 2 x 7
#> ID Y1 Y2 Y3 Y4 Y5 Y6
#> <dbl> <int> <int> <int> <int> <int> <int>
#> 1 1 7 4 5 5 4 4
#> 2 2 7 5 5 5 14 5
Created on 2021-05-30 by the reprex package (v2.0.0)
Related
I have a lot of columns in 1 dataframe that identify different timepoints of the same variable. Basically, within my data, if there's no response at timepoint X-1, there will be no response at time point X or beyond (after an NA appears in a row, it will continue). I currently have a column that shows which row the last response came from and what that response is. The dataframe currently looks like this:
id X1 X2 X3 X4 X_final X_final_location
1 1 5 5 6 NA 6 X3
2 2 4 NA NA NA 4 X1
3 3 7 1 3 5 5 X4
4 4 8 2 4 2 2 X4
5 5 1 5 NA NA 5 X2
6 6 5 7 7 7 7 X4
My goal is to be able to conduct a regression using the last response of each row as the outcome variable. However, I don't want it to repeat twice in the "X_final" column and also in the column that the response actually comes from. Therefore, I am hoping to find a way to put a "." in for the cell where that value originally came from so it looks like this:
id X1 X2 X3 X4 X_final X_final_location
1 1 5 5 6 NA 6 X3
2 2 . <NA> NA NA 4 X1
3 3 7 1 3 5 5 X4
4 4 8 2 4 2 2 X4
5 5 1 . NA NA 5 X2
6 6 5 7 7 7 7 X4
Any suggestions would be appreciated - thank you!
Another method, since you already have the locations in $X_final_location. As mentioned in the question comments, NA values would be preferred if the goal would be regression analysis to preserve numeric values.
data_orig <- data.frame(
id = c(1, 2, 3, 4, 5, 6),
X1 = c(5, 4, 7, 8, 1, 5),
X2 = c(5, NA, 1, 2, 5, 7),
X3 = c(6, NA, 3, 4, NA, 7),
X4 = c(NA, NA, 5, 2, NA, 7),
X_final = c(6, 4, 5, 2, 5, 7),
X_final_location = c("X3", "X1", "X4", "X4", "X2", "X4")
)
data_new <- data_orig
for (i in seq_len(nrow(data_new))) {
data_new[i, data_new$X_final_location[i]] <- NA
}
data_new
# id X1 X2 X3 X4 X_final X_final_location
# 1 1 5 5 NA NA 6 X3
# 2 2 NA NA NA NA 4 X1
# 3 3 7 1 3 NA 5 X4
# 4 4 8 2 4 NA 2 X4
# 5 5 1 NA NA NA 5 X2
# 6 6 5 7 7 NA 7 X4
One way to do this (NA instead of . to preserve data type):
match finds the first NA position, replace replaces the value in that position - 1 (previous) with NA.
apply(data, 1, \(x) ...) applies that function for each row. Finally t transposes the result (since apply by default coerces the result to columns.
data = data.frame(id = 1:6, X1 = c(5L, 4L, 7L, 8L, 1L, 5L), X2 = c(5L,
NA, 1L, 2L, 5L, 7L), X3 = c(6L, NA, 3L, 4L, NA, 7L), X4 = c(NA,
NA, 5L, 2L, NA, 7L), X_final = c(6L, 4L, 5L, 2L, 5L, 7L), X_final_location = c("X3",
"X1", "X4", "X4", "X2", "X4"))
data[,2:5] <- t(apply(data[,2:5], 1 , function(x) replace(x, match(NA, x) - 1, NA)))
data
#> id X1 X2 X3 X4 X_final X_final_location
#> 1 1 5 5 NA NA 6 X3
#> 2 2 NA NA NA NA 4 X1
#> 3 3 7 1 3 5 5 X4
#> 4 4 8 2 4 2 2 X4
#> 5 5 1 NA NA NA 5 X2
#> 6 6 5 7 7 7 7 X4
Another way using split (grouping by row):
split(data, row.names(data)) <-
lapply(split(data, row.names(data)), \(x) replace(x, x$X_final_location, "."))
I know there is a simple solution to this problem, as I solved it a couple of months ago, but have since lost the relevant file, and cannot for the life of me work out how I did it.
My data is in a long form, where each row represents a participant's answer to one question, with all rows for one participant sharing a common participant ID - e.g.
ParticipantID Question Resp
1 Age x1
1 Gender x2
1 Education x3
1 Q1 x4
1 Q2 x5
...
2 Age y1
2 Gender y2
...
etc
I want to add new columns to the data to associate the various demographic values with each answer provided by a given participant. So in the example above, I would have a new column "Age" which would take the value x1 for all rows where ParticipantID = 1, y1 for all rows where ParticipantID = 2, etc., like so:
ParticipantID Question Resp Age Gender ...
1 Age x1 x1 x2
1 Gender x2 x1 x2
1 Education x3 x1 x2
1 Q1 x4 x1 x2
1 Q2 x5 x1 x2
...
2 Age y1 y1 y2
2 Gender y2 y1 y2
...
etc
Importantly, I can't just rotate the table from long to wide, because I need the study questions (represented as Q1, Q2, ... above) to remain in long form.
Any help that can be offered is greatly appreciated!
As long as each participant has the same questions in the same order, you can do
cbind(df, do.call(rbind, lapply(split(df, df$ParticipantID), function(x) {
setNames(as.data.frame(t(x[-1])[rep(2, nrow(x)),]), x[[2]])
})), row.names = NULL)
#> ParticipantID Question Resp Age Gender Education Q1 Q2
#> 1 1 Age x1 x1 x2 x3 x4 x5
#> 2 1 Gender x2 x1 x2 x3 x4 x5
#> 3 1 Education x3 x1 x2 x3 x4 x5
#> 4 1 Q1 x4 x1 x2 x3 x4 x5
#> 5 1 Q2 x5 x1 x2 x3 x4 x5
#> 6 2 Age y1 y1 y2 y3 y4 y5
#> 7 2 Gender y2 y1 y2 y3 y4 y5
#> 8 2 Education y3 y1 y2 y3 y4 y5
#> 9 2 Q1 y4 y1 y2 y3 y4 y5
#> 10 2 Q2 y5 y1 y2 y3 y4 y5
Data used
df <- structure(list(ParticipantID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L), Question = c("Age", "Gender", "Education", "Q1",
"Q2", "Age", "Gender", "Education", "Q1", "Q2"), Resp = c("x1",
"x2", "x3", "x4", "x5", "y1", "y2", "y3", "y4", "y5")), class = "data.frame",
row.names = c(NA, -10L))
df
#> ParticipantID Question Resp
#> 1 1 Age x1
#> 2 1 Gender x2
#> 3 1 Education x3
#> 4 1 Q1 x4
#> 5 1 Q2 x5
#> 6 2 Age y1
#> 7 2 Gender y2
#> 8 2 Education y3
#> 9 2 Q1 y4
#> 10 2 Q2 y5
Created on 2022-09-19 with reprex v2.0.2
In the following data frame, I want to collect members of B1, where their value in B2 is equal to or more than the value of "b" in B2. And then after this new information, count how many times each of the B1 members occurred.
dataframe:
ID B1 B2
z1 a 2.5
z1 b 1.7
z1 c 170
z1 c 9
z1 d 3
y2 a 0
y2 b 21
y2 c 15
y2 c 101
y2 d 30
y2 d 3
y2 d 15.5
x3 a 30.8
x3 a 54
x3 a 0
x3 b 30.8
x3 c 30.8
x3 d 7
so the result would be:
ID B1 B2
z1 a 2.5
z1 c 170
z1 c 9
z1 d 3
y2 c 101
y2 d 30
x3 a 30.8
x3 a 54
x3 c 30.8
and
ID B1 count
z1 a 1
z1 c 2
z1 d 1
y2 a 0
y2 c 1
y2 d 1
x3 a 2
x3 c 1
x3 d 0
Grouped by 'ID', filter where the 'B2' is greater than or equal to 'B2' where 'B1' is 'b' as well as create another condition where 'B1' is not equal to 'b'
library(dplyr)
out1 <- df1 %>%
group_by(ID) %>%
filter(any(B1 == "b") & B2 >= min(B2[B1 == "b"]), B1 != 'b')
-output
> out1
# A tibble: 9 × 3
# Groups: ID [3]
ID B1 B2
<chr> <chr> <dbl>
1 z1 a 2.5
2 z1 c 170
3 z1 c 9
4 z1 d 3
5 y2 c 101
6 y2 d 30
7 x3 a 30.8
8 x3 a 54
9 x3 c 30.8
The second output will be do a group by with summarise to get the number of rows, and then fill the missing combinations with complete
library(tidyr)
out1 %>%
group_by(B1, .add = TRUE) %>%
summarise(count = n(), .groups = "drop_last") %>%
complete(B1 = unique(.$B1), fill = list(count = 0)) %>%
ungroup
# A tibble: 9 × 3
ID B1 count
<chr> <chr> <int>
1 x3 a 2
2 x3 c 1
3 x3 d 0
4 y2 a 0
5 y2 c 1
6 y2 d 1
7 z1 a 1
8 z1 c 2
9 z1 d 1
data
df1 <- structure(list(ID = c("z1", "z1", "z1", "z1", "z1", "y2", "y2",
"y2", "y2", "y2", "y2", "y2", "x3", "x3", "x3", "x3", "x3", "x3"
), B1 = c("a", "b", "c", "c", "d", "a", "b", "c", "c", "d", "d",
"d", "a", "a", "a", "b", "c", "d"), B2 = c(2.5, 1.7, 170, 9,
3, 0, 21, 15, 101, 30, 3, 15.5, 30.8, 54, 0, 30.8, 30.8, 7)),
class = "data.frame", row.names = c(NA,
-18L))
Using tidyverse:
library(tidyverse)
df %>%
group_by(ID) %>%
filter(B2 > B2[B1 == "b"]) %>%
group_by(ID, B1) %>%
count(name = "count") %>%
as.data.frame()
#> ID B1 count
#> 1 x3 a 1
#> 2 y2 c 1
#> 3 y2 d 1
#> 4 z1 a 1
#> 5 z1 c 2
#> 6 z1 d 1
Created on 2022-04-26 by the reprex package (v2.0.1)
I have a df which includes multiple columns, which you could find my templete below. I would like to reshape as columns into rows in R. I am sure it is possible with tidyr::gather() function but I can not manage it.
If someone could help me I would be glad!
Best wishes
# Df I have
A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4 D1 D2 D3 D4
X1 X2 X3 X4 a b c d e f g h i j k l
Y1 Y2 Y3 Y4 m n o p
Z1 Z2 Z3 Z4 r s t u w v y z
# Df I would like to reshape
Col1 Col2 Col3 Col4
X1 X2 X3 X4 a b c d
X1 X2 X3 X4 e f g h
X1 X2 X3 X4 i j k l
Y1 Y2 Y3 Y4 m n o p
Z1 Z2 Z3 Z4 r s t u
Z1 Z2 Z3 Z4 w v y z
We could also do this with a single pivot_longer
library(dplyr)
library(tidyr)
library(stringr)
df %>%
pivot_longer(cols = -id, names_to = c("grp", ".value"),
names_sep="(?<=[A-Z])(?=[0-9])", values_drop_na = TRUE) %>%
select(-grp) %>%
rename_at(-1, ~ str_c('Col', .))
# A tibble: 7 x 5
# id Col1 Col2 Col3 Col4
# <int> <chr> <chr> <chr> <chr>
#1 1 a b c d
#2 1 e f g h
#3 1 i j k l
#4 2 m n o p
#5 2 q <NA> <NA> <NA>
#6 3 r s t u
#7 3 w v y z
data
df <- structure(list(id = 1:3, A1 = c("a", "m", "r"), A2 = c("b", "n",
"s"), A3 = c("c", "o", "t"), A4 = c("d", "p", "u"), B1 = c("e",
"q", "w"), B2 = c("f", NA, "v"), B3 = c("g", NA, "y"), B4 = c("h",
NA, "z"), C1 = c("i", NA, NA), C2 = c("j", NA, NA), C3 = c("k",
NA, NA), C4 = c("l", NA, NA), D1 = c(NA, NA, NA), D2 = c(NA,
NA, NA), D3 = c(NA, NA, NA), D4 = c(NA, NA, NA)), class = "data.frame",
row.names = c("1",
"2", "3"))
I bet there are more elegant solutions, but this one uses tidyr and dplyr:
Suppose your data looks like
> df
# A tibble: 3 x 17
id A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4 D1 D2 D3 D4
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 a b c d e f g h i j k l NA NA NA NA
2 2 m n o p q NA NA NA NA NA NA NA NA NA NA NA
3 3 r s t u w v y z NA NA NA NA NA NA NA NA
I replaced your X1 X2 X3 X4, ... by an indexing column and I added on q in column B1.
Using
df %>%
pivot_longer(cols=matches("\\d$"),
names_to = c("set"),
names_pattern = ".(.)") %>%
pivot_wider(names_from="set",
names_prefix="Col",
values_fn = list) %>%
unnest(matches("\\d$")) %>%
rowwise() %>%
filter(sum(is.na(c_across(matches("\\d$")))) != ncol(.) - 1) # -1 because of the indexing column
returns
# A tibble: 7 x 5
# Rowwise:
id Col1 Col2 Col3 Col4
<dbl> <chr> <chr> <chr> <chr>
1 1 a b c d
2 1 e f g h
3 1 i j k l
4 2 m n o p
5 2 q NA NA NA
6 3 r s t u
7 3 w v y z
I have the following data
> df
X1 X2 X3
1 3 4
1 0 0
1 1 0
and I want to merge all the column so that the final output will be
new colName
1 X1
1 X1
1 X1
3 X2
0 X2
1 X2
4 X3
0 X3
0 X3
You can try stack
> setNames(stack(df),c("new","colName"))
new colName
1 1 X1
2 1 X1
3 1 X1
4 3 X2
5 0 X2
6 1 X2
7 4 X3
8 0 X3
9 0 X3
Data
> dput(df)
structure(list(X1 = c(1L, 1L, 1L), X2 = c(3L, 0L, 1L), X3 = c(4L,
0L, 0L)), class = "data.frame", row.names = c(NA, -3L))
library (tidyverse)
pivot_longer(df,X1:X3)
You can try gathering the column names with tidyr
library(tidyr)
X1 <- c(1,1,1)
X2 <- c(3,0,1)
X3 <- c(4,0,0)
df <- data.frame(X1, X2, X3)
df <- df %>%
gather(new, colname, X1, X2, X3)
print(df)
new colname
1 X1 1
2 X1 1
3 X1 1
4 X2 3
5 X2 0
6 X2 1
7 X3 4
8 X3 0
9 X3 0