R - cleaning data with repeated columns for different locations

R - cleaning data with repeated columns for different locations - r

#Edited to make my data more similar to the data I'm working with and example of what I have tried
I am working with a Qualtrics survey where blocks of questions repeat themselves based on previous questions using a function in the survey build called "loop and merge". I'm trying to pull out like questions and then use rbind so that each question only shows up once in a column. I have a basic example below, however in my actual data, the repeats happen 36 times.
example data frame:
capacity_1 <- data.frame("1_q1" = 1:4,
"1_q2" = c("a", "b", "c", "d"),
'1_q3' = 10:13,
'1_q4' = 100:103,
'1_q5' = 110:113,
'1_q6' = 11:14,
"2_q1" = 22:25,
"2_q2" = c("i", "j", "k", "l"),
'2_q3' = 20:23,
'2_q4' = 200:203,
'2_q5' = 210:213,
'2_q6' = 21:24,
"3_q1" = 90:93,
"3_q2" = c("p", "q", "r", "s"),
'3_q3' = 10:13,
'3_q4' = 300:303,
'3_q5' = 310:313,
'3_q6' = 31:34,check.names = FALSE)
note that the "1_" at the start of "1_q1" is the county's reference number
What I could do but that is inefficient, especially since my actual data repeats these questions 36 times:
dat_1 <- dat %>%
select(1:2) %>%
rename(q = 1:2) %>%
mutate("county" = 1)
dat_2 <- dat %>%
select(3:4) %>%
rename(q = 1:2) %>%
mutate("county" = 2)
dat_3 <- dat %>%
select(5:6) %>%
rename(q = 1:2)%>%
mutate("county" = 3)
dat_final <- rbind(dat_1, dat_2, dat_3)
the "dat_final" data frame is what I'd like the data to look like, but also have formatted again here:
dat_clean <- data.frame("q1" = c(1:4, 22:25, 90:93),
"q2" = c("a", "b", "c", "d",
"i", "j", "k", "l",
"p", "q", "r", "s"),
"county" = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3))
Update - Tried suggestion below, and get the error "error in "set_names()" the size of 'nm' (6) must be compatible with the size of 'x'(2)
do.call(
rbind,
lapply(seq(1,ncol(capacity_1),6), \(i) {
capacity_1 %>%
select(c(i,i+5)) %>%
rename_all(~c("capacity_outpatient", "capacity_inpatient", "capacity_housing",
"capacity_recovery", "capacity_demand", "capacity_notes")) %>%
mutate(county=(i+5)/6)
})
)

You can do the following, which uses a seq from 1 to ncol(dat), by 2:
do.call(
rbind,
lapply(seq(1,ncol(dat),2), \(i) {
dat %>% select(c(i,i+1)) %>% rename_all(~c("q1","q2")) %>% mutate(county=(i+1)/2)
})
)
Output:
q1 q2 county
1 1 a 1
2 2 b 1
3 3 c 1
4 4 d 1
5 22 i 2
6 23 j 2
7 24 k 2
8 25 l 2
9 90 p 3
10 91 q 3
11 92 r 3
12 93 s 3
Another approach, with data.table
library(data.table)
setDT(dat)
rbindlist(lapply(seq(1,ncol(dat),2), \(i) {
setnames(dat[,i:(i+1)],c("q1","q2"))
}), use.names=F,idcol = "county")
Output:
county q1 q2
1: 1 1 a
2: 1 2 b
3: 1 3 c
4: 1 4 d
5: 2 22 i
6: 2 23 j
7: 2 24 k
8: 2 25 l
9: 3 90 p
10: 3 91 q
11: 3 92 r
12: 3 93 s

A solution using dplyr, purrr, stringr - This solution is not affected by columns orders, number of q columns. It just use the perfix as base for processing data.
library(dplyr)
library(purrr)
library(stringr)
dat <- data.frame("1_q1" = 1:4,
"1_q2" = c("a", "b", "c", "d"),
"2_q1" = 22:25,
"2_q2" = c("i", "j", "k", "l"),
"3_q1" = 90:93,
"3_q2" = c("p", "q", "r", "s"), check.names = FALSE)
# Here is the indexes of county that want to extract from df
county_index <- c("1", "2", "3")
# Function that take index as input and will extract data from `dat` df
edit_df <- function(index) {
dat %>%
# select column start with index prefix
select(matches(paste0(index, "_"))) %>%
# remove the index prefix from string
rename_all(~ str_replace(., regex("^\\d+_", ignore_case = TRUE), "")) %>%
# add county column with the input inex
mutate("county" = as.numeric(index))
}
Result using purrr::map_dfr
# map the county index that want to extract from original df and edit_df function
dat_clean <- map_dfr(.x = county_index, .f = edit_df)
dat_clean
#> q1 q2 county
#> 1 1 a 1
#> 2 2 b 1
#> 3 3 c 1
#> 4 4 d 1
#> 5 22 i 2
#> 6 23 j 2
#> 7 24 k 2
#> 8 25 l 2
#> 9 90 p 3
#> 10 91 q 3
#> 11 92 r 3
#> 12 93 s 3
Created on 2022-05-25 by the reprex package (v2.0.1)

Related

Merge three Variables to one and replicate observations

I have a Dataframe which looks like the following:
B <- data.frame(
nr=c(1,2,3,4,5),
A=c('a','b','c','d','e'),
B=c("s", "t", "i", "u", "z"),
B1=c("", "v", "", "", ""),
B2 =c("", "g", "", "", ""))
B <- B %>% mutate_all(na_if,"")
Since my Varaibales B1 and B2 only have one value, I would like to merge B1 and B2 to the Variable B. Therefor it should create two new observation and replicating every other Variable of this Oberservation.
It should look like the following:
B <- data.frame(
nr=c(1,2,2, 2, 3,4,5),
A=c("a","b", "b", "b", "c","d","e"),
B=c("s", "v", "g", "t", "i", "u", "z"))
Thanks for your help!!

Reshape to 'long' format with pivot_longer on the 'B' columns and remove the NA with values_drop_na = TRUE
library(dplyr)
library(tidyr)
B %>%
pivot_longer(cols = starts_with("B"), values_to = "B",
values_drop_na = TRUE, names_to = NULL)
-output
# A tibble: 7 × 3
nr A B
<dbl> <chr> <chr>
1 1 a s
2 2 b t
3 2 b v
4 2 b g
5 3 c i
6 4 d u
7 5 e z

Assign a value to a column in R based on a percentage within each group

[]
1I need to create column C in a data frame where 30% of the rows within each group (column B) get a value 0.
How do I do this in R?

We may use rbinom after grouping by 'category' column. Specify the prob as a vector of values
library(dplyr)
df1 %>%
group_by(category) %>%
mutate(value = rbinom(n(), 1, c(0.7, 0.3))) %>%
ungroup
-output
# A tibble: 9 x 3
sno category value
<int> <chr> <int>
1 1 A 1
2 2 A 0
3 3 A 1
4 4 B 1
5 5 B 0
6 6 B 1
7 7 C 1
8 8 C 0
9 9 C 0
data
df1 <- structure(list(sno = 1:9, category = c("A", "A", "A", "B", "B",
"B", "C", "C", "C")), class = "data.frame", row.names = c(NA,
-9L))

If your data already exist (assuming this is a simplified answer), and if you want the value to be randomly assigned to each group:
library(dplyr)
d <- data.frame(sno = 1:9,
category = rep(c("A", "B", "C"), each = 3))
d %>%
group_by(category) %>%
mutate(value = sample(c(rep(1, floor(n()*.7)), rep(0, n() - floor(n()*.7)))))

Base R
set.seed(42)
d$value <- ave(
rep(0, nrow(d)), d$category,
FUN = function(z) sample(0:1, size = length(z), prob = c(0.3, 0.7), replace = TRUE)
)
d
# sno category value
# 1 1 A 0
# 2 2 A 0
# 3 3 A 1
# 4 4 B 0
# 5 5 B 1
# 6 6 B 1
# 7 7 C 0
# 8 8 C 1
# 9 9 C 1
Data copied from Brigadeiro's answer:
d <- structure(list(sno = 1:9, category = c("A", "A", "A", "B", "B", "B", "C", "C", "C")), class = "data.frame", row.names = c(NA, -9L))

How to find the highest value in a row which is not a distinct variable

I have this dataframe
mydf <- structure(list(POS = c("1", "2", "3", "4"), A = c("10", "10",
"6", "1"), C = c("1", "8", "2", "7"), T = c("6", "2", "10", "8"
), G = c("0", "0", "2", "11"), Ref = c("A", "A", "T", "C")), class = "data.frame", row.names = c(NA,
-4L))
which looks like this
POS A C T G Ref
1 10 1 6 0 A
2 10 8 2 0 A
3 6 2 10 2 T
4 1 7 8 11 C
My aim is to extract the maximum value of each row, which is NOT the one stated in Ref. Meaning in the first row i want to extract the value of T since it has the highest value, which is not the Ref A. In the second row i want to have the value of C and so on...
The POS colum does not count here, it is all about A,T,G and C.
Unfortunately, i have to do this on quite a number of rows, so that i need to have an automated solution.
I would be happy for a dplyr solution, since i am trying to focus on dplyr :)
Thanks a lot!
THANK YOU a lot for all the answers, there are multiple correct solutions, i justed took one which i am currently using. The other answers can work as well!

You can try max in apply:
apply(sapply(c("A", "C", "T", "G"), function(i)
`[<-`(as.numeric(mydf[[i]]), mydf$Ref == i, NA)), 1, max, na.rm=TRUE)
#[1] 6 8 6 11
Or using pmax:
do.call(pmax, c(lapply(c("A", "C", "T", "G"), function(i)
`[<-`(as.numeric(mydf[[i]]), mydf$Ref == i, NA)), na.rm=TRUE))
#[1] 6 8 6 11
Benchmark:
library(dplyr)
bench::mark(check = FALSE
, apply = apply(sapply(c("A", "C", "T", "G"), function(i)
`[<-`(as.numeric(mydf[[i]]), mydf$Ref == i, NA)), 1, max, na.rm=TRUE)
, do.call = do.call(pmax, c(lapply(c("A", "C", "T", "G"), function(i)
`[<-`(as.numeric(mydf[[i]]), mydf$Ref == i, NA)), na.rm=TRUE))
, mapply = mapply(function(x, i) max(as.numeric(unlist(x))[-i]),
x = split(mydf[, 2:5], seq(nrow(mydf))),
i = match(mydf$Ref, names(mydf)[-1]))
, sapply = sapply(split(mydf, seq(nrow(mydf))),
function(x) max(as.numeric(x[, setdiff(c("A", "C", "T", "G"), x$Ref)])))
, dplyr = {mydf %>%
rowwise() %>%
mutate(Res = Reduce(pmax, across(A:G, ~ as.numeric(.) * (. != get(Ref)))))}
)
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
#1 apply 103.7µs 111.06µs 8861. 4.13KB 14.5 4291 7
#2 do.call 63.3µs 68.56µs 14072. 4.13KB 14.4 6825 7
#3 mapply 323.3µs 355.44µs 2747. 14.55KB 12.4 1329 6
#4 sapply 469.4µs 516.12µs 1855. 16.5KB 12.5 892 6
#5 dplyr 7.6ms 8.26ms 120. 23.35KB 11.1 54 5
Using pmax over do.call looks like to be the fastest and uses less memory.

You can turn the values in Ref columns to be NA and use pmax to get rowwise maximum ignoring NA values.
mydf <- type.convert(mydf, as.is = TRUE)
tmp <- mydf
tmp[cbind(1:nrow(tmp), match(tmp$Ref, names(tmp)))] <- NA
mydf$max_value <- do.call(pmax, c(tmp[2:5], na.rm = TRUE))
mydf
# POS A C T G Ref max_value
#1 1 10 1 6 0 A 6
#2 2 10 8 2 0 A 8
#3 3 6 2 10 2 T 6
#4 4 1 7 8 11 C 11

A base R solution is
sapply(split(mydf, seq(nrow(mydf))),
function(x) max(x[, setdiff(c("A", "C", "T", "G"), x$Ref)]))
#R> 1 2 3 4
#R> 6 8 6 11
Or
mapply(function(x, i) max(x[-i]),
x = split(as.matrix(mydf[, 2:5]), seq(nrow(mydf))),
i = match(mydf$Ref, names(mydf)[-1]))
#R> 1 2 3 4
#R> 6 8 6 11
Or like GKi's answer
x <- as.matrix(mydf[, c("A", "C", "T", "G")])
x[rep(c("A", "C", "T", "G"), each = NROW(mydf)) == mydf$Ref] <- NA_real_
apply(x, 1, max, na.rm = TRUE)
#R> [1] 6 8 6 11
# in R 4.1.0 or greater
as.matrix(mydf[, c("A", "C", "T", "G")]) |>
(\(x){
x[rep(c("A", "C", "T", "G"), each = NROW(mydf)) == mydf$Ref] <- NA_real_
x
})() |>
apply(1, max, na.rm = TRUE)
#R> [1] 6 8 6 11
I have first transformed the columns to numeric variables as follows as I assume that this is what you intended:
mydf[, c("A", "C", "T", "G")] <-
lapply(mydf[, c("A", "C", "T", "G")], as.numeric)

One dplyr option could be:
mydf %>%
rowwise() %>%
mutate(Res = Reduce(pmax, across(A:G, ~ . * (. != get(Ref)))))
POS A C T G Ref Res
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 1 10 1 6 0 A 6
2 2 10 8 2 0 A 8
3 3 6 2 10 2 T 6
4 4 1 7 8 11 C 11

Comparing two dataframes in R and extract the values from one dataframe

I have two dataframes which have different number of rows and columns. one dataframe is with two columns and other dataframe with multiple columns.
The first dataframes looks like,
Second dataframe is like
Actually, i need to replace the second dataframe which contains A,B,C etc with the values of 2nd column of first dataframe.
I need the output in below format.
Help me to solve this problem.
dput:
df
structure(list(col1 = c("A", "B", "C", "D", "E", "F", "G", "H",
"I", "J", "K", "L"), col2 = c(10, 1, 2, 3, 4, 3, 1, 8, 19, 200,
12, 112)), row.names = c(NA, -12L), class = c("tbl_df", "tbl",
"data.frame"))
df2
structure(list(col1 = c("A", "F", "W", "E", "F", "G"), col2 = c(NA,
NA, "J", "K", "L", NA), col3 = c(NA, "H", "I", NA, "A", "B")), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))

A one-liner:
as_tibble(`colnames<-`(matrix(df1$col2[match(as.matrix(df2),df1$col1)], ncol=3), names(df2)))
#> # A tibble: 6 x 3
#> col1 col2 col3
#> <dbl> <dbl> <dbl>
#> 1 10 NA NA
#> 2 3 NA 8
#> 3 NA 200 19
#> 4 4 12 NA
#> 5 3 112 10
#> 6 1 NA 1

You can accomplish this with a little data manipulation. Make the data in df2 long, then join to df, then make the data wide again.
The rowid_to_column is necessary to make the transition from long to wide work. You can easily remove that column by adding select(-rowid) at the end of the chain.
library(tidyverse)
df2 %>%
rowid_to_column() %>%
pivot_longer(cols = -rowid) %>%
left_join(df, by = c("value" = "col1")) %>%
select(-value) %>%
pivot_wider(names_from = name, values_from = col2)
# rowid col1 col2 col3
# <int> <dbl> <dbl> <dbl>
# 1 1 10 NA NA
# 2 2 3 NA 8
# 3 3 NA 200 19
# 4 4 4 12 NA
# 5 5 3 112 10
# 6 6 1 NA 1

one-liner in base R:
df2 <- as.data.frame(lapply(df2, function(x) ifelse(!is.na(x), setNames(df$col2, df$col1)[x], NA)))
Output
> df2
col1 col2 col3
1 10 NA NA
2 3 NA 8
3 NA 200 19
4 4 12 NA
5 3 112 10
6 1 NA 1

Another short one liner in base. You can use match and assign the result to df2[]:
df2[] <- df[match(unlist(df2), df[,1]), 2]
df2
# col1 col2 col3
#1 10 NA NA
#2 3 NA 8
#3 NA 200 19
#4 4 12 NA
#5 3 112 10
#6 1 NA 1

Find rows in data frame with certain columns are duplicated, then combine the the elements in other columns [duplicate]

This question already has answers here:
Collapse text by group in data frame [duplicate]
(2 answers)
Aggregating by unique identifier and concatenating related values into a string [duplicate]
(4 answers)
Closed 3 years ago.
I have one data frame, I want to find the rows where both columns A and B are duplicated, and then combine the rows by combing the elements in C column together.
My example:
DF = cbind.data.frame(A = c(1, 1, 2, 3, 3),
B = c("a", "b", "a", "c", "c"),
C = c("M", "N", "X", "M", "N"))
My expected result:
DFE = cbind.data.frame(A = c(1, 1, 2, 3),
B = c("a", "b", "a", "c"),
C = c("M", "N", "X", "M; N"))
Thanks a lot

Without packages:
DF <- aggregate(C ~ A + B, FUN = function(x) paste(x, collapse = "; "), data = DF)
Output:
A B C
1 1 a M
2 2 a X
3 1 b N
4 3 c M; N
Or with data.table:
setDT(DF)[, .(C = paste(C, collapse = "; ")), by = .(A, B)]

This is a tidyverse based solution where you can use paste with collapse after grouping it.
library(dplyr)
DF = cbind.data.frame(A = c(1, 1, 2, 3, 3),
B = c("a", "b", "a", "c", "c"),
C = c("M", "N", "X", "M", "N"))
DFE = cbind.data.frame(A = c(1, 1, 2, 3),
B = c("a", "b", "a", "c"),
C = c("M", "N", "X", "M; N"))
DF %>%
group_by(A,B) %>%
summarise(C = paste(C, collapse = ";"))
#> # A tibble: 4 x 3
#> # Groups: A [3]
#> A B C
#> <dbl> <fct> <chr>
#> 1 1 a M
#> 2 1 b N
#> 3 2 a X
#> 4 3 c M;N
Created on 2019-03-19 by the reprex package (v0.2.1)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R - cleaning data with repeated columns for different locations - r

Related

Merge three Variables to one and replicate observations

Assign a value to a column in R based on a percentage within each group

How to find the highest value in a row which is not a distinct variable

Comparing two dataframes in R and extract the values from one dataframe

Find rows in data frame with certain columns are duplicated, then combine the the elements in other columns [duplicate]

Categories

Resources