data transformation using R [closed]

data transformation using R [closed] - r

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I got some data such as this
structure(list(id = c(1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 3), dead = c(1,
1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0), futime = c(2062, 2062, 2062,
2062, 2151, 2151, 388, 388, 388, 388, 388, 388), event = c("hosp",
"out", "hosp", "out", "hosp", "out", "hosp", "out", "hosp", "out",
"hosp", "out"), event_time = c(36, 52, 775, 776, 1268, 1283,
178, 192, 271, 272, 387, 377.9)), class = "data.frame", row.names = c(NA,
-12L))
and I would like to make it look like this
structure(list(id2 = c(1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3,
3, 3), dead2 = c(NA, NA, NA, NA, 1, NA, NA, 1, NA, NA, NA, NA,
NA, NA, NA), futime2 = c(NA, NA, NA, NA, 2062, NA, NA, 2151,
NA, NA, NA, NA, NA, NA, 388), event2 = c("hosp", "out", "hosp",
"out", "death", "hosp", "out", "death", "hosp", "out", "hosp",
"out", "hosp", "out", "censored"), event_time2 = c(36, 52, 775,
776, 2062, 1268, 1283, 2151, 178, 192, 271, 272, 387, 377.9,
388)), class = "data.frame", row.names = c(NA, -15L))
So basically, I want the dead == 1 and the value in the futime column to appear in the last observation for each id. and create a new column where all events are sequentially entered. Thanks

I didn't put the "2"s in the column names of the result, but you can make that change easily enough if you need it.
library(dplyr)
last_rows = df %>%
select(id, dead, futime) %>%
group_by(id) %>%
slice(1) %>%
mutate(
event = ifelse(dead == 1, "death", "censored"),
event_time = futime
)
result = df %>%
mutate(
dead = NA,
futime = NA
) %>%
bind_rows(last_rows) %>%
arrange(id, event_time)
result
# id dead futime event event_time
# 1 1 NA NA hosp 36.0
# 2 1 NA NA out 52.0
# 3 1 NA NA hosp 775.0
# 4 1 NA NA out 776.0
# 5 1 1 2062 death 2062.0
# 6 2 NA NA hosp 1268.0
# 7 2 NA NA out 1283.0
# 8 2 1 2151 death 2151.0
# 9 3 NA NA hosp 178.0
# 10 3 NA NA out 192.0
# 11 3 NA NA hosp 271.0
# 12 3 NA NA out 272.0
# 13 3 NA NA out 377.9
# 14 3 NA NA hosp 387.0
# 15 3 0 388 censored 388.0

Here is one method using group_modify and add_row
library(dplyr)
library(tibble)
df1 %>%
group_by(id, futime) %>%
group_modify(~ .x %>%
add_row(dead = NA^!last(.x$dead), event_time = last(.y$futime),
event = if(last(.x$dead) == 1) "death" else "censored")) %>%
mutate(across(c(dead), ~ replace(., row_number() != n(), NA))) %>%
group_by(id) %>%
mutate(futime = replace(futime, duplicated(futime, fromLast = TRUE),
NA)) %>%
ungroup
-output
# A tibble: 15 × 5
id futime dead event event_time
<dbl> <dbl> <dbl> <chr> <dbl>
1 1 NA NA hosp 36
2 1 NA NA out 52
3 1 NA NA hosp 775
4 1 NA NA out 776
5 1 2062 1 death 2062
6 2 NA NA hosp 1268
7 2 NA NA out 1283
8 2 2151 1 death 2151
9 3 NA NA hosp 178
10 3 NA NA out 192
11 3 NA NA hosp 271
12 3 NA NA out 272
13 3 NA NA hosp 387
14 3 NA NA out 378.
15 3 388 NA censored 388

Related

Lag back to a Value with Conditions R

I am in need of a conditional way to lag back to the last row where the value is one number or "level" lower than the current row. Whenever type = "yes", I want to go back one level lower to the last "no" and get the quantity. For example, rows 2 and 3 here are type "yes" and level 5. In that case, I'd like to go back to the last level 4 "no" row, get the quantity, and assign it to a new column. When type is "no" no lagging needs to be done.
Data:
row_id level type quantity
1 4 no 100
2 5 yes 110
3 5 yes 115
4 2 no 500
5 2 no 375
6 3 yes 250
7 3 yes 260
8 3 yes 420
Desired output:
row_id level type quantity lagged_quantity
1 4 no 100 NA
2 5 yes 110 100
3 5 yes 115 100
4 2 no 500 NA
5 2 no 375 NA
6 3 yes 250 375
7 3 yes 260 375
8 3 yes 420 375
Data:
structure(list(row_id = c(1, 2, 3, 4, 5, 6, 7, 8), level = c(4,
5, 5, 2, 2, 3, 3, 3), type = c("no", "yes", "yes", "no", "no",
"yes", "yes", "yes"), quantity = c(100, 110, 115, 500, 375, 250,
260, 420)), row.names = c(NA, -8L), class = c("tbl_df", "tbl",
"data.frame"))
Desired output:
structure(list(row_id = c(1, 2, 3, 4, 5, 6, 7, 8), level = c(4,
5, 5, 2, 2, 3, 3, 3), type = c("no", "yes", "yes", "no", "no",
"yes", "yes", "yes"), quantity = c(100, 110, 115, 500, 375, 250,
260, 420), lagged_quantity = c("NA", "100", "100", "NA", "NA",
"375", "375", "375")), row.names = c(NA, -8L), class = c("tbl_df",
"tbl", "data.frame"))
#Mossa

Direct solution would be to:
df1 %>%
mutate(
level_id = 1 + cumsum(c(1, diff(level)) < 0)
) %>%
mutate(lagged_quantity = if_else(type == "yes", NA_real_, quantity)) %>%
fill(lagged_quantity) %>%
mutate(lagged_quantity = if_else(type == "no", NA_real_, lagged_quantity))
Where first we retain only the values you would like, and then the missing entries are filled with last known value, and then the no answers, that need not be lagged, are taken out.

An option with data.table
library(data.table)
setDT(df1)[df1[, .(lagged_qty = last(quantity)), .(level, type)][,
lagged_qty := shift(lagged_qty), .(grp = cumsum(type == 'no'))],
lagged_qty := lagged_qty, on = .(level, type)]
-output
> df1
row_id level type quantity lagged_qty
<int> <int> <char> <int> <int>
1: 1 4 no 100 NA
2: 2 5 yes 110 100
3: 3 5 yes 115 100
4: 4 2 no 500 NA
5: 5 2 no 375 NA
6: 6 3 yes 250 375
7: 7 3 yes 260 375
8: 8 3 yes 420 375

Reshaping a dataframe in R by sorting just some fields in a row alphabetically

I have a few large dataframes in RStudio, that have this structure:
Original data structure
structure(list(CHROM = c("scaffold1000|size223437", "scaffold1000|size223437",
"scaffold1000|size223437", "scaffold1000|size223437"), POS = c(666,
1332, 3445, 4336), REF = c("A", "TA", "CTTGA", "GCTA"), RO = c(20,
14, 9, 25), ALT_1 = c("GAT", "TGC", "AGC", "T"), ALT_2 = c("CAG",
"TGA", "CGC", NA), ALT_3 = c("G", NA, "TGA", NA), ALT_4 = c("AGT",
NA, NA, NA), AO_1 = c(13, 4, 67, 120), AO_2 = c(12, 5, 34, NA
), AO_3 = c(6, NA, 18, NA), AO_4 = c(101, NA, NA, NA), AOF_1 = c(8.55263157894737,
17.3913043478261, 52.34375, 82.7586206896552), AOF_2 = c(7.89473684210526,
21.7391304347826, 26.5625, NA), AOF_3 = c(3.94736842105263, NA,
14.0625, NA), AOF_4 = c(66.4473684210526, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-4L))
But for an analysis I need it to look like this:
Desired output
structure(list(CHROM = c("scaffold1000|size223437", "scaffold1000|size223437",
"scaffold1000|size223437", "scaffold1000|size223437"), POS = c(666,
1332, 3445, 4336), REF = c("A", "TA", "CTTGA", "GCTA"), RO = c(20,
14, 9, 25), ALT_1 = c("AGT", "TGA", "AGC", "T"), ALT_2 = c("CAG",
"TGC", "CGC", NA), ALT_3 = c("G", NA, "TGA", NA), ALT_4 = c("GAT",
NA, NA, NA), AO_1 = c(101, 5, 67, 120), AO_2 = c(12, 4, 34, NA
), AO_3 = c(6, NA, 18, NA), AO_4 = c(13, NA, NA, NA), AOF_1 = c(66.4473684210526,
21.7391304347826, 52.34375, 82.7586206896552), AOF_2 = c(7.89473684210526,
17.3913043478261, 26.5625, NA), AOF_3 = c(3.94736842105263, NA,
14.0625, NA), AOF_4 = c(8.55263157894737, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-4L))
So what I would like to do is to rearrange the content of a row in a way, that the columns ALT_1, ALT_2, ALT_3, ALT_4 are alphabetically sorted, but at the same time I also need to rearrange the corresponding columns of AO and AOF, so that the values still match.
(The value of AO_1 should still match with the sequence that was in ALT_1.
So if ALT_1 becomes ALT_2 in the sorted dataframe, AO_1 should also become AO_2)
What I tried so far, but didn't work:
Pasting the values of ALT_1, AO_1, AOF_1 all in one field, so I have them together with
if (is.na(X[i,6]) == FALSE) {
X[i,6] <- paste(X[i,6],X[i,10],X[i,14],sep=" ")
}
}
And then I wanted to extract every row as a vector to sort the values and put it back in the dataframe, but I didn't manage to do this.
So the question would be how I can order the dataframe to get the desired output?
(I need to apply this to 32 dataframes with each having >100.000 values)

Here is dplyr solution. Took me some time and I needed some help pivot_wider dissolves arrange:
library(dplyr)
library(tidyr)
df1 %>%
mutate(id = row_number()) %>%
unite("conc1", c(ALT_1, AO_1, AOF_1), sep = "_") %>%
unite("conc2", c(ALT_2, AO_2, AOF_2), sep = "_") %>%
unite("conc3", c(ALT_3, AO_3, AOF_3), sep = "_") %>%
unite("conc4", c(ALT_4, AO_4, AOF_4), sep = "_") %>%
pivot_longer(
starts_with("conc")
) %>%
mutate(value = ifelse(value=="NA_NA_NA", NA_character_, value)) %>%
group_by(id) %>%
mutate(value = sort(value, na.last = TRUE)) %>%
ungroup() %>%
pivot_wider(
names_from = name,
values_from = value,
values_fill = "0"
) %>%
separate(conc1, c("ALT_1", "AO_1", "AOF_1"), sep = "_") %>%
separate(conc2, c("ALT_2", "AO_2", "AOF_2"), sep = "_") %>%
separate(conc3, c("ALT_3", "AO_3", "AOF_3"), sep = "_") %>%
separate(conc4, c("ALT_4", "AO_4", "AOF_4"), sep = "_") %>%
select(CHROM, POS, REF, RO, starts_with("ALT"), starts_with("AO_"), starts_with("AOF_")) %>%
type.convert(as.is=TRUE)
CHROM POS REF RO ALT_1 ALT_2 ALT_3 ALT_4 AO_1 AO_2 AO_3 AO_4 AOF_1 AOF_2 AOF_3 AOF_4
<chr> <int> <chr> <int> <chr> <chr> <chr> <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
1 scaffold1000|size223437 666 A 20 AGT CAG G GAT 101 12 6 13 66.4 7.89 3.95 8.55
2 scaffold1000|size223437 1332 TA 14 TGA TGC NA NA 5 4 NA NA 21.7 17.4 NA NA
3 scaffold1000|size223437 3445 CTTGA 9 AGC CGC TGA NA 67 34 18 NA 52.3 26.6 14.1 NA
4 scaffold1000|size223437 4336 GCTA 25 T NA NA NA 120 NA NA NA 82.8 NA NA NA

here is a data.table approach
library(data.table)
# Set to data.table format
setDT(mydata)
# Melt to long format
DT.melt <- melt(mydata, measure.vars = patterns(ALT = "^ALT_", AO = "^AO_", AOF = "^AOF_"))
# order by groups, na's at the end
setorderv(DT.melt, cols = c("CHROM", "POS", "ALT"), na.last = TRUE)
# cast to wide again, use rowid() for numbering
dcast(DT.melt, CHROM + POS + REF + RO ~ rowid(REF), value.var = list("ALT", "AO", "AOF"))
# CHROM POS REF RO ALT_1 ALT_2 ALT_3 ALT_4 AO_1 AO_2 AO_3 AO_4 AOF_1 AOF_2 AOF_3 AOF_4
# 1: scaffold1000|size223437 666 A 20 AGT CAG G GAT 101 12 6 13 66.44737 7.894737 3.947368 8.552632
# 2: scaffold1000|size223437 1332 TA 14 TGA TGC <NA> <NA> 5 4 NA NA 21.73913 17.391304 NA NA
# 3: scaffold1000|size223437 3445 CTTGA 9 AGC CGC TGA <NA> 67 34 18 NA 52.34375 26.562500 14.062500 NA
# 4: scaffold1000|size223437 4336 GCTA 25 T <NA> <NA> <NA> 120 NA NA NA 82.75862 NA NA NA

Multiple Columns to calculate new column using equation

I'm a fairly new R user -- trying to teach myself based on forums, videos, and trial+error. I have a very large dataset and would like to calculate number of members in the household who are considered children ( aged under 18). I have a column for number of household members, as well as 11 columns for each household member's age. My initial thought would be to select those who are under 18 and subtract from total household members. I've tried a few different lines of code unsuccessfully and I'm not sure how best to go about executing this. Any help is greatly appreciated!
enter image description here

There are a few ways to do this. I'm using something called a datastep from the libr package.
First, here is your data:
df <- data.frame(num_hhmem = c(6, 4, 4, 5, 4, NA, 8, NA),
ChildAge = c(9, 8, 10, 10, 9, NA, 8, NA),
hhm2_Age = c(36, 44, 52, 40, 33, NA, 37, NA),
hhm3_Age = c(34, 16, 53, 15, 15, NA, 39, NA),
hhm4_Age = c(15, 10, 92, 17, 11, NA, NA, NA),
hhm5_Age = c(7, NA, NA, 20, NA, NA, 10, NA),
hhm6_Age = c(11, NA, NA, NA, NA, NA, 6, NA),
hhm7_Age = c(NA, NA, NA, NA, NA, NA, 68, NA),
hhm8_Age = c(NA, NA, NA, NA, NA, NA, 78, NA),
hhm9_Age = c(NA, NA, NA, NA, NA, NA, NA, NA))
Then I set up the datastep with an array for the columns you want to iterate. Also I also set up a childCount variable with the value of 0 to start with. The datastep will loop through the dataframe row by row. So then you just iterate through the array and add any children to the childCount variable.
library(libr)
res <- datastep(df,
arrays = list(ages = dsarray("ChildAge", "hhm2_Age", "hhm3_Age",
"hhm4_Age", "hhm5_Age", "hhm6_Age",
"hhm7_Age", "hhm8_Age", "hhm9_Age")),
calculate = { childCount <- 0 },
drop = "age",
{
for(age in ages) {
if (!is.na(ages[age])) {
if (ages[age] < 18)
childCount <- childCount + 1
}
}
})
Here are the results:
res
# num_hhmem ChildAge hhm2_Age hhm3_Age hhm4_Age hhm5_Age hhm6_Age hhm7_Age hhm8_Age hhm9_Age childCount
# 1 6 9 36 34 15 7 11 NA NA NA 4
# 2 4 8 44 16 10 NA NA NA NA NA 3
# 3 4 10 52 53 92 NA NA NA NA NA 1
# 4 5 10 40 15 17 20 NA NA NA NA 3
# 5 4 9 33 15 11 NA NA NA NA NA 3
# 6 NA NA NA NA NA NA NA NA NA NA 0
# 7 8 8 37 39 NA 10 6 68 78 NA 3
# 8 NA NA NA NA NA NA NA NA NA NA 0

Here is another potential solution using tidyverse functions and the data formatted by #David J. Bosak:
df1 <- data.frame(num_hhmem = c(6, 4, 4, 5, 4, NA, 8, NA),
ChildAge = c(9, 8, 10, 10, 9, NA, 8, NA),
hhm2_Age = c(36, 44, 52, 40, 33, NA, 37, NA),
hhm3_Age = c(34, 16, 53, 15, 15, NA, 39, NA),
hhm4_Age = c(15, 10, 92, 17, 11, NA, NA, NA),
hhm5_Age = c(7, NA, NA, 20, NA, NA, 10, NA),
hhm6_Age = c(11, NA, NA, NA, NA, NA, 6, NA),
hhm7_Age = c(NA, NA, NA, NA, NA, NA, 68, NA),
hhm8_Age = c(NA, NA, NA, NA, NA, NA, 78, NA),
hhm9_Age = c(NA, NA, NA, NA, NA, NA, NA, NA))
df2 <- df1 %>%
rowwise() %>%
mutate(total_kids = rowSums(across(-c(num_hhmem), ~sum(.x <= 18, na.rm = TRUE))))
df2
#> # A tibble: 8 × 11
#> # Rowwise:
#> num_hhmem ChildAge hhm2_Age hhm3_Age hhm4_Age hhm5_Age hhm6_Age hhm7_Age hhm8_Age hhm9_Age total_kids
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl>
#> 1 6 9 36 34 15 7 11 NA NA NA 4
#> 2 4 8 44 16 10 NA NA NA NA NA 3
#> 3 4 10 52 53 92 NA NA NA NA NA 1
#> 4 5 10 40 15 17 20 NA NA NA NA 3
#> 5 4 9 33 15 11 NA NA NA NA NA 3
#> 6 NA NA NA NA NA NA NA NA NA NA 0
#> 7 8 8 37 39 NA 10 6 68 78 NA 3
#> 8 NA NA NA NA NA NA NA NA NA NA 0
Or, if you just want the counts in a dataframe on their own:
df3 <- df1 %>%
rowwise() %>%
summarise(total_kids = rowSums(across(-c(num_hhmem), ~sum(.x <= 18, na.rm = TRUE))))
df3
#> # A tibble: 8 × 1
#> total_kids
#> <dbl>
#> 1 4
#> 2 3
#> 3 1
#> 4 3
#> 5 3
#> 6 0
#> 7 3
#> 8 0

How to pivot_longer this dataframe?

I have a dataframe with the following structure:
x <- data.frame(
x1_dte = as.Date(c("2001-01-01", "2001-01-02", "2002-01-02"), format = "%Y-%m-%d"),
x1_val1 = c(10, 12, 13),
x1_val2 = c(200, 250, 300),
x2_dte = as.Date(c("2003-01-01", "2003-04-02", "2003-04-02"), format = "%Y-%m-%d"),
x2_val1 = c(9, 11, 14),
x2_val2 = c(110, 140, 200),
x3_dte = as.Date(c(NA, NA, NA), format = "%Y-%m-%d"),
x3_val1 = c(NA, NA, NA),
x3_val2 = c(NA, NA, NA)
)
x1_dte x1_val1 x1_val2 x2_dte x2_val1 x2_val2 x3_dte x3_val1 x3_val2
1 2001-01-01 10 200 2003-01-01 9 110 <NA> NA NA
2 2001-01-02 12 250 2003-04-02 11 140 <NA> NA NA
3 2002-01-02 13 300 2003-04-02 14 200 <NA> NA NA
I want to transform it to a dataframe with the following structure:
x_longer <- data.frame(var = c("x1", "x1", "x1", "x2", "x2", "x2","x3", "x3", "x3"),
date = as.Date(c("2001-01-01", "2001-01-02", "2002-01-02", "2003-01-01", "2003-04-02", "2003-04-02", NA, NA, NA), format = "%Y-%m-%d"),
val1 = c(10, 12, 13, 9, 11, 14, NA, NA, NA),
val2 = c(200, 250, 300, 110, 140, 200, NA, NA, NA)
)
var date val1 val2
1 x1 2001-01-01 10 200
2 x1 2001-01-02 12 250
3 x1 2002-01-02 13 300
4 x2 2003-01-01 9 110
5 x2 2003-04-02 11 140
6 x2 2003-04-02 14 200
7 x3 <NA> NA NA
8 x3 <NA> NA NA
9 x3 <NA> NA NA
I don't understand how to take the multiple columns from dataframe x to create x_longer. Can anyone help me?

This works:
x %>% pivot_longer(everything(), names_to = c('var', '.value'), names_sep = '_')

x %>%
pivot_longer(everything(),
names_to = c("var", ".value"),
names_pattern = c("^(x[0-9]+)_(.*)")) %>%
rename(date = "dte") %>%
arrange(var, date)
# # A tibble: 9 x 4
# var date val1 val2
# <chr> <date> <dbl> <dbl>
# 1 x1 2001-01-01 10 200
# 2 x1 2001-01-02 12 250
# 3 x1 2002-01-02 13 300
# 4 x2 2003-01-01 9 110
# 5 x2 2003-04-02 11 140
# 6 x2 2003-04-02 14 200
# 7 x3 NA NA NA
# 8 x3 NA NA NA
# 9 x3 NA NA NA

count occurrences in multiple columns (but for each row) based on value in another column

I am currently trying to analyse a data set in which I have one column that gives me the value of interest for each row (column called value_needed) and then a bunch of columns (in reality around 150) that have values and also a lot of NA's. For each row I would like to count the number of occurrences of that value from column value_needed in all the other columns, here position_1:position_6.
Here is some fake data:
position_1 <- c(6, -8, 8, 0, 0, -6)
position_2 <- c(NA, 6, -8, 8, 8, 0)
position_3 <- c(NA, NA, 6, -8, 0, 8)
position_4 <- c(NA, NA, NA, 6, -8, -8)
position_5 <- c(NA, NA, NA, NA, 6, 8)
position_6 <- c(NA, NA, NA, NA, NA, 6)
value_needed <- c(0, 6, -8, 8, 0, 8)
df <- data.frame(position_1, position_2, position_3,position_4, position_5, position_6,value_needed)
In the ideal case I would need to create a new column (name it occ) that counts the occurrences of the value in column value_needed from all position columns in that particular row.
The output for this fake data set above would be then:
occ = c(0,1,1,1,2,1)
If anyone has any hints, I really appreciate that.
Thanks

base solution
df$occ <- rowSums(df[1:6] == df$value_needed, na.rm = T)
dplyr solution
library(dplyr)
df %>%
rowwise() %>%
mutate(occ = sum(c_across(pos_1:pos_6) == value_needed, na.rm = T)) %>%
ungroup()
output
# # A tibble: 6 x 8
# pos_1 pos_2 pos_3 pos_4 pos_5 pos_6 value_needed occ
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 6 NA NA NA NA NA 0 0
# 2 -8 6 NA NA NA NA 6 1
# 3 8 -8 6 NA NA NA -8 1
# 4 0 8 -8 6 NA NA 8 1
# 5 0 8 0 -8 6 NA 0 2
# 6 -6 0 8 -8 8 6 8 2

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

data transformation using R [closed] - r

Related

Lag back to a Value with Conditions R

Reshaping a dataframe in R by sorting just some fields in a row alphabetically

Multiple Columns to calculate new column using equation

How to pivot_longer this dataframe?

count occurrences in multiple columns (but for each row) based on value in another column

Categories

Resources