Problem with Piping for revalue in R Studio - r

I would like to revalue 13 different variables. They all have character as levels right now and are supposed to be changed to values.
Individually it would work to use
x$eins <- revalue(x$eins, c("Nie Thema" = "1",
"Selten Thema" = "2",
"Manchmal Thema" = "3",
"Häufig Thema" = "4",
"Sehr häufig Thema" = "5",
"Fast immer Thema" = "6"))
With the piping, I guess it would look something like this
x %>%
dplyr::select(., eins:dreizehn) %>%
revalue(., c("Nie Thema" = "1",
"Selten Thema" = "2",
"Manchmal Thema" = "3",
"Häufig Thema" = "4",
"Sehr häufig Thema" = "5",
"Fast immer Thema" = "6"))
With this, I get the warning message from revalue, that x is not a factor or a character vector.
What am I doing wrong?
Thanks in advance.

Use across to apply a function for multiple columns.
library(dplyr)
x <- x %>%
dplyr::mutate(across(eins:dreizehn, ~revalue(., c("Nie Thema" = "1",
"Selten Thema" = "2",
"Manchmal Thema" = "3",
"Häufig Thema" = "4",
"Sehr häufig Thema" = "5",
"Fast immer Thema" = "6"))))

Related

Create (many) columns conditional on similarly named columns

I want to create a new column that take the value of one of two similarly named columns, depending on a third column. There are many such columns to create. Here's my data.
dt <- structure(list(malvol_left_1_w1 = c("1", "1", "4", "3", "4",
"4", "1", "4", "4", "3", "1", "4", "4", "3", "4", "4", "5", "2",
"4", "2"), malvol_left_2_w1 = c("1", "1", "4", "3", "4", "4",
"1", "3", "4", "2", "2", "2", "4", "1", "5", "4", "5", "2", "4",
"2"), malvol_right_1_w1 = c("1", "1", "4", "3", "4", "4", "1",
"3", "4", "2", "1", "4", "4", "5", "5", "4", "2", "6", "4", "1"
), malvol_right_2_w1 = c("1", "1", "4", "3", "4", "4", "1", "3",
"4", "2", "1", "2", "4", "5", "5", "4", "5", "5", "4", "5"),
malvol_left_1_w2 = c("1", "1", "3", "3", "4", "4", "1", "5",
"4", "4", "4", "2", "1", "4", "5", "4", "3", "2", "4", "4"
), malvol_left_2_w2 = c("1", "1", "3", "3", "4", "4", "7",
"5", "4", "2", "3", "1", "1", "4", "4", "4", "3", "4", "4",
"4"), malvol_right_1_w2 = c("1", "3", "3", "3", "4", "4",
"1", "4", "4", "3", "2", "2", "4", "1", "4", "4", "5", "5",
"4", "4"), malvol_right_2_w2 = c("1", "2", "3", "3", "4",
"4", "1", "2", "4", "2", "3", "2", "4", "1", "4", "4", "5",
"4", "4", "3"), leftright_w1 = c("right", "right", "left",
"right", "right", "right", "left", "right", "right", "left",
"left", "left", "left", "right", "left", "left", "right",
"right", "right", "left"), leftright_w2 = c("right", "right",
"left", "left", "right", "left", "left", "right", "right",
"left", "left", "left", "left", "right", "left", "left",
"right", "right", "left", "left")), class = "data.frame", row.names = c("12",
"15", "69", "77", "95", "96", "112", "122", "150", "163", "184",
"216", "221", "226", "240", "298", "305", "354", "370", "379"
))
Now I can do this in dplyr like:
dt <- dt %>%
mutate(
malvol_1_w1 = case_when(
leftright_w1 == "left" ~ malvol_right_1_w1,
leftright_w1 == "right" ~ malvol_left_1_w1),
malvol_2_w1 = case_when(
leftright_w1 == "left" ~ malvol_right_2_w1,
leftright_w1 == "right" ~ malvol_left_2_w1),
malvol_1_w2 = case_when(
leftright_w2 == "left" ~ malvol_right_1_w2,
leftright_w2 == "right" ~ malvol_left_1_w2),
malvol_2_w2 = case_when(
leftright_w2 == "left" ~ malvol_right_2_w2,
leftright_w2 == "right" ~ malvol_left_2_w2))
However, it's not really a feasible solution, because there will be more of both numbers defining a variable (e.g. both malvol_3_w1 and malvol_1_w3 will need to be created).
One solution is to this with a loop:
for (wave in 1:2) {
for (var in 1:2) {
dt[, paste0("malvol_", var, "_w", wave)] <- dt[, paste0("malvol_right_", var, "_w", wave)]
dt[dt[[paste0("leftright_w", wave)]] == "right", paste0("malvol_", var, "_w", wave)] <-
dt[dt[[paste0("leftright_w", wave)]] == "right", paste0("malvol_left_", var, "_w", wave)]
}
}
However, what is a tidyverse solution?
UPDATE:
I came up with a tidyverse solution myself, however, not every elegant. Still looking for more canonical solutions.
dt <- dt %>%
mutate(
malvol_1_w1 = NA, malvol_2_w1 = NA,
malvol_1_w2 = NA, malvol_2_w2 = NA) %>%
mutate(
across(matches("malvol_\\d"),
~ case_when(
eval(parse(text = paste0("leftright_", str_extract(cur_column(), "w.")))) == "left" ~
eval(parse(text = paste0(str_split(cur_column(), "_\\d", simplify = T)[1],
"_right", str_split(cur_column(), "malvol", simplify = T)[2]))),
eval(parse(text = paste0("leftright_", str_extract(cur_column(), "w.")))) == "right" ~
eval(parse(text = paste0(str_split(cur_column(), "_\\d", simplify = T)[1],
"_left", str_split(cur_column(), "malvol", simplify = T)[2]))))))
What makes your problem difficult is that a lot of information is hidden in variable names rather than data cells. Hence, you need some steps to transform your data into "tidy" format. In the code below, the crucial part is (1) to turn the variables [malvol]_[lr]_[num]_[w] into four separate columns malvol, lr, num, w (all prefixed with m_), and (2) from the variables leftright_[w] extract variable w (prefixed with l_) using the functions pivot_longer and than separate.
# Just adding a row_id to your data, for later joining
dt <- dt %>% mutate(id = row_number())
df <- dt %>%
# Tidy the column "malvol"
pivot_longer(cols = starts_with('malvol'), names_to = "m_var", values_to = "m_val") %>%
separate(m_var, into = c("m_malvol", "m_lr", "m_num", "m_w")) %>%
# They the column "leftright"
pivot_longer(cols = starts_with('leftright'), names_to = 'l_var', values_to = 'l_lr') %>%
separate(l_var, into = c(NA, "l_w")) %>%
# Implement the logic
filter(l_w == m_w) %>%
filter(l_lr != m_lr) %>%
# Pivot into original wide format
select(-c(l_w, l_lr, m_lr)) %>%
pivot_wider(names_from = c(m_malvol, m_num, m_w), values_from = m_val)
# Merging back results to original data
dt <- dt %>% mutate(id = row_number()) %>% inner_join(df, by="id")
Although I pivoted the data back into your desired format in the end (to check whether results are in line with your desired results), I would suggest you leave the data in the long format, which is "tidy" and more easy to work with, compared to your "wide" format. So maybe skip the last pivot_wider operation.

Recode a factor variable, dropping N/A

I have a factor variable with 14 levels, which I'm trying to into collapse into only 3 levels. It contains two N/A which I also wanna remove.
My code looks like this:
job <- fct_collapse(E$occupation, other = c("7","9", "10", "13" "14"), 1 = c("1", "2", "3", "12"), 2 = c("4", "5", "6", "8", "11"))
However it just gives me tons of error. Can anyone help here me here?
We could also this with a named list
library(forcats)
lst1 <- setNames(list(as.character(c(7, 9, 10, 13, 14)),
as.character(c(1, 2, 3, 12)), as.character(c(4, 5, 6, 8, 11))), c('other', 1, 2))
fct_collapse(df$occupation, !!!lst1)
data
df <- structure(list(occupation = c("1", "3", "5", "7", "9", "10",
"12", "14", "13", "4", "7", "6", "5")), class = "data.frame", row.names = c(NA,
-13L))
For numbers try using backquotes in fct_collapse.
job <- forcats::fct_collapse(df$occupation,
other = c("7","9", "10", "13", "14"),
`1` = c("1", "2", "3", "12"),
`2` = c("4", "5", "6", "8", "11"))

Regarding the merge of two dataframes

I have a lot of data that is represented as below. in total there are 13 dafaframes as the one represented below. All have the same columns.
Example of data
There are in total about 500.000 rows and 106 columns in each dataframe. I want to combine them in the following way:
If the first AND second column in a row in df1 are equal to the first and second column in a row i df2 i want to add the two rows together, otherwise i want to add the row to the dataframe.
i Have created the following code for a minimal example (which gives me the wanted result, but really will not work for the scale that im a working at):
dput(df1[,1:5 ])
structure(list(C5id = c("100110", "100110", "100110", "100110",
"100100", "100100", "100100", "100100", "100100", "100100"),
Retnavn = c("Braiserede kæber af gris, tomat-skysovs, kartofler, ovnbagte bønner med bacon",
"Braiseret okseinderlår, skysovs, kartofler, marinerede rødløg med hyldeblomst",
"Cremet champignonsuppe", "Forårsfrikassé med kalv, asparges og forårsløg, kartofler, broccoli",
"Hakkebøf, bearnaisesauce, kartofler, ærter", "Farsbrød med gulerødder og ærter, legeret sovs, kartofler og romanescokål",
"Fiskefrikadeller med persillesovs, kartofler og juliennegrønt",
"Fiskefrikadeller med remouladesovs, kartofler og juliennegrønt",
"Forloren hare med vildtsovs, kartofler og tyttebærsylt",
"Frikadeller med skysovs, kartofler og sellerichutney"),
a2018uge2 = c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2), a2018uge3 = c("2",
"2", "2", "2", "2", "2", "2", "2", "2", "2"), a2018uge4 = c("2",
"2", "2", "2", "2", "2", "2", "2", "2", "2")), class = "data.frame", row.names = 4:13)
> dput(df2[,1:5 ])
structure(list(C5id = c("100110", "100110", "100100", "100100",
"100100", "100100", "100100", "100100", "100100", "100100", "100110",
"100110", "100100", "100100", "100100", "100100", "100100"),
Retnavn = c("Braiserede kæber af gris, tomat-skysovs, kartofler, ovnbagte bønner med bacon",
"Braiseret okseinderlår, skysovs, kartofler, marinerede rødløg med hyldeblomst",
"Cremet champignonsuppe", "Forårsfrikassé med kalv, asparges og forårsløg, kartofler, broccoli",
"Hakkebøf, bearnaisesauce, kartofler, ærter", "Hamburgerryg, flødekartofler, blomkål, broccoli og romanesco",
"Kylling i karrysovs med æbler og ingefær, kartofler, cherrytomater med løg",
"Kylling i sur-sød sovs med peberfugt, kartofler og broccoli",
"Kyllingefrikassé med kartofler", "Lammesteg, flødekartofler, ovnbagte grønne bønner med bacon",
"Cremet champignonsuppe", "Forårsfrikassé med kalv, asparges og forårsløg, kartofler, broccoli",
"Farsbrød med gulerødder og ærter, legeret sovs, kartofler og romanescokål",
"Fiskefrikadeller med persillesovs, kartofler og juliennegrønt",
"Fiskefrikadeller med remouladesovs, kartofler og juliennegrønt",
"Forloren hare med vildtsovs, kartofler og tyttebærsylt",
"Frikadeller med skysovs, kartofler og sellerichutney"),
a2018uge2 = c(3, 3, 1, 1, 3, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,
2, 2), a2018uge3 = c("3", "3", "1", "1", "3", "1", "1", "1",
"1", "1", "2", "2", "2", "2", "2", "2", "2"), a2018uge4 = c("3",
"3", "1", "1", "3", "1", "1", "1", "1", "1", "2", "2", "2",
"2", "2", "2", "2")), class = "data.frame", row.names = c("5",
"6", "7", "8", "9", "10", "11", "12", "13", "14", "61", "71",
"91", "101", "111", "121", "131"))
df2_before = df2
hej=c()
for (i in 1:length(df2$C5id)) {
for (j in 1:length(df1$C5id)) {
if (df2$C5id[i] == df1$C5id[j] && df2$Retnavn[i] == df1$Retnavn[j]) {
df2[j, 3:8 ] <- as.numeric(df2[i,3:8 ]) + as.numeric(df1[j,3:8 ])
hej=c(hej,j)
#df1 = df1[-i, ]
}
}
cat("vi er kommet til:",i,",",j,"\n")
}
df2=rbind(df2,df1[-hej,])
where df1 and df2 are the two dataframes. My problem is that this has to loop through 500.000*500.000 different combination. I have in total 13 dataframes of this size that have to combined, so i would take an absolute eternity.
I was hoping that there would be some sort of vectoriced way to this that might be done before the fall of 2030.
Best regard
ps. I understand that the way i inserted the data in this post might not be the best. But this might be the best i could think of
pps. I have edited the question in regard to MKR comment.
I suggest the following :
library(data.table)
df1 <- data.table::setDT(df1)
df2 <- data.table::setDT(df2)
data.table::setkeyv(df1, c("C5id","Retnavn"))
data.table::setkeyv(df2, c("C5id","Retnavn"))
new_df2 <- merge(df1,df2, all.y = TRUE)
cols <- names(new_df2[,3:ncol(new_df2)])
new_df2[, (cols) := lapply(.SD, as.numeric), .SDcols = cols]
new_df2[, (cols) := lapply(.SD, function(i)
tidyr::replace_na(i,0)), .SDcols = cols]
sapply(new_df2, class)
You therefore have transformed your variable into numeric:
C5id Retnavn a2018uge2.x a2018uge3.x a2018uge4.x a2018uge2.y a2018uge3.y a2018uge4.y
"character" "character" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
Then building on this issue : R: merging columns and the values if they have the same column name with #bgoldst solution:
# First I replace the names of the same variables by replacing ".x" or ".y":
names(new_df2) <- stringr::str_replace(names(new_df2),".[xy]","")
temp = do.call(cbind,lapply(split(as.list(new_df2[,3:ncol(new_df2)]),
names(new_df2[,3:ncol(new_df2)])),
function(x) Reduce(`+`,x)));
new_df2 <- cbind(new_df2[,1:2],temp)

Using msSurv package in R

I'm trying to use msSurv for a multi-state modelling problem that looks at an individuals transition to different stages. Part of that is creating a tree object which is where I think I'm making a mistake but I can't understand what it is. I'll include the minimum workable example here.
Nodes <- c("1", "2", "3", "4", "5", "6")
Edges <- list("1" = list(edges = c("2", "3", "4", "5", "6")),
"2" = list(edges = c("1", "3", "4", "5", "6")),
"3" = list(edges = c("1", "2", "4", "5", "6")),
"4" = list(edges = c("1", "2", "3", "5", "6")),
"5" = list(edges = c("3", "4", "6")),
"6" = list(edges = NULL))
treeobj <- new("graphNEL", nodes = Nodes, edgeL = Edges, edgemode = "directed")
fit3 <- msSurv(df, treeobj, bs = TRUE, LT = TRUE)
The error I'm getting is as follows.
No states eligible for exit distribution calculation.
Entry distributions calculated for states 6 .
Error in bs.IA[, , j, b] : subscript out of bounds
The dataset in question can be found here.
Any help is sincerely appreciated.
I may be misunderstanding, but your 6 group doesn't have 1-6 as an edge, thus the program returns an error because in essence you're saying 6 isn't connected to the calculation. In relation to the solution, I believe 6 should have edges, as in this line may need to have edges: "6" = list(edges = NULL))

Lapply to execute command for a list of variables

I intend to change the order of levels of some factors.
The intention is to apply this command
Df$X1 <- ordered(Df$X1, levels = c("5", "4", "3", "2", "1"))
to a list of variables (X1 to X2)
Df <- data.frame(
X1 = ordered(sample(1:5,30,r=T)),
X2 = ordered(sample(1:5,30,r=T)),
X3 = as.factor(sample(1:5,30,r=T)),
Y = as.factor(sample(1:5,30,r=T))
)
tmplistporadove <- as.list(paste("Df$",names(Df)[1:2],sep=""))
zmena <- lapply(tmplistporadove, function(x) substitute(x <- ordered(x, levels = c("5", "4", "3", "2", "1"))) )
eval(zmena)
But R just prints this:
X[[i]] <- ordered(X[[i]], levels = c("5", "4", "3", "2", "1"))

Resources