Lapply to execute command for a list of variables - r

I intend to change the order of levels of some factors.
The intention is to apply this command
Df$X1 <- ordered(Df$X1, levels = c("5", "4", "3", "2", "1"))
to a list of variables (X1 to X2)
Df <- data.frame(
X1 = ordered(sample(1:5,30,r=T)),
X2 = ordered(sample(1:5,30,r=T)),
X3 = as.factor(sample(1:5,30,r=T)),
Y = as.factor(sample(1:5,30,r=T))
)
tmplistporadove <- as.list(paste("Df$",names(Df)[1:2],sep=""))
zmena <- lapply(tmplistporadove, function(x) substitute(x <- ordered(x, levels = c("5", "4", "3", "2", "1"))) )
eval(zmena)
But R just prints this:
X[[i]] <- ordered(X[[i]], levels = c("5", "4", "3", "2", "1"))

Related

Remove Na from one column of response variables at a time and replace data when done and remove all na values from predictor variables using lapply()

I am trying to make a model using the lapply function where lapply indexes through each column of response variables and creates a linear model using the predictor variables. I am then each individual linear model to the stepAIC funciton and then to the stepVIF function after that. I can make this work in a dataset with no na values, however my dataset is full of na values which is giving me a variable length differ error when I pass the linear models to the stepAIC function.
This is what I have tried so far. I made the miultiple.func variable in an attempt to remove na values from column at a time buit it does not work and I think that it would end up removing all of the rows of data except for the fourth column due to how the complete.cases() works. This is why I only want to remove the Na values from one column at a time of the response variables (the column being called in the model), and all of the Na values from the predictor variables (col d, e and f).
data_dummy <- data.frame(first_column = c("A", "B", "c", "d", "e", "f"),
second_column = c("1", "Na", "3", "4", "5", "6"),
third_column = c("Na", "7", "3", "Na", "2", "6"),
fourth_column = c("5", "8", "3", "4", "5", "1"),
fith_column = c("2", "Na", "3", "na", "2", "6"),
sixth_column = c("5", "9", "3", "4", "na", "1")
)
g <- 3
multiple.func <- function(g) {
c(data34[complete.cases(data_dummy[[,c(g)]]),], lm(reformulate(names(data34)[4:7], response=names(data_dummy)[g]), data_dummy)) #trying to remove NA
}
full.model <- lapply(data_dummy, multiple.func)
step.model <- lapply(full.model, function(x)MASS::stepAICIC(x, direction = "both", trace = FALSE)) #Fit stepwise regression model
stepmod3 <- lapply(step.model, function(x)pedometrics::stepVIF(model = x, threshold = 10, verbose = TRUE))

Loop for creating multiple new 3 level variables from another 5 level variable

I'm looking for a way to generate multiple 3-level variables from an older 5-level variable, while keeping the old variables.
This is how it is now:
structure(list(Quesiton1 = c("I", "5", "4", "4"), Question2 = c("I",
"5", "4", "4"), Question3 = c("I", "3", "2", "4")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -4L))
I would like this:
structure(list(Quesiton1 = c("I", "5", "4", "4"), Question2 = c("I",
"5", "4", "4"), Question3 = c("I", "3", "2", "4"), Question1_3l = c("NA",
"3", "3", "3"), Question2_3l = c("NA", "3", "3", "3"), Question3_3l = c("NA",
"2", "1", "3")), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
I have this code to recode the 5-level variable
df2 %>%
mutate_at(vars(Question1, Question2, Question3), recode,'1'=1, '2'=1, '3'=3, '4'=5, '5'=5, 'l' = NA)
But what I want to do is to keep the old variable and generate the 3 level variable into something like Question1_3l, Question2_3l, Question3_3l.
It shouldn't be too difficult. In Stata it looks something like this:
foreach i of varlist ovsat-not_type_number {
local lbl : variable label `i'
recode `i' (1/2=1)(3=2)(4/5=3), gen(`i'_3l)
}
Thank you.
Not the most elegant, not the fastest (but still pretty fast), not the most idiomatic, but this does what you want (I think) and should be easy to read and customize.
dt <- structure(list(Quesiton1 = c("I", "5", "4", "4"),
Question2 = c("I", "5", "4", "4"),
Question3 = c("I", "3", "2", "4")),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -4L))
#transfor your data into a data.table
setDT(dt)
#define the names of the columns that you want to recode
vartoconv <- names(dt)
#define the names of the recoded columns
newnames <- paste0(vartoconv, "_3l")
#define an index along the vector of the names of the columns to recode
for(varname_loopid in seq_along(vartoconv)){
#identify the name of the column to recode for each iteration
varname_loop <- vartoconv[varname_loopid]
#identify the name of the recoded column for each iteration
newname_loop <- newnames[varname_loopid]
#create the recoded variable by using conditionals on the variable to recode
dt[get(varname_loop) %in% c(1, 2), (newname_loop) := 1]
dt[get(varname_loop) == 3, (newname_loop) := 2]
dt[get(varname_loop) %in% c(4, 5), (newname_loop) := 3]
}
Try:
library(tidyverse)
library(stringr)
df2 <- replicate(6, sample(as.character(1:5), 50, replace = TRUE), simplify = "matrix") %>%
as_tibble(.name_repair = ~str_c("Question", 1:6))
df2 %>%
mutate_at(vars(Question1:Question3),
~case_when(.x %in% c('1', '2') ~ 1L, # 1L means integer 1
.x %in% c('3') ~ 3L,
.x %in% c('4', '5') ~ 5L,
TRUE ~ as.integer(NA)))

Problem with Piping for revalue in R Studio

I would like to revalue 13 different variables. They all have character as levels right now and are supposed to be changed to values.
Individually it would work to use
x$eins <- revalue(x$eins, c("Nie Thema" = "1",
"Selten Thema" = "2",
"Manchmal Thema" = "3",
"Häufig Thema" = "4",
"Sehr häufig Thema" = "5",
"Fast immer Thema" = "6"))
With the piping, I guess it would look something like this
x %>%
dplyr::select(., eins:dreizehn) %>%
revalue(., c("Nie Thema" = "1",
"Selten Thema" = "2",
"Manchmal Thema" = "3",
"Häufig Thema" = "4",
"Sehr häufig Thema" = "5",
"Fast immer Thema" = "6"))
With this, I get the warning message from revalue, that x is not a factor or a character vector.
What am I doing wrong?
Thanks in advance.
Use across to apply a function for multiple columns.
library(dplyr)
x <- x %>%
dplyr::mutate(across(eins:dreizehn, ~revalue(., c("Nie Thema" = "1",
"Selten Thema" = "2",
"Manchmal Thema" = "3",
"Häufig Thema" = "4",
"Sehr häufig Thema" = "5",
"Fast immer Thema" = "6"))))

Reorder Stacked Bar Chart

newbie R coder here. I have a stacked bar chart in base R that I'd like to reorder numerically by question type (Question 1 Pre, Question 1 Post, Question 2 Pre, Question 2 Post, etc.)
It's probably a fairly simple fix but I can't seem to get the reorder function to work. The other questions on reordering don't quite get to my solution. Maybe reorder isn't the right way to go about it?
Attached my graph and base code. Thank you so much! I appreciate your kind help.
if(!require(psych)){install.packages("psych")}
if(!require(likert)){install.packages("likert")}
library(readxl)
setwd("MSSE 507 Capstone Data Analysis/")
read_xls("ProcessDataMSSE.xls")
Data = read_xls("ProcessDataMSSE.xls")
str(Data) # tbl_df, tbl, and data.frame classes
### Change Likert scores to factor and specify levels; factors because numeric values are ordinal
Data <- Data[, c(3:26)] # Get rid of the other columns! (Drop multiple columns)
Data$`1Pre` <- factor(Data$`1Pre`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`1Post` = factor(Data$`1Post`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`2Pre` <- factor(Data$`2Pre`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`2Post` = factor(Data$`2Post`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`3Pre` <- factor(Data$`3Pre`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`3Post` = factor(Data$`3Post`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`4Pre` <- factor(Data$`4Pre`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`4Post` = factor(Data$`4Post`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`5Pre` <- factor(Data$`5Pre`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`5Post` = factor(Data$`5Post`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`6Pre` <- factor(Data$`6Pre`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`6Post` = factor(Data$`6Post`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`7Pre` <- factor(Data$`7Pre`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`7Post` = factor(Data$`7Post`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`8Pre` <- factor(Data$`8Pre`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`8Post` = factor(Data$`8Post`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`9Pre` <- factor(Data$`9Pre`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`9Post` = factor(Data$`9Post`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`10Pre` <- factor(Data$`10Pre`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`10Post` = factor(Data$`10Post`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`11Pre` <- factor(Data$`11Pre`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`11Post` = factor(Data$`11Post`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`12Pre` <- factor(Data$`12Pre`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`12Post` = factor(Data$`12Post`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data <- factor(Data,levels=Data[3:26])
Data
### Double check the data frame
library(psych) # Loads psych package
headTail(Data) # Displays last few and first few data
str(Data) # Shows structure of an object (observations and variables, etc.) - in this case, ordinal factors with 4 levels (1 through 4)
summary(Data) # Summary of the number of times you see a data point
Data$`1Pre` # This allows us to check how many data points are really there
str(Data)
### Remove unnecessary objects, removing the data frame in this case (we've converted that data frame into a table with the read.table function above)
library(likert)
Data <- as.data.frame(Data) # Makes the tibble a data frame
likert(Data) # This will give the percentage responses for each level and group
Result = likert(Data)
summary(Result) # This will give the mean and SD
plot(Result,
main = "Pre and Post Treatment Percentage Responses",
ylab="Questions",
type="bar")
I largely agree with #DzimitryM 's solution. It is unclear to me, however whether this really works. In my solution, I need to use the items variable of the data.frame, not the data.frame as such. There is some comment in the code below (at the bottom) highlighting this.
Anyway this is the reason I made a working example with executable code.
I am aware of the fact, that it may be improved; my focus was on executability.
library(likert)
### mimic some of your data, with 'accepted' naming
Data <- data.frame(
C01Pre = as.character(c( rep(1, 10), rep(2, 60), rep(3, 25), rep(4, 5) )),
C01Post = as.character(c( rep(1, 25), rep(2, 52), rep(3, 21), rep(4, 2) )),
C02Pre = as.character(c( rep(1, 25), rep(2, 68), rep(3, 5), rep(4, 2) )),
C02Post = as.character(c( rep(1, 30), rep(2, 53), rep(3, 13), rep(4, 4) )),
C03Pre = as.character(c( rep(1, 20), rep(2, 52), rep(3, 25), rep(4, 3) )),
C03Post = as.character(c( rep(1, 20), rep(2, 39), rep(3, 35), rep(4, 6) ))
)
### coerce to ordered factor
Data$C01Pre <- factor(Data$C01Pre, levels = c("1", "2", "3", "4"), ordered = TRUE)
Data$C01Post <- factor(Data$C01Post, levels = c("1", "2", "3", "4"), ordered = TRUE)
Data$C02Pre <- factor(Data$C02Pre, levels = c("1", "2", "3", "4"), ordered = TRUE)
Data$C02Post <- factor(Data$C02Post, levels = c("1", "2", "3", "4"), ordered = TRUE)
Data$C03Pre <- factor(Data$C03Pre, levels = c("1", "2", "3", "4"), ordered = TRUE)
Data$C03Post <- factor(Data$C03Post, levels = c("1", "2", "3", "4"), ordered = TRUE)
Result = likert(Data)
### show the "natural" order when processed by likert()
summary(Result)
# Item low neutral high mean sd
# 6 C03Post 59 0 41 2.27 0.8510837
# 1 C01Pre 70 0 30 2.25 0.7017295
# 5 C03Pre 72 0 28 2.11 0.7506899
# 2 C01Post 77 0 23 2.00 0.7385489
# 4 C02Post 83 0 17 1.91 0.7666667
# 3 C02Pre 93 0 7 1.84 0.5983141
plot(Result,
group.order = names(Result$items)) ## this is the key!
## difference with other answer is:
## names of the "items" variable of the df
## not the data.frame itself
This results in the following graph:
Grouping option can be added to plot() in order to get the plot, that is ordered by the column names of the initial dataset:
plot(Result,
group.order = names(Data),
type="bar")

Regarding the merge of two dataframes

I have a lot of data that is represented as below. in total there are 13 dafaframes as the one represented below. All have the same columns.
Example of data
There are in total about 500.000 rows and 106 columns in each dataframe. I want to combine them in the following way:
If the first AND second column in a row in df1 are equal to the first and second column in a row i df2 i want to add the two rows together, otherwise i want to add the row to the dataframe.
i Have created the following code for a minimal example (which gives me the wanted result, but really will not work for the scale that im a working at):
dput(df1[,1:5 ])
structure(list(C5id = c("100110", "100110", "100110", "100110",
"100100", "100100", "100100", "100100", "100100", "100100"),
Retnavn = c("Braiserede kæber af gris, tomat-skysovs, kartofler, ovnbagte bønner med bacon",
"Braiseret okseinderlår, skysovs, kartofler, marinerede rødløg med hyldeblomst",
"Cremet champignonsuppe", "Forårsfrikassé med kalv, asparges og forårsløg, kartofler, broccoli",
"Hakkebøf, bearnaisesauce, kartofler, ærter", "Farsbrød med gulerødder og ærter, legeret sovs, kartofler og romanescokål",
"Fiskefrikadeller med persillesovs, kartofler og juliennegrønt",
"Fiskefrikadeller med remouladesovs, kartofler og juliennegrønt",
"Forloren hare med vildtsovs, kartofler og tyttebærsylt",
"Frikadeller med skysovs, kartofler og sellerichutney"),
a2018uge2 = c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2), a2018uge3 = c("2",
"2", "2", "2", "2", "2", "2", "2", "2", "2"), a2018uge4 = c("2",
"2", "2", "2", "2", "2", "2", "2", "2", "2")), class = "data.frame", row.names = 4:13)
> dput(df2[,1:5 ])
structure(list(C5id = c("100110", "100110", "100100", "100100",
"100100", "100100", "100100", "100100", "100100", "100100", "100110",
"100110", "100100", "100100", "100100", "100100", "100100"),
Retnavn = c("Braiserede kæber af gris, tomat-skysovs, kartofler, ovnbagte bønner med bacon",
"Braiseret okseinderlår, skysovs, kartofler, marinerede rødløg med hyldeblomst",
"Cremet champignonsuppe", "Forårsfrikassé med kalv, asparges og forårsløg, kartofler, broccoli",
"Hakkebøf, bearnaisesauce, kartofler, ærter", "Hamburgerryg, flødekartofler, blomkål, broccoli og romanesco",
"Kylling i karrysovs med æbler og ingefær, kartofler, cherrytomater med løg",
"Kylling i sur-sød sovs med peberfugt, kartofler og broccoli",
"Kyllingefrikassé med kartofler", "Lammesteg, flødekartofler, ovnbagte grønne bønner med bacon",
"Cremet champignonsuppe", "Forårsfrikassé med kalv, asparges og forårsløg, kartofler, broccoli",
"Farsbrød med gulerødder og ærter, legeret sovs, kartofler og romanescokål",
"Fiskefrikadeller med persillesovs, kartofler og juliennegrønt",
"Fiskefrikadeller med remouladesovs, kartofler og juliennegrønt",
"Forloren hare med vildtsovs, kartofler og tyttebærsylt",
"Frikadeller med skysovs, kartofler og sellerichutney"),
a2018uge2 = c(3, 3, 1, 1, 3, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,
2, 2), a2018uge3 = c("3", "3", "1", "1", "3", "1", "1", "1",
"1", "1", "2", "2", "2", "2", "2", "2", "2"), a2018uge4 = c("3",
"3", "1", "1", "3", "1", "1", "1", "1", "1", "2", "2", "2",
"2", "2", "2", "2")), class = "data.frame", row.names = c("5",
"6", "7", "8", "9", "10", "11", "12", "13", "14", "61", "71",
"91", "101", "111", "121", "131"))
df2_before = df2
hej=c()
for (i in 1:length(df2$C5id)) {
for (j in 1:length(df1$C5id)) {
if (df2$C5id[i] == df1$C5id[j] && df2$Retnavn[i] == df1$Retnavn[j]) {
df2[j, 3:8 ] <- as.numeric(df2[i,3:8 ]) + as.numeric(df1[j,3:8 ])
hej=c(hej,j)
#df1 = df1[-i, ]
}
}
cat("vi er kommet til:",i,",",j,"\n")
}
df2=rbind(df2,df1[-hej,])
where df1 and df2 are the two dataframes. My problem is that this has to loop through 500.000*500.000 different combination. I have in total 13 dataframes of this size that have to combined, so i would take an absolute eternity.
I was hoping that there would be some sort of vectoriced way to this that might be done before the fall of 2030.
Best regard
ps. I understand that the way i inserted the data in this post might not be the best. But this might be the best i could think of
pps. I have edited the question in regard to MKR comment.
I suggest the following :
library(data.table)
df1 <- data.table::setDT(df1)
df2 <- data.table::setDT(df2)
data.table::setkeyv(df1, c("C5id","Retnavn"))
data.table::setkeyv(df2, c("C5id","Retnavn"))
new_df2 <- merge(df1,df2, all.y = TRUE)
cols <- names(new_df2[,3:ncol(new_df2)])
new_df2[, (cols) := lapply(.SD, as.numeric), .SDcols = cols]
new_df2[, (cols) := lapply(.SD, function(i)
tidyr::replace_na(i,0)), .SDcols = cols]
sapply(new_df2, class)
You therefore have transformed your variable into numeric:
C5id Retnavn a2018uge2.x a2018uge3.x a2018uge4.x a2018uge2.y a2018uge3.y a2018uge4.y
"character" "character" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
Then building on this issue : R: merging columns and the values if they have the same column name with #bgoldst solution:
# First I replace the names of the same variables by replacing ".x" or ".y":
names(new_df2) <- stringr::str_replace(names(new_df2),".[xy]","")
temp = do.call(cbind,lapply(split(as.list(new_df2[,3:ncol(new_df2)]),
names(new_df2[,3:ncol(new_df2)])),
function(x) Reduce(`+`,x)));
new_df2 <- cbind(new_df2[,1:2],temp)

Resources