I've been struggling with finding a way to evaluate an R expression in an environment constructed from data. I have a dataframe:
head(DATA1)
COD_CLI ENDEUD_FINAL
1 01002901 Mediana Empresa
2 01002932 No Sobreendeudado
3 04203409 No Sobreendeudado
...
and I try to complete another data (DATA2):
head(DATA2)
COD_CLI_W ENDEUD_FINAL
1 01002190
2 01002913
3 04203401
...
DATA2 is larger than DATA1, if the same COD_CLI/COD_CLI_W is in both datas, I take the second column of DATA1, if not I must evaluate another dataframe "wallet":
> str(wallet)
'data.frame': 81101 obs. of 8 variables:
$ COD_CREDITO : chr "0040410166104" "00000363393" "0060030164135" "004023854M" ...
$ COD_CLI : chr "00402037" "00166750" "00178607" "40097700" ...
$ TIPO_DE_CREDITO : chr "12.-CONSUMO NO REVOLVENTE" "10.-MICROEMPRESA" "10.-MICROEMPRESA" "10.-MICROEMPRESA" ...
$ SITUACION_SAFI : chr "CASTIGADO" "CASTIGADO" "CASTIGADO" "CASTIGADO" ...
$ COD_TIP_PRESTAMO: chr "0747" "0748" "0748" "0747" ...
$ ATR_SOL : num 0 0 0 0 0 0 0 0 0 0 ...
$ CAP_SOL : num 313.37 3.16 1670.51 3010 2327.71 ...
$ NUM_ENT : num 3 1 2 1 1 3 2 1 4 2 ...
Now the code I run is:
DATA2 <- within(DATA2,{
CALIF_RCD <- ifelse(COD_CLI_W %in% DATA1$COD_CLI,DATA1$ENDEUD_FINAL[which(DATA1$COD_CLI %in% COD_CLI_W)],
ifelse(wallet$TIPO_DE_CREDITO[which(wallet$COD_CLI %in% COD_CLI_W)[1]] == "08.-MEDIANA EMPRESA","Mediana Empresa",
ifelse(wallet$NUM_ENT[which(wallet$COD_CLI %in% COD_CLI_W)[1]]<5,"No Sobreendeudado","Sobreendeudado")))
}
)
the output is wrong in most of the cases. I'm new to R and I would like to know how to code it in a properly manner. Any help would be much appreciated.
I took the first approach using "merge":
DATA3 <- merge(DATA1, DATA2, by.x = "COD_CLI", by.y = "COD_CLI_W", all.y=TRUE)
DATA3 <- DATA3[!complete.cases(DATA3),]
After that, I analysed the left outer side in DATA3:
w = NULL
for(i in 1:length(DATA3$ENDEUD_FINAL))
{
w = which(wallet$COD_CLI %in% DATA3$COD_CLI[i])[1]
DATA3$ENDEUD_FINAL[i] <- ifelse(wallet$TIPO_DE_CREDITO[w] == "08.-MEDIANA EMPRESA","Mediana Empresa",
ifelse(wallet$NUM_ENT[w]<5,"No Sobreendeudado","Sobreendeudado"
))
}
and finally "rbind" DATA1 and DATA2:
DATA2 <- rbind(DATA1, DATA3)
Related
I am attempting to do a survival analysis, with "time to loss to follow up".
I have tried to fix the error by ensuring that the column is numeric (see strings as factors and colClasses in the .csv read function below), but it has not solved the error.
I have trawled stack overflow and other sites for answers, but I am stuck.
Can anyone help, please?
library(tidyverse)
library(gtsummary)
library(data.table)
library(tidyr)
library(dplyr)
library(survival)
survdat <- fread("221121_HBV_Followup_survivalanalysis.csv", stringsAsFactors=FALSE,
colClasses = c("Time to LTFU"="numeric"))
#Create censoring variable (right censoring)
survdat$censored[survdat$`LTFU confirmed` == 'Yes']<- 1
survdat$censored[survdat$`LTFU confirmed` == 'No'] <-0
#specify KM analysis model
km1 <- survfit(Surv('Time to LTFU', censored) ~ 1,
data=survdat,
type="kaplan-meier")
#I get the following error
> km1 <- survfit(Surv('Time to LTFU', censored) ~ 1,
+ data=survdat,
+ type="kaplan-meier")
Error in Surv("Time to LTFU", censored) : Time variable is not numeric
str(survdat)
````
NB Have removed some of the variables for confidentiality
Classes ‘data.table’ and 'data.frame': 43 obs. of 10 variables:
$ Date screened : chr "19/10/2021" "07/07/2021" "18/01/2022" "07/05/2021" ...
$ Last date seen : chr "21/11/2022" "21/11/2022" "21/11/2022" "21/11/2022" ...
$ Time to LTFU : num 398 502 307 563 564 605 516 29 118 118 ...
$ LTFU confirmed : chr "No" "No" "No" "No" ...
$ censored : num 0 0 0 0 0 0 0 1 1 0 ...
As you can see, the "Time to LTFU" variable IS numeric!
Please help!
Thanks
Time to LTFU needs to be between backticks, not single quotes, otherwise you are supplying a string (character variable) to the function.
km1 <- survfit(Surv(`Time to LTFU`, censored) ~ 1,
data=survdat,
type="kaplan-meier")
I have a list that contains multiple data.frames. I want to select every nth data.frame from the list and combine them into a single data.frame which can be written to a csv.
Here is an example of the list structure:
one.title <- data.frame(id = '1a', title = 'first title')
one.author <- data.frame(first_name = c('Susan', 'Alice'),
last_name = c('Smith', 'Johnson') )
second.title <- data.frame(id = '2b', title = 'second_title')
second.author <- data.frame(first_name = c('Sarah', 'Mary'),
last_name = c('Davis', 'Proctor') )
one.list <- list()
one.list[[1]]$title <- one.title
one.list[[1]]$author <- one.author
one.list[[2]]$title <- second.title
one.list[[2]]$author <- second.author
Here's my current solution that produces a single data frame for the 'authors' fields:
build_author_table <- function(result.l){
list_to_df <- function(i){
x <- result.l[[i]]$author
return(x)
}
authors_df_l <-(lapply(1:length(result.l), FUN = list_to_df))
authors_df <- do.call("rbind", lapply(authors_df_l, as.data.frame))
return(authors_df)
}
This produces the output I want:
first_name last_name
1 Susan Smith
2 Alice Johnson
3 Sarah Davis
4 Mary Proctor
But as you can probably imagine, when scaled to thousands of records with much larger text fields in the data.frame, it is painfully slow.
Can anyone suggest a faster, more efficient way to produce the final data.frame?
Your construction code didn't work, but I built one that I think resembles what you're shooting at.
List of 2
$ :List of 2
..$ title :'data.frame': 1 obs. of 2 variables:
.. ..$ id : Factor w/ 1 level "1a": 1
.. ..$ title: Factor w/ 1 level "first title": 1
..$ author:'data.frame': 2 obs. of 2 variables:
.. ..$ first_name: Factor w/ 2 levels "Alice","Susan": 2 1
.. ..$ last_name : Factor w/ 2 levels "Johnson","Smith": 2 1
$ :List of 2
..$ title :'data.frame': 1 obs. of 2 variables:
.. ..$ id : Factor w/ 1 level "2b": 1
.. ..$ title: Factor w/ 1 level "second_title": 1
..$ author:'data.frame': 2 obs. of 2 variables:
.. ..$ first_name: Factor w/ 2 levels "Mary","Sarah": 2 1
.. ..$ last_name : Factor w/ 2 levels "Davis","Proctor": 1 2
If this is what you were thinking of, this works splendidly, you do get a warning because the character strings are factors. These can be ignored, or when building the initial dataframe use stringAsFactors = F as an argument
library(purrr)
map_dfr(one.list, "author")
Here's a better solution (benchmarked):
data.table::rbindlist(lapply(one.list, "[[", "author"))
The purr solution is pretty, but not that fast. Benchmark results:
microbenchmark(build_author_table(one.list),
data.table::rbindlist(lapply(one.list, "[[", "author")),
map_dfr(one.list, "author"))
Unit: microseconds
expr min lq mean median uq max neval cld
build_author_table(one.list) 170.693 190.9460 239.2987 206.4505 272.3815 494.477 100 a
data.table::rbindlist(lapply(one.list, "[[", "author")) 69.562 88.5590 270.4926 99.1750 152.6735 15068.116 100 a
map_dfr(one.list, "author") 214.832 245.2825 2374.5980 281.3210 340.1270 206562.846 100 a
Try this:
one.title <- data.frame(id = '1a', title = 'first title')
one.author <- data.frame(first_name = c('Susan', 'Alice'),
last_name = c('Smith', 'Johnson') )
second.title <- data.frame(id = '2b', title = 'second_title')
second.author <- data.frame(first_name = c('Sarah', 'Mary'),
last_name = c('Davis', 'Proctor') )
one.list <- list(
list(title = one.title, author = one.author),
list(title = second.title, author = second.author)
)
authors_df_l = lapply(one.list, function(item) item$author)
do.call("rbind",authors_df_l)
I am at a loss. Googling has failed me because I'm not sure I know the right question to ask.
I have a data frame (df1) and my goal is to use a function to get a moving average using forecast::ma.
Here is str(df1)
'data.frame': 934334 obs. of 6 variables:
$ clname : chr ...
$ dos : Date, format: "2011-10-05" ...
$ subpCode: chr
$ ch1 : chr "
$ prov : chr
$ ledger : chr
I have a function that I am trying to write.
process <- function(df, y, sub, ...) {
prog <- df %>%
filter(subpCode == sub) %>%
group_by(dos, subpCode) %>%
summarise(services = n())
prog$count_ts <- ts(prog[ , c('services')])
}
The problem is that when I run the function, my final result is data object that is 1x1798 and it's just a time series. If I go a run the code line by line I get what I need but my function that hypothetically does the same thing wont work.
Here is my desired result
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 1718 obs. of 4 variables:
$ dos : Date, format: "2010-09-21" "2010-11-18" "2010-11-19" "2010-11-30" ...
$ subpCode: chr "CII " "CII " "CII " "CII " ...
$ services: int 1 1 2 2 2 2 1 2 1 3 ...
$ count_ts: Time-Series [1:1718, 1] from 1 to 1718: 1 1 2 2 2 2 1 2 1 3 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "services"
- attr(*, "vars")= chr "dos"
- attr(*, "drop")= logi TRU
And here is the code that gets it.
CII <- df1 %>%
filter(subpCode == "CII ") %>%
group_by(dos, subpCode) %>%
summarise(services = n())
CII$count_ts <- ts(CII[ , c('services')])
Could someone point me in the right direction. I've exhausted my usual places.
Thanks!
Following the vignette pointed out by #CalumYou, you should use more something like this:
process <- function(df, sub) {
## Enquoting sub
sub <- enquo(sub)
## Piping stuff
prog <- df %>%
filter(!! subpCode == sub) %>%
group_by(dos, subpCode) %>%
summarise(services = n())
prog$count_ts <- ts(prog[ , c('services')])
## Returning the prog object
return(prog)
}
I have a set of data frames belonging to many countries consisting of 3 variables (year, AI, OAD). The example for Zimbabwe is shown as below,
>str(dframe_Zimbabwe_1955_1970)
'data.frame': 16 obs. of 3 variables:
$ year: chr "1955" "1956" "1957" "1958" ...
$ AI : chr "11.61568161" "11.34114927" "11.23639317" "11.18841409" ...
$ OAD : chr "5.740789488" "5.775882473" "5.800441036" "5.822536579" ...
I am trying to change the data type of the variables in data frame to below so that I can model the linear fit using lm(dframe_Zimbabwe_1955_1970$AI ~ dframe_Zimbabwe_1955_1970$year).
>str(dframe_Zimbabwe_1955_1970)
'data.frame': 16 obs. of 3 variables:
$ year: int 1955 1956 1957 1958 ...
$ AI : num 11.61568161 11.34114927 11.23639317 11.18841409 ...
$ OAD : num 5.740789488 5.775882473 5.800441036 5.822536579 ...
The below static code able to change AI from character (chr) to numeric (num).
dframe_Zimbabwe_1955_1970$AI <- as.numeric(dframe_Zimbabwe_1955_1970$AI)
However when I tried to automate the code as below, AI still remains as character (chr)
countries <- c('Zimbabwe', 'Afghanistan', ...)
for (country in countries) {
assign(paste('dframe_',country,'_1955_1970$AI', sep=''), eval(parse(text = paste('as.numeric(dframe_',country,'_1955_1970$AI)', sep=''))))
}
Can you advice what I could have done wrong?
Thanks.
42: Your code doesn't work as written but with some edits it will. in addition to the missing parentheses and wrong sep, you can't use $'column name' in assign, but you don't need it anyway
for (country in countries) {
new_val <- get(paste( 'dframe_',country,'_1955_1970', sep=''))
new_val[] <- lapply(new_val, as.numeric) # the '[]' on LHS keeps dataframe
assign(paste('dframe_',country,'_1955_1970', sep=''), new_val)
remove(new_val)
}
proof it works:
dframe_Zimbabwe_1955_1970 <- data.frame(year = c("1955", "1956", "1957"),
AI = c("11.61568161", "11.34114927", "11.23639317"),
OAD = c("5.740789488", "5.775882473", "5.800441036"),
stringsAsFactors = F)
str(dframe_Zimbabwe_1955_1970)
'data.frame': 3 obs. of 3 variables:
$ year: chr "1955" "1956" "1957"
$ AI : chr "11.61568161" "11.34114927" "11.23639317"
$ OAD : chr "5.740789488" "5.775882473" "5.800441036"
countries <- 'Zimbabwe'
for (country in countries) {
new_val <- get(paste( 'dframe_',country,'_1955_1970', sep=''))
new_val[] <- lapply(new_val, as.numeric) # the '[]' on LHS keeps dataframe
assign(paste('dframe_',country,'_1955_1970', sep=''), new_val)
remove(new_val)
}
str(dframe_Zimbabwe_1955_1970)
'data.frame': 3 obs. of 3 variables:
$ year: num 1955 1956 1957
$ AI : num 11.6 11.3 11.2
$ OAD : num 5.74 5.78 5.8
It's going to be considered fairly ugly code by teh purists but perhaps this:
for (country in countries) {
new_val <- get(paste('dframe_',country,'_1955_1970', sep=''))
new_val[] <- lapply(new_val, as.numeric) # the '[]' on LHS keeps dataframe
assign(paste('dframe_',country,'_1955_1970', sep=''), new_val)
}
Using the get('obj_name') function is considered cleaner than eval(parse(text=...)). It would get handled more R-naturally had you assembled these dataframes in a list.
I have a data frame
str is
'data.frame': 334 obs. of 6 variables:
$ Patient_ID : int 524451 517060 518025 515768 499994
$ Camp_Start_Date : Date, format: "2003-08-16" "2005-02-15" "2005-02-15" ...
$ Camp_End_Date : Date, format: "2003-08-20" "2005-02-18" "2005-02-18" ...
$ First_Interaction: Date, format: "2003-08-16" "2004-10-03" "2005-02-17" ...
I am using this to create a new column pRegDate
RegDatelogicLUT <- RegDatelogicLUT %>%
mutate(pRegDate = if_else(between(First_Interaction, Camp_Start_Date, Camp_End_Date), First_Interaction, Camp_Start_Date)
)
Getting the error.
Error: expecting a single value
Any help will be appreciated.
Thanks
There is a nice lubridatesolution for this problem:
library(lubridate)
RegDatelogicLUT <- RegDatelogicLUT %>%
mutate(pRegDate = if_else(First_Interaction %within% c(Camp_Start_Date %--% Camp_End_Date),
First_Interaction, Camp_Start_Date))
# Patient_ID Camp_Start_Date Camp_End_Date First_Interaction pRegDate
#1 524451 2003-08-16 2003-08-20 2003-08-16 2003-08-16
#2 517060 2005-02-15 2005-02-18 2004-10-03 2005-02-15
#3 518025 2005-02-15 2005-02-18 2005-02-17 2005-02-17