R: How do you run a function to get multiple columns? - r

So my data looks like this
id first middle last Age
1 Carol Jenny Smith 15
2 Sarah Carol Roberts 20
3 Josh David Richardson 22
I have a function that creates a new column which gives you how many times the name was found for each row in previous columns that I specified (2nd-4th columns or 'first':'last' columns). I have a function that outputs the result below,
funname <- function(df, cols, value, newcolunmn) {
df$newcolumn <- as.integer(rowSums(df[cols] == value) > 0)
}
id first middle last Age Carol
1 Carol Jenny Smith 15 1
2 Sarah Carol Roberts 20 1
3 Josh David Richardson 22 0
But my real data is more complicated and I want to create at least 20 new, different columns (ex: Carol, Robert, Jenny, Anna, Richard, Daniel, Eric...)
So how can I incorporate multiple new columns into the existing function?
I can only think of adding function(df, cols, value, newcolumn1, newcolumn2, newcolumn3,...,) but this would be impossible if I want like hundred columns later,..any help? thank you in advance! :)
EDIT:
function(df, cols, value, newcol) {
df$newcol <- as.integer(rowSums(df[cols] == value) > 0)
df
}
I read the comments below..but let me change my question..
How would I map this function so that I can generate multiple columns with new names that I want to assign?..

I think this is just one giant table operation if you get your data converted to two long vectors, one representing row number and the other the value:
tab <- as.data.frame.matrix(table(row(dat[2:4]), unlist(dat[2:4])))
cbind(dat, tab)
# id first middle last Age Carol David Jenny Josh Richardson Roberts Sarah Smith
#1 1 Carol Jenny Smith 15 1 0 1 0 0 0 0 1
#2 2 Sarah Carol Roberts 20 1 0 0 0 0 1 1 0
#3 3 Josh David Richardson 22 0 1 0 1 1 0 0 0
This method would also allow you to map the new output columns to variations of the names if required:
tab <- as.data.frame.matrix(table(row(dat[2:4]), unlist(dat[2:4])))
dat[paste0(colnames(tab),"_n")] <- tab
dat
# id first middle last Age Carol_n David_n Jenny_n Josh_n Richardson_n Roberts_n Sarah_n Smith_n
#1 1 Carol Jenny Smith 15 1 0 1 0 0 0 0 1
#2 2 Sarah Carol Roberts 20 1 0 0 0 0 1 1 0
#3 3 Josh David Richardson 22 0 1 0 1 1 0 0 0

Related

Sequential Count of Treatment Episodes in R Dataframe

I have a data set of different patient ID's, clinic visit dates, and attendance (see example data below, separated by patient ID for clarity).
I am interested in sequentially counting treatment episodes, which are defined as attending >= 4 visits for their starting month, followed by >= 1 visit every month afterwards. If a patient attends <1 visit after starting (i.e., after completing their initial >=4 visits in the starting month), that treatment episode is considered ended. A new treatment episode subsequently starts the next time a patient attends >= 4 visits in a given month, and that same episode continues as long as the patient attends >=1 visit/month thereafter. When patients either do not meet or break this pattern, I'd like to input 0.
Example data (note: I've excluded each day's date to prevent the example from being excessively long and re-produced dates to give a clearer picture of the desired data):
Patient ID
Visit Date
Attendance
1
01/01/2023
Yes
1
01/02/2023
Yes
1
01/03/2023
Yes
1
01/04/2023
Yes
1
02/01/2023
Yes
1
03/01/2023
Yes
1
04/01/2023
No
1
05/01/2023
Yes
1
06/01/2023
No
1
07/01/2023
Yes
1
07/02/2023
Yes
1
07/03/2023
Yes
1
07/04/2023
Yes
1
08/01/2023
Yes
----------
----------
----------
Patient ID
Visit Date
Attendance
----------
----------
----------
2
01/01/2023
Yes
2
02/01/2023
Yes
2
03/01/2023
Yes
2
03/02/2023
Yes
2
03/03/2023
Yes
2
03/04/2023
Yes
2
04/01/2023
Yes
2
05/01/2023
Yes
2
07/01/2023
Yes
Desired data:
Patient ID
Visit Date
Attendance
Tx Episode
1
01/01/2023
Yes
1
1
01/02/2023
Yes
1
1
01/03/2023
Yes
1
1
01/04/2023
Yes
1
1
02/01/2023
Yes
1
1
03/01/2023
Yes
1
1
04/01/2023
No
0
1
05/01/2023
Yes
0
1
06/01/2023
No
0
1
07/01/2023
Yes
2
1
07/02/2023
Yes
2
1
07/03/2023
Yes
2
1
07/04/2023
Yes
2
1
08/01/2023
Yes
2
----------
----------
----------
----------
Patient ID
Visit Date
Attendance
Tx Episode
----------
----------
----------
----------
2
01/01/2023
Yes
0
2
02/01/2023
Yes
0
2
03/01/2023
Yes
1
2
03/02/2023
Yes
1
2
03/03/2023
Yes
1
2
03/04/2023
Yes
1
2
04/01/2023
Yes
1
2
05/01/2023
Yes
1
2
07/01/2023
Yes
0
I am somewhat new to programming in R and have initially attempted to use ifelse() but wasn't able to come up with logicals that worked. I've also attempted to write loops, which have failed to run.
Any help would be greatly appreciated and I'm happy to provide more detail if the above isn't clear.
Thanks in advance for your time/effort!
This seems fairly complex, and not sure of entire logic, but thought this may help. This uses the lubridate library, but otherwise base R functions. A helper function elapsed_months was borrowed from here.
First an empty list is created enc_list that will store results for the final data.frame.
We construct two loops - the first to analyze data for each Patient_ID, and the second to evaluate encounters for that given patient.
Note that I subset based on Attendance being "Yes" - if not attended, would not want to include that data in evaluation. This is an assumption on my part.
A table of months for the Visit_Date is made so that we know which months have >= 4 encounters.
The enc_active is a simple flag on whether row-by-row we are dealing with an active encounter. The enc_num is the number treatment encounter that is incremented when new treatment encounters are discovered.
Going row-by-row through encounter data, first check if in an active encounter. If it is, check if the number of elapsed months is 0 (same month) or 1 (consecutive month). If true, then record that encounter. If not true, then the treatment encounter is over.
If not an active encounter, check if has a month with 4+ encounters, and if it does, set to a new active treatment encounter. Note that in cases were not true, it will record 0 for Tx_Encounter and then reset the flag.
The final results are stored back in the list which will be bound together with rbind (row bind) in the end.
The merge will combine the results with the original data.frame, which will be needed since the rows with Attendance or "No" were removed early on. Since the merge will leave Tx_Encounter with missing for those "No"s, we'll replace NA with 0.
Some example data was adapted from your comment. Please let me know of questions - happy to do a StackOverflow chat to go over as well. I do have an interest in this form of data from my own experiences.
library(lubridate)
elapsed_months <- function(end_date, start_date) {
ed <- as.POSIXlt(end_date)
sd <- as.POSIXlt(start_date)
12 * (ed$year - sd$year) + (ed$mon - sd$mon)
}
enc_list <- list()
for (id in unique(df$Patient_ID)) {
enc_data <- df[df$Patient_ID == id & df$Attendance == "Yes", ]
enc_month <- table(cut(enc_data$Visit_Date, 'month'))
enc_active <- F
enc_num <- 0
for (i in 1:nrow(enc_data)) {
if (enc_active) {
if(elapsed_months(enc_data$Visit_Date[i], enc_data$Visit_Date[i - 1]) <= 1) {
enc_data[i, "Tx_Episode"] <- enc_num
} else {
enc_active = F
enc_data[i, "Tx_Episode"] <- 0
}
} else {
if(enc_month[as.character(floor_date(enc_data$Visit_Date[i], unit = "month"))] >= 4) {
enc_active = T
enc_num <- enc_num + 1
enc_data[i, "Tx_Episode"] <- enc_num
} else {
enc_data[i, "Tx_Episode"] <- 0
}
}
}
enc_list[[id]] <- enc_data
}
df_final <- merge(
do.call('rbind', enc_list),
df,
all.y = T
)
df_final$Tx_Episode[is.na(df_final$Tx_Episode)] <- 0
Output
Patient_ID Visit_Date Attendance Tx_Episode
1 1 2023-01-01 Yes 1
2 1 2023-01-02 Yes 1
3 1 2023-01-03 Yes 1
4 1 2023-01-04 Yes 1
5 1 2023-02-01 Yes 1
6 1 2023-03-01 Yes 1
7 1 2023-04-01 No 0
8 1 2023-05-01 Yes 0
9 1 2023-06-01 No 0
10 1 2023-07-01 Yes 2
11 1 2023-07-02 Yes 2
12 1 2023-07-03 Yes 2
13 1 2023-07-04 Yes 2
14 1 2023-08-01 Yes 2
15 2 2023-01-01 Yes 0
16 2 2023-02-01 Yes 0
17 2 2023-03-01 Yes 1
18 2 2023-03-02 Yes 1
19 2 2023-03-03 Yes 1
20 2 2023-03-04 Yes 1
21 2 2023-04-01 Yes 1
22 2 2023-04-02 Yes 1
23 2 2023-04-03 Yes 1
24 2 2023-04-04 Yes 1
25 2 2023-06-12 Yes 0

Is there a way to convert all NAs in certain dataframe column only if two separate conditions are met in R

any help with this R problem would be appreciated beyond belief!! Apologies in advance for i) not using correct syntax/formatting. ii) this has probably been asked in other ways i can't find. iii) I'm finding it really hard to describe in words (or using simplified expressions) what I'm trying to achieve...
In a nutshell, converting NAs to 0s is really straightforward, however, converting NAs in a specific dataframe column under two or more "if" conditions is a different story.
My dataframe is a log of how many times analysts have been late with their report type in the past year (all made up):
LateReporters
Analyst Name Report Type Secured-Late Unsecured-Late
1 Nick Unsecured NA NA
2 John Unsecured NA 3
3 Emily Secured 4 NA
4 Karen Secured NA NA
5 Leslie Unsec&Sec NA 4
6 Joanna Unsec&Sec 2 NA
7 Scott Unsec&Sec NA NA
8 Stuart Unsec&Sec 1 3
This is what i am trying to achieve:
LateReporters
Analyst Name Report Type Secured-Late Unsecured-Late
1 Nick Unsecured NA 0
2 John Unsecured NA 3
3 Emily Secured 4 NA
4 Karen Secured 0 NA
5 Leslie Unsec&Sec 0 4
6 Joanna Unsec&Sec 2 0
7 Scott Unsec&Sec 0 0
8 Stuart Unsec&Sec 1 3
Rules:
1) If analyst reports "Unsecured" 'Report Type' only, then any NAs in the 'Unsecured-Late' column should be changed to a 0 as they have genuinely never been late - and any NAs in 'Secured-Late' column should remain as NA (as they don't report it).
1) If analyst reports "Secured" 'Report Type' only, then any NAs in the 'Secured-Late' column should be changed to a 0 as they have genuinely never been late - and any NAs in 'Unsecured-Late' column should remain as NA (as they don't report it).
3) If analyst reports "Unsec&Sec" 'Report Type', then any NAs in the 'Unsecured-Late' and the 'Secure-Late' column should be changed to 0 as they have genuinely never been late.
Attempted Code (and many many many others):
LateReporters <- if (LateReporters$'Report Type' == "Unsecured" & LateReporters$'Unsecured-Late == is.na(LateReporters$Unsecured-Late)) { LateReporter$Unsecured-Late [which(is.na(LateReporters$Unsecured-Late))] <- 0
}
LateReporters <- if (LateReporters$'Report Type' == "Secured" & LateReporters$'Secured-Late == is.na(LateReporters$Secured-Late)) { LateReporter$Secured-Late [which(is.na(LateReporters$Secured-Late))] <- 0
}
LateReporters <- if (LateReporters$'Report Type' == "Unsec&Sec" & LateReporters$'Unsecured-Late == is.na(LateReporters$secured) | LateReporters$'Secured-Late == is.na(LateReporters$secured) ) { LateReporter$Unsecured-Late [which(is.na(LateReporters$Unsecured-Late))] & LateReporter$Secured-Late [which(is.na(LateReporters$Secured-Late))] <- 0
}
Any suggestions where I'm going wrong would be great.
Many thanks
In base R, a little clunky-looking:
LateReports <- within(LateReports, {
Unsecured_Late[is.na(Unsecured_Late) & Report_Type == "Unsecured"] <- 0
Secured_Late[ is.na(Secured_Late) & Report_Type == "Secured"] <- 0
Unsecured_Late[is.na(Unsecured_Late) & Report_Type == "Unsec&Sec"] <- 0
Secured_Late[ is.na(Secured_Late) & Report_Type == "Unsec&Sec"] <- 0
})
LateReports
# Analyst_Name Report_Type Secured_Late Unsecured_Late
# 1 Nick Unsecured NA 0
# 2 John Unsecured NA 3
# 3 Emily Secured 4 NA
# 4 Karen Secured 0 NA
# 5 Leslie Unsec&Sec 0 4
# 6 Joanna Unsec&Sec 2 0
# 7 Scott Unsec&Sec 0 0
# 8 Stuart Unsec&Sec 1 3
With dplyr::case_when
library(dplyr)
df %>%
mutate(`Secured-Late` = case_when(`Report Type` == "Secured" & is.na(`Secured-Late`) ~ 0,
`Report Type` == "Unsec&Sec" & is.na(`Secured-Late`) ~ 0,
TRUE ~ `Secured-Late`),
`Unsecured-Late` = case_when(`Report Type` == "Unsecured" & is.na(`Unsecured-Late`) ~ 0,
`Report Type` == "Unsec&Sec" & is.na(`Unsecured-Late`) ~ 0,
TRUE ~ `Unsecured-Late`))
# A tibble: 8 x 5
X1 Analyst_Name `Report Type` `Secured-Late` `Unsecured-Late`
<dbl> <chr> <chr> <dbl> <dbl>
1 1 Nick Unsecured NA 0
2 2 John Unsecured NA 3
3 3 Emily Secured 4 NA
4 4 Karen Secured 0 NA
5 5 Leslie Unsec&Sec 0 4
6 6 Joanna Unsec&Sec 2 0
7 7 Scott Unsec&Sec 0 0
8 8 Stuart Unsec&Sec 1 3

How to do this?

I have a problem and I would ask if is a function or easy way to do below operation.
I have a data.frame like this
customer item
-------------------
smith a
smith b
smith c
johnson a
bush NA
regan d
How to create matrix like this
customer a b c d
--------------------------------------
smith 1 1 1 0
johnson 1 0 0 0
bush 0 0 0 0
regan 0 0 0 1
Is loop obligartory? Is easier way to create this?
Thank you in advance!
You should use the table function. The call would look something like this. IT goes x,y but depending on what the full data.frame list looks you may want to add some more parameters to handle NA values and such
table(df$customer, df$item)

counting occurrence of strings across multiple columns in R [duplicate]

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 5 years ago.
I have a dataset in R which looks like the following (only relevant columns shown). It has sex disaggregated data on what crops respondents wanted more information about and how much of a priority this crop for them.
sex wantcropinfo1 priority1 wantcropinfo2 priority2
m wheat high eggplant medium
m rice low cabbage high
m rice high
f eggplant medium
f cotton low
...
I want to be able to (a) count the total occurrences of each crop across all the wantcropinfoX columns; and (b) get the same count but sort them by priority; and (c) do the same thing but disaggregated by sex.
(a) output should look like this:
crop count
wheat 1
eggplant 2
rice 2
...
(b) output should look like this:
crop countm countf
wheat 1 0
eggplant 1 1
rice 2 0
...
(c) should look like this:
crop high_m med_m low_m high_f med_f low_f
wheat 1 0 0 0 0 0
eggplant 0 1 0 0 1 0
rice 1 0 1 0 0 0
...
I'm a bit of an R newbie and the manuals are slightly bewildering. I've googled a lot but couldn't find anything that was quite like this even though it seems like a fairly common thing one might want to do. Similar questions on stackoverflow seemed to be asking something a bit different.
We can use melt from data.table to convert from 'wide' to 'long' format. It can take multiple measure columns.
library(data.table)
dM <- melt(setDT(df1), measure = patterns("^want", "priority"),
value.name = c("crop", "priority"))[crop!='']
In the 'long' format, we get the 3 expected results by either grouping by 'crop' and get the number of rows or convert to 'wide' with dcast specifying the fun.aggregate as length.
dM[,.(count= .N) , crop]
# crop count
#1: wheat 1
#2: rice 2
#3: eggplant 2
#4: cotton 1
#5: cabbage 1
dcast(dM, crop~sex, value.var='sex', length)
# crop f m
#1: cabbage 0 1
#2: cotton 1 0
#3: eggplant 1 1
#4: rice 0 2
#5: wheat 0 1
dcast(dM, crop~priority+sex, value.var='priority', length)
# crop high_m low_f low_m medium_f medium_m
#1: cabbage 1 0 0 0 0
#2: cotton 0 1 0 0 0
#3: eggplant 0 0 0 1 1
#4: rice 1 0 1 0 0
#5: wheat 1 0 0 0 0
Use ddply function in the plyr package.
The structure of how you use this function is the following:
ddply(dataframe,.(var1,var2,...), summarize, function)
In this case you might want to do the follow for:
a) ddply(df,.(wantcropinfo1),summarize,count=length(wantcropinfo1))
b)ddply(df,.(wantcropinfo1,priority),summarize,count=length(wantcropinfo1))
c) ddply(df,.(wantcropinfo1,priority,sex),summarize,count=length(wantcropinfo1))
Note that the output will not have the same structure you mention in your question but the information will be the same. For the mentioned structure use the table function

Optimization of an R loop taking 18 hours to run

I've got an R code that works and does what I want but It takes a huge time to run. Here is an explanation of what the code does and the code itself.
I've got a vector of 200000 line containing street adresses (String) : data.
Example :
> data[150000,]
address
"15 rue andre lalande residence marguerite yourcenar 91000 evry france"
And I have a matrix of 131x2 string elements which are 5grams (part of word) and the ids of the bags of NGrams (example of a 5Grams bag : ["stack", "tacko", "ackov", "ckover", ",overf", ... ] ) : list_ngrams
Example of list_ngrams :
idSac ngram
1 4 stree
2 4 tree_
3 4 _stre
4 4 treet
5 5 avenu
6 5 _aven
7 5 venue
8 5 enue_
I have also a 200000x31 numerical matrix initialized with 0 : idv_x_bags
In total I have 131 5-grams and 31 bags of 5-grams.
I want to loop the string addresses and check whether it contains one of the n-grams in my list or not. If it does, I put one in the corresponding column which represents the id of the bag that contains the 5-gram.
Example :
In this address : "15 rue andre lalande residence marguerite yourcenar 91000 evry france". The word "residence" exists in the bag ["resid","eside","dence",...] which the id is 5. So I'm gonna put 1 in the column called 5. Therefore the corresponding line "idv_x_bags" matrix will look like the following :
> idv_x_sacs[150000,]
4 5 6 8 10 12 13 15 17 18 22 26 29 34 35 36 42 43 45 46 47 48 52 55 81 82 108 114 119 122 123
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Here is the code that does :
idv_x_sacs <- matrix(rep(0,nrow(data)*31),nrow=nrow(data),ncol=31)
colnames(idv_x_sacs) <- as.vector(sqldf("select distinct idSac from list_ngrams order by idSac"))$idSac
for(i in 1:nrow(idv_x_bags))
{
for(ngram in list_ngrams$ngram)
{
if(grepl(ngram,data[i,])==TRUE)
{
idSac <- sqldf(sprintf("select idSac from list_ngramswhere ngram='%s'",ngram))[[1]]
idv_x_bags[i,as.character(idSac)] <- 1
}
}
}
The code does perfectly what I aim to do, but it takes about 18 hours which is huge. I tried to recode it with c++ using Rcpp library but I encountered many problems. I'm tried to recode it using apply, but I couldn't do it.
Here is what I did :
apply(cbind(data,1:nrow(data),1,function(x){
apply(list_ngrams,1,function(y){
if(grepl(y[2],x[1])==TRUE){idv_x_bags[x[2],str_trim(as.character(y[1]))]<-1}
})
})
I need some help with coding my loop using apply or some other method that run faster that the current one. Thank you very much.
Check this one and run the simple example step by step to see how it works.
My N-Grams don't make much sense, but it will work with actual N_Grams as well.
library(dplyr)
library(reshape2)
# your example dataset
dt_sen = data.frame(sen = c("this is a good thing", "this is bad"), stringsAsFactors = F)
dt_ngr = data.frame(id_ngr = c(2,2,2,3,3,3),
ngr = c("th","go","tt","drf","ytu","bad"), stringsAsFactors = F)
# sentence dataset
dt_sen
sen
1 this is a good thing
2 this is bad
#ngrams dataset
dt_ngr
id_ngr ngr
1 2 th
2 2 go
3 2 tt
4 3 drf
5 3 ytu
6 3 bad
# create table of matches
expand.grid(unique(dt_sen$sen), unique(dt_ngr$id_ngr)) %>%
data.frame() %>%
rename(sen = Var1,
id_ngr = Var2) %>%
left_join(dt_ngr, by = "id_ngr") %>%
group_by(sen, id_ngr,ngr) %>%
do(data.frame(match = grepl(.$ngr,.$sen))) %>%
group_by(sen,id_ngr) %>%
summarise(sum_success = sum(match)) %>%
mutate(match = ifelse(sum_success > 0,1,0)) -> dt_full
dt_full
Source: local data frame [4 x 4]
Groups: sen
sen id_ngr sum_success match
1 this is a good thing 2 2 1
2 this is a good thing 3 0 0
3 this is bad 2 1 1
4 this is bad 3 1 1
# reshape table
dt_full %>% dcast(., sen~id_ngr, value.var = "match")
sen 2 3
1 this is a good thing 1 0
2 this is bad 1 1

Resources