Merge columns with the same name R - r

I'm fairly new to R. I'm working with a data set that is incredibly redundant with a lot of columns (~400). There are several duplicate column names, however the data is not duplicate, so I need to sum the columns when collapsing them.
The columns all have a similar name that allows easy identification, so I'm hoping I can use that to my advantage.
I attempted to perform the following:
ColNames <- unique(colnames(df))
CombinedDf <- data.frame(sapply(ColNames, function(i)rowSums(Test[,ColNames==i, drop=FALSE])))
This works if I sum over the range of columns that only contain integers, but the issue is that other columns have strings and such in them, so rowSums throws a fit.
Assuming that the identifier is "XXX", how can I aggregate all the columns that are of the same name leaving the other columns as is?
Thank you for your time.
Edit: Sample data has been asked for, I cannot give the exact data as it is sensitive, but I will give an example:
Name COL1XXX COL2XXX COL1XXX COL3XXX COL2XXX Type
Henry 5 15 25 31 1 Orange
Tom 8 16 12 4 3 Green
Should return
Name COL1XXX COL2XXX COL3XXX Type
Henry 30 16 31 Orange
Tom 20 19 4 Green

I'm not really sure, but you may try transposing the data and then aggregating by unique names.
t_df=as.data.frame(t(df))
new_df=aggregate(t_df, by=list(rownames(t_df)),sum)
Again, without sample data I'm unsure if it'll work, but based on what you said, that might work.

Related

R - Using Stringr to identify a string across hundreds of rows

I have a database where some people have multiple diagnoses. I posted a similar question in the past, but now have some more nuances I need to work through:
R- How to test multiple 100s of similar variables against a condition
I have this dataset (which was an import of a SAS file)
ID dx1 dx2 dx3 dx4 dx5 dx6 .... dx200
1 343 432 873 129 12 123 3445
2 34 12 44
3 12
4 34 56
Initially, I wanted to be able to create a new variable if any of the "dxs" equals a certain number without using hundreds of if statements? All the different variables have the same format (dx#). So I used the following code:
Ex:
dataset$highbloodpressure <- rowSums(screen[0:832] == "410") > 0
This worked great. However, there are many different codes for the same diagnosis. For example, a heart attack can be defined as:
410.1,
410.71,
410.62,
410.42,
...this goes on for 20 additional codes. BUT! They all start with 410.
I thought about using stringr (the variable is a string), to identify the common code components (410, for the example above), but am not sure how to use it in the context of rowsums.
If anyone has any suggestions for this, please let me know!
Thanks for all the help!
You can use the grepl() function that returns TRUE if a value is present. In order to check all columns simultaneously, just collapse all of them to one character per row:
df$dx.410 = NA
for(i in 1:dim(df)[1]){
if(grepl('410',paste(df[i,2:200],collapse=' '))){
df$dx.410[i]="Present"
}
}
This will loop through all lines, create one large character containing all diagnoses for this case and write "Present" in column dx.410 if any column contains a 410-diagnosis.
(The solution expects the data structure you have here with the dx-variables in columns 2 to 200. If there are some other columns, just adjust these numbers)

Recommender Split Returning Empty Dataset

I'm using a "Split Data" module set to recommender split to split data for training and testing a matchbox recommender. The input data is a valid user-item-rating tuple (for example, 575978 - 157381 - 3) and I've left the parameters for the recommender split as default (0s for everything), besides changing it to a .75 and .25 split. However, when this module finishes, it returns the complete, unsplit dataset for dataset1 and a completely empty (but labelled) dataset for dataset2. This also happens when doing a stratified split using the "Split Rows" mode. Any idea what's going on?
Thanks.
Edit: Including a sample of my data.
UserID ItemID Rating
835793 165937 3
154738 11214 3
938459 748288 3
819375 789768 6
738571 98987 3
847509 153777 3
991757 124458 3
968685 288070 2
236349 8337 3
127299 545885 3
Figured it out. In my "Remove Duplicate Rows" module up the chain a bit I was only removing duplicates by UserID instead of UserID and ItemID. This still left quite a bit of rows but I'm assuming it messed with the stratification.

R: Subsetting rows by group based on time difference

I have the following data frame:
group_id date_show date_med
1 1976-02-07 1971-04-14
1 1976-02-09 1976-12-11
1 2011-03-02 1970-03-22
2 1993-08-04 1997-06-13
2 2008-07-25 2006-09-01
2 2009-06-18 2005-11-12
3 2009-06-18 1999-11-03
I want to subset my data frame in such a way that the new data frame only shows the rows in which the values of date_show are further than 10 days apart but this condition should only be applied per group. I.e. if the values in the date_show column are less than 10 days apart but the group_ids are different, I need to keep both entries. What I want my result to look like based on the above table is:
group_id date_show date_med
1 1976-02-07 1971-04-14
1 2011-03-02 1970-03-22
2 1993-08-04 1997-06-13
2 2008-07-25 2006-09-01
2 2009-06-18 2005-11-12
3 2009-06-18 1999-11-03
Which row gets deleted isn't important because the reason why I'm subsetting in the first place is to calculate the number of rows I am left with after applying this criteria.
I've tried playing around with the diff function but I'm not sure how to go about it in the simplest possible way because this problem is already within another sapply function so I'm trying to avoid any kind of additional loop (in this case by group_id).
The df I'm working with has around 100 000 rows. Ideally, I would like to do this with base R because I have no rights to install any additional packages on the machine I'm working on but if this is not possible (or if solving this with an additional package would be significantly better), I can try and ask my admin to install it.
Any tips would be appreciated!

R: iterating through unique values of a vector in for loop

I'm new to R and I am having some trouble iterating through the unique element of a vector. I have a dataframe "School" with 700 different teachers. Each teacher has around 40 students.
I want to be able to loop through each teacher, create a graphs for the mean score of his/her students' over time, save the graphs in a folder and automatically email that folder to that teacher.
I'm just getting started and am having trouble setting up the for-loop. In Stata, I know how to loop through each unique element in a list, but am having trouble doing that in R. Any help would be appreciated.
School$Teacher School$Student School$ScoreNovember School$ScoreDec School$TeacherEmail
A 1 35 45 A#school.org
A 2 43 65 A#school.org
B 1 66 54 B#school.org
A 3 97 99 A#school.org
C 1 23 45 C#school.org
Your question seems a bit vague and it looks like you want us to write your whole project. Could you share what you have done so far and where exactly you are struggling?
see ?subset
School=data.frame(Teacher=c("A","B"), ScoreNovember=10:11, ScoreDec=13:14)
for (teacher in unique(School$Teacher)) {
teacher_df=subset(School, Teacher==teacher)
MeanScoreNovember=mean(teacher_df$ScoreNovember)
MeanScoreDec =mean(teacher_df$ScoreDec)
# do your plot
# send your email
}
I think you have 3 questions, which will need separate questions, how do I:
Create graphs
Automatically email output
Compute a subset mean based on group
For the 3rd one, I like using the plyr package, other people will recommend data.table or dplyrpackages. You can also use aggregate from base. To get a teacher's mean:
library(plyr)
ddply(School,.(Teacher),summarise,Nov_m=mean(ScoreNovember))
If you want per student per teacher, etc. just add between the columns, like:
library(plyr)
ddply(School,.(Teacher,Student),summarise,Nov_m=mean(ScoreNovember))
You could do that for each score column (and then chart it) if your data was long rather than wide you could also add the date ('November', 'Dec') as a group in the brackets, or:
library(plyr)
ddply(School,.(Teacher,Student),summarise,Nov_m=mean(ScoreNovember),Dec_m=mean(ScoreDec))
See if that helps with the 3rd, but look at splitting your questions up too.

How do I generate a dataframe displaying the number of unique pairs between two vectors, for each unique value in one of the vectors?

First of all, I apologize for the title. I really don't know how to succinctly explain this issue in one sentence.
I have a dataframe where each row represents some aspect of a hospital visit by a patient. A single patient might have thousands of rows for dozens of hospital visits, and each hospital visit could account for several rows.
One column is Medical.Record.Number, which corresponds to Patient IDs, and the other is Patient.ID.Visit, which corresponds to an ID for an individual hospital visit. I am trying to calculate the number of hospital visits each each patient has had.
For example:
Medical.Record.Number    Patient.ID.Visit
AAAXXX           1111
AAAXXX           1112
AAAXXX           1113
AAAZZZ           1114
AAAZZZ           1114
AAABBB           1115
AAABBB           1116
would produce the following:
Medical.Record.Number   Number.Of.Visits
AAAXXX          3
AAAZZZ          1
AAABBB          2
The solution I am currently using is the following, where "data" is my dataframe:
#this function returns the number of unique hospital visits associated with the
#supplied record number
countVisits <- function(record.number){
visits.by.number <- data$Patient.ID.Visit[which(data$Medical.Record.Number
== record.number)]
return(length(unique(visits.by.number)))
}
recordNumbers <- unique(data$Medical.Record.Number)
visits <- integer()
for (record in recordNumbers){
visits <- c(visits, countVisits(record))
}
visit.counts <- data.frame(recordNumbers, visits)
This works, but it is pretty slow. I am dealing with potentially millions of rows of data, so I'd like something efficient. From what little I know about R, I know there's usually a faster way to do things without using a for-loop.
This essentially looks like a table() operation after you take out duplicates. First, some sample data
#sample data
dd<-read.table(text="Medical.Record.Number Patient.ID.Visit
AAAXXX 1111
AAAXXX 1112
AAAXXX 1113
AAAZZZ 1114
AAAZZZ 1114
AAABBB 1115
AAABBB 1116", header=T)
then you could do
tt <- table(Medical.Record.Number=unique(dd)$Medical.Record.Number)
as.data.frame(tt, responseName="Number.Of.Visits") #to get a data.frame rather than named vector (table)
# Medical.Record.Number Number.Of.Visits
# 1 AAABBB 2
# 2 AAAXXX 3
# 3 AAAZZZ 1
Or you could also think of this as an aggregation problem
aggregate(Patient.ID.Visit~Medical.Record.Number, dd, function(x) length(unique(x)))
# Medical.Record.Number Patient.ID.Visit
# 1 AAABBB 2
# 2 AAAXXX 3
# 3 AAAZZZ 1
There are many ways to do this, #MrFlick provided handful of perfectly valid approaches. Personally I'm fond of the data.table package. Its faster on large data frames and I find the logic to be more intuitive than the base functions. I'd check it out if you are having problems with execution time.
library(data.table)
med.dt <- data.table(med_tbl)
num.visits.dt <- med.dt[ , num_visits = length(unique(Patient.ID.Visit)),
by = Medical.Record.Number]
data.Table should be much faster than data.frame on a large tables.

Resources