Order R dataframe columns by using second dataframe as a reference. - r

I am working on developing a statistical program using R, this program accepts two dataFrames. The first dataFrame carries demographics information of patients and the second carries their clinical information. The key column in the demographics dataFrame is the patientID column. While in the clinical dataFrame each patientID is a column. I wish to arrange/sort my demographics dataFrame by patientID, based upon the order of patientID's(ind columns) in the clinical dataFrame. Also the ID's could numeric or alphanumeric or could just be some-alphabet sequence. I was able to write some code, but would need help/guidance to come up with a better way to sort columns irrespective of their datatype(character, factor, numeric etc).
demogr = read.csv(mydemoFile, header = T, stringsAsFactors
=TRUE,colClasses=c('factor','factor','factor','factor','factor'))
demogr=demogr[order(as.numeric(demogr$Patient_ID)),]
myClinicalFrame=fread(myInputFile,header=T,data.table=FALSE,sep=",")
rowNames=myClinicalFrame[,1]
myClinicalFrame[,1]<-NULL
rownames(myClinicalFrame)=rowNames
names(myClinicalFrame)=sort((names(myClinicalFrame)))
The above works for certain types but fails for others. eg: Patient_ID in
demoFrame is numerically sorted above, in some situations R changes patient_ID like
109999345554545465 to 1.09e+18, which doesn't match with the second dataFrame.
Thanks

Let's start by creating two example data frames:
patientID = c(123456789012345,1234,1234567890,123)
state = c("FL","NJ","CA","TX")
demog = data.frame(ID = patientID,state = state)
clinical = data.frame(col1 = c(1,2,3),
col2 = c(3,4,5),
col2 = c(1,7,9),
col2 = c(6,4,2))
colnames(clinical) = c("1234567890","123","123456789012345","1234")
This gives us:
> demog
ID state
1 1.234568e+14 FL
2 1.234000e+03 NJ
3 1.234568e+09 CA
4 1.230000e+02 TX
and
> clinical
1234567890 123 123456789012345 1234
1 1 3 1 6
2 2 4 7 4
3 3 5 9 2
As you can see the rows in demog are in a different order than the columns in clinical.
To sort the rows in demog do:
rownames(demog) = demog$ID
demog = demog[colnames(clinical),]
This works even for IDs that are factors or characters, because rownames() will convert them to character.
Result:
> demog
ID state
1234567890 1.234568e+09 CA
123 1.230000e+02 TX
123456789012345 1.234568e+14 FL
1234 1.234000e+03 NJ

Related

Select columns from a data frame

I have a Data Frame made up of several columns, each corresponding to a different industry per country. I have 56 industries and 43 countries and I'd select only industries from 5 to 22 per country (18 industries). The big issue is that each industry per country is named as: AUS1, AUS2 ..., AUS56. What I shall select is AUS5 to AUS22, AUT5 to AUT22 ....
A viable solution could be to select columns according to the following algorithm: the first column of interest, i.e., AUS5 corresponds to column 10 and then I select up to AUS22 (corresponding to column 27). Then, I should skip all the remaining column for AUS (i.e. AUS23 to AUS56), and the first 4 columns for the next country (from AUT1 to AUT4). Then, I select, as before, industries from 5 to 22 for AUT. Basically, the algorithm, starting from column 10 should be able to select 18 columns(including column 10) and then skip the next 38 columns, and then select the next 18 columns. This process should be repeated for all the 43 countries.
How can I code that?
UPDATE, Example:
df=data.frame(industry = c("C10","C11","C12","C13"),
country = c("USA"),
AUS3 = runif(4),
AUS4 = runif(4),
AUS5 = runif(4),
AUS6 = runif(4),
DEU5 = runif(4),
DEU6 = runif(4),
DEU7 = runif(4),
DEU8 = runif(4))
#I'm interested only in C10-c11:
df_a=df %>% filter(grepl('C10|C11',industry))
df_a
#Thus, how can I select columns AUS10,AUS11, DEU10,DEU11 efficiently, considering that I have a huge dataset?
Demonstrating the paste0 approach.
ctr <- unique(gsub('\\d', '', names(df[-(1:2)])))
# ctr <- c("AUS", "DEU") ## alternatively hard-coded
ind <- c(10, 11)
subset(df, industry == paste0('C', 10:11),
select=c('industry', 'country', paste0(rep(ctr, each=length(ind)), ind)))
# industry country AUS10 AUS11 DEU10 DEU11
# 1 C10 USA 0.3376674 0.1568496 0.5033433 0.7327734
# 2 C11 USA 0.7421840 0.6808892 0.9050158 0.3689741
Or, since you appear to like grep you could do.
df[grep('10|11', df$industry), grep('industry|country|[A-Z]{3}1[01]', names(df))]
# industry country AUS10 AUS11 DEU10 DEU11
# 1 C10 USA 0.3376674 0.1568496 0.5033433 0.7327734
# 2 C11 USA 0.7421840 0.6808892 0.9050158 0.3689741
If you have a big data set in memory, data.table could be ideal and much faster than alternatives. Something like the following could work, though you will need to play with select_ind and select_ctr as desired on the real dataset.
It might be worth giving us a slightly larger toy example, if possible.
library(data.table)
setDT(df)
select_ind <- paste0(c("C"), c("11","10"))
select_ctr <- paste0(rep(c("AUS", "DEU"), each = 2), c("10","11"))
df[grepl(paste0(select_ind, collapse = "|"), industry), # select rows
..select_ctr] # select columns
AUS10 AUS11 DEU10 DEU11
1: 0.9040223 0.2638725 0.9779399 0.1672789
2: 0.6162678 0.3095942 0.1527307 0.6270880
For more information, see Introduction to data.table.

How to count the occurrence of a word in multiple variables in R and sort them from highest to lowest?

I have a huge dataset with over 3 million obs and 108 columns. There are 14 variables I'm interested in: DIAG_PRINC, DIAG_SECUN, DIAGSEC1:DIAGSEC9, CID_ASSO, CID_MORTE and CID_NOTIF (they're in different positions). These variables contain ICD-10 codes.
I'm interested in counting how many times certain ICD-10 codes appear and then sort them from highest to lowest in a dataframe. Here's some reproductible data:
data <- data.frame(DIAG_PRINC = c("O200", "O200", "O230"),
DIAG_SECUN = c("O555", "O530", "O890"),
DIAGSEC1 = c("O766", "O876", "O899"),
DIAGSEC2 = c("O200", "I520", "O200"),
DIAGSEC3 = c("O233", "O200", "O620"),
DIAGSEC4 = c("O060", "O061", "O622"),
DIAGSEC5 = c("O540", "O123", "O344"),
DIAGSEC6 = c("O876", "Y321", "S333"),
DIAGSEC7 = c("O450", "X900", "O541"),
DIAGSEC8 = c("O222", "O111", "O123"),
DIAGSEC9 = c("O987", "O123", "O622"),
CID_MORTE = c("O066", "O699", "O555"),
CID_ASSO = c("O600", "O060", "O068"),
CID_NOTIF = c("O069", "O066", "O065"))
I also have a list of ICD-10 codes that I'm interested in counting.
GRUPO1 <- c("O00", "O000", "O001", "O002", "O008", "O009",
"O01", "O010", "O011", "O019",
"O02", "O020", "O021", "O028", "O029",
"O03", "O030", "O031", "O032", "O033", "O034", "O035", "O036", "O037",
"O038", "O039",
"O04", "O040", "O041", "O042", "O043", "O044", "O045", "O046", "O047",
"O048", "O049",
"O05", "O050", "O051", "O052", "O053", "O054", "O055", "O056", "O057",
"O058", "O059",
"O06", "O060", "O061", "O062", "O063", "O064", "O065", "O066", "O067",
"O068", "O069",
"O07", "O070", "O071", "O072", "O073", "O074", "O075", "O076", "O077",
"O078", "O079",
"O08", "O080", "O081", "O082", "O083", "O084", "O085", "O086", "O087",
"O088", "O089")
What I need is a dataframe counting how many times the ICD-10 codes from "GRUPO1" appear in any row/column from DIAG_PRINC, DIAG_SECUN, DIAGSEC1:DIAGSEC9, CID_ASSO, CID_MORTE and CID_NOTIF variables. For example, on my reproductible data ICD-10 cod "O066" appears twice.
Thank you in advance!
We can unlist the data into a vector, use %in% to subset the values from 'GRUPO1' and get the frequency count with table in base R
v1 <- unlist(data)
out <- table(v1[v1 %in% GRUPO1])
out[order(-out)]
O060 O066 O061 O065 O068 O069
2 2 1 1 1 1
Here is a tidyverse solution using tidyr and dplyr:
library(tidyverse)
pivot_longer(data, everything()) %>%
filter(value %in% GRUPO1) %>%
count(value)
Output
value n
<chr> <int>
1 O060 2
2 O061 1
3 O065 1
4 O066 2
5 O068 1
6 O069 1

In R, left join two tables whose 2 potential keys contain missing data

Background:
I'm working with a fairly large (>10,000 rows) dataset of individual cars, and I need to do some analysis on it. I need to keep this dataset d intact, but I'm only going to be analyzing cars made by Japanese companies (e.g. Nissan, Honda, etc.). d contains information like VIN_prefix (the first two letters of a VIN number that indicates the "World Manufacturer Number"), model year, and make, but no explicit indicator of whether the car is made by a Japanese firm. Here's d:
d <- data.frame(
make = c("GMC","Dodge","NA","Subaru","Nissan","Chrysler"),
model_yr = c("1999","2004","1989","1999","2006","2012"),
VIN_prefix = c("1G","1D","JH","JF","NA","2C"),
stringsAsFactors=FALSE)
Here, rows 3, 4, and 5 correspond to Japanese cars: the NA in row 3 is actually an Acura whose make is missing. See below when I get to the other dataset about why this is.
d also lacks some attributes (columns) about cars that I need for my analysis, e.g. the current CEO of Japanese car firms.
Enter another dataset, a, a dataset about Japanese car firms which contains those extra attributes as well as columns that could be used to identify whether a given car (row) in d is made by a Japanese firm. One of those is VIN_prefix; the other is jp_makes, a list of Japanese auto firms. Here's a:
a <- data.frame(
VIN_prefix = c("JH","JF","1N"),
jp_makes = c("Acura","Subaru","Nissan"),
current_ceo = c("Toshihiro Mibe","Tomomi Nakamura","Makoto Ushida"),
stringsAsFactors=FALSE)
Here, we can see that the "Acura" make, missing in the car from row 3 in d, could be identified by its VIN_prefix "JH", which in row 3 of d is not NA.
Goal:
Left join a onto d so that each of the 3 Japanese cars in d gets the relevant corresponding attributes from a - mainly, current_ceo. (Non-Japanese cars in d would have NA for columns joined from a; this is fine.)
Problem:
As you can tell, the two relevant variables in d that could be used as keys in a join - make and VIN_prefix - have missing data in d. The "matching rules" we could use are imperfect: I could match on d$make == a$jp_makes or on d$VIN_prefix == a$VIN_prefix, but they'd each be wrong due to the missing data in d.
What to do?
What I've tried:
I can try left joining on either one of these potential keys, but not all 3 of the Japanese cars in d wind up with their correct information from a:
try1 <- left_join(d, a, by = c("make" = "jp_makes"))
try2 <- left_join(d, a, by = c("VIN_prefix" = "VIN_prefix"))
I can successfully generate an logical 'indicator' variable in d that tells me whether a car is Japanese or not:
entries_make <- a$jp_makes
entries_vin_prefix <- a$VIN_prefix
d<- d %>%
mutate(is_jp = ifelse(d$VIN_prefix %in% entries_vin_prefix | d$make %in% entries_make, 1, 0)
%>% as.logical())
But that only gets me halfway: I still need those other columns from a to sit next to those Japanese cars in d. It's unfeasible to manually fill all the missing data in some other way; the real datasets these toy examples correspond to are too big for that and I don't have the manpower or time.
Ideally, I'd like a dataset that looks something like this:
ideal <- data.frame(
make = c("GMC","Dodge","NA","Subaru","Nissan","Chrysler"),
model_yr = c("1999","2004","1989","1999","2006","2012"),
VIN_prefix = c("1G","1D","JH","JF","NA","2C"),
current_ceo = c("NA", "NA", "Toshihiro Mibe","Tomomi Nakamura","Makoto Ushida", "NA"),
stringsAsFactors=FALSE)
What do you all think? I've looked at other posts (e.g. here) but their solutions don't really apply. Any help is much appreciated!
Left join on an OR of the two conditions.
library(sqldf)
sqldf("select d.*, a.current_ceo
from d
left join a on d.VIN_prefix = a.VIN_prefix or d.make = a.jp_makes")
giving:
make model_yr VIN_prefix current_ceo
1 GMC 1999 1G <NA>
2 Dodge 2004 1D <NA>
3 NA 1989 JH Toshihiro Mibe
4 Subaru 1999 JF Tomomi Nakamura
5 Nissan 2006 NA Makoto Ushida
6 Chrysler 2012 2C <NA>
Use a two pass method. First fill in the missing make (or VIN values). I'll illustrate by filling in make valuesDo notice taht "NA" is not the same as NA. The first is a character value while the latter is a true R missing value, so I'd first convert those to a true missing value. In natural language I am replacing the missing values in d (note correction of df) with values of 'jp_makes' that are taken from a on the basis of matching VIN_prefix values:
is.na( d$make) <- df$make=="NA"
d$make[is.na(df$make)] <- a$jp_makes[
match( d$VIN_prefix[is.na(d$make)], a$VIN_prefix) ]
Now you have the make values filled in on the basis of the table look up in a. It should be trivial to do the match you wanted by using by.x='make', by.y='jp_make'
merge(d, a, by.x='make', by.y='jp_makes', all.x=TRUE)
make model_yr VIN_prefix.x VIN_prefix.y current_ceo
1 Acura 1989 JH JH Toshihiro Mibe
2 Chrysler 2012 2C <NA> <NA>
3 Dodge 2004 1D <NA> <NA>
4 GMC 1999 1G <NA> <NA>
5 Nissan 2006 NA 1N Makoto Ushida
6 Subaru 1999 JF JF Tomomi Nakamura
You can then use the values in VIN_prefix.y to replace the values the =="NA" in VIN_prefix.x.

Carrying out a simple dataframe subset with dplyr

Consider the following dataframe slice:
df = data.frame(locations = c("argentina","brazil","argentina","denmark"),
score = 1:4,
row.names = c("a091", "b231", "a234", "d154"))
df
locations score
a091 argentina 1
b231 brazil 2
a234 argentina 3
d154 denmark 4
sorted = c("a234","d154","a091") #in my real task these strings are provided from an exogenous function
df2 = df[sorted,] #quick and simple subset using rownames
EDIT: Here I'm trying to subset AND order the data according to sorted - sorry that was not clear before. So the output, importantly, is:
locations score
a234 argentina 1
d154 denmark 4
a091 argentina 3
And not as you would get from a simple subset operation:
locations score
a091 argentina 1
a234 argentina 3
d154 denmark 4
I'd like to do the exactly same thing in dplyr. Here is an inelegant hack:
require(dplyr)
dt = as_tibble(df)
rownames(dt) = rownames(df)
Warning message:
Setting row names on a tibble is deprecated.
dt2 = dt[sorted,]
I'd like to do it properly, where the rownames are an index in the data table:
dt_proper = as_tibble(x = df,rownames = "index")
dt_proper2 = dt_proper %>% ?some_function(index, sorted)? #what would this be?
dt_proper2
# A tibble: 3 x 3
index locations score
<chr> <fct> <int>
1 a091 argentina 1
2 d154 denmark 4
3 a234 argentina 3
But I can't for the life of me figure out how to do this using filter or some other dplyr function, and without some convoluted conversion to factor, re-order factor levels, etc.
Hy,
you can simply use mutate and filter to get the row.names of your data frame into a index column and filter to the vector "sorted" and sort the data frame due to the vector "sorted":
df2 <- df %>% mutate(index=row.names(.)) %>% filter(index %in% sorted)
df2 <- df2[order(match(df2[,"index"], sorted))]
I think I've figured it out:
dt_proper2 = dt_proper[match(sorted,dt_proper$index),]
Seems to be shortest implementation of what df[sorted,] will do.
Functions in the tidyverse (dplyr, tibble, etc.) are built around the concept (as far as I know), that rows only contain attributes (columns) and no row names / labels / indexes. So in order to sort columns, you have to introduce a new column containing the ranks of each row.
The way I would do it is to create another tibble containing your "sorting information" (sorting attribute, rank) and inner join it to your original tibble. Then I could order the rows by rank.
library(tidyverse)
# note that I've changed the third column's name to avoid confusion
df = tibble(
locations = c("argentina","brazil","argentina","denmark"),
score = 1:4,
custom_id = c("a091", "b231", "a234", "d154")
)
sorted_ids = c("a234","d154","a091")
sorting_info = tibble(
custom_id = sorted_ids,
rank = 1:length(sorted_ids)
)
ordered_ids = df %>%
inner_join(sorting_info) %>%
arrange(rank) %>%
select(-rank)

Using R, Randomly Assigning Students Into Groups Of 4

I'm still learning R and have been given the task of grouping a long list of students into groups of four based on another variable. I have loaded the data into R as a data frame. How do I sample entire rows without replacement, one from each of 4 levels of a variable and have R output the data into a spreadsheet?
So far I have been tinkering with a for loop and the sample function but I'm quickly getting over my head. Any suggestions? Here is sample of what I'm attempting to do. Given:
Last.Name <- c("Picard","Troi","Riker","La Forge", "Yar", "Crusher", "Crusher", "Data")
First.Name <- c("Jean-Luc", "Deanna", "William", "Geordi", "Tasha", "Beverly", "Wesley", "Data")
Email <- c("a#a.com","b#b.com", "c#c.com", "d#d.com", "e#e.com", "f#f.com", "g#g.com", "h#h.com")
Section <- c(1,1,2,2,3,3,4,4)
df <- data.frame(Last.Name,First.Name,Email,Section)
I want to randomly select a Star Trek character from each section and end up with 2 groups of 4. I would want the entire row's worth of information to make it over to a new data frame containing all groups with their corresponding group number.
I'd use the wonderful package 'dplyr'
require(dplyr)
random_4 <- df %>% group_by(Section) %>% slice(sample(c(1,2),1))
random_4
Source: local data frame [4 x 4]
Groups: Section
Last.Name First.Name Email Section
1 Troi Deanna b#b.com 1
2 La Forge Geordi d#d.com 2
3 Crusher Beverly f#f.com 3
4 Data Data h#h.com 4
random_4
Source: local data frame [4 x 4]
Groups: Section
Last.Name First.Name Email Section
1 Picard Jean-Luc a#a.com 1
2 Riker William c#c.com 2
3 Crusher Beverly f#f.com 3
4 Data Data h#h.com 4
%>% means 'and then'
The code is read as:
Take DF AND THEN for all 'Section', select by position (slice) 1 or 2. Voila.
I suppose you have 8 students: First.Name <- c("Jean-Luc", "Deanna", "William", "Geordi", "Tasha", "Beverly", "Wesley", "Data").
If you wish to randomly assign a section number to the 8 students, and assuming you would like each section to have 2 students, then you can either permute Section <- c(1, 1, 2, 2, 3, 3, 4, 4) or permute the list of the students.
First approach, permute the sections:
> assigned_section <- print(sample(Section))
[1] 1 4 3 2 2 3 4 1
Then the following data frame gives the assignments:
assigned_students <- data.frame(First.Name, assigned_section)
Second approach, permute the students:
> assigned_students <- print(sample(First.Name))
[1] "Data" "Geordi" "Tasha" "William" "Deanna" "Beverly" "Jean-Luc" "Wesley"
Then, the following data frame gives the assignments:
assigned_students <- data.frame(assigned_students, Section)
Alex, Thank You. Your answer wasn't exactly what I was looking for, but it inspired the correct one for me. I had been thinking about the process from a far too complicated point of view. Instead of having R select rows and put them into a new data frame, I decided to have R assign a random number to each of the students and then sort the data frame by the number:
First, I broke up the data frame into sections:
df1<- subset(df, Section ==1)
df2<- subset(df, Section ==2)
df3<- subset(df, Section ==3)
df4<- subset(df, Section ==4)
Then I randomly generated a group number 1 through 4.
Groupnumber <-sample(1:4,4, replace=F)
Next, I told R to bind the columns:
Assigned1 <- cbind(df1,Groupnumber)
*Ran the group number generator and cbind in alternating order until I got through the whole set. (Wanted to make sure the order of the numbers was unique for each section).
Finally row binding the data set back together:
Final_List<-rbind(Assigned1,Assigned2,Assigned3,Assigned4)
Thank you everyone who looked this over. I am new to data science, R, and stackoverflow, but as I learn more I hope to return the favor.
I'd suggest the randomizr package to "block assign" according to section. The block_ra function lets you do this in a easy-to-read one-liner.
install.packages("randomizr")
library(randomizr)
df$group <- block_ra(block_var = df$Section,
condition_names = c("group_1", "group_2"))
You can inspect the resulting sets in a variety of ways. Here's with base r subsetting:
df[df$group == "group_1",]
Last.Name First.Name Email Section group
2 Troi Deanna b#b.com 1 group_1
3 Riker William c#c.com 2 group_1
6 Crusher Beverly f#f.com 3 group_1
7 Crusher Wesley g#g.com 4 group_1
df[df$group == "group_2",]
Last.Name First.Name Email Section group
1 Picard Jean-Luc a#a.com 1 group_2
4 La Forge Geordi d#d.com 2 group_2
5 Yar Tasha e#e.com 3 group_2
8 Data Data h#h.com 4 group_2
If you want to roll your own:
set <- tapply(1:nrow(df), df$Section, FUN = sample, size = 1)
df[set,] # show the sampled set
df[-set,] # show the complimentary set

Resources