dplyr lookup table / pattern matching [duplicate] - r

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 5 years ago.
I was looking for a smart, or "tidier" way, to make use of a lookup table in the tidyverse, but could not find a satisfying solution.
I have a dataset and lookup table:
# Sample data
data <- data.frame(patients = 1:5,
treatment = letters[1:5],
hospital = c("yyy", "yyy", "zzz", "www", "uuu"),
response = rnorm(5))
# Lookup table
lookup <- tibble(hospital = c("yyy", "uuu"), patients = c(1,5))
... where each row in the lookup table is the exact pattern for which I want to filter the first tibble (data).
The wanted result would look like this:
# A tibble: 3 x 4
patients treatment hospital response
<dbl> <chr> <chr> <dbl>
1 1.00 a yyy -0.275
2 5.00 e uuu -0.0967
The easiest solution I came up with is something like this:
as.tibble(dat) %>%
filter(paste(hospital, patients) %in% paste(lookup$hospital, lookup$patients))
However, this must be something that a lot of people regularly do - is there a cleaner and more convienent way to do this (i.e. for more than two columns in your lookup table)?

Since the default behavior of dplyr::inner_join() is to match on common columns between the two tibbles passed to the function and the lookup table consists of only the 2 key columns, the shortest code is as follows:
library(dplyr)
# Sample data
data <- tibble(patients = 1:5,
treatment = letters[1:5],
hospital = c("yyy", "yyy", "zzz", "www", "uuu"),
response = rnorm(5))
# Lookup table
lookup <- tibble(hospital = c("yyy", "uuu"), patients = c(1,5))
data %>% inner_join(.,lookup)
...and the output:
> data %>% inner_join(.,lookup)
Joining, by = c("patients", "hospital")
# A tibble: 2 x 4
patients treatment hospital response
<dbl> <chr> <chr> <dbl>
1 1.00 a yyy -1.44
2 5.00 e uuu -0.313
>
Because the desired output can be accomplished by a join on key columns across the tibbles, the paste() code in the OP is unnecessary.
Also note that inner_join() is the right type of join because the desired output is rows that match across both incoming tibbles, and the lookup table does not have duplicate rows. If the lookup table contained duplicate rows, then semi_join() would be the appropriate function, per the comments on the OP.

Related

Select columns from a data frame

I have a Data Frame made up of several columns, each corresponding to a different industry per country. I have 56 industries and 43 countries and I'd select only industries from 5 to 22 per country (18 industries). The big issue is that each industry per country is named as: AUS1, AUS2 ..., AUS56. What I shall select is AUS5 to AUS22, AUT5 to AUT22 ....
A viable solution could be to select columns according to the following algorithm: the first column of interest, i.e., AUS5 corresponds to column 10 and then I select up to AUS22 (corresponding to column 27). Then, I should skip all the remaining column for AUS (i.e. AUS23 to AUS56), and the first 4 columns for the next country (from AUT1 to AUT4). Then, I select, as before, industries from 5 to 22 for AUT. Basically, the algorithm, starting from column 10 should be able to select 18 columns(including column 10) and then skip the next 38 columns, and then select the next 18 columns. This process should be repeated for all the 43 countries.
How can I code that?
UPDATE, Example:
df=data.frame(industry = c("C10","C11","C12","C13"),
country = c("USA"),
AUS3 = runif(4),
AUS4 = runif(4),
AUS5 = runif(4),
AUS6 = runif(4),
DEU5 = runif(4),
DEU6 = runif(4),
DEU7 = runif(4),
DEU8 = runif(4))
#I'm interested only in C10-c11:
df_a=df %>% filter(grepl('C10|C11',industry))
df_a
#Thus, how can I select columns AUS10,AUS11, DEU10,DEU11 efficiently, considering that I have a huge dataset?
Demonstrating the paste0 approach.
ctr <- unique(gsub('\\d', '', names(df[-(1:2)])))
# ctr <- c("AUS", "DEU") ## alternatively hard-coded
ind <- c(10, 11)
subset(df, industry == paste0('C', 10:11),
select=c('industry', 'country', paste0(rep(ctr, each=length(ind)), ind)))
# industry country AUS10 AUS11 DEU10 DEU11
# 1 C10 USA 0.3376674 0.1568496 0.5033433 0.7327734
# 2 C11 USA 0.7421840 0.6808892 0.9050158 0.3689741
Or, since you appear to like grep you could do.
df[grep('10|11', df$industry), grep('industry|country|[A-Z]{3}1[01]', names(df))]
# industry country AUS10 AUS11 DEU10 DEU11
# 1 C10 USA 0.3376674 0.1568496 0.5033433 0.7327734
# 2 C11 USA 0.7421840 0.6808892 0.9050158 0.3689741
If you have a big data set in memory, data.table could be ideal and much faster than alternatives. Something like the following could work, though you will need to play with select_ind and select_ctr as desired on the real dataset.
It might be worth giving us a slightly larger toy example, if possible.
library(data.table)
setDT(df)
select_ind <- paste0(c("C"), c("11","10"))
select_ctr <- paste0(rep(c("AUS", "DEU"), each = 2), c("10","11"))
df[grepl(paste0(select_ind, collapse = "|"), industry), # select rows
..select_ctr] # select columns
AUS10 AUS11 DEU10 DEU11
1: 0.9040223 0.2638725 0.9779399 0.1672789
2: 0.6162678 0.3095942 0.1527307 0.6270880
For more information, see Introduction to data.table.

R Beginner struggling with extremely messy XLSX

I got an XLSX with data from a questionnaire for my master thesis.
The questions and answers for an interviewee are in one row in the second column. The first column contains the date.
The data of the second column comes in a form like this:
"age":"52","height":"170","Gender":"Female",...and so on
I started with:
test12 <- read_xlsx("Testdaten.xlsx")
library(splitstackshape)
test13 <- concat.split(data = test12, split.col= "age", sep =",")
Then I got the questions and the answers as a column divided by a ":".
For e.g. column 1: "age":"52" and column2:"height":"170".
But the data is so messy that sometimes in the column of the age question and answer there is a height question and answer and for some questionnaires questions and answers double.
I would need the questions as variables and the answers as observations. But I have no clue how to get there. I could clean the data in excel first, but with the fact that columns are not constant and there are for e.g. some height questions in the age column I see no chance to do it as I will get new data regularly, formated the same way.
Here is an example of the data:
A tibble: 5 x 2
partner.createdAt partner.wphg.info
<chr> <chr>
1 2019-11-09T12:13:11.099Z "{\"age_years\":\"50\",\"job_des\":\"unemployed\",\"height_cm\":\"170\",\"Gender\":\"female\",\"born_in\":\"Italy\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"200000\""
2 2019-11-01T06:43:22.581Z "{\"age_years\":\"34\",\"job_des\":\"self-employed\",\"height_cm\":\"158\",\"Gender\":\"male\",\"born_in\":\"Germany\",\"Alcoholic\":\"true\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"10000\""
3 2019-11-10T07:59:46.136Z "{\"age_years\":\"24\",\"height_cm\":\"187\",\"Gender\":\"male\",\"born_in\":\"England\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"150000\""
4 2019-11-11T13:01:48.488Z "{\"age_years\":\"59\",\"job_des\":\"employed\",\"height_cm\":\"167\",\"Gender\":\"female\",\"born_in\":\"United States\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"2\",\"total_wealth\":\"1000000~
5 2019-11-08T14:54:26.654Z "{\"age_years\":\"36\",\"height_cm\":\"180\",\"born_in\":\"Germany\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"170000\",\"job_des\":\"employed\",\"Gender\":\"male\""
Thank you so much for your time!
You can loop through each entry, splitting at , as you did. Then you can loop through them all again, splitting at :.
The result will be a bunch of variable/value pairings. This can be all done stacked. Then you just want to pivot back into columns.
data
Updated the data based on your edit.
data <- tribble(~partner.createdAt, ~partner.wphg.info,
'2019-11-09T12:13:11.099Z', '{\"age_years\":\"50\",\"job_des\":\"unemployed\",\"height_cm\":\"170\",\"Gender\":\"female\",\"born_in\":\"Italy\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"200000\"',
'2019-11-01T06:43:22.581Z', '{\"age_years\":\"34\",\"job_des\":\"self-employed\",\"height_cm\":\"158\",\"Gender\":\"male\",\"born_in\":\"Germany\",\"Alcoholic\":\"true\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"10000\"',
'2019-11-10T07:59:46.136Z', '{\"age_years\":\"24\",\"height_cm\":\"187\",\"Gender\":\"male\",\"born_in\":\"England\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"150000\"',
'2019-11-11T13:01:48.488Z', '{\"age_years\":\"59\",\"job_des\":\"employed\",\"height_cm\":\"167\",\"Gender\":\"female\",\"born_in\":\"United States\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"2\",\"total_wealth\":\"1000000\"',
'2019-11-08T14:54:26.654Z', '{\"age_years\":\"36\",\"height_cm\":\"180\",\"born_in\":\"Germany\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"170000\",\"job_des\":\"employed\",\"Gender\":\"male\"')
libraries
We need a few here. Or you can just call tidyverse.
library(stringr)
library(purrr)
library(dplyr)
library(tibble)
library(tidyr)
function
This function will create a data frame (or tibble) for each question. The first column is the date, the second is the variable, the third is the value.
clean_record <- function(date, text) {
clean_records <- str_split(text, pattern = ",", simplify = TRUE) %>%
str_remove_all(pattern = "\\\"") %>% # remove double quote
str_remove_all(pattern = "\\{|\\}") %>% # remove curly brackets
str_split(pattern = ":", simplify = TRUE)
tibble(date = as.Date(date), variable = clean_records[,1], value = clean_records[,2])
}
iteration
Now we use pmap_dfr from purrr to loop over the rows, outputting each row with an id variable named record.
This will stack the data as described in the function. The mutate() line converts all variable names to lowercase. The distinct() line will filter out rows that are exact duplicates.
What we do then is just pivot on the variable column. Of course, replace data with whatever you name your data frame.
data_clean <- pmap_dfr(data, ~ clean_record(..1, ..2), .id = "record") %>%
mutate(variable = tolower(variable)) %>%
distinct() %>%
pivot_wider(names_from = variable, values_from = value)
result
The result is something like this. Note how I had reordered some of the columns, but it still works. You are probably not done just yet. All columns are now of type character. You need to figure out the desired type for each and convert.
# A tibble: 5 x 10
record date age_years job_des height_cm gender born_in alcoholic knowledge_selfass total_wealth
<chr> <date> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 2019-11-09 50 unemployed 170 female Italy false 5 200000
2 2 2019-11-01 34 self-employed 158 male Germany true 3 10000
3 3 2019-11-10 24 NA 187 male England false 3 150000
4 4 2019-11-11 59 employed 167 female United States false 2 1000000
5 5 2019-11-08 36 employed 180 male Germany false 5 170000
For example, convert age_years to numeric.
data_clean %>%
mutate(age_years = as.numeric(age_years))
I am sure you may run into other things, but this should be a start.

Carrying out a simple dataframe subset with dplyr

Consider the following dataframe slice:
df = data.frame(locations = c("argentina","brazil","argentina","denmark"),
score = 1:4,
row.names = c("a091", "b231", "a234", "d154"))
df
locations score
a091 argentina 1
b231 brazil 2
a234 argentina 3
d154 denmark 4
sorted = c("a234","d154","a091") #in my real task these strings are provided from an exogenous function
df2 = df[sorted,] #quick and simple subset using rownames
EDIT: Here I'm trying to subset AND order the data according to sorted - sorry that was not clear before. So the output, importantly, is:
locations score
a234 argentina 1
d154 denmark 4
a091 argentina 3
And not as you would get from a simple subset operation:
locations score
a091 argentina 1
a234 argentina 3
d154 denmark 4
I'd like to do the exactly same thing in dplyr. Here is an inelegant hack:
require(dplyr)
dt = as_tibble(df)
rownames(dt) = rownames(df)
Warning message:
Setting row names on a tibble is deprecated.
dt2 = dt[sorted,]
I'd like to do it properly, where the rownames are an index in the data table:
dt_proper = as_tibble(x = df,rownames = "index")
dt_proper2 = dt_proper %>% ?some_function(index, sorted)? #what would this be?
dt_proper2
# A tibble: 3 x 3
index locations score
<chr> <fct> <int>
1 a091 argentina 1
2 d154 denmark 4
3 a234 argentina 3
But I can't for the life of me figure out how to do this using filter or some other dplyr function, and without some convoluted conversion to factor, re-order factor levels, etc.
Hy,
you can simply use mutate and filter to get the row.names of your data frame into a index column and filter to the vector "sorted" and sort the data frame due to the vector "sorted":
df2 <- df %>% mutate(index=row.names(.)) %>% filter(index %in% sorted)
df2 <- df2[order(match(df2[,"index"], sorted))]
I think I've figured it out:
dt_proper2 = dt_proper[match(sorted,dt_proper$index),]
Seems to be shortest implementation of what df[sorted,] will do.
Functions in the tidyverse (dplyr, tibble, etc.) are built around the concept (as far as I know), that rows only contain attributes (columns) and no row names / labels / indexes. So in order to sort columns, you have to introduce a new column containing the ranks of each row.
The way I would do it is to create another tibble containing your "sorting information" (sorting attribute, rank) and inner join it to your original tibble. Then I could order the rows by rank.
library(tidyverse)
# note that I've changed the third column's name to avoid confusion
df = tibble(
locations = c("argentina","brazil","argentina","denmark"),
score = 1:4,
custom_id = c("a091", "b231", "a234", "d154")
)
sorted_ids = c("a234","d154","a091")
sorting_info = tibble(
custom_id = sorted_ids,
rank = 1:length(sorted_ids)
)
ordered_ids = df %>%
inner_join(sorting_info) %>%
arrange(rank) %>%
select(-rank)

Read in a table that meets specific requirements in R

I am reading in data from a .txt file that contains over thousands of records
table1 <- read.table("teamwork.txt", sep ="|", fill = TRUE)
Looks like:
f_name l_name hours_worked code
Jim Baker 8.5 T
Richard Copton 4.5 M
Tina Bar 10 S
However I only want to read in data that has a 'S' or 'M' code:
I tried to concat the columns:
newdata <- subset(table1, code = 'S' |'M')
However I get this issue:
operations are possible only for numeric, logical or complex types
If there are thousands or tens of thousands of records (maybe not for millions), you should just be able to filter after you read in all the data:
> library(tidyverse)
> df %>% filter(code=="S"|code=="M")
# A tibble: 2 x 4
f_name l_name hours_worked code
<fct> <fct> <dbl> <fct>
1 Richard Copton 4.50 M
2 Tina Bar 10.0 S
If you really want to just pull in the rows that meet your condition, try sqldf package as in example here: How do i read only lines that fulfil a condition from a csv into R?
You can try
cols_g <- table1[which(table1$code == "S" | table1$code == "M",]
OR
cols_g <- subset(table1, code=="S" | code=="M")
OR
library(dplyr)
cols_g <- table1 %>% filter(code=="S" | code=="M")
If you want to add column cols_g on table1, you can use table1$cols_g assigned anything from these 3 methods instead of cols_g.

Order R dataframe columns by using second dataframe as a reference.

I am working on developing a statistical program using R, this program accepts two dataFrames. The first dataFrame carries demographics information of patients and the second carries their clinical information. The key column in the demographics dataFrame is the patientID column. While in the clinical dataFrame each patientID is a column. I wish to arrange/sort my demographics dataFrame by patientID, based upon the order of patientID's(ind columns) in the clinical dataFrame. Also the ID's could numeric or alphanumeric or could just be some-alphabet sequence. I was able to write some code, but would need help/guidance to come up with a better way to sort columns irrespective of their datatype(character, factor, numeric etc).
demogr = read.csv(mydemoFile, header = T, stringsAsFactors
=TRUE,colClasses=c('factor','factor','factor','factor','factor'))
demogr=demogr[order(as.numeric(demogr$Patient_ID)),]
myClinicalFrame=fread(myInputFile,header=T,data.table=FALSE,sep=",")
rowNames=myClinicalFrame[,1]
myClinicalFrame[,1]<-NULL
rownames(myClinicalFrame)=rowNames
names(myClinicalFrame)=sort((names(myClinicalFrame)))
The above works for certain types but fails for others. eg: Patient_ID in
demoFrame is numerically sorted above, in some situations R changes patient_ID like
109999345554545465 to 1.09e+18, which doesn't match with the second dataFrame.
Thanks
Let's start by creating two example data frames:
patientID = c(123456789012345,1234,1234567890,123)
state = c("FL","NJ","CA","TX")
demog = data.frame(ID = patientID,state = state)
clinical = data.frame(col1 = c(1,2,3),
col2 = c(3,4,5),
col2 = c(1,7,9),
col2 = c(6,4,2))
colnames(clinical) = c("1234567890","123","123456789012345","1234")
This gives us:
> demog
ID state
1 1.234568e+14 FL
2 1.234000e+03 NJ
3 1.234568e+09 CA
4 1.230000e+02 TX
and
> clinical
1234567890 123 123456789012345 1234
1 1 3 1 6
2 2 4 7 4
3 3 5 9 2
As you can see the rows in demog are in a different order than the columns in clinical.
To sort the rows in demog do:
rownames(demog) = demog$ID
demog = demog[colnames(clinical),]
This works even for IDs that are factors or characters, because rownames() will convert them to character.
Result:
> demog
ID state
1234567890 1.234568e+09 CA
123 1.230000e+02 TX
123456789012345 1.234568e+14 FL
1234 1.234000e+03 NJ

Resources