Look-up table with different values for each column - r

I've got one table that is a set of all of my columns, their possible corresponding values, and the description for each one of those values. For example, the table looks like this:
ID Column Value Description
1 Age A Age_20-30
2 Age B Age_30-50
3 Age C Age_50-75
4 Geo A Big_City
5 Geo B Suburbs
6 Geo C Rural_Town
And so on.. Next, I have my main data frame that is populated with the column values. What I'd like to do is switch all values in each column with their corresponding description.
Old:
ID Age Geo
1 A B
2 A A
3 C A
4 B C
5 C C
New:
ID Age Geo
1 Age_20-30 Suburbs
2 Age_20-30 Big_City
3 Age_50-75 Big_City
4 Age_30-50 Rural_Town
5 Age_50-75 Rural_Town
Now I know how I can do this for one column using the following (where lookup_df is a table for only one of my columns):
old <- lookup_df$Value
new <- lookup_df$Description
df$Age <- new[match(df$Age, old, nomatch = 0)]
But I am struggling to do this for all columns. My full set of data has >100 columns so doing this manually for each column isn't really an option (at least in terms of efficiency). Any help or pointers in the right direction would be a huge help.

We can split the first dataset into to a list of named vectors. Use that to match and replace the second dataset
lst1 <- lapply(split(df1[c('Value', 'Description')], df1$Column),
function(x) setNames(x$Description, x$Value))
df2[-1] <- Map(function(x, y) y[x], df2[-1], lst1)
-output
df2
# ID Age Geo
#1 1 Age_20-30 Suburbs
#2 2 Age_20-30 Big_City
#3 3 Age_50-75 Big_City
#4 4 Age_30-50 Rural_Town
#5 5 Age_50-75 Rural_Town
data
df1 <- structure(list(ID = 1:6, Column = c("Age", "Age", "Age", "Geo",
"Geo", "Geo"), Value = c("A", "B", "C", "A", "B", "C"),
Description = c("Age_20-30",
"Age_30-50", "Age_50-75", "Big_City", "Suburbs", "Rural_Town"
)), class = "data.frame", row.names = c(NA, -6L))
df2 <- structure(list(ID = 1:5, Age = c("A", "A", "C", "B", "C"), Geo = c("B",
"A", "A", "C", "C")), class = "data.frame", row.names = c(NA,
-5L))

To do this on data with lot of columns you can get the data in long format, join it with the first dataframe and (if needed) get it back in wide format.
library(dplyr)
library(tidyr)
df2 %>%
pivot_longer(cols = -ID) %>%
left_join(df1 %>% select(-ID),
by = c('name' = 'Column', 'value' = 'Value')) %>%
select(-value) %>%
pivot_wider(names_from = name, values_from = Description)
# ID Age Geo
# <int> <chr> <chr>
#1 1 Age_20-30 Suburbs
#2 2 Age_20-30 Big_City
#3 3 Age_50-75 Big_City
#4 4 Age_30-50 Rural_Town
#5 5 Age_50-75 Rural_Town

Related

How to choose the most common value in a group related to other group in R?

I have in R the following data frame:
ID = c(rep(1,5),rep(2,3),rep(3,2),rep(4,6));ID
VAR = c("A","A","A","A","B","C","C","D",
"E","E","F","A","B","F","C","F");VAR
CATEGORY = c("ANE","ANE","ANA","ANB","ANE","BOO","BOA","BOO",
"CAT","CAT","DOG","ANE","ANE","DOG","FUT","DOG");CATEGORY
DATA = data.frame(ID,VAR,CATEGORY);DATA
That looks like this table below :
ID
VAR
CATEGORY
1
A
ANE
1
A
ANE
1
A
ANA
1
A
ANB
1
B
ANE
2
C
BOO
2
C
BOA
2
D
BOO
3
E
CAT
3
E
CAT
4
F
DOG
4
A
ANE
4
B
ANE
4
F
DOG
4
C
FUT
4
F
DOG
ideal output given the above data frame in R I want to be like that:
ID
TEXTS
category
1
A
ANE
2
C
BOO
3
E
CAT
4
F
DOG
More specifically: I want for ID say 1 to search the most common value in the column VAR which is A and then to search the most common value in the column CATEGORY related to the most common value A which is the ANE and so forth.
How can I do it in R ?
Imagine that it is sample example.My real data frame contains 850.000 rows and has 14000 unique ID.
Another dplyr strategy using count and slice:
library(dplyr)
DATA %>%
group_by(ID) %>%
count(VAR, CATEGORY) %>%
slice(which.max(n)) %>%
select(-n)
ID VAR CATEGORY
<dbl> <chr> <chr>
1 1 A ANE
2 2 C BOA
3 3 E CAT
4 4 F DOG
dplyr
library(dplyr)
DATA %>%
group_by(ID) %>%
filter(VAR == names(sort(table(VAR), decreasing=TRUE))[1]) %>%
group_by(ID, VAR) %>%
summarize(CATEGORY = names(sort(table(CATEGORY), decreasing=TRUE))[1]) %>%
ungroup()
# # A tibble: 4 x 3
# ID VAR CATEGORY
# <dbl> <chr> <chr>
# 1 1 A ANE
# 2 2 C BOA
# 3 3 E CAT
# 4 4 F DOG
Data
DATA <- structure(list(ID = c(1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 4), VAR = c("A", "A", "A", "A", "B", "C", "C", "D", "E", "E", "F", "A", "B", "F", "C", "F"), CATEGORY = c("ANE", "ANE", "ANA", "ANB", "ANE", "BOO", "BOA", "BOO", "CAT", "CAT", "DOG", "ANE", "ANE", "DOG", "FUT", "DOG")), class = "data.frame", row.names = c(NA, -16L))
We could modify the Mode to return the index and use that in slice after grouping by 'ID'
Modeind <- function(x) {
ux <- unique(x)
which.max(tabulate(match(x, ux)))
}
library(dplyr)
DATA %>%
group_by(ID) %>%
slice(Modeind(VAR)) %>%
ungroup
-output
# A tibble: 4 x 3
ID VAR CATEGORY
<dbl> <chr> <chr>
1 1 A ANE
2 2 C BOO
3 3 E CAT
4 4 F DOG
A base R option with nested subset + ave
subset(
subset(
DATA,
!!ave(ave(ID, ID, VAR, FUN = length), ID, FUN = function(x) x == max(x))
),
!!ave(ave(ID, ID, VAR, CATEGORY, FUN = length), ID, VAR, FUN = function(x) seq_along(x) == which.max(x))
)
gives
ID VAR CATEGORY
1 1 A ANE
6 2 C BOO
9 3 E CAT
11 4 F DOG
Explanation
The inner subset + ave is to filter out the rows with the most common VAR values (grouped by ID)
Based on the trimmed data frame the previous step, the outer subset + ave is to filter out the rows with the most common CATEGORY values ( grouped by ID + VAR)

Transforming a dataframe in r to apply pivot table

I have a data frame like below:
Red Green Black
John A B C
Sean A D C
Tim B C C
How can I transform it to below form to apply a pivot table (or if it can be done directly in r without transforming data):
Names Code Type
John Red A
John Green B
John Black C
Sean Red A
Sean Green D
Sean Black C
Tim Red B
Tim Green C
Tim Black C
So then my ultimate goal is to count the types as below by a pivot table on the transformed dataframe:
Count of Code for each type:
Row Labels A B C D Grand Total
John 1 1 1 3
Sean 1 1 1 3
Tim 1 2 3
Grand Total 2 2 4 1 9
```
reading similar topics did not help that much.
Thanks in advance!
Regards
Using a literal dump from your first matrix-like frame above:
dat <- structure(list(Red = c("A", "A", "B"), Green = c("B", "D", "C"
), Black = c("C", "C", "C")), class = "data.frame", row.names = c("John",
"Sean", "Tim"))
I can do this:
library(dplyr)
library(tidyr)
tibble::rownames_to_column(dat, var = "Names") %>%
gather(Code, Type, -Names)
# Names Code Type
# 1 John Red A
# 2 Sean Red A
# 3 Tim Red B
# 4 John Green B
# 5 Sean Green D
# 6 Tim Green C
# 7 John Black C
# 8 Sean Black C
# 9 Tim Black C
We can extend that to get your next goal:
tibble::rownames_to_column(dat, var = "Names") %>%
gather(Code, Type, -Names) %>%
xtabs(~ Names + Type, data = .)
# Type
# Names A B C D
# John 1 1 1 0
# Sean 1 0 1 1
# Tim 0 1 2 0
which then just needs marginals:
tibble::rownames_to_column(dat, var = "Names") %>%
gather(Code, Type, -Names) %>%
xtabs(~ Names + Type, data = .) %>%
addmargins()
# Type
# Names A B C D Sum
# John 1 1 1 0 3
# Sean 1 0 1 1 3
# Tim 0 1 2 0 3
# Sum 2 2 4 1 9
You can use reshape(). I'm not sure about your data structure, if there is a column with names or if they are row names. I've added both versions.
reshape(dat1, idvar="Names",
varying=2:4,
v.names="Type", direction="long",
timevar="Code", times=c("red", "green", "black"),
new.row.names=1:9)
reshape(transform(dat2, Names=rownames(dat2)), idvar="Names",
varying=1:3,
v.names="Type", direction="long",
timevar="Code", times=c("red", "green", "black"),
new.row.names=1:9)
# V1 Code Type
# 1 John red A
# 2 Sean red A
# 3 Tim red B
# 4 John black B
# 5 Sean black D
# 6 Tim black C
# 7 John green C
# 8 Sean green C
# 9 Tim green C
To get kind of a raw version you could do:
res <- reshape(transform(dat2, Names=rownames(dat2)), idvar="Names",
varying=1:3,
v.names="Type", direction="long",
timevar="Code")
res
# Names Code Type
# John.1 John 1 A
# Sean.1 Sean 1 A
# Tim.1 Tim 1 B
# John.2 John 2 B
# Sean.2 Sean 2 D
# Tim.2 Tim 2 C
# John.3 John 3 C
# Sean.3 Sean 3 C
# Tim.3 Tim 3 C
After that you may assign labels at will to "Code" column by transforming to factor like so:
res$Code <- factor(res$Code, labels=c("red", "green", "black"))
Data
dat1 <- structure(list(Names = c("John", "Sean", "Tim"), Red = c("A",
"A", "B"), Green = c("B", "D", "C"), Black = c("C", "C", "C")), row.names = c(NA,
-3L), class = "data.frame")
dat2 <- structure(list(Red = c("A", "A", "B"), Green = c("B", "D", "C"
), Black = c("C", "C", "C")), row.names = c("John", "Sean", "Tim"
), class = "data.frame")
What you aim to do is (1) creating a contingency table and then (2) compute the sum of table entries for both rows and columns.
Step1: Create a contingency table
I first pivoted the data using pivot_longer() rather than gather() because it's more intuitive. Then, apply table() to the two variables of your interest.
# Toy example
df <- structure(list(Red = c("A", "A", "B"), Green = c("B", "D", "C"
), Black = c("C", "C", "C")), class = "data.frame", row.names = c("John",
"Sean", "Tim"))
# Pivot the data
long_df <- tibble::rownames_to_column(df, var = "Names") %>%
tidyverse::pivot_longer(cols = c(-Names),
names_to = "Type",
values_to = "Code")
# Create a contingency table
df_table <- table(long_df$Names, long_df$Code)
Step 2: Compute the sum of entries for both rows and columns.
Again, I only used a base R function margin.table(). Using this approach also allows you to save the sum of the row and column entries for further analysis.
# Grand total (margin = 1 indicates rows)
df_table %>%
margin.table(margin = 1)
# Grand total (margin = 2 indicates columns)
df_table %>%
margin.table(margin = 2)

dplyr reordering rows by string

I have the following data:
library(tidyverse)
d1 <- data_frame(Nat = c("UK", "UK", "UK", "NONUK", "NONUK", "NONUK"),
Type = c("a", "b", "c", "a", "b", "c"))
I would like to rearrange the rows so the dataframe looks like this:
d2 <- data_frame(
Nat = c("UK", "UK", "UK", "NONUK", "NONUK", "NONUK"),
Type = c("b", "c", "a", "b", "c", "a"))
So the UK and Non UK grouping remains, but the 'Type' rows have shifted. This questions is quite like this one: Reorder rows conditional on a string variable
However the answer above is dependent on the rows you are reordering being in alphabetical order (excluding London). Is there a way to reorder a string value more specifically where you select order of the rows yourself, rather than relying on it being alphabetical? Is there a way to do this using dplyr?
Thanks!
You could use match
string_order <- c("b", "c", "a")
d1 %>%
group_by(Nat) %>%
mutate(Type = Type[match(string_order, Type)]) %>%
ungroup()
# A tibble: 6 x 2
# Nat Type
# <chr> <chr>
#1 UK b
#2 UK c
#3 UK a
#4 NONUK b
#5 NONUK c
#6 NONUK a
What about explicit the levels in a dplyr chain, to choose your order:
library(dplyr)
d1 %>%
arrange(factor(.$Nat, levels = c("UK", "NONUK")), factor(.$Type, levels = c("c", "b","a")))
# A tibble: 6 x 2
Nat Type
<chr> <chr>
1 UK c
2 UK b
3 UK a
4 NONUK c
5 NONUK b
6 NONUK a
Another example:
d1 %>%
arrange(factor(.$Nat, levels = c("UK", "NONUK")), factor(.$Type, levels = c("b", "c","a")))
# A tibble: 6 x 2
Nat Type
<chr> <chr>
1 UK b
2 UK c
3 UK a
4 NONUK b
5 NONUK c
6 NONUK a

Subsetting if contains multiple variables in a certain order

In my dataframe, I have two columns of interest: id and name - my goal is to only keep records of id where id has more than one value in name and where the final value in name is 'B' .
The sample data would look like this:
> test
id name
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
6 6 A
7 7 A
8 2 B
9 1 B
10 2 A
and the output would look like this:
> output
id name
1 1 A
9 1 B
How would one filter to get these rows in R ? I know that you can filter by the those with multiple variables using the %in% operator but am not sure how to add in the condition that 'B' must be the last record. I am not opposed to using a package like dplyr but a solution in base R would be ideal. Any suggestions?
Here is the sample data:
> dput(test)
structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 2, 1, 2), name = c("A",
"A", "A", "A", "A", "A", "A", "B", "B", "A")), .Names = c("id",
"name"), row.names = c(NA, -10L), class = "data.frame")
Using dplyr,
test %>%
group_by(id) %>%
filter(n_distinct(name) > 1 & last(name) == 'B')
#Source: local data frame [2 x 2]
#Groups: id [1]
# A tibble: 2 x 2
# id name
# <dbl> <chr>
#1 1 A
#2 1 B
In data.table:
library(data.table)
setDT(test)[, .SD[length(unique(name)) >= 2 & name[.N] == "B"],by = .(id)]
# id name
#1: 1 A
#2: 1 B

Select minimum data of grouped data - keeping all columns [duplicate]

This question already has an answer here:
R: Uniques (or dplyr distinct) + most recent date
(1 answer)
Closed 7 years ago.
I am running into a wall here.
I have a dataframe, many rows.
Here is schematic example.
#myDf
ID c1 c2 myDate
A 1 1 01.01.2015
A 2 2 02.02.2014
A 3 3 03.01.2014
B 4 4 09.09.2009
B 5 5 10.10.2010
C 6 6 06.06.2011
....
I need to group my dataframe by my ID, and then select the row with the oldest date, and write the ouput into a new dataframe - keeping all rows.
ID c1 c2 myDate
A 3 3 03.01.2014
B 4 4 09.09.2009
C 6 6 06.06.2011
....
That is how I approach it:
test <- myDf %>%
group_by(ID) %>%
mutate(date == as.Date(myDate, format = "%d.%m.%Y")) %>%
filter(date == min(b2))
To verfiy: The nrow of my resulting dataframe should be the same as unique returns.
unique(myDf$ID) %>% length == nrow(test)
FALSE
Does not work. I tried this:
newDf <- ddply(.data = myDf,
.variables = "ID",
.fun = function(piece){
take.this.row <- piece$myDate %>% as.Date(format="%d.%m.%Y") %>% which.min
piece[take.this.row,]
})
That does run forever. I terminated it.
Why is the first approach not working and what would be a good way to approach the problem?
Considering you have a pretty large dataset, I think using data.table will be better ! Here is the data.table version to solve your problem, it will be quicker than dplyr package:
library(data.table)
df <- data.table(ID=c("A","A","A","B","B","C"),c1=1:6,c2=1:6,
myDate=c("01.01.2015","02.02.2014",
"03.01.2014","09.09.2009","10.10.2010","06.06.2011"))
df[,myDate:=as.Date(myDate, '%d.%m.%Y')]
> df_new <- df[ df[, .I[myDate == min(myDate)], by=ID]$V1 ]
> df_new
ID c1 c2 myDate
1: A 3 3 2014-01-03
2: B 4 4 2009-09-09
3: C 6 6 2011-06-06
PS: you can use setDT(mydf) to transform data.frame to data.table.
After grouping by 'ID', we can use which.min to get the index of 'myDate' (after converting to Date class), and we extract the rows with slice.
library(dplyr)
df1 %>%
group_by(ID) %>%
slice(which.min(as.Date(myDate, '%d.%m.%Y')))
# ID c1 c2 myDate
# (chr) (int) (int) (chr)
#1 A 3 3 03.01.2014
#2 B 4 4 09.09.2009
#3 C 6 6 06.06.2011
data
df1 <- structure(list(ID = c("A", "A", "A", "B", "B", "C"), c1 = 1:6,
c2 = 1:6, myDate = c("01.01.2015", "02.02.2014", "03.01.2014",
"09.09.2009", "10.10.2010", "06.06.2011")), .Names = c("ID",
"c1", "c2", "myDate"), class = "data.frame", row.names = c(NA,
-6L))
If you wanted to just use the base functions you can also go with the aggregate and merge functions.
# data (from response above)
df1 <- structure(list(ID = c("A", "A", "A", "B", "B", "C"), c1 = 1:6,
c2 = 1:6, myDate = c("01.01.2015", "02.02.2014", "03.01.2014",
"09.09.2009", "10.10.2010", "06.06.2011")),
.Names = c("ID","c1", "c2", "myDate"),
class = "data.frame", row.names = c(NA,-6L))
# convert your date column to POSIXct object
df1$myDate = as.POSIXct(df1$myDate,format="%d.%m.%Y")
# Use the aggregate function to look for the minimum dates by group.
# In this case our variable of interest in the myDate column and the
# group to sort by is the "ID" column.
# The function will sort out the minimum date and create a new data frame
# with names "myDate" and "ID"
df2 = aggregate(list(myDate = df1$myDate),list(ID = df1$ID),
function(x){x[which(x == min(x))]})
df2
# Use the merge function to merge your original data frame with the
# data from the aggregate function
merge(df1,df2)

Resources