Unnest range of years when there are dummy variables in R - r

I'm working on a dataset containing information about individuals' place of residence and occupation. Originally, it says that someone resides at an address from a year to a year, e.g. from 1920 to 1925. If the individual moved to that address in 1920 there is a dummy variable with the value of 1. Similarily, if the individual moved out from that address in 1925, there is also a dummy with the value of 1.
Now, the problem is that when I unnest the "from year - to year", there will be a value of 1 for all observations, both moved out and moved in, from 1920 to 1925.
Example data:
library(tidyr)
library(dplyr)
individual <- c('John Doe','Peter Gynn','Jolie Hope', 'Jolie Hope')
occupation <- c('banker', 'butcher', 'clerk', 'clerk')
first_obs <- c(1920, 1920, 1920, 1925)
last_obs <- c(1925, 1925, 1925, 1926)
moved_in <- c(1, 0, 1, 1)
moved_out <- c(0, 0, 1, 0)
address <- c('king street', 'market street', 'montgomery road', 'princes ave')
df <- data.frame(individual, occupation, address, first_obs, last_obs, moved_in, moved_out)
df$year <- mapply(seq,df$first_obs,df$last_obs,SIMPLIFY=FALSE)
new_df <- df %>%
unnest(year) %>%
select(-first_obs,-last_obs)
As you can see, it seems that Jolie Hope, for example, moved in and moved out of her address every year between 1920 and 1925, but she supposed to have moved in in 1920 and moved out in 1925. Is there a solution for this?
Additionally, I have som problems with duplicated values due to people moving in and out in the same year. For instance, Jolie Hope moved out from Mongomery Road in 1925 and moved in at Princes Avenue in 1925. I think the best solution would be to only use the "moved in" row. Is it possible to systematically remove all the "moved out" rows where there are duplicated values?

We can group_by each individual and their address and assign 1 if to first year when they moved in and 1 to last year when they moved out.
library(dplyr)
df %>%
tidyr::unnest(year) %>%
select(-first_obs,-last_obs) %>%
group_by(individual, address) %>%
mutate(moved_in = if (any(moved_in == 1)) replace(moved_in,
row_number() != 1, 0) else moved_in,
moved_out = if (any(moved_out == 1)) replace(moved_out,
row_number() != n(), 0) else moved_out)
# individual occupation address moved_in moved_out year
# <fct> <fct> <fct> <dbl> <dbl> <int>
# 1 John Doe banker king street 1 0 1920
# 2 John Doe banker king street 0 0 1921
# 3 John Doe banker king street 0 0 1922
# 4 John Doe banker king street 0 0 1923
# 5 John Doe banker king street 0 0 1924
# 6 John Doe banker king street 0 0 1925
# 7 Peter Gynn butcher market street 0 0 1920
# 8 Peter Gynn butcher market street 0 0 1921
# 9 Peter Gynn butcher market street 0 0 1922
#10 Peter Gynn butcher market street 0 0 1923
#11 Peter Gynn butcher market street 0 0 1924
#12 Peter Gynn butcher market street 0 0 1925
#13 Jolie Hope clerk montgomery road 1 0 1920
#14 Jolie Hope clerk montgomery road 0 0 1921
#15 Jolie Hope clerk montgomery road 0 0 1922
#16 Jolie Hope clerk montgomery road 0 0 1923
#17 Jolie Hope clerk montgomery road 0 0 1924
#18 Jolie Hope clerk montgomery road 0 1 1925
#19 Jolie Hope clerk princes ave 1 0 1925
#20 Jolie Hope clerk princes ave 0 0 1926
To fix the duplicated values issue I think it is better to keep a duplicated row with same year indicating that they moved out of the old address and moved in to a new address in the same year.

Related

R Dataframe Group-Level Pattern

I have a dataframe looks like below:
person year Office Job rank
Harry 2002 Los Angeles CEO 0
Harry 2006 Boston CEO 0
Harry 2006 Los Angeles Advisor 1
Harry 2006 Chicago Chairman 2
Peter 2001 New York Director 0
Peter 2001 Chicago CFO 1
Peter 2002 Chicago CEO 0
Lily 2005 Springfield CEO 0
Lily 2007 New York CFO 0
Lily 2008 Boston COO 0
Lily 2011 Chicago Advisor 0
Lily 2011 New York board 1
I want to know at a person level, who has at least one of the following two patterns:
in a previous available year, an office has rank 0 and in the next available year, the office still exist but rank is bigger than 0 (job does not matter). For example, Los Angeles for Harry.
in a next availabe year, an office has rank 0 and in the previous available year, the office still exist but rank is bigger than 0 (For example, Chicago for Peter).
Note that New York for Lily does not have either of the above situation as 2007 is not the previous available year for Lily (2008 is).
Thus, the output should look like:
person yes/no
Harry 1
Peter 1
Lily 0
We can use
library(dplyr)
df1 %>%
group_by(person, Office) %>%
summarise(yes_no =n_distinct(rank) > 1) %>%
summarise(yes_no = +(any(yes_no)), .groups = 'drop')

summarizing and grouping rows based on rank or order of data values in R

My data looks like this:
EMPLOYEE_ID
LAST_NAME
FIRST_NAME
UNIT
CITY
STATE
DATA_RANK
221
SMITH
JILL
X1
DALLAS
TX
2
221
SMITH-WU
JILL
TX
1
331
DEVIN
MARY
X2
HOUSTON
2
331
TRUNG
MARY
HOUSTON
TX
1
441
SWAN
ANNA-BELLE
X2
AUBURN
CA
1
441
DUCK
ANNA
X3
AUBURN
2
I am trying to get the output to look like this (group rows by EMPLOYEE_ID) and also pick the row that has data_rank = 1 where there is a duplicate employee-id.
EMPLOYEE_ID
LAST_NAME
FIRST_NAME
UNIT
CITY
STATE
DATA_RANK
221
SMITH-WU
JILL
TX
1
331
TRUNG
MARY
HOUSTON
TX
1
441
SWAN
ANNA-BELLE
X2
AUBURN
CA
1
I tried using the following code:
data <- data %>%
group_by(EMPLOYEE_ID, substr(LAST_NAME,0,4), substr(FIRST_NAME,0,3)) %>%
mutate_at(vars(-group_cols()),funs(na.locf(., na.rm = FALSE, fromLast = FALSE))) %>%
filter(row_number()==n())
But that's not quite getting me here. Any thoughts? Thank you!
Is there a reason why you're using substr()?
I believe this code should work.
data %>%
group_by(EMPLOYEE_ID) %>%
filter(DATA_RANK == 1)

How to compare two strings word by word in R

I have a dataset, let's call it "ORIGINALE", composed by several different rows for only two columns, the first called "DESCRIPTION" and the second "CODICE". The description column has the right information while the column codice, which is the key, is almost always empty, therefore I'm tryng to search for the corresponding codice in another dataset, let's call it "REFERENCE". I am using the column desciption, which is in natural language, and trying to match it with the description in the second dataset. I have to match word by word since there may be a different order of words, synonims or abbreviations. Then I calcolate the similarity score to keep only the best match and accept those above a certain score. Is there a way to improve it? I'm working with around 300000 rows and, even though I know is always going to take time, perhaps there could be a way to make it even just slightly faster.
ORIGINALE <- data.frame(DESCRIPTION = c("mr peter 123 rose street 3b LA"," 4c flower str jenny jane Chicago", "washington miss sarah 430f name strt"), CODICE = (NA, NA, NA))
REFERENE <- dataframe (DESCRIPTION = c("sarah brown name street 430f washington", "peter green 123 rose street 3b LA", "jenny jane flower street 4c Chicago"), CODICE = c("135tg67","aw56", "83776250"))
algoritmo <- function(ORIGINALE, REFERENCE) {
split1 <- strsplit(x$DESCRIPTION, " ")
split2 <- strsplit(y$DESCRIPTION, " ")
risultato <- vector()
distanza <- vector()
for(i in 1:NROW(split1)) {
best_dist <- -5
closest_match <- -5
for(j in 1:NROW(split2)) {
dist <- stringsim(as.character(split1[i]), as.character(split2[j]))
if (dist > best_dist) {
closest_match <- y$DESCRIPTION[j]
best_dist <- dist
}
}
distanza <- append(distanza, best_dist)
risultato <- append(risultato, closest_match)
}
confronto <<- tibble(x$DESCRIPTION, risultato, distanza)
}
match <- subset.data.frame(confronto, confronto$distanza >= "0.6")
missing <- subset.data.frame(confronto, confronto$distanza <"0.6")
The R tm (text mining) library can help here:
library(tm)
library(proxy) # for computing cosine similarity
library(data.table)
ORIGINALE = data.table(DESCRIPTION = c("mr peter 123 rose street 3b LA"," 4c flower str jenny jane Chicago", "washington miss sarah 430f name strt"), CODICE = c(NA, NA, NA))
REFERENCE = data.table(DESCRIPTION = c("sarah brown name street 430f washington", "peter green 123 rose street 3b LA", "jenny jane flower street 4c Chicago"), CODICE = c("135tg67","aw56", "83776250"))
# combine ORIGINALE and REFERENCE into one data.table
both = rbind(ORIGINALE,REFERENCE)
# create "doc_id" and "text" columns (required by tm)
both[,doc_id:=1:.N]
names(both)[1] = 'text'
# convert to tm corpus
corpus = SimpleCorpus(DataframeSource(both))
# convert to a tm document term matrix
dtm = DocumentTermMatrix(corpus)
# convert to a regular matrix
dtm = as.matrix(dtm)
# look at it (t() transpose for readability)
t(dtm)
Docs
Terms 1 2 3 4 5 6
123 1 0 0 0 1 0
peter 1 0 0 0 1 0
rose 1 0 0 0 1 0
street 1 0 0 1 1 1
chicago 0 1 0 0 0 1
flower 0 1 0 0 0 1
jane 0 1 0 0 0 1
jenny 0 1 0 0 0 1
str 0 1 0 0 0 0
430f 0 0 1 1 0 0
miss 0 0 1 0 0 0
name 0 0 1 1 0 0
sarah 0 0 1 1 0 0
strt 0 0 1 0 0 0
washington 0 0 1 1 0 0
brown 0 0 0 1 0 0
green 0 0 0 0 1 0
# compute similarity between each combination of documents 1:3 and documents 4:6
similarity = proxy::dist(dtm[1:3,], dtm[4:6,], method="cosine")
# result:
ORIGINALE REFERENCE document
document 4 5 6
1 0.7958759 0.1055728 0.7763932 <-- difference (smaller = more similar)
2 1.0000000 1.0000000 0.2000000
3 0.3333333 1.0000000 1.0000000
# make a table of which REFERENCE document is most similar
most_similar = rbindlist(
apply(
similarity,1,function(x){
data.table(i=which.min(x),distance=min(x))
}
)
)
# result:
i distance
1: 2 0.1055728
2: 3 0.2000000
3: 1 0.3333333
# rows 1, 2, 3 or rows of ORIGINALE; i: 2 3 1 are rows of REFERENCE
# add the results back to ORIGINALE
ORIGINALE1 = cbind(ORIGINALE,most_similar)
REFERENCE[,i:=1:.N]
ORIGINALE2 = merge(ORIGINALE1,REFERENCE,by='i',all.x=T,all.y=F)
# result:
i DESCRIPTION.x CODICE.x distance DESCRIPTION.y CODICE.y
1: 1 washington miss sarah 430f name strt NA 0.3333333 sarah brown name street 430f washington 135tg67
2: 2 mr peter 123 rose street 3b LA NA 0.1055728 peter green 123 rose street 3b LA aw56
3: 3 4c flower str jenny jane Chicago NA 0.2000000 jenny jane flower street 4c Chicago 83776250
# now the documents are in a different order than in ORIGINALE2.
# this is caused by merging by i (=REFERENCE document row).
# if order is important, then add these two lines around the merge line:
ORIGINALE1[,ORIGINALE_i:=1:.N]
ORIGINALE2 = merge(...
ORIGINALE2 = ORIGINALE2[order(ORIGINALE_i)]
Good question. for loops are slow in R:
for(i in 1:NROW(split1)) {
for(j in 1:NROW(split2)) {
For fast R, you need to vectorize your algorithm. I'm not that handy with data.frame anymore, so I'll use its successor, data.table.
library(data.table)
ORIGINALE = data.table(DESCRIPTION = c("mr peter 123 rose street 3b LA"," 4c flower str jenny jane Chicago", "washington miss sarah 430f name strt"), CODICE = c(NA, NA, NA))
REFERENCE = data.table(DESCRIPTION = c("sarah brown name street 430f washington", "peter green 123 rose street 3b LA", "jenny jane flower street 4c Chicago"), CODICE = c("135tg67","aw56", "83776250"))
# split DESCRIPTION to make tables that have one word per row
ORIGINALE_WORDS = ORIGINALE[,.(word=unlist(strsplit(DESCRIPTION,' ',fixed=T))),.(DESCRIPTION,CODICE)]
REFERENCE_WORDS = REFERENCE[,.(word=unlist(strsplit(DESCRIPTION,' ',fixed=T))),.(DESCRIPTION,CODICE)]
# remove empty words introduced by extra spaces in your DESCRIPTIONS
ORIGINALE_WORDS = ORIGINALE_WORDS[word!='']
REFERENCE_WORDS = REFERENCE_WORDS[word!='']
# merge the tables by word
merged = merge(ORIGINALE_WORDS,REFERENCE_WORDS,by='word',all=F,allow.cartesian=T)
# count matching words for each combination of ORIGINALE DESCRIPTION and REFERENCE DESCRIPTION and CODICE
counts = merged[,.N,.(DESCRIPTION.x,DESCRIPTION.y,CODICE.y)]
# keep only the highest N CODICE.y for each DESCRIPTION.x
topcounts = merged[order(-N)][!duplicated(DESCRIPTION.x)]
# merge the counts back to ORIGINALE
result = merge(ORIGINALE,topcounts,by.x='DESCRIPTION',by.y='DESCRIPTION.x',all.x=T,all.y=F)
Here is result:
DESCRIPTION CODICE DESCRIPTION.y CODICE.y N
1: 4c flower str jenny jane Chicago NA jenny jane flower street 4c Chicago 83776250 5
2: mr peter 123 rose street 3b LA NA peter green 123 rose street 3b LA aw56 6
3: washington miss sarah 430f name strt NA sarah brown name street 430f washington 135tg67 4
PS: There are more memory-efficient ways to do this, and this code could cause your machine to crash due to an out-of-memory error or go slowly due to needing virtual memory, but if not, it should be faster than the for loops.
What about :
library(stringdist)
library(dplyr)
library(tidyr)
data_o <- ORIGINALE %>% mutate(desc_o = DESCRIPTION) %>% select(desc_o)
data_r <- REFERENE %>% mutate(desc_r = DESCRIPTION) %>% select(desc_r)
data <- crossing(data_o,data_r)
data %>% mutate(dist= stringsim(as.character(desc_o),as.character(desc_r))) %>%
group_by(desc_o) %>%
filter(dist==max(dist))
desc_o desc_r dist
<chr> <chr> <dbl>
1 " 4c flower str jenny jane Chicago" jenny jane flower street 4c Chicago 0.486
2 "mr peter 123 rose street 3b LA" peter green 123 rose street 3b LA 0.758
3 "washington miss sarah 430f name strt" sarah brown name street 430f washington 0.385

Value of a variable matching the first row of another variable by group [duplicate]

This question already has answers here:
Using mutate to create a new column with the first value of each group in R
(3 answers)
Closed 3 years ago.
As in the title, I would like to have a process that allows me to assign a set of unique values of a first variable, to the most common value of a second variable, matching the first row of a third value. Example:
Name Year Job
Alicia 1990 Butcher
Alicia 1991 Baker
George 1989 Scientist
George 1990 Banker
George 1991 Banker
I would like to easily identify what is the first job each unique Name did:
Name Year Job First Job
Alicia 1990 Butcher Butcher
Alicia 1991 Baker Butcher
George 1989 Scientist Scientist
George 1990 Banker Scientist
George 1991 Banker Scientist
We can use data.table for this:
library(data.table)
setDT(df1)[order(Year),FirstJob:=Job[1],.(Name)][]
## or using which.min instead of ordering as akrun suggested:
# setDT(df1)[,FirstJob:=Job[which.min(Year)], .(Name)][]
#> Name Year Job FirstJob
#> 1: Alicia 1990 Butcher Butcher
#> 2: Alicia 1991 Baker Butcher
#> 3: George 1989 Scientist Scientist
#> 4: George 1990 Banker Scientist
#> 5: George 1991 Banker Scientist
Data:
read.table(text="Name Year Job
Alicia 1990 Butcher
Alicia 1991 Baker
George 1989 Scientist
George 1990 Banker
George 1991 Banker",
header=T, stringsAsFactors=F) -> df1
We can group by 'Name' and extract the first 'Job' to create the new column 'FirstJob'
library(dplyr)
df1 %>%
group_by(Name) %>%
mutate(FirstJob = first(Job))
# A tibble: 5 x 4
# Groups: Name [2]
# Name Year Job FirstJob
# <chr> <int> <chr> <chr>
#1 Alicia 1990 Butcher Butcher
#2 Alicia 1991 Baker Butcher
#3 George 1989 Scientist Scientist
#4 George 1990 Banker Scientist
#5 George 1991 Banker Scientist
If the 'Year' is not ordered
df1 %>%
group_by(Name) %>%
mutate(FirstJob = Job[which.min(Year)])
data
df1 <- structure(list(Name = c("Alicia", "Alicia", "George", "George",
"George"), Year = c(1990L, 1991L, 1989L, 1990L, 1991L), Job = c("Butcher",
"Baker", "Scientist", "Banker", "Banker")), class = "data.frame",
row.names = c(NA,
-5L))

abind error - arg 'XXX' has dims=1912, 35, 1; but need dims=0, 35, X

I am trying to use abind to create a 3-D array out of a large 2D array. The source data is structured like this
Firstname Lastname Country City Measure Wk1 Wk2... Wkn
foo bar UK London Height 23 34 34
foo bar UK London Weight 67 67 67
foo bar UK London Fat 6 7 9
John doe US NY Height 546 776 978
John doe US NY Weight 123 656 989
John doe US NY Fat 34 45 67
There are 1912 rows per Measure and 25 weeks of data. I am trying to create a 3D array such that I can measure city wise trends of the Measures - height weight etc.
When I use abind(split(df,df$city), along =3), it gives me the error:
abind error - arg 'XXX' has dims=1912, 35, 1; but need dims=0, 35, X
I have verified that the number of rows are 1912 per measure and the number of columns are also homogenous. Any help will be greatly appreciated.
Are you sure that you want to use arrays to measure city trends?
Usually the right approach to analysing data like yours is to unpivot the weeks into long format.
I'll start by importing your data into R...
tc <- textConnection("Firstname Lastname Country City Measure Wk1 Wk2 Wk3
foo bar UK London Height 23 34 34
foo bar UK London Weight 67 67 67
foo bar UK London Fat 6 7 9
John doe US NY Height 546 776 978
John doe US NY Weight 123 656 989
John doe US NY Fat 34 45 67")
df <- read.table(tc, header = TRUE)
Then install and load a couple of useful packages.
install.packages("tidyr")
install.packages("dplyr")
library(tidyr)
library(dplyr)
Now to unpivot your data using the gather command from tidyr.
> long_df <- gather(df, Week, Value, -c(1:5))
> long_df
Firstname Lastname Country City Measure Week Value
1 foo bar UK London Height Wk1 23
2 foo bar UK London Weight Wk1 67
3 foo bar UK London Fat Wk1 6
4 John doe US NY Height Wk1 546
5 John doe US NY Weight Wk1 123
6 John doe US NY Fat Wk1 34
7 foo bar UK London Height Wk2 34
8 foo bar UK London Weight Wk2 67
9 foo bar UK London Fat Wk2 7
10 John doe US NY Height Wk2 776
11 John doe US NY Weight Wk2 656
12 John doe US NY Fat Wk2 45
13 foo bar UK London Height Wk3 34
14 foo bar UK London Weight Wk3 67
15 foo bar UK London Fat Wk3 9
16 John doe US NY Height Wk3 978
17 John doe US NY Weight Wk3 989
18 John doe US NY Fat Wk3 67
Now you can use dplyr to produce any summaries of the data that you please...
> long_df %>%
+ group_by(Country, City, Measure) %>%
+ summarise(mean_val = mean(Value))
Source: local data frame [6 x 4]
Groups: Country, City
Country City Measure mean_val
1 UK London Fat 7.333333
2 UK London Height 30.333333
3 UK London Weight 67.000000
4 US NY Fat 48.666667
5 US NY Height 766.666667
6 US NY Weight 589.333333
Or summaries by Country and Measure...
> long_df %>%
+ group_by(Country, Measure) %>%
+ summarise(mean_val = mean(Value), med_val = median(Value), count = n())
Source: local data frame [6 x 5]
Groups: Country
Country Measure mean_val med_val count
1 UK Fat 7.333333 7 3
2 UK Height 30.333333 34 3
3 UK Weight 67.000000 67 3
4 US Fat 48.666667 45 3
5 US Height 766.666667 776 3
6 US Weight 589.333333 656 3

Resources