I have a dataframe looks like below:
person year Office Job rank
Harry 2002 Los Angeles CEO 0
Harry 2006 Boston CEO 0
Harry 2006 Los Angeles Advisor 1
Harry 2006 Chicago Chairman 2
Peter 2001 New York Director 0
Peter 2001 Chicago CFO 1
Peter 2002 Chicago CEO 0
Lily 2005 Springfield CEO 0
Lily 2007 New York CFO 0
Lily 2008 Boston COO 0
Lily 2011 Chicago Advisor 0
Lily 2011 New York board 1
I want to know at a person level, who has at least one of the following two patterns:
in a previous available year, an office has rank 0 and in the next available year, the office still exist but rank is bigger than 0 (job does not matter). For example, Los Angeles for Harry.
in a next availabe year, an office has rank 0 and in the previous available year, the office still exist but rank is bigger than 0 (For example, Chicago for Peter).
Note that New York for Lily does not have either of the above situation as 2007 is not the previous available year for Lily (2008 is).
Thus, the output should look like:
person yes/no
Harry 1
Peter 1
Lily 0
We can use
library(dplyr)
df1 %>%
group_by(person, Office) %>%
summarise(yes_no =n_distinct(rank) > 1) %>%
summarise(yes_no = +(any(yes_no)), .groups = 'drop')
My data looks like this:
EMPLOYEE_ID
LAST_NAME
FIRST_NAME
UNIT
CITY
STATE
DATA_RANK
221
SMITH
JILL
X1
DALLAS
TX
2
221
SMITH-WU
JILL
TX
1
331
DEVIN
MARY
X2
HOUSTON
2
331
TRUNG
MARY
HOUSTON
TX
1
441
SWAN
ANNA-BELLE
X2
AUBURN
CA
1
441
DUCK
ANNA
X3
AUBURN
2
I am trying to get the output to look like this (group rows by EMPLOYEE_ID) and also pick the row that has data_rank = 1 where there is a duplicate employee-id.
EMPLOYEE_ID
LAST_NAME
FIRST_NAME
UNIT
CITY
STATE
DATA_RANK
221
SMITH-WU
JILL
TX
1
331
TRUNG
MARY
HOUSTON
TX
1
441
SWAN
ANNA-BELLE
X2
AUBURN
CA
1
I tried using the following code:
data <- data %>%
group_by(EMPLOYEE_ID, substr(LAST_NAME,0,4), substr(FIRST_NAME,0,3)) %>%
mutate_at(vars(-group_cols()),funs(na.locf(., na.rm = FALSE, fromLast = FALSE))) %>%
filter(row_number()==n())
But that's not quite getting me here. Any thoughts? Thank you!
Is there a reason why you're using substr()?
I believe this code should work.
data %>%
group_by(EMPLOYEE_ID) %>%
filter(DATA_RANK == 1)
I have a dataset, let's call it "ORIGINALE", composed by several different rows for only two columns, the first called "DESCRIPTION" and the second "CODICE". The description column has the right information while the column codice, which is the key, is almost always empty, therefore I'm tryng to search for the corresponding codice in another dataset, let's call it "REFERENCE". I am using the column desciption, which is in natural language, and trying to match it with the description in the second dataset. I have to match word by word since there may be a different order of words, synonims or abbreviations. Then I calcolate the similarity score to keep only the best match and accept those above a certain score. Is there a way to improve it? I'm working with around 300000 rows and, even though I know is always going to take time, perhaps there could be a way to make it even just slightly faster.
ORIGINALE <- data.frame(DESCRIPTION = c("mr peter 123 rose street 3b LA"," 4c flower str jenny jane Chicago", "washington miss sarah 430f name strt"), CODICE = (NA, NA, NA))
REFERENE <- dataframe (DESCRIPTION = c("sarah brown name street 430f washington", "peter green 123 rose street 3b LA", "jenny jane flower street 4c Chicago"), CODICE = c("135tg67","aw56", "83776250"))
algoritmo <- function(ORIGINALE, REFERENCE) {
split1 <- strsplit(x$DESCRIPTION, " ")
split2 <- strsplit(y$DESCRIPTION, " ")
risultato <- vector()
distanza <- vector()
for(i in 1:NROW(split1)) {
best_dist <- -5
closest_match <- -5
for(j in 1:NROW(split2)) {
dist <- stringsim(as.character(split1[i]), as.character(split2[j]))
if (dist > best_dist) {
closest_match <- y$DESCRIPTION[j]
best_dist <- dist
}
}
distanza <- append(distanza, best_dist)
risultato <- append(risultato, closest_match)
}
confronto <<- tibble(x$DESCRIPTION, risultato, distanza)
}
match <- subset.data.frame(confronto, confronto$distanza >= "0.6")
missing <- subset.data.frame(confronto, confronto$distanza <"0.6")
The R tm (text mining) library can help here:
library(tm)
library(proxy) # for computing cosine similarity
library(data.table)
ORIGINALE = data.table(DESCRIPTION = c("mr peter 123 rose street 3b LA"," 4c flower str jenny jane Chicago", "washington miss sarah 430f name strt"), CODICE = c(NA, NA, NA))
REFERENCE = data.table(DESCRIPTION = c("sarah brown name street 430f washington", "peter green 123 rose street 3b LA", "jenny jane flower street 4c Chicago"), CODICE = c("135tg67","aw56", "83776250"))
# combine ORIGINALE and REFERENCE into one data.table
both = rbind(ORIGINALE,REFERENCE)
# create "doc_id" and "text" columns (required by tm)
both[,doc_id:=1:.N]
names(both)[1] = 'text'
# convert to tm corpus
corpus = SimpleCorpus(DataframeSource(both))
# convert to a tm document term matrix
dtm = DocumentTermMatrix(corpus)
# convert to a regular matrix
dtm = as.matrix(dtm)
# look at it (t() transpose for readability)
t(dtm)
Docs
Terms 1 2 3 4 5 6
123 1 0 0 0 1 0
peter 1 0 0 0 1 0
rose 1 0 0 0 1 0
street 1 0 0 1 1 1
chicago 0 1 0 0 0 1
flower 0 1 0 0 0 1
jane 0 1 0 0 0 1
jenny 0 1 0 0 0 1
str 0 1 0 0 0 0
430f 0 0 1 1 0 0
miss 0 0 1 0 0 0
name 0 0 1 1 0 0
sarah 0 0 1 1 0 0
strt 0 0 1 0 0 0
washington 0 0 1 1 0 0
brown 0 0 0 1 0 0
green 0 0 0 0 1 0
# compute similarity between each combination of documents 1:3 and documents 4:6
similarity = proxy::dist(dtm[1:3,], dtm[4:6,], method="cosine")
# result:
ORIGINALE REFERENCE document
document 4 5 6
1 0.7958759 0.1055728 0.7763932 <-- difference (smaller = more similar)
2 1.0000000 1.0000000 0.2000000
3 0.3333333 1.0000000 1.0000000
# make a table of which REFERENCE document is most similar
most_similar = rbindlist(
apply(
similarity,1,function(x){
data.table(i=which.min(x),distance=min(x))
}
)
)
# result:
i distance
1: 2 0.1055728
2: 3 0.2000000
3: 1 0.3333333
# rows 1, 2, 3 or rows of ORIGINALE; i: 2 3 1 are rows of REFERENCE
# add the results back to ORIGINALE
ORIGINALE1 = cbind(ORIGINALE,most_similar)
REFERENCE[,i:=1:.N]
ORIGINALE2 = merge(ORIGINALE1,REFERENCE,by='i',all.x=T,all.y=F)
# result:
i DESCRIPTION.x CODICE.x distance DESCRIPTION.y CODICE.y
1: 1 washington miss sarah 430f name strt NA 0.3333333 sarah brown name street 430f washington 135tg67
2: 2 mr peter 123 rose street 3b LA NA 0.1055728 peter green 123 rose street 3b LA aw56
3: 3 4c flower str jenny jane Chicago NA 0.2000000 jenny jane flower street 4c Chicago 83776250
# now the documents are in a different order than in ORIGINALE2.
# this is caused by merging by i (=REFERENCE document row).
# if order is important, then add these two lines around the merge line:
ORIGINALE1[,ORIGINALE_i:=1:.N]
ORIGINALE2 = merge(...
ORIGINALE2 = ORIGINALE2[order(ORIGINALE_i)]
Good question. for loops are slow in R:
for(i in 1:NROW(split1)) {
for(j in 1:NROW(split2)) {
For fast R, you need to vectorize your algorithm. I'm not that handy with data.frame anymore, so I'll use its successor, data.table.
library(data.table)
ORIGINALE = data.table(DESCRIPTION = c("mr peter 123 rose street 3b LA"," 4c flower str jenny jane Chicago", "washington miss sarah 430f name strt"), CODICE = c(NA, NA, NA))
REFERENCE = data.table(DESCRIPTION = c("sarah brown name street 430f washington", "peter green 123 rose street 3b LA", "jenny jane flower street 4c Chicago"), CODICE = c("135tg67","aw56", "83776250"))
# split DESCRIPTION to make tables that have one word per row
ORIGINALE_WORDS = ORIGINALE[,.(word=unlist(strsplit(DESCRIPTION,' ',fixed=T))),.(DESCRIPTION,CODICE)]
REFERENCE_WORDS = REFERENCE[,.(word=unlist(strsplit(DESCRIPTION,' ',fixed=T))),.(DESCRIPTION,CODICE)]
# remove empty words introduced by extra spaces in your DESCRIPTIONS
ORIGINALE_WORDS = ORIGINALE_WORDS[word!='']
REFERENCE_WORDS = REFERENCE_WORDS[word!='']
# merge the tables by word
merged = merge(ORIGINALE_WORDS,REFERENCE_WORDS,by='word',all=F,allow.cartesian=T)
# count matching words for each combination of ORIGINALE DESCRIPTION and REFERENCE DESCRIPTION and CODICE
counts = merged[,.N,.(DESCRIPTION.x,DESCRIPTION.y,CODICE.y)]
# keep only the highest N CODICE.y for each DESCRIPTION.x
topcounts = merged[order(-N)][!duplicated(DESCRIPTION.x)]
# merge the counts back to ORIGINALE
result = merge(ORIGINALE,topcounts,by.x='DESCRIPTION',by.y='DESCRIPTION.x',all.x=T,all.y=F)
Here is result:
DESCRIPTION CODICE DESCRIPTION.y CODICE.y N
1: 4c flower str jenny jane Chicago NA jenny jane flower street 4c Chicago 83776250 5
2: mr peter 123 rose street 3b LA NA peter green 123 rose street 3b LA aw56 6
3: washington miss sarah 430f name strt NA sarah brown name street 430f washington 135tg67 4
PS: There are more memory-efficient ways to do this, and this code could cause your machine to crash due to an out-of-memory error or go slowly due to needing virtual memory, but if not, it should be faster than the for loops.
What about :
library(stringdist)
library(dplyr)
library(tidyr)
data_o <- ORIGINALE %>% mutate(desc_o = DESCRIPTION) %>% select(desc_o)
data_r <- REFERENE %>% mutate(desc_r = DESCRIPTION) %>% select(desc_r)
data <- crossing(data_o,data_r)
data %>% mutate(dist= stringsim(as.character(desc_o),as.character(desc_r))) %>%
group_by(desc_o) %>%
filter(dist==max(dist))
desc_o desc_r dist
<chr> <chr> <dbl>
1 " 4c flower str jenny jane Chicago" jenny jane flower street 4c Chicago 0.486
2 "mr peter 123 rose street 3b LA" peter green 123 rose street 3b LA 0.758
3 "washington miss sarah 430f name strt" sarah brown name street 430f washington 0.385
This question already has answers here:
Using mutate to create a new column with the first value of each group in R
(3 answers)
Closed 3 years ago.
As in the title, I would like to have a process that allows me to assign a set of unique values of a first variable, to the most common value of a second variable, matching the first row of a third value. Example:
Name Year Job
Alicia 1990 Butcher
Alicia 1991 Baker
George 1989 Scientist
George 1990 Banker
George 1991 Banker
I would like to easily identify what is the first job each unique Name did:
Name Year Job First Job
Alicia 1990 Butcher Butcher
Alicia 1991 Baker Butcher
George 1989 Scientist Scientist
George 1990 Banker Scientist
George 1991 Banker Scientist
We can use data.table for this:
library(data.table)
setDT(df1)[order(Year),FirstJob:=Job[1],.(Name)][]
## or using which.min instead of ordering as akrun suggested:
# setDT(df1)[,FirstJob:=Job[which.min(Year)], .(Name)][]
#> Name Year Job FirstJob
#> 1: Alicia 1990 Butcher Butcher
#> 2: Alicia 1991 Baker Butcher
#> 3: George 1989 Scientist Scientist
#> 4: George 1990 Banker Scientist
#> 5: George 1991 Banker Scientist
Data:
read.table(text="Name Year Job
Alicia 1990 Butcher
Alicia 1991 Baker
George 1989 Scientist
George 1990 Banker
George 1991 Banker",
header=T, stringsAsFactors=F) -> df1
We can group by 'Name' and extract the first 'Job' to create the new column 'FirstJob'
library(dplyr)
df1 %>%
group_by(Name) %>%
mutate(FirstJob = first(Job))
# A tibble: 5 x 4
# Groups: Name [2]
# Name Year Job FirstJob
# <chr> <int> <chr> <chr>
#1 Alicia 1990 Butcher Butcher
#2 Alicia 1991 Baker Butcher
#3 George 1989 Scientist Scientist
#4 George 1990 Banker Scientist
#5 George 1991 Banker Scientist
If the 'Year' is not ordered
df1 %>%
group_by(Name) %>%
mutate(FirstJob = Job[which.min(Year)])
data
df1 <- structure(list(Name = c("Alicia", "Alicia", "George", "George",
"George"), Year = c(1990L, 1991L, 1989L, 1990L, 1991L), Job = c("Butcher",
"Baker", "Scientist", "Banker", "Banker")), class = "data.frame",
row.names = c(NA,
-5L))
I am trying to use abind to create a 3-D array out of a large 2D array. The source data is structured like this
Firstname Lastname Country City Measure Wk1 Wk2... Wkn
foo bar UK London Height 23 34 34
foo bar UK London Weight 67 67 67
foo bar UK London Fat 6 7 9
John doe US NY Height 546 776 978
John doe US NY Weight 123 656 989
John doe US NY Fat 34 45 67
There are 1912 rows per Measure and 25 weeks of data. I am trying to create a 3D array such that I can measure city wise trends of the Measures - height weight etc.
When I use abind(split(df,df$city), along =3), it gives me the error:
abind error - arg 'XXX' has dims=1912, 35, 1; but need dims=0, 35, X
I have verified that the number of rows are 1912 per measure and the number of columns are also homogenous. Any help will be greatly appreciated.
Are you sure that you want to use arrays to measure city trends?
Usually the right approach to analysing data like yours is to unpivot the weeks into long format.
I'll start by importing your data into R...
tc <- textConnection("Firstname Lastname Country City Measure Wk1 Wk2 Wk3
foo bar UK London Height 23 34 34
foo bar UK London Weight 67 67 67
foo bar UK London Fat 6 7 9
John doe US NY Height 546 776 978
John doe US NY Weight 123 656 989
John doe US NY Fat 34 45 67")
df <- read.table(tc, header = TRUE)
Then install and load a couple of useful packages.
install.packages("tidyr")
install.packages("dplyr")
library(tidyr)
library(dplyr)
Now to unpivot your data using the gather command from tidyr.
> long_df <- gather(df, Week, Value, -c(1:5))
> long_df
Firstname Lastname Country City Measure Week Value
1 foo bar UK London Height Wk1 23
2 foo bar UK London Weight Wk1 67
3 foo bar UK London Fat Wk1 6
4 John doe US NY Height Wk1 546
5 John doe US NY Weight Wk1 123
6 John doe US NY Fat Wk1 34
7 foo bar UK London Height Wk2 34
8 foo bar UK London Weight Wk2 67
9 foo bar UK London Fat Wk2 7
10 John doe US NY Height Wk2 776
11 John doe US NY Weight Wk2 656
12 John doe US NY Fat Wk2 45
13 foo bar UK London Height Wk3 34
14 foo bar UK London Weight Wk3 67
15 foo bar UK London Fat Wk3 9
16 John doe US NY Height Wk3 978
17 John doe US NY Weight Wk3 989
18 John doe US NY Fat Wk3 67
Now you can use dplyr to produce any summaries of the data that you please...
> long_df %>%
+ group_by(Country, City, Measure) %>%
+ summarise(mean_val = mean(Value))
Source: local data frame [6 x 4]
Groups: Country, City
Country City Measure mean_val
1 UK London Fat 7.333333
2 UK London Height 30.333333
3 UK London Weight 67.000000
4 US NY Fat 48.666667
5 US NY Height 766.666667
6 US NY Weight 589.333333
Or summaries by Country and Measure...
> long_df %>%
+ group_by(Country, Measure) %>%
+ summarise(mean_val = mean(Value), med_val = median(Value), count = n())
Source: local data frame [6 x 5]
Groups: Country
Country Measure mean_val med_val count
1 UK Fat 7.333333 7 3
2 UK Height 30.333333 34 3
3 UK Weight 67.000000 67 3
4 US Fat 48.666667 45 3
5 US Height 766.666667 776 3
6 US Weight 589.333333 656 3