How to align Dates from different columns in R or Excel - r

I have a dataset of variables with each variable having a different date span. They are presented as in the following example (taken the first two cases out of 500):
DatesV1 DatesV2
29/12/1995 19/07/2001
02/01/1996 20/07/2001
03/01/1996 23/07/2001
04/01/1996 24/07/2001
05/01/1996 25/07/2001
08/01/1996 26/07/2001
09/01/1996 27/07/2001
10/01/1996 30/07/2001
11/01/1996 31/07/2001
12/01/1996 01/08/2001
15/01/1996 02/08/2001
16/01/1996 03/08/2001
17/01/1996 06/08/2001
What I want to happen is for the dates in DatesV2 to align with the dates in DatesV1. This means that DatesV2 will start with a few NA until the row that the dates align. Like this:
DatesV1 DatesV2
... ...
17/07/2001 NA
18/07/2001 NA
19/07/2001 19/07/2001
20/07/2001 20/07/2001
... ...
In the Example Set, I have the example of exactly what I am trying to do. I can't find of a fast computational way to do it in either R or Excel for the 500 variables that I have.
Example Set
I have tried something like this:
nhat<-which(Example$DatesV2[1]==Example$DatesV1)
nend<-which(Example$DatesV1[length(Example$DatesV1)-1]==Example$DatesV2)
Example$Apotelesma<- c(rep(NA,nhat-1),Example$DatesV2[1:nend],NA)
Which is an initial test that works for two dates. Only thing is that dates appear as numbers.

Here's a possible solution using some re-shaping. I'm using a simple example:
df = data.frame(DatesV1 = c("24/07/2001","25/07/2001","26/07/2001"),
DatesV2 = c("25/07/2001","26/07/2001","27/07/2001"),
DatesV3 = c("26/07/2001","27/07/2001","28/07/2001"),
stringsAsFactors = F)
library(tidyverse)
library(lubridate)
# update to date columns (only if needed)
df = df %>% mutate_all(dmy)
df %>%
gather() %>% # reshape dataset
mutate(id = value) %>% # use date values as row ids
spread(key, value) %>% # reshape again
select(-id) # remove ids
# DatesV1 DatesV2 DatesV3
# 1 2001-07-24 <NA> <NA>
# 2 2001-07-25 2001-07-25 <NA>
# 3 2001-07-26 2001-07-26 2001-07-26
# 4 <NA> 2001-07-27 2001-07-27
# 5 <NA> <NA> 2001-07-28

This an ugly/messy method if you like, but it gets the job done. Any faster and tidier way would be better.
n<-nrow(DataAlignment)
Newdata<-matrix(0,5148,ncol(DataAlignment))
loops<-ncol(DataAlignment)-1
for(i in 1:loops){
nhat<-which(DataAlignment[1,i+1]==DataAlignment[,1]) #finds the position of the first date in column 2 according to the first column
nend<-which(DataAlignment[n,1]==DataAlignment[,i+1]) #finds the position of last date in col 2 according to the first column
if(nhat==1 | nend != 5148){ #takes into account when they start at the same time but end in different dates
Newdata[,i+1]<-c(DataAlignment[c(1:nend),i+1],rep(NA,n-nend))
}
else{if(nhat==1| nend==5148){Newdata[,i+1]<-c(DataAlignment[,i+1])} #this takes account when they start and end at the same time
else{if(nhat!=1){
Newdata[,i+1]<-c(rep(NA,nhat-1),DataAlignment[c(1:nend),i+1])}}} #creates the new data
}

Related

How to count the occurrence of a word in multiple variables in R and sort them from highest to lowest?

I have a huge dataset with over 3 million obs and 108 columns. There are 14 variables I'm interested in: DIAG_PRINC, DIAG_SECUN, DIAGSEC1:DIAGSEC9, CID_ASSO, CID_MORTE and CID_NOTIF (they're in different positions). These variables contain ICD-10 codes.
I'm interested in counting how many times certain ICD-10 codes appear and then sort them from highest to lowest in a dataframe. Here's some reproductible data:
data <- data.frame(DIAG_PRINC = c("O200", "O200", "O230"),
DIAG_SECUN = c("O555", "O530", "O890"),
DIAGSEC1 = c("O766", "O876", "O899"),
DIAGSEC2 = c("O200", "I520", "O200"),
DIAGSEC3 = c("O233", "O200", "O620"),
DIAGSEC4 = c("O060", "O061", "O622"),
DIAGSEC5 = c("O540", "O123", "O344"),
DIAGSEC6 = c("O876", "Y321", "S333"),
DIAGSEC7 = c("O450", "X900", "O541"),
DIAGSEC8 = c("O222", "O111", "O123"),
DIAGSEC9 = c("O987", "O123", "O622"),
CID_MORTE = c("O066", "O699", "O555"),
CID_ASSO = c("O600", "O060", "O068"),
CID_NOTIF = c("O069", "O066", "O065"))
I also have a list of ICD-10 codes that I'm interested in counting.
GRUPO1 <- c("O00", "O000", "O001", "O002", "O008", "O009",
"O01", "O010", "O011", "O019",
"O02", "O020", "O021", "O028", "O029",
"O03", "O030", "O031", "O032", "O033", "O034", "O035", "O036", "O037",
"O038", "O039",
"O04", "O040", "O041", "O042", "O043", "O044", "O045", "O046", "O047",
"O048", "O049",
"O05", "O050", "O051", "O052", "O053", "O054", "O055", "O056", "O057",
"O058", "O059",
"O06", "O060", "O061", "O062", "O063", "O064", "O065", "O066", "O067",
"O068", "O069",
"O07", "O070", "O071", "O072", "O073", "O074", "O075", "O076", "O077",
"O078", "O079",
"O08", "O080", "O081", "O082", "O083", "O084", "O085", "O086", "O087",
"O088", "O089")
What I need is a dataframe counting how many times the ICD-10 codes from "GRUPO1" appear in any row/column from DIAG_PRINC, DIAG_SECUN, DIAGSEC1:DIAGSEC9, CID_ASSO, CID_MORTE and CID_NOTIF variables. For example, on my reproductible data ICD-10 cod "O066" appears twice.
Thank you in advance!
We can unlist the data into a vector, use %in% to subset the values from 'GRUPO1' and get the frequency count with table in base R
v1 <- unlist(data)
out <- table(v1[v1 %in% GRUPO1])
out[order(-out)]
O060 O066 O061 O065 O068 O069
2 2 1 1 1 1
Here is a tidyverse solution using tidyr and dplyr:
library(tidyverse)
pivot_longer(data, everything()) %>%
filter(value %in% GRUPO1) %>%
count(value)
Output
value n
<chr> <int>
1 O060 2
2 O061 1
3 O065 1
4 O066 2
5 O068 1
6 O069 1

Problem formatting spreadsheets in R, how can I read and write to tables using R?

I'm working with R for the first time for a class in college. To preface this: I don't know enough to know what I don't know, so I'm sorry if this question has been asked before. I am trying to predict the results of the Texas state house elections in 2020, and I think the best prior for that is the results of the 2018 state house elections. There are 150 races, so I can't bare to input them all by hand, but I can't find any spreadsheet that has data formatted how I want it. I want it in a pretty standard table format:
My desired table format. However, the table from the Secretary of state I have looks like the following:
Gross ugly table.
I wrote some psuedo code:
Here's the Psuedo Code, basically we want to construct a new CSV:
'''%First, we want to find a district, the house races are always preceded by a line of dashes, so I will need a function like this:
Create a New CSV;
for(x=1; x<151 ; x +=1){
Assign x to the cell under the district number cloumn;
Find "---------------" ;
Go down one line;
Go over two lines;
% We should now be in the third column and now want to read in which party got how many votes. The number of parties is not consistant, so we need to account for uncontested races, libertarians, greens, and write ins. I want totals for Republicans, Democrats, and Other.
while(cell is not empty){
Party <- function which reads cell (but I want to read a string);
go right one column;
Votes <- function which reads cell (but I want to read an integer);
if(Party = Rep){
put this data in place in new CSV;
else if (Party = Dem)
put this data in place in new CSV;
else
OtherVote += Votes;
};
};
Assign OtherVote to the column for other party;
OtherVote <- 0;
%Now I want to assign 0 to null cells (ones where no rep, or no Dem, or no other party contested
read through single row 4 spaces, if its null assign it 0;
Party <- null
};'''
But I don't know enough to google what to do! Here's what I need help with: Can I create a new CSV in Rstudio, how? How can I read specific cells in a table, hopefully indexing? Lastly, how do I write to a table in R. Any help is appreciated! Thank you!
Can I create a new CSV in Rstudio, how?
Yes you can. Use the "write.csv" function.
write.csv(df, file = "df.csv") #see help for more information.
How can I read specific cells in a table?
Use the brackets after df,example below.
df <- data.frame(x = c(1,2,3), y = c("A","B","C"), z = c(15,25,35))
df[1,1]
#[1] 1
df[1,1:2]
# x y
#1 1 A
How do I write to a table in R?
If you want to write a table in xlsx use the function write.xlsx from openxlsx package.
Wikipedia seems to have a table that is closer to the format you are looking for.
In order to get to the table you are looking for we need a few steps:
Download data from Wikipedia and extract table.
Clean up table.
Select columns.
Calculate margins.
1. Download data from wikipedia and extract table.
The rvest table helps with downloading and parsing websites into R objects.
First we download the HTML of the whole website.
library(dplyr)
library(rvest)
wiki_html <-
read_html(
"https://en.wikipedia.org/wiki/2018_United_States_House_of_Representatives_elections_in_Texas"
)
There are a few ways to get a specific object from an HTML file in this case
I dedided to look for the table that has the class name “wikitable plainrowheaders sortable”,
as I learned from inspecting the code, that the only table with that class is
the one we want to extract.
library(purrr)
html_nodes(wiki_html, "table") %>%
map_lgl( ~ html_attr(., "class") == "wikitable plainrowheaders sortable") %>%
which()
#> [1] 20
Then we can select table number 20 and convert it to a dataframe with html_table()
raw_table <-
html_nodes(wiki_html, "table")[[20]] %>%
html_table(fill = TRUE)
2. Clean up table.
The table has duplicated names, we can change that by using as_tibble() and its .name_repair argument. We then usedplyr::select() to get the columns. Furthermore we usedplyr::filter() to delete the first two rows, that have "District" as a value in theDistrictcolumn. Now the columns are still characters
vectors, but we need them to be numeric, therefore we first delete commas from
all columns and then transform columns 2 to 4 to numeric.
clean_table <-
raw_table %>%
as_tibble(.name_repair = "unique") %>%
filter(District != "District") %>%
mutate_all( ~ gsub(",", "", .)) %>%
mutate_at(2:4, as.numeric)
3. Select columns and 4. Calculate margins.
We use dplyr::select() to select the columns you are interested in and give them more helpful names.
Finally we calculate the margin between democratic and republican votes by first adding up there votes
as total_votes and then dividing the difference by total_votes.
clean_table %>%
select(District,
RepVote = Republican...2,
DemVote = Democratic...4,
OthVote = Others...6) %>%
mutate(
total_votes = RepVote + DemVote,
margin = abs(RepVote - DemVote) / total_votes * 100
)
#> # A tibble: 37 x 6
#> District RepVote DemVote OthVote total_votes margin
#> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 District 1 168165 61263 3292 229428 46.6
#> 2 District 2 139188 119992 4212 259180 7.41
#> 3 District 3 169520 138234 4604 307754 10.2
#> 4 District 4 188667 57400 3178 246067 53.3
#> 5 District 5 130617 78666 224 209283 24.8
#> 6 District 6 135961 116350 3731 252311 7.77
#> 7 District 7 115642 127959 0 243601 5.06
#> 8 District 8 200619 67930 4621 268549 49.4
#> 9 District 9 0 136256 16745 136256 100
#> 10 District 10 157166 144034 6627 301200 4.36
#> # … with 27 more rows
Edit: In case you want to go with the data provided by the state, it looks to me as if the data you are looking for is in the first, third and fourth column. So what you want to do is.
(All the code below is not tested, as I do not have the original data.)
read data into R
library(readr)
tx18 <- read_csv("filename.csv")
select relevant columns
tx18 <- tx18 %>%
select(c(1,3,4))
clean table
tx18 <- tx18 %>%
filter(!is.na(X3),
X3 != "Party",
X3 != "Race Total")
Group and summarize data by party
tx18 <- tx18 %>%
group_by(X3) %>%
summarise(votes = sum(X3))
Pivot/ Reshape data to wide format
tx18 %>$
pivot_wider(names_from = X3,
values_from = votes)
After this you could then calculate the margin similarly as I did with the Wikipedia data.

Melt Data and fill new column with desired data

Hello coding community
I have a two part question that is 1/2 answered
transpose, aka melt data frame, to my liking - done
add rows of data based on results found in "removed" column, a column created in the transposing step - stuck here
df<- read.table("https://pastebin.com/raw/NEPcUG01",header=T, sep="\t")
df_transformed<-tidyr::gather(df, day, removed, -(1:2), na.rm = TRUE) # melted data
In my example here (df), I have an experiment ran over 8 days. On certain days, I remove data points, and I am only interested in these days (hence why I added na.rm = TRUE in the transposing process). I sometimes remove 1 data point, or 4 (but this could be any number really)
I would like the removed data points to be called "individuals", and for them to be counted in chronological order. Therefore, I first need to add a column called "individuals"
df_transformed$individual <- ""
I would like to fill in the "individual" column based on the results in the "removed" column.
example: cage 2 had only 1 data point removed, and it was on day_8. I would therefore like to add, in the "individual" column, a 1. Cage 4, on the other hand, had data points removed on day_5 (1 data point) and day_7 (3 data points), for a total of 4 data points , aka , 4 "individuals". Therefore, Cage 4, when starting with day_5, I would like to add a 1 in the "individuals" column, and for day 7, create 3 total rows of data, and continue my "individual count" with 2,3,4. IF day_8 had 3 more data points removed, the individual count would continue with 5,6,7.
My desired result for my example data set today would be this:
desired_results <- read.table("https://pastebin.com/raw/r7QrC0y3", header=T, sep="\t") # 68 total rows of data
Interesting piece of information: The total number of rows in my final data set should equal the sum of all removed data points:
sum(df_transformed$removed) # 68
Thank you StackOverflow community. Looking forward to seeing the results.
We can use complete to create a sequence from 1 to each individual grouped by cage and day. We then fill the NA values in columns experiment and removed.
library(dplyr)
library(tidyr)
df_transformed %>%
mutate(individual = removed) %>%
group_by(cage, day) %>%
complete(individual = seq_len(individual)) %>%
fill(experiment, removed, .direction = "up")
# cage day individual experiment removed
#1 2 day_8 1 sugar 1
#2 3 day_5 1 sugar 1
#3 4 day_5 1 sugar 3
#4 4 day_5 2 sugar 3
#5 4 day_5 3 sugar 3
#6 4 day_7 1 sugar 1
#7 7 day_7 1 sugar 1
#8 7 day_8 1 sugar 1
#9 8 day_5 1 sugar 2
#10 8 day_5 2 sugar 2
# … with 58 more rows
To update individual only based on cage we can do
df_transformed %>%
mutate(individual = removed) %>%
group_by(cage, day) %>%
complete(individual = seq_len(individual)) %>%
group_by(cage) %>%
mutate(individual = row_number()) %>%
fill(experiment, removed, .direction = "up")
I think the following bit of code does what you need:
library(tidyverse)
read.table("https://pastebin.com/raw/NEPcUG01",header=T, sep="\t") %>%
pivot_longer(starts_with("day_"), names_to = "day", values_to = "removed") %>%
# drop_na() %>%
group_by(cage) %>%
summarize(individual = sum(removed, na.rm = TRUE))
I have used the pipe operator (%>%), which enables cleaner syntax. I have also used the newer pivot_longer function instead of gather. Then, grouping by cage and later summing over the individual column with summarize you get how many individuals were removed per cage.
I checked the sum of all the individuals and it seems to work:
read.table("https://pastebin.com/raw/NEPcUG01",header=T, sep="\t") %>%
pivot_longer(starts_with("day_"), names_to = "day", values_to = "removed") %>%
# drop_na() %>%
group_by(cage) %>%
summarize(individual = sum(removed, na.rm = TRUE)) %>%
pull(individual) %>%
sum()
#> [1] 68
The result is slightly different to your desired result. I am not 100% your desired result is actually correct... From your question, I understand that cage 4 should have 4 individuals, but in your desired_result it appears 4 times with values 1, 2, 3 and 4. The code I sent you generates a data frame where each appears in a single row.

R Beginner struggling with extremely messy XLSX

I got an XLSX with data from a questionnaire for my master thesis.
The questions and answers for an interviewee are in one row in the second column. The first column contains the date.
The data of the second column comes in a form like this:
"age":"52","height":"170","Gender":"Female",...and so on
I started with:
test12 <- read_xlsx("Testdaten.xlsx")
library(splitstackshape)
test13 <- concat.split(data = test12, split.col= "age", sep =",")
Then I got the questions and the answers as a column divided by a ":".
For e.g. column 1: "age":"52" and column2:"height":"170".
But the data is so messy that sometimes in the column of the age question and answer there is a height question and answer and for some questionnaires questions and answers double.
I would need the questions as variables and the answers as observations. But I have no clue how to get there. I could clean the data in excel first, but with the fact that columns are not constant and there are for e.g. some height questions in the age column I see no chance to do it as I will get new data regularly, formated the same way.
Here is an example of the data:
A tibble: 5 x 2
partner.createdAt partner.wphg.info
<chr> <chr>
1 2019-11-09T12:13:11.099Z "{\"age_years\":\"50\",\"job_des\":\"unemployed\",\"height_cm\":\"170\",\"Gender\":\"female\",\"born_in\":\"Italy\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"200000\""
2 2019-11-01T06:43:22.581Z "{\"age_years\":\"34\",\"job_des\":\"self-employed\",\"height_cm\":\"158\",\"Gender\":\"male\",\"born_in\":\"Germany\",\"Alcoholic\":\"true\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"10000\""
3 2019-11-10T07:59:46.136Z "{\"age_years\":\"24\",\"height_cm\":\"187\",\"Gender\":\"male\",\"born_in\":\"England\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"150000\""
4 2019-11-11T13:01:48.488Z "{\"age_years\":\"59\",\"job_des\":\"employed\",\"height_cm\":\"167\",\"Gender\":\"female\",\"born_in\":\"United States\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"2\",\"total_wealth\":\"1000000~
5 2019-11-08T14:54:26.654Z "{\"age_years\":\"36\",\"height_cm\":\"180\",\"born_in\":\"Germany\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"170000\",\"job_des\":\"employed\",\"Gender\":\"male\""
Thank you so much for your time!
You can loop through each entry, splitting at , as you did. Then you can loop through them all again, splitting at :.
The result will be a bunch of variable/value pairings. This can be all done stacked. Then you just want to pivot back into columns.
data
Updated the data based on your edit.
data <- tribble(~partner.createdAt, ~partner.wphg.info,
'2019-11-09T12:13:11.099Z', '{\"age_years\":\"50\",\"job_des\":\"unemployed\",\"height_cm\":\"170\",\"Gender\":\"female\",\"born_in\":\"Italy\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"200000\"',
'2019-11-01T06:43:22.581Z', '{\"age_years\":\"34\",\"job_des\":\"self-employed\",\"height_cm\":\"158\",\"Gender\":\"male\",\"born_in\":\"Germany\",\"Alcoholic\":\"true\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"10000\"',
'2019-11-10T07:59:46.136Z', '{\"age_years\":\"24\",\"height_cm\":\"187\",\"Gender\":\"male\",\"born_in\":\"England\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"150000\"',
'2019-11-11T13:01:48.488Z', '{\"age_years\":\"59\",\"job_des\":\"employed\",\"height_cm\":\"167\",\"Gender\":\"female\",\"born_in\":\"United States\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"2\",\"total_wealth\":\"1000000\"',
'2019-11-08T14:54:26.654Z', '{\"age_years\":\"36\",\"height_cm\":\"180\",\"born_in\":\"Germany\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"170000\",\"job_des\":\"employed\",\"Gender\":\"male\"')
libraries
We need a few here. Or you can just call tidyverse.
library(stringr)
library(purrr)
library(dplyr)
library(tibble)
library(tidyr)
function
This function will create a data frame (or tibble) for each question. The first column is the date, the second is the variable, the third is the value.
clean_record <- function(date, text) {
clean_records <- str_split(text, pattern = ",", simplify = TRUE) %>%
str_remove_all(pattern = "\\\"") %>% # remove double quote
str_remove_all(pattern = "\\{|\\}") %>% # remove curly brackets
str_split(pattern = ":", simplify = TRUE)
tibble(date = as.Date(date), variable = clean_records[,1], value = clean_records[,2])
}
iteration
Now we use pmap_dfr from purrr to loop over the rows, outputting each row with an id variable named record.
This will stack the data as described in the function. The mutate() line converts all variable names to lowercase. The distinct() line will filter out rows that are exact duplicates.
What we do then is just pivot on the variable column. Of course, replace data with whatever you name your data frame.
data_clean <- pmap_dfr(data, ~ clean_record(..1, ..2), .id = "record") %>%
mutate(variable = tolower(variable)) %>%
distinct() %>%
pivot_wider(names_from = variable, values_from = value)
result
The result is something like this. Note how I had reordered some of the columns, but it still works. You are probably not done just yet. All columns are now of type character. You need to figure out the desired type for each and convert.
# A tibble: 5 x 10
record date age_years job_des height_cm gender born_in alcoholic knowledge_selfass total_wealth
<chr> <date> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 2019-11-09 50 unemployed 170 female Italy false 5 200000
2 2 2019-11-01 34 self-employed 158 male Germany true 3 10000
3 3 2019-11-10 24 NA 187 male England false 3 150000
4 4 2019-11-11 59 employed 167 female United States false 2 1000000
5 5 2019-11-08 36 employed 180 male Germany false 5 170000
For example, convert age_years to numeric.
data_clean %>%
mutate(age_years = as.numeric(age_years))
I am sure you may run into other things, but this should be a start.

How can I group by one variable in terms of status of a different variable in a longitudinal situation in R?

I'm new to R, so please go easy on me... I have some longitudinal data that looks like
Basically, I'm trying to find a way to get a table with a) the number of unique cases that have all complete data and b) the number of unique cases that have at least one incomplete or missing data. The end results would ideally be
df<- df %>% group_by(Location)
df1<- df %>% group_by(any(Completion_status=='Incomplete' | 'Missing'))
Not sure about what you want, because it seems there are something of inconsistent between your request and the desired output, however lets try, it seems you need a kind of frequency table, that you can manage with basic R. At the bottom of the answer you can find some data similar to yours.
# You have two cases, the Complete, and the other, so here a new column about it:
data$case <- ifelse(data$Completion_status =='Complete','Complete', 'MorIn')
# now a frequency table about them: if you want a data.frame, here we go
result <- as.data.frame.matrix(table(data$Location,data$case))
# now the location as a new column rather than the rownames
result$Location <- rownames(result)
# and lastly a data.frame with the final results: note that you can change the names
# of the columns but if you want spaces maybe a tibble is better
result <- data.frame(Location = result$Location,
`Number.complete` = result$Complete,
`Number.incomplete.missing` = result$MorIn)
result
Location Number.complete Number.incomplete.missing
1 London 0 1
2 Los Angeles 0 1
3 Paris 3 1
4 Phoenix 0 2
5 Toronto 1 1
Or if you prefere a dplyr chain:
data %>%
mutate(case = ifelse(data$Completion_status =='Complete','Complete', 'MorIn')) %>%
do( as.data.frame.matrix(table(.$Location,.$case))) %>%
mutate(Location = rownames(.)) %>%
select(3,1,2) %>%
`colnames<-`(c("Location","Number of complete ", "Number of incomplete or"))
Location Number of complete Number of incomplete or
1 London 0 1
2 Los Angeles 0 1
3 Paris 3 1
4 Phoenix 0 2
5 Toronto 1 1
With data:
# here your data (next time try to put them in an usable way in the question)
data <- data.frame( ID = c("A1","A1","A2","A2","B1","C1","C2","D1","D2","E1"),
Location = c('Paris','Paris','Paris','Paris','London','Toronto','Toronto','Phoenix','Phoenix','Los Angeles'),
Completion_status = c('Complete','Complete','Incomplete','Complete','Incomplete','Missing',
'Complete','Incomplete','Incomplete','Missing'))

Resources