Counting number of specific strings in an R data frame - r

I have a data frame that has many columns, but the two columns I am interested in are major and department. I need to find a way to count the number of specific entries in a column. So my data frame looks something like
student_num major dept
123 child education
124 child education
125 special education
126 justice administration
127 justice administration
128 justice administration
129 police administration
130 police administration
What I want is a student count for each major and department. Something like
education child special administration justice police
3 2 1 5 3 2
I have tried several methods but nothing is quite what I need. I tried using the aggregate() function and the ddply() from plyr but they give me department as two - for two unique entries, education and administration. How can I count each unique entry and not how many unique entries there are?

You can try:
library(dplyr)
count(my_dataframe, major)
count(my_dataframe, dept)

# Create example data frame
dt <- read.table(text = "student_num major dept
123 child education
124 child education
125 special education
126 justice administration
127 justice administration
128 justice administration
129 police administration
130 police administration",
header = TRUE, stringsAsFactors = FALSE)
# Select columns
dt <- dt[, c("major", "dept")]
# Unlist the data frame
dt_vec <- unlist(dt)
# Count the number
table(dt_vec)
dt_vec
administration child education justice police
5 2 3 3 2
special
1

Related

Constrained K-means, R

I am currently doing k-means to cluster my data, however, I wish each cluster to appear once in each given year. I have searched for answers for a whole night but with no result. Would anyone have ideas upon this problem using R? Or is there any package I should look for ? Thanks.
More background infos :
I try to replicated the cluster of relationships, using the reported gender, education level and birth year. I am doing this because this is a survey data whose respondents are old people and they sometime will report inaccurate age or education infos. My main challenge now is that I wish to "have only one cluster labels in each survey year". For example, I do not want to see there are two cluster3 in survey year 2000. My data is like below :
survey year
relationship
gender
education level
birth year
k-means cluster
2000
41( first daughter)
0
3
1997
1
2003
41( first daughter)
0
3
1997
1
2000
42( second daughter)
0
4
1999
2
2003
42( second daughter)
0
4
1999
2
2000
42( third daughter)
0
5
1999
2
2003
42( third daughter)
0
5
2001
3
Thanks in advance.
--Update--
A more detailed description of the task:
The data set is a panel survey data asking elders for their health status, their relationships ( incl. sons, daughters, neighbors ). Since these older people are sometimes imprecise on their family's demographic information such as birth year, education level, etc., we might need to delete a big part of the data if it did not match.
(e.g., A reported his first son is 30 years old in 1997, while said his first son was 29 years old in 1999, this data could therefore be problematic). My task is to save as much data as possible if the imprecision is not that high.
Therefore I first mutated columns to check the precision of each family member (e.g., birth year error %in% c(-1,2)). Next, I run k-means if the family members are detected to be imprecise. In this way, I save much of the data. Although I did not solve the above problem, it rarely occurs that I can almost ignore or drop these observations.

How to find duplicate using Lat Long data and make it a Unique Identifier in big dataset

My Dataset looks something like this. Note below is hypothetical dataset.
Objective: Sales employee has to go to a particular location and verify the houses/Stores/buildings and device captures below mentioned information
Sr.No.
Store_Name
Phone-No.
Agent_id
Area
Lat-Long
1
ABC Stores
89099090
121
Bay Area
23.909090,89.878798
2
Wuhan Masks
45453434
122
Santa Fe
24.452134,78.123243
3
Twitter Cafe
67556090
123
Middle East
11.889766,23.334483
4
abc
33445569
121
Santa Cruz
23.345678,89.234213
5
Silver Gym
11004110
234
Worli Sea Link
56.564311, 78.909087
6
CK Clothings
00908876
223
90 th Street
34.445887, 12.887654
Facts:
#1 Unique Identifier for finding Duplicates – ** Check Sr.No 1 & 4 basically same
In this dummy dataset all the columns can be manipulated i.e. for same store/house/building-outlet
a) Since Name is entered manually for same house/store names can be changed and entered in the system -
multiple visits can happen
b) Mobile number can also be manipulated, different number can be associated with same outlet
c) Device with Agent capturing lat-long info also can be fudged - by moving closer or near to the building
Problem:
How to make Lat-Long Data as the Unique Identifier keeping in mind point - c), above for finding duplicates in the huge dataset.
Deploying QR is not also very helpful as this can also be tweaked.
Hereby stopping the fraudulent practice by an employee ( Same emp can visit same store/outlet or a different emp can also again visit the same store outlet to increase visit count)
Right now I can only think of Lat-Long Column to make UID please feel free to suggest if anything else can be made

Categorizing types of duplicates in R

Let's say I have the following data frame:
df <- data.frame(address=c('654 Peachtree St','890 River Rd','890 River Rd','890 River Rd','1234 Main St','1234 Main St','567 1st Ave','567 1st Ave'), city=c('Atlanta','Eugene','Eugene','Eugene','Portland','Portland','Pittsburgh','Etna'), state=c('GA','OR','OR','OR','OR','OR','PA','PA'), zip5=c('30308','97404','97404','97404','97201','97201','15223','15223'), zip9=c('30308-1929','97404-3253','97404-3253','97404-3253','97201-5717','97201-5000','15223-2105','15223-2105'), stringsAsFactors = FALSE)
`address city state zip5 zip9
1 654 Peachtree St Atlanta GA 30308 30308-1929
2 8910 River Rd Eugene OR 97404 97404-3253
3 8910 River Rd Eugene OR 97404 97404-3253
4 8910 River Rd Eugene OR 97404 97404-3253
5 1234 Main St Portland OR 97201 97201-5717
6 1234 Main St Portland OR 97201 97201-5000
7 567 1st Ave Pittsburgh PA 15223 15223-2105
8 567 1st Ave Etna PA 15223 15223-2105`
I'm considering any rows with a matching address and zip5 to be duplicates.
Filtering out or keeping duplicates based on these two columns is simple enough in R. What I'm trying to do is create a new column with a conditional label for each set of duplicates, ending up with something similar to this:
`address city state zip5 zip9 type
1 8910 River Rd Eugene OR 97404 97404-3253 Exact Match
2 8910 River Rd Eugene OR 97404 97404-3253 Exact Match
3 8910 River Rd Eugene OR 97404 97404-3253 Exact Match
4 1234 Main St Portland OR 97201 97201-5717 Different Zip9
5 1234 Main St Portland OR 97201 97201-5000 Different Zip9
6 567 1st Ave Pittsburgh PA 15223 15223-2105 Different City
7 567 1st Ave Etna PA 15223 15223-2105 Different City`
(I'd also be fine with a True/False column for each type of duplicate.)
I'm assuming the solution will be in some mutate+ifelse+boolean code, but I think it's the comparing within each duplicate subset that has me stuck...
Any advice?
Edit:
I don't believe this is a duplicate of Find duplicated rows (based on 2 columns) in Data Frame in R. I can use that solution to create a T/F column for each type of duplicate/group_by match, but I'm trying to create exclusive categories. How could my conditions also take differences into account? The exact match rows should show true only on the "exact match" column, and false for every other column. If I define my columns simply by feeding different combinations of columns to group_by, the exact match rows will never return a False.
I think the key is grouping by "reference" variable--here address makes sense--and then you can count the number of unique items in that vector. It's not a perfect solution since my use of case_when will prioritize earlier options (i.e. if there are two different cities attributed to one address AND two different zip codes, you'll only see that there are two different cities--you will need to address this if it matters with additional case_when statements). However, getting the length of unique items is a reasonable heuristic in this case if you don't need a perfectly granular solution.
df %>%
group_by(address) %>%
mutate(
match_type = case_when(
all(
length(unique(city)) == 1,
length(unique(state)) == 1,
length(unique(zip5)) == 1,
length(unique(zip9)) == 1) ~ "Exact Match",
length(unique(city)) > 1 ~ "Different City",
length(unique(state)) > 1 ~ "Different State",
length(unique(zip5)) > 1 ~ "Different Zip5",
length(unique(zip9)) > 1 ~ "Different Zip9"
))
Otherwise, you'll have to do iterative grouping (address + other variable) and mutate in a Boolean column as you alluded to.
Edit
One additional approach I just thought of if you need a more granular solution is to utilize the addition of an id column (df %>% rowid_to_column("ID")) and then a full join of the table to itself by address with suffixes (e.g. suffix = c("a","b")), filtering out same IDs and calling distinct (since each comparison is there twice), and then you can make Boolean columns with mutate for the pairwise comparisons. It may be too computationally intensive, depending on the size of your dataset, but it should work on the scale of a few thousand if you have a reasonable amount of RAM.

Subsetting to Output only certain CHAR names in a single column

I want to be able to keep all rows where the "conm" column does contain certain bank names. you can tell from the code I am trying to use subset to do this but to no avail.
I have tried using subset to do this.
CMPSTPRFT12 <- subset(CMPSPRFT11, conm = MORGUARD CORP | conm = LEHMAN BROTHERS HOLDINGS INC)
I expect the output in rstudio to just show all rows where the column containing the names of banks includes certain banks, not all banks. I want SUnTrust, Lehman Brothers, Morgan Stanley, Goldman Sachs, PennyMac, Bank of America, and Fannie Mae.
Please see other posts on how to phrase your questions more helpfully for others. How to make a great R reproducible example
You can use dplyr and filter.
df <- data.frame(bank=letters[1:10],
value=10:19)
df %>% filter(bank=='a' | bank=='b')
bank value
1 a 10
2 b 11
banks <- c('d','g','j')
df %>% filter(bank %in% banks)
bank value
1 d 13
2 g 16
3 j 19

Comparing rows in the same R dataframe

I want to take the nth row in a dataframe and compare it to all rows that are not the nth row and return how many of this columns match and/or mismatch.
I tried the match function and ifelse for single observations but I haven't been able to replicate it for the entire dataframe.
The dataset Superstore contains the order priority, customer name, ship mode, customer segment and category. It looks like this:
> head(df2)
Order.Priority Customer.Name Ship.Mode Customer.Segment Product.Category
1 Not Specified Dana Teague Regular Air Corporate Office Supplies
2 Critical Vanessa Boyer Regular Air Consumer Office Supplies
3 Critical Wesley Tate Regular Air Corporate Technology
4 High Brian Grady Delivery Truck Corporate Furniture
5 Medium Kristine Connolly Delivery Truck Corporate Furniture
6 High Emily Britt Regular Air Corporate Office Supplies
The code I tried (extracting relevant columns):
df <- read.csv("Superstore.csv", header = TRUE)
df2 <- df[,c(2,4,5,6,7)]
match(df2[2,],df2[1,],nomatch = 0)
This returns:
> match(df2[2,],df2[1,],nomatch = 0)
[1] 0 0 3 0 5
Using ifelse I get:
> ifelse(df2[1,]==df2[2,],1,0)
Order.Priority Customer.Name Ship.Mode Customer.Segment Product.Category
1 0 0 1 0 1
Like I said, this is exactly the result I need, but I haven't been able to replicate for the whole dataframe.

Resources