Currently finding the duplicates but the data is not showing the row number, name and number and isn't outputting correctly (See below for expected output).
Use df.duplicated with keep=False to get a boolean mask of your dup rows then extract rows:
# split name / number from your csv file
df = pd.read_csv('names_dup2.csv', quoting=1, header=None)[0] \
.str.split('\t', expand=True)
# increment index to match line number
df.index += 1
# keep duplicate entries
out = df[df[0].duplicated(keep=False)]
# export to duplicated_data.csv
out.to_csv('duplicated_data.csv', header=False)
Content of output file:
15,ANDREW ZHAO CHONG,83091746
19,ANDREW ZHAO CHONG,83091746
26,ANDREW ZHAO CHONG,83091746
48,ANDREW ZHAO CHONG,83091746
53,KOH KANG RI,89943392
56,KOH KANG RI,89943392
63,ENOS ZHAO KANG SONG,80746554
66,ENOS ZHAO KANG SONG,80746554
80,ENOS ZHAO KANG SONG,80746554
One line version
pd.read_csv('names_dup2.csv', quoting=1, header=None)[0] \
.str.split('\t', expand=True) \
.assign(index=lambda x: x.index+1) \
.set_index('index') \
[lambda x: x[0].duplicated(keep=False)] \
.to_csv('duplicated_data.csv', header=False)
This is happening because .duplicated returns a boolean series (True/False), which you are saving directly.
But you should be using this to subset the data, like so:
import pandas as pd
import os
df_state = pd.DataFrame(
[["3 Liu Yu,876"],
["4 Koh chong,123"],
["3 Liu Yu,876"]])
df_state = df_state[0].str.split(" ", expand= True)
print(df_state, "\n")
duplicated = df_state.duplicated() # just a boolean series
print(duplicated, "\n")
print(df_state[duplicated], "\n") ## <- subset and save with .to_csv
# as Anders Källmar points out, you can also do this:
all_duplicated = df_state.duplicated(keep= False)
print(df_state[all_duplicated])
Output:
0 1 2
0 3 Liu Yu,876
1 4 Koh chong,123
2 3 Liu Yu,876
0 False
1 False
2 True
dtype: bool
0 1 2
2 3 Liu Yu,876
0 1 2
0 3 Liu Yu,876
2 3 Liu Yu,876
Related
Currently having an issue importing a data set of tweets so that every observation is in one column
This is the data before import; it includes three cells for each tweet, and a blank space in between.
T 2009-06-11 00:00:03
U http://twitter.com/imdb
W No Post Title
T 2009-06-11 16:37:14
U http://twitter.com/ncruralhealth
W No Post Title
T 2009-06-11 16:56:23
U http://twitter.com/boydjones
W listening to "Big Lizard - The Dead Milkmen" ♫ http://blip.fm/~81kwz
library(tidyverse)
tweets1 <- read_csv("tweets.txt.gz", col_names = F,
skip_empty_rows = F)
This is the output:
Parsed with column specification:
cols(
X1 = col_character()
)
Warning message:
“71299 parsing failures.
row col expected actual file
35 -- 1 columns 2 columns 'tweets.txt.gz'
43 -- 1 columns 2 columns 'tweets.txt.gz'
59 -- 1 columns 2 columns 'tweets.txt.gz'
71 -- 1 columns 5 columns 'tweets.txt.gz'
107 -- 1 columns 3 columns 'tweets.txt.gz'
... ... ......... ......... ...............
See problems(...) for more details.
”
# A tibble: 1,220,233 x 1
X1
<chr>
1 "T\t2009-06-11 00:00:03"
2 "U\thttp://twitter.com/imdb"
3 "W\tNo Post Title"
4 NA
5 "T\t2009-06-11 16:37:14"
6 "U\thttp://twitter.com/ncruralhealth"
7 "W\tNo Post Title"
8 NA
9 "T\t2009-06-11 16:56:23"
10 "U\thttp://twitter.com/boydjones"
# … with 1,220,223 more rows
The only issue are the many parsing failures, where problems(tweets1) shows that R expected one column, but got multiple. Any ideas on how to fix this? My output should provide me with 1.4 million rows according to my Professor, so unsure if this parsing issue is the key here. Any help is appreciated!
Maybe something like this will work for you.
data
data <- 'T 2009-06-11 00:00:03
U http://twitter.com/imdb
W No Post Title
T 2009-06-11 16:37:14
U http://twitter.com/ncruralhealth
W No Post Title
T 2009-06-11 16:56:23
U http://twitter.com/boydjones
W listening to "Big Lizard - The Dead Milkmen" ♫ http://blip.fm/~81kwz'
For a large file, fread() should be quick. The sep = NULL is saying basically just read in full lines. You will replace input = data with file = "tweets.txt.gz".
library(data.table)
read_rows <- fread(input = data, header = FALSE, sep = NULL, blank.lines.skip = TRUE)
processing
You could just stay with data.table, but I noticed you in the tidyverse already.
library(dplyr)
library(stringr)
library(tidyr)
Basically I am grabbing the first character (T, U, W) and storing it into a variable called Column. I am adding another column called Content for the rest of the string, with white space trimmed on both ends. I also added an ID column so I know how to group the clusters of 3 rows.
Then you basically just pivot on the Column. I am not sure if you wanted this last step or not, so remove as needed.
read_rows %>%
mutate(ID = rep(1:3, each = n() / 3),
Column = str_sub(V1, 1, 1),
Content = str_trim(str_sub(V1, 2))) %>%
select(-V1) %>%
pivot_wider(names_from = Column, values_from = Content)
result
# A tibble: 3 x 4
ID T U W
<int> <chr> <chr> <chr>
1 1 2009-06-11 00:00:03 http://twitter.com/imdb No Post Title
2 2 2009-06-11 16:37:14 http://twitter.com/ncruralhealth No Post Title
3 3 2009-06-11 16:56:23 http://twitter.com/boydjones "listening to \"Big Lizard - The Dead Milkmen\" ♫ http://blip.fm/~81kwz"
I have a .csv file with the following type of data:
Day Item
1 12,19,24,31,48,
1 1,19,
1 16,28,32,45,
1 19,36,41,43,44,
1 7,24,27,
1 21,31,33,41,
1 46
1 50
2 12,31,36,48,
2 17,29,47,
2 2,18,20,29,38,39,40,41
2 17,29,47,
And I can't get the read.transactions to read it properly.
The data set is based on several item selection for each day (more than one time per day, if necessary). For instance, the third selection on day 1, returned items 16,28,32, and 45.
Shouldn't this be enough?
library(arules)
dataset <- read.transactions("file.csv", format = 'basket')
I have tried to create a sample data using data provided by you
data <- read.table(text="Day Item
1 12,19,24,31,48,
1 1,19,
1 16,28,32,45,
1 19,36,41,43,44,
1 7,24,27,
1 21,31,33,41,
1 46
1 50
2 12,31,36,48,
2 17,29,47,
2 2,18,20,29,38,39,40,41
2 17,29,47",header = T)
data <- as(data[-1], "transactions") ##removing 1st header column for the transactional data
inspect(data)
## apply apriori algorithm ###
rules <- apriori(data, parameter = list(supp = 0.001, conf = 0.80))
### Arrange top 10 rules by lift ####
inspect(rules[1:10])
Please try this method hope it helps
editing to provide further clarification on the requirement
I'm fairly new in R and I've currently encountered a road block when I was tidying up my data.
My current data looks like this.
Data
1 AAA TEXT Here
2 ZX
3 YX
4 ****
5 BBB Text Here
6 AL
7 TP
8 XY
9 ******
10 CCC Text Here
11 PP
12 QV
13 ******
AAA, BBB, CCC are like my 'identifiers' and the *** means the end of the related lines to the identifiers. In this sample output, I would only want to extract BBB and the next 3 lines after it. I would need to select in-between rows and transform my table to just this:
Data
1 BBB Text Here
2 AL
3 TP
4 XY
Can you please help? Thanks!
Hmm. Your method of data storage is not what any of us would recommend, but if what you have written is indeed how you have stored your data then you can use a method outlined in this answer to find the row number of the line matching your specified identifier.
# Set up test 'identifier' value
WantedIdentifier = "BBB Text Here"
# Get matching row number
RowNo =
which(Text == WantedIdentifier, arr.ind=TRUE)[1]
# Return from that row to the third beyond
ReturnedText =
if(!is.na(RowNo)) data.frame(Data = Text[RowNo:(RowNo+3),]) else NA
# Value returned
> ReturnedText
Data
1 BBB Text Here
2 AL
3 TP
4 XY
Test data setup
Text=
read.table(text = "Data
'AAA TEXT Here'
'ZX'
'YX'
'****'
'BBB Text Here'
'AL'
'TP'
'XY'
'******'
'CCC Text Here'
'PP'
'QV'
'******'", header = TRUE, stringsAsFactors = FALSE)
I'm looking for an easy fix to read a txt file that looks like this when opened in excel:
IDmaster By_uspto App_date Grant_date Applicant Cited
2 1 19671106 19700707 Motorola Inc 1052446
2 1 19740909 19751028 Gen Motors Corp 1062884
2 1 19800331 19820817 Amp Incorporated 1082369
2 1 19910515 19940719 Dell Usa L.P. 389546
2 1 19940210 19950912 Schueman Transfer Inc. 1164239
2 1 19940217 19950912 Spacelabs Medical Inc. 1164336
EDIT: Opening the txt file in notepad looks like this (with commas). The last two rows exhibit the problem.
IDmaster,By_uspto,App_date,Grant_date,Applicant,Cited
2,1,19671106,19700707,Motorola Inc,1052446
2,1,19740909,19751028,Gen Motors Corp,1062884
2,1,19800331,19820817,Amp Incorporated,1082369
2,1,19910515,19940719,Dell Usa L.P.,389546
2,1,19940210,19950912,Schueman Transfer, Inc.,1164239
2,1,19940217,19950912,Spacelabs Medical, Inc.,1164336
The problem is that some of the Applicant names contain commas so that they are read as if they belong in a different column, which they actually don't.
Is there a simple way to
a) "teach" R to keep string variables together, regardless of commas in between
b) read in the first 4 columns, and then add an extra column for everything behind the last comma?
Given the length of the data I can't open it entirely in excel which would be otherwise a simple alternative.
If your example is written in a "Test.csv" file, try with:
read.csv(text=gsub(', ', ' ', paste0(readLines("Test.csv"),collapse="\n")),
quote="'",
stringsAsFactors=FALSE)
It returns:
# IDmaster By_uspto App_date Grant_date Applicant Cited
# 1 2 1 19671106 19700707 Motorola Inc 1052446
# 2 2 1 19740909 19751028 Gen Motors Corp 1062884
# 3 2 1 19800331 19820817 Amp Incorporated 1082369
# 4 2 1 19910515 19940719 Dell Usa L.P. 389546
# 5 2 1 19940210 19950912 Schueman Transfer Inc. 1164239
# 6 2 1 19940217 19950912 Spacelabs Medical Inc. 1164336
This provides a very silly workaround but it does the trick for me (because I don't really care about the Applicant names atm. However, I'm hoping for a better solution.
Step 1: Open the .txt file in notepad, and add five column names V1, V2, V3, V4, V5 (to be sure to capture names with multiple commas).
bc <- read.table("data.txt", header = T, na.strings = T, fill = T, sep = ",", stringsAsFactors = F)
library(data.table)
sapply(bc, class)
unique(bc$V5) # only NA so can be deleted
setDT(bc)
bc <- bc[,1:10, with = F]
bc$Cited <- as.numeric(bc$Cited)
bc$Cited[is.na(bc$Cited)] <- 0
bc$V1 <- as.numeric(bc$V1)
bc$V2 <- as.numeric(bc$V2)
bc$V3 <- as.numeric(bc$V3)
bc$V4 <- as.numeric(bc$V4)
bc$V1[is.na(bc$V1)] <- 0
bc$V2[is.na(bc$V2)] <- 0
bc$V3[is.na(bc$V3)] <- 0
bc$V4[is.na(bc$V4)] <- 0
head(bc, 10)
bc$Cited <- with(bc, Cited + V1 + V2 + V3 + V4)
It's a silly patch but it does the trick in this particular context
I have a csv Document with 2 columns which contains Commodity Category and Commodity Name.
Ex:
Sl.No. Commodity Category Commodity Name
1 Stationary Pencil
2 Stationary Pen
3 Stationary Marker
4 Office Utensils Chair
5 Office Utensils Drawer
6 Hardware Monitor
7 Hardware CPU
and I have another csv file which contains various Commodity names.
Ex:
Sl.No. Commodity Name
1 Pancil
2 Pencil-HB 02
3 Pencil-Apsara
4 Pancil-Nataraj
5 Pen-Parker
6 Pen-Reynolds
7 Monitor-X001RL
The output I would like is to standardise and categorise the commodity names and classify them into respective Commodity Categories like shown below :
Sl.No. Commodity Name Commodity Category
1 Pencil Stationary
2 Pencil Stationary
3 Pencil Stationary
4 Pancil Stationary
5 Pen Stationary
6 Pen Stationary
7 Monitor Hardware
Step 1) I first have to use NLTK (Text mining methods) and clean the data so as to seperate "Pencil" from "Pencil-HB 02" .
Step 2) After cleaning I have to use Approximate String match technique i.e agrep() to match the patterns "Pencil *" or correcting "Pancil" to "Pencil".
Step 3)Once correcting the pattern I have to categorise. No idea how.
This is what I have thought about. I started with step 2 and I'm stuck in step 2 only.
I'm not finding an exact method to code this.
Is there any way to get the output as required?
If yes please suggest me the method I can proceed with.
You could use the stringdist package. The correct function below will correct the Commodity.Name in file2 based on distances of the item to different CName.
Then a left_join is used to join the two tables.
I also notice that there are some classifications if I use the default options for stringdistmatrix. You can try changing the weight argument of stringdistmatrix for better correction result.
> library(dplyr)
> library(stringdist)
>
> file1 <- read.csv("/Users/Randy/Desktop/file1.csv")
> file2 <- read.csv("/Users/Randy/Desktop/file2.csv")
>
> head(file1)
Sl.No. Commodity.Category Commodity.Name
1 1 Stationary Pencil
2 2 Stationary Pen
3 3 Stationary Marker
4 4 Office Utensils Chair
5 5 Office Utensils Drawer
6 6 Hardware Monitor
> head(file2)
Sl.No. Commodity.Name
1 1 Pancil
2 2 Pencil-HB 02
3 3 Pencil-Apsara
4 4 Pancil-Nataraj
5 5 Pen-Parker
6 6 Pen-Reynolds
>
> CName <- levels(file1$Commodity.Name)
> correct <- function(x){
+ factor(sapply(x, function(z) CName[which.min(stringdistmatrix(z, CName, weight=c(1,0.1,1,1)))]), CName)
+ }
>
> correctedfile2 <- file2 %>%
+ transmute(Commodity.Name.Old = Commodity.Name, Commodity.Name = correct(Commodity.Name))
>
> correctedfile2 %>%
+ inner_join(file1[,-1], by="Commodity.Name")
Commodity.Name.Old Commodity.Name Commodity.Category
1 Pancil Pencil Stationary
2 Pencil-HB 02 Pencil Stationary
3 Pencil-Apsara Pencil Stationary
4 Pancil-Nataraj Pencil Stationary
5 Pen-Parker Pen Stationary
6 Pen-Reynolds Pen Stationary
7 Monitor-X001RL Monitor Hardware
If you need the "Others" category, you just need to play with the weights.
I added a row "Diesel" in file2. Then compute the score using stringdist with customized weights (you should try varying the values). If the score is large than 2 (this value is related to how the weights are assigned), it doesn't correct anything.
PS: as we don't know all the possible labels, we have to do as.character to convect factor to character.
PS2: I am also using tolower for case insensitive scoring.
> head(file2)
Sl.No. Commodity.Name
1 1 Diesel
2 2 Pancil
3 3 Pencil-HB 02
4 4 Pencil-Apsara
5 5 Pancil-Nataraj
6 6 Pen-Parker
>
> CName <- levels(file1$Commodity.Name)
> CName.lower <- tolower(CName)
> correct_1 <- function(x){
+ scores = stringdistmatrix(tolower(x), CName.lower, weight=c(1,0.001,1,0.5))
+ if (min(scores)>2) {
+ return(x)
+ } else {
+ return(as.character(CName[which.min(scores)]))
+ }
+ }
> correct <- function(x) {
+ sapply(as.character(x), correct_1)
+ }
>
> correctedfile2 <- file2 %>%
+ transmute(Commodity.Name.Old = Commodity.Name, Commodity.Name = correct(Commodity.Name))
>
> file1$Commodity.Name = as.character(file1$Commodity.Name)
> correctedfile2 %>%
+ left_join(file1[,-1], by="Commodity.Name")
Commodity.Name.Old Commodity.Name Commodity.Category
1 Diesel Diesel <NA>
2 Pancil Pencil Stationary
3 Pencil-HB 02 Pencil Stationary
4 Pencil-Apsara Pencil Stationary
5 Pancil-Nataraj Pencil Stationary
6 Pen-Parker Pen Stationary
7 Pen-Reynolds Pen Stationary
8 Monitor-X001RL Monitor Hardware
There is an 'Approximate string matching' function amatch() in {stingdist} (at least in 0.9.4.6) that returns the most probable match from the pre-defined set of words. It has a parameter maxDist that can be set for the maximum distance to be matched, and a nomatch parameter that can be used for the 'other' category. Otherwise, method, weights, etc. can be set similarly to stringdistmatrix().
So, your original problem can be solved like this using a tidyverse compatible solution:
library(dplyr)
library(stringdist)
# Reading the files
file1 <- readr::read_csv("file1.csv")
file2 <- readr::read_csv("file2.csv")
# Getting the commodity names in a vector
commodities <- file1 %>% distinct(`Commodity Name`) %>% pull()
# Finding the closest string match of the commodities, and joining the file containing the categories
file2 %>%
mutate(`Commodity Name` = commodities[amatch(`Commodity Name`, commodities, maxDist = 5)]) %>%
left_join(file1, by = "Commodity Name")
This will return a data frame that contains the corrected commodity name and category. If the original Commodity name is more than 5 characters away (simplified explanation of string distance) from any of the possible commodity names, the corrected name will be NA.