Removing rows where the first entries are duplicate (R data frames) [duplicate] - r

Hi dear I have a little problem with a dataframe that has duplicates in a column. I would like to remove the rows where a column presents duplicates. For example my dataframe is like this:
Value City Card.Type ID
100 Michigan Silver 001
120 Angeles Gold 002
NA Kansas Gold 002
500 Michigan Silver 001
800 Texas Basic 005
You can see that in ID column there are two duplicates one for 001 and one for 002. I was using unique function but I don't get to erase that duplicates. I would like to get someone like this:
Value City Card.Type ID
100 Michigan Silver 001
120 Angeles Gold 002
800 Texas Basic 005
Thanks for your help.

The use of which should only be done with its "positive" version. The danger in using the construction -which() is that when none of the rows or items match the test, the result of the which() is numeric(0) and -numeric(0) will return 'nothing', when the correct result is 'everything'. Use use:
dat[!duplicated(dat), ]
In this case there were no duplicated rows, but the OP thought that some should be removed so obviously it was only two or three columns were under consideration. This is easy to accommodate. Just do the duplication test on 2 or three columns:
dat[ !duplicated(dat[ , 2:3] ) , ]

Use the function duplicated.
Something like:
data.subset <- data[!duplicated(data$ID),]
Duplicated returns a true/false vector. The second duplicated entry in the vector will always return TRUE.

Related

how to use R to modify the CHAR length of a column(type) to 1

By Referring this: Finding the length of each string within a column of a data-frame in R. I could see the code data.frame(names=temp$name,chr=apply(temp,2,nchar)[,2]) that says 'temp' what does temp mean here?
I'm trying to figure out how to make the CHAR length of the column 'type' in a data frame 'df1' to 1.
Could someone help me how to accomplish this in R?
df1
date name expenditure type
23MAR2013 KOSH ENTRP 4000 COMPANY
23MAR2013 JOHN DOE 800 INDIVIDUAL
24MAR2013 S KHAN 300 INDIVIDUAL
24MAR2013 JASINT PVT LTD 8000 COMPANY
25MAR2013 KOSH ENTRPRISE 2000 COMPANY
25MAR2013 JOHN S DOE 220 INDIVIDUAL
25MAR2013 S KHAN 300 INDIVIDUAL
26MAR2013 S KHAN 300 INDIVIDUAL
Temp here is the name to look into. Apply is basically a glorified for-loop over the margins (rows or columns) of an array of data.frame. So in this case it is applying "nchar" to each column (margin = 2) of the data.frame stored in "temp".
What you're looking for instead is substr(df1$type, 1, 1)[[1]] to extract the first character of each value in the type column.
df1$type <- substr(df1$type, 1, 1)

Make only numeric entries blank

I have a dataframe with UK postcodes in it. Unfortunately some of the postcode data is incorrect - ie, they are only numeric (all UK postcodes should start with a alphabet character)
I have done some research and found the grepl command that I've used to generate a TRUE/FALSE vector if the entry is only numeric,
Data$NewPostCode <- grepl("^.*[0-9]+[A-Za-z]+.*$|.*[A-Za-z]+[0-9]+.*$",Data$PostCode)
however, what I really want to do is where the instance starts with a number to make the postcode blank.
Note, I don't want remove the rows with an incorrect postcode as I will lose information from the other variables. I simply want to remove that postcode
Example data
Area Postcode
Birmingham B1 1AA
Manchester M1 2BB
Bristol BS1 1LM
Southampton 1254
London 1290C
Newcastle N1 3DC
Desired output
Area Postcode
Birmingham B1 1AA
Manchester M1 2BB
Bristol BS1 1LM
Southampton
London
Newcastle N1 3DC
There are a few ways to go between TRUE/FALSE vectors and the kind of task you want, but I prefer ifelse. A simpler way to generate the type of logical vector you're looking for is
grepl("^[0-9]", Data$PostCode)
which will be TRUE whenever PostCode starts with a number, and FALSE otherwise. You may need to adjust the regex if your needs are more complex.
You can then define a new column which is blank whenever the vector is TRUE and the old value whenever the vector is FALSE, as follows:
Data$NewPostCode <- ifelse(grepl("^[0-9]", Data$PostCode), "", Data$PostCode)
(May I suggest using NA instead of blank?)

Identifying, reviewing, and deduplicating records in R

I'm looking to identify duplicate records in my data set based on multiple columns, review the records, and keep the ones with the most complete data in R. I would like to keep the row(s) associated with each name that have the maximum number of data points populated. In the case of date columns, I would also like to treat invalid dates as missing. My data looks like this:
df<-data.frame(Record=c(1,2,3,4,5),
First=c("Ed","Sue","Ed","Sue","Ed"),
Last=c("Bee","Cord","Bee","Cord","Bee"),
Address=c(123,NA,NA,456,789),
DOB=c("12/6/1995","0056/12/5",NA,"12/5/1956","10/4/1980"))
Record First Last Address DOB
1 Ed Bee 123 12/6/1995
2 Sue Cord 0056/12/5
3 Ed Bee
4 Sue Cord 456 12/5/1956
5 Ed Bee 789 10/4/1980
So in this case I would keep records 1, 4, and 5. There are approximately 85000 records and 130 variables, so if there is a way to do this systematically, I'd appreciate the help. Also, I'm a total R newbie (as if you couldn't tell), so any explanation is also appreciated. Thanks!
#Add a new column to the dataframe containing the number of NA values in each row.
df$nMissing <- apply(df,MARGIN=1,FUN=function(x) {return(length(x[which(is.na(x))]))})
#Using ave, find the indices of the rows for each name with min nMissing
#value and use them to filter your data
deduped_df <-
df[which(df$nMissing==ave(df$nMissing,paste(df$First,df$Last),FUN=min)),]
#If you like, remove the nMissinig column
df$nMissing<-deduped_df$nMissing<-NULL
deduped_df
Record First Last Address DOB
1 1 Ed Bee 123 12/6/1995
4 4 Sue Cord 456 12/5/1956
5 5 Ed Bee 789 10/4/1980
Edit: Per your comment, if you also want to filter on invalid DOBs, you can start by converting the column to date format, which will automatically treat invalid dates as NA (missing data).
df$DOB<-as.Date(df$DOB,format="%m/%d/%Y")

How to remove rows in a dataframe considering there are duplicates in one column of dataframe

Hi dear I have a little problem with a dataframe that has duplicates in a column. I would like to remove the rows where a column presents duplicates. For example my dataframe is like this:
Value City Card.Type ID
100 Michigan Silver 001
120 Angeles Gold 002
NA Kansas Gold 002
500 Michigan Silver 001
800 Texas Basic 005
You can see that in ID column there are two duplicates one for 001 and one for 002. I was using unique function but I don't get to erase that duplicates. I would like to get someone like this:
Value City Card.Type ID
100 Michigan Silver 001
120 Angeles Gold 002
800 Texas Basic 005
Thanks for your help.
The use of which should only be done with its "positive" version. The danger in using the construction -which() is that when none of the rows or items match the test, the result of the which() is numeric(0) and -numeric(0) will return 'nothing', when the correct result is 'everything'. Use use:
dat[!duplicated(dat), ]
In this case there were no duplicated rows, but the OP thought that some should be removed so obviously it was only two or three columns were under consideration. This is easy to accommodate. Just do the duplication test on 2 or three columns:
dat[ !duplicated(dat[ , 2:3] ) , ]
Use the function duplicated.
Something like:
data.subset <- data[!duplicated(data$ID),]
Duplicated returns a true/false vector. The second duplicated entry in the vector will always return TRUE.

Change column in dataframe where

I am trying to change one column in an R-dataframe if some column has a specific content (however, not the exact content, rather a content I find with Regex).
For example:
df:
Name City Age
Peter Fort Wayne 15
John South Bend 20
Christopher Boston 25
Andy Boston 30
Johnathan Los Angeles 35
Now, if I want to change the age of all people whose names begin with John, I would usually select them by saying:
subset(df, grepl("^John", Name))
Which would give me
Name City Age
John South Bend 20
Johnathan Los Angeles 35
However, apparently I cannot change it using
subset(df, grepl("^John", Name))$Age <- 20
Is there an easy way to do this? I'd hate to drop the rows from the dataframe and then re-insert them, which is what I've been doing so far.
Thanks for your help,
Oliver
Try:
df$Age[grepl("^John", df$Name)] <- 20
subset returns a data.frame and you can not assign 20 to the whole data.frame. Instead, index by columns as shown above.

Resources