Change column in dataframe where - r

I am trying to change one column in an R-dataframe if some column has a specific content (however, not the exact content, rather a content I find with Regex).
For example:
df:
Name City Age
Peter Fort Wayne 15
John South Bend 20
Christopher Boston 25
Andy Boston 30
Johnathan Los Angeles 35
Now, if I want to change the age of all people whose names begin with John, I would usually select them by saying:
subset(df, grepl("^John", Name))
Which would give me
Name City Age
John South Bend 20
Johnathan Los Angeles 35
However, apparently I cannot change it using
subset(df, grepl("^John", Name))$Age <- 20
Is there an easy way to do this? I'd hate to drop the rows from the dataframe and then re-insert them, which is what I've been doing so far.
Thanks for your help,
Oliver

Try:
df$Age[grepl("^John", df$Name)] <- 20
subset returns a data.frame and you can not assign 20 to the whole data.frame. Instead, index by columns as shown above.

Related

Categorizing types of duplicates in R

Let's say I have the following data frame:
df <- data.frame(address=c('654 Peachtree St','890 River Rd','890 River Rd','890 River Rd','1234 Main St','1234 Main St','567 1st Ave','567 1st Ave'), city=c('Atlanta','Eugene','Eugene','Eugene','Portland','Portland','Pittsburgh','Etna'), state=c('GA','OR','OR','OR','OR','OR','PA','PA'), zip5=c('30308','97404','97404','97404','97201','97201','15223','15223'), zip9=c('30308-1929','97404-3253','97404-3253','97404-3253','97201-5717','97201-5000','15223-2105','15223-2105'), stringsAsFactors = FALSE)
`address city state zip5 zip9
1 654 Peachtree St Atlanta GA 30308 30308-1929
2 8910 River Rd Eugene OR 97404 97404-3253
3 8910 River Rd Eugene OR 97404 97404-3253
4 8910 River Rd Eugene OR 97404 97404-3253
5 1234 Main St Portland OR 97201 97201-5717
6 1234 Main St Portland OR 97201 97201-5000
7 567 1st Ave Pittsburgh PA 15223 15223-2105
8 567 1st Ave Etna PA 15223 15223-2105`
I'm considering any rows with a matching address and zip5 to be duplicates.
Filtering out or keeping duplicates based on these two columns is simple enough in R. What I'm trying to do is create a new column with a conditional label for each set of duplicates, ending up with something similar to this:
`address city state zip5 zip9 type
1 8910 River Rd Eugene OR 97404 97404-3253 Exact Match
2 8910 River Rd Eugene OR 97404 97404-3253 Exact Match
3 8910 River Rd Eugene OR 97404 97404-3253 Exact Match
4 1234 Main St Portland OR 97201 97201-5717 Different Zip9
5 1234 Main St Portland OR 97201 97201-5000 Different Zip9
6 567 1st Ave Pittsburgh PA 15223 15223-2105 Different City
7 567 1st Ave Etna PA 15223 15223-2105 Different City`
(I'd also be fine with a True/False column for each type of duplicate.)
I'm assuming the solution will be in some mutate+ifelse+boolean code, but I think it's the comparing within each duplicate subset that has me stuck...
Any advice?
Edit:
I don't believe this is a duplicate of Find duplicated rows (based on 2 columns) in Data Frame in R. I can use that solution to create a T/F column for each type of duplicate/group_by match, but I'm trying to create exclusive categories. How could my conditions also take differences into account? The exact match rows should show true only on the "exact match" column, and false for every other column. If I define my columns simply by feeding different combinations of columns to group_by, the exact match rows will never return a False.
I think the key is grouping by "reference" variable--here address makes sense--and then you can count the number of unique items in that vector. It's not a perfect solution since my use of case_when will prioritize earlier options (i.e. if there are two different cities attributed to one address AND two different zip codes, you'll only see that there are two different cities--you will need to address this if it matters with additional case_when statements). However, getting the length of unique items is a reasonable heuristic in this case if you don't need a perfectly granular solution.
df %>%
group_by(address) %>%
mutate(
match_type = case_when(
all(
length(unique(city)) == 1,
length(unique(state)) == 1,
length(unique(zip5)) == 1,
length(unique(zip9)) == 1) ~ "Exact Match",
length(unique(city)) > 1 ~ "Different City",
length(unique(state)) > 1 ~ "Different State",
length(unique(zip5)) > 1 ~ "Different Zip5",
length(unique(zip9)) > 1 ~ "Different Zip9"
))
Otherwise, you'll have to do iterative grouping (address + other variable) and mutate in a Boolean column as you alluded to.
Edit
One additional approach I just thought of if you need a more granular solution is to utilize the addition of an id column (df %>% rowid_to_column("ID")) and then a full join of the table to itself by address with suffixes (e.g. suffix = c("a","b")), filtering out same IDs and calling distinct (since each comparison is there twice), and then you can make Boolean columns with mutate for the pairwise comparisons. It may be too computationally intensive, depending on the size of your dataset, but it should work on the scale of a few thousand if you have a reasonable amount of RAM.

Make only numeric entries blank

I have a dataframe with UK postcodes in it. Unfortunately some of the postcode data is incorrect - ie, they are only numeric (all UK postcodes should start with a alphabet character)
I have done some research and found the grepl command that I've used to generate a TRUE/FALSE vector if the entry is only numeric,
Data$NewPostCode <- grepl("^.*[0-9]+[A-Za-z]+.*$|.*[A-Za-z]+[0-9]+.*$",Data$PostCode)
however, what I really want to do is where the instance starts with a number to make the postcode blank.
Note, I don't want remove the rows with an incorrect postcode as I will lose information from the other variables. I simply want to remove that postcode
Example data
Area Postcode
Birmingham B1 1AA
Manchester M1 2BB
Bristol BS1 1LM
Southampton 1254
London 1290C
Newcastle N1 3DC
Desired output
Area Postcode
Birmingham B1 1AA
Manchester M1 2BB
Bristol BS1 1LM
Southampton
London
Newcastle N1 3DC
There are a few ways to go between TRUE/FALSE vectors and the kind of task you want, but I prefer ifelse. A simpler way to generate the type of logical vector you're looking for is
grepl("^[0-9]", Data$PostCode)
which will be TRUE whenever PostCode starts with a number, and FALSE otherwise. You may need to adjust the regex if your needs are more complex.
You can then define a new column which is blank whenever the vector is TRUE and the old value whenever the vector is FALSE, as follows:
Data$NewPostCode <- ifelse(grepl("^[0-9]", Data$PostCode), "", Data$PostCode)
(May I suggest using NA instead of blank?)

Removing rows where the first entries are duplicate (R data frames) [duplicate]

Hi dear I have a little problem with a dataframe that has duplicates in a column. I would like to remove the rows where a column presents duplicates. For example my dataframe is like this:
Value City Card.Type ID
100 Michigan Silver 001
120 Angeles Gold 002
NA Kansas Gold 002
500 Michigan Silver 001
800 Texas Basic 005
You can see that in ID column there are two duplicates one for 001 and one for 002. I was using unique function but I don't get to erase that duplicates. I would like to get someone like this:
Value City Card.Type ID
100 Michigan Silver 001
120 Angeles Gold 002
800 Texas Basic 005
Thanks for your help.
The use of which should only be done with its "positive" version. The danger in using the construction -which() is that when none of the rows or items match the test, the result of the which() is numeric(0) and -numeric(0) will return 'nothing', when the correct result is 'everything'. Use use:
dat[!duplicated(dat), ]
In this case there were no duplicated rows, but the OP thought that some should be removed so obviously it was only two or three columns were under consideration. This is easy to accommodate. Just do the duplication test on 2 or three columns:
dat[ !duplicated(dat[ , 2:3] ) , ]
Use the function duplicated.
Something like:
data.subset <- data[!duplicated(data$ID),]
Duplicated returns a true/false vector. The second duplicated entry in the vector will always return TRUE.

What's the smart way to aggregate data?

Suppose there is a dataset of different regions, each region a subset of a state, and some outcome variable:
regions <- c("Michigan, Eastern",
"Michigan, Western",
"Minnesota",
"Mississippi, Northern",
"Mississippi, Southern",
"Missouri, Eastern",
"Missouri, Western")
set.seed(123)
outcome <- rpois(7, 12)
testset <- data.frame(regions,outcome)
regions outcome
1 Michigan, Eastern 10
2 Michigan, Western 11
3 Minnesota 17
4 Mississippi, Northern 12
5 Mississippi, Southern 12
6 Missouri, Eastern 17
7 Missouri, Western 13
A useful tool would aggregate each region and add, or take the mean or maximum, etc. of outcome by region and generate a new data frame for state. A sum, for example, would output this:
state outcome
1 Michigan 21
3 Minnesota 17
4 Mississippi 24
6 Missouri 30
The aggregate() function won't solve this problem. Is there something else in R that is built for this? It seems like grep could be used to generate the new column "states" as part of an application specific program. Seems like this would already be out there somewhere though.
The reason this isn't straight forward is that the structure of your data is not consistent, so you couldn't build a library simply for it.
Your state, region column is basically an index column, and you want to index across part of it. tapply is designed for this, but there's no reason to build in a function to do it automatically for this specific scenario. You could do it without creating the column though
tapply(outcome,gsub(",.*$","",testset$regions),sum)
The index column just replaces the , and everything after it, leaving the index column.
PS: you have a slight typo in your example, your data.frame should be
testset <- data.frame(regions,outcome)

How to remove rows in a dataframe considering there are duplicates in one column of dataframe

Hi dear I have a little problem with a dataframe that has duplicates in a column. I would like to remove the rows where a column presents duplicates. For example my dataframe is like this:
Value City Card.Type ID
100 Michigan Silver 001
120 Angeles Gold 002
NA Kansas Gold 002
500 Michigan Silver 001
800 Texas Basic 005
You can see that in ID column there are two duplicates one for 001 and one for 002. I was using unique function but I don't get to erase that duplicates. I would like to get someone like this:
Value City Card.Type ID
100 Michigan Silver 001
120 Angeles Gold 002
800 Texas Basic 005
Thanks for your help.
The use of which should only be done with its "positive" version. The danger in using the construction -which() is that when none of the rows or items match the test, the result of the which() is numeric(0) and -numeric(0) will return 'nothing', when the correct result is 'everything'. Use use:
dat[!duplicated(dat), ]
In this case there were no duplicated rows, but the OP thought that some should be removed so obviously it was only two or three columns were under consideration. This is easy to accommodate. Just do the duplication test on 2 or three columns:
dat[ !duplicated(dat[ , 2:3] ) , ]
Use the function duplicated.
Something like:
data.subset <- data[!duplicated(data$ID),]
Duplicated returns a true/false vector. The second duplicated entry in the vector will always return TRUE.

Resources