I have a data frame of postcodes with a regional/metro classification assigned. In some instances, due to the datasource, the same postcode will occur with both a regional and metro classification.
POSTCODE REGON
1 3000 METRO
2 3000 REGIONAL
3 3256 METRO
4 3145 METRO
I am wondering how to remove the duplicate row and replace the region with "SPLIT" in these instances.
I have tried using the below code however this reassignes the entire dataset with either "METRO" or "REGIONAL"
test <- within(PC_ACTM, REGION <- ifelse(duplicated("Postcode"), "SPLIT", REGION))
The desired output would be
POSTCODE REGON
1 3000 SPLIT
2 3256 METRO
3 3145 METRO
Example data:
dput(PC_ACTM)
structure(list(POSTCODE = c(3000L, 3000L, 3256L, 3145L), REGON = c("METRO",
"REGIONAL", "METRO", "METRO")), class = "data.frame", row.names = c("1",
"2", "3", "4"))
Based on your title, you're looking for an ifelse() solution; perhaps this will suit?
PC_ACTM <- structure(list(POSTCODE = c(3000L, 3000L, 3256L, 3145L),
REGION = c("METRO", "REGIONAL", "METRO", "METRO")),
class = "data.frame",
row.names = c("1", "2", "3", "4"))
PC_ACTM$REGION <- ifelse(duplicated(PC_ACTM$POSTCODE), "SPLIT", PC_ACTM$REGION)
PC_ACTM[!duplicated(PC_ACTM$POSTCODE, fromLast = TRUE),]
#> POSTCODE REGION
#> 2 3000 SPLIT
#> 3 3256 METRO
#> 4 3145 METRO
Created on 2022-04-07 by the reprex package (v2.0.1)
Consider ave to sequential count by group and then subset the last but before use ifslse to replace needed value for any group counts over 1. Below uses new base R 4.1.0+ pipe |>:
test <- within(
PC_ACTM, {
PC_SEQ <- ave(1:nrow(test), POSTCODE, FUN=seq_along)
PC_COUNT <- ave(1:nrow(test), POSTCODE, FUN=length)
REGION <- ifelse(
(PC_SEQ == PC_COUNT) & (PC_COUNT > 1), "SPLIT", REGION
)
}
) |> subset(
subset = PC_SEQ == PC_COUNT, # SUBSET ROWS
select = c(POSTCODE, REGION) # SELECT COLUMNS
) |> `row.names<-`(NULL) # RESET ROW NAMES
Related
So, I have a data set with a lot of observations for X individuals and more rows per some individuals. For each row, I have assigned a classification (the variable clinical_significance) that takes three values in prioritized order: definite disease, possible, colonization. Now, I would like to have only one row for each individual and the "highest classification" across the rows, e.g. definite if present, subsidiary possible and colonization. Any good suggestions on how to overcome this?
For instance, as seen in the example, I would like all ID #23 clinical_signifiance to be 'definite disease' as this outranks 'possible'
id id_row number_of_samples species_ny clinical_significa…
18 1 2 MAC possible
18 2 2 MAC possible
20 1 2 scrofulaceum possible
20 2 2 scrofulaceum possible
23 1 2 MAC possible
23 2 2 MAC definite disease
Making a reproducible example:
df <- structure(
list(
id = c("18", "18", "20", "20", "23", "23"),
id_row = c("1","2", "1", "2", "1", "2"),
number_of_samples = c("2", "2", "2","2", "2", "2"),
species_ny = c("MAC", "MAC", "scrofulaceum", "scrofulaceum", "MAC", "MAC"),
clinical_significance = c("possible", "possible", "possible", "possible", "possible", "definite disease")
),
row.names = c(NA, -6L), class = c("data.frame")
)
The idea is to turn clinical significance into a factor, which is stored as an integer instead of character (i.e. 1 = definite, 2 = possible, 3 = colonization). Then, for each ID, take the row with lowest number.
df_prio <- df |>
mutate(
fct_clin_sig = factor(
clinical_significance,
levels = c("definite disease", "possible", "colonization")
)
) |>
group_by(id) |>
slice_min(fct_clin_sig)
I fixed it using
df <- df %>%
group_by(id) %>%
mutate(clinical_significance_new = ifelse(any(clinical_significance == "definite disease"), "definite disease", as.character(clinical_significance)))
I have the following dataframe:
library(dplyr)
library(tidyverse)
library(concordance)
Year <- c(2016,2016,2017,2019,2020,2020,2020,2013,2010,2010)
Pf <- c("HS4","HS4","HS4","HS5","HS5","HS5","HS5","HS4","HS3","HS3")
Code <- c("391890","440929","851660","732399","720839","050510","830241","321590","010210","010210")
Slen <- c("6","6","6","6","6","6","6","6","6","6")
df <- data.frame(Year,Pf,Code,Slen)
'Pf' column contains 3 different types of rows: "HS3", "HS4" and "HS5". I want to perform a vectorized operation and apply concord() function to the 'Code' column", however in order to do that, 'Pf' must be Unique that's why before I sebset datarames where 'Pf' column is unique
# Subset data where Pf column is unique
df.H5 <- subset(df, Pf == "HS5")
df.H4 <- subset(df, Pf == "HS4")
df.H3 <- subset(df, Pf == "HS3")
Now I apply a function to each dataframe. Here concord() function applies to 'Code' column and converts these characters to different ones. However, if destination (argument) and values in 'Pf' column are the same it does not work, for instance, if Pf="HS3" (in df) and destination = "HS3", the code does not run, that's why I don't apply code to df.H3
# Apply function to df.H5
df.H5<- df.H5 %>%
group_by(Pf, Slen) %>%
mutate(
Code2 = concord(Code, origin = unique(Pf), dest.digit = unique(Slen), destination = "HS3", all = FALSE)
) %>%
ungroup()
# Apply function to df.H4
df.H4<- df.H4 %>%
group_by(Pf, Slen) %>%
mutate(
Code2 = concord(Code, origin = unique(Pf), dest.digit = unique(Slen), destination = "HS3", all = FALSE)
) %>%
ungroup()
#add column todf.H3 in order to merge these 3 tafarames
df.H3$Code2 <- df.H3$Code
#merge
df2 <- rbind(df.H4, df.H5, df.H3)
My goal is to somehow automate the process. For instance, if destination = "HS3", the code applies whole data without pre-subsetting and if destination (argument) and rows in Pf match each other, the code does not apply to it and just copy-paste values from 'Code' to generated 'Code2' column in that case
You could put the logic in a function and use it in a by approach which splits data and applies functions. In the function you could do a case handling where supposedly P == 'HS3' should not be processed. Finally unsplit.
cf <- \(x) {
Code2 <- if (!any(x$Pf == 'HS3')) {
concordance::concord(x$Code, x$Pf[1], x$Slen[1],
destination="HS3", all=FALSE)
} else {
x$Code
}
cbind(x, Code2)
}
by(df, df$Pf, cf) |>
unsplit(df$Pf)
# Year Pf Code Slen Code2
# 1 2016 HS4 391890 6 391890
# 2 2016 HS4 440929 6 440929
# 3 2017 HS4 851660 6 851660
# 4 2019 HS5 732399 6 732399
# 5 2020 HS5 720839 6 720839
# 6 2020 HS5 050510 6 050510
# 7 2020 HS5 830241 6 830241
# 8 2013 HS4 321590 6 321590
# 9 2010 HS3 010210 6 010210
# 10 2010 HS3 010210 6 010210
Data:
df <- structure(list(Year = c(2016, 2016, 2017, 2019, 2020, 2020, 2020,
2013, 2010, 2010), Pf = c("HS4", "HS4", "HS4", "HS5", "HS5",
"HS5", "HS5", "HS4", "HS3", "HS3"), Code = c("391890", "440929",
"851660", "732399", "720839", "050510", "830241", "321590", "010210",
"010210"), Slen = c("6", "6", "6", "6", "6", "6", "6", "6", "6",
"6")), class = "data.frame", row.names = c(NA, -10L))
I'm having some trouble when I try to merge two data frames. Here is an example:
Number <- c("1", "2", "3")
Letter <- factor(c("a", "b", "c"))
map <- data.frame(Number, Letter, row.names = c("Belgium", "Italy", "Senegal"))
This is my first data frame called "map", it looks like this:
Number Letter
Belgium 1 a
Italy 2 b
Senegal 3 c
And if I try to select by row and column I don't have any problem:
map["Belgium", "Number"]
[1] "1"
Here I have my second data frame called "calendar":
Month <- c("January", "February", "March")
calendar <- data.frame(Month, row.names = c("Belgium", "Italy", "Senegal"))
It looks like this:
Month
Belgium January
Italy February
Senegal March
The problem comes when I try to merge both data frames:
map.amp = merge(map, calendar, by = 0)
Row.names Number Letter Month
1 Belgium 1 a January
2 Italy 2 b February
3 Senegal 3 c March
Now, when I try to select a cell using rows and columns, the outcome is always NA
map.amp["Italy", "Month"]
[1] NA
map.amp["Belgium", "Number"]
[1] NA
How can I merge both data frames so I can keep using that kind of select function?
You have to re-set the row names:
row.names(map.amp) <- map.amp$Row.names
If you want to keep using those row names you have to set the Row.names column back to row names. tibble::column_to_rownames is a nice option for this:
map.amp <- merge(map, calendar, by = 0) %>% tibble::column_to_rownames(var = "Row.names")
map.amp[map.amp$Row.names =='Italy', 'Month']
Will work now as row.names is also a column now
You could use the answer in the comment by #thelatemail. Or use
subset(map.amp, Row.names =='Italy')[[ 'Month']] # first get matching rows but them narrow to named column.
or
subset(map.amp, Row.names =='Italy', 'Month') # third argument is for column selection
I have a data frame with the approximate structure:
C1 C2 C3
1 c("XXX", "Y3") "XXX" "Y31"
2 c("SFM", "DD31", "DSDW") "SFF" "DD31"
The column C1 is a list. It was a string which I split into separate words. The other 2 columns are character.
I need to match C2 and C3 against C1 so that in case of the match (100% there is a match), replace the value in C1 with another value. For example:
The first row has 2 matches because fuzzy match is also a match:
C1~C2: replace "XXX" in C1 with the modified value from C1 "XXX[TAG]"
C1~C3: replace "Y3" in C1 with the modified value from C3 "Y31[TAG]"
In general I understand how to do that: with a for loop, match function and regex but my knowledge does not allow me to combine everything together. Thank you in advance!
EDITED
What I have:
x <- structure(list(Description = list(c("2012", "Deere", "544K",
"Wheel", "Loader,"), c("Caterpillar","Model", "988", "Year", "1972")),
Manufacturer = c("john deere", "caterpillar"),
Model = c("544k", "988")), .Names = c("Description", "Manufacturer", "Model"), row.names = 4:5, class = "data.frame")
#> Description Manufacturer Model
#> 4 2012, Deere, 544K, Wheel, Loader, john deere 544k
#> 5 Caterpillar, Model, 988, Year, 1972 caterpillar 988
What I want to have:
x.new <- structure(list(Description = list(c("2012", "john deere[Manufacturer]", "544k[Model]",
"Wheel", "Loader,"), c("caterpillar[Manufacturer]","Model", "988[Model]", "Year", "1972")),
Manufacturer = c("john deere", "caterpillar"),
Model = c("544k", "988")), .Names = c("Description", "Manufacturer", "Model"), row.names = 4:5, class = "data.frame")
#> Description Manufacturer Model
#> 4 2012, john deere[Manufacturer], 544k[Model], Wheel, Loader, john deere 544k
#> 5 caterpillar[Manufacturer], Model, 988[Model], Year, 1972 caterpillar 988
With list columns, you'll need a lot of lapply and its multivariate equivalent, Map, which allow you to iterate over the list column and return a list which can be reassigned as a column. For example,
df <- structure(list(C1 = list(c("XXX", "Y3"), c("SFM", "DD31", "DSDW")),
C2 = c("XXX", "SFF"),
C3 = c("Y31", "DD31")),
.Names = c("C1", "C2", "C3"), row.names = c(NA, -2L), class = "data.frame")
df$C1_new <- Map(function(c1, c2, c3){
sapply(c1, function(x){
mtch <- grepl(x, c(c2, c3));
if (any(mtch)) {paste0(c(c2, c3)[mtch], '[', names(df)[-1][mtch], ']')} else {x}
})},
df$C1, df$C2, df$C3)
df
#> C1 C2 C3 C1_new
#> 1 XXX, Y3 XXX Y31 XXX[C2], Y31[C3]
#> 2 SFM, DD31, DSDW SFF DD31 SFM, DD31[C3], DSDW
There are many other ways to set this up, including using using packages like purrr and stringr that make the syntax simpler and more uniform. Vary as you like.
To apply to the second dataset listed, it works with some slight edits:
x <- structure(list(Description = list(c("2012", "Deere", "544K", "Wheel", "Loader,"),
c("Caterpillar","Model", "988", "Year", "1972")),
Manufacturer = c("john deere", "caterpillar"),
Model = c("544k", "988")),
.Names = c("Description", "Manufacturer", "Model"), row.names = 4:5, class = "data.frame")
x$Description <- Map(function(desc, mfr, mdl){
sapply(desc, function(wrd){
mtch <- grepl(wrd, c(mfr, mdl), ignore.case = TRUE);
if (any(mtch)) {paste0(c(mfr, mdl)[mtch], '[', names(x)[-1][mtch], ']')} else {wrd}
})},
x$Description, x$Manufacturer, x$Model)
x
#> Description Manufacturer Model
#> 4 2012, john deere[Manufacturer], 544k[Model], Wheel, Loader, john deere 544k
#> 5 caterpillar[Manufacturer], Model, 988[Model], Year, 1972 caterpillar 988
I have a list that's 1314 element long. Each element is a data frame consisting of two rows and four columns.
Game.ID Team Points Victory
1 201210300CLE CLE 94 0
2 201210300CLE WAS 84 0
I would like to use the lapply function to compare points for each team in each game, and change Victory to 1 for the winning team.
I'm trying to use this function:
test_vic <- lapply(all_games, function(x) {if (x[1,3] > x[2,3]) {x[1,4] = 1}})
But the result it produces is a list 1314 elements long with just the Game ID and either a 1 or a null, a la:
$`201306200MIA`
[1] 1
$`201306160SAS`
NULL
How can I fix my code so that each data frame maintains its shape. (I'm guessing solving the null part involves if-else, but I need to figure out the right syntax.)
Thanks.
Try
lapply(all_games, function(x) {x$Victory[which.max(x$Points)] <- 1; x})
Or another option would be to convert the list to data.table by using rbindlist and then do the conversion
library(data.table)
rbindlist(all_games)[,Victory:= +(Points==max(Points)) ,Game.ID][]
data
all_games <- list(structure(list(Game.ID = c("201210300CLE",
"201210300CLE"
), Team = c("CLE", "WAS"), Points = c(94L, 84L), Victory = c(0L,
0L)), .Names = c("Game.ID", "Team", "Points", "Victory"),
class = "data.frame", row.names = c("1",
"2")), structure(list(Game.ID = c("201210300CME", "201210300CME"
), Team = c("CLE", "WAS"), Points = c(90, 92), Victory = c(0L,
0L)), .Names = c("Game.ID", "Team", "Points", "Victory"),
row.names = c("1", "2"), class = "data.frame"))
You could try dplyr:
library(dplyr)
all_games %>%
bind_rows() %>%
group_by(Game.ID) %>%
mutate(Victory = row_number(Points)-1)
Which gives:
#Source: local data frame [4 x 4]
#Groups: Game.ID
#
# Game.ID Team Points Victory
#1 201210300CLE CLE 94 1
#2 201210300CLE WAS 84 0
#3 201210300CME CLE 90 0
#4 201210300CME WAS 92 1