R: How to remove duplicates with some conditions in a complex Dataframe? - r

I have a dataset for factory producing Gold and Silver product (Pen), we would like to check the quality by assigning employees to check these products produced from all machines in the factory. Data sample data below:
Every machine is in a specific Room/Section/Building, and we have two columns to group employee IDs that are testing Gold and Silver Pens.
The issue is I have duplicates employees testing the quality of the same machine. So I would like to remove these duplicates and group the ones which are not duplicates.
Sample:
Bld.No <- c(1,1,1,1,1,1,2,2,2,2)
Section <- c("A","A","A","A","B","B","C","C","D","D")
Room.No <- c(100,100,100,100,200,200,300,300,400,400)
Gold <- c(8,6,4,0,6,0,7,2,2,1)
Silver <- c(1,0,0,1,2,3,4,0,4,0)
Total <- c(9,6,4,1,8,3,11,2,6,1)
Emp.Gold.ID <- c("A11, A09, B22, E12, A04, C09, D33, A01", "A11, A09, B22, E12, A04, A01", "A09, 822, E12, A04", NA, "A71, A09, B12, E32, A04, C19", NA, "B22, E12, A04, C09, D33, A01, M11", "E12, Z09", "C09, D33", "D18")
Emp.Silver.ID <- c("A17", NA, NA, "D33", "B22, E12", "A09, B12, E32", "A44, C02, D03, A71", NA, "A12, A01, M11, D18", NA)
df <- data.frame(Bld.No, Section, Room.No, Gold, Silver, Total, Emp.Gold.ID, Emp.Silver.ID)
Note: if emp.Id is already in the previous records, either gold or silver, we should remove it. Meaning ID should be in either one and remove the duplicate. See the example of the last record in the sample and output table, we removed the last record (2, D, 400, 1, 0, 1, D18, NA), because of D18 is already in the previous record, even though it's in the Silver column.
Sample Data and Output:
Sample Data and Output

To do this, I would use separate_rows to get all the IDs in separate rows to remove duplicates later on with distinct.
After removing duplicate IDs, would create comma separated strings of the IDs for gold and silver. You can either summarize total Gold and Silver before this step or afterwards.
Note that to get the same results as in your sample Data and Output, I changed 822 to B22.
Please let me know if this is what you had in mind.
library(dplyr)
library(tidyr)
df$Emp.Gold.ID <- as.character(df$Emp.Gold.ID)
df$Emp.Silver.ID <- as.character(df$Emp.Silver.ID)
df %>%
separate_rows(Emp.Gold.ID) %>%
separate_rows(Emp.Silver.ID) %>%
pivot_longer(cols = starts_with("Emp."), names_to = "ID", values_drop_na = TRUE) %>%
group_by(Bld.No, Section, Room.No) %>%
distinct(value, .keep_all = TRUE) %>%
group_by(Bld.No, Section, Room.No, ID) %>%
summarise(NewID = toString(value)) %>%
pivot_wider(names_from = ID, values_from = NewID) %>%
mutate(Gold = length(unlist(strsplit(Emp.Gold.ID, ", "))),
Silver = length(unlist(strsplit(Emp.Silver.ID, ", "))),
Total = Gold + Silver)
# A tibble: 4 x 8
# Groups: Bld.No, Section, Room.No [8]
Bld.No Section Room.No Emp.Gold.ID Emp.Silver.ID Gold Silver Total
<dbl> <fct> <dbl> <chr> <chr> <int> <int> <int>
1 1 A 100 A11, A09, B22, E12, A04, C09, D33, A01 A17 8 1 9
2 1 B 200 A71, A09, B12, E32, A04, C19 B22, E12 6 2 8
3 2 C 300 B22, E12, A04, C09, D33, A01, M11, Z09 A44, C02, D03, A71 8 4 12
4 2 D 400 C09, D33 A12, A01, M11, D18 2 4 6

Related

How to pivot_wider the n unique values of variable A grouped_by variable B?

I am trying to pivot_wider() the column X of a data frame containing various persons names. Within group_by() another variable Y of the df there are always 2 of these names. I would like R to take the 2 unique X names values within each unique identifier of Y and put them in 2 new columns ex_X_Name_1 and ex_X_Name_2.
My data frame is looking like this:
df <- data.frame(Student = rep(c(17383, 16487, 17646, 2648, 3785), each = 2),
Referee = c("Paul Severe", "Cathy Nice", "Jean Exigeant", "Hilda Ehrlich", "John Rates",
"Eva Luates", "Fred Notebien", "Aldous Grading", "Hans Streng", "Anna Filaktic"),
Rating = format(round(x = sqrt(sample(15:95, 10, replace = TRUE)), digits = 3), nsmall = 3)
)
df
I would like to make the transformation of the Referee column to 2 new columns Referee_1 and Referee_2 with the 2 unique Referees assigned to each student and end with this result:
even_row_df <- as.logical(seq_len(length(df$Referee)) %% 2)
df_wanted <- data_frame(
Student = unique(df$Student),
Referee_1 = df$Referee[even_row_df],
Rating_Ref_1 = df$Rating[even_row_df],
Referee_2 = df$Referee[!even_row_df],
Rating_Ref_2 = df$Rating[!even_row_df]
)
df_wanted
I guess I could achieve this with by subsetting unique rows of student/referee combinations and make joins , but is there a way to handle this in one call to pivot_wider?
You should create a row id per group first:
library(dplyr)
library(tidyr)
df %>%
group_by(Student) %>%
mutate(row_n = row_number()) %>%
ungroup() %>%
pivot_wider(names_from = "row_n", values_from = c("Referee", "Rating"))
# A tibble: 5 × 5
Student Referee_1 Referee_2 Rating_1 Rating_2
<dbl> <chr> <chr> <chr> <chr>
1 17383 Paul Severe Cathy Nice 9.165 7.810
2 16487 Jean Exigeant Hilda Ehrlich 5.196 6.557
3 17646 John Rates Eva Luates 7.211 5.568
4 2648 Fred Notebien Aldous Grading 4.000 8.124
5 3785 Hans Streng Anna Filaktic 7.937 6.325
using data.table
library(data.table)
setDT(df)
merge(df[, .SD[1], Student], df[, .SD[2], Student], by = "Student", suffixes = c("_1", "_2"))
# Student Referee_1 Rating_1 Referee_2 Rating_2
# 1: 2648 Fred Notebien 6.708 Aldous Grading 9.747
# 2: 3785 Hans Streng 6.245 Anna Filaktic 8.775
# 3: 16487 Jean Exigeant 7.681 Hilda Ehrlich 4.359
# 4: 17383 Paul Severe 4.583 Cathy Nice 7.616
# 5: 17646 John Rates 6.708 Eva Luates 8.246

dplyr: how to pivot_wider with multiple columns while creating new covariates at the same time?

I have a proteomics dataset currently with ~60 columns (patients and information such as protein names) and ~1800 rows (the specific proteins).
I need to convert from long to wide format so that each row corresponds to the patients while all the columns represent the proteins. I can do (very) simple conversions, but there are many columns in this example and, in extension, some data management is required as new covariates needs to be created/extracted from the raw proteomics output below. I simply does not know how to proceed and have not found any solutions looking at many available walk-throughs of converting large datasets like this.
I prefer dplyr-inputs, hints or solutions.
The raw output from the proteomic-software looks something like this:
> head(Heat_BT)
# A tibble: 11 x 6
protein gene Intensity_10 Intensity_11 Intensity_MB_1 Intensity_Ref1
<chr> <chr> <chr> <chr> <chr> <chr>
1 NA NA Bruschi Bruschi Reichl Reichl
2 NA NA Ctrl Ctrl Tumor Ctrl
3 NA NA Hydro Hydro Malignant Hydro
4 NA NA Ctrl Ctrl MB Ctrl
5 von Willebrand factor VWF 0.674627721 0.255166769 0.970489979 0.215972215
6 Sex hormone-binding globulin SHBG 0.516914487 0.476843655 0.88173753 0.306484252
7 Glyceraldehyde-3-phosphate dehydrogenase GAPDH 0.622163594 0.231107563 0.71856463 0.204625234
8 Nestin NES 0.868476391 0.547319174 0.832109928 0.440162212
9 Heat shock 70 kDa protein 13 HSPA13 0.484973907 0.435322136 0.539334834 0.28678757
10 Isocitrate dehydrogenase [NADP], mitochondrial IDH2 1.017596364 0.107395157 0.710225344 0.251976997
11 Mannan-binding lectin serine protease 1 MASP1 0.491321206 0.434995681 0.812500775 0.403583705
Expected output:
id lab malig diag VWF SHBG GAPDH NES HSPA13 IDH2 MASP1
1 Intensity_10 Bruschi Ctrl Hydro 0.6746277 0.5169145 0.6221636 0.8684764 0.4849739 1.0175964 0.4913212
2 Intensity_11 Bruschi Ctrl Hydro 0.2551668 0.4768437 0.2311076 0.5473192 0.4353221 0.1073952 0.4349957
3 Intensity_MB_1 Reichl Tumor Malignant 0.9704900 0.8817375 0.7185646 0.8321099 0.5393348 0.7102253 0.8125008
4 Intensity_Ref1 Reichl Ctrl Hydro 0.2159722 0.3064843 0.2046252 0.4401622 0.2867876 0.2519770 0.4035837
The proteomic-software automatically prints the first four rows as categories, which each patient belongs to.
Based on these first four rows:
There must be added four new covariates to the wide format: (1) Heat_BT$id correspond to the study name of each patient, (2) Heat_BT$lab correspond to what lab have produced the data, (3) Heat_BT$malig correspond to whether the patient case is a control case or a tumor case and finally, (4) Heat_BT$diag correspond to the underlying diagnosis.
Data
Heat_BT <- structure(list(protein = c(NA, NA, NA, NA, "von Willebrand factor",
"Sex hormone-binding globulin", "Glyceraldehyde-3-phosphate dehydrogenase",
"Nestin", "Heat shock 70 kDa protein 13", "Isocitrate dehydrogenase [NADP], mitochondrial",
"Mannan-binding lectin serine protease 1"), gene = c(NA, NA,
NA, NA, "VWF", "SHBG", "GAPDH", "NES", "HSPA13", "IDH2", "MASP1"
), Intensity_10 = c("Bruschi", "Ctrl", "Hydro", "Ctrl", "0.674627721",
"0.516914487", "0.622163594", "0.868476391", "0.484973907", "1.017596364",
"0.491321206"), Intensity_11 = c("Bruschi", "Ctrl", "Hydro",
"Ctrl", "0.255166769", "0.476843655", "0.231107563", "0.547319174",
"0.435322136", "0.107395157", "0.434995681"), Intensity_MB_1 = c("Reichl",
"Tumor", "Malignant", "MB", "0.970489979", "0.88173753", "0.71856463",
"0.832109928", "0.539334834", "0.710225344", "0.812500775"),
Intensity_Ref1 = c("Reichl", "Ctrl", "Hydro", "Ctrl", "0.215972215",
"0.306484252", "0.204625234", "0.440162212", "0.28678757",
"0.251976997", "0.403583705")), row.names = c(NA, -11L), class = c("tbl_df",
"tbl", "data.frame"))
Here is a dplyr solution for you. Its two steps, as you would need to collect intensity-variables first.
Heat_BT <- Heat_BT %>% na.exclude()
Heat_BT[,-1] %>% pivot_longer(
cols = Intensity_10:Intensity_Ref1,
names_to = "id"
) %>% pivot_wider(
names_from = gene
) %>% mutate(
across(.cols = -"id", as.numeric)
)
Which gives the following output
# A tibble: 4 x 8
id VWF SHBG GAPDH NES HSPA13 IDH2 MASP1
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Intensity_10 0.674627721 0.516914487 0.622163594 0.868476391 0.484973907 1.017596364 0.491321206
2 Intensity_11 0.255166769 0.476843655 0.231107563 0.547319174 0.435322136 0.107395157 0.434995681
3 Intensity_MB_1 0.970489979 0.88173753 0.71856463 0.832109928 0.539334834 0.710225344 0.812500775
4 Intensity_Ref1 0.215972215 0.306484252 0.204625234 0.440162212 0.28678757 0.251976997 0.403583705
I had trouble seeing the connection between the variables you wanted to add from the data, so I assumed that once you were able to pivot you data correctly, you would be able to fill in the rest.
Ill happily revise my answer, if you can explain it more plainly how these variables are related.
Best
EDIT: Notice that I removed the first four rows from the data as I didnt immediately see the connection between the variables that you wanted added.
EDIT 2: I assumed that the first 3 rows are the covariates that you want to add such that the first row is lab, malig and diag respectively.
# Extract the relevant information
# from the data.
id_cols <- bind_cols(
var = c("lab", "malig", "diag"),
Heat_BT[1:3,-c(1,2)]
) %>% group_by(var) %>% pivot_longer(
cols = Intensity_10:Intensity_Ref1, names_to = "id"
) %>% pivot_wider(
names_from = var,
)
# Remove these identifiers;
Heat_BT <- Heat_BT %>% na.exclude()
# Pivot the table;
pivoted_table <- Heat_BT[,-1] %>% pivot_longer(
cols = Intensity_10:Intensity_Ref1,names_to = "id"
) %>% pivot_wider(
names_from = gene,
) %>% mutate(
across(.cols = -"id", as.numeric)
)
# Join with the ID colums
left_join(
id_cols,
pivoted_table
)
Which gives the output,
# A tibble: 4 x 11
id lab malig diag VWF SHBG GAPDH NES HSPA13 IDH2 MASP1
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Intensity_10 Bruschi Ctrl Hydro 0.674627721 0.516914487 0.622163594 0.868476391 0.484973907 1.017596364 0.491321206
2 Intensity_11 Bruschi Ctrl Hydro 0.255166769 0.476843655 0.231107563 0.547319174 0.435322136 0.107395157 0.434995681
3 Intensity_MB_1 Reichl Tumor Malignant 0.970489979 0.88173753 0.71856463 0.832109928 0.539334834 0.710225344 0.812500775
4 Intensity_Ref1 Reichl Ctrl Hydro 0.215972215 0.306484252 0.204625234 0.440162212 0.28678757 0.251976997 0.403583705
This will work with the data that you have, regardless of size. Clearly, you can make the approach more bullet-proof by replacing, for example, cols = Intensity_10:Intensity_Ref1 to contains("intensity").
Edit 3
You have a lot more variables than provided here, so when you pivot these are not modified during the pivot-process.
So we can take a more robust approach, assuming that all the variables not provided here are similar to the ones provided, by changing cols-argument accordingly.
# Extract the relevant information
# from the data.
id_cols <- bind_cols(
var = c("lab", "malig", "diag"),
Heat_BT[1:3,-c(1,2)]
) %>% group_by(var) %>% pivot_longer(
cols = -"var", names_to = "id"
) %>% pivot_wider(
names_from = var,
)
# Remove these identifiers;
Heat_BT <- Heat_BT[-(1:4),]
# Pivot the table;
pivoted_table <- Heat_BT[,-1] %>% pivot_longer(
cols = -"gene",
names_to = "id"
) %>% pivot_wider(
names_from = gene,
) %>% mutate(
across(.cols = -"id", as.numeric)
)
# Join with the ID colums
left_join(
id_cols,
pivoted_table
)
Which gives the same output as above.
You could do:
Heat_BT[,2][1:3] <- c('lab', 'malig', 'diag')
data.table::transpose(Heat_BT[,-1],keep.names = 'gene',make.names = TRUE)
gene lab malig diag NA VWF SHBG GAPDH NES HSPA13 IDH2 MASP1
1 Intensity_10 Bruschi Ctrl Hydro Ctrl 0.674627721 0.516914487 0.622163594 0.868476391 0.484973907 1.017596364 0.491321206
2 Intensity_11 Bruschi Ctrl Hydro Ctrl 0.255166769 0.476843655 0.231107563 0.547319174 0.435322136 0.107395157 0.434995681
3 Intensity_MB_1 Reichl Tumor Malignant MB 0.970489979 0.88173753 0.71856463 0.832109928 0.539334834 0.710225344 0.812500775
4 Intensity_Ref1 Reichl Ctrl Hydro Ctrl 0.215972215 0.306484252 0.204625234 0.440162212 0.28678757 0.251976997 0.403583705

Interactive join in r based on different variables

I have two data frames as follows:
df<-data.frame(
id=c("1-1","2-2","3-3","4-4","5-5","6-6"),
identifer=c(1,2,3,4,5,6),
key=c("A","B","C","D","E","F"),
product=c("productA","productB","productC","productD","productE","productF"),
ingredient=c("ingredientA","ingredientB","ingredientC","ingredientD","ingredientE","ingredientF"),
DF=c("Tablet","Powder","Suspension","System","Capsule","Capsule"))
df_2<-data.frame(
identifer=c(1,2,2,3,4,6),
key=c("A","B","B","C","D","F"),
product=c("productA","productB","productB","productCC","productDD","productFF"),
ingredient=c("ingredientA","ingredientBB","ingredientB","ingredientC","ingredientDD","ingredeintFF"),
DF=c("Tablet","Powder","Powder","Suspension","injection","tablet"),
Route=c("ORAL","INHALATION","INHALATION","topical","injecatable","oral")
)
I want to join these two datasets first on the following variables + create a new column called "match" that describes the join:
1) identifier,key, product, ingredient,DF
match="identifier,key, product, ingredient,DF"
Then, I want to join the REMAINING rows on these variables:
2)identifier, key, product, DF
match="identifier,key, product,DF"
Then the remaining rows from step 2 on these variables, so and so forth.
3) identifier, key, Ingredient, DF
4) identifier, key, DF
5) identifer, key, product, ingredient
7) identifer, key, product
8) identifer, key, ingredient
9) identifier, key
And I want to return the rows that do not have a match as well. I know how to do this stepwise but I'm wondering if there is an easier way to do this?
this is the expected output:
df_out<-data.frame(
identifer=c(1,2,3,4,5,6),
key=c("A","B","C","D","E","F"),
product_1=c("productA","productB","productC","productD","productE","productF"),
ingredient_1=c("ingredientA","ingredientB","ingredientC","ingredientD","ingredientE","ingredientF"),
DF_1=c("Tablet","Powder","Suspension","System","Capsule","Capsule"),
product_2=c("productA","productB","productCC","productDD",NA,"productFF"),
ingredient_2=c("ingredientA","ingredientB","ingredientC","ingredientDD",NA,"ingredeintFF"),
DF_2=c("Tablet","Powder","Suspension","injection",NA,"tablet"),
Route_2=c("ORAL","INHALATION",'topical',"injecatable",NA,"oral"),
Match=c("identifer+key+product+ingredient+DF","identifier+key+product+ingredient+DF","identifier+key+ingredient+DF","identifer+key","None","identifer+key+product+ingredient"))
Here is an option using data.table:
library(data.table)
setDT(df)
setDT(df_2)
keyord <- list(
c("product", "ingredient", "DF"),
c("product", "DF"),
c("ingredient", "DF"),
"DF",
c("product", "ingredient"),
"product",
"ingredient",
c()
)
cols <- c("product", "ingredient", "DF", "Route")
df[, Match := NA_character_]
for (v in keyord) {
k <- c("identifier", "key", v)
df[df_2, on=k, c(paste0(cols, "_2"), "check") := c(mget(paste0("i.", cols)), .(TRUE))]
df[is.na(Match) & check, Match := toString(k)]
}
setnames(df, cols, paste0(cols, "_1"), skip_absent=TRUE)
output:
id identifier key product_1 ingredient_1 DF_1 Match product_2 ingredient_2 DF_2 Route_2 check
1: 1-1 1 A productA ingredientA Tablet identifier, key, product, ingredient, DF productA ingredientA Tablet ORAL TRUE
2: 2-2 2 B productB ingredientB Powder identifier, key, product, ingredient, DF productB ingredientB Powder INHALATION TRUE
3: 3-3 3 C productC ingredientC Suspension identifier, key, ingredient, DF productCC ingredientC Suspension topical TRUE
4: 4-4 4 D productD ingredientD System identifier, key productDD ingredientDD injection injecatable TRUE
5: 5-5 5 E productE ingredientE Capsule <NA> <NA> <NA> <NA> <NA> NA
6: 6-6 6 F productF ingredientF Capsule identifier, key, product, ingredient productF ingredientF tablet oral TRUE
data after fixing some typos in OP:
df <- data.frame(
id=c("1-1","2-2","3-3","4-4","5-5","6-6"),
identifier=c(1,2,3,4,5,6),
key=c("A","B","C","D","E","F"),
product=c("productA","productB","productC","productD","productE","productF"),
ingredient=c("ingredientA","ingredientB","ingredientC","ingredientD","ingredientE","ingredientF"),
DF=c("Tablet","Powder","Suspension","System","Capsule","Capsule"))
df_2 <- data.frame(
identifier=c(1,2,2,3,4,6),
key=c("A","B","B","C","D","F"),
product=c("productA","productB","productB","productCC","productDD","productF"),
ingredient=c("ingredientA","ingredientBB","ingredientB","ingredientC","ingredientDD","ingredientF"),
DF=c("Tablet","Powder","Powder","Suspension","injection","tablet"),
Route=c("ORAL","INHALATION","INHALATION","topical","injecatable","oral")
)
edit for multiple matches:
df_2 <- data.frame( identifier=c(1,2,2,3,4,4,6), key=c("A","B","B","C","D","D","F"), product=c("productA","productB","productB","productCC","productDD","productDd","productF"), ingredient=c("ingredientA","ingredientBB","ingredientB","ingredientC","ingredientDD",NA,"ingredientF"), DF=c("Tablet","Powder","Powder","Suspension","injection",NA,"tablet"), Route=c("ORAL","INHALATION","INHALATION","topical","injecatable",NA,"oral") )
setDT(df_2)
df[, c("Match", "check") := .(NA_character_, FALSE)]
ocols <- unique(unlist(keyord))
rbindlist(lapply(keyord, function(v) {
k <- c("identifier", "key", v)
a <- df_2[df[(!check)], on=k, nomatch=0L, c(.(id=id),
setNames(mget(paste0("i.", ocols)), paste0(ocols, "_1")),
setNames(mget(paste0("x.", c(ocols, "Route"))), paste0(c(ocols, "Route"), "_2")))
]
df[id %chin% a$id, check := TRUE]
a
}), use.names=TRUE)
output:
id product_1 ingredient_1 DF_1 product_2 ingredient_2 DF_2 Route_2
1: 1-1 productA ingredientA Tablet productA ingredientA Tablet ORAL
2: 2-2 productB ingredientB Powder productB ingredientB Powder INHALATION
3: 3-3 productC ingredientC Suspension productCC ingredientC Suspension topical
4: 6-6 productF ingredientF Capsule productF ingredientF tablet oral
5: 4-4 productD ingredientD System productDD ingredientDD injection injecatable
6: 4-4 productD ingredientD System productDd <NA> <NA> <NA>
Here is a solution that might feel slightly over-engineered but achieves the expected outcome:
library(dplyr)
library(purrr)
library(stringr)
get_match=function(data, cols, keys){
rtn = ifelse(rowSums(is.na(data[paste0(cols, "_1")]))==rowSums(is.na(data[paste0(cols, "_2")])), paste(keys, collapse="+"), "None")
rtn2 = cols %>%
map(~{
case_when(as.character(data[[paste0(.x, "_1")]])==as.character(data[[paste0(.x, "_2")]])~.x)
}) %>%
reduce(paste, sep="+") %>% str_replace_all("\\+?NA\\+?", "")
paste(rtn, rtn2, sep="+") %>% str_replace_all("\\+$", "")
}
df_out = left_join(df, df_2, by=c("identifer", "key"), suffix=c("_1", "_2")) %>%
mutate(Match = get_match(., cols=c("product", "ingredient", "DF"), keys=c("identifer", "key")),
match_strength = str_count(Match, "\\+")) %>%
group_by(id) %>%
filter(match_strength==max(match_strength, na.rm=TRUE))
dplyr::left_join removes the by keys so the only way I found to add them is to check that all the _1 or the _2 were missing. I could have used the keep=TRUE option and remove/rename them hereafter though...

How to tidy the data set with column containing multiple information-Sample data put?

Please help me make my data tidy. Thanks.
The total observations is 394, with 26 columns. Data is exported from ms excel.
Data sample is given below. In this sample actually there should be only three observations/rows.
In the vectors d1..d2..no and Farmer.Name the observations corresponding to NA of v1 should be cleared and added to the preceding row value.
the d1..d2..no corresponds to three observations (two date observations one unique identification number )and so do the Farmer.Name vector.
The sample is
d1..d2..no<-c("27/01/2020", "43832", "KE004421", "43832", "43832",
"KE003443", "31/12/2019", "43832", "KE0001512")
Farmer.Name<-c("S Jacob Gender:male","farmer type :marginal","farmer category :general",
"J Isac Gender :Female","farmer type: large","farmer category :general",
"P Kumar Gender :Male","farmer type:small","farmer category :general")
adress<-c("k11",NA,NA,"k12",NA,NA,"k13",NA,NA)
amount<-c(25,NA,NA,25,NA,NA,32,NA,NA)
mydata<-data.frame(v1=v1, d1..d2..no=d1..d2..no, Farmer.Name=Farmer.Name,
adress=adress, amount=amount)
In the vectors d1..d2..no and Farmer.Name the observations corresponding to NA of v1 should be cleared and added to the preceding row value.
the d1..d2..no corresponds to three observations (two date observations one unique identification number )
and so do the Farmer.Name vector. That is, my result expected is like from this code
v1<-c(1,2,3)
d1<-c("27/01/2020","43832","31/12/2019")
d2<-c("43832","43832","43832")
no<-c("KE004421","KE003443","KE0001512")
Farmer.Name1<-c("S Jacob","J Isac","P Kumar")
Gender<-c("male","female","male")
farmer_type <-c("marginal","large","small")
farmer_category <-c("general", "general", "general")
adress<-c("k11","k12","k13")
amount<-c(25,25,32)
myfinaldata<-data.frame(v1=v1,d1=d1,d2=d2,no=no,
Farmer.Name1=Farmer.Name1,
farmer_type=farmer_type,
farmer_category=farmer_category,
adress=adress,amount=amount)
The result should be
v1 d1 d2 no Farmer.Name1 farmer_type farmer_category adress amount
1 1 27/01/2020 43832 KE004421 S Jacob marginal general k11 25
2 2 43832 43832 KE003443 J Isac large general k12 25
3 3 31/12/2019 43832 KE0001512 P Kumar small general k13 32
I am a novice to programming and r, learning through online resources. Also my first post on this platform. Please forgive any mistakes.
I have done a lot of mess with spread,separate, etc of tidy vesre.. But stuck at how to proceed.
Untidy data can be a challenge. Here is a tidyverse approach.
First, added proposed column names expected for d1, d2, and no. Assumes rows are in this order.
Column Farmer.Name is separated into two columns, by :.
The Name itself is separated before the word Gender.
fill allows for common values to be filled in for the same individual (such as v1, adress, amount, and Name).
pivot_wider is done to spread the data wide, first, by d1, d2, and no, and then by the other columns including Gender, farmer_type, and farmer_category.
library(tidyverse)
df1 <- mydata %>%
mutate(d_var = rep(c("d1", "d2", "no"), times = 3)) %>%
separate(Farmer.Name, into = c("Var", "Val"), sep = ":") %>%
separate(Var, into = c("Name", "Var"), sep = "(?=Gender)", fill = "left") %>%
mutate_at(c("Name", "Var"), trimws) %>%
fill(v1, adress, amount, Name, .direction = "down") %>%
mutate(Var = gsub(" ", "_", Var))
df1 %>%
pivot_wider(id_cols = c(v1, Name, adress, amount), names_from = d_var, values_from = d1..d2..no) %>%
left_join(pivot_wider(df1, id_cols = c(v1, Name, adress, amount), names_from = Var, values_from = Val))
Output
# A tibble: 3 x 10
v1 Name adress amount d1 d2 no Gender farmer_type farmer_category
<dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 S Jacob k11 25 27/01/2020 43832 KE004421 male "marginal" general
2 2 J Isac k12 25 43832 43832 KE003443 Female " large" general
3 3 P Kumar k13 32 31/12/2019 43832 KE0001512 Male "small" general
The dates in your data set are not in date format. Consider formatting them after this.
library(reshape)
df.new <- cbind(mydata[seq(1, nrow(mydata), 3), ], mydata[seq(2, nrow(mydata), 3), ][2:3], mydata[seq(3, nrow(mydata), 3), ][2:3])
colnames(df.new) <- c("v1", "d1", "Farmer.Name1", "adress", "amount", "d2", "farmer_type", "no", "farmer_category")
df.new <- df.new[c(1,2,6, 8,3, 7,9, 4,5)]
library(stringr)
df.new$Farmer.Name1 <- word(df.new$Farmer.Name1,1,sep = "\\ Gender")
df.new$farmer_type <- word(df.new$farmer_type,2,sep = "\\:")
df.new$farmer_category <- word(df.new$farmer_category,2,sep = "\\:")
Final table:
> df.new
v1 d1 d2 no Farmer.Name1 farmer_type farmer_category adress amount
1 1 27/01/2020 43832 KE004421 S Jacob marginal general k11 25
4 2 43832 43832 KE003443 J Isac large general k12 25
7 3 31/12/2019 43832 KE0001512 P Kumar small general k13 32
P.S.: I have not renamed the row numbers.

Loop through 2 dataframes to identify common columns

I have 2 reproducible dataframes over here. I am trying to identify which column contain values that are similar to another column. I hope my code will take in every row and loop through every single column in df2. My code works below, but it requires fine-tuning to allow multiple matches with the same column.
df1 <- data.frame(fruit=c("Apple", "Orange", "Pear"), location = c("Japan", "China", "Nigeria"), price = c(32,53,12))
df2 <- data.frame(grocery = c("Durian", "Apple", "Watermelon"),
place=c("Korea", "Japan", "Malaysia"),
name = c("Mark", "John", "Tammy"),
favourite.food = c("Apple", "Wings", "Cakes"),
invoice = c("XD1", "XD2", "XD3"))
df <- sapply(names(df1), function(x) {
temp <- sapply(names(df2), function(y)
if(any(match(df1[[x]], df2[[y]], nomatch = FALSE))) y else NA)
ifelse(all(is.na(temp)), NA, temp[which.max(!is.na(temp))])
}
)
t1 <- data.frame(lapply(df, type.convert), stringsAsFactors=FALSE)
t1 <- data.frame(t(t1))
t1 <- cbind(newColName = rownames(t1), t1)
rownames(t1) <- 1:nrow(t1)
colnames(t1) <- c("Columns from df1", "Columns from df2")
df1
fruit location price
1 Apple Japan 32
2 Orange China 53
3 Pear Nigeria 12
df2
grocery place name favourite.food invoice
1 Durian Korea Mark Apple XD1
2 Apple Japan John Wings XD2
3 Watermelon Malaysia Tammy Cakes XD3
t1 #(OUTPUT FROM CODE ABOVE)
Columns from df1 Columns from df2
1 fruit grocery
2 location place
3 price <NA>
This is the output I hope to obtain instead:
Columns from df1 Columns from df2
1 fruit grocery, favourite.food
2 location place
3 price <NA>
Notice that the columns, "Grocery" and "favourite.food" both matches to the column "fruit", whereas my code only returns one column.
We can change the code to return all the matches instead and wrap them in one string using toString
vec <- sapply(names(df1), function(x) {
temp <- sapply(names(df2), function(y)
if(any(match(df1[[x]], df2[[y]], nomatch = FALSE))) y else NA)
ifelse(all(is.na(temp)), NA, toString(temp[!is.na(temp)]))
}
)
vec
# fruit location price
#"grocery, favourite.food" "place" NA
To convert it into dataframe, we can do
data.frame(columns_from_df1 = names(vec), columns_from_df2 = vec, row.names = NULL)
# columns_from_df1 columns_from_df2
#1 fruit grocery, favourite.food
#2 location place
#3 price <NA>

Resources