Let's start with the example of the data:
structure(list(P1 = structure(c(1L, 1L, 3L, 3L, 5L, 5L, 5L, 5L,
4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 2L, 2L), .Label = c("Apple",
"Grape", "Orange", "Peach", "Tomato"), class = "factor"), P2 = structure(c(4L,
4L, 3L, 3L, 5L, 5L, 5L, 5L, 6L, 6L, 2L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 6L, 6L), .Label = c("Banana", "Cucumber", "Lemon", "Orange",
"Potato", "Tomato"), class = "factor"), P1_location_subacon = structure(c(2L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = c("Fridge", "Table"), class = "factor"),
P1_location_all_predictors = structure(c(2L, 2L, 3L, 3L,
3L, 3L, 3L, 3L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L), .Label = c("Table,Desk,Bag,Fridge,Bed,Shelf,Chair",
"Table,Shelf,Cupboard,Bed,Fridge", "Table,Shelf,Fridge"), class = "factor"),
P2_location_subacon = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Fridge",
"Shelf"), class = "factor"), P2_location_all_predictors = structure(c(3L,
3L, 2L, 2L, 1L, 1L, 1L, 1L, 3L, 3L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 3L, 3L), .Label = c("Shelf,Fridge", "Shelf,Fridge,Bed",
"Table,Shelf,Fridge"), class = "factor")), .Names = c("P1",
"P2", "P1_location_subacon", "P1_location_all_predictors", "P2_location_subacon",
"P2_location_all_predictors"), class = "data.frame", row.names = c(NA,
-20L))
I would like to compare the two pairs of column. First pair which I would like to comapre is P1_location_subacon with P2_location_subacon. The second pair is P1_location_all_predictors with P2_location_all_predictors.
How I want to compare them ? In each column you have different "locations" of the fruit/vegetable. So:
if the location is the same in the first pair (P1/2_location_subacon) I would like to put number 2 in the additional column.
if the location is the same in the second pair (P1/2_location_all_predictors) I would like to put number 1 in the additional column. That one is a bit more complicated because not all of the locations have to be the same. At least one of them has to be the same for both fruits/vegetables.
if in both cases they are different put 0. You won't see such situation in the example data.
To summarize I show you the output which I would like to achieve:
structure(list(P1 = structure(c(1L, 1L, 3L, 3L, 5L, 5L, 5L, 5L,
4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 2L, 2L), .Label = c("Apple",
"Grape", "Orange", "Peach", "Tomato"), class = "factor"), P2 = structure(c(4L,
4L, 3L, 3L, 5L, 5L, 5L, 5L, 6L, 6L, 2L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 6L, 6L), .Label = c("Banana", "Cucumber", "Lemon", "Orange",
"Potato", "Tomato"), class = "factor"), P1_location_subacon = structure(c(2L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = c("Fridge", "Table"), class = "factor"),
P1_location_all_predictors = structure(c(2L, 2L, 3L, 3L,
3L, 3L, 3L, 3L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L), .Label = c("Table,Desk,Bag,Fridge,Bed,Shelf,Chair",
"Table,Shelf,Cupboard,Bed,Fridge", "Table,Shelf,Fridge"), class = "factor"),
P2_location_subacon = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Fridge",
"Shelf"), class = "factor"), P2_location_all_predictors = structure(c(3L,
3L, 2L, 2L, 1L, 1L, 1L, 1L, 3L, 3L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 3L, 3L), .Label = c("Shelf,Fridge", "Shelf,Fridge,Bed",
"Table,Shelf,Fridge"), class = "factor"), X = c(NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA), Correct = c(1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L)), .Names = c("P1",
"P2", "P1_location_subacon", "P1_location_all_predictors", "P2_location_subacon",
"P2_location_all_predictors", "X", "Correct"), class = "data.frame", row.names = c(NA,
-20L))
EDIT: using feedback from here Test two columns of strings for match row-wise in R I have improved my answer.
Where DT is your table:
library(data.table)
setDT(DT)
DT <- data.table(sapply(DT,as.character))
DT[, P1_location_all_predictors := gsub(",","|",P1_location_all_predictors)]
DT[, P1_location_subacon := gsub(",","|",P1_location_subacon)]
DT[, match_all_pred := grepl(P1_location_all_predictors, P2_location_all_predictors) + 0, by = P1_location_all_predictors]
DT[, match_subacon := grepl(P1_location_subacon, P2_location_subacon), by = P1_location_subacon]
DT[, P1_location_all_predictors := gsub("\\|",",",P1_location_all_predictors)]
DT[, P1_location_subacon := gsub("\\|",",",P1_location_subacon)]
I instead opted for two columns vs your 0/1/2 notation; it makes the code less straightforward as you have to rely on nested ifs. I also think that multiple columns is better as you can clearly see the F/F, T/F, F/T, and T/T cases.
If you must create the 0/1/2, you can call
DT[, MyCol := match_all_pred - match_subacon*match_all_pred+match_subacon*2]
which assumes that subacon supersedes the all location.
Here is another way:
myData <- data.frame(sapply(myData, as.character), stringsAsFactors=FALSE)
doesIntersect <- function(setA, setB) {length(intersect(setA,setB)) > 0}
myData$Correct <- 0
myData$Correct[mapply(doesIntersect, strsplit(myData$P1_location_all_predictors, ","), strsplit(myData$P2_location_all_predictors, ","))] <- 1
myData$Correct[mapply(setequal, strsplit(myData$P1_location_subacon, ","), strsplit(myData$P2_location_subacon, ","))] <- 2
> myData$Correct
[1] 1 1 2 2 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
Related
Hello :) I am desperately trying to change the colors and font of my emmip plot (plot from the emmeans package in R) but none of my codes are working.
Currently my code for the plot looks like this:
emmip(Model, group ~ gend, CIs=TRUE, nuisance = c("known", "age_dup", "edu"),
xlab = "",
ylab = "Intention to use the platform")
I read in the manueal from the R-package that the code from emmip() can be combined with ggplot2 codes. But when I add the following two codes (that I successfully use in another ggplot) - nothing changes in my plot:
+ theme(text=element_text(family="serif", size=13)
+ scale_fill_brewer(palette="Blues"))
I varied them already, for example "," instead of "+"
Does anyone have an idea how I can make these two modifications work in emmip? Thank you all in advance!
Here is the dput of my data (first 30 rows):
structure(list(dv = c(1, 5, 5, 1, 3, 5, 2, 1, 5, 5, 2, 4, 6,
7, 3, 5, 5, 6, 7, 1, 7, 6, 2, 4, 7, 6, 5, 1, 6, 6), gend = structure(c(1L,
2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, NA, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 2L), .Label = c("Male",
"Female"), class = "factor"), group = structure(c(5L, 3L, 5L,
3L, 2L, 1L, 3L, 4L, 2L, 1L, 3L, 2L, 3L, 3L, 4L, 4L, 2L, 4L, 5L,
5L, 1L, 4L, 1L, 4L, 2L, 1L, 2L, 3L, 1L, 4L), .Label = c("Default",
"Visual element", "Verbal content", "Visual design", "Combined",
"DesignZH"), class = "factor"), ISFregscores = c(0.984372106429775,
-0.383676865152824, -0.816194838031774, -0.408554787302724, -0.0416530380928891,
0.998088756156888, 0.216609251327447, 0.83416518546863, 1.00178246600492,
-0.496215251116934, -1.34559758838579, NA, 0.707838661016661,
1.05815783619489, -0.314855036376305, 0.617674358967702, -0.56862344822269,
0.0589354712707628, 0.31998903974822, -0.511084756816837, -0.171121724458495,
0.532699047600051, 0.196311893993997, -2.09902298349596, 1.04422334581248,
-0.132687312769232, 1.05733961165571, 0.541606480874359, 0.440296538856025,
0.895064902672922), age_dup = structure(c(2L, 1L, 1L, 2L, 1L,
2L, 3L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 3L, 1L, 2L, 3L,
1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L), .Label = c("under34", "age35_49",
"over50"), class = "factor"), edu = structure(c(5L, 4L, 5L, 2L,
5L, 5L, 5L, 5L, 5L, 4L, 5L, NA, 5L, 4L, 1L, 5L, 4L, 5L, 3L, 5L,
2L, 5L, 3L, 5L, 6L, 5L, 6L, 1L, 3L, 5L), .Label = c("oblig. Schulzeit",
"Berufsausbildung", "Berufsmatura", "Gymnasiale Matura", "BA/MA",
"Doktorat", "Andere"), class = "factor"), empl = structure(c(1L,
6L, 1L, 2L, 8L, 2L, 5L, 2L, 2L, 6L, 2L, NA, 1L, 1L, 6L, 2L, 6L,
1L, 1L, 4L, 2L, 6L, 1L, 1L, 3L, 6L, 2L, 2L, 4L, 1L), .Label = c("Privatsektor",
"öffentlicher Sektor", "Non-Profit Sektor", "selbstständig",
"Rentner/in", "Student/in", "Hausfrau/Hausmann", "arbeitssuchend"
), class = "factor"), civ_dup = structure(c(2L, 1L, 1L, 3L, 2L,
1L, 2L, 2L, 2L, 1L, 2L, NA, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 3L, 1L,
2L, 2L, 2L, 2L, 1L, 2L, 3L, 2L, 1L), .Label = c("single", "Partnerschaft",
"keine Angabe"), class = "factor"), kids = structure(c(2L, 1L,
1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, NA, 2L, 2L, 1L, 1L, 1L, 1L,
2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L), .Label = c("Nein",
"Ja"), class = "factor"), known = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 1L, 1L, NA, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L), .Label = c("Nein", "Ja"
), class = "factor"), device = structure(c(1L, 2L, 1L, 2L, 2L,
1L, 3L, 1L, 2L, 2L, 1L, NA, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L,
1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L), .Label = c("Smartphone / Tablet iOS (iPhone/iPad)",
"Smartphone / Tablet (Android)", "Computer / Laptop"), class = "factor")), row.names = c(NA,
-30L), class = c("tbl_df", "tbl", "data.frame"))
And this is the code for my regression that I then use for the interaction (graph):
Model <- lm(dv ~ gend * group + ISFregscores + age_dup + edu + empl + civ_dup + kids + known + device, data=)
You've got the right approach to change the font but you also have to make sure the font is actually available to the graphics device. This step can be tricky; I use the showtext package which makes this a bit easier.
To change the color palette, specify the color scale (rather than the fill scale).
library("showtext")
#> Loading required package: sysfonts
#> Loading required package: showtextdb
library("emmeans")
library("tidyverse")
showtext_auto()
# 30 data points are too few to fit the original model, so I drop `device`
model <- lm(
dv ~ gend * group + ISFregscores + age_dup + edu + empl + civ_dup + kids + known,
data = data
)
p <- emmip(
model, group ~ gend,
CIs = TRUE,
nuisance = c("known", "age_dup", "edu"),
xlab = "",
ylab = "Intention to use the platform"
)
p +
scale_color_brewer(
palette = "Blues"
) +
guides(
color = guide_legend(title = "New Legend Title")
) +
theme(
text = element_text(family = "serif", face = "bold.italic", size = 16)
)
Created on 2023-01-11 with reprex v2.0.2
I have a folder which serves as a container for a standardized report from a system. This report is run on a daily basis. However, the report may require re-run for a certain date or range of dates depending on user preferences and asks. Thus file content may change significantly.
I would like to create a script that would group the unique dates together in one dataframe based on the latest run time, and another dataframe for the dates that are being revised.
Here is a simplified version of the table:
structure(list(Source = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L), Date = structure(c(1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("11-Feb-20", "12-Feb-20"
), class = "factor"), FarmType = structure(c(3L, 4L, 5L, 1L,
2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L), .Label = c("AJSKJA",
"ASKJKA", "GHDGH", "KLKIUK", "KLSAKJ"), class = "factor"), FarmName = structure(c(1L,
2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L), .Label = c("",
"JJHGH", "JKJKK", "JUISO", "SDLLS"), class = "factor"), Perform = c(13.04144378,
1.230474165, 1.230474165, 13.9407486, 13.9407486, 13.04144378,
1.230474165, 1.230474165, 13.9407486, 13.9407486, 13.04144378,
15.26566, 1.230474165, 13.9407486), RunDate = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("02/14/2020",
"02/15/2020"), class = "factor")), class = "data.frame", row.names = c(NA,
-14L))
Please note that the number of columns does not change, however, after each re-run the number of rows may increase/decrease.
The idea is -- the first group of data that is based on the most recent run would represent the up-to-date information (corrections, revisions, etc.), while the second group essentially looks at what is being revised and how the numbers and data are changing.
Expected output for the first group:
structure(list(Source = c(3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L),
Date = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("11-Feb-20",
"12-Feb-20"), class = "factor"), FarmType = structure(c(3L,
4L, 5L, 1L, 3L, 4L, 5L, 1L, 2L), .Label = c("AJSKJA", "ASKJKA",
"GHDGH", "KLKIUK", "KLSAKJ"), class = "factor"), FarmName = structure(c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L, 5L), .Label = c("", "JJHGH",
"JKJKK", "JUISO", "SDLLS"), class = "factor"), Perform = c(13.04144378,
15.26566, 1.230474165, 13.9407486, 13.04144378, 1.230474165,
1.230474165, 13.9407486, 13.9407486), RunDate = structure(c(2L,
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L), .Label = c("02/14/2020",
"02/15/2020"), class = "factor")), class = "data.frame", row.names = c(NA,
-9L))
Expected output for the second group:
structure(list(Source = c(1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L),
Date = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "11-Feb-20", class = "factor"),
FarmType = structure(c(3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L
), .Label = c("AJSKJA", "ASKJKA", "GHDGH", "KLKIUK", "KLSAKJ"
), class = "factor"), FarmName = structure(c(1L, 2L, 3L,
4L, 5L, 1L, 2L, 3L, 4L), .Label = c("", "JJHGH", "JKJKK",
"JUISO", "SDLLS"), class = "factor"), Perform = c(13.04144378,
1.230474165, 1.230474165, 13.9407486, 13.9407486, 13.04144378,
15.26566, 1.230474165, 13.9407486), RunDate = structure(c(1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("02/14/2020",
"02/15/2020"), class = "factor")), class = "data.frame", row.names = c(NA,
-9L))
Thank you for your time. Please let me know if you have questions.
We could group by 'Date' and filter those groups where the 'RunDate' is the latest after converting to Date class
library(lubridate)
library(dplyr)
new1 <- df1 %>%
group_by(Date) %>%
filter(mdy(RunDate) == max(mdy(RunDate)))
and for the second set, we can check if the number of distinct elements of 'RunDate' is more than 1
new2 <- df1 %>%
group_by(Date) %>%
filter(n_distinct(RunDate) > 1)
I have data that I want to know the number of specific rows that are with specific character. The data looks like the following
df<-structure(list(Gene.refGene = structure(c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 4L, 4L, 4L, 4L, 4L), .Label = c("A1BG", "A1BG-AS1", "A1CF",
"A1CF;PRKG1"), class = "factor"), Chr = structure(c(2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("chr10", "chr19"
), class = "factor"), Start = c(58858232L, 58858615L, 58858676L,
58859052L, 58859055L, 58859066L, 58859510L, 58863162L, 58864479L,
58864150L, 58864867L, 58864879L, 58865857L, 52566433L, 52569637L,
52571047L, 52573510L, 52576068L, 52580561L, 52603659L, 52619845L,
52625849L, 52642500L, 52650951L, 52675605L, 52703952L, 52723140L,
52723638L), End = c(58858232L, 58858615L, 58858676L, 58859052L,
58859055L, 58859066L, 58859510L, 58863166L, 58864479L, 58864150L,
58864867L, 58864879L, 58865857L, 52566433L, 52569637L, 52571047L,
52573510L, 52576068L, 52580561L, 52603659L, 52619845L, 52625849L,
52642500L, 52650958L, 52675605L, 52703952L, 52723140L, 52723638L
), Ref = structure(c(3L, 5L, 2L, 2L, 3L, 2L, 5L, 7L, 6L, 6L,
2L, 1L, 5L, 6L, 5L, 3L, 2L, 5L, 6L, 3L, 3L, 6L, 3L, 4L, 3L, 6L,
6L, 3L), .Label = c("-", "A", "C", "CTCTCTCT", "G", "T", "TTTTT"
), class = "factor"), Alt_df1 = structure(c(1L, 1L, 4L, 4L, 1L,
4L, 5L, 1L, 3L, 3L, 4L, 4L, 3L, 1L, 2L, 5L, 1L, 2L, 1L, 5L, 5L,
2L, 5L, 1L, 4L, 3L, 4L, 2L), .Label = c("-", "A", "C", "G", "T"
), class = "factor")), class = "data.frame", row.names = c(NA,
-28L))
I want to know how many rows of the column named "alt_df1" is missing or - or NA
Here is an answer using which and utilising base R's LETTERS data:
length(which(!df$Alt_df1%in%LETTERS))
#[1] 8
Or using just which:
length(which(df$Alt_df1=="-"))
#[1] 8
One way would be to create a logical vector using %in% and then sum over them to count the number of occurrences.
sum(df$Alt_df1 %in% c("-", NA))
#[1] 8
Or we can also subset and count the number of rows.
nrow(subset(df, Alt_df1 %in% c("-", NA)))
which can also be done in dplyr by
library(dplyr)
df %>% filter(Alt_df1 %in% c("-", NA)) %>% nrow
Another option using grepl
with(df, sum(grepl("-", Alt_df1)) + sum(is.na(Alt_df1)))
and I am sure there are multiple other ways.
I have a data frame of 2511 rows and 6 columns with candy and color items. Please see the first 15 rows as below:
structure(list(x = 1:15, iteml = structure(c(2L, 1L, 1L, 1L,
5L, 4L, 4L, 3L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("{dulce1_rojo",
"{dulce2_verde", "{dulce7_plata", "{miel21_amarillo", "{miel30_azul"
), class = "factor"), item2 = structure(c(4L, 2L, 2L, 2L, 1L,
5L, 5L, 4L, 3L, 3L, 4L, 1L, 4L, 4L, 1L), .Label = c("chocolate2l_amarillo",
"dulce2_verde", "dulce7_plata", "miel21_amarillo", "miel30_azul"
), class = "factor"), item3 = structure(c(1L, 1L, 3L, 3L, 2L,
2L, 1L, 2L, 2L, 3L, 2L, 2L, 2L, 1L, 2L), .Label = c("chocolate2l_amarillo",
"chocolate30_azul", "miel21_amarillo"), class = "factor"), item4 = structure(c(2L,
2L, 2L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("chocolate2l_amarillo",
"chocolate32_violeta", "cookie30_azul"), class = "factor"), item5 = structure(c(2L,
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("cookie2l_amarillo}",
"cookie32_violeta}"), class = "factor"), item6 = structure(c(4L,
6L, 1L, 3L, 6L, 1L, 2L, 4L, 6L, 2L, 5L, 6L, 1L, 2L, 4L), .Label = c(">{chocolate2l_amarillo}",
">{chocolate30_azul}", ">{chocolate32_violeta}", ">{dulce1_rojo}",
">{dulce7_plata}", ">{miel21_amarillo}"), class = "factor")), class = "data.frame", row.names = c(NA,
-15L))
I don`t know how can I count in new columns only the kind of candy that each row has. This first line as an expected ouput of the resulting data frame:
x iteml item2 item3 item4 item5 item6 dulce miel chocolate cookie
1 1 {dulce2_verde miel21_amarillo chocolate2l_amarillo chocolate32_violeta cookie32_violeta} >{dulce1_rojo} 2 1 2 1
I'm stuck and I'd appreciate a little help.
you can use apply function to apply grepl function by row for the initial data frame. Then you use sapply to iterate through four ingridients you indicated. Then use cbind to concatentate the initial data frame and the data frame with ingedients into one. Please see the code below:
# initialize data frame
df <- structure(list(x = 1:15, iteml = structure(c(2L, 1L, 1L, 1L,
5L, 4L, 4L, 3L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("{dulce1_rojo",
"{dulce2_verde", "{dulce7_plata", "{miel21_amarillo", "{miel30_azul"
), class = "factor"), item2 = structure(c(4L, 2L, 2L, 2L, 1L,
5L, 5L, 4L, 3L, 3L, 4L, 1L, 4L, 4L, 1L), .Label = c("chocolate2l_amarillo",
"dulce2_verde", "dulce7_plata", "miel21_amarillo", "miel30_azul"
), class = "factor"), item3 = structure(c(1L, 1L, 3L, 3L, 2L,
2L, 1L, 2L, 2L, 3L, 2L, 2L, 2L, 1L, 2L), .Label = c("chocolate2l_amarillo",
"chocolate30_azul", "miel21_amarillo"), class = "factor"), item4 = structure(c(2L,
2L, 2L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("chocolate2l_amarillo",
"chocolate32_violeta", "cookie30_azul"), class = "factor"), item5 = structure(c(2L,
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("cookie2l_amarillo}",
"cookie32_violeta}"), class = "factor"), item6 = structure(c(4L,
6L, 1L, 3L, 6L, 1L, 2L, 4L, 6L, 2L, 5L, 6L, 1L, 2L, 4L), .Label = c(">{chocolate2l_amarillo}",
">{chocolate30_azul}", ">{chocolate32_violeta}", ">{dulce1_rojo}",
">{dulce7_plata}", ">{miel21_amarillo}"), class = "factor")), class = "data.frame", row.names = c(NA,
-15L))
# counting ingridients
ingridients <- c("dulce", "miel", "chocolate", "cookie")
x <- sapply(ingridients, function(y) apply(df, 1, function(x) sum(grepl(y, x))))
df_res <- cbind(df, x)
head(df_res)
Output:
x iteml item2 item3 item4 item5 item6 dulce miel chocolate cookie
1 1 {dulce2_verde miel21_amarillo chocolate2l_amarillo chocolate32_violeta cookie32_violeta} >{dulce1_rojo} 2 1 2 1
2 2 {dulce1_rojo dulce2_verde chocolate2l_amarillo chocolate32_violeta cookie32_violeta} >{miel21_amarillo} 2 1 2 1
3 3 {dulce1_rojo dulce2_verde miel21_amarillo chocolate32_violeta cookie32_violeta} >{chocolate2l_amarillo} 2 1 2 1
4 4 {dulce1_rojo dulce2_verde miel21_amarillo chocolate2l_amarillo cookie32_violeta} >{chocolate32_violeta} 2 1 2 1
5 5 {miel30_azul chocolate2l_amarillo chocolate30_azul cookie30_azul cookie2l_amarillo} >{miel21_amarillo} 0 2 2 2
6 6 {miel21_amarillo miel30_azul chocolate30_azul cookie30_azul cookie2l_amarillo} >{chocolate2l_amarillo} 0 2 2 2
I have two data.frames df.1 and df.2 that I would merge or otherwise select data from to create a new data.frame. df.1 contains information about each individual (ID), sampling event (Event), Site and sample number (Sample). The tricky part for me is that Site and the corresponding Sample for each ID-Event pairing is different. For example, F3-3 has Site "plum" for Sample "1" and M6-3 has Site "pear" for Sample "1".
df.2 has Sample1 and Sample2 which corresponds to the Sample information in df.1 by way of the ID-Event pairing.
I'd like to match/merge the information between these two data.frames. Essentially, get the "word" from Site in df.1 that matches the Sample number. An example (df.3) is below.
Each ID-Event pairing will only have one Site and corresponding Sample (e.g. "Apple" will correspond to "1" not to "1" and "4"). I know I could use merge if I was only matching, for example, Sample1 or Sample2 I am not sure how to do this with both to populate Site1 and Site2 with the correctly matched word.
df.1 <- structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("F1",
"F3", "M6"), class = "factor"), Sex = structure(c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L), .Label = c("F", "M"), class = "factor"), Event = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L,
4L, 4L, 4L, 4L), Site = structure(c(1L, 3L, 9L, 7L, 8L, 10L,
2L, 6L, 4L, 5L, 1L, 9L, 7L, 8L, 10L, 5L, 10L, 2L, 6L, 4L, 5L,
1L, 9L, 2L, 6L, 4L, 5L, 1L, 8L, 3L, 10L, 4L, 2L, 6L, 4L, 5L,
1L), .Label = c("Apple", "Banana", "Grape", "Guava", "Kiwi",
"Mango", "Orange", "Peach", "Pear", "Plum"), class = "factor"),
Sample = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L,
3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L,
6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L)), .Names = c("ID",
"Sex", "Event", "Site", "Sample"), class = "data.frame", row.names = c(NA,
-37L))
#
df.2 <- structure(list(Sample1 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L), Sample2 = c(2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L,
3L, 4L, 5L), V1 = c(0.12, 0.497, 0.715, 0, 0.001, 0, 0.829, 0,
0, 0.001, 0, 0.829), V2 = c(0.107, 0.273, 0.595, 0, 0.004, 0,
0.547, 0.001, 0.001, 0.107, 0.273, 0.595), ID = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("F1",
"M6"), class = "factor"), Sex = structure(c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("F", "M"), class = "factor"),
Event = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L)), .Names = c("Sample1",
"Sample2", "V1", "V2", "ID", "Sex", "Event"), class = "data.frame", row.names = c(NA,
-12L))
#
df.3 <- structure(list(Sample1 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L), Sample2 = c(2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L,
3L, 4L, 5L), V1 = c(0.12, 0.497, 0.715, 0, 0.001, 0, 0.829, 0,
0, 0.001, 0, 0.829), V2 = c(0.107, 0.273, 0.595, 0, 0.004, 0,
0.547, 0.001, 0.001, 0.107, 0.273, 0.595), Site1 = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("Apple",
"Banana"), class = "factor"), Site2 = structure(c(2L, 8L, 6L,
7L, 9L, 1L, 5L, 3L, 4L, 5L, 3L, 4L), .Label = c("Banana", "Grape",
"Guava", "Kiwi", "Mango", "Orange", "Peach", "Pear", "Plum"), class = "factor"),
ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L), .Label = c("F1", "M6"), class = "factor"), Sex = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("F",
"M"), class = "factor"), Event = c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 3L, 3L, 3L)), .Names = c("Sample1", "Sample2",
"V1", "V2", "Site1", "Site2", "ID", "Sex", "Event"), class = "data.frame", row.names = c(NA, -12L))
Two merges should do it:
first <- merge(df.2, unique(df.1[,3:5]), by.x=c("Sample1","Event"), by.y=c("Sample","Event"), all.x=TRUE)
second <- merge(first, unique(df.1[,3:5]),by.x=c("Sample2","Event"), by.y=c("Sample","Event"), all.x=TRUE)
print(second)
Sample2 Event Sample1 V1 V2 ID Sex Site.x Site.y
1 10 1 1 0.000 0.001 F1 F Apple Kiwi
2 2 1 1 0.120 0.107 F1 F Apple Grape
3 3 1 1 0.497 0.273 F1 F Apple Pear
4 3 3 2 0.001 0.107 M6 M Banana Mango
5 4 1 1 0.715 0.595 F1 F Apple Orange
6 4 3 2 0.000 0.273 M6 M Banana Guava
7 5 1 1 0.000 0.000 F1 F Apple Peach
8 5 3 2 0.829 0.595 M6 M Banana Kiwi
9 6 1 1 0.001 0.004 F1 F Apple Plum
10 7 1 1 0.000 0.000 F1 F Apple Banana
11 8 1 1 0.829 0.547 F1 F Apple Mango
12 9 1 1 0.000 0.001 F1 F Apple Guava