Can someone help me how to count from another dataframe?
df1(out)
structure(list(Item = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), class = "factor", .Label = "0S1576"), LC = structure(c(1L,
1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L), class = "factor", .Label = c("MW92",
"OY01", "RM11")), Fiscal.Month = c("2019-M06", "2019-M07", "2019-M06",
"2019-M07", "2019-M08", "2019-M09", "2019-M06", "2019-M07", "2019-M08"
)), row.names = c(NA, -9L), class = "data.frame")
df2(tempdf)
structure(list(Item = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "0S1576", class = "factor"),
LC = structure(c(1L, 1L, 1L, 1L, 2L, 3L, 4L, 6L, 5L, 1L,
2L, 2L, 3L, 3L), .Label = c("MW92", "OY01", "RM11", "RS11",
"WK14", "WK15"), class = "factor"), Fiscal.Month = structure(c(1L,
2L, 3L, 4L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("2019-M06",
"2019-M07", "2019-M08", "2019-M09"), class = "factor"), fcst = c(22L,
21L, 20L, 19L, 12L, 10L, 10L, 12L, 10L, 12L, 10L, 10L, 10L,
10L)), row.names = c(NA, -14L), class = "data.frame")
I want to count the frequency of Item,LC,Fiscal.month of df1 from df2
You can count using table and merge df1 with df2 by using factor and you need interaction as you use more than one column to merge.
table(factor(interaction(df2[c("Item","LC","Fiscal.Month")]), levels=interaction(df1)))
#0S1576.MW92.2019-M06 0S1576.MW92.2019-M07 0S1576.OY01.2019-M06
# 2 1 3
#0S1576.OY01.2019-M07 0S1576.OY01.2019-M08 0S1576.OY01.2019-M09
# 0 0 0
#0S1576.RM11.2019-M06 0S1576.RM11.2019-M07 0S1576.RM11.2019-M08
# 3 0 0
Or a speed improved version using match and tabulate:
(df1$freq <- tabulate(match(interaction(df2[c("Item","LC","Fiscal.Month")]), interaction(df1)), nrow(df1)))
#[1] 2 1 3 0 0 0 3 0 0
Or sometimes even faster using fastmatch:
library(fastmatch)
df1$freq <- tabulate(fmatch(interaction(df2[c("Item","LC","Fiscal.Month")]), interaction(df1)), nrow(df1))
I'm writing a function that attempts to add values in a single row of a data.frame in several columns at once:
require(stringr)
addPointsToKeyRow = function(df, keyRowNum, searchStringForPointColNames, pointsVector){
colsWithMatchingSearchResults = str_match(colnames(df), searchStringForPointColNames)
pointColNums = (which(!is.na(colsWithMatchingSearchResults)))
pointsVectorCleaned = pointsVector[!is.na(pointsVector)]
print(is.vector(pointsVectorCleaned)) #Returns TRUE
print(is.data.frame(pointsVectorCleaned)) #Returns FALSE
print(pointsVectorCleaned)
if(length(pointsVectorCleaned) == length(pointColNums)){
newDf = data.frame(df, stringsAsFactors = FALSE)
newDf[keyRowNum, pointColNums] = as.character(pointsVectorCleaned)
#for(i in 1:length(pointColNums)){
# newDf[keyRowNum,pointColNums[i]]=as.character(pointsVectorCleaned[i])
#}
print(newDf[keyRowNum,])
}
}
When I apply the function to my data (addPointsToKeyRow(finalDf, which(finalDf[,1]=="key"), "points_q", pointVals)), I get the following warnings:
In [<-.factor(*tmp*, iseq, value = "2") :
invalid factor level, NA generated
I've looked for the error on SO and other sites, and the recommendation always seems to be to make sure your data.frame has stringsAsFactors = FALSE.
I think my issue might be that when I subset the data.frame (newDf[keyRowNum, pointColNums]), it no longer keeps stringsAsFactors = FALSE.
Regardless of whether that's the issue or not, I'd very much welcome some help solving this weird issue. Many thanks in advance!
For the sake of an example, let's say df is:
df = structure(list(first = structure(c(7L, 9L, 5L, 4L, 10L, 2L, 3L,
6L, 1L, 8L), .Label = c("autumn", "spring", "summer", "winter",
"july", "betty", "november", "echo", "victor", "tango"), class = "factor"),
last = structure(c(6L, 2L, 4L, 5L, 1L, 8L, 3L, 9L, 10L, 7L
), .Label = c("brummett1", "do", "drorbaugh", "galeno", "gerber",
"key", "lyons", "pecsok", "perezfranco", "swatt"), class = "factor"),
question1 = structure(c(1L, 1L, 1L, 4L, 6L, 2L, 5L, 3L, 5L,
5L), .Label = c("0", "0.25", "1:02:01", "1:2 50%", "2-Jan",
"50%"), class = "factor"), points_q1 = structure(c(1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "", class = "factor"),
question2 = structure(c(8L, 10L, 6L, 5L, 2L, 3L, 7L, 1L,
4L, 9L), .Label = c(" a | b; A| Aa | Ab; b| ab | bb; the possibility that the offspring will be heterozygous is about 25%. The same goes for the homozygous recessive it is a 1:1:1:1",
"1/4 heterozygous for \xf1a\xee and 0 recessive for \xf1b\xee",
"16-Mar", "2-Jan", "3:1 25%", "4-Jan", "Male=aabb Female=AAbb Heterozygous is going to be 1/2. Homozygous is going to be 1/4.",
"possible offspring genotypes (each with probability of 0.25): AABb AaBb AAbb Aabb. Question is asking about probability of Aabb_ which is 0.25.",
"The square shows Ab Ab_ Bb Bb so 50% or 1/2. ", "Xa Yb (father) crossed with XA Xb (mother) = 1/2 "
), class = "factor"), points_q2 = structure(c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "", class = "factor"),
question3 = structure(c(4L, 5L, 3L, 5L, 5L, 5L, 7L, 2L, 6L,
1L), .Label = c("Codominance", "coheritance", "incomplete dominance",
"Incomplete dominance", "Incomplete dominance ", "Incomplete dominance. ",
"Independent Assortment"), class = "factor"), points_q3 = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "", class = "factor"),
question4 = structure(c(3L, 4L, 2L, 3L, 6L, 3L, 7L, 1L, 5L,
4L), .Label = c("", "co-dominance", "Codominance", "Codominance ",
"Codominance. ", "Codominant ", "Independent Assortment? (Wrong)"
), class = "factor"), points_q4 = structure(c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "", class = "factor"),
question5 = structure(c(2L, 10L, 6L, 4L, 5L, 3L, 8L, 1L,
7L, 9L), .Label = c(" X | Y; X| XX | XY; x| Xx | xY; the percentage will be 25 % or 1/4 the same applies to the son ",
"0 for daughter_ because male can only give non-colorblind X chromosome (because he's not colorblind an only has one X chromosome). 0.25 for both son and colorblind.",
"0.25", "25% for son and 25% for daughter", "25% for the son and 25% for the daughter ",
"4-Jan", "50%", "Father=XY Mother=X2Y Therefore_ by using the punnet square_ I was able to show/understand that the probability of them having a son AND him being colorblind is 1/4.",
"To have a son or daughter is 50/50. To have a colorblind daughter is .25 whereas to have a colorblind son is .75 because it is carried on the X chromosome and the son is much more likely to inherit this because he has less x to work with",
"XcY (father) XC Xc (mother) Daughter is 1/4 son 1/4"), class = "factor"),
points_q5 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = "", class = "factor"), question6 = structure(c(3L,
6L, 7L, 8L, 5L, 2L, 10L, 9L, 4L, 1L), .Label = c("Chromatids ",
"Chromosomes (diploids)", "homologous chromosome pairs",
"Homologous chromosome pairs are being separated. ", "Homologous chromosomes ",
"Homologous pairs ", "homologous pairs of chromosomes", "Homologus Chromosomes ",
"sister chromatids ", "Sister Chromatids?"), class = "factor"),
points_q6 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = "", class = "factor"), question7 = structure(c(6L,
8L, 5L, 7L, 8L, 2L, 3L, 1L, 9L, 4L), .Label = c("", "Chromatids (haploids)",
"Daughter Chromosomes?", "One cell to 2", "sister chromatids",
"Sister chromatids", "Sister Chromatids", "Sister chromatids ",
"Sister chromatids within daughter cells are separating. "
), class = "factor"), points_q7 = structure(c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "", class = "factor"),
question8 = structure(c(1L, 4L, 1L, 2L, 4L, 2L, 3L, 6L, 5L,
3L), .Label = c("sister chromatids", "Sister chromatids",
"Sister Chromatids", "Sister chromatids ", "Sister chromatids are held together by the centromeres. In prophase chromosomes become visible. During metaphase chromosomes attach to spindles. During Anaphase the chromosomes are split apart and in telophase the cells start to create cleavage. ",
"sisters chromatides"), class = "factor"), points_q8 = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "", class = "factor"),
question9 = structure(c(2L, 4L, 1L, 3L, 4L, 3L, 3L, 2L, 5L,
3L), .Label = c("prohase ", "prophase", "Prophase", "Prophase ",
"They condense during prophase before the rest of the phases. "
), class = "factor"), points_q9 = structure(c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "", class = "factor"),
question10 = structure(c(1L, 3L, 1L, 2L, 3L, 2L, 2L, 1L,
4L, 2L), .Label = c("anaphase", "Anaphase", "Anaphase ",
"During anaphase. "), class = "factor"), points_q10 = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "", class = "factor"),
question11 = structure(c(3L, 4L, 3L, 4L, 4L, 4L, 4L, 3L,
1L, 2L), .Label = c("During prophase. ", "Telephase ", "telophase",
"Telophase"), class = "factor"), points_q11 = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "", class = "factor"),
question12 = structure(c(1L, 3L, 1L, 2L, 3L, 2L, 3L, 1L,
4L, 2L), .Label = c("metaphase", "Metaphase", "Metaphase ",
"Metaphase. "), class = "factor"), points_q12 = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "", class = "factor"),
question13 = structure(c(1L, 4L, 1L, 4L, 2L, 4L, 2L, 5L,
3L, 6L), .Label = c("centromere", "Centromere", "Centromere. ",
"Centromeres", "centromeres ", "Cleavage"), class = "factor"),
points_q13 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), .Label = "", class = "factor")), .Names = c("first",
"last", "question1", "points_q1", "question2", "points_q2", "question3",
"points_q3", "question4", "points_q4", "question5", "points_q5",
"question6", "points_q6", "question7", "points_q7", "question8",
"points_q8", "question9", "points_q9", "question10", "points_q10",
"question11", "points_q11", "question12", "points_q12", "question13",
"points_q13"), row.names = c(NA, -10L), class = "data.frame")
which(finalDf[,1]=="key") is 1.
pointVals is c(NA, "2", "2", "2", "2", "2", "2", "2", "1", "1", "1", "1",
"1", "1")
For clarification, I'd want the final table to look something like:
First Last question1 points_q1 question2 points_q2 etc.
key key 0 2 "possible_offspring_genotypes..." 1 etc.
I have reduced your function based on my understanding , let me know if it gives what you want or if I have misunderstood something
addPointsToKeyRow = function(df, keyRowNum, searchString, pointsVector) {
#Find columns which has searchString in it
cols <- grepl(searchString, colnames(df))
#Check if the columns with searchString and length of pointsVector is the same
if (sum(cols) == length(pointsVector)) {
#Assign the value
df[keyRowNum,cols] <- pointsVector
}
#Return the updated dataframe
df
}
#Convert all the variables in the column from factor to character
df[] <- lapply(df, as.character)
#define the values to be replaced
pointVals <- c("2", "2", "2", "2", "2", "2", "2", "1", "1", "1", "1","1", "1")
#Call the function
df <- addPointsToKeyRow(df, 1, "points_q", pointsval)
#Check the dataframe
df
This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 5 years ago.
I have a dataframe like this:
df <- structure(list(col1 = structure(c(1L, 1L, 2L, 3L, 1L, 3L, 1L,
3L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 4L), .Label = c("stock1",
"stock2", "stock3", "stock4"), class = "factor"), col2 = structure(c(4L,
5L, 7L, 6L, 5L, 5L, 5L, 6L, 6L, 8L, 8L, 4L, 3L, 3L, 1L, 2L, 3L
), .Label = c("comapny1", "comapny1+comapny4", "comapny4", "company1",
"company2", "company2+company1", "company3", "company4"), class = "factor"),
col3 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("predictor1", "predictor2"
), class = "factor")), .Names = c("col1", "col2", "col3"), class = "data.frame", row.names = c(NA,
-17L))
I would like to take the frequency from the three columns.
Expected output
df2 <- structure(list(col1 = structure(c(1L, 1L, 1L, 2L, 4L, 1L, 1L,
3L, 3L, 1L, 2L, 1L), .Label = c("stock1", "stock2", "stock3",
"stock4"), class = "factor"), col2 = structure(c(1L, 2L, 3L,
3L, 3L, 4L, 5L, 5L, 6L, 6L, 7L, 8L), .Label = c("comapany1",
"comapany1+comapany4", "comapany4", "company1", "company2", "company2+company1",
"company3", "company4"), class = "factor"), col3 = structure(c(2L,
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("predictor1",
"predictor2"), class = "factor"), frequency = c(1L, 1L, 1L, 1L,
1L, 2L, 3L, 1L, 2L, 1L, 1L, 2L)), .Names = c("col1", "col2",
"col3", "frequency"), class = "data.frame", row.names = c(NA,
-12L))
How is it possible to make it?
We can use count
library(dplyr)
count(df, col1, col2, col3)
# A tibble: 12 x 4
# col1 col2 col3 n
# <fctr> <fctr> <fctr> <int>
# 1 stock1 comapny1 predictor2 1
# 2 stock1 comapny1+comapny4 predictor2 1
# 3 stock1 comapny4 predictor2 1
# 4 stock1 company1 predictor1 2
# 5 stock1 company2 predictor1 3
# 6 stock1 company2+company1 predictor1 1
# 7 stock1 company4 predictor1 2
# 8 stock2 comapny4 predictor2 1
# 9 stock2 company3 predictor1 1
#10 stock3 company2 predictor1 1
#11 stock3 company2+company1 predictor1 2
#12 stock4 comapny4 predictor2 1
Or with data.table
library(data.table)
setDT(df)[, .N, .(col1, col2, col3)]
(I have modified this question to make it more explicit.)
I have a dataset as follows:
data <- structure(list(id = 1:12, personID = c(1L, 2L, 3L, 4L, 4L, 3L,
2L, 1L, 1L, 2L, 3L, 4L), lastName = structure(c(1L, 2L, 3L, 4L,
4L, 3L, 2L, 1L, 1L, 2L, 3L, 4L), .Label = c("james", "joan",
"lucy", "mary"), class = "factor"), date = structure(c(5L, 5L,
8L, 9L, 6L, 1L, 3L, 11L, 4L, 2L, 7L, 10L), .Label = c("1/01/2012",
"10/04/2011", "11/01/2012", "11/08/2011", "12/01/2012", "12/04/2012",
"12/12/2011", "14/01/2012", "16/01/2012", "24/06/2010", "24/06/2011"
), class = "factor"), status = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 1L, 2L, 1L)), .Names = c("id", "personID", "lastName",
"date", "status"), class = "data.frame", row.names = c(NA, -12L
))
I need to extract a subset from the data frame to include records where each row occured more than once in a period of greater than 8 weeks.
The extraction needs to search from the oldest record and then select the next (more recent) additional record for the same personID that was greater then 8 weeks since the previous record. Upon finding another record older then 8 weeks it should repeat the process using what the more recent second record as the new starting point.
Thanks.
How about:
maxDiff <- tapply(data$date,data$personID,function(x) max(dist(x)))
subset(data,personID %in% names(maxDiff[maxDiff>(8*7)]))
id personID lastName date status
1 1 1 james 2012-01-12 1
4 4 4 mary 2012-01-16 1
5 5 4 mary 2012-04-12 1
8 8 1 james 2011-06-24 1
This will do the trick, though I'm sure someone else can give you a better answer.
require(plyr)
diffWeek <- function (df) {
abs(df$date[1] - df$date[2])}
eightWeeks <- 7*8 # 56 days
aux.data <- ddply(data, "lastName", function (df) diffWeek(df) > eightWeeks)
data[data$lastName %in% aux.data[aux.data[,2]==T,1],] # this willreturn the data.frame.
Note that my answer doesn't generalize well. If I have more time I'll try to generalize it. But it should work for now.