An R function to work out the preceding value - r

I'm trying to create a table of staff, who they report to and what level they are.
I've been working with a similar table, and #TonakShah was kind enough to help me with calculating the lowest level location is and the level above is using the solution below.
My employee table looks like this:
input = structure(list(Level.1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = "Board", class = "factor"), Level.2 = structure(c(2L,
2L, 2L, 1L, 1L, 3L, 3L), .Label = c("Aasha", "Grace", "Marisol"
), class = "factor"), Level.3 = structure(c(4L, 4L, 3L, 1L, 1L,
2L, 2L), .Label = c("Alex", "Chandler", "Millie", "Tushad"), class = "factor"),
Level.4 = structure(c(2L, 2L, 6L, 1L, 5L, 3L, 4L), .Label = c("#",
"Frank", "Joey", "Rachel", "Sarah", "Tony"), class = "factor"),
Level.5 = structure(c(3L, 2L, 1L, 1L, 1L, 4L, 1L), .Label = c("#",
"Lela", "Millie", "Ross"), class = "factor"), Level.6 = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = "#", class = "factor")), class = "data.frame", row.names = c(NA,
-7L))
and using the technique described here by Ronak (stackoverflow.com/questions/56903188/create-a-table-from-a-hierarchy/)
which is,
as.data.frame(t(apply(input, 1, function(x)
{new_x = x[x != "###"]; c(rev(tail(new_x, 2)), length(new_x)) })))
I can get most of the required table. But I'm having trouble trying to get "the bosses" (eg. those with employees but are not "the board").
My ideal output would look something like this (I've added colnames to make it easier to understand):
structure(list(Subordinate = structure(c(9L, 4L, 14L, 5L, 7L,
13L, 9L, 2L, 1L, 12L, 11L, 6L, 3L, 8L, 10L), .Label = c("Aasha",
"Alex", "Chandler", "Frank", "Grace", "Joey", "Lela", "Marisol",
"Millie", "Rachel", "Ross", "Sarah", "Tony", "Tushad"), class = "factor"),
Boss = structure(c(5L, 10L, 6L, 3L, 5L, 9L, 6L, 1L, 3L, 2L,
7L, 4L, 8L, 3L, 4L), .Label = c("Aasha", "Alex", "Board",
"Chandler", "Frank", "Grace", "Joey", "Marisol", "Millie",
"Tushad"), class = "factor"), Level = c(5L, 4L, 3L, 2L, 5L,
4L, 3L, 3L, 2L, 4L, 5L, 4L, 3L, 2L, 4L)), class = "data.frame", row.names = c(NA,
-15L))
I think I maybe do it with a loop, but this doesn't seem to be the best answer. Can anyone offer any other tips?

Couldn't come up with a prettier solution but this works. Using a while loop in the apply call used previously, we can do
output <- do.call(rbind.data.frame, apply(input, 1, function(x) {
new_x = as.character(x[x != "#"])
list_df <- list()
i = 1
while(length(new_x) >= 2) {
#Get last 2 eneteries
list_df[[i]] <- c(rev(tail(new_x, 2)), length(new_x))
#Go one level deeper
new_x = head(new_x, -1)
i = i +1
}
do.call(rbind, list_df)
}))
#To remove duplicate enteries
output[!duplicated(output), ]
# V1 V2 V3
#1 Millie Frank 5
#2 Frank Tushad 4
#3 Tushad Grace 3
#4 Grace Board 2
#5 Lela Frank 5
#9 Tony Millie 4
#10 Millie Grace 3
#12 Alex Aasha 3
#13 Aasha Board 2
#14 Sarah Alex 4
#17 Ross Joey 5
#18 Joey Chandler 4
#19 Chandler Marisol 3
#20 Marisol Board 2
#21 Rachel Chandler 4

Related

Frequency of one dataframe rows from another dataframe

Can someone help me how to count from another dataframe?
df1(out)
structure(list(Item = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), class = "factor", .Label = "0S1576"), LC = structure(c(1L,
1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L), class = "factor", .Label = c("MW92",
"OY01", "RM11")), Fiscal.Month = c("2019-M06", "2019-M07", "2019-M06",
"2019-M07", "2019-M08", "2019-M09", "2019-M06", "2019-M07", "2019-M08"
)), row.names = c(NA, -9L), class = "data.frame")
df2(tempdf)
structure(list(Item = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "0S1576", class = "factor"),
LC = structure(c(1L, 1L, 1L, 1L, 2L, 3L, 4L, 6L, 5L, 1L,
2L, 2L, 3L, 3L), .Label = c("MW92", "OY01", "RM11", "RS11",
"WK14", "WK15"), class = "factor"), Fiscal.Month = structure(c(1L,
2L, 3L, 4L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("2019-M06",
"2019-M07", "2019-M08", "2019-M09"), class = "factor"), fcst = c(22L,
21L, 20L, 19L, 12L, 10L, 10L, 12L, 10L, 12L, 10L, 10L, 10L,
10L)), row.names = c(NA, -14L), class = "data.frame")
I want to count the frequency of Item,LC,Fiscal.month of df1 from df2
You can count using table and merge df1 with df2 by using factor and you need interaction as you use more than one column to merge.
table(factor(interaction(df2[c("Item","LC","Fiscal.Month")]), levels=interaction(df1)))
#0S1576.MW92.2019-M06 0S1576.MW92.2019-M07 0S1576.OY01.2019-M06
# 2 1 3
#0S1576.OY01.2019-M07 0S1576.OY01.2019-M08 0S1576.OY01.2019-M09
# 0 0 0
#0S1576.RM11.2019-M06 0S1576.RM11.2019-M07 0S1576.RM11.2019-M08
# 3 0 0
Or a speed improved version using match and tabulate:
(df1$freq <- tabulate(match(interaction(df2[c("Item","LC","Fiscal.Month")]), interaction(df1)), nrow(df1)))
#[1] 2 1 3 0 0 0 3 0 0
Or sometimes even faster using fastmatch:
library(fastmatch)
df1$freq <- tabulate(fmatch(interaction(df2[c("Item","LC","Fiscal.Month")]), interaction(df1)), nrow(df1))

Warning message In `[<-.factor`(`*tmp*`, iseq, value = foo) : invalid factor level, NA generated when trying to add vector to row subset

I'm writing a function that attempts to add values in a single row of a data.frame in several columns at once:
require(stringr)
addPointsToKeyRow = function(df, keyRowNum, searchStringForPointColNames, pointsVector){
colsWithMatchingSearchResults = str_match(colnames(df), searchStringForPointColNames)
pointColNums = (which(!is.na(colsWithMatchingSearchResults)))
pointsVectorCleaned = pointsVector[!is.na(pointsVector)]
print(is.vector(pointsVectorCleaned)) #Returns TRUE
print(is.data.frame(pointsVectorCleaned)) #Returns FALSE
print(pointsVectorCleaned)
if(length(pointsVectorCleaned) == length(pointColNums)){
newDf = data.frame(df, stringsAsFactors = FALSE)
newDf[keyRowNum, pointColNums] = as.character(pointsVectorCleaned)
#for(i in 1:length(pointColNums)){
# newDf[keyRowNum,pointColNums[i]]=as.character(pointsVectorCleaned[i])
#}
print(newDf[keyRowNum,])
}
}
When I apply the function to my data (addPointsToKeyRow(finalDf, which(finalDf[,1]=="key"), "points_q", pointVals)), I get the following warnings:
In [<-.factor(*tmp*, iseq, value = "2") :
invalid factor level, NA generated
I've looked for the error on SO and other sites, and the recommendation always seems to be to make sure your data.frame has stringsAsFactors = FALSE.
I think my issue might be that when I subset the data.frame (newDf[keyRowNum, pointColNums]), it no longer keeps stringsAsFactors = FALSE.
Regardless of whether that's the issue or not, I'd very much welcome some help solving this weird issue. Many thanks in advance!
For the sake of an example, let's say df is:
df = structure(list(first = structure(c(7L, 9L, 5L, 4L, 10L, 2L, 3L,
6L, 1L, 8L), .Label = c("autumn", "spring", "summer", "winter",
"july", "betty", "november", "echo", "victor", "tango"), class = "factor"),
last = structure(c(6L, 2L, 4L, 5L, 1L, 8L, 3L, 9L, 10L, 7L
), .Label = c("brummett1", "do", "drorbaugh", "galeno", "gerber",
"key", "lyons", "pecsok", "perezfranco", "swatt"), class = "factor"),
question1 = structure(c(1L, 1L, 1L, 4L, 6L, 2L, 5L, 3L, 5L,
5L), .Label = c("0", "0.25", "1:02:01", "1:2 50%", "2-Jan",
"50%"), class = "factor"), points_q1 = structure(c(1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "", class = "factor"),
question2 = structure(c(8L, 10L, 6L, 5L, 2L, 3L, 7L, 1L,
4L, 9L), .Label = c(" a | b; A| Aa | Ab; b| ab | bb; the possibility that the offspring will be heterozygous is about 25%. The same goes for the homozygous recessive it is a 1:1:1:1",
"1/4 heterozygous for \xf1a\xee and 0 recessive for \xf1b\xee",
"16-Mar", "2-Jan", "3:1 25%", "4-Jan", "Male=aabb Female=AAbb Heterozygous is going to be 1/2. Homozygous is going to be 1/4.",
"possible offspring genotypes (each with probability of 0.25): AABb AaBb AAbb Aabb. Question is asking about probability of Aabb_ which is 0.25.",
"The square shows Ab Ab_ Bb Bb so 50% or 1/2. ", "Xa Yb (father) crossed with XA Xb (mother) = 1/2 "
), class = "factor"), points_q2 = structure(c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "", class = "factor"),
question3 = structure(c(4L, 5L, 3L, 5L, 5L, 5L, 7L, 2L, 6L,
1L), .Label = c("Codominance", "coheritance", "incomplete dominance",
"Incomplete dominance", "Incomplete dominance ", "Incomplete dominance. ",
"Independent Assortment"), class = "factor"), points_q3 = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "", class = "factor"),
question4 = structure(c(3L, 4L, 2L, 3L, 6L, 3L, 7L, 1L, 5L,
4L), .Label = c("", "co-dominance", "Codominance", "Codominance ",
"Codominance. ", "Codominant ", "Independent Assortment? (Wrong)"
), class = "factor"), points_q4 = structure(c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "", class = "factor"),
question5 = structure(c(2L, 10L, 6L, 4L, 5L, 3L, 8L, 1L,
7L, 9L), .Label = c(" X | Y; X| XX | XY; x| Xx | xY; the percentage will be 25 % or 1/4 the same applies to the son ",
"0 for daughter_ because male can only give non-colorblind X chromosome (because he's not colorblind an only has one X chromosome). 0.25 for both son and colorblind.",
"0.25", "25% for son and 25% for daughter", "25% for the son and 25% for the daughter ",
"4-Jan", "50%", "Father=XY Mother=X2Y Therefore_ by using the punnet square_ I was able to show/understand that the probability of them having a son AND him being colorblind is 1/4.",
"To have a son or daughter is 50/50. To have a colorblind daughter is .25 whereas to have a colorblind son is .75 because it is carried on the X chromosome and the son is much more likely to inherit this because he has less x to work with",
"XcY (father) XC Xc (mother) Daughter is 1/4 son 1/4"), class = "factor"),
points_q5 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = "", class = "factor"), question6 = structure(c(3L,
6L, 7L, 8L, 5L, 2L, 10L, 9L, 4L, 1L), .Label = c("Chromatids ",
"Chromosomes (diploids)", "homologous chromosome pairs",
"Homologous chromosome pairs are being separated. ", "Homologous chromosomes ",
"Homologous pairs ", "homologous pairs of chromosomes", "Homologus Chromosomes ",
"sister chromatids ", "Sister Chromatids?"), class = "factor"),
points_q6 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = "", class = "factor"), question7 = structure(c(6L,
8L, 5L, 7L, 8L, 2L, 3L, 1L, 9L, 4L), .Label = c("", "Chromatids (haploids)",
"Daughter Chromosomes?", "One cell to 2", "sister chromatids",
"Sister chromatids", "Sister Chromatids", "Sister chromatids ",
"Sister chromatids within daughter cells are separating. "
), class = "factor"), points_q7 = structure(c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "", class = "factor"),
question8 = structure(c(1L, 4L, 1L, 2L, 4L, 2L, 3L, 6L, 5L,
3L), .Label = c("sister chromatids", "Sister chromatids",
"Sister Chromatids", "Sister chromatids ", "Sister chromatids are held together by the centromeres. In prophase chromosomes become visible. During metaphase chromosomes attach to spindles. During Anaphase the chromosomes are split apart and in telophase the cells start to create cleavage. ",
"sisters chromatides"), class = "factor"), points_q8 = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "", class = "factor"),
question9 = structure(c(2L, 4L, 1L, 3L, 4L, 3L, 3L, 2L, 5L,
3L), .Label = c("prohase ", "prophase", "Prophase", "Prophase ",
"They condense during prophase before the rest of the phases. "
), class = "factor"), points_q9 = structure(c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "", class = "factor"),
question10 = structure(c(1L, 3L, 1L, 2L, 3L, 2L, 2L, 1L,
4L, 2L), .Label = c("anaphase", "Anaphase", "Anaphase ",
"During anaphase. "), class = "factor"), points_q10 = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "", class = "factor"),
question11 = structure(c(3L, 4L, 3L, 4L, 4L, 4L, 4L, 3L,
1L, 2L), .Label = c("During prophase. ", "Telephase ", "telophase",
"Telophase"), class = "factor"), points_q11 = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "", class = "factor"),
question12 = structure(c(1L, 3L, 1L, 2L, 3L, 2L, 3L, 1L,
4L, 2L), .Label = c("metaphase", "Metaphase", "Metaphase ",
"Metaphase. "), class = "factor"), points_q12 = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "", class = "factor"),
question13 = structure(c(1L, 4L, 1L, 4L, 2L, 4L, 2L, 5L,
3L, 6L), .Label = c("centromere", "Centromere", "Centromere. ",
"Centromeres", "centromeres ", "Cleavage"), class = "factor"),
points_q13 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), .Label = "", class = "factor")), .Names = c("first",
"last", "question1", "points_q1", "question2", "points_q2", "question3",
"points_q3", "question4", "points_q4", "question5", "points_q5",
"question6", "points_q6", "question7", "points_q7", "question8",
"points_q8", "question9", "points_q9", "question10", "points_q10",
"question11", "points_q11", "question12", "points_q12", "question13",
"points_q13"), row.names = c(NA, -10L), class = "data.frame")
which(finalDf[,1]=="key") is 1.
pointVals is c(NA, "2", "2", "2", "2", "2", "2", "2", "1", "1", "1", "1",
"1", "1")
For clarification, I'd want the final table to look something like:
First Last question1 points_q1 question2 points_q2 etc.
key key 0 2 "possible_offspring_genotypes..." 1 etc.
I have reduced your function based on my understanding , let me know if it gives what you want or if I have misunderstood something
addPointsToKeyRow = function(df, keyRowNum, searchString, pointsVector) {
#Find columns which has searchString in it
cols <- grepl(searchString, colnames(df))
#Check if the columns with searchString and length of pointsVector is the same
if (sum(cols) == length(pointsVector)) {
#Assign the value
df[keyRowNum,cols] <- pointsVector
}
#Return the updated dataframe
df
}
#Convert all the variables in the column from factor to character
df[] <- lapply(df, as.character)
#define the values to be replaced
pointVals <- c("2", "2", "2", "2", "2", "2", "2", "1", "1", "1", "1","1", "1")
#Call the function
df <- addPointsToKeyRow(df, 1, "points_q", pointsval)
#Check the dataframe
df

How can I use R to fill rows based on column?

I have the following table
Code Name Class
1
2 Monday day
5 green color
9
6
1 red color
1
2
9 Tuesday day
6
5
Goal is to the fill the Name and Class columns based on the Code column of a filled row. For example, the second row is filled and the code is 2. I would like to fill all the rows where code = 2 with Name=Monday and Class=day.
I tried using fill() from tidyR but that seems to require ordered data.
structure(list(Code = c(1L, 2L, 5L, 9L, 6L, 1L, 1L, 2L, 9L, 6L,
5L), Name = structure(c(1L, 3L, 2L, 1L, 1L, 4L, 1L, 1L, 5L, 1L,
1L), .Label = c("", "green", "Monday", "red", "Tuesday"), class = "factor"),
Class = structure(c(1L, 3L, 2L, 1L, 1L, 2L, 1L, 1L, 3L, 1L,
1L), .Label = c("", "color", "day"), class = "factor")), .Names = c("Code",
"Name", "Class"), class = "data.frame", row.names = c(NA, -11L
))
library(dplyr)
final_df <- left_join(df, df[df$Name!='',], by='Code')[,c(1,4:5)]
colnames(final_df) <- colnames(df)
final_df

Taking the frequency of three different columns [duplicate]

This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 5 years ago.
I have a dataframe like this:
df <- structure(list(col1 = structure(c(1L, 1L, 2L, 3L, 1L, 3L, 1L,
3L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 4L), .Label = c("stock1",
"stock2", "stock3", "stock4"), class = "factor"), col2 = structure(c(4L,
5L, 7L, 6L, 5L, 5L, 5L, 6L, 6L, 8L, 8L, 4L, 3L, 3L, 1L, 2L, 3L
), .Label = c("comapny1", "comapny1+comapny4", "comapny4", "company1",
"company2", "company2+company1", "company3", "company4"), class = "factor"),
col3 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("predictor1", "predictor2"
), class = "factor")), .Names = c("col1", "col2", "col3"), class = "data.frame", row.names = c(NA,
-17L))
I would like to take the frequency from the three columns.
Expected output
df2 <- structure(list(col1 = structure(c(1L, 1L, 1L, 2L, 4L, 1L, 1L,
3L, 3L, 1L, 2L, 1L), .Label = c("stock1", "stock2", "stock3",
"stock4"), class = "factor"), col2 = structure(c(1L, 2L, 3L,
3L, 3L, 4L, 5L, 5L, 6L, 6L, 7L, 8L), .Label = c("comapany1",
"comapany1+comapany4", "comapany4", "company1", "company2", "company2+company1",
"company3", "company4"), class = "factor"), col3 = structure(c(2L,
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("predictor1",
"predictor2"), class = "factor"), frequency = c(1L, 1L, 1L, 1L,
1L, 2L, 3L, 1L, 2L, 1L, 1L, 2L)), .Names = c("col1", "col2",
"col3", "frequency"), class = "data.frame", row.names = c(NA,
-12L))
How is it possible to make it?
We can use count
library(dplyr)
count(df, col1, col2, col3)
# A tibble: 12 x 4
# col1 col2 col3 n
# <fctr> <fctr> <fctr> <int>
# 1 stock1 comapny1 predictor2 1
# 2 stock1 comapny1+comapny4 predictor2 1
# 3 stock1 comapny4 predictor2 1
# 4 stock1 company1 predictor1 2
# 5 stock1 company2 predictor1 3
# 6 stock1 company2+company1 predictor1 1
# 7 stock1 company4 predictor1 2
# 8 stock2 comapny4 predictor2 1
# 9 stock2 company3 predictor1 1
#10 stock3 company2 predictor1 1
#11 stock3 company2+company1 predictor1 2
#12 stock4 comapny4 predictor2 1
Or with data.table
library(data.table)
setDT(df)[, .N, .(col1, col2, col3)]

extract a subset of a data frame where records are separated by a specific time period

(I have modified this question to make it more explicit.)
I have a dataset as follows:
data <- structure(list(id = 1:12, personID = c(1L, 2L, 3L, 4L, 4L, 3L,
2L, 1L, 1L, 2L, 3L, 4L), lastName = structure(c(1L, 2L, 3L, 4L,
4L, 3L, 2L, 1L, 1L, 2L, 3L, 4L), .Label = c("james", "joan",
"lucy", "mary"), class = "factor"), date = structure(c(5L, 5L,
8L, 9L, 6L, 1L, 3L, 11L, 4L, 2L, 7L, 10L), .Label = c("1/01/2012",
"10/04/2011", "11/01/2012", "11/08/2011", "12/01/2012", "12/04/2012",
"12/12/2011", "14/01/2012", "16/01/2012", "24/06/2010", "24/06/2011"
), class = "factor"), status = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 1L, 2L, 1L)), .Names = c("id", "personID", "lastName",
"date", "status"), class = "data.frame", row.names = c(NA, -12L
))
I need to extract a subset from the data frame to include records where each row occured more than once in a period of greater than 8 weeks.
The extraction needs to search from the oldest record and then select the next (more recent) additional record for the same personID that was greater then 8 weeks since the previous record. Upon finding another record older then 8 weeks it should repeat the process using what the more recent second record as the new starting point.
Thanks.
How about:
maxDiff <- tapply(data$date,data$personID,function(x) max(dist(x)))
subset(data,personID %in% names(maxDiff[maxDiff>(8*7)]))
id personID lastName date status
1 1 1 james 2012-01-12 1
4 4 4 mary 2012-01-16 1
5 5 4 mary 2012-04-12 1
8 8 1 james 2011-06-24 1
This will do the trick, though I'm sure someone else can give you a better answer.
require(plyr)
diffWeek <- function (df) {
abs(df$date[1] - df$date[2])}
eightWeeks <- 7*8 # 56 days
aux.data <- ddply(data, "lastName", function (df) diffWeek(df) > eightWeeks)
data[data$lastName %in% aux.data[aux.data[,2]==T,1],] # this willreturn the data.frame.
Note that my answer doesn't generalize well. If I have more time I'll try to generalize it. But it should work for now.

Resources