I'm wondering if it was possible to repeat a column value into sublevel record. The link between main dataset (A) et second Dataset (B) is the hierarchy value
I.e
A.Code A.hierachy A.level B.valuetopeat
John 1/ 1 Senior
Smith 1/2 2 --> no data by default, need "Senior"
Wesson 1/2/3 3 --> no data by default, need "Senior"
Syl 2/ 1 "Junior"
Ves 2/1 2 --> no data by default, need "Junior"
Ter 2/1/1 3 --> no data by default, need "Junior"
Or another approach to get same results could be (always with dataset A & B, but having parent-child columns)
A.Key A.parentKey A.Code B.valuetoRepeat
1 John Senior
2 1 Smith --> no data by default, need "Senior"
3 2 Wesson --> no data by default, need "Senior"
4 Syl "Junior"
5 4 Ves --> no data by default, need "Junior"
6 5 Ter --> no data by default, need "Junior"
Thank you
regards,
You can do this in DAX using PATH functions. It seems like you're trying to flatten a hierarchy which is exactly what they're for.
https://learn.microsoft.com/en-us/dax/parent-and-child-functions-dax
Related
I am trying to capture the number of times keywords show up after "not" words in a large amount of comments to gauge sentiment. To capture the words following the not words I used Quanteda's KWIC and created a dtm for the keywords based on the window afterwords in the KWIC. My issue is that the KWIC dataframe is smaller than the original dataframe and therefore can't find the corresponding occurances.
I have this:
library(dplyr)
library(quanteda)
text_column <- c("not safe","not safe and not listening","not safe never patient", "safe","not welcoming","nice people","corporate culture school tacos","successful words words coding","not scary")
test.df <- as.data.frame(text_column)
notwords <- c("not", "never", "don't", "seldom", "won't")
dictionary(list(possafety = c("open","open-minded", "listen*", "safe*", "patien*", "underst*", "willing to help", "helpful", "tight-knit", "hear*", "engage*", "support*", "comfortable", "belong*", "welcom*", "inclu*", "value", "respect*", "always someone you can go to for questions", "accept*")
rownumber
text
1
not safe
2
not safe and not listening
3
not safe and never patient
4
safe
5
not welcoming
6
nice people
7
corporate culture school tacos
8
successful words words coding
9
not scary
and I want to get this:
rownumber
text
notpossafe
1
not safe
1
2
not safe and not listening
2
3
not safe and never patient
2
4
safe
0
5
not welcoming
1
6
nice people
0
7
corporate culture school tacos
0
8
successful words words coding
0
9
not scary
0
I tried creating a row number variable, filtering the KWIC dataframe for occurances, and used an ifelse statement to verify if the rownumber was found in the dataframe, but that still only gives me a 0 or a 1, and I need to count for instances like in row two and three where there are more than 1 occurance.
not.df <- as.data.frame(kwic(test.df$text, pattern = not, window = 2))
not.df$rownumber <- as.numeric(gsub(".*?([0-9]+).*", "\\1", not.df$docname))
corptextnot <- corpus(not.df, text_field = "post")
dtmtextnot <- dfm(corptextnot)
dict_dtmtextnot = dfm_lookup(dtmtextnot, dict, exclusive = TRUE)
nottextdict.df <- as.data.frame(dict_dtmtextnot)
not.df$safe <- nottextdict.df$possafety
not.df <- filter(not.df, safe > 0)
test.df$notpossafe <- ifelse((test.df$rownumber %in% not.df$rownumber), 1,0)
This only gives me:
rownumber
text
notpossafe
1
not safe
1
2
not safe and not listening
1
3
not safe and never patient
1
4
safe
0
5
not welcoming
1
6
nice people
0
7
corporate culture school tacos
0
8
successful words words coding
0
9
not scary
0
Is there a way to count the number or occurances that an elseif test is positive and make that the number, or is there a way to find corresponding values between two dataframes of different sizes, or more fundamentally, is there just a better tool to do what I am trying to do?
You can do this with the stringr package using stringr::str_count and paste with collapse = "|":
test.df$notpossafe <- stringr::str_count(test.df$text_column,
paste(notwords, collapse = "|"))
Output:
# text_column notpossafe
# 1 not safe 1
# 2 not safe and not listening 2
# 3 not safe never patient 2
# 4 safe 0
# 5 not welcoming 1
# 6 nice people 0
# 7 corporate culture school tacos 0
# 8 successful words words coding 0
I would like to show the department that uses the same vendor using the vendor code in a very big dataset, so I guess I will need a loop for that but I am not really sure how to start.
for example, I want to see for each vendor code, all the department that uses it, only if it's used by 3 or more department
see the sample of data here
Here's a base R solution.
# get the repeated values
dat_tb <- table(dat$vendor_code)
# select for the condition and print from the whole data set
dat[ dat$vendor_code %in% names(dat_tb[ dat_tb > 2 ]), ]
vendor_code department
2 9966 dept2
3 9966 dept3
8 9966 dept8
9 9966 dept9
Data:
dat <- data.frame( vendor_code=rep(c(3344,9966,9966,3444,5566,3388),2),
department=paste0("dept",1:12))
I generated a dataset holding two distinct columns: an ID column associated to a customer and another column associated to his/her active products:
head(df_itemList)
ID PRD_LISTE
1 1 A,B,C
3 2 C,D
4 3 A,B
5 4 A,B,C,D,E
7 5 B,A,D
8 6 A,C,D
I only selected customers that own more than one product. In total I have 589.454 rows and there are 16 different products.
Next, I wrote the data.frame into an csv-file like this:
df_itemList$ID <- NULL
colnames(df_itemList) <- c("itemList")
write.csv(df_itemList, "Basket_List_13-08-2020.csv", row.names = TRUE)
Then, I converted the csv-file into a basket format in order to apply the apriori algorithm as implemented in the arules-package.
library(arules)
txn <- read.transactions(file="Basket_List_13-08-2020.csv",
rm.duplicates= TRUE, format="basket",sep=",",cols=1)
txn#itemInfo$labels <- gsub("\"","",txn#itemInfo$labels)
The summary-function yields the following output:
summary(txn)
transactions as itemMatrix in sparse format with
589455 rows (elements/itemsets/transactions) and
1737 columns (items) and a density of 0.0005757052
most frequent items:
A,C A,B C,F C,D
57894 32150 31367 29434
A,B,C (Other)
29035 409575
element (itemset/transaction) length distribution:
sizes
1
589455
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 1 1 1 1 1
includes extended item information - examples:
labels
1 G,H,I,A,B,C,D,F,J
2 G,H,I,A,B,C,F
3 G,H,I,A,B,K,D
includes extended transaction information - examples:
transactionID
1
2 1
3 3
Now, I tried to run the apriori-algorithm:
basket_rules <- apriori(txn, parameter = list(sup = 1e-15,
conf = 1e-15, minlen = 2, target="rules"))
This is the output:
Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen maxlen target ext
0.01 0.1 1 none FALSE TRUE 5 1e-15 2 10 rules TRUE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 0
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[1737 item(s), 589455 transaction(s)] done [0.20s].
sorting and recoding items ... [1737 item(s)] done [0.00s].
creating transaction tree ... done [0.16s].
checking subsets of size 1 done [0.00s].
writing ... [0 rule(s)] done [0.00s].
creating S4 object ... done [0.04s].
Even with a ridiculously low support and confidence, no rules are generated...
summary(basket_rules)
set of 0 rules
Is this really because of my dataset? Or was there a mistake in my code?
Your summary shows that the data is not read in correctly:
most frequent items:
A,C A,B C,F C,D
57894 32150 31367 29434
A,B,C (Other)
29035 409575
Looks like "A,C" is read as an item, but it should be two items "A" and "C". The separating character does not work. I assume that could be because of quotation marks in the file. Make sure that Basket_List_13-08-2020.csv looks correct. Also, you need to skip the first line (headers) using skip = 1 when you read the transactions.
#Michael I am quite positive now that there is something wrong with the .csv-file I am reading in. Since there are others who experienced similar problems my guess is that this is the common reason for error. Can you please describe how the .csv-file should look like when read in?
When typing in data <- read.csv("file.csv", header = TRUE, sep = ",") I get the following data.frame:
X Prd
1 A
2 A,B
3 B,A
4 B
5 C
Is it correct that - if there are multiple products for a customer X - these products are all written in a single column? Or should be written in different columns?
Furthermore, when writing txn <- read.transactions(file="Versicherungen2_ItemList_Short.csv", rm.duplicates= TRUE, format="basket",sep=",",cols=1, skip=1) and summary(txn) I see the following problem:
most frequent items:
A B C A,B B,A
1256 1235 456 235 125
(numbers are chosen randomly)
So the read.transaction function differentiates between A,B and B,A... So I am guessing there is something wrong with the .csv-file.
I have a maybe simple problem, but I can't solve it.
I have two list's. List A is empty and a list B has several named columns. Now, I want to select a colum of B by a variable and put it in list A. Somehow like shown in the example:
A<-list()
B<-list()
VAR<-"a"
B$a<-c(1:10)
B$b<-c(10:20)
B$c<-c(20:30)
#This of course dosn't work...
A$VAR<-B$VAR
You can extract list entry with B[[VAR]] and append new entry to a list using get (A[[get("VAR")]] <- newEntry):
A[[get("VAR")]] <- B[[VAR]]
## A list
# $a
# [1] 1 2 3 4 5 6 7 8 9 10
I`ve got some problems filtering for duplicate elements in a string. My data look similar to this:
idvisit path
1 1,16,23,59,16
2 2,14,19,14
3 5,19,23
4 10,21
5 23,27,29,23
I have a column containing an unique ID and a column containing a path for web page navigation. The right column contains some cases, where pages were accessed twice or more often, but some different pages are between these accesses. I just want to filter() the rows, where pages occur twice or more often and at least one page is in bettween the two accesses, so the data should look like this.
idvisit path
1 1,16,23,59,16
2 2,14,19,14
5 23,27,29,23
I just want to remove the rows that match the conditions. I really dont know how to handle a String with using a variable for the many different numbers.
You can filter based on the number of elements in each string. Strings with duplicated entries will be larger than their unique lengths, i.e.
df1[sapply(strsplit(as.character(df1$path), ','), function(i) length(unique(i)) != length(i)),]
# idvisit path
#1 1 1,16,23,59,16
#2 2 2,14,19,14
#5 5 23,27,29,23
We can try
library(data.table)
lst <- strsplit(df1$path, ",")
df1[lengths(lst) != sapply(lst, uniqueN),]
# idvisit path
#1 1 1,16,23,59,16
#2 2 2,14,19,14
#5 5 23,27,29,23
Or an option using tidyverse
library(tidyverse)
separate_rows(df1, path) %>%
group_by(idvisit) %>%
filter(n_distinct(path) != n()) %>%
summarise(path = toString(path))
You could try regular expressions too with grepl:
df[grepl('.*([0-9]+),.*,\\1', as.character(df$path)),]
# idvisit path
#1 1 1,16,23,59,16
#2 2 2,14,19,14
#5 5 23,27,29,23