Bugs in converting Categorical dataset to Transaction dataset in R - r

I have been working on feature engineering and I have come up with following code to converted the categorical dataset into binary transaction dataset, but I am not able get desired output, only three columns get populated with one. Rest all remain zero.
Here's my code:
`binarize<-function(dataset,names){
drops<-NA
colnames(dataset)<- paste("{",names,"}:",sep = "")
dataset<-dataset[,!(names(dataset)%in%drops)]
dataset$id = 1:nrow(dataset)
xt<-dataset %>% mutate_each(funs(as.character))%>% gather(key,values,starts_with("{"))
xt$values <- apply( xt[ , c("key","values") ] , 1 , paste , collapse = "" )
xt<-xt %>% mutate(present = 1)%>% spread(values,present,fill = 0)
xt<-xt%>% distinct(id)
rmCols<-c("id","key")
xt<-xt[,-which(names(xt) %in% rmCols)]
return(xt)
}`
Can someone help me debug this? names in the argument is a character vector.
Thanks,
Prerit

Related

Error when doing Panel VAR (panelvar package) in R

I tried to run a panel var on dataset I got from Statistics Sweden and here is what I get:
df<- read_excel("Inkfördelning per kommun.xlsx")
nujavlar <- pvarfeols(dependent_vars = c("Kvintil-1", "Kvintil-4", "Kvintil-5"),
lags = 1,
transformation = "demean",
data = df,
panel_identifier = c("Kommun", "Year")
)
Error: Can't subset columns that don't exist.
x Column `Kvintil-1` doesn't exist.
I often get this message too:
Warning in xtfrm.data.frame(x) : cannot xtfrm data frames
Error: Can't subset columns that don't exist.
x Location 2 doesn't exist.
ℹ There are only 1 column.
I have made sure that all data is numeric. I have also tried cleaning my workspace and restarted the programme. I also tried to convert it into a paneldata frame with palm package. I also tried converting my entity variable "Kommun" (Municipality) into factors and it still doesn't work.
Here's the data if someone wants to give it a go.
https://docs.google.com/spreadsheets/d/16Ak_Z2n6my-5wEw69G29_NLryQKcrYZC/edit?usp=sharing&ouid=113164216369677216623&rtpof=true&sd=true
The column names in your dataframe are Kvintil 1, not Kvintil-1, so the variable you are referring to really does not exist. Please be aware that in R, variable names cannot have hyphens and it is good practice to avoid spaces in variable names because it is annoying to refer to variables with spaces. I have included a reproducible example below.
library(tidyverse)
library(gsheet)
library(panelvar)
url <- 'docs.google.com/spreadsheets/d/16Ak_Z2n6my-5wEw69G29_NLryQKcrYZC'
df <- gsheet2tbl(url) %>%
rename(Kvintil1 = `Kvintil 1`) %>%
rename(Kvintil2 = `Kvintil 2`) %>%
rename(Kvintil3 = `Kvintil 3`) %>%
rename(Kvintil4 = `Kvintil 4`) %>%
rename(Kvintil5 = `Kvintil 5`) %>%
as.data.frame()
nujavlar <- pvarfeols(
dependent_vars = c("Kvintil1", "Kvintil4", "Kvintil5"),
lags = 1,
transformation = "demean",
data = df,
panel_identifier = c("Kommun", "Year"))

Power BI - creating custom r visuals relationship error

I desperately need help!
I am trying to predict drug use based on 5 characteristics: Age, Gender, Education, Ethnicity, Country. I already build a tree model in R with rpart
DrugTree3 <- rpart(formula = DrugUser ~ Age+Gender+Education+Ethnicity+Country, data = traindata)
, a logistic regression model
DrugLog <- glm(formula = DrugUser ~ Age+Gender+Ethnicity+Education+Country,data = traindata, family = binomial)
, and a knn model
KnnModel <- train(form = DrugUser~., data = ModelData,method ='knn',tuneGrid=expand.grid(.k=1:100),metric='Accuracy',trControl=trainControl(method='repeatedcv',number=10,repeats=10)) .
I saved those as RDS files and uploaded them successfully in Power BI.
I then created tables for each characterization and created okviz filters for them.
Then I tried to predict whether a customer gets predicted as a drug user or a non-drug user based on the selections in the okviz filters. This is when everything went horribly wrong:
I created a custom R visual vor each model prediction and inserted the following code in each visual:
# The following code to create a dataframe and remove duplicated rows is always executed and acts as a preamble for your script:
# dataset <- data.frame(chunk_id, model_id, model_str, AgeLabel, GenderLabel, CountryLabel, EducationLabel, EthnicityLabel)
# dataset <- unique(dataset)
# Paste or type your script code here:
library(dplyr)
from_byte_string = function(x) {
xcharvec = strsplit(x, " ")[[1]]
xhex = as.hexmode(xcharvec)
xraw = as.raw(xhex)
unserialize(xraw)
}
# R Visual imports tables with read.csv but no argument for strings_as_factors = F.
# This means some of the chunks are truncated (ie if they had a " " at the end).
# If you convert to a character and add a space if nchar == 9999 the deserialization works.
# (Thanks to Danny Shah)
dataset <- dataset %>%
mutate( model_str = as.character(model_str) ) %>%
mutate( model_str = ifelse(nchar(model_str) == 9999, paste0(model_str, " "), model_str) )
model_vct <- dataset %>%
filter(model_id == 1) %>%
distinct(model_id, chunk_id, model_str) %>%
arrange(model_id, chunk_id) %>%
pull(model_str)
finalfit.str <- paste( model_vct, collapse = "" )
finalfit <- from_byte_string(finalfit.str)
# get the user parameters
userdata <- dataset %>% select(AgeLabel,GenderLabel,CountryLabel,EducationLabel,EthnicityLabel) %>% unique()
# and then using them to make a prediction
myprediction <- predict(finalfit,newdata=data.frame(Age=userdata$AgeLabel,Gender=userdata$GenderLabel,Country=userdata$CountryLabel, Education=userdata$EducationLabel,Ethnicity=userdata$EthnicityLabel))
maxpred <- which(myprediction==max(myprediction))
myclass <- maxpred - 1
myprob <- myprediction[[maxpred]]
plot.new()
text(0.5,0.5,labels=sprintf("P(class = %s) = %s",myclass,as.character(round(myprob,2))),cex=3.5)
Error: Can't determine relationship between fields.
What has gone wrong here?
When I then clicked on the diagonal arrow to get to R Studio, this happens: Unable to construct R script data for use in external R IDE.
I need help as I am literally going crazy over this and I don't know how to resolve the issue! I would be really happy if you can help me
enter image description here
You made a error in line 34, and line 25.
Below is a fixed version of your code.
# The following code to create a dataframe and remove duplicated rows is always executed and acts as a preamble for your script:
# dataset <- data.frame(chunk_id, model_id, model_str, AgeLabel, GenderLabel, CountryLabel, EducationLabel, EthnicityLabel)
# dataset <- unique(dataset)
# Paste or type your script code here:
library(dplyr)
from_byte_string = function(x) {
xcharvec = strsplit(x, " ")[[1]]
xhex = as.hexmode(xcharvec)
xraw = as.raw(xhex)
unserialize(xraw)
}
# R Visual imports tables with read.csv but no argument for strings_as_factors = F.
# This means some of the chunks are truncated (ie if they had a " " at the end).
# If you convert to a character and add a space if nchar == 9999 the deserialization works.
# (Thanks to Danny Shah)
dataset <- dataset %>%
mutate( model_str = as.character(model_str) ) %>%
mutate( model_str = ifelse(nchar(model_str) == 9999, paste0(model_str, " "), model_str) )
model_vct <- dataset %>%
filter(model_id == 1) %>%
distinct(model_id, chunk_id, model_str) %>%
arrange(model_id, chunk_id) %>%
pull(model_str)
finalfit.str <- paste( model_vct, collapse = "" )
finalfit <- from_byte_string(finalfit.str)
# get the user parameters
userdata <- dataset %>% select(AgeLabel,GenderLabel,CountryLabel,EducationLabel,EthnicityLabel) %>% unique()
# and then using them to make a prediction
myprediction <- predict(finalfit,newdata=data.frame(Age=userdata$AgeLabel,Gender=userdata$GenderLabel,Country=userdata$CountryLabel, Education=userdata$EducationLabel,Ethnicity=userdata$EthnicityLabel))
maxpred <- which(myprediction==max(myprediction))
myclass <- maxpred - 1
myprob <- myprediction[[maxpred]]
plot.new()
text(0.5,0.5,labels=sprintf("P(class =
Good Luck!

Ho to run a function (many times) that changes variable (tibble) in global env

I'm a newbie in R, so please have some patience and... tips are most welcome.
My goal is to create tibble that holds a "Full Name" (of a person, that may have 2 to 4 names) and his/her gender. I must start from a tibble that contains typical Male and Female names.
Below I present a minimum working example.
My problem: I can call get_name() multiple time (in 10.000 for loop!!) and get the right answer. But, I was looking for a more 'elegant' way of doing it. replicate() unfortunately returns a vector... which make it unusable.
My doubts: I know I have some (very few... right!!) issues, like the if statement, that is evaluated every time (which is redundant), but I don't find another way to do it. Any suggestion?
Any other suggestions about code struct are also welcome.
Thank you very much in advance for your help.
# Dummy name list
unit_names <- tribble(
~Women, ~Man,
"fem1", "male1",
"fem2", "male2",
"fem3", "male3",
"fem4", "male4",
"fem5", "male5",
"fem6", NA,
"fem7", NA
)
set.seed(12345) # seed for test
# Create a tibble with the full names
full_name <- tibble("Full Name" = character(), "Gender" = character() )
get_name <- function() {
# Get the Number of 'Unit-names' to compose a 'Full-name'
nbr_names <- sample(2:4, 1, replace = TRUE)
# Randomize the Gender
gender <- sample(c("Women", "Man"), 1, replace = TRUE)
if (gender == "Women") {
lim_names <- sum( !is.na(unit_names$"Women"))
} else {
lim_names <- sum( !is.na(unit_names$"Man"))
}
# Sample the Fem/Man List names (may have duplicate)
sample(unlist(unit_names[1:lim_names, gender]), nbr_names, replace = TRUE) %>%
# Form a Full-name
paste ( . , collapse = " ") %>%
# Add it to the tibble (INCLUDE the Gender)
add_row(full_name, "Full Name" = . , "Gender" = gender)
}
# How can I make 10k of this?
full_name <- get_name()
If you pass a larger number than 1 to sample this problem becomes easier to vectorise.
One thing that currently makes your problem much harder is the layout of your unit_names table: you are effectively treating male and female names as individually paired, but they clearly aren’t: hence they shouldn’t be in columns of the same table. Use a list of two vectors, for instance:
unit_names = list(
Women = c("fem1", "fem2", "fem3", "fem4", "fem5", "fem6", "fem7"),
Men = c("male1", "male2", "male3", "male4", "male5")
)
Then you can generate random names to your heart’s delight:
generate_names = function (n, unit_names) {
name_length = sample(2 : 4, n, replace = TRUE)
genders = sample(c('Women', 'Men'), n, replace = TRUE)
names = Map(sample, unit_names[genders], name_length, replace = TRUE) %>%
lapply(paste, collapse = ' ') %>%
unlist()
tibble(`Full name` = names, Gender = genders)
}
A note on style, unlike your function the above doesn’t use any global variables. Furthermore, don’t "quote" variable names (you do this in unit_names$"Women" and for the arguments of add_row). R allows this, but this is arguably a mistake in the language specification: these are not strings, they’re variable names, making them look like strings is misleading. You don’t quote your other variable names, after all. You do need to backtick-quote the `Full name` column name, since it contains a space. However, the use of backticks, rather than quotes, signifies that this is a variable name.
I am not 100% of what you are trying to get, but if I got it right...did you try with mutate at dplyr? For example:
result= mutate(data.frame,
concated_column = paste(column1, column2, column3, column4, sep = '_'))
With a LITTLE help from Konrad Rudolph, the following elegant (and vectorized ... and fast) solution that I was looking. map2 does the necessary trick.
Here is the full working example if someone needs it:
(Just a side note: I kept the initial conversion from tibble to list because the data arrives to me as a tibble...)
Once again thanks to Konrad.
# Dummy name list
unit_names <- tribble(
~Women, ~Men,
"fem1", "male1",
"fem2", "male2",
"fem3", "male3",
"fem4", "male4",
"fem5", "male5",
"fem6", NA,
"fem7", NA
)
name_list <- list(
Women = unit_names$Women[!is.na(unit_names$Women)],
Men = unit_names$Men[!is.na(unit_names$Men)]
)
generate_names = function (n, name_list) {
name_length = sample(2 : 4, n, replace = TRUE)
genders = sample(c('Women', 'Men'), n, replace = TRUE)
#names = lapply(name_list[genders], sample, name_length) %>%
names = map2(name_list[genders], name_length, sample) %>%
lapply(paste, collapse = ' ') %>%
unlist()
tibble(`Full name` = names, Gender = genders)
}
full_name <- generate_names(10000, name_list)

dplyr mutate inside for loop - Issue

I am performing Data Analysis and cleaning in R using tidyverse.
I have a Data Frame with 23 columns containing values 'NO','STEADY','UP' and 'down'.
I want to change all the values in these 23 columns to 0 in case of 'NO','STEADY' and 1 in other case.
What i did is, i created a list by name keys in which i have kept all my columns, After that i am using for loop, ifelse statements and mutate.
Please have a look at the code below
# Column names are kept in the list by name keys
keys = c('metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride',
'glipizide', 'glyburide', 'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol',
'insulin', 'glyburide-metformin', 'tolazamide', 'metformin-pioglitazone',
'metformin-rosiglitazone', 'glimepiride-pioglitazone', 'glipizide-metformin',
'troglitazone', 'tolbutamide', 'acetohexamide')
After that, i used following code to get the desired result :
for (col in keys){
Dataset = Dataset %>%
mutate(col = ifelse(col %in% c('No','Steady'),0,1)) }
I was expecting that, it will do the changes that i require, but nothing happens after this. (NO ERROR MESSAGE AND NO DESIRED RESULT)
After that, i researched further and executed following code
for (col in keys){
print(col)}
It gives me elements of list as characters like - "metformin"
So, i thought - may be this is the issue. Hence, i used the below code to caste the keys as symbols :
keys_new = sym(keys)
After that i again ran the same code:
for (col in keys_new){
Dataset = Dataset %>%
mutate(col = ifelse(col %in% c('No','Steady'),0,1))}
It gives me following Error -
Error in match(x, table, nomatch = 0L) :
'match' requires vector arguments
After all this. I also tried to create a function to get the desired results, but that too didn't worked:
change = function(name){
Dataset = Dataset %>%
mutate(name = ifelse(name %in% c('No','Steady'),0,1),
name = as.factor(name))
return(Dataset)}
for (col in keys){
change(col)}
This didn't perform any action. (NO ERROR MESSAGE AND NO DESIRED RESULT)
When keys_new is placed in this code:
for (col in keys_new){
change(col)}
I got the same Error :
Error in match(x, table, nomatch = 0L) :
'match' requires vector arguments
PLEASE GUIDE
There's no need to loop or keep track of column names. You can use mutate_all -
Dataset %>%
mutate_all(~ifelse(. %in% c('No','Steady'), 0, 1))
Another way, thanks to Rui Barradas -
Dataset %>%
mutate_all(~as.integer(!. %in% c('No','Steady')))
There's a simpler way using mutate_at and case_when.
Dataset %>% mutate_at(keys, ~case_when(. %in% c("NO", "STEADY") ~ 0, TRUE ~ 1))
mutate_at will only mutate the columns specified in the keys variable. case_when then lets you replace one value by another by some condition.
This answer for using mutate through forloop.
I don't have your data, so i tried to make my own data, i changed the keys into a tibble using enframe then spread it into columns and used the row number as a value for each column, then check if the value is higher than 10 or not.
To use the column name in mutate you have to use !! and := in the mutate function
df <- enframe(c('metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride',
'glipizide', 'glyburide', 'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol',
'insulin', 'glyburide-metformin', 'tolazamide', 'metformin-pioglitazone',
'metformin-rosiglitazone', 'glimepiride-pioglitazone', 'glipizide-metformin',
'troglitazone', 'tolbutamide', 'acetohexamide')
) %>% spread(key = value,value = name)
keys = c('metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride',
'glipizide', 'glyburide', 'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol',
'insulin', 'glyburide-metformin', 'tolazamide', 'metformin-pioglitazone',
'metformin-rosiglitazone', 'glimepiride-pioglitazone', 'glipizide-metformin',
'troglitazone', 'tolbutamide', 'acetohexamide')
for (col in keys){
df = df %>%
mutate(!!as.character(col) := ifelse( df[col] > 10,0,100) )
}

How do I avoid 'NA' values when coercing a .tsv column into numeric via as.numeric?

I have a dataframe with several columns from a .tsv file and want to transform one of them into the 'numeric' type for analysis. However, I keep getting the 'NAs' introduced by coercion warning all the time and do not know exactly why. There is some unnecessary info at the beginning of another column, which is pretty much the only formatting I did.
Originally, I thought the file might have added some extra tabs or spaces, which is why I tried to delete these via giving sub() as an argument.
I should also mention that I get the NA errors also when I do not replace the values and run the dataframe as is:
library(tidyverse)
data_2018 <- read_tsv('teina230.tsv')
data_1995 <- read_csv('OECD_1995.csv')
#get rid of long colname & select only columns containing %GDP
clean_data_2018 <- data_2018 %>%
select('na_item,sector,unit,geo','2018Q1','2018Q2','2018Q3','2018Q4') %>%
rename(country = 'na_item,sector,unit,geo')
clean_data_2018 <- clean_data_2018[grep("PC_GDP", clean_data_2018$'country'), ]
#remove unnecessary info
clean_data_2018 <- clean_data_2018 %>%
mutate(country=gsub('\\GD,S13,PC_GDP,','',country))
clean_data_2018 <- clean_data_2018 %>%
mutate(
'2018Q1'=as.numeric(sub("", "", '2018Q1', fixed = TRUE)),
'2018Q2'=as.numeric(sub(" ", "", '2018Q2', fixed = TRUE)),
'2018Q3'=as.numeric(sub(" ", "", '2018Q3', fixed = TRUE)),
'2018Q4'=as.numeric(sub(" ", "", '2018Q4', fixed = TRUE))
)
Is there another way to get around the problem and convert the column without replacing all the values with 'NA'?
Thanks guys :)
Thanks for the hint #divibisan !
Renaming the columns via rename() actually solved the problem. Here the code which finally worked:
library(tidyverse)
data_2018 <- read_tsv('teina230.tsv')
#get rid of long colname & select only columns containing %GDP
clean_data_2018 <- data_2018 %>%
select('na_item,sector,unit,geo','2018Q1','2018Q2','2018Q3','2018Q4') %>%
rename(country = 'na_item,sector,unit,geo',
quarter_1 = '2018Q1',
quarter_2 = '2018Q2',
quarter_3 = '2018Q3',
quarter_4 = '2018Q4')
clean_data_2018 <- clean_data_2018[grep("PC_GDP", clean_data_2018$'country'), ]
#remove unnecessary info
clean_data_2018 <- clean_data_2018 %>%
mutate(country=gsub('\\GD,S13,PC_GDP,','',country))
clean_data_2018 <- clean_data_2018 %>%
mutate(
quarter_1 = as.numeric(quarter_1),
quarter_2 = as.numeric(quarter_2),
quarter_3 = as.numeric(quarter_3),
quarter_4 = as.numeric(quarter_4)
)

Resources