vcdExtra::datasets not working on some Packages - r

R3.6.1, vcdExtra 0.7.1
vcdExtra::datasets("caret")
Error in get(x) : object 'GermanCredit' not found
vcdExtra::datasets fails on some packages like "caret".
Am I missing something?
thanks

If you only require the dataset of German Credit, try this code:
library(caret)
data("GermanCredit")
GermanCredit
And you will get:
Duration Amount InstallmentRatePercentage ResidenceDuration Age NumberExistingCredits NumberPeopleMaintenance Telephone
1 6 1169 4 4 67 2 1 0
2 48 5951 2 2 22 1 1 1
3 12 2096 2 3 49 1 2 1
4 42 7882 2 4 45 1 2 1
5 24 4870 3 4 53 2 2 1
Please, comment if it is what you need.
Regards,
Alexis

This is the sequence of commands that I need to run for a correct functioning of vcdExtra::datasets("caret")
library(evtree)
library(caret)
data(Sacramento)
data(tecator)
data(BloodBrain)
data(cox2)
data(dhfr)
data(oil)
data(mdrr)
data(pottery)
data(scat)
data(segmentationData)
vcdExtra::datasets("caret")
The output is
Item class dim Title
1 GermanCredit data.frame 1000x21 German Credit Data
2 Sacramento data.frame 932x9 Sacramento CA Home Prices
3 absorp matrix 215x100 Fat, Water and Protein Content of Meat Samples
4 bbbDescr data.frame 208x134 Blood Brain Barrier Data
5 cars data.frame 50x2 Kelly Blue Book resale data for 2005 model year GM cars
6 cox2Class factor 462 COX-2 Activity Data
7 cox2Descr data.frame 462x255 COX-2 Activity Data
8 cox2IC50 numeric 462 COX-2 Activity Data
9 dhfr data.frame 325x229 Dihydrofolate Reductase Inhibitors Data
10 endpoints matrix 215x3 Fat, Water and Protein Content of Meat Samples
11 fattyAcids data.frame 96x7 Fatty acid composition of commercial oils
12 logBBB numeric 208 Blood Brain Barrier Data
13 mdrrClass factor 528 Multidrug Resistance Reversal (MDRR) Agent Data
14 mdrrDescr data.frame 528x342 Multidrug Resistance Reversal (MDRR) Agent Data
15 oilType factor 96 Fatty acid composition of commercial oils
16 potteryClass factor 58 Pottery from Pre-Classical Sites in Italy
17 scat data.frame 110x19 Morphometric Data on Scat
18 scat_orig data.frame 122x20 Morphometric Data on Scat
19 segmentationData data.frame 2019x61 Cell Body Segmentation

Related

Is there an R function to help me plot the network connections for a single node?

This is my original dataset. R1,R2 and R3 are word association responses for the cue word. tf and df are total and document frequency of the cue word, respectively.
[1]: https://i.stack.imgur.com/wpfZy.png [Image shows original dataframe}
I have cleaned up a dataset into a nodes list and an edge list. I have over a million rows in both lists. Plotting this as a network graph would take too long, and also be very dense, i.e. not understandable.
[2]: https://i.stack.imgur.com/mfSfN.png [Image shows node-list]
[3]: https://i.stack.imgur.com/l60Eu.png [Image shows edge-list]
I want to be able to make a network graph for the cue words, such that upon entering a cue word, I get a network of words that are either responses to it, or are words that the cue word is a response for.
For example, I want to see all the connections for the word 'money'. Using filter(nword == "money") only shows the node 'money' as an output, but I want all nodes connected to the cue word (in this case, 'money').
[4]: https://i.stack.imgur.com/1bKrr.png [Image shows filter()]
Is there a function or a chunk of code that would help me resolve this issue?
from
to
1
1
1
6
1
8
1
17
1
18
1
22
1
23
1
38
1
67
1
80
2
82736
2
88035
2
103428
3
11
3
27
3
45
node_id
nword
n
1
money
13633
2
food
12338
3
water
12276
4
car
8907
5
music
8351
6
green
7890
7
red
7623
8
love
7406
9
sex
6552
10
happy
6432
11
cold
6333
12
bad
6132
13
sad
5958
14
dog
5940
15
white
5910
16
school
5832
17
fun
5594
18
time
5467
19
black
5233
20
hair
5219

Remove NA's from a stacked bar chart created using likertplot function from the HH package

I am creating stacked-bar-charts using the likertplot function from the HH package to display summary results from a recent student survey.
The code I have used to produce this plot is:
likertplot(Subgroup ~ . | Group, data = SOCIETIES_DATA,
as.percent=TRUE,
main='Did you attend the City Societies Fair?',
ylab=NULL,
scales = list(y = list(relation = "free")),
between=list(y=0),
layout = c(1, 5))
Where SOCIETIES_DATA is my dataframe that contains frequency data for the number of students from particular demographics that selected an answer to a single question (in this case if they attended the societies fair). Group is a column for the name of the Demographic categories (e.g. Age, Accommodation) and Subgroup is the categories within the groups (e.g. for Age, <18, 18-20. 21-24 etc.).
Unfortunately I am receiving unwanted NA values values on the second Y axis of the chart for particular variables (in my example, Age and Fee status).
Outputted likert plot from R
My data is formatted the same as it is for other data I have used to create likertplots in the same way, for which I have had no issues. Therefore the error is unlikely to be due to data and thus from the likertplot function.
Most likely, the error is occurring in the scales = argument since this has been affecting the number of NA levels presented in each section of the stacked-bar-chart when editing the code.
I have read through the documentation for the likertplot function in the HH package as well as Heiberger and Robbins (2014) Design of Diverging Stacked Bar Charts for Likert Scales and Other Applications, but have found no solutions to this issue.
The data I have used is presented below.
Did not attend Yes and poor range of stalls Yes and good range of stalls Subgroup Group
1 107 23 155 Halls Accommodation
2 81 7 54 Home Accommodation
3 10 2 5 Prefer not to answer Accommodation
4 71 13 90 Rented private accommodation Accommodation
5 9 1 4 <18 Age
6 192 33 220 18-20 Age
7 37 6 64 21-24 Age
8 27 4 17 25-39 Age
9 6 1 1 40 and over Age
10 2 0 1 Prefer not to answer Age
11 29 6 57 EU Fee Status
12 195 31 198 Home Fee Status
13 34 8 43 International Fee Status
14 15 0 9 Prefer not to answer Fee Status
15 48 10 59 Arts, Design and Social Sciences Faculty
16 75 10 86 Business and Law Faculty
17 34 12 64 Engineering and Environment Faculty
18 53 8 59 Health and Life Sciences - City Campus Faculty
19 59 5 36 Health and Life Sciences - Coach Lane Campus Faculty
20 52 6 61 Foundation Study Mode
21 1 1 1 Postgraduate Research Study Mode
22 13 2 18 Postgraduate Taught Study Mode
23 207 36 227 Undergraduate Study Mode
Any help would be greatly appreciated.
I was able to solve this myself and the answer was actually pretty simple. The categories for each group must be independent. I had the option 'prefer not to say' for both age and Fee status which was causing the error.

Observations with low frequency go all in train set and produce error in predict ()

I have a dataset (~14410 rows) with observations including the country. I divide this set into train and test set and train my data using decision tree with the rpart() function. When it comes to predicting, sometimes I get the error that test set has countries which are not in train set.
At first I excluded/deleted the countries which appeared only once:
# Get orderland with frequency one
var.names <- names(table(mydata1$country))[table(mydata1$country) == 1]
loss <- match(var.names, mydata1$country)
names(which(table(mydata1$country) == 1))
mydata1 <- mydata1[-loss, ]
When rerunning my code, I get the same error at the same code line, saying that I have new countries in test which are not in train.
Now I did a count to see how often a country appears.
count <- as.data.frame(count(mydata1, vars=mydata1$country))
count[rev(order(count$n)),]
vars n
3 Bundesrep. Deutschland 7616
9 Grossbritannien 1436
12 Italien 930
2 Belgien 731
22 Schweden 611
23 Schweiz 590
13 Japan 587
19 Oesterreich 449
17 Niederlande 354
8 Frankreich 276
18 Norwegen 238
7 Finnland 130
21 Portugal 105
5 Daenemark 65
26 Spanien 57
4 China 55
20 Polen 51
27 Taiwan 31
14 Korea Süd 30
11 Irland 26
29 Tschechien 13
16 Litauen 9
10 Hong Kong 7
30 <NA> 3
6 Estland 3
24 Serbien 2
1 Australien 2
28 Thailand 1
25 Singapur 1
15 Kroatien 1
From this I can see, I also have NA's in my data.
My question now is, how can I proceed with this problem?
Should I exclude/delete all countries with e.g. observations < 7 or should I take the data with observations < 7 and reproduce/repeat this data two times, so my predict () function will always work, also for other data sets?
It's somehow not "fancy" just to delete the rows...is there any other possibility?
You need to convert every chr variable in factor:
mydata1$country <- as.factor(mydata1$country)
Then you can simply proceed with train/test splitting. You won't need to remove anything (except NAs)
By using the type factor, your model will know that an observation country, will have some possible levels:
Example:
country <- factor("Italy", levels = c("Italy", "USA", "UK")) # just 3 levels for example
country
[1] Italy
Levels: Italy USA UK
# note that as.factor() takes care of defining the levels for you
See the difference with:
country <- "Italy"
country
[1] "Italy"
By using factor, the model will know all the possible levels. Because of this, even if in the train data you won't have an observation "Italy", the model will know that it's possible to have it in the test data.
factor is always the correct type for characters in models.

Algorithm to optimally define groups based on multiple responses in R

I have a scheduling puzzle that I am looking for suggestions/solutions using R.
Context
I am coordinating a series of live online group discussions where registered participants will be grouped according to their availability. In a survey, 28 participants (id) indicated morning, afternoon, or evening (am, after, pm) availability on days Monday through Saturday (18 possibilities). I need to generate groups of 4-6 participants who are available at the same time, without replacement (meaning they can only be assigned to one group). Once assigned, groups will meet weekly at the same time (i.e. Group A members will always meet Monday mornings).
Problem
Currently group assignment is being achieved manually (by a human), but with more participants optimizing group assignment will become increasingly challenging. I am interested in finding an algorithm that efficiently achieves relatively equal group placements, and respects other factors such as a person's timezone.
Sample Data
Sample data are in long-format located in an R-script here.
>str(x)
'data.frame': 504 obs. of 4 variables:
$ id : Factor w/ 28 levels "1","10","11",..: 1 12 22 23 24 25 26 27 28 2 ...
$ timezone: Factor w/ 4 levels "Central","Eastern",..: 2 1 3 4 2 1 3 4 2 1 ...
$ day.time: Factor w/ 18 levels "Fri.after","Fri.am",..: 5 5 5 5 5 5 5 5 5 5 ...
$ avail : num 0 0 1 0 1 1 0 1 0 0 ...
The first 12 rows of the data look like this:
> head(x, 12)
id timezone day.time avail
1 1 Eastern Mon.am 0
2 2 Central Mon.am 0
3 3 Mountain Mon.am 1
4 4 Pacific Mon.am 0
5 5 Eastern Mon.am 1
6 6 Central Mon.am 1
7 7 Mountain Mon.am 0
8 8 Pacific Mon.am 1
9 9 Eastern Mon.am 0
10 10 Central Mon.am 0
11 11 Mountain Mon.am 0
12 12 Pacific Mon.am 1
Ideal Solution
An algorithm to optimally define groups (size = 4 to 6) that exactly match on day.time and avail while minimizing differences on other more flexible factors (in this case timezone). In the final result, a participant should only exist in a single group.
Okay, so I am not the most knowledge when it comes to this, but have you looked at the K-Means Clustering algorithm. You can specify the number of clusters you want and the variables for the algorithm to consider. It will then cluster the data into the specified number of clusters, aka, categories for you.
What do you think?
References:
https://datascienceplus.com/k-means-clustering-in-r/
http://www.sthda.com/english/wiki/cluster-analysis-in-r-unsupervised-machine-learning

Subset of data with repetitive names

Subset of cricket data with repetitive player names and runs. My question is how many players have scored more than 5000 total runs? Form the subset of those people along with their runs. The data is as follows. A glimpse of the data is below.
"Player" "Runs"---
SM Gavaskar 28
SS Naik 18
AL Wadekar 67
GR Viswanath 4
FM Engineer 32
BP Patel 82
ED Solkar 3
S Abid Ali 17
S Madan Lal 2
S Venkataraghavan 1
BS Bedi 0
SM Gavaskar 20
SS Naik 20
GK Bose 13
AL Wadekar 6
GR Viswanath 32
FM Engineer 4
BP Patel 12
AV Mankad 44
ED Solkar 0
S Abid Ali 6
S Madan Lal 3
SM Gavaskar 36
ED Solkar 8
AD Gaekwad 22
GR Viswanath 37
BP Patel 16
S Abid Ali
KD Ghavri
M Amarnath
FM Engineer
S Madan Lal
S Venkataraghavan
SM Gavaskar 65
FM Engineer 54
Please suggest the method. In excel we would have removed the duplicates and applied a sumif. How about in R?
Assuming you have the data in a csv file in Excel, where the first column, named 'player' represents the player and the second column, named 'runs' represents the number of runs.
dat <- read.csv("cricket.csv", header=TRUE) # read in the data
dat.nodup <- tapply(dat$runs, dat$player, function(x) sum(x, na.rm=TRUE)) # sum runs for each player with duplicate observations
dat.gt5000 <- dat.nodup[which(dat.nodup > 5000)] # keep only records with > 5000 runs
length(dat.gt5000) # Number of players with > 5000 runs

Resources