Subset of data with repetitive names - r

Subset of cricket data with repetitive player names and runs. My question is how many players have scored more than 5000 total runs? Form the subset of those people along with their runs. The data is as follows. A glimpse of the data is below.
"Player" "Runs"---
SM Gavaskar 28
SS Naik 18
AL Wadekar 67
GR Viswanath 4
FM Engineer 32
BP Patel 82
ED Solkar 3
S Abid Ali 17
S Madan Lal 2
S Venkataraghavan 1
BS Bedi 0
SM Gavaskar 20
SS Naik 20
GK Bose 13
AL Wadekar 6
GR Viswanath 32
FM Engineer 4
BP Patel 12
AV Mankad 44
ED Solkar 0
S Abid Ali 6
S Madan Lal 3
SM Gavaskar 36
ED Solkar 8
AD Gaekwad 22
GR Viswanath 37
BP Patel 16
S Abid Ali
KD Ghavri
M Amarnath
FM Engineer
S Madan Lal
S Venkataraghavan
SM Gavaskar 65
FM Engineer 54
Please suggest the method. In excel we would have removed the duplicates and applied a sumif. How about in R?

Assuming you have the data in a csv file in Excel, where the first column, named 'player' represents the player and the second column, named 'runs' represents the number of runs.
dat <- read.csv("cricket.csv", header=TRUE) # read in the data
dat.nodup <- tapply(dat$runs, dat$player, function(x) sum(x, na.rm=TRUE)) # sum runs for each player with duplicate observations
dat.gt5000 <- dat.nodup[which(dat.nodup > 5000)] # keep only records with > 5000 runs
length(dat.gt5000) # Number of players with > 5000 runs

Related

vcdExtra::datasets not working on some Packages

R3.6.1, vcdExtra 0.7.1
vcdExtra::datasets("caret")
Error in get(x) : object 'GermanCredit' not found
vcdExtra::datasets fails on some packages like "caret".
Am I missing something?
thanks
If you only require the dataset of German Credit, try this code:
library(caret)
data("GermanCredit")
GermanCredit
And you will get:
Duration Amount InstallmentRatePercentage ResidenceDuration Age NumberExistingCredits NumberPeopleMaintenance Telephone
1 6 1169 4 4 67 2 1 0
2 48 5951 2 2 22 1 1 1
3 12 2096 2 3 49 1 2 1
4 42 7882 2 4 45 1 2 1
5 24 4870 3 4 53 2 2 1
Please, comment if it is what you need.
Regards,
Alexis
This is the sequence of commands that I need to run for a correct functioning of vcdExtra::datasets("caret")
library(evtree)
library(caret)
data(Sacramento)
data(tecator)
data(BloodBrain)
data(cox2)
data(dhfr)
data(oil)
data(mdrr)
data(pottery)
data(scat)
data(segmentationData)
vcdExtra::datasets("caret")
The output is
Item class dim Title
1 GermanCredit data.frame 1000x21 German Credit Data
2 Sacramento data.frame 932x9 Sacramento CA Home Prices
3 absorp matrix 215x100 Fat, Water and Protein Content of Meat Samples
4 bbbDescr data.frame 208x134 Blood Brain Barrier Data
5 cars data.frame 50x2 Kelly Blue Book resale data for 2005 model year GM cars
6 cox2Class factor 462 COX-2 Activity Data
7 cox2Descr data.frame 462x255 COX-2 Activity Data
8 cox2IC50 numeric 462 COX-2 Activity Data
9 dhfr data.frame 325x229 Dihydrofolate Reductase Inhibitors Data
10 endpoints matrix 215x3 Fat, Water and Protein Content of Meat Samples
11 fattyAcids data.frame 96x7 Fatty acid composition of commercial oils
12 logBBB numeric 208 Blood Brain Barrier Data
13 mdrrClass factor 528 Multidrug Resistance Reversal (MDRR) Agent Data
14 mdrrDescr data.frame 528x342 Multidrug Resistance Reversal (MDRR) Agent Data
15 oilType factor 96 Fatty acid composition of commercial oils
16 potteryClass factor 58 Pottery from Pre-Classical Sites in Italy
17 scat data.frame 110x19 Morphometric Data on Scat
18 scat_orig data.frame 122x20 Morphometric Data on Scat
19 segmentationData data.frame 2019x61 Cell Body Segmentation

Mapping a dataframe (with NA) to an n by n adjacency matrix (as a data.frame object)

I have a three-column dataframe object recording the bilateral trade data between 161 countries, the data are of dyadic format containing 19687 rows, three columns (reporter (rid), partner (pid), and their bilateral trade flow (TradeValue) in a given year). rid or pid takes a value from 1 to 161, and a country is assigned the same rid and pid. For any given pair of (rid, pid) in which rid =/= pid, TradeValue(rid, pid) = TradeValue(pid, rid).
The data (run in R) look like this:
#load the data from dropbox folder
library(foreign)
example_data <- read.csv("https://www.dropbox.com/s/hf0ga22tdjlvdvr/example_data.csv?dl=1")
head(example_data, n = 10)
rid pid TradeValue
1 2 3 500
2 2 7 2328
3 2 8 2233465
4 2 9 81470
5 2 12 572893
6 2 17 488374
7 2 19 3314932
8 2 23 20323
9 2 25 10
10 2 29 9026220
The data were sourced from UN Comtrade database, each rid is paired with multiple pid to get their bilateral trade data, but as can be seen, not every pid has a numeric id value because I only assigned a rid or pid to a country if a list of relevant economic indicators of that country are available, which is why there are NA in the data despite TradeValue exists between that country and the reporting country (rid). The same applies when a country become a "reporter," in that situation, that country did not report any TradeValue with partners, and its id number is absent from the rid column. (Hence, you can see rid column begins with 2, because country 1 (i.e., Afghanistan) did not report any bilateral trade data with partners). A quick check with summary statistics helps confirm this
length(unique(example_data$rid))
[1] 139
# only 139 countries reported bilateral trade statistics with partners
length(unique(example_data$pid))
[1] 162
# that extra pid is NA (161 + NA = 162)
Since most countries report bilateral trade data with partners and for those who don't, they tend to be small economies. Hence, I want to preserve the complete list of 161 countries and transform this example_data dataframe into a 161 x 161 adjacency matrix in which
for those countries that are absent from the rid column (e.g., rid == 1), create each of them a row and set the entire row (in the 161 x 161 matrix) to 0.
for those countries (pid) that do not share TradeValue entries with a particular rid, set those cells to 0.
For example, suppose in a 5 x 5 adjacency matrix, country 1 did not report any trade statistics with partners, the other four reported their bilateral trade statistics with other (except country 1). The original dataframe is like
rid pid TradeValue
2 3 223
2 4 13
2 5 9
3 2 223
3 4 57
3 5 28
4 2 13
4 3 57
4 5 82
5 2 9
5 3 28
5 4 82
from which I want to convert it to a 5 x 5 adjacency matrix (of data.frame format), the desired output should look like this
V1 V2 V3 V4 V5
1 0 0 0 0 0
2 0 0 223 13 9
3 0 223 0 57 28
4 0 13 57 0 82
5 0 9 28 82 0
And using the same method on the example_data to create a 161 x 161 adjacency matrix. However, after a couple trial and error with reshape and other methods, I still could not get around with such conversion, not even beyond the first step.
It will be really appreciated if anyone could enlighten me on this?
I cannot read the dropbox file but have tried to work off of your 5-country example dataframe -
country_num = 5
# check countries missing in rid and pid
rid_miss = setdiff(1:country_num, example_data$rid)
pid_miss = ifelse(length(setdiff(1:country_num, example_data$pid) == 0),
1, setdiff(1:country_num, example_data$pid))
# create dummy dataframe with missing rid and pid
add_data = as.data.frame(do.call(cbind, list(rid_miss, pid_miss, NA)))
colnames(add_data) = colnames(example_data)
# add dummy dataframe to original
example_data = rbind(example_data, add_data)
# the dcast now takes missing rid and pid into account
mat = dcast(example_data, rid ~ pid, value.var = "TradeValue")
# can remove first column without setting colnames but this is more failproof
rownames(mat) = mat[, 1]
mat = as.matrix(mat[, -1])
# fill in upper triangular matrix with missing values of lower triangular matrix
# and vice-versa since TradeValue(rid, pid) = TradeValue(pid, rid)
mat[is.na(mat)] = t(mat)[is.na(mat)]
# change NAs to 0 according to preference - would keep as NA to differentiate
# from actual zeros
mat[is.na(mat)] = 0
Does this help?

Observations with low frequency go all in train set and produce error in predict ()

I have a dataset (~14410 rows) with observations including the country. I divide this set into train and test set and train my data using decision tree with the rpart() function. When it comes to predicting, sometimes I get the error that test set has countries which are not in train set.
At first I excluded/deleted the countries which appeared only once:
# Get orderland with frequency one
var.names <- names(table(mydata1$country))[table(mydata1$country) == 1]
loss <- match(var.names, mydata1$country)
names(which(table(mydata1$country) == 1))
mydata1 <- mydata1[-loss, ]
When rerunning my code, I get the same error at the same code line, saying that I have new countries in test which are not in train.
Now I did a count to see how often a country appears.
count <- as.data.frame(count(mydata1, vars=mydata1$country))
count[rev(order(count$n)),]
vars n
3 Bundesrep. Deutschland 7616
9 Grossbritannien 1436
12 Italien 930
2 Belgien 731
22 Schweden 611
23 Schweiz 590
13 Japan 587
19 Oesterreich 449
17 Niederlande 354
8 Frankreich 276
18 Norwegen 238
7 Finnland 130
21 Portugal 105
5 Daenemark 65
26 Spanien 57
4 China 55
20 Polen 51
27 Taiwan 31
14 Korea Süd 30
11 Irland 26
29 Tschechien 13
16 Litauen 9
10 Hong Kong 7
30 <NA> 3
6 Estland 3
24 Serbien 2
1 Australien 2
28 Thailand 1
25 Singapur 1
15 Kroatien 1
From this I can see, I also have NA's in my data.
My question now is, how can I proceed with this problem?
Should I exclude/delete all countries with e.g. observations < 7 or should I take the data with observations < 7 and reproduce/repeat this data two times, so my predict () function will always work, also for other data sets?
It's somehow not "fancy" just to delete the rows...is there any other possibility?
You need to convert every chr variable in factor:
mydata1$country <- as.factor(mydata1$country)
Then you can simply proceed with train/test splitting. You won't need to remove anything (except NAs)
By using the type factor, your model will know that an observation country, will have some possible levels:
Example:
country <- factor("Italy", levels = c("Italy", "USA", "UK")) # just 3 levels for example
country
[1] Italy
Levels: Italy USA UK
# note that as.factor() takes care of defining the levels for you
See the difference with:
country <- "Italy"
country
[1] "Italy"
By using factor, the model will know all the possible levels. Because of this, even if in the train data you won't have an observation "Italy", the model will know that it's possible to have it in the test data.
factor is always the correct type for characters in models.

How to resample data by clusters (block sampling) with replacement in R using Sampling package

This is my dummy data:
income <- as.data.frame.vector <- sample(1000:10000, 1000, replace=TRUE)
individuals <- as.data.frame.vector <- sample(1:50,1000,replace=TRUE)
datatest <- as.data.frame (cbind (income, individuals))
I know I can sample by individual rows with this code:
sample <- datatest[sample(nrow(datatest), replace=TRUE),]
Now, I want to extract random samples with replacement and equal probabilities of the dataset but sampling complete blocks of observations with the same individual code.
Note that there are 50 individuals, but 1000 observations. Some observations belong to the same individual, so I want to sample by individuals (clusters, in this case), not observations. I don't mind if the extracted samples differ slightly in the number of observations. How can I do that?
I have tried:
library(sampling)
samplecluster <- cluster (datatest, clustername=c("individuals"), size=50,
method="srswr")
But the outcome is not the sampled data. Am I missing something?
Well, it seems I was indeed missing something. After the cluster command you need to apply the getdata command (all from the Sampling Package). This way I do get the sample as I wanted, plus some additional columns.
samplecluster <- cluster (datatest, clustername=c("personid"), size=50, method="srswr")
Gives you:
head(samplecluster)
individuals ID_unit Replicates Prob
1 1 259 2 0.63583
2 1 178 2 0.63583
3 1 110 2 0.63583
4 1 153 2 0.63583
5 1 941 2 0.63583
6 1 667 2 0.63583
Then using getdata, I also get the original data on income sampled by whole clusters:
datasample <- getdata (datatest, samplecluster)
head(datasample)
income individuals ID_unit Replicates Prob
1 8567 1 259 2 0.63583
2 2701 1 178 2 0.63583
3 4998 1 110 2 0.63583
4 3556 1 153 2 0.63583
5 2893 1 941 2 0.63583
6 7581 1 667 2 0.63583
I am not sure if I am missing something. If you just want some of your individuals, you can create a smaller sample of them:
ind.sample <- sample(1:50, size = 10)
print(ind.sample)
# [1] 17 43 38 39 28 23 35 47 9 13
my.sample <- datatest[datatest$individuals %in% ind.sample) ,]
head(my.sample)
# income individuals
#21 9072 17
#97 5928 35
#122 9130 43
#252 4388 43
#285 8083 28
#287 1065 35
I guess a more generic approach would be to generate random indexes;
ind.unique <- unique(individuals)
ind.sample.index <- sample(1:length(ind.unique), size = 10)
ind.sample <- ind.unique[ind.sample.index]
print(ind.sample[order(ind.sample)])
my.sample <- datatest[datatest$individuals %in% ind.sample, ]
ind.counts <- aggregate(income ~ individuals, my.sample, FUN = length)
print(ind.counts)
I think its important to note that the dataset still needs to be expanded to include all the replicates.
sw<-data.frame(datasample[rep(seq_len(dim(datasample)[1]), datasample$Replicates),, drop = FALSE], row.names=NULL)
Might be helpful to someone

Replace values in one data frame from values in another data frame

I need to change individual identifiers that are currently alphabetical to numerical. I have created a data frame where each alphabetical identifier is associated with a number
individuals num.individuals (g4)
1 ZYO 64
2 KAO 24
3 MKU 32
4 SAG 42
What I need to replace ZYO with the number 64 in my main data frame (g3) and like wise for all the other codes.
My main data frame (g3) looks like this
SAG YOG GOG BES ATR ALI COC CEL DUN EVA END GAR HAR HUX ISH INO JUL
1 2
2 2 EVA
3 SAG 2 EVA
4 2
5 SAG 2
6 2
Now on a small scale I can write a code to change it like I did with ATR
g3$ATR <- as.character(g3$ATR)
g3[g3$target == "ATR" | g3$ATR == "ATR","ATR"] <- 2
But this is time consuming and increased chance of human error.
I know there are ways to do this on a broad scale with NAs
I think maybe we could do a for loop for this, but I am not good enough to write one myself.
I have also been trying to use this function which I feel like may work but I am not sure how to logically build this argument, it was posted on the questions board here
Fast replacing values in dataframe in R
df <- as.data.frame(lapply(df, function(x){replace(x, x <0,0)})
I have tried to work my data into this by
df <- as.data.frame(lapply(g4, function(g3){replace(x, x <0,0)})
Here is one approach using the data.table package:
First, create a reproducible example similar to your data:
require(data.table)
ref <- data.table(individuals=1:4,num.individuals=c("ZYO","KAO","MKU","SAG"),g4=c(64,24,32,42))
g3 <- data.table(SAG=c("","SAG","","SAG"),KAO=c("KAO","KAO","",""))
Here is the ref table:
individuals num.individuals g4
1: 1 ZYO 64
2: 2 KAO 24
3: 3 MKU 32
4: 4 SAG 42
And here is your g3 table:
SAG KAO
1: KAO
2: SAG KAO
3:
4: SAG
And now we do our find and replacing:
g3[ , lapply(.SD,function(x) ref$g4[chmatch(x,ref$num.individuals)])]
And the final result:
SAG KAO
1: NA 24
2: 42 24
3: NA NA
4: 42 NA
And if you need more speed, the fastmatch package might help with their fmatch function:
require(fastmatch)
g3[ , lapply(.SD,function(x) ref$g4[fmatch(x,ref$num.individuals)])]
SAG KAO
1: NA 24
2: 42 24
3: NA NA
4: 42 NA

Resources