Calculating Gini by Row in R - r

stackoverflow.
I'm trying to calculate the gini coefficient within each row of my dataframe, which is 1326 rows long, by 6 columns (1326 x 6).
My current code...
attacks$attack_gini <- gini(x = c(attacks$attempts_open_play,
attacks$attempts_corners,attacks$attempts_throws,
attacks$attempts_fk,attacks$attempts_set_play,attacks$attempts_penalties))
... fills all the rows with the same figure of 0.7522439 - which is evidently wrong.
Note: I'm using the gini function from the reldist package.
Is there a way that I can calculate the gini for the 6 columns in each row?
Thanks in advance.

Function gini of reldist does not accept a dataframe as an input. You could easily get the coefficient of the first column of your dataframe like this:
> gini(attacks$attempts_open_play)
[1] 0.1124042
However when you do c(attacks$attempts_open_play, attacks$attempts_corners, ...) you are actually generating one list with all the columns of your dataframe just after the other, thus your gini call gives back a single number, e.g.:
> gini(c(attacks$attempts_open_play, attacks$attempts_corners))
[1] 0.112174
And that's why you are assigning the same single number to every row at attacks$attack_gini. If I understood properly, you what to calculate the gini coefficient for the values of your columns per row, you can use apply, something like
attacks$attack_gini <- apply(attacks[,c('attempts_open_play', 'attempts_corners', ...)], 1, gini)
where the 2nd parameter with value 1 is applying the function gini per row.
head(apply(attacks[,c('attempts_open_play', 'attempts_corners')], 1, gini))
[1] 0.026315789 0.044247788 0.008928571 0.053459119 0.019148936 0.007537688
Hope it helps.

Related

R: Seperating several observations of a variable and building a matrix

I have a multiple-response-variable with seven possible observations: "Inhalt", "Arbeit", "Verhindern Koalition", "Ermöglichen Koalition", "Verhindern Kanzlerschaft", "Ermöglichen Kanzlerschaft", "Spitzenpolitiker".
If one chose more than one observation, the answers however are not separated in the data (Data)
My goal is to create a matrix with all possible observations as variables and marked with 1 (yes) and 0 (No). Currently I am using this command:
einzeln_strategisch_2021 <- data.frame(strategisch_2021[, ! colnames (strategisch_2021) %in% "Q12"], model.matrix(~ Q12 - 1, strategisch_2021)) %>%
This gives me the matrix I want but it does not separate the observations, so now I have a matrix with 20 variables instead of the seven (variables).
I also tried seperate() like this:
separate(Q12, into = c("Inhalt", "Arbeit", "Verhindern Koalition", "Ermöglichen Koalition", "Verhindern Kanzlerschaft", "Ermöglichen Kanzlerschaft", "Spitzenpolitiker"), ";") %>%
This does separate the observations, but not in the right order and without the matrix.
How do I separate my observations and create a matrix with the possible observations as variables akin to the third picture (Matrix)?
Thank you very much in advance ;)

Apply/lapply function to all columns in matrix

I have a matrix called seq$num, consisting of 100 columns and 30k rows. Each column corresponds to the name of a specific sample (es. CAGTCA), and every row is about a numeric value. With this type of object, I can access the first row writing seq$num[[1]] and so on for the other rows. This is a brief example of my database:
CAGTCA
AGATCA
GCTCGA
GCTCGA
-0.4930
-2.0330
0.7100
0.1560
1.0030
0.0120
-1.0433
0.6701
0.0013
1.0013
1.2451
-1.3421
I would like to loop through all the samples using the lapply function and for each sample classify:
the numbers above 1.5 as "high".
the numbers below 0 as "low".
the numbers between 0 and 1.5 as "medium".
Then I need also to take note of how many high, low and medium numbers have a sample.
How can this be done? I've tried applying the lapply function, but I don't get the output I want.
You can write a function which divides data into categories.
classify <- function(x) ifelse(x >= 1.5, 'high', ifelse(x < 0, 'low', 'medium'))
For each dataframe in seq$num apply the classify function to it and use table to count.
res <- lapply(seq$num, function(x) table(classify(as.matrix(x))))

Summing a specific vector index

I'm having trouble figuring out how vectors are formatted. I need to find the average height of participants in the cystfibr package of the ISwR library. When printing the entire height data set it appears to be a 21x2 matrix with height values and a 1 or 2 to indicate sex. However, ncol returns a value of NA suggesting it is a vector. Trying to get specific indexes of the matrix (heightdata[1,]) also returns an incorrect number of dimensions error.
I'm looking to sum up only the height values in the vector but when I run the code I get the sum of the male and female integers. (25)
install.packages("ISwR")
library(ISwR)
attach(cystfibr)
heightdata = table(height)
print(heightdata)
print(sum(heightdata))
This is what the output looks like.
You can convert the cystfibr to a dataframe format to find out the sum of all vectors present in the data.
install.packages("ISwR")
library(ISwR)
data <- data.frame(cystfibr) # attach and convert to dataframe format
As there are no unique identifier present in the data, so done sum across observations
apply(data [,"height", drop =F], 2, sum) # to find out the sum of height vector
height
3820
unlist(lapply(data , sum))
age sex height weight bmp fev1 rv frc tlc pemax
362.0 11.0 3820.0 960.1 1957.0 868.0 6380.0 3885.0 2850.0 2728.0
sapply(data, sum)
age sex height weight bmp fev1 rv frc tlc pemax
362.0 11.0 3820.0 960.1 1957.0 868.0 6380.0 3885.0 2850.0 2728.0
table gives you the count of values in the vector.
If you want to sum the output of height from heightdata, they are stored in names of heightdata but it is in character format, convert it to numeric and sum.
sum(as.numeric(names(heightdata)))
#[1] 3177
which is similar to summing the unique values of height.
sum(unique(cystfibr$height))
#[1] 3177

Combining row vectors for data frame after using quantile function

Novice problem. I ran following command:
CI_95_outcomes_male <- data.frame(do.call(cbind,lapply(1:ncol(outcomes_male_dt), function(r) quantile(outcomes_male_dt[,r],c(.95)))))
and end up with this output:
CI_95_outcomes_male
X1 X2 X3 X4
95% 9629902039 0 2.968924e+15 2.968924e+15
I would like to combine this vector with following vector to end up with 2X4 matrix:
#
mean_outcomes_male
ylg_smoking_simS deaths_averted total_cig total_tax_
9.62990 0.0000 2.78248 2.782480
I tried:
CI_95_outcomes_male<-colnames(mean_outcomes_male)
data.frame(mean_outcomes_male,CI_95_outcomes_male)
Error in data.frame(mean_outcomes_male, CI_95_outcomes_male) :
arguments imply differing number of rows: 4, 0
Any guidance appreciated, thanks!
CI_95_outcomes_male<-colnames(mean_outcomes_male)
I think you forgot to put colnames around CI_95_outcomes_male. But there's another problem here. I'm assuming that mean_outcomes_male is a vector, in which case colnames(mean_outcomes_male) is NULL.
data.frame(mean_outcomes_male,CI_95_outcomes_male)
Even if CI_95_outcomes_male was correct, the above command will result in a 4x5 data frame, with the first column being the mean_outcomes_male vector, second column being the CI_95_outcomes_male value for your first variable (repeated for each row),...,and the fifth column being the CI_95_outcomes_male value for your fourth variable (repeated for each row).
You need to do something like this:
set.seed(42)
# Generate a random dataset for outcomes_male_dt with 4 variables and n rows
n <- 100
outcomes_male_dt <- data.frame(x1=runif(n),x2=runif(n),x3=runif(n),x4=runif(n))
# I'm assuming you want the 95th percentile of each variable in outcomes_male_dt and store them in CI_95_outcomes_male
ptl <- .95 # if you want to add other percentiles you can replace this with something like "ptl <- c(.10,.50,.90,.95)"
CI_95_outcomes_male <- apply(outcomes_male_dt,2,quantile,probs=ptl)
# I'm going to assume that mean_outcomes_male is a vector of means for all the variables in outcomes_male_dt
mean_outcomes_male <- colMeans(outcomes_male_dt)
# You want to end up with a 2x4 matrix - I'm assuming you meant row 1 will be the means, and row 2 will be the 95th percentiles, and the columns will be the variables
want <- rbind(mean_outcomes_male, CI_95_outcomes_male)
colnames(want) <- colnames(outcomes_male_dt)
row.names(want) <- c('Mean',paste0("p",ptl*100)) # paste0("p",ptl*100) is equivalent to paste("p",ptl*100,sep="")
want # Resulting matrix

sample() command is too slow in R

I want to create a random subset of a data.table df that is very large (around 2 million lines).
The data table has a weight column, wgt that indicates how many observation each line represents.
To generate the vector of row numbers I want to extract, I proceed as follows:
I get the exact number of observations :
ns<- length(df$wgt)
I get the number of desired lines (30% of the sample):
lines<-round(0.3*ns)
I compute the vector of probabilities:
pr<-df$wgt/sum(df$wgt)
And then I compute the vector of line numbers to get the subsample:
ssout<-sample(1:ns, size=lines, probs=pr)
The final aim is to subset the data using df[ssout,]. However, R gets stuck when computing ssout.
Is there a faster/more efficient way to do this?
Thank you!
I'm guessing that df is a summary description of a data set that has repeated observations (with wgt being the count of repetitions). In that case, the only useful way to sample from it would be with replacement; and a proper 30% sample would be 30% of the real population, .3*sum(wgt):
# example data
wgt <- sample(10,2e6,replace=TRUE)
nobs<- sum(wgt)
pr <- wgt/sum(wgt)
# select rows
system.time(x <- sample.int(2e6,size=.3*nobs,prob=pr,replace=TRUE))
# user system elapsed
# 0.20 0.02 0.22
Sampling rows without replacement takes forever on my computer, but is also something that I don't think one needs to do here.

Resources