R Sample By Minimum Cell Size - r

set.seed(1)
data=data.frame(SCHOOL = rep(1:10, each = 1000), GRADE = sample(7:12, r = T, size = 10000),SCORE = sample(1:100, r = T, size = 10000))
I have 'data' that contains information about student test score. I wish to: count how many GRADE for each SCHOOL, and then take the smallest value of GRADE for all SCHOOLS. Like this:
For each SCHOOL count the number of rows for a specific GRADE.
Then for each GRADE find the smallest values across all SCHOOLs.
Finally I wish to take a random sample based on the smallest value found in step 2.
So basically in this basic example with two SCHOOLs and GRADE 7 and GRADE 8:
SCHOOL 1 has 2 SCOREs for GRADE 7 and SCHOOL 1 has 3 SCOREs for GRADE 8.
SCHOOL 2 has 1 SCOREs for GRADE 7 and SCHOOL 2 has 4 SCOREs for GRADE 8.
So the new data contains one SCORE for GRADE 7 from SCHOOL 1 and SCHOOL 2, and three SCORE for GRADE 8 from SCHOOL 1 and SCHOOL 2 and these SCORE that are picked are RANDOMLY SAMPLED.
like this:
My attempt:
data[, .SD[sample(x = .N, size = min(sum(GRADE), .N))], by = .(SCHOOL,GRADE]

This follows your description of how to do it step-by-step.
library(data.table)
setDT(data)
data[, N := .N, .(SCHOOL, GRADE)]
data[, N := min(N), GRADE]
data[, .(SCORE = sample(SCORE, N)), .(SCHOOL, GRADE, N)][, -'N']
If you have multiple SCORE-like columns and you want keep the same rows from each then you can use .SD like in your attempt:
data[, .SD[sample(.N, N)], .(SCHOOL, GRADE, N)][, -'N']

Related

R using melt() and dcast() with categorical and numerical variables at the same time

I am a newbie in programming with R, and this is my first question ever here on Stackoverflow.
Let's say that I have a data frame with 4 columns:
(1) Individual ID (numeric);
(2) Morality of the individual (factor);
(3) The city (factor);
(4) Numbers of books possessed (numeric).
Person_ID <- c(1,2,3,4,5,6,7,8,9,10)
Morality <- c("Bad guy","Bad guy","Bad guy","Bad guy","Bad guy",
"Good guy","Good guy","Good guy","Good guy","Good guy")
City <- c("NiceCity", "UglyCity", "NiceCity", "UglyCity", "NiceCity",
"UglyCity", "NiceCity", "UglyCity", "NiceCity", "UglyCity")
Books <- c(0,3,6,9,12,15,18,21,24,27)
mydf <- data.frame(Person_ID, City, Morality, Books)
I am using this code in order to get the counts by each category for the variable Morality in each city:
mycounts<-melt(mydf,
idvars = c("City"),
measure.vars = c("Morality"))%>%
dcast(City~variable+value,
value.var="value",fill=0,fun.aggregate=length)
The code gives this kind of table with the sums:
names(mycounts)<-gsub("Morality_","",names(mycounts))
mycounts
City Bad guy Good guy
1 NiceCity 3 2
2 UglyCity 2 3
I wonder if there is a similar way to use dcast() for numerical variables (inside the same script) e.g. in order to get a sum the Books possessed by all individuals living in each city:
#> City Bad guy Good guy Books
#>1 NiceCity 3 2 [Total number of books in NiceCity]
#>2 UglyCity 2 3 [Total number of books in UglyCity]
Do you mean something like this:
mydf %>%
melt(
idvars = c("City"),
measure.vars = c("Morality")
) %>%
dcast(
City ~ variable + value,
value.var = "Books",
fill = 0,
fun.aggregate = sum
)
#> City Morality_Bad guy Morality_Good guy
#> 1 NiceCity 18 42
#> 2 UglyCity 12 63

How to find the difference of ratio in multiple columns using R

I have data frame with 4 different years with values. I need to find how these values changes in all years i.e which city changes its value too often which is least.
City Ratio1 Ratio2 Ratio3 Ratio4
A 1.0177722 1.0173251 1.0133026 1.0140027
B 1.0132619 1.0122653 1.0128473 1.0111068
C 1.0689484 1.0640355 1.0625305 1.0544790
..... other 1000 entries
I have tried to do it by difference but no luck. The question is which city's ratio changed most between ratio1 to ratio4 and which is least changed.
I have tried using mutate function to calculate variance but it throws me an warning:
DF<- DF%>% mutate(vari = var(Ratio1:Ratio4,na.rm = T))
Warning messages:
1: In POP_2013_ratio:POP_2016_ratio :
numerical expression has 439 elements: only the first used
2: In POP_2013_ratio:POP_2016_ratio :
numerical expression has 439 elements: only the first used
R's data.table package has a pretty neat way to create new columns based on existing ones:
dt <- data.table(City = c("A", "B", "C"),
Ratio1 = c(1.0177722, 1.0132619, 1.0689484),
Ratio2 = c(1.0173251, 1.0122653, 1.0640355),
Ratio3 = c(1.0133026, 1.0128473, 1.0625305),
Ratio4 = c(1.0140027,1.0111068, 1.0544790))
>dt
City Ratio1 Ratio2 Ratio3 Ratio4
1: A 1.017772 1.017325 1.013303 1.014003
2: B 1.013262 1.012265 1.012847 1.011107
3: C 1.068948 1.064035 1.062531 1.054479
You can play around with some functions and then see what suits you best:
dt[, diff := Ratio4-Ratio1
][, abs_diff := abs(Ratio4-Ratio1)
][, range:= max(c(Ratio1, Ratio2, Ratio3, Ratio4))- min(c(Ratio1, Ratio2, Ratio3, Ratio4)), by = City
][,variance:=var(c(Ratio1, Ratio2, Ratio3, Ratio4)), by = City]
>dt
City Ratio1 Ratio2 Ratio3 Ratio4 diff abs_diff range variance
1: A 1.017772 1.017325 1.013303 1.014003 -0.0037695 0.0037695 0.0044696 5.174612e-06
2: B 1.013262 1.012265 1.012847 1.011107 -0.0021551 0.0021551 0.0021551 8.766456e-07
3: C 1.068948 1.064035 1.062531 1.054479 -0.0144694 0.0144694 0.0144694 3.609233e-05
When you're finally decided on the criteria to use (let's say, variance), you can select the top City using:
dt[order(-variance)][1]
>dt
City Ratio1 Ratio2 Ratio3 Ratio4 diff abs_diff range variance
1: C 1.068948 1.064035 1.062531 1.054479 -0.0144694 0.0144694 0.0144694 3.609233e-05

How do I find the percentage of something in R?

I am really new at R and this is probably a really basic question but let's say I have a data set with 2 columns that has students that are composed of males and female. One column has the student, and the other column is gender. How do I find the percentage of each?
Another way using data.table:
students <- data.frame( names = c( "Bill", "Stacey", "Fred", "Jane", "Sarah" ),
gender = c( "M", "F", "M", "F", "F" ),
stringsAsFactors = FALSE )
library( data.table )
setDT( students )[ , 100 * .N / nrow( students ), by = gender ]
# gender V1
# 1: M 40
# 2: F 60
Or dplyr:
library( dplyr )
students %>%
group_by( gender ) %>%
summarise( percent = 100 * n() / nrow( students ) )
# A tibble: 2 × 2
# gender percent
# <chr> <dbl>
# 1 F 60
# 2 M 40
These are both popular packages for operations like these but, as has already been pointed out, you can also stick with base R if you prefer.
You can use table() function to produce a table telling you how much of males and of females are among the students.Then just divide this table over the total amount of students (you can get this by using the length() function). At last you just multiply the result by 100.
Your code should be something like:
proportions <- table(your_data_frame$gender_columnn)/length(your_data_frame$gender_column)
percentages <- proportions*100
There are already some good answers to this question, but as the original submitter admits to being new to R, I wanted to provide a very long form answer. The answer below takes more than the minimum necessary number of steps and doesn't use helpers like pipes.
Hopefully, providing an answer in this way helps the original submitter understand what is happening with each step.
# Load the dplyr library
library("dplyr")
# Create an example data frame
students <-
data.frame(
names = c("Bill", "Stacey", "Fred", "Jane", "Sarah"),
gender = c("M", "F", "M", "F", "F"),
stringsAsFactors = FALSE
)
# Count the total number of students.
total_students <- nrow(students)
# Use dplyr filter to obtain just Female students
all_female_students <- dplyr::filter(students, gender %in% "F")
# Count total number of female students
total_female <- nrow(all_female_students)
# Repeat to find total number of male students
all_male_students <- dplyr::filter(students, gender %in% "M")
total_male <- nrow(all_male_students)
# Divide total female students by total students
# and multiply result by 100 to obtain a percentage
percent_female <- (total_female / total_students) * 100
# Repeat for males
percent_male <- (total_male / total_students) * 100
> percent_female
[1] 60
> percent_male
[1] 40
This is probably not the most efficient way to do this, but this is one way to solve the problem.
First you have to create a data.frame. How is an artificial one:
students <- data.frame(student = c("Carla", "Josh", "Amanda","Gabriel", "Shannon", "Tiffany"), gender = c("Female", "Male", "Female", "Male", "Female", "Female")
View(students)
Then I use prop table which gives me a proportion table or the ratios the columns in the matrix, and I coerce it to a data.frame because I love data.frames, and I have to multiply by 100 to turn the ratios from the prop table as they would be as percentages.
tablature <- as.data.frame.matrix(prop.table(table(students)) * 100)
tablature
I decided to call my data frame table tablature.
So it says "Amanda" is 16 + (2 / 3) % on the female column. Basically that means that she is a Female and thus 0 for male, and my data.frame has 6 students so (1 / 6) * 100 makes her 16.667 percent of the set.
Now what percentage of females and males are there?
Two ways: 1) Get the number of each set at the same time with the apply function, or get the number of each set one at a time, and we should use the sum function now.
apply(tablature, 2, FUN = sum)
Female Male
66.66667 33.33333
Imagine that in terms of percentages.
Where 2 tablature is the proportion table dataframe that I am applying the sum function to across the columns (2 for columns or 1 for rows).
So if you just eyeball the small amount of data, you can see that there are 2 / 6 = 33.3333% males in the data.frame students, and 4 / 6 = 66.66667 % females in the data.frame so I did the calculation correctly.
Alternatively,
sum(tablature$Female)
[1] 66.66667
sum(tablature$Male)
[1] 33.33333
And you can make a barplot. As I formatted it, you would have to refer to it as a matrix to get a barplot.
And from here you can make a stacked visual comparison of Gender barplot.
barplot(as.matrix(tablature), xlab = "Gender", main = "Barplot comparison of Gender Among Students", ylab = "Percentages of Student Group")
It's stacking because R made each student a box of 16.6667%.
To be honest it looks better if you just plot the the output of the apply function. Of course you could save it to a variable. But naahhh ...
barplot(apply(tablature, 2, FUN = sum), col = c("green", "blue"),xlab = "Gender", ylab = "Percentage of Total Students", main = "Barplot showing the Percentages of Gender Represented Among Students", cex.main = 1)
Now it doesn't stack.

Q-How to fill a new column in data.frame based on row values by two conditions in R

I am trying to figure out how to generate a new column in R that accounts for whether a politician "i" remains in the same party or defect for a given legislatures "l". These politicians and parties are recognized because of indexes. Here is an example of how my data originally looks like:
## example of data
names <- c("Jesus Martinez", "Anrita blabla", "Paco Pico", "Reiner Steingress", "Jesus Martinez Porras")
Parti.affiliation <- c("Winner","Winner","Winner", "Loser", NA)#NA, "New party", "Loser", "Winner", NA
Legislature <- c(rep(1, 5), rep(2,5), rep(3,5), rep(4,5), rep(5,5), rep(6,5))
selection <- c(rep("majority", 15), rep("PR", 15))
sex<- c("Male", "Female", "Male", "Female", "Male")
Election<- c(rep(1955, 5), rep(1960, 5), rep(1965, 5), rep(1970,5), rep(1975,5), rep(1980,5))
d<- data.frame(names =factor(rep(names, 6)), party.affiliation = c(rep(Parti.affiliation,5), NA, "New party", "Loser", "Winner", NA), legislature = Legislature, selection = selection, gender =rep(sex, 6), Election.date = Election)
## genrating id for politician and party.affiliation
d$id_pers<- paste(d$names, sep="")
d <- arrange(d, id_pers)
d <- transform(d, id_pers = as.numeric(factor(id_pers)))
d$party.affiliation1<- as.numeric(d$party.affiliation)
The expected outcome should show the following: if a politician (showed through the column "id_pers") has changed their values in the column "party.affiliation1", a value 1 will be assigned in a new column called "switch", otherwise 0. The same procedure should be done with every politician in the dataset, so the expected outcome should be like this:
d["switch"]<- c(1, rep(0,4), NA, rep(0,6), rep(NA, 6),1, rep(0,5), rep (0,5),1) # 0= remains in the same party / 1= switch party affiliation.
As example, you can see in this data.frame that the first politician, called "Anrita blabla", was a candidate of the party '3' from the 1st to 5th legislature. However, we can observe that "Anrita" changes her party affiliation in the 6th legislature, so she was a candidate for the party '2'. Therefore, the new column "switch" should contain a value '1' to reflect this Anrita's change of party affiliation, and '0' to show that "Anrita" did not change her party affiliation for the first 5 legislatures.
I have tried several approaches to do that (e.g. loops). I have found this strategy the simplest one, but it does not work :(
## add a new column based on raw values
ind <- c(FALSE, party.affiliation1[-1L]!= party.affiliation1[-length(party.affiliation1)] & party.affiliation1!= 'Null')
d <- d %>% group_by(id_pers) %>% mutate(this = ifelse(ind, 1, 0))
I hope you find this explanation clear. Thanks in advance!!!
I think you could do:
library(tidyverse)
d%>%
group_by(id_pers)%>%
mutate(switch=as.numeric((party.affiliation1-lag(party.affiliation1)!=0)))
The first entry will be NA as we don't have information on whether their previous, if any, party affiliation was different.
Edit: We use the default= parameter of lag() with ifelse() nested to differentiate the first values.
df=d%>%
group_by(id_pers)%>%
mutate(switch=ifelse((party.affiliation1-lag(party.affiliation1,default=-99))>90,99,ifelse(party.affiliation1-lag(party.affiliation1)!=0,1,0)))
Another approach, using data.table:
library(data.table)
# Convert to data.table
d <- as.data.table(d)
# Order by election date
d <- d[order(Election.date)]
# Get the previous affiliation, for each id_pers
d[, previous_party_affiliation := shift(party.affiliation), by = id_pers]
# If the current affiliation is different from the previous one, set to 1
d[, switch := ifelse(party.affiliation != previous_party_affiliation, 1, 0)]
# Remove the column
d[, previous_party_affiliation := NULL]
As Haboryme has pointed out, the first entry of each person will be NA, due to the lack of information on previous elections. And the result would give this:
names party.affiliation legislature selection gender Election.date id_pers party.affiliation1 switch
1: Anrita blabla Winner 1 majority Female 1955 1 NA NA
2: Anrita blabla Winner 2 majority Female 1960 1 NA 0
3: Anrita blabla Winner 3 majority Female 1965 1 NA 0
4: Anrita blabla Winner 4 PR Female 1970 1 NA 0
5: Anrita blabla Winner 5 PR Female 1975 1 NA 0
6: Anrita blabla New party 6 PR Female 1980 1 NA 1
(...)
EDITED
In order to identify the first entry of the political affiliation and assign the value 99 to them, you can use this modified version:
# Note the "fill" parameter passed to the function shift
d[, previous_party_affiliation := shift(party.affiliation, fill = "First"), by = id_pers]
# Set 99 to the first occurrence
d[, switch := ifelse(party.affiliation != previous_party_affiliation, ifelse(previous_party_affiliation == "First", 99, 1), 0)]

Sliding window over data.frame with nested hierarchy

Description of the data
My data.frame represents the salary of people living in different cities (city) in different countries (country). city names, country names and salaries are integers. In my data.frame, the variable country is ordered, the variable city is ordered within each country and the variable salary is ordered within each city (and country). There are two additional columns called arg1 and arg2, which contain floats/doubles.
Goal
For each country and each city, I want to consider a window of size WindowSize of salaries and calculate D = sum(arg1)/sum(arg2) over this window. Then, the window slide by WindowStep and D should be recalculated and so on. For example, let's consider a WindowSize = 1000 and WindowStep = 10. Within each country and within each city, I would like to get D for the range of salaries between 0 and 1000 and for the range between 10 and 1010 and for the range 20 and 1020, etc...
At the end the output should be a data.frame associating a D statistic to each window. If a given window has no entry (for example nobody has a salary between 20 and 1020 in country 1, city 3), then the D statistic should be NA.
Note on performance
I will have to run this algorithm about 10000 times on pretty big tables (that have nothing to do with countries, cities and salaries; I don't yet have a good estimate of the size of these tables), so performance is of concern.
Example data
set.seed(84)
country = rep(1:3, c(30, 22, 51))
city = c(rep(1:5, c(5,5,5,5,10)), rep(1:5, c(1,1,10,8,2)), rep(c(1,3,4,5), c(20, 7, 3, 21)))
tt = paste0(city, country)
salary = c()
for (i in unique(tt)) salary = append(salary, sort(round(runif(sum(tt==i), 0,100000))))
arg1 = rnorm(length(country), 1, 1)
arg2 = rnorm(length(country), 1, 1)
dt = data.frame(country = country, city = city, salary = salary, arg1 = arg1, arg2 = arg2)
head(dim)
country city salary arg1 arg2
1 1 1 22791 -1.4606212 1.07084528
2 1 1 34598 0.9244679 1.19519158
3 1 1 76411 0.8288587 0.86737330
4 1 1 76790 1.3013056 0.07380115
5 1 1 87297 -1.4021137 1.62395596
6 1 2 12581 1.3062181 -1.03360620
With this example, if windowSize = 70000 and windowStep = 30000, the first values of D are -0.236604 and 0.439462 which are the results of sum(dt$arg1[1:2])/sum(dt$arg2[1:2]) and sum(dt$arg1[2:5])/sum(dt$arg2[2:5]), respectively.
Unless I've misunderstood something, the following might be helpful.
Define a simple function regardless of hierarchical groupings:
ff = function(salary, wSz, wSt, arg1, arg2)
{
froms = (wSt * (0:ceiling(max(salary) / wSt)))
tos = froms + wSz
Ds = mapply(function(from, to, salaries, args1, args2) {
inds = salaries > from & salaries < to
sum(args1[inds]) / sum(args2[inds])
},
from = froms, to = tos,
MoreArgs = list(salaries = salary, args1 = arg1, args2 = arg2))
list(from = froms, to = tos, D = Ds)
}
Compute on the groups with, for example, data.table:
library(data.table)
dt2 = as.data.table(dt)
ans = dt2[, ff(salary, 70000, 30000, arg1, arg2), by = c("country", "city")]
head(ans, 10)
# country city from to D
# 1: 1 1 0 70000 -0.2366040
# 2: 1 1 30000 100000 0.4394620
# 3: 1 1 60000 130000 0.2838260
# 4: 1 1 90000 160000 NaN
# 5: 1 2 0 70000 1.8112196
# 6: 1 2 30000 100000 0.6134090
# 7: 1 2 60000 130000 0.5959344
# 8: 1 2 90000 160000 NaN
# 9: 1 3 0 70000 1.3216255
#10: 1 3 30000 100000 1.8812397
I.e. a faster equivalent of
lapply(split(dt[-c(1, 2)], interaction(dt$country, dt$city, drop = TRUE)),
function(x) as.data.frame(ff(x$salary, 70000, 30000, x$arg1, x$arg2)))
Without your expected outcome it is a bit hard to guess whether my result is correct but it should give you a head start for the first step. From a performance point of view the data.table package is very fast. Much faster than loops.
set.seed(84)
country <- rep(1:3, c(30, 22, 51))
city <- c(rep(1:5, c(5,5,5,5,10)), rep(1:5, c(1,1,10,8,2)), rep(c(1,3,4,5), c(20, 7, 3, 21)))
tt <- paste0(city, country)
salary <- c()
for (i in unique(tt)) salary <- append(salary, sort(round(runif(sum(tt==i), 0,100000))))
arg1 <- rnorm(length(country), 1, 1)
arg2 <- rnorm(length(country), 1, 1)
dt <- data.frame(country = country, city = city, salary = salary, arg1 = arg1, arg2 = arg2)
head(dt)
# For data table
require(data.table)
# For rollapply
require(zoo)
setDT(dt)
WindowSize <- 10
WindowStep <- 3
dt[, .(D = (rollapply(arg1, width = WindowSize, FUN = sum, by = WindowStep) /
rollapply(arg2, width = WindowSize, FUN = sum, by = WindowStep)),
by = list(country = country, city = city))]
You can achieve the latter part of your goal by melting the data and doing and writing a custom summary function that you use to dcast your data together again.
Table = NULL
StepNumber = 100
WindowSize = 1000
WindowRange = c(0,WindowSize)
WindowStep = 100
for(x in dt$country){
#subset of data for that country
CountrySubset = dt[dt$country == x,,drop=F]
for(y in CountrySubset$city){
#subset of data for citys within country
CitySubset = CountrySubset[CountrySubset$city == y,,drop=F]
for(z in 1:StepNumber){
WinRange = WindowRange + (z*WindowStep)
#subset of salarys within country of city via windowRange
WindowData = subset(CitySubset, salary > WinRange[1] & salary < WinRange[2])
CalcD = sum(WindowData$arg1)/sum(WindowData$arg2)
Output = c(Country = x, City = y, WinStart = WinRange[1], WinEnd = WinRange[2], D = CalcD)
Table = rbind(Table,Output)
}
}
}
Using your example code this should work, its just a series of nested loops that will write to Table. It does however duplicate a line every now and then because the only way I know to keep adding results to a table is rbind.
So if someone can alter this to fix that. Should be good.
WindowStep is the difference between each consecutive WindowSize you want.
StepNumber is how many steps you want to take in total, might be best to find out what the maximum salary is and then adjust for that.

Resources