using median function to create a new variable in R - r

I would like to use the median function in order to create a new variable.If an observation (person that has been polled) in the sample has an education level greater than the median of the sample then he will be considered having a "high_educ". If his level of education is lower than the median then he will be considered "low_educ".
There are 1247 observations (n) in my sample so I should end up with something like 623 low_educ and 624 high_educ.
In below code I didn't include the median function as I do not know how to include it. Instead, I am manually including a threshold (here: 13.5) But of course doing like this my sample population is not equally divided by 2 as it should be if I was using the median function.
For your info, "educ" values are integers and my query seems to work only when I add as.numeric before (educ) in the ifelse function.
educ12<-gss%>%
filter(year==2012, !is.na(educ), !is.na(abany))%>%
mutate(education = ifelse(as.numeric(educ)>=13.5, "high_educ", "low_educ"))
Could you please help me to understand how to use the median function in my code?
Thanks a lot
Michael

Related

Finding summary statistics. Struggling with having anything work after importing data into R from Excel

Very new to R here, also very new to the idea of coding and computer stuff.
Second week of class and I need to find some summary statistics from a set of data my professor provided. I downloaded the chart of data and tried to follow along with his verbal instructions during class, but I am one of the only non-computer science backgrounds in my degree program (I am an RN going for degree in Health Informatics), so he went way too fast for me.
I was hoping for some input on just where to start with his list of tasks for me to complete. I downloaded his data into an excel file, and then uploaded it into R and it is now a matrix. However, everything I try for getting the mean and standard deviation of the columns he wants comes up with an error. I am understanding that I need to convert these column names into some sort of vector, but online every website tells me to do these tasks differently. I don't even know where to start with this assignment.
Any help on how to get myself started would be greatly appreciated. Ive included a screenshot of his instructions and of my matrix. and please, excuse my ignorance/lack of familiarity compared to most of you here... this is my second week into my masters I am hoping I begin to pick this up soon I am just not there yet.
the instructions include:
# * Import the dataset
# * summarize the dataset,Compute the mean and standard deviation for the three variables (columns): age, height, weight
# * Tabulate smokers and age.level data with the variable and its frequency. How many smokers in each age category ?
# * Subset dataset by the mothers that smoke and weigh less than 100kg,how many mothers meet this requirements?
# * Compute the mean and standard deviation for the three variables (columns): age, height, weight
# * Plot a histogram
Stack Overflow is not a place for homeworks, but I feel your pain. Let's get piece by piece.
First let's use a package that helps us do those tasks:
library(data.table) # if not installed, install it with install.packages("data.table")
Then, let's load the data:
library(readxl) #again, install it if not installed
dt = setDT(read_excel("path/to/your/file/here.xlsx"))
Now to the calculations:
1 summarize the dataset. Here you'll see the ranges, means, medians and other interesting data of your table.
summary(dt)
1A mean and standard deviation of age, height and weight (replace age with the column name of height and weight to get those)
dt[, .(meanValue = mean(age, na.rm = TRUE), stdDev = sd(age, na.rm = TRUE))]
2 tabulate smokers and age.level. get the counts for each combination:
dt[, .N, by = .(smoke, age.level)]
3 subset smoker mothers with wt < 100 (I'm asuming non-pregnant mothers have NA in the gestation field. Adjust as necessary):
dt[smoke == 1 & weight < 100 & !is.na(gestation), .N]
4 Is the same as 1A.
5 Plot a histogram (but you don't specify of what variable, so let's say it's age):
hist(dt$age)
Keep on studying R, it's not that difficult. The book recommended in the comments is a very good start.

Why my R function is showing me Length class mode instead of frequencies?

I am currently taking a course in statistics, and they asked us to install R and R environment to use during the course. This my first time using R.
Our first task is to test some csv files
using these commands:
ex.1 <- read.csv('ex1.csv')
summary(ex.1)
colnames(ex.1)
The results should look like this:
id sex height
Min. :1538611 FEMALE:54 Min. :117.0
1st Qu.:3339583 MALE :46 1st Qu.:158.0
Median :5105620 Median :171.0
Mean :5412367 Mean :170.1
3rd Qu.:7622236 3rd Qu.:180.2
Max. :9878130 Max. :208.0
However, I am getting this (I included my entire code):
> getwd()
[1] "C:/Users/hp/Documents"
> setwd("C:/Users/hp/Documents/R")
> dir()
[1] "ex1.csv" "ex2.csv" "flowers.csv" "pop1.csv" "pop2.csv"
[6] "pop3.csv" "win-library"
> ex.1 <- read.csv('ex1.csv')
> summary(ex.1)
id sex height
Min. :1538611 Length:100 Min. :117.0
1st Qu.:3339583 Class :character 1st Qu.:158.0
Median :5105620 Mode :character Median :171.0
Mean :5412367 Mean :170.1
3rd Qu.:7622236 3rd Qu.:180.2
Max. :9878130 Max. :208.0
> colnames(ex.1)
[1] "id" "sex" "height"
What is the problem?
Do this ex.1$sex<- as.factor(ex.1$sex) . Then try summary command
The thing is when you are reading it as csv it reads sex column as characters. You have to make it as factor.
What is happening is that R is reading the sex column as a string of text and not acknowledging that all cells with "MALE" in them should be grouped as should all "FEMALE" cells. So it reports that the cells have text in them where as you want a special form of text which is a factor. as.factor() forces R to recognise the same text string as from a particular group.
How to stop it from doing it next time. One way is to add the argument ' stringsAsFactors = T ' to your "read.csv" command. e.g.
read.csv('ex1.csv', stringAsFactors = T)
This coerces the strings which are read by R to become factors.

Table of average score of peer per percentile

I'm quite a newbie in R so I was interested in the optimality of my solution. Even if it works it could be (a bit) long and I wanted your advice to see if the "way I solved it" is "the best" and it could help me to learn new techniques and functions in R.
I have a dataset on students identified by their id and I have the school where they are matched and the score they obtained at a specific test (so for short: 3 variables id,match and score).
I need to construct the following table: for students in between two percentiles of score, I need to calculate the average score (between students) of the average score of the students of the school they are matched to (so for each school I take the average score of the students matched to it and then I calculate the average of this average for percentile classes, yes average of a school could appear twice in this calculation). In English it allows me to answer: "A student belonging to the x-th percentile in terms of score will be in average matched to a school with this average quality".
Here is an example in the picture:
So in that case, if I take the median (15) for the split (rather than percentiles) I would like to obtain:
[0,15] : 9.5
(15,24] : 20.25
So for students having a score between 0 and 15 I take the average of the average score of the school they are matched to (note that b average will appears twice but that's ok).
Here how I did it:
match <- c(a,b,a,b,c)
score <- c(18,4,15,8,24)
scoreQuant <- cut(score,quantile(score,probs=seq(0,1,0.1),na.rm=TRUE))
AvgeSchScore <- tapply(score,match,mean,na.rm=TRUE)
AvgScore <- 0
for(i in 1:length(score)) {
AvgScore[i] <- AvgeSchScore[match[i]]
}
results <- tapply(AvgScore,scoreQuant,mean,na.rm = TRUE)
If you have a more direct way of doing it.. Or I think the bad point is 3) using a loop, maybe apply() is better ? But I'm not sure how to use it here (I tried to code my own function but it crashed so I "bruted force it").
Thanks :)
The main fix is to eliminate the for loop with:
AvgScore <- AvgeSchScore[match]
R allows you to subset in ways that you cannot in other languages. The tapply function outputs the names of the factor that you grouped by. We are using those names for match to subset AvgeScore.
data.table
If you would like to try data.table you may see speed improvements.
library(data.table)
match <- c("a","b","a","b","c")
score <- c(18,4,15,8,24)
dt <- data.table(id=1:5, match, score)
scoreQuant <- cut(dt$score,quantile(dt$score,probs=seq(0,1,0.1),na.rm=TRUE))
dt[, AvgeScore := mean(score), match][, mean(AvgeScore), scoreQuant]
# scoreQuant V1
#1: (17.4,19.2] 16.5
#2: NA 6.0
#3: (12.2,15] 16.5
#4: (7.2,9.4] 6.0
#5: (21.6,24] 24.0
It may be faster than base R. If the value in the NA row bothers you, you can delete it after.

R code optimizing for rep function

I'm working with data from an income/expense per home poll.
The 9,002 observations from the sample data base represent 3,155,937 homes through an expansion factor like this.
Homeid Income Factor
001 23456 678
002 42578 1073
.. .. ..
9002 62333 987
I'm trying to get an exact summary of the total income per decile by expanding each income value times its factor which will give as result a 3,155,937 ovservations vector and then I'm using a 'for' loop to asign each value the Decile it belongs to.
Three <- Nal %>% select(income,factor)
Five <- data.frame(income=rep(Three$income,Three$factor))
for(i in 1:31559379){if(i<=3155937){Five$Decil[i]=1}
else{if(i<=6311874){Five$Decil[i]=2}
else{if(i<=9467811){Five$Decil[i]=3}
else{if(i<=12623748){Five$Decil[i]=4}
else{if(i<=15779685){Five$Decil[i]=5}
else{if(i<=18935622){Five$Decil[i]=6}
else{if(i<=22091559){Five$Decil[i]=7}
else{if(i<=25247496){Five$Decil[i]=8}
else{if(i<=28403433){Five$Decil[i]=9}
else{Five$Decil[i]=10}
}}}}}}}}}
for(i in 1:10){Two=filter(Five,Decil==i);
TotDecil$inctot[i]=sum(Two$income)}
rm(Five);rm(Three);rm(Two);gc()
I want to know if you can help me optimize this code; it has taken hours and still haven't finished.
The ntile function from the dplyr package worked better:
Three <- Nal %>% select(income,factor)
Five <- data.frame(income=rep(Three$income,Three$factor))
Cinco$Decil <- ntile(Cinco$ing_cor,10)
# ^ This line works instead of that 'for' loop & it only takes seconds to run

Group a continuous variable in R

My aim is to compare in a pivot table if there is a link between the presence of one particular shop and the density of population where we can find these shops. For that, I've a CSV file, with 600 exemples of areas where there is OR not the shop. It's a file with 600 lines and two columns : 1/ a number who represent the density of populaiton for one area, and 2/ the quantity of this particular shop in this area (0, 1 or 2).
In order to do a pivot table, I need to group the densities in 10 groups of 60 lines for each (in the first group the 60 bigger densities until the last group with the 60 smaller densities). Then, I'll easily be able to see how many shops are built, whether the density is low or high. Am I understandable (I hope) ? :)
Nothing really difficult I suppose. But there are some much way (and package) which could be ok for that... that I'm a little bit lost.
My main issue : which is the simplest way to group my variable in ten groups of 60 lines each ? I've tried cut()/cut2() and hist() without success, I heard about bin_var() and reshape() but I don't understand how they can be helpful for this case.
For example (as Justin asked).
With cut():
data <- read.csv("data.csv", sep = ";")
groups <- cut(as.numeric(data$densit_pop2), breaks=10)
summary(groups)
(0.492,51.4] (51.4,102] (102,153] (153,204] (204,255] (255,306]
53 53 52 52 52 54
(306,357] (357,408] (408,459] (459,510]
52 59 53 54
Ok, good, indeed 'groups' contains 10 groups with almost the same number of lines. But certains values indicated in the intervals don't make any sens for me. Here is the first lines of density column (increasly sorted) :
> head(data$densit_pop2)
[1] 14,9 16,7 17,3 18,3 20,2 20,5
509 Levels: 100 1013,2 102,4 102,6 10328 103,6 10375 10396,8 104,2 ... 99,9
I mean, look at the first group. Why 0.492 when 14.9 is my smallest value ? And, if I count manually how many lines between the first one and the value 51.4, I find 76. Why is it indicated 53 lines ? I precise that the dataframe are correctly ranked from lowest to highest.
I certainly miss something... but what ?
I think you'll be happy with cut2 once you have a numeric variable to work with. When using commas as your decimal separator, use read.csv2 or use the argument dec = "," when reading in a dataset.
y = runif(600, 14.9, 10396.8)
require(Hmisc)
summary(cut2(y, m = 60))
You can do the same thing with cut, but you would need to set your breaks at the appropriate quantiles to get equal groups which takes a bit more work.
summary(cut(y, breaks = quantile(y, probs = seq(0, 1, 1/10)), include.lowest = TRUE))
Responding to your data: you need to correct errors in data entry:
data$densit_pop3 <- as.numeric(
sub('\\,', '.',
as.character(data$densit_pop2)))
Then. Something along these lines (assuming this is not really a question about loading data from text files):
with(dfrm, by(dens, factor(shops), summary) )
As an example of hte output one might get:
with(BNP, by( proBNP.A, Sex, summary))
Sex: Female
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
5.0 55.7 103.6 167.9 193.6 5488.0 3094899
---------------------------------------------------------------------
Sex: Male
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
5 30 63 133 129 5651 4013760
If you are trying to plot this to look at the density of densities (which in this case seems like a reasonable request) then try this:
require(lattice)
densityplot( ~dens|shops, data=dfrm)
(And please do stop calling these "pivot tables". That is an aggregation strategy from Excel and one should really learn to describe the desired output in standard statistical or mathematical jargon.)

Resources