How to split data up using an if statement? - r

I have a data set in R that involves students and GPAs, for example
Student GPA
Jim 3.00
Tom 3.29
Ana 3.99
and so on.
I want a column that puts them in a bin. for example
Student GPASplit
Jim 3.0-3.5
Tom 3.0-3.5
Ana 3.5-4.0
Because when I try to take the statistics for the GPA all the bins are seperated based on the actual GPA. For example I am trying to find the percentage for how many students have higher than a 3.5, a GPA between 3.0-3.5, and so forth. But I get the percentage in terms of the actual GPA and when you have 4000 data points all with different GPAs, it is hard to figure out how many have a GPA higher than 3.5 and so forth? Does this make sense? Sorry if it doesn't.

You can use the cut() function to split data into bins that you define. You have to be careful about values that fall exactly on the boundaries though, and make sure they're being treated how you want. With your example data:
> df$GPA_split = cut(df$GPA, breaks = c(3.0, 3.5, 4.0), include.lowest = TRUE)
> df
Student GPA GPA_split
1 Jim 3.00 [3,3.5]
2 Tom 3.29 [3,3.5]
3 Ana 3.99 (3.5,4]
# Count values in each bin
> table(df$GPA_split)
[3,3.5] (3.5,4]
2 1

Related

Joining two dataframes by different values | Plotting two dataframes in one plot

I have this dataframe:
# A tibble: 6 x 4
Full.Name Year freq Ra
<chr> <chr> <dbl> <dbl>
1 A. Patrick Beharelle 2019 0.000713 0.110
2 A. Patrick Beharelle 2020 -0.0946 -0.116
3 Aaron P. Graft 2019 0.835 0.276
4 Aaron P. Graft 2020 -0.276 0.376
5 Aaron P. Jagdfeld 2019 -1.20 0.745
6 Aaron P. Jagdfeld 2020 10.7 0.889
Which describes a certain topic. Now, I want to visualize the freq column by Full.Name with a plot. That's not that hard. I can do that. But here comes the tricky part, which I am not able to do:
I have another dataframe, which is exactly the same structure (same columns, but different values), dealing with another topic and I want to include this dataframe into the other one's plot so that I can compare them.
I tried merging both dataframes, but both dataframes have a different amount of observations, therefore it' hard to merge. I tried to innerjoin() but due to the values of Full.Name not matching, that was not successful for me. Maybe there is another way to join both dataframes.
Any suggestions how to include both dataframes into one plot or even some kind of merged table, distinguishing between both topics, would be great. Any help is appreciated. Thanks in advance!
My understanding of your problem is that you have two dataframes and you want to compare the values in those two dataframes using one plot. You can achieve that by appending those two dataframes. Here is an example below:
## Sample DataFrame1
df1 = data.frame(Names=c("Alpha","Alpha","Rome","Victor","Victor"),
Year=c(2019,2020,2019,2020,2019),
Freq = c(0.000713,-0.000713,0.01724,-0.0760713,0.00213),
Dataframe="df1")
## Sample DataFrame2
df2 = data.frame(Names=c("Gamma","Gamma","Tango","Pan","Beta"),
Year=c(2019,2020,2019,2020,2019),
Freq = c(0.0713,-0.090713,0.1724,-0.013,0.0299),
Dataframe="df2")
## Appending the two DataFrames
rbind(df1,df2)
Hope this helps!

For Loop for Correlations

I am wanting to get correlation values between two variables for each county.
I have subset my data as shown below and get the appropriate value for the individual Adams county, but am now wanting to do the other counties:
CorrData<-read.csv("H://Correlation
Datasets/CorrelationData_Master_Regression.csv")
CorrData2<-subset(CorrData, CountyName=="Adams")
dzCases<-(cor.test(CorrData2$NumVisit, CorrData2$dzdx,
method="kendall"))
dzCases
I am wanting to do a For Loop or something similar that will make the process more efficient, and so that I don't have write 20 different variable correlations for each of the 93 counties.
When I run the following in R, it doesn't give an error, but it doesn't give me the response I was hoping for either. Rather than the Spearman's Correlation for each county, it seems to be ignoring the loop portion and just giving me the correlation between the two variables for ALL counties.
CorrData<-read.csv("H:\\CorrelationData_Master_Regression.csv")
for (i in CorrData$CountyName)
{
dzCasesYears<-cor.test(CorrData$NumVisit, CorrData$dzdx,
method="spearman")
}
A very small sample of my data looks similar to this:
CountyName Year NumVisits dzdx
Adams 2010 4.545454545 1.19
Adams 2011 20.83333333 0.20
Elmore 2010 26.92307692 0.24
Elmore 2011 0 0.61
Brown 2010 0 -1.16
Brown 2011 17.14285714 -1.28
Clark 2010 25 -1.02
Clark 2011 0 1.13
Cass 2010 17.85714286 0.50
Cass 2011 27.55102041 0.11
I have tried to find a similar example online, but am not having luck!
Thank you in advance for all your help!
You are looping but not using your iterator 'i' in your code. If this makes sense with respect with what you want to do (and judging from your condition). Based on comments, you might want to make sure you are using numerics. Also, i noticed that you are not iterating into your output cor.test vector. I'm not sure a loop is the most efficient way to do it, but it will be just fine and since your started with a loop, You should have something of the kind:
dzCasesYears = list() #Prep a list to store your corr.test results
counter = 0 # To store your corr.test into list through iterating
for (i in unique(CorrData$CountyName))
{
counter = counter + 1
# Creating new variables makes the code clearer
x = as.numeric(CorrData[CorrData$CountyName == i,]$NumVisit)
y = as.numeric(CorrData[CorrData$CountyName == i,]$dzdx)
dzCasesYears[[counter]] <-cor.test(x,y,method="spearman")
}
And it's always good to put a unique there when you are iterating.
data.table makes operations like this very simple.
library('data.table')
CorrData <- as.data.table(read.csv("H:\\CorrelationData_Master_Regression.csv"))
CorrData[, cor(dzdx, NumVisits), CountyName]
With the sample data, it's all negative ones because there's two points per county and so the correlation is perfect. The full dataset should be more interesting!
CountyName V1
1: Adams -1
2: Elmore -1
3: Brown -1
4: Clark -1
5: Cass -1
Edit to include p values from cor.test as OP asked in the comment
This is also quite simple!
CorrData[, .(cor=cor(dzdx, NumVisits),
p=cor.test(dzdx, NumVisits)$p.value),
CountyName]
...But it won't work with your sample data as two points per county is not enough for cor.test to get a p value. Perhaps you could take #smci's advice and dput a larger subset of the data to make your question truly reproducible

Table of average score of peer per percentile

I'm quite a newbie in R so I was interested in the optimality of my solution. Even if it works it could be (a bit) long and I wanted your advice to see if the "way I solved it" is "the best" and it could help me to learn new techniques and functions in R.
I have a dataset on students identified by their id and I have the school where they are matched and the score they obtained at a specific test (so for short: 3 variables id,match and score).
I need to construct the following table: for students in between two percentiles of score, I need to calculate the average score (between students) of the average score of the students of the school they are matched to (so for each school I take the average score of the students matched to it and then I calculate the average of this average for percentile classes, yes average of a school could appear twice in this calculation). In English it allows me to answer: "A student belonging to the x-th percentile in terms of score will be in average matched to a school with this average quality".
Here is an example in the picture:
So in that case, if I take the median (15) for the split (rather than percentiles) I would like to obtain:
[0,15] : 9.5
(15,24] : 20.25
So for students having a score between 0 and 15 I take the average of the average score of the school they are matched to (note that b average will appears twice but that's ok).
Here how I did it:
match <- c(a,b,a,b,c)
score <- c(18,4,15,8,24)
scoreQuant <- cut(score,quantile(score,probs=seq(0,1,0.1),na.rm=TRUE))
AvgeSchScore <- tapply(score,match,mean,na.rm=TRUE)
AvgScore <- 0
for(i in 1:length(score)) {
AvgScore[i] <- AvgeSchScore[match[i]]
}
results <- tapply(AvgScore,scoreQuant,mean,na.rm = TRUE)
If you have a more direct way of doing it.. Or I think the bad point is 3) using a loop, maybe apply() is better ? But I'm not sure how to use it here (I tried to code my own function but it crashed so I "bruted force it").
Thanks :)
The main fix is to eliminate the for loop with:
AvgScore <- AvgeSchScore[match]
R allows you to subset in ways that you cannot in other languages. The tapply function outputs the names of the factor that you grouped by. We are using those names for match to subset AvgeScore.
data.table
If you would like to try data.table you may see speed improvements.
library(data.table)
match <- c("a","b","a","b","c")
score <- c(18,4,15,8,24)
dt <- data.table(id=1:5, match, score)
scoreQuant <- cut(dt$score,quantile(dt$score,probs=seq(0,1,0.1),na.rm=TRUE))
dt[, AvgeScore := mean(score), match][, mean(AvgeScore), scoreQuant]
# scoreQuant V1
#1: (17.4,19.2] 16.5
#2: NA 6.0
#3: (12.2,15] 16.5
#4: (7.2,9.4] 6.0
#5: (21.6,24] 24.0
It may be faster than base R. If the value in the NA row bothers you, you can delete it after.

R Programming - Sum Elements of Rows with Common Values

Hello and thank you in advance for your assistance,
(PLEASE Note Comments section for additional insight: i.e. the cost column in the example below was added to this question; Simon, provides a great answer, but the cost column itself is not represented in the data response from him, although the function he provides works with the cost column)
I have a data set, lets call it 'data' which looks like this
NAME DATE COLOR PAID COST
Jim 1/1/2013 GREEN 150 100
Jim 1/2/2013 GREEN 50 25
Joe 1/1/2013 GREEN 200 150
Joe 1/2/2013 GREEN 25 10
What I would like to do is sum the PAID (and COST) elements of the records with the same NAME value and reduce the number of rows (as in this example) to 2, such that my new data frame looks like this:
NAME DATE COLOR PAID COST
Jim 1/2/2013 GREEN 200 125
Joe 1/2/2013 GREEN 225 160
As far as the dates are concerned, I don't really care about which one survives the summation process.
I've gotten as far as rowSums(data), but I'm not exactly certain how to use it. Any help would be greatly appreciated.
aggregate is the function you are looking for:
aggregate( cbind( PAID , COST ) ~ NAME + COLOR , data = data , FUN = sum )
# NAME PAID
# 1 Jim 200
# 2 Joe 225

In R and ddply, is it possible to avoid enumerating all columns I need when using ddply?

Other posts suggested that ddply is a good workhorse.
I am trying to learn xxply functions and I can not solve this problem.
This is my
library(ggplot2)
(df= tips[1:5,])
total_bill tip sex smoker day time size
1 16.989999999999998437 1.0100000000000000089 Female No Sun Dinner 2
2 10.339999999999999858 1.6599999999999999201 Male No Sun Dinner 3
3 21.010000000000001563 3.5000000000000000000 Male No Sun Dinner 3
4 23.679999999999999716 3.3100000000000000533 Male No Sun Dinner 2
5 24.589999999999999858 3.6099999999999998757 Female No Sun Dinner 4
and I need to something like this
ddply(df
,.(<do I have to enumerate all columns I need to operate on here?)>
, function(x) {if size>=3 return(size) else return(total_bill+tip)
)
(the example is a fake problem (does not make real life sense) and only demonstrates my problem with larger data)
I could not get the ddply code right reading just help files. Any advise appreciated. Or even great ddply tutorial?
I like that with ddply I can just pass my dataframe as input, but in the second argument, it is not nice that I am forced to enumerate all columns that I need later. Is there a way to pass the whole row (all columns)?
I like defining the function on the fly, but I am not sure how to make my pseudocode correct in R (my last argument).
Based on your code, it doesn't look like you need to use plyr here at all. It seems to me you are calculating a new variable for each row of the data.frame. If that's the case, then just use some base R functions:
dat <- transform(dat, newval = ifelse(size >= 3, size, total_bill + tip))
total_bill tip sex smoker day time size newval
1 16.99 1.01 Female No Sun Dinner 2 18.00
2 10.34 1.66 Male No Sun Dinner 3 3.00
3 21.01 3.50 Male No Sun Dinner 3 3.00
4 23.68 3.31 Male No Sun Dinner 2 26.99
5 24.59 3.61 Female No Sun Dinner 4 4.00
Sorry if I misunderstood what you are doing. If you do in fact need to pass the entire row of a data.frame into plyr with no grouping variable, perhaps you can treat it as an array with margin = 1? i.e adply(dat, 1, ...)
Great introduction of plyr here: www.jstatsoft.org/v40/i01/paper
The second argument is the "splitting" variable. so in your sample data set, if you're looking to see the difference in spending habits between the sexes you would supply .(sex) or if you want all possibilities of your categorical variables, yes you would have to supply them all .(sex, smoker, day, time).
On a separate note, when using ddply your function should take a data.frame and return a data.frame. Currently It returns a vector. Also, if is not vectorized, you should use ifelse.
ddply(df, .(sex), function(x) {
x$new.var <- ifelse(x$size >= 3, x$size, x$total_bill + x$tip)
return(x)
})
if you don't specify the return value, R will return the last thing calculated which is a vector.
My only other suggestion is to keep playing with plyr. Eventually it will click and you'll love it!
don't know if this is still useful. Whilst I am not sure whether this is adequate I am used to solve tasks similar to yours as follows:
ddply(df
, as.quoted(colnames(df))
, function(x) {if size>=3 return(size) else return(total_bill+tip)
)

Resources