I am working on a problem for a statistics class that utilizes baseball team data such as attendance, wins/losses, and other stats about baseball teams. The problem statement calls for variables to be created to include winning teams (those with 81 or more wins), losing teams (with less than 81 wins), and attendance figures on three categories, less than 2 million, between 2 and 3 million, and more than 3 million.
The raw data is keyed by team name, with one team per row and then the stats in each column.
I then need to create a table with counts of the number of teams along those dimensions, like:
Winning Season Low Attendance Med. Attendance High Attendance
Yes 3 12 3
No 2 10 2
We can use whatever tool we'd like to complete it and I am attempting to use R and RStudio to create the table in order to gain knowledge about stats and R at the same time. However, I can't figure out how to make it happen or what function(s) to use to create a table with those aggregate numbers.
I have looked at data.table and dplyr and others but I cannot seem to figure out how to get counts sorted by each team. If it was SQL, I would be able to
select count(*) from table where attend < 2000000 and wins < 81
and then programmatically create the table. I can't figure out how to do the same in R.
Thank you for any help.
Related
Disclaimer: I can't include data because it's confidential student data.
I have an R dataframe "data" with a column "StateResidence" for what state the student is from, and a column "Enrolled" 0 or a 1 that tells whether or not they enrolled in the school I go to.
I'm trying to make a dataframe with three columns: Column 1 should list out each of the 69 unique States listed in Data (I've already done this one), and column 2 should show how many students from that state are enrolled, and column 3 should show what percentage of the total students from that state were enrolled.
The reason for this is so I can do some exploratory data analysis by plotting barplots with the number on the Y axis and the state on the X axis to analyze enrollment trends geographically.
I really don't have much else to include - I'm completely lost here, and I'm not very familiar with R. Any help is greatly appreciated, even just some helpful functions or something. Thank you.
Edit: using the aid from one of the users, I was able to use "table(ArrestData$CHARGE)", yet, since there are over 2400 entries, many of the entries are being omitted. I am looking for the top 5 charges, is there code for this? Additionally, I am looking at a particular council district (which is another variable titled "CITY_COUNCIL_DIST"). I want to see which are the top 5 charges given out within a specific council district. Is there code for this?
Thanks for the help!
Original post follows
Just like how I can use "names(MyData)" to see the names of my variables, I am wondering if I can use a code to see the names/responses/data points of a specific column.
In other words, I am attempting to see the names in my rows for a specific column of data. I would like to see what names are cumulatively being used.
After I find this, I would like to know how many times each name within the rows is being used, whether thats numeric or percentage. After this, I would like to see how many times each name within the rows is being used with the condition that it meets a numeric value of another column/variable.
Apologies if this, in any way, is confusing.
To go further in depth, I am playing around with the Los Angeles Police Data that I got via the Office of the Mayor's website. From 2017-2018, I am attempting to see what charges and the amount of each specific charge were given out in Council District 5. CHARGE and CITY_COUNCIL_DIST are the two variables I am looking at.
Any and all help will be appreciated.
To get all the distinct variables, you can use the unique function, as in:
> x <- c(1,1,2,3,3,4,5,5,5,6)
> unique(x)
[1] 1 2 3 4 5 6
To count the number of distinct values you can use table, as in:
> x <- c(1,1,2,3,3,4,5,5,5,6)
> table(x)
x
1 2 3 4 5 6
2 1 2 1 3 1
The first row gives you the distinct values and the second row the counts for each of them.
EDIT
This edit is aimed to answer your second question following with my previous example.
In order to look for the top five most repeated values of a variable we can use base R. To do so, I would first create a dataframe from your table of frequencies:
df <- as.data.frame(table(x))
Having this, now you just have to order the column Freq in descending order:
df[order(-df$Freq),]
In order to look for the top five most repeated values of a variable within a group, however, we need to go beyond base R. I would use dplyr to create an augmented dataframe with frequencies for each value of the variable of interest, let it be count_variable:
library(dplyr)
x_or <- x %>%
group_by(group_variable, count_variable) %>%
summarise(freq=n())
where x is your original dataframe, group_variable is the variable for your groups and count_variable is the variable you want to count. Now, you just have to order the object in a way you get the frequencies of your count_variable ordered by group_variables:
x_or %>%
arrange(group_variable, count_variable, freq)
I have basic knowledge of R, but I'm not sure how to go about the following programming:
I have a large data frame of data from 25 participants. Each participant has a different total number of sound files (i.e., #1 might have 105 and #2 might have 98). For each participant, we've coded whether or not they're in a specific location (i.e., if they're in their apartment, 1, if not, 0).
I want to get the count of how many times Apartment=1 for each participant. This is where I'm confused: my Excel sheet is organized such that all of the participants are stacked on top of each other. So the first 105 rows are for Participant #1, and the next 98 rows are for Participant #2.
Do I need to aggregate my data by participant number, and then get the count?
I've never written a for loop, but is that another possible solution?
I simulated a dataset for an online Retail market. The customer can purchase their products in different stores in Germany (e.g. Munich, Berlin, Hamburg..) and in Online stores. To get the latitude/longitude data from the cities I use geocode from the ggmap package. But customers who purchase Online are able to purchase them all over the country. Now I want to generate random latitude/longitude data within Germany for the online purchases, to map them later with shiny leaflet. Is there any way to do this?
My df looks like this:
View(df)
ClientId Store ... lat lon
1 Berlin 52 13
2 Munich 48 11
3 Online x x
4 Online x x
But my aim is a data frame for example like this:
ClientId Store ... lat lon
1 Berlin 52 13
2 Munich 48 11
3 Online 50 12
4 Online 46 10
Is there any way to get these random latitude/longitude data and integrate it to my data frame?
Your problem is twofold. First of all, as a newbie to R, you are not yet used to the semantics required to do what you need. Fundamentally, what you are asking to to do is:
First, Identify which orders are sourced from Online
Second, generate a random lat and lon for these orders
First, to identify elements of your data frame which fit a criterion, you use the which function. Thus, to find the rows in your data frame which have the Store column equal to "Online", you do:
df[which(df$Store=="Online")]
To update the lat or lon for a particular row, we need to be able to access the column. To get values of a particular column, we use $. For example, to get the lat values for the online orders you use:
df$lat[which(df$Store=="Online")]
Great! The problem now diverges and increases in complexity. For the new values, do you want to generate simple values to accomplish your demo, or do you want to come up with new logic to generate spacial results in a given region? You indicate you would like to generate data points in Germany itself, however, to accomplish that is beyond the scope of this question. For now, we will consider the easy example of generating values in a bounded box and updating your data.frame accordingly.
To generate integer values in a given range, we can use the sample function. Assuming that you would want lat values in the range of 45 and 55 and lon values in the range of 9 to 14 we can do the following:
df$lat[which(df$Store=="Online")]<-sample(45:55,length(which(df$Store=="Online")))
df$lon[which(df$Store=="Online")]<-sample(9:14,length(which(df$Store=="Online")))
Reading this code, we have update the lat values in df that are "Online" orders with a vector of random numbers from 48:52 that is the proper length (the number of "Online" orders).
If you wanted more decimal precision, you can use similar logic with the runif function which samples from the uniform distribution and round to get the appropriate amount of precision. Good luck!
I am working with a large dataset (10 million + cases) where each case represents a sale's monthly transactions of a given product (there are 17 products). As such, each shop is potentially represented across 204 cases (12 months * 17 Product sales; note, not all stores sell all 17 products throughout the year).
I need to restructure the data so that there is one case for each product transaction. This would result in each shop being represented by only 17 cases.
Ideally, I would like the create the mean value of the transactions over the 12 months.
To be more specific, there dataset currently has 5 variables:
Shop Location — A unique 6 digit sequence
Month — 2013_MM (data is only from 2013)
Number of Units sold Total Profit (£)
Product Type - 17 Different product types (this is a String
Variable)
I am working in R. It would be ideal to save this restructured dataset into a data frame.
I'm thinking an if/for loop could work, but I'm unsure how to get this to work.
Any suggestions or ideas are greatly appreciated. If you need further information, please just ask!
Kind regards,
R
There really wasn't much here to work with, but this is what my interpretation leads to... You're looking to summarise your data set, grouped by shop_location and product_type
# install.packages('dplyr')
library(dplyr)
your_data_set <- xxx
your_data_set %>%
group_by(shop_location, product_type) %>%
summarise(profit = sum(total_profit),
count = n(),
avg_profit = profit/count)