I have imported an excel spreadsheet into R studio and I need to write R commands for the data. I need a command to display how many times an item has been sold. The data looks a little something like this
PRODUCT ------------------- UNITS
eye liner ----------------------- 10
lip gloss ----------------------- 5
eye liner ----------------------- 10
lip gloss ----------------------- 5
I do not know how to count how many units of lip gloss have been sold. The best I can do is display how many times lip gloss shows up in the data with the command:
nrow(mySales[mySales$Product=="lip gloss",])
This command doesn't count how many units of lip gloss are sold which is 10, it only counts how many times lip gloss appears in the data (2). This is a beginner course and this is the first exercise, I am assuming it is a simple problem however I am completely lost.
You are almost there. If you look at your code :
nrow(mySales[mySales$Product=="lip gloss",])
this line here :
mySales[mySales$Product=="lip gloss",]
will subset the data that has the product called lip gloss
When you add nrow you are counting the number of rows in the new subset data
Hence you can get the total count by using the function row
Hence what you need to do next can replace nrow with rowSum, or sum if you subset the units columns of the new dataframe
sum(mySales[mySales$Product=="lip gloss",]$UNITS)
Heres a step by step version
lipGlossSales<- mySales[mySales$Product=="lip gloss",]
lipGlossUnits <-lipGlossSales$UNITS
totallipGloss <- sum(lipGlossUnits)
Happy R-ing
cheers,
This is called the split-apply-combine approach and is well-documented and very common in data analysis. In this case I would try the plyr library which allows for making a nice summary of the data as such:
fakedata <- data.frame(Product=c('eye liner', 'lip gloss', 'eye liner', 'lip gloss'),
count=c(10,5,10,5))
library(plyr)
product.counts <- ddply(fakedata, "Product", function(x) data.frame(Productcount = sum(x$count)))
R> product.counts
Product Productcount
1 eye liner 20
2 lip gloss 10
Related
Question:
Initialize the city of Boston earnings dataset as shown below:
boston <- read.csv( "https://people.bu.edu/kalathur/datasets/bostonCityEarnings.csv", colClasses = c("character", "character", "character", "integer", "character"))
Generate a subset of the dataset from Boston earnings dataset with only the top 5 departments based on the number of employees working in that department. The top 5 departments should be computed using R code. Then, use %in% operator to create the required subset.
Use a sample size of 50 for each of the following.
Set the start seed for random numbers as the last 4 digits of 1000
a) Show the sample drawn using simple random sampling without replacement.
Show the frequencies for the selected departments.
Show the percentages of these with respect to sample size.
I've tried to write the code, but still don't know how to create the subset with the top 5 and don't know how to turn into percentage as well.
Thank you all for the help!
enter image description here
Very new to R here, also very new to the idea of coding and computer stuff.
Second week of class and I need to find some summary statistics from a set of data my professor provided. I downloaded the chart of data and tried to follow along with his verbal instructions during class, but I am one of the only non-computer science backgrounds in my degree program (I am an RN going for degree in Health Informatics), so he went way too fast for me.
I was hoping for some input on just where to start with his list of tasks for me to complete. I downloaded his data into an excel file, and then uploaded it into R and it is now a matrix. However, everything I try for getting the mean and standard deviation of the columns he wants comes up with an error. I am understanding that I need to convert these column names into some sort of vector, but online every website tells me to do these tasks differently. I don't even know where to start with this assignment.
Any help on how to get myself started would be greatly appreciated. Ive included a screenshot of his instructions and of my matrix. and please, excuse my ignorance/lack of familiarity compared to most of you here... this is my second week into my masters I am hoping I begin to pick this up soon I am just not there yet.
the instructions include:
# * Import the dataset
# * summarize the dataset,Compute the mean and standard deviation for the three variables (columns): age, height, weight
# * Tabulate smokers and age.level data with the variable and its frequency. How many smokers in each age category ?
# * Subset dataset by the mothers that smoke and weigh less than 100kg,how many mothers meet this requirements?
# * Compute the mean and standard deviation for the three variables (columns): age, height, weight
# * Plot a histogram
Stack Overflow is not a place for homeworks, but I feel your pain. Let's get piece by piece.
First let's use a package that helps us do those tasks:
library(data.table) # if not installed, install it with install.packages("data.table")
Then, let's load the data:
library(readxl) #again, install it if not installed
dt = setDT(read_excel("path/to/your/file/here.xlsx"))
Now to the calculations:
1 summarize the dataset. Here you'll see the ranges, means, medians and other interesting data of your table.
summary(dt)
1A mean and standard deviation of age, height and weight (replace age with the column name of height and weight to get those)
dt[, .(meanValue = mean(age, na.rm = TRUE), stdDev = sd(age, na.rm = TRUE))]
2 tabulate smokers and age.level. get the counts for each combination:
dt[, .N, by = .(smoke, age.level)]
3 subset smoker mothers with wt < 100 (I'm asuming non-pregnant mothers have NA in the gestation field. Adjust as necessary):
dt[smoke == 1 & weight < 100 & !is.na(gestation), .N]
4 Is the same as 1A.
5 Plot a histogram (but you don't specify of what variable, so let's say it's age):
hist(dt$age)
Keep on studying R, it's not that difficult. The book recommended in the comments is a very good start.
I am using the 'pivottabler' package to create some pivot tables in R.
Basically, the pivot tables I create have similar structure, only the column header changes.
For example, I have a data set containing the prices of fruits based on region and month.
So I will create one pivot that will look like this:
Fruits Nigeria Laos England
Prices Prices Prices
Apple 1$ 2$ 3$
Mango 4$ 5$ 6$
Orange 7$ 8$ 9$
And another pivot table that will look this:
Fruits Jan Feb March
Prices Prices Prices
Apple 1$ 1.5$ 2$
Mango 4$ 4.5$ 5$
Orange 7$ 7.5$ 8$
Right now I am using two different codes to create both the pivots.
pt_country <- PivotTable$new()
pt_country$addData(Fruit_Prices) #Fruit_Prices is the data frame containing the data
pt_country$addColumnDataGroups("Countries")
pt_country$addRowDataGroups("Fruits")
pt_country$defineCalculation(CalculationName = "Prices")
pt_country$renderPivot()
pt_country <- PivotTable$new()
pt_country$addData(Fruit_Prices) #Fruit_Prices is the data frame containing the data
pt_country$addColumnDataGroups("Months")
pt_country$addRowDataGroups("Fruits")
pt_country$defineCalculation(CalculationName = "Prices")
pt_country$renderPivot()
I want to shorten the code length, since there will be multiple such pivot tables.
So, ideally I was looking for a solution that allows me to replace one column group with another without changes to other structures of code.
Any help will be appreciated.
I am the author of the pivottabler package.
There are currently only limited options to amend a pivot table after it has been calculated.
More detail
In your example, removing the columns would also remove the calculations, since in your R script the calculations are added after the columns. Reapplying the calculations is then not possible, because the pivot table recognises that the calculations were added already (you get an error). I will look at options to add flexibility in the future.
Alternative approach
One option to reduce the amount of code is to create a function which takes as a parameter the variable to show on the columns. This function can then be easily called to create different variations of the pivot table:
createPivot <- function(columnGroupVariableName)
{
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$addColumnDataGroups(columnGroupVariableName)
pt$addRowDataGroups("TOC")
pt$defineCalculation(calculationName="TotalTrains", summariseExpression="n()")
pt$renderPivot()
}
# create pivot tables with different variables on the columns
createPivot("TrainCategory")
createPivot("PowerType")
I had found a sort of workaround this problem.
Thought to mention it here just for sake.
The answer provided by #cbailiss is better in that it achieves the desired result in reduced lines of code and has a better comprehension and will be marked as the official answer.
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$defineCalculation(calculationName="TotalTrains", summariseExpression="n()")
pt$addRowDataGroups("TOC")
## Adding the required column for the pivot
Col <- pt$addColumnGroup() # Step1
Col_Data <- Col$addDataGroups("TrainCategory") # Step 2
pt$renderPivot()
## Removing the 'Col' group, thus deleting the added columns
Col$removeGroup()
## Repeating Step 1 and Step 2 for another column variable
Col <- pt$addColumnGroup() # Step1
Col_Data <- Col$addDataGroups("PowerType") # Step 2
pt$renderPivot()
The above lines of code worked for me and the method was found in the 'Irregular' vignettes at:
http://www.pivottabler.org.uk/articles/v11-irregularlayout.html
just earlier today I received a very helpful answer for a problem I was running into that allowed me to move onto the next step of one of my projects. However, I got stuck again later on in the project, and I'm wondering if any of you can help me move forward.
Context
Currently, I have a list of data frames that are full of soccer matches called wc_match_dataframes. Here is what one of the data frames looks like:
type_id tourn_id day month year team_A score_A score_B team_B win loss
f wc_1934 27 5 1934 Germany 5 2 Belgium Germany Belgium
I wasn't able to fit the data for the final three columns, draw, drawA, and drawB but basically the draw column is TRUE if the match is a draw, if not, it is FALSE. In the case of a draw, the win and loss columns are just filled by Draw. The drawA column is filled by team_A if the match was a draw, and likewise, the drawB column is filled by team_B.
The type_id is either f or q depending on if the match was a World Cup qualifier or a World Cup finals match. The tourn_id refers to the tournament the match was for, whether it was a qualifier or finals.
There are a total of 39 of these data frames, with a "finals" data frame for each of the 20 World Cup tournaments, and a "qualifiers" data frame for 19 tournaments (the first World Cup did not have qualifying).
What I Want To Do
I'm trying to populate a different list of data frames wc_dataframes with data for each of the 20 World Cups at the country level as opposed to the match level. Each of these twenty data frames will have the countries that made it to the finals of said tournament and their data like so:
Country
Wins in qualifying
Wins in finals
Losses in qualifying
Losses in finals
... and so on.
I have been able to populate the first country column for every World Cup no problem, but I'm running into issues for the rest of the columns.
Here is what I'm doing
This is the unlooped (only works for one World Cup) version of my code that works successfully:
wc_dataframes$wc_1930$fw <- apply(wc_dataframes$wc_1930, MARGIN = 1, function(country)
sum(wc_match_dataframes$`wc_1930 f`$w == country, na.rm = TRUE))
This is successfully populating the finals win column in the wc_dataframes$wc_1930 data frame by counting the number of wins.
Now, when I try and nest this under lapply to do it across all World Cup years like so:
lapply(names(wc_dataframes), function(year)
wc_dataframes$year$fw <- apply(wc_dataframes$year, MARGIN = 1, function(country)
sum(wc_match_dataframes$`year f`$w == country, na.rm = TRUE)))
It does not work for me. I suspect that the issue has to do with defining the year function and running into issues in the sum portion of my code. I come from a background in STATA so I am more used to running for loops and what not. I'm still getting used to R and lists and everything so I really appreciate the help.
Thank you!
Thank you so much in advance for the help, and happy holidays! :)
What you need is to output whatever you have replaced:
lapply(names(wc_dataframes), function(year){
wc_dataframes[[year]]$fw <- apply(wc_dataframes[[year]], MARGIN = 1, function(country)
sum(wc_match_dataframes[[paste(year,'f')]]$w == country, na.rm = TRUE));
wc_dataframes}
)
I want to count the number of words in each row:
Review_ID Review_Date Review_Content Listing_Title Star Hotel_Name
1 1/25/2016 I booked both the Crosby and Four Seasons but decided to cancel the Four Seasons closer to the arrival date based on reviews. Glad I did. The Crosby is an outstanding hotel. The rooms are immaculate and luxurious, with real attention to detail and none of the bland furnishings you find in even the top chain hotels. Staff on the whole were extremely attentive and seemed to enjoy being there. Breakfast was superb and facilities at ground level gave an intimate and exclusive feel to the hotel. It's a fairly expensive place to stay but is one of those hotels where you feel you're getting what you pay for, helped by an excellent location. Hope to be back! Outstanding 5 Crosby Street Hotel
2 1/18/2016 We've stayed many times at the Crosby Street Hotel and always have an incredible, flawless experience! The staff couldn't be more accommodating, the housekeeping is immaculate, the location's awesome and the rooms are the coolest combination of luxury and chic. During our most recent trip over The New Years holiday, we stayed in the stunning Crosby Suite which has the most extraordinary, gorgeous decor. The Crosby remains our absolute favorite in NYC. Can't wait to return! Always perfect! 5 Crosby Street Hotel
I was thinking something like:
WordFreqRowWise %>%
rowwise() %>%
summarise(n = n())
To get the results something like..
Review_ID Review_Content total_Words Min_occrd_word Max Average
1 .... 230 great: 1 the: 25 total_unique/total_words in the row
But do not have idea, how can I do it....
Here is a method in base R using strsplit and sapply. Let's say the data is stored in a data.frame df and the reviews are stored in the variable Review_Content
# break up the strings in each row by " "
temp <- strsplit(df$Review_Content, split=" ")
# count the number of words as the length of the vectors
df$wordCount <- sapply(temp, length)
In this instance, sapply will return a vector of the counts for each row.
Since the word count is now an object, you can perform analysis you want on it. Here are some examples:
summarize the distribution of word counts: summary(df$wordCount)
maximum word count: max(df$wordCount)
mean word count: mean(df$wordCount)
range of word counts: range(df$wordCount)
interquartile range of word counts: IQR(df$wordCount)
Adding to #lmo's answer above..
Below code will generate a dataframe that consists of all the words, row-wise, and their frequencies:
temp2 <- data.frame()
for (i in 1:length(temp)){
temp1 <- as.data.frame(table(temp[[i]]))
temp1$ID <- paste0("Row_", i)
temp2 <- rbind(temp2, temp1)
temp1 <- NULL
}