I simulated a dataset for an online Retail market. The customer can purchase their products in different stores in Germany (e.g. Munich, Berlin, Hamburg..) and in Online stores. To get the latitude/longitude data from the cities I use geocode from the ggmap package. But customers who purchase Online are able to purchase them all over the country. Now I want to generate random latitude/longitude data within Germany for the online purchases, to map them later with shiny leaflet. Is there any way to do this?
My df looks like this:
View(df)
ClientId Store ... lat lon
1 Berlin 52 13
2 Munich 48 11
3 Online x x
4 Online x x
But my aim is a data frame for example like this:
ClientId Store ... lat lon
1 Berlin 52 13
2 Munich 48 11
3 Online 50 12
4 Online 46 10
Is there any way to get these random latitude/longitude data and integrate it to my data frame?
Your problem is twofold. First of all, as a newbie to R, you are not yet used to the semantics required to do what you need. Fundamentally, what you are asking to to do is:
First, Identify which orders are sourced from Online
Second, generate a random lat and lon for these orders
First, to identify elements of your data frame which fit a criterion, you use the which function. Thus, to find the rows in your data frame which have the Store column equal to "Online", you do:
df[which(df$Store=="Online")]
To update the lat or lon for a particular row, we need to be able to access the column. To get values of a particular column, we use $. For example, to get the lat values for the online orders you use:
df$lat[which(df$Store=="Online")]
Great! The problem now diverges and increases in complexity. For the new values, do you want to generate simple values to accomplish your demo, or do you want to come up with new logic to generate spacial results in a given region? You indicate you would like to generate data points in Germany itself, however, to accomplish that is beyond the scope of this question. For now, we will consider the easy example of generating values in a bounded box and updating your data.frame accordingly.
To generate integer values in a given range, we can use the sample function. Assuming that you would want lat values in the range of 45 and 55 and lon values in the range of 9 to 14 we can do the following:
df$lat[which(df$Store=="Online")]<-sample(45:55,length(which(df$Store=="Online")))
df$lon[which(df$Store=="Online")]<-sample(9:14,length(which(df$Store=="Online")))
Reading this code, we have update the lat values in df that are "Online" orders with a vector of random numbers from 48:52 that is the proper length (the number of "Online" orders).
If you wanted more decimal precision, you can use similar logic with the runif function which samples from the uniform distribution and round to get the appropriate amount of precision. Good luck!
Related
Edit: using the aid from one of the users, I was able to use "table(ArrestData$CHARGE)", yet, since there are over 2400 entries, many of the entries are being omitted. I am looking for the top 5 charges, is there code for this? Additionally, I am looking at a particular council district (which is another variable titled "CITY_COUNCIL_DIST"). I want to see which are the top 5 charges given out within a specific council district. Is there code for this?
Thanks for the help!
Original post follows
Just like how I can use "names(MyData)" to see the names of my variables, I am wondering if I can use a code to see the names/responses/data points of a specific column.
In other words, I am attempting to see the names in my rows for a specific column of data. I would like to see what names are cumulatively being used.
After I find this, I would like to know how many times each name within the rows is being used, whether thats numeric or percentage. After this, I would like to see how many times each name within the rows is being used with the condition that it meets a numeric value of another column/variable.
Apologies if this, in any way, is confusing.
To go further in depth, I am playing around with the Los Angeles Police Data that I got via the Office of the Mayor's website. From 2017-2018, I am attempting to see what charges and the amount of each specific charge were given out in Council District 5. CHARGE and CITY_COUNCIL_DIST are the two variables I am looking at.
Any and all help will be appreciated.
To get all the distinct variables, you can use the unique function, as in:
> x <- c(1,1,2,3,3,4,5,5,5,6)
> unique(x)
[1] 1 2 3 4 5 6
To count the number of distinct values you can use table, as in:
> x <- c(1,1,2,3,3,4,5,5,5,6)
> table(x)
x
1 2 3 4 5 6
2 1 2 1 3 1
The first row gives you the distinct values and the second row the counts for each of them.
EDIT
This edit is aimed to answer your second question following with my previous example.
In order to look for the top five most repeated values of a variable we can use base R. To do so, I would first create a dataframe from your table of frequencies:
df <- as.data.frame(table(x))
Having this, now you just have to order the column Freq in descending order:
df[order(-df$Freq),]
In order to look for the top five most repeated values of a variable within a group, however, we need to go beyond base R. I would use dplyr to create an augmented dataframe with frequencies for each value of the variable of interest, let it be count_variable:
library(dplyr)
x_or <- x %>%
group_by(group_variable, count_variable) %>%
summarise(freq=n())
where x is your original dataframe, group_variable is the variable for your groups and count_variable is the variable you want to count. Now, you just have to order the object in a way you get the frequencies of your count_variable ordered by group_variables:
x_or %>%
arrange(group_variable, count_variable, freq)
I am working on a problem for a statistics class that utilizes baseball team data such as attendance, wins/losses, and other stats about baseball teams. The problem statement calls for variables to be created to include winning teams (those with 81 or more wins), losing teams (with less than 81 wins), and attendance figures on three categories, less than 2 million, between 2 and 3 million, and more than 3 million.
The raw data is keyed by team name, with one team per row and then the stats in each column.
I then need to create a table with counts of the number of teams along those dimensions, like:
Winning Season Low Attendance Med. Attendance High Attendance
Yes 3 12 3
No 2 10 2
We can use whatever tool we'd like to complete it and I am attempting to use R and RStudio to create the table in order to gain knowledge about stats and R at the same time. However, I can't figure out how to make it happen or what function(s) to use to create a table with those aggregate numbers.
I have looked at data.table and dplyr and others but I cannot seem to figure out how to get counts sorted by each team. If it was SQL, I would be able to
select count(*) from table where attend < 2000000 and wins < 81
and then programmatically create the table. I can't figure out how to do the same in R.
Thank you for any help.
I have some classified raster layers as categorical land cover maps. All the layers having exactly the same categories (lets say: "water", "Trees", "Urban","bare soil") but they are from different time points (e.g. 2005 and 2015)
I load them into memory using the raster function like this:
comp <- raster("C:/workingDirectory4R/rasterproject/2005marsh3.rst")
ref <- raster("C:/workingDirectory4R/rasterproject/2013marsh3.rst")
"comp" is the comparison map at time t+1 and "ref" is the reference map from time t. Then I used the crosstab function to generate the confusion table. This table can be used to explore the changes in categories through the time interval.
contingency.Matrix <- crosstab(comp, ref)
The result is in the matrix format with the "comp" categories in the column and "ref" in the rows. And column and row names labeled with numbers numbers 1 to 4.
Now I have 2 questions and I really appreciate any help on how to solve them.
1- I want to assign the category names to the columns and rows of
the matrix to facilitate it's interpretation.
2- Now let's say I have three raster layers for 2005, 2010 and 2015.
This means I would have two confusion tables one for 2005-2010 and
another one for 2010-2015. What's the best procedure to automate
this process with the minimal interaction from user.
I thought to ask the user to load the raster layers, then the code save them in a list. Then I ask for a vector of years from the user but the problem is how can I make sure that the order of raster layers and the years are the same? And is there a more elegant way to do this.
Thanks
I found a partial answer to my first question. If the categorical map is created in TerrSet(IDRISI) software with the ".rst" extention then I can extract the category names like this:
comp <- raster("C:/rasterproject/2005subset.rst")
attributes <- data.frame(comp#data#attributes)
categories <- as.character(attributes[,8])
and I get a vector with the name of categories. However if the raster layers are created with a different extension then the code won't work. For instance if the raster is created in ENVI then the third line of the code should get changed to:
categories <- as.character(attributes[,2])
I am working with a large dataset (10 million + cases) where each case represents a sale's monthly transactions of a given product (there are 17 products). As such, each shop is potentially represented across 204 cases (12 months * 17 Product sales; note, not all stores sell all 17 products throughout the year).
I need to restructure the data so that there is one case for each product transaction. This would result in each shop being represented by only 17 cases.
Ideally, I would like the create the mean value of the transactions over the 12 months.
To be more specific, there dataset currently has 5 variables:
Shop Location — A unique 6 digit sequence
Month — 2013_MM (data is only from 2013)
Number of Units sold Total Profit (£)
Product Type - 17 Different product types (this is a String
Variable)
I am working in R. It would be ideal to save this restructured dataset into a data frame.
I'm thinking an if/for loop could work, but I'm unsure how to get this to work.
Any suggestions or ideas are greatly appreciated. If you need further information, please just ask!
Kind regards,
R
There really wasn't much here to work with, but this is what my interpretation leads to... You're looking to summarise your data set, grouped by shop_location and product_type
# install.packages('dplyr')
library(dplyr)
your_data_set <- xxx
your_data_set %>%
group_by(shop_location, product_type) %>%
summarise(profit = sum(total_profit),
count = n(),
avg_profit = profit/count)
I am working with NDVI3g data sets. My problem is that i am trying to create monthly composite data sets from the bi-monthly original data sets using maximum value composite method in R. Please i need your help, because i tried my possible best, but couldn't figure it out. The problem with data is that the first composite in a month is named as for example below;
AF99sep15a.n14-VI3g: first 15 days
AF99sep15b.n14-VI3g : Last 15 days;
I have 31 years data sets (i.e 1982-2012).
Kindly need your help on how to combine the whole data sets into a monthly composite.
given RasterStack gimms and that you want to average sequential pairs, I think you can do
i <- rep(1:(nlayers(gimms)/2), each =2)
x <- stackApply(gimms, i, mean)
Make sure to also check out the gimms package which includes the function monthlyComposite (including optional parallel support) to create monthly maximum value composites from the initial half-monthly layers. Needless to say, the function is heavily based on stackApply from the raster package.