R data subset restructuring [closed] - r

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am fairly new to R/Rstudio and I am still learning how to do certain operations.
I have the following data set. For columns I have Operating Region, type of element(CA,OBU), sub-element and Net Revenue.
Currently the data is quite big(50 000 rows) and I want to get a summary of Operating region by element,sub-element and NR.
Example
Operating Region Element Sub-Element NR
Asia CA CA123 50 000
America OBU EFK456 35 000
Could someone please guide me on how to accomplish this?
Any relevant readings/examples would be much appreciated.

Using the data below to return the data frame object "data," you can use the dplyr package to organize results in many different ways. Here is one example:
data <- data.frame("OperatingRegion" = c("Asia", "America"), "Region" = c("CA", "OBU"), "Element" = c("CA123", "EFK456"), "SubElement" = c(50000, 35000))
require(dplyr)
results <- data %.%
group_by(OperatingRegion) %.%
summarise(SubE = sum(SubElement, na.rm = TRUE))
Source: local data frame [2 x 2]
OperatingRegion SubE
1 America 35000
2 Asia 50000
After loading the package, you provide dplyr the data frame and then, using the special operators %.% or %>%, group_by whatever single or multiple variables you want. Then, call summarise to create sums, medians, averages or whatever computation you want.

Related

How to get the percentile from the dataset? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I processed the dataset.
df <- read.csv('https://raw.githubusercontent.com/ulklc/covid19-timeseries/master/countryReport/raw/rawReport.csv')
df$countryName = as.character(df$countryName)
Considering the new cases announced, the top 3 countries that explained the most cases explained what% of the total cases. can we find it?
Here is one way to do this with Base R. Since the statistics are cumulative for each country by day, we subset to the most recent day's data with the [ form of the extract operator, sort by descending confirmed cases, calculate and sum the percentages for the first 3 rows.
df <- read.csv('https://raw.githubusercontent.com/ulklc/covid19-timeseries/master/countryReport/raw/rawReport.csv')
df$countryName = as.character(df$countryName)
# subset to max(day)
today <- df[df$day == max(df$day),]
today <- today[order(today$confirmed,decreasing=TRUE),]
today$pct <- today$confirmed / sum(today$confirmed)
paste("top 3 countries percentage as of",today$day[1],"is:",
sprintf("%3.2f%%",sum(today$pct[1:3]*100)))
...and the output:
> paste("top 3 countries percentage as of",today$day[1],"is:",
+ sprintf("%3.2f%%",sum(today$pct[1:3]*100)))
[1] "top 3 countries percentage as of 2020/05/30 is: 44.09%"
We can print selected data for the top 3 countries as follows.
today[1:3,colList]
countryName day confirmed pct
26000 United States 2020/05/30 1816117 0.29531640
3640 Brazil 2020/05/30 498440 0.08105067
21710 Russia 2020/05/30 396575 0.06448654
>

How do I break up a dataset in R so that particular values of a categorical variable are together, and I can then perform analysis of those values? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I'm doing a project for my stats class in which I need to make an infographic, and I have chosen a dataset involving all flights in the US in June 2016. All 500000 of them. I am trying to sort all of the flights by airport, then calculate the average delay and the percentage of cancelled flights by airport to create a "consistency" statistic to see which airports are the most reliable, and I need to do this with R because the dataset is too large to do anything in Excel. I can't use the by() function, because I cannot perform functions on other variables after categorizing the data. Help?
A useful package for this may be dplyr. I'll use a couple of bogus variables from the included iris dataset to give an indication of how this would work. In your case, replace iris with your dataset, and change the various calculations to what you need.
require(dplyr)
iris %>%
mutate(# Calculate required variables at the level of your raw data
Sepal.Area = Sepal.Length * Sepal.Width,
Petal.Area = Petal.Length * Petal.Width
) %>%
group_by(# Choose variable to group by
Species
) %>%
summarize(# Perform some grouping calculations
Mean.Sepal.Area = mean(Sepal.Area),
Mean.Petal.Area = mean(Petal.Area),
count = n()
) %>%
mutate(# Calculate required variables at the level of your summarized data
Sepal.Times.Petal = Mean.Sepal.Area * Mean.Petal.Area) ->
output
This would be much easier to answer if you gave an example of your data set.
If you have an airport name you can filter to that by running something like this. I'll just use msp as an example.
Dat[airport=="msp',]
If you have a variable called delay you can calculate the average delay by running
mean(Dat[airport=="msp", "delay"])
You could create a list of all airports and use a for loop to go over it to calculate all means.
Or, an easier way would be to turn it into a data table and run
Dat[,.(mean(delay)),by=.(airport)]

R programming- find lowest value [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I've just started learning R. I wanted to know how can I find the lowest value in a column for a unique value in other column. For example, in this case I wanted to know the lowest avg price per year.
I have a data frame with about 7 columns, 2 of them being average price and year. The year is obviously recurrent ranges from 2000 to 2009. The data also has various NA's in different columns.
I have very less idea about running a loop or whatsoever in this regard.
Thank you :)
my data set looks something like this:
avgprice year
332 2002
NA 2009
5353 2004
1234 NA and so on.
To break down my problem to find first five lowest values from year 2000-2004.
s<-subset(tx.house.sales,na.rm=TRUE,select=c(avgprice,year)
s2<-subset(s,year==2000)
s3<-arrange(s2)
tail(s2,5)
I know the code fails miserably. I wanted to first subset my dataframe on the basis of year and avgprice. Then sort it for each year through 2000-2004. Arrange it and using tail() print the lowest five. However I also wanted to ignore the NAs
You could try
aggregate(averageprice~year, df1, FUN=min)
Update
If you need to get 5 lowest "averageprice" per "year"
library(dplyr)
df1 %>%
group_by(year) %>%
arrange(averageprice) %>%
slice(1:5)
Or you could use rank in place of arrange
df1 %>%
group_by(year) %>%
filter(rank(averageprice, ties.method='min') %in% 1:5)
This could be also done with aggregate, but the 2nd column will be a list
aggregate(averageprice~year, df1, FUN=function(x)
head(sort(x),5), na.action=na.pass)
data
set.seed(24)
df1 <- data.frame(year=sample(2002:2008, 50, replace=TRUE),
averageprice=sample(c(NA, 80:160), 50, replace=TRUE))

Analyze CSV data in R [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I have CSV data as follows:
code, label, value
ABC, len, 10
ABC, count, 20
ABC, data, 102
ABC, data, 212
ABC, data, 443
...
XYZ, len, 11
XYZ, count, 25
XYZ, data, 782
...
The number of data entries is different for each code. (This doesn't matter for my question; I'm just point it out.)
I need to analyze the data entries for each code. This would include calculating the median, plotting graphs, etc. This means I should separate out the data for each code and make it numeric?
Is there a better way of doing this than this kind of thing:
x = read.csv('dataFile.csv, header=T)
...
median(as.numeric(subset(x, x$code=='ABC' & x$label=='data')$value))
boxplot(median(as.numeric(subset(x, x$code=='ABC' & x$label=='data')$value)))
split and list2env allows you to separate your data.frame x for each code generating one data.frame for each level in code:
list2env(split(x, x$code), envir=.GlobalEnv)
or just
my.list <- split(x, x$code)
if you prefer to work with lists.
I'm not sure I totally understand the final objective of your question, do you just want some pointers of what you could do it? because there are a lot of possible solutions.
When you ask: I need to analyze the data entries for each code. This would include calculating the median, plotting graphs, etc. This means I should separate out the data for each code and make it numeric?
The answer would be no, you don't strictly have to. You could use R functions which does this task for you, for example:
x = read.csv('dataFile.csv', header=T)
#is it numeric?
class(x$value)
# if it is already numeric you shouldn't have to convert it,
# if it strictly numeric I don't know any reason why it
# should be read as strings but it happens.
aggregate(x,by=list(x$code),FUN="median")
boxplot(value~code,data=x)
# and you can do ?boxplot to look into its options.

I found ways to plot a graph in R using plot function. but I am looking to create plot with part of the data [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions concerning problems with code you've written must describe the specific problem — and include valid code to reproduce it — in the question itself. See SSCCE.org for guidance.
Closed 9 years ago.
Improve this question
I am trying to plot a graph for a data containing years between 1900 and 2010 and output in each month of the year in R. I need to select years between 1950-2001 against months of Nov-february. How can I select part of data for plotting this graph ?
since I am a rookie at R programming or any programming, an easy to follow example would be of great help.
thanks
GRV
I am not sure exactly what you mean by
select years between 1950-2001 against months of Nov-february
But the following should get you started on a reproducible example...
#create a vector of months from 1900 through 2010
months <- seq(as.Date("1900/1/1"), as.Date("2010/12/31"), "months")
#assign a random vector of equal length
output <- rnorm(length(months))
#assign both values to a data_frame
data <- data.frame(months = months, output = output)
Based on your description, your data should look something like the dataframe, called data.
From here, you can make use of the subset function to help you on your way. The first example subsets to data from 1950 through 2001. The next further restricts that subset to the months of November through February.
#subset to just 1950 through 2001
data_sub <- subset(data, months >= as.Date("1950-01-01") & months <= as.Date("2001-12-31"))
#subset the 1950 to 2001 data to just Nov-feb months (i.e. c(11,12,1,2))
data_sub_nf <- subset(data_sub, as.numeric(format(data_sub$months, "%m")) %in% c(11,12,1,2))
You should also read Why is `[` better than `subset`? to move beyond subset.
As stated, after the data has been subset, you can use plot or any other plotting function to graph your data.

Resources