I am attempting to count elements within a variable(column) and group it by elements in another variable.The table below is the current data I am working with:
. Company.Name Sales.Team Product.Family
1 example1 Global Accounts FDS
2 example2 Americas RDS
3 Example3 WEMEA 2 Research
4 Example4 WEMEA 2 Research
5 Example5 CEE Research
6 Example6 CEE Research
What I am trying to do is is aggregate count of company names by different product families. So it would look something like:
FDS RDS Research
Americas 0 1 0
CEE 0 0 2
Global Accounts 1 0 0
WEMEA 2 0 0 2
I have been messing around with the aggregate function, but this has not yielded the needed data. I am having trouble with determining how to have columns based on elements in a row.
Any help would be appreciated.
You can solve this using the base table function in R. using an example table:
table(example_table$Sales.Team, example_table$Product.Family)
A basic run through for frequency tables can be found here at quick-R
If you need your output to be a dataframe, this is really easy using dplyr.
library(dplyr)
my_df <- data.frame("Product.Family" = c("FDS", "RDS", rep("Research", 4)), "Company.Name" = paste0("Example", 1:6), "Sales.Team" = c("Global Accounts", "Americas", rep("WEMA 2", 2), rep("CEE", 2)))
summary_df <- my_df %>%
group_by(Sales.Team) %>%
summarize(FDS = sum(Product.Family == "FDS"), RDS = sum(Product.Family == "RDS"), Research = sum(Product.Family == "Research"))
Related
Good morning community, I wanted to ask about any proposals you have to solve the following problem I have with a dataset. It turns out, that I want that in the column "Municipio" of the image on the left, every time I change the name of the municipality, the numerical value of the column increases, in order to be able to group all the data and classify them according to the "codigo municipio" that you see in the image on the right. I do not do it manually because there are more than 1000 municipalities and it would take me more than a whole day to do this task, so I would like to hear if anyone has a proposal, thank you very much.
enter image description here
I have used the package dplyr in R, but you could also just do this in Excel if you wanted to.
library(dplyr)
# Mockup approximating your data
df <- data.frame(
EM = c("ABEJORRAL", "AEXSAT S.A.", "AZTECA COM", "ABREGO", "AXESAT S.A.", "ABRIAQUI"),
Numero = c(890,2,0,259,4,64)
)
municipios <- data.frame(
Municipios = c("ABEJORRAL", "ABREGO", "ABRIAQUI"),
Validacion = c("Municipio")
)
# create a new column with the Municipios ID by just counting up from 1.
df <- df %>% mutate(
Municipio = cumsum(EM %in% municipios$Municipios)
)
This solution assumes the municipios are in the same order in both tables, and none are missing from the main data set, as it's just creating a grouping variable.
output:
EM Numero Municipio
1 ABEJORRAL 890 1
2 AEXSAT S.A. 2 1
3 AZTECA COM 0 1
4 ABREGO 259 2
5 AXESAT S.A. 4 2
6 ABRIAQUI 64 3
I have a dataframe as below:
**df**
Cust_name time freq
Andrew 0 4
Dillain 1 2
Alma 2 3
Andrew 1 4
Kiko 2 1
Sarah 2 8
Sarah 0 3
I want to calculate the sum of frequency by the time range provided for each cust_name. Example: If I select time range 0 to 2 for Andrew, it will give me sum of freq: 4+4= 8. And for Sarah, it will give me 8+3=11. I have tried it in the following ways just to get the time range, but do not know how to do the rest, as I am very new to R:
df[(df$time>=0 & df$time<=2),]
You can do this with dplyr.
To make your code reproducible, you should add the creation of your dataframe in your post. Copy and pasting everything is time consuming.
library(dplyr)
df <- data.frame(
cust_name = c('Andrew', 'Dillain', 'Alma', 'Andrew', 'Kiko', 'Sarah', 'Sarah'),
time = c(0,1,2,1,2,2,0),
freq = c(4,2,3,4,1,8,3)
)
df %>%
filter(time >=0, time <=2) %>%
group_by(cust_name) %>%
summarise(sum_freq = sum(freq))
I'm new to R, so please go easy on me... I have some longitudinal data that looks like
Basically, I'm trying to find a way to get a table with a) the number of unique cases that have all complete data and b) the number of unique cases that have at least one incomplete or missing data. The end results would ideally be
df<- df %>% group_by(Location)
df1<- df %>% group_by(any(Completion_status=='Incomplete' | 'Missing'))
Not sure about what you want, because it seems there are something of inconsistent between your request and the desired output, however lets try, it seems you need a kind of frequency table, that you can manage with basic R. At the bottom of the answer you can find some data similar to yours.
# You have two cases, the Complete, and the other, so here a new column about it:
data$case <- ifelse(data$Completion_status =='Complete','Complete', 'MorIn')
# now a frequency table about them: if you want a data.frame, here we go
result <- as.data.frame.matrix(table(data$Location,data$case))
# now the location as a new column rather than the rownames
result$Location <- rownames(result)
# and lastly a data.frame with the final results: note that you can change the names
# of the columns but if you want spaces maybe a tibble is better
result <- data.frame(Location = result$Location,
`Number.complete` = result$Complete,
`Number.incomplete.missing` = result$MorIn)
result
Location Number.complete Number.incomplete.missing
1 London 0 1
2 Los Angeles 0 1
3 Paris 3 1
4 Phoenix 0 2
5 Toronto 1 1
Or if you prefere a dplyr chain:
data %>%
mutate(case = ifelse(data$Completion_status =='Complete','Complete', 'MorIn')) %>%
do( as.data.frame.matrix(table(.$Location,.$case))) %>%
mutate(Location = rownames(.)) %>%
select(3,1,2) %>%
`colnames<-`(c("Location","Number of complete ", "Number of incomplete or"))
Location Number of complete Number of incomplete or
1 London 0 1
2 Los Angeles 0 1
3 Paris 3 1
4 Phoenix 0 2
5 Toronto 1 1
With data:
# here your data (next time try to put them in an usable way in the question)
data <- data.frame( ID = c("A1","A1","A2","A2","B1","C1","C2","D1","D2","E1"),
Location = c('Paris','Paris','Paris','Paris','London','Toronto','Toronto','Phoenix','Phoenix','Los Angeles'),
Completion_status = c('Complete','Complete','Incomplete','Complete','Incomplete','Missing',
'Complete','Incomplete','Incomplete','Missing'))
Sorry, I've tried my best but I didn't find the answer. As beginner, I'm not sure that I'm able to put the question clearly. Thanks in advance.
So I have a dataframe with data about consumption with 24000 rows.
In this dataframe, there is a series of variable about the number of objects bought within the last two months :
NumberOfCoat, NumberOfShirt, NumberOfPants, NumberOfShoes...
And there is a variable "profession" registered by number.
So now the data looks looks like this
profession NumberOfCoat NumberOfShirt NumberOfShoes
individu1 1 1 1 1
individu2 3 2 4 1
individu3 2 2 0 0
individu4 6 0 3 2
individu5 5 0 2 3
individu6 7 1 0 5
individu7 4 3 1 2
I would like to know the structure of consumption by profession and get something like this :
ProportionOfCoat ProportionOfShirt ProportionOfShoe...
profession1 0.3 0.5 0.1
profession2 0.1 0.2 0.4
profession3 0.2 0.6 0.1
profession4 0.1 0.1 0.2
I don't know if it is clear, but finally I want to be able to say :
10% of clothing products that doctors bought are Tshirts whereas 20% of what teachers bought are T-shirts.
And finally, I'd like to draw a stacked barplot where each stack is scaled to sum to 100%.
I suppose that we can you dplyr ?
Thank you very much !!
temp <- aggregate( . ~ profession, data=zzz, FUN=sum)
cbind(temp[1],temp[-1]/rowSums(temp[-1]))
or also using prop.table
As other people noted, it is always better to post a reproducible example, I´ll try to post one with my solution, which is longer than the ones already posted but, for the same reason, maybe clearer.
First you should create an example dataframe:
set.seed(10) # I set a seed cause I´ll use the sample() function
n <- 1:100 # vector from 1 to 100 to obtain the number of products bought
p <- 1:8 # vector for obtaining id of professions
profession <- sample(p,50, replace = TRUE)
NumberOfCoat <- sample(n,50, replace = TRUE)
NumberOfShirt <- sample(n,50, replace = TRUE)
NumberOfShoes <- sample(n,50, replace = TRUE)
df <- as.data.frame(cbind(profession, NumberOfCoat,
NumberOfShirt, NumberOfShoes))
Once you got the dataframe, you can explain what you have tried so far, or a possible solution. Here I used dplyr.
df <- df %>% group_by(profession) %>% summarize(coats = sum(NumberOfCoat),
shirts = sum(NumberOfShirt),
shoes = sum(NumberOfShoes)) %>%
mutate(tot_prod = coats + shirts + shoes,
ProportionOfCoat = coats/tot_prod,
ProportionOfShirt = shirts/tot_prod,
ProportionofShoes = shoes/tot_prod) %>%
select(profession, ProportionOfCoat, ProportionOfShirt,
ProportionofShoes)
dfcorresponds to the second dataframe you show, where you have the proportion of each product bought by each profession. In my example looks like this:
profession ProportionOfCoat ProportionOfShirt ProportionofShoes
<int> <dbl> <dbl> <dbl>
1 1 0.3910483 0.2343934 0.3745583
2 2 0.4069641 0.3525571 0.2404788
3 3 0.3330804 0.3968134 0.2701062
4 4 0.2740657 0.3952435 0.3306908
5 5 0.2573991 0.3784753 0.3641256
6 6 0.2293814 0.3543814 0.4162371
7 7 0.2245841 0.3955638 0.3798521
8 8 0.2861635 0.3490566 0.3647799
If you want to produce a stacked barplot, you have to reshape your data to a long format in order to be able to use ggplot2. As #alistaire noted, you can do it with the gather function from the tidyr package.
df <- df %>% gather(product, proportion, -profession)
And finally you can plot it with ggplot2.
ggplot(df, aes(x=profession, y=proportion, fill=product)) +
geom_bar(stat="identity")
Using R, I am trying to take a csv file, loop through it, extract values, and dump them into a data frame. There are four columns in the csv: ID, UG_inst, Freq, and Year. Specifically, I want to loop through the UG_inst column by institution name for each year (2010-11,2011-12,2012-13,and 2013-14) and put the value at that cell into the respective "cell" in the R data frame. Right now, the csv just has a Year column, but the data frame I've created has a column for each year. The ultimate idea is to be able to create bar graphs representing the frequency per institution per year. Currently, the code below throws up NO errors, but appears to do nothing to the R data frame "j".
A couple of caveats: 1) Doing a nested for loop was making my head spin, so I decided to just use 2010-11 for now and just loop through the institution name. Since there are only 4 years, I can rewrite this four times, each time with a different year. 2) Also, in the csv, there are repeat names. So, if an institution name appears twice (will be adjacent rows in the csv due to alphabetical arrangement), is there a way to dump the SUM of these into the data frame in R?
All relevant info is below. Thanks so much for any help!!!!
Here is a link to the .csv file: https://www.dropbox.com/s/9et7muchkrgtgz7/UG_inst_ALL.csv
And here is the R code I am trying:
abc <- read.csv(insert file path to above csv here)
inst_string <- unique(abc$UG_inst)
j <- data.frame("UG_inst"=inst_string,"2010-11"=NA,"2011-12"=NA,"2012-13"=NA,"2013-14"=NA)
for (i in inst_string) {
inst.index <- which(abc$UG_inst == i && abc$Year == "2010-11")
j$X2010.11[j$Ug_inst==i] <- abc$Freq[inst.index]
}
Instead of using a nested loop (or a loop at all) I suggest using the reshape() function in base R.
abc <- read.csv("UG_inst_ALL.csv")
abc <- abc[2:4]
reshape(data = abc,
v.names = "Freq",
timevar = "Year",
idvar = "UG_inst",
direction = "wide")
This is known as "reshaping" your data, and you are going from a "long" format to a "wide" format.
In addition to base R's reshape function, here are a few other options to consider.
I'll assume that we are starting with data read in like the following.
abc <- read.csv("~/Downloads/UG_inst_ALL.csv", row.names = 1)
head(abc)
# UG_inst Freq Year
# 1 Abilene Christian University 0 2010-11
# 2 Adams State University 0 2010-11
# 3 Adrian College 1 2010-11
# 4 Agnes Scott College 0 2010-11
# 5 Alabama A&M University 1 2010-11
# 6 Albion College 1 2010-11
Option 1: xtabs
out <- as.data.frame.matrix(xtabs(Freq ~ UG_inst + Year, abc))
head(out)
# 2010-11 2011-12 2012-13 2013-14
# Abilene Christian University 0 1 0 0
# Adams State University 0 0 0 1
# Adrian College 1 0 0 0
# Agnes Scott College 0 0 1 0
# Alabama A&M University 1 3 1 2
# Albion College 1 0 0 0
Option 2: dcast from "reshape2"
library(reshape2)
head(dcast(abc, UG_inst ~ Year, value.var = "Freq"))
Option 3: spread from "tidyr"
library(dplyr)
library(tidyr)
abc %>% select(-X) %>% group_by(UG_inst) %>% spread(Year, Freq)