I have a train set here and I need you to help me with something.
This is the df.
Jobs Agency Location Date RXH HS TMM Payed
14 Netapp Gitex F1 Events House DWTC 2015-10-19 100 8.0 800 TRUE
5 RWC Heineken Lightblue EGC 2015-10-09 90 4.0 360 FALSE
45 Rugby 7s CEO Seven Stadium 2015-12-04 100 10.0 1000 FALSE
29 Playstation Lightblue Mirdiff CC 2015-11-11 90 7.0 630 FALSE
24 RWC Heineken Lightblue EGC 2015-10-31 90 4.5 405 FALSE
33 Playstation Lightblue Mirdiff CC 2015-11-15 90 10.0 900 FALSE
46 Rugby 7s CEO Seven Stadium 2015-12-05 100 10.0 1000 FALSE
44 Rugby 7s CEO Seven Stadium 2015-12-03 100 10.0 1000 FALSE
I want to know for example that the total of rows is 10, and I worked for " CEO" agency 3 times, I want CEO Agency to have the 30% value for that month, if that makes sense?
I want to know depending on the number of observations how much in % i ve worked for them.
Thats just a Demo DF to see what im talking about.
Thanks
If I understand correctly, you want to summarize by Agency and by month. Here's how to do it with dplyr:
library(dplyr)
table1 %>%
mutate(Month=format(Date,"%m-%Y")) %>%
group_by(Month,Agency)%>%
summarise(Total=n())%>%
mutate(Pct=round(Total/sum(Total)*100))
Source: local data frame [4 x 4]
Groups: Month [3]
Month Agency Total Pct
(chr) (chr) (int) (dbl)
1 10-2015 Events House 1 33
2 10-2015 Lightblue 2 67
3 11-2015 Lightblue 2 100
4 12-2015 CEO 3 100
This is just a simple approach, and I suspect you might be looking for more. However, here's some code that would give you the answer to your sample question:
length(df$Agency[df$Agency == "CEO"]) / length(df$Agency)
The first length() function calculates how many cells in df$Agency are marked "CEO," then the second one calculates the total number of cells in that column. Dividing one by the other will give you the answer.
This will get more complicated if you want to automatically do this for each of the agencies in the column, but there are the basics.
Related
I am having difficulty importing my data in the way I would like to from a .csv file to tidydata.
My data set is made up of descriptive data (age, country, etc.) and then 15 condition columns that I would like to have in just one column (long format). I have previously tried 'melting' the data in a few ways, but it does not turn out the way I intended it to. These are a few things I have tried, I know it is kind of messy. There are quite a few NAs in the data, which seem to be causing an issue. I am trying to create this specific column "Vignette" which will serve as the collective column for the 15 vignette columns I would like in long format.
head(dat)
ID Frequency Gender Country Continent Age
1 5129615189 At least weekly female France Europe 30-50 years
2 5128877943 At least daily female Spain Europe > 50 years
3 5126775994 At least weekly female Spain Europe 30-50 years
4 5126598863 At least daily male Albania Europe 30-50 years
5 5124909744 At least daily female Ireland Europe > 50 years
6 5122047758 At least weekly female Denmark Europe 30-50 years
Practice Specialty Seniority AMS
1 University public hospital centre Infectious diseases 6-10 years Yes
2 Other public hospital Infectious diseases > 10 years Yes
3 University public hospital centre Intensive care > 10 years Yes
4 University public hospital centre Infectious diseases > 10 years No
5 Private hospial/clinic Clinical microbiology > 10 years Yes
6 University public hospital centre Infectious diseases 0-5 years Yes
Durations V01 V02 V03 V04 V05 V06 V07 V08 V09 V10 V11 V12 V13 V14 V15
1 range 7 2 7 7 7 5 7 14 7 42 42 90 7 NA 5
2 range 7 10 10 5 14 5 7 14 10 42 21 42 14 14 14
3 range 7 5 5 7 14 5 5 13 10 42 42 42 5 0 7
4 range 10 7 7 5 7 10 7 5 7 28 14 42 10 10 7
5 range 7 5 7 7 14 7 7 14 10 42 42 90 10 0 7
6 fixed duration 7 3 3 7 10 10 7 14 7 90 90 90 10 7 7
dat_long %>%
gather(Days, Age, -Vignette)
dat$new_sp = NULL
names(dat) <- gsub("new_sp", "", names(dat))
dat_tidy<-melt(
data=dat,
id=0:180,
variable.name="Vignette",
value.name="Days",
na.rm=TRUE
)
dat_tidy<- mutate(dat_tidy,
Days= sub("^V", "", Days)
)
It keeps saying "Error: id variables not found in data: NA"
I have tried to get rid of NAs but it doesn't seem to do anything.
I am guessing you are loading the melt function from reshape2. I will recommend that you try tidyr which is basically the next generation of reshape2.
Your error is presumable that the argument id=0:180. This is basically asking it to keep columns 0-180 as "identifier" columns, and melt the rest (i.e. create a new row for each value in each column).
When you subset more column indices than columns in a data.frame, the non-existing columns are filled in with pure NA - you asked for them, so you get them!
I would recommend loading tidyr, as it is newer. There should be some new verbs in the package that are more intuitive, but I'll give you a solution with the older semantic:
library(tidyr)
dat_tidy <- dat %>% gather('Vignette', 'Days', starts_with('V'))
# or a bit more verbose
dat_tidy <- dat %>% gather('Vignette', 'Days', V01, V02, V03, V04)
And check out the comment #heck1 for asking even better questions.
I have been searching this information since yesterday but so far I could not find a nice solution to my problem.
I have the following dataframe:
CODE CONCEPT P. NR. NAME DEPTO. PRICE
1 Lunch 11 John SALES 160
1 Lunch 11 John SALES 120
1 Lunch 11 John SALES 10
1 Lunch 13 Frank IT 200
2 Internet 13 Frank IT 120
and I want to add a column with the sum of rows by group, for instance, the total amount of concept: Lunch, code: 1 by name in order to get an output like this:
CODE CONCEPT P. NR. NAME DEPTO. PRICE TOTAL
1 Lunch 11 John SALES 160 NA
1 Lunch 11 John SALES 120 NA
1 Lunch 11 John SALES 10 290
1 Lunch 13 Frank IT 200 200
2 Internet 13 Frank IT 120 120
So far, I tried with:
aggregate(PRICE~NAME+CODE, data = df, FUN = sum)
But this retrieves just the total of the concepts like this:
NAME CODE TOTAL
John 1 290
Frank 1 200
Frank 2 120
And not the table with the rest of the data as I would like to have it.
I also tried adding an extra column with NA but somehow I cannot paste the total in a specific row position.
Any suggestions? I would like to have something I can do in BaseR.
Thanks!!
In base R you can use ave to add new column. We insert the sum of group only if it is last row in the group.
df$TOTAL <- with(df, ave(PRICE, CODE, CONCEPT, PNR, NAME, FUN = function(x)
ifelse(seq_along(x) == length(x), sum(x), NA)))
df
# CODE CONCEPT PNR NAME DEPTO. PRICE TOTAL
#1 1 Lunch 11 John SALES 160 NA
#2 1 Lunch 11 John SALES 120 NA
#3 1 Lunch 11 John SALES 10 290
#4 1 Lunch 13 Frank IT 200 200
#5 2 Internet 13 Frank IT 120 120
Similar logic using dplyr
library(dplyr)
df %>%
group_by(CODE, CONCEPT, PNR, NAME) %>%
mutate(TOTAL = ifelse(row_number() == n(), sum(PRICE) ,NA))
For a base R option, you may try merging the original data frame and aggregate:
df2 <- aggregate(PRICE~NAME+CODE, data = df, FUN = sum)
out <- merge(df[ , !(names(df) %in% c("PRICE"))], df2, by=c("NAME", "CODE"))
out[with(out, order(CODE, NAME)), ]
NAME CODE CONCEPT PNR DEPT PRICE
1 Frank 1 Lunch 13 IT 200
3 John 1 Lunch 11 SALES 290
4 John 1 Lunch 11 SALES 290
5 John 1 Lunch 11 SALES 290
2 Frank 2 Internet 13 IT 120
We have a daily meeting when participants nominate each other to speak. The first person is chosen randomly.
I have a dataframe that consists of names and the order of speech every day.
I have a day1, a day2 ,a day3 , etc. in the columns.
The data in the rows are numbers, meaning the order of speech on that particular day.
NA means that the person did not participate on that day.
Name day1 day2 day3 day4 ...
Albert 1 3 1 ...
Josh 2 2 NA
Veronica 3 5 3
Tim 4 1 2
Stew 5 4 4
...
I want to create two analysis, first, I want to create a dataframe who has chosen who the most times. (I know that the result depends on if a participant was nominated before and therefore on that day that participant cannot be nominated again, I will handle it later, but for now this is enough)
It should look like this:
Name Favorite
Albert Stew
Josh Veronica
Veronica Tim
Tim Stew
...
My questions (feel free to answer only one if you can):
1. What code shall I use for it without having to manunally put the names in a different dataframe?
2. How shall I handle a tie, for example Josh chose Veronica and Tim first the same number of times? Later I want to visualise it and I have no idea how to handle ties.
I also would like to analyse the results to visualise strong connections.
Like to show that there are people who usually chose each other, etc.
Is there a good package that is specialised for these? Or how should I get to it?
I do not need DNA sequences, only this simple ones, but I have not found a suitable one yet.
Thanks for your help!
If I am not misunderstanding your problem, here is some code to get the number of occurences of who choose who as next speaker. I added a fourth day to have some count that is not 1. There are ties in the result, choosing the first couple of each group by speaker ('who') may be a solution :
df <- read.table(textConnection(
"Name,day1,day2,day3,day4
Albert,1,3,1,3
Josh,2,2,,2
Veronica,3,5,3,1
Tim,4,1,2,4
Stew,5,4,4,5"),header=TRUE,sep=",",stringsAsFactors=FALSE)
purrr::map(colnames(df)[-1],
function (x) {
who <- df$Name[order(df[x],na.last=NA)]
data.frame(who,lead(who),stringsAsFactors=FALSE)
}
) %>%
replyr::replyr_bind_rows() %>%
filter(!is.na(lead.who.)) %>%
group_by(who,lead.who.) %>% summarise(n=n()) %>%
arrange(who,desc(n))
Input:
Name day1 day2 day3 day4
1 Albert 1 3 1 3
2 Josh 2 2 NA 2
3 Veronica 3 5 3 1
4 Tim 4 1 2 4
5 Stew 5 4 4 5
Result:
# A tibble: 12 x 3
# Groups: who [5]
who lead.who. n
<chr> <chr> <int>
1 Albert Tim 2
2 Albert Josh 1
3 Albert Stew 1
4 Josh Albert 2
5 Josh Veronica 1
6 Stew Veronica 1
7 Tim Stew 2
8 Tim Josh 1
9 Tim Veronica 1
10 Veronica Josh 1
11 Veronica Stew 1
12 Veronica Tim 1
This website has helped me with so much over the years, but I can't seem to figure this part out. I am working on modeling terrorist attacks in Afghanistan and want to create a new variable to reflect the clustering of attacks. For each attack I want to calculate the number of attacks that fall into two range criteria, distance and time.
head(timedist_terr_talib, 15)
eventid lat lon event1 Cluster_Num
1 20110104 32.07333 64.83389 2011-01-04 NA
2 20110107 31.00806 66.39806 2011-01-07 NA
3 20110112 34.53306 69.16611 2011-01-12 NA
4 20110112 34.87417 71.15278 2011-01-12 NA
5 20110114 31.65003 65.65002 2011-01-14 1
6 20110115 33.42977 66.21314 2011-01-15 0
7 20110116 35.95000 68.70000 2011-01-16 0
8 20110119 32.68556 68.23778 2011-01-19 0
9 20110119 34.08056 68.51917 2011-01-19 1
10 20110123 34.89000 71.18000 2011-01-23
11 20110128 34.53306 69.16611 2011-01-28
12 20110129 31.61767 65.67594 2011-01-29
13 20110131 35.03924 69.00633 2011-01-31
14 20110201 31.61767 65.67594 2011-02-01
15 20110207 31.48623 64.32139 2011-02-07
I want to create a new column whose values are the number of attacks that happened within the last 14 days and 100 km of that attack.
event1 <- strptime(timedist_terr_talib$eventid,
format="%Y%m%d", tz="UTC")
I found code that makes a matrix with the distance between each point:
http://eurekastatistics.com/calculating-a-distance-matrix-for-geographic-points-using-r/
#find dist in meters / 1000 to get km
#dis_talib_mat<-round(GeoDistanceInMetresMatrix(timedist_terr_talib) / 1000)
dis_talib_mat1 <- (GeoDistanceInMetresMatrix(timedist_terr_talib) / 1000)
And I have a matrix that calculates the time distance between every pair:
timediff_talib1<-t(outer(timedist_terr_talib$event1,
timedist_terr_talib$event1, difftime))
timediff_talib1<-timediff_talib1/(60*60*24)
So example for attack 1:4 are NA because the data does not have a complete 14 days. When I look at attack 5, I look at attacks 1:4 because they happened with 14 days. The distance matrix shows that 1 of those attacks was within 100 km.
and manually count that there is 1 attack that is under 100 km away.
My current data set is 2813 attacks, so the running is slow, but if I could get the code for these 15 and apply it my set, I would be so happy!
I have the following data frames:
Required <- data.table( Country=c("AT Iron", "AT Energy", "BE Iron", "BE Energy", "BG Iron", "BG Energy"),Prod1=c(5,10,0,5,0,5),Prod2=c(25,5,10,0,0,5))
Supplied <- data.table( Country=c("AT Iron", "AT Energy", "BE Iron", "BE Energy", "BG Iron", "BG Energy"),Prod1=c(10,5,5,10,5,10),Prod2=c(20,20,20,0,15,10))
> Required
Country Prod1 Prod2
1: AT Iron 5 25
2: AT Energy 10 5
3: BE Iron 0 10
4: BE Energy 5 0
5: BG Iron 0 0
6: BG Energy 5 5
> Supplied
Country Prod1 Prod2
1: AT Iron 10 20
2: AT Energy 5 20
3: BE Iron 5 20
4: BE Energy 10 0
5: BG Iron 5 15
6: BG Energy 10 10
"Required" shows the initial material and energy requirements to manufacture two products, and the materials and energy are supplied by three different countries. For example, product 1 would require, for Energy, 10 units from AT, 5 units from BE and 5 units from BG. "Supplied" shows the actual supply capacity of the countries. Following the example, AT cannot supply 10 units of energy but 5 units, so another country must supply the remaining units. I assume that the country with the most net supply capacity (that is, once discounted the initial requirement) will provide the remaining units. In this case, both BE and BG have 5 units of net supply capacity, so both will provide with equal units, 2.5.
I seek an optimization algorithm that creates a new "Required" table, "RequiredNew", considering supply constrains and the above mentioned assumption. The resulting table should look like:
> RequiredNew
Country Prod1 Prod2
1: AT Iron 5 20
2: AT Energy 10 5
3: BE Iron 0 10
4: BE Energy 7.5 0
5: BG Iron 0 5
6: BG Energy 7.5 5
In the link below I posted a similar question which was solved by user digEmAll, so a similar approach would be suitable. However, I rephrased the question so that it becomes clearer and resembles more to my actual data.
Mathematical optimization in R
I apologise by the multiple posts. Thank you in advance.