This question already has answers here:
Find consecutive values in vector in R [duplicate]
(2 answers)
Closed 6 years ago.
I am fairly new to the art of programming (loops etc..) and this is something where I would be grateful if I could get an opinion whether my approach is fine or it would definitely need to be optimized if it was about to used on much bigger sample.
Currently I have approximately 20 000 observations and one of the columns is the ID of receipt. What I would like to achieve is to assign each row to a group that would consist of IDs that are ascending in a format of n+1. If this rule is broken the new group should be created until the rule is broken again.
To illustrate, lets say I have this table (Important note is that ID are not necessarily unique and can repeat, like ID 10 in my example):
MyTable <- data.frame(ID = c(1,2,3,4,6,7,8,10,10,11,17,18,19,200,201,202,2010,2011,2013))
MyTable
ID
1
2
3
4
6
7
8
10
10
11
17
18
19
200
201
202
2010
2011
2013
The result of my grouping should be following:
ID GROUP
1 1
2 1
3 1
4 1
6 2
7 2
8 2
10 3
10 3
11 3
17 4
18 4
19 4
200 5
201 5
202 5
2010 6
2011 6
2013 7
I used dplyr for ordering the ID in ascending way. Then created the variable MyData$Group which I have simply filled with 1's.
rep(1,length(MyTable$ID)
for (i in 2:length(MyTable$ID) ) {
if(MyTable$ID[i] == MyTable$ID[i-1]+1 | MyTable$ID[i] == MyTable$ID[i-1]) {
MyTable$ID[i] <- MyTable$GROUP[i-1]
} else {
MyTable$GROUP[i] <- MyTable$GROUP[i-1]+1
}
}
This code worked for me and I got the results fairly easily. However, I wonder if in eyes of more experienced programmers, this piece of code would be considered as "bad", "average", "good" or whatever rating you come up with.
EDIT: I am sure this topic has been touched already, not arguing against that. Though, as the main difference is that I would like to touch a topic of optimization here and see whether my approach meets standards.
Thanks!
To make a long story short:
MyTable$Group <- cumsum(c(1, diff(MyTable$ID) != 1))
# ID Group
#1 1 1
#2 2 1
#3 3 1
#4 4 1
#5 6 2
#6 7 2
#7 8 2
#8 10 3
#9 11 3
#10 12 3
#11 17 4
#12 18 4
#13 19 4
#14 200 5
#15 201 5
#16 202 5
#17 2010 6
#18 2011 6
#19 2013 7
You are searching all differences in your vector Mytable$ID, which are not 1, so this are your "breaks". And then you cumsum all these values. When you do not know cumsum so type ?cumsum.
That's all!
UPDATE: with repeating IDs, you can use this:
MyTable <- data.frame(ID = c(1,2,3,4,6,7,8,10,10,11,17,18,19,200,201,202,2010,2011,2013))
MyTable$Group <- cumsum(c(1, !diff(MyTable$ID) %in% c(0,1) ))
# ID Group
#1 1 1
#2 2 1
#3 3 1
#4 4 1
#5 6 2
#6 7 2
#7 8 2
#8 10 3
#9 10 3
#10 11 3
#11 17 4
#12 18 4
#13 19 4
#14 200 5
#15 201 5
#16 202 5
#17 2010 6
#18 2011 6
#19 2013 7
Related
This question already has answers here:
Summarizing by subgroup percentage in R
(2 answers)
Closed 9 months ago.
I am wrangling with a huge dataset and my R skills are very new. I am really trying to understand the terminology and processes but finding it a struggle as the R-documentation often makes no sense to me. So apologies if this is a dumb question.
I have data for plant species at different sites with different percentages of ground-cover. I want to create a new column PROP-COVER which gives the proportion of each species' cover as a percentage of the total cover of all species in a particular site. This is slightly different to calculating percentage cover by site area as it is disregards bare ground with no vegetation. This is an easy calculation with just one site, but I have over a hundred sites and need to perform the calculation on species ground-cover grouped by site. The desired column output is PROP-COVER.
SPECIES SITE COVER PROP-COVER(%)
1 1 10 7.7
2 1 20 15.4
3 1 10 7.7
4 1 20 15.4
5 1 30 23.1
6 1 40 30.8
2 2 20 22.2
3 2 50
5 2 10
6 2 10
1 3 5
2 3 25
3 3 40
5 3 10
I have looked at for loops and repeat but I can't see where the arguments should go. Every attempt I make returns a NULL.
Below is an example of something I tried which I am sure is totally wide of the mark, but I just can't work out where to begin with or know if it is even possible.
a<- for (i in data1$COVER) {
sum(data1$COVER[data1$SITE=="i"],na.rm = TRUE)
}
a
NULL
I have a major brain-blockage when it comes to how 'for' loops etc work, no amount of reading about it seems to help, but perhaps what I am trying to do isn't possible? :(
Many thanks for looking.
In Base R:
merge(df, prop.table(xtabs(COVER~SPECIES+SITE, df), 2)*100)
SPECIES SITE COVER Freq
1 1 1 10 7.692308
2 1 3 5 6.250000
3 2 1 20 15.384615
4 2 2 20 22.222222
5 2 3 25 31.250000
6 3 1 10 7.692308
7 3 2 50 55.555556
8 3 3 40 50.000000
9 4 1 20 15.384615
10 5 1 30 23.076923
11 5 2 10 11.111111
12 5 3 10 12.500000
13 6 1 40 30.769231
14 6 2 10 11.111111
In tidyverse you can do:
df %>%
group_by(SITE) %>%
mutate(n = proportions(COVER) * 100)
# A tibble: 14 x 4
# Groups: SITE [3]
SPECIES SITE COVER n
<int> <int> <int> <dbl>
1 1 1 10 7.69
2 2 1 20 15.4
3 3 1 10 7.69
4 4 1 20 15.4
5 5 1 30 23.1
6 6 1 40 30.8
7 2 2 20 22.2
8 3 2 50 55.6
9 5 2 10 11.1
10 6 2 10 11.1
11 1 3 5 6.25
12 2 3 25 31.2
13 3 3 40 50
14 5 3 10 12.5
The code could also be written as n = COVER/sum(COVER) or even n = prop.table(COVER)
This question already has answers here:
Create new column in dataframe based on partial string matching other column
(2 answers)
Closed 1 year ago.
I would like to make a new variable unsure which contains the word "unsure" if any of the following words are found in the freetext column: "too soon", "to tell", leaving the freetext unchanged, and NA in the new column when freetext doesn't contain those words. Currently the data looks like:
id freetext date
1 1 its too soon 1
2 2 I'm not sure 2
3 3 pink 12
4 4 yellow 15
5 5 too soon to tell 20
6 6 I think it is too soon 2
7 7 5 days 6
8 8 red 7
9 9 its been 2 days 3
10 10 too soon to tell 11
The data:
structure(list(id = c("1","2","3","4","5","6","7","8","9","10"),
freetext = c("its too soon", "I'm not sure",
"pink","yellow","too soon to tell","I think it is too soon","5 days","red",
"its been 2 days","too soon to tell","scans","went on holiday"),
date = c("1","2","12","15","20","2","6","7","3","11")), class = "data.frame", row.names = c(NA,-10L))
And I would like it to look like:
id freetext unsure date
1 1 its too soon unsure 1
2 2 I'm not sure <NA> 2
3 3 pink <NA> 12
4 4 yellow <NA> 15
5 5 too soon to tell unsure 20
6 6 I think it is too soon unsure 2
7 7 5 days <NA> 6
8 8 red <NA> 7
9 9 its been 2 days <NA> 3
10 10 too soon to tell unsure 11
You can use if_else with str_detect for pattern matching -
library(tidyverse)
df %>% mutate(unsure = if_else(str_detect(freetext, 'too soon|to tell'), 'unsure', NA_character_))
# id freetext date unsure
#1 1 its too soon 1 unsure
#2 2 I'm not sure 2 <NA>
#3 3 pink 12 <NA>
#4 4 yellow 15 <NA>
#5 5 too soon to tell 20 unsure
#6 6 I think it is too soon 2 unsure
#7 7 5 days 6 <NA>
#8 8 red 7 <NA>
#9 9 its been 2 days 3 <NA>
#10 10 too soon to tell 11 unsure
In base R -
transform(df, unsure = ifelse(grepl('too soon|to tell', freetext), 'unsure', NA))
I have a seemingly easy to solve question, but I am really stuck! I have a set of dates and I am aiming to assign an index value to those dates. Specifically, I want each month of each year to have a unique index value. For visualization of what I want to do, I have the following dates where as you can see January 96 has the value of 1, February 96 is 2, March 96 is 3 and so on. Note: there are many years as well from 1996 to 2018, which means that Jan 1997 should have the value 13 in the index.
Date Index
01/01/1996 1
02/01/1996 1
03/01/1996 1
04/01/1996 1
05/01/1996 1
08/01/1996 1
01/02/1996 2
02/02/1996 2
05/02/1996 2
06/02/1996 2
07/02/1996 2
08/02/1996 2
09/02/1996 2
01/03/1996 3
04/03/1996 3
05/03/1996 3
06/03/1996 3
07/03/1996 3
08/03/1996 3
11/03/1996 3
I am trying to achieve this either in R, or Excel.
Assuming that you read in your dates in R and they are of type character and not Date, this would work:
mydf$index2 <- as.numeric(as.factor(substring(mydf$Date, 4)))
mydf
# Date Index Index2
#1 01/01/1996 1 1
#2 02/01/1996 1 1
#3 03/01/1996 1 1
#4 04/01/1996 1 1
#5 05/01/1996 1 1
#6 08/01/1996 1 1
#7 01/02/1996 2 2
#8 02/02/1996 2 2
#9 05/02/1996 2 2
#10 06/02/1996 2 2
#11 07/02/1996 2 2
#12 08/02/1996 2 2
#13 09/02/1996 2 2
#14 01/03/1996 3 3
#15 04/03/1996 3 3
#16 05/03/1996 3 3
#17 06/03/1996 3 3
#18 07/03/1996 3 3
#19 08/03/1996 3 3
#20 11/03/1996 3 3
mydf is your data.frame name. In the code above, I subset the date to extract the month and the year, then convert into factor and then into numeric which creates the indices.
This is my data frame as given below
rd <- data.frame(
Customer = rep("A",15),
date_num = c(3,3,9,11,14,14,15,16,17,20,21,27,28,29,31),
exp_cumsum_col = c(1,1,2,3,4,4,4,4,4,5,5,6,6,6,7))
I am trying to get column 3 (exp_cumsum_col), but am unable to get the correct values after trying many times. This is the code I used:
rd<-as.data.frame(rd %>%
group_by(customer) %>%
mutate(exp_cumsum_col = cumsum(row_number(ifelse(date_num[i]==date_num[i+1],1)))))
If my date_num is continuous, then I am treating that entire series as a one number, and if there is any break in my date_num, then I am increasing exp_cumsum_col by 1 ..... exp_cumsum_col would start at 1.
We can take the differece of adjacent elements, check if it is greater than 1 and get the cumsum
rd %>%
group_by(Customer) %>%
mutate(newexp_col = cumsum(c(TRUE, diff(date_num) > 1)))
# Customer date_num exp_cumsum_col newexp_col
#1 A 3 1 1
#2 A 3 1 1
#3 A 9 2 2
#4 A 11 3 3
#5 A 14 4 4
#6 A 14 4 4
#7 A 15 4 4
#8 A 16 4 4
#9 A 17 4 4
#10 A 20 5 5
#11 A 21 5 5
#12 A 27 6 6
#13 A 28 6 6
#14 A 29 6 6
#15 A 31 7 7
I am a R noob, and hope some of you can help me.
I have two data sets:
- store (containing store data, including location coordinates (x,y). The location are integer values, corresponding to GridIds)
- grid (containing all gridIDs (x,y) as well as a population variable TOT_P for each grid point)
What I want to achieve is this:
For each store I want loop over the grid date, and sum the population of the grid ids close to the store grid id.
I.e basically SUMIF the grid population variable, with the condition that
grid(x) < store(x) + 1 &
grid(x) > store(x) - 1 &
grid(y) < store(y) + 1 &
grid(y) > store(y) - 1
How can I accomplish that? My own take has been trying to use different things like merge, sapply, etc, but my R inexperience stops me from getting it right.
Thanks in advance!
Edit:
Sample data:
StoreName StoreX StoreY
Store1 3 6
Store2 5 2
TOT_P GridX GridY
8 1 1
7 2 1
3 3 1
3 4 1
22 5 1
20 6 1
9 7 1
28 1 2
8 2 2
3 3 2
12 4 2
12 5 2
15 6 2
7 7 2
3 1 3
3 2 3
3 3 3
4 4 3
13 5 3
18 6 3
3 7 3
61 1 4
25 2 4
5 3 4
20 4 4
23 5 4
72 6 4
14 7 4
178 1 5
407 2 5
26 3 5
167 4 5
58 5 5
113 6 5
73 7 5
76 1 6
3 2 6
3 3 6
3 4 6
4 5 6
13 6 6
18 7 6
3 1 7
61 2 7
25 3 7
26 4 7
167 5 7
58 6 7
113 7 7
The output I am looking for is
StoreName StoreX StoreY SUM_P
Store1 3 6 479
Store2 5 2 119
I.e for store1 it is the sum of TOT_P for Grid fields X=[2-4] and Y=[5-7]
One approach would be to use dplyr to calculate the difference between each store and all grid points and then group and sum based on these new columns.
#import library
library(dplyr)
#create example store table
StoreName<-paste0("Store",1:2)
StoreX<-c(3,5)
StoreY<-c(6,2)
df.store<-data.frame(StoreName,StoreX,StoreY)
#create example population data (copied example table from OP)
df.pop
#add dummy column to each table to enable cross join
df.store$k=1
df.pop$k=1
#dplyr to join, calculate absolute distance, filter and sum
df.store %>%
inner_join(df.pop, by='k') %>%
mutate(x.diff = abs(StoreX-GridX), y.diff=abs(StoreY-GridY)) %>%
filter(x.diff<=1, y.diff<=1) %>%
group_by(StoreName) %>%
summarise(StoreX=max(StoreX), StoreY=max(StoreY), tot.pop = sum(TOT_P) )
#output:
StoreName StoreX StoreY tot.pop
<fctr> <dbl> <dbl> <int>
1 Store1 3 6 721
2 Store2 5 2 119