Subset by first and last value per group - r

I have a data frame in R with two columns temp and timeStamp. The data has temp values regularly. A portion of dataframe looks like-
I have to create line chart showing changes in temp over time. As can be seen here, temp values remain the same for several timeStamp. Having these repeating value increases the size of data file and I want to remove them. So the output should look like this-
Showing just the values where there is a change.
Cannot think of a way to get this think done in R. Any inputs in the right direction would be really helpful.

Here's a dplyr solution:
# Toy data
df <- data.frame(time = seq(20), temp = c(rep(60, 5), rep(61, 7), rep(59, 3), rep(60, 5)))
# Now filter for the first and last rows and ones bracketing a temperature change
df %>% filter(temp!=lag(temp) | temp!=lead(temp) | time==min(time) | time==max(time))
time temp
1 1 60
2 5 60
3 6 61
4 12 61
5 13 59
6 15 59
7 16 60
8 20 60
If the data are grouped by a third column (id), just add group_by(id) %>% before the filtering step.

One option would be using data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)). Grouped by 'temp', we subset the first and last observation (.SD[c(1L, .N)]) per each group. If there is only a single value per group, we take the row as such (else .SD).
library(data.table)
setDT(df1)[, if(.N>1) .SD[c(1L, .N)] else .SD, by =temp]
# temp val
#1: 22.50 1
#2: 22.50 4
#3: 22.37 5
#4: 22.42 6
#5: 22.42 7
Or a base R option with duplicated. We check the duplicated values in 'temp' (output is a logical vector), and also check the duplication from the reverse side (fromLast=TRUE). Use & to find the elements that are TRUE in both cases, negate (!) and subset the rows of 'df1'.
df1[!(duplicated(df1$temp) & duplicated(df1$temp,fromLast=TRUE)),]
# temp val
#1 22.50 1
#4 22.50 4
#5 22.37 5
#6 22.42 6
#7 22.42 7
data
df1 <- data.frame(temp=c(22.5, 22.5, 22.5, 22.5, 22.37,22.42, 22.42), val=1:7)

Related

Replace value from updated dataset based on number of instances it appears in a second dataset

I have a simple 2-column dataset containing variable cluster_size and index. Originally all values of index were assigned a value 1. Subsequently, I received a second dataset containing only a few clusters where index should updated with different integer values.
I simply need to replace the index value from the updated dataset. My specific issue is that the values for cluster_size can repeat multiple times, but I only need to replace it for the number of instances it appears in the updated dataset. For instance, in the example data below, the cluster_size value of 34 appears three times, but only once in the updated data with an index of 6. This means that only one of these three rows should update to 6 (doesn't matter which one).
Code to recreate a 20-row sample of the original data (have), updated subset (updated), and desired dataset (want) are below. The actual data has tens of thousands of rows. Ive tried several merge and loop functions (all too pathetic to waste your time by posting here), but cant seem to find an elegant solution.
# Data with original index cases
set.seed(03151813)
have <- data.frame(clust_size=sample(1:50,20,replace=TRUE),index=rep(1,times=20))
have <- have[order(have$clust_size),]
# Updated data only contains clusters that need updating of inde
updated <- data.frame(clust_size=c(30,34,42,44,44,46),
index=c(2,6,4,8,9,4))
# Desired dataset
want <- data.frame(clust_size=have$clust_size,
index=c(rep(1,times=9),2,1,6,
1,1,1,4,1,8,9,4))
Here is a base R approach. Add row numbers to have and updated for each clust_size. So the clust_size of 34 will have rows numbered consecutively 1, 2, and 3.
Then, you can merge the two together on both clust_size and row number. If you include all.x you will get all rows from the first data frame have.
Final step is to replace the missing NA values in your new index column with the original index.
have$rn <- with(have, ave(seq_along(clust_size), clust_size, FUN = seq_along))
updated$rn <- with(updated, ave(seq_along(clust_size), clust_size, FUN = seq_along))
want <- merge(have, updated, all.x = TRUE, by = c("clust_size", "rn"))
want$index.y <- ifelse(is.na(want$index.y), want$index.x, want$index.y)
want[, c("clust_size", "index.y")]
An alternate version using dplyr would be something like this:
library(dplyr)
have2 <- have %>%
group_by(clust_size) %>%
mutate(rn = row_number())
updated2 <- updated %>%
group_by(clust_size) %>%
mutate(rn = row_number())
left_join(have2, updated2, by = c("clust_size", "rn")) %>%
mutate(index.y = coalesce(index.y, index.x))
Output
clust_size index.y
1 1 1
2 5 1
3 8 1
4 10 1
5 16 1
6 20 1
7 22 1
8 27 1
9 29 1
10 30 2
11 30 1
12 34 6
13 34 1
14 34 1
15 35 1
16 42 4
17 43 1
18 44 8
19 44 9
20 46 4

Multiple rows to single cell space delimited values in pandas with group by

I have a data set similar to df1 here
df1 = pd.DataFrame({'id':[1,1,2,2,2],
'value':[67,45,7,5,9]})
id value
1 67
1 45
2 7
2 5
2 9
I want to bring bring it to this form. all the values corresponding to that id in one cell separated by spaces.
id values
1 67 45
2 7 5 9
Here is my code
df2 = pd.DataFrame(df1['id'].unique())
df2.columns=['id']
df2['values']=np.nan
for i in df2['id']:
s=''
for k in df1[df1['id']==i]['value']:
s=s+' '+str(k)
df2.loc[df2['id']==i,'values']=s.lstrip()
print(df2)
Is there a more pythonic way of doing this. I have 70000 unique id's, each id may have number of values ranging from 1 to 20
I am using
Anaconda python 3.5
pandas 0.20.1
numpy 1.12.1
windows 10
Also, How can we replicate the same in R
Convert the 'value' column from int to string, then perform a groupby on 'id' and apply the str.join function:
# Convert 'value' column to string.
df1['value'] = df1['value'].astype(str)
# Perform a groupby and apply a string join.
df1 = df1.groupby('id')['value'].apply(' '.join).reset_index()
The resulting output:
id value
0 1 67 45
1 2 7 5 9
Here is how to do it in R. It is the same approach
df = data.frame('id'=c(1,1,2,2,2),'value'=c(67,45,7,5,9))
aggregate(cbind(values=value)~id,
data = df,
FUN = function(x){paste(x,collapse=' ')})

Combine or Sum rows based on partial match and other rules

I have a dataframe df1:
df1 <- data.frame(
Lot = c("13VC011","13VC018","13VC011A","13VC011B","13VC018A","13VC018C","13VC018B"),
Date = c("2013-07-12","2013-07-11","2013-07-13","2013-07-14","2013-07-16","2013-07-18","2013-07-19"),
Step = c("A","A","B","B","C","C","C"),
kg = c(31,32,14,16,10,11,10))
Sometimes at a particular 'Step' a 'Lot' gets split into A,B or C as indicated. I'd like to sum those and get a dataframe that tells me the total kg at each step, for each lot.
For example the output should look like this:
df2 <- data.frame(
Lot = c("13VC011","13VC011","13VC018","13VC018"),
Step = c("A","B","A","C"),
kg = c(31,30,32,31))
So there are two requirements. If the 'Lot' matches, regardless of the trailing letter, and the step matches, then the sum occurs. If both conditions are not satisfied, then just carry over the line item as is into df2.
Part2:
So I would like to introduce a 3rd requirement. In some cases, the Lot was split in two or 3 parts, however not all the data is present. In this case, using these solutions masks this and makes it appear that one lot has much lower kg than it actually has.
What I would like to do is find a way to indicate if the dataset contains 13VC011A for example, but no 13VC011B. Or if we see a 'B' but no 'A' or a 'C' but no 'B' or 'A'.
So now the original dataframe is:
df1 <- data.frame(
Lot = c("13VC011","13VC018","13VC011A","13VC011B","13VC018A","13VC018C","13VC018B","13VC020B"),
Date = c("2013-07-12","2013-07-11","2013-07-13","2013-07-14","2013-07-16","2013-07-18","2013-07-19","2013-07-22"),
Step = c("A","A","B","B","C","C","C","B"),
kg = c(31,32,14,16,10,11,10,18))
And the resultant df2 should look something like:
df2 <- data.frame(
Lot = c("13VC011","13VC011","13VC018","13VC018","13VC020B"),
Step = c("A","B","A","C","B"),
kg = c(31,30,32,31,18),
Partial = c(F,F,F,F,T))
df1$Lot <- gsub("[[:alpha:]]$","",df1$Lot) #replace the character element at the end of string with `""`
aggregate(kg~Lot+Step,df1, FUN=sum)
# Lot Step kg
#1 13VC011 A 31
#2 13VC011 B 30
#3 13VC018 A 32
#4 13VC018 C 31
Or using dplyr
library(stringr)
library(dplyr)
df1%>%
group_by(Lot=str_extract(Lot,perl('.*\\d(?=[A-Z]?$)')), Step) %>%
summarize(kg=sum(kg))
#Source: local data frame [4 x 3]
#Groups: Lot
# Lot Step kg
#1 13VC011 A 31
#2 13VC011 B 30
#3 13VC018 A 32
#4 13VC018 C 31
Explanation
regex
.* : select more than one element
\\d :followed by a digit
(?=[A-Z]?$) : and lookahead for character elements or (?) not at the $ end of string.
`
> aggregate(kg ~Lot + Step, data=df1, FUN=sum)
Lot Step kg
1 13VC011 A 31
2 13VC018 A 32
3 13VC011A B 14
4 13VC011B B 16
5 13VC018A C 10
6 13VC018B C 10
7 13VC018C C 11
At that point I finally understood what you meant by "regardless of the trailing letter" and wondered if the formula method of aggregate could handle an R-function in one of the terms:
> aggregate(kg ~substr(Lot,1,7) + Step, data=df1, FUN=sum)
substr(Lot, 1, 7) Step kg
1 13VC011 A 31
2 13VC018 A 32
3 13VC011 B 30
4 13VC018 C 31

Subsetting an integer vector based on a vector of corresponding dates

Elementary question:
I'm trying to subset a vector of a data frame based on a vector of dates that correspond with the vector that I wish to subset. Consider the following data frame as an example:
Date Time Axis1 Day Sum.A1.Daily
1 6/12/10 5:00:00 20 1 NA
2 6/12/10 5:01:00 40 1 NA
3 6/12/10 5:02:00 50 1 NA
4 6/13/10 5:03:00 10 2 NA
5 6/13/10 5:04:00 20 2 NA
6 6/13/10 5:05:00 30 2 NA
I want to fill the column to the right with the sum of values for each day. Basically, (1:3,5) should = 110, and (4:6,5) should = 60.
I know there are many ways to do this that are smarter/faster/better than what I'm attempting to do (e.g., my date variable is a factor split into "levels" that I don't know how to access), but I'm trying to build my skills from the ground up, and want to figure out how to:
Take a subset of data$Axis1 that will only grab the values for the 1st day
Take a subset of the values of data$Axis1 that will only grab the values for the 2nd day
Sum the values for each day, and place them in column 5, overwriting the "NA"
I successfully performed a function similar to this to auto-fill-in the "Day" vector, which was originally full of "NA" values (below). But I'm getting stuck as I think about how to a) subset with dates, and b) sum while subsetting.
Thanks in advance for your help - also, let me know if my question could be clearer/I'm violating cardinal stackoverflow rules. I'm very new to R and the coding community in general; I appreciate your help!
dates <-c("6/12/10","6/13/10")
counts <- c(1:2)
x <- nrow(data)
for (i in 1:x) {
for (j in 1:12) {
if (data[i,1] == dates[j]) {
data[i,4] <- counts[j]
}
}
}
Using ave :
transform(dat,Sum.A1.Daily=ave(dat$Axis1,dat$Date,FUN=sum))
Date Time Axis1 Day Sum.A1.Daily
1 6/12/10 5:00:00 20 1 110
2 6/12/10 5:01:00 40 1 110
3 6/12/10 5:02:00 50 1 110
4 6/13/10 5:03:00 10 2 60
5 6/13/10 5:04:00 20 2 60
6 6/13/10 5:05:00 30 2 60
Another way would be using data.table
#Let's say df is your dataset
library(data.table)
dt = as.data.table(df)
dt = dt[, Sum.A1.Daily := sum(Axis1), by = Date]

Row aggregation when values are close enough in a column

I have a dataframe with 2 columns
time x
1306247226 5
1306247236 10
1306248127 20
1306248187 36
1306249248 28
1306249258 24
1306249259 20
...
I'd like to aggregate the rows whose values in the 'time' column are close enough
(eg. let's say their difference is less than 60.) and sum their 'x' values in the aggregated row. The 'time value in the aggregated row will be the one of the first row of the aggregation. ('time' is an unix timestamp)
The goal is to have as output of this example:
time x
1306247226 15
1306248127 20
1306248187 36
1306249248 72
...
The dataset is quite big, a 'for' loop will take a long time... but if it is the only option I can deal with it and wait.
Any idea?
Thanks a lot!
You can use something like this :
First I create a new column for aggregation
dat$gg <- cumsum(c(0,diff(dat$time)) > 60)
Then I use the plyr package to apply function aggregation
library(plyr)
ddply(dat,.(gg),summarise,time = head(time,1),res = sum(x))
gg time res
1 0 1306247226 15
2 1 1306248127 56
3 2 1306249248 72
Edit after comment
The Op wanted a threshold of 60, not greater than 60. So I need to change the > to >=
dat$gg <- cumsum(c(0,diff(dat$time)) >= 60)
ddply(dat,.(gg),summarise,time = head(time,1),res = sum(x))
gg time res
1 0 1306247226 15
2 1 1306248127 20
3 2 1306248187 36
4 3 1306249248 72

Resources