Replace value from updated dataset based on number of instances it appears in a second dataset - r

I have a simple 2-column dataset containing variable cluster_size and index. Originally all values of index were assigned a value 1. Subsequently, I received a second dataset containing only a few clusters where index should updated with different integer values.
I simply need to replace the index value from the updated dataset. My specific issue is that the values for cluster_size can repeat multiple times, but I only need to replace it for the number of instances it appears in the updated dataset. For instance, in the example data below, the cluster_size value of 34 appears three times, but only once in the updated data with an index of 6. This means that only one of these three rows should update to 6 (doesn't matter which one).
Code to recreate a 20-row sample of the original data (have), updated subset (updated), and desired dataset (want) are below. The actual data has tens of thousands of rows. Ive tried several merge and loop functions (all too pathetic to waste your time by posting here), but cant seem to find an elegant solution.
# Data with original index cases
set.seed(03151813)
have <- data.frame(clust_size=sample(1:50,20,replace=TRUE),index=rep(1,times=20))
have <- have[order(have$clust_size),]
# Updated data only contains clusters that need updating of inde
updated <- data.frame(clust_size=c(30,34,42,44,44,46),
index=c(2,6,4,8,9,4))
# Desired dataset
want <- data.frame(clust_size=have$clust_size,
index=c(rep(1,times=9),2,1,6,
1,1,1,4,1,8,9,4))

Here is a base R approach. Add row numbers to have and updated for each clust_size. So the clust_size of 34 will have rows numbered consecutively 1, 2, and 3.
Then, you can merge the two together on both clust_size and row number. If you include all.x you will get all rows from the first data frame have.
Final step is to replace the missing NA values in your new index column with the original index.
have$rn <- with(have, ave(seq_along(clust_size), clust_size, FUN = seq_along))
updated$rn <- with(updated, ave(seq_along(clust_size), clust_size, FUN = seq_along))
want <- merge(have, updated, all.x = TRUE, by = c("clust_size", "rn"))
want$index.y <- ifelse(is.na(want$index.y), want$index.x, want$index.y)
want[, c("clust_size", "index.y")]
An alternate version using dplyr would be something like this:
library(dplyr)
have2 <- have %>%
group_by(clust_size) %>%
mutate(rn = row_number())
updated2 <- updated %>%
group_by(clust_size) %>%
mutate(rn = row_number())
left_join(have2, updated2, by = c("clust_size", "rn")) %>%
mutate(index.y = coalesce(index.y, index.x))
Output
clust_size index.y
1 1 1
2 5 1
3 8 1
4 10 1
5 16 1
6 20 1
7 22 1
8 27 1
9 29 1
10 30 2
11 30 1
12 34 6
13 34 1
14 34 1
15 35 1
16 42 4
17 43 1
18 44 8
19 44 9
20 46 4

Related

Multiply various subsets of a data frame by different elements of a vector R

I have a data frame:
df<-data.frame(id=rep(1:10,each=10),
Room1=rnorm(100,0.4,0.5),
Room2=rnorm(100,0.3,0.5),
Room3=rnorm(100,0.7,0.5))
And a vector:
vals <- sample(7:100, 10)
I want to multiply cols Room1, Room2 and Room3 by a different element of the vector for every unique ID number and output a new data frame (df2).
I managed to multiply each column per id by EVERY element of the vector using the following:
samp_func <- function(x) {
x*vals[i]
}
for (i in vals) {
df2 <- df %>% mutate_at(c("Room1", "Room2", "Room3"), samp_func)
}
But the resulting df (df2) is each Room column multiplied by the same element of the vector (vals) for each of the different ids. When what I want is each Room column (per id) multiplied by a different element of the vector vals. Sorry in advance if this is not clear I am a beginner and still getting to grips with the terminology.
Thanks!
EDIT: The desired output should look like the below, where the columns for each ID have been multiplied by a different element of the vector vals.
id Room1 Room2 Room3
1 1 24.674826880 60.1942571 46.81276141
2 1 21.970270107 46.0461779 35.09928150
3 1 26.282357614 -3.5098880 38.68400541
4 1 29.614182061 -39.3025587 25.09146592
5 1 33.030886472 46.0354881 42.68209027
6 1 41.362699668 -23.6624632 26.93845129
7 1 5.429031042 26.7657577 37.49086963
8 1 18.733422977 -42.0620572 23.48992138
9 1 -17.144070723 9.9627315 55.43999326
10 1 45.392182468 20.3959968 -16.52166621
11 2 30.687978299 -11.7194020 27.67351631
12 2 -4.559185345 94.9256561 9.26738357
13 2 86.165076849 -1.2821515 29.36949423
14 2 -12.546711562 47.1763755 152.67588456
15 2 18.285856423 60.5679496 113.85971720
16 2 72.074929648 47.6509398 139.69051486
17 2 -12.332519694 67.8890324 20.73189965
18 2 80.889634991 69.5703581 98.84404415
19 2 87.991093995 -20.7918559 106.13610773
20 2 -2.685594148 71.0611693 47.40278949
21 3 4.764445589 -7.6155681 12.56546664
22 3 -1.293867841 -1.1092243 13.30775785
23 3 16.114831628 -5.4750642 8.58762550
24 3 -0.309470950 7.0656088 10.07624289
25 3 11.225609780 4.2121241 16.59168866
26 3 -3.762529113 6.4369973 15.82362705
27 3 -5.103277731 0.9215625 18.20823042
28 3 -10.623165177 -5.2896293 33.13656839
29 3 -0.002517872 5.0861361 -0.01966699
30 3 -2.183752881 24.4644310 13.55572730
This should solve your problem. You can use a new dataset of all id, value combinations to make sure you calculate each combination and merge on the Room values. Then use mutate to make new Room columns.
Also, in the future I'd recommend setting a seed when asking questions with random data as it's easier for someone to replicate your output.
library(dplyr)
set.seed(0)
df<-data.frame(id=rep(1:10,each=10),
Room1=rnorm(100,0.4,0.5),
Room2=rnorm(100,0.3,0.5),
Room3=rnorm(100,0.7,0.5))
vals <- sample(7:100, 10)
other_df <- data.frame(id=rep(1:10),
val = rep(vals, 10))
df2 <- inner_join(other_df, df)
df2 <- df2 %>%
mutate(Room1 = Room1*val,
Room2 = Room2*val,
Room3 = Room3*val)

Multiple rows to single cell space delimited values in pandas with group by

I have a data set similar to df1 here
df1 = pd.DataFrame({'id':[1,1,2,2,2],
'value':[67,45,7,5,9]})
id value
1 67
1 45
2 7
2 5
2 9
I want to bring bring it to this form. all the values corresponding to that id in one cell separated by spaces.
id values
1 67 45
2 7 5 9
Here is my code
df2 = pd.DataFrame(df1['id'].unique())
df2.columns=['id']
df2['values']=np.nan
for i in df2['id']:
s=''
for k in df1[df1['id']==i]['value']:
s=s+' '+str(k)
df2.loc[df2['id']==i,'values']=s.lstrip()
print(df2)
Is there a more pythonic way of doing this. I have 70000 unique id's, each id may have number of values ranging from 1 to 20
I am using
Anaconda python 3.5
pandas 0.20.1
numpy 1.12.1
windows 10
Also, How can we replicate the same in R
Convert the 'value' column from int to string, then perform a groupby on 'id' and apply the str.join function:
# Convert 'value' column to string.
df1['value'] = df1['value'].astype(str)
# Perform a groupby and apply a string join.
df1 = df1.groupby('id')['value'].apply(' '.join).reset_index()
The resulting output:
id value
0 1 67 45
1 2 7 5 9
Here is how to do it in R. It is the same approach
df = data.frame('id'=c(1,1,2,2,2),'value'=c(67,45,7,5,9))
aggregate(cbind(values=value)~id,
data = df,
FUN = function(x){paste(x,collapse=' ')})

R: Randomly sampling (with replacement) each column of a data frame independently

I am trying to create a new data frame by randomly sampling an existing data frame. Specifically, I want create a data frame that is the same size as the original data frame, but each column of the new data frame is a random sample (with replacement) of the corresponding column in the original data frame. My first attempt looked like this:
# Create toy data set
data.set <- as.data.frame(matrix(1:50, ncol = 5))
# Change names
colnames(data.set) <- c("Stuff", "Things", "Foo", "Bar", "Guff")
# Try to create randomly sampled data frame
data.set %>% sample_n(replace = TRUE, size = nrow(data.set))
The problem here is that it just randomly samples rows, but not elements within each column individually. For example, here is some output.
Stuff Things Foo Bar Guff
2 2 12 22 32 42
10 10 20 30 40 50
2.1 2 12 22 32 42
3 3 13 23 33 43
5 5 15 25 35 45
3.1 3 13 23 33 43
8 8 18 28 38 48
9 9 19 29 39 49
1 1 11 21 31 41
6 6 16 26 36 46
Notice that the first and third rows are exactly the same, as are the fourth and sixth rows. What I would like is for each and every column to be randomly sampled independently. So, I tried this.
apply(data.set, MARGIN = 2, sample_n, replace = TRUE, size = nrow(data.set))
which produced the following error:
Error: Don't know how to sample from objects of class integer
although, I don't see what I did incorrectly. Can anyone offer a concise way of achieving my goal?
First, the apply function should have argument. In this case we use columns since the margin is 2.
apply(df, MARGIN = 2, function(x) sample(x, replace = TRUE, size = length(x)))

Avoid using a loop to get sum of rows in R, where I want to start and stop the sum on different columns for each row

I am relatively new to R from Stata. I have a data frame that has 100+ columns and thousands of rows. Each row has a start value, stop value, and 100+ columns of numerical values. The goal is to get the sum of each row from the column that corresponds to the start value to the column that corresponds to the stop value. This is direct enough to do in a loop, that looks like this (data.frame is df, start is the start column, stop is the stop column):
for(i in 1:nrow(df)) {
df$out[i] <- rowSums(df[i,df$start[i]:df$stop[i]])
}
This works great, but it is taking 15 minutes or so. Does anyone have any suggestions on a faster way to do this?
You can do this using some algebra (if you have a sufficient amount of memory):
DF <- data.frame(start=3:7, end=4:8)
DF <- cbind(DF, matrix(1:50, nrow=5, ncol=10))
# start end 1 2 3 4 5 6 7 8 9 10
#1 3 4 1 6 11 16 21 26 31 36 41 46
#2 4 5 2 7 12 17 22 27 32 37 42 47
#3 5 6 3 8 13 18 23 28 33 38 43 48
#4 6 7 4 9 14 19 24 29 34 39 44 49
#5 7 8 5 10 15 20 25 30 35 40 45 50
take <- outer(seq_len(ncol(DF)-2)+2, DF$start-1, ">") &
outer(seq_len(ncol(DF)-2)+2, DF$end+1, "<")
diag(as.matrix(DF[,-(1:2)]) %*% take)
#[1] 7 19 31 43 55
If you are dealing with values of all the same types, you typically want to do things in matrices. Here is a solution in matrix form:
rows <- 10^3
cols <- 10^2
start <- sample(1:cols, rows, replace=T)
end <- pmin(cols, start + sample(1:(cols/2), rows, replace=T))
# first 2 cols of matrix are start and end, the rest are
# random data
mx <- matrix(c(start, end, runif(rows * cols)), nrow=rows)
# use `apply` to apply a function to each row, here the
# function sums each row excluding the first two values
# from the value in the start column to the value in the
# end column
apply(mx, 1, function(x) sum(x[-(1:2)][x[[1]]:x[[2]]]))
# df version
df <- as.data.frame(mx)
df$out <- apply(df, 1, function(x) sum(x[-(1:2)][x[[1]]:x[[2]]]))
You can convert your data.frame to a matrix with as.matrix. You can also run the apply directly on your data.frame as shown, which should still be reasonably fast. The real problem with your code is that your are modifying a data frame nrow times, and modifying data frames is very slow. By using apply you get around that by generating your answer (the $out column), which you can then cbind back to your data frame (and that means you modify your data frame just once).

Row aggregation when values are close enough in a column

I have a dataframe with 2 columns
time x
1306247226 5
1306247236 10
1306248127 20
1306248187 36
1306249248 28
1306249258 24
1306249259 20
...
I'd like to aggregate the rows whose values in the 'time' column are close enough
(eg. let's say their difference is less than 60.) and sum their 'x' values in the aggregated row. The 'time value in the aggregated row will be the one of the first row of the aggregation. ('time' is an unix timestamp)
The goal is to have as output of this example:
time x
1306247226 15
1306248127 20
1306248187 36
1306249248 72
...
The dataset is quite big, a 'for' loop will take a long time... but if it is the only option I can deal with it and wait.
Any idea?
Thanks a lot!
You can use something like this :
First I create a new column for aggregation
dat$gg <- cumsum(c(0,diff(dat$time)) > 60)
Then I use the plyr package to apply function aggregation
library(plyr)
ddply(dat,.(gg),summarise,time = head(time,1),res = sum(x))
gg time res
1 0 1306247226 15
2 1 1306248127 56
3 2 1306249248 72
Edit after comment
The Op wanted a threshold of 60, not greater than 60. So I need to change the > to >=
dat$gg <- cumsum(c(0,diff(dat$time)) >= 60)
ddply(dat,.(gg),summarise,time = head(time,1),res = sum(x))
gg time res
1 0 1306247226 15
2 1 1306248127 20
3 2 1306248187 36
4 3 1306249248 72

Resources