how to replace values in a dataframe from a second smaller dataframe? - r

I am relatively new with R and I have a problem with a dataframe.
I have a very long dataframe (df1) with some coordinates xy and a value z. I have a shorter dataframe (df2) with the same columns but smaller number of rows. I want to replace values in df1 when xy are equal in df2.
x = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4)
y = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4)
z = c(8, 5, 3, 1, 2, 6, 8, 5, 3, 2, 8, 4, 4, 6, 2, 1)
df1 = data.frame(x, y, z)
x1=c(1,3,4)
y1=c(2,1,4)
z1=c(58,37,23)
df2=data.frame(x1,y1,z1)
names(df2) <- c("x", "y", "z")
I thought that I might use ifelse function as:
df1$znew<-ifelse((df1[,1]== df2[,1])&(df1[,2]==df2[,2]), df2[,3], df1[,3])
But the two objects are not the same dimensions.
I have tried to use loops so it analyse each row to compare x and y and then decide what z to use but I can't make it work.
At the end I would like to have a dataframe with a new variable of z to compare the values and corroborate that it really changed the values. My final dataframe would look like:
znew = c(8,58,3,1,2,6,8,5,37,2,8,4,4,6,2,23)
I really appreciate any help and I am sorry if somebody else posted similar questions, I have been all day trying to figure it out and I can't find any example that suits my case.

Assuming the two data frames do in fact have the same column names (probably just a typo in your question), you might do this with merge:
tmp <- merge(df1,df2,all.x = TRUE,by = c('x','y'))
tmp$z.x[!is.na(tmp$z.y)] <- tmp$z.y[!is.na(tmp$z.y)]
> tmp
x y z.x z.y
1 1 1 8 NA
2 1 2 4 4
3 1 3 3 NA
4 1 4 1 NA
5 2 1 2 NA
6 2 2 6 NA
7 2 3 8 NA
8 2 4 5 NA
9 3 1 4 4
10 3 2 2 NA
11 3 3 8 NA
12 3 4 4 NA
13 4 1 4 NA
14 4 2 6 NA
15 4 3 2 NA
16 4 4 3 3
Then just remove the extra column and rename the columns.

Related

Vector of repeated index values

I have a vector of the following form:-
a <- c(4, 6, 3, 6, 1)
What I want is to make a vector such that it has the index of the vector a the number of times the value of that index in vector a.
Like the first index has value 4, so there should be 4 ones, followed by 6 twos, followed by 3 threes, and so on.
Then resulting vector should be of the following form:-
b <- c(1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5)
Thanks in advance.
We can use rep as :
a <- c(4, 6, 3, 6, 1)
rep(seq_along(a), a)
#[1] 1 1 1 1 2 2 2 2 2 2 3 3 3 4 4 4 4 4 4 5
We can use sequence
cumsum(sequence(a) == 1)
#[1] 1 1 1 1 2 2 2 2 2 2 3 3 3 4 4 4 4 4 4 5
Or using uncount
library(dplyr)
library(tidyr)
tibble(a) %>%
mutate(rn = row_number()) %>%
uncount(a)

Aggregate rows in data.frame containing same values over different columns [duplicate]

This question already has answers here:
Aggregating regardless of the order of columns
(4 answers)
Closed 3 years ago.
The following works as expected:
m <- matrix (c(1, 2, 3,
1, 2, 4,
2, 1, 4,
2, 1, 4,
2, 3, 4,
2, 3, 6,
3, 2, 3,
3, 2, 2), byrow=TRUE, ncol=3)
df <- data.frame(m)
aggdf <- aggregate(df$X3, list(df$X1, df$X2), FUN=sum)
colnames(aggdf) <- c("A", "B", "value")
and results in:
A B value
1 2 1 8
2 1 2 7
3 3 2 5
4 2 3 10
But I would like to treat rows 1/2 and 3/4 as equal, not caring whether observation A is 1 and B is 2 or vice versa.
I also do not care about how the aggregation is sorting A/B in the final data.frame, so both of the following results would be fine:
A B value
1 2 1 15
2 3 2 15
A B value
1 1 2 15
2 2 3 15
How can that be achieved?
You need to get them in a consistent order. For just 2 columns, pmin and pmax work nicely:
df$A = with(df, pmin(X1, X2))
df$B = with(df, pmax(X1, X2))
aggregate(df$X3, df[c("A", "B")], FUN = sum)
# A B x
# 1 1 2 15
# 2 2 3 15
For more columns, use sort, as akrun recommends:
df[1:2] <- t(apply(df[1:2], 1, sort))
By changing 1:2 to all the key columns, this generalizes up easily.

calculating simple retention in R

For the dataset test, my objective is to find out how many unique users carried over from one period to the next on a period-by-period basis.
> test
user_id period
1 1 1
2 5 1
3 1 1
4 3 1
5 4 1
6 2 2
7 3 2
8 2 2
9 3 2
10 1 2
11 5 3
12 5 3
13 2 3
14 1 3
15 4 3
16 5 4
17 5 4
18 5 4
19 4 4
20 3 4
For example, in the first period there were four unique users (1, 3, 4, and 5), two of which were active in the second period. Therefore the retention rate would be 0.5. In the second period there were three unique users, two of which were active in the third period, and so the retention rate would be 0.666, and so on. How would one find the percentage of unique users that are active in the following period? Any suggestions would be appreciated.
The output would be the following:
> output
period retention
1 1 NA
2 2 0.500
3 3 0.666
4 4 0.500
The test data:
> dput(test)
structure(list(user_id = c(1, 5, 1, 3, 4, 2, 3, 2, 3, 1, 5, 5,
2, 1, 4, 5, 5, 5, 4, 3), period = c(1, 1, 1, 1, 1, 2, 2, 2, 2,
2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4)), .Names = c("user_id", "period"
), row.names = c(NA, -20L), class = "data.frame")
How about this? First split the users by period, then write a function that calculates the proportion carryover between any two periods, then loop it through the split list with mapply.
splt <- split(test$user_id, test$period)
carryover <- function(x, y) {
length(unique(intersect(x, y))) / length(unique(x))
}
mapply(carryover, splt[1:(length(splt) - 1)], splt[2:length(splt)])
1 2 3
0.5000000 0.6666667 0.5000000
Here is an attempt using dplyr, though it also uses some standard syntax in the summarise:
test %>%
group_by(period) %>%
summarise(retention=length(intersect(user_id,test$user_id[test$period==(period+1)]))/n_distinct(user_id)) %>%
mutate(retention=lag(retention))
This returns:
period retention
<dbl> <dbl>
1 1 NA
2 2 0.5000000
3 3 0.6666667
4 4 0.5000000
This isn't so elegant but it seems to work. Assuming df is the data frame:
# make a list to hold unique IDS by
uniques = list()
for(i in 1:max(df$period)){
uniques[[i]] = unique(df$user_id[df$period == i])
}
# hold the retention rates
retentions = rep(NA, times = max(df$period))
for(j in 2:max(df$period)){
retentions[j] = mean(uniques[[j-1]] %in% uniques[[j]])
}
Basically the %in% creates a logical of whether or not each element of the first argument is in the second. Taking a mean gives us the proportion.

Create all pairs within groups and maintaining variables

I have a dataframe with around 30k observations, divided in 300 groups. For example
id, group, x, y
1, 1, 2, 3
2, 1, 4, 3
3, 1, 2, 4
4, 2, 5, 4
5, 2, 5, 3
6, 2, 6, 4
I want to make it so
pair, group, x_i, x_j, y_i, y_j
12, 1, 2, 4, 3, 3
13, 1, 2, 2, 3, 4
23, 1, 4, 2, 3, 4
45, 2, 5, 5, 4, 3
and so on. I've found a few topics, but they don't seem to apply exactly to my problem.
The combn function can be used to generate each corresponding pair of x and y values. We operate by group using lapply. lapply returns a list so we use rbind to put each list element (the results for each group) back together in a single data frame.
new.dat = lapply(unique(dat$group), function(g) {
data.frame(pairs = apply(t(combn(dat$id[dat$group==g], 2)), 1, paste, collapse=""),
group=g,
x = t(combn(dat$x[dat$group==g], 2)),
y = t(combn(dat$y[dat$group==g], 2)))
})
do.call(rbind, new.dat)
pairs group x.1 x.2 y.1 y.2
1 12 1 2 4 3 3
2 13 1 2 2 3 4
3 23 1 4 2 3 4
4 45 2 5 5 4 3
5 46 2 5 6 4 4
6 56 2 5 6 3 4
You could also use split, which saves some typing, but is about 10% slower on my machine:
lapply(split(dat, dat$group), function(df) {
data.frame(pairs = apply(t(combn(df$id, 2)), 1, paste, collapse=""),
group=g,
x = t(combn(df$x, 2)),
y = t(combn(df$y, 2)))
})
I won't say this is an ooptimal result, but it should work:
df <- read.table(text="id, group, x, y
1,1,2,3
2,1,4,3
3,1,2,4
4,2,5,4
5,2,5,3
6,2,6,4", header=T, sep=",")
df.new <- do.call(rbind,lapply(tapply(df$id, df$group, combn, m=2), FUN=function(x) data.frame(pairi=x[1,], pairj=x[2,])))
df.new <- do.call(rbind,apply(df.new, 1, FUN=function(x) data.frame(pair=paste0(x[1], x[2]),group=df[df$id==x[1], 'group'], x_i=df[df$id==x[1],'x'], x_j=df[df$id==x[2],'x'], y_i=df[df$id==x[1],'y'], y_j=df[df$id==x[2],'y'] )))
df.new
pair group x_i x_j y_i y_j
1.1 12 1 2 4 3 3
1.2 13 1 2 2 3 4
1.3 23 1 4 2 3 4
2.1 45 2 5 5 4 3
2.2 46 2 5 6 4 4
2.3 56 2 5 6 3 4

New variable with values depending on combination of other variables

I'm very inexperienced in R, and although this site has been tremendously helpful, I have a very specific situation and cannot find a solution. I imagine I need to write a function to accomplish this. However, my current time frame does not allow me to spend the time doing trial/error. (I apologize in advance for anything unclear).
Here is an example of my current data:
UniqueID, Time1.Feel1, Time2.Feel1.1, Time2.Feel1.2, Time2Num
1, 9, 5, 6, 1
1, 9, 7, 5, 2
2, 4, 3, 4, 1
2, 4, 5, 6, 2
3, 7, 4, 7, 1
3, 7, 6, 5, 2
I want to create a new variable: Time2.Feel1, which consists of the values of either Time2.Feel1.1 OR Time2.Feel1.2, depending on the value of Time2Num.
So, this:
UniqueID, Time1.Feel1, Time2.Feel1.1, Time2.Feel1.2, Time2Num, Time2.Feel1
1, 9, 5, 6, 1, 5
1, 9, 7, 5, 2, 5
2, 4, 3, 4, 1, 3
2, 4, 5, 6, 2, 6
3, 7, 4, 7, 1, 4
3, 7, 6, 5, 2, 5
I need to do this 30 times (i.e., Time2Num has values 1:30 and there are 30 different Time2.Feel1 variables: Time2.Feel1.1:30)
I then want to calculate a correlation between Time1.Feel1 and Time2.Feel1 for EACH UniqueID, creating a new data frame with the variables UniqueID and the new correlations. This part is less of a concern; I think I've figured out how to that, but if the combined steps could be done more simply, I'd prefer that.
Thanks in advance!
To expound on #thelatemail's comment, you could do this
dat <- read.csv(text="UniqueID, Time1.Feel1, Time2.Feel1.1, Time2.Feel1.2, Time2Num
1, 9, 5, 6, 1
1, 9, 7, 5, 2
2, 4, 3, 4, 1
2, 4, 5, 6, 2
3, 7, 4, 7, 1
3, 7, 6, 5, 2")
dat$Time2.Feel1 <- dat[c("Time2.Feel1.1","Time2.Feel1.2")][cbind(seq(nrow(dat)),dat$Time2Num)]
# UniqueID Time1.Feel1 Time2.Feel1.1 Time2.Feel1.2 Time2Num Time2.Feel1
# 1 1 9 5 6 1 5
# 2 1 9 7 5 2 5
# 3 2 4 3 4 1 3
# 4 2 4 5 6 2 6
# 5 3 7 4 7 1 4
# 6 3 7 6 5 2 5
Doing that 30 times isn't very efficient, so you could use a loop:
## creating some example data which I think matches your format
nr <- nrow(dat)
set.seed(1)
dat1 <- lapply(1:15, function(ii)
matrix(c(sample(1:9, nr * 2, replace = TRUE),
sample(1:2, nr, replace = TRUE)), nrow = nr,
dimnames = list(NULL, c(paste0('Time2.Feel1.', 1 + 2 * (ii - 1)),
paste0('Time2.Feel1.', 2 + 2 * (ii - 1)),
sprintf('Time%sNum', 2 + 2 * (ii - 1))))))
dat1 <- data.frame(do.call('cbind', dat1))
# Time2.Feel1.1 Time2.Feel1.2 Time2Num Time2.Feel1.3 Time2.Feel1.4 Time4Num
# 1 3 9 2 4 3 1
# 2 4 6 1 7 4 2
# 3 6 6 2 9 1 1
# 4 9 1 1 2 4 1
# 5 2 2 2 6 8 2
# 6 9 2 2 2 4 2
# Time2.Feel1.5 Time2.Feel1.6 Time6Num Time2.Feel1.7 Time2.Feel1.8 Time8Num
# 1 8 8 2 1 9 1
# 2 1 5 2 1 3 2
# 3 7 5 1 3 5 1
# 4 4 8 2 5 3 2
# 5 8 1 1 6 6 1
# 6 6 5 1 4 3 2
# Time2.Feel1.9 Time2.Feel1.10 Time10Num Time2.Feel1.11 Time2.Feel1.12 Time12Num
# 1 4 7 2 3 5 1
# 2 4 9 1 1 4 2
# 3 5 4 2 6 8 2
# 4 9 7 1 8 6 1
# 5 8 4 1 8 6 1
# 6 4 3 1 8 4 1
etc, etc
So you can start here. First you make the input vectors:
I call xx which is Time2.Feel1, Time2.Feel3, Time2.Feel5, etc
yy which is Time2.Feel2, Time2.Feel4, Time2.Feel6, etc; xx and yy are your two "choices"
and zz which is the "decision" column, Time2Feel1, Time4Feel1, Time6Feel1, etc
Then use mapply to do the indexing above but in a 1-1 mapping using those three input vectors with mapply. Note that zz, yy, and xx are all the same length
n <- 30
xx <- paste0('Time2.Feel1.', seq(1, n - 1, by = 2))
yy <- paste0('Time2.Feel1.', seq(2, n, by = 2))
zz <- sprintf('Time%sNum', seq(2, n, by = 2))
nn <- sprintf('Time%s.Feel1', seq(2, n, by = 2))
res <- mapply(function(x, y, z) dat1[, c(x, y)][cbind(1:nr, dat1[, z])],
xx, yy, zz, SIMPLIFY = FALSE)
res <- `colnames<-`(do.call('cbind', res), nn)
# Time2.Feel1 Time4.Feel1 Time6.Feel1 Time8.Feel1 Time10.Feel1 Time12.Feel1
# [1,] 9 4 8 1 7 3
# [2,] 4 4 5 3 4 4
# [3,] 6 9 7 3 4 8
# [4,] 9 2 8 3 9 8
# [5,] 2 8 8 6 8 8
# [6,] 2 4 6 3 4 8
And then you can combine the results back. You would need to reorder them if that is important to you
## combine results into original data
cbind(dat1, res)
When searching for the error I received when trying the answer from #user12202013, I came across this solution using ifelse, found here: Conditional assignment of one variable to the value of one of two other variables
Time2.Feel1 <- ifelse(Time2Num == 1, Time2.Feel1.1, ifelse(Time2Num == 2,
Time2.Feel1.2,""))
Although it is definitely not the most efficient solution, particularly because I need to nest it 30 times and I need to do it for 9 items, it solved my problem. A simpler answer is still welcome, though!
Thanks for your answers!
You want to do something like:
Time2.Feel1 = rep(NA, length(Time2Num))
Time2.Feel1[Time2Num == 1] <- Time2.Feel1.1
Time2.Feel1[Time2Num == 2] <- Time2.Feel1.2
This says to create a vector called Time2.Feel1 which we initialize with NA values. Then where Time2Num is one we fill in the values from Time2.Feel1.1 and where Time2Num is two we fill in the values from Time2.Feel1.2. If there is any place where Time2Num is neither 1 nor 2 thenTime2.Feel1` will have an NA value.
Edit:
Not sure what the error message is referring to since I am able to do this
# reproducible example
set.seed(1)
A <- letters
B <- sample(c(0, 1, NA), 26, TRUE)
A[B == 1] <- '5' # assignment where subscript contains NAs
A[B == 0] <- NA # assigning NA values
A
[1] NA "5" "5" "d" NA "f" "g" "5" "5" NA NA NA "m" "5" "o" "5" "q" "r" "5" "t" "u" NA "5" NA NA "5"
I would need to see more complete code to know what is causing the error.

Resources