replace values in one dataset with values in another dataset R - r

I have a somewhat seemingly simple problem that I am stumped with. I have a df, say:
x y z
0 1 2
3 5 4
1 0 5
0 5 0
and another:
x y z
1 5 6
2 4 5
4 5 7
5 8 5
I want to replace the zero values in df1 with the value in df2. E.g., cell 1 of df1 would be 1 instead of zero. I want this for all columns in a dataframe. Can you help me code? I cant seem to figure it out. Thanks!

First, you can locate the indices of 0's using which
zero_locations <- which(df1 == 0, arr.ind=TRUE)
Then, you can use the locations to make the replacements:
df1[zero_locations] <- df2[zero_locations]
As David Arenburg pointed out in the comments, which isn't strictly necessary:
zero_locations <- df1 == 0
Will work as well.

Related

Use if-else function on data frame with multiple values

I have a data frame that contains multiple values in each spot, like this:
ID<-c(1,1,1,2,2,2,2,3,3,4,4,4,5,6,6)
W<-c(29,72,32,33,34,44,42,78,32,42,18,26,10,34,39)
df1<-data.frame(ID, W)
df<-ddply(df1, .(ID), summarize,
X=paste(unique(W),collapse=","))
ID X
1 1 29,72,32
2 2 33,34,44,42
3 3 78,32
4 4 42,18,26
5 5 10
6 6 34,39
I am trying to generate another column using an if-else function so that every ID that has an X value greater than 70 will show a 1, and all others will show a 0, like this:
ID X Y
1 1 29,72,32 1
2 2 33,34,44,42 0
3 3 78,32 1
4 4 42,18,26 0
5 5 10 0
6 6 34,39 0
This is the code that I tried:
df$Y <- ifelse(df$X>=70, 1, 0)
But it doesn't work; it only seems to put the first value of each spot through the function:
ID X Y
1 1 29,72,32 0
2 2 33,34,44,42 0
3 3 78,32 1
4 4 42,18,26 0
5 5 10 0
6 6 34,39 0
It worked fine on my one column that has only one value per spot. Is there a way to get to the if-else function to evaluate every value in each spot and assign a 1 if any of them fit the statement?
Thank you, I'm sorry that I do not know a lot of R vocabulary yet.
As 'X' is a string, we can split the 'X' at the , to create a list of vectors, loop over the list with map check if there are any numeric converted values are greater than 70
library(dplyr)
library(purrr)
df %>%
mutate(Y = map_int(strsplit(X, ","), ~ +(any(as.numeric(.x) > 70))))

R: loop matrix sort columns individually for specific rows

I want to sort my Matrix (U) columnwise for the rows, which have the same name. My (very large) matrix looks similar to this:
1 2
1 5 6
1 -4 4
1 6 -2
2 7 -2
2 -2 3
Now I want to loop through the matrix looking for the same rows and then sort the columns which have the same row.name resulting in this matrix:
1 2
1 -4 -2
1 5 4
1 6 6
2 -2 -2
2 7 3
My code until now looks like this:
First step was the row count, which works:
z <- 1
for(i in (1:nrow(U))){
if(row.names(U)[i] != row.names(U)[i-1]){
z = (sum(row.names(U) == row.names(U)[i]))+1}}
Now I wanted to add after the row count a sorting function and I tried this for the first set of rows manually:
x <- 1
for(x in (1:ncol(U))){
U[1:3,x]<- U[do.call(order, lapply(x:NCOL(U), function(x) U[1:3, x]
However this loop is on the one hand very slow and on the other hand it only fills in the first column correctly
Do you have a recommendation how I could improve my sorting function, while taking into account the performance issues?
EDIT: I guess this was confusing in my first edit. The first "column" of my matrix are the row.names and I have in this example a 5x2 Matrix
Here's an approach which just uses order() first by row name, then by each column in turn. Is this what you're after?
U <- matrix(c(5,6,-4,4,6,-2,7,-2,-2,3), byrow=TRUE, ncol=2, dimnames=list(c(1,1,1,2,2), c(1,2)))
apply(U, 2, function(j) j[order(rownames(U), j)])
We can use data.table, convert to data.table, grouped by the first column ('U'), loop through the columns and sort
library(data.table)
as.data.table(m1)[, lapply(.SD, sort), by = U]
An alternative using dplyr
df = read.table(textConnection("U 1 2
1 5 6
1 -4 4
1 6 -2
2 7 -2
2 -2 3"), header= TRUE)
library(dplyr)
df %>% group_by(U) %>% transmute(sort(X1),sort(X2))

remove duplicate row based only of previous row

I'm trying to remove duplicate rows from a data frame, based only on the previous row. The duplicate and unique functions will remove all duplicates, leaving you only with unique rows, which is not what I want.
I've illustrated the problem here with a loop. I need to vectorize this because my actual data set is much to large to use a loop on.
x <- c(1,1,1,1,3,3,3,4)
y <- c(1,1,1,1,3,3,3,4)
z <- c(1,2,1,1,3,2,2,4)
xy <- data.frame(x,y,z)
xy
x y z
1 1 1 1
2 1 1 2
3 1 1 1
4 1 1 1 #this should be removed
5 3 3 3
6 3 3 2
7 3 3 2 #this should be removed
8 4 4 4
# loop that produces desired output
toRemove <- NULL
for (i in 2:nrow(xy)){
test <- as.vector(xy[i,] == xy[i-1,])
if (!(FALSE %in% test)){
toRemove <- c(toRemove, i) #build a vector of rows to remove
}
}
xy[-toRemove,] #exclude rows
x y z
1 1 1 1
2 1 1 2
3 1 1 1
5 3 3 3
6 3 3 2
8 4 4 4
I've tried using dplyr's lag function, but it only works on single columns, when I try to run it over all 3 columns it doesn't work.
ifelse(xy[,1:3] == lag(xy[,1:3],1), NA, xy[,1:3])
Any advice on how to accomplish this?
Looks like we want to remove if the row is same as above:
# make an index, if cols not same as above
ix <- c(TRUE, rowSums(tail(xy, -1) == head(xy, -1)) != ncol(xy))
# filter
xy[ix, ]
Why don't you just iterate the list while keeping track of the previous row to compare it to the next row?
If this is true at some point: remember that row position and remove it from the list then start iterating from the beginning of the list.
Don't delete row while iterating because you will get concurrent modification error.

R data.table selecting the previous row within group blocks

I have the following example data frame.
id value
a 3
a 4
a 8
b 9
b 8
I want to convert it so that I can calculate differences in the column "value" between successive rows. So the expected result is
id value prevValue
a 3 0
a 4 3
a 8 4
b 9 0
b 8 9
Notice within each group I want the sequence of values to start with a 0 and successive values are from the one prior. I tried the following
x = x[,list(
prevValue = c(0,value[1:(.N-1)])
),by=id]
but no luck.
Thanks in advance.
Use negative indexing, something like:
x[,prev.value := c(0,value[-.N]) ,by=id]
Without data.table:
with(dat,ave(value,id,FUN=function(x) c(0,head(x,-1))))
[1] 0 3 4 0 9

Data frame "expand" procedure in R?

This is not a real statistical question, but rather a data preparation question before performing the actual statistical analysis. I have a data frame which consists of sparse data. I would like to "expand" this data to include zeroes for missing values, group by group.
Here is an example of the data (a and b are two factors defining the group, t is the sparse timestamp and xis the value):
test <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,2,2,2,2,2,2),
t=c(0,2,3,4,7,3,4,6,7,8,9),
x=c(1,2,1,2,2,1,1,2,1,1,3))
Assuming I would like to expand the values between t=0 and t=9, this is the result I'm hoping for:
test.expanded <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2),
t=c(0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9),
x=c(1,0,2,1,2,0,0,2,0,0,0,0,0,1,1,0,2,1,1,3))
Zeroes have been inserted for all missing values of t. This makes it easier to use.
I have a quick and dirty implementation which sorts the dataframe and loops through each of its lines, adding missing lines one at a time. But I'm not entirely satisfied by the solution. Is there a better way to do it?
For those who are familiar with SAS, it is similar to the proc expand.
Thanks!
As you noted in a comment to the other answer, doing it by group is easy with plyr which just leaves how to "fill in" the data sets. My approach is to use merge.
library("plyr")
test.expanded <- ddply(test, c("a","b"), function(DF) {
DF <- merge(data.frame(t=0:9), DF[,c("t","x")], all.x=TRUE)
DF[is.na(DF$x),"x"] <- 0
DF
})
merge with all.x=TRUE will make the missing values NA, so the second line of the function is needed to replace those NAs with 0's.
This is convoluted but works fine:
test <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,2,2,2,2,2,2),
t=c(0,2,3,4,7,3,4,6,7,8,9),
x=c(1,2,1,2,2,1,1,2,1,1,3))
my.seq <- seq(0,9)
not.t <- !(my.seq %in% test$t)
test[nrow(test)+seq(length(my.seq[not.t])),"t"] <- my.seq[not.t]
test
#------------
a b t x
1 1 1 0 1
2 1 1 2 2
3 1 1 3 1
4 1 1 4 2
5 1 1 7 2
6 1 2 3 1
7 1 2 4 1
8 1 2 6 2
9 1 2 7 1
10 1 2 8 1
11 1 2 9 3
12 NA NA 1 NA
13 NA NA 5 NA
Not sure if you want it sorted by t afterwards or not. If so, easy enough to do:
https://stackoverflow.com/a/6871968/636656

Resources