In the following dataframe I want to create a new column called D2 that matches the corresponding A, B, or C column. For example, if D == A, I want D2 == A2.
A A2 B B2 C C2 D
1 10 2 90 3 9 1
1 11 2 99 3 15 1
1 42 2 2 3 9 2
1 5 2 54 3 235 2
1 13 2 20 3 10 3
1 6 2 1 3 4 3
This is what I want the new data frame to look like:
A A2 B B2 C C2 D D2
1 10 2 90 3 9 1 10
1 11 2 99 3 15 1 11
1 42 2 2 3 9 2 2
1 5 2 54 3 235 2 54
1 13 2 20 3 10 3 10
1 6 2 1 3 4 3 4
I have succeeded in doing this with ifelse statements using dplyr, but because I am doing this with many columns, it gets tedious after a while. I was wondering if there is a more clever way to accomplish the same task.
library(dplyr)
newdata <- olddata %>% mutate(D2=ifelse(D==A,A2,ifelse(D==B,B2,C2)))
We can do this efficiently with max.col from base R. Subset the 'olddata' with only 'A', 'B', 'C' columns ('d1'), check whether it is equal to 'D' (after replicating the 'D' to match the lengths), use max.col to find the index of the maximum element (i.e TRUE in this case, assuming that there will be a single TRUE value per rows), multiply by 2 as the 'A1', 'B2', 'C2' columns are alternating after 'A', 'B', 'C', cbind with the row sequence to create the row/column index and extract the elements based on that to create the 'D2' column.
d1 <- olddata[c("A", "B", "C")]
olddata$D2 <- olddata[cbind(1:nrow(d1), max.col(d1 == rep(olddata["D"],
ncol(d1)), "first")*2)]
olddata$D2
#[1] 10 11 2 54 10 4
A slightly different approach would be to compare the columns separately in a loop using lapply (should be efficient if the dataset is very big as converting to a big logical matrix can cost the memory) and based on that we subset the corresponding columns of A2, B2, C2 with mapply
i1 <- grep("^[^D]", names(olddata)) #create an index for columns that are not D
i2 <- seq(1, ncol(olddata[i1]), by = 2)#for subsetting A, B, C
i3 <- seq(2, ncol(olddata[i1]), by = 2)# for subsetting A2, B2, C2
olddata$D2 <- c(mapply(`[`, olddata[i3], lapply(olddata[i2], `==`, olddata$D)))
olddata$D2
[1] 10 11 2 54 10 4
Related
I am facing with the other problem in coding with R-Studio. I have two dataframes (with the same number of rows and colunms). Now I want to merge them two into one, but the 6 columns of dataframe 1 would be columns 1,3,5,7,9.11 in the new matrix; while those of data frame 2 would be 2,4,6,8,10,12 in the new merged dataframe.
I can do it with for loop but is there any smarter way/function to do it? Thank you in advance ; )
You can cbind them and then reorder the columns accordingly:
df1 <- as.data.frame(sapply(LETTERS[1:6], function(x) 1:3))
df2 <- as.data.frame(sapply(letters[1:6], function(x) 1:3))
cbind(df1, df2)[, matrix(seq_len(2*ncol(df1)), 2, byrow=T)]
# A a B b C c D d E e F f
# 1 1 1 1 1 1 1 1 1 1 1 1 1
# 2 2 2 2 2 2 2 2 2 2 2 2 2
# 3 3 3 3 3 3 3 3 3 3 3 3 3
The code below will produce your required result, and will also work if one data frame has more columns than the other
# create sample data
df1 <- data.frame(
a1 = 1:10,
a2 = 2:11,
a3 = 3:12,
a4 = 4:13,
a5 = 5:14,
a6 = 6:15
)
df2 <- data.frame(
b1=11:20,
b2=12:21,
b3=13:22,
b4=14:23,
b5=15:24,
b6=16:25
)
# join by interleaving columns
want <- cbind(df1,df2)[,order(c(1:length(df1),1:length(df2)))]
Explanation:
cbind(df1,df2) combines the data frames with all the df1 columns first, then all the df2 columns.
The [,...] element re-orders these columns.
c(1:length(df1),1:length(df2)) gives 1 2 3 4 5 6 1 2 3 4 5 6 - i.e. the order of the columns in df1, followed by the order in df2
order() of this gives 1 7 2 8 3 9 4 10 5 11 6 12 which is the required column order
So [, order(c(1:length(df1), 1:length(df2)] re-orders the columns so that the columns of the original data frames are interleaved as required.
Following on from a previous question here I extracted the following data.frame
DF <- data.frame(A =c("One","Two","Three","Four","Five"),
B=c(1,1,2,2,3),
D=c(10,2,3,-5,5))
subset(DF, B %in% c(1,3))
A B D
1 One 1 10
2 Two 1 2
5 Five 3 5
but now I want to (for example) multiply the numbers by (say) five and replace them in the original data.frame
The following code
subset(DF, B %in% c(1,3))[,2:3] * 5
B D
1 5 50
2 5 10
5 15 25
gives me the numbers I want but how to I get them back to
A B D
1 One 5 50
2 Two 5 10
3 Three 2 3
4 Four 2 -5
5 Five 15 25
The answer is staring me in the face (ie the index numbers ... but how do I get to them)?
You can do
DF[DF$B %in% c(1, 3), 2:3] <- DF[DF$B %in% c(1, 3), 2:3] * 5
DF
# A B D
#1 One 5 50
#2 Two 5 10
#3 Three 2 3
#4 Four 2 -5
#5 Five 15 25
Say I have the following dataframe, df:
A B C
1 4 25 a
2 3 79 b
3 4 25 c
4 6 17 d
5 4 21 e
6 5 25 f
How to I index the lines where elements in certain columns match a vector, e.g. df$A == 4 & df$B == 25?
I would have thought something like: df[df[,c("A", "B")] == c(4, 25),] would have worked, but this doesn't give me the right answer (it returns no lines).
I would like a method that uses a vector of column names to match on, and a vector of values to match.
Try this:
df[colSums(t(df[,c("A", "B")]) == c(4, 25))==2,]
x A B C
1 1 4 25 a
3 3 4 25 c
This works recycling the vector c(4, 25).
You could also use a simple merge, which would have analogies for data.table and dplyr and in sql as well:
merge(df, setNames(list(4,25),c("A","B")))
# A B C
#1 4 25 a
#2 4 25 c
Suppose i have a data which looks like this
ID A B C
1 X 1 10
1 X 2 10
1 Z 3 15
1 Y 4 12
2 Y 1 15
2 X 2 13
2 X 3 13
2 Y 4 13
3 Y 1 16
3 Y 2 18
3 Y 3 19
3 Y 4 10
I Wanted to compare these values with each other so if an ID has changed its value of A variable over a period of B variable(which is from 1 to 4) it goes into data frame K and if it hasn't then it goes to data frame L.
so in this data set K will look like
ID A B C
1 X 1 10
1 X 2 10
1 Z 3 15
1 Y 4 12
2 Y 1 15
2 X 2 13
2 X 3 13
2 Y 4 13
and L will look like
ID A B C
3 Y 1 16
3 Y 2 18
3 Y 3 19
3 Y 4 10
In terms of nested loops and if then else statement it can be solved like following
for ( i in 1:length(ID)){
m=0
for (j in 1: length(B)){
ifelse( A[j] == A[j+1],m,m=m+1)
}
ifelse(m=0, L=c[,df[i]], K=c[,df[i]])
}
I have read in some posts that in R nested loops can be replaced by apply and outer function. if someone can help me understand how it can be used in such circumstances.
So basically you don't need a loop with conditions here, all you need to do is to check if there's a variance (and then converting it to a logical using !) in A during each cycle of B (IDs) by converting A to a numeric value (I'm assuming its a factor in your real data set, if its not a factor, you can use FUN = function(x) length(unique(x)) within ave instead ) and then split accordingly. With base R we can use ave for such task, for example
indx <- !with(df, ave(as.numeric(A), ID , FUN = var))
Or (if A is a character rather a factor)
indx <- with(df, ave(A, ID , FUN = function(x) length(unique(x)))) == 1L
Then simply run split
split(df, indx)
# $`FALSE`
# ID A B C
# 1 1 X 1 10
# 2 1 X 2 10
# 3 1 Z 3 15
# 4 1 Y 4 12
# 5 2 Y 1 15
# 6 2 X 2 13
# 7 2 X 3 13
# 8 2 Y 4 13
#
# $`TRUE`
# ID A B C
# 9 3 Y 1 16
# 10 3 Y 2 18
# 11 3 Y 3 19
# 12 3 Y 4 10
This will return a list with two data frames.
Similarly with data.table
library(data.table)
setDT(df)[, indx := !var(A), by = ID]
split(df, df$indx)
Or dplyr
library(dplyr)
df %>%
group_by(ID) %>%
mutate(indx = !var(A)) %>%
split(., indx)
Since you want to understand apply rather than simply getting it done, you can consider tapply. As a demonstration:
> tapply(df$A, df$ID, function(x) ifelse(length(unique(x))>1, "K", "L"))
1 2 3
"K" "K" "L"
In a bit plainer English: go through all df$A grouped by df$ID, and apply the function on df$A within each groupings (i.e. the x in the embedded function): if the number of unique values is more than 1, it's "K", otherwise it's "L".
We can do this using data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)). Grouped by 'ID', we check the length of unique elements in 'A' (uniqueN(A)) is greater than 1 or not, create a column 'ind' based on that. We can then split the dataset based on that
'ind' column.
library(data.table)
setDT(df1)[, ind:= uniqueN(A)>1, by = ID]
setDF(df1)
split(df1[-5], df1$ind)
#$`FALSE`
# ID A B C
#9 3 Y 1 16
#10 3 Y 2 18
#11 3 Y 3 19
#12 3 Y 4 10
#$`TRUE`
# ID A B C
#1 1 X 1 10
#2 1 X 2 10
#3 1 Z 3 15
#4 1 Y 4 12
#5 2 Y 1 15
#6 2 X 2 13
#7 2 X 3 13
#8 2 Y 4 13
Or similarly using dplyr, we can use n_distinct to create a logical column and then split by that column.
library(dplyr)
df2 <- df1 %>%
group_by(ID) %>%
mutate(ind= n_distinct(A)>1)
split(df2, df2$ind)
Or a base R option with table. We get the table of the first two columns of 'df1' i.e. the 'ID' and 'A'. By double negating (!!) the output, we can get the '0' values convert to 'TRUE' and all other frequency as 'FALSE'. Get the rowSums ('indx'). We match the ID column in 'df1' with the names of the 'indx', use that to replace the 'ID' with TRUE/FALSE, and split the dataset with that.
indx <- rowSums(!!table(df1[1:2]))>1
lst <- split(df1, indx[match(df1$ID, names(indx))])
lst
#$`FALSE`
# ID A B C
#9 3 Y 1 16
#10 3 Y 2 18
#11 3 Y 3 19
#12 3 Y 4 10
#$`TRUE`
# ID A B C
#1 1 X 1 10
#2 1 X 2 10
#3 1 Z 3 15
#4 1 Y 4 12
#5 2 Y 1 15
#6 2 X 2 13
#7 2 X 3 13
#8 2 Y 4 13
If we need to get individual datasets on the global environment, change the names of the list elements to the object names we wanted and use list2env (not recommended though)
list2env(setNames(lst, c('L', 'K')), envir=.GlobalEnv)
I have two data frames, A and B.
a1 <- c(12, 12, 12, 23, 23, 23, 34, 34, 34)
a2 <- c(1, 2, 3 , 2, 4 , 5 , 2 , 3 , 4)
A <- as.data.frame(cbind(a1, a2))
b1 <- c(12, 23, 34)
b2 <- c(1, 2, 2)
B <- as.data.frame(cbind(b1, b2))
> A
a1 a2
1 12 1
2 12 2
3 12 3
4 23 2
5 23 4
6 23 5
7 34 2
8 34 3
9 34 4
> B
b1 b2
1 12 1
2 23 2
3 34 2
Basically, B contains the rows in A, with the lowest values of a2 for each unique a1.
What I need to do is simple. Find the row indexes (or row numbers?) lets call this for index.vector, such that A[index.vector, ] equals B.
For this particular problem, there will be only one solution, because for each unique value of a1, there is no values in a2 that are the same.
The faster the routine, the better. Need to apply this on data frames with anywhere between 500 and millions of rows.
I'd make sure my data is ordered first (in your example the data is ordered correctly, but I guess this might not always be the case), then use match which returns the index of the first match of it's first argument in it's second argument (or NA if there is no match).
A <- A[ order( A$a1 , A$a2 ) , ]
A
# a1 a2
#1 12 1
#2 12 2
#3 12 3
#4 23 2
#5 23 4
#6 23 5
#7 34 2
#8 34 3
#9 34 4
# Get row indices for required values
match( B$b1 , A$a1 )
[1] 1 4 7
And here is a data.table solution which should be far faster for large tables
require(data.table)
A <- data.table( A )
B <- data.table( B )
# Use setkeyv to order the tables by the values in the first column, then the second column
setkeyv( A , c("a1","a2") )
setkeyv( B , c("b1","b2") )
# Create a new column that is the row index of A
A[ , ID:=(1:nrow(A)) ]
# Join A and B on the key columns (this works because you have unique values in your second column for each grouping of the first), giving the relevant ID
A[B]
# a1 a2 ID
#1: 12 1 1
#2: 23 2 4
#3: 34 2 7