Matching dataframe columns: one int and another is list - r

Trying to create a column in dataframe df1 based on match in another dataframe df2, where df1 is much bigger than df2:
df1$val2 <- df2$val2[match(df1$id, df2$IDs)]
This doesn't quite work because df2$IDs column is a list:
> df2
IDs val2
1 0 1
2 1, 2 2
3 3, 4 3
4 5, 6 4
5 7, 8 5
6 9, 10 6
7 11, 12, 13, 14 7
It only works for the part where the list has 1 element (row 1: ..$ : int 0 above). For all other rows the 'match(df1$id, df2$IDs)' returns NA.
Test of matching some individual numbers works just fine with double brackets:
2 %in% df2[[2,'IDs']]
So, I either need to modify the column df2$IDs or need to perform match operation differently. The df1 has many other columns, so does the df2, but df2 is much shorter in rows.
The case can be reproduced with the following:
IDs <- c("[0]", "[1, 2]", "[3, 4]", "[5, 6]", "[7, 8]", "[9, 10]", "[11, 12, 13, 14]")
val2 <- c(1,2,3,4,5,6,7)
df2 <- data.frame(IDs, val2)
df2$IDs <- lapply(strsplit(as.character(df2$IDs), ','), function (x) as.integer(gsub("\\s|\\[|\\]", "", x)))
id <- floor(runif(100, min=0, max=15))
df1 <- data.frame(id)
str(df1)
str(df2)
df1$val2 <- df2$val2[match(df1$id, df2$IDs)]

List columns are clumsy to work with. If you convert df2 to a more vanilla format, it works:
DF2 = with(df2, data.frame(ID = unlist(IDs), val2 = rep(val2, lengths(IDs))))
df1$m = DF2$val2[ match(df1$id, DF2$ID) ]
If you want list columns just for browsing, it is quick to do...
aggregate(ID ~ ., DF2, list)
val2 ID
1 1 0
2 2 1, 2
3 3 3, 4
4 4 5, 6
5 5 7, 8
6 6 9, 10
7 7 11, 12, 13, 14
.
Fyi, the match approach will not extend naturally to joining on more columns, so you might want to eventually learn data.table and its "update join" syntax for this case:
library(data.table)
setDT(df1); setDT(df2)
DT2 = df2[, .(ID = unlist(IDs)), by=setdiff(names(df2), "IDs")]
df1[DT2, on=.(id = ID), v := i.val2 ]

Related

Assign value to a specific rows (R)

I have a df of 16k+ items. I want to assign values A, B and C to those items based.
Example:
I have the following df with 10 unique items
df <- c(1:10)
Now I have three separate vectors (A, B, C) that contain row numbers of the df with values A, B or C.
A <- c(3, 9)
B <- c(2, 6, 8)
C <- c(1, 4, 5, 7, 10)
Now I want to add a new category column to the df and assign values A, B and C based on the row numbers that are in the three vectors that I have. For example, I would like to assign value C to rows 1, 4, 5, 7 and 10 of the df.
I tried to experiment with for loops and if statements to match the value of the vector with the row number of the df but I didn't succeed. Can anybody help out?
Here is a way to assign the new column.
Create the data frame and a list of vectors:
df <- data.frame(n=1:10)
dat <- list( A=c(3, 9), B=c(2, 6, 8), C=c(1, 4, 5, 7, 10) )
Put the data in the desired rows:
df$new[unlist(dat)] <- sub("[0-9].*$","",names(unlist(dat)))
Result:
df
n new
1 1 C
2 2 B
3 3 A
4 4 C
5 5 C
6 6 B
7 7 C
8 8 B
9 9 A
10 10 C
You could iterate over the names of a list and assign those names to the positions indexed by the successive sets of numeric values:
dat <- list(A=A,B=B,C=C)
for(i in names(dat)){ df$new[ dat[[i]] ] <- i}

Use a for loop to rename variables in data frames

I have the following data which I have split by name into separate data frames. After I run the following code, the variables in each data set are automatically named "X..i..".
I would like to rename the variable of each separate data frame so it matches the data set.
# load data
df1_raw <- data.frame(name = c("A", "B", "C", "A", "C", "B"),
start = c(1, 3, 4, 5, 2, 1),
end = c(6, 5, 7, 8, 6, 7))
df1 <- split(x = df1_raw, f = df1_raw$name) # split data by name
df1 <- lapply(df1, function(x) Map(seq.int, x$start, x$end)) # generate sequence intervals
df1 <- map(df1, unlist) # unlist sequences
df1 <- lapply(df1, data.frame) # convert to df
# rename variables
name <- c("A", "B", "C")
for (i in seq_along(df1)) {
names(df1[i]) <- name[i]
}
The last for loop does not work to rename variables. When I type names(df1$A) I still get "X..i..". The output I would like from names(df1$A) is "A".
Does anyone have any thoughts on how to rename these variables? Thanks!
You need to use [[]] when indexing from a list
for (i in seq_along(df1)) {
names(df1[[i]]) <- name[i]
}
Alternatively you could change how you create the list so you don't have to rename after the fact
df1 <- split(x = df1_raw, f = df1_raw$name) # split data by name
df1 <- lapply(df1, function(x) Map(seq.int, x$start, x$end)) # generate sequence intervals
df1 <- map(df1, unlist) # unlist sequences
df1 <- Map(function(x,name) {as.data.frame(setNames(list(x), name))}, df1, names(df1))
I think the solution by #MrFlick is enough for addressing the issue of renaming within a for loop.
Here is a base R workaround that may work for you
lapply(
split(df1_raw, df1_raw$name),
function(x) {
with(
x,
setNames(
data.frame(unlist(mapply(seq, start, end))),
unique(name)
)
)
}
)
which gives
$A
A
1 1
2 2
3 3
4 4
5 5
6 6
7 5
8 6
9 7
10 8
$B
B
1 3
2 4
3 5
4 1
5 2
6 3
7 4
8 5
9 6
10 7
$C
C
1 4
2 5
3 6
4 7
5 2
6 3
7 4
8 5
9 6

Conditional Statement in R (indicator) based off matching values to another dataset

I have two datasets
dataset1 with column fruit, customer_num
dataset2 with column fruit2, customer_num
So lets say I do a left join with dataset 1 to dataset 2, using customer_num as the joiner. Now I got a dataset with fruit and fruit2 as column variables.
How can a create an indicator to say if fruit==fruit2 then 1 else 0 ?
You could do it like this (my example):
# I've created example of customer_num where I presumed that this are numbers
fruit <- data.frame(customer_num = c(1, 2, 3, 4, 5, 6))
fruit2 <- data.frame(customer_num = c(1, 2, 3, 10, 11, 12))
# Vector in data frame
df <- data.frame(fruit, fruit2)
# And match values / Indicator
dat<-within(df,match <- ifelse (fruit == fruit2,1,0))
# Output
customer_num customer_num.1 customer_num
1 1 1 1
2 2 2 1
3 3 3 1
4 4 10 0
5 5 11 0
6 6 12 0
ifelse would be easiest, assuming it is in the same dataframe. Example using the dplyr package
dataset1 %>%
mutate(Match=ifelse(fruit==fruit2,1,0))
This will create a column called Match and do 1 if they match, 0 if they do not

Conditional statement in R dataframe

I have dataframe df as below.
dput(df)
structure(list(X = c(1, 2, 5, 7, 8), Y = c(3, 5, 8, 7, 2), Z = c(2,
8, 7, 4, 3), R = c(6, 6, 6, 6, 66)), .Names = c("X", "Y", "Z",
"R"), row.names = c(NA, -5L), class = "data.frame")
df
class(df)
I have to modify df under two conditions.
First:
modify df so that it check minimum between X,Y,Z for each row and whichever is minimum get replaced with corresponding value of R.
Second case:
which is minimum between X,Y,Z,R in each row, it get replaced with maximum between X,Y,Z,and R and create a new df.
How should i get that?
I tried ifelse and if and else but could not get what i want..
Any help would be appreciated.
You can create a new dataset "df1" with first three coumns of "df". Multiply "df1" with "-1" so that maximum values become "min" (assuming that there are no negative values). Here, in the example, the values were all unique per row. So, you can use the function max.col and specify the ties.method='first'. It will get you the index of maximum value (here it will be minimum) per row, cbind it will the 1:nrow(df) to create the "row/column" index and extract the elements of "df1" based on that index (df1[cbind..]) and change those values to "R" column values (<- df$R). You could then change the original "df" columns ("df[1:3]") to new values. If there are more than one "minimum" value per row, you could use the "loop" method described for the second case.
df1 <- df[1:3]
df1[cbind(1:nrow(df),max.col(-1*df1, 'first'))] <- df$R
df[1:3] <- df1
df
# X Y Z R
#1 6 3 2 6
#2 6 5 8 6
#3 6 8 7 6
#4 7 7 6 6
#5 8 66 3 66
Create a copy of "df" (df2), get the max values per row using pmax, loop over the rows of "df2" (sapply(seq_len...)) and change the "minimum" values in each row to corresponding "max" values ("MaxV"), transpose (t) and assign it back to the "df2" (df2[])
df2 <- df
#only use this if there is only a single "minimum" value per row
# and no negative values in the data
#df2[cbind(1:nrow(df), max.col(-1*df2, 'first'))] <-
# do.call(pmax, df2)
MaxV <- do.call(pmax, df2)
df2 [] <- t(sapply(seq_len(nrow(df2)), function(i) {
x <- unlist(df2[i,])
ifelse(x==min(x), MaxV[i], x)}))
df2
# X Y Z R
#1 6 3 6 6
#2 6 8 8 6
#3 8 8 7 8
#4 7 7 7 7
#5 8 66 66 66

comparing two files and outputting common elements

I have 2 files of 3 columns and hundreds of rows. I want to compare and list the common elements of first two columns of the two files. Then the list which i will get after comparing i have to add the third column of second file to that list. Third column will contain the values which were in the second file corresponding to numbers of remaining two columns which i have got as common to both the files.
For example, consider two files of 6 rows and 3 columns
First file -
1 2 3
2 3 4
4 6 7
3 8 9
11 10 5
19 6 14
second file -
1 4 1
2 1 4
4 6 10
3 7 2
11 10 3
19 6 5
As i said i have to compare the first two columns and then add the third column of second file to that list. Therefore, output must be:
4 6 10
11 10 3
19 6 5
I have the following code, however its showing an error object not found also i am not able to add the third column. Please help :)
df2 = reading first file, df3 = reading second file. Code is in R language.
s1 = 1
for(i in 1:nrow(df2)){
for(j in 1:nrow(df3)){
if(df2[i,1] == df3[j,1]){
if(df2[i,2] == df3[j,2]){
common.rows1[s1,1] <- df2[i,1]
common.rows1[s1,2] <- df2[i,2]
s1 = s1 + 1
}
}
}
You can use the %in% operator twice to subset your second data.frame (I call it df2):
df2[df2$V1 %in% df1$V1 & df2$V2 %in% df1$V2,]
# V1 V2 V3
#3 4 6 10
#5 11 10 3
#6 19 6 5
V1 and V2 in my example are the column names of df1 and df2.
It seems that this is the perfect use-case for merge, e.g.
merge(d1[c('V1','V2')],d2)
results in:
V1 V2 V3
1 11 10 3
2 19 6 5
3 4 6 10
In which 'V1' and 'V2' are the column names of interest.
data.table proposal
library(data.table)
setDT(df1)
setDT(df2)
setkey(df1, V1, V2)
setkey(df2, V1, V2)
df2[df1[, -3, with = F], nomatch = 0]
## V1 V2 V3
## 1: 4 6 10
## 2: 11 10 3
## 3: 19 6 5
If your two tables are d1 and d2,
d1<-data.frame(
V1 = c(1, 2, 4, 3, 11, 19),
V2 = c(2, 3, 6, 8, 10, 6),
V3 = c(3, 4, 7, 9, 5, 14)
)
d2<-data.frame(
V1 = c(1, 2, 4, 3, 11, 19),
V2 = c(4, 1, 6, 7, 10, 6),
V3 = c(1, 4, 10, 2, 3, 5)
)
then you can subset d2 (in order to keep the third column) with
d2[interaction(d2$V1, d2$V2) %in% interaction(d1$V1, d1$V2),]
The interaction() treats the first two columns as a combined key.

Resources