Changing values between two data.frames in R - r

I have the following data.frames(sample):
>df1
number ACTION
1 1 this
2 2 that
3 3 theOther
4 4 another
>df2
id VALUE
1 1 3
2 2 4
3 3 2
4 4 1
4 5 4
4 6 2
4 7 3
. . .
. . .
I would like df2 to become like the following:
>df2
id VALUE
1 1 theOther
2 2 another
3 3 that
4 4 this
4 5 another
4 6 that
4 7 theOther
. . .
. . .
It can be done 'mannualy' by using the following for each value:
df2[df2==1] <- 'this'
df2[df2==2] <- 'that'
.
.
and so on, but is there a way to do it not mannualy?

Try
df2$VALUE <- setNames(df1$ACTION, df1$number)[as.character(df2$VALUE)]
df2
# id VALUE
#1 1 theOther
#2 2 another
#3 3 that
#4 4 this
#5 5 another
#6 6 that
#7 7 theOther
Or use match
df2$VALUE <- df1$ACTION[match(df2$VALUE, df1$number)]
data
df1 <- structure(list(number = 1:4, ACTION = c("this", "that",
"theOther",
"another")), .Names = c("number", "ACTION"), class = "data.frame",
row.names = c("1", "2", "3", "4"))
df2 <- structure(list(id = 1:7, VALUE = c(3L, 4L, 2L, 1L, 4L, 2L, 3L
)), .Names = c("id", "VALUE"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7"))

You could do:
library(qdapTools)
df2$VALUE <- lookup(terms = df2$VALUE, key.match = df1)
Note that for this to work, you will need the proper columns order in df1. From ?lookup
key.match
Takes one of the following: (1) a two column data.frame of a match key
and reassignment column, (2) a named list of vectors (Note: if
data.frame or named list supplied no key reassign needed) or (3) a
single vector match key.
Which gives:
# id VALUE
#1 1 theOther
#2 2 another
#3 3 that
#4 4 this
#5 5 another
#6 6 that
#7 7 theOther

Related

How to remove rows if values from a specified column in data set 1 does not match the values of the same column from data set 2 using dplyr

I have 2 data sets, both include ID columns with the same IDs. I have already removed rows from the first data set. For the second data set, I would like to remove any rows associated with IDs that do not match the first data set by using dplyr.
Meaning whatever is DF2 must be in DF1, if it is not then it must be removed from DF2.
For example:
DF1
ID X Y Z
1 1 1 1
2 2 2 2
3 3 3 3
5 5 5 5
6 6 6 6
DF2
ID A B C
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
DF2 once rows have been removed
ID A B C
1 1 1 1
2 2 2 2
3 3 3 3
5 5 5 5
6 6 6 6
I used anti_join() which shows me the difference in rows but I cannot figure out how to remove any rows associated with IDs that do not match the first data set by using dplyr.
Try with paste
i1 <- do.call(paste, DF2) %in% do.call(paste, DF1)
# if it is only to compare the 'ID' columns
i1 <- DF2$ID %in% DF1$ID
DF3 <- DF2[i1,]
DF3
ID A B C
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 5 5 5 5
5 6 6 6 6
DF4 <- DF2[!i1,]
DF4
ID A B C
4 4 4 4 4
7 7 7 7 7
data
DF1 <- structure(list(ID = c(1L, 2L, 3L, 5L, 6L), X = c(1L, 2L, 3L,
5L, 6L), Y = c(1L, 2L, 3L, 5L, 6L), Z = c(1L, 2L, 3L, 5L, 6L)), class = "data.frame", row.names = c(NA,
-5L))
DF2 <- structure(list(ID = 1:7, A = 1:7, B = 1:7, C = 1:7), class = "data.frame", row.names = c(NA,
-7L))
# Load package
library(dplyr)
# Load dataframes
df1 <- data.frame(
ID = 1:6,
X = 1:6,
Y = 1:6,
Z = 1:6
)
df2 <- data.frame(
ID = 1:7,
X = 1:7,
Y = 1:7,
Z = 1:7
)
# Include all rows in df1
df1 %>%
left_join(df2)
Joining, by = c("ID", "X", "Y", "Z")
ID X Y Z
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6

Create a dummy variable in a dataframe based on spreadsheet matrix in R

I have a dataframe that looks like the following:
ID V1 V2 V3 V4 V5
1 a 6 3 5 3
2 c 4 1 2 1
3 g 8 2 4 2
4 h 7 9 8 1
5 a 4 6 2 1
6 b 4 2 1 2
7 j 8 7 1 4
I need to create a new dummy variable and add it to this dataframe as column "V6". I need to do it based on a matrix from an external spreadsheet such as the following:
V1 1 2 3 4 5 6 7 8 9
a 1 1 1 1 1
b 1 1 1 1 1 1 1
c 1
d 1 1 1 1 1 1 1 1 1
g 1 1
h 1 1
i 1 1 1
j
k 1 1 1 1 1
In the above matrix, the V1 column is the value of the V1 variable in the original dataframe, and the other columns correspond with possible values of the V5 variable. All the empty spaces are blank in the spreadsheet. I need the new dummy variable, V6 to represent 1 if the unit is a 1 on the matrix based on the intersection of values. The result would therefore be the following:
ID V1 V2 V3 V4 V5 V6
1 a 6 3 5 3 0
2 c 4 1 2 1 0
3 g 8 2 4 2 1
4 h 7 9 8 1 0
5 a 4 6 2 1 1
6 b 4 2 1 2 0
7 j 8 7 1 4 0
ID 1 is a 0 for the V6 variable, because in the matrix, a and the value 3 intersect at a blank (or 0). Therefore the dummy variable for row 1 is a 0, because its V1 is a and its V5 is 3. Conversely, the third row generates a 1, because its V1 is G and its V5 value is 2. That intersection on the matrix, g-2 is a 1, therefore V6 for that combination is a "hit", or a 1 in the dummy variable
I recognize this is an odd method of dummy variable creation, but how can one use an externally created spreadsheet like this to create dummy variables based on the intersection of values most efficiently? What would be a flexible way to code this, so that it could be adapted depending on if the variables are character or numeric?
I think it's best to approach this by pivoting/reshaping df2 (the 1s and blanks), and joining it on df1 (original data).
Note: it isn't abundantly clear if your df2 has empty strings or NA values. If the latter, then replace the nzchar(V6) with !is.na(V6) or !V6 %in% c(NA, "") (for both possibilities).
base R
out <- reshape2::melt(df2, "V1", variable.name = "V5", value.name = "V6") |>
subset(nzchar(V6)) |>
merge(df1, by = c("V1", "V5"), all.y = TRUE) |>
transform(V6 = +(!is.na(V6)))
out
# V1 V5 V6 ID V2 V3 V4
# 1 a 1 1 5 4 6 2
# 2 a 3 0 1 6 3 5
# 3 b 2 0 6 4 2 1
# 4 c 1 0 2 4 1 2
# 5 g 2 1 3 8 2 4
# 6 h 1 0 4 7 9 8
# 7 j 4 0 7 8 7 1
The rows/columns are out of order, we can restore it fairly easily:
out <- out[order(out$ID), c("ID", sort(setdiff(names(out), "ID")))]
out
# ID V1 V2 V3 V4 V5 V6
# 2 1 a 6 3 5 3 0
# 4 2 c 4 1 2 1 0
# 5 3 g 8 2 4 2 1
# 6 4 h 7 9 8 1 0
# 1 5 a 4 6 2 1 1
# 3 6 b 4 2 1 2 0
# 7 7 j 8 7 1 4 0
dplyr/tidyr
library(dplyr)
library(tidyr) # pivot_longer
df2 %>%
pivot_longer(-V1, names_to = "V5", values_to = "V6") %>%
filter(nzchar(V6)) %>%
# dplyr requires the join columns to be the same class, but the
# column names from `df2` are still character, as all column names are
mutate(V5 = as.integer(V5)) %>%
left_join(df1, ., by = c("V1", "V5")) %>%
mutate(V6 = +(!is.na(V6)))
# ID V1 V2 V3 V4 V5 V6
# 1 1 a 6 3 5 3 0
# 2 2 c 4 1 2 1 0
# 3 3 g 8 2 4 2 1
# 4 4 h 7 9 8 1 0
# 5 5 a 4 6 2 1 1
# 6 6 b 4 2 1 2 0
# 7 7 j 8 7 1 4 0
Data
df1 <- structure(list(ID = 1:7, V1 = c("a", "c", "g", "h", "a", "b", "j"), V2 = c(6L, 4L, 8L, 7L, 4L, 4L, 8L), V3 = c(3L, 1L, 2L, 9L, 6L, 2L, 7L), V4 = c(5L, 2L, 4L, 8L, 2L, 1L, 1L), V5 = c(3L, 1L, 2L, 1L, 1L, 2L, 4L)), class = "data.frame", row.names = c(NA, -7L))
df2 <- structure(list(V1 = c("a", "b", "c", "d", "g", "h", "i", "j", "k"), "1" = c("1", "1", "", "1", "1", "", "", "", "1"), "2" = c("1", "", "", "1", "1", "1", "", "", "1"), "3" = c("", "", "", "1", "", "", "", "", "1"), "4" = c("1", "1", "", "1", "", "", "", "", "1"), "5" = c("1", "1", "", "1", "", "", "1", "", "1"), "6" = c("1", "1", "", "1", "", "", "1", "", ""), "7" = c("", "1", "", "1", "", "", "", "", ""), "8" = c("", "1", "", "1", "", "", "", "", ""), "9" = c("", "1", "1", "1", "", "1", "1", "", "")), row.names = c(NA, -9L), class = "data.frame")

R function for collapsing multiple ranges of different columns from wide to long format?

I've a dataset with multiple different ranges of columns in each row (each row corresponds to one individual), as below. Each instance of the different column types have 3 levels (0,1 and 2).
id col1_0 col1_1 col1_2 col2_0 col2_1 col2_2 col3_0 col3_1 col3_2
1 0 1 3 2 2 3 3 4 5
2 1 1 2 2 4 7 4 5 5
.
.
etc.
What I would need is to collapse all col1 into one column, all col2 into another and all col3's into another, for each id. As below.
id x col1 col2 col4
1 0 0 2 3
1 1 1 2 4
1 2 3 3 5
2 0 1 2 4
2 1 1 4 5
2 2 1 7 5
.
.
etc.
In addition, I would also need to create an x-column with values 0,1 and 2, for each id. However, I only manage to collapse the first range of columns (col1) with the code below.
library(tidyverse)
longer_data <- dataframe %>%
group_by(id) %>%
pivot_longer(col1_0:col1_2, names_to = "x1", values_to = "col1")
x1 here creates a column with the original column names. So I would create need an additional x-column that only keeps the last numbers of the original column names.
Is there a way to achieve this? Many thanks in advance!
We don't need any group_by. It can be directly done with pivot_longer by specifying the names_sep and the .value in names_to. Note the order of .value and x. It implies the values of that column should go into the each of those prefixes before the _ and the new column with suffix stub goes into 'x'
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = -id, names_to = c('.value', 'x'), names_sep = "_")
-output
# A tibble: 6 x 5
# id x col1 col2 col3
# <int> <chr> <int> <int> <int>
#1 1 0 0 2 3
#2 1 1 1 2 4
#3 1 2 3 3 5
#4 2 0 1 2 4
#5 2 1 1 4 5
#6 2 2 2 7 5
data
df1 <- structure(list(id = 1:2, col1_0 = 0:1, col1_1 = c(1L, 1L), col1_2 = 3:2,
col2_0 = c(2L, 2L), col2_1 = c(2L, 4L), col2_2 = c(3L, 7L
), col3_0 = 3:4, col3_1 = 4:5, col3_2 = c(5L, 5L)),
class = "data.frame", row.names = c(NA,
-2L))
Here is a base R option using reshape, where timevar="x" creates a column named x, and sep="_" helps to fetch the last numbers of the original column names.
res <- reshape(
df,
direction = "long",
idvar = "id",
varying = -1,
timevar = "x",
sep = "_"
)
res <- res[order(res$id), ]
Output
> res
id x col1 col2 col3
1.0 1 0 0 2 3
1.1 1 1 1 2 4
1.2 1 2 3 3 5
2.0 2 0 1 2 4
2.1 2 1 1 4 5
2.2 2 2 2 7 5
Data
> dput(df)
structure(list(id = 1:2, col1_0 = 0:1, col1_1 = c(1L, 1L), col1_2 = 3:2,
col2_0 = c(2L, 2L), col2_1 = c(2L, 4L), col2_2 = c(3L, 7L
), col3_0 = 3:4, col3_1 = 4:5, col3_2 = c(5L, 5L)), class = "data.frame", row.names = c(NA,
-2L))

Fill a column's blank spaces contingent on a second column in R

I'd appreciate some help with this one. I have something similar to the data below.
df$A df$B
1 .
1 .
1 .
1 6
2 .
2 .
2 7
What I need to do is fill in df$B with each value that corresponds to the end of the run of values in df$A. Example below.
df$A df$B
1 6
1 6
1 6
1 6
2 7
2 7
2 7
Any help would be welcome.
It seems to me that the missing values are denoted by .. It is better to read the dataset with na.strings="." so that the missing values will be NA. For the current dataset, the 'B' column would be character/factor class (depending upon whether you used stringsAsFactors=FALSE/TRUE (default) in the read.table/read.csv.
Using data.table, we convert the data.frame to data.table (setDT(df1)), change the 'character' class to 'numeric' (B:= as.numeric(B)). This will also result in coercing the . to NA (a warning will appear). Grouped by "A", we change the "B" values to the last element (B:= B[.N])
library(data.table)
setDT(df1)[,B:= as.numeric(B)][,B:=B[.N] , by = A]
# A B
#1: 1 6
#2: 1 6
#3: 1 6
#4: 1 6
#5: 2 7
#6: 2 7
#7: 2 7
Or with dplyr
library(dplyr)
df1 %>%
group_by(A) %>%
mutate(B= as.numeric(tail(B,1)))
Or using ave from base R
df1$B <- with(df1, as.numeric(ave(B, A, FUN=function(x) tail(x,1))))
data
df1 <- structure(list(A = c(1L, 1L, 1L, 1L, 2L, 2L, 2L), B = c(".",
".", ".", "6", ".", ".", "7")), .Names = c("A", "B"),
class = "data.frame", row.names = c(NA, -7L))

reshape data frame in R

I have a data frame that I need to reshape, transforming repeated values in a single column into a single row with several data columns. I know this should be simple but I can't figure out how to do this, and which of the many reshape/cast functions available I need to use.
Part of my data looks like this:
Source ID info
1 In 842701 1
2 Out 842701 1
3 In 21846591 2
4 Out 21846591 2
5 In 22181760 3
6 In 39338740 4
7 Out 9428 5
I want to make it look like this:
ID In Out info
1 842701 1 1 1
2 21846591 1 1 2
3 22181760 1 0 3
4 39338740 1 0 4
5 9428 0 1 5
and so on, while preserving all the remaining columns (which are identical for a given entry).
I would really appreciate some help. TIA.
Here is a way using reshape2
library(reshape2)
res <- dcast(transform(df, indx=1, ID=factor(ID, levels=unique(ID))),
ID~Source, value.var="indx", fill=0)
res
# ID In Out
#1 842701 1 1
#2 21846591 1 1
#3 22181760 1 0
#4 39338740 1 0
#5 9428 0 1
Or
res1 <- as.data.frame.matrix(table(transform(df,
ID=factor(ID, levels=unique(ID)))[,2:1]))
Update
dcast(transform(df1, indx=1, ID=factor(ID, levels=unique(ID))),
...~Source, value.var="indx", fill=0)
# ID info In Out
#1 842701 1 1 1
#2 21846591 2 1 1
#3 22181760 3 1 0
#4 39338740 4 1 0
#5 9428 5 0 1
You could also use reshape from base R
res2 <- reshape(transform(df1, indx=1), idvar=c("ID", "info"),
timevar="Source", direction="wide")
res2[,3:4][is.na(res2)[,3:4]] <- 0
res2
# ID info indx.In indx.Out
#1 842701 1 1 1
#3 21846591 2 1 1
#5 22181760 3 1 0
#6 39338740 4 1 0
#7 9428 5 0 1
data
df <- structure(list(Source = c("In", "Out", "In", "Out", "In", "In",
"Out"), ID = c(842701L, 842701L, 21846591L, 21846591L, 22181760L,
39338740L, 9428L)), .Names = c("Source", "ID"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7"))
df1 <- structure(list(Source = c("In", "Out", "In", "Out", "In", "In",
"Out"), ID = c(842701L, 842701L, 21846591L, 21846591L, 22181760L,
39338740L, 9428L), info = c(1L, 1L, 2L, 2L, 3L, 4L, 5L)), .Names = c("Source",
"ID", "info"), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6", "7"))

Resources