Reference column name from another table to insert value from shared ID - r

I am attempting to reference a column in another dataframe (df2) based on a column of column names in my primary dataframe (df1$reference) to return the extact value where the ID and column name match in a new column (df1$new_col). Here is a simplified example of my data and the desired outcome:
I have tried rbind but ran into errors due to differences in the number of rows/columns. I also tried bind_rows/bind_cols, but am having a hard time joining only the referenced data (I would like to avoid a large join because my real data has many more columns). I feel like indexing would be able to accomplish this but I am not as familiar with indexing outside of simple tasks. I'm open to any and all suggestions / approaches!

We may use row/column indexing
DF1$new_col <- DF2[-1][cbind(seq_len(nrow(DF1)),
match(DF1$reference, names(DF2)[-1]))]
-output
> DF1
ID value reference new_col
1 1 4 colD no
2 2 5 colD no
3 3 6 colE no
data
DF1 <- structure(list(ID = 1:3, value = 4:6, reference = c("colD", "colD",
"colE")), class = "data.frame", row.names = c(NA, -3L))
DF2 <- structure(list(ID = 1:3, colD = c("no", "no", "yes"), colE = c("yes",
"no", "no"), colF = c("no", "yes", "no")),
class = "data.frame", row.names = c(NA,
-3L))

Maybe something like this?
library(dplyr)
left_join(DF1, DF2, by="ID") %>%
mutate(New_col = case_when(reference=="colD" ~ colD,
reference=="colE" ~ colE,
reference=="colF" ~ colF)) %>%
select(ID, value, reference, New_col)
ID value reference New_col
1 1 4 colD no
2 2 5 colD no
3 3 6 colE no

Related

R: adding matching vector values from two dataframes in one column

I have a data frame which is configured roughly like this:
df <- cbind(c('hello', 'yes', 'example'),c(7,8,5),c(0,0,0))
words
frequency
count
hello
7
0
yes
8
0
example
5
0
What I'm trying to do is add values to the third column from a different data frame, which is similiar but looks like this:
df2 <- cbind(c('example','hello') ,c(5,6))
words
frequency
example
5
hello
6
My goal is to find matching values for the first column in both data frames (they have the same column name) and add matching values from the second data frame to the third column of the first data frame.
The result should look like this:
df <- cbind(c('hello', 'yes', 'example'),c(7,8,5),c(6,0,5))
words
frequency
count
hello
7
6
yes
8
0
example
5
5
What I've tried so far is:
df <- merge(df,df2, by = "words", all.x=TRUE)
However, it doesn't work.
I could use some help understanding how could it be done. Any help will be welcome.
This is an "update join". My favorite way to do it is in dplyr:
library(dplyr)
df %>% rows_update(rename(df2, count = frequency), by = "words")
In base R you could do the same thing like this:
names(df2)[2] = "count2"
df = merge(df, df2, by = "words", all.x=TRUE)
df$count = ifelse(is.na(df$coutn2), df$count, df$count2)
df$count2 = NULL
Here is an option with data.table:
library(data.table)
setDT(df)[setDT(df2), on = "words", count := i.frequency]
Output
words frequency count
<char> <num> <num>
1: hello 7 6
2: yes 8 0
3: example 5 5
Or using match in base R:
df$count[match(df2$words, df$words)] <- df2$frequency
Or another option with tidyverse using left_join and coalesce:
library(tidyverse)
left_join(df, df2 %>% rename(count.y = frequency), by = "words") %>%
mutate(count = pmax(count.y, count, na.rm = T)) %>%
select(-count.y)
Data
df <- structure(list(words = c("hello", "yes", "example"), frequency = c(7,
8, 5), count = c(0, 0, 0)), class = "data.frame", row.names = c(NA,
-3L))
df2 <- structure(list(words = c("example", "hello"), frequency = c(5, 6)), class = "data.frame", row.names = c(NA,
-2L))

I have two columns, if one column has a certain word in a row, I'd like the other column to add 1 point to the corresponding row

I have a column with string values and a column with numeric values. I want to add to the numeric value of each row if the string column has a certain word in it
For example:
stringColumn numericColumn
----------------------------
yes 5
no 7
no 3
yes 4
The numericColumn already has random numbers in it, but after running the code it should add 1 point to the numericColumn if the stringColumn = 'yes'.
So the dataset would end up looking like this
stringColumn numericColumn
----------------------------
yes 6
no 7
no 3
yes 5
You can edit the numericColumn by using an ifelse statement inside mutate. So, if yes is detected (via str_detect) in the stringColumn, then add 1 to the number in the numericColumn and if not (i.e., no), then just return numericColumn.
library(tidyverse)
df %>%
mutate(numericColumn = ifelse(
str_detect(stringColumn, "yes"),
numericColumn + 1,
numericColumn
))
Output
stringColumn numericColumn
1 yes 6
2 no 7
3 no 3
4 yes 5
Or in base R:
df$numericColumn <-
ifelse(grepl("yes", df$stringColumn),
df$numericColumn + 1,
df$numericColumn)
Data
df <- structure(list(stringColumn = c("yes", "no", "no", "yes"), numericColumn = c(5L,
7L, 3L, 4L)), class = "data.frame", row.names = c(NA, -4L))
There are lots of ways of getting to the answer you want, but here is my take on a tidyverse version. The conditional statements are made within case_when() that is used inside mutate(). It's worth reading into what case_when() does since it'll come in handy for various uses.
library(tidyverse)
example_df <- tibble(
stringColumn = c("yes", "no", "no", "yes"),
numericColumn = c(5,7,3,4)
)
results_table <- example_df %>%
mutate(
Updated_column = case_when(
stringColumn == "yes" ~ numericColumn + 1,
TRUE ~ numericColumn
)
)
# option 1: print to console
results_table
# option 1.2: a tidier way to view on the console
glimpse(results_table)
# option 2: view on RStudio
View(results_table)
# option 3: save as file (eg. .csv format)
write_csv(results_table, "path/to/folder/my_results.csv")

create new variables in a dataframe from existing variables using lists of names

I was wondering if there was a way to construction a function to do the following:
Original dataframe, df:
Obs Col1 Col2
1 Y NA
2 NA Y
3 Y Y
Modified dataframe, df:
Obs Col1 Col2 Col1_YN Col2_YN
1 Y NA “Yes” “No”
2 NA Y “No “Yes”
3 Y Y “Yes” “Yes”
The following code works just fine to create the new variables but I have lots of original columns with this structure and the “Yes” “No” format works better when constructing tables.
df$Col1_YN <- as.factor(ifelse(is.na(df$Col1), “No”, “Yes”))
df$Col2_YN <- as.factor(ifelse(is.na(df$Col2), “No”, “Yes”))
I was thinking along the lines of defining lists of input and output columns to be passed to a function, or using lapply but haven’t figured out how to do this.
We can use across to loop over the columns and create the new columns by modifying the .names
library(dplyr)
df1 <- df %>%
mutate(across(-Obs,
~ case_when(. %in% "Y" ~ "Yes", TRUE ~ "No"), .names = "{.col}_YN"))
-output
df1
Obs Col1 Col2 Col1_YN Col2_YN
1 1 Y <NA> Yes No
2 2 <NA> Y No Yes
3 3 Y Y Yes Yes
If we want to use lapply, loop over the columns of interest, apply the ifelse and assign it back to new columns by creating a vector of new names with paste
df[paste0(names(df)[-1], "_YN")] <-
lapply(df[-1], \(x) ifelse(is.na(x), "No", "Yes"))
data
df <- structure(list(Obs = 1:3, Col1 = c("Y", NA, "Y"), Col2 = c(NA,
"Y", "Y")), class = "data.frame", row.names = c(NA, -3L))

How to remove NA from certain columns only using R

I got data which has many columns but I want to remove NAs values from specific columns
If i got a df such as below
structure(list(id = c(1, 1, 2, 2), admission = c("2001/01/01",
"2001/03/01", "NA", "2005/01/01"), discharged = c("2001/01/07",
"NA", "NA", "2005/01/03")), class = "data.frame", row.names = c(NA,
-4L))
and I want to exclude the records for each id which has NAs in it and get the df such as below
structure(list(id2 = c(1, 2), admission2 = c("2001/01/01", "2005/01/01"
), discharged2 = c("2001/01/07", "2005/01/03")), class = "data.frame", row.names = c(NA,
-2L))
The NAs in the data are "NA". Convert to NA with is.na and then use na.omit
is.na(df1) <- df1 == "NA"
df1 <- na.omit(df1)
df1
id admission discharged
1 1 2001/01/01 2001/01/07
4 2 2005/01/01 2005/01/03
If it is specific columns, use
df1[complete.cases(df1[c("admission", "discharged")]),]
Base R option:
character "NA" to NA
then use complete.cases
df[df=="NA"] <- NA
dfNA <- complete.cases(df)
df[dfNA,]
id admission discharged
1 1 2001/01/01 2001/01/07
4 2 2005/01/01 2005/01/03
We can make like this too:
library(tidyr)
df1[df1=="NA"] <- NA
df1_new <- df1 %>% drop_na()
df1_new
id admission discharged
1 1 2001/01/01 2001/01/07
2 2 2005/01/01 2005/01/03

How do I aggregate data in R in a way that returns the entire row that satisfies the aggregation condition? [no dplyr]

I have data that looks like this:
ID FACTOR_VAR INT_VAR
1 CAT 1
1 DOG 0
I want to aggregate by ID such that the resulting dataframe contains the entire row that satisfies my aggregate condition. So if I aggregate by the max of INT_VAR, I want to return the whole first row:
ID FACTOR_VAR INT_VAR
1 CAT 1
The following will not work because FACTOR_VAR is a factor:
new_data <- aggregate(data[,c("ID", "FACTOR_VAR", "INT_VAR")], by=list(data$ID), fun=max)
How can I do this? I know dplyr has a group by function, but unfortunately I am working on a computer for which downloading packages takes a long time. So I'm looking for a way to do this with just vanilla R.
If you want to keep all the columns, use ave instead :
subset(df, as.logical(ave(INT_VAR, ID, FUN = function(x) x == max(x))))
You can use aggregate for this. If you want to retain all the columns, merge can be used with it.
merge(aggregate(INT_VAR ~ ID, data = df, max), df, all.x = T)
# ID INT_VAR FACTOR_VAR
#1 1 1 CAT
data
df <- structure(list(ID = c(1L, 1L), FACTOR_VAR = structure(1:2, .Label = c("CAT", "DOG"), class = "factor"), INT_VAR = 1:0), class = "data.frame", row.names = c(NA,-2L))
We can do this in dplyr
library(dplyr)
df %>%
group_by(ID)
filter(INT_VAR == max(INT_VAR))
Or using data.table
library(data.table)
setDT(df)[, .SD[INT_VAR == max(INT_VAR)], by = ID]

Resources