I am needing to see how to update one column in a data frame where the employeeID matches that of another data frame.
For example:
df1 >
empID salary
1 10000
2 15000
3 0
df2 >
empID salary2
1 10000
2 15000
3 20000
I am needing to see how to update it where df1$salary = 0, then update it where df1$empID = df2$empID.
I tried this but received "No such column: salary2" error :
df1$salary <- ifelse(df1$salary == 0,sqldf("UPDATE df1 SET salary = salary2 WHERE df1.empID = df2.empID"),df1$salary)
Here is another option with merge,
transform(merge(df1, df2, by = 'empID'),
salary = replace(salary, salary == 0, salary2[salary == 0]),
salary2 = NULL)
# empID salary
#1 1 10000
#2 2 15000
#3 3 20000
You can also use ifelse instead of replace for salary, i.e.
salary = ifelse(salary == 0, salary2, salary)
We could do
#find empID in df1 where salary is 0
inds <- df1$empID[df1$salary == 0]
#match empID with df2 and get respective salary and update df1
df1$salary[inds] <- df2$salary2[match(inds, df2$empID)]
df1
# empID salary
#1 1 10000
#2 2 15000
#3 3 20000
This should also work if you have multiple entries with 0 in df1.
We can do the same using ifelse like
ifelse(df1$salary == 0, df2$salary2[match(df1$empID, df2$empID)], df1$salary)
Related
I want to join a data frame to another data frame first by one column, then, if there are no "matches" I want it to try to join by another column. The problem is similar to this question, but I'm trying to get a slightly different output.
Here are my 'observations'
#my df
my_plants <- data.frame(scientific_name = c("Abelmoschus esculentus",
"Abies balsamea",
"Ammophila breviligulata",
"Zigadenus glaucus"),
percent_cover = c(90, 80, 10, 60))
and here is the main list with some data that I want to extract for each of my observations. Obviously this is simplified.
#hypothetical database
plant_database <- data.frame(scientific_name = c("Abelmoschus esculentus",
"Abies balsamea",
"Ammophila breviligulata",
"Anticlea elegans"),
synonym = c(NA_character_,
NA_character_,
NA_character_,
"Zigadenus glaucus"),
score = c(1, 1, 2, 6))
Here is a function to join my observations to the main list. Note: I'm using a left_join because I want to know which observations were not matched.
#joining function
joining_fun <- function(plants, database) {
database_long <- database %>%
dplyr::mutate(ID = row.names(.)) %>%
tidyr::pivot_longer(., cols = c(scientific_name, synonym),
values_to = "scientific_name")
join <- dplyr::left_join(plants, database_long, by = "scientific_name") %>%
dplyr::select(-name)
return(join)
}
Which gets me here:
joining_fun(my_plants, plant_database)
scientific_name percent_cover score ID
1 Abelmoschus esculentus 90 1 1
2 Abies balsamea 80 1 2
3 Ammophila breviligulata 10 2 3
4 Zigadenus glaucus 60 6 4
but I want something like this:
scientific_name synonym percent_cover score ID
Abelmoschus esculentus NA 90 1 1
Abies balsamea NA 80 1 2
Ammophila breviligulata NA 10 2 3
Anticlea elegans Zigadenus glaucus 60 6 4
Thanks!
Use inner_join() to create a df of only cases that match on scientific_name.
Use anti_join() to get a version of plants that don't match on scientific_name.
Do another inner_join() of database with these unmatched cases, using key "synonym" = "scientific_name".
Do one more anti_join() to get cases without a match in either column.
Finally, bind all results together.
library(dplyr)
# add test case with no match in either column
my_plants <- add_row(
my_plants,
scientific_name = "Stackus overflovius",
percent_cover = 0
)
joining_fun <- function(plants, database) {
by_sci_name <- inner_join(plants, database, by = "scientific_name")
no_sci_match <- anti_join(plants, database, by = "scientific_name")
by_syn <- inner_join(database, no_sci_match, by = c("synonym" = "scientific_name"))
no_match <- anti_join(no_sci_match, database, by = c("scientific_name" = "synonym"))
bind_rows(by_syn, by_sci_name, no_match)
}
joining_fun(my_plants, plant_database)
scientific_name synonym score percent_cover
1 Anticlea elegans Zigadenus glaucus 6 60
2 Abelmoschus esculentus <NA> 1 90
3 Abies balsamea <NA> 1 80
4 Ammophila breviligulata <NA> 2 10
5 Stackus overflovius <NA> NA 0
I was working in the following problem. I've got monthly data from a survey, let's call it df:
df1 = tibble(ID = c('1','2'), reported_value = c(1200, 31000), anchor_month = c(3,5))
ID reported_value anchor_month
1 1200 3
2 31000 5
So, the first row was reported in March, but there's no way to know if it's reporting March or February values and also can be an approximation to the real value. I've also got a table with actual values for each ID, let's call it df2:
df2 = tibble( ID = c('1', '2') %>% rep(4) %>% sort,
real_value = c(1200,1230,11000,10,25000,3100,100,31030),
month = c(1,2,3,4,2,3,4,5))
ID real_value month
1 1200 1
1 1230 2
1 11000 3
1 10 4
2 25000 2
2 3100 3
2 100 4
2 31030 5
So there's two challenges: first, I only care about the anchor month OR the previous month to the anchor month of each ID and then I want to match to the closest value (sounds like fuzzy join). So, my first challenge was to filter my second table so it only has the anchor month or the previous one, which I did doing the following:
filter_aux = df1 %>%
bind_rows(df1 %>% mutate(anchor_month = if_else(anchor_month == 1, 12, anchor_month- 1)))
df2 = df2 %>%
inner_join(filter_aux , by=c('ID', 'month' = 'anchor_month')) %>% distinct(ID, month)
Reducing df2 to:
ID real_value month
1 1230 2
1 11000 3
2 100 4
2 31030 5
Now I tried to do a difference_inner_join by ID and reported_value = real_value, (df1 %>% difference_inner_join(df2, by= c('ID', 'reported_value' = 'real_value'))) but it's bringing a non-numeric argument to binary operator error I'm guessing because ID is a string in my actual data. What gives? I'm no expert in fuzzy joins, so I guess I'm missing something.
My final dataframe would look like this:
ID reported_value anchor_month closest_value month
1 1200 3 1230 2
2 31000 5 31030 5
Thanks!
It was easier without fuzzy_join:
df3 = df1 %>% left_join(df2 , by='ID') %>%
mutate(dif = abs(real_value - reported_value)) %>%
group_by(ID) %>% filter(dif == min(dif))
Output:
ID reported_value anchor_month real_value month dif
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1200 3 1230 2 30
2 2 31000 5 31030 5 30
i'm new to R and trying to use it in place of Excel (where i have more experience). I'm still working out the full 'for' logic, but not having the values to determine if it's working how i think it should is stopping me in my tracks. The goal is to generate what will be used as a factor with 3 levels; 0 = no duplicates, 1 is if duplicate, Oldest, 2 = if duplicate, newest.
I have a dataframe that looks like this
Person <- c("A", "B", "C", "C", "D", "E","E")
Date <- c(1/1/20, 1/1/20,12/25/19, 1/1/20, 1/1/20, 12/25/19, 1/1/20)
ID <- c(1,2,3,4,5,6,7)
DuplicateStatus <- c(0,0,0,0,0,0,0)
IdealResult <- c(0,0,1,2,0,1,2)
mydata <- cbind(Person, Date, ID, DuplicateStatus, IdealResult)
I am trying to use a for loop to evaluate if person duplicates. If a person does not duplicate, value= 0 and if they do duplicate, they should have a 1 for the oldest value and a 2 for the newest value (see ideal result). NOTE: I have already sorted the data to be by person and then date, so if duplicated, first appearance is oldest.
previous investigations of Vlookup in R answers here are aimed at merging datasets based on identical values in multiple datasets. Here, i am attempting to modify a column based on the relationship between columns, within a single dataset.
currentID = 0
nextID =0
for(i in mydata$ID){
currentID = i
nextID = currentID++1
CurrentPerson ##Vlookup function that does - find currentID in ID, return associated value in Person column in same position.
NextPerson ##Vlookup function that does - find nextID in ID, return associated value in Person column in same position.
if CurrentPerson = NextPerson, then DuplicateStatus at ID associated with current person should be 1, and DuplicateStatus at ID associated with NextPerson = 2.
**This should end when current person = total number of people
Thanks!
You really need to spend some time with a simple tutorial on R. Your cbind() function converts all of your data to a character matrix which is probably not what you want. Look at the results of str(mydata). Instead of looping, this creates an index number within each Person group and then zeros out the groups with a single observation:
mydata <- data.frame(Person, Date, ID, DuplicateStatus, IdealResult)
IR <- ave(mydata$ID, mydata$Person, FUN=seq_along)
IR
# [1] 1 1 1 2 1 1 2
tbl <- table(mydata$Person)
tozero <- mydata$Person %in% names(tbl[tbl == 1])
IR[tozero] <- 0
IR
# [1] 0 0 1 2 0 1 2
Is what you are looking for just to count the number of observations for a person, in one column (like a column ID)? If so, this will work using tidyverse:
Person <- c("A", "B", "C", "C", "D", "E","E")
Date <- c(1/1/20, 1/1/20,12/25/19, 1/1/20, 1/1/20, 12/25/19, 1/1/20)
ID <- c(1,2,3,4,5,6,7)
DuplicateStatus <- c(0,0,0,0,0,0,0)
IdealResult <- c(0,0,1,2,0,1,2)
mydata <- data.frame(Person, Date, ID, DuplicateStatus, IdealResult)
library(tidyverse)
mydata <- mydata %>%
group_by(Person) %>%
mutate(Duplicate = seq_along(Person))
mydata
# A tibble: 7 x 6
# Groups: Person [5]
Person Date ID DuplicateStatus IdealResult Duplicate
<fct> <dbl> <dbl> <dbl> <dbl> <int>
1 A 0.05 1 0 0 1
2 B 0.05 2 0 0 1
3 C 0.0253 3 0 1 1
4 C 0.05 4 0 2 2
5 D 0.05 5 0 0 1
6 E 0.0253 6 0 1 1
7 E 0.05 7 0 2 2
You could assign row number within each group provided if there are more than 1 row in each.
This can be implemented in base R, dplyr as well as data.table
In base R :
mydata$ans <- with(mydata, ave(ID, Person, FUN = function(x)
seq_along(x) * (length(x) > 1)))
# Person Date ID IdealResult ans
#1 A 0.0500000 1 0 0
#2 B 0.0500000 2 0 0
#3 C 0.0252632 3 1 1
#4 C 0.0500000 4 2 2
#5 D 0.0500000 5 0 0
#6 E 0.0252632 6 1 1
#7 E 0.0500000 7 2 2
Using dplyr:
library(dplyr)
mydata %>% group_by(Person) %>% mutate(ans = row_number() * (n() > 1))
and with data.table
library(data.table)
setDT(mydata)[, ans := seq_along(ID) * (.N > 1), Person]
data
mydata <- data.frame(Person, Date, ID, IdealResult)
I would argue that n() is the ideal function for you problem
library(tidyverse)
mydata <- mydata %>%
group_by(Person) %>%
mutate(Duplicate = n())
I have a data frame df1 with information of acquisitions by ID. Every acquirer A and target B have their four-digit SIC codes on one line separated by "/".
df1 <- data.frame(ID = c(1,2,3,4),
A = c("1230/1344/2334/2334","3322/3344/3443", "1112/9099", "3332/4483"),
B = c("1333/2334","3344/8840", "4454", "9988/2221/4483"))
ID A B
1 1230/1344/2334/2334 1333/2334
2 3322/3344/3443 3344/8840
3 1112/9099 4454
4 3332/4483 9988/2221/4483
I would need to classify each transaction ID as follows:
If the primary code (i.e. the first four digits) of either A or B matches any other code than the primary code of B or A, then the Primary.other.match column takes a value of 1 and 0 else.
If any other than code the primary code of A or B matches any other than the primary code of B or A, then the Other.other.match column takes value of 1 and 0 else.
The desired output is shown below in the updated df1.
df1 <- data.frame(ID = c(1,2,3,4),
A = c("1230/1344/2334/2334","3322/3344/3443", "1112/9099", "3332/4483"),
B = c("1333/2334","3344/8840", "4454", "9988/2221/4483"),
Primary.other.match = c(0,1,0,0), #only if primary Code of A or B matches
any other code of B or A
Other.other.match = c(1,0,0,1)) # only if primary codes do not match
primary or any other codes, but any other codes match
ID A B Primary.other.match Other.other.match
1 1230/1344/2334/2334 1333/2334 0 1
2 3322/3344/3443 3344/8840 1 0
3 1112/9099 4454 0 0
4 3332/4483 9988/2221/4483 0 1
Thank you for your help!
here is a solution within the tidyverse.
You first create a function which checks whether there is a primary match or a other match and then apply this function column wise with purrr::map:
library(tidyverse)
fun1 <- function(str1, str2){
str1 <- str1 %>% str_split("/") %>% unlist()
str2 <- str2 %>% str_split("/") %>% unlist()
str1p <- str1[1]
str2p <- str2[1]
pom <- ifelse(str1p %in% str2 | str2p %in% str1, 1, 0)
oom <- ifelse(pom == 0 & length(intersect(str1, str2)) > 0, 1, 0)
tibble(pom = pom, oom = oom)
}
df1 %>% as_tibble() %>%
mutate(result = map2(A, B, fun1)) %>%
unnest(result)
# A tibble: 4 x 5
ID A B pom oom
<dbl> <fct> <fct> <dbl> <dbl>
1 1 1230/1344/2334/2334 1333/2334 0 1
2 2 3322/3344/3443 3344/8840 1 0
3 3 1112/9099 4454 0 0
4 4 3332/4483 9988/2221/4483 0 1
I have two tables and I need to update pro_sales column values in the first table from pro_sales values in the second.
df1 <- data.frame(storecode = c(100,100,100,200,200),
productcode = c(1,2,3,1,2), pro_sales = c(0,0,0,0,0))
df2 <- data.frame(storecode = c(100,100,200),
productcode = c(1,2,1), pro_sales = c(0,1,0))
I need to left join on the columns storecode and productcode. Below should be my final table:
storecode productcode pro_sales
1 100 1 0
2 100 2 1
3 100 3 0
4 200 1 0
5 200 2 0
I was able to left join in dplyr but after that i need help please?
df1 %>%
left_join(df2,c("storecode"="storecode","productcode"="productcode")) %>%
mutate( ???? ) %>%
select(names, match, value = value.x)
Thank you.
Another option is to use an update join with the data.table-package:
library(data.table)
setDT(df1)
setDT(df2)
df1[df2, on = .(storecode, productcode), pro_sales := i.pro_sales][]
which gives:
storecode productcode pro_sales
1: 100 1 0
2: 100 2 1
3: 100 3 0
4: 200 1 0
5: 200 2 0
df1 <- data.frame(storecode=c(100,100,100,200,200),
productcode=c(1,2,3,1,2),pro_sales=c(0,0,0,0,0))
df2 <- data.frame(storecode=c(100,100,200),
productcode=c(1,2,1),pro_sales=c(0,1,0))
library(dplyr)
df1 %>%
left_join(df2, by = c("storecode","productcode")) %>%
mutate(pro_sales.y = coalesce(pro_sales.y, 0)) %>%
select(storecode, productcode, pro_sales = pro_sales.y)
# storecode productcode pro_sales
# 1 100 1 0
# 2 100 2 1
# 3 100 3 0
# 4 200 1 0
# 5 200 2 0
I assume that if you want to update values in first table given the second table, as you mentioned, then NA values should be zeros and not what you have in your first table.