Updating character column based on another column in R - r

I have the following dataframe
df
ID Timestamp Package Genre Name
1 01 com.abc NA NA
1 02 com.xyz NA NA
2 04 com.abc NA NA
Now the Package column has about 1000 unique packages on the basis of which I need to update the Genre and Name columns.
I know how to update these by using vectorized approach or by using within but this means I would have to iterate over all unique package names manually and I was hoping to find a sleeker solution.
Looking at switch function for column values and R Apply function depending on element in a vector, I was trying to make a switch function that could take in two arguements (package field and Genre field) and use switch statements to update. Not sure if it's the right way to go.

Create a data.frame that contains the package information and merge them together on the package. First drop the genre and name columns as they will be populated with merge
df[, c("Genre", "Name")] <- NULL
df2 <- data.frame(Package = c("com.abc", "com.xyz"),
Genre = c("g1", "g2"),
Name = c("n1", "n2"))
merge(df, df2, by = "Package")
Package ID Timestamp Genre Name
1 com.abc 1 1 g1 n1
2 com.abc 2 4 g1 n1
3 com.xyz 1 2 g2 n2

Related

recode in dplyr giving Error: Argument 2 must be named, not unnamed

I have a dataframe "employee" like this:
Emp_Id,Name,Dept_Id
20203,Sam,1
20301,Rodd,2
30321,Mike,3
40403,Derik,4
Now i want to transform this dataframe in a way that the Dept_Id have department names instead of Dept_Id.
I am trying to use recode from dplyrfor this, since my transformation logic comes from a csv, I would have to use a variable in place of transformation logic.
I used read.csv to get my dataframe df where my logic (1=HR,2=IT and so on) sits and then get it in a list:
df:
Source,Target,Transformation
Employee,Emp,"1=HR,2=Sales,3=Finance,4=IT"
To get the transaformation login from df
myList <- as.character(df[1,3])
Now replacing the data in employee as per the logic
employee$Dept_Id <- recode(employee$Dept_Id,myList)
On this line it is giving me:
Error: Argument 2 must be named, not unnamed
There are multiple ways to do this. One way is:
Method 1:
df$Dept_Id <- name[match(df$Dept_Id, names(name))]
Emp_Id Name Dept_Id
1: 20203 Sam HR
2: 20301 Rodd IT
Method 2:
df <- df %>%
mutate(Dept_Id_2 = case_when(
Dept_Id == 1 ~ 'HR',
Dept_Id == 2 ~ 'IT'
))
Method 3:
codes <- list("1" = "HR", "2" = "IT")
df %>%
mutate(d2 = recode(Dept_Id, !!!codes))
Setup
df <- fread("
Emp_Id Name Dept_Id
20203 Sam 1
20301 Rodd 2
")
name <- c("1" = "HR", "2"="IT")
Your dataframe df has a different structure which makes it difficult to apply functions directly. We need to clean it and bring it in a better format so that it is easy to query on it.
One way to do that is to split the data on , and = to create a new dataframe (lookup) with department ID and Name.
lookup <- data.frame(t(sapply(strsplit(as.character(df[1,3]), ",")[[1]],
function(x) strsplit(x, "=")[[1]])), row.names = NULL)
lookup
# X1 X2
#1 1 HR
#2 2 Sales
#3 3 Finance
#4 4 IT
Once we have lookup then it is easy to match by ID and get corresponding name.
employee$Dept_Name <- lookup$X2[match(employee$Dept_Id, lookup$X1)]
employee
# Emp_Id Name Dept_Id Dept_Name
#1 20203 Sam 1 HR
#2 20301 Rodd 2 Sales
#3 30321 Mike 3 Finance
#4 40403 Derik 4 IT
Another way If you don't want to change your existing database and the department list is not too big.
Assumption: there is not any missing data in your "employee" database. if any missing data available then need to add one more level of condition.
ifelse is the simple way to apply your logic, I have mentioned that in below code
New_DF = ifelse(employee$Dept_Id == 1,"HR",ifelse(employee$Dept_Id == 2,"Sales",ifelse(employee$Dept_Id == 3,"Finance","IT")))
New_DF = cbind(employee,New_DF)

cbind a dynamic column name from a string in R

I want to cbind a column to the data frame with the column name dynamically assigned from a string
y_attribute = "Survived"
cbind(test_data, y_attribute = NA)
this results in a new column added as y_attribute instead of the required Survived attribute which in provided as a string to the y_attribute variable. What needs to be done to get a column in the data frame with the column name provided from a variable?
You don't actually need cbind to add a new column. Any of these will work:
test_data[, y_attribute] = NA # data frame row,column syntax
test_data[y_attribute] = NA # list syntax (would work for multiple columns at once)
test_data[[y_attribute]] = NA # list single item syntax (single column only)
New columns are added after the existing columns, just like cbind.
We can use tidyverse to do this
library(dplyr)
test_data %>%
mutate(!! y_attribute := NA)
# col1 Survived
#1 1 NA
#2 2 NA
#3 3 NA
#4 4 NA
#5 5 NA
data
test_data <- data.frame(col1 = 1:5)
Not proud of this but I usually will do somethingl like this:
dyn.col <- "XYZ"
cbind(test.data, UNIQUE_NAMEXXX=NA)
colnames(test.data)[colnames(test.data == 'UNIQUE_NAMEXXX')] <- dyn.col
We can also do it with data.table
library(data.table)
setDT(test_data)[, (y_attribute) := NA]

merging multiple dataframes with duplicate rows in R

Relatively new with R for this kind of thing, searched quite a bit and couldn't find much that was helpful.
I have about 150 .csv files with 40,000 - 60,000 rows each and I am trying to merge 3 columns from each into 1 large data frame. I have a small script that extracts the 3 columns of interest ("id", "name" and "value") from each file and merges by "id" and "name" with the larger data frame "MergedData". Here is my code (I'm sure this is a very inefficient way of doing this and that's ok with me for now, but of course I'm open to better options!):
file_list <- list.files()
for (file in file_list){
if(!exists("MergedData")){
MergedData <- read.csv(file, skip=5)[ ,c("id", "name", "value")]
colnames(MergedData) <- c("id", "name", file)
}
else if(exists("MergedData")){
temp_data <- read.csv(file, skip=5)[ ,c("id", "name", "value")]
colnames(temp_data) <- c("id", "name", file)
MergedData <- merge(MergedData, temp_data, by=c("id", "name"), all=TRUE)
rm(temp_data)
}
}
Not every file has the same number of rows, though many rows are common to many files. I don't have an inclusive list of rows, so I included all=TRUE to append new rows that don't yet exist in the MergedData file.
My problem is: many of the files contain 2-4 rows with identical "id" and "name" entries, but different "value" entries. So, when I merge them I end up adding rows for every possible combination, which gets out of hand fast. Most frustrating is that none of these duplicates are of any interest to me whatsoever. Is there a simple way to take the value for the first entry and just ignore any further duplicate entries?
Thanks!
Based on your comment, we could stack each file and then cast the resulting data frame from "long" to "wide" format:
library(dplyr)
library(readr)
library(reshape2)
df = lapply(file_list, function(file) {
dat = read_csv(file)
dat$source.file = file
return(dat)
})
df = bind_rows(df)
df = dcast(df, id + name ~ source.file, value.var="value")
In the code above, after reading in each file, we add a new column source.file containing the file name (or a modified version thereof).* Then we use dcast to cast the data frame from "long" to "wide" format to create a separate column for the value from each file, with each new column taking one of the names we just created in source.file.
Note also that depending on what you're planning to do with this data frame, you may find it more convenient to keep it in long format (i.e., skip the dcast step) for further analysis.
Addendum: Dealing with Aggregation function missing: defaulting to length warning. This happens when you have more than one row with the same id, name and source.file. That means there are multiple values that have to get mapped to the same cell, resulting in aggregation. The default aggregation function is length (i.e., a count of the number of values in that cell). The only ways around this that I know of are (a) keep the data in long format, (b) use a different aggregation function (e.g., mean), or (c) add an extra counter column to differentiate cases with multiple values for the same combination of id, name, and source.file. We demonstrate these below.
First, let's create some fake data:
df = data.frame(id=rep(1:2,2),
name=rep(c("A","B"), 2),
source.file=rep(c("001","002"), each=2),
value=11:14)
df
id name source.file value
1 1 A 001 11
2 2 B 001 12
3 1 A 002 13
4 2 B 002 14
Only one value per combination of id, name and source.file, so dcast works as desired.
dcast(df, id + name ~ source.file, value.var="value")
id name 001 002
1 1 A 11 13
2 2 B 12 14
Add an additional row with the same id, name and source.file. Since there are now two values getting mapped to a single cell, dcast must aggregate. The default aggregation function is to provide a count of the number of values.
df = rbind(df, data.frame(id=1, name="A", source.file="002", value=50))
dcast(df, id + name ~ source.file, value.var="value")
Aggregation function missing: defaulting to length
id name 001 002
1 1 A 1 2
2 2 B 1 1
Instead, use mean as the aggregation function.
dcast(df, id + name ~ source.file, value.var="value", fun.aggregate=mean)
id name 001 002
1 1 A 11 31.5
2 2 B 12 14.0
Add a new counter column to differentiate cases where there are multiple rows with the same id, name and source.file and include that in dcast. This gets us back to a single value per cell, but at the expense of having more than one column for some source.files.
# Add counter column
df = df %>% group_by(id, name, source.file) %>%
mutate(counter=1:n())
As you can see, the counter value only has a value of 1 in cases where there's only one combination of id, name, and source.file, but has values of 1 and 2 for one case where there are two rows with the same id, name, and source.file (rows 3 and 5 below).
df
id name source.file value counter
1 1 A 001 11 1
2 2 B 001 12 1
3 1 A 002 13 1
4 2 B 002 14 1
5 1 A 002 50 2
Now we dcast with counter included, so we get two columns for source.file "002".
dcast(df, id + name ~ source.file + counter, value.var="value")
id name 001_1 002_1 002_2
1 1 A 11 13 50
2 2 B 12 14 NA
* I'm not sure what your file names look like, so you'll probably need to adjust this create a naming format with a unique file identifier. For example, if your file names follow the pattern "file001.csv", "file002.csv", etc., you could do this: dat$source.file = paste0("Value", gsub("file([0-9]{3})\\.csv", "\\1", file).

Dplyr or R basis. How to select (or delete) lines that have identical values (column 1 and column 2) and keeping column 3 values

In a data.frame class object with {dplyr} or R {base}.
How to select (or delete) lines that have identical values in column 1 and column 2 ( and keeping column's 3 values).
I have no idea (use distinct fonction?)
test <- data.frame(column1 = c("paris","moscou", "rennes"),
column2 = c("paris", "lima", "rennes"),
column3 =c(12,56,78))
> print (test)
column1 column2 column3
1 paris paris 12
2 moscou lima 56
3 rennes rennes 78
Example:
line 1: paris paris
line 4: rennes rennes
library(dplyr)
test2 <- test %>%
filter(column1 == column2)
print (test2)
Error: level sets of factors are different
We can use subset from base R
subset(test, as.character(column1) == as.character(column2))
In dplyr, use filter to retrieve specific rows and use select to retrieve specific columns.
For data.frames you need to as.character to match strings:
library(dplyr)
test %>%
filter(as.character(column1) == as.character(column2))

How to create ID (by) before merging in R?

I have two dataframes df.o and df.m as defined below. I need to find which observation in df.o (dimension table) corresponds which observations in df.m (fact table) based on two criteria: 1) df.o$Var1==df.o$Var1 and df.o$date1 < df.m$date2 < df.o$date3 such that I get the correct value of df.o$oID in df.m$oID (the correct value is manually entered in df.m$CORRECToID). I need the ID to complete a merge afterwards.
df.o <- data.frame(oID=1:4,
Var1=c("a","a","b","c"),
date3=c(2015,2011,2014,2015),
date1=c(2013,2009,2012,2013),
stringsAsFactors=FALSE)
df.m <- data.frame(mID=1:3,
Var1=c("a","a","b"),
date2=c(2014,2010,2013),
oID=NA,
CORRECToID=c(1,2,3),
points=c(5, 10,15),
stringsAsFactors=FALSE)
I have tried various combinations of like the code below, but without luck:
df.m$oID[df.m$date2 < df.o$date3 & df.m$date2 > df.o$date1 & df.o$Var1==df.m$Var1] <- df.o$oID
I have also tried experimenting with various combinations of ifelse, which and match, but none seem to do the trick.
The problem I keep encountering is that my replacement was a different number of rows than data and that "longer object length is not a multiple of shorter object length".
What you are looking for is called an "overlap join", you could try the data.table::foverlaps function in order to achieve this.
The idea is simple
Create the columns to overlap on (add an additional column to df.m)
key by these columns
run foverlaps and select the column you want back
library(data.table)
setkey(setDT(df.m)[, date4 := date2], Var1, date2, date4)
setkey(setDT(df.o), Var1, date1, date3)
foverlaps(df.m, df.o)[, names(df.m), with = FALSE]
# mID Var1 date2 oID CORRECToID points date4
# 1: 2 a 2010 2 2 10 2010
# 2: 1 a 2014 1 1 5 2014
# 3: 3 b 2013 3 3 15 2013

Resources