For Each Loop to convert into numeric values [duplicate] - r

This question already has answers here:
How to assign a unique ID number to each group of identical values in a column [duplicate]
(2 answers)
Closed 4 years ago.
I have a question about the manipulation of a data frame. If I have this data frame as an example:
employee <- c('John Doe','Peter Gynn','Jolie Hope')
salary <- c(21000, 23400, 26800)
startdate <- as.Date(c('2010-11-1','2008-3-25','2007-3-14'))
location <- c('New York', 'Alabama','New York')
employ.data <- data.frame(employee, salary, startdate, location)
employ.data
employee salary startdate location
1 John Doe 21000 2010-11-01 New York
2 Peter Gynn 23400 2008-03-25 Alabama
3 Jolie Hope 26800 2007-03-14 New York
Now I want to transform the location into nummeric values. I know that I can do something like this:
transformlocation <- function(x) {
x <- as.character(x)
if (x =='New York'){
return('1')
}else if (x=='Alabama'){
return('2')
}else if (x=='Florida'){
return('3')
}else
return('0')
}
employ.data$location <- sapply(employ.data$location, transformlocation)
employ.data
employee salary startdate location
1 John Doe 21000 2010-11-01 1
2 Peter Gynn 23400 2008-03-25 2
3 Jolie Hope 26800 2007-03-14 1
But in my final dataset there are hundreds of different values. For example, is it possible to work with a for each statement here?
Thanks for your help!

If it is already a factor variable, then simply convert to integer,i.e.
employ.data$location <- as.integer(employ.data$location)
employ.data
# employee salary startdate location
#1 John Doe 21000 2010-11-01 2
#2 Peter Gynn 23400 2008-03-25 1
#3 Jolie Hope 26800 2007-03-14 2
Otherwise convert to factor and then integer, i.e.
employ.data$location <- as.integer(as.factor(employ.data$location))

Related

R - Creating/subset dataframes within loop or apply family

I would like to subset a R dataframe if multiple variables (ending in same suffix) do NOT contain NAs.
employee <- c('John Doe','Peter Gynn','Jolie Hope')
salary1 <- c(NA, 20400, 26800)
salary2 <- c(29045, NA, 78765)
date1 <- as.Date(c(NA,'2008-3-25','2007-3-14'))
date2 <- as.Date(c('2010-11-1',NA,'2007-3-14'))
employ.data <- data.frame(employee, salary1, salary2, date1, date2)
for (i in 1:2) {
assign("employ.data", i, subset(employ.data, !is.na(employ.data$date[i]) & !is.na(employ.data$salary[i])))
}
Final result would hopefully produce two separate dataframes looking something like:
employ.data1:
| employee | salary1 | salary2 | date1 | date2 |
| Peter Gynn | 20400 | NA | 2008-03-25 | NA |
| Jolie Hope | 26800 | 78765 | 2007-03-14 | 2007-03-14 |
employ.data2:
| employee | salary1 | salary2 | date1 | date2 |
| John Doe | NA | 29045 | NA | 2010-11-01 |
| Jolie Hope | 26800 | 78765 | 2007-03-14 | 2007-03-14 |
Thanks in advance!
(Up front: it's generally better to deal with a list of frames instead of using assign to dynamically create objects.)
numfields <- grep("[0-9]+$", colnames(employ.data), value = TRUE)
split(numfields, gsub(".*?([0-9]+)$", "\\1", numfields))
# $`1`
# [1] "salary1" "date1"
# $`2`
# [1] "salary2" "date2"
out <- lapply(split(numfields, gsub(".*?([0-9]+)$", "\\1", numfields)),
function(flds) employ.data[ complete.cases(subset(employ.data, select = flds)), ])
out
# $`1`
# employee salary1 salary2 date1 date2
# 2 Peter Gynn 20400 NA 2008-03-25 <NA>
# 3 Jolie Hope 26800 78765 2007-03-14 2007-03-14
# $`2`
# employee salary1 salary2 date1 date2
# 1 John Doe NA 29045 <NA> 2010-11-01
# 3 Jolie Hope 26800 78765 2007-03-14 2007-03-14
This will dynamically find all numbered fields, though admittedly it does not enforce "pairs".
The non_na function returns the rows for which there are no missing values in the columns whose names end in i. It first defines those column names as cn and then uses complete.cases to define a logical vector which indicates which rows have no NA's in those columns and then returns those rows from employ.data.
Assume that the only numbers in columns are at the end of names and that the unique such numbers define the possible suffixes. The stems are those unique names after removing the suffix.
We assume that column 1 is to be excluded from this calculation. We use gsub/unique to define suffixes and stems. Alternately hard code suffixes <- as.character(1:2) and stems <- c("salary", "date") .
Finally run non_na on each suffix creating a list whose i component contains a data frame with the rows corresponding to i. L[[i]] will be the i data frame.
non_na <- function(i) {
cn <- paste0(stems, i)
employ.data[complete.cases(employ.data[cn]), ]
}
nms <- names(employ.data)[-1]
suffixes <- unique(gsub("\\D", "", nms)) # c("1", "2")
stems <- unique(gsub("\\d", "", nms)) # c("salary", "date")
L <- Map(non_na, suffixes); L
giving:
$`1`
employee salary1 salary2 date1 date2
2 Peter Gynn 20400 NA 2008-03-25 <NA>
3 Jolie Hope 26800 78765 2007-03-14 2007-03-14
$`2`
employee salary1 salary2 date1 date2
1 John Doe NA 29045 <NA> 2010-11-01
3 Jolie Hope 26800 78765 2007-03-14 2007-03-14

Merge dataframe with a key value that is contained within a string in a separate dataframe

employee <- c('John','Peter', 'Gynn', 'Jolie', 'Hope', 'Sue', 'Jane', 'Sarah')
salary <- c('VT020', 'VT126', 'VT027', 'VT667', 'VC120', 'VT000', 'VA120', 'VA020')
emp <- data.frame(employee, salary)
benefit <- c('Health', 'Time', 'Bonus')
benefit_id <- c('VT020 VT126 VT667 VA020', 'VT667', 'VT126 VT667 VT000')
ben <- data.frame(benefit, benefit_id)
Above we have to dataframes, one contains names and a unique ID, the other contains a category and a list of unique IDs.
What is the most efficient way to merge the ben dataframe with the emp dataframe such that we get the appropriate benefit assigned to each employee?
tidyverse
library(dplyr)
library(tidyr) # tidyr
ben %>%
mutate(benefit_id = strsplit(benefit_id, "\\s+")) %>%
unnest(benefit_id) %>%
left_join(emp, ., by = c(salary = "benefit_id"))
# employee salary benefit
# 1 John VT020 Health
# 2 Peter VT126 Health
# 3 Peter VT126 Bonus
# 4 Gynn VT027 <NA>
# 5 Jolie VT667 Health
# 6 Jolie VT667 Time
# 7 Jolie VT667 Bonus
# 8 Hope VC120 <NA>
# 9 Sue VT000 Bonus
# 10 Jane VA120 <NA>
# 11 Sarah VA020 Health
Depending on your needs, you may also prefer a different join. For instance, use a full_join if you want all pairings, where NA in employee indicates a benefit sans employee.
FYI: if you are running R before 4.0, then you might have factors in your data. To fix that, just convert the factor columns with as.character first. (This can be determined with sapply(ben, inherits, "factor").)
data.table
library(data.table)
setDT(emp)
ben_long <- setDT(ben)[, list(benefit_id = unlist(strsplit(x = benefit_id, split = " "))), by = benefit]
merge(x = emp, y = ben_long, by.x = "salary", by.y = "benefit_id", all.x = TRUE)
salary employee benefit
1: VA020 Sarah Health
2: VA120 Jane <NA>
3: VC120 Hope <NA>
4: VT000 Sue Bonus
5: VT020 John Health
6: VT027 Gynn <NA>
7: VT126 Peter Health
8: VT126 Peter Bonus
9: VT667 Jolie Health
10: VT667 Jolie Time
11: VT667 Jolie Bonus

Add multiple new columns to the dataset, based on another dataset's elements

I have the following products list
> products
# A tibble: 311 x 1
value
<fct>
1 NA
2 Alternativ Economy
3 Ambulant Balance
4 Ambulant Economy
5 Ambulant Premium
6 Ambulant 2
7 Ambulant 3
8 Ambulant 1
9 COMPLETA
10 HOSPITAL ECO
# ... with 301 more rows
and the following df
> df <- data.frame(employee = c('John Doe','Peter Gynn','Jolie Hope'),
+ salary = c(21000, 23400, 26800),
+ startdate = as.Date(c('2010-11-1','2008-3-25','2007-3-14')))
> df
employee salary startdate
1 John Doe 21000 2010-11-01
2 Peter Gynn 23400 2008-03-25
3 Jolie Hope 26800 2007-03-14
Now, I want to add the elements of the former (i.e. products) as variables of the latter (i.e. the df). I use
cbind(df, setNames(lapply(products, function(x) x = NA), products))
but I get an error. Can you suggest another way of doing this? What is wrong with my solution? thanks in advance
Here is one solution.
df <- data.frame(employee = c('John Doe','Peter Gynn','Jolie Hope'),
salary = c(21000, 23400, 26800),
startdate = as.Date(c('2010-11-1','2008-3-25','2007-3-14')))
products <- data.frame(value = c(NA, "Alternativ Economy", "COMPLETA"))
#products$value <- ifelse(is.na(products$value), "not_available", as.character(products$value))
cbind(df, `colnames<-`(data.frame(matrix(ncol = nrow(products), nrow = nrow(df))), products$value))
employee salary startdate NA Alternativ Economy COMPLETA
1 John Doe 21000 2010-11-01 NA NA NA
2 Peter Gynn 23400 2008-03-25 NA NA NA
3 Jolie Hope 26800 2007-03-14 NA NA NA
I question the wisdom of having NAs as column names, so I'd uncomment that one line of code in there to replace NAs with some character string instead.

Trim leading/trailing whitespaces from a data frame column where the column name comes as a variable

I have a dataframe where the name of the column which is to be trimmed for whitespaces is comming as a variable and I am not able to resolve the variable to point me to the column so that it can be trimmed.
salary <- c(21000, 23400, 26800)
startdate <- as.Date(c('2010-11-1','2008-3-25','2007-3-14'))
employee <- c(' John Doe ',' Peter Gynn ',' Jolie Hope')
employ.data <- data.frame(employee, salary, startdate)
Here I try to trim employeecolumn and I have tried dplyr:
employ.data %>% mutate(employee = trimws(employee))
which works.
However, If I say:
abc <- "employee"
and then
employ.data %>% mutate(abc= trimws(abc))
It doesnt work.
I have tried using get(abc) in this function but this doesn't work either.
I understand I cant use abc as employ.data$abc when abc is a variable column name.
INITIAL DATAFRAME
employee salary startdate
John Doe 21000 2010-11-01
Peter Gynn 23400 2008-03-25
Jolie Hope 26800 2007-03-14
FINAL DATAFRAME
employee salary startdate
John Doe 21000 2010-11-01
Peter Gynn 23400 2008-03-25
Jolie Hope 26800 2007-03-14
You can also use str_trim from stringr in the tidyverse.
employ.data %>%
mutate(abc = str_trim(employee))
Which is:
employee salary startdate abc
1 John Doe 21000 2010-11-01 John Doe
2 Peter Gynn 23400 2008-03-25 Peter Gynn
3 Jolie Hope 26800 2007-03-14 Jolie Hope
Use mutate_at
library(dplyr)
employ.data %>% mutate_at(abc, trimws)
# employee salary startdate
#1 John Doe 21000 2010-11-01
#2 Peter Gynn 23400 2008-03-25
#3 Jolie Hope 26800 2007-03-14
Or you can directly do, if you have only one column
employ.data[[abc]] <- trimws(employ.data[[abc]])
If there are multiple columns you can use lapply
employ.data[abc] <- lapply(employ.data[abc], trimws)

Create unique list of names

I have a list of actors:
name <- c('John Doe','Peter Gynn','Jolie Hope')
age <- c(26 , 32, 56)
postcode <- c('4011', '5600', '7700')
actors <- data.frame(name, age, postcode)
name age postcode
1 John Doe 26 4011
2 Peter Gynn 32 5600
3 Jolie Hope 56 7700
I also have an edge list of relations:
from <- c('John Doe','John Doe','John Doe', 'Peter Gynn', 'Peter Gynn', 'Jolie Hope')
to <- c('John Doe', 'John Doe', 'Peter Gynn', 'Jolie Hope', 'Peter Gynn', 'Frank Smith')
edge <- data.frame(from, to)
from to
1 John Doe John Doe
2 John Doe John Doe
3 John Doe Peter Gynn
4 Peter Gynn Jolie Hope
5 Peter Gynn Peter Gynn
6 Jolie Hope Frank Smith
First, I want to eliminate self references in my edge list i.e. rows 1,2,5 in my 'edge' dataframe.
non.self.ref <- edge[!(edge$from == edge$to),]
does not produce the desired result.
Second, edge includes a name not in the 'actor' dataframe ('Frank Smith'). I want to add 'Frank Smith' to my 'actor' dataframe, even though I do not have age or postcode data for 'Frank Smith'. For example:
name age postcode
1 John Doe 26 4011
2 Peter Gynn 32 5600
3 Jolie Hope 56 7700
4 Frank Smith NA NA
I would be grateful for a tidy solution!
Here is a tidyverse solution to both parts, though in general try not to ask multiple questions per question.
The first part is fairly simple. filter allows a very intuitive syntax that just specifies you want to keep rows where from isn't equal to to.
The second part is a little more complicated. First we gather up the from and to columns, so all the actors are in one column. Then we use distinct to leave us with a one column tbl with unique actor names. Finally, we can use full_join to combine the tables. A full_join keeps all rows and columns from both tables, matching on shared name column by default, and fills NA if there is no data (as there isn't for Frank).
library(tidyverse)
actors <- tibble(
name = c('John Doe','Peter Gynn','Jolie Hope'),
age = c(26 , 32, 56),
postcode = c('4011', '5600', '7700')
)
edge <- tibble(
from = c('John Doe','John Doe','John Doe', 'Peter Gynn', 'Peter Gynn', 'Jolie Hope'),
to = c('John Doe', 'John Doe', 'Peter Gynn', 'Jolie Hope', 'Peter Gynn', 'Frank Smith')
)
edge %>%
filter(from != to)
#> # A tibble: 3 x 2
#> from to
#> <chr> <chr>
#> 1 John Doe Peter Gynn
#> 2 Peter Gynn Jolie Hope
#> 3 Jolie Hope Frank Smith
edge %>%
gather("to_from", "name", from, to) %>%
distinct(name) %>%
full_join(actors)
#> Joining, by = "name"
#> # A tibble: 4 x 3
#> name age postcode
#> <chr> <dbl> <chr>
#> 1 John Doe 26.0 4011
#> 2 Peter Gynn 32.0 5600
#> 3 Jolie Hope 56.0 7700
#> 4 Frank Smith NA <NA>
Created on 2018-03-02 by the reprex package (v0.2.0).
I discovered by including stringsAsFactors = FALSE e.g.
edge <- data.frame(from, to, stringsAsFactors = F)
then:
non.self.ref <- edge[!(edge$from == edge$to),]
works!
An option with dplyr would be to filter the rows by comparing 'from' and 'to' (to get the first output - it is not needed if we are interested only at the second output), unlist, get the unique values, convert it to a tibble and do a left_join
library(dplyr)
edge %>%
filter(from != to) %>% #get the results for the first question
unlist %>%
unique %>%
tibble(name = .) %>%
left_join(actors) # second output
# A tibble: 4 x 3
# name age postcode
# <chr> <dbl> <fctr>
#1 John Doe 26.0 4011
#2 Peter Gynn 32.0 5600
#3 Jolie Hope 56.0 7700
#4 Frank Smith NA <NA>

Resources