R - Creating/subset dataframes within loop or apply family - r

I would like to subset a R dataframe if multiple variables (ending in same suffix) do NOT contain NAs.
employee <- c('John Doe','Peter Gynn','Jolie Hope')
salary1 <- c(NA, 20400, 26800)
salary2 <- c(29045, NA, 78765)
date1 <- as.Date(c(NA,'2008-3-25','2007-3-14'))
date2 <- as.Date(c('2010-11-1',NA,'2007-3-14'))
employ.data <- data.frame(employee, salary1, salary2, date1, date2)
for (i in 1:2) {
assign("employ.data", i, subset(employ.data, !is.na(employ.data$date[i]) & !is.na(employ.data$salary[i])))
}
Final result would hopefully produce two separate dataframes looking something like:
employ.data1:
| employee | salary1 | salary2 | date1 | date2 |
| Peter Gynn | 20400 | NA | 2008-03-25 | NA |
| Jolie Hope | 26800 | 78765 | 2007-03-14 | 2007-03-14 |
employ.data2:
| employee | salary1 | salary2 | date1 | date2 |
| John Doe | NA | 29045 | NA | 2010-11-01 |
| Jolie Hope | 26800 | 78765 | 2007-03-14 | 2007-03-14 |
Thanks in advance!

(Up front: it's generally better to deal with a list of frames instead of using assign to dynamically create objects.)
numfields <- grep("[0-9]+$", colnames(employ.data), value = TRUE)
split(numfields, gsub(".*?([0-9]+)$", "\\1", numfields))
# $`1`
# [1] "salary1" "date1"
# $`2`
# [1] "salary2" "date2"
out <- lapply(split(numfields, gsub(".*?([0-9]+)$", "\\1", numfields)),
function(flds) employ.data[ complete.cases(subset(employ.data, select = flds)), ])
out
# $`1`
# employee salary1 salary2 date1 date2
# 2 Peter Gynn 20400 NA 2008-03-25 <NA>
# 3 Jolie Hope 26800 78765 2007-03-14 2007-03-14
# $`2`
# employee salary1 salary2 date1 date2
# 1 John Doe NA 29045 <NA> 2010-11-01
# 3 Jolie Hope 26800 78765 2007-03-14 2007-03-14
This will dynamically find all numbered fields, though admittedly it does not enforce "pairs".

The non_na function returns the rows for which there are no missing values in the columns whose names end in i. It first defines those column names as cn and then uses complete.cases to define a logical vector which indicates which rows have no NA's in those columns and then returns those rows from employ.data.
Assume that the only numbers in columns are at the end of names and that the unique such numbers define the possible suffixes. The stems are those unique names after removing the suffix.
We assume that column 1 is to be excluded from this calculation. We use gsub/unique to define suffixes and stems. Alternately hard code suffixes <- as.character(1:2) and stems <- c("salary", "date") .
Finally run non_na on each suffix creating a list whose i component contains a data frame with the rows corresponding to i. L[[i]] will be the i data frame.
non_na <- function(i) {
cn <- paste0(stems, i)
employ.data[complete.cases(employ.data[cn]), ]
}
nms <- names(employ.data)[-1]
suffixes <- unique(gsub("\\D", "", nms)) # c("1", "2")
stems <- unique(gsub("\\d", "", nms)) # c("salary", "date")
L <- Map(non_na, suffixes); L
giving:
$`1`
employee salary1 salary2 date1 date2
2 Peter Gynn 20400 NA 2008-03-25 <NA>
3 Jolie Hope 26800 78765 2007-03-14 2007-03-14
$`2`
employee salary1 salary2 date1 date2
1 John Doe NA 29045 <NA> 2010-11-01
3 Jolie Hope 26800 78765 2007-03-14 2007-03-14

Related

Add multiple new columns to the dataset, based on another dataset's elements

I have the following products list
> products
# A tibble: 311 x 1
value
<fct>
1 NA
2 Alternativ Economy
3 Ambulant Balance
4 Ambulant Economy
5 Ambulant Premium
6 Ambulant 2
7 Ambulant 3
8 Ambulant 1
9 COMPLETA
10 HOSPITAL ECO
# ... with 301 more rows
and the following df
> df <- data.frame(employee = c('John Doe','Peter Gynn','Jolie Hope'),
+ salary = c(21000, 23400, 26800),
+ startdate = as.Date(c('2010-11-1','2008-3-25','2007-3-14')))
> df
employee salary startdate
1 John Doe 21000 2010-11-01
2 Peter Gynn 23400 2008-03-25
3 Jolie Hope 26800 2007-03-14
Now, I want to add the elements of the former (i.e. products) as variables of the latter (i.e. the df). I use
cbind(df, setNames(lapply(products, function(x) x = NA), products))
but I get an error. Can you suggest another way of doing this? What is wrong with my solution? thanks in advance
Here is one solution.
df <- data.frame(employee = c('John Doe','Peter Gynn','Jolie Hope'),
salary = c(21000, 23400, 26800),
startdate = as.Date(c('2010-11-1','2008-3-25','2007-3-14')))
products <- data.frame(value = c(NA, "Alternativ Economy", "COMPLETA"))
#products$value <- ifelse(is.na(products$value), "not_available", as.character(products$value))
cbind(df, `colnames<-`(data.frame(matrix(ncol = nrow(products), nrow = nrow(df))), products$value))
employee salary startdate NA Alternativ Economy COMPLETA
1 John Doe 21000 2010-11-01 NA NA NA
2 Peter Gynn 23400 2008-03-25 NA NA NA
3 Jolie Hope 26800 2007-03-14 NA NA NA
I question the wisdom of having NAs as column names, so I'd uncomment that one line of code in there to replace NAs with some character string instead.

Trim leading/trailing whitespaces from a data frame column where the column name comes as a variable

I have a dataframe where the name of the column which is to be trimmed for whitespaces is comming as a variable and I am not able to resolve the variable to point me to the column so that it can be trimmed.
salary <- c(21000, 23400, 26800)
startdate <- as.Date(c('2010-11-1','2008-3-25','2007-3-14'))
employee <- c(' John Doe ',' Peter Gynn ',' Jolie Hope')
employ.data <- data.frame(employee, salary, startdate)
Here I try to trim employeecolumn and I have tried dplyr:
employ.data %>% mutate(employee = trimws(employee))
which works.
However, If I say:
abc <- "employee"
and then
employ.data %>% mutate(abc= trimws(abc))
It doesnt work.
I have tried using get(abc) in this function but this doesn't work either.
I understand I cant use abc as employ.data$abc when abc is a variable column name.
INITIAL DATAFRAME
employee salary startdate
John Doe 21000 2010-11-01
Peter Gynn 23400 2008-03-25
Jolie Hope 26800 2007-03-14
FINAL DATAFRAME
employee salary startdate
John Doe 21000 2010-11-01
Peter Gynn 23400 2008-03-25
Jolie Hope 26800 2007-03-14
You can also use str_trim from stringr in the tidyverse.
employ.data %>%
mutate(abc = str_trim(employee))
Which is:
employee salary startdate abc
1 John Doe 21000 2010-11-01 John Doe
2 Peter Gynn 23400 2008-03-25 Peter Gynn
3 Jolie Hope 26800 2007-03-14 Jolie Hope
Use mutate_at
library(dplyr)
employ.data %>% mutate_at(abc, trimws)
# employee salary startdate
#1 John Doe 21000 2010-11-01
#2 Peter Gynn 23400 2008-03-25
#3 Jolie Hope 26800 2007-03-14
Or you can directly do, if you have only one column
employ.data[[abc]] <- trimws(employ.data[[abc]])
If there are multiple columns you can use lapply
employ.data[abc] <- lapply(employ.data[abc], trimws)

For Each Loop to convert into numeric values [duplicate]

This question already has answers here:
How to assign a unique ID number to each group of identical values in a column [duplicate]
(2 answers)
Closed 4 years ago.
I have a question about the manipulation of a data frame. If I have this data frame as an example:
employee <- c('John Doe','Peter Gynn','Jolie Hope')
salary <- c(21000, 23400, 26800)
startdate <- as.Date(c('2010-11-1','2008-3-25','2007-3-14'))
location <- c('New York', 'Alabama','New York')
employ.data <- data.frame(employee, salary, startdate, location)
employ.data
employee salary startdate location
1 John Doe 21000 2010-11-01 New York
2 Peter Gynn 23400 2008-03-25 Alabama
3 Jolie Hope 26800 2007-03-14 New York
Now I want to transform the location into nummeric values. I know that I can do something like this:
transformlocation <- function(x) {
x <- as.character(x)
if (x =='New York'){
return('1')
}else if (x=='Alabama'){
return('2')
}else if (x=='Florida'){
return('3')
}else
return('0')
}
employ.data$location <- sapply(employ.data$location, transformlocation)
employ.data
employee salary startdate location
1 John Doe 21000 2010-11-01 1
2 Peter Gynn 23400 2008-03-25 2
3 Jolie Hope 26800 2007-03-14 1
But in my final dataset there are hundreds of different values. For example, is it possible to work with a for each statement here?
Thanks for your help!
If it is already a factor variable, then simply convert to integer,i.e.
employ.data$location <- as.integer(employ.data$location)
employ.data
# employee salary startdate location
#1 John Doe 21000 2010-11-01 2
#2 Peter Gynn 23400 2008-03-25 1
#3 Jolie Hope 26800 2007-03-14 2
Otherwise convert to factor and then integer, i.e.
employ.data$location <- as.integer(as.factor(employ.data$location))

How to groupby column value using R programming

I have a table
Employee Details:
EmpID | WorkingPlaces | Salary
1001 | Bangalore | 5000
1001 | Chennai | 6000
1002 | Bombay | 1000
1002 | Chennai | 500
1003 | Pune | 2000
1003 | Mangalore | 1000
A same employee works for different places in a month. How to find the top 2 highly paid employees.
The result table should look like
EmpID | WorkingPlaces | Salary
1001 | Chennai | 6000
1001 | Bangalore | 5000
1003 | Pune | 2000
1003 | Mangalore | 1000
My code: in R language
knime.out <- aggregate(x= $"EmpID", by = list(Thema = $"WorkingPlaces", Project = $"Salary"), FUN = "length") [2]
But this doesnt give me the expected result. Kindly help me to correct the code.
We can try with dplyr
library(dplyr)
df1 %>%
group_by(EmpID) %>%
mutate(SumSalary = sum(Salary)) %>%
arrange(-SumSalary, EmpID) %>%
head(4) %>%
select(-SumSalary)
A base R solution. Considering your dataframe as df. We first aggregate the data by EmpId and calculate their sum. Then we select the top 2 EmpID's for which the salary is highest and find the subset of those ID's in the original dataframe using %in%.
temp <- aggregate(Salary~EmpID, df, sum)
df[df$EmpID %in% temp$EmpID[tail(order(temp$Salary), 2)], ]
# EmpID WorkingPlaces Salary
#1 1001 Bangalore 5000
#2 1001 Chennai 6000
#5 1003 Pune 2000
#6 1003 Mangalore 1000

Concatenate rows in R depending on specific row value range

I have two data frames:
df
set.seed(10)
df <- data.frame(Name = c("Bob","John","Jane","John","Bob","Jane","Jane"),
Date=as.Date(c("2014-06-04", "2013-12-04", "2013-11-04" , "2013-12-06" ,
"2014-01-09", "2014-03-21", "2014-09-24")), Degrees= rnorm(7, mean=32, sd=32))
Name | Date | Degrees
Bob | 2014-06-04 | 50.599877
John | 2013-12-04 | 44.103919
Jane | 2013-11-04 | 6.117422
John | 2013-12-06 | 30.826633
Bob | 2014-01-09 | 59.425444
Jane | 2014-03-21 | 62.473418
Jane | 2014-09-24 | 11.341562
df2
df2 <- data.frame(Name = c("Bob","John","Jane"),
Date=as.Date(c("2014-03-01", "2014-01-20", "2014-06-07")),
Weather = c("Good weather","Bad weather", "Good weather"))
Name | Date | Weather
Bob | 2014-03-01 | Good weather
John | 2014-01-20 | Bad weather
Jane | 2014-06-07 | Good weather
I would like to extract the following:
Name | Date | Weather | Degrees (until this Date) | Other measures
Bob | 2014-03-01 | Good weather | 59.425444 | 50.599877
John | 2014-01-20 | Bad weather | 44.103919, 30.826633 |
Jane | 2014-06-07 | Good weather | 6.117422, 62.473418 | 11.341562
Which is a merge between both df and df2, with:
"Degrees (until this Date)" concatenates from df$Degrees up until the date of df2$Date;
the value of "Other measures" is whatever measures are on df$Degrees after the date of df2$Date.
Another alternative:
#a grouping variable to use for identical splitting
nms = unique(c(as.character(df$Name), as.character(df2$Name)))
#split data
dates = split(df$Date, factor(df$Name, nms))
degrees = split(df$Degrees, factor(df$Name, nms))
thresholds = split(df2$Date, factor(df2$Name, nms))
#mapply the condition
res = do.call(rbind.data.frame,
Map(function(date, thres, deg)
tapply(deg, factor(date <= thres, c(TRUE, FALSE)),
paste0, collapse = ", "),
dates, thresholds, degrees))
#bind with df2
cbind(df2, setNames(res[match(row.names(res), df2$Name), ], c("Degrees", "Other")))
# Name Date Weather Degrees Other
#Bob Bob 2014-03-01 Good weather 41.4254440501603 32.5998774701384
#John John 2014-01-20 Bad weather 26.10391865379, 12.826633094921 <NA>
#Jane Jane 2014-06-07 Good weather -11.8825775975204, 44.4734176224054 -6.65843761374357
Here's one approach:
library(dplyr)
library(tidyr)
library(magrittr)
res <-
left_join(df, df2 %>% select(Name, Date, Weather), by = "Name") %>%
mutate(paste = factor(Date.x <= Date.y, labels = c("before", "other"))) %>%
group_by(Name, paste) %>%
mutate(Degrees = paste(Degrees, collapse = ", ")) %>%
distinct() %>%
spread(paste, Degrees) %>%
group_by(Name, Date.y, Weather) %>%
summarise(other = other[1], before = before[2]) %>%
set_names(c("Name", "Date" , "Weather", "Degrees (until this Date)" , "Other measures"))
res[is.na(res)] <- ""
res
# Name Date Weather Degrees (until this Date) Other measures
# 1 Bob 2014-03-01 Good weather 41.4254440501603 32.5998774701384
# 2 Jane 2014-06-07 Good weather -11.8825775975204, 44.4734176224054 -6.65843761374357
# 3 John 2014-01-20 Bad weather 26.10391865379, 12.826633094921
There may be room for improvements, but anyway.

Resources