How do I update data from an incomplete lookup table? - r

I have a table that uses unique IDs but inconsistent readable names for those IDs. It is more complex than month names, but for the sake of a more simple example, let's say it looks something like this:
demo_frame <- read.table(text=" Month_id Month_name Number
1 Jan 37
2 Feb 63
3 March 9
3 Mar 150
2 February 49", header=TRUE)
Except that they might have spelled "Feb" or "March" eight different ways. I also have a clean data frame that contains consistent names for the names that have variations:
month_lookup <- read.table(text=" Month_id Month_name
2 Feb
3 Mar", header=TRUE)
I want to get to this:
1 Jan 37
2 Feb 63
3 Mar 9
3 Mar 150
2 Feb 49"
I tried merge(month_lookup, demo_frame, by = "Month_id") but that dropped all the January values because "Jan" doesn't exist in the lookup table:
Month_id Month_name.x Month_name.y Number
1 2 Feb Feb 63
2 2 Feb February 49
3 3 Mar March 9
4 3 Mar Mar 150
My read of How to replace data.frame column names with string in corresponding lookup table in R is that I ought to be able to use plyr::mapvalues but I'm unclear from examples and documentation on how I'd map the id to the name. I don't just want to say "Replace 'March' with 'Mar'" -- I need to say SET month_name = 'Mar' WHERE month_id = 3 for each value in lookup.

I think you want this.
library(dplyr)
demo_frame <- read.table(text=" Month_id Month_name Number
1 Jan 37
2 Feb 63
3 March 9
3 Mar 150
2 February 49", header=TRUE, stringsAsFactors = FALSE)
month_lookup <- read.table(text=" Month_id Month_name
2 Feb
3 Mar", header=TRUE, stringsAsFactors = FALSE)
result =
demo_frame %>%
rename(bad_month = Month_name) %>%
left_join(month_lookup) %>%
mutate(month_fix =
Month_name %>%
is.na %>%
ifelse(bad_month, Month_name) )

Related

how to sum conditional functions to grouped rows in R

I so have the following data frame
customerid
payment_month
payment_date
bill_month
charges
1
January
22
January
30
1
February
15
February
21
1
March
2
March
33
1
May
4
April
43
1
May
4
May
23
1
June
13
June
32
2
January
12
January
45
2
February
15
February
56
2
March
2
March
67
2
April
4
April
65
2
May
4
May
54
2
June
13
June
68
3
January
25
January
45
3
February
26
February
56
3
March
30
March
67
3
April
1
April
65
3
June
1
May
54
3
June
1
June
68
(the id data is much larger) I want to calculate payment efficiency using the following function,
efficiency = (amount paid not late / total bill amount)*100
not late is paying no later than the 21st day of the bill's month. (paying January's bill on the 22nd of January is considered as late)
I want to calculate the efficiency of each customer with the expected output of
customerid
effectivity
1
59.90
2
100
3
37.46
I have tried using the following code to calculate for one id and it works. but I want to apply and assign it to the entire group id and summarize it into 1 column (effectivity) and 1 row per ID. I have tried using group by, aggregate and ifelse functions but nothing works. What should I do?
df1 <- filter(df, (payment_month!=bill_month & id==1) | (payment_month==bill_month & payment_date > 21 & id==1) )
df2 <-filter(df, id==1001)
x <- sum(df1$charges)
x <- sum(df2$charges)
100-(x/y)*100
An option using dplyr
library(dplyr)
df %>%
group_by(customerid) %>%
summarise(
effectivity = sum(
charges[payment_date <= 21 & payment_month == bill_month]) / sum(charges) * 100,
.groups = "drop")
## A tibble: 3 x 2
#customerid effectivity
# <int> <dbl>
#1 1 59.9
#2 2 100
#3 3 37.5
df %>%
group_by(customerid) %>%
mutate(totalperid = sum(charges)) %>%
mutate(pay_month_number = match(payment_month , month.name),
bill_month_number = match(bill_month , month.name)) %>%
mutate(nolate = ifelse(pay_month_number > bill_month_number, TRUE, FALSE)) %>%
summarise(efficiency = case_when(nolate = TRUE ~ (charges/totalperid)*100))

Giving month names to a variable of numbers in R

I have a data set with the variable 'months' from 1 to 12, but need to change them to the month names. i.e "1" needs to be January and so on. Whats the easiest way to do this?
R has an inbuilt vector called month.name for your purpose you could do something like the following:
# Some dummy data
set.seed(1)
df <- data.frame(
month = sample(1:12, size = 10)
)
# Now use your integer month to subset month.name
df$month2 <- month.name[df$month] # Also has month.abb
df
month month2
1 9 September
2 4 April
3 7 July
4 1 January
5 2 February
6 5 May
7 3 March
8 8 August
9 6 June
10 11 November

How to add a vector to be a column using dplyr (examples given)?

I've got some data. I want to add a column, but not in the regular way.
data <- data.frame(month_num = 1:12, month_name = month.abb)
data
month_num month_name
1 1 Jan
2 2 Feb
3 3 Mar
4 4 Apr
5 5 May
6 6 Jun
7 7 Jul
8 8 Aug
9 9 Sep
10 10 Oct
11 11 Nov
12 12 Dec
Now, I want to add a third column to this data. For example I want to make the following vector a column within data:
sentiment = c(rep("cold", 3), rep("hot", 6), rep("cold", 3)
What I would normally do (in baseR) is one of the following:
Add it using $
data$sentiment <- sentiment
Add it via column index creation
data[,3] <- sentiment
Add it in initial creation
data.frame(month_num = 1:12, month_name = month.abb, sentiment = sentiment)
Yes, data.table also has this nicely done within its reference semantics.
data <- data.table(month_num = 1:12, month_name = month.abb)
data[,`:=`(sentiment = sentiment)]
data
month_num month_name sentiment
1: 1 Jan cold
2: 2 Feb cold
3: 3 Mar cold
4: 4 Apr hot
5: 5 May hot
6: 6 Jun hot
7: 7 Jul hot
8: 8 Aug hot
9: 9 Sep hot
10: 10 Oct cold
11: 11 Nov cold
12: 12 Dec cold
However, I don't want to add it in this way. I want to use dplyr related functions to do this task. Is there any function within dplyr that will let me perform this task of column creation?
NOTE: mutate() will not work! (or as I know of it right now).
data%>%
mutate(sentiment = sentiment)
month_num month_name V3 sentiment
1 1 Jan cold cold
2 2 Feb cold cold
3 3 Mar cold cold
4 4 Apr hot hot
5 5 May hot hot
6 6 Jun hot hot
7 7 Jul hot hot
8 8 Aug hot hot
9 9 Sep hot hot
10 10 Oct cold cold
11 11 Nov cold cold
12 12 Dec cold cold
As you can see the column is duplicated and I'm not really sure why that's happening. Perhaps it has to do with the number of unique values in sentiment?
All in all, is there a way to accomplish this within dplyr using mutate() or other related functions?
The simpliest way I know is using the function case_when:
data <- data.frame(month_num = 1:12, month_name = month.abb)
data
sentiment = c(rep("cold", 3), rep("hot", 6), rep("cold", 3)
data <- data %>%
mutate(sentiment=case_when(
month_num<=3 | month_num>=10 ~ "cold",
month_num>=4 & month_num<=9 ~ "hot"
))

arrange one below the other every 2 columns from data frame in R

Hi I have a df as below which show date and their respected
date 1_val date 2_val . . . . date n_val
2014 23 2014 33 . . . . 2014 34
2015 22 2016 12 . . . . 2016 99
i have tried with hard coding to arrange the columns one below the other
for 1&2 columns
a=1
b=2
names_2<-df[,c(a,b)]
colnames(names_2)[1]<-"Date"
names_2 <- names_2[!apply(is.na(names_2) | names_2 == "", 1, all),]
names_2<-melt(names_2,id=colnames(names_2)[1])
samp_out<-names_2
for 3&4 columns
a=3
b=4
names_2<-df[,c(a,b)]
colnames(names_2)[1]<-"Date"
names_2 <- names_2[!apply(is.na(names_2) | names_2 == "", 1, all),]
names_2<-melt(names_2,id=colnames(names_2)[1])
samp_out1<-names_2
till n-numbers
df1= rbind(samp_out,samp_out1,......samp_out_n)
output
date variable value
2014 1_val 23
2015 1_val 22
2014 2_val 33
2016 2_val 12
.
.
2014 n_val 34
2016 n_val 99
Thanks in advance
The function melt in the package data.table does that:
melt(df, id = "Date", measure = patterns("_val"))
You can specify the name of the variable to pivot on (Date in this case) and a pattern in the variables you want to keep the values of. You can also supply a vector with all the variablenames instead.
> DT <- data.table(Date = c(2014,2013), `1_val` = c(33, 32), Date = c(2014, 2013), `2_val` = c(65, 34))
> DT
Date 1_val Date 2_val
1: 2014 33 2014 65
2: 2013 32 2013 34
> melt(DT, id = "Date", measure = patterns("_val"))
Date variable value
1: 2014 1_val 33
2: 2013 1_val 32
3: 2014 2_val 65
4: 2013 2_val 34
You can use stack from base R,
setNames(data.frame(stack(df[c(TRUE, FALSE)])[1],
stack(df[c(FALSE, TRUE)])),
c('date', 'value', 'variable'))
# date value variable
#1 2014 33 1_val
#2 2013 32 1_val
#3 2014 65 2_val
#4 2013 34 2_val
Define the untidy rectangle
library(magrittr)
csv <- "date,1_val,date,2_val,date,3_val
2014,23,2014,33,2014,34
2015,22,2016,12,2016,99"
Read into a data frame, then transform into a long/eav rectangle.
ds_eav <- csv %>%
readr::read_csv() %>%
tibble::rownames_to_column(var="height") %>%
tidyr::gather(key=key, value=value, -height)
output:
# A tibble: 12 x 4
key index value height
<chr> <int> <int> <int>
1 date 1 2014 1
2 date 1 2015 2
3 value 1 23 1
4 value 1 22 2
5 date 2 2014 1
6 date 2 2016 2
7 value 2 33 1
8 value 2 12 2
9 date 3 2014 1
10 date 3 2016 2
11 value 3 34 1
12 value 3 99 2
Identify which rows are dates/values. Then shift up dates' index by 1.
ds_eav <- ds_eav %>%
dplyr::mutate(
index_val = sub("^(\\d+)_val$" , "\\1", key),
index_date = sub("^date_(\\d+)$", "\\1", key),
index_date = dplyr::if_else(key=="date", "0", index_date),
key = dplyr::if_else(grepl("^date(_\\d+)*", key), "date", "value"),
index = dplyr::if_else(key=="date", index_date, index_val),
index = as.integer(index),
index = index + dplyr::if_else(key=="date", 1L, 0L)
) %>%
dplyr::select(key, index, value, height)
Follow the advice of #jarko-dubbeldam and use spread/gather on the last step too
ds_eav %>%
tidyr::spread(key=key, value=value)
output:
# A tibble: 6 x 4
index height date value
* <int> <int> <int> <int>
1 1 1 2014 23
2 1 2 2015 22
3 2 1 2014 33
4 2 2 2016 12
5 3 1 2014 34
6 3 2 2016 99
You can use paste0(index, "_val") to get you exact output. But I'd prefer to keep them as integers, so you can do math on them in necessary (eg, max()).
edit 1: incorporate the advice & corrections of #jarko-dubbeldam and #hnskd.
edit 2: use rownames_to_column() in case the input isn't a balanced rectangle (eg, one column doesn't all all the rows).

How to complete missing values with Na in a list?

I have a data frame that has the following column: Tree ID, month, values. For some months, there is no recorded data, therefore those months do not exist in the data frame. I have completed the list with the missing months but now I do not know how to insert NA in the value column for the added months.
Example:
Tree.Id: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Month: Jan, Feb, Mar, May, Jun, Jul, Sept, Oct, Nov, Dec
Values: 1,0,1,1,0,2,1,1,0,2
The following months are missing: Apr, Aug,
I added them with the code below, and now I want for those 2 added months to introduce NA in the value column.
Here is what I tried:
tree_ls <- list()
for (i in unique(data$Tree.ID)){
mon1 <- data$month[data$Tree.ID == i] ### extract the month for every Tree iD
mon <- min(mon1, na.rm=T):max(mon1, na.rm=T) # completes the numbers with the missing month
dat1 <- data$value[data$Tree.ID == i]
......
After this step, I do not know how to create a list that will add NA for all the added months that were missing, so I will have lists of the same length.
Thanks
This is an old post, but I have a pretty good solution for this:
To begin, your small reproducible code should probably be the following:
month <- c(Jan, Feb, Mar, May, Jun, Jul, Sept, Oct, Nov, Dec)
value <- c(1,0,1,1,0,2,1,1,0,2)
df <- data.frame(id=id, month=month,value=value)
> head(df)
id month value
1 1 Jan 1
2 2 Feb 0
3 3 Mar 1
4 4 May 1
5 5 Jun 0
6 6 Jul 2
Now just simply introduce an entire list of your domain, e.g., your months you want to obtain NA's where missing.
completeMonths <- c("Jan", "Feb", "Mar", "Apr","May", "Jun", "Jul","Aug", "Sept", "Oct", "Nov", "Dec")
df2 <- dataframe(month=completeMonths)
> df2
month
1 Jan
2 Feb
3 Mar
4 Apr
5 May
6 Jun
7 Jul
8 Aug
9 Sept
10 Oct
11 Nov
12 Dec
Now we have a column with all the underlying values, so when we merge, we can fill the missing rows as NA with the following syntax:
merge(df, df2, on=month, all=TRUE)
With our results as follows:
month id value
1 Dec 10 2
2 Feb 2 0
3 Jan 1 1
4 Jul 6 2
5 Jun 5 0
6 Mar 3 1
7 May 4 1
8 Nov 9 0
9 Oct 8 1
10 Sept 7 1
11 Apr NA NA
12 Aug NA NA
Hope this helps, data wrangling sucks.
When you say that you have a data frame with some months that have "no recorded data" and therefore "do not exist", the fact that they are in the data frame at all means they have some representation. I'm going to guess that by "do not exist" you mean that they are blank strings, such as "". If that's the case, you can replace the blank strings with NA values using mutate in the dplyr package and ifelse in the base package as follows:
library(dplyr);
data_with_nas <- mutate(data, value = ifelse(value=="", NA, value));
That reads as "change the data data frame such that its value cells are replaced with NA if they were a blank string, or kept as is otherwise."

Resources