How to merge 2 columns within the same dataset in R - r

I am trying to merge 2 columns within the same dataset in order to condense the number of columns.
The dataset currently looks like this:
Year Var1 Var2
2014 NA 123
2014 NA 155
2015 541 NA
2015 432 NA
2016 NA 124
etc
I wish the dataset to look like
Year Var1/2
2014 123
2014 155
2015 541
2015 432
2016 124
Any Help is grealty apprecitated.

You should be able to just use with(mydf, pmax(Var1, Var2, na.rm = TRUE)).
Here's a sample data.frame. Note row 5.
mydf <- structure(list(Year = c(2014L, 2014L, 2015L, 2015L, 2016L), Var1 = c(NA,
NA, 541L, 432L, NA), Var2 = c(123L, 155L, NA, NA, NA)), .Names = c("Year",
"Var1", "Var2"), row.names = c(NA, 5L), class = "data.frame")
mydf
## Year Var1 Var2
## 1 2014 NA 123
## 2 2014 NA 155
## 3 2015 541 NA
## 4 2015 432 NA
## 5 2016 NA NA
with(mydf, pmax(Var1, Var2, na.rm = TRUE))
## [1] 123 155 541 432 NA
Assign it to a column and you're good to go.

can paste function help?
df$Var1/2 <- paste(df$Var1,df$Var2)

Related

selecting the column with the maximum value

I have a dataframe in the wide format such as below:
Subject
Volume.1
Volume.2
Volume.3
Volume.4
1
77
22
1
NA
2
65
182
NA
NA
3
98
NA
NA
NA
4
66
76
145
677
I am wanting to select the volume.1 and the column and the largest volume of Volume1-4 irrespective of which column it came from but am struggling to code this correctly. Some of the columns are Na when a subject does not have a recording then.
For instance with the above example the table would look like:
Subject
Volume.1
Worst volume
1
77
22
2
65
182
3
98
NA
4
66
677
I was wondering if anyone could help?
We may use pmax
cbind(df[1:2], WorseVolume = do.call(pmax, c(df[3:5], na.rm = TRUE)))
-output
Subject Volume.1 WorseVolume
1 1 77 22
2 2 65 182
3 3 98 NA
4 4 66 677
data
df <- structure(list(Subject = 1:4, Volume.1 = c(77L, 65L, 98L, 66L
), Volume.2 = c(22L, 182L, NA, 76L), Volume.3 = c(1L, NA, NA,
145L), Volume.4 = c(NA, NA, NA, 677L)), class = "data.frame", row.names = c(NA,
-4L))

How should I impute NA with mean of that column, not just replacing the na value? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I have a dataset looks like this:
TYPE YEAR NUMBERS
A 2020 60
A 2019 NA
A 2018 88
A 2017 NA
A 2016 90
I want to impute the missing value with the mean of the value in column 'numbers'
I have search for a lot of tutorial, but they just directly replace the missing value with the mean which is not what i want. I try using mice and hmics, but they come out errors. So, if there is any better way to do this?Thanks!
I'd have done this :
df <- read.table(text = 'TYPE YEAR NUMBERS
A 2020 60
A 2019 NA
A 2018 88
A 2017 NA
A 2016 90', header=T)
a= mean(na.omit(df$NUMBERS))
df[is.na(df$NUMBERS),"NUMBERS"]=a
df
Output:
TYPE YEAR NUMBERS
1 A 2020 60.00000
2 A 2019 79.33333
3 A 2018 88.00000
4 A 2017 79.33333
5 A 2016 90.00000
Is it what you wanted?
I'm inferring from the presence of the TYPE column that you should be imputing based on the group's mean, not the population's mean.
Modified data:
dat <- structure(list(TYPE = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B"), YEAR = c(2020L, 2019L, 2018L, 2017L, 2016L, 2020L, 2019L, 2018L, 2017L, 2016L), NUMBERS = c(60L, NA, 88L, NA, 90L, 160L, NA, 188L, NA, 190L)), class = "data.frame", row.names = c(NA, -10L))
base R
do.call(rbind, by(dat, dat$TYPE,
function(z) { z$NUMBERS[is.na(z$NUMBERS)] <- mean(z$NUMBERS, na.rm = TRUE); z}))
# TYPE YEAR NUMBERS
# A.1 A 2020 60.00000
# A.2 A 2019 79.33333
# A.3 A 2018 88.00000
# A.4 A 2017 79.33333
# A.5 A 2016 90.00000
# B.6 B 2020 160.00000
# B.7 B 2019 179.33333
# B.8 B 2018 188.00000
# B.9 B 2017 179.33333
# B.10 B 2016 190.00000
or
do.call(rbind, by(dat, dat$TYPE,
function(z) transform(z, NUMBERS = ifelse(is.na(NUMBERS), mean(NUMBERS, na.rm = TRUE), NUMBERS))))
dplyr
library(dplyr)
dat %>%
group_by(TYPE) %>%
mutate(NUMBERS = if_else(is.na(NUMBERS), mean(NUMBERS, na.rm = TRUE), as.numeric(NUMBERS))) %>%
ungroup()
# # A tibble: 10 x 3
# TYPE YEAR NUMBERS
# <chr> <int> <dbl>
# 1 A 2020 60
# 2 A 2019 79.3
# 3 A 2018 88
# 4 A 2017 79.3
# 5 A 2016 90
# 6 B 2020 160
# 7 B 2019 179.
# 8 B 2018 188
# 9 B 2017 179.
# 10 B 2016 190

How to remove duplicates if specific column has value in r

I need to delete some rows in my dataset based on the given condition.
Kindly gothrough the sample data for reference.
ID Date Dur
123 01/05/2000 3
123 08/04/2002 6
564 04/04/2012 2
741 01/08/2011 5
789 02/03/2009 1
789 08/01/2010 NA
789 05/05/2011 NA
852 06/06/2015 3
852 03/02/2016 NA
155 03/02/2008 NA
155 01/01/2009 NA
159 07/07/2008 NA
My main concern is Dur column. I have to delete the rows which have Dur != NA for group ID's
i.e ID's(123,789,852) have more than one record/row with Dur value. so I need to remove the ID with Dur value, which means entire ID of 123 and first record of 789 and 852.
I don't want to delete any ID's(564,741,852) have Dur with single record or any other ID's with null in Dur.
Expected Output:
ID Date Dur
564 04/04/2012 2
741 01/08/2011 5
789 08/01/2010 NA
789 05/05/2011 NA
852 03/02/2016 NA
155 03/02/2008 NA
155 01/01/2009 NA
159 07/07/2008 NA
Kindly suggest a code to solve the issue.
Thanks in Advance!
One way would be to select rows where number of rows in the group is 1 or there are NA's rows in the data.
This can be written in dplyr as :
library(dplyr)
df %>% group_by(ID) %>% filter(n() == 1 | is.na(Dur))
# ID Date Dur
# <int> <chr> <int>
#1 564 04/04/2012 2
#2 741 01/08/2011 5
#3 789 08/01/2010 NA
#4 789 05/05/2011 NA
#5 852 03/02/2016 NA
#6 155 03/02/2008 NA
#7 155 01/01/2009 NA
#8 159 07/07/2008 NA
Using data.table :
library(data.table)
setDT(df)[, .SD[.N == 1 | is.na(Dur)], ID]
and base R :
subset(df, ave(is.na(Dur), ID, FUN = function(x) length(x) == 1 | x))
data
df <- structure(list(ID = c(123L, 123L, 564L, 741L, 789L, 789L, 789L,
852L, 852L, 155L, 155L, 159L), Date = c("01/05/2000", "08/04/2002",
"04/04/2012", "01/08/2011", "02/03/2009", "08/01/2010", "05/05/2011",
"06/06/2015", "03/02/2016", "03/02/2008", "01/01/2009", "07/07/2008"
), Dur = c(3L, 6L, 2L, 5L, 1L, NA, NA, 3L, NA, NA, NA, NA)),
class = "data.frame", row.names = c(NA, -12L))
We can use .I in data.table
library(data.table)
setDT(df1)[df1[, .I[.N == 1| is.na(Dur)], ID]$V1]

Problems of joining datasets on R

I have a dataset containing variables and a quantity of goods sold: for some days, however, there are no values.
I created a dataset with all 0 values in sales and all NA in the rest. How can I add those lines to the initial dataset?
At the moment, I have this:
sales
day month year employees holiday sales
1 1 2018 14 0 1058
2 1 2018 25 1 2174
4 1 2018 11 0 987
sales.NA
day month year employees holiday sales
1 1 2018 NA NA 0
2 1 2018 NA NA 0
3 1 2018 NA NA 0
4 1 2018 NA NA 0
I would like to create a new dataset, inserting the days where I have no observations, value 0 to sales, and NA on all other variables. Like this
new.data
day month year employees holiday sales
1 1 2018 14 0 1058
2 1 2018 25 1 2174
3 1 2018 NA NA 0
4 1 2018 11 0 987
I tried used something like this
merge(sales.NA,sales, all.y=T, by = c("day","month","year"))
But it does not work
Using dplyr, you could use a "right_join". For example:
sales <- data.frame(day = c(1,2,4),
month = c(1,1,1),
year = c(2018, 2018, 2018),
employees = c(14, 25, 11),
holiday = c(0,1,0),
sales = c(1058, 2174, 987)
)
sales.NA <- data.frame(day = c(1,2,3,4),
month = c(1,1,1,1),
year = c(2018,2018,2018, 2018)
)
right_join(sales, sales.NA)
This leaves you with
day month year employees holiday sales
1 1 1 2018 14 0 1058
2 2 1 2018 25 1 2174
3 3 1 2018 NA NA NA
4 4 1 2018 11 0 987
This leaves NA in sales where you want 0, but that could be fixed by including the sales data in sales.NA, or you could use "tidyr"
right_join(sales, sales.NA) %>% mutate(sales = replace_na(sales, 0))
Here is another data.table solution:
jvars = c("day","month","year")
merge(sales.NA[, ..jvars], sales, by = jvars, all.x = TRUE)[is.na(sales), sales := 0L][]
day month year employees holiday sales
1: 1 1 2018 14 0 1058
2: 2 1 2018 25 1 2174
3: 3 1 2018 NA NA 0
4: 4 1 2018 11 0 987
Or with some neater syntax:
sales[sales.NA[, ..jvars], on = jvars][is.na(sales), sales := 0][]
Reproducible data:
sales <- structure(list(day = c(1L, 2L, 4L), month = c(1L, 1L, 1L), year = c(2018L,
2018L, 2018L), employees = c(14L, 25L, 11L), holiday = c(0L,
1L, 0L), sales = c(1058L, 2174L, 987L)), row.names = c(NA, -3L
), class = c("data.table", "data.frame"))
sales.NA <- structure(list(day = 1:4, month = c(1L, 1L, 1L, 1L), year = c(2018L,
2018L, 2018L, 2018L), employees = c(NA, NA, NA, NA), holiday = c(NA,
NA, NA, NA), sales = c(0L, 0L, 0L, 0L)), row.names = c(NA, -4L
), class = c("data.table", "data.frame"))
That's an answer using the data.table package, since I am more familiar with the syntax, but regular data.frames should work pretty much the same. I also would switch to a proper date format, which will make life easier for you down the line.
Actually, in this way you would not need the Sales.NA table, since it would automatically be solved by all days which have NAs after the first join.
library(data.table)
dt.dates <- data.table(Date = seq.Date(from = as.Date("2018-01-01"), to = as.Date("2018-12-31"),by = "day" ))
dt.sales <- data.table(day = c(1,2,4)
, month = c(1,1,1)
, year = c(2018,2018,2018)
, employees = c(14, 25, 11)
, holiday = c(0,1,0)
, sales = c(1058, 2174, 987)
)
dt.sales[, Date := as.Date(paste(year,month,day, sep = "-")) ]
merge( x = dt.dates
, y = dt.sales
, by.x = "Date"
, by.y = "Date"
, all.x = TRUE
)
> Date day month year employees holiday sales
1: 2018-01-01 1 1 2018 14 0 1058
2: 2018-01-02 2 1 2018 25 1 2174
3: 2018-01-03 NA NA NA NA NA NA
4: 2018-01-04 4 1 2018 11 0 987
...

Transform Year-to-date to Quarterly data with data.table

Quarterly data from a data provider has the issue that for some variables the quarterly data values are actually Year-to-date figures. That means the values are the sum of all previous quarters (Q2 = Q1 + Q2 , Q3 = Q1 + Q2 + Q3, ...).
The structure of the original data looks the following:
library(data.table)
library(plyr)
dt.quarter.test <- structure(list(Year = c(2000L, 2000L, 2000L, 2000L, 2001L, 2001L, 2001L, 2001L)
, Quarter = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L)
, Data.Year.to.Date = c(162, 405, 610, 938, 331, 1467, 1981, 2501))
, .Names = c("Year", "Quarter", "Data.Year.to.Date"), class = c("data.table", "data.frame"), row.names = c(NA, -8L))
In order to calculate the quarterly values I therefore need to subtract the previous Quarter from Q2, Q3 and Q4.
I've managed to get the desired results by using the ddply function from the plyr package.
dt.quarter.result <- ddply(dt.quarter.test, "Year"
, transform
, Data.Quarterly = Data.Year.to.Date - shift(Data.Year.to.Date, n = 1L, type = "lag", fill = 0))
dt.quarter.result
Year Quarter Data.Year.to.Date Data.Quarterly
1 2000 1 162 162
2 2000 2 405 243
3 2000 3 610 205
4 2000 4 938 328
5 2001 1 331 331
6 2001 2 1467 1136
7 2001 3 1981 514
8 2001 4 2501 520
But I am not really happy with the command, since it seems quite clumsy and I would like to get some input on how to improve it and especially do it directly within the data.table.
Here is the data.table syntax, and you might find data.table cheat sheet helpful:
library(data.table)
dt.quarter.test[, Data.Quarterly := Data.Year.to.Date - shift(Data.Year.to.Date, fill = 0), Year][]
# Year Quarter Data.Year.to.Date Data.Quarterly
# 1: 2000 1 162 162
# 2: 2000 2 405 243
# 3: 2000 3 610 205
# 4: 2000 4 938 328
# 5: 2001 1 331 331
# 6: 2001 2 1467 1136
# 7: 2001 3 1981 514
# 8: 2001 4 2501 520

Resources