r- dynamically detect excel column names format as date (without df slicing) - r

I am trying to detect column dates that come from an excel format:
library(openxlsx)
df <- read.xlsx('path/df.xlsx', sheet=1, detectDates = T)
Which reads the data as follows:
# a b c 44197 44228 d
#1 1 1 NA 1 1 1
#2 2 2 NA 2 2 2
#3 3 3 NA 3 3 3
#4 4 4 NA 4 4 4
#5 5 5 NA 5 5 5
I tried to specify a fix index slice and then transform these specific columns as follows:
names(df)[4:5] <- format(as.Date(as.numeric(names(df)[4:5]),
origin = "1899-12-30"), "%m/%d/%Y")
This works well when the df is sliced for those specific columns, unfortunately, there could be the possibility that these column index's change, say from names(df)[4:5] to names(df)[2:3] for example. Which will return coerced NA values instead of dates.
data:
Note: for this data the column name is read as X4488, while read.xlsx() read it as 4488
df <- data.frame(a=rep(1:5), b=rep(1:5), c=NA, "44197"=rep(1:5), '44228'=rep(1:5), d=rep(1:5))
Expected Output:
Note: this is the original excel format for these above columns:
# a b c 01/01/2021 01/02/2021 d
#1 1 1 NA 1 1 1
#2 2 2 NA 2 2 2
#3 3 3 NA 3 3 3
#4 4 4 NA 4 4 4
#5 5 5 NA 5 5 5
How could I detect directly these excel format and change it to date without having to slice the dataframe?

We may need to only get those column names that are numbers
i1 <- !is.na(as.integer(names(df)))
and then use
names(df)[i1] <- format(as.Date(as.numeric(names(df)[i1]),
origin = "1899-12-30"), "%m/%d/%Y")
Or with dplyr
library(dplyr)
df %>%
rename_with(~ format(as.Date(as.numeric(.),
origin = "1899-12-30"), "%m/%d/%Y"), matches('^\\d+$'))

Related

R - Merging rows with numerous NA values to another column

I would like to ask the R community for help with finding a solution for my data, where any consecutive row with numerous NA values is combined and put into a new column.
For example:
df <- data.frame(A= c(1,2,3,4,5,6), B=c(2, "NA", "NA", 5, "NA","NA"), C=c(1,2,"NA",4,5,"NA"), D=c(3,"NA",5,"NA","NA","NA"))
A B C D
1 1 2 1 3
2 2 NA 2 NA
3 3 NA NA 5
4 4 5 4 NA
5 5 NA 5 NA
6 6 NA NA NA
Must be transformed to this:
A B C D E
1 1 2 1 3 2 NA 2 NA 3 NA NA 5
2 4 5 4 NA 5 NA 5 NA 6 NA NA NA
I would like to do the following:
Identify consecutive rows that have more than 1 NA value -> combine entries from those consecutive rows into a single combined entiry
Place the above combined entry in new column "E" on the prior row
This is quite complex (for me!) and I am wondering if anyone can offer any help with this. I have searched for some similar problems, but have been unable to find one that produces a similar desired output.
Thank you very much for your thoughts--
Using tidyr and dplyr:
Concatenate values for each row.
Keep the concatenated values only for rows with more than one NA.
Group each “good” row with all following “bad” rows.
Use a grouped summarize() to concatenate “bad” row values to a single string.
df %>%
unite("E", everything(), remove = FALSE, sep = " ") %>%
mutate(
E = if_else(
rowSums(across(!E, is.na)) > 1,
E,
""
),
new_row = cumsum(E == "")
) %>%
group_by(new_row) %>%
summarize(
across(A:D, first),
E = trimws(paste(E, collapse = " "))
) %>%
select(!new_row)
# A tibble: 2 × 5
A B C D E
<dbl> <dbl> <dbl> <dbl> <chr>
1 1 2 1 3 2 NA 2 NA 3 NA NA 5
2 4 5 4 NA 5 NA 5 NA 6 NA NA NA

Merge multiple data frames with partially matching rows

I have data frames with lists of elements such as NAMES. There are different names in dataframes, but most of them match together. I'd like to combine all of them in one list in which I'd see whether some names are missing from any of df.
DATA sample for df1:
X x
1 1 rh_Structure/Focus_S
2 2 rh_Structure/Focus_C
3 3 lh_Structure/Focus_S
4 4 lh_Structure/Focus_C
5 5 RH_Type-Function-S
6 6 RH_REFERENT-S
and for df2
X x
1 1 rh_Structure/Focus_S
2 2 rh_Structure/Focus_C
3 3 lh_Structure/Focus_S
4 4 lh_Structure/Focus_C
5 5 UCZESTNIK
6 6 COACH
and expected result would be:
NAME. df1 df2
1 COACH NA 6
2 lh_Structure/Focus_C 4 4
3 lh_Structure/Focus_S 3 3
4 RH_REFERENT-S 6 NA
5 rh_Structure/Focus_C 2 2
6 rh_Structure/Focus_S 1 1
7 RH_Type-Function-S 5 NA
8 UCZESTNIK NA 5
I can do that with merge.data.frame(df1,df2,by = "x", all=T),
but the I can't do it with more df with similar structure. Any help would be appreciated.
It might be easier to work with this in a long form. Just rbind all the datasets below one another with a flag for which dataset they came from. Then it's relatively straightforward to get a tabulation of all the missing values (and as an added bonus, you can see if you have any duplicates in any of the source datasets):
dfs <- c("df1","df2")
dfall <- do.call(rbind, Map(cbind, mget(dfs), src=dfs))
table(dfall$x, dfall$src)
# df1 df2
# COACH 0 1
# lh_Structure/Focus_C 1 1
# lh_Structure/Focus_S 1 1
# RH_REFERENT-S 1 0
# rh_Structure/Focus_C 1 1
# rh_Structure/Focus_S 1 1
# RH_Type-Function-S 1 0
# UCZESTNIK 0 1

How to put dates on the top column of my output in R

I have three columns of data
Category Date Value
A 10/12/2018 1
A 10/14/2018 2
B 10/12/2018 3
B 10/13/2018 4
C 10/12/2018 5
C 10/14/2018 6
How can I transform my output so that the output has dates on the top like this and groups the Categories?
10/12/2018 10/13/2018 10/14/2018
A 1 2
B 3 4
C 5 6
I've tried searching for crosstab and some basic R functions and appreciate your thoughts on this.
What you want is called "wide" format. There are many package and methods in R that do this kind or formatting. Bing Sun, pointed to the dplyr method. I prefer the data.table method.
## loading your data here
library(readr)
x <- read_delim("Category Date Value
A 10/12/2018 1
A 10/14/2018 2
B 10/12/2018 3
B 10/13/2018 4
C 10/12/2018 5
C 10/14/2018 6", delim = " ")
## casting your data to wide format
library(data.table)
xcast <- dcast(x, Category~Date, value.var = "Value")
xcast
returns...
Category 10/12/2018 10/13/2018 10/14/2018
1 A 1 NA 2
2 B 3 4 NA
3 C 5 NA 6
This is a reshape problem.
library(tidyr)
df %>% spread(Date,Value)
Category 10/12/2018 10/13/2018 10/14/2018
1 A 1 NA 2
2 B 3 4 NA
3 C 5 NA 6

Access specific instances in list in dataframe column, and also count list length - R

I have an R dataframe composed of columns. One column contains lists: i.e.
Column
1,2,4,7,9,0
5,3,8,9,0
3,4
5.8,9,3.5
6
NA
7,4,3
I would like to create column which counts, the length of these lists:
Column Count
1,2,4,7,9,0 6
5,3,8,9,0 5
3,4 2
5.8,9,3.5 3
6 1
NA NA
7,4,3 3
Also, is there a way to access specific instances in these lists? i.e. make a new column with only the first instances of each list? or the last instances of each?
One solution is to use strsplit to split element in character vector and use sapply to get the desired count:
df$count <- sapply(strsplit(df$Column, ","),function(x){
if(all(is.na(x))){
NA
} else {
length(x)
}
})
df
# Column count
# 1 1,2,4,7,9,0 6
# 2 5,3,8,9,0 5
# 3 3,4 2
# 4 5.8,9,3.5 3
# 5 6 1
# 6 <NA> NA
# 7 7,4,3 3
If it is desired to count NA as 1 then solution could have been even simpler as:
df$count <- sapply(strsplit(df$Column, ","),length)
Data:
df <- read.table(text = "Column
'1,2,4,7,9,0'
'5,3,8,9,0'
'3,4'
'5.8,9,3.5'
'6'
NA
'7,4,3'",
header = TRUE, stringsAsFactors = FALSE)
count.fields serves this purpose for a text file, and can be coerced to work with a column too:
df$Count <- count.fields(textConnection(df$Column), sep=",")
df$Count[is.na(df$Column)] <- NA
df
# Column Count
#1 1,2,4,7,9,0 6
#2 5,3,8,9,0 5
#3 3,4 2
#4 5.8,9,3.5 3
#5 6 1
#6 <NA> NA
#7 7,4,3 3
On a more general note, you're probably better off converting your column to a list, or stacking the data to a long form, to make it easier to work with:
df$Column <- strsplit(df$Column, ",")
lengths(df$Column)
#[1] 6 5 2 3 1 1 3
sapply(df$Column, `[`, 1)
#[1] "1" "5" "3" "5.8" "6" NA "7"
stack(setNames(df$Column, seq_along(df$Column)))
# values ind
#1 1 1
#2 2 1
#3 4 1
#4 7 1
#5 9 1
#6 0 1
#7 5 2
#8 3 2
#9 8 2
# etc
Here's a slightly faster way to achieve the same result:
df$Count <- nchar(gsub('[^,]', '', df$Column)) + 1
This one works by counting how many commas there are and adding 1.

Setting values to NA in a dataframe in R

Here is some reproducible code that shows the problem I am trying to solve in another dataset. Suppose I have a dataframe df with some NULL values in it. I would like to replace these with NAs, as I attempt to do below. But when I print this, it comes out as <NA>. See the second dataframe, which comes is the dataframe I would like to produce from df, in which the NA is a regular old NA without the carrots.
> df = data.frame(a=c(1,2,3,"NULL"),b=c(1,5,4,6))
> df[4,1] = NA
> print(df)
a b
1 1 1
2 2 5
3 3 4
4 <NA> 6
>
> d = data.frame(a=c(1,2,3,NA),b=c(1,5,4,6))
> print(d)
a b
1 1 1
2 2 5
3 3 4
4 NA 6

Resources