R dataframe, expand rows by string variable [duplicate] - r

This question already has answers here:
R semicolon delimited a column into rows
(3 answers)
Closed 6 years ago.
Can anyone please help with this little data.frame expansion problem?
Thanks in advance!
# I have
data.frame(rbind(c("1", "2", "3", "a/b/c"),
c("11", "0", "5", "c/d"),
c("3", "33", "0", "a"))
)
# X1 X2 X3 X4
# 1 1 2 3 a/b/c
# 2 11 0 5 c/d
# 3 3 33 0 a
# I want
data.frame(rbind(c("1", "2", "3", "a"),
c("1", "2", "3", "b"),
c("1", "2", "3", "c"),
c("11", "0", "5", "c"),
c("11", "0", "5", "d"),
c("3", "33", "0", "a"))
)
# X1 X2 X3 X4
# 1 1 2 3 a
# 2 1 2 3 b
# 3 1 2 3 c
# 4 11 0 5 c
# 5 11 0 5 d
# 6 3 33 0 a

We can use data.table
library(data.table)
setDT(df1)[, strsplit(as.character(X4), "/"), by = .(X1, X2, X3)]

Related

Intrincate variable generation with conditionals against multiple factor variables in R

I'm triying to generate a new variable using multiple conditionals that evaluate against factor variables.
So, let's say I got this factor variables data.frame
x<-c("1", "2", "1","NA", "1", "2", "NA", "1", "2", "2", "NA" )
y<-c("1","NA", "2", "1", "1", "NA", "2", "1", "2", "1", "1" )
z<-c("1", "2", "3", "4", "1", "2", "3", "4", "1", "2", "3")
w<- c("01", "02", "03", "04","05", "06", "07", "01", "02", "03", "04")
df<-data.frame(x,y,z,w)
df$x<-as.factor(df$x)
df$y<-as.factor(df$y)
df$z<-as.factor(df$z)
df$w<-as.factor(df$w)
str(df)
So I need to get a new v colum on my dataframe which takes values between 1, 0 or NA with the following conditionals:
Takes value 1 if: x = "1", y = "1", z = "1" or "2", w = "01" to "06"
Takes value 0 if it doesn't meet at least one of the conditionals.
Takes value NA if any of x, y, z, or w is NA.
Had tried using a pipe %>% along mutate and case_when but have been unable to make it work.
So my desired result would be a new column v in df which would look like this:
[1] 1 NA 0 NA 1 NA NA 0 0 0 NA
Here I also use mutate with case_when. Since the NA in your dataset is of character "NA" (literal string of "NA"), we cannot use function like is.na() to idenify it. Would recommend to change it to "real" NA (by removing double quotes in your input).
As I've pointed out in the comment, I'm not sure why the eighth entry is "1" when the corresponding z is not "1" or "2".
library(dplyr)
df %>% mutate(v = case_when(x == "1" & y == "1" & z %in% c("1", "2") & w %in% paste0(0, seq(1:6)) ~ "1",
x == "NA" | y == "NA" | z == "NA" | w == "NA" ~ NA_character_,
T ~ "0"))
x y z w v
1 1 1 1 01 1
2 2 NA 2 02 <NA>
3 1 2 3 03 0
4 NA 1 4 04 <NA>
5 1 1 1 05 1
6 2 NA 2 06 <NA>
7 NA 2 3 07 <NA>
8 1 1 4 01 0
9 2 2 1 02 0
10 2 1 2 03 0
11 NA 1 3 04 <NA>

How to format data from excel containing two rows of column headers to be able to use in R?

I am importing the following table 1 into R but am struggling with the formatting, as each column has two headers. My desired output is the second table 2. I plan to use tidyr to gather the data.
Another obstacle I have is the merged cells. I have been using fillMergedCells=TRUE to duplicate this.
read.xlsx(xlsxFile ="C:/Users/X/X/Desktop/X.xlsx",fillMergedCells = TRUE)
One option would be to
read your excel file with option colNames = FALSE
Paste the first two rows together and use the result as the column names. Here I use an underscore as the separator which makes it easy to split the names later on.
Get rid of the first two rows
Use tidyr::pivot_longer to convert to long format.
# df <- openxlsx::read.xlsx(xlsxFile ="data/test2.xlsx", fillMergedCells = TRUE, colNames = FALSE)
# Use first two rows as names
names(df) <- paste(df[1, ], df[2, ], sep = "_")
names(df)[1] <- "category"
# Get rid of first two rows and columns containing year average
df <- df[-c(1:2), ]
df <- df[, !grepl("^Year", names(df))]
library(tidyr)
library(dplyr)
df %>%
pivot_longer(-category, names_to = c("Time", ".value"), names_pattern = "^(.*?)_(.*)$") %>%
arrange(Time)
#> # A tibble: 16 × 4
#> category Time Y Z
#> <chr> <chr> <chr> <chr>
#> 1 Total Feb-21 1 1
#> 2 A Feb-21 2 2
#> 3 B Feb-21 3 3
#> 4 C Feb-21 4 4
#> 5 D Feb-21 5 5
#> 6 E Feb-21 6 6
#> 7 F Feb-21 7 7
#> 8 G Feb-21 8 8
#> 9 Total Jan-21 1 1
#> 10 A Jan-21 2 2
#> 11 B Jan-21 3 3
#> 12 C Jan-21 4 4
#> 13 D Jan-21 5 5
#> 14 E Jan-21 6 6
#> 15 F Jan-21 7 7
#> 16 G Jan-21 8 8
DATA
df <- structure(list(X1 = c(
NA, NA, "Total", "A", "B", "C", "D", "E",
"F", "G"
), X2 = c(
"Year Rolling Avg.", "Share", NA, "1", "1",
"1", "1", "1", "1", "1"
), X3 = c(
"Year Rolling Avg.", "Y", "1",
"2", "3", "4", "5", "6", "7", "8"
), X4 = c(
"Year Rolling Avg.",
"Z", "1", "2", "3", "4", "5", "6", "7", "8"
), X5 = c(
"Jan-21",
"Y", "1", "2", "3", "4", "5", "6", "7", "8"
), X6 = c(
"Jan-21",
"Z", "1", "2", "3", "4", "5", "6", "7", "8"
), X7 = c(
"Feb-21",
"Y", "1", "2", "3", "4", "5", "6", "7", "8"
), X8 = c(
"Feb-21",
"Z", "1", "2", "3", "4", "5", "6", "7", "8"
)), row.names = c(
NA,
10L
), class = "data.frame")

combining rows based on a condition in R

I am trying to remove some useless rows from the below df. There can be a type (1:5) per ID and yes_no variable to see if there is a variable recorded or not. As you can see, I would like to remove the 3rd and 5th rows as they have other rows with the same ID and type with a recorded value with yes_no = y.
df <- data.frame(ID = c("1", "1", "1", "1", "1", "1", "1", "1"), type = c("1", "2", "3", "3", "4", "4", "4", "5"), yes_no = c("n", "n", "n", "y", "n", "y", "y", "n"), value = c(NA, NA, NA, "2", NA, "5", "6", NA))
ID type yes_no value
1 1 n <NA>
1 2 n <NA>
1 3 n <NA>
1 3 y 2
1 4 n <NA>
1 4 y 5
1 4 y 6
1 5 n <NA>
The desired output is as follows:
df2 <- data.frame(ID = c("1", "1", "1", "1", "1", "1"), type = c("1", "2", "3", "4", "4", "5"), yes_no = c("n", "n", "y", "y", "y", "n"), value = c(NA, NA, "2", "5", "6", NA))
ID type yes_no value
1 1 n <NA>
1 2 n <NA>
1 3 y 2
1 4 y 5
1 4 y 6
1 5 n <NA>
There are ID's other than 1 that have types 1:5 so looks like I have to group_by(ID). A dplyr solution would be great too.
Any help would be appreciated, thanks!
You may use an if condition to check if yes_no has any y value.
library(dplyr)
df %>%
group_by(ID, type) %>%
filter(if(any(yes_no == 'y')) yes_no == 'y' else TRUE) %>%
ungroup
# ID type yes_no value
# <chr> <chr> <chr> <chr>
#1 1 1 n NA
#2 1 2 n NA
#3 1 3 y 2
#4 1 4 y 5
#5 1 4 y 6
#6 1 5 n NA
A base R option using subset + ave
subset(
df,
ave(yes_no == "y", ID, type, FUN = max) == (yes_no == "y")
)
gives
ID type yes_no value
1 1 1 n <NA>
2 1 2 n <NA>
4 1 3 y 2
6 1 4 y 5
7 1 4 y 6
8 1 5 n <NA>
After grouping by 'ID', 'type', we may use an OR (|) condition to filter to filter the groups where 'y' is present or when all elements are not 'y'
library(dplyr)
df %>%
group_by(ID, type) %>%
filter(yes_no == 'y'|all(yes_no != 'y')) %>%
ungroup
-output
# A tibble: 6 x 4
ID type yes_no value
<chr> <chr> <chr> <chr>
1 1 1 n <NA>
2 1 2 n <NA>
3 1 3 y 2
4 1 4 y 5
5 1 4 y 6
6 1 5 n <NA>

Select first row per run by group [duplicate]

This question already has answers here:
Select first row in each contiguous run by group
(4 answers)
Closed 1 year ago.
I have data with a grouping variable (ID) and some values (type):
ID <- c("1", "1", "1", "1", "2", "2", "2", "2", "3", "3", "3", "3")
type <- c("1", "3", "3", "2", "3", "3", "1", "1", "1", "2", "2", "1")
dat <- data.frame(ID,type)
Within each ID, I want to delete the repeated number, not the unique one but the one the same as the previous one. I have annotated some examples:
# ID type
# 1 1 1
# 2 1 3 # first value in a run of 3s within ID 1: keep
# 3 1 3 # 2nd value: remove
# 4 1 2
# 5 2 3
# 6 2 3
# 7 2 1
# 8 2 1
# 9 3 1
# 10 3 2 # first value in a run of 2s within ID 3: keep
# 11 3 2 # 2nd value: remove
# 12 3 1
For example, ID 3 have the sequence of values 1, 2, 2, 1. The third value is the same as the second value, so it should be deleted, to become 1,2,1
Thus, the desired output is:
data.frame(ID = c("1", "1", "1", "2", "2", "3", "3", "3"),
type = c("1", "3", "2", "3", "1", "1", "2", "1"))
ID type
1 1 1
2 1 3
3 1 2
4 2 3
5 2 1
6 3 1
7 3 2
8 3 1
I've tried
df[!duplicated(df), ]
however what I got was
ID <- c("1", "1", "1", "2", "2", "3", "3")
type<- c("1", "3", "2", "3", "1", "1", "2")
I know duplicated would only keep the unique one. how can I get the values I want?
Thanks for the help in advance!
Does this work:
library(dplyr)
dat %>% group_by(ID) %>%
mutate(flag = case_when(type == lag(type) ~ TRUE, TRUE ~ FALSE)) %>%
filter(!flag) %>% select(-flag)
# A tibble: 8 x 2
# Groups: ID [3]
ID type
<chr> <chr>
1 1 1
2 1 3
3 1 2
4 2 3
5 2 1
6 3 1
7 3 2
8 3 1
Using data.table rleid and duplicated -
library(data.table)
setDT(dat)[!duplicated(rleid(ID, type))]
# ID type
#1: 1 1
#2: 1 3
#3: 1 2
#4: 2 3
#5: 2 1
#6: 3 1
#7: 3 2
#8: 3 1
Improved answer including suggestion from #Henrik.
Base R way If you want to eliminate consecutive duplicate rows only (8 rows output)
ID <- c("1", "1", "1", "1", "2", "2", "2", "2", "3", "3", "3", "3")
type<- c("1", "3", "3", "2", "3", "3", "1", "1", "1", "2", "2", "1")
dat <- data.frame(ID,type)
subset(dat, !duplicated(with(rle(paste(dat$ID, dat$type)), rep(seq_len(length(lengths)), lengths))))
#> ID type
#> 1 1 1
#> 2 1 3
#> 4 1 2
#> 5 2 3
#> 7 2 1
#> 9 3 1
#> 10 3 2
#> 12 3 1
Created on 2021-05-22 by the reprex package (v2.0.0)

count_if (EXPSS) with multiple conditions in R

I am using expss::count_if.
While something like this works fine (i.e., counting values only where value is equal to "1"):
(number_unemployed = count_if("1",unemployed_field,na.rm = TRUE)),
This does not (i.e., counting values only where value is equal to "1" or "2" or "3"):
(number_unemployed = count_if("1", "2", "3", unemployed_field,na.rm = TRUE)),
What is the correct syntax for using multiple conditions for count_if? I cannot find anything in the expss package documentation.
You need to put them into a vector. This works:
(number_unemployed = count_if(c("1", "2", "3"), unemployed_field), na.rm=T),
Example: Sample data is provided below;
library(expss)
count_if(c("1","2","3"),dt$Encounter)
#> 9
Data:
dt <- structure(list(Location = c("A", "B", "A", "A", "C", "B", "A", "B", "A", "A", "A"),
Encounter = c("1", "2", "3", "1", "2", "3", "4", "1", "2", "3", "4")),
row.names = c(NA, -11L), class = "data.frame")
# Location Encounter
# 1 A 1
# 2 B 2
# 3 A 3
# 4 A 1
# 5 C 2
# 6 B 3
# 7 A 4
# 8 B 1
# 9 A 2
# 10 A 3
# 11 A 4

Resources