how to dynamically intercalate columns with pattern in R? - r

this is a follow up question . I wanna know how can I intercalate dynamically the columns in the bigger data set?
Rationale: I've conducted a for-loop to import 16 dataframes. After that, I did this to merge all dataframes:
### Merge all dataframes: (ps: I got this code here in SO :)
mergefun <- function(x, y) merge(x, y, by= "ID", all = T)
merged_DF <- Reduce(mergefun, dataList)
Each dataframes has an "ID" column (which is the same for every one), but they have different column names (the ones I've created based on the other posts' answer). Hence,
I have, in total (the head() of each dataframe):
ID NARR_G1_50_AAA NARR_G1_50_AAC NARR_G1_50_AC NARR_G1_50_AB
ID NARR_G1_100_AAA NARR_G1_100_AAC NARR_G1_100_AC NARR_G1_100_AB
ID NARR_G1_150_AAA NARR_G1_150_AAC NARR_G1_150_AC NARR_G1_150_AB
ID NARR_G1_200_AAA NARR_G1_200_AAC NARR_G1_200_AC NARR_G1_200_AB
ID NARR_G2_50_AAA NARR_G2_50_AAC NARR_G2_50_AC NARR_G2_50_AB
ID NARR_G2_100_AAA NARR_G2_100_AAC NARR_G2_100_AC NARR_G2_100_AB
ID NARR_G2_150_AAA NARR_G2_150_AAC NARR_G2_150_AC NARR_G2_150_AB
ID NARR_G2_200_AAA NARR_G2_200_AAC NARR_G2_200_AC NARR_G2_200_AB
ID ARG_G1_50_AAA ARG_G1_50_AAC ARG_G1_50_AC ARG_G1_50_AB
ID ARG_G1_100_AAA ARG_G1_100_AAC ARG_G1_100_AC ARG_G1_100_AB
ID ARG_G1_150_AAA ARG_G1_150_AAC ARG_G1_150_AC ARG_G1_150_AB
ID ARG_G1_200_AAA ARG_G1_200_AAC ARG_G1_200_AC ARG_G1_200_AB
ID ARG_G2_50_AAA ARG_G2_50_AAC ARG_G2_50_AC ARG_G2_50_AB
ID ARG_G2_100_AAA ARG_G2_100_AAC ARG_G2_100_AC ARG_G2_100_AB
ID ARG_G2_150_AAA ARG_G2_150_AAC ARG_G2_150_AC ARG_G2_150_AB
ID ARG_G2_200_AAA ARG_G2_200_AAC ARG_G2_200_AC ARG_G2_200_AB
I need two arrange the joined dataframe columns in these two orders:
SET 1 :
###Desired output 1:
NARR_G1_50_AAA, NARR_G2_50_AAA,
NARR_G1_50_AAC, NARR_G2_50_AAC,
NARR_G1_50_AC, NARR_G2_50_AC,
NARR_G1_50_AB, NARR_G2_50_AB,
ARG_G1_50_AAA, ARG_G2_50_AAA,
ARG_G1_50_AAC, ARG_G2_50_AAC,
ARG_G1_50_AC, ARG_G2_50_AC,
ARG_G1_50_AB, ARG_G2_50_AB........then with 100,150 and 200
SET 2 :
###Desired output 2:
NARR_G1_50_AAA, ARG_G1_50_AAA,
NARR_G2_50_AAA, ARG_G2_50_AAA,
NARR_G1_50_AAC, ARG_G1_50_AAC,
NARR_G2_50_AAC, ARG_G2_50_AAC,
NARR_G1_50_AC, ARG_G1_50_AC,
NARR_G2_50_AC, ARG_G2_50_AC,
NARR_G1_50_AB, ARG_G1_50_AB,
NARR_G2_50_AB, ARG_G2_50_AB,........then with 100,150 and 200
I've tried many things, but I can't get the desired orders...the closer I got was this:
dfPaired <- merged_DF %>% ###still doesn't produce the desired output
# dplyr::select(sort(names(.))) %>%
dplyr::select(order(gsub("G1", "G2", names(.)))) %>%
Question:
How can I get the desired orders (set 1 and set 2) without manually intercalating the columns in select() ?
Further notes:
SET 1:
I need to intercalate (in increasing order 50, then 100, then 150, then 200) "G1" and "G2" within each variable. Ex: NARR_G1_50_AAA, NARR_G2_50_AAA... There are 4 per number (AAA, AAB, AC and AB)
SET 2:
I need to intercalate (in increasing order 50, then 100, then 150, then 200) "NARR" and "ARG" comparing G1 and G2. Such as: NARR_G1_50_AAA, NARR_G2_50_AAA... thanks in advance :)

If it should be custom order, an option would be to split up the column names at _, then convert to factor with levels specified in the order we wanted
lvls1 <- c("NARR", "ARG")
lvls2 <- c("G1", "G2")
lvls3 <- c("AAA", "AAC", "AC", "AB")
#v1 <- names(merged_DF)[-1] # assuming 'ID' is the first column
d1 <- read.table(text = v1, header = FALSE, sep = "_")
i1 <- !sapply(d1, is.numeric)
d1[i1] <- Map(factor, d1[i1], levels = list(lvls1, lvls2, lvls3))
v2 <- v1[do.call(order, d1[c(3, 1,4, 2)])]
library(dplyr)
merged_DF %>%
select(ID, all_of(v2))
where v2 is
> v2
[1] "NARR_G1_50_AAA" "NARR_G2_50_AAA" "NARR_G1_50_AAC" "NARR_G2_50_AAC" "NARR_G1_50_AC" "NARR_G2_50_AC" "NARR_G1_50_AB" "NARR_G2_50_AB"
[9] "ARG_G1_50_AAA" "ARG_G2_50_AAA" "ARG_G1_50_AAC" "ARG_G2_50_AAC" "ARG_G1_50_AC" "ARG_G2_50_AC" "ARG_G1_50_AB" "ARG_G2_50_AB"
[17] "NARR_G1_100_AAA" "NARR_G2_100_AAA" "NARR_G1_100_AAC" "NARR_G2_100_AAC" "NARR_G1_100_AC" "NARR_G2_100_AC" "NARR_G1_100_AB" "NARR_G2_100_AB"
[25] "ARG_G1_100_AAA" "ARG_G2_100_AAA" "ARG_G1_100_AAC" "ARG_G2_100_AAC" "ARG_G1_100_AC" "ARG_G2_100_AC" "ARG_G1_100_AB" "ARG_G2_100_AB"
[33] "NARR_G1_150_AAA" "NARR_G2_150_AAA" "NARR_G1_150_AAC" "NARR_G2_150_AAC" "NARR_G1_150_AC" "NARR_G2_150_AC" "NARR_G1_150_AB" "NARR_G2_150_AB"
[41] "ARG_G1_150_AAA" "ARG_G2_150_AAA" "ARG_G1_150_AAC" "ARG_G2_150_AAC" "ARG_G1_150_AC" "ARG_G2_150_AC" "ARG_G1_150_AB" "ARG_G2_150_AB"
data
# it is a random order of the column names which is ordered in the code
v1 <- c("NARR_G1_100_AB", "NARR_G1_150_AAC", "NARR_G2_50_AB", "NARR_G1_150_AB",
"NARR_G2_100_AAA", "NARR_G1_100_AAC", "ARG_G1_150_AC", "ARG_G2_50_AAA",
"ARG_G2_150_AAA", "ARG_G1_50_AAA", "ARG_G2_100_AC", "NARR_G1_150_AAA",
"NARR_G2_100_AC", "ARG_G1_50_AC", "NARR_G1_100_AAA", "ARG_G2_50_AB",
"NARR_G1_150_AC", "ARG_G2_50_AAC", "ARG_G2_150_AB", "NARR_G2_100_AAC",
"NARR_G2_150_AAA", "NARR_G1_100_AC", "ARG_G1_150_AB", "ARG_G1_50_AAC",
"NARR_G1_50_AC", "ARG_G2_150_AAC", "NARR_G1_50_AAA", "NARR_G2_150_AB",
"NARR_G2_150_AAC", "ARG_G1_150_AAA", "ARG_G2_50_AC", "NARR_G2_50_AC",
"ARG_G1_150_AAC", "ARG_G1_100_AC", "ARG_G1_100_AAA", "NARR_G1_50_AAC",
"NARR_G2_150_AC", "ARG_G1_100_AAC", "ARG_G2_100_AAA", "ARG_G2_100_AAC",
"NARR_G1_50_AB", "NARR_G2_100_AB", "ARG_G2_100_AB", "ARG_G1_50_AB",
"NARR_G2_50_AAA", "ARG_G1_100_AB", "ARG_G2_150_AC", "NARR_G2_50_AAC"
)

Related

How to initialize empty data frame where number of columns is depends on multiple input parameters in R

i need to create an empty data.table, with a variable length. For example, i will be given that n = 2 amd m = 12, then i want the table to have columns such as:
name, ID, nickname, start_1, start_2, count_1, count_2, value_<n>_month_<m>
(value_<n>_month_<m> is repeated m x n times)
i get columns: name, ID, nickname, start_1, start_2, start_3 count_1, count_2, count_3 by using the following code:
VarStart <- setnames(setDF(lapply(integer(K), function(...) character(0L))),
paste0("start", 1:K))
NewTable <- cbind(NewTable, VarStart)
VarCount <- setnames(setDF(lapply(integer(K), function(...) character(0L))),
paste0("start", 1:K))
NewTable <- cbind(NewTable, VarCount)
borrowed from here.
But how can i crete columns for variables with both n and m in the column name?
Is there a nicer way to what i already have?
You can use :
n = 2
m = 12
const_col <- c('name', 'ID', 'nickname')
sc_cols <- c(t(outer(c('start', 'count'), seq_len(n),paste0)))
vm_cols <- c(t(outer(seq_len(n), seq_len(m), function(x, y)
sprintf('value_%d_month_%d', x, y))))
all_cols <- c(const_col, sc_cols, vm_cols)
NewTable <- data.table::data.table(matrix(ncol = length(all_cols),
dimnames = list(NULL, all_cols)))
Specify nrow as 0 if you want an empty data.table with 0 rows.
names(NewTable)
# [1] "name" "ID" "nickname"
# [4] "start1" "start2" "count1"
# [7] "count2" "value_1_month_1" "value_1_month_2"
#[10] "value_1_month_3" "value_1_month_4" "value_1_month_5"
#[13] "value_1_month_6" "value_1_month_7" "value_1_month_8"
#[16] "value_1_month_9" "value_1_month_10" "value_1_month_11"
#[19] "value_1_month_12" "value_2_month_1" "value_2_month_2"
#[22] "value_2_month_3" "value_2_month_4" "value_2_month_5"
#[25] "value_2_month_6" "value_2_month_7" "value_2_month_8"
#[28] "value_2_month_9" "value_2_month_10" "value_2_month_11"
#[31] "value_2_month_12"

Separating a column using big spaces in strings in R

This is my data frame, composed only of the 1 observation. This is a long string where 4 different parts are identifiable:
example <- "4.6 (19 ratings) Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately. 151 students enrolled "
df <- data.frame(example)
As you can see, the first observation is composed of a string with 4 different parts: rating (4.6), number of ratings (19 ratings), a sentence (Course...accurately), and students enrolled (151).
I employed the separate() function to divide that column in 4 one:
df1 <- separate(df, example, c("Rating", "Number of rating", "Sentence", "Students"), sep = " ")
Thus, this does not behave as expected.
Any idea.
UPDATE:
This is what I get with your comment #nicola
> df1 <- separate(df, example, c("Rating", "Number of rating", "Sentence", "Students"), sep=" {4,}")
Warning message:
Expected 4 pieces. Additional pieces discarded in 1 rows [1].
How about this:
x <- str_split(example, " ") %>%
unlist()
x <- x[x != ""]
df <- tibble("a", "b", "c", "d")
df[1, ] <- x
colnames(df) <- c("Rating", "Number of rating", "Sentence", "Students")
> str(df)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1 obs. of 4 variables:
$ Rating : chr "4.6"
$ Number of rating: chr " (19 ratings)"
$ Sentence : chr " Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of ra"| __truncated__
$ Students : chr "151 students enrolled"
There are two keys to the answer. The first is to the correct regex used as separator sep = "[[:space:]]{2,}" which means two or more whitespace (\\s{2,} would be a more common alterantive). The second one is that your example actually has a lot a trailing whitespace which separate() tries to put into another column. It can simply be removed using trimws(). The solution therefore looks like this:
library(tidyr)
library(dplyr)
example <- "4.6 (19 ratings) Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately. 151 students enrolled "
df <- data.frame(example)
df_new <- df %>%
mutate(example = trimws(example)) %>%
separate(col = "example",
into = c("rating", "number_of_ratings", "sentence", "students_enrolled"),
sep = "[[:space:]]{2,}")
as_tibble(df_new)
# A tibble: 1 x 4
rating number_of_ratings sentence students_enrolled
<chr> <chr> <chr> <chr>
1 4.6 (19 ratings) Course Ratings are calculated from individual students’ ratings and a vari~ 151 students enr~
tibble is only used to formatting the output.
Certainly possible with the stringr package and a bit of regular expressions:
rating_mean n_ratings n_students descr
1 4.65 19 151 "Course (...) accurately."
Code
library(stringr)
# create result data frame
result <- data.frame(cbind(rating_mean = 0, n_ratings = 0, n_students = 0, descr = 0))
# loop through rows of example data frame
for (i in 1:nrow(df)){
# replace spaces
example[i, 1] <- gsub("\\s+", " ", example[i, 1])
# match and extract mean rating
result[i, 1] <- as.numeric(str_match(example[i], "^[0-9]+\\.[0-9]+"))
# match and extract number of ratings
result[i, 2] <- as.numeric(str_match(str_match(example[i, 1], "\\(.+\\)"), "[0-9]+"))
# match and extract number of enrolled students
result[i, 3] <- as.numeric(str_match(str_match(example[i, 1], "\\s[0-9].+$"), "[0-9]+"))
# match and extract sentence
result[i, 4] <- str_match(example[i, 1], "[A-Z].+\\.")
}
Data
example <- "4.65 (19 ratings) Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately. 151 students enrolled "
example <- data.frame(example, stringsAsFactors = FALSE)

Selecting specific rows from datafrom based on column match in another datafram in R

I have a question. Please help.
I have two dataframe. data1 and data2
data1 has following data
HHID..... blockid....serial_number...name
100............1............1.........xxx
100............2............2.........yyy
100............1............3.........zzz
200............1........... 1.........sss
200............1............2.........ddd
data2 is as below
HHID-.......serial....... hospital
100...........3...............Delhi
200...........2...............paris
Now,i want to select rows in data1 based on HHID and serial in data2. For eg, here, in data2, we can see a row with HHID 100 and serial 3. So, I want select only that row from data1 where HHID is 100 and serial is 3. Similarly for HHID 200 and serial 2. Also, when I select row from data1, I dont want any extra columns from data2. All I care about is if HHID and serial in data2 is matching in data1. If it does, then I need that complete row in data1. So the output should be as follows
HHID....blockid.....serial....name
100..... .....1........3......zzz
200...........1........2......ddd
Can somebody help?
Thank you
you can create a unique ID for each data frame like so:
#data frame definitions
data1 <- data.frame(HHID = c(100,100,100,200,200), blockid = c(1,2,1,1,1), serial_number = c(1,2,3,1,2), name = c('xxx', 'yyy', 'zzz', 'sss', 'ddd'))
data2 <- data.frame(HHID = c(100,200), serial = c(3,2), hospital = c('Delphi', 'paris'))
#unique identifier
data1$unique <- paste(data1$HHID, data1$serial_number, sep = '')
data2$unique <- paste(data2$HHID, data2$serial, sep = '')
Then, you can use the subset function to isolate rows in data1, like so:
result <- subset(data1, unique %in% data2$unique)
I would suggest:
Here I recreate the data:
library(tidyverse)
data1 <- read.table(text="HHID blockid serial_number name
100 1 1 xxx
100 2 2 yyy
100 1 3 zzz
200 1 1 sss
200 1 2 ddd", sep = " ", stringsAsFactors = F, header = T)
data2 <- read.table(text="HHID serial hospital
100 3 Delhi
200 2 paris", sep = " ", stringsAsFactors = F, header = T)
That's my suggestion
results <- data1 %>%
rename(serial=serial_number) %>%
right_join(data2, by=c("HHID", "serial")) %>%
select(-hospital) # get rid of the hospital column
results
If you are not familiar with the tidyverse, you can execute every line step by step until the %>% to see the single steps. That's the output:
HHID blockid serial name
1 100 1 3 zzz
2 200 1 2 ddd

select duplicates from two columns by row and create a new variable in R

I have a data frame in which I have different duplicates of ID and dates. I just want to detect the duplicates of one column that are also in the other so I can say:
1. remove the rows with duplicate id, duplicate datee and missing in T (second record in this table).
2. And then say: if there is a duplicate id and duplicate date, chose the T=="high".
id<-c("a", "a", "a", "a", "b", "c")
datee<-c("12/02/10", "12/02/10", "12/02/10","10/03/11", "10/04/18","1/04/18" )
T<-c("high", NA, "low","high", "low", "medium")
mydata<-data.frame(id, datee, T)
This is like this:
id datee T
a 12/02/10 high
a 12/02/10 <NA>
a 12/02/10 low
a 10/03/11 high
b 10/04/18 low
c 1/04/18 medium
A step by step solution
Step 1 - Remove missing
mydata<-mydata[!is.na(mydata[,3]),]
Step 2 - Identify duplicated rows on ID
dup_rows_ID<-duplicated(mydata[,c(1)],fromLast = TRUE) | duplicated(mydata[,c(1)],fromLast = FALSE)
mydata_dup<-mydata[dup_rows_ID,]
Step 3 - Identity duplicated rows on ID and datee
dup_rows_ID_datee<-duplicated(mydata_dup[,c(1,2)],fromLast = TRUE) | duplicated(mydata_dup[,c(1,2)],fromLast = FALSE)
Step 4 - Select T="high"
mydata_dup2<-mydata_dup[mydata_dup[dup_rows_ID_datee,"T"]=="high",]
Your output
rbind(mydata_dup[rownames(mydata_dup) %in% rownames(mydata_dup2),],
+ mydata[!dup_rows_ID,])
id datee T
1 a 12/02/10 high
4 a 10/03/11 high
5 b 10/04/18 low
6 c 1/04/18 medium
About ID==a you have two date with T=="high", you have to choose if you want the one with higher or lower datee.
You can do like this first:
is_duplicate <- lapply(X = mydata, FUN = duplicated, incomparables = FALSE)
is_na <- lapply(X = mydata, FUN = is.na)
and use data.frames to f.ex. remove the rows with duplicate id, duplicate datee and missing in T like this:
drop_idx <- which(is_duplicate$id & is_duplicate$datee & is_na$T)
data[drop_idx, ]

Match 2 data.frames when one value is doubled - NA occurs

dat is a data.frame with a location column called area.
map is a data.frame which contains location called area and salary.
I want to add salary column to the dat from map according to their's locations.
map <- data.frame(
area = c(
"MAZOWIECKIE",
"WIELKOPOLSKIE",
"MALOPOLSKIE",
"LUBELSKIE",
"SLASKIE",
"POMORZE",
"DOLNOSLASKIE",
"KUJAWSKOPOMORSKIE",
"LUBUSKIE",
"PODKARPACKIE",
"SWIETOKRZYSKIE"
),
salary = c(12962, 10449, 11204, 10626, 8375, 10399, 12883, 9136, 6000, 12843, 7800)
)
dat <- data.frame(area = c("OPOLSKIE",
"SWIETOKRZYSKIE",
"KUJAWSKOPOMORSKIE",
"MAZOWIECKIE",
"POMORZE",
"SLASKIE",
"WARMINSKOMAZURSKIE",
"POMORZE",
"DOLNOSLASKIE",
"WIELKOPOLSKIE",
"LODZKIE",
"PODLASKIE",
"MALOPOLSKIE",
"LUBUSKIE",
"PODKARPACKIE",
"LUBELSKIE"
)
)
matched <- match(map$area, dat$area)
stopifnot(all(!is.na(matched)))
dat$salary <- NA
dat$salary[matched] <- map$salary
# why POMORZE = NA ?!
dat[dat$area == "POMORZE",]
# the value should be doubled
map[map$area == "POMORZE",]
The problem is that 1 location in in dat is doubled. In this case I want to double also the salary for this location in dat.
I can rewrite map$salary with the mean aggreageted by map$area and option is.na = F but maybe there is a better way to do this?

Resources