I have a dataframe containing N=11 variables. The X's are column names defined by R.
filenames=gtools::mixedsort(c("a1", "a10", "a11", "a2", "a3", "a4", "a5", "a6", "a7", "a8", "a9"))
filenames<-data.frame(t(filenames))
list(filenames)
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11
1 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11
How do I insert 6 blank columns before the first filled column (a1.csv), and subsequently 5 blank columns between each of the filled columns? The output should look something like:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18.... X67 X68 X69 X70 X71 X72
1 a1 a2 .... a11
I need a way that can automatically perform the same operation of adding blank columns whatever the value of N.
A bit primitive but this should do what you want:
library(tidyverse)
library(stringr)
library(gtools)
filenames=gtools::mixedsort(c("a1", "a10", "a11", "a2", "a3", "a4", "a5", "a6", "a7", "a8", "a9"))
filenames <- data.frame(t(filenames))
# Note: I am not putting filnames into a list - I use it in the form of a data frame
empty_col <- data.frame(matrix(ncol = 5, nrow = nrow(filenames), data = ""))
for(i in seq_along(filenames)){
filenames <- filenames %>%
add_column(empty_col, .before=(i + (i-1)*ncol(empty_col))) # was .after
}
filenames <- cbind(filenames, empty_col) # this adds 5 columns at the end since .before and .after can't be used in the same function
names(filenames) <- ifelse(str_starts(names(filenames), 'X'), "", names(filenames))
Result:
1 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11
Related
I have one dataframe (data2) for which I want to add column ("M1", "M2", "M3") based on the presence of some string in another tab (data1). For example, for "M1" value, I want to create a columned name "M1" in data2 containing the value of TOT of a raw if this value is associated to M1 in the data1.
In some case I have in data2 value as follow "t1;t4" and what I want to considere for comparison in data2 is value separated by ; for each row.
For better understanding, there is the outpout I want below.
In order to do that I tried to make a loop on a vector containing all the module values ("M1"..."M3"), filter the data1 for this value, store the corresponding GNE value in a vector, and add the column in data2 with mutate function. I would like to know how to name inside the loop the New column with "M1", "M2" or "M3".
Or if someone have a better idea to obtain the output with another method I would also be usefule for me, as I think this is not the simplest way to do that. In my real data I have 32 new columns to add by this way.
data1=data.frame(Module=c("M1", "M1", "M2", "M3", "M3", "M3", "M3"), GNE= c("t1", "t3", "t5", "t8", "t2", "t9", "t12"))
data2=data.frame(ID=c("A1", "A2", "A3", "A4", "A5", "A6", "A7", "A8", "A9", "A10", "A11"), TOT=c("t1;t4", "t2", "t3", "t4", "t5", "t6", "t7", "t8;t9", "t9", "t10", "t8-8"))
list=c("M1", "M2", "M3")
> data1
Module GNE
1 M1 t1
2 M1 t3
3 M2 t5
4 M3 t8
5 M3 t2
6 M3 t9
7 M3 t12
> data2
ID TOT
1 A1 t1;t4
2 A2 t2
3 A3 t3
4 A4 t4
5 A5 t5
6 A6 t6
7 A7 t7
8 A8 t8;t9
9 A9 t9
10 A10 t10
11 A11 t8-8
for (a in list) {
data1_int=data1 %>% filter(Module=={{a}})
data1_int=data1_int$GNE
data2_int=data2 %>% mutate(New1 = map_chr(strsplit(TOT, ";"), ~ str_c(intersect(., data1_int), collapse = ";"))) %>% select(New1)
data2=cbind(data2, data2_int)
}
out=data2=data.frame(ID=c("A1", "A2", "A3", "A4", "A5", "A6", "A7", "A8", "A9", "A10", "A11"),
TOT=c("t1;t4", "t2", "t3", "t4", "t5", "t6", "t7", "t8;t9", "t9", "t10", "t8-8"),
M1=c("t1", "", "t3", "", "", "", "", "", "", "", ""),
M2=c("", "", "", "", "t5", "", "", "", "", "", ""),
M3=c("", "t2", "", "", "", "", "", "t8;t9", "t9", "", ""))
ID TOT M1 M2 M3
1 A1 t1;t4 t1
2 A2 t2 t2
3 A3 t3 t3
4 A4 t4
5 A5 t5 t5
6 A6 t6
7 A7 t7
8 A8 t8;t9 t8;t9
9 A9 t9 t9
10 A10 t10
11 A11 t8-8
Here's a different approach without using a for loop.
library(dplyr)
library(tidyr)
data1 %>%
filter(GNE %in% data2$TOT) %>%
mutate(col = GNE) %>%
complete(Module, GNE = data2$TOT, fill = list(col = '')) %>%
pivot_wider(names_from = Module, values_from = col) %>%
left_join(data2, by = c('GNE' = 'TOT')) %>%
select(ID, GNE, everything()) %>%
arrange(order(gtools::mixedorder(ID)))
# ID GNE M1 M2 M3
# <chr> <chr> <chr> <chr> <chr>
# 1 A1 t1 "t1" "" ""
# 2 A2 t2 "" "" "t2"
# 3 A3 t3 "t3" "" ""
# 4 A4 t4 "" "" ""
# 5 A5 t5 "" "t5" ""
# 6 A6 t6 "" "" ""
# 7 A7 t7 "" "" ""
# 8 A8 t8 "" "" "t8"
# 9 A9 t9 "" "" "t9"
#10 A10 t10 "" "" ""
try to use dcast from data.table (or reshape2)
library(dplyr)
library(data.table)
data1 <- data1 %>% as.data.table()
data2 <- data2 %>% as.data.table()
data2 %>%
left_join(data1, by = c("TOT" = "GNE")) %>%
dcast(ID + TOT ~ Module, value.var = "TOT", fill = "")
out:
ID TOT NA M1 M2 M3
1: A1 t1 t1
2: A10 t10 t10
3: A2 t2 t2
4: A3 t3 t3
5: A4 t4 t4
6: A5 t5 t5
7: A6 t6 t6
8: A7 t7 t7
9: A8 t8 t8
10: A9 t9 t9
if you need to keep order of ID (and remove NA column):
data2 %>%
left_join(data1, by = c("TOT" = "GNE")) %>%
dcast(ID + TOT ~ Module, value.var = "TOT", fill = "") %>%
select(-`NA`) %>%
arrange(order(gtools::mixedorder(ID)))
out:
ID TOT M1 M2 M3
1: A1 t1 t1
2: A2 t2 t2
3: A3 t3 t3
4: A4 t4
5: A5 t5 t5
6: A6 t6
7: A7 t7
8: A8 t8 t8
9: A9 t9 t9
10: A10 t10
after edit its a bit more complicated:
data2 %>%
tidyr::separate(TOT, into = c("TOT1","TOT2"), sep = "\\;", remove = F) %>%
left_join(data1, by = c("TOT1" = "GNE")) %>%
left_join(data1, by = c("TOT2" = "GNE")) %>%
mutate(TOT2 = ifelse(is.na(Module.y), NA, TOT2),
TOT1 = ifelse(is.na(Module.x), NA, TOT1)) %>%
tidyr::unite("tmp", c(TOT1, TOT2), na.rm = T, sep = ";") %>%
mutate(Module = coalesce(Module.x, Module.y)) %>%
dcast(ID + TOT ~ Module, value.var = "tmp", fill = "") %>%
select(-`NA`) %>%
arrange(order(gtools::mixedorder(ID)))
out:
ID TOT M1 M2 M3
1: A1 t1;t4 t1
2: A2 t2 t2
3: A3 t3 t3
4: A4 t4
5: A5 t5 t5
6: A6 t6
7: A7 t7
8: A8 t8;t9 t8;t9
9: A9 t9 t9
10: A10 t10
11: A11 t8-8
I have a huge data frame with following syntax (the four variables are just for example, there are many more variables):
Date. Ticker. Revenue. Price.
a1 b1 c1 d1
a2 b1 c2 d2
a3 b1 c3 d3
a4 b1 c4 d4
a5 b1 c5 d5
a1 b2 c6 d6
a2 b2 c7 d7
a3 b2 c8 d8
a4 b2 c9 d9
a5 b2 c10 d10
...
The ticker b1 and b2 are in order in the example, but in the real df it might be mixed up.
What I want is to create a new data frame with prices that goes to t intervals back. For example, if I need 3 years back, the result will be:
Date. Ticker. Revenue. Price.
a1 b1 c1
a2 b1 c2
a3 b1 c3
a4 b1 c4 d1
a5 b1 c5 d2
a1 b2 c6
a2 b2 c7
a3 b2 c8
a4 b2 c9 d6
a5 b2 c10 d10
...
We can use lag in dplyr to go back t intervals.
library(dplyr)
t <- 3
df %>% group_by(Ticker) %>% mutate(Price= lag(Price, t))
# Date Ticker Revenue Price
# <chr> <chr> <chr> <chr>
# 1 a1 b1 c1 NA
# 2 a2 b1 c2 NA
# 3 a3 b1 c3 NA
# 4 a4 b1 c4 d1
# 5 a5 b1 c5 d2
# 6 a1 b2 c6 NA
# 7 a2 b2 c7 NA
# 8 a3 b2 c8 NA
# 9 a4 b2 c9 d6
#10 a5 b2 c10 d7
Or shift in data.table :
library(data.table)
setDT(df)[, Price := shift(Price, t), Ticker]
data
df <- structure(list(Date = c("a1", "a2", "a3", "a4", "a5", "a1", "a2",
"a3", "a4", "a5"), Ticker = c("b1", "b1", "b1", "b1", "b1", "b2",
"b2", "b2", "b2", "b2"), Revenue = c("c1", "c2", "c3", "c4",
"c5", "c6", "c7", "c8", "c9", "c10"), Price = c("d1", "d2", "d3",
"d4", "d5", "d6", "d7", "d8", "d9", "d10")),
class = "data.frame", row.names = c(NA, -10L))
We can use data.table methods
library(data.table)
setDT(df)[, Price. := shift(Price., 3, fill = ""), Ticker.]
or with dplyr
library(dplyr)
df %>%
group_by(Ticker.) %>%
mutate(Price = lag(Price., 3, default = ""))
-output
# A tibble: 10 x 5
# Groups: Ticker. [2]
# Date. Ticker. Revenue. Price. Price
# <chr> <chr> <chr> <chr> <chr>
# 1 a1 b1 c1 d1 ""
# 2 a2 b1 c2 d2 ""
# 3 a3 b1 c3 d3 ""
# 4 a4 b1 c4 d4 "d1"
# 5 a5 b1 c5 d5 "d2"
# 6 a1 b2 c6 d6 ""
# 7 a2 b2 c7 d7 ""
# 8 a3 b2 c8 d8 ""
# 9 a4 b2 c9 d9 "d6"
#10 a5 b2 c10 d10 "d7"
Or using base R with ave
df$Price <- with(df, ave(Price., Ticker., FUN =
function(x) c(rep('', 3), head(x, -3))))
data
df <- structure(list(Date. = c("a1", "a2", "a3", "a4", "a5", "a1",
"a2", "a3", "a4", "a5"), Ticker. = c("b1", "b1", "b1", "b1",
"b1", "b2", "b2", "b2", "b2", "b2"), Revenue. = c("c1", "c2",
"c3", "c4", "c5", "c6", "c7", "c8", "c9", "c10"), Price. = c("d1",
"d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10")), class = "data.frame",
row.names = c(NA,
-10L))
I have two data sets, data1 and data2:
data1 <- data.frame(ID = 1:6,
A = c("a1", "a2", NA, "a4", "a5", NA),
B = c("b1", "b2", "b3", NA, "b5", NA),
stringsAsFactors = FALSE)
data1
ID A B
1 a1 b1
2 a2 b2
3 NA b3
4 a4 NA
5 a5 b5
6 NA NA
and
data2 <- data.frame(ID = 1:6,
A = c(NA, "a2", "a3", NA, "a5", "a6"),
B = c(NA, "b2.wrong", NA, "b4", "b5", "b6"),
stringsAsFactors = FALSE)
data2
ID A B
1 NA NA
2 a2 b2.wrong
3 a3 NA
4 NA b4
5 a5 b5
6 a6 b6
I would like to merge them by ID so that the resultant merged dataset, data.merged, populates fields form both datasets, but chooses values from data1 whenever there are possible values from both datasets.
I.e., I would like the final dataset, data.merge, to be:
ID A B
1 a1 b1
2 a2 b2
3 a3 b3
4 a4 b4
5 a5 b5
6 a6 b6
I have looked around, finding similar but not exact answers.
You can join the data and use coalesce to select the first non-NA value.
library(dplyr)
data1 %>%
inner_join(data2, by = 'ID') %>%
mutate(A = coalesce(A.x, A.y),
B = coalesce(B.x, B.y)) %>%
select(names(data1))
# ID A B
#1 1 a1 b1
#2 2 a2 b2
#3 3 a3 b3
#4 4 a4 b4
#5 5 a5 b5
#6 6 a6 b6
Or in base R comparing values with NA :
transform(merge(data1, data2, by = 'ID'),
A = ifelse(is.na(A.x), A.y, A.x),
B = ifelse(is.na(B.x), B.y, B.x))[names(data1)]
I have a DF$vector which looks like this:
A10 A50
C1 C4
B1
A7
C3
B1 B4
I look for a way to order it as follows:
A10 A50
A7
B1 B4
B1
C1 C4
C3
I tried to use gsub :
vector[order(gsub("([A-Z]+)([0-9]+)", "\\1", vector),
as.numeric(gsub("([A-Z]+)([0-9]+)", "\\2", vector)))]
But it didnt return what i want.
Thank you for any suggestions.
We can use order from base R
df1[order(sub("\\d+", "", df1[,1]), as.numeric(sub("\\D+", "", df1[,1])), df1[,2] == ""),]
# A10 A50
#3 A7
#5 B1 B4
#2 B1
#1 C1 C4
#4 C3
data
df1 <-structure(list(A10 = c("C1", "B1", "A7", "C3", "B1"), A50 = c("C4",
"", "", "", "B4")), .Names = c("A10", "A50"), class = "data.frame",
row.names = c(NA, -5L))
In programming languages, the letters are considered to be increasing in terms of magnitude. Thus A is considered to be lessthan Betc. Thus to order the above, just use the code:
df1$r=rank(df1$A10,ties.method = "last")
df1[order(df1$r),-ncol(df1)]
A10 A50
3 A7
5 B1 B4
2 B1
1 C1 C4
4 C3
This question already has answers here:
R semicolon delimited a column into rows
(3 answers)
Closed 6 years ago.
I have a dataset that has following format:
id1 a1 b2 x1;x2;x3
id2 a2 b3 x4;x5
id3 a4 b5 x6
id4 a7 b7 x7;x8
First 3 columns (id, a, b) only have 1 instance, but the last column has multiple instances, separated by ;. How can I "wrap" these into new columns? Such as:
id1 a1 b2 x1
id1 a1 b2 x2
id1 a1 b2 x3
id2 a2 b3 x4
id2 a2 b3 x5
id3 a4 b5 x6
id4 a7 b7 x7
id4 a7 b7 x8
We can use cSplit
library(splitstackshape)
cSplit(df1, "v4", ";", "long")
# v1 v2 v3 v4
#1: id1 a1 b2 x1
#2: id1 a1 b2 x2
#3: id1 a1 b2 x3
#4: id2 a2 b3 x4
#5: id2 a2 b3 x5
#6: id3 a4 b5 x6
#7: id4 a7 b7 x7
#8: id4 a7 b7 x8
data
df1 <- structure(list(v1 = c("id1", "id2", "id3", "id4"), v2 = c("a1",
"a2", "a4", "a7"), v3 = c("b2", "b3", "b5", "b7"), v4 = c("x1;x2;x3",
"x4;x5", "x6", "x7;x8")), .Names = c("v1", "v2", "v3", "v4"),
class = "data.frame", row.names = c(NA, -4L))