for loop acting weird - r

I have two dataframes:
df_1 <- data.frame(c("a_b", "a_c", "a_d"))
df_2 <- data.frame(matrix(ncol = 2))
And I would like to loop over df_1 in order to fill df_2:
for (i in (1:(length(df_1[,1])))){
for (j in (1:2)) {
df_2[i*j,] <-str_split_fixed(df_1[i,1], "_", 2)
}
}
I would like df_2 to look like:
col1 col2
a b
a b
a c
a c
a d
a d
But instead I get:
col1 col2
a b
a c
a d
a c
NA NA
a d
I must be doing something wrong, but cannot figure it out.
I also would like to use apply (or something like it, but am pretty new to R and not firm with the apply-family.
Thanks for your help!

Another way would be
df_1 <- data.frame(col1 = c("a_b", "a_c", "a_d"))
df_2 <- as.data.frame(do.call(rbind, strsplit(as.character(df_1$col1), split = "_", fixed = TRUE)))
df_2[rep(1:nrow(df_2), each = 2), ]
V1 V2
1 a b
1.1 a b
2 a c
2.1 a c
3 a d
3.1 a d

We can use cSplit with data.table approach
library(splitstackshape)
cSplit(df_1, 'col1', '_')[rep(seq_len(.N), each =2)]
# col1_1 col1_2
#1: a b
#2: a b
#3: a c
#4: a c
#5: a d
#6: a d
Or another option is tidyverse
library(tidyverse)
separate(df_1, col1, into=c("col_1", "col_2")) %>%
map_df(~rep(., each = 2))
# A tibble: 6 × 2
# col_1 col_2
# <chr> <chr>
#1 a b
#2 a b
#3 a c
#4 a c
#5 a d
#6 a d
NOTE: Both the answers are one-liners.
data
df_1 <- data.frame(col1 = c("a_b", "a_c", "a_d"))

This would be a combination of two answers. With cSplit we split the column by _ and then repeat each row twice. Assuming your column name as V1.
library(splitstackshape)
df_2 <- cSplit(df_1, "V1", "_")
df_2[rep(seq_len(nrow(df_2)),each = 2), ]
# V1_1 V1_2
#1: a b
#2: a b
#3: a c
#4: a c
#5: a d
#6: a d
Or as #Sotos mentioned in the comments we can use expandRows to accomodate everything into one line.
expandRows(cSplit(df_1, "V1", "_"), 2, count.is.col = FALSE)
# V1_1 V1_2
#1: a b
#2: a b
#3: a c
#4: a c
#5: a d
#6: a d
data
df_1 <- data.frame(V1 = c("a_b", "a_c", "a_d"))

OK, I started learning R this week, but if you want presented result you can use your code with this fix:
for (i in (1:(length(df_1[,1])))){
for (j in (1:2)) {
df_2[(i-1)*2+j,] <- str_split_fixed(df_1[i,1], "_", 2)
}
}
I changed index of df_2.
I guess that there is better way than two for loops, but that all I can do for the moment.

I was trying to post a solution I found right after posting but it was misunderstood and was deleted:
"sometimes posting a question helps:
I am was asking for the right position in df_1, but I was saving the result in the wrong cell.
the answer to my original question should be something like this:
n <- 1
for (i in (1:(length(df_1[,1])))){
for (j in (1:2)) {
df_2[n,] <-str_split_fixed(df_1[i,1], "_", 2)
n <- n+1
}
}"

Related

How can I fill NA-values in a data frame column based on the values from an other column? [duplicate]

This question already has an answer here:
Replace NA with mode based on ID attribute
(1 answer)
Closed 2 years ago.
I'd like to fill the NA-values in F2-column, based on the the most common F2-value when grouped by F1-column.
F1 F2
1 A C
2 B D
3 A NA
4 A C
5 B NA
Desired outcome:
F1 F2
1 A C
2 B D
3 A C
4 A C
5 B D
Thank you for help
Here is a base R solution. First define a function for Mode (Taken from here) and then apply it to you data frame, i.e.
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
df$F2 <- with(df, ave(F2, F1, FUN = function(i) replace(i, is.na(i), Mode(i))))
df
# F1 F2
#1 A C
#2 B D
#3 A C
#4 A C
#5 B D
Here is one way using dplyr :
library(dplyr)
df %>%
group_by(F1) %>%
mutate(F2 = replace(F2, is.na(F2),
names(sort(table(F2), decreasing = TRUE)[1])))
# F1 F2
# <chr> <chr>
#1 A C
#2 B D
#3 A C
#4 A C
#5 B D
In case of ties, preference is given to lexicographic order.
Try this:
First in df2 I get max count by the variable F1 where F2 is not missing. That will give you the most common F2 value when groups by F1. I join it back onto the original data.frame and use a mutate to fill by the new variable F2_fill and then remove it from this variable from the data.frame.
library(tidyverse)
df <- tribble(
~F1, ~F2,
'A', 'C',
'B' , 'D',
'A' ,NA,
'A', 'C',
'B', NA)
df2 <- df %>%
group_by(F1) %>%
count(F2) %>%
filter(!is.na(F2), n == max(n)) %>%
select(-n) %>%
rename(F2_fill = F2)
df3 <- left_join(df,df2, by="F1") %>%
mutate(F2 = ifelse(is.na(F2), F2_fill,F2)) %>%
select(-F2_fill)
You can use ave with table and which.max and subsetting with is.na when it is a character.
i <- is.na(x$F2)
x$F2[i] <- ave(x$F2, x$F1, FUN=function(y) names(which.max(table(y))))[i]
x
# F1 F2
#1 A C
#2 B D
#3 A C
#4 A C
#5 B D
Data:
x <- data.frame(F1 = c("A", "B", "A", "A", "B")
, F2 = c("C", "D", NA, "C", NA))

Function that ignores missing columns

Say I have the following two data frames:
col1 <- c("a","b","c","d","e")
col2 <- c("A","B","C","D","E")
col1a <- c("a","b","c","d","e")
col2a <- c("A","B","C","D","E")
df1 <- data.frame(col1, col2)
df2 <- data.frame(col1a, col2a)
colnames(df1) <- c("c1","c2")
colnames(df2) <- c("c1","c3")
And I have the following function to rename column headers:
library(dplyr)
col_rename <- function(x) x %>% rename(new_c1 = c1, new_c2 = c2, new_c3 = c3)
When I run this function, I get an error because the columns in the function does not match the columns in the data frame.
df1 <- col_rename(df1)
Error: `c3` contains unknown variables
How can I make the function run only on the present columns, and ignore the ones not present, without removing or changing the column names specified in the function?
EDIT:
I can see how the example was a bit confusing. I have many dataframes with many columns. These columns are shared by some dataframes but not all. However, I want to rename all columns specified by the function, regardless of what is present in the dataframe. It looks something like this:
col1 <- c(1:5)
col2 <- c(1:5)
col3 <- c(1:5)
col4 <- c(1:5)
df1 <- data.frame(col1,col2,col3,col4)
df2 <- data.frame(col1,col2,col3,col4)
colnames(df1) <- c("c1","c2","c6","c8")
colnames(df2) <- c("c1","c3","c2","c8")
AB_rename <- function(x) x %>% rename(aa=col1,bb=col2,
cc=col3,dd=col4,
ee=col5,ff=col6,
gg=col7,hh=col8)
Therefore I cannot follow the example of #Ycw, as they do not all follow the same rename rule. How do I make this ignore columns that are not present?
Here is a workaround to use setNames for the col_rename function.
col_rename <- function(x) setNames(x, paste0("new_", names(x)))
col_rename(df1)
new_c1 new_c2
1 a A
2 b B
3 c C
4 d D
5 e E
col_rename(df2)
new_c1 new_c3
1 a A
2 b B
3 c C
4 d D
5 e E
Or use the select_all function from the dplyr.
library(dplyr)
df1 %>% select_all(function(x) paste0("new_", x))
new_c1 new_c2
1 a A
2 b B
3 c C
4 d D
5 e E
This (~) also works for select_all
df2 %>% select_all(~paste0("new_", .))
new_c1 new_c3
1 a A
2 b B
3 c C
4 d D
5 e E
rename_all also works well
library(dplyr)
df1 %>% rename_all(~paste0("new_", .))
new_c1 new_c2
1 a A
2 b B
3 c C
4 d D
5 e E
Update
This is an update to address OP's updated question.
We can create a named vector showing the relationship between old column names and new column names. And defined a function to change the name based on the setNames function.
# Create name vector
vec <- paste0("c", 1:8)
names(vec) <- c("aa", "bb", "cc", "dd", "ee", "ff", "gg", "hh")
# Create the function
AB_rename <- function(x, name_vec){
old_colname <- names(x)
new_colname <- name_vec[name_vec %in% old_colname]
x2 <- setNames(x, names(new_colname))
return(x2)
}
AB_rename(df1, vec)
aa bb ff hh
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5

separate() in tidyr with NA

I have a question related to separate() in the tidyr package. When there is no NA in a data frame, separate() works. I have been using this function a lot. But, today I had a case in which there were NAs in a data frame. separate() returned an error message. I could be very silly. But, I wonder if tidyr may not be designed for this kind of data cleaning. Or is there any way separate() can work with NAs? Thank you very much for taking your time.
Here is an updated sample based on the comments. Say I want to separate characters in y and create new columns. If I remove the row with NA, separate() will work. But, I do not want to delete the row, what could I do?
x <- c("a-1","b-2","c-3")
y <- c("d-4","e-5", NA)
z <- c("f-6", "g-7", "h-8")
foo <- data.frame(x,y,z, stringsAsFactors = F)
ana <- foo %>%
separate(y, c("part1", "part2"))
# > foo
# x y z
# 1 a-1 d-4 f-6
# 2 b-2 e-5 g-7
# 3 c-3 <NA> h-8
# > ana <- foo %>%
# + separate(y, c("part1", "part2"))
# Error: Values not split into 2 pieces at 3
One way would be:
res <- foo %>%
mutate(y=ifelse(is.na(y), paste0(NA,"-", NA), y)) %>%
separate(y, c('part1', 'part2'))
res[res=='NA'] <- NA
res
# x part1 part2 z
#1 a-1 d 4 f-6
#2 b-2 e 5 g-7
#3 c-3 <NA> <NA> h-8
You can use extra option in separate.
Here's an example from hadley's github issue page
> df <- data.frame(x = c("a", "a b", "a b c", NA))
> df
x
1 a
2 a b
3 a b c
4 <NA>
> df %>% separate(x, c("a", "b"), extra = "merge")
a b
1 a <NA>
2 a b
3 a b c
4 <NA> <NA>
> df %>% separate(x, c("a", "b"), extra = "drop")
a b
1 a <NA>
2 a b
3 a b
4 <NA> <NA>

Reshape R: split a column

A really simple question but I could'nt find a solution:
I have a data.frame like
V1 <- c("A","A","B","B","C","C")
V2 <- c("D","D","E","E","F","F")
V3 <- c(10:15)
df <- data.frame(cbind(V1,V2,V3))
i.e.
V1 V2 V3
A D 10
A D 11
B E 12
B E 13
C F 14
C F 15
And I would like
V1 V2 V3.1 V3.2
A D 10 11
B E 12 13
C F 14 15
I try reshape{stats} and reshape2
As I had mentioned, all that you need is a "time" variable and you should be fine.
Mark Miller shows the base R approach, and creates the time variable manually.
Here's a way to automatically create the time variable, and the equivalent command for dcast from the "reshape2" packge:
## Creating the "time" variable. This does not depend
## on the rows being in a particular order before
## assigning the variables
df <- within(df, {
A <- do.call(paste, df[1:2])
time <- ave(A, A, FUN = seq_along)
rm(A)
})
## This is the "reshaping" step
library(reshape2)
dcast(df, V1 + V2 ~ time, value.var = "V3")
# V1 V2 1 2
# 1 A D 10 11
# 2 B E 12 13
# 3 C F 14 15
Self-promotion alert
Since this type of question has cropped up several times, and since a lot of datasets don't always have a unique ID, I have implemented a variant of the above as a function called getanID in my "splitstackshape" package. In its present version, it hard-codes the name of the "time" variable as ".id". If you were using that, the steps would be:
library(splitstackshape)
library(reshape2)
df <- getanID(df, id.vars=c("V1", "V2"))
dcast(df, V1 + V2 ~ .id, value.var = "V3")
V1 <- c("A","A","B","B","C","C")
V2 <- c("D","D","E","E","F","F")
V3 <- c(10:15)
time <- rep(c(1,2), 3)
df <- data.frame(V1,V2,V3,time)
df
reshape(df, idvar = c('V1','V2'), timevar='time', direction = 'wide')
V1 V2 V3.1 V3.2
1 A D 10 11
3 B E 12 13
5 C F 14 15

data frame manipulation in R

I have a data frame that looks like this:
id = c("A","B","C","A","C","C")
val = c(5,4,6,7,10,99)
df = data.frame(id, val)
df
id val
A 5
B 4
C 6
A 7
C 10
C 99
Now I would like to re-arrange the id column (A, B, C...), keep their corresponding val, and then add a new column of newid starting with letter E, followed by three digits counting the number of id in the first column. The code is here:
id2 = c("A","A","B","C","C","C")
val2 = c(5,7,4,6,10,99)
newid = c("E001","E002","E001","E001","E002","E003")
df2 = data.frame(id2, val2, newid)
df2
and the final result is this:
id2 val2 newid
A 5 E001
A 7 E002
B 4 E001
C 6 E001
C 10 E002
C 99 E003
Is there an efficient way to do this?
library(data.table)
dt = data.table(df)
dt[, newid := paste0('E', gsub(' ', '0', format(1:.N, width = 3))), keyby = id]
dt
# id val newid
#1: A 5 E001
#2: A 7 E002
#3: B 4 E001
#4: C 6 E001
#5: C 10 E002
#6: C 99 E003
keyby here does the sorting, so no need to do it explicitly
Here is one way to do that, using the order() function to arrange the data, and the sprintf(), sapply() and table() functions to define newid.
df2 <- df[order(df$id, df$val), ]
df2$newid <- paste0("E", sprintf("%04d", unlist(sapply(table(df$id), function(x) 1:x))))

Resources