How can I organize my CSV file in R - r

I got .csv file with 53000 rows as follows:
s
1
2
3
m
4
5
6
7
r
8
9
10
11
I would like to make it following format using R or excel:
s 1 2 3
m 4 5 6 7
r 8 9 10 11

Three alternative implementations using base R and data.table:
1: with base R
df$id <- cumsum(grepl("\\D", df$x))
df$name <- ave(df$x, df$id, FUN = function(x) rep(x[1],length(x)))
df <- df[!grepl("\\D", df$x),]
df$pos <- ave(df$x, df$name, FUN = function(x) paste0("p",1:length(x)))
library(reshape2)
dcast(df, name ~ pos, value.var = "x")
this gives:
name p1 p2 p3 p4
1 m 4 5 6 7
2 r 8 9 10 11
3 s 1 2 3 <NA>
2: first approach with data.table
library(data.table)
dcast(setDT(df)[, id := cumsum(grepl("\\D", x))
][, `:=` (name = x[1], pos = 0:(.N-1)), id
][!grepl("\\D", x), .(name, x, pos=paste0("p",pos))],
name ~ pos, value.var = "x")
3: second approach with data.table, but now with the just introduced rowid function from the development version (installation instructions):
library(data.table) # v1.9.7+
dcast(setDT(df)[, id := cumsum(grepl("\\D", x))
][, name := x[1], id
][!grepl("\\D", x), .(name, x)],
name ~ rowid(name, prefix="p"), value.var = "x")
both data.table approaches result in:
name p1 p2 p3 p4
1: m 4 5 6 7
2: r 8 9 10 11
3: s 1 2 3 NA
Used data:
df <- data.frame(x = c("s", 1:3, "m", 4:7, "r", 8:11), stringsAsFactors = FALSE)

Assuming that the new row names are always alpha numeric and the values in the rows are always numeric, this reformats it into a data frame you may be looking for.
library(dplyr)
library(tidyr)
data.frame(x = c("s", 1:3, "m", 4:7, "r", 8:11),
stringsAsFactors = FALSE) %>%
mutate(var_id = cumsum(grepl("[[:alpha:]]", x))) %>%
group_by(var_id) %>%
mutate(row_name = x[1]) %>%
filter(!grepl("[[:alpha:]]", x)) %>%
mutate(var_index = 1:n()) %>%
ungroup() %>%
select(-var_id) %>%
spread(var_index, x)

Related

R - applying calculation pairwise on columns of data frame/data table

Let's say I have the data frames with the same column names
DF1 = data.frame(a = c(0,1), b = c(2,3), c = c(4,5))
DF2 = data.frame(a = c(6,7), c = c(8,9))
and want to apply some basic calculation on them, for example add each column.
Since I also want the goal data frame to display missing data, I appended such a column to DF2, so I have
> DF2
a c b
1 6 8 NA
2 7 9 NA
What I tried here now is to create the data frame
for(i in names(DF2)){
DF3 = data.frame(i = DF1[i] + DF2[i])
}
(and then bind this together) but this obviously doesn't work since the order of the columns is mashed up.
SO,
what's the best way to do this pairwise calculation when the order of the columns is not the same, without reordering them?
I also tried doing (since this is what I thought would be a fix)
for(i in names(DF2)){
DF3 = data.frame(i = DF1$i + DF2$i)
}
but this doesn't work because DF1$i is NULL for all i.
Conlusion: I want the data frame
>DF3
a b c
1 6+0 NA 4+8
2 1+7 NA 5+9
Any help would be appreciated.
This may help -
#Get column names from DF1 and DF2
all_cols <- union(names(DF1), names(DF2))
#Fill missing columns with NA in both the dataframe
DF1[setdiff(all_cols, names(DF1))] <- NA
DF2[setdiff(all_cols, names(DF2))] <- NA
#add the two dataframes arranging the columns
DF1[all_cols] + DF2[all_cols]
# a b c
#1 6 NA 12
#2 8 NA 14
We can use bind_rows
library(dplyr)
library(data.table)
bind_rows(DF1, DF2, .id = 'grp') %>%
group_by(grp = rowid(grp)) %>%
summarise(across(everything(), sum), .groups = 'drop') %>%
select(-grp)
-output
# A tibble: 2 x 3
a b c
<dbl> <dbl> <dbl>
1 6 NA 12
2 8 NA 14
Another base R option using aggregate + stack + reshae
aggregate(
. ~ rid,
transform(
reshape(
transform(rbind(
stack(DF1),
stack(DF2)
),
rid = ave(seq_along(ind), ind, FUN = seq_along)
),
direction = "wide",
idvar = "rid",
timevar = "ind"
),
rid = 1:nrow(DF1)
),
sum,
na.action = "na.pass"
)[-1]
gives
values.a values.b values.c
1 6 NA 12
2 8 NA 14

How, in R, can you create a cross table from a named num based on its names?

I have a numeric vector with names following a pattern. The name for each element consists of two parts. There are a fixed number of variations on the first part and a fixed number of variations on the second part per the below.
x <- c(2, 4, 3, 7, 6, 9)
names(x) <- c("a.0", "b.0", "c.0", "a.1", "b.1", "c.1")
From this I want to create and print a table where the first part of the names is the rows and the second part the columns per the below.
a b c
0 2 4 3
1 7 6 9
Here are some possibilities. The first 3 only use base R.
1) tapply Use tapply with the row and column parts specified in the second argument.
nms <- names(x)
tapply(x, list(row = sub(".*\\.", "", nms), col = sub("\\..*", "", nms)), c)
giving the following matrix with the indicated row and column names.
col
row a b c
0 2 4 3
1 7 6 9
2) xtabs Another possibility is to use xtabs:
dnms <- read.table(text = names(x), sep = ".", as.is = TRUE,
col.names = c("col", "row"))[2:1]
xtabs(x ~ ., dnms)
giving this xtabs/table object:
col
row a b c
0 2 4 3
1 7 6 9
3) reshape
long <- cbind(x, read.table(text = names(x), sep = ".", as.is = TRUE,
col.names = c("col", "row")))
r <- reshape(long, dir = "wide", idvar = "row", timevar = "col")[-1]
dimnames(r) <- lapply(long[3:2], unique)
r
giving this data.frame:
a b c
0 2 4 3
1 7 6 9
4) dplyr/tidyr/tibble Using the indicated packages we can form the following pipeline:
library(dplyr)
library(tidyr)
library(tibble)
x %>%
stack %>%
separate(ind, c("col", "rowname")) %>%
pivot_wider(names_from = col, values_from = ".") %>%
column_to_rownames
giving this data.frame:
a b c
0 2 4 3
1 7 6 9
If you are using an older version of tidyr replace the pivot_wider line with
spread(col, values) %>%
As per #d.b. comment this would also work:
x %>%
data.frame %>%
rownames_to_column %>%
separate(rowname, c("col", "rowname")) %>%
pivot_wider(names_from = col, values_from = ".") %>%
column_to_rownames
do.call(rbind, split(x, gsub(".*\\.(.*)", "\\1", names(x))))
# a.0 b.0 c.0
#0 2 4 3
#1 7 6 9

Group data by factor level, then transform to data frame with colname being levels?

There is my problem that I can't solve it:
Data:
df <- data.frame(f1=c("a", "a", "b", "b", "c", "c", "c"),
v1=c(10, 11, 4, 5, 0, 1, 2))
data.frame:f1 is factor
f1 v1
a 10
a 11
b 4
b 5
c 0
c 1
c 2
# What I want is:(for example, fetch data with the number of element of some level == 2, then to data.frame)
a b
10 4
11 5
Thanks in advance!
I might be missing something simple here , but the below approach using dplyr works.
library(dplyr)
nlevels = 2
df1 <- df %>%
add_count(f1) %>%
filter(n == nlevels) %>%
select(-n) %>%
mutate(rn = row_number()) %>%
spread(f1, v1) %>%
select(-rn)
This gives
# a b
# <int> <int>
#1 10 NA
#2 11 NA
#3 NA 4
#4 NA 5
Now, if you want to remove NA's we can do
do.call("cbind.data.frame", lapply(df1, function(x) x[!is.na(x)]))
# a b
#1 10 4
#2 11 5
As we have filtered the dataframe which has only nlevels observations, we would have same number of rows for each column in the final dataframe.
split might be useful here to split df$v1 into parts corresponding to df$f1. Since you are always extracting equal length chunks, it can then simply be combined back to a data.frame:
spl <- split(df$v1, df$f1)
data.frame(spl[lengths(spl)==2])
# a b
#1 10 4
#2 11 5
Or do it all in one call by combining this with Filter:
data.frame(Filter(function(x) length(x)==2, split(df$v1, df$f1)))
# a b
#1 10 4
#2 11 5
Here is a solution using unstack :
unstack(
droplevels(df[ave(df$v1, df$f1, FUN = function(x) length(x) == 2)==1,]),
v1 ~ f1)
# a b
# 1 10 4
# 2 11 5
A variant, similar to #thelatemail's solution :
data.frame(Filter(function(x) length(x) == 2, unstack(df,v1 ~ f1)))
My tidyverse solution would be:
library(tidyverse)
df %>%
group_by(f1) %>%
filter(n() == 2) %>%
mutate(i = row_number()) %>%
spread(f1, v1) %>%
select(-i)
# # A tibble: 2 x 2
# a b
# * <dbl> <dbl>
# 1 10 4
# 2 11 5
or mixing approaches :
as_tibble(keep(unstack(df,v1 ~ f1), ~length(.x) == 2))
Using all base functions (but you should use tidyverse)
# Add count of instances
x$len <- ave(x$v1, x$f1, FUN = length)
# Filter, drop the count
x <- x[x$len==2, c('f1','v1')]
# Hacky pivot
result <- data.frame(
lapply(unique(x$f1), FUN = function(y) x$v1[x$f1==y])
)
colnames(result) <- unique(x$f1)
> result
a b
1 10 4
2 11 5
I'd like code this, may it helps for you
library(reshape2)
library(dplyr)
aa = data.frame(v1=c('a','a','b','b','c','c','c'),f1=c(10,11,4,5,0,1,2))
cc = aa %>% group_by(v1) %>% summarise(id = length((v1)))
dd= merge(aa,cc) #get the level
ee = dd[dd$aa==2,] #select number of level equal to 2
ee$id = rep(c(1,2),nrow(ee)/2) # reset index like (1,2,1,2)
dcast(ee, id~v1,value.var = 'f1')
all done!

Extract character list values from data.frame rows and reshape data

I have a variable x with character lists in each row:
dat <- data.frame(id = c(rep('a',2),rep('b',2),'c'),
x = c('f,o','f,o,o','b,a,a,r','b,a,r','b,a'),
stringsAsFactors = F)
I would like to reshape the data so that each row is a unique (id, x) pair such as:
dat2 <- data.frame(id = c(rep('a',2),rep('b',3),rep('c',2)),
x = c('f','o','a','b','r','a','b'))
> dat2
id x
1 a f
2 a o
3 b a
4 b b
5 b r
6 c a
7 c b
I've attempted to do this by splitting the character lists and keeping only the unique list values in each row:
dat$x <- sapply(strsplit(dat$x, ','), sort)
dat$x <- sapply(dat$x, unique)
dat <- unique(dat)
> dat
id x
1 a f, o
3 b a, b, r
5 c a, b
However, I'm not sure how to proceed with converting the row lists into individual row entries.
How would I accomplish this? Or is there a more efficient way of converting a list of strings to reshape the data as described?
You can use tidytext::unnest_tokens:
library(tidytext)
library(dplyr)
dat %>%
unnest_tokens(x1, x) %>%
distinct()
id x1
1 a f
2 a o
3 b b
4 b a
5 b r
6 c b
7 c a
A base R method with two lines is
#get list of X potential vars
x <- strsplit(dat$x, ",")
# construct full data.frame, then use unique to return desired rows
unique(data.frame(id=rep(dat$id, lengths(x)), x=unlist(x)))
This returns
id x
1 a f
2 a o
6 b b
7 b a
9 b r
13 c b
14 c a
If you don't want to write out the variable names yourself, you can use setNames.
setNames(unique(data.frame(rep(dat$id, lengths(x)), unlist(x))), names(dat))
We could use separate_rows
library(tidyverse)
dat %>%
separate_rows(x) %>%
distinct()
# id x
#1 a f
#2 a o
#3 b b
#4 b a
#5 b r
#6 c b
#7 c a
A solution can be achieved using splitstackshape::cSplit to split x column into mulltiple columns. Then gather and filter will help to achieve desired output.
library(tidyverse)
library(splitstackshape)
dat %>% cSplit("x", sep=",") %>%
mutate_if(is.factor, as.character) %>%
gather(key, value, -id) %>%
filter(!is.na(value)) %>%
select(-key) %>% unique()
# id value
# 1 a f
# 3 b b
# 5 c b
# 6 a o
# 8 b a
# 10 c a
# 13 b r
Base solution:
temp <- do.call(rbind, apply( dat, 1,
function(z){ data.frame(
id=z[1],
x = scan(text=z['x'], what="",sep=","),
stringsAsFactors=FALSE)} ) )
Read 2 items
Read 3 items
Read 4 items
Read 3 items
Read 2 items
Warning messages:
1: In data.frame(id = z[1], x = scan(text = z["x"], what = "", sep = ",")) :
row names were found from a short variable and have been discarded
2: In data.frame(id = z[1], x = scan(text = z["x"], what = "", sep = ",")) :
row names were found from a short variable and have been discarded
3: In data.frame(id = z[1], x = scan(text = z["x"], what = "", sep = ",")) :
row names were found from a short variable and have been discarded
4: In data.frame(id = z[1], x = scan(text = z["x"], what = "", sep = ",")) :
row names were found from a short variable and have been discarded
5: In data.frame(id = z[1], x = scan(text = z["x"], what = "", sep = ",")) :
row names were found from a short variable and have been discarded
temp[!duplicated(temp),]
#------
id x
1 a f
2 a o
6 b b
7 b a
9 b r
13 c b
14 c a
To get rid of all the messages and warnings:
temp <- do.call(rbind, apply( dat, 1,
function(z){ suppressWarnings(data.frame(id=z[1],
x = scan(text=z['x'], what="",sep=",", quiet=TRUE), stringsAsFactors=FALSE)
)} ) )
temp[!duplicated(temp),]

Loop by variable names

I want to create a for loop by variable names.
Each time, I calculte the max between each two variables, and define a new one in data df. New variables look like this:var1_1, var1_2... Here is my code:
df=data.frame(matrix(c(1:6), nrow = 2))
colnames(df) = c("x", "y", "z")
for(i in length(names(df))-1){
df = df %>% mutate(paste0("var", i, "_", i+1) = max(names(df)[i], names(df)[i+1]))
}
But there gives error.
Expected output:
>df
x y z var1_2 var1_3 var2_3
1 3 5 3 5 5
2 4 6 4 6 6
One way via base R,
m1 <- sapply(combn(names(df),2, simplify = FALSE), function(i) do.call(pmax, df[i]))
nms <- combn(ncol(m1), 2, function(i) paste0('Var', i[1], '_', i[2]))
cbind(df, setNames(data.frame(m1), nms))
# x y z Var1_2 Var1_3 Var2_3
#1 1 3 5 3 5 5
#2 2 4 6 4 6 6
If you really want to use a Loop you can try:
ind<-combn(3,2)
for(i in 1:dim(df)[2]){
i <- ind[,i]
name <- paste0("var", i[1], "_", i[2])
val <- names(df)[i[ifelse(sum(df[,i[1]]) > sum(df[,i[2]]),1,2)]]
df <- mutate_(df, .dots= setNames(list(val),name))
}

Resources