Splitting data in a cell - r

I have a data set that looks like this
Code Product
1 A|B
2 A|B|C
3 A|B|C|D|E
When I split the column Product using colsplit function, duplication occurs. The output of colsplit function looks like this:
Code Product.1 Product.2 Product.3 Product.4 Product.5
1 A B A B A
2 A B C A B
3 A B C D E
This happens because one of the cells had five elements. Is there any way to avoid this duplication?
Thanks and regards
Jayaram

Update (21 Oct 2013)
The concepts below have been rolled into a family of functions called concat.split.* in my "splitstackshape" package. Here is a very straightforward solution using concat.split.multiple:
library(splitstackshape)
concat.split.multiple(temp, "Product", "|", "long")
# Code time Product
# 1 1 1 A
# 2 2 1 A
# 3 3 1 A
# 4 1 2 B
# 5 2 2 B
# 6 3 2 B
# 7 1 3 <NA>
# 8 2 3 C
# 9 3 3 C
# 10 1 4 <NA>
# 11 2 4 <NA>
# 12 3 4 D
# 13 1 5 <NA>
# 14 2 5 <NA>
# 15 3 5 E
Remove the "long" argument if you want the wide format, but your comments indicated that ultimately you wanted a long format for your output.
Original answer (17 Dec 2012)
You can do this with strsplit and sapply as follows:
# Your data
temp <- structure(list(Code = 1:3, Product = c("A|B", "A|B|C", "A|B|C|D|E"
)), .Names = c("Code", "Product"), class = "data.frame", row.names = c(NA, -3L))
temp1 <- strsplit(temp$Product, "\\|") # Split the product cell
temp1 <- data.frame(Code = temp$Code,
t(sapply(temp1,
function(x) {
temp <- matrix(NA,
nrow = max(sapply(temp1, length)));
temp[1:length(x)] <- x; temp})))
temp1
# Code X1 X2 X3 X4 X5
# 1 1 A B <NA> <NA> <NA>
# 2 2 A B C <NA> <NA>
# 3 3 A B C D E
Or... use rbind.fill from the "plyr" package, after making each of your rows into a single column data.frame:
temp1 <- strsplit(temp$Product, "\\|")
library(plyr)
data.frame(Code = temp$Code,
rbind.fill(lapply(temp1, function(x) data.frame(t(x)))))
# Code X1 X2 X3 X4 X5
# 1 1 A B <NA> <NA> <NA>
# 2 2 A B C <NA> <NA>
# 3 3 A B C D E
Or... inspired by #DWin's great answer here, re-read the second column as a data.frame in itself.
newcols <- max(sapply(strsplit(temp$Product, "\\|"), length))
temp2 <- data.frame(Code = temp$Code,
read.table(text = as.character(temp$Product),
sep="|", fill=TRUE,
col.names=paste("Product", seq(newcols))))
temp2
# Code Product.1 Product.2 Product.3 Product.4 Product.5
# 1 1 A B
# 2 2 A B C
# 3 3 A B C D E

Related

From axis values to coodinates pairs [duplicate]

I have two vectors of integers, say v1=c(1,2) and v2=c(3,4), I want to combine and obtain this as a result (as a data.frame, or matrix):
> combine(v1,v2) <--- doesn't exist
1 3
1 4
2 3
2 4
This is a basic case. What about a little bit more complicated - combine every row with every other row? E.g. imagine that we have two data.frames or matrices d1, and d2, and we want to combine them to obtain the following result:
d1
1 13
2 11
d2
3 12
4 10
> combine(d1,d2) <--- doesn't exist
1 13 3 12
1 13 4 10
2 11 3 12
2 11 4 10
How could I achieve this?
For the simple case of vectors there is expand.grid
v1 <- 1:2
v2 <- 3:4
expand.grid(v1, v2)
# Var1 Var2
#1 1 3
#2 2 3
#3 1 4
#4 2 4
I don't know of a function that will automatically do what you want to do for dataframes(See edit)
We could relatively easily accomplish this using expand.grid and cbind.
df1 <- data.frame(a = 1:2, b=3:4)
df2 <- data.frame(cat = 5:6, dog = c("a","b"))
expand.grid(df1, df2) # doesn't work so let's try something else
id <- expand.grid(seq(nrow(df1)), seq(nrow(df2)))
out <-cbind(df1[id[,1],], df2[id[,2],])
out
# a b cat dog
#1 1 3 5 a
#2 2 4 5 a
#1.1 1 3 6 b
#2.1 2 4 6 b
Edit: As Joran points out in the comments merge does this for us for data frames.
df1 <- data.frame(a = 1:2, b=3:4)
df2 <- data.frame(cat = 5:6, dog = c("a","b"))
merge(df1, df2)
# a b cat dog
#1 1 3 5 a
#2 2 4 5 a
#3 1 3 6 b
#4 2 4 6 b

merge columns that have the same name r

I am working in R with a dataset that is created from mongodb with the use of mongolite.
I am getting a list that looks like so:
_id A B A B A B NA NA
1 a 1 b 2 e 5 NA NA
2 k 4 l 3 c 3 d 4
I would like to merge the datasetto look like this:
_id A B
1 a 1
2 k 4
1 b 2
2 l 3
1 e 5
2 c 3
1 NA NA
2 d 4
The NAs in the last columns are there because the columns are named from the first entry and if a later entry has more columns than that they don't get names assigned to them, (if I get help for this as well it would be awesome but it's not the reason I am here).
Also the number of columns might differ for different subsets of the dataset.
I have tried melt() but since it is a list and not a dataframe it doesn't work as expected, I have tried stack() but it dodn't work because the columns have the same name and some of them don't even have a name.
I know this is a very weird situation and appreciate any help.
Thank you.
using library(magrittr)
data:
df <- fread("
_id A B A B A B NA NA
1 a 1 b 2 e 5 NA NA
2 k 4 l 3 c 3 d 4 ",header=T)
setDF(df)
Code:
df2 <- df[,-1]
odds<- df2 %>% ncol %>% {(1:.)%%2} %>% as.logical
even<- df2 %>% ncol %>% {!(1:.)%%2}
cbind(df[,1,drop=F],
A=unlist(df2[,odds]),
B=unlist(df2[,even]),
row.names=NULL)
result:
# _id A B
# 1 1 a 1
# 2 2 k 4
# 3 1 b 2
# 4 2 l 3
# 5 1 e 5
# 6 2 c 3
# 7 1 <NA> NA
# 8 2 d 4
We can use data.table. Assuming A and B are always following each other. I created an example with 2 sets of NA's in the header. With grep we can find the ones fread has named V8 etc. Using R's recycling of vectors, you can rename multiple headers in one go. If in your case these are named differently change the pattern in the grep command. Then we melt the data in via melt
library(data.table)
df <- fread("
_id A B A B A B NA NA NA NA
1 a 1 b 2 e 5 NA NA NA NA
2 k 4 l 3 c 3 d 4 e 5",
header = TRUE)
df
_id A B A B A B A B A B
1: 1 a 1 b 2 e 5 <NA> NA <NA> NA
2: 2 k 4 l 3 c 3 d 4 e 5
# assuming A B are always following each other. Can be done in 1 statement.
cols <- names(df)
cols[grep(pattern = "^V", x = cols)] <- c("A", "B")
names(df) <- cols
# melt data (if df is a data.frame replace df with setDT(df)
df_melted <- melt(df, id.vars = 1,
measure.vars = patterns(c('A', 'B')),
value.name=c('A', 'B'))
df_melted
_id variable A B
1: 1 1 a 1
2: 2 1 k 4
3: 1 2 b 2
4: 2 2 l 3
5: 1 3 e 5
6: 2 3 c 3
7: 1 4 <NA> NA
8: 2 4 d 4
9: 1 5 <NA> NA
10: 2 5 e 5
Thank you for your help, they were great inspirations.
Even though #Andre Elrico gave a solution that worked in the reproducible example better #phiver gave a solution that worked better on my overall problem.
By using both those I came up with the following.
library(data.table)
#The data were in a list of lists called list for this example
temp <- as.data.table(matrix(t(sapply(list, '[', seq(max(sapply(list, lenth))))),
nrow = m))
# m here is the number of lists in list
cols <- names(temp)
cols[grep(pattern = "^V", x = cols)] <- c("B", "A")
#They need to be the opposite way because the first column is going to be substituted with id, and this way they fall on the correct column after that
cols[1] <- "id"
names(temp) <- cols
l <- melt.data.table(temp, id.vars = 1,
measure.vars = patterns(c("A", "B")),
value.name = c("A", "B"))
That way I can use this also if I have more than 2 columns that I need to manipulate like that.

Parsing Delimited Data In a DataFrame Into Separate Columns in R

I have a data frame which looks as such
A B C
1 3 X1=7;X2=8;X3=9
2 4 X1=10;X2=11;X3=12
5 6 X1=13;X2=14
I would like to parse the C column into separate columns as such...
A B X1 X2 X3
1 3 7 8 9
2 4 10 11 12
5 6 13 14 NA
How would one go about doing this in R?
First, here's the sample data in data.frame form
dd<-data.frame(
A = c(1L, 2L, 5L),
B = c(3L, 4L, 6L),
C = c("X1=7;X2=8;X3=9",
"X1=10;X2=11;X3=12", "X1=13;X2=14"),
stringsAsFactors=F
)
Now I define a small helper function to take vectors like c("A=1","B=2") and changed them into named vectors like c(A="1", B="2").
namev<-function(x) {
a<-strsplit(x,"=")
setNames(sapply(a,'[',2), sapply(a,'[',1))
}
and now I perform the transformations
#turn each row into a named vector
vv<-lapply(strsplit(dd$C,";"), namev)
#find list of all column names
nm<-unique(unlist(sapply(vv, names)))
#extract data from all rows for every column
nv<-do.call(rbind, lapply(vv, '[', nm))
#convert everything to numeric (optional)
class(nv)<-"numeric"
#rejoin with original data
cbind(dd[,-3], nv)
and that gives you
A B X1 X2 X3
1 1 3 7 8 9
2 2 4 10 11 12
3 5 6 13 14 NA
My cSplit function makes solving problems like these fun. Here it is in action:
## Load some packages
library(data.table)
library(devtools) ## Just for source_gist, really
library(reshape2)
## Load `cSplit`
source_gist("https://gist.github.com/mrdwab/11380733")
First, split your values up and create a "long" dataset:
ddL <- cSplit(cSplit(dd, "C", ";", "long"), "C", "=")
ddL
# A B C_1 C_2
# 1: 1 3 X1 7
# 2: 1 3 X2 8
# 3: 1 3 X3 9
# 4: 2 4 X1 10
# 5: 2 4 X2 11
# 6: 2 4 X3 12
# 7: 5 6 X1 13
# 8: 5 6 X2 14
Next, use dcast.data.table (or just dcast) to go from "long" to "wide":
dcast.data.table(ddL, A + B ~ C_1, value.var="C_2")
# A B X1 X2 X3
# 1: 1 3 7 8 9
# 2: 2 4 10 11 12
# 3: 5 6 13 14 NA
Here's one possible approach:
dat <- read.table(text="A B C
1 3 X1=7;X2=8;X3=9
2 4 X1=10;X2=11;X3=12
5 6 X1=13;X2=14", header=TRUE, stringsAsFactors = FALSE)
library(qdapTools)
dat_C <- strsplit(dat$C, ";")
dat_C2 <- sapply(dat_C, function(x) {
y <- strsplit(x, "=")
rep(sapply(y, "[", 1), as.numeric(sapply(y, "[", 2)))
})
data.frame(dat[, -3], mtabulate(dat_C2))
## A B X1 X2 X3
## 1 1 3 7 8 9
## 2 2 4 10 11 12
## 3 5 6 13 14 0
EDIT To obtain the NA values
m <- mtabulate(dat_C2)
m[m==0] <- NA
data.frame(dat[, -3], m)
Here's a nice, somewhat hacky way to get you there.
## read your data
> dat <- read.table(h=T, text = "A B C
1 3 X1=7;X2=8;X3=9
2 4 X1=10;X2=11;X3=12
5 6 X1=13;X2=14", stringsAsFactors = FALSE)
## ---
> s <- strsplit(dat$C, ";|=")
> xx <- unique(unlist(s)[grepl('[A-Z]', unlist(s))])
> sap <- t(sapply(seq(s), function(i){
wh <- which(!xx %in% s[[i]]); n <- suppressWarnings(as.numeric(s[[i]]))
nn <- n[!is.na(n)]; if(length(wh)){ append(nn, NA, wh-1) } else { nn }
})) ## see below for explanation
> data.frame(dat[1:2], sap)
# A B X1 X2 X3
# 1 1 3 7 8 9
# 2 2 4 10 11 12
# 3 5 6 13 14 NA
Basically what's happening in sap is
check which values are missing
change each list element of s to numeric
remove the NA values from (2)
insert NA into the correct position with append
transpose the result

Convert datafile from wide to long format to fit ordinal mixed model in R

I am dealing with a dataset that is in wide format, as in
> data=read.csv("http://www.kuleuven.be/bio/ento/temp/data.csv")
> data
factor1 factor2 count_1 count_2 count_3
1 a a 1 2 0
2 a b 3 0 0
3 b a 1 2 3
4 b b 2 2 0
5 c a 3 4 0
6 c b 1 1 0
where factor1 and factor2 are different factors which I would like to take along (in fact I have more than 2, but that shouldn't matter), and count_1 to count_3 are counts of aggressive interactions on an ordinal scale (3>2>1). I would now like to convert this dataset to long format, to get something like
factor1 factor2 aggression
1 a a 1
2 a a 2
3 a a 2
4 a b 1
5 a b 1
6 a b 1
7 b a 1
8 b a 2
9 b a 2
10 b a 3
11 b a 3
12 b a 3
13 b b 1
14 b b 1
15 b b 2
16 b b 2
17 c a 1
18 c a 1
19 c a 1
20 c a 2
21 c a 2
22 c a 2
23 c a 2
24 c b 1
25 c b 2
Would anyone happen to know how to do this without using for...to loops, e.g. using package reshape2? (I realize it should work using melt, but I just haven't been able to figure out the right syntax yet)
Edit: For those of you that would also happen to need this kind of functionality, here is Ananda's answer below wrapped into a little function:
widetolong.ordinal<-function(data,factors,responses,responsename) {
library(reshape2)
data$ID=1:nrow(data) # add an ID to preserve row order
dL=melt(data, id.vars=c("ID", factors)) # `melt` the data
dL=dL[order(dL$ID), ] # sort the molten data
dL[,responsename]=match(dL$variable,responses) # convert reponses to ordinal scores
dL[,responsename]=factor(dL[,responsename],ordered=T)
dL=dL[dL$value != 0, ] # drop rows where `value == 0`
out=dL[rep(rownames(dL), dL$value), c(factors, responsename)] # use `rep` to "expand" `data.frame` & drop unwanted columns
rownames(out) <- NULL
return(out)
}
# example
data <- read.csv("http://www.kuleuven.be/bio/ento/temp/data.csv")
widetolong.ordinal(data,c("factor1","factor2"),c("count_1","count_2","count_3"),"aggression")
melt from "reshape2" will only get you part of the way through this problem. To go the rest of the way, you just need to use rep from base R:
data <- read.csv("http://www.kuleuven.be/bio/ento/temp/data.csv")
library(reshape2)
## Add an ID if the row order is importantt o you
data$ID <- 1:nrow(data)
## `melt` the data
dL <- melt(data, id.vars=c("ID", "factor1", "factor2"))
## Sort the molten data, if necessary
dL <- dL[order(dL$ID), ]
## Extract the numeric portion of the "variable" variable
dL$aggression <- gsub("count_", "", dL$variable)
## Drop rows where `value == 0`
dL <- dL[dL$value != 0, ]
## Use `rep` to "expand" your `data.frame`.
## Drop any unwanted columns at this point.
out <- dL[rep(rownames(dL), dL$value), c("factor1", "factor2", "aggression")]
This is what the output finally looks like. If you want to remove the funny row names, just use rownames(out) <- NULL.
out
# factor1 factor2 aggression
# 1 a a 1
# 7 a a 2
# 7.1 a a 2
# 2 a b 1
# 2.1 a b 1
# 2.2 a b 1
# 3 b a 1
# 9 b a 2
# 9.1 b a 2
# 15 b a 3
# 15.1 b a 3
# 15.2 b a 3
# 4 b b 1
# 4.1 b b 1
# 10 b b 2
# 10.1 b b 2
# 5 c a 1
# 5.1 c a 1
# 5.2 c a 1
# 11 c a 2
# 11.1 c a 2
# 11.2 c a 2
# 11.3 c a 2
# 6 c b 1
# 12 c b 2

Condensing Data Frame in R

I just have a simple question, I really appreciate everyones input, you have been a great help to my project. I have an additional question about data frames in R.
I have data frame that looks similar to something like this:
C <- c("","","","","","","","A","B","D","A","B","D","A","B","D")
D <- c(NA,NA,NA,2,NA,NA,1,1,4,2,2,5,2,1,4,2)
G <- list(C=C,D=D)
T <- as.data.frame(G)
T
C D
1 NA
2 NA
3 NA
4 2
5 NA
6 NA
7 1
8 A 1
9 B 4
10 D 2
11 A 2
12 B 5
13 D 2
14 A 1
15 B 4
16 D 2
I would like to be able to condense all the repeat characters into one, and look similar to this:
J B C E
1 2 1
2 A 1 2 1
3 B 4 5 4
4 D 2 2 2
So of course, the data is all the same, it is just that it is condensed and new columns are formed to hold the data. I am sure there is an easy way to do it, but from the books I have looked through, I haven't seen anything for this!
EDIT I edited the example because it wasn't working with the answers so far. I wonder if the NA's, blanks, and unevenness from the blanks are contributing??
here´s a reshape solution:
require(reshape)
cast(T, C ~ ., function(x) x)
Changed T to df to avoid a bad habit. Returns a list, which my not be what you want but you can convert from there.
C <- c("A","B","D","A","B","D","A","B","D")
D <- c(1,4,2,2,5,2,1,4,2)
my.df <- data.frame(id=C,val=D)
ret <- function(x) x
by.df <- by(my.df$val,INDICES=my.df$id,ret)
This seems to get the results you are looking for. I'm assuming it's OK to remove the NA values since that matches the desired output you show.
T <- na.omit(T)
T$ind <- ave(1:nrow(T), T$C, FUN = seq_along)
reshape(T, direction = "wide", idvar = "C", timevar = "ind")
# C D.1 D.2 D.3
# 4 2 1 NA
# 8 A 1 2 1
# 9 B 4 5 4
# 10 D 2 2 2
library(reshape2)
dcast(T, C ~ ind, value.var = "D", fill = "")
# C 1 2 3
# 1 2 1
# 2 A 1 2 1
# 3 B 4 5 4
# 4 D 2 2 2

Resources