How to transform a key/value string into separate columns? - r

I've got a data.frame with key/value string column containing information about features and their values for a set of users. Something like this:
data<-data.frame(id=1:3,statid=c("s003e","s093u","s085t"),str=c("a:1,7:2","a:1,c:4","a:3,b:5,c:33"))
data
# id statid str
# 1 1 s003e a:1,7:2
# 2 2 s093u a:1,c:4
# 3 3 s085t a:3,b:5,c:33
What I'm trying to do is to create a data.frame containing column for every feature. Like this:
data_after<-data.frame(id=1:3,statid=c("s003e","s093u","s085t"),
a=c(1,1,3),b=c(0,0,5),c=c(0,4,33),"7"=c(2,0,0))
data_after
# id statid a b c X7
# 1 1 s003e 1 0 0 2
# 2 2 s093u 1 0 4 0
# 3 3 s085t 3 5 33 0
I was trying to use str_split from stringr package and then transform elements of created list to data.frames (later bind them using for example rbind.fill from plyr) but couldn't done it. Any help will be appreciated!

You can use dplyr and tidyr:
library(dplyr); library(tidyr)
data %>% mutate(str = strsplit(str, ",")) %>% unnest(str) %>%
separate(str, into = c('var', 'val'), sep = ":") %>% spread(var, val, fill = 0)
# id statid 7 a b c
# 1 1 s003e 2 1 0 0
# 2 2 s093u 0 1 0 4
# 3 3 s085t 0 3 5 33

We can use cSplit to do this in a cleaner way. Convert the data to 'long' format by splitting at ,, then do the split at : and dcast from 'long' to 'wide'
library(splitstackshape)
library(data.table)
dcast(cSplit(cSplit(data, "str", ",", "long"), "str", ":"),
id+statid~str_1, value.var="str_2", fill = 0)
# id statid 7 a b c
#1: 1 s003e 2 1 0 0
#2: 2 s093u 0 1 0 4
#3: 3 s085t 0 3 5 33

Related

How can I transform and count data?

I have a data.frame with 4 cathegorical variables with scale 1-5.
data.frame(
first=c(2,3,3,2,2),
second=c(5,5,4,5,5),
third=c(5,5,5,4,4),
fourth=c(2,1,1,1,2))
first second third fourth
2 5 5 2
3 5 5 1
3 4 5 1
2 5 4 1
2 5 4 2
I want to transform names of variables to one column and do cumulative counts of values and set up new variables to rows with cathegorical scale.
newvar 1 2 3 4 5
first 0 3 2 0 0
second 0 0 0 1 4
third 0 0 0 2 3
fourth 3 2 0 0 0
Using data.table :
library(data.table)
dcast(melt(df), variable~value)
# variable 1 2 3 4 5
#1 first 0 3 2 0 0
#2 second 0 0 0 1 4
#3 third 0 0 0 2 3
#4 fourth 3 2 0 0 0
This returns some warning since we are relying on the default options of melt and dcast, it is safe to ignore them in this case. To avoid warnings you can use this extended version.
library(data.table)
dcast(melt(setDT(df), measure.vars = names(df)), variable~value, fun.aggregate = length)
Not the cleanest method, nevertheless it works.
You use pivot_longer to transform the data into a long format.
Then we can group the data and count how many occurrences there are for each of your original columns.
Transform the data back into wide format using pivot_wider and then the last two lines rearranges the data to match your desired output.
df %>%
pivot_longer(c(first:fourth)) %>%
count(name, value) %>%
pivot_wider(names_from = "value",
values_from = "n") %>%
select(name, `1`, `2`, `3`, `4`, `5`) %>%
arrange(match(name, c("first", "second", "third", "fourth")), desc(name))

Generate matrix of unique combination using 2 variables from a data frame r

I have a data frame as
df<- as.data.frame(expand.grid(0:1, 0:4, 0:3,0:7, 2:7))
I want to get all unique combinations using 2 variables of the given 5 variables in the data frame df
Apply a function f (extracting unique couple) to each couple of columns:
f<-function(col,df)
{
return(unique(df[,col]))
}
#All combinantions
comb_col<-combn(colnames(df),2)
Your output
apply(comb_col,2,f,df=df)
[[1]]
Var1 Var2
1 0 0
2 1 0
3 0 1
4 1 1
5 0 2
6 1 2
7 0 3
8 1 3
9 0 4
10 1 4
[[2]]
Var1 Var3
1 0 0
2 1 0
11 0 1
12 1 1
21 0 2
22 1 2
31 0 3
32 1 3
...
You can use distinct function from dplyr package:
df <- as.data.frame(expand.grid(0:1, 0:4, 0:3,0:7, 2:7))
library(dplyr)
df %>%
distinct(Var1, Var2)
Also you have an option to keep the rest of your columns with .keep_all = TRUE parameter.
If you want to get all the possible combinations:
# Generate matrix with all combinations of variables
comb <- combn(names(df), 2)
# Generate a list with all unique values in your data.frame
apply(comb, 2, function(x) df %>% distinct_(.dots = x))

Grouping and Counting instances?

Is it possible to group and count instances of all other columns using R (dplyr)? For example, The following dataframe
x a b c
1 0 0 0
1 1 0 1
1 2 2 1
2 1 2 1
Turns to this (note: y is value that is being counted)
EDIT:- explaining the transformation, x is what I'm grouping by, for each number grouped, i want to count how many times 0 and 1 and 2 was mentioned, as in the first row in the transformed dataframe, we counted how many times x = 1 was equal to 0 in the other columns (y), so 0 was in column a one time, column b two times and column c one time
x y a b c
1 0 1 2 1
1 1 1 0 2
1 2 1 1 0
2 1 1 0 1
2 2 0 1 0
An approach with a combination of the melt and dcast functions of data.table or reshape2:
library(data.table) # v1.9.5+
dt.new <- dcast(melt(setDT(df), id.vars="x"), x + value ~ variable)
this gives:
dt.new
# x value a b c
# 1: 1 0 1 2 1
# 2: 1 1 1 0 2
# 3: 1 2 1 1 0
# 4: 2 1 1 0 1
# 5: 2 2 0 1 0
In dcast you can specify which aggregation function to use, but this is in this case not necessary as the default aggregation function is length. Without using an aggregation function, you will get a warning about that:
Aggregation function missing: defaulting to length
Furthermore, if you do not explicitly convert the dataframe to a data table, data.table will redirect to reshape2 (see the explanation from #Arun in the comments). Consequently this method can be used with reshape2 as well:
library(reshape2)
df.new <- dcast(melt(df, id.vars="x"), x + value ~ variable)
Used data:
df <- read.table(text="x a b c
1 0 0 0
1 1 0 1
1 2 2 1
2 1 2 1", header=TRUE)
I'd use a combination of gather and spread from the tidyr package, and count from dplyr:
library(dplyr)
library(tidyr)
df = data.frame(x = c(1,1,1,2), a = c(0,1,2,1), b = c(0,0,2,2), c = c(0,1,1,1))
res = df %>%
gather(variable, value, -x) %>%
count(x, variable, value) %>%
spread(variable, n, fill = 0)
# Source: local data frame [5 x 5]
#
# x value a b c
# 1 1 0 1 2 1
# 2 1 1 1 0 2
# 3 1 2 1 1 0
# 4 2 1 1 0 1
# 5 2 2 0 1 0
Essentially, you first change the format of the dataset to:
head(df %>%
gather(variable, value, -x))
# x variable value
#1 1 a 0
#2 1 a 1
#3 1 a 2
#4 2 a 1
#5 1 b 0
#6 1 b 0
Which allows you to use count to get the information regarding how often certain values occur in columns a to c. After that, you reformat the dataset to your required format using spread.

Reshaping a data.frame so a column containing multiple features becomes multiple binary columns

I have a dataframe like this
df <-data.frame(id = c(1,2),
value = c(25,24),
features = c("A,B,D,F","C,B,E"))
print(df)
id,value,features
1,25,"A,B,D,F"
2,24,"C,B,E"
I want to reshape it into this:
id,value,A,B,C,D,E,F
1,25,1,1,0,1,0,1
2,24,0,1,1,0,1,0
I'm guessing that the first step would be to identify the unique values in the df$features column, but once I have that list, I'm not sure what an efficient (i.e. vectorized) way to create the final dataset would be.
This feels like an operation for dplyr or reshape2 but I'm not sure how to approach this.
This is yet another use case for merge after suitable transformation.
library(reshape2)
f<-with(df,stack(setNames(strsplit(as.character(features),","),id)))
d<-dcast(f,ind~values,length,value.var="ind")
out<-merge(df[,1:2],d,by.x="id",by.y="ind")
print(out)
id value A B C D E F
1 1 25 1 1 0 1 0 1
2 2 24 0 1 1 0 1 0
This can also be done using only default libraries (without reshape2) in a variety of slightly messier ways. In the above, you can substitute the d and out lines with the following instead:
d<-xtabs(count~ind+values,transform(f,count=1))
out<-merge(df[,1:2],as.data.frame.matrix(d),by.x="id",by.y="row.names")
You can do:
library(splitstackshape)
library(qdapTools)
df1 = data.frame(cSplit(df, 'features', sep=',', type.convert=F))
cbind(df1[1:2], mtabulate(as.data.frame(t(df1[-c(1,2)]))))
# id value A B C D E F
#1: 1 25 1 1 0 1 0 1
#2: 2 24 0 1 1 0 1 0
Another one using splitstackshape and data.table (installation instructions here):
require(splitstackshape)
require(data.table) # v1.9.5+
ans <- cSplit(df, 'features', sep = ',', 'long')
dcast(ans, id + value ~ features, fun.aggregate = length)
# id value A B C D E F
# 1: 1 25 1 1 0 1 0 1
# 2: 2 24 0 1 1 0 1 0
If you're using data.table v1.9.4, then replace dcast with dcast.data.table.
Alternatively, you can use cSplit_e, like this:
cSplit_e(df, "features", ",", type = "character", fill = 0)
## id value features features_A features_B features_C features_D features_E features_F
## 1 1 25 A,B,D,F 1 1 0 1 0 1
## 2 2 24 C,B,E 0 1 1 0 1 0
A dplyr/tidyr solution
library(dplyr)
library(tidyr)
separate(df,features,1:4,",",extra="merge") %>%
gather(key,letter,-id,-value) %>%
filter(!is.na(letter)) %>%
select(-key) %>%
mutate(n=1) %>%
spread(letter,n) %>%
mutate_each(funs(ifelse(is.na(.),0,1)),A:F)

Splitting one Column to Multiple R and Giving logical value if true

I am trying to split one column in a data frame in to multiple columns which hold the values from the original column as new column names. Then if there was an occurrence for that respective column in the original give it a 1 in the new column or 0 if no match. I realize this is not the best way to explain so, for example:
df <- data.frame(subject = c(1:4), Location = c('A', 'A/B', 'B/C/D', 'A/B/C/D'))
# subject Location
# 1 1 A
# 2 2 A/B
# 3 3 B/C/D
# 4 4 A/B/C/D
and would like to expand it to wide format, something such as, with 1's and 0's (or T and F):
# subject A B C D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1
I have looked into tidyr and the separate function and reshape2 and the cast function but seem to getting hung up on giving logical values. Any help on the issue would be greatly appreciated. Thank you.
You may try cSplit_e from package splitstackshape:
library(splitstackshape)
cSplit_e(data = df, split.col = "Location", sep = "/",
type = "character", drop = TRUE, fill = 0)
# subject Location_A Location_B Location_C Location_D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1
You could take the following step-by-step approach.
## get the unique values after splitting
u <- unique(unlist(strsplit(as.character(df$Location), "/")))
## compare 'u' with 'Location'
m <- vapply(u, grepl, logical(length(u)), x = df$Location)
## coerce to integer representation
m[] <- as.integer(m)
## bind 'm' to 'subject'
cbind(df["subject"], m)
# subject A B C D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1

Resources