R data frame format group - r

I have data frame in this format-
ABC 2
ABC 4
ABC 6
DEF 10
DEF 20
How can I get this to-
ABC 2 4 6
DEF 10 20
I tried the aggregate function, but it needs functions like mean/sum as params. How can I just display the values directly in the row.

df <- read.table(sep=" ", header=F, text="
ABC 2
ABC 4
ABC 6
DEF 10
DEF 20")
unstack(df, form=V2~V1)
# $ABC
# [1] 2 4 6
#
# $DEF
# [1] 10 20
unstack produces a list in this case as the columns don't have the same length. In case of the same length:
df <- read.table(sep=" ", header=F, text="
ABC 2
ABC 4
ABC 6
DEF 10
DEF 20
DEF 20")
t(unstack(df, form=V2~V1))
# [,1] [,2] [,3]
# ABC 2 4 6
# DEF 10 20 20

Well, what are the observations? Are they suppose to measure the same thing for each category?
You can't actually get a data frame exactly as you have posted, because the number of observations for each category is different. But you could do that if you add an "NA" to the "DEF".
Like this:
ABC 2 4 6
DEF 10 20 NA
If that is what you want, you could just use reshape2's dcast.
But you have to name the observations:
library(reshape2)
df <- data.frame(obs =c(1:3, 1:2),
categories = c(rep("ABC", 3), rep("DEF",2)),
values=c(2,4,6,10,20), stringsAsFactors=FALSE)
df2 <- dcast(df, categories~obs)
df2
# categories 1 2 3
# 1 ABC 2 4 6
# 2 DEF 10 20 NA

To add to your alternatives:
This seems to be a basic "long to wide" reshape problem, but it is missing a "time" variable. It's easy to recreate one by using ave:
ave(as.character(df$V1), df$V1, FUN = seq_along)
# [1] "1" "2" "3" "1" "2"
df$time <- ave(as.character(df$V1), df$V1, FUN = seq_along)
Once you have a "time" variable, using reshape is pretty straightforward:
reshape(df, idvar="V1", timevar="time", direction = "wide")
# V1 V2.1 V2.2 V2.3
# 1 ABC 2 4 6
# 4 DEF 10 20 NA
If, instead, you wanted a list, there is no need for the time variable. Just use split:
split(df$V2, df$V1)
# $ABC
# [1] 2 4 6
#
# $DEF
# [1] 10 20
#
Similarly, if your data were balanced, split plus rbind could get you what you need. Using the sample data from #lukeA:
df <- read.table(sep=" ", header=F, text="
ABC 2
ABC 4
ABC 6
DEF 10
DEF 20
DEF 20")
do.call(rbind, split(df$V2, df$V1))
# [,1] [,2] [,3]
# ABC 2 4 6
# DEF 10 20 20

You want to obtain a sparse matrix? The two rows in your example have different lengths. Try a function producing a list:
mat<-cbind(
c("ABC","ABC","ABC","DEF","DEF"),
c(2,4,6,10,20)
)
count<-function(mat){
values<-unique(mat[,1])
outlist<-list()
for(v in values){
outlist[[v]]<-mat[mat[,1]==v,2]
}
return(outlist)
}
count(mat)
Which will give you this result:
$ABC
[1] "2" "4" "6"
$DEF
[1] "10" "20"

Related

Obtaining the index of elements that belong to common group in R

I have the following data frame,
>df
Label
0 control1
1 control1
2 control2
3 control2
4 control1
To get the index of the elements with label control1 and control2, I do the following
Index1 <- grep("control1",df[,1])
Index2 <- grep("control2",df[,1])
In the above syntax, the labels control1 and control2 are explicity mentioned in the command.
Is there a way to find the labels automatically? The reason is the data frame, df,contents are parsed from different input files.
For instance, I could have another data frame that reads
>df2
Label
0 trol1
1 trol1
2 trol2
3 trol3
4 trol2
Is there a way to create a list of unique labels present in the column of df?
We can use split to get list of index according to unique Label
split(1:nrow(df), df$Label)
#$control1
#[1] 1 2 5
#$control2
#[1] 3 4
With df2
split(1:nrow(df2), df2$Label)
#$trol1
#[1] 1 2
#$trol2
#[1] 3 5
#$trol3
#[1] 4
Using unique and which you can do:
df <- data.frame(Label = c("trol1", "trol1", "trol2", "trol3", "trol2"), stringsAsFactors=FALSE)
label_idx = list()
for(lbl in unique(df$Label)){
label_idx[[lbl]] = which(df$Label == lbl)
}
label_idx
$`trol1`
[1] 1 2
$trol2
[1] 3 5
$trol3
[1] 4
You can try also
lapply(unique(df$Label), function(x) which(df$Label%in% x))
#with df
[[1]]
[1] 1 2 5
[[2]]
[1] 3 4
lapply(unique(df2$Label), function(x) which(df2$Label%in% x))
#with df2
[[1]]
[1] 1 2
[[2]]
[1] 3 5
[[3]]
[1] 4

R split array into Data frame

VERY new to R and struggling with knowing exactly what to ask, have found a similar question here
How to split a character vector into data frame?
but this has fixed length, and I've been unable to adjust for my problem
I've got some data in an array in R
TEST <- c("Value01:100|Value02:200|Value03:300|","Value04:1|Value05:2|",
"StillAValueButNamesAreNotConsistent:12345.6789|",
"AlsoNotAllLinesAreTheSameLength:1|")
The data is stored in pairs, and I'm looking to split out into a dataframe as such:
Variable Value
Value01 100
Value02 200
Value03 300
Value04 1
Value05 2
StillAValueButNamesAreNotConsistent 12345.6789
AlsoNotAllLinesAreTheSameLength 1
The Variable name is a string and the value will always be a number
Any help would be great!
Thanks
One can use tidyr based solution. Convert vector TEST to a data.frame and remove the last | from each row as that doesn't carry any meaning as such.
Now, use tidyr::separate_rows to expand rows based on | and then separate data in 2 columns using tidyr::separate function.
library(dplyr)
library(tidyr)
data.frame(TEST) %>%
mutate(TEST = gsub("\\|$","",TEST)) %>%
separate_rows(TEST, sep = "[|]") %>%
separate(TEST, c("Variable", "Value"), ":")
# Variable Value
# 1 Value01 100
# 2 Value02 200
# 3 Value03 300
# 4 Value04 1
# 5 Value05 2
# 6 StillAValueButNamesAreNotConsistent 12345.6789
# 7 AlsoNotAllLinesAreTheSameLength 1
We can do it in base R with one line. Just change the | characters to line breaks then use : as the sep value in read.table(). You can also set column names there too.
read.table(text = gsub("\\|", "\n", TEST), sep = ":",
col.names = c("Variable", "Value"))
# Variable Value
# 1 Value01 100.00
# 2 Value02 200.00
# 3 Value03 300.00
# 4 Value04 1.00
# 5 Value05 2.00
# 6 StillAValueButNamesAreNotConsistent 12345.68
# 7 AlsoNotAllLinesAreTheSameLength 1.00
With Base R:
(I've broken out each step to hopefully make the code clear)
# your data
myvec <- c("Value01:100|Value02:200|Value03:300|","Value04:1|Value05:2|",
"StillAValueButNamesAreNotConsistent:12345.6789|",
"AlsoNotAllLinesAreTheSameLength:1|")
# convert into one long string
all_text_str <- paste0(myvec, collapse="")
# split the string by "|"
all_text_vec <- unlist(strsplit(all_text_str, split="\\|"))
# split each "|"-group by ":"
data_as_list <- strsplit(all_text_vec, split=":")
# collect into a dataframe
df <- do.call(rbind, data_as_list)
# clean up the dataframe by adding names and converting value to numeric
names(df) <- c("variable", "value")
df$value <- as.numeric(df$value)
With help of strsplit and unlist function. Each command is shown with output below.
Input
TEST
# [1] "Value01:100|Value02:200|Value03:300|"
# [2] "Value04:1|Value05:2|"
# [3] "StillAValueButNamesAreNotConsistent:12345.6789|"
# [4] "AlsoNotAllLinesAreTheSameLength:1|"
Splitting by | and then by :
my_list <- strsplit(unlist(strsplit(TEST, "|", fixed = TRUE)), ":", fixed = TRUE)
my_list
# [[1]]
# [1] "Value01" "100"
# [[2]]
# [1] "Value02" "200"
# [[3]]
# [1] "Value03" "300"
# [[4]]
# [1] "Value04" "1"
# [[5]]
# [1] "Value05" "2"
# [[6]]
# [1] "StillAValueButNamesAreNotConsistent" "12345.6789"
# [[7]]
# [1] "AlsoNotAllLinesAreTheSameLength" "1"
Converting above list to data.frame
df <- data.frame(matrix(unlist(my_list), ncol = 2, byrow=TRUE))
df
# X1 X2
# 1 Value01 100
# 2 Value02 200
# 3 Value03 300
# 4 Value04 1
# 5 Value05 2
# 6 StillAValueButNamesAreNotConsistent 12345.6789
# 7 AlsoNotAllLinesAreTheSameLength 1
Colnames to dataframe
names(df) <- c("Variable", "Value")
df
# Variable Value
# 1 Value01 100
# 2 Value02 200
# 3 Value03 300
# 4 Value04 1
# 5 Value05 2
# 6 StillAValueButNamesAreNotConsistent 12345.6789
# 7 AlsoNotAllLinesAreTheSameLength 1

Storing frequencies returned from table function in R

I have a vector of size 5 which stores random digits 0-9 so that there can be multiple occurrences of the same digit. Here is an example vector:
nums <- c(5,2,5,9,2)
If I print the results of running the table function on this vector, I get the following output:
nums
2 5 9
2 2 1
I would like to know what the highest and second highest frequencies are that are returned from table(nums). How can I store all of the frequencies that are returned from an iteration of the table function?
table returns an array that can be saved to a variable. If you convert it to a data.frame using as.data.frame you get an easier to work with object:
nums <- c(5,2,5,9,2)
tab <- as.data.frame(table(nums))
tab
nums Freq
1 2 2
2 5 2
3 9 1
You can use plyr, its lightening fast.
library(plyr)
nums <- c(5,2,5,9,2)
count(nums)
Result
x freq
2 2
5 2
9 1
To shrink the table only to the two most frequent options you would want
sort(table(nums), dec = TRUE)[1:2]
# nums
# 2 5
# 2 2
Just to get their names you could do
names(sort(table(nums), dec = TRUE))[1:2]
# [1] "2" "5"
If it may happen that there are not that many unique values, you could use na.omit, as in
names(sort(table(nums), dec = TRUE))[1:4]
# [1] "2" "5" "9" NA
na.omit(names(sort(table(nums), dec = TRUE))[1:4])
# [1] "2" "5" "9"
# attr(,"na.action")
# [1] 4
# attr(,"class")
# [1] "omit"
As for storing the results, using a list should be pretty convenient:
tabs <- list()
tabs[[1]] <- sort(table(nums), dec = TRUE)[1:2]
tabs[[2]] <- sort(table(c(1, 1, 2, 3, 3)), dec = TRUE)[1:2]
tabs
# [[1]]
# nums
# 2 5
# 2 2
#
# [[2]]
#
# 1 3
# 2 2
In particular, using lists is compatible with the option that the number of options is varying.

How to split character value properly

I have a data frame which consists of some composite information. I would like to split the vector a into the vectors "a" and "d", where "a" corresponds only to the numeric ID 898, 3467 ,234 ,222 and vector "d" contains the corresponding character values.
Data:
a<-c("898_Me","3467_You or ", "234_Hi-hi", "222_what")
b<-c(1,8,3,8)
c<-c(2,4,6,2)
df<-data.frame(a,b,c)
What I tried so far:
a<-str(df$a)
a<-strsplit(df$a, split)
But that just doesn't work out with my regular expression skills.
The required output table might have the form:
a d b c
898 Me 1 2
3467 You or 8 3
234 Hi-hi 3 6
222 what 8 2
library(tidyr)
a<-c("898_Me","3467_You or ", "234_Hi-hi", "222_what")
b<-c(1,8,3,8)
c<-c(2,4,6,2)
df <-data.frame(a,b,c)
final_df <- separate(df , a , c("a" , "d") , sep = "_")
# a d b c
#1 898 Me 1 2
#2 3467 You or 8 4
#3 234 Hi-hi 3 6
#4 222 what 8 2
final_df$d
# [1] "Me" "You or " "Hi-hi" "what"
strsplit is right, but you need to pass the character to split with:
do.call(rbind, strsplit(as.character(df$a), "_"))
# [,1] [,2]
# [1,] "898" "Me"
# [2,] "3467" "You or "
# [3,] "234" "Hi-hi"
# [4,] "222" "what"
Or
library(stringi)
stri_split_fixed(df$a, "_", simplify = TRUE)
With your example, Here is my solution in base R:
df$a2 <- gsub("[^0-9]", "", a)
df$d <- gsub("[0-9]", "", a)
That gives:
> df
a b c a2 d
1 898_Me 1 2 898 _Me
2 3467_You or 8 4 3467 _You or
3 234_Hi-hi 3 6 234 _Hi-hi
4 222_what 8 2 222 _what
Not elegant but it preserves original data and easy to apply.

How to turn variable names into factors in a data frame in R

Say I have a data frame containing time-series data, where the first column is the index, and the remaining columns all contain different data streams, and are named descriptively, as in the following example:
temps = data.frame(matrix(1:20,nrow=2,ncol=10))
names(temps) <- c("flr1_dirN_areaA","flr1_dirS_areaA","flr1_dirN_areaB","flr1_dirS_areaB","flr2_dirN_areaA","flr2_dirS_areaA","flr2_dirN_areaB","flr2_dirS_areaB","flr3_dirN_areaA","flr3_dirS_areaA")
temps$Index <- as.Date(2013,7,1:2)
temps
flr1_dirN_areaA flr1_dirS_areaA ... Index
1 1 3 ... 1975-07-15
2 2 4 ... 1975-07-16
Now I want to prep the data frame for plotting with ggplot2, and i want to include the three factors: flr, dir, and area.
I can achieve this for this simple example as follows:
temps.m <- melt(temps,"Index")
temps.m$flr <- factor(rep(1:3,c(8,8,4)))
temps.m$dir <- factor(rep(c("N","S"),each=2,len=20))
temps.m$area <- factor(rep(c("A","B"),each=4,len=20))
temps.m
Index variable value flr dir area
1 1975-07-15 flr1_dirN_areaA 1 1 N A
2 1975-07-16 flr1_dirN_areaA 2 1 N A
3 1975-07-15 flr1_dirS_areaA 3 1 S A
4 1975-07-16 flr1_dirS_areaA 4 1 S A
5 1975-07-15 flr1_dirN_areaB 5 1 N B
6 1975-07-16 flr1_dirN_areaB 6 1 N B
7 1975-07-15 flr1_dirS_areaB 7 1 S B
8 1975-07-16 flr1_dirS_areaB 8 1 S B
9 1975-07-15 flr2_dirN_areaA 9 2 N A
10 1975-07-16 flr2_dirN_areaA 10 2 N A
11 1975-07-15 flr2_dirS_areaA 11 2 S A
12 1975-07-16 flr2_dirS_areaA 12 2 S A
13 1975-07-15 flr2_dirN_areaB 13 2 N B
14 1975-07-16 flr2_dirN_areaB 14 2 N B
15 1975-07-15 flr2_dirS_areaB 15 2 S B
16 1975-07-16 flr2_dirS_areaB 16 2 S B
17 1975-07-15 flr3_dirN_areaA 17 3 N A
18 1975-07-16 flr3_dirN_areaA 18 3 N A
19 1975-07-15 flr3_dirS_areaA 19 3 S A
20 1975-07-16 flr3_dirS_areaA 20 3 S A
In reality, I have data streams (columns) of varying lengths - each of which comes from its own file, has missing data, more than 3 factors encoded in the column (file) names, so this simple method of applying factors won't work. I need something more robust, and I'm inclined to parse the variable names into the different factors, and populate the factor-columns of the melted data frame.
My end goal is to plot something like this:
ggplot(temps.m,aes(x=Index,y=value,color=area,linetype=dir))+geom_line()+facet_grid(flr~.)
I imagine that the reshape, reshape2, plyr, or some other package can do this in one or two statements - but I struggle with melt/cast/ddply and the rest of them. Any suggestions?
Also, if you can suggest an entirely different [better] approach to structuring my data, I'm all ears.
Thanks in advance
You can use some regular expressions to creates your factors:
res <- do.call(rbind,strsplit(gsub('flr([0-9]+).*dir([A-Z]).*area([A-Z])',
'\\1,\\2,\\3',
temps.m$variable),
','))
[,1] [,2] [,3]
[1,] "1" "N" "A"
[2,] "1" "N" "A"
[3,] "1" "S" "A"
[4,] "1" "S" "A"
[5,] "1" "N" "B"
[6,] "1" "N" "B"
[7,] "1" "S" "B"
[8,] "1" "S" "B"
........
Maybe you need further step to transform your columns to factors.
res <- colwise(as.factor)(data.frame(res))
X1 X2 X3
1 1 N A
2 1 N A
3 1 S A
4 1 S A
........
To combine the result with your melted data you can use cbind
temps.m <- cbind(temps.m,res)
Here's a way to turn a bunch of appropriately-formatted strings into a data frame of factor variables. This assumes the factors are split by _, and the last character in each substring is the desired level.
require(plyr)
v <- do.call(rbind, strsplit(as.character(temps.m$variable), "_"))
v <- alply(v, 2, function(x) {
n <- nchar(x)
name <- substr(x, 1, n - 1)[1]
lev <- substr(x, n, n)
structure(factor(lev), name=name)
})
names(v) <- sapply(v, attr, "name")
temps.m <- cbind(temps.m, as.data.frame(v))
Adding more generality is left as an exercise for the reader.

Resources