VERY new to R and struggling with knowing exactly what to ask, have found a similar question here
How to split a character vector into data frame?
but this has fixed length, and I've been unable to adjust for my problem
I've got some data in an array in R
TEST <- c("Value01:100|Value02:200|Value03:300|","Value04:1|Value05:2|",
"StillAValueButNamesAreNotConsistent:12345.6789|",
"AlsoNotAllLinesAreTheSameLength:1|")
The data is stored in pairs, and I'm looking to split out into a dataframe as such:
Variable Value
Value01 100
Value02 200
Value03 300
Value04 1
Value05 2
StillAValueButNamesAreNotConsistent 12345.6789
AlsoNotAllLinesAreTheSameLength 1
The Variable name is a string and the value will always be a number
Any help would be great!
Thanks
One can use tidyr based solution. Convert vector TEST to a data.frame and remove the last | from each row as that doesn't carry any meaning as such.
Now, use tidyr::separate_rows to expand rows based on | and then separate data in 2 columns using tidyr::separate function.
library(dplyr)
library(tidyr)
data.frame(TEST) %>%
mutate(TEST = gsub("\\|$","",TEST)) %>%
separate_rows(TEST, sep = "[|]") %>%
separate(TEST, c("Variable", "Value"), ":")
# Variable Value
# 1 Value01 100
# 2 Value02 200
# 3 Value03 300
# 4 Value04 1
# 5 Value05 2
# 6 StillAValueButNamesAreNotConsistent 12345.6789
# 7 AlsoNotAllLinesAreTheSameLength 1
We can do it in base R with one line. Just change the | characters to line breaks then use : as the sep value in read.table(). You can also set column names there too.
read.table(text = gsub("\\|", "\n", TEST), sep = ":",
col.names = c("Variable", "Value"))
# Variable Value
# 1 Value01 100.00
# 2 Value02 200.00
# 3 Value03 300.00
# 4 Value04 1.00
# 5 Value05 2.00
# 6 StillAValueButNamesAreNotConsistent 12345.68
# 7 AlsoNotAllLinesAreTheSameLength 1.00
With Base R:
(I've broken out each step to hopefully make the code clear)
# your data
myvec <- c("Value01:100|Value02:200|Value03:300|","Value04:1|Value05:2|",
"StillAValueButNamesAreNotConsistent:12345.6789|",
"AlsoNotAllLinesAreTheSameLength:1|")
# convert into one long string
all_text_str <- paste0(myvec, collapse="")
# split the string by "|"
all_text_vec <- unlist(strsplit(all_text_str, split="\\|"))
# split each "|"-group by ":"
data_as_list <- strsplit(all_text_vec, split=":")
# collect into a dataframe
df <- do.call(rbind, data_as_list)
# clean up the dataframe by adding names and converting value to numeric
names(df) <- c("variable", "value")
df$value <- as.numeric(df$value)
With help of strsplit and unlist function. Each command is shown with output below.
Input
TEST
# [1] "Value01:100|Value02:200|Value03:300|"
# [2] "Value04:1|Value05:2|"
# [3] "StillAValueButNamesAreNotConsistent:12345.6789|"
# [4] "AlsoNotAllLinesAreTheSameLength:1|"
Splitting by | and then by :
my_list <- strsplit(unlist(strsplit(TEST, "|", fixed = TRUE)), ":", fixed = TRUE)
my_list
# [[1]]
# [1] "Value01" "100"
# [[2]]
# [1] "Value02" "200"
# [[3]]
# [1] "Value03" "300"
# [[4]]
# [1] "Value04" "1"
# [[5]]
# [1] "Value05" "2"
# [[6]]
# [1] "StillAValueButNamesAreNotConsistent" "12345.6789"
# [[7]]
# [1] "AlsoNotAllLinesAreTheSameLength" "1"
Converting above list to data.frame
df <- data.frame(matrix(unlist(my_list), ncol = 2, byrow=TRUE))
df
# X1 X2
# 1 Value01 100
# 2 Value02 200
# 3 Value03 300
# 4 Value04 1
# 5 Value05 2
# 6 StillAValueButNamesAreNotConsistent 12345.6789
# 7 AlsoNotAllLinesAreTheSameLength 1
Colnames to dataframe
names(df) <- c("Variable", "Value")
df
# Variable Value
# 1 Value01 100
# 2 Value02 200
# 3 Value03 300
# 4 Value04 1
# 5 Value05 2
# 6 StillAValueButNamesAreNotConsistent 12345.6789
# 7 AlsoNotAllLinesAreTheSameLength 1
Related
Edited to add more details and clarify.
Basically, I have a list of data frames, they have the same row numbers but various column numbers, so the dimension of each data frame is different. What I want to do now is to select the first row of each data frame, put them into a new data frame and use it as the first element of a new list, then do the same things for second rows, third rows...
I have contemplated to use 2 for loops to reassign rows, however that seems to be a very bad way to go about it given that nested for loop is pretty slow and the data I have is huge. Would truly appreciate sound insight and help.
myList <- list()
df1 <- as.data.frame(matrix(1:6, nrow=3, ncol=2))
df2 <- as.data.frame(matrix(7:15, nrow=3, ncol=3))
myList[[1]]<-df1
myList[[2]]<-df2
print(myList)
Current example data -
> print(myList)
[[1]]
V1 V2
1 1 4
2 2 5
3 3 6
[[2]]
V1 V2 V3
1 7 10 13
2 8 11 14
3 9 12 15
Desired Outcome
> print(myList2)
[[1]]
V1 V2 V3
1 1 4 0
2 7 10 13
[[2]]
V1 V2 V3
1 2 5 0
2 8 11 14
[[3]]
V1 V2 V3
1 3 6 0
2 9 12 15
The different dimensions of the current data frames makes it tricky.
Here's a base method of:
Adding all the column names to each list item
Converting the list to an array.
Transposing the array using aperm to match your intended output
Optional turning the array into a list using apply.
myListBase <- myList #added because we modify the original list
#get all of the unique names from the list of dataframes
##default ordering is by ordering in list
all_cols <- Reduce(base::union, lapply(myListBase, names))
#loop, add new columns, and then re-order them so all data.frames
# have the same order
myListBase <- lapply(myListBase,
function(DF){
DF[, base::setdiff(all_cols, names(DF))] <- 0 #initialze columns
DF[, all_cols] #reorder columns
}
)
#create 3D array - could be simplified using abind::abind(abind(myListBase, along = 3))
myArrayBase <- array(unlist(myListBase, use.names = F),
dim = c(nrow(myListBase[[1]]), #rows
length(all_cols), #columns
length(myListBase) #3rd dimension
),
dimnames = list(NULL, all_cols, NULL))
#rows and 3rd dimension are transposed
myPermBase <- aperm(myArrayBase, c(3,2,1))
myPermBase
#, , 1
#
# V1 V2 V3
#[1,] 1 4 0
#[2,] 7 10 13
#
#, , 2
#
# V1 V2 V3
#[1,] 2 5 0
#[2,] 8 11 14
#
#, , 3
#
# V1 V2 V3
#[1,] 3 6 0
#[2,] 9 12 15
#make list of dataframes - likely not necessary
apply(myPermBase, 3, data.frame)
#[[1]]
# V1 V2 V3
#1 1 4 0
#2 7 10 13
#
#[[2]]
# V1 V2 V3
#1 2 5 0
#2 8 11 14
#
#[[3]]
# V1 V2 V3
#1 3 6 0
#2 9 12 15
Performance
The first version of the answer had a data.table and abind method but I've removed it - the base version is much faster and there's not much additional clarity gained.
Unit: microseconds
expr min lq mean median uq max neval
camille_purrr_dplyr 7910.9 8139.25 8614.956 8246.30 8387.20 60159.5 1000
cole_DT_abind 2555.8 2804.75 3012.671 2917.95 3061.55 6602.3 1000
cole_base 600.3 634.40 697.987 663.00 733.10 3761.6 1000
Complete code for reference:
library(dplyr)
library(purrr)
library(data.table)
library(abind)
library(microbenchmark)
myList <- list()
df1 <- as.data.frame(matrix(1:6, nrow=3, ncol=2))
df2 <- as.data.frame(matrix(7:15, nrow=3, ncol=3))
myList[[1]]<-df1
myList[[2]]<-df2
microbenchmark(
camille_purrr_dplyr = {
myList %>%
map_dfr(tibble::rownames_to_column, var = "id") %>%
mutate_at(vars(-id), ~ifelse(is.na(.), 0, .)) %>%
split(.$id) %>%
map(select, -id)
}
,
cole_DT_abind = {
myListDT <- copy(myList)
all_cols <- Reduce(base::union, lapply(myListDT, names))
# data.table used for side effects of updating-by-reference in lapply
lapply(myListDT, setDT)
# add non-existing columns
lapply(myListDT,
function(DT) {
DT[, base::setdiff(all_cols, names(DT)) := 0]
setorderv(DT, all_cols)
})
# abind is used to make an array
myArray <- abind(myListDT, along = 3)
# aperm is used to transpose the array to the preferred route
myPermArray <- aperm(myArray, c(3,2,1))
# myPermArray
#or as a list of data.frames
apply(myPermArray, 3, data.frame)
}
,
cole_base = {
myListBase <- myList
all_cols <- Reduce(base::union, lapply(myListBase, names))
myListBase <- lapply(myListBase,
function(DF){
DF[, base::setdiff(all_cols, names(DF))] <- 0
DF[, all_cols]
}
)
myArrayBase <- array(unlist(myListBase, use.names = F),
dim = c(nrow(myListBase[[1]]), length(all_cols), length(myListBase)),
dimnames = list(NULL, all_cols, NULL))
myPermBase <- aperm(myArrayBase, c(3,2,1))
apply(myPermBase, 3, data.frame)
}
# ,
# cole_base_aperm = {
# myListBase <- myList
#
# all_cols <- Reduce(base::union, lapply(myListBase, names))
#
# myListBase <- lapply(myListBase,
# function(DF){
# DF[, base::setdiff(all_cols, names(DF))] <- 0
# DF[, all_cols]
# }
# )
#
# myArrayABind <- abind(myListBase, along = 3)
#
# myPermBase <- aperm(myArrayABind, c(3,2,1))
# apply(myPermBase, 3, data.frame)
# }
, times = 1000
)
One way with a few dplyr & purrr functions is to add an ID column to each row in each data frame, bind them all, then split by that ID. The base rbind would throw an error because of the mismatched column names, but dplyr::bind_rows takes a list of any number of data frames and adds NA columns for anything missing.
First step gets you one data frame:
library(dplyr)
library(purrr)
myList %>%
map_dfr(tibble::rownames_to_column, var = "id")
#> id V1 V2 V3
#> 1 1 1 4 NA
#> 2 2 2 5 NA
#> 3 3 3 6 NA
#> 4 1 7 10 13
#> 5 2 8 11 14
#> 6 3 9 12 15
Fill in the NAs with 0 in all columns except the ID—this could also be adjusted if need be. Split by ID, and drop the ID column since you no longer need it.
myList %>%
map_dfr(tibble::rownames_to_column, var = "id") %>%
mutate_at(vars(-id), ~ifelse(is.na(.), 0, .)) %>%
split(.$id) %>%
map(select, -id)
#> $`1`
#> V1 V2 V3
#> 1 1 4 0
#> 4 7 10 13
#>
#> $`2`
#> V1 V2 V3
#> 2 2 5 0
#> 5 8 11 14
#>
#> $`3`
#> V1 V2 V3
#> 3 3 6 0
#> 6 9 12 15
I have a data table like
sample1 sample2 sample3
fruit1 10 20 30
fruit2 1 5 6
fruit3 3 7 8
etc.
I want to find the top 1 percentile of fruits in each sample in R (according to the number in each sample). Is there a simple way to do this?
You can lapply over your data and for each column, subset the rownames of df with a logical vector which is TRUE when the corresponding value in the column is in the 1 percentile (i.e. above the 100 - 1 percentile).
Create example data
set.seed(2019)
df <- as.data.frame(matrix(sample(1e4, replace = T), 1e3, 10))
names(df) <- paste0('sample', seq_along(df))
rownames(df) <- paste0('fruit', seq_len(nrow(df)))
Step described above:
lapply(df, function(x) rownames(df)[x > quantile(x, (100 - 1)/100)])
# $`sample1`
# [1] "fruit57" "fruit76" "fruit149" "fruit471" "fruit520" "fruit682" "fruit805"
# [8] "fruit949" "fruit966" "fruit975"
#
# $sample2
# [1] "fruit49" "fruit109" "fruit232" "fruit274" "fruit312" "fruit795" "fruit883"
# [8] "fruit884" "fruit955" "fruit958"
#
# $sample3
# [1] "fruit37" "fruit189" "fruit231" "fruit256" "fruit473" "fruit654" "fruit729"
# [8] "fruit742" "fruit820" "fruit979"
#
# ...
Assuming your data frame is calle "fruit"
fruit <- fruit[order(fruit$sample1,decreasing = TRUE)]
top.1.percent <- fruit[1:length(fruit$sample1)/100,]
This should do the trick for sample1
I have a vector of size 5 which stores random digits 0-9 so that there can be multiple occurrences of the same digit. Here is an example vector:
nums <- c(5,2,5,9,2)
If I print the results of running the table function on this vector, I get the following output:
nums
2 5 9
2 2 1
I would like to know what the highest and second highest frequencies are that are returned from table(nums). How can I store all of the frequencies that are returned from an iteration of the table function?
table returns an array that can be saved to a variable. If you convert it to a data.frame using as.data.frame you get an easier to work with object:
nums <- c(5,2,5,9,2)
tab <- as.data.frame(table(nums))
tab
nums Freq
1 2 2
2 5 2
3 9 1
You can use plyr, its lightening fast.
library(plyr)
nums <- c(5,2,5,9,2)
count(nums)
Result
x freq
2 2
5 2
9 1
To shrink the table only to the two most frequent options you would want
sort(table(nums), dec = TRUE)[1:2]
# nums
# 2 5
# 2 2
Just to get their names you could do
names(sort(table(nums), dec = TRUE))[1:2]
# [1] "2" "5"
If it may happen that there are not that many unique values, you could use na.omit, as in
names(sort(table(nums), dec = TRUE))[1:4]
# [1] "2" "5" "9" NA
na.omit(names(sort(table(nums), dec = TRUE))[1:4])
# [1] "2" "5" "9"
# attr(,"na.action")
# [1] 4
# attr(,"class")
# [1] "omit"
As for storing the results, using a list should be pretty convenient:
tabs <- list()
tabs[[1]] <- sort(table(nums), dec = TRUE)[1:2]
tabs[[2]] <- sort(table(c(1, 1, 2, 3, 3)), dec = TRUE)[1:2]
tabs
# [[1]]
# nums
# 2 5
# 2 2
#
# [[2]]
#
# 1 3
# 2 2
In particular, using lists is compatible with the option that the number of options is varying.
Consider the following:
df <- data.frame(a = 1, b = 2, c = 3)
names(df[1]) <- "d" ## First method
## a b c
##1 1 2 3
names(df)[1] <- "d" ## Second method
## d b c
##1 1 2 3
Both methods didn't return an error, but the first didn't change the column name, while the second did.
I thought it has something to do with the fact that I'm operating only on a subset of df, but why, for example, the following works fine then?
df[1] <- 2
## a b c
##1 2 2 3
What I think is happening is that replacement into a data frame ignores the attributes of the data frame that is drawn from. I am not 100% sure of this, but the following experiments appear to back it up:
df <- data.frame(a = 1:3, b = 5:7)
# a b
# 1 1 5
# 2 2 6
# 3 3 7
df2 <- data.frame(c = 10:12)
# c
# 1 10
# 2 11
# 3 12
df[1] <- df2[1] # in this case `df[1] <- df2` is equivalent
Which produces:
# a b
# 1 10 5
# 2 11 6
# 3 12 7
Notice how the values changed for df, but not the names. Basically the replacement operator `[<-` only replaces the values. This is why the name was not updated. I believe this explains all the issues.
In the scenario:
names(df[2]) <- "x"
You can think of the assignment as follows (this is a simplification, see end of post for more detail):
tmp <- df[2]
# b
# 1 5
# 2 6
# 3 7
names(tmp) <- "x"
# x
# 1 5
# 2 6
# 3 7
df[2] <- tmp # `tmp` has "x" for names, but it is ignored!
# a b
# 1 10 5
# 2 11 6
# 3 12 7
The last step of which is an assignment with `[<-`, which doesn't respect the names attribute of the RHS.
But in the scenario:
names(df)[2] <- "x"
you can think of the assignment as (again, a simplification):
tmp <- names(df)
# [1] "a" "b"
tmp[2] <- "x"
# [1] "a" "x"
names(df) <- tmp
# a x
# 1 10 5
# 2 11 6
# 3 12 7
Notice how we directly assign to names, instead of assigning to df which ignores attributes.
df[2] <- 2
works because we are assigning directly to the values, not the attributes, so there are no problems here.
EDIT: based on some commentary from #AriB.Friedman, here is a more elaborate version of what I think is going on (note I'm omitting the S3 dispatch to `[.data.frame`, etc., for clarity):
Version 1 names(df[2]) <- "x" translates to:
df <- `[<-`(
df, 2,
value=`names<-`( # `names<-` here returns a re-named one column data frame
`[`(df, 2),
value="x"
) )
Version 2 names(df)[2] <- "x" translates to:
df <- `names<-`(
df,
`[<-`(
names(df), 2, "x"
) )
Also, turns out this is "documented" in R Inferno Section 8.2.34 (Thanks #Frank):
right <- wrong <- c(a=1, b=2)
names(wrong[1]) <- 'changed'
wrong
# a b
# 1 2
names(right)[1] <- 'changed'
right
# changed b
# 1 2
I have data frame in this format-
ABC 2
ABC 4
ABC 6
DEF 10
DEF 20
How can I get this to-
ABC 2 4 6
DEF 10 20
I tried the aggregate function, but it needs functions like mean/sum as params. How can I just display the values directly in the row.
df <- read.table(sep=" ", header=F, text="
ABC 2
ABC 4
ABC 6
DEF 10
DEF 20")
unstack(df, form=V2~V1)
# $ABC
# [1] 2 4 6
#
# $DEF
# [1] 10 20
unstack produces a list in this case as the columns don't have the same length. In case of the same length:
df <- read.table(sep=" ", header=F, text="
ABC 2
ABC 4
ABC 6
DEF 10
DEF 20
DEF 20")
t(unstack(df, form=V2~V1))
# [,1] [,2] [,3]
# ABC 2 4 6
# DEF 10 20 20
Well, what are the observations? Are they suppose to measure the same thing for each category?
You can't actually get a data frame exactly as you have posted, because the number of observations for each category is different. But you could do that if you add an "NA" to the "DEF".
Like this:
ABC 2 4 6
DEF 10 20 NA
If that is what you want, you could just use reshape2's dcast.
But you have to name the observations:
library(reshape2)
df <- data.frame(obs =c(1:3, 1:2),
categories = c(rep("ABC", 3), rep("DEF",2)),
values=c(2,4,6,10,20), stringsAsFactors=FALSE)
df2 <- dcast(df, categories~obs)
df2
# categories 1 2 3
# 1 ABC 2 4 6
# 2 DEF 10 20 NA
To add to your alternatives:
This seems to be a basic "long to wide" reshape problem, but it is missing a "time" variable. It's easy to recreate one by using ave:
ave(as.character(df$V1), df$V1, FUN = seq_along)
# [1] "1" "2" "3" "1" "2"
df$time <- ave(as.character(df$V1), df$V1, FUN = seq_along)
Once you have a "time" variable, using reshape is pretty straightforward:
reshape(df, idvar="V1", timevar="time", direction = "wide")
# V1 V2.1 V2.2 V2.3
# 1 ABC 2 4 6
# 4 DEF 10 20 NA
If, instead, you wanted a list, there is no need for the time variable. Just use split:
split(df$V2, df$V1)
# $ABC
# [1] 2 4 6
#
# $DEF
# [1] 10 20
#
Similarly, if your data were balanced, split plus rbind could get you what you need. Using the sample data from #lukeA:
df <- read.table(sep=" ", header=F, text="
ABC 2
ABC 4
ABC 6
DEF 10
DEF 20
DEF 20")
do.call(rbind, split(df$V2, df$V1))
# [,1] [,2] [,3]
# ABC 2 4 6
# DEF 10 20 20
You want to obtain a sparse matrix? The two rows in your example have different lengths. Try a function producing a list:
mat<-cbind(
c("ABC","ABC","ABC","DEF","DEF"),
c(2,4,6,10,20)
)
count<-function(mat){
values<-unique(mat[,1])
outlist<-list()
for(v in values){
outlist[[v]]<-mat[mat[,1]==v,2]
}
return(outlist)
}
count(mat)
Which will give you this result:
$ABC
[1] "2" "4" "6"
$DEF
[1] "10" "20"