Splitting column of a data.frame into more columns - r

I want to split the Out column of Test data.frame into columns separating based on blank space. Here is my MWE. I tried separate function from tidyr package and strsplit function from base R but couldn't figured out the problem.
Test <-
structure(list(Out = structure(1:2, .Label = c("t1* -0.4815861 0.3190424 0.2309631",
"t2* 0.9189246 -0.1998455 0.2499412"), class = "factor")),
.Names = "Out", row.names = c(NA, -2L), class = "data.frame")
library(dplyr)
library(tidyr)
Test %>% separate(Out, c("A", "B", "C", "D"), sep = " ")
Error: Values not split into 4 pieces at 1, 2
strsplit(Test$Out, " ")
Error in strsplit(Test$Out, " ") : non-character argument

try
Test %>% separate(Out, c("A", "B", "C", "D"), sep = "\\s+")
which allows for multiple spaces (\\s+).

Related

How to select one value of a data.frame within a list column with R?

I have a data.frame that contains a type column. The list contains a 1x3 data.frame. I only want one value from this list. Thus will flatten my data.frame so I can write out a csv.
How do I select one item from the nested data.frame (see the 2nd column)?
Here's the nested col. I'd provide the data but cannot flatten to write_csv.
result of dput:
structure(list(id = c("1386707", "1386700", "1386462", "1386340",
"1386246", "1386300"), fields.created = c("2020-05-07T02:09:27.000-0700",
"2020-05-07T01:20:11.000-0700", "2020-05-06T21:38:14.000-0700",
"2020-05-06T07:19:44.000-0700", "2020-05-06T06:11:43.000-0700",
"2020-05-06T02:26:44.000-0700"), fields.customfield_10303 = c(NA,
NA, 3, 3, NA, NA), fields.customfield_28100 = list(NULL, structure(list(
self = ".../rest/api/2/customFieldOption/76412",
value = "New Feature", id = "76412"), .Names = c("self",
"value", "id"), class = "data.frame", row.names = 1L), structure(list(
self = ".../rest/api/2/customFieldOption/76414",
value = "Technical Debt", id = "76414"), .Names = c("self",
"value", "id"), class = "data.frame", row.names = 1L), NULL,
structure(list(self = ".../rest/api/2/customFieldOption/76411",
value = "Maintenance", id = "76411"), .Names = c("self",
"value", "id"), class = "data.frame", row.names = 1L), structure(list(
self = ".../rest/api/2/customFieldOption/76412",
value = "New Feature", id = "76412"), .Names = c("self",
"value", "id"), class = "data.frame", row.names = 1L))), row.names = c(NA,
6L), class = "data.frame", .Names = c("id", "fields.created",
"fields.customfield_10303", "fields.customfield_28100"))
I found a way to do this.
First, instead of changing the data, I added a column with mutate. Then, directly selected the same column from all nested lists. Then, I converted the list column into a vector. Finally, I cleaned it up by removing the other columns.
It seems to work. I don't know yet how it will handle multiple rows within the nested df.
dat <- sample_dat %>%
mutate(cats = sapply(nested_col, `[[`, 2)) %>%
mutate(categories = sapply(cats, toString)) %>%
select(-nested_col, -cats)
Related
How to directly select the same column from all nested lists within a list?
r-convert list column into character vector where lists are characters
library(dplyr)
library(tidyr)
df <- tibble(Group=c("A","A","B","C","D","D"),
Batman=1:6,
Superman=c("red","blue","orange","red","blue","red"))
nested <- df %>%
nest(data=-Group)
unnested <- nested %>%
unnest(data)
Nesting and unnesting data with tidyr
library(purrr)
nested %>%
mutate(data=map(data,~select(.x,2))) %>%
unnest(data)
select with purrr, but lapply as you've done is fine, it's just for aesthetics ;)

R: filter %in% range not filtering values with decimals

I have a dataset e:
`structure(list(num = c(23L, 23L, 23L), code = structure(1:3, .Label = c("A",
"B", "C"), class = "factor"), ranking = c(140.5, 140.5,
2662), bottom = c(-0.0207357225475016, -0.0146710913954366,
-0.019899240924872), previous = c(0.00312288516116536,
0.00207118230618904, -0.00191931365721628), mean_of_all = c(-0.000222419352160109,
-0.00107348087538642, -0.00202343390338765)), row.names = c(NA,
-3L), class = "data.frame")`
code:
`winner_filtered <- e %>%
group_by(code) %>%
filter(ranking %in% (winner_lower:winner_upper))`
is not filtering the two values with 140.5
Any guesses? Thanks.
As the column 'ranking' is numeric, it may not exactly be equal to the values generated from the sequence due to precision. So, the filter can be either with <, > operators or use a convenient wrapper between
library(dplyr)
e %>%
group_by(code) %>%
filter(between(ranking, winner_lower, winner_upper))

Write a Data.Table as a csv file

I have a data.table that has list values within the columns. Below is the dput:
dput(df2)
structure(list(a = list(structure(5594.05118603497, .Names = "a"),
structure(8877.42723091876, .Names = "a"), structure(2948.95666065332,
.Names = "a"),
structure(5312.77623937465, .Names = "a"), structure(676.637044992807,
.Names = "a"),
structure(323.104243007498, .Names = "a")), b =
list(structure(3.90258318853593e-06, .Names = "b"),
structure(3.89772483584672e-06, .Names = "b"), structure(3.91175458242421e-
06, .Names = "b"),
structure(3.90169532031545e-06, .Names = "b"), structure(6.54536728417568e-
06, .Names = "b"),
structure(6.59087917747312e-06, .Names = "b")), id = 1:6), .Names = c("a",
"b", "id"), class = c("data.table", "data.frame"), row.names = c(NA,
-6L), .internal.selfref = <pointer: 0x0000000000220788>)
Here is what the output looks like:
head(df2)
a b id
1: 5594.051 3.902583e-06 1
2: 8877.427 3.897725e-06 2
3: 2948.957 3.911755e-06 3
4: 5312.776 3.901695e-06 4
5: 676.637 6.545367e-06 5
6: 323.1042 6.590879e-06 6
This looks ok when you see it at first but if you look further into it, this is what it looks like when I want to select a column:
How do I change df2 to just be a normal dataframe where it doesn't have these extra values within a and b like this? I am trying to write this file to a csv but it will not allow me to because it is saying there are vectors as the values.
Thanks!
Edit:
This was the code that generated the lists:
test<-sapply( split( df , df$ID),
function(d){ dat <- list2env(d)
nlsfit <- nls( form = y ~ a * (1-exp(-b * x)), data=dat,
start= list( a=max(dat$y), b=b.start),
control= control1)
list(a = coef(nlsfit)[1], b = coef(nlsfit)[2])} )
df1<-as.data.frame(t(test))
Load the right package, look at its help page, search for "csv", follow the Usage section:
library(data.table)
help(pac=data.table)
fwrite(df2, file="~/test.csv") # for mac, need changing for other OS
Another approach might be:
as.data.frame( lapply(df2, unlist) )

Replace a string from lookuptable in R

I have a txt file with a list:
name
Test_123
run_456
Test_789
I have another lookuptable that contains the "ID" and gives me a "Plate"
ID plate
123 xxx
456 zzz
789 bbb
Would love to get here
Test_xxx
run_zzz
Test_bbb
My current code does not work entirely.
Either getting <NA> as I guess it looks for values and not for a string or errors.
Thanks so much for your help!
B
A tidyverse way to do this would be:
library(tidyverse)
df1 %>%
separate(name, c("name", "ID"), convert=TRUE) %>%
left_join(df2, by="ID") %>%
mutate(new_name = paste(name, plate, sep="_"))
Using:
df1 <- structure(list(name = c("Test_123", "run_456", "Test_789")),
.Names = "name", class = "data.frame", row.names = c(NA, -3L))
df2 <- structure(list(ID = c(123L, 456L, 789L), plate = c("xxx", "zzz",
"bbb")), .Names = c("ID", "plate"), class = "data.frame", row.names = c(NA,
-3L))
Note that:
separate(..., convert=TRUE) use some heuristics to convert character into integer. You can otherwise do this manually: mutate(ID=as.integer(ID))
You could use unite() (which does the opposite of separate()) instead of mutate(new_name = paste(name, plate, sep="_")), which would also remove the previous columns
An option would be gsubfn
library(gsubfn)
gsubfn("(\\d+)", setNames(as.list(df2$plate), df2$ID), df1$name)
#[1] "Test_xxx" "run_zzz" "Test_bbb"
data
df1 <- structure(list(name = c("Test_123", "run_456", "Test_789")),
.Names = "name", class = "data.frame", row.names = c(NA, -3L))
df2 <- structure(list(ID = c(123L, 456L, 789L), plate = c("xxx", "zzz",
"bbb")), .Names = c("ID", "plate"), class = "data.frame", row.names = c(NA,
-3L))
For a base R option, you could add a new column to your first data frame with the exact join data:
df1$ID <- sub(".*_(?=[0-9]+)", "", df1$name, perl=TRUE)
df1$start <- sub("_[0-9]+", "", df1$name)
Then, use merge:
result <- merge(df1, df2, by="ID")
And finally create your desired output column:
result$out <- paste0(result$start, "_", result$plate)
result$out
[1] "Test_xxx" "run_zzz" "Test_bbb"
Data:
df1 <- data.frame(name=c("Test_123", "run_456", "Test_789"), stringsAsFactors=FALSE)
df2 <- data.frame(ID=c("123", "456", "789"),
plate=c("xxx", "zzz", "bbb"), stringsAsFactors=FALSE)
Demo

Arithmetic on summarized dataframe from dplyr in R

I have a large dataset I use dplyr() summarize to generate some means.
Occasionally, I would like to perform arithmetic on that output.
For example, I would like to get the mean of means from the output below, say "m.biomass".
I've tried this mean(data.sum[,7]) and this mean(as.list(data.sum[,7])). Is there a quick and easy way to achieve this?
data.sum <-structure(list(scenario = c("future", "future", "future", "future"
), state = c("fl", "ga", "ok", "va"), m.soc = c(4090.31654013689,
3654.45350562628, 2564.33199749487, 4193.83388887064), m.npp = c(1032.244475,
821.319385, 753.401315, 636.885535), sd.soc = c(56.0344229400332,
97.8553643582118, 68.2248389927858, 79.0739969429246), sd.npp = c(34.9421782033153,
27.6443555578531, 26.0728757486901, 24.0375040705595), m.biomass = c(5322.76631158111,
3936.79457763176, 3591.0902359206, 2888.25308402464), sd.m.biomass = c(3026.59250918009,
2799.40317348016, 2515.10516340438, 2273.45510178843), max.biomass = c(9592.9303,
8105.109, 7272.4896, 6439.2259), time = c("1980-1999", "1980-1999",
"1980-1999", "1980-1999")), .Names = c("scenario", "state", "m.soc",
"m.npp", "sd.soc", "sd.npp", "m.biomass", "sd.m.biomass", "max.biomass",
"time"), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -4), vars = list(quote(scenario)), labels = structure(list(
scenario = "future"), class = "data.frame", row.names = c(NA,
-1), vars = list(quote(scenario)), drop = TRUE, .Names = "scenario"), indices = list(0:3))
We can use [[ to extract the column as a vector; as mean only works on a vector or a matrix -- not on a data.frame. If the OP wanted to do this on a single column, use this:
mean(data.sum[[7]])
#[1] 3934.726
If there was only the data.frame class, the data.sum[,7] would be extracting it as a vector, but the tbl_df prevents it to collapse it to vector
For multiple columns, the dplyr also has specialised functions
data.sum %>%
summarise_each(funs(mean), 3:7)

Resources