making the first row a header in a dataframe in r - r

I've seen this asked here: Create header of a dataframe from the first row in the data frame
and here: assign headers based on existing row in dataframe in R
and the solutions offered don't work for me.
When I transpose my dataframe (p1), the header of DF.transpose (p1t) is something new and annoying. and the first row of the p1t is what I would like to use as the header, I tried:
colnames(p1t) = p1t[1, ]
and it doesn't work!
here is how the original df appears:
File Fp1.PD_ShortSOA_FAM Fp1.PD_LongSOA_FAM Fp1.PD_ShortSOA_SEMplus_REAL Fp1.PD_ShortSOA_SEMplus_FICT
sub0001 0,446222 2,524,804 0,272959 1,281,349
sub0002 1,032,688 2,671,048 1,033,278 1,217,817
And here is how the transpose appears:
row.names V1 V2
File sub0001 sub0002
Fp1.PD_ShortSOA_FAM 0,446222 1,032,688
Fp1.PD_LongSOA_FAM 2,524,804 2,671,048
Fp1.PD_ShortSOA_SEMplus_REAL 0,272959 1,033,278
Fp1.PD_ShortSOA_SEMplus_FICT 1,281,349 1,217,817
Fp1.PD_ShortSOA_SEMminus_REAL 0,142739 1,405,100
Fp1.PD_ShortSOA_SEMminus_FICT 1,515,577 -1,990,458
How can I make "File", "sub0001","sub0002" etc... as the header?
Thanks!

Works for me (with a little trick).
x <- read.table(text = "File Fp1.PD_ShortSOA_FAM Fp1.PD_LongSOA_FAM Fp1.PD_ShortSOA_SEMplus_REAL Fp1.PD_ShortSOA_SEMplus_FICT
sub0001 0,446222 2,524,804 0,272959 1,281,349
sub0002 1,032,688 2,671,048 1,033,278 1,217,817",
header = TRUE)
x <- t(x)
colnames(x) <- x[1, ]
x <- x[-1, ]
x
sub0001 sub0002
Fp1.PD_ShortSOA_FAM "0,446222" "1,032,688"
Fp1.PD_LongSOA_FAM "2,524,804" "2,671,048"
Fp1.PD_ShortSOA_SEMplus_REAL "0,272959" "1,033,278"
Fp1.PD_ShortSOA_SEMplus_FICT "1,281,349" "1,217,817"

We can make use of transpose from data.table
library(janitor)
data.table::transpose(x, keep.names = 'File') %>%
row_to_names(1)
# File sub0001 sub0002
#2 Fp1.PD_ShortSOA_FAM 0,446222 1,032,688
#3 Fp1.PD_LongSOA_FAM 2,524,804 2,671,048
#4 Fp1.PD_ShortSOA_SEMplus_REAL 0,272959 1,033,278
#5 Fp1.PD_ShortSOA_SEMplus_FICT 1,281,349 1,217,817
data
x <- structure(list(File = structure(1:2, .Label = c("sub0001", "sub0002"
), class = "factor"), Fp1.PD_ShortSOA_FAM = structure(1:2, .Label = c("0,446222",
"1,032,688"), class = "factor"), Fp1.PD_LongSOA_FAM = structure(1:2, .Label = c("2,524,804",
"2,671,048"), class = "factor"), Fp1.PD_ShortSOA_SEMplus_REAL = structure(1:2, .Label = c("0,272959",
"1,033,278"), class = "factor"), Fp1.PD_ShortSOA_SEMplus_FICT = structure(2:1, .Label = c("1,217,817",
"1,281,349"), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))

Related

create data frame from nested entries

I have a data frame test like this:
dput(test)
structure(list(X = 1L, entityId = structure(1L, .Label = "HOST-123", class = "factor"),
displayName = structure(1L, .Label = "server1", class = "factor"),
discoveredName = structure(1L, .Label = "server1", class = "factor"),
firstSeenTimestamp = 1593860000000, lastSeenTimestamp = 1603210000000,
tags = structure(1L, .Label = "c(\"CONTEXTLESS\", \"CONTEXTLESS\", \"CONTEXTLESS\", \"CONTEXTLESS\", \"CONTEXTLESS\", \"CONTEXTLESS\", \"CONTEXTLESS\", \"CONTEXTLESS\"), c(\"app1\", \"client\", \"org\", \"app1\", \"DATA_CENTER\", \"PURPOSE\", \"REGION\", \"Test\"), c(NA, \"NONE\", \"Host:Environment:test123\", \"111\", \"222\", \"GENERAL\", \"444\", \"555\")", class = "factor")), .Names = c("X",
"entityId", "displayName", "discoveredName", "firstSeenTimestamp",
"lastSeenTimestamp", "tags"), class = "data.frame", row.names = c(NA,
-1L))
There is a column called tags which should become a dataframe. I need to get rid of the first row in tags (which keep saying CONTEXTLESS, expand the second column in tags(make them columns. Lastly I need to insert the 3rd column values in tags under each expanded columns.
For example in needs to look like this:
structure(list(entityId = structure(1L, .Label = "HOST-123", class = "factor"),
displayName = structure(1L, .Label = "server1", class = "factor"),
discoveredName = structure(1L, .Label = "server1", class = "factor"),
firstSeenTimestamp = 1593860000000, lastSeenTimestamp = 1603210000000,
app1 = NA, client = structure(1L, .Label = "None", class = "factor"),
org = structure(1L, .Label = "Host:Environment:test123", class = "factor"),
app1.1 = 111L, data_center = 222L, purppose = structure(1L, .Label = "general", class = "factor"),
region = 444L, test = 555L), .Names = c("entityId", "displayName",
"discoveredName", "firstSeenTimestamp", "lastSeenTimestamp",
"app1", "client", "org", "app1.1", "data_center", "purppose",
"region", "test"), class = "data.frame", row.names = c(NA, -1L
))
I need to remove the 1st vector that keeps saying "contextless", add the second vector the columns. Each 2nd vector value should be a column name. Last vector should be values of the newly added columns.
If you are willing to drop the first "row" of garbage and then do a ittle cleanup of the parse-side-effects, then this might be a good place to start:
read.table(text=gsub("\\),", ")\n", test$tags[1]), sep=",", skip=1, #drops line
header=TRUE)
c.app1 client org app1 DATA_CENTER PURPOSE REGION Test.
1 c(NA NONE Host:Environment:test123 111 222 GENERAL 444 555)
The read.table function uses the scan function which doesn't know that "c(" and ")" are meaningful. The other alternative might be to try eval(parse(text= .)) (which would know that they are enclosing vectors) on the the second and third lines, but I couldn't see a clean way to do that. I initially tried to separate the lines using strsplit, but that caused me to loose the parens.
Here's a stab at some cleanup via that addition of some more gsub operations:
read.table(text=gsub("c\\(|\\)","", # gets rid of enclosing "c(" and ")"
gsub("\\),", "\n", # inserts line breaks
test$tags[1])),
sep=",", #lets commas be parsed
skip=1, #drops line
header=TRUE) # converts to colnames
app1 client org app1.1 DATA_CENTER PURPOSE REGION Test
1 NA NONE Host:Environment:test123 111 222 GENERAL 444 555
The reason for the added ".1" in the second instance of app1 is that R colnames in dataframes need to be unique unless you override that with check.names=FALSE
Here is a tidyverse approach
library(dplyr)
library(tidyr)
str2dataframe <- function(txt, keep = "all") {
# If you can confirm that all vectors are of the same length, then we can make them into columns of a data.frame
out <- eval(parse(text = paste0("data.frame(", as.character(txt),")")))
# rename columns as X1, X2, ...
nms <- make.names(seq_along(out), unique = TRUE)
if (keep == "all")
keep <- nms
`names<-`(out, nms)[, keep]
}
df %>%
mutate(
tags = lapply(tags, str2dataframe, -1L),
tags = lapply(tags, function(d) within(d, X2 <- make.unique(X2)))
) %>%
unnest(tags) %>%
pivot_wider(names_from = "X2", values_from = "X3")
df looks like this
> df
X entityId displayName discoveredName firstSeenTimestamp lastSeenTimestamp
1 1 HOST-123 server1 server1 1.59386e+12 1.60321e+12
tags
1 c("CONTEXTLESS", "CONTEXTLESS", "CONTEXTLESS", "CONTEXTLESS", "CONTEXTLESS", "CONTEXTLESS", "CONTEXTLESS", "CONTEXTLESS"), c("app1", "client", "org", "app1", "DATA_CENTER", "PURPOSE", "REGION", "Test"), c(NA, "NONE", "Host:Environment:test123", "111", "222", "GENERAL", "444", "555")
Output looks like this
# A tibble: 1 x 14
X entityId displayName discoveredName firstSeenTimestamp lastSeenTimestamp app1 client org app1.1 DATA_CENTER PURPOSE REGION Test
<int> <fct> <fct> <fct> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 HOST-123 server1 server1 1593860000000 1603210000000 NA NONE Host:Environment:test123 111 222 GENERAL 444 555

Converting empty values to NULL in R - Handling date column

I have a simple dataframe as: dput(emp)
structure(list(name = structure(1L, .Label = "Alex", class = "factor"),
job = structure(1L, .Label = "", class = "factor"), Mgr = structure(1L, .Label = "", class = "factor"),
update = structure(18498, class = "Date")), class = "data.frame", row.names = c(NA,
-1L))
I want to convert all empty rows to NULL
The simplest way to achieve is:
emp[emp==""] <- NA
Which ofcourse would have worked but I get the error for the date column as:
Error in charToDate(x) :
character string is not in a standard unambiguous format
How can I convert all other empty rows to NULL without having to deal with the date column? Please note that the actual data frame has 30000+ rows.
Try formating the date variable as character, make the change and transform to date again:
#Format date
emp$update <- as.character(emp$update)
#Replace
emp[emp=='']<-NA
#Reformat date
emp$update <- as.Date(emp$update)
Output:
name job Mgr update
1 Alex <NA> <NA> 2020-08-24
You can try type.convert like below
type.convert(emp,as.is = TRUE)
such that
name job Mgr update
1 Alex NA NA 2020-08-24
You may try this using dplyr:
library(dplyr)
df %>%
mutate_at(vars(update),as.character) %>%
na_if(.,"")
As mentioned by #Duck, you have to format the date variable as character.
afterwards you can transform it back to date if you need it:
library(dplyr)
df %>%
mutate_at(vars(update),as.character) %>%
na_if(.,"") %>%
mutate_at(vars(update),as.Date)
See if this works:
> library(dplyr)
> library(purrr)
> emp <- structure(list(name = structure(1L, .Label = "Alex", class = "factor"),
+ job = structure(1L, .Label = "", class = "factor"), Mgr = structure(1L, .Label = "", class = "factor"),
+ update = structure(18498, class = "Date")), class = "data.frame", row.names = c(NA,
+ -1L))
> emp
name job Mgr update
1 Alex 2020-08-24
> emp %>% mutate(update = as.character(update)) %>% map_df(~gsub('^$',NA, .x)) %>% mutate(update = as.Date(update)) %>% mutate(across(1:3, as.factor))
# A tibble: 1 x 4
name job Mgr update
<fct> <fct> <fct> <date>
1 Alex NA NA 2020-08-24
>

Reshaping untidy and unbalanced dataset from wide to long [duplicate]

This question already has an answer here:
How do I convert a wide dataframe to a long dataframe for a multilevel structure with 'quadruple nesting'?
(1 answer)
Closed 6 years ago.
I have a dataset (data) that looks like this:
ID,ABC.BC,ABC.PL,DEF.BC,DEF.M,GHI.PL
SB0005,C01,D20,C01a,C01b,D20
BC0013,C05,D5,C05a,NA,D5
I want to reshape it from wide-to-long format to get something like this:
ID,FC,Type,Var
SB0005,ABC,BC,C01
SB0005,ABC,PL,D20
SB0005,DEF,BC,C01a
SB0005,DEF,M,C01b
SB0005,GHI,PL,D20
BC0013,ABC,BC,C05
BC0013,ABC,PL,D5
BC0013,DEF,BC,C05a
# BC0013,DEF,M,NA (This row need not be in the dataset as I will remove it later)
BC0013,GHI,PL,D5
The usual reshape package does not work as the dataset is unbalanced. I also tried Reshape from splitstackshape but it does not give me what I want.
library(splitstackshape)
vary <- grep("\\.BC$|\\.PL$|\\.M$", names(data))
stubs <- unique(sub("\\..*$", "", names(data[vary])))
Reshape(data, id.vars=c("ID"), var.stubs=stubs, sep=".")
ID,time,ABC,DEF,GHI
SB0005,1,C01,C01a,D20
BC0013,1,C05,C05a,D5
SB0005,2,D20,C01b,NA
BC0013,2,D5,NA,NA
SB0005,3,NA,NA,NA
BC0013,3,NA,NA,NA
Appreciate any suggestions, thanks!
Providing the output of dput(data) as requested
structure(list(ID = structure(c(2L, 1L), .Label = c("BC0013",
"SB0005"), class = "factor"), ABC.BC = structure(1:2, .Label = c("C01",
"C05"), class = "factor"), ABC.PL = structure(1:2, .Label = c("D20",
"D5"), class = "factor"), DEF.BC = structure(1:2, .Label = c("C01a",
"C05a"), class = "factor"), DEF.M = structure(1:2, .Label = c("C01b",
"NA"), class = "factor"), GHI.PL = structure(1:2, .Label = c("D20",
"D5"), class = "factor")), .Names = c("ID", "ABC.BC", "ABC.PL",
"DEF.BC", "DEF.M", "GHI.PL"), row.names = c(NA, -2L), class = "data.frame")
You need to reshape your data into long format first and then you can spit the variable column into to columns. With splitstackshape you could do:
library(splitstackshape) # this will also load 'data.table' from which the 'melt' function is used
cSplit(melt(mydf, id.vars = 'ID'),
'variable',
sep = '.',
direction = 'wide')[!is.na(value)]
which results in:
ID value variable_1 variable_2
1: SB0005 C01 ABC BC
2: BC0013 C05 ABC BC
3: SB0005 D20 ABC PL
4: BC0013 D5 ABC PL
5: SB0005 C01a DEF BC
6: BC0013 C05a DEF BC
7: SB0005 C01b DEF M
8: SB0005 D20 GHI PL
9: BC0013 D5 GHI PL
An alternative with tidyr:
library(tidyr)
mydf %>%
gather(var, val, -ID) %>%
separate(var, c('FC','Type')) %>%
filter(!is.na(val))

text cleaning in R

I have a single column in R that looks like this:
Path Column
ag.1.4->ao.5.5->iv.9.12->ag.4.35
ao.11.234->iv.345.455.1.2->ag.9.531
I want to transform this into:
Path Column
ag->ao->iv->ag
ao->iv->ag
How can I do this?
Thank you
Here is my full dput from my data:
structure(list(Rank = c(10394749L, 36749879L), Count = c(1L,
1L), Percent = c(0.001011122, 0.001011122), Path = c("ao.legacy payment.not_completed->ao.legacy payment.not_completed->ao.legacy payment.completed",
"ao.legacy payment.not_completed->agent.payment.completed")), .Names = c("Rank",
"Count", "Percent", "Path"), class = "data.frame", row.names = c(NA,
-2L))
You could use gsub to match the . and numbers following the . (\\.[0-9]+) and replace it with ''.
df1$Path.Column <- gsub('\\.[0-9]+', '', df1$Path.Column)
df1
# Path.Column
#1 ag -> ao -> iv -> ag
#2 ao -> iv -> ag
Update
For the new dataset df2
gsub('\\.[^->]+(?=(->|\\b))', '', df2$Path, perl=TRUE)
#[1] "ao->ao->ao" "ao->agent"
and for the string showed in the OP's post
str2 <- c('ag.1.4->ao.5.5->iv.9.12->ag.4.35',
'ao.11.234->iv.345.455.1.2->ag.9.531')
gsub('\\.[^->]+(?=(->|\\b))', '', str2, perl=TRUE)
#[1] "ag->ao->iv->ag" "ao->iv->ag"
data
df1 <- structure(list(Path.Column = c("ag.1 -> ao.5 -> iv.9 -> ag.4",
"ao.11 -> iv.345 -> ag.9")), .Names = "Path.Column",
class = "data.frame", row.names = c(NA, -2L))
df2 <- structure(list(Rank = c(10394749L, 36749879L), Count = c(1L,
1L), Percent = c(0.001011122, 0.001011122),
Path = c("ao.legacy payment.not_completed->ao.legacy payment.not_completed->ao.legacy payment.completed",
"ao.legacy payment.not_completed->agent.payment.completed")),
.Names = c("Rank", "Count", "Percent", "Path"), class = "data.frame",
row.names = c(NA, -2L))
It may be easeir to split the strings on '->' and process the substrings separately
# split the stirngs into parts
subStrings <- strsplit(df$Path,'->')
# remove eveything after **first** the dot
subStrings<- lapply(subStrings,
function(x)gsub('\\..*','',x))
# paste them back together.
sapply(subStrings,paste0,collapse="->")
#> "ao->ao->ao" "ao->agent"
or
# split the stirngs into parts
subStrings <- strsplit(df$Path,'->')
# remove the parts of the identifiers after the dot
subStrings<- lapply(subStrings,
function(x)gsub('\\.[^ \t]*','',x))
# paste them back together.
sapply(subStrings,paste0,collapse="->")
#> "ao payment->ao payment->ao payment" "ao payment->agent"

apply over dataframe

I've got two structures:
max_map <-
structure(list(name = structure(1:11, .Label = c("2-Acetylaminofluorene",
"amsacrine", "aniline", "aspartame", "cyclophosphamide", "doxorubicin",
"indomethacin", "phenacetin", "quercetin", "raloxifene", "urethane"
), class = "factor"), value = c(0.811811403850414, 0.8670680916324,
0.794704077953131, 0.652724115286456, 0.946812003911574, 0.94467294086402,
0.99210186168903, 0.965998352825426, 0.953645104970837, 0.903845608662668,
0.858610554863266)), .Names = c("name", "value"), row.names = c(NA,
-11L), class = "data.frame")
maps <-
structure(list(name = c("2-Acetylaminofluorene", "amsacrine",
"aniline", "aspartame", "cyclophosphamide", "doxorubicin", "indomethacin",
"phenacetin", "quercetin", "raloxifene", "urethane"), avg_relations_fan = c(0.596381660936706,
0.627169363301574, 0.52144016932515, 0.335756276148214, 0.710245148396949,
0.786168090022777, 0.931928694886563, 0.797790600434933, 0.836458734127729,
0.764397331494529, 0.548648356310039), baseline = c(0.441175818174093,
0.661376446637227, 0.470246408568704, 0.325159351267395, 0.664171399502648,
0.75247341151084, 0.894791275258052, 0.79447733086043, 0.791316894314006,
0.593161248492605, 0.546928771024265), baseline_mesh = c(0.511440934523423,
0.635334407445469, 0.466187120416127, 0.292197730456067, 0.712015987803737,
0.774493950979802, 0.936857915628513, 0.776404901563741, 0.786072875131457,
0.586564923115283, 0.602183350788001), standard = c(0.441269542443449,
0.656249151603696, 0.451995996997505, 0.331622681220588, 0.680778834932872,
0.742015626142688, 0.883911615393179, 0.791293422595675, 0.760673562009157,
0.559234401021581, 0.555385232882166), sum_relations_fan = c(0.593111715736251,
0.518197244570419, 0.52676186810563, 0.331234383858585, 0.697489423349489,
0.77249112456473, 0.940506641487552, 0.79946569580319, 0.82893149142568,
0.749819491774919, 0.624830313758535), total = c(0.593111715736251,
0.518197244570419, 0.52676186810563, 0.331234383858585, 0.697489423349489,
0.77249112456473, 0.940506641487552, 0.79946569580319, 0.82893149142568,
0.749819491774919, 0.624830313758535)), .Names = c("name", "avg_relations_fan",
"baseline", "baseline_mesh", "standard", "sum_relations_fan",
"total"), row.names = c(NA, 11L), class = c("cast_df", "data.frame"
), idvars = "name", rdimnames = list(structure(list(name = c("2-Acetylaminofluorene",
"amsacrine", "aniline", "aspartame", "cyclophosphamide", "doxorubicin",
"indomethacin", "phenacetin", "quercetin", "raloxifene", "urethane"
)), .Names = "name", row.names = c("2-Acetylaminofluorene", "amsacrine",
"aniline", "aspartame", "cyclophosphamide", "doxorubicin", "indomethacin",
"phenacetin", "quercetin", "raloxifene", "urethane"), class = "data.frame"),
structure(list(series = c("avg_relations_fan", "baseline",
"baseline_mesh", "standard", "sum_relations_fan", "total"
)), .Names = "series", row.names = c("avg_relations_fan",
"baseline", "baseline_mesh", "standard", "sum_relations_fan",
"total"), class = "data.frame")))
And I'd like to apply the function x/y over the maps dataframe,
where x is the current value and y is the corresponding value along
the name.
I already tried
mapply(function(x,y) {x/y}, t(maps[,!names(maps) %in% c('name')]), arrange(max_map, name)$value)
but that gives me one big list without any names associated. I'd like
the results to be similar to the maps dataframe, just with different values.
I'm just guessing here, but maybe you're looking to do something like this:
m <- merge(maps,max_map)
m[,2:7] <- m[,2:7] / m[,8]
Without the merge and without specifying how many columns you have:
maps[,-1] <- maps[,-1] / max_map$value
again, assuming that both are in identical orders.
Joran's answer is definitely the better way, but this might help you understand mapply better. Each argument is a list, and the shorter of the two is recycled, in this case, the second one.
mapply(function(x,y) {x/y}, maps[,!names(maps) %in% c('name')], list(arrange(max_map, name)$value))
avg_relations_fan baseline baseline_mesh standard sum_relations_fan total
[1,] 0.7346308 0.5434462 0.6299997 0.5435616 0.7306028 0.7306028
[2,] 0.7233219 0.7627734 0.7327388 0.7568600 0.5976431 0.5976431
[3,] 0.6561438 0.5917252 0.5866172 0.5687601 0.6628403 0.6628403
[4,] 0.5143923 0.4981574 0.4476589 0.5080595 0.5074646 0.5074646
[5,] 0.7501438 0.7014818 0.7520141 0.7190222 0.7366715 0.7366715
[6,] 0.8322119 0.7965438 0.8198541 0.7854736 0.8177339 0.8177339
[7,] 0.9393478 0.9019147 0.9443163 0.8909484 0.9479940 0.9479940
[8,] 0.8258716 0.8224417 0.8037332 0.8191457 0.8276057 0.8276057
[9,] 0.8771174 0.8297813 0.8242824 0.7976485 0.8692243 0.8692243
[10,] 0.8457167 0.6562639 0.6489658 0.6187278 0.8295880 0.8295880
[11,] 0.6389956 0.6369928 0.7013463 0.6468418 0.7277226 0.7277226

Resources