R, import column names - r

I have a csv file (myNames) with column names. It is 1 x 66. I want to use those names to rename the 66 columns I have in another dataframe. I am trying to use colnames(df)[]<-(myNames) but I get the wrong result. I have tried to do this using as.vector, as.array, as.list, without success.
Is there a more direct way to read a csv file into an array?
or
How can I get an array from my dataframe that I can use in colnames()?
Here's myNames:
v1 v2 v3 v4 v5 v6 v7
Tom Dick Harry John Paul George Ringo
I want to make Tom, Dick, Harry my new column names in mydata.

Try this:
library(tidyverse)
# Reproducible example
df <- ("Tom Dick Harry John Paul George Ringo")
df <- read.table(text = df)
# Change column names
names(df) <- as.matrix(df[1, ])
# Remove row 1
df <- df[-1, ]
# Convert to a tibble
df %>%
as_tibble() %>%
mutate_all(parse_guess) %>%
glimpse()
The code above returns:
Observations: 0
Variables: 7
$ Tom <chr>
$ Dick <chr>
$ Harry <chr>
$ John <chr>
$ Paul <chr>
$ George <chr>
$ Ringo <chr>
You could turn this into a function:
rn_to_cn <- function(dataframe){
x <- length(colnames(dataframe))
y <- length(unique(matrix(dataframe)))
if(x > y){
stop("Can't have duplicate column names.")
} else {
message("It worked!")
}
names(dataframe) <- as.matrix(dataframe[1, ])
dataframe <- dataframe[-1, ]
dataframe %>%
as_tibble() %>%
mutate_all(parse_guess)
}
And then do this:
rn_to_cn(df)
# A tibble: 0 x 7
# ... with 7 variables: Tom <chr>, Dick <chr>, Harry <chr>, John <chr>,
# Paul <chr>, George <chr>, Ringo <chr>

Related

How to write a function around data.table::tstrsplit() to separate strings into different columns?

This question is likely a duplicate, but I didn't find a solution so apologies in advance.
I want to build a speedier alternative to tidyr::separate(), which separates a string in one column into new columns. I decided to use data.table for this task.
I created the following function:
library(data.table)
library(tibble)
fast_separate <- function(.data, origin_col, into) {
my_origin_col <- deparse(substitute(origin_col))
my_dt <- as.data.table(.data)
my_dt <- my_dt[, into := tstrsplit(my_origin_col, "_", fixed = TRUE)][] # https://rdatatable.gitlab.io/data.table/reference/tstrsplit.html
as_tibble(my_dt)
}
My function fails
Let's say I have the following toy data:
df_toms <- tribble(~toms,
"tom_hanks",
"tom_bradey",
"tom_cruise",
"thomas_edison",
"thomas_fefferson"
)
df_toms
## # A tibble: 5 x 1
## toms
## <chr>
## 1 tom_hanks
## 2 tom_bradey
## 3 tom_cruise
## 4 thomas_edison
## 5 thomas_fefferson
When I call fast_separate() I get:
fast_separate(df_toms, origin_col = toms, into = c("first_name", "surname"))
## # A tibble: 5 x 2
## toms into
## <chr> <chr>
## 1 tom_hanks toms
## 2 tom_bradey toms
## 3 tom_cruise toms
## 4 thomas_edison toms
## 5 thomas_fefferson toms
This is not the desired output.
Which is strange because running the same code regularly (i.e., not inside a function) we get:
my_dt <- as.data.table(df_toms)
my_dt <- my_dt[, c("first_name", "surname") := tstrsplit(toms, "_", fixed = TRUE)][]
desired_output <- as_tibble(my_dt)
desired_output
## # A tibble: 5 x 3
## toms first_name surname
## <chr> <chr> <chr>
## 1 tom_hanks tom hanks
## 2 tom_bradey tom bradey
## 3 tom_cruise tom cruise
## 4 thomas_edison thomas edison
## 5 thomas_fefferson thomas fefferson
What's wrong with the way I wrote fast_separate()?
This provides the needed output
library(data.table)
library(tibble)
fast_separate <- function(.data, origin_col, into) {
as.tibble(setDT(.data)[, (into) := tstrsplit(.data[[origin_col]], "_", fixed = TRUE)])
}
df_toms <- tribble(~toms,
"tom_hanks",
"tom_bradey",
"tom_cruise",
"thomas_edison",
"thomas_fefferson"
)
fast_separate(df_toms, origin_col = "toms", into = c("first_name", "surname"))
# # A tibble: 5 x 3
# toms first_name surname
# <chr> <chr> <chr>
# 1 tom_hanks tom hanks
# 2 tom_bradey tom bradey
# 3 tom_cruise tom cruise
# 4 thomas_edison thomas edison
# 5 thomas_fefferson thomas fefferson
and who is Thomas Fefferson? ;-)

Count Bigrams independently of order of appearance

I´m trying to count bigrams independently of order like 'John Doe' and 'Doe John' should be counted together as 2.
Already tried some examples using text mining such as those provided on https://www.oreilly.com/library/view/text-mining-with/9781491981641/ch04.html but couldn´t find any counting that ignores order of appearance.
library('widyr')
word_pairs <- austen_section_words %>%
pairwise_count(word, section, sort = TRUE)
word_pairs
It counts separated like this:
<chr> <chr> <dbl>
1 darcy elizabeth 144
2 elizabeth darcy 144
It should look like this:
item1 item2 n
<chr> <chr> <dbl>
1 darcy elizabeth 288
Thanks if anyone can help me.
This code works. There is probably something more efficient out there though.
# Create sample dataframe
df <- data.frame(name = c('darcy elizabeth', 'elizabeth darcy', 'John Doe', 'Doe John', 'Steve Smith'))
# Break out first and last names
library(stringr)
df$first <- word(df$name,1); df$second <- word(df$name,2);
# Reorder alphabetically
df$a <- ifelse(df$first<df$second, df$first, df$second); df$b <- ifelse(df$first>df$second, df$first, df$second)
library(dplyr)
summarize(group_by(df, a, b), n())
# Yields
# a b `n()`
# <chr> <chr> <int>
#1 darcy elizabeth 2
#2 Doe John 2
#3 Smith Steve 1
Tks Guys,
I considered your suggestions and tried a similar approach:
library(dplyr)
#Function to order 2 variables by alphabetical order.
#This function below i got from another post, couldn´t remember the author ;(.
alphabetical <- function(x,y){x < y}
#Created a sample dataframe
col1<-c("darcy","elizabeth","elizabeth","darcy","john","doe")
col2<-c("elizabeth","darcy","darcy","elizabeth","doe","john")
dfSample<-data.frame(col1,col2)
#Create an empty dataframe
dfCreated <- data.frame(col1=character(),col2=character())
#for each row, I reorder the columns and append to a new dataframe
#Tks to Gregor
for(i in 1:nrow(dfSample)) {
row <- c(as.String(dfSample[i,1]), as.String(dfSample[i,2]))
if(!alphabetical(row[1],row[2])){
row <- c(row[2],row[1])
}
dfCreated<-rbind(dfCreated,c(row[1],row[2]),stringsAsFactors=FALSE)
}
colnames(dfCreated)<-c("col1","col2")
dfCreated
#tks to Monk
summarize(group_by(dfCreated, col1, col2), n())
col1 col2 `n()`
<chr> <chr> <int>
1 darcy elizabeth 4
2 doe john 2

R JSON to tibble

I have the following data passed back from an API and I cannot change it's structure. I would like to convert the following JSON into a tibble.
data <- '{ "ids":{
"00000012664":{
"state":"Indiana",
"version":"10",
"external_ids":[
{
"db":"POL",
"db_id":"18935"
},
{
"db":"CIT",
"db_id":"1100882"
}
],
"id":"00000012520",
"name":"Joe Smith",
"aliases":[
"John Smith",
"Bill Smith"
]
},
"00000103162":{
"state":"Kentucky",
"external_ids":[
{
"db":"POL",
"db_id":"69131"
},
{
"db":"CIT",
"db_id":"1098802"
}
],
"id":"00000003119",
"name":"Sue Smith",
"WIP":98203059
} ,
"0000019223":{
"state":"Ohio",
"external_ids":[
{
"db":"POL",
"db_id":"69134"
},
{
"db":"JT",
"db_id":"615234"
}
],
"id":"0000019223",
"name":"Larry Smith",
"WIP":76532172,
"aliases":[
"Test 1",
"Test 2",
"Test 3",
"Test 4"
],
"insured":1
} } }'
Please Note: This is a small subset of the data and could have thousands of "ids".
I've tried jsonlite and tidyjson with a combination of purrr.
The following gives me a tibble, but I cannot figure out how to get aliases back.
obj <- jsonlite::fromJSON(data, simplifyDataFrame = T, flatten = F)
obj$ids %>% {
data_frame(id=purrr::map_chr(., 'id'),
state=purrr::map_chr(., 'state', ''),
WIP=purrr::map_chr(., 'WIP', .default=''),
#aliases=purrr::map(purrr::map_chr(., 'aliases', .default=''), my_fun)
)
}
I cannot figure out with tidyjson either:
data %>% enter_object(ids) %>% gather_object %>% spread_all
What I would like back is a tibble with the following fields (regardless if they are in the JSON or not.
id
name
state
version
aliases -> as a string comma separated
WIP
BONUS: ;-)
Can I get external_ids as a string as well?
Instead of extracting each element with multiple calls with map, an option is to convert to tibble with (as_tibble) and select the columns of interest, grouped by 'id' collapse the 'aliases' into a single string and get the distinct rows by 'id'
library(tibble)
library(purrr)
library(stringr)
map_dfr(obj$ids, ~ as_tibble(.x) %>%
select(id, one_of("name", "state", "version", "aliases", "WIP"))) %>%
group_by(id) %>%
mutate(aliases = toString(unique(aliases))) %>%
distinct(id, .keep_all = TRUE)
# A tibble: 2 x 6
# Groups: id [2]
# id name state version aliases WIP
# <chr> <chr> <chr> <chr> <chr> <int>
#1 00000012520 Joe Smith Indiana 10 John Smith, Bill Smith NA
#2 00000003119 Sue Smith Kentucky <NA> NA 98203059
If we also need the 'external_ids' (which is a data.frame)
map_dfr(obj$ids, ~ as_tibble(.x) %>%
mutate(external_ids = reduce(external_ids, str_c, sep = " "))) %>%
group_by(id) %>%
mutate_at(vars(aliases, external_ids), ~ toString(unique(.))) %>%
ungroup %>%
distinct(id, .keep_all= TRUE)
# A tibble: 2 x 7
# state version external_ids id name aliases WIP
# <chr> <chr> <chr> <chr> <chr> <chr> <int>
#1 Indiana 10 POL 18935, CIT 1100882 00000012520 Joe Smith John Smith, Bill Smith NA
#2 Kentucky <NA> POL 69131, CIT 1098802 00000003119 Sue Smith NA 98203059
Update
For the new data, we can use
obj$ids %>%
map_dfr(~ map_df(.x, reduce, str_c, collapse = ", ", sep= " ") )
# A tibble: 3 x 8
# state version external_ids id name aliases WIP insured
# <chr> <chr> <chr> <chr> <chr> <chr> <int> <int>
#1 Indiana 10 POL 18935, CIT 1100882 00000012520 Joe Smith John Smith Bill Smith NA NA
#2 Kentucky <NA> POL 69131, CIT 1098802 00000003119 Sue Smith <NA> 98203059 NA
#3 Ohio <NA> POL 69134, JT 615234 0000019223 Larry Smith Test 1 Test 2 Test 3 Test 4 76532172 1

Using str_extract_all and unnest but losing rows from NA

I'm using str_extract() and str_extract_all() to do some look around regex. There are zero, one, or multiple results, so I want to unnest() the multiple results into multiple rows. The unnest does not give all rows in the output, because of the character(0) in ab_all (I'm assuming).
library(tidyverse)
my_tbl <- tibble(clmn = c("abcd", "abef, abgh", "xkcd"))
ab_tbl <- my_tbl %>%
mutate(ab = str_extract(clmn, "(?<=ab)[:alpha:]*\\b"),
ab_all = str_extract_all(clmn, "(?<=ab)[:alpha:]*\\b"),
cd = str_extract(clmn, "[:alpha:]*(?=cd)"))
ab_tbl %>% unnest(ab_all, .drop = FALSE)
# A tibble: 3 x 4
clmn ab cd ab_all
<chr> <chr> <chr> <chr>
1 abcd cd ab cd
2 abef, abgh ef NA ef
3 abef, abgh ef NA gh
Edit: Expected output:
# A tibble: 3 x 4
clmn ab cd ab_all
<chr> <chr> <chr> <chr>
1 abcd cd ab cd
2 abef, abgh ef NA ef
3 abef, abgh ef NA gh
4 xkcd NA xk NA
The row with xkccd is not given in the output. Is it something to do with the str_extract_all or the unnest or should I change my approach?
May be we can change the length that are 0 to NA and then do the unnest
library(tidyverse)
ab_tbl %>%
mutate(ab_all = map(ab_all, ~ if(length(.x) ==0) NA_character_ else .x)) %>%
unnest
NOTE: Assuming that the patterns in str_extract are correct

Use merge with one data frame in R [duplicate]

This question already has answers here:
Transpose / reshape dataframe without "timevar" from long to wide format
(9 answers)
Closed 4 years ago.
I have one data frame in R with duplicate indexes stored in the first column.
df <- data.frame("Index" = c(1,2,1), "Age" = c("Jane Doe","John Doe","Jane
Doe"), "Address" = c("123 Fake Street","780 York Street","456 Elm
Street"),"Telephone" = c("xxx-xxx-xxxx","zzz-zzz-zzzz","yyy-yyy-yyyy"))
Index Name Address Telephone
1 Jane Doe 123 Fake Street xxx-xxx-xxxx
2 John Doe 780 York Street zzz-zzz-zzzz
1 Jane Doe 456 Elm Street yyy-yyy-yyyy
I would like to combine the above data frame to look like:
Index Name Address Telephone Address 2 Telephone 2
1 Jane, Doe 123 Fake Street xxx-xxx-xxxx 456 Elm Street yyy-yyy-yyyy
2 John Doe 780 York Street zzz-zzz-zzzz NA NA
Can I use "merge" on the same data frame or is their another command in R that would accomplish this task? Thank you.
with tidyverse
df %>%
group_by(Age) %>%
summarize_at(vars(Telephone,Address),paste, collapse="|") %>%
separate(Address,into=c("Address1","Address2"),sep="\\|") %>%
separate(Telephone,into=c("Telephone1","Telephone2"),sep="\\|")
# # A tibble: 2 x 5
# Age Telephone1 Telephone2 Address1 Address2
# <fct> <chr> <chr> <chr> <chr>
# 1 Jane Doe xxx-xxx-xxxx yyy-yyy-yyyy 123 Fake Street 456 Elm Street
# 2 John Doe zzz-zzz-zzzz <NA> 780 York Street <NA>
To be more general, we can nest the values using summarize and list, and reformat the content to unnestit with the right format:
df %>%
group_by(Age) %>%
summarize_at(vars(Telephone,Address),
~lst(setNames(invoke(tibble,.),seq_along(.)))) %>%
unnest(.sep = "")
# # A tibble: 2 x 5
# Age Telephone1 Telephone2 Address1 Address2
# <fct> <fct> <fct> <fct> <fct>
# 1 Jane Doe xxx-xxx-xxxx yyy-yyy-yyyy 123 Fake Street 456 Elm Street
# 2 John Doe zzz-zzz-zzzz <NA> 780 York Street <NA>
The function inside of summarize is a bit scary but you can wrap it into a friendlier name if you want to use it again (I added a names parameter just in case):
nest2row <- function(x,names = seq_along(x))
lst(setNames(invoke(tibble,x),names[seq_along(x)]))
df %>%
group_by(Age) %>%
summarize_at(vars(Telephone,Address), nest2row) %>%
unnest(.sep = "")
And this would be the recommended tidy way I suppose :
df %>%
group_by(Age) %>%
mutate(id=row_number()) %>%
gather(key,value,Address,Telephone) %>%
unite(key,key,id,sep="") %>%
spread(key,value)
# # A tibble: 2 x 6
# # Groups: Age [2]
# Index Age Address1 Address2 Telephone1 Telephone2
# <dbl> <fct> <chr> <chr> <chr> <chr>
# 1 1 Jane Doe 123 Fake Street 456 Elm Street xxx-xxx-xxxx yyy-yyy-yyyy
# 2 2 John Doe 780 York Street <NA> zzz-zzz-zzzz <NA>
With my second solution you keep your factors and there's not this awkward forcing different types of variables in the same column that the idiomatic way has.
Try something like this:
df <- data.frame("Index" = c(1,2,1), "Age" = c("Jane Doe","John Doe","Jane Doe"),
"Address" = c("123 Fake Street","780 York Street","456 Elm Street"),
"Telephone" = c("xxx-xxx-xxxx","zzz-zzz-zzzz","yyy-yyy-yyyy"),
stringsAsFactors = F)
df$unindex=paste(df$Index,df$Age)
sapply(unique(df$unindex),function(li){ # li="1 Jane Doe"
dft=df[li==df$unindex,3:4]
if(nrow(dft)==1)dft else c(t(dft))
})

Resources