I am trying to unnest two columns that do not always have the same number of values per cell and then concatenate the values that have a correspond between the two columns. For example:
library('dplyr')
library('tidyr')
#Sample Data
df <- data.frame(id = c(1:4),
first.names = c('Michael, Jim', 'Michael, Michael', 'Creed', 'Creed, Jim'),
last.names = c('Scott, Halpert', 'Scott, Cera', '', 'Halpert'))
Not all values in df$first.names are associated with a value in df$last.names. I am trying to get the following results:
#Desired output
df.results <- data.frame(id = c(1,1,2,2,3,4,4),
first.names = c('Michael', 'Jim', 'Michael', 'Michael', 'Creed', 'Creed', 'Jim'),
last.names = c('Scott', 'Halpert', 'Scott', 'Cera', '', '', 'Halpert'),
full.names = c('Michael Scott', 'Jim Halpert', 'Michael Scott', 'Michael Cera', 'Creed', 'Creed', 'Jim Halpert'))
I have tried using unnest, it works for first.names, but not for last.names (it drops the row where last.names is blank):
#convert to characters
df$first.names <- as.character(df$first.names)
df$last.names <- as.character(df$last.names)
#Unnest first names
df <- df %>%
transform(first.names = strsplit(first.names, ',')) %>%
unnest(first.names)%>%
transform(last.names = strsplit(last.names, ',')) %>%
unnest(last.names)
I was then going to delete duplicate lines, but that still does not solve the the issues with the values in df$first.names that do not have a value in df$last.names
Is there a better way to do this?
Check this solution:
library(tidyverse)
df %>%
as_tibble() %>%
mutate_at(2:3, ~ strsplit(as.character(.x), ',') %>% map(~ str_trim(.x))) %>%
mutate(
First = map2_chr(first.names, last.names, ~ paste(.x[1], .y[1])),
Second = map2_chr(first.names, last.names, ~ paste(.x[2], .y[2]))
) %>%
mutate_at(4:5, ~ str_remove_all(.x, 'NA') %>% str_trim()) %>%
gather('x', 'full.names', First:Second) %>%
filter(full.names != '') %>%
mutate(
first.names = map_chr(full.names, ~ str_split(.x, ' ')[[1]][1]),
last.names = map_chr(full.names, ~ str_split(.x, ' ')[[1]][2]) %>%
replace_na('')
) %>%
select(-x) %>%
arrange(id)
I can include a logic, that if there is one last.names it will combine it with the second first.names to get the same result, but I don't think this is what you want. Vector with first.names that has no second.names can solve the problem.
Related
The dataset below has columns with very similar names and some values which are NA.
library(tidyverse)
dat <- data.frame(
v1_min = c(1,2,4,1,NA,4,2,2),
v1_max = c(1,NA,5,4,5,4,6,NA),
other_v1_min = c(1,1,NA,3,4,4,3,2),
other_v1_max = c(1,5,5,6,6,4,3,NA),
y1_min = c(3,NA,2,1,2,NA,1,2),
y1_max = c(6,2,5,6,2,5,3,3),
other_y1_min = c(2,3,NA,1,1,1,NA,2),
other_y1_max = c(5,6,4,2,NA,2,NA,NA)
)
head(dat)
In this example, x1 and y1 would be what I would consider the common "categories" among the columns. In order to get something similar with my current dataset, I had to use grepl to tease these out
cats<-dat %>%
names() %>%
gsub("^(.*)_(min|max)", "\\1",.) %>%
gsub("^(.*)_(.*)", "\\2",.) %>%
unique()
Now, my goal is to mutate a new min and a new max column for each of those categories. So far the code below works just fine.
dat %>%
rowwise() %>%
mutate(min_v1 = min(c_across(contains(cats[1])), na.rm=T)) %>%
mutate(max_v1 = max(c_across(contains(cats[1])), na.rm=T)) %>%
mutate(min_y1 = min(c_across(contains(cats[2])), na.rm=T)) %>%
mutate(max_y1 = max(c_across(contains(cats[2])), na.rm=T))
However, the number of categories in my current dataset is quite a bit bigger than 2.. Is there a way to implement this but quicker?
I've tried a few of the suggestions on this post but haven't quite been able to extend them to this problem.
You can use one of the map function here for each common categories.
library(dplyr)
library(purrr)
result <- bind_cols(dat, map_dfc(cats,
~dat %>%
rowwise() %>%
transmute(!!paste('min', .x, sep = '_') := min(c_across(matches(.x)), na.rm = TRUE),
!!paste('max', .x, sep = '_') := max(c_across(matches(.x)), na.rm = TRUE))))
result
I have the following dataset:
combined <- data.frame(
client = c('aaa','aaa','aaa','bbb','bbb','ccc','ccc','ddd','ddd','ddd'),
type = c('norm','reg','opt','norm','norm','reg','opt','opt','opt','reg'),
age = c('>50','>50','75+','<25','<25','>50','75+','25-50','25-50','75+'),
cases = c('1','2','2','1','0','1','2','0','3','2'),
IsActive = c('1','0','0','1','1','0','1','1','1','0')
)
And have identified the unique variable combinations with :
# get unique variable combinations
unique_vars <- combined %>%
select(1:3,5) %>%
distinct()
I am trying to iterate on this query combined %>% anti_join(slice(unique_vars,1)) using purrr and save both the output of the query and also save summary of cases from each output back to the unique_vars table. The slice should iterate through each row of unique_vars, not be fixed at 1
I tried :
qry <- combined %>% anti_join(slice(unique_vars,1))
map(.x = unique_vars %>%
slice(.),
~qry %>%
summarise(CaseCnt = sum(cases)) %>%
inner_join(.x))
My desired output would be two things:
Full output of the query
the new Field CaseCnt added to the unique_vars dataframe
Is this possible?
Although I don't completely follow the intuition behind your query, it seems that for #1 you would want:
lapply(1:nrow(unique_vars), function(x) {
combined %>%
anti_join(slice(unique_vars, x), keep = TRUE)
})
And for #2 you would want:
unique_vars$CaseCnt <- lapply(1:nrow(unique_vars), function(x) {
combined %>%
anti_join(slice(unique_vars, x), keep = TRUE) %>%
summarise(CaseCnt = sum(cases %>% as.numeric))
}) %>% do.call(what = rbind.data.frame,
args = .)
Alternatively for #2 with purrr:map_df():
unique_vars$CaseCnt <- map_df(c(1:nrow(unique_vars)), function(x) {
combined %>%
anti_join(slice(unique_vars, x), keep = TRUE) %>%
summarise(CaseCnt = sum(cases %>% as.numeric))
})
Just as an aside -- you could do this directly with:
combined %>%
mutate(cases = as.numeric(cases)) %>%
mutate(tot_cases = sum(cases)) %>% # sum total cases across unique_id's
group_by(client, type, age, IsActive) %>%
summarize(CaseCnt = mean(tot_cases) - sum(cases))
Or if what you were actually looking for is the sum of cases in that group:
combined %>%
mutate(cases = as.numeric(cases)) %>%
group_by(client, type, age, IsActive) %>%
summarize(CaseCnt = sum(cases))
This is what I have tried so far. It works, but it only tells me the p.value of the data that has no NA's. Much of my data has NA values in a few places up to 1/3rd of the data.
normal <- apply(cor_phys, 2, function(x) shapiro.test(x)$p.value)
I want to try adding na.rm to the function, but it's not working. Help?
#calculate the correlations between all variables
corres <- cor_phys %>% #cor_phys is my data
as.matrix %>%
cor(use="complete.obs") %>% #complete.obs does not use NA
as.data.frame %>%
rownames_to_column(var = 'var1') %>%
gather(var2, value, -var1)
#removes duplicates correlations
corres <- corres %>%
mutate(var_order = paste(var1, var2) %>%
strsplit(split = ' ') %>%
map_chr( ~ sort(.x) %>%
paste(collapse = ' '))) %>%
mutate(cnt = 1) %>%
group_by(var_order) %>%
mutate(cumsum = cumsum(cnt)) %>%
filter(cumsum != 2) %>%
ungroup %>%
select(-var_order, -cnt, -cumsum) #removes unneeded columns
I did not write this myself, but it is the answer that I used and worked for my needs. The link to the page I used is: How to compute correlations between all columns in R and detect highly correlated variables
I would like to join/merge multiple tibbles/data frames with the use of map/lapply. How would it be possible to perform that?
Reproducible example:
set.seed(42)
df <- tibble::tibble(rank = rep(stringr::str_c("rank",1:10),10),
char_1 = sample(c("a","b","c"), size = 100, replace = TRUE),
points = sample(1:10000, size = 100)
)
my_top <- seq(10,90, by= 10) %>%
as.list() %>%
set_names(c(stringr::str_c("sample_",1:9)))
my_list_1 <- map(my_top , ~ df %>%
sample_n(.x) %>%
mutate(!!str_c(.x, "_score") := sample(1:10000, size = .x)))
I would like to perform this:
df %>% group_by(rank, char_1, points) %>%
left_join(my_list_1[[1]] ) %>%
left_join(my_list_1[[2]] ) %>%
left_join(my_list_1[[3]] )
and so on ... with map function.
I tried this:
map(as.list(names(my_top)), ~ df %>% group_by(rank, char_1, points) %>%
left_join(my_list_1[[.x]] ))
But of course, it is not saving somewhere the joined tibble in order to make a new join with it!
An option would be reduce
library(dplyr)
library(purrr)
df %>%
group_by(rank, char_1, points) %>%
list(.) %>%
c(., my_list_1[1:3]) %>%
reduce(left_join)
This is my first answer, I'm new here. I had a similar problem recently, join_all was the best solution I found.
library(plyr)
#list files that are saved in your computer, for example, in txt format
files <- list.files("path", *.txt)
# open the files and save then as a list
list_of_data_frames <- lapply(files, read_delim, delim = "\t")
# merge files
merged_file <- join_all(list_of_data_frames, by = NULL)
I am trying to apply the codes to my data in this link
https://www.tidytextmining.com/sentiment.html#sentiment-analysis-with-inner-join
The code in the book is
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)
I wrote it like the following (excluded "filter" because I have just filenames and words columns in my data)
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
abc %>%
inner_join(nrc_joy ) %>%
count(word, sort = TRUE)
I get this error:
Error: by required, because the data sources have no common variables
Any ideas how to deal with it?
After running into a similar issue this is what I found.
The complete code from the website is:
library(janeaustenr)
library(dplyr)
library(stringr)
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
The 'abc' dataset is unspecified in the question; however, it is easy to make up a substitute dataset using a 'differentColumnNameForWord'.
library(tidytext)
abc <- data.frame(differentColumnNameForWord = stop_words$word, stringsAsFactors = FALSE)
The way to find which column name the words are stored in the data frame is to use the 'names' function.
> names(abc)
[1] "DifferentColumnNameForWord"
Once the name of the column is identified the code would need to be modified as follows:
abc %>% inner_join(nrc_joy, by = c("DifferentColumnNameForWord" = "word")) %>%
count(DifferentColumnNameForWord, sort = TRUE)
In my situation, one dataset had the words under the 'word' column while another had the words under the 'term' column.