I have created the following dataframe df in R
Sl NO Word
1 get
2 Free
3 Joshi
4 Hello
5 New
I have used this code to get a list of synonyms but the same are in the form of a list
library(qdap)
synonyms(DF$Word)
I am getting a list of synonymous words for this. I Want to get the synonymous words for each word in the dataframe appended rowwise to the dataframe as separate columns.
DF<-
Sl NO Word Syn1 Syn2
1 get obtain receive
2 Free independent NA
3 Joshi NA NA
4 Hello Greeting NA
5 New Unused Fresh
Is there an elegant way to obtain this.Are there other dictionaries that can be used for this.
One approach could be to use mapply and pass each word at a time to qdap::synonyms. The result from 'synonyms' can be collapsed in a column using paste0 function with collapse = "|". Now data is ready.
Use tidyr::separate to separate columns into Syn1, Syn2 etc.
Note: synonyms is called with two arguments as return.list = FALSE, multiwords = FALSE
The below code has limit on maximum 10 synonyms but solution can be evolved to handle number dynamically.
library(tidyverse)
library(qdap)
df %>%
mutate(Synonyms =
mapply(function(x)paste0(
head(synonyms(x, return.list = FALSE, multiwords = FALSE),10), collapse = "|"),
tolower(.$Word))) %>%
separate(Synonyms, paste("Syn",1:10), sep = "\\|", extra = "drop" )
Result:
# SlNO Word Syn 1 Syn 2 Syn 3 Syn 4 Syn 5 Syn 6 Syn 7 Syn 8 Syn 9 Syn 10
# 1 1 get achieve acquire attain bag bring earn fetch gain glean inherit
# 2 2 Free buckshee complimentary gratis gratuitous unpaid footloose independent liberated loose uncommitted
# 3 3 Joshi <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# 4 4 Hello <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# 5 5 New advanced all-singing all-dancing contemporary current different fresh ground-breaking happening latest
Data
df <- read.table(text =
"SlNO Word
1 get
2 Free
3 Joshi
4 Hello
5 New",
header = TRUE, stringsAsFactors = FALSE)
Here is another approach with splitstackshape::cSplit.
library(tidyverse)
library(qdap)
library(splitstackshape)
DF <- read.table(text = tt, header = T)
DF <- DF %>% mutate_at(vars(Word), tolower)
syns <- synonyms_frame(synonyms(tolower(DF$Word))) %>%
mutate_at(vars(x), funs(str_remove(x, "\\..*"))) %>%
mutate_at(vars(y), funs(str_extract(y, '[:alpha:]+'))) %>%
group_by(x) %>%
summarise(Syn = toString(y)) %>%
rename(Word = x) %>% cSplit('Syn')
left_join(DF, syns)
I am not sure how exactly would you like to add all synonyms of a word because when you run synonyms("get") it gives 75 definitions of get and I feel that the desired layout will not be of much help if you add all values of 75 definitions in a single row.
So in below solution I have selected the very first definition only.
library(qdap)
library(dplyr)
library(splitstackshape)
df %>%
rowwise() %>%
mutate(synonym_of_word = paste(synonyms(tolower(word))[[1]], collapse=",")) %>%
cSplit("synonym_of_word", ",")
Sample data:
df <- structure(list(sl_no = 1:5, word = c("get", "Free", "Joshi",
"Hello", "New")), .Names = c("sl_no", "word"), class = "data.frame", row.names = c(NA,
-5L))
Related
I have values with the following structure: string OR string_string.interger
EX:
df<-data.frame(Objs=c("Windows","Door_XYZ.1", "Door_XYY.1", "Chair_XYYU.2" ))
Objs
Windows
Door_XYZ.1
Door_XYY.1
Chair_XYYU.2
Using the command split(), separate() or something similar I need to generate a dataframe similar to this one:
Obs: The split must be performed for the characters "_" and "."
Objs
IND
TAG
Control
Windows
NA
NA
NA
Door_XYZ.1
Door
XYZ
1
Door_XYY.1
Door
XYY
1
Chair_XYYU.2
Chair
XYYU
2
The closest solution was suggested by #Tommy, in similar context.
df %>% data.frame(.,do.call(rbind,str_split(.$Objs,"_")))
The default value of the sep argument in separate() will nearly get the result you need. A conditional mutate was also needed to remove the Windows entry from the IND column.
library(tidyverse)
df <- data.frame(Objs=c("Windows","Door_XYZ.1", "Door_XYY.1", "Chair_XYYU.2" ))
df %>%
separate(Objs, into = c("IND", "TAG", "Control"), remove = FALSE, fill = "right") %>%
mutate(IND = if_else(Objs == IND, NA_character_, IND))
#> Objs IND TAG Control
#> 1 Windows <NA> <NA> <NA>
#> 2 Door_XYZ.1 Door XYZ 1
#> 3 Door_XYY.1 Door XYY 1
#> 4 Chair_XYYU.2 Chair XYYU 2
Created on 2022-05-05 by the reprex package (v1.0.0)
Hello coding community
I have a two part question that is 1/2 answered
transpose, aka melt data frame, to my liking - done
add rows of data based on results found in "removed" column, a column created in the transposing step - stuck here
df<- read.table("https://pastebin.com/raw/NEPcUG01",header=T, sep="\t")
df_transformed<-tidyr::gather(df, day, removed, -(1:2), na.rm = TRUE) # melted data
In my example here (df), I have an experiment ran over 8 days. On certain days, I remove data points, and I am only interested in these days (hence why I added na.rm = TRUE in the transposing process). I sometimes remove 1 data point, or 4 (but this could be any number really)
I would like the removed data points to be called "individuals", and for them to be counted in chronological order. Therefore, I first need to add a column called "individuals"
df_transformed$individual <- ""
I would like to fill in the "individual" column based on the results in the "removed" column.
example: cage 2 had only 1 data point removed, and it was on day_8. I would therefore like to add, in the "individual" column, a 1. Cage 4, on the other hand, had data points removed on day_5 (1 data point) and day_7 (3 data points), for a total of 4 data points , aka , 4 "individuals". Therefore, Cage 4, when starting with day_5, I would like to add a 1 in the "individuals" column, and for day 7, create 3 total rows of data, and continue my "individual count" with 2,3,4. IF day_8 had 3 more data points removed, the individual count would continue with 5,6,7.
My desired result for my example data set today would be this:
desired_results <- read.table("https://pastebin.com/raw/r7QrC0y3", header=T, sep="\t") # 68 total rows of data
Interesting piece of information: The total number of rows in my final data set should equal the sum of all removed data points:
sum(df_transformed$removed) # 68
Thank you StackOverflow community. Looking forward to seeing the results.
We can use complete to create a sequence from 1 to each individual grouped by cage and day. We then fill the NA values in columns experiment and removed.
library(dplyr)
library(tidyr)
df_transformed %>%
mutate(individual = removed) %>%
group_by(cage, day) %>%
complete(individual = seq_len(individual)) %>%
fill(experiment, removed, .direction = "up")
# cage day individual experiment removed
#1 2 day_8 1 sugar 1
#2 3 day_5 1 sugar 1
#3 4 day_5 1 sugar 3
#4 4 day_5 2 sugar 3
#5 4 day_5 3 sugar 3
#6 4 day_7 1 sugar 1
#7 7 day_7 1 sugar 1
#8 7 day_8 1 sugar 1
#9 8 day_5 1 sugar 2
#10 8 day_5 2 sugar 2
# … with 58 more rows
To update individual only based on cage we can do
df_transformed %>%
mutate(individual = removed) %>%
group_by(cage, day) %>%
complete(individual = seq_len(individual)) %>%
group_by(cage) %>%
mutate(individual = row_number()) %>%
fill(experiment, removed, .direction = "up")
I think the following bit of code does what you need:
library(tidyverse)
read.table("https://pastebin.com/raw/NEPcUG01",header=T, sep="\t") %>%
pivot_longer(starts_with("day_"), names_to = "day", values_to = "removed") %>%
# drop_na() %>%
group_by(cage) %>%
summarize(individual = sum(removed, na.rm = TRUE))
I have used the pipe operator (%>%), which enables cleaner syntax. I have also used the newer pivot_longer function instead of gather. Then, grouping by cage and later summing over the individual column with summarize you get how many individuals were removed per cage.
I checked the sum of all the individuals and it seems to work:
read.table("https://pastebin.com/raw/NEPcUG01",header=T, sep="\t") %>%
pivot_longer(starts_with("day_"), names_to = "day", values_to = "removed") %>%
# drop_na() %>%
group_by(cage) %>%
summarize(individual = sum(removed, na.rm = TRUE)) %>%
pull(individual) %>%
sum()
#> [1] 68
The result is slightly different to your desired result. I am not 100% your desired result is actually correct... From your question, I understand that cage 4 should have 4 individuals, but in your desired_result it appears 4 times with values 1, 2, 3 and 4. The code I sent you generates a data frame where each appears in a single row.
I have a table called DATA_SET.This table contains one column with six different
cases of data.
#DATA_SET
DATA_SET<-data.frame(
CUSTOMS_RATE=c("20","15+0,41 eur/kg","10+0,1 eur/kg max.17","0,1
eur/l max.17","0,04 eur/kg max.10","NA")
)
View(DATA_SET)
#DATA_SET1
DATA_SET1<-data.frame(
RATE="",
SPECIFIC_RATE="",
MAXIMUM_RATE=""
)
So my intention is to divide this column into three different columns in order to continue with other statistical operations (calculation of averages, etc.) like table (DATA_SET 1) below.
So can anybody help me how to transform this table ?
Usually, separate would be a better option, but in this case, the positions of the numbers are not the same in each row, (sometimes missing too). So, we are using str_extract to individually extract the values
library(tidyverse)
DATA_SET %>%
mutate(CUSTOMS_RATE = str_replace_all(CUSTOMS_RATE, ",", "."),
RATE = str_extract(CUSTOMS_RATE, "^[0-9]+(?=\\+|$)"),
SPECIFIC_RATE = str_extract(CUSTOMS_RATE, "\\d+\\.\\d+"),
MAXIMUM_RATE = str_extract(CUSTOMS_RATE, "(?<=max\\.)\\d+")) %>%
select(2:4) %>%
mutate_all(as.numeric)
# RATE SPECIFIC_RATE MAXIMUM_RATE
#1 20 <NA> <NA>
#2 15 0.41 <NA>
#3 10 0.1 17
#4 <NA> 0.1 17
#5 <NA> 0.04 10
#6 <NA> <NA> <NA>
Or use str_replace to create a single delimiter and then use separate
DATA_SET %>%
mutate(CUSTOMS_RATE = str_replace_all(CUSTOMS_RATE, ",", ".") %>%
str_replace("\\+?([0-9]+\\.[0-9]+)", "+\\1") %>%
str_replace_all("[A-Za-z/ ]+\\.?", "+")) %>%
separate(CUSTOMS_RATE, into = c("RATE", "SPECIFIC_RATE", "MAXIMUM_RATE"),
sep="\\+", convert = TRUE)
In the data with multiple columns, I want to exchange certain patterns in one column with NA in another column.
I have the data below:
data = data.frame(adverb=c('truly','extremely','wanted','happily','stressed'),verb=c('loved','adored','a','prayed','the'),article=c('you','the','toy',NA,'importance'),argument=c(NA,'doll',NA,NA,NA))
adverb verb article argument
1 truly loved you <NA>
2 extremely adored the doll
3 wanted a toy <NA>
4 happily prayed <NA> <NA>
5 stressed the importance <NA>
I want to relocate values in the data to the corresponding columns according to the patterns below.
adverb.pattern = '[a-z]+ly$'
verb.pattern = '[a-z]+ed$'
article.pattern = '(the)$|(a)$|(an)$'
argumen.pattern = '(you)$|(importance)$|(toy)$'
This is the desired output.
adverb verb article argument
1 truly loved <NA> you
2 extremely adored the doll
3 <NA> wanted a toy
4 happily prayed <NA> <NA>
5 <NA> stressed the importance
Here's a tidyverse solution:
# example data
data = data.frame(adverb=c('truly','extremely','wanted','happily','stressed'),
verb=c('loved','adored','a','prayed','the'),
article=c('you','the','toy',NA,'importance'),
argument=c(NA,'doll',NA,NA,NA),
stringsAsFactors = F)
library(tidyverse)
# specify patterns
adverb.pattern = '[a-z]+ly$'
verb.pattern = '[a-z]+ed$'
article.pattern = '(the)$|(a)$|(an)$'
argument.pattern = '(you)$|(importance)$|(toy)$'
data %>%
mutate(id = row_number()) %>% # add row id (useful to reshape)
gather(type, value, -id) %>% # reshape data
na.omit() %>% # remove rows with NAs
mutate(type_upd = case_when(grepl(adverb.pattern, value) ~ "adverb", # check patterns sequentially
grepl(verb.pattern, value) ~ "verb",
grepl(article.pattern, value) ~ "article",
grepl(argument.pattern, value) ~ "argument"),
type_upd = ifelse(is.na(type_upd), type, type_upd)) %>% # use original / initial type if updated type is NA
select(-type) %>% # remove old type
spread(type_upd, value) %>% # reshape data
select(adverb, verb, article, argument) # select column in this order
# adverb verb article argument
# 1 truly loved <NA> you
# 2 extremely adored the doll
# 3 <NA> wanted a toy
# 4 happily prayed <NA> <NA>
# 5 <NA> stressed the importance
I need your help, because I got a data frame with a very difficult format. My data frame
data <- data.frame(information = c("{u'info1': u'mnfd', u'text': u'exampletext'}","{u'info2': u'332', u'text': u'lalala'}","{u'info1': u'', u'text': u'blub'}"))
has the column information (and a few other columns in the real data frame) and looks for example like that
## information
## 1 {u'info1': u'mnfd', u'text': u'exampletext'}
## 2 {u'info2': u'332', u'text': u'lalala'}
## 3 {u'info1': u'', u'text': u'blub'}
The real data frame has a few thousand rows and the strings are much longer. I would like to add columns which should display the information from the strings. So at the end I would like to have a dataframe looking like that (the string "of_" is added before every columnn name)
information of_info1 of_text of_info2
1 {u'info1': u'mnfd', u'text': u'exampletext'} mnfd exampletext <NA>
2 {u'info2': u'332', u'text': u'lalala'} <NA> lalala 322
3 {u'info1': u'', u'text': u'blub'} blub <NA>
Thanks for your help
This is close to a JSON file, so do a bit of formatting to get it right, and then import via the awesome jsonlite package:
library(jsonlite)
fromJSON(paste0("[", paste(gsub("(u|)'",'"',data$information), collapse=",\n"), "]"))
# info1 text info2
#1 mnfd exampletext <NA>
#2 <NA> lalala 332
#3 blub <NA>
Here's a version with dplyr and stringr. It should not be too difficult to translate it into base R if you prefer that.
This will break, however, if there are escaped single quotation marks in the fields.
library(stringr)
library(dplyr)
data <- data$information %>% str_match_all("u'([^']+)': u'([^']*)'") %>%
lapply(function(matches) {
result <- data.frame(as.list(matches[,3]), stringsAsFactors = FALSE)
colnames(result) <- paste0("of_", matches[,2])
result
}) %>% bind_rows() %>% bind_cols(data, .)
Convert it to DCF format and then read it in using read.dcf. No packages are used.
First we remove the junk giving s0 and then split it on comma-space giving s1. Then add an empty terminating line between records giving s2. Finally use read.dcf to read that in and append it to data.
s0 <- gsub("[{}]", "", gsub("u'(.*?)'", "\\1", data$information))
s1 <- strsplit(s0, ", ")
s2 <- unlist(lapply(s1, c, ""))
cbind(data, read.dcf(textConnection(s2)))
giving:
information info1 text info2
1 {u'info1': u'mnfd', u'text': u'exampletext'} mnfd exampletext <NA>
2 {u'info2': u'332', u'text': u'lalala'} <NA> lalala 332
3 {u'info1': u'', u'text': u'blub'} blub <NA>
magrittr
This could also be expressed as a nested magrittr pipeline like this:
library(magrittr)
data %>%
cbind({.$information %>%
gsub("u'(.*?)'", "\\1", .) %>%
gsub("[{}]", "", .) %>%
strsplit(", ") %>%
lapply(c, "") %>%
unlist %>%
textConnection %>%
read.dcf
})