I have some data from a poll which looks like this:
Freetime_activities
1 Travelling, On the PC, Clubbing
2 Sports, On the PC, Clubbing
3 Clubbing
4 On the PC
5 Travelling, On the PC, Clubbing
6 On the PC
7 Watching TV, Travelling
I want to get the count of each value (how many times Travelling/On the PC/etc.), but I'm having trouble splitting the values. Is there a function in R that can do for example:
split("A,B,C") ->
1 A
2 B
3 C
Or is there a straight forward solution to counting the values directly from the column?
We can use strsplit to split the column by the delimiter ", "), unlist the list output and then use table to get the frequency
tbl <- table(unlist(strsplit(as.character(df1$Freetime_activities),
", ")))
as.data.frame(tbl)
# Var1 Freq
#1 Clubbing 4
#2 On the PC 5
#3 Sports 1
#4 Travelling 3
#5 Watching TV 1
NOTE: Here is used as.character in case the column is a factor as strsplit can take only character vectors.
Or another option would be to use scan to extract the elements, and then with table get the frequency.
table(trimws(scan(text = as.character(df1$Freetime_activities),
what = "", sep = ",")))
Or using read.table with unlist and table
table(unlist(read.table(text = as.character(df1$Freetime_activities),
sep = ",", fill = TRUE, strip.white = TRUE)))
EDIT: Based on #David Arenburg's comments.
data
df1 <- structure(list(Freetime_activities = c("Travelling, On the PC,
Clubbing",
"Sports, On the PC, Clubbing", "Clubbing", "On the PC", "Travelling,
On the PC, Clubbing",
"On the PC", "Watching TV, Travelling")),
.Names = "Freetime_activities",
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7"))
Related
I have a data frame with a review and text column with multiple rows. I also have a list containing words. I want a for loop to examine each row of the data frame to sum the number of words found in the from the list. I want to keep each row sum separated by the row and place the results into a new result data frame.
#Data Frame
Review Text
1 I like to run and play.
2 I eat cookies.
3 I went to swim in the pool.
4 I like to sleep.
5 I like to run, play, swim, and eat.
#List Words
Run
Play
Eat
Swim
#Result Data Frame
Review Count
1 2
2 1
3 1
4 0
5 4
Here is a solution for base R, where gregexpr is used for counting occurences.
Given the pattern as below
pat <- c("Run", "Play", "Eat", "Swim")
then the counts added to the data frame can be made via:
df$Count <- sapply(gregexpr(paste0(tolower(pat),collapse = "|"),tolower(df$Text)),
function(v) ifelse(-1 %in% v, 0,length(v)))
such that
> df
Review Text Count
1 1 I like to run and play 2
2 2 I eat cookies 1
3 3 I went to swim in the pool. 1
4 4 I like to sleep. 0
5 5 I like to run, play, swim, and eat. 4
We can use stringr::str_count after pasting the words together as one pattern.
df$Count <- stringr::str_count(df$Text,
paste0("\\b", tolower(words), "\\b", collapse = "|"))
df
# Review Text Count
#1 1 I like to run and play. 2
#2 2 I eat cookies. 1
#3 3 I went to swim in the pool. 1
#4 4 I like to sleep. 0
#5 5 I like to run, play, swim, and eat. 4
data
df <- structure(list(Review = 1:5, Text = structure(c(2L, 1L, 5L, 4L,
3L), .Label = c("I eat cookies.", "I like to run and play.",
"I like to run, play, swim, and eat.", "I like to sleep.",
"I went to swim in the pool."), class = "factor")), class =
"data.frame", row.names = c(NA, -5L))
words <- c("Run","Play","Eat","Swim")
Base R solution (note this solution is intentionally case insensitive):
# Create a vector of patterns to search for:
patterns <- c("Run", "Play", "Eat", "Swim")
# Split on the review number, apply a term counting function (for each review number):
df$term_count <- sapply(split(df, df$Review),
function(x){length(grep(paste0(tolower(patterns), collapse = "|"),
tolower(unlist(strsplit(x$Text, "\\s+")))))})
Data:
df <- data.frame(Review = 1:5, Text = as.character(c("I like to run and play",
"I eat cookies",
"I went to swim in the pool.",
"I like to sleep.",
"I like to run, play, swim, and eat.")),
stringsAsFactors = FALSE)
I've looked through the following pages on using regex to isolate a string:
Regular expression to extract text between square brackets
What is a non-capturing group? What does (?:) do?
Split data frame string column into multiple columns
I have a dataframe which contains protein/gene identifiers, and in some cases there are two or more of these strings (seperated by a comma) because of multiple matches from a list. In this case the first string is the strongest match and I'm not necessarily interested in keeping the rest.They represent multiple matches from inferred evidence and when they cannot be easily discriminated all of the hits get put into a column. In this case I'm only interested in keeping the first because the group will likely have the same type of annotation (i.e. type of protein, gene ontology, similar function etc) If I split the multiple entries into more rows then it would appear that I have evidence that they exist in my dataset, but at the empirical level I don't.
My dataframe:
protein
1 sp|P50213|IDH3A_HUMAN
2 sp|Q9BZ95|NSD3_HUMAN
3 sp|Q92616|GCN1_HUMAN
4 sp|Q9NSY1|BMP2K_HUMAN
5 sp|O75643|U520_HUMAN
6 sp|O15357|SHIP2_HUMAN
523 sp|P10599|THIO_HUMAN,sp|THIO_HUMAN|
524 sp|Q96KB5|TOPK_HUMAN
525 sp|P12277|KCRB_HUMAN,sp|P17540|KCRS_HUMAN,sp|P12532|KCRU_HUMAN
526 sp|O00299|CLIC1_HUMAN
527 sp|P25940|CO5A3_HUMAN
The output I am trying to create:
uniprot gene
P50213 IDH3A
Q9BZ95 NSD3
Q92616 GCN1
P12277 KCRB
I'm trying to use extract and separate functions to do this:
extract(df, protein, into = c("uniprot", "gene"), regex = c("sp|(.*?)|","
(.*?)_"), remove = FALSE)
results in:
Error: is_string(regex) is not TRUE
trying separate to at least break apart the two in multiple steps:
separate(df, protein, into = c("uniprot", "gene"), sep = "|", remove =
FALSE)
results in:
Warning message:
Expected 2 pieces. Additional pieces discarded in 528 rows [1, 2, 3, 4, 5,
6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
protein uniprot gene
1 sp|P50213|IDH3A_HUMAN s
2 sp|Q9BZ95|NSD3_HUMAN s
3 sp|Q92616|GCN1_HUMAN s
4 sp|Q9NSY1|BMP2K_HUMAN s
5 sp|O75643|U520_HUMAN s
6 sp|O15357|SHIP2_HUMAN s
What is the best way to use regex in this scenario and are extract or separate the best way to go about this? Any suggestion would be greatly appreciated. Thanks!
Update based on feedback:
df <- structure(list(protein = c("sp|P50213|IDH3A_HUMAN", "sp|Q9BZ95|NSD3_HUMAN",
"sp|Q92616|GCN1_HUMAN", "sp|Q9NSY1|BMP2K_HUMAN", "sp|O75643|U520_HUMAN",
"sp|O15357|SHIP2_HUMAN", "sp|P10599|THIO_HUMAN,sp|THIO_HUMAN|",
"sp|Q96KB5|TOPK_HUMAN", "sp|P12277|KCRB_HUMAN,sp|P17540|KCRS_HUMAN,sp|P12532|KCRU_HUMAN",
"sp|O00299|CLIC1_HUMAN")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "523", "524", "525", "526"))
df1 <- separate(df, protein, into = "protein", sep = ",")
#i'm only interested in the first match, because science
df2 <- extract(df1, protein, into = c("uniprot", "gene"), regex = "sp\\|
([^|]+)\\|([^_]+)", remove = FALSE)
#create new columns with uniprot code and gene id, no _HUMAN
#df2
# protein uniprot gene
#1 sp|P50213|IDH3A_HUMAN P50213 IDH3A
#2 sp|Q9BZ95|NSD3_HUMAN Q9BZ95 NSD3
#3 sp|Q92616|GCN1_HUMAN Q92616 GCN1
#4 sp|Q9NSY1|BMP2K_HUMAN Q9NSY1 BMP2K
#5 sp|O75643|U520_HUMAN O75643 U520
#6 sp|O15357|SHIP2_HUMAN O15357 SHIP2
#523 sp|P10599|THIO_HUMAN P10599 THIO
#524 sp|Q96KB5|TOPK_HUMAN Q96KB5 TOPK
#525 sp|P12277|KCRB_HUMAN P12277 KCRB
#526 sp|O00299|CLIC1_HUMAN O00299 CLIC1
#and the answer using %>% pipes (this is what I aspire to)
df_filtered <- df %>%
separate(protein, into = "protein", sep = ",") %>%
extract(protein, into = c("uniprot", "gene"), regex = "sp\\|([^|]+)\\|([^_]+)") %>%
select(uniprot, gene)
#df_filtered
# uniprot gene
#1 P50213 IDH3A
#2 Q9BZ95 NSD3
#3 Q92616 GCN1
#4 Q9NSY1 BMP2K
#5 O75643 U520
#6 O15357 SHIP2
#523 P10599 THIO
#524 Q96KB5 TOPK
#525 P12277 KCRB
#526 O00299 CLIC1
We can capture the pattern as a group ((...)) in extract. Here, we match sp at the beginning (^) of the string followed by a | (metacharacter - escaped \\), followed by one or more characters not a | captured as a group, followed by a | and the second set of characters captured
library(tidyverse)
extract(df, protein, into = c("uniprot", "gene"),
regex = "^sp\\|([^|]+)\\|([^|]+).*")
If there are multiple instances of 'sp', then separate the rows into long format with separate_rows and then use extract
df %>%
separate_rows(protein, sep=",") %>%
extract(protein, into = c("uniprot", "gene"),
"^sp\\|([^|]+)\\|([^|]*).*")
There is one instance where there is only two sets of words. To make it working
df %>%
separate_rows(protein, sep=",") %>%
extract(protein, into = "gene", "([^|]*HUMAN)", remove = FALSE) %>%
mutate(uniprot = str_extract(protein, "(?<=sp\\|)[^_]+(?=\\|)")) %>%
select(uniprot, gene)
# uniprot gene
#1 P50213 IDH3A_HUMAN
#2 Q9BZ95 NSD3_HUMAN
#3 Q92616 GCN1_HUMAN
#4 Q9NSY1 BMP2K_HUMAN
#5 O75643 U520_HUMAN
#6 O15357 SHIP2_HUMAN
#7 P10599 THIO_HUMAN
#8 <NA> THIO_HUMAN
#9 Q96KB5 TOPK_HUMAN
#10 P12277 KCRB_HUMAN
#11 P17540 KCRS_HUMAN
#12 P12532 KCRU_HUMAN
#13 O00299 CLIC1_HUMAN
data
df <- structure(list(protein = c("sp|P50213|IDH3A_HUMAN", "sp|Q9BZ95|NSD3_HUMAN",
"sp|Q92616|GCN1_HUMAN", "sp|Q9NSY1|BMP2K_HUMAN", "sp|O75643|U520_HUMAN",
"sp|O15357|SHIP2_HUMAN", "sp|P10599|THIO_HUMAN,sp|THIO_HUMAN|",
"sp|Q96KB5|TOPK_HUMAN", "sp|P12277|KCRB_HUMAN,sp|P17540|KCRS_HUMAN,sp|P12532|KCRU_HUMAN",
"sp|O00299|CLIC1_HUMAN")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "523", "524", "525", "526"))
There is the basic width : xxxx.xxxxxx (4digits before "." 6 digits after".")
Have to add "0" when each side before and after "." is not enough digits.
Use regexr find "[.]" location with combination of str_pad can
fix the first 4 digits but
don't know how to add value after the specific character with fixed digits.
(cannot find a library can count the location from somewhere specified)
Data like this
> df
Category
1 300.030340
2 3400.040290
3 700.07011
4 1700.0901
5 700.070114
6 700.0791
7 3600.05059
8 4400.0402
Desired data
> df
Category
1 0300.030340
2 3400.040290
3 0700.070110
4 1700.090100
5 0700.070114
6 0700.079100
7 3600.050590
8 4400.040200
I am beginner of coding that sometime can't understand some regex like "["
e.t.c .With some explain of them would be super helpful.
Also i have a combination like this :
df$Category<-ifelse(regexpr("[.]",df$Category)==4,
paste("0",df1$Category,sep = ""),df$Category)
df$Category<-str_pad(df$Category,11,side = c("right"),pad="0")
Desire to know are there is any better way do this , especially count and
return the location from the END until specific character appear.
Using formatC:
df$Category <- formatC(as.numeric(df$Category), format = 'f', width = 11, flag = '0', digits = 6)
# > df
# Category
# 1 0300.030340
# 2 3400.040290
# 3 0700.070110
# 4 1700.090100
# 5 0700.070114
# 6 0700.079100
# 7 3600.050590
# 8 4400.040200
format = 'f': formating doubles;
width = 11: 4 digits before . + 1 . + 6 digits after .;
flag = '0': pads leading zeros;
digits = 6: the desired number of digits after the decimal point (format = "f");
Input df seems to be character data.frame:
structure(list(Category = c("300.030340", "3400.040290", "700.07011",
"1700.0901", "700.070114", "700.0791", "3600.05059", "4400.0402"
)), .Names = "Category", row.names = c(NA, -8L), class = "data.frame")
We can use sprintf
df$Category <- sprintf("%011.6f", df$Category)
df
# Category
#1 0300.030340
#2 3400.040290
#3 0700.070110
#4 1700.090100
#5 0700.070114
#6 0700.079100
#7 3600.050590
#8 4400.040200
data
df <- structure(list(Category = c(300.03034, 3400.04029, 700.07011,
1700.0901, 700.070114, 700.0791, 3600.05059, 4400.0402)),
.Names = "Category", class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8"))
There are plenty of great tricks, functions, and shortcuts to be learned, and I would encourage you to explore them all! For example, if you're trying to win code golf, you will want to use #akrun's sprintf() approach. Since you stated you're a beginner, it might be more helpful to breakdown the problem into its component parts. One transparent and easy-to-follow, in my opinion, approach would be to utilize the stringr package:
library(stringr)
location_of_dot <- str_locate(df$Category, "\\.")[, 1]
substring_left_of_dot <- str_sub(df$Category, end = location_of_dot - 1)
substring_right_of_dot <- str_sub(df$Category, start = location_of_dot + 1)
pad_left <- str_pad(substring_left_of_dot, 4, side = "left", pad = "0")
pad_right <- str_pad(substring_right_of_dot, 6, side = "right", pad = "0")
result <- paste0(pad_left, ".", pad_right)
result
Use separate in tidyr to separate Category on decimal. Use str_pad from stringr to add zeros in the front or back and paste them together.
library(tidyr) # to separate columns on decimal
library(dplyr) # to mutate and pipes
library(stringr) # to strpad
input_data <- read.table(text =" Category
1 300.030340
2 3400.040290
3 700.07011
4 1700.0901
5 700.070114
6 700.0791
7 3600.05059
8 4400.0402", header = TRUE, stringsAsFactors = FALSE) %>%
separate(Category, into = c("col1", "col2")) %>%
mutate(col1 = str_pad(col1, width = 4, side= "left", pad ="0"),
col2 = str_pad(col2, width = 6, side= "right", pad ="0"),
Category = paste(col1, col2, sep = ".")) %>%
select(-col1, -col2)
I'm doing a handful of transformation steps for several dfs, so I have ventured into the beautiful world of apply, lapply, sweep, etc. Unfortunately I got stuck trying to use sweep for listed dfs.
What I would like to do, is calculate the percentage of each value, based on the mean of each data frame's first row.
So I put my dfs into a list which ends up looking something like this;
df1 <- read.table(header = TRUE, text = "a b
1 16.26418 19.60232
2 16.09745 18.44320
3 17.25242 18.21141
4 17.61503 17.64766
5 18.35453 19.52620")
df2 <- read.table(header = TRUE, text = "a b
1 4.518654 4.346056
2 4.231176 4.175854
3 2.658694 4.999478
4 3.348019 2.345594
5 3.103378 2.556690")
list.one <- list(df1,df2)
> list.one
[[1]]
a b
1 16.26418 19.60232
2 16.09745 18.44320
3 17.25242 18.21141
4 17.61503 17.64766
5 18.35453 19.52620
[[2]]
a b
1 4.518654 4.346056
2 4.231176 4.175854
3 2.658694 4.999478
4 3.348019 2.345594
5 3.103378 2.556690
Now I calculate the mean of each first row and store it
one.hundred <- lapply(list.one, function(i)
{rowMeans(i[1,], na.rm=T)})
> one.hundred
[[1]]
1
17.93325
[[2]]
1
4.432355
Now I calculate their percentage (as compared to the values stored in the second list) and the best I came up with is this rather tedious workaround:
df1.per<-sweep(list.one[[1]], 1, one.hundred[[1]],
function(x,y){100/y*x})
df2.per<-sweep(list.one[[2]], 1, one.hundred[[2]],
function(x,y){100/y*x})
list.new(df1.per,df2.per)
If somebody could suggest me simpler, preferably list based solution that would be great help.
Thanks a lot.
Here's another approach with sapply and Map that will also return a list of data.frames:
means <- sapply(list.one, function(df) rowMeans(df[1, ], na.rm = TRUE))
Map(function(vec, df) df/vec*100, means, list.one)
#$`1`
# a b
#1 90.69287 109.30713
#2 89.76315 102.84360
#3 96.20353 101.55109
#4 98.22553 98.40748
#5 102.34916 108.88266
#
#$`1`
# a b
#1 101.94702 98.05298
#2 95.46113 94.21299
#3 59.98378 112.79507
#4 75.53589 52.91981
#5 70.01646 57.68243
data:
> dput(list.one)
list(structure(list(a = c(16.26418, 16.09745, 17.25242, 17.61503,
18.35453), b = c(19.60232, 18.4432, 18.21141, 17.64766, 19.5262
)), .Names = c("a", "b"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5")), structure(list(a = c(4.518654, 4.231176,
2.658694, 3.348019, 3.103378), b = c(4.346056, 4.175854, 4.999478,
2.345594, 2.55669)), .Names = c("a", "b"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5")))
Very simple question. I am using an excel sheet that has two rows for the column headings; how can I convert these two row headings into one? Further, these headings don't start at the top of the sheet.
Thus, I have DF1
Temp Press Reagent Yield A Conversion etc
degC bar /g % %
1 2 3 4 5
6 7 8 9 10
and I want,
Temp degC Press bar Reagent /g Yield A % Conversion etc
1 2 3 4 5
6 7 8 9 10
Using colnames(DF1) returns the upper names, but getting the second line to merge with the upper one keeps eluding me.
Using your data, modified to quote text fields that contain the separator (get whatever tool you used to generate the file to quote text fields for you!)
txt <- "Temp Press Reagent 'Yield A' 'Conversion etc'
degC bar /g % %
1 2 3 4 5
6 7 8 9 10
"
this snippet of code below reads the file in two steps
First we read the data, so skip = 2 means skip the first 2 lines
Next we read the data again but only the first two line, this output is then further processed by sapply() where we paste(x, collapse = " ") the strings in the columns of the labs data frame. These are assigned to the names of dat
Here is the code:
dat <- read.table(text = txt, skip = 2)
labs <- read.table(text = txt, nrows = 2, stringsAsFactors = FALSE)
names(dat) <- sapply(labs, paste, collapse = " ")
dat
names(dat)
The code, when runs produces:
> dat <- read.table(text = txt, skip = 2)
> labs <- read.table(text = txt, nrows = 2, stringsAsFactors = FALSE)
> names(dat) <- sapply(labs, paste, collapse = " ")
>
> dat
Temp degC Press bar Reagent /g Yield A % Conversion etc %
1 1 2 3 4 5
2 6 7 8 9 10
> names(dat)
[1] "Temp degC" "Press bar" "Reagent /g"
[4] "Yield A %" "Conversion etc %"
In your case, you'll want to modify the read.table() calls to point at the file on your file system, so use file = "foo.txt" in place of text = txt in the code chunk, where "foo.txt" is the name of your file.
Also, if these headings don't start at the top of the file, then increase skip to 2+n where n is the number of lines before the two header rows. You'll also need to add skip = n to the second read.table() call which generates labs, where n is again the number of lines before the header lines.
This should work. You only need set stringsAsFactors=FALSE when reading data.
data <- structure(list(Temp = c("degC", "1", "6"), Press = c("bar", "2",
"7"), Reagent = c("/g", "3", "8"), Yield.A = c("%", "4", "9"),
Conversion = c("%", "5", "10")), .Names = c("Temp", "Press",
"Reagent", "Yield.A", "Conversion"), class = "data.frame", row.names = c(NA,
-3L)) # Your data
colnames(data) <-paste(colnames(dados),dados[1,]) # Set new names
data <- data[-1,] # Remove first line
data <- data.frame(apply(data,2,as.real)) # Correct the classes (works only if all collums are numbers)
Just load your file with read.table(file, header = FALSE, stringsAsFactors = F) arguments. Then, you can grep to find the position this happens.
df <- data.frame(V1=c(sample(10), "Temp", "degC"),
V2=c(sample(10), "Press", "bar"),
V3 = c(sample(10), "Reagent", "/g"),
V4 = c(sample(10), "Yield_A", "%"),
V5 = c(sample(10), "Conversion", "%"),
stringsAsFactors=F)
idx <- unique(c(grep("Temp", df$V1), grep("degC", df$V1)))
df2 <- df[-(idx), ]
names(df2) <- sapply(df[idx, ], function(x) paste(x, collapse=" "))
Here, if you want, you can then convert all the columns to numeric as follows:
df2 <- as.data.frame(sapply(df2, as.numeric))