How to extract specific string in R and puts into another column? - r

I have data like this, below are the 3 rows from my data set:
total=7871MB;free=5711MB;used=2159MB;shared=0MB;buffers=304MB;cached=1059MB;
free=71MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;cached=1059MB;
cached=1059MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;free=109MB;
Expected output as below,
total free used shared buffers cached
7871MB 5711MB 2159MB 0MB 304MB 1059MB
5751MB 71MB 5MB 3159MB 30MB 1059MB
5751MB 109MB 5MB 3159MB 30MB 1059MB
and the problem here is I want to make different columns using above data like total value, free value, used value, shared value.
I can do that by splitting using ; but in other rows values are getting shuffled, like first value coming as free then total followed by other values,
Is there any way using REGEX in , if we find total get value till ; and put into one column, if we find free get value till ; and put into another column?

Here is one possibility using strsplit.
df <- as.data.frame(matrix(unlist(lapply(strsplit(x, ";"), strsplit, "=")), nrow = 2))
colnames(df) = df[1,]
df = df[-1,]
df
# total free used shared buffers cached
# 2 7871MB 5711MB 2159MB 0MB 304MB 1059MB
Edit
I don't know how your data are structured. But you can do something like the following:
x <- "total=7871MB;free=5711MB;used=2159MB;shared=0MB; buffers=304MB;cached=1059MB;
free=71MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;cached=1059MB;
cached=1059MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;free=109MB;"
x %>% str_split("\n") %>% unlist() %>% as_tibble() %>%
mutate(total = str_extract(value, "total=(.*?)MB;"),
free = str_extract(value, "free=(.*?)MB;"),
used = str_extract(value, "used=(.*?)MB;"),
shared = str_extract(value, "shared=(.*?)MB;"),
buffers = str_extract(value, "buffers=(.*?)MB;"),
cached = str_extract(value, "cached=(.*?)MB;")) %>%
select(-value) %>%
mutate_all(~as.numeric(str_extract(.,"[[:digit:]]+")))
# # A tibble: 3 x 6
# total free used shared buffers cached
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 7871. 5711. 2159. 0. 304. 1059.
# 2 5751. 71. 5. 3159. 30. 1059.
# 3 5751. 109. 5. 3159. 30. 1059.

We can try using strsplit followed by sub to separate the data from the labels. Then, create a data frame using this data:
x <- 'total=7871MB;free=5711MB;used=2159MB;shared=0MB;buffers=304MB;cached=1059MB;'
y <- unlist(strsplit(x, ';'))
names <- sapply(y, function(x) gsub("=.*$", "", x))
data <- sapply(y, function(x) gsub(".*=", "", x, perl=TRUE))
df <- data.frame(names=names, data=data)
df
Demo

Related

How do I find the most common words in a character vector in R?

I am analysing some fmri data – in particular, I am looking at what sorts of cognitive functions are associated with coordinates from an fmri scan (conducted while subjects were performing a task. My data can be obtained with the following function:
library(httr)
scrape_and_sort = function(neurosynth_link){
result = content(GET(neurosynth_link), "parsed")$data
names = c("Name", "z_score", "post_prob", "func_con", "meta_analytic")
df = do.call(rbind, lapply(result, function(x) setNames(as.data.frame(x), names)))
df$z_score = as.numeric(df$z_score)
df = df[order(-df$z_score), ]
df = df[-which(df$z_score<3),]
df = na.omit(df)
return(df)
}
RO4 = scrape_and_sort('https://neurosynth.org/api/locations/-58_-22_6_6/compare')
Now, I want know which key words are coming up most often and ideally construct a list of the most common words. I tried the following:
sort(table(RO4$Name),decreasing=TRUE)
But this clearly won't work.The problem is that the names (for example: "auditory cortex") are strings with multiple words in, so results such 'auditory' and 'auditory cortex' come out as two separate entries, whereas I want them counted as two instances of 'auditory'.
But I am not sure how to search inside each string and record individual words like that. Any ideas?
using packages {jsonlite}, {dplyr} and the pipe operator %>% for legibility:
store response as dataframe df
url <- 'https://neurosynth.org/api/locations/-58_-22_6_6/compare/'
df <- jsonlite::fromJSON(url) %>% as.data.frame
reshape and aggregate
df %>%
## keep first column only and name it 'keywords':
select('keywords' = 1) %>%
## multiple cell values (as separated by a blank)
## into separate rows:
separate_rows(keywords, sep = " ") %>%
group_by(keywords) %>%
summarise(count = n()) %>%
arrange(desc(count))
result:
+ # A tibble: 965 x 2
keywords count
<chr> <int>
1 cortex 53
2 gyrus 26
3 temporal 26
4 parietal 23
5 task 22
6 anterior 19
7 frontal 18
8 visual 17
9 memory 16
10 motor 16
# ... with 955 more rows
edit: or, if you want to proceed from your dataframe
RO4 %>%
select(Name) %>%
## select(everything())
## select(Name:func_con)
separate_rows(Name, sep=' ') %>%
## do remaining stuff
You can of course select more columns in a number of convenient ways (see commented lines above and ?dplyr::select). Mind that values of the other variables will repeated as many times as rows are needed to accomodate any multivalue in column "Name", so that will introduce some redundancy.
If you want to adopt {dplyr} style, arranging by descending z-score and excluding unwanted z-scores would read like this:
RO4 %>%
filter(z_score < 3 & !is.na(z_score)) %>%
arrange(desc(z_score))
Not sure to understand. Can't you proceed like this:
x <- c("auditory cortex", "auditory", "auditory", "hello friend")
unlist(strsplit(x, " "))
# "auditory" "cortex" "auditory" "auditory" "hello" "friend"

Complicated string split

I have a data frame where some values for "revenue" are listed in the hundreds, say "300," and others are listed as "1.5k." Obviously this is annoying, so I need to find some way of splitting the "k" and "." characters from those values and only those values. Any thoughts?
Another way to do this is just with Regex (and tidyverse for pipes)
library(tidyverse)
string <- c("300", "1.5k")
string %>% ifelse(
# check if string ends in k (upper/lower case)
grepl("[kK]$", .),
# if string ends in k, remove it and multiply by 1000
1000 * as.numeric(gsub("[kK]$", "", .)),
.) %>% as.numeric()
[1] 300 1500
You could create a function that remove "k", change to a numeric vector and multiple by 1,000.
to_1000 <- function(x){
x %>%
str_remove("k") %>%
as.numeric() %>%
{.*1000}
}
x <- c("3000","1.5k")
tibble(x) %>%
mutate(x_num = if_else(str_detect(x,"k"),to_1000(x),as.numeric(x)))
# A tibble: 2 x 2
x x_num
<chr> <dbl>
1 3000 3000
2 1.5k 1500

Why does add_column assign a letter to the data?

I tried reading through R's documentation on the add_column function, but I'm a little confused as to the examples it provides. See below:
# add_column ---------------------------------
df <- tibble(x = 1:3, y = 3:1)
df %>% add_column(z = -1:1, w = 0)
df %>% add_column(z = -1:1, .before = "y")
# You can't overwrite existing columns
try(df %>% add_column(x = 4:6))
# You can't create new observations
try(df %>% add_column(z = 1:5))
What is the purpose of these letters that are being assigned a range? Eg:
z = 1:5
My understanding from the documentation is that add_column() takes in a dataframe and appends it in position based on the .before and .after arguments defaulting to the end of the dataframe.
I'm a little confused here. There is also a "..." argument that takes in Name-value pairs. Is that what I'm seeing with "z = 1:5"? What is the functional purpose of this?
data.frame columns always have a name in R, no exception.
Since add_column adds new columns, you need to specify names for these columns.
… well, technically you don’t need to. The following works:
df %>% add_column(1 : 3)
But add_column auto-generates the column name based on the expression you pass it, and you might not like the result (in this case, it’s literally 1:3, which isn’t a convenient name to work with).
Conversely, the following also works and is perfectly sensible:
z = 1 : 3
df %>% add_column(z)
Result:
# A tibble: 3 x 3
x y z
<int> <int> <int>
1 1 3 1
2 2 2 2
3 3 1 3

How to count unique occurrences of data saved in a multi-column table?

I have a table with 3 columns and cca 14.000 rows. I want to count every occurrence of each type of a row.
I am a newbie into R, so can't really come up with a solution to extract it from the table.
I managed to list all different values in single column with levels(), but can't really make it work.
Table looks like this:
My expected result:
IPV4|UDP|UDP: 120 times
IPV4|UDP|SSDP: 60 times
...
With some sample data that looks like this
tst <- data.frame(Type = c("IPV4", " ", "IPV4", "IPV4"), Protocol = c("UDP", " ", "UDP", "UDP"), Protocol.1 = c("SSDP", " ", "UDP", "UDP"))
You could get tallies as follows using tools from the tidyverse (dplyr, magrittr).
tst_summmary <- tst %>%
mutate(class_var = paste(Type, Protocol, Protocol.1, sep = "|")) %>%
group_by(class_var) %>%
tally() %>% as.data.frame()
# # A tibble: 3 x 2
# class_var n
# <chr> <int>
# 1 " | | " 1
# 2 IPV4|UDP|SSDP 1
# 3 IPV4|UDP|UDP 2
What we're doing here is concatenating the strings from all the different columns (that you want to use to group/classify) together into the contents of a single column class_var using paste() (mutate() creates this new class_var column). Then we can group the data (group_by) with this newly created column and tally the occurrences with tally().
Getting a table with the original columns along with the generated counts would invoke a for loop and the str_split() function from stringr as shown below.
tst_summary <- tst %>%
mutate(class_var = paste(Type, Protocol, Protocol.1, sep = "|")) %>%
group_by(class_var) %>%
tally() %>% as.data.frame()
for(i in 1:nrow(tst_summary)){
tst_summary$Type[i] <- lapply(tst_summary$class_var[i], function(x){ unlist(str_split(x, "\\|"))[[1]]})
tst_summary$Protocol[i] <- lapply(tst_summary$class_var[i], function(x){ unlist(str_split(x, "\\|"))[[2]]})
tst_summary$Protocol.1[i] <- lapply(tst_summary$class_var[i], function(x){ unlist(str_split(x, "\\|"))[[3]]})
}
tst_summary <- tst_summary[, c(3,4,5,2)]
tst_summary
# Type Protocol Protocol.1 n
# 1 1
# 2 IPV4 UDP SSDP 1
# 3 IPV4 UDP UDP 2

Sum by aggregating complex paired names in R

In R, I'm trying to aggregate a dataframe based on unique IDs, BUT I need to use some kind of wild card value for the IDs. Meaning I have paired names like this:
lion_tiger
elephant_lion
tiger_lion
And I need the lion_tiger and tiger_lion IDs to be summed together, because the order in the pair does not matter.
Using this dataframe as an example:
df <- data.frame(pair = c("1_3","2_4","2_2","1_2","2_1","4_2","3_1","4_3","3_2"),
value = c("12","10","19","2","34","29","13","3","14"))
So the values for pair IDs, "1_2" and "2_1" need to be summed in a new table. That new row would then read:
1_2 36
Any suggestions? While my example has numbers as the pair IDs, in reality I would need this to read in text (like the lion_tiger" example above).
We can split the 'pair' column by _, then sort and paste it back, use it in a group by function to get the sum
tapply(as.numeric(as.character(df$value)),
sapply(strsplit(as.character(df$pair), '_'), function(x)
paste(sort(as.numeric(x)), collapse="_")), FUN = sum)
Or another option is gsubfn
library(gsubfn)
df$pair <- gsubfn('([0-9]+)_([0-9]+)', ~paste(sort(as.numeric(c(x, y))), collapse='_'),
as.character(df$pair))
df$value <- as.numeric(as.character(df$value))
aggregate(value~pair, df, sum)
Using tidyverse and purrrlyr
df <- data.frame(name=c("lion_tiger","elephant_lion",
"tiger_lion"),value=c(1,2,3),stringsAsFactors=FALSE)
require(tidyverse)
require(purrrlyr)
df %>% separate(col = name, sep = "_", c("A", "B")) %>%
by_row(.collate = "rows",
..f = function(this_row) {
paste0(sort(c(this_row$A, this_row$B)), collapse = "_")
}) %>%
rename(sorted = ".out") %>%
group_by(sorted) %>%
summarize(sum(value))%>%show
## A tibble: 2 x 2
# sorted `sum(value)`
# <chr> <dbl>
#1 elephant_lion 2
#2 lion_tiger 4

Resources