Extract specific text parts from df column R - r

I have a question how to extract parts of the text and convert them df output.
This is an example of my df, output of one row in my one column (content of one cell)
[{"id"=>"aaaaaaaaaaaaaaaa", "effortDate"=>"2021-07-04T23:00:00.000Z", "effort"=>2, "author"=>"a:aa:a"}, {"id"=>"bbbbbbbbbbbbbb", "effortDate"=>"2021-07-11T23:00:00.000Z", "effort"=>1, "author"=>"b:bb:b"}, {"id"=>"ccccccccccccc", "effortDate"=>"2021-07-17T23:00:00.000Z", "effort"=>1, "author"=>"c:cc:c"}]
My expected output would be to have 2 columns with as many rows I get from this string:
effortDate
2021-07-04
2021-04-11
and second column
effort
2
1
Any suggestion how to achieve that?
Thanks!

looks like json-content... but the => messes with the reading. If you replace it with :, you sould be able to read properly.
mystr <- '[{"id"=>"aaaaaaaaaaaaaaaa", "effortDate"=>"2021-07-04T23:00:00.000Z", "effort"=>2, "author"=>"a:aa:a"}, {"id"=>"bbbbbbbbbbbbbb", "effortDate"=>"2021-07-11T23:00:00.000Z", "effort"=>1, "author"=>"b:bb:b"}, {"id"=>"ccccccccccccc", "effortDate"=>"2021-07-17T23:00:00.000Z", "effort"=>1, "author"=>"c:cc:c"}]'
jsonlite::fromJSON(gsub("=>", ":", mystr))
# id effortDate effort author
# 1 aaaaaaaaaaaaaaaa 2021-07-04T23:00:00.000Z 2 a:aa:a
# 2 bbbbbbbbbbbbbb 2021-07-11T23:00:00.000Z 1 b:bb:b
# 3 ccccccccccccc 2021-07-17T23:00:00.000Z 1 c:cc:c

Related

Replace more than one word in a column with R

I trying to change the all the names with the word stocker in job.tittle to a new column job.title.2
I tried to use gsub() without the expected result
My data.frame looks liek this:
x<- data.frame(Job.tittle=c("DW Overnight Stockers", "Checkers","TH Stockers", "CM Midland Stockers"), Head.counts=c(100,50,100,200))
Thank you
I tried this: x$job.tittle.2<-gsub("\bDW Overnight Stockers\w+","Stocker",x$Job.tittle)
and did not work
Here you go. Using regex, this takes a string that contains the word "stocker" or "stockers", in either upper or lower case, any where in the string, and replaces it with "Stocker".
x$job.title.2 <- gsub(".*stockers?.*", "Stocker", x$Job.tittle, ignore.case = TRUE)
x
Job.tittle Head.counts job.title.2
1 DW Overnight Stockers 100 Stocker
2 Checkers 50 Checkers
3 TH Stockers 100 Stocker
4 CM Midland Stockers 200 Stocker

Text Processing : extract fixed number of numbers from text

I am trying the following :
gg <-c("delete from below 110 11031133 11 11031135 110",
"delete froml #10989431 from adfdaf 10888022 <(>&<)> 10888018",
"this is for the deletion of an incorrect numberss that is no longer used for asd09 and sd040",
"please delete the following mangoes from trey 10246211 1 10821224 1 10821248 1 10821249",
"from 11015647 helppp 1 na from 0050 - zfhhhh 10840637 1")
pattern_to_find <- c('\\d{4,}')
aa <- str_extract_all(gg, pattern_to_find)
aa
with this code I am able to extact any numeric pattern with number greater than a fixed number. But if I want to extract 2 didit number then it picks up all the first two numbers from the numeric field .
pattern_to_find <- c('\\d{2}').
How can I modify my pattern to work on both ways.
Regards,
R
Tidyverse solution:
library(tidyverse)
pattern_to_find <- c('\\d{2,}')
aa <- str_extract_all(gg, pattern_to_find)
Base R solution:
base_aa <- regmatches(gg, gregexpr(pattern_to_find, gg))

order() function gives wrong order for characters in R

xx = c("calculated_p3", "calculated_c1" ,"calculated_p2" ,"calculated_c2", "calculated_d2",
"calculated_d3", "calculated_c3", "calculated_p1" ,"calculated_d1")
order(xx)
The output is: 2 4 7 9 5 6 8 3 1
Why is the "calculated_d1" ordered as the first element? And why is "calculated_c2" ordered as the 9th element? I don't understand here. Shouldn't "calculated_c1" be the first one?
Thank you for your help
order is written such that xx[order(xx)] is the same as sort(xx).
The numbers don't refer to the position that each entry should go to but rather the position the entries should come from if they were in order.
calculated_c1 should indeed be the first one. As it is in position 2, the first number is therefore a 2.
If you want to keep your order you can use factors:
factor(xx, xx)
[1] calculated_p3 calculated_c1 calculated_p2 calculated_c2 calculated_d2 calculated_d3 calculated_c3 calculated_p1
[9] calculated_d1
9 Levels: calculated_p3 calculated_c1 calculated_p2 calculated_c2 calculated_d2 calculated_d3 ... calculated_d1

How do I subset a list with mixed data type and data structure?

I have a list which included a mix of data type (character) and data structure (dataframe).
I want to keep only the dataframes and remove the rest.
> head(list)
[[1]]
[1] "/Users/Jane/R/12498798.txt error"
[[2]]
match
1 Japan arrests man for taking gun
2 Extradition bill turns ugly
file
1 /Users/Jane/R/12498770.txt
2 /Users/Jane/R/12498770.txt
[[3]]
[1] "/Users/Jane/R/12498780.txt error"
I expect the final list to contain only dataframes:
[[2]]
match
1 Japan arrests man for taking gun
2 Extradition bill turns ugly
file
1 /Users/Jane/R/12498770.txt
2 /Users/Jane/R/12498770.txt
Based on the example, it is possible that the OP's list elements are vectors and want to remove any element having 'error' substring
list[!sapply(list, function(x) any(grepl("error$", x)))]

R "read.table" function gives only odd numbered columns while merging the even numbered columns

I am trying to read a TSV file in R using the read.table function.
myTable <- read.table("file_path", sep='\t', header=T)
But when I try the command
names(myTable)
It gives me column names which are odd numbered, while merging the even numbered columns with those.
[1] "GeneSymbol" "GSM480304_JK_C_05.07.mas5.chp"
[3] "GSM480355_JK_C_05.07.mas5.chp" "GSM480480_JK_C_05.07.mas5.chp"
[5] "GSM480555_JK_C_05.07.mas5.chp" "GSM480634_JK_C_05.07.mas5.chp"
These are exact column names and you can see that two column names are separated by space while only ODD numbered column names are listed.
The output should be like this:
[1] "GeneSymbol"
[2] "GSM480304_JK_C_05.07.mas5.chp"
[3] "GSM480355_JK_C_05.07.mas5.chp"
[4] "GSM480480_JK_C_05.07.mas5.chp"
[5] "GSM480555_JK_C_05.07.mas5.chp"
[6] "GSM480634_JK_C_05.07.mas5.chp"
This is creating problem in assigning names to another table where I want to use these column names. Any suggestions ?
As noted in the comments, R is displaying all the columns, but not in the format you expect. This can be forced by casting the result of names() with as.data.frame() as follows:
rawData <- "
Number,Name,Type1,Type2,Total,HP,Attack,Defense,SpecialAtk,SpecialDef,Speed,Generation,Legendary
1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
5,Charmeleon,Fire,,405,58,64,58,80,65,80,1,False
6,Charizard,Fire,Flying,534,78,84,78,109,85,100,1,False
6,CharizardMega Charizard X,Fire,Dragon,634,78,130,111,130,85,100,1,False
6,CharizardMega Charizard Y,Fire,Flying,634,78,104,78,159,115,100,1,False
7,Squirtle,Water,,314,44,48,65,50,64,43,1,False
8,Wartortle,Water,,405,59,63,80,65,80,58,1,False
9,Blastoise,Water,,530,79,83,100,85,105,78,1,False"
gen01 <- read.csv(textConnection=rawData,header=TRUE)
as.data.frame(names(gen01))
...and the output:
> as.data.frame(names(gen01))
names(gen01)
1 Number
2 Name
3 Type1
4 Type2
5 Total
6 HP
7 Attack
8 Defense
9 SpecialAtk
10 SpecialDef
11 Speed
12 Generation
13 Legendary

Resources