Is there any method : text to table with R? - r

I have a strings which have some pattern.
for example
>oc
[1]"for financial company payment manufacturer company payment distributor people payment other payment total payment 1 month payment 10 20 30 40 100 2 month payment 8 14 15 30 67 1 year payment 5 9 11 15 40"
raw material is table and there is some disturbing things, I decide to extract text from table and organize, clean them with code then reshape table form.
The raw material table looks like this
for financial company payment | manufacturer company payment | distributor people payment | other..
1 m..| 10 20 30 ...
2 m..| 8 14 15 ...
1 y..| 5 9 11 ...
I appreciate any method so please, leave any comment for it. It would be great help to me.
Also what I tried to do is first, use extract_text function (in tabilizer library)
and second I use regular expression to make strings tidy
finally I use scan function.
Again, any method is okay. please leave any help. Thank you!

Here's a solution--anything but elegant but working:
Your data:
oc <- "for financial company payment manufacturer company payment distributor people payment other payment total payment 1 month payment 10 20 30 40 100 2 month payment 8 14 15 30 67 1 year payment 5 9 11 15 40"
First, split the string at payment:
oc <- strsplit(oc, " payment ")
Prepare a matrix to fill in the data:
mt <- matrix(NA, ncol = 5, nrow = 3)
Grab the relevant elements from oc as column names:
colnames(mt) <- oc[[1]][1:5]
Define the rownames:
rownames(mt) <- c("1 month", "2 month", "1 year")
Grab the numbers from oc:
numbers <- ocx[[1]][7:9]
Clean numbers:
numbers <- gsub("( 2 month| 1 year)", "", numbers)
Now breaknumbers into individual numbers using str_extract_all from the package stringr:
library(stringr)
numbers <- str_extract_all(numbers, "\\d+")
Iterate over the rows in mt to fill in the numbers from numbers:
for(i in 1:nrow(mt)){
mt[i,] <- numbers[[i]]
}
Finally redefine mt as a dataframe:
mt <- as.data.frame(mt)
Et voilá, the result:
mt
for financial company manufacturer company distributor people other total
1 month 10 20 30 40 100
2 month 8 14 15 30 67
1 year 5 9 11 15 40

Related

Remove all rows of a category if zero in a % of cases

I have the following data set of weekly retail data ordered after Category(e.g. Chocolate), Brand (e.g. Cadbury's), and Week(1-208). CBX is a unique global identifier for each brand.
Category Brand Week Sales Price CBX
33 2 1 167650. 2.20 33 - 2
33 2 2 168044. 2.18 33 - 2
33 2 3 160770 2.24 33 - 2
I now want to remove the brands that have zero sales in more than 75% of the weeks (thus have positive sales in at least 156 weeks).
At first I deleted all brands with any zero sales using dplyr, but it deleted too much of the data. This was the code I used:
library(dplyr)
Final_df_ <- Final_df %>%
group_by(Final_df$CBX) %>%
filter(!any(Sales==0 & Price==0))
Now I'm trying to change the code so it only deletes all rows belonging to a brand (CBX) if the sales of that brand are zero in more than 25% of the cases.
This is how far I've come:
Final_df_ <- Final_df %>%
group_by(Final_df$CBX) %>%
filter(!((Final_df$Sales==0)>0.75))
Thank you!

switching list elements with dataframe rows

Consider my list IDs that has a dataframe of behaviours in each one:
IDs <- list(Dave = data.frame(Behaviour = c("Aggression","Interaction", "Nursing"), number = c(20,10,5), duration = c(60,39,27)),James = data.frame(Behaviour = c("Aggression","Interaction"), number = c(21,30), duration = c(30,49)))
IDs
$Dave
Behaviour number duration
1 Aggression 20 60
2 Interaction 10 39
3 Nursing 5 27
$James
Behaviour number duration
1 Aggression 21 30
2 Interaction 30 49
Note that James does not exhibit any nursing behaviour and therefore different number of rows between the two list elements.
I want to switch the list elements with the dataframe rows so that I have a list of behaviours and a dataframe of ID. So that it looks like this:
$Aggression
ID number duration
1 Dave 20 60
2 James 21 30
$Interaction
ID number duration
1 Dave 10 39
2 James 30 49
$Nursing
ID number duration
1 Dave 5 27
I thought that it could be achieved with reshape2::melt. I wasn't able to get further than melt(IDs, id = "Behaviour)
Any ideas?
Generally you can do it in two steps:
turning the list into a single data.frame/data.table
splitting it based on Behavior
You can do it like this, for example:
dt <- data.table::rbindlist(IDs, id = "ID")
# or: dt <- dplyr::bind_rows(IDs, .id = "ID")
split(dt, dt$Behaviour)
Note:
If you don't want the Behaviour column in the result and you used the data.table approach, you can modify the split to:
split(dt[,!"Behaviour"], dt$Behaviour)
Try this:
tmp<-data.frame(ID=rep(names(IDs),vapply(IDs,nrow,1L)),do.call(rbind,IDs),row.names=NULL)
split(tmp[-2],tmp$Behaviour)
#$Aggression
# ID number duration
#1 Dave 20 60
#4 James 21 30
#$Interaction
# ID number duration
#2 Dave 10 39
#5 James 30 49
#$Nursing
# ID number duration
#3 Dave 5 27
#6 James 1 17
Or using base R
d1 <- do.call(rbind, Map(cbind, id = names(IDs), IDs))
split(d1, d1$Behaviour)

aggregate value with time in data frame

I have the following data frame;
Date <- as.Date(c('2006-08-23', '2006-08-30', '2006-09-06', '2006-09-13', '2006-09-20'))
order <- c("buy", "buy", "sell", "buy", "buy")
cost <- c(10, 15, 12, 13, 8)
df <- data.frame(Date, order, cost)
df
Date order cost
1 2006-08-23 buy 10
2 2006-08-30 buy 15
3 2006-09-06 sell 12
4 2006-09-13 buy 13
5 2006-09-20 buy 8
How can I sum the cost column by taking into account date and the order column and obtain the new balance column in a new data frame like this one?
Date order cost balance
1 2006-08-23 buy 10 10
2 2006-08-30 buy 15 25
3 2006-09-06 sell 12 13
4 2006-09-13 buy 13 26
5 2006-09-20 buy 8 34
Assuming you have a sorted DF, "cost" is an unsmart label, since we will have to generate a sign to show what the actual cost was based on the buy/sell flag.
df$cost[df$order == 'sell'] <- -df$cost[df$order == 'sell']
balance is then cumsum(df$cost).

Extract before and after lines based on keyword in Pdf using R programming

I want to extract information related to keyword "cancer" from list of pdf using R.
i want to extract before and after lines or paragraph containing word cancer in text file.
abstracts <- lapply(mytxtfiles, function(i) {
j <- paste0(scan(i, what = character()), collapse = " ")
regmatches(j, gregexpr("(?m)(^[^\\r\\n]*\\R+){4}[cancer][^\\r\\n]*\\R+(^[^\\r\\n]*\\R+){4}", j, perl=TRUE))})
above regex is not working
Here's one approach:
library(textreadr)
library(tidyverse)
loc <- function(var, regex, n = 1, ignore.case = TRUE){
locs <- grep(regex, var, ignore.case = ignore.case)
out <- sort(unique(c(locs - 1, locs, locs + 1)))
out <- out[out > 0]
out[out <= length(var)]
}
doc <- 'https://www.in.kpmg.com/pdf/Indian%20Pharma%20Outlook.pdf' %>%
read_pdf() %>%
slice(loc(text, 'cancer'))
doc
## page_id element_id text
## 1 24 28 Ranjit Shahani applauds the National Pharmaceuticals Policy's proposal of public/private
## 2 24 29 partnerships (PPPs) to tackle life-threatening diseases such as cancer and HIV/AIDS, but
## 3 24 30 stresses that, in order for them to work, they should be voluntary, and the government
## 4 25 8 the availability of medicines to treat life-threatening diseases. It notes, for example, that
## 5 25 9 while an average estimate of the value of drugs to treat the country's cancer patients is
## 6 25 10 $1.11 billion, the market is in fact worth only $33.5 million. “The big gap indicates the
## 7 25 12 because of the high cost of these medicines,” says the Policy, which also calls for tax and
## 8 25 13 excise exemptions for anti-cancer drugs.
## 9 25 14 Another area for which PPPs are proposed is for drugs to treat HIV/AIDS, India's biggest health
## 10 32 19 Variegate Trading, a UB subsidiary. The firm's major products are in the anti-infective,
## 11 32 20 anti-inflammatory, cancer, diabetes and allergy market segments and, for the year ended
## 12 32 21 December 31, 2005, it reported net sales (excluding excise duty) up 9.9 percent to $181.1

From panel data to cross-sectional data using averages

I am very new to R so I am not sure how basic my question is, but I am stuck at the following point.
I have data that has a panel structure, similar to this
Country Year Outcome Country-characteristic
A 1990 10 40
A 1991 12 40
A 1992 14 40
B 1991 10 60
B 1992 12 60
For some reason I need to put this in a cross-sectional structure such I get averages over all years for each country, that is in the end, it should look like,
Country Outcome Country-Characteristic
A 12 40
B 11 60
Has anybody faced a similar problem? I was playing with lapply(table$country, table$outcome, mean) but that did not work as I wanted it.
Two tips: 1- When you ask a question, you should provide a reproducible example for the data too (as I did with read.table below). 2- It's not a good idea to use "-" in column names. You should use "_" instead.
You can get a summary using the dplyr package:
df1 <- read.table(text="Country Year Outcome Countrycharacteristic
A 1990 10 40
A 1991 12 40
A 1992 14 40
B 1991 10 60
B 1992 12 60", header=TRUE, stringsAsFactors=FALSE)
library(dplyr)
df1 %>%
group_by(Country) %>%
summarize(Outcome=mean(Outcome),Countrycharacteristic=mean(Countrycharacteristic))
# A tibble: 2 x 3
Country Outcome Countrycharacteristic
<chr> <dbl> <dbl>
1 A 12 40
2 B 11 60
We can do this in base R with aggregate
aggregate(.~Country, df1[-2], mean)
# Country Outcome Countrycharacteristic
#1 A 12 40
#2 B 11 60

Resources