Data Cleaning in R - r

I have a csv file which I want to extract only the timestamp of the sentences which contain toward plus the fruit name in that sentence. How can I do this in R (or if there's a faster way to do so, what's that?)
rosbagTimestamp,data
1438293900729698553,robot is in motion toward [strawberry]
1438293900730571638,Found a plan for avocado in 1.36400008202 seconds
1438293900731434815,current probability is greater than EXECUTION_THRESHOLD
1438293900731554567,ready to execute am original plan of len = 33
1438293900731586463,len of sub plan 1 = 24
1438293900731633713,len of sub plan 2 = 9
1438293900732910799,put in an execution request; now updating the dict
1438293900732949576,current_prediciton_item = avocado
1438293900733070339,current_item_probability = 0.880086981207
1438293901677787230,current probability is greater than PLANNING_THRESHOLD
1438293901681590725,robot is in motion toward [avocado]
1438293902689233770,we have received verbal request [avocado]
1438293902689314002,we already have a plan for the verbal request
1438293902689377800,debug
1438293902690529516,put in the final motion request
1438293902691076051,Found a plan for avocado in 1.95595788956 seconds
1438293902691084147,current predicted item != motion target; calc a new plan
1438293902691110642,current probability is greater than EXECUTION_THRESHOLD
1438293902691885974,have existing requests
1438293904496769068,robot is in motion toward [avocado]
1438293907737142498,ready to pick up the item
Ideally I want the output to be something like this:
1438293900729698553, strawberry
1438293901681590725, avocado
1438293904496769068, avocado
So apparently I have to use subset in grep for R but I am not really sure how to!

stamps <- df$rosbagTimestamp[grep("toward \\[", df$data)]
fruits <- gsub(".*\\[(\\w+)\\].*", "\\1", df$data[grep("toward \\[", df$data)])
data.frame(stamps,fruits)
stamps fruits
1 1438293900729698560 strawberry
2 1438293901681590784 avocado
3 1438293904496769024 avocado
I used the pattern "toward \\[" to locate fruits. If any changes occur in variability, it can be extended. The stamps variable is created by locating time stamps that have the pattern in the data column. The fruits variable isolates the fruit inside of the brackets.

Related

Data Preparation In R

I have six .txt datasets files i've stored at '../data/csv'. All the datasets have similar structure(X1(speech),part(part of the speech i.e Charlotte_part_1 ...Charlotte_part_60)). Am having trouble combining all the six datasets into a single .csv file called biden.csv which has speech, part,location, event and date .But am having trouble extracting the speech, part(this two are from the file content) and event(from file name) of the file names because of their varying naming structure.
The six datasets
"Charlotte_Sep23_2020_Racial_Equity_Discussion-1.txt",
"Cleveland_Sep30_2020_Whistle_Stop_Tour.txt",
"Milwaukee_Aug20_2020_Democratic_National_Convention.txt",
"Philadelphia_Sep20_2020_SCOTUS.txt",
"Washington_Sep26_2020_US_Conference_of_Mayors.txt",
"Wilmington_Nov25_2020_Thanksgiving.txt"
Sample content from 'Charlotte_Sep23_2020_Racial_Equity_Discussion-1.txt'
X1 part
"Folks, thanks for taking the time to be here today. I really appreciate it. And we even have an astronaut in our house and I tell you what, that’s pretty cool. Look, first of all, I want to thank Chris and the mayor for being here, and all of you for being here. You know, these are tough times. Over 200,000 Americans have passed away. Over 200,000, and the number is still rising. The impact on communities is bad across the board, but particularly bad for African-American communities. Almost four times as likely, three times as likely to catch the disease, COVID, and when it’s caught, twice as likely to die as white Americans. It’s sort of emblematic of the inequality that exists and the circumstances that exist." Charlotte_part_1
"One of the things that really matters to me, is we could do … It didn’t have to be this bad. You have 30 million people on unemployment, you have 20 million people figuring whether or not they can pay their mortgage payment this month, and what they’re going to be able to do or not do as the consequence of that, and you’ve got millions of people who are worried that they’re going to be thrown out in the street because they can’t pay their rent. Although they’ve been given a reprieve for three months, but they have to pay double the next three months when it comes around." Charlotte_part_2
Here is the code i have designed but its not producing the output i wan't...i mean it just creat the tibble with the tittles but no contents in any of the variables
biden_data <- tibble() # initialize empty tibble
# loop through all text files in the specified directory
for (file in list.files(path="./data/csv", pattern='*.txt', full.names=T)){
filename <- strsplit(file, "[./]")[[1]][5] # extract file name from path
# extract location from file name
location <- strsplit(filename, split='_')[[1]][1]
# extract raw date from file name
raw_date <- strsplit(filename, split='_')[[1]][2]
date <- as.Date(raw_date, "%b%d_%Y") # format as datetime
# extract event from file name
event <- strsplit(filename, split='_')[[1]][3]
# extract speech and part from file
content <- readChar(file, file.info(file)$size)
speech <- content[grepl("^X1", content)]
part <- content[grepl("^part", content)]
# create a new observation (row)
new_obs <- tibble(speech=speech, part=part, location=location, event=event, date=date)
# append the new observation to the existing data
biden_data <- bind_rows(biden_data, new_obs)
rm(filename, location, raw_date, date, content, speech, part, new_obs, file) # cleanup
}
Desired Output is supposed to look like this:
## # A tibble: 128 x 5
## speech part location event date
## <chr> <chr> <chr> <chr> <date>
## 1 Folks, thanks for taking the time to be here~ Char~ Charlot~ Raci~ 2020-09-23
## 2 One of the things that really matters to me,~ Char~ Charlot~ Raci~ 2020-09-23
## 3 How people going to do that? And the way, in~ Char~ Charlot~ Raci~ 2020-09-23
## 4 In addition to that, we find ourselves in a ~ Char~ Charlot~ Raci~ 2020-09-23
## 5 If he had spoken, as I said, they said at Co~ Char~ Charlot~ Raci~ 2020-09-23
## 6 But what I want to talk to you about today i~ Char~ Charlot~ Raci~ 2020-09-23
## 7 And thirdly, if you’re a business person, le~ Char~ Charlot~ Raci~ 2020-09-23
## 8 For too many people, particularly in the Afr~ Char~ Charlot~ Raci~ 2020-09-23
## 9 It goes to education, as well as access to e~ Char~ Charlot~ Raci~ 2020-09-23
## 10 And then we’re going to talk about, I think ~ Char~ Charlot~ Raci~ 2020-09-23
## # ... with 118 more rows
Starting with a vector of file paths:
files <- c("Charlotte_Sep23_2020_Racial_Equity_Discussion-1.txt", "Cleveland_Sep30_2020_Whistle_Stop_Tour.txt", "Milwaukee_Aug20_2020_Democratic_National_Convention.txt", "Philadelphia_Sep20_2020_SCOTUS.txt", "Washington_Sep26_2020_US_Conference_of_Mayors.txt", "Wilmington_Nov25_2020_Thanksgiving.txt")
We can capture the components into a frame:
meta <- strcapture("^([^_]+)_([^_]+_[^_]+)_(.*)\\.txt$", files, list(location="", date="", event=""))
meta
# location date event
# 1 Charlotte Sep23_2020 Racial_Equity_Discussion-1
# 2 Cleveland Sep30_2020 Whistle_Stop_Tour
# 3 Milwaukee Aug20_2020 Democratic_National_Convention
# 4 Philadelphia Sep20_2020 SCOTUS
# 5 Washington Sep26_2020 US_Conference_of_Mayors
# 6 Wilmington Nov25_2020 Thanksgiving
And then iterate on that for the contents into a single frame.
out <- do.call(Map, c(list(f = function(fn, ...) cbind(..., read.table(fn, header = TRUE))),
list(files), meta))
out <- do.call(rbind, out)
rownames(out) <- NULL
out[1:3,]
# location date event
# 1 Charlotte Sep23_2020 Racial_Equity_Discussion-1
# 2 Charlotte Sep23_2020 Racial_Equity_Discussion-1
# 3 Cleveland Sep30_2020 Whistle_Stop_Tour
# X1
# 1 Folks, thanks for taking the time to be here today. I really appreciate it. And we even have an astronaut in our house and I tell you what, that’s pretty cool. Look, first of all, I want to thank Chris and the mayor for being here, and all of you for being here. You know, these are tough times. Over 200,000 Americans have passed away. Over 200,000, and the number is still rising. The impact on communities is bad across the board, but particularly bad for African-American communities. Almost four times as likely, three times as likely to catch the disease, COVID, and when it’s caught, twice as likely to die as white Americans. It’s sort of emblematic of the inequality that exists and the circumstances that exist.
# 2 One of the things that really matters to me, is we could do … It didn’t have to be this bad. You have 30 million people on unemployment, you have 20 million people figuring whether or not they can pay their mortgage payment this month, and what they’re going to be able to do or not do as the consequence of that, and you’ve got millions of people who are worried that they’re going to be thrown out in the street because they can’t pay their rent. Although they’ve been given a reprieve for three months, but they have to pay double the next three months when it comes around.
# 3 Charlotte_Sep23_2020_Racial_Equity_Discussion-1.txt
# part
# 1 Charlotte_part_1
# 2 Charlotte_part_2
# 3 something
(I made fake files for all but the first file.)
Brief walk-through:
strcapture takes the regex (lots of _-separation) and creates a frame of location, date, etc.
Map takes a function with 1 or more arguments (we use two: fn= for the filename, and ... for "the rest") and applies it to each of the subsequent lists/vectors. In this case, I'm using ... to cbind (column-bind/concatenate) the columns from meta to what we read from the file itself. This is useful in that it combines the 1 row of each meta row with any-number-of-rows from the file itself. (We could have hard-coded ... instead as location, date, and event, but I tend to prefer to generalize, in case you need to extract something else from the filenames.)
Because we use ..., however, we need to combine files and the columns of meta in a list and then call our anon-function with the list contents as arguments.
The contents of out after our do.call(Map, ...) is in a list and not a single frame. Each element of this list is a frame with the same column-structure, so we then combine them by rows with do.call(rbind, out).
R is going to use the names from files into row names, which I find unnecessary (and distracting), so I removed the row names. Optional.
If you're interested, this may appear much easier to digest using dplyr and friends:
library(dplyr)
# library(tidyr) # unnest
out <- strcapture("^([^_]+)_([^_]+_[^_]+)_(.*)\\.txt$", files,
list(location="", date="", event="")) %>%
mutate(contents = lapply(files, read.table, header = TRUE)) %>%
tidyr::unnest(contents)

find every combination of elements in a column of a dataframe, which add up to a given sum in R

I'm trying to ease my life by writing a menu creator, which is supposed to permutate a weekly menu from a list of my favourite dishes, in order to get a little bit more variety in my life.
I gave every dish a value of how many days it approximately lasts and tried to arrange the dishes to end up with menus worth 7 days of food.
I've already tried solutions for knapsack functions from here, including dynamic programming, but I'm not experienced enough to get the hang of it. This is because all of these solutions are targeting only the most efficient option and not every combination, which fills the Knapsack.
library(adagio)
#create some data
dish <-c('Schnitzel','Burger','Steak','Salad','Falafel','Salmon','Mashed potatoes','MacnCheese','Hot Dogs')
days_the_food_lasts <- c(2,2,1,1,3,1,2,2,4)
price_of_the_food <- c(20,20,40,10,15,18,10,15,15)
data <- data.frame(dish,days_the_food_lasts,price_of_the_food)
#give each dish a distinct id
data$rownumber <- (1:nrow(data))
#set limit for how many days should be covered with the dishes
food_needed_for_days <- 7
#knapsack function of the adagio library as an example, but all other solutions I found to the knapsackproblem were the same
most_exspensive_food <- knapsack(days_the_food_lasts,price_of_the_food,food_needed_for_days)
data[data$rownumber %in% most_exspensive_food$indices, ]
#output
dish days_the_food_lasts price_of_the_food rownumber
1 Schnitzel 2 20 1
2 Burger 2 20 2
3 Steak 1 40 3
4 Salad 1 10 4
6 Salmon 1 18 6
Simplified:
I need a solution to a single objective single Knapsack problem, which returns all possible combinations of dishes which add up to 7 days of food.
Thank you very much in advance

fuzzy and exact match of two databases

I have two databases. The first one has about 70k rows with 3 columns. the second one has 790k rows with 2 columns. Both databases have a common variable grantee_name. I want to match each row of the first database to one or more rows of the second database based on this grantee_name. Note that merge will not work because the grantee_name do not match perfectly. There are different spellings etc. So, I am using the fuzzyjoin package and trying the following:
library("haven"); library("fuzzyjoin"); library("dplyr")
forfuzzy<-read_dta("/path/forfuzzy.dta")
filings <- read_dta ("/path/filings.dta")
> head(forfuzzy)
# A tibble: 6 x 3
grantee_name grantee_city grantee_state
<chr> <chr> <chr>
1 (ICS)2 MAINE CHAPTER CLEARWATER FL
2 (SUFFOLK COUNTY) VANDERBILT~ CENTERPORT NY
3 1 VOICE TREKKING A FUND OF ~ WESTMINSTER MD
4 10 CAN NEWBERRY FL
5 10 THOUSAND WINDOWS LIVERMORE CA
6 100 BLACK MEN IN CHICAGO INC CHICAGO IL
... 7 - 70000 rows to go
> head(filings)
# A tibble: 6 x 2
grantee_name ein
<chr> <dbl>
1 ICS-2 MAINE CHAPTER 123456
2 SUFFOLK COUNTY VANDERBILT 654321
3 VOICE TREKKING A FUND OF VOICES 789456
4 10 CAN 654987
5 10 THOUSAND MUSKETEERS INC 789123
6 100 BLACK MEN IN HOUSTON INC 987321
rows 7-790000 omitted for brevity
The above examples are clear enough to provide some good matches and some not-so-good matches. Note that, for example, 10 THOUSAND WINDOWS will match best with 10 THOUSAND MUSKETEERS INC but it does not mean it is a good match. There will be a better match somewhere in the filings data (not shown above). That does not matter at this stage.
So, I have tried the following:
df<-as.data.frame(stringdist_inner_join(forfuzzy, filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance"))
Totally new to R. This is resulting in an error:
cannot allocate vector of size 375GB (with the big database of course). A sample of 100 rows from forfuzzy always works. So, I thought of iterating over a list of 100 rows at a time.
I have tried the following:
n=100
lst = split(forfuzzy, cumsum((1:nrow(forfuzzy)-1)%%n==0))
df<-as.data.frame(lapply(lst, function(df_)
{
(stringdist_inner_join(df_, filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance", nthread = getOption("sd_num_thread")))
}
)%>% bind_rows)
I have also tried the above with mclapply instead of lapply. Same error happens even though I have tried a high-performance cluster setting 3 CPUs, each with 480G of memory and using mclapply with the option mc.cores=3. Perhaps a foreach command could help, but I have no idea how to implement it.
I have been advised to use the purrr and repurrrsive packages, so I try the following:
purrr::map(lst, ~stringdist_inner_join(., filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance", nthread = getOption("sd_num_thread")))
This seems to be working, after a novice error in the by=grantee_name statement. However, it is taking forever and I am not sure it will work. A sample list in forfuzzy of 100 rows, with n=10 (so 10 lists with 10 rows each) has been running for 50 minutes, and still no results.
If you split (with base::split or dplyr::group_split) your uniquegrantees data frame into a list of data frames, then you can call purrr::map on the list. (map is pretty much lapply)
purrr::map(list_of_dfs, ~stringdist_inner_join(., filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance"))
Your result will be a list of data frames each fuzzyjoined with filings. You can then call bind_rows (or you could do map_dfr) to get all the results in the same data frame again.
See R - Splitting a large dataframe into several smaller dateframes, performing fuzzyjoin on each one and outputting to a single dataframe
I haven't used foreach before but maybe the variable x is already the individual rows of zz1?
Have you tried:
stringdist_inner_join(x, zz2, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance")
?

How to retrieve movies' genres from wikidata using R

I would like to retrieve information from wikidata and store it in a dataframe. For the sake of simplicity I am going to assume that I want to get the genre of the following movies and then filter those that belong to sci-fi:
movies = c("Star Wars Episode IV: A New Hope", "Interstellar",
"Happythankyoumoreplease")
I know there is a package called WikidataR. If I am not wrong, and according to its vignettes there are two commands that may be useful: find_item and find_property allow you to retrieve a set of Wikidata items or properties where the aliase or descriptions match a particular search term. Apparently they are great for me, so I thought of doing something like
for (i in movies) {
info = find_item(i)
}
This is what I get from each item:
> find_item("Interstellar")
Wikidata item search
Number of results: 10
Results:
1 Interstellar (Q13417189) - 2014 US science fiction film
2 Interstellar (Q6057099)
3 interstellar medium (Q41872) - matter and fields (radiation) that exist in the space between the star systems in a galaxy;includes gas in ionic, atomic or molecular form, dust and cosmic rays. It fills interstellar space and blends smoothly into the surrounding intergalactic space
4 space colonization (Q686876) - concept of permanent human habitation outside of Earth
5 rogue planet (Q167910) - planetary-mass object that orbits the galaxy directly
6 interstellar cloud (Q1054444) - accumulation of gas, plasma and dust in a galaxy
7 interstellar travel (Q834826) - term used for hypothetical manned or unmanned travel between stars
8 Interstellar Boundary Explorer (Q835898)
9 starship (Q2003852) - spacecraft designed for interstellar travel
10 interstellar object (Q2441216) - astronomical object in interstellar space, such as a comet
>
Unfortunately, the information that I get from find_item (see below) has two problems:
it is not a dataframe with all wikidata information of the item I
am searching but a list of what seems to be metadata (wikidata's id,
link...).
it does not have the information I need (wikidata's
properties from each particular wikidata item).
Similarly, find_property provides metadata of a certain property. find_property("genre") retrieves the following information:
> find_property("genre")
Wikidata property search
Number of results: 4
Results:
1 genre (P136) - a creative work's genre or an artist's field of work (P101). Use main subject (P921) to relate creative works to their topic
2 radio format (P415) - describes the overall content broadcast on a radio station
3 sex or gender (P21) - sexual identity of subject: male (Q6581097), female (Q6581072), intersex (Q1097630), transgender female (Q1052281), transgender male (Q2449503). Animals: male animal (Q44148), female animal (Q43445). Groups of same gender use "subclass of" (P279)
4 gender of a scientific name of a genus (P2433) - determines the correct form of some names of species and subdivisions of species, also subdivisions of a genus
This has similar problems:
it is not a dataframe
it just stores metadata about the property
I don't find any way to link each property with each object in movies vector.
Is there any way to end up with a dataframe containing the genre's of those movies? (or a dataframe with all wikidata's information which I will have to manipulate in order to filter or select my desired data?)
These are just lists. you can get a picture with str(find_item("Interstellar")) for example.
Then you can go through each element of the list and pick the item that you need. For example. Getting the title and the label
a <- find_item("Interstellar")
b <- Reduce(rbind,lapply(a, function(x) cbind(x$title,x$label)))
data.frame(b)
## X1 X2
## 1 Q13417189 Interstellar
## 2 Q6057099 Interstellar
## 3 Q41872 interstellar medium
## 4 Q686876 space colonization
## 5 Q167910 rogue planet
## 6 Q1054444 interstellar cloud
## 7 Q834826 interstellar travel
## 8 Q835898 Interstellar Boundary Explorer
## 9 Q2003852 starship
## 10 Q2441216 interstellar object
This works easily for regular data if some element is missing then you will have to handle it for example some items don't have description. So you can get around with the following.
Reduce("rbind",lapply(a,
function(x) cbind(x$title,
x$label,
ifelse(length(x$description)==0,NA,x$description))))

Adding additional values to tables in Lua

I have a input file with different food types
Corn Fiber 17
Beans Protein 12
Milk Protien 15
Butter Fat 201
Eggs Fat 2
Bread Fiber 12
Eggs Cholesterol 4
Eggs Protein 8
Milk Fat 5
(Don't take these too seriously. I'm no nutrition expert) Anyway, I have the following script that reads the input file then puts the following into a table
file = io.open("food.txt")
foods = {}
nutritions = {}
for line in file:lines()
do
local f, n, v = line:match("(%a+) (%a+) (%d+)")
nutritions[n] = {value = v}
--foods[f] = {} Not sure how to implement here
end
file:close()
(It's a little messy right now)
Notice also that different foods can have different nutrients. For example, eggs have both protein and fat. I need a way to let the program, know which value I am trying to call. For example:
> print(foods.Eggs.Fat)
2
> print(foods.Eggs.Protein
8
I believe I need two tables, as shown above. The foods table will contain a table of nutritions. This way, I can have multiple food types with multiple different nutrient facts. However, I am not sure how to handle a table of tables. How can I implement this within my program?
The straightforward way is to test if food[f] exists, to decide whether to create a new table or add elements to existing one.
foods = {}
for line in file:lines() do
local f, n, v = line:match("(%a+) (%a+) (%d+)")
if foods[f] then
foods[f][n] = v
else
foods[f] = {[n] = v}
end
end

Resources