Screen Names from Twitter into DataFrame - R - r

I am downloading all the Tweets (using rtweet package, version 0.7.0) that contain the user #sernac in the text of the tweet (a Chilean government entity), then extract all all the usernames (screen name) from the body of the tweet using the following function.
Tweets <- search_tweets("#sernac", n = 50000, include_rts = F)
Names <- str_extract_all(Tweets$text, "(?<=^|\\s)#[^\\s]+")
This give me a List object with the every screen name of each text's tweet.
The first question is: How i get a data frame whith the following estructure?
X1
X2
X3
X4
X5
...
Xn
#sernac
#vtrchile
NA
NA
NA
NA
NA
#username
#playstation
#taylorswitft
#elonmusk
#instagram
NA
NA
#username2
#username5
#selenagomez
#username2
#username3
#FIFA
#xbox
#username4
#ebay
NA
NA
NA
NA
NA
Where the numbers of columns is equal to the max number of elements in a object from the list.
I tried the following fuction, but only return 4 columns, where the max elements into a object is 9.
df <- data.frame(matrix(unlist(Names), nrow=length(Names), byrow = T))
After this, I need to perform a left join between this table and a cluster table created by me, this left join must be between the first column of the newly created database and the cluster data base , but if there is no match in the left join, it should perform a second left join, but in this case using the second column, until exhausting all the columns if there is no match when performing the left join.
This is an example of the database created by me and the final desired result:
CLUSTER DATA FRAME
screen_name
cluster
#sernac
Gov
#playstation
Videogames
#walmart
Supermarket
#SelenaGomez
Celebrity
#elonmusk
Celebrity
#xbox
Videogames
#ebay
Ecommerce
FINAL RESULT
X1
X2
X3
X4
X5
...
Xn
cluster
#sernac
#vtrchile
NA
NA
NA
NA
NA
Gov
#username
#playstation
#taylorswitft
#elonmusk
#instagram
NA
NA
Videogames
#username2
#username5
#selenagomez
#username2
#username3
#FIFA
#xbox
Celebrity
#username4
#ebay
NA
NA
NA
NA
NA
Ecommerce
I have tried to explain myself in the best way, English is not my main language, so I can explain more detail in the comments.

I would approach this differently.
First, if you are trying to download as many tweets as possible, set n = Inf and retryonratelimit = TRUE:
Tweets <- search_tweets("#sernac",
n = Inf,
include_rts = FALSE,
retryonratelimit = TRUE)
Second, there is no need to extract screen names from the tweet text, as this information can be found in the entities column.
One way to extract mentions is to use lapply. You can then create a data frame with just the useful columns, and convert screen names to lower case for matching.
library(dplyr)
mentions <- lapply(Tweets$entities, function(x) x$user_mentions) %>%
bind_rows(.id = "tweet_number") %>%
select(tweet_number, screen_name) %>%
mutate(screen_name_lc = tolower(screen_name))
head(mentions)
tweet_number screen_name screen_name_lc
1 1 mundo_pacifico mundo_pacifico
2 1 OIMChile oimchile
3 1 subtel_chile subtel_chile
4 1 ReclamosSubtel reclamossubtel
5 1 SERNAC sernac
6 2 mundo_pacifico mundo_pacifico
Next, add a column with the lower-case screen names to your cluster data:
cluster_df <- cluster_df %>%
mutate(screen_name_lc = str_replace(screen_name, "#", "") %>%
tolower())
Now we can join the data frames, just on the screen_name_lc column:
mentions_clusters <- mentions %>%
left_join(cluster_df,
by = "screen_name_lc") %>%
select(tweet_number, screen_name = screen_name.x, cluster)
head(mentions_clusters)
tweet_number screen_name cluster
1 1 mundo_pacifico <NA>
2 1 OIMChile <NA>
3 1 subtel_chile <NA>
4 1 ReclamosSubtel <NA>
5 1 SERNAC Gov
6 2 mundo_pacifico <NA>
This "long" format is much easier to work with for subsequent analysis than the "wide" format, and can still be grouped by tweet using the tweet_number column.
Data for cluster_df:
cluster_df <- structure(list(screen_name = c("#sernac", "#playstation", "#walmart",
"#SelenaGomez", "#elonmusk", "#xbox", "#ebay"), cluster = c("Gov",
"Videogames", "Supermarket", "Celebrity", "Celebrity", "Videogames",
"Ecommerce"), screen_name_lc = c("sernac", "playstation", "walmart",
"selenagomez", "elonmusk", "xbox", "ebay")), class = "data.frame", row.names = c(NA,
-7L))

Related

How to split one column whith multiples delimiters in multiple columns in R?

I have values ​​with the following structure: string OR string_string.interger
EX:
df<-data.frame(Objs=c("Windows","Door_XYZ.1", "Door_XYY.1", "Chair_XYYU.2" ))
Objs
Windows
Door_XYZ.1
Door_XYY.1
Chair_XYYU.2
Using the command split(), separate() or something similar I need to generate a dataframe similar to this one:
Obs: The split must be performed for the characters "_" and "."
Objs
IND
TAG
Control
Windows
NA
NA
NA
Door_XYZ.1
Door
XYZ
1
Door_XYY.1
Door
XYY
1
Chair_XYYU.2
Chair
XYYU
2
The closest solution was suggested by #Tommy, in similar context.
df %>% data.frame(.,do.call(rbind,str_split(.$Objs,"_")))
The default value of the sep argument in separate() will nearly get the result you need. A conditional mutate was also needed to remove the Windows entry from the IND column.
library(tidyverse)
df <- data.frame(Objs=c("Windows","Door_XYZ.1", "Door_XYY.1", "Chair_XYYU.2" ))
df %>%
separate(Objs, into = c("IND", "TAG", "Control"), remove = FALSE, fill = "right") %>%
mutate(IND = if_else(Objs == IND, NA_character_, IND))
#> Objs IND TAG Control
#> 1 Windows <NA> <NA> <NA>
#> 2 Door_XYZ.1 Door XYZ 1
#> 3 Door_XYY.1 Door XYY 1
#> 4 Chair_XYYU.2 Chair XYYU 2
Created on 2022-05-05 by the reprex package (v1.0.0)

compute diff of rows with NAs values in data frame using R

I have data frame (9000 x 304) but it looks like to this :
date
a
b
1997-01-01
8.720551
10.61597
1997-01-02
na
na
1997-01-03
8.774251
na
1997-01-04
8.808079
11.09641
I want to calculate the values data such as :
first <- data[i-1,] - data[i-2,]
second <- data[i,] - data[i-1,]
third <- data[i,] - data[i-2,]
I want to ignore the NA values and if there is na I want to get the last value that is not na in the column.
For example in the second diff i = 4 from column b :
11.09641 - 10.61597 is the value of b_diff on 1997-01-04
This is what I did but it keeps generating data with NA :
first <- NULL
for (i in 3:nrow(data)){
first <-rbind(first, data[i-1,] - data[i-2,])
}
second <- NULL
for (i in 3:nrow(data)){
second <- rbind(second, data[i,] - data[i-1,])
}
third <- NULL
for (i in 3:nrow(data)){
third <- rbind(third, data[i,] - data[i-2,])
}
It can be a way to solve it with aggregate function but I need a solution that can be applied on big data and I can't specify each colnames separately. Moreover my colnames are in foreign language.
Thank you very much ! I hope I gave you all the information you need to help me, otherwise, let me know please.
You can use fill to replace NAs with the closest value, and then use across and lag to compute the new variables. It is unclear as to what exactly is your expected output, but you can also replace the default value of lag when it does not exist (e.g. for the first value), using lag(.x, default = ...).
library(dplyr)
library(tidyr)
data %>%
fill(a, b) %>%
mutate(across(a:b, ~ lag(.x) - lag(.x, n = 2), .names = "first_{.col}"),
across(a:b, ~ .x - lag(.x), .names = "second_{.col}"),
across(a:b, ~ .x - lag(.x, n = 2), .names = "third_{.col}"))
date a b first_a first_b second_a second_b third_a third_b
1 1997-01-01 8.720551 10.61597 NA NA NA NA NA NA
2 1997-01-02 8.720551 10.61597 NA NA 0.000000 0.00000 NA NA
3 1997-01-03 8.774251 10.61597 0.0000 0 0.053700 0.00000 0.053700 0.00000
4 1997-01-04 8.808079 11.09641 0.0537 0 0.033828 0.48044 0.087528 0.48044

How to convert a string that has the headers and values with ID per string into a dataframe in R

I need to know how to convert strings in the text file into a data frame for analysis.
I got one line which has an ID per customer, which has the column heading and value in and is separated by semi colon ';'
For example:
{ID=12345;TimeStamp=""2019-02-26 00:15:42"";Event=StatusEvent;Status=""WiLoMonitorStart"";Text=""mnew inactivity failure on cable"";}
The column headings are ID, TimeStamp, Event, Status, Text and any others that come before the equal "=" sign.
The values under the column headings will be after the equals sign "=", see the picture this is the end result I want to achieve.
Statements {
"{ID=12345;TimeStamp=""2019-02-26 00:15:42"";Event=StatusEvent;Status=""WiLoMonitorStart"";Text=""mnew inactivity failure on cable"";}"
"{ID=12346;TimeStamp=""2019-02-26 00:15:43"";Event=StatusEvent;Status=""MetroCode"";Text=""AU"";}"
"{ID=12347;TimeStamp=""2019-02-26 00:15:43"";Event=StatusEvent;Status=""LoWiValidation"";Text=""Password validation 2.5GHz for AES: BigBong"";}"
"{ID=12349;TimeStamp=""2019-02-26 00:15:42"";Event=DomainEvent;MacAddress=""AB:23:34:EF:YN:OT"";LogTime=""2019-02-26 00:15:48"";Domain=""Willing ind"";SecondaryDomain=""No_Perl"";}"
"{ID=12351;TimeStamp=""2019-02-26 00:15:45"";Event=CollectionCallEvent;SerialNumber=""34121"";}"
"{ID=12352;TimeStamp=""2019-02-26 00:15:46"";Event=CollectionCallEvent;SerialNumber=""34151"";Url=""werlkdfa/vierjwerret/vre34f3/df343rsdf343+t45rf/dfgr3443"";}"
}
}
You can see the semi colon ";" separates each variable, can someone be able to separate and make R identify what is a column heading and what is a value to be placed underneath the respected heading with respect to the customer ID (the primary key).
Note that each line may not have the same column headings for the next one.
The image supplied is what I want to achieve in the end but I am having great difficulty to do so in R. It is not a json file or a XML format it is a file that was dumped in text format where I need to extract and analyse the information in a dataframe format before I can do any insights.
Any suggestions? Would there be a better way than say using regular expressions? E.g. stringr package?
txt <- 'Statements {
"{ID=12345;TimeStamp=""2019-02-26 00:15:42"";Event=StatusEvent;Status=""WiLoMonitorStart"";Text=""mnew inactivity failure on cable"";}"
"{ID=12346;TimeStamp=""2019-02-26 00:15:43"";Event=StatusEvent;Status=""MetroCode"";Text=""AU"";}"
"{ID=12347;TimeStamp=""2019-02-26 00:15:43"";Event=StatusEvent;Status=""LoWiValidation"";Text=""Password validation 2.5GHz for AES: BigBong"";}"
"{ID=12349;TimeStamp=""2019-02-26 00:15:42"";Event=DomainEvent;MacAddress=""AB:23:34:EF:YN:OT"";LogTime=""2019-02-26 00:15:48"";Domain=""Willing ind"";SecondaryDomain=""No_Perl"";}"
"{ID=12351;TimeStamp=""2019-02-26 00:15:45"";Event=CollectionCallEvent;SerialNumber=""34121"";}"
"{ID=12352;TimeStamp=""2019-02-26 00:15:46"";Event=CollectionCallEvent;SerialNumber=""34151"";Url=""werlkdfa/vierjwerret/vre34f3/df343rsdf343+t45rf/dfgr3443"";}"
} ' # note need for single quotes
Then read in with readLines and remove leading and trailing lines and hten remove {,}, and the double quotes, and finally read with scan:
RL <- readLines(textConnection(txt))
rl <- RL[-1]
input <- scan(text=gsub('[{}"]',"", rl[1:6]), sep=';', what="")
input[1:12]
#------------------
[1] " ID=12345" "TimeStamp=2019-02-26 00:15:42"
[3] "Event=StatusEvent" "Status=WiLoMonitorStart"
[5] "Text=mnew inactivity failure on cable" ""
[7] " ID=12346" "TimeStamp=2019-02-26 00:15:43"
[9] "Event=StatusEvent" "Status=MetroCode"
[11] "Text=AU" ""
Then you can process like any key-value pair input with "ID" being the delimiter. Another way that woud keep the origianl lines together in a list would be:
sapply( gsub('[{}"]',"", rl[1:6]), function(x) scan(text=x, sep=";", what=""))
#----------------
Read 6 items
Read 6 items
Read 6 items
Read 8 items
Read 5 items
Read 6 items
$` ID=12345;TimeStamp=2019-02-26 00:15:42;Event=StatusEvent;Status=WiLoMonitorStart;Text=mnew inactivity failure on cable;`
[1] " ID=12345" "TimeStamp=2019-02-26 00:15:42" "Event=StatusEvent"
[4] "Status=WiLoMonitorStart" "Text=mnew inactivity failure on cable" ""
# only printed the result from the first line
Convert url query key-value pairs to data frame
Here's a tidyverse solution:
library(tidyverse)
data.frame(txt) %>%
# tidy strings:
mutate(txt = trimws(gsub("Statements|\\s{2,}|[^\\w=; -]", "", txt, perl = TRUE))) %>%
# separate into rows by splitting on ";":
separate_rows(txt, sep = ";") %>%
# separate into two columns by splitting on "=":
separate(txt, into = c("header", "value"), sep = "=") %>%
na.omit() %>%
group_by(header) %>%
# create grouped row ID:
mutate(rowid = row_number()) %>%
ungroup() %>%
# cast wider:
pivot_wider(rowid,
names_from = "header",
values_from = "value")
# A tibble: 6 × 12
rowid ID TimeStamp Event Status Text MacAd…¹ LogTime Domain Secon…² Seria…³ Url
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 12345 2019-02-26 001542 StatusEvent WiLoM… mnew… AB2334… 2019-0… Willi… No_Perl 34121 werl…
2 2 12346 2019-02-26 001543 StatusEvent Metro… AU NA NA NA NA 34151 NA
3 3 12347 2019-02-26 001543 StatusEvent LoWiV… Pass… NA NA NA NA NA NA
4 4 12349 2019-02-26 001542 DomainEvent NA NA NA NA NA NA NA NA
5 5 12351 2019-02-26 001545 CollectionCallEve… NA NA NA NA NA NA NA NA
6 6 12352 2019-02-26 001546 CollectionCallEve… NA NA NA NA NA NA NA NA
# … with abbreviated variable names ¹​MacAddress, ²​SecondaryDomain, ³​SerialNumber
Data:
txt <- 'Statements {
"{ID=12345;TimeStamp=""2019-02-26 00:15:42"";Event=StatusEvent;Status=""WiLoMonitorStart"";Text=""mnew inactivity failure on cable"";}"
"{ID=12346;TimeStamp=""2019-02-26 00:15:43"";Event=StatusEvent;Status=""MetroCode"";Text=""AU"";}"
"{ID=12347;TimeStamp=""2019-02-26 00:15:43"";Event=StatusEvent;Status=""LoWiValidation"";Text=""Password validation 2.5GHz for AES: BigBong"";}"
"{ID=12349;TimeStamp=""2019-02-26 00:15:42"";Event=DomainEvent;MacAddress=""AB:23:34:EF:YN:OT"";LogTime=""2019-02-26 00:15:48"";Domain=""Willing ind"";SecondaryDomain=""No_Perl"";}"
"{ID=12351;TimeStamp=""2019-02-26 00:15:45"";Event=CollectionCallEvent;SerialNumber=""34121"";}"
"{ID=12352;TimeStamp=""2019-02-26 00:15:46"";Event=CollectionCallEvent;SerialNumber=""34151"";Url=""werlkdfa/vierjwerret/vre34f3/df343rsdf343+t45rf/dfgr3443"";}"
} '

Sorting in natural order by column in R

I've used full.joint to combine two tables:
fsts = full_join(fstvarcal, fst, by = "SNP")
And this had the effect of grouping 1st rows for which there were values for the two datasets, followed by rows for which there were values for the 1st dataset only (and NAs for the 2nd), followed by rows for which there were values for the 2nd dataset only (and NAs for the 1st).
I'm now trying to order by natural order.
Looking for the equivalent of sort -V -k1 in bash.
I've been tried:
library(naturalsort);
fstordered = fsts[naturalorder(fsts$SNP),]
which works, but it's very slow.
Any faster ways of doing this? Or of doing merging the two datasets without loosing the natural order?
I have:
SNP fst
scaffold_0 0.186473
scaffold_9 0.186475
scaffold_10 0.186472
scaffold_11 0.186470
scaffold_99 0.186420
scaffold_100 0.186440
and
SNP fstvarcal
scaffold_0 0.186472
scaffold_8 0.186475
scaffold_20 0.186477
scaffold_21 0.186440
scaffold_999 0.186450
scaffold_1000 0.186420
and wan to combine into
SNP fstvarcal fst
scaffold_0 0.186472 0.186473
scaffold_8 0.186475 NA
scaffold_9 NA 0.186475
scaffold_10 NA 0.186472
scaffold_11 NA 0.186470
scaffold_20 0.186477 NA
scaffold_21 0.186440 NA
scaffold_99 NA 0.186420
scaffold_100 NA 0.186440
scaffold_999 0.186450 NA
scaffold_1000 0.186420 NA
Perhaps you can do the following:
I generate some representative sample data first.
set.seed(2018)
df <- data.frame(
SNP = sprintf("scaffold_%i", 1:1000),
val = rnorm(1000))
df <- df[df$SNP, ]
We now use tidyr::separate to separate SNP into "id" and "no", and arrange rows by "id" and "no" to ensure natural ordering (convert = T automatically converts "no" to an integer column vector).
library(tidyverse)
df %>%
separate(SNP, into = c("id", "no"), remove = F, convert = T) %>%
arrange(id, no) %>%
select(-id, -no)
# SNP val
#1 scaffold_1 -0.4229839834
#2 scaffold_2 -1.5498781617
#3 scaffold_3 -0.0644293189
#4 scaffold_4 0.2708813526
#5 scaffold_5 1.7352836655
#6 scaffold_6 -0.2647112113
#7 scaffold_7 2.0994707023
#8 scaffold_8 0.8633512196
#9 scaffold_9 -0.6105871453
#10 scaffold_10 0.6370556066
#11 scaffold_11 -0.6430346953
#...

Reading Excel file: How to find the start cell in messy spreadsheets?

I'm trying to write R code to read data from a mess of old spreadsheets. The exact location of the data varies from sheet to sheet: the only constant is that the first column is a date and the second column has "Monthly return" as the header. In this example, the data starts in cell B5:
How do I automate the search of Excel cells for my "Monthly return" string using R?
At the moment, the best idea I can come up with is to upload everything in R starting at cell A1 and sort out the mess in the resulting (huge) matrices. I'm hoping for a more elegant solution
I haven't found a way to do this elegantly, but I'm very familiar with this problem (getting data from FactSet PA reports -> Excel -> R, right?). I understand different reports have different formats, and this can be a pain.
For a slightly different version of annoyingly formatted spreadsheets, I do the following. It's not the most elegant (it requires two reads of the file) but it works. I like reading the file twice, to make sure the columns are of the correct type, and with good headers. It's easy to mess up column imports, so I'd rather have my code read the file twice than go through and clean up columns myself, and the read_excel defaults, if you start at the right row, are pretty good.
Also, it's worth noting that as of today (2017-04-20), readxl had an update. I installed the new version to see if that would make this very easy, but I don't believe that's the case, although I could be mistaken.
library(readxl)
library(stringr)
library(dplyr)
f_path <- file.path("whatever.xlsx")
if (!file.exists(f_path)) {
f_path <- file.choose()
}
# I read this twice, temp_read to figure out where the data actually starts...
# Maybe you need something like this -
# excel_sheets <- readxl::excel_sheets(f_path)
# desired_sheet <- which(stringr::str_detect(excel_sheets,"2 Factor Brinson Attribution"))
desired_sheet <- 1
temp_read <- readxl::read_excel(f_path,sheet = desired_sheet)
skip_rows <- NULL
col_skip <- 0
search_string <- "Monthly Returns"
max_cols_to_search <- 10
max_rows_to_search <- 10
# Note, for the - 0, you may need to add/subtract a row if you end up skipping too far later.
while (length(skip_rows) == 0) {
col_skip <- col_skip + 1
if (col_skip == max_cols_to_search) break
skip_rows <- which(stringr::str_detect(temp_read[1:max_rows_to_search,col_skip][[1]],search_string)) - 0
}
# ... now we re-read from the known good starting point.
real_data <- readxl::read_excel(
f_path,
sheet = desired_sheet,
skip = skip_rows
)
# You likely don't need this if you start at the right row
# But given that all weird spreadsheets are weird in their own way
# You may want to operate on the col_skip, maybe like so:
# real_data <- real_data %>%
# select(-(1:col_skip))
Okay, at the format was specified for xls, update from csv to the correctly suggested xls loading.
library(readxl)
data <- readxl::read_excel(".../sampleData.xls", col_types = FALSE)
You would get something similar to:
data <- structure(list(V1 = structure(c(6L, 5L, 3L, 7L, 1L, 4L, 2L), .Label = c("",
"Apr 14", "GROSS PERFROANCE DETAILS", "Mar-14", "MC Pension Fund",
"MY COMPANY PTY LTD", "updated by JS on 6/4/2017"), class = "factor"),
V2 = structure(c(1L, 1L, 1L, 1L, 4L, 3L, 2L), .Label = c("",
"0.069%", "0.907%", "Monthly return"), class = "factor")), .Names = c("V1",
"V2"), class = "data.frame", row.names = c(NA, -7L))
then you can dynamincally filter on the "Monthly return" cell and identify your matrix.
targetCell <- which(data == "Monthly return", arr.ind = T)
returns <- data[(targetCell[1] + 1):nrow(data), (targetCell[2] - 1):targetCell[2]]
With a general purpose package like readxl, you'll have to read twice, if you want to enjoy automatic type conversion. I assume you have some sort of upper bound on the number of junk rows at the front? Here I assumed that was 10. I'm iterating over worksheets in one workbook, but the code would look pretty similar if iterating over workbooks. I'd write one function to handle a single worksheet or workbook then use lapply() or purrr::map(). This function will encapsulate the skip-learning read and the "real" read.
library(readxl)
two_passes <- function(path, sheet = NULL, n_max = 10) {
first_pass <- read_excel(path = path, sheet = sheet, n_max = n_max)
skip <- which(first_pass[[2]] == "Monthly return")
message("For sheet '", if (is.null(sheet)) 1 else sheet,
"' we'll skip ", skip, " rows.")
read_excel(path, sheet = sheet, skip = skip)
}
(sheets <- excel_sheets("so.xlsx"))
#> [1] "sheet_one" "sheet_two"
sheets <- setNames(sheets, sheets)
lapply(sheets, two_passes, path = "so.xlsx")
#> For sheet 'sheet_one' we'll skip 4 rows.
#> For sheet 'sheet_two' we'll skip 6 rows.
#> $sheet_one
#> # A tibble: 6 × 2
#> X__1 `Monthly return`
#> <dttm> <dbl>
#> 1 2017-03-14 0.00907
#> 2 2017-04-14 0.00069
#> 3 2017-05-14 0.01890
#> 4 2017-06-14 0.00803
#> 5 2017-07-14 -0.01998
#> 6 2017-08-14 0.00697
#>
#> $sheet_two
#> # A tibble: 6 × 2
#> X__1 `Monthly return`
#> <dttm> <dbl>
#> 1 2017-03-14 0.00907
#> 2 2017-04-14 0.00069
#> 3 2017-05-14 0.01890
#> 4 2017-06-14 0.00803
#> 5 2017-07-14 -0.01998
#> 6 2017-08-14 0.00697
In those cases it's important to know the possible conditions of your data. I'm gonna assume that you want only remove columns and rows that doesn't confrom your table.
I have this Excel book:
I added 3 blank columns at left becouse when I loaded in R with one column the program omits them. Thats for confirm that R omits empty cols at the left.
First: load data
library(xlsx)
dat <- read.xlsx('book.xlsx', sheetIndex = 1)
head(dat)
MY.COMPANY.PTY.LTD NA.
1 MC Pension Fund <NA>
2 GROSS PERFORMANCE DETAILS <NA>
3 updated by IG on 20/04/2017 <NA>
4 <NA> Monthly return
5 Mar-14 0.0097
6 Apr-14 6e-04
Second: I added some cols with NA and '' values in the case that your data contain some
dat$x2 <- NA
dat$x4 <- NA
head(dat)
MY.COMPANY.PTY.LTD NA. x2 x4
1 MC Pension Fund <NA> NA NA
2 GROSS PERFORMANCE DETAILS <NA> NA NA
3 updated by IG on 20/04/2017 <NA> NA NA
4 <NA> Monthly return NA NA
5 Mar-14 0.0097 NA NA
6 Apr-14 6e-04 NA NA
Third: Remove columns when all values are NA and ''. I have to deal with that kind of problems in past
colSelect <- apply(dat, 2, function(x) !(length(x) == length(which(x == '' | is.na(x)))))
dat2 <- dat[, colSelect]
head(dat2)
MY.COMPANY.PTY.LTD NA.
1 MC Pension Fund <NA>
2 GROSS PERFORMANCE DETAILS <NA>
3 updated by IG on 20/04/2017 <NA>
4 <NA> Monthly return
5 Mar-14 0.0097
6 Apr-14 6e-04
Fourth: Keep only rows with complete observations (it's what I supose from your example)
rowSelect <- apply(dat2, 1, function(x) !any(is.na(x)))
dat3 <- dat2[rowSelect, ]
head(dat3)
MY.COMPANY.PTY.LTD NA.
5 Mar-14 0.0097
6 Apr-14 6e-04
7 May-14 0.0189
8 Jun-14 0.008
9 Jul-14 -0.0199
10 Ago-14 0.00697
Finally if you want to keep the header you can make something like this:
colnames(dat3) <- as.matrix(dat2[which(rowSelect)[1] - 1, ])
or
colnames(dat3) <- c('Month', as.character(dat2[which(rowSelect)[1] - 1, 2]))
dat3
Month Monthly return
5 Mar-14 0.0097
6 Apr-14 6e-04
7 May-14 0.0189
8 Jun-14 0.008
9 Jul-14 -0.0199
10 Ago-14 0.00697
Here is how I would tackle it.
STEP 1
Read the excel spreadsheet in without the headers.
STEP 2
Find the row index for your string Monthly return in this case
STEP 3
Filter from the identified row (or column or both), prettify a little and done.
Here is what a sample function looks like. It works for your example no matter where it is in the spreadsheet. You can play around with regex to make it more robust.
Function Definition:
library(xlsx)
extract_return <- function(path = getwd(), filename = "Mysheet.xlsx", sheetnum = 1){
filepath = paste(path, "/", filename, sep = "")
input = read.xlsx(filepath, sheetnum, header = FALSE)
start_idx = which(input == "Monthly return", arr.ind = TRUE)[1]
output = input[start_idx:dim(input)[1],]
rownames(output) <- NULL
colnames(output) <- c("Date","Monthly Return")
output = output[-1, ]
return(output)
}
Example:
final_df <- extract_return(
path = "~/Desktop",
filename = "Apr2017.xlsx",
sheetnum = 2)
No matter ho many rows or columns you may have, the idea remains the same.. Give it a try and let me know.
This is a tidy alternative that avoids the multiple reads issue discussed above. However, when doing benchmarks, Rafael Zayas's answer still wins out.
library("tidyxl")
library("unpivotr")
library("tidyr")
library("dplyr")
tidy_solution <- function() {
raw <- xlsx_cells("messyExcel.xlsx")
start <- raw %>%
filter_all(any_vars(. %in% c("Monthly return"))) %>%
select(row, col)
month.col <- raw %>%
filter(row >= start$row + 1, col == start$col - 1) %>%
pivot_wider(date, col)
return.col <- raw %>%
filter(row >= start$row + 1, col == start$col) %>%
pivot_wider(numeric, col)
output <- cbind(month.col, return.col)
}
# My Solution
expr min lq mean median uq max neval
tidy_solution() 29.0372 30.40305 32.13793 31.36925 32.9812 56.6455 100
# Rafael's
expr min lq mean median uq max neval
original_solution() 21.4405 23.8009 25.86874 25.10865 26.99945 59.4128 100
grep("2014",dat)[1]
This gives you first column with year. Or use "-14" or whatever you have for years.
Similar way grep("Monthly",dat)[1] gives you second column

Resources