R read excel until you reach a certain criteria - r

I have a messy excel file that I need to read in as is, but I want to read the file in until it hits the row that says "Projects as usual"
This value will always be in the first column, and no other string in that column will ever match it. I also don't want any of the information below it in other columns because it's making my numeric columns be read in as strings (see example below with score).
For example, we can pretend this is the excel file:
library(tidyverse)
messy_excel <- tibble(id = c("1", "2", NA, NA, "Projects as usual", NA),
name = c("Joe", "Justin", NA, NA, NA, "Other info I don't want"),
score = c("50", "20", NA, NA, NA, "This shouldn't show"))
And this is what I want:
library(tidyverse)
beautiful_excel <- tibble(id = c("1", "2"),
name = c("Joe", "Justin"),
score = c(50, 20))
~~~~~
Thank you!

Edit:
Based on #G5W's suggestion, I filtered for where the value was, getting inspiration from this answer: How to find row number of a value in R code
Specifically, I assigned each row a row number, detected where the target string was, and removed that row and the ones below it. I then used the retype() function from the hablar package to fix the column types.
library(tidyverse)
library(hablar)
messy_excel <- tibble(id = c("1", "2", NA, NA, "Projects as usual", NA),
name = c("Joe", "Justin", NA, NA, NA, "Other info I don't want"),
score = c("50", "20", NA, NA, NA, "This shouldn't show"))
#Give each line a row number
messy_excel$row_num <- seq.int(nrow(messy_excel))
#Identify the row where the garbage starts
messy_row <- which(grepl("Projects as usual", messy_excel$id))
#remove all rows below the garbage, remove the row_num column, correct column types, and remove the rows of all nas
clean_excel <- messy_excel %>%
filter(row_num < messy_row) %>%
dplyr::select(-row_num) %>%
retype() %>%
na.omit()
glimpse(clean_excel)

Related

Joining 'n' number of lists and perform a function in R

I have a dataframe which contains many triplicate (3 columns set). And I have grouped the dataframe into each triplicate as a seperate group of list.
The example dataset is,
example_data <- structure(list(`1_3ng` = c(69648445400, 73518145600, NA, NA,
73529102400, 75481088000, NA, 73545910600, 74473949200, 77396199900
), `2_3ng` = c(71187990600, 70677690400, NA, 73675407400, 73215342700,
NA, NA, 69996254800, 69795686400, 76951318300), `3_3ng` = c(65032022000,
71248214000, NA, 72393058300, 72025550900, 71041067000, 73604692000,
NA, 73324202000, 75969608700), `4_7-5ng` = c(NA, 65845061600,
75009245100, 64021237700, 66960666600, 69055643600, NA, 64899540900,
NA, NA), `5_7-5ng` = c(65097201700, NA, NA, 69032126500, NA,
70189899800, NA, 74143529100, 69299087400, NA), `6_7-5ng` = c(71964413900,
69048485800, NA, 71281569700, 71167596500, NA, NA, 68389822800,
69322289200, NA), `7_10ng` = c(71420403700, 67552276500, 72888076300,
66491357100, NA, 68165019600, 70876631000, NA, 69174190100, 63782945300
), `8_10ng` = c(NA, 71179401200, 68959365100, 70570182700, 73032738800,
NA, 74807496700, NA, 71812102100, 73855098500), `9_10ng` = c(NA,
70403756100, NA, 70277421000, 69887731700, 69818871800, NA, 71353886700,
NA, 74115466700), `10_15ng` = c(NA, NA, 68487581700, NA, NA,
69056997400, NA, 67780479400, 66804467800, 72291939500), `11_15ng` = c(NA,
63599643700, NA, NA, 60752029700, NA, NA, 63403655600, NA, 64548492900
), `12_15ng` = c(NA, 67344750600, 61610182700, 67414425600, 65946654700,
66166118400, NA, 70830837700, 67288305700, 69911451300)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -10L)
And after grouping I got the four lists, since the above example dataset contains 4 groups. I have used the following R code for grouping the data,
grouping_data<-function(df){ #df= dataframe
df_col<-ncol(df) #calculates no. of columns in dataframe
groups<-sort(rep(0:((df_col/3)-1),3)) #creates user determined groups
id<-list() #creates empty list
for (i in 1:length(unique(groups))){
id[[i]]<-which(groups == unique(groups)[i])} #creates list of groups
names(id)<-paste0("id",unique(groups)) #assigns group based names to the list "id"
data<-list() #creates empty list
for (i in 1:length(id)){
data[[i]]<-df[,id[[i]]]} #creates list of dataframe columns sorted by groups
names(data)<-paste0("data",unique(groups)) #assigns group based names to the list "data"
return(data)}
group_data <-grouping_data(example_data)
Please suggest useful R code for do a particular function for all the lists at a same time.
For example the below function I have done by following way,
#VSN Normalization
vsnNorm <- function(dat) {
dat<-as.data.frame(dat)
vsnNormed <- suppressMessages(vsn::justvsn(as.matrix(dat)))
colnames(vsnNormed) <- colnames(dat)
row.names(vsnNormed) <- rownames(dat)
return(as.matrix(vsnNormed))
}
And I have tried like below,
vsn.dat0 <- vsnNorm(group_data$data0)
vsn.dat1 <- vsnNorm(group_data$data1)
vsn.dat2 <- vsnNorm(group_data$data2)
vsn.dat3 <- vsnNorm(group_data$data3)
vsn.dat <- cbind (vsn.dat0,vsn.dat1,vsn.dat2,vsn.dat3)
It is working well.
But the dataset triplicate (3 columns set) value may be change from dataset to dataset. And calling all the lists everytime become will be tedious.
So kindly share some codes which will call all the resulted lists for performing a function and combine the result as a single file.
Thank you in advance.
The shortcut you are looking for is:
vsn.dat <- do.call("rbind", lapply(group_data, vsnNorm))

Conditionally replace cells in data frame based on another data frame

In the interest of learning better coding practices, can anyone show me a more efficient way of solving my problem? Maybe one that doesn't require new columns...
Problem: I have two data frames: one is my main data table (t) and the other contains changes I need to replace in the main table (Manual_changes). Example: Sometimes the CaseID is matched with the wrong EmployeeID in the file.
I can't provide the main data table, but the Manual_changes file looks like this:
Manual_changes = structure(list(`Case ID` = c(46605, 25321, 61790, 43047, 12157,
16173, 94764, 38700, 41798, 56198, 79467, 61907, 89057, 34232,
100189), `Employee ID` = c(NA, NA, NA, NA, NA, NA, NA, NA, 906572,
164978, 145724, 874472, 654830, 846333, 256403), `Age in Days` = c(3,
3, 3, 12, 0, 0, 5, 0, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA,
-15L), class = c("tbl_df", "tbl", "data.frame"))
temp = merge(t, Manual_changes, by = "Case ID", all.x = TRUE)
temp$`Employee ID.y` = ifelse(is.na(temp$`Employee ID.y`), temp$`Employee ID.x`, temp$`Employee ID.y`)
temp$`Age in Days.y`= ifelse(is.na(temp$`Age in Days.y`), temp$`Age in Days.x`, temp$`Age in Days.y`)
temp$`Age in Days.x` = NULL
temp$`Employee ID.x` = NULL
colnames(temp) = colnames(t)
t = temp
We could use coalesce
library(dplyr)
left_join(t, Manual_changes, by = "Case ID") %>%
mutate(Employee_ID.y = coalesce(`Employee ID.x`, `Employee ID.y`),
`Age in Days.y` = coalesce(`Age in Days.x`, `Age in Days.y`))
Or with data.table
library(data.table)
setDT(t)[Manual_changes,
c('Employee ID', 'Age in Days') :=
.(fcoalesce(`Employee ID.x`, `Employee ID.y`),
fcoalesce(`Age in Days.x`, `Age in Days.y`)),
on = .(`Case ID`)]

linear regression model with dplyr on sepcified columns by name

I have the following data frame, each row containing four dates ("y") and four measurements ("x"):
df = structure(list(x1 = c(69.772808673525, NA, 53.13125414839,
17.3033274666411,
NA, 38.6120670385487, 57.7229000792707, 40.7654208618078, 38.9010405201831,
65.7108936694177), y1 = c(0.765671296296296, NA, 1.37539351851852,
0.550277777777778, NA, 0.83037037037037, 0.0254398148148148,
0.380671296296296, 1.368125, 2.5250462962963), x2 = c(81.3285388496182,
NA, NA, 44.369872853302, NA, 61.0746827226573, 66.3965114460601,
41.4256874481852, 49.5461413070349, 47.0936997726146), y2 =
c(6.58287037037037,
NA, NA, 9.09377314814815, NA, 7.00127314814815, 6.46597222222222,
6.2462962962963, 6.76976851851852, 8.12449074074074), x3 = c(NA,
60.4976916064608, NA, 45.3575294731303, 45.159758146854, 71.8459173097114,
NA, 37.9485456227131, 44.6307631013742, 52.4523342186143), y3 = c(NA,
12.0026157407407, NA, 13.5601157407407, 16.1213657407407, 15.6431018518519,
NA, 15.8986805555556, 13.1395138888889, 17.9432638888889), x4 = c(NA,
NA, NA, 57.3383407228293, NA, 59.3921356160536, 67.4231673171527,
31.853845252547, NA, NA), y4 = c(NA, NA, NA, 18.258125, NA,
19.6074768518519,
20.9696527777778, 23.7176851851852, NA, NA)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -10L))
I would like to create an additional column containing the slope of all the y's versus all the x's, for each row (each row is a patient with these 4 measurements).
Here is what I have so far:
df <- df %>% mutate(Slope = lm(vars(starts_with("y") ~
vars(starts_with("x"), data = .)
I am getting an error:
invalid type (list) for variable 'vars(starts_with("y"))'...
What am I doing wrong, and how can I calculate the rowwise slope?
You are using a tidyverse syntax but your data is not tidy...
Maybe you should rearrange your data.frame and rethink the way you store your data.
Here is how to do it in a quick and dirty way (at least if I understood your explanations correctly):
df <- merge(reshape(df[,(1:4)*2-1], dir="long", varying = list(1:4), v.names = "x", idvar = "patient"),
reshape(df[,(1:4)*2], dir="long", varying = list(1:4), v.names = "y", idvar = "patient"))
df$patient <- factor(df$patient)
Then you could loop over the patients, perform a linear regression and get the slopes as a vector:
sapply(levels(df$patient), function(pat) {
coef(lm(y~x,df[df$patient==pat,],na.action = "na.omit"))[2]
})

Multiple columns processing and dynamically naming new columns

Variables are mistakenly being entered into multiple columns eg: "aaa_1", "aaa_2" and "aaa_3", or "ccc_1, "ccc_2", and "ccc_3"). Need to create single new columns (eg "aaa", or "ccc"). Some variables are currently in a single column though ("hhh_1"), but more columns may be added (hhh_2 etc).
This is what I got:
aaa_1 <- c(43, 23, 65, NA, 45)
aaa_2 <- c(NA, NA, NA, NA, NA)
aaa_3 <- c(NA, NA, 92, NA, 82)
ccc_1 <- c("fra", NA, "spa", NA, NA)
ccc_2 <- c(NA, NA, NA, "wez", NA)
ccc_3 <- c(NA, "ija", NA, "fda", NA)
ccc_4 <- c(NA, NA, NA, NA, NA)
hhh_1 <- c(183, NA, 198, NA, 182)
dataf1 <- data.frame(aaa_1,aaa_2,aaa_3,ccc_1,ccc_2, ccc_3,ccc_4,hhh_1)
This is what I want:
aaa <- c(43, 23, NA, NA, NA)
ccc <- c("fra", "ija", "spa", NA, NA)
hhh <- c(183, NA, 198, NA, 182)
dataf2 <- data.frame(aaa,ccc,hhh)
General solution needed as there are ~100 variables (eg "aaa", "hhh", "ccc", "ttt", "eee", "hhh"etc).
Thanks!
This is a base solution, i.e. no packages.
First define get_only which when given a list converts it to a data.frame and applies get_only to each row. When given a vector it returns the single non-NA in it or NA if there is not only one.
Define root to be the column names without the suffixes.
Convert the data frame to a list of columns, group them by root and apply get_only to each such group.
Finally, convert the resulting list to a data frame.
get_only <- function(x) UseMethod("get_only")
get_only.list <- function(x) apply(data.frame(x), 1, get_only)
get_only.default <- function(x) if (sum(!is.na(x)) == 1) na.omit(x) else NA
root <- sub("_.*", "", names(dataf1))
as.data.frame(lapply(split(as.list(dataf1), root), FUN = get_only))
giving:
age country hight
1 43 fra 183
2 23 ija NA
3 NA spa 198
4 NA <NA> NA
5 NA <NA> 182
We may try with splitstackshape
library(splitstackshape)
nm1 <- sub("_\\d+", "", names(dataf1))
tbl <- table(nm1) > 1
merged.stack(dataf1, var.stubs = names(tbl)[tbl], sep="_")
I'm not sure your example is right. For example in the third row you've got values for both age_1 and age_3, then in the desired output NA for that row.
If I've understood what you're trying to do though, it will be much easier if you transpose columns to rows, fix them and then transpose back again. Try this as a start point using the 'tidyverse' of dplyr and tidyr.
library(tidyverse)
library(stringr)
age_1 <- c(43, 23, 65, NA, 45)
age_2 <- c(NA, NA, NA, NA, NA)
age_3 <- c(NA, NA, 92, NA, 82)
country_1 <- c("fra", NA, "spa", NA, NA)
country_2 <- c(NA, NA, NA, "wez", NA)
country_3 <- c(NA, "ija", NA, "fda", NA)
country_4 <- c(NA, NA, NA, NA, NA)
hight_1 <- c(183, NA, 198, NA, 182)
dataf1 <- data.frame(age_1,age_2,age_3,country_1,country_2, country_3,country_4,hight_1)
data <- dataf1 %>%
mutate(row_num = row_number()) %>% #create a row number to track values
gather(key, value, -row_num) %>% #flatten your data
drop_na() %>% #drop na rows
mutate(key = str_replace(key, "_.", "")) %>% #remove the '_x' part of names
group_by(row_num) %>%
top_n(1) %>%
spread(key, value) #pivot back to columns
For your example you need the group_by() and top_n() lines to make it run because you've got multiple values in the same row. If you only have one value (as I think you should?) then you can remove these two lines. It will be better without them because then it won't run if your data is wrong.
Edit following comment below. This will make any duplicated entries NA.
data <- dataf1 %>%
mutate(row_num = row_number()) %>% #create a row number to track values
gather(key, value, -row_num) %>% #flatten your data
drop_na() %>% #drop na rows
mutate(key = str_replace(key, "_.", "")) %>% #remove the '_x' part of names
group_by(row_num, key) %>%
mutate(count = n()) %>% #count how many entries for each row/key combo
mutate(value = ifelse(count > 1, NA, value)) %>% #set NA for rows with duplicates
drop_na() %>%
spread(key, value) %>% #pivot back to columns
select(-count) #drop the `count` variable

How to update values in a for-loop?

I have a for-loop that initializes 3 vectors (launch_2012, amount, and one_week_bf) and creates a data frame. Then, it predicts a single week's of data and inserts it into vectors (amount and one_week_bf), and recreates the data.frame again; this process is looped 8 times. However, I can't seem to get the data.frame to update the new amounts. Would anyone be able to assist please?
for (i in 1:8) {
launch_2012 <- c(rep('bf', 5), 'launch', rep('af', 7))
amount <- c(7946, 6641, 5975, 5378, 5217, NA, NA, NA, NA, NA, NA, NA, NA)
one_week_bf <- c(NA, 7946, 6641, 5975, 5378, 5217, NA, NA, NA, NA, NA, NA, NA)
newdata <- data.frame(amount = amount, one_week_bf = one_week_bf, launch = launch_2012, week = week)
predicted <- predict(model0a, newdata)
amount[i+5] <- predicted[i+5]
one_week_bf[i+6] <- predicted[i+5]
View(newdata)
}
It's difficult to be sure since your example is not reproducible, but note that predict.lm(...) by default has na.action=na.pass, which means that any rows in newdata that have any NA values by default generate NA for the prediction. Since your first pass of newdata has NA in rows 6-13, predicted will have NA in those same elements. This means that amounts and one_week_bf will have NA in those elements, which in turn will generate the same newdata each time.
None of this should be in a for loop.
x <- data.frame("launch_2012" = c(rep('bf', 5), 'launch', rep('af', 7)),
"amount"=c(7946, 6641, 5975, 5378, 5217, NA, NA, NA, NA, NA, NA, NA, NA),
"one_week_bf"=c(NA, 7946, 6641, 5975, 5378, 5217, NA, NA, NA, NA, NA, NA, NA))
x$new_amount <- #the replacement from your predict vector
x$new_one_week_bf <- #the replacement from your predict vector
Note I have no idea what model0a does, so just gave what the new columns should be as whatever the resulting vector is from your predict function. This will add the new data as new columns

Resources