How to attach data to columns in existing dataframe - r

I'm attempting to import multiple csv files and create a new dataframe that includes specific columns (some with the same name; some different) from each of these files. So far I have been able to create the dataframe with the specific columns I want, but somewhere in my code my data gets lost and doesn't transfer over to each column.
I would also like to create a new column named status where I would like to have each cell equal to either Lost/Gained/Neutral depending if the same value found in the all_v.csv file is also found in the lost_v and/or the gained_v. If it is found in niether then it is Neutral. I attempted to write a line of code for this, but I won't know if it works till I am able to attach the correct data in each column.
This would give me a total of 8 columns:
pre_contact, status, gained_variation, lost_variation, coord.lat, coord.long, country, Date
Most of these columns come from the 4 files listed below with the exception of the status column:
all_v - pre_contact
status - Lost / Gained / Neutral
gained_v - gained_variation
lost_v - lost_variation
SOUTH - coord.lat, coord.long, country, Date
An issue I'm also facing is having disproportionate dataframes. So when I attempt to merge or use rbind, I get an error saying that my rows do not line up because some columns are larger than others so I would like a way to fix this with adding NAs
Here is my sample code:
folder_path<- setwd("/directory/")
setwd(folder_path)
#this creates a table with two columns: filename and file_path but I'm not sure how to utilize it
all_of_them <- fs::dir_ls(folder_path, pattern="*.csv")
file_names <- tibble(filename = list.files(folder_path))
file.paths <- file_names %>% mutate(filepath = paste0(folder_path, filename))
#Each file I want to use
gained_v <- read.csv("gained.csv", header = TRUE)
lost_v <- read.csv("lost.csv", header = TRUE)
all_v <- read.csv("all.csv", header = TRUE)
SOUTH <- read.csv("SOUTH.csv", header = TRUE)
files = list.files(pattern="*.csv", full.names = TRUE)
for (i in 1:length(files)){
data <-
files %>%
map_df(~fread(.))
}
# Set Column Names
subset_data <- data.frame(data)
subset_data$status <- with(subset_data, subset_data$pre_contact == subset_data$gained_variation | subset_data$pre_contact == subset_data$lost_variatiion)
subset_data <- subset(subset_data, select = c(pre_contact,status, gained_variation,lost_variatiion,coord.lat,coord.long, country, Date))
subset_data <- as_tibble(subset_data)
write.csv(subset_data, "subset_data.csv")
status_data = read.csv("subset_data.csv", header = TRUE)
status_data <- data.frame(subset(status_data, select = -c(X)))
status_data <- tibble(status_data)
So far my output looks like this (where the only data showing is from my pre-contact column:
pre_contact status gained_variation lost_variation coord.lat coord.long country Date
1234 NA NA NA NA NA
6543 NA NA NA NA NA
9876 NA NA NA NA NA
1233 NA NA NA NA NA
1276 NA NA NA NA NA

Related

R - Exporting a List through Openxlsx with Separately Sorted Columns Placing NA Values Last

Solution found! Scroll to the end to see what I did. Hopefully, this function can help others.
TLDR: I have a list: https://i.stack.imgur.com/7t6Ej.png
and I need to do something like this to it
lapply(irabdf, function(x) c(x[!is.na(x)], x[is.na(x)]))
But I need this function to do this to each element of the list individually, and not delete the column names. Currently, I can get it to sort lowest to highest but it moves everything into a single column and deletes the column names.
I have a list in R that I am exporting as a XLS file using the Openxlsx package. I have everything that I need functionally, but my P.I has requested that I sort each column lowest to highest for reviewers, as there are a lot of empty cells that make the document look funny. I am trying to add this feature in R so that I don't need to do it manually. All columns were created from separate .csv files, and rows are unimportant.
The List: https://i.stack.imgur.com/7t6Ej.png
The generated XLSX file looks like this: https://i.stack.imgur.com/ftg00.png.
The columns are not blank, the data is just much further down.
My code for writing the file:
wb <- createWorkbook()
lapply(names(master), function(i){
addWorksheet(wb=wb, sheetName = names(master[i]))
writeData(wb, sheet = i, master[[i]])
addFilter(wb, sheet = i, rows = 1, cols = 1:(a))
})
#Save Workbook
saveWorkbook(wb, saveFile, overwrite = TRUE)
a = this value obtained through (length(unique(x)). X is the levels of a variable.
What I have:
Column1, Column2, Column3, Column4
1. 1 NA NA NA
2. 2 NA NA NA
3. NA 3 NA NA
4. NA 4 NA NA
5. NA NA 5 NA
6. NA NA 6 NA
7. NA NA NA 7
8. NA NA NA 8
What I want:
Column1, Column2, Column3, Column4
1. 1 3 5 7
2. 2 4 6 8
3. NA NA NA NA
4. NA NA NA NA
5. NA NA NA NA
6. NA NA NA NA
7. NA NA NA NA
8. NA NA NA NA
The actual file has 1,000's of rows and 100's of blank cells for each column. The solution would replicate this across all tabs of the XLSX file.
What I have tried:
In a previous version of this script I was able to do this. I had separate df's which were assigned names through user-dialogue options. This is an example of the code I used to do that.
irabdf <- masterdf %>%
filter(Fluorescence == "Infrared") %>%
select(mean, Conditions) %>%
mutate(row = row_number()) %>%
spread(Conditions, mean) %>%
select(!row)
irabdf <- lapply(irabdf, function(x) c(x[!is.na(x)], x[is.na(x)])) %>% ## Move NAs to the bottom of the df
data.frame()
# Create a blank workbook
WB <- createWorkbook()
# Add some sheets to the workbook
addWorksheet(WB, gab)
addWorksheet(WB, rab)
addWorksheet(WB, irab)
# Write the data to the sheets
writeData(WB, sheet = gab, x = gabdf)
writeData(WB, sheet = rab, x = rabdf)
writeData(WB, sheet = irab, x = irabdf)
# Reorder worksheets
worksheetOrder(WB) <- c(1:3)
# Export the file
saveWorkbook(WB, saveFile)
Now that I have removed the user interface and am now using a list I can no longer do this. I have also tried a myriad of other things with most utilizing lapply.
If you need any more information just ask.
Thanks in advance for the assistance!
09/21
I think I am getting closer but I still haven't resolved the issue.
When I use this code
list <- lapply(master[[1]],
function(x) c(x[!is.na(x)], x[is.na(x)]))
I get the results I want but end up losing the first element. If I could keep the first element and apply this over my entire list that should do the trick.
09/22
I have found something that works! However, it isn't dynamic. If someone could help me loop this function across all of the elements of this list (or knows a better solution) just let me know.
list1 <- lapply(master[[1]],
function(x) c(x[!is.na(x)], x[is.na(x)]))
list1 <- data.frame(list1)
master[[1]] <- list1
I need to specify list1 as a df for me to maintain my column names in my XLSX output.
09/22 - 2
Okay, I have the script doing exactly what I want it to do. However, it isn't pretty and it isn't "very" dynamic.
+rep to anyone who can help me convert this into a pretty lapply loop!
if (b >= 1) {
list1 <- lapply(master[[1]],
function(x) c(x[!is.na(x)], x[is.na(x)]))
list1 <- data.frame(list1)
master[[1]] <- list1
}
if (b >= 2) {
list2 <- lapply(master[[2]],
function(x) c(x[!is.na(x)], x[is.na(x)]))
list2 <- data.frame(list2)
master[[2]] <- list2
}
ect... x12
b has a value of 12 here. However, it could be any number practically.
09/22 - 3
Alright, I figured it out. I created the following loop to do what I needed to do and everything appears to be working perfectly. Part of me wants to scream from happiness.
for (i in 1:length(unique(masterdf$ABwant))) {
if (i >= 1)
list.i <- lapply(master[[i]],
function(x) c(x[!is.na(x)], x[is.na(x)]))
list.i<- data.frame(list.i)
master[[i]] <- list.i
}
I'll keep the thread open the rest of the week and if someone has a better solution I will accept it and give you some rep. Else, GG.
This was the code that I used to create the loop that I wanted.
for (i in 1:length(unique(masterdf$ABwant))) {
if (i >= 1)
list.i <- lapply(master[[i]],
function(x) c(x[!is.na(x)], x[is.na(x)]))
list.i<- data.frame(list.i)
master[[i]] <- list.i
}
Using OpenXLSX, I was able to use this loop to create an Excel file that has a separate tab for each antibody and has all columns sorted with NA values placed at the bottom.
### Creating the Excel file
wb <- createWorkbook()
lapply(names(master), function(i){
addWorksheet(wb=wb, sheetName = names(master[i]))
writeData(wb, sheet = i, master[[i]])
# Saving the Excel file
saveWorkbook(wb, saveFile, overwrite = TRUE)

Vlookup/ Match function in R for continuous column in R

I have a 2 dataframe.
df1:
Dis1_SubDIs1_Village1 Dis2_SubDIs1_Village1 Dis1_SubDIs2_Village1
JODHPUR|JODHPUR|JODHPUR |JODHPUR|JODHPUR JODHPUR||JODHPUR
JHUNJHUNUN|JHUNJHUNUN|BARI |JHUNJHUNUN|BARI JHUNJHUNUN|BARI|BARI
BUNDI|HINDOLI|BUNDI |HINDOLI|BUNDI BUNDI|BUNDI|BUNDI
SIROHI|SIROHI|SIROHI |SIROHI|SIROHI SIROHI||SIROHI
ALWAR|ALWAR|BASAI |ALWAR|BASAI ALWAR||BASAI
BHARATPUR|BHARATPUR|SEEKRI |BHARATPUR|SEEKRI BHARATPUR||SEEKRI
and second data,
df2 :
High
|BHARATPUR|SEEKRI
BUNDI|HINDOLI|BUNDI
SIROHI||SIROHI
CHURU|TARANAGAR|DABRI CHHOTI
Now, I want to apply vloook/match in df1 with respect to df2 column. The same we do in excel.
If exact matches are there, give me the match, else 0.
I tried making the function in R
For match
for(i in names(df1)){
match_vector = match(df_final[,i], df$High, incomparables = NA)
df1$High = df2$High[match_vector]
}
but getting an error. It's showing only for the last column and replacing the value of other column.
For vlookup:
func_vlook = function(a){
for(i in 1:ncol(a)) {
lookup_df = vlookup_df(lookup_value = i,
dict = df2,
lookup_column = 1)
}
return(lookup_df)
}
lookup_df <- func_vlook(a = df1)
Still getting an error.
My final Output should be like the below attached with df file:
Dis1_SubDIs1_Village1_M1 Dis2_SubDIs1_Village1_M2 Dis1_SubDIs2_Village1_M3
NA NA NA
NA NA NA
BUNDI|HINDOLI|BUNDI NA NA
NA SIROHI||SIROHI SIROHI||SIROHI
NA NA NA
NA NA |BHARATPUR|SEEKRI
for the N no. of columns, there should be N no. of columns with match
Please help.
No need for any loops with this one - apply and match should work fine. apply will iterate over as many columns you have, so the output will have the same number of columns as the input. In your example, apply will simplify to produce a matrix.
apply(X = df1,
MARGIN = 2,
FUN = function(x) df2$High[match(x, df2$High)])
If you need a dataframe as the output, then wrap the code below in as.data.frame()
as.data.frame(apply(X = df1,
MARGIN = 2,
FUN = function(x) df2$High[match(x, df2$High)]))

Why does code throw error when looped? My code works when I incremement index "by hand" but when I put in loop it fails

I want to append values from one dataframe as column names to an another data frame.
I've written code that will produce one column at a time if I "manually" assigne index values:
df_searchtable <- data.frame(category = c("air", "ground", "ground", "air"), wiggy = c("soar", "trot", "dive", "gallop"))
df_host <- data.frame(textcolum = c("run on the ground", "fly through the air"))
#create vector of categories
categroups <- as.character(unique(df_searchtable$category))
##### if I assign colum names one at a time using index numbers no prob:
group = categroups[1]
df_host[, group] <- NA
##### if I use a loop to assign the column names:
for (i in categroups) {
group = categroups[i]
df_host[, group] <- NA
}
the code fails, giving:
Error in [<-.data.frame(`*tmp*`, , group, value = NA) :
missing values are not allowed in subscripted assignments of data frames
How can I get around this problem?
Here's a simple base R solution:
df_host[categroups] <- NA
df_host
textcolum air ground
1 run on the ground NA NA
2 fly through the air NA NA
The problem with your loop is that you are looping through each element whereas your code assumes you are looping through 1, 2, ..., n.
For instance:
for (i in categroups) {
print(i)
print(categroups[i])
}
[1] "air"
[1] NA
[1] "ground"
[1] NA
To fix your loop, you could do one of two things:
for (group in categroups) {
df_host[, group] <- NA
}
# or
for (i in seq_along(categroups)) {
group <- categroups[i]
df_host[, group] <- NA
}
Here's a solution using purrr's map.
bind_cols(df_host,
map_dfc(categroups,
function(group) tibble(!!group := rep(NA_real_, nrow(df_host)))))
Gives:
textcolum air ground
1 run on the ground NA NA
2 fly through the air NA NA
map_dfc maps over the input categroups, creates a single-column tibble for each one, and joins the newly created tibbles into a dataframe
bind_cols joins the original dataframe to your new tibble
Alternatively you could use walk:
walk(categroups, function(group){df_host <<- mutate(df_host, !!group := rep(NA_real_, nrow(df_host)))})
Here's an ugly base R solution: create an empty matrix with the column names and cbind it to the second dataframe.
df_searchtable <- data.frame(category = c("air", "ground", "ground", "air"),
wiggy = c("soar", "trot", "dive", "gallop"),
stringsAsFactors = FALSE)
df_host <- data.frame(textcolum = c("run on the ground", "fly through the air"),
stringsAsFactors = FALSE)
cbind(df_host,
matrix(nrow = nrow(df_host),
ncol = length(unique(df_searchtable$category)),
dimnames = list(NULL, unique(df_searchtable$category))))
Result:
textcolum air ground
1 run on the ground NA NA
2 fly through the air NA NA

replacing missing values(NAs) with white space(" ") or dot(".") while exporting a dataframe using write.csv

I have a typical data frame that contains missing values, I want to export this data into a cvs or an excell workbook but I want to handle the missing values differently because I will want to use this data frame in STATA which does not accept NA as missing value.
I know r by default handles any missing value as NA, is there to tell R to handle this differently while exporting? say use a white space or dot to mean a missing value in my exported file that I will use in STATA?
Thank you
From ?write.csv:
na the string to use for missing values in the data.
E.g. write.csv(x, file = "foo.csv", na='.')
Sample data:
library(data.table)
dt <- data.table("col1" = c(1,2,NA),
"col2" = c(NA,NA,0))
> dt
col1 col2
1: 1 NA
2: 2 NA
3: NA 0
Replace NAs with "."
dt[is.na(dt)] <- "."
> dt
col1 col2
1: 1 .
2: 2 .
3: . 0
write.csv(dt,"test2.csv",na=".",row.names = FALSE)
You could create a new dataset where you replace NA for a character.
E.g. data[is.na(data)]<-"."
Here is an example based on my comment:
df <- mtcars
df$miss <- NA
# NA values as empty cells
write.csv(df, file = "df.csv", na = "") # for csv file
xlsx::write.xlsx(df, file = "df.xlsx", showNA = FALSE) # for excel file

R: Mean subset sequence from dataframe

I have been facing a problem for three days and I cannot get any answer about why it does not work. I have tried quite a lot different ways, but I am just going to post the one I believe is likely to be closest to the solution. I am going to put a reduce example about what I want to ask.
I have 7 csv files (called 001.csv, 002.csv, ... etc), in a folder called "Myfolder".
I have been trying to get a function that merged into an unique data.frame all this different .csv files using for-loop and r.bind and finally return the mean from either column "Colour1" or "Colour2" depending in the "colour" (column) and the "Children" (rows) I choose, and of course without missing values "Na". As an example when I merge the files I get a data frame like this data frame:
Colour1 Colour2 Children
NA NA 1
9 NA 2
NA NA 2
NA 5 3
7 NA 4
NA NA 5
NA 8 5
2 NA 6
6 3 6
14 NA 7
This is the the example of the function I want to build get_mean <- function(directory, colour, children)
What I have tried
get_mean <- function(directory, colour, children) {
files <- list.files(directory, full.names=TRUE)
allfiles <- data.frame()
for(i in 1:7) {
allfiles <- rbind(allfiles, read.csv(files[i]))
}
if(colour == "colour1"){
mean(allfiles$colour1[allfiles$Children == children], na.rm = TRUE)
}
if(colour == "colour2"){
mean(alllists$colour2[alllist$Children == children], na.rm = TRUE)
}
}
When I tried for example:
get_mean("Myfolder", "colour1", 3:6)
I get
In alllist$ID == id :
longer object length is not a multiple of shorter object length
and when I try:
get_mean("Myfolder", "colour1", 6)
I get:
>
Yes guys....I get back absolutely nothing.
What do you think guys? any correction to it? any other way to get the mean?
Note: all the data I put in here is not the one I am using. This is just an example from an exercise much bigger. I have tried to make a really small example with different names and numbers in order to don't discuss about the exercise itself and other could copy the solution
Here is a corrected and more readable version of your function - I named your data.frame all files df, I also added a check on colour:
get_mean <- function(directory, colour, children) {
files = list.files(directory, full.names=T)
df = do.call(rbind, lapply(files[1:7], read.csv))
# check the colour argument
if(!is.element(colour, c('colour1','colour2')))
stop(sprtinf('colour argument value %s is not part of df column', colour))
mean(df[[colour]][df$Children == children], na.rm=TRUE)
}

Resources