When excel has merged cells importing the data gives generic column names for the subsequent columns as shown in the picture below.
R data frame from excel sheet with merged cells
So is it possible to copy the name of a column to the column to its right?
In this example it would be copying "Sulfur dioxide Results" to overwrite X_6 and X_7, and "Ethanol Results" to X_8 and X_9 etc.
All the column names of interest end with "Results" so i'm considering if I can select the columns based on the "Results" in the name and copy the name to the 2 columns to its right.
There are many more columns, but they have the same pattern, and the amount of columns and their names are likely to change, but "Results" will still be in the names.
This solution works by using sapply against the names of a data frame. Then, for each column name, it checks if the name of the column which came either one or two positions prior ends in results. If so, then it copies over that previous name, from one or two positions prior.
df <- data.frame(one_results=c(1:3), blah=c(4:6), star=c(7:9), col=c(1:3))
df
names(df) <- sapply(seq_along(names(df)), function(x) {
if (x > 1 && grepl("results$", names(df)[x-1])) {
return(names(df)[x-1])
}
else if (x > 2 && grepl("results$", names(df)[x-2])) {
return(names(df)[x-2])
}
else {
return(names(df)[x]) # do not alter the column name in this case
}
})
df
Output:
one_results blah star col
1 1 4 7 1
2 2 5 8 2
3 3 6 9 3
one_results one_results one_results col
1 1 4 7 1
2 2 5 8 2
3 3 6 9 3
Related
I have a dataset with multiple variables. Each question has the actual survey answer and three other characteristics. So there are four variables for each question. I want to specify if Q135_L ==1 , leave Q135_RT as it is, otherwise code it as NA. I can do that with an ifelse statement.
df$Q135_RT <- ifelse(df$Q135_L == 1, df$Q22_RT, NA)
However, I have hundreds of variables and the names are not related. For example, in the picture we can see Q135, SG1_1 and so on. How can I specify for the whole dataset if a variable ends at _L, then for the same variable ending at _RT should remain as it is, otherwise the variable ending at _RT should be coded as NA.
I tried this but it only returns NAs
ifelse(grepl("//b_L" ==1, df), "//b_RT" , NA)
If I understand your problem correctly, you have a data frame of which the columns represent survey question variables. Each column contains two identifiers, namely: a survey question number (134, 135, etc) and a variable letter (L, R, etc). Because you provide no reproducible example, I tried to make a simplified example of your data frame:
set.seed(5)
DF <- data.frame(array(sample(1:4, 24, replace = TRUE), c(4,6)))
colnames(DF) <- c("Q134_L","Q135_L", "Q134_R", "Q135_R", "Q_L1", "Q134_S")
DF
# Q134_L Q135_L Q134_R Q135_R Q_L1 Q134_S
# 1 2 3 2 3 1 1
# 2 3 1 3 2 4 4
# 3 1 1 3 2 4 3
# 4 3 1 3 3 2 1
What you want is that if Q135_L == 1, leave Q135_RT as it is, otherwise code it as NA. Here is a function that implements this recoding logic:
recode <- function(yourdf, questnums) {
for (k in 1:length(questnums)) {
charnum <- as.character(questnums)
col_end_L_k <- yourdf[grepl("_L\\b", colnames(yourdf)) &
grepl(charnum[k], colnames(yourdf))]
col_end_R_k <- yourdf[grepl("_RT\\b", colnames(yourdf)) &
grepl(charnum[k], colnames(yourdf))]
row_is_1 <- which(col_end_L_k == 1)
col_end_R_k[-row_is_1, ] <- NA
yourdf[, colnames(col_end_R_k)] <- col_end_R_k
}
return(yourdf)
}
This function takes a data frame and a vector of question numbers, and then returns the data frame that has been recoded.
What this function does:
Selecting each question number using for.
Using grepl to identify any column that contains the selected number and contains _L at the end of the column name.
Similar with above but for _RT at the end of the column name.
Using which to identify the location of rows in the _L column that contain 1.
Keeping the values of the _RT column, which has the same question number with the corresponding _L column, in those rows, and change values on other rows to NA.
The result:
recode(DF, 134:135)
# Q134_L Q135_L Q134_RT Q135_RT Q_L1 Q134_S
# 1 2 3 NA NA 1 1
# 2 3 1 NA 2 4 4
# 3 1 1 3 2 4 3
# 4 3 1 NA 3 2 1
Note that the Q_L1 column is not affected because _L in this column is not located on the end of the column name.
As for how to define questnums, the question numbers, you just need to create a numeric vector. Examples:
Your questnums are 1 to 200. Then use 1:200 or seq(200), so recode(DF, 1:200).
Your questnums are 1, 3, 134, 135. Then, use recode(DF, c(1, 3, 134, 135)).
You can also assign the question numbers to an object first, such as n = c(25, 135, 145) and the use it : recode(DF, n)
I should append dataframes stored in different lists in R, and separate them by an empty row (in the excel file).
I can't bind dataframes from different lists in one because they have different numbers of columns.
I can't also use packages 'xlsx' and 'XLConnect' because they give me problems related to Java.
Any help is welcome.
The first list of dataframe:
listofdfs <- list(x <- data.frame("y"=c(2009,2010,2011),"b"=c(35,30,20)), y <- data.frame("y"=c(2009,2010,2011), "b"=c(6,21,40)) )
label <- c("Red","Green")
listofdfs <- setNames(listofdfs, label)
$Red
y b
1 2009 35
2 2010 30
3 2011 20
$Green
y b
1 2009 6
2 2010 21
3 2011 40
the second list of dataframes (with more columns than the previous ones):
listofdfs_2 <- list(x <- data.frame("y"=c(2009,2010,2011),"x_1"=c(35,30,20), "x_2"=c(1,2,0), "x_3"=c(6,0,3), "x_4"=c(12,5,8)), y <- data.frame("y"=c(2009,2010,2011), "x_1"=c(6,21,40), "x_2"=c(3,5,0), "x_3"=c(6,9,12), "x_4"=c(8,5,1)) )
label <- c("Red","Green")
listofdfs_2 <- setNames(listofdfs_2, label)
$Red
y x_1 x_2 x_3 x_4
1 2009 35 1 6 12
2 2010 30 2 0 5
3 2011 20 0 3 8
$Green
y x_1 x_2 x_3 x_4
1 2009 6 3 6 8
2 2010 21 5 9 5
3 2011 40 0 12 1
I'd like to obtain on the same excel sheet the tables in this way:
Using openxlsx, which I hope works better for you than the other packages you mentioned, you can do:
library(openxlsx)
# create your workbook
mywb <- createWorkbook()
# create the sheets you need based on the first list of tables
for (sheetName in names(listofdfs)){
addWorksheet(mywb , sheetName )
}
# get all your lists of tables in a single list
l_listOfDF <- mget(ls(pattern="listofdf"))
# initiate the index of the row where you will want to start writing (one per sheet)
startR <- rep(1, length(listofdfs)) ; names(startR) <- names(listofdfs)
# loop over the lists of tables using index and then over the elements / sheets using their names
for(N_myListOfDF in seq(l_listOfDF)){
for(pageName in names(l_listOfDF[[N_myListOfDF]])){
# write the name/number of your table in the correct sheet, at the correct row
writeData(mywb, sheet=pageName, startRow=startR[pageName], paste0("Table ", N_myListOfDF, "."))
# write your data in the correct sheet at the correct row (the one after the name)
writeData(mywb, sheet=pageName, startRow=startR[pageName]+1, l_listOfDF[[N_myListOfDF]][[pageName]])
# update the row number (the + 3 is to leave space for name of table, headers and blank row
startR[pageName] <- startR[pageName]+nrow(l_listOfDF[[N_myListOfDF]][[pageName]]) + 3
}
}
# save your workbook in a file
saveWorkbook(mywb , "myfile.xlsx")
Output file:
One problem at a time:
1. Appending data frames with different columns
rbind(df1,df2) #this fails if they do not have the same columns
library(dplyr)
df1 %>%
bind_rows(df2) #this works even with different columns
#including an empty row
df1 %>%
bind_rows(df_empty) %>% #pre-create a df with one "check" column and an empty row
bind_rows(df2)
2.Writing to excel
If you have tried different packages like foreign to do it, simply save as a csv which can be opened in Excel.
write.csv(df_result,"result.csv")
I have long vector of patient statuses in R that are chronologically sorted, and a label of associated patient IDs. This vector is an element of a dataframe. I would like to label consecutive rows of data for which the patient status is the same. If the status changes, then reverts to its original value, that would be three separate events. This is different than most situations I have searched where duplicated or match would suffice.
An example would be along the lines of:
s <- c(0,0,0,1,1,1,0,0,2,1,1,0,0)
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2)
and the desired output would be
flag <- c(1,1,1,2,2,2,3,1,2,3,4,4)
or
flag <- c(1,1,1,2,2,2,3,4,5,6,7,7)
One inelegant approach would be to generate the sequence:
unlist(tapply(s, id, function(x) cumsum(c(T, x[-1] != rev(rev(x)[-1])))))
Is there a better way?
I think you could use rleid from data.table for this:
library(data.table)
rleid(s,id)
Output:
1 1 1 2 2 2 3 4 5 6 6 7 7
Or for the first sequence:
data.table(s,id)[,rleid(s),id]$V1
Output:
1 1 1 2 2 2 3 1 2 3 3 4 4
Run Length Encoding - rle()
tapply(s, id, function(x) {
v<-rle(x)$length
rep(1:length(v), v)
})
I faced a problem while trying to re-arrange by data frame into long format.
my table looks like this:
x <- data.frame("Accession"=c("AGI1","AGI2","AGI3","AGI4","AGI5","AGI6"),"wt_rep_1"=c(1,2,3,4,4,5), "wt_rep_2" = c(1,2,3,4,8,9), "mutant1_rep_1"=c(1,1,0,0,5,3), "mutant2_rep_1" = c(1,7,0,0,1,5), "mutant2_rep_2" = c(1,1,4,0,1,8) )
> x
Accession wt_rep_1 wt_rep_2 mutant1_rep_1 mutant2_rep_1 mutant2_rep_2
1 AGI1 1 1 1 1 1
2 AGI2 2 2 1 7 1
3 AGI3 3 3 0 0 4
4 AGI4 4 4 0 0 0
5 AGI5 4 8 5 1 1
6 AGI6 5 9 3 5 8
I need to create a column that I would name "genotype", and it would containt the first part of the name of the column before "_"
How to use
strsplit(names(x), "_")
for that?
and preferably loop...
please, anyone, help.
I'll extract the part of the column names of x before the first _ in two instructions. Note that it can be done in just one line, but I'm posting like this for clarity.
sp <- strsplit(names(x), "_")
sapply(sp[-1], `[`, 1)
Now, how can this be a new column in data.frame x? There are only five elements in the resulting vector and x has six rows.
I agree with Ruy Barradas: I don't get how this vector could be a part of your original dataframe. Could you please clarify?
William Doane's response to this question suggests that using regular expressions might do the trick. I like this approach because I find it elegant and fast:
> gsub("(_.*)$", "", names(x))[-1]
[1] "wt" "wt" "mutant1" "mutant2" "mutant2"
I need to pull records from a first data set (called df1 here) based on a combination of specific dates, ID#s, event start time, and event end time that match with a second data set (df2). Everything works fine when there is just 1 date, ID, and event start and end time, but some of the matching records between the data sets contain multiple IDs, dates, or times, and I can't get the records from df1 to subset properly in those cases. I ultimately want to put this in a FOR loop or independent function since I have a rather large dataset. Here's what I've got so far:
I started just by matching the dates between the two data sets as follows:
match_dates <- as.character(intersect(df1$Date, df2$Date))
Then I selected the records in df2 based on the first matching date, also keeping the other columns so I have the other ID and time information I need:
records <- df2[which(df2$Date == match_dates[1]), ]
The date, ID, start, and end time from records are then:
[1] "01-04-2009" "599091" "12:00" "17:21"
Finally I subset df1 for before and after the event based on the date, ID, and times in records and combined them into a new data frame called final to get at the data contained in df1 that I ultimately need.
before <- subset(df1, NUM==records$ID & Date==records$Date & Time<records$Start)
after <- subset(df1, NUM==records$ID & Date==records$Date & Time>records$End)
final <- rbind(before, after)
Here's the real problem - some of the matching dates have more than 1 corresponding row in df2, and return multiple IDs or times. Here is what an example of multiple records looks like:
records <- df2[which(df2$Date == match_dates[25]), ]
> records$ID
[1] 507646 680845 680845
> records$Date
[1] "04-02-2009" "04-02-2009" "04-02-2009"
> records$Start
[1] "09:43" "05:37" "11:59"
> records$End
[1] "05:19" "11:29" "16:47"
When I try to subset df1 based on this I get an error:
before <- subset(df1, NUM==records$ID & Date==records$Date & Time<records$Start)
Warning messages:
1: In NUM == records$ID :
longer object length is not a multiple of shorter object length
2: In Date == records$Date :
longer object length is not a multiple of shorter object length
3: In Time < records$Start :
longer object length is not a multiple of shorter object length
Trying to do it manually for each ID-date-time combination would be way to tedious. I have 9 years worth of data, all with multiple matching dates for a given year between the data sets, so ideally I would like to set this up as a FOR loop, or a function with a FOR loop in it, but I can't get past this. Thanks in advance for any tips!
If you're asking what I think you are the filter() function from the dplyr package combined with the match function does what you're looking for.
> df1 <- data.frame(A = c(rep(1,4),rep(2,4),rep(3,4)), B = c(rep(1:4,3)))
> df1
A B
1 1 1
2 1 2
3 1 3
4 1 4
5 2 1
6 2 2
7 2 3
8 2 4
9 3 1
10 3 2
11 3 3
12 3 4
> df2 <- data.frame(A = c(1,2), B = c(3,4))
> df2
A B
1 1 3
2 2 4
> filter(df1, A %in% df2$A, B %in% df2$B)
A B
1 1 3
2 1 4
3 2 3
4 2 4