I extracted data from a pdf and have a small issue. Some cells got seperated in two rows, for the data to have NA in some cells, and values in others. These values I want to simply merge into the above cell.
Interestingly, every "real" line to which I want to merge the other ones starts with the same symbol, namely a "§".
I have about ~1000 observations so automated solution would be amazing
first <- c("§", "3", "4")
second <- c(NA, "2", NA)
third <- c("§", "2", 5)
fourth <- c(NA, "2", "3")
... and so on
df <- as.data.frame(rbind(first, second, third, fourth))
expected output:
first_e <- c("§", "32", "4")
second_e <- c "§", "22", "53")
df_e <- as.data.frame(rbind(first_e, second_e))
It would be truly amazing if someone had an idea (:
best from berlin
One possible solution if there is allways a second line would be this:
library(dplyr)
library(dplyr)
df %>%
# use the row number as colum
dplyr::mutate(ID = dplyr::row_number()) %>%
# substract 1 from very even row numer to build groups
dplyr::mutate(ID = ifelse(ID %% 2 == 0, ID - 1, ID)) %>%
# group by the new ID
dplyr::group_by(ID) %>%
# convert all NAs to "" (empty string)
dplyr::mutate_all(~ ifelse(is.na(.), "", .)) %>%
# concatenate all strings per group
dplyr::mutate_all( ~ paste(., collapse = "")) %>%
# select only distinct cases (do elimitate "seconds" as the now are identical to "frists)
dplyr::distinct()
V1 V2 V3 ID
<chr> <chr> <chr> <dbl>
1 4 32 4 1
I left the created ID number in the result but you can drop/delete it after the calculations if you prefer
Simply paste the odd elements of a column with even elements:
# vectors of TRUEs in odd or even positions
odd <- rep(c(T,F), length.out=nrow(df))
evn <- rep(c(F,T), length.out=nrow(df))
# for each column...
result <- lapply(df, function(col) {
paste0(ifelse(is.na(col[odd]), '', col[odd]),
ifelse(is.na(col[evn]), '', col[evn]))
})
as.data.frame(result)
Consider flagging § columns with a ifelse + cumsum to generate a grouping field for aggregate with paste:
# BUILD DATA FRAME
df <- setNames(rbind.data.frame(first, second, third, fourth, stringsAsFactors=FALSE),
c("col1", "col2", "col3"))
# CONVERT ALL NAs TO EMPTY STRING
df[is.na(df)] <- ""
# GENERATE GROUPING COLUMN
df$section <- cumsum(ifelse(df$col1 == "§", 1, 0))
df
# col1 col2 col3 section
# 1 § 3 4 1
# 2 2 1
# 3 § 2 5 2
# 4 2 3 2
# AGGREGATE BY GROUPING COLUMNS
clean_df <- aggregate(. ~ section, df, paste, collapse="")[-1]
clean_df
# col1 col2 col3
# 1 § 32 4
# 2 § 22 53
Online Demo
Related
I am trying to create a table with the headings study_id, Conflict and create a third and forth column which contains data that corresponds to the value in column: Conflict
The code below achieves this aim. However it is rather long especially as I wish to expand this to cover several hundred different values in column: conflict
Thank you in advance for any pointers
df <- data.frame(study_id=c("1", "1", "4", "4", "5"),
Conflict=c("WATER.START", "WATER.STOP", "OIL.START", "NA", "WATER.STOP"),
Result=c("TRUE", "TRUE", "TRUE", "NA", "TRUE"))
df2 <- data.frame(study_id=c("1", "2", "3", "4", "5"),
WATER.start=c(1, 1, 2, NA, 6),
WATER.truestart=c(1, 1, 2, NA, 25),
WATER.stop=c(33, 3, 2, NA, 8),
WATER.truestop= c(34, 4, 2, NA, 8))
final <- left_join(df, df2, by ='study_id')
dd <- final %>% filter(Result == "TRUE" & Conflict == "WATER.START")
dd <- dd %>% subset(., Conflict == "WATER.START",
select=c(study_id, Conflict, WATER.start, WATER.truestart))
dd <- dd %>% rename(initial=WATER.start) %>% rename(verification=WATER.truestart)
ee <- final %>% filter(Result == "TRUE" & Conflict == "WATER.STOP")
ee <- ee %>% subset(., Conflict == "WATER.STOP",
select=c(study_id, Conflict, WATER.stop, WATER.truestop))
ee <- ee %>% rename(initial=WATER.stop) %>% rename(verification=WATER.truestop)
ff <- bind_rows(dd, ee)
gg <- ff %>% select(study_id, Conflict, initial, verification)
gg
# study_id Conflict initial verification
# 1 1 WATER.START 1 1
# 2 1 WATER.STOP 33 34
# 3 5 WATER.STOP 8 8
First, as it seems you only focus on c('WATER.START', 'WATER.STOP'), subset your first df. Next, for every MARGIN=1 (i.e. every row), we apply an anonymous function \(x) that selects the appropriate columns of df2, using tolower to match case, and cbinds together. Finally rename columns using setNames and rbind the resulting list.
subset(df, Conflict %in% c('WATER.START', 'WATER.STOP')) |>
apply(MARGIN=1, \(x) {
mt <- match(tolower(x[2]), tolower(names(df2)))
cbind(t(x[1:2]), df2[df2$study_id == x[1], c(mt, mt + 1)]) |>
setNames(c('study_id', 'Conflict', 'initial', 'verification'))
}) |> do.call(what=rbind)
# study_id Conflict initial verification
# 1 1 WATER.START 1 1
# 2 1 WATER.STOP 33 34
# 5 5 WATER.STOP 8 8
You could also use a dictionary a (which may be expanded to other levels you might be using).
a <- c(WATER.START='WATER.start', WATER.STOP='WATER.stop')
subset(df, Conflict %in% c('WATER.START', 'WATER.STOP')) |>
apply(MARGIN=1, \(x) {
mt <- match(a[match(x[2], names(a))], names(df2))
cbind(t(x[1:2]), df2[df2$study_id == x[1], c(mt, mt + 1)]) |>
setNames(c('study_id', 'Conflict', 'initial', 'verification'))
}) |> do.call(what=rbind)
# study_id Conflict initial verification
# 1 1 WATER.START 1 1
# 2 1 WATER.STOP 33 34
# 5 5 WATER.STOP 8 8
Say I have the following data frame:
# S/N a b
# 1 L1-S2 <blank>
# 2 T1-T3 <blank>
# 3 T1-L2 <blank>
How do I turn the above data frame into this:
# S/N a b
# 1 L1-S2 LS
# 2 T1-T3 T
# 3 T1-L2 TL
I am thinking of writing a loop, where
For x in column a,
If first character in x == L AND 4th character in x == S,
fill the corresponding cell in b with LS
and so on...
However, I am not sure how to implement it, or if there is a more elegant way of doing this.
We can extract the upper case letters and remove the repeated ones
library(stringr)
library(dplyr)
df1 %>%
mutate(b = str_replace(str_replace(a, "^([A-Z])\\d+-([A-Z])\\d+",
"\\1\\2"), "(.)\\1+", "\\1"))
-output
# S_N a b
#1 1 L1-S2 LS
#2 2 T1-T3 T
#3 3 T1-L2 TL
Or another option is str_extract_all to extract the upper case letters, loop over the list with map, paste the unique elements
library(purrr)
df1 %>%
mutate(b = str_extract_all(a, "[A-Z]") %>%
map_chr(~ str_c(unique(.x), collapse="")))
Or using a corresponding base R option for the first tidyverse option
df1$b <- sub("(.)\\1+", "\\1", gsub("[0-9-]+", "", df1$a))
Or with strsplit
df1$b <- sapply(strsplit(df1$a, "[0-9-]+"),
function(x) paste(unique(x), collapse=""))
data
df1 <- structure(list(S_N = 1:3, a = c("L1-S2", "T1-T3", "T1-L2"),
b = c(NA,
NA, NA)), class = "data.frame", row.names = c(NA, -3L))
I have a grouped set of values in a column I am trying to replace with a since value
col1
a
a;a;b;c
c;b;a
NA
b;b;b
I want to replace all values with either mixed or the single present value if for example a;a;a;a becomes a
Expected Output
col1
a
Mixed
Mixed
NA
b
Code
grouping = function(x){
y = as.list(strsplit(x, ";")[[1]])
#select first element, and test if each is the same element.
z = ""
for (i in 1:length(y)){
if (as.character(y[1]) != as.character(y[i])) {
z = 'mixed'
break
} else {
z = as.character(y[1])
}
}
return(z)
}
db %>%
select(col1) %>%
mutate(
test = grouping(col1)
)
I have tried it a few different ways and either end up with it not working at all or giving the value a for everything
A base R option via defining a user function f
f <- function(x) ifelse(length(u <- unique(unlist((strsplit(x, ";"))))) > 1, "Mixed", u)
such that
> transform(df, col1 = Vectorize(f)(col1))
col1
1 a
2 Mixed
3 Mixed
4 <NA>
5 b
You can also consider this for your function and use base R:
#Function
myfun <- function(x)
{
y <- unlist(strsplit(x, ";"))
if(length(unique(y))==1)
{
z <- unique(y)
} else
{
z <- 'Mixed'
}
}
#Apply
df$New <- apply(df,1,myfun)
Output:
df
col1 New
1 a a
2 a;a;b;c Mixed
3 c;b;a Mixed
4 <NA> <NA>
5 b;b;b b
Some data used:
#Data
df <- structure(list(col1 = c("a", "a;a;b;c", "c;b;a", NA, "b;b;b")), class = "data.frame", row.names = c(NA,
-5L))
We can extract the substring from the 'col1' which are letters, check the number of distinct elements with n_distinct, use case_when to change those which have more one unique elements to 'Mixed'
library(dplyr)
library(stringr)
library(purrr)
df1 %>%
mutate(col1 = case_when(map_dbl(str_extract_all(col1,
"[a-z]"), n_distinct) >1 ~ "Mixed",
is.na(col) ~ NA_character_,
TRUE ~ substr(col1, 1, 1)))
-output
# col1
#1 a
#2 Mixed
#3 Mixed
#4 <NA>
#5 b
Or another option is to split the column by the delimiter with separate_rows, and do a group by row_number to summarise elements having more than one row (after the distinct) to be 'Mixed'
library(tidyr)
df1 %>%
mutate(rn = row_number()) %>%
separate_rows(col1) %>%
distinct() %>%
group_by(rn) %>%
summarise(col1 = case_when(n() > 1 ~ 'Mixed', TRUE ~ first(col1)),
.groups = 'drop') %>%
select(-rn)
-output
# A tibble: 5 x 1
# col1
# <chr>
#1 a
#2 Mixed
#3 Mixed
#4 <NA>
#5 b
Or using base R with a compact option
v1 <- gsub("([a-z])\\1+", "\\1", gsub(";", "", df1$col1))
replace(v1, nchar(v1) > 1, "Mixed")
#[1] "a" "Mixed" "Mixed" NA "b"
The issue in the OP's function is that it is extracting only the first [[1]] list element
as.list(strsplit(x, ";")[[1]])
as strsplit returns a list with length equal to the number of rows of the initial data. So, basically by selecting only the first, it is recycled
data
df1 <- structure(list(col1 = c("a", "a;a;b;c", "c;b;a", NA, "b;b;b")),
class = "data.frame", row.names = c(NA,
-5L))
You can write the grouping function as :
grouping <- function(x) {
sapply(strsplit(x, ';'), function(x)
if(length(unique(x)) == 1) unique(x) else 'Mixed')
}
db$test <- grouping(db$col1)
db
# col1 test
#1 a a
#2 a;a;b;c Mixed
#3 c;b;a Mixed
#4 <NA> <NA>
#5 b;b;b b
I've column with two alpha numeric characters separated by '->' I'm trying to split them into columns.
Df:
column e
1. asd1->ref2
2. fde4 ->fre4
3. dfgt-fgr ->frt5
4. ftr5 -> lkh-oiut
5. rey6->usre-lynng->usre-lkiujh->kiuj-bunny
6. dge1->fgt4->okiuj-dfet
Desired output
col 1 col 2
1. asd1 ref2
2. fde4 fre4
3. frt5
4. ftr5
5. rey6
6. dge1 fgt4
I tried using out <- strsplit(as.character(Df$column e),'_->_') with no output and used str_extract(m1$column e,"(?<=\\[)[[:alnum:]]")->m1$column f, also strsplit(as.character(Df$column e),' -> 'fixed=T)[[1]][[1]] but not getting the desired output.
The column if of integer type and all are capital letters(I'm not sure if this is imp.)
Here is one way with tidyverse
library(tidyverse)
df1 %>%
separate(columne, into = c('col1', 'col2'), sep = "->", extra = 'drop') %>%
mutate_all(funs(replace(., str_detect(., '-'), "")))
# col1 col2
#1 asd1 ref2
#2 fde4 fre4
#3 frt5
#4 ftr5
#5 rey6
#6 dge1 fgt4
A base R solution as well, though a fair bit less concise than #akrun's tidyverse one:
# split as appropriate
out <- strsplit( as.character( Df$column.e ), '->' )
out <- lapply( out, function(x) {
# I assume you don't want the white space
y <- trimws( x )
# take the first two "columns"
y <- y[1:2]
# remove any items containing a hyphen
y[ grepl( "-", y ) ] <- ""
y
}
)
# then bind it all rowwise
out <- do.call( rbind, out )
data.frame( out )
X1 X2
1 asd1 ref2
2 fde4 fre4
3 frt5
4 ftr5
5 rey6
6 dge1 fgt4
I am working with a data frame that looks like the following which I need to transpose by group based on the common Id:
testDF = data.frame(c("Id", "1", "1", "2", "2"), c("Item", 'Milk','Eggs','Bacon', "Bread"))
testDF
#>Id Item
#>1 Milk
#>1 Eggs
#>2 Bacon
#>2 Bread
newDT <- dcast(testDF, Id ~ Item, value.var = "Item")
View(newDT)
I need the output to look like the following (excluding the header row and the Id column altogether):
Milk, Eggs (Id 1)
Bacon, Bread (Id 2)
I am receiving the error 'Error: value.var (Item) not found in input'. Can you please tell me what I am doing wrong?
Thanks,
Matt
Some good answers above, however I think this should be listed as an option as well:
df %>%
group_by(Id) %>%
# Create string listing all items in given Id, separated by comma
summarise(Items = str_c(Item, collapse = ', '))
Returns:
# A tibble: 2 × 2
Id Items
<fctr> <chr>
1 Milk, Eggs
2 Bacon, Bread
testDF = data.frame(Id = c("1", "1", "2", "2"),
Item = c('Milk','Eggs','Bacon', "Bread"))
testDF
z <- aggregate(list(Item = testDF$Item), list(ID = testDF$Id),
function(x) paste(x, collapse = ','))
z
ID Item
1 1 Milk,Eggs
2 2 Bacon,Bread
With group_by and summarise functions from dplyr you have:
library("dplyr")
testDF %>%
group_by(Id) %>%
summarise(Items=paste0(Item,collapse=","))
testDF
#Source: local data frame [2 x 2]
#Groups: Id [2]
#
# Id Items
# (fctr) (chr)
#1 1 Milk,Eggs
#2 2 Bacon,Bread
since I didn't see a Green checkmark, figured I'd take a stab at it because I wrote a function for this exact problem.
library(dplyr)
transp <- function(input,uniq_var,compare_var,transposed_column_names = 'measurement'){
if(class(input[,uniq_var]) == "factor"){
input[uniq_var] = sapply(input[uniq_var],as.character)
}
#' input is the dataframe/data.table that you want to perform the operation on, uniq_var is the variable that you are groupying by, compare_var is the variable that is being measured in each of the groups, and transposed_colum_names is just an optional string for the user to call each of their columns (will be concatenated with an observation number, i.e. if you input 'distance', it will name the observations 'distance_1','distance_2','distance_3'...ect.)
list_df <- input %>% group_by(input[,uniq_var]) %>% do(newcol = t(.[compare_var]))
# it gets us the aggregates we want, BUT all of our columns are stored in a list
# instead of in separate columns.... so we need to create a new dataframe with the dimensions
# rows = the number of unique values that we are "grouping" by, noted here by uniq_var and the number of columns will be
# the maximum number of observations that are assigned to one of those groups.
# so first we will create the skeleton of the matrix, and then use a user defined function
# to fill it with the correct values
new_df <- matrix(rep(NA,(max(count(input,input[,uniq_var])[,2])*dim(list_df)[1])),nrow = dim(list_df)[1])
new_df <- data.frame(new_df)
new_df <- cbind(list_df[,1],new_df)
# i am writing a function inside of a function becuase for loops can take a while
# when doing operaitons on multiple columns of a dataframe
func2 <- function(input,thing = new_df){
# here, we have a slightly easier case when we have the maximum number of children
# assigned to a household.
# we subtract 1 from the number of columns because the first column holds the value of the
# unique value we are looking at, so we don't count it
if(length(input[2][[1]])==dim(thing)[2]-1){
# we set the row corresponding to the specific unique value specified in our list_df of aggregated values
# equal to the de-aggregated values, so that you have a column for each value like in PROC Transpose.
thing[which(thing[,1]==input[1]),2:ncol(thing)]= input[2][[1]]
#new_df[which(new_df[,1]==input[1]),2:ncol(new_df)]= input[2][,1][[1]][[1]]
}else{
thing[which(thing[,1]==input[1]),2:(1+length(input[2][[1]]))]= input[2][[1]]
}
# if you're wondering why I have to use so many []'s it's because our list_df has 1 column
# of unique identifiers and the other column is actually a column of dataframes
# each of which only has 1 row and 1 column, and that element is a list of the transposed values
# that we want to add to our new dataframe
# so essentially the first bracket
return(thing[which(thing[,1]==input[1]),])
}
quarter_final_output <- apply(list_df,1,func2)
semi_final_output <- data.frame(matrix(unlist(quarter_final_output),nrow = length(quarter_final_output),byrow = T))
#return(apply(list_df,1,func2))
# this essentially names the columns according to the column names that a user would typically specify
# in a proc transpose.
name_trans <- function(trans_var=transposed_column_names,uniq_var = uniq_var,df){
#print(trans_var)
colnames(df)[1] = colnames(input[uniq_var])
colnames(df)[2:length(colnames(df))] = c(paste0(trans_var,seq(1,(length(colnames(df))-1),1)))
return(df)
}
final_output <- name_trans(transposed_column_names,uniq_var,semi_final_output)
return(final_output)
}
In your case, you'd apply it like this:
transp(testDF,uniq_var = 'Id',compare_var = "Item")
If you want to download it from my github https://github.com/seanpili/R_PROC_TRANSPOSE