fread specifying separator within column - r
I am trying to parse a 2 column list that is separated using multiple spaces for columns and single spaces for words within a column. Nothing I have tried has successfully split the data into two columns. How do I do this?
library(data.table)
item.ids<-fread("http://eve-files.com/chribba/typeid.txt",sep2=" ")
Example of the dataset:
typeID typeName
----------- ----------------------------------------
0 #System
2 Corporation
3 Region
4 Constellation
5 Solar System
This seems to work:
library(readr)
url = "http://eve-files.com/chribba/typeid.txt"
df = read_fwf(url, fwf_empty(url), skip = 2)
colnames = read_table(url, n_max = 1)
names(df) = names(colnames)
df = na.omit(df)
dim(df)
# [1] 22382 2
summary(df)
# typeID typeName
# Min. : 0 Length:22382
# 1st Qu.: 13986 Class :character
# Median : 22938 Mode :character
# Mean : 53827
# 3rd Qu.: 30209
# Max. :368620
Here's one approach that uses extract from "tidyr" that should be pretty easy to follow.
First, we read the data in, and inspect the first few lines and last few lines. After inspection, we find that the data values are from lines 3 to 22384.
x <- readLines("http://eve-files.com/chribba/typeid.txt")
# Check out the data
head(x) # Let's get rid of the first two lines...
tail(x) # ... and the last 3
In the extraction stage, we're basically looking for:
A set of numbers--can be of varying lengths (([0-9]+)). It's in (), so capture it and extract it to a new column.
The numbers should be followed by 2 or more spaces ([ ]{2,}). That's not in (), so we don't need to extract that into a new column.
The set of spaces can be followed by anything else ((.*)). That's in (), so capture that and extract it into a new column.
I've also used the first value of "x" to extract the original column names.
Here's what it looks like:
library(tidyverse)
data_frame(V1 = x[3:(length(x)-3)]) %>%
extract(V1, into = scan(text = x[1], what = ""), regex = "([0-9]+)[ ]{2,}(.*)")
# # A tibble: 22,382 x 2
# typeID typeName
# * <chr> <chr>
# 1 0 #System
# 2 2 Corporation
# 3 3 Region
# 4 4 Constellation
# 5 5 Solar System
# 6 6 Sun G5 (Yellow)
# 7 7 Sun K7 (Orange)
# 8 8 Sun K5 (Red Giant)
# 9 9 Sun B0 (Blue)
# 10 10 Sun F0 (White)
# # ... with 22,372 more rows
Or
data_frame(V1 = x[3:(length(x)-3)]) %>%
separate(V1, into = scan(text = x[1], what = ""), sep = "[ ]{2,}",
extra = "merge", convert = TRUE)
Another approach might be to use strsplit with [ ]{2, } as the split value. do.call(rbind, ...) would be the idiom to follow after that, but you might want to filter only for cases where the split resulted in two values.
do.call(rbind, Filter(function(z) length(z) == 2, strsplit(x, "[ ]{2, }")))
Read in your text file line-by-line:
l <- list()
fileName <- "http://eve-files.com/chribba/typeid.txt"
conn <- file(fileName,open="r")
linn <-readLines(conn)
for (i in 1:length(linn)){
l[i] <- list(linn[i])
}
close(conn)
Create a list of all entries:
l_new <- list()
for(p in 1:length(l)) {
new_vec <- unlist(strsplit(gsub("(?<=[\\s])\\s*|^\\s+|\\s+$", "", l[[p]], perl=TRUE), " "))
if(!is.na(new_vec[4])) {
new_vec_t <- paste(new_vec[2], new_vec[3], new_vec[4])
}
else if (!is.na(new_vec[3])) {
new_vec_t <- paste(new_vec[2], new_vec[3])
}
else {
new_vec_t <- paste(new_vec[2])
}
l_new[p] <- list(c(new_vec[1], new_vec_t))
}
Convert your list to a dataframe:
l_new_frame <- data.frame(do.call('rbind', l_new))
l_new_frame <- l_new_frame[-c(1,2),]
names(l_new_frame) <- c('typeID', 'typeName')
Check results:
print(l_new_frame[1:100,], row.names = FALSE)
Related
Convert object name to string in function
I have a list of data.frames. I want to send each data.frame to a function using lapply. Inside the function I want to check whether the name of a data.frame includes a particular string. If the string in question is present I want to perform one series of operations. Otherwise I want to perform a different series of operations. I cannot figure out how to check whether the string in question is present from within the function. I wish to use base R. This seems to be a possible solution but I cannot get it to work: In R, how to get an object's name after it is sent to a function? Here is an example list followed by an example function further below. matrix.apple1 <- read.table(text = ' X3 X4 X5 1 1 1 1 1 1 ', header = TRUE) matrix.apple2 <- read.table(text = ' X3 X4 X5 1 1 1 2 2 2 ', header = TRUE) matrix.orange1 <- read.table(text = ' X3 X4 X5 10 10 10 20 20 20 ', header = TRUE) my.list <- list(matrix.apple1 = matrix.apple1, matrix.orange1 = matrix.orange1, matrix.apple2 = matrix.apple2) This operation can check whether each object name contains the string apples but I am not sure how to use this information inside the function further below. grepl('apple', names(my.list), fixed = TRUE) #[1] TRUE FALSE TRUE Here is an example function. Based on hours of searching and trial-and-error I perhaps am supposed to use deparse(substitute(x)) but so far it only returns x or something similar. table.function <- function(x) { # The three object names are: # 'matrix.apple1', 'matrix.orange1' and 'matrix.apple2' myObjectName <- deparse(substitute(x)) print(myObjectName) # perform a trivial example operation on a data.frame my.table <- table(as.matrix(x)) # Test whether an object name contains the string 'apple' contains.apple <- grep('apple', myObjectName, fixed = TRUE) # Use the result of the above test to perform a trivial example operation. # With my code 'my.binomial' is always given the value of 0 even though # 'apple' appears in the name of two of the data.frames. my.binomial <- ifelse(contains.apple == 1, 1, 0) return(list(my.table = my.table, my.binomial = my.binomial)) } table.function.output <- lapply(my.list, function(x) table.function(x)) These are the results of print(myObjectName): #[1] "x" #[1] "x" #[1] "x" table.function.output Here are the rest of the results of table.function showing that my.binomial is always 0. The first and third value of my.binomial should be 1 because the names of the first and third data.frames contain the string apple. # $matrix.apple1 # $matrix.apple1$my.table # 1 # 6 # $matrix.apple1$my.binomial # logical(0) # # $matrix.orange1 # $matrix.orange1$my.table # 10 20 # 3 3 # $matrix.orange1$my.binomial # logical(0) # # $matrix.apple2 # $matrix.apple2$my.table # 1 2 # 3 3 # $matrix.apple2$my.binomial # logical(0)
You could redesign your function to use the list names instead: table_function <- function(myObjectName) { # The three object names are: # 'matrix.apple1', 'matrix.orange1' and 'matrix.apple2' myObject <- get(myObjectName) print(myObjectName) # perform a trivial example operation on a data.frame my.table <- table(as.matrix(myObject)) # Test whether an object name contains the string 'apple' contains.apple <- grep('apple', myObjectName, fixed = TRUE) # Use the result of the above test to perform a trivial example operation. # With my code 'my.binomial' is always given the value of 0 even though # 'apple' appears in the name of two of the data.frames. my.binomial <- +(contains.apple == 1) return(list(my.table = my.table, my.binomial = my.binomial)) } lapply(names(my.list), table_function) This returns [[1]] [[1]]$my.table 1 6 [[1]]$my.binomial [1] 1 [[2]] [[2]]$my.table 10 20 3 3 [[2]]$my.binomial integer(0) [[3]] [[3]]$my.table 1 2 3 3 [[3]]$my.binomial [1] 1 If you want to keep the list names, you could use sapply(names(my.list), table_function, simplify = FALSE, USE.NAMES = TRUE) instead of lapply.
Use Map and pass both list data and it's name to the function. Change your function to accept two arguments. table.function <- function(data, name) { # The three object names are: # 'matrix.apple1', 'matrix.orange1' and 'matrix.apple2' print(name) # perform a trivial example operation on a data.frame my.table <- table(as.matrix(data)) # Test whether an object name contains the string 'apple' contains.apple <- grep('apple', name, fixed = TRUE) # Use the result of the above test to perform a trivial example operation. # With my code 'my.binomial' is always given the value of 0 even though # 'apple' appears in the name of two of the data.frames. my.binomial <- as.integer(contains.apple == 1) return(list(my.table = my.table, my.binomial = my.binomial)) } Map(table.function, my.list, names(my.list)) #[1] "matrix.apple1" #[1] "matrix.orange1" #[1] "matrix.apple2" #$matrix.apple1 #$matrix.apple1$my.table #1 #6 #$matrix.apple1$my.binomial #[1] 1 #$matrix.orange1 #$matrix.orange1$my.table #10 20 # 3 3 #$matrix.orange1$my.binomial #integer(0) #... #... The same functionality is provided by imap in purrr where you don't need to explicitly pass the names. purrr::imap(my.list, table.function)
How to split a sentence in two halves in R
I have a vector of string, and I want each string to be cut roughly in half, at the nearest space. For exemple, with the following data : test <- data.frame(init = c("qsdf mqsldkfop mqsdfmlk lksdfp pqpdfm mqsdfmj mlk", "qsdf", "mp mlksdfm mkmlklkjjjjjjjjjjjjjjjjjjjjjjklmmjlkjll", "qsddddddddddddddddddddddddddddddd", "qsdfmlk mlk mkljlmkjlmkjml lmj mjjmjmjm lkj"), stringsAsFactors = FALSE) I want to get something like this : first sec 1 qsdf mqsldkfop mqsdfmlk lksdfp pqpdfm mqsdfmj mlk 2 qsdf 3 mp mlksdfm mkmlklkjjjjjjjjjjjjjjjjjjjjjjklmmjlkjll 4 qsddddddddddddddddddddddddddddddd 5 lmj mjjmjmjm lkj lmj mjjmjmjm lkj Any solution that does not cut in halves but "so that the first part isn't longer than X character" would be also great.
First, we split the strings by spaces. a <- strsplit(test$init, " ") Then we find the last element of each vector for which the cumulative sum of characters is lower than half the sum of all characters in the vector: b <- lapply(a, function(x) which.max(cumsum(cumsum(nchar(x)) <= sum(nchar(x))/2))) Afterwards we combine the two halfs, substituting NA if the vector was of length 1 (only one word). combined <- Map(function(x, y){ if(y == 1){ return(c(x, NA)) }else{ return(c(paste(x[1:y], collapse = " "), paste(x[(y+1):length(x)], collapse = " "))) } }, a, b) Finally, we rbind the combined strings and change the column names. newdf <- do.call(rbind.data.frame, combined) names(newdf) <- c("first", "second") Result: > newdf first second 1 qsdf mqsldkfop mqsdfmlk lksdfp pqpdfm mqsdfmj mlk 2 qsdf <NA> 3 mp mlksdfm mkmlklkjjjjjjjjjjjjjjjjjjjjjjklmmjlkjll 4 qsddddddddddddddddddddddddddddddd <NA> 5 qsdfmlk mlk mkljlmkjlmkjml lmj mjjmjmjm lkj
You can use the function nbreak from the package that I wrote: devtools::install_github("igorkf/breaker") library(tidyverse) test <- data.frame(init = c("Phrase with four words", "That phrase has five words"), stringsAsFactors = F) #This counts the numbers of words of each row: nwords = str_count(test$init, " ") + 1 #This is the position where break the line for each row: break_here = ifelse(nwords %% 2 == 0, nwords/2, round(nwords/2) + 1) test # init # 1 Phrase with four words # 2 That phrase has five words #the map2_chr is applying a function with two arguments, #the string is "init" and the n is "break_here": test %>% mutate(init = map2_chr(init, break_here, ~breaker::nbreak(string = .x, n = .y, loop = F))) %>% separate(init, c("first", "second"), sep = "\n") # first second # 1 Phrase with four words # 2 That phrase has five words
In r, How can you read X number of fixed width flat files into Y data frames based on the file name?
So I have 80 files that come in the file name format: P.A3588.ACO.CCLF0.D00001.TO30000 P.A3588.ACO.CCLF0.D30001.TO60000 ... P.A3588.ACO.CCLF1.D30001.TO60000 P.A3588.ACO.CCLF1.D30001.TO60000 ... P.A3588.ACO.CCLF9.D30001.TO60000 P.A3588.ACO.CCLF9.D30001.TO60000 There are 80 fixed width text files: 8 parts for each of the 10 CCLF numbers (CCLF0,CCLF1,...,CCLF9). I want to be able to group by CCLF number, apply the column width vector, add column names, and bind the rows of the CCLF parts. Below is what I've tried so far. It doesn't work, but gives an idea what I'm attempting. filenames <- list.files(dataPath) names <- substr(filenames,13,17) CCLF1_width <- c(13,6,11,2,10,10,1,1,7,7,2,17,1,2,2,4,1,10,10,10,10,10,2,10,10,10,11,2,2,1,1,1) CCLF2_width <- c(13,10,11,2,10,10,4,10,5,11,6,10,10,24,17,2,2,2,2,2) CCLF3_width <- c(13,11,2,2,7,10,11,6,10,10,1) CCLF4_width <- c(13,11,2,1,2,7,11,6,10,10,7,1) CCLF5_width <- c(13,10,11,2,10,10,3,2,2,1,2,10,10,5,15,1,7,10,10,2,2,2,10,10,40,11,17,24,2,2,2,2,2,2,7,7,7,7,7,7,7,7,1) CCLF6_width <- c(13,10,11,2,10,10,1,2,10,10,5,15,1,10,10,2,2,2,10,10,40,11,17,2) CCLF7_width <- c(13,11,11,2,10,2,20,1,1,24,9,2,20,13,2,10,10,12,9) CCLF8_width <- c(11,2,3,5,10,1,1,3,2,2,10,10,10,30,15,40,1,1) CCLF9_width <- c(11,11,10,10,12) CCLF0_width <- c(11,11) for (i in length(filenames)){ assign(paste0(substr(filenames,13,17)), read_fwf(grepl("CCLF1",filenames), paste0(i,"_width"))) }
Consider mapping through two equal length lists and iterate elementwise with Map (wrapper to mapply): cclf_files <- paste0("CCLF", seq(0:9)) cclf_widths <- list( CCLF0_width = c(11,11) CCLF1_width = c(13,6,11,2,10,10,1,1,7,7,2,17,1,2,2,4,1,10,10,10,10,10,2,10,10,10,11,2,2,1,1,1) CCLF2_width = c(13,10,11,2,10,10,4,10,5,11,6,10,10,24,17,2,2,2,2,2) CCLF3_width = c(13,11,2,2,7,10,11,6,10,10,1) CCLF4_width = c(13,11,2,1,2,7,11,6,10,10,7,1) CCLF5_width = c(13,10,11,2,10,10,3,2,2,1,2,10,10,5,15,1,7,10,10,2,2,2,10,10,40,11,17,24,2,2,2,2,2,2,7,7,7,7,7,7,7,7,1) CCLF6_width = c(13,10,11,2,10,10,1,2,10,10,5,15,1,10,10,2,2,2,10,10,40,11,17,2) CCLF7_width = c(13,11,11,2,10,2,20,1,1,24,9,2,20,13,2,10,10,12,9) CCLF8_width = c(11,2,3,5,10,1,1,3,2,2,10,10,10,30,15,40,1,1) CCLF9_width = c(11,11,10,10,12) ) proc_files <- function(f, w) { # RETRIEVE FILES WITH CURRENT CCF# IN NAME files <- list.files(path = "/path/to/cclf/files", pattern = f) print(files) # BUILD A LIST OF DFs FROM ALL FILES WITH CURRENT CCF#_width df_list <- lapply(files, function(x) { tryCatch(read.fwf(x, widths=w), error = function(e) return(NA)) }) # ROW BIND ALL DFs TO FINAL FOR RETURN df <- do.call(rbind, df_list) } # BUILD A NAMED LIST OF DATA FRAMES FOR EACH CCLF# df_list <- setNames(Map(proc_files, cclf_files, cclf_widths), cclf_files) df_list$CCLF0 df_list$CCLF1 df_list$CCLF2 ...
This problem can be solved with a for loop and a function. I wrote the function combine_fwf to combine all fixed width files with a given CCLF number into a single dataframe. The list.files line finds all files in the current working directory that contain CCLF where is one of your numbers (0-9). Then use read.fwf to read in that filename to a dataframe, cbind on the CCLF number and rbind it on to the overall result. combine_fwf = function(CCLF_num, colwidths) { filenames = list.files(pattern = paste0("CCLF", CCLF_num)) df_list = vector("list", length(filenames)) for (i in 1:length(filenames)) { df_list[[i]] = cbind.data.frame(CCLF_num, read.fwf(filenames[[i]], colwidths)) } return(do.call(rbind, df_list)) } combine_fwf(2, c(12,6)) >> CCLF_num V1 V2 1 2 adsfasdfadsf 123123 2 2 lkjhlkjhlkjh 98098 3 2 adsfasdfadsf 123123 4 2 lkjhlkjhlkjh 98098 5 2 adsfasdfadsf 123123 6 2 lkjhlkjhlkjh 98098 > combine_fwf(1, c(12,6)) CCLF_num V1 V2 1 1 adsfasdfadsf 123123 2 1 lkjhlkjhlkjh 98098 3 1 adsfasdfadsf 123123 4 1 lkjhlkjhlkjh 98098 5 1 adsfasdfadsf 123123 6 1 lkjhlkjhlkjh 98098
selecting n consequent grouped variables and apply the function in r
Here is example data: myd <- data.frame (matrix (sample (c("AB", "BB", "AA"), 100*100, replace = T), ncol = 100)) variablenames= paste (rep (paste ("MR.", 1:10,sep = ""), each = 10), 1:100, sep = ".") names(myd) <- variablenames Each variable has a group, here we have ten groups. Thus the group index for the each variable in this data frame is as follows: group <- rep(1:10, each = 10) Thus Variable names and group data.frame (group, variablenames) group variablenames 1 1 MR.1.1 2 1 MR.1.2 3 1 MR.1.3 4 1 MR.1.4 5 1 MR.1.5 6 1 MR.1.6 7 1 MR.1.7 8 1 MR.1.8 9 1 MR.1.9 10 1 MR.1.10 11 2 MR.2.11 <<<<<<<<<<<<<<<<<<<<<<<< 100 10 MR.10.100 Each groups means that the following steps whould be applied to group of variables seperately. I have longer function to work the following is short example: function considering two variables at time myfun <- function (x1, x2) { out <- NULL out <- paste(x1, x2, sep=":") # for other steps to be performed here return (out) } # group 1 myfun (myd[,1], myd[,2]); myfun (myd[,3], myd[,4]); myfun (myd[,5], myd[,6]); myfun (myd[,7], myd[,8]); myfun (myd[,9], myd[,10]); # group 2 myfun (myd[,11], myd[,12]); myfun (myd[,13], myd[,14]); .......so on to group 10 ; In this way I need to walk for variables 1:10 (i.e. in first group to perform the above action), then 11:20 (the second group). The group doesnot matter in this case number of variables in each group are divisible with number of variables (10) taken (considered) at a time (2). However in the following example where 3 variables taken at a time - number of total variable in each group (3), 10/3, you have one variable left over at the end. function considering three variable at time. myfun <- function (x1, x2, x3) { out <- NULL out <- paste(x1, x2, x3, sep=":") # for other steps to be performed here return (out) } # for group 1 myfun (myd[,1], myd[,2], myd[,3]) myfun (myd[,4], myd[,5], myd[,6]) myfun (myd[,7], myd[,8], myd[,9]) # As there one variable left before proceedomg to second group, the final group will have 1 extra variable myfun (myd[,7], myd[,8], myd[,9],myd[,10] ) # for group 2 myfun (myd[,11], myd[,12], myd[,13]) # and to the end all groups and to end of the file. I want to loop this process by user defined n number of variables consered at time, where n may be 1 to maximum number of variables in each group. Edit: Just illustration to show the process (just group 1 and 2 demostrated for example):
Create a function that will split your data up into appropriate lists, and apply whatever functions you want to your list. This function will create your second grouping variable. (The first grouping variable (group) is provided in your question; if you change that value, you should also change DIM in the function below.) myfun = function(LENGTH, DIM = 10) { PATTERN = rep(1:(DIM %/% LENGTH), each=LENGTH) c(PATTERN, rep(max(PATTERN), DIM %% LENGTH)) } Here are the groups on which we will split myd. In this example, we are splitting myd first into 10-column groups, and each group into 3-column groups, except for the last group, which will have 4 columns (3+3+4 = 10). NOTE: To change the number of columns you're grouping by, for example, grouping by two variables at a time, change group2 = rep(myfun(3), length.out=100) to group2 = rep(myfun(2), length.out=100). group <- rep(1:10, each = 10) # CHANGE THE FOLLOWING LINE ACCORDING # TO THE NUMBER OF GROUPS THAT YOU WANT group2 = rep(myfun(3), length.out=100) This is the splitting process. We first split up just by names, and match those names with myd to create a list of data.frames. # Extract group names for matching purposes temp = split(names(myd), list(group, group2)) # Match the names to myd temp = lapply(1:length(temp), function(x) myd[, which(names(myd) %in% temp[[x]])]) # Extract the names from the list for future reference NAMES = lapply(temp, function(x) paste(names(x), collapse="_")) Now that we have a list, we can do lots of fun things. You wanted to paste your columns together separated by a colon. Here's how you'd do that. # Do what you want with the list # For example, to paste the columns together: FINAL = lapply(temp, function(x) apply(x, 1, paste, collapse=":")) names(FINAL) = NAMES Here's a sample of the output: lapply(FINAL, function(x) head(x, 5)) # $MR.1.1_MR.1.2_MR.1.3 # [1] "AA:AB:AB" "AB:BB:AA" "BB:AB:AA" "BB:AA:AB" "AA:AA:AA" # # $MR.2.11_MR.2.12_MR.2.13 # [1] "BB:AA:AB" "BB:AB:BB" "BB:AA:AA" "AB:BB:AA" "BB:BB:AA" # # $MR.3.21_MR.3.22_MR.3.23 # [1] "AA:AB:BB" "BB:AA:AA" "AA:AB:BB" "AB:AA:AA" "AB:BB:BB" # # <<<<<<<------SNIP------>>>>>>>> # # $MR.1.4_MR.1.5_MR.1.6 # [1] "AB:BB:AA" "BB:BB:BB" "AA:AA:AA" "BB:BB:AB" "AB:AA:AA" # # $MR.2.14_MR.2.15_MR.2.16 # [1] "AA:BB:AB" "BB:BB:BB" "BB:BB:AB" "AA:BB:AB" "BB:BB:BB" # # $MR.3.24_MR.3.25_MR.3.26 # [1] "AA:AB:BB" "BB:AA:BB" "BB:AB:BB" "AA:AB:AA" "AB:AA:AA" # # <<<<<<<------SNIP------>>>>>>>> # # $MR.1.7_MR.1.8_MR.1.9_MR.1.10 # [1] "AB:AB:AA:AB" "AB:AA:BB:AA" "BB:BB:AA:AA" "AB:BB:AB:AA" "AB:BB:AB:BB" # # $MR.2.17_MR.2.18_MR.2.19_MR.2.20 # [1] "AB:AB:BB:BB" "AB:AB:BB:BB" "AB:AA:BB:BB" "AA:AA:AB:AA" "AB:AB:AB:AB" # # $MR.3.27_MR.3.28_MR.3.29_MR.3.30 # [1] "BB:BB:AB:BB" "BB:BB:AA:AA" "AA:BB:AB:AA" "AA:BB:AB:AA" "AA:AB:AA:BB" # # $MR.4.37_MR.4.38_MR.4.39_MR.4.40 # [1] "BB:BB:AB:AA" "AA:BB:AA:BB" "AA:AA:AA:AB" "AB:AA:BB:AB" "BB:BB:BB:BB" # # $MR.5.47_MR.5.48_MR.5.49_MR.5.50 # [1] "AB:AA:AA:AB" "AB:AA:BB:AA" "AB:BB:AA:AA" "AB:BB:BB:BB" "BB:AA:AB:AA" # # $MR.6.57_MR.6.58_MR.6.59_MR.6.60 # [1] "BB:BB:AB:AA" "BB:AB:BB:AA" "AA:AB:AB:BB" "BB:AB:AA:AB" "AB:AA:AB:BB" # # $MR.7.67_MR.7.68_MR.7.69_MR.7.70 # [1] "BB:AB:BB:AA" "BB:AB:BB:AA" "BB:AB:BB:AB" "AB:AA:AA:AA" "AA:AA:AA:AB" # # $MR.8.77_MR.8.78_MR.8.79_MR.8.80 # [1] "AA:AB:AA:AB" "AB:AA:AB:BB" "BB:BB:AA:AB" "AB:BB:BB:BB" "AB:AA:BB:AB" # # $MR.9.87_MR.9.88_MR.9.89_MR.9.90 # [1] "AA:BB:AB:AA" "AA:AB:BB:BB" "AA:BB:AA:BB" "AB:AB:AA:BB" "AB:AA:AB:BB" # # $MR.10.97_MR.10.98_MR.10.99_MR.10.100 # [1] "AB:AA:BB:AB" "AB:AA:AB:BB" "BB:AB:AA:AA" "BB:BB:AA:AA" "AB:AB:BB:AB"
I suggest to recode myfun to take a matrix and use pasteCols from plotrix package. library(plotrix) myfun = function(x){ out = pasteCols(t(x), sep = ":") # some code return(out) } then, its very easy: for each group, compute the index of the first and of the last column you want to use when you call myfun, using modulus and integer division: rubiques_solution = function(group, myd, num_to_group){ # loop over groups for(g in unique(group)){ var_index = which(group == g) num_var = length(var_index) # test to make sure num_to_group is smaller than the number of variable if(num_var < num_to_group){ stop("num_to_group > number of variable in at least one group") } # number of calls to myfun num_calls = num_var %/% num_to_group # the idea here is that we create the first and last column # in which we are interested for each call first = seq(from = var_index[1], by = num_to_group, length = num_calls) last = first + num_to_group -1 # the last call will contain possibly more varialbe, we adjust here: last[length(last)] = last[length(last)] + (num_var %% num_to_group) for(i in num_calls){ # maybe do something with the return value of myfun ? myfun(myd[,first[i]:last[i]]) } } } group = rep(1:10, each = 10) # same than yours myd = data.frame (matrix (sample (c("AB", "BB", "AA"), 100*100, replace = T), ncol = 100)) # same than yours num_to_group = 2 # this is your first example rubiques_solution(group, myd, num_to_group) hope i understood the problem right.
using R to compare data frames to find first occurrence to df1$columnA in df2$columnB when df2$columnB is a single space separated list
I have a question regarding data frames in R. I want to take a data.frame, dfy, and find the first occurrence of dfy$workerId in dfx$workers, to create a new dataframe, dfz, a copy of dfx that also contains the first occurance of dfy$workerId in dfx$wokers as dfz$highestRankingGroup. Its a little tricky becuase dfx$workers is a single spaced seperated string. My original plan was to do this in Perl, but I would like to find a way to work in R and avoid having to write out to temp. files. thank you for your time. y <- "name,workerId,aptitude joe,4,34 steve,5,42 jon,7,23 nick,8,122" x <- "workers,projectScore 1 2 3 8 ,92 1 2 5 9 ,89 3 5 7 ,85 1 8 9 10 ,82 4 5 7 8 ,83 1 3 5 7 8 ,79" z <- "name,workerId,aptitude,highestRankingGroup joe,4,0.34,5 steve,5,0.42,2 jon,7,0.23,3 nick,8,0.122,1" dfy <- read.csv(textConnection(y), header=TRUE, sep=",", stringsAsFactors=FALSE) dfx <- read.csv(textConnection(x), header=TRUE, sep=",", stringsAsFactors=FALSE) dfz <- read.csv(textConnection(z), header=TRUE, sep=",", stringsAsFactors=FALSE)
First, add the highestRankingGroup column to your dataset dfx dfx$highestRankingGroup <- seq(1, length(dfx$projectScore)) Since you have mentioned perl you can do a familar perl thing and simple split the workers column in whitespaces. I combined the splitting with functions from the plyr package which are always nice to work with. library(plyr) df.l <- dlply(dfx, "projectScore") f.reshape <- function(x) { wrk <- strsplit(x$workers, "\\s", perl = TRUE) data.frame(worker = wrk[[1]] , projectScore = x$projectScore , highestRankingGroup = x$highestRankingGroup ) } df.tmp <- ldply(df.l, f.reshape) df.z1 <- merge(df.tmp, dfy, by.x = "worker", by.y = "workerId") Now you have to look for the max values in the projectScore column: df.z2 <- ddply(df.z1, "name", function(x) x[x$projectScore == max(x$projectScore), ]) This produces: R> df.z2 worker .id projectScore highestRankingGroup name aptitude 1 4 83 83 5 joe 34 2 7 85 85 3 jon 23 3 8 92 92 1 nick 122 4 5 89 89 2 steve 42 R> You can reshape the df.z2 dataframe according to your personal taste. Simply look at the different steps and the produced objects in order to see at which step different columns, etc get introduced.
Before I start, I recommend that you go with #mropa's answer. This answer is a bit of fun I had messing about with your question. On the plus side, it does involve a bit of fun with function closures ;) Essentially, I create a function that returns two functions. updateDFz = function(dfy) { ## Create a default dfz matrix dfz = dfy dfz$HRG = 10000 ## Big max value counter = 0 ## Update the dfz matrix after every row update = function(x) { counter <<- counter + 1 for(i in seq_along(x)) { if(is.element(x[i], dfz$workerId)) dfz[dfz$workerId == x[i],]$HRG <<- min(dfz[dfz$workerId == x[i],]$HRG, counter) } return(dfz) } ## Get the dfz matrix getDFz = function() return(dfz) list(getDFz=getDFz, update=update) } f = updateDFz(dfy) lapply(strsplit(dfx$workers, " "), f$update) f$getDFz() As I said, a bit of fun ;)
Hopefully someone finds this useful. # Recieves a data.frame and a search column # Returns a data.frame of the first occurances of all unique values of the "search" column getfirsts <- function(data, searchcol){ rows <- as.data.frame(match(unique(data[[searchcol]]), data[[searchcol]])) firsts = data[rows[[1]],] return(firsts) }