I am trying to read a file into R that has different delimiters in the first row has space as delimiters but from the 2nd row to the last between the first column and the second there is a space, the same between the second and third, then all the block of two, zeros and ones should be different columns.
any hint?!
ID Chip AX-77047182 AX-80910836 AX-80737273 AX-77048714 AX-77048779 AX-77050447
3811582 1 2002202222200202022020200200220200222200022220002200000201202000222022
3712982 1 2002202222200202022020200200220200222200022220002200000200202000222022
3712990 1 2002202211200202021011100101210200111101022121112100111110211110122122
3713019 1 2002202211200202021011100101210200111101022121112100111110211110122122
3713025 1 2002202211200202021011100101210200111101022121112100111110211110122122
3713126 1 2002202222200202022020200200220200222200022220002200000200202000222022
Certainly not the most elegant solution, but you could try the following. If I have understood your example data correctly, you have not provided all the column names (AX-77047182,...) that would be needed for the rows of zeros/ones/twos. If my understanding is wrong, below approach will not result in the desired result, but might still aid you in finding a workaround - you might simply adapt the delimiter in the second split command. I hope this helps...
#read file as character vector
chipstable <- readLines(".../chips.txt")
#extact first line to be used as column names
tablehead <- unlist(strsplit(chipstable[1], " "))
#split by first delimiter, i.e., space
chipstable <- strsplit(chipstable[2:length(chipstable)], " ")
#split by second delimiter, i.e., between each character (here number)
#and merge the two split results in one line
chipstable <- lapply(chipstable, function(x) {
c(x[1:2], unlist(strsplit(x[3], "")))
})
#combine all lines to a data frame
chipstable <- do.call(rbind, chipstable)
#assign column names
colnames(chipstable) <- tablehead
#turn values to numeric (if needed)
chipstable <- apply(chipstable, 2, as.numeric)
You can try ... read(pattern = " || 1 ", recursive = TRUE)
After make a bind
For instance:
data <- "ID Chip AX-77047182 AX-80910836 AX-80737273 AX-77048714 AX-77048779 AX-77050447
3811582 1 2002202222200202022020200200220200222200022220002200000201202000222022
3712982 1 2002202222200202022020200200220200222200022220002200000200202000222022
3712990 1 2002202211200202021011100101210200111101022121112100111110211110122122
3713019 1 2002202211200202021011100101210200111101022121112100111110211110122122
3713025 1 2002202211200202021011100101210200111101022121112100111110211110122122
3713126 1 2002202222200202022020200200220200222200022220002200000200202000222022"
teste <- strsplit(data, split = "\n")
for(i in seq(1, length(teste[[1]]),1)) {
if (i==1) {
dataOut <- strsplit(teste[[1]][i], split = " ")
print(dataOut)
} else
dataOut <- strsplit(teste[[1]][i], split = " 1 ")
print(dataOut)
}
Related
I want to concatenate text across 20 columns of my dataset (dat), skipping all NA values.
For example, if the first row had "cat" in column 1, "dog" in column 2, and NA in column 3, I want to compile that as "cat dog" in a new column (dat$results). Here's what I have:
m <- ""
for(i in 1:20){
if(!is.na(dat[,i])){
m <- paste(m, dat[,i], sep = " ")
}
else {
next
}
}
dat$results <- m
The loop only runs up to column 3 (which is NA for my first row). Not a problem for that first row, BUT other rows that do have text in column 3 don't get that column compiled. What can I do here?
Maybe just concatenate all the columns in one go and remove na's and double spaces afterward
dat$result <- gsub('\\s\\s', ' ', gsub('NA', '', do.call(paste, d[,1:20])))
I have a vector of string, and I want each string to be cut roughly in half, at the nearest space.
For exemple, with the following data :
test <- data.frame(init = c("qsdf mqsldkfop mqsdfmlk lksdfp pqpdfm mqsdfmj mlk",
"qsdf",
"mp mlksdfm mkmlklkjjjjjjjjjjjjjjjjjjjjjjklmmjlkjll",
"qsddddddddddddddddddddddddddddddd",
"qsdfmlk mlk mkljlmkjlmkjml lmj mjjmjmjm lkj"), stringsAsFactors = FALSE)
I want to get something like this :
first sec
1 qsdf mqsldkfop mqsdfmlk lksdfp pqpdfm mqsdfmj mlk
2 qsdf
3 mp mlksdfm mkmlklkjjjjjjjjjjjjjjjjjjjjjjklmmjlkjll
4 qsddddddddddddddddddddddddddddddd
5 lmj mjjmjmjm lkj lmj mjjmjmjm lkj
Any solution that does not cut in halves but "so that the first part isn't longer than X character" would be also great.
First, we split the strings by spaces.
a <- strsplit(test$init, " ")
Then we find the last element of each vector for which the cumulative sum of characters is lower than half the sum of all characters in the vector:
b <- lapply(a, function(x) which.max(cumsum(cumsum(nchar(x)) <= sum(nchar(x))/2)))
Afterwards we combine the two halfs, substituting NA if the vector was of length 1 (only one word).
combined <- Map(function(x, y){
if(y == 1){
return(c(x, NA))
}else{
return(c(paste(x[1:y], collapse = " "), paste(x[(y+1):length(x)], collapse = " ")))
}
}, a, b)
Finally, we rbind the combined strings and change the column names.
newdf <- do.call(rbind.data.frame, combined)
names(newdf) <- c("first", "second")
Result:
> newdf
first second
1 qsdf mqsldkfop mqsdfmlk lksdfp pqpdfm mqsdfmj mlk
2 qsdf <NA>
3 mp mlksdfm mkmlklkjjjjjjjjjjjjjjjjjjjjjjklmmjlkjll
4 qsddddddddddddddddddddddddddddddd <NA>
5 qsdfmlk mlk mkljlmkjlmkjml lmj mjjmjmjm lkj
You can use the function nbreak from the package that I wrote:
devtools::install_github("igorkf/breaker")
library(tidyverse)
test <- data.frame(init = c("Phrase with four words", "That phrase has five words"), stringsAsFactors = F)
#This counts the numbers of words of each row:
nwords = str_count(test$init, " ") + 1
#This is the position where break the line for each row:
break_here = ifelse(nwords %% 2 == 0, nwords/2, round(nwords/2) + 1)
test
# init
# 1 Phrase with four words
# 2 That phrase has five words
#the map2_chr is applying a function with two arguments,
#the string is "init" and the n is "break_here":
test %>%
mutate(init = map2_chr(init, break_here, ~breaker::nbreak(string = .x, n = .y, loop = F))) %>%
separate(init, c("first", "second"), sep = "\n")
# first second
# 1 Phrase with four words
# 2 That phrase has five words
The first column in my data.frame consists of strings, and the second column are unique keys.
I want to extract all words after the nth word from each string, and if the string has <= n words, extract the entire string.
I have over 10k rows in my data.frame and was wondering if there is a quick way of doing this other than using for loops?
Thanks.
How about the following:
# Generate some sample data
library(tidyverse)
df <- data.frame(
one = c("Entries from row one", "Entries from row two", "Entries from row three"),
two = runif(3))
# Define function to extract all words after the n=1 word
# (or return the full string if n > # of words in string)
crop_string <- function(ss, n) {
lapply(strsplit(as.character(ss), "\\s"), function(v)
if (length(v) > n) paste(v[(n + 1):length(v)], collapse = " ")
else paste(v, collapse = " "))
}
# Let's crop strings from column one by removing the first 3 words (n = 3)
n <- 3;
df %>%
mutate(words_after_n = crop_string(one, n))
# one two words_after_n
#1 Entries from row one 0.5120053 one
#2 Entries from row two 0.1873522 two
#3 Entries from row three 0.0725107 three
# If n > # of words, return the full string
n <- 10;
df %>%
mutate(words_after_n = crop_string(one, n))
# one two words_after_n
#1 Entries from row one 0.9363278 Entries from row one
#2 Entries from row two 0.3024628 Entries from row two
#3 Entries from row three 0.6666226 Entries from row three
here I use nchar(), so make your data has transformed to the character.
as.character(YOUR_DATA)
as.character(sapply(YOUR_DATA,function(x,y){
if(nchar(x)>=y){
substr(x,y,nchar(x))
}
else{x}
},y= nth_data_you_want))
asumme the data is like:
"gene#seq"
"Cblb#TAGTCCCGAAGGCATCCCGA"
"Fbxo27#CCCACGTGTTCTCCGGCATC"
"Fbxo11#GGAATATACGTCCACGAGAA"
"Pwp1#GCCCGACCCAGGCACCGCCT"
I use 10 as nth data, the result is:
"gene#seq"
"CCCGAAGGCATCCCGA"
"CACGTGTTCTCCGGCATC"
"AATATACGTCCACGAGAA"
"GACCCAGGCACCGCCT"
I know that in R for loops should be avoided and vectorized operations should be used instead.
I want to solve this with a for loop and then try to use the apply family, then also in Rcpp.
I load a dataset containing one column of passwords (alphanumeric).
Once loaded (a sample, for speed), I want to create new column with value (0,1) based on some conditions "contains_lower_chars", "contains_numbers" and so on.
Here what I tried to do, but it doesn't work - meaning each column I create has the same value.
library(tidyverse)
set.seed(123)
# load dataset from url, skip the first 16 rows
df <- read.csv('http://datashaping.com/passwords.txt', header = F, skip = 16) %>%
sample_frac(.001) %>%
rename(password = V1)
patterns = c("[a-z]","[A-Z]","[0-9]+")
df$has_lower <- 0
df$has_upper <- 0
df$has_numeric <- 0
for(i in 1:nrow(df)){
for(j in patterns){
n <- ifelse(grepl(j, df$password[i]),1,0)
}
df$has_lower[i] <- n
df$has_upper[i] <- n
df$has_numeric[i] <- n
}
Output I have in mind is:
password has_lower has_upper has_numeric
Bigmaccas 1 1 0
0127515559 0 0 1
dbqky73p 1 0 1
We can simplify things if we just name your pattern vector. For example
patterns = c(has_lower="[a-z]",
has_upper="[A-Z]",
has_numeric="[0-9]+")
for(pattern in names(patterns)) {
df[, pattern] = as.numeric(grepl(patterns[pattern], df$password))
}
Basically we just loop through each of the names, grab the regular expression corresponding to that name, then do the matching and adding the column.
A data frame is above all a list.
So, you can simply do:
df[c("has_lower", "has_upper", "has_numeric")] <-
lapply(patterns, function(pattern) grepl(pattern, df$password) + 0)
Use + 0L instead of + 0 is you want integers instead of doubles (I would recommend to do nothing and to keep logicals).
First you need to update has.lower has.upper and has.numeric within the j loop otherwise your n remains the same for this 3 cases. To do so you need to be able to loop over the names of the columns has.lower has.upper and has.numeric:
names <- c("has_lower","has_upper","has_numeric")
for(i in 1:nrow(df)){
for(j in 1:length(patterns)){
df[i,(names[j])] <- as.numeric(grepl(j, df$password[i]))
}
}
A quicker, nicer, more compact alternative using apply and the fact that grepl is already vectorized:
df[, c("has_lower","has_upper","has_numeric"):=lapply(patterns, function(x) grepl(x,df$password))]
Note (nothing to do with your question):
I advise you to use the fread function to read your dataset since it is quite large.
df = fread('http://datashaping.com/passwords.txt', header = F, skip = 16)%>%
sample_frac(.001) %>%
rename(password = V1)
I need to read a csv file in R. But the file contains some text information in some rows instead of comma values. So i cannot read that file using read.csv(fileName) method.
The content of the file is as follows:
name:russel date:21-2-1991
abc,2,saa
anan,3,ds
ama,ds,az
,,
name:rus date:23-3-1998
snans,32,asa
asa,2,saz
I need to store only values of each name,date pair as data frame. To do that how can i read that file?
Actually my required output is
>dataFrame1
abc,2,saa
anan,3,ds
ama,ds,az
>dataFrame2
snans,32,asa
asa,2,saz
You can read the data with scan and use grep and sub functions to extract the important values.
The text:
text <- "name:russel date:21-2-1991
abc,2,saa
anan,3,ds
ama,ds,az
,,
name:rus date:23-3-1998
snans,32,asa
asa,2,saz"
These commands generate a data frame with name and date values.
# read the text
lines <- scan(text = text, what = character())
# find strings staring with 'name' or 'date'
nameDate <- grep("^name|^date", lines, value = TRUE)
# extract the values
values <- sub("^name:|^date:", "", nameDate)
# create a data frame
dat <- as.data.frame(matrix(values, ncol = 2, byrow = TRUE,
dimnames = list(NULL, c("name", "date"))))
The result:
> dat
name date
1 russel 21-2-1991
2 rus 23-3-1998
Update
To extract the values from the strings, which do not contain name and date information, the following commands can be used:
# read data
lines <- readLines(textConnection(text))
# split lines
splitted <- strsplit(lines, ",")
# find positions of 'name' lines
idx <- grep("^name", lines)[-1]
# create grouping variable
grp <- cut(seq_along(lines), c(0, idx, length(lines)))
# extract values
values <- tapply(splitted, grp, FUN = function(x)
lapply(x, function(y)
if (length(y) == 3) y))
create a list of data frames
dat <- lapply(values, function(x) as.data.frame(matrix(unlist(x),
ncol = 3, byrow = TRUE)))
The result:
> dat
$`(0,7]`
V1 V2 V3
1 abc 2 saa
2 anan 3 ds
3 ama ds az
$`(7,9]`
V1 V2 V3
1 snans 32 asa
2 asa 2 saz
I would read the entire file first as a list of characters, i.e. a string for each line in the file, this can be done using readLines. Next you have to find the places where the data for a new date starts, i.e. look for ,,, see grep for that. Then take the first entry of each data block, e.g. using str_extract from the stringr package. Finally, you need split all the remaing data strings, see strsplit for that.