Reading csv file, having numbers and strings in one column - r

I am importing a 3 column CSV file. The final column is a series of entries which are either an integer, or a string in quotation marks.
Here are a series of example entries:
1,4,"m"
1,5,20
1,6,"Canada"
1,7,4
1,8,5
When I import this using read.csv, these are all just turned in to factors.
How can I set it up such that these are read as integers and strings?
Thank you!

This is not possible, since a given vector can only have a single mode (e.g. character, numeric, or logical).
However, you could split the vector into two separate vectors, one with numeric values and the second with character values:
vec <- c("m", 20, "Canada", 4, 5)
vnum <- as.numeric(vec)
vchar <- ifelse(is.na(vnum), vec, NA)
vnum
[1] NA 20 NA 4 5
vchar
[1] "m" NA "Canada" NA NA

EDIT Despite the OP's decision to accept this answer, #Andrie's answer is the preferred solution. My answer is meant only to inform about some odd features of data frames.
As others have pointed out, the short answer is that this isn't possible. data.frames are intended to contain columns of a single atomic type. #Andrie's suggestion is a good one, but just for kicks I thought I'd point out a way to shoehorn this type of data into a data.frame.
You can convert the offending column to a list (this code assumes you've set options(stringsAsFactors = FALSE)):
dat <- read.table(textConnection("1,4,'m'
1,5,20
1,6,'Canada'
1,7,4
1,8,5"),header = FALSE,sep = ",")
tmp <- as.list(as.numeric(dat$V3))
tmp[c(1,3)] <- dat$V3[c(1,3)]
dat$V3 <- tmp
str(dat)
'data.frame': 5 obs. of 3 variables:
$ V1: int 1 1 1 1 1
$ V2: int 4 5 6 7 8
$ V3:List of 5
..$ : chr "m"
..$ : num 20
..$ : chr "Canada"
..$ : num 4
..$ : num 5
Now, there are all sorts of reasons why this is a bad idea. For one, lots of code that you'd expect to play nicely with data.frames will not like this and either fail, or behave very strangely. But I thought I'd point it out as a curiosity.

No. A dataframe is a series of pasted together vectors (a list of vectors or matrices). Because each column is a vector it can not be classified as both integer and factor. It must be one or the other. You could split the vector apart into numeric and factor ( acolumn for each) but I don't believe this is what you want.

Related

Extract all values from a vector of named numerics with the same name in R

I'm trying to handle a vector of named numerics for the first time in R. The vector itself is named p.values. It consists of p-values which are named after their corresponding variabels. Through simulating I obtained a huge number of p-values that are always named like one of the five variables they correspond to. I'm interested in p-values of only one variable however and tried to extract them with p.values[["var_a"]] but that gives my only the p-value of var_a's last entry. p.values$var_a is invalid and as.numeric(p.values) or unname(p.values) gives my only all values without names obviously. Any idea how I can get R to give me the 1/5 of named numerics that are named var_a?
Short example:
p.values <- as.numeric(c(rep(1:5, each = 5)))
names(p.values) <- rep(letters[1:5], 5)
str(p.values)
Named num [1:25] 1 1 1 1 1 2 2 2 2 2 ...
- attr(*, "names")= chr [1:25] "a" "b" "c" "d" ...
I'd like to get R to show me all 5 numbers named "a".
Thanks for reading my first post here and I hope some more experienced R users know how to deal with named numerics and can help me with this issue.
You can subset p.values using [ with names(p.values) == "a" to show all values named a.
p.values[names(p.values) == "a"]
#a a a a a
#1 2 3 4 5

Grouping occurences of a string to a row

tl;dr
Is there a way to group together a large number of values to a single column without truncation of those values?
I am working on a data frame with 48,178 entries on RStudio. The data frame has 2 columns of which the first one contains unique numeric values, and the other contains repeated strings.
----------
id name
1 forest
2 forest
3 park
4 riverbank
.
.
.
.
.
48178 water
----------
I would like to group together all entries on the basis of unique entries in the 2nd column. I have used the package "ddply" to achieve the result. I now have the following derived table:
----------
type V1
forest forest,forest,forest
park park,park,park,park
riverbank riverbank,riverbank,
water water,water,water,water
----------
However, on applying str function on the derived data frame, I find that the column contains truncated values, and not every instance of each string.
The output to the str is:
'data.frame': 4 obs. of 2 variables:
$ type: chr "forest" "park" "riverbank" "water"
$ V1 : chr "forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,f"| __truncated__ "park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,pa"| __truncated__ "riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverba"| __truncated__ "water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,w"| __truncated__`
How do I group together same strings and push them to a row, without truncation?
Extending the answer of HubertL, the str() function does exactly what it is supposed to but is perhaps the wrong choice for what you intend to do.
From the (rather limited) information you have given in your Q it seems that you already have achieved what you are looking for, i.e., concatenating all strings of the same type.
However, it appears that you are stuck with the output of the str() function.
Please, refer to the help page ?str.
From the Description section:
Compactly display the internal structure of an R object, a diagnostic function and an alternative to summary (and to some extent, dput). Ideally, only one line for each ‘basic’ structure is displayed. It is especially well suited to compactly display the (abbreviated) contents of (possibly nested) lists. The idea is to give reasonable output for any R object.
str() has a parameter nchar.max which defaults to 128.
nchar.max maximal number of characters to show for character strings. Longer strings are truncated, see longch example below.
The longch example in the Examples section illustrates the effect of this parameter:
nchar(longch <- paste(rep(letters,100), collapse = ""))
#[1] 2600
str(longch)
# chr "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvw"| __truncated__
str(longch, nchar.max = 52)
# chr "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxy"| __truncated__
Maximum length of a character string
According to ?"Memory-limits", the number of bytes in a character string is limited to 2^31 - 1 ~ 2*10^9. Given the number of rows in your data frame and the length of name the concatened strings won't exceed 0.6*10^6 which is far from the limit.
Try storing the results in a list using base R split() function:
new.list <- split(df, f=df$type)
This will split the data frame into multiple data frames that can be accessed using square brackets. It keeps the character strings from being combined and truncated as the records continue to be preserved in separate cells.
Your strings are not really truncated, only their display by str are truncated:
size <- 48000
df <- data.frame(1:size,
type=sample(c("forest", "park", "riverbank", "water" ),
size, replace = TRUE),
stringsAsFactors = FALSE)
res <- by(df$type , df$type, paste, collapse=",")
str(res)
'by' chr [1:4(1d)] "forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,f"| __truncated__ ...
- attr(*, "dimnames")=List of 1
..$ df$type: chr [1:4] "forest" "park" "riverbank" "water"
- attr(*, "call")= language by.default(data = df$type, INDICES = df$type, FUN = paste, collapse = ",")
lengths( strsplit(res, ','))
forest park riverbank water
11993 12017 11953 12037
sum(lengths( strsplit(res, ',')))
[1] 48000
If all you want is a count of occurance, then why not simply use table ?
df<- read.table(head=T, text="id name
1 forest
2 forest
3 park
4 riverbank")
df
df1<- as.data.frame(table(df$name))
#will give you number of times the word occurs
# if for some reason you want a repetition,then
x<- mapply(rep,df1$Var1,df1$Freq)
y<- sapply(x,paste, collapse=",")
data.frame(type=df1$Var1, V1=y)

Extracting the numbers from the data frame

I have a data frame with a "Calculation" column, which could be reproduced by the following code:
a <- data.frame(Id = c(1:3), Calculation = c('[489]/100','[4771]+[4777]+[5127]+[5357]+[5597]+[1044])/[463]','[1044]/[463]'))
> str(a)
'data.frame': 3 obs. of 2 variables:
$ Id : int 1 2 3
$ Calculation: Factor w/ 3 levels "[1044]/[463]",..: 3 2 1
Please note that there are two types of numbers in "Calculation" column: most of them are surrounded by brackets, but some (in this case the number 100) is not (this has a meaning in my application).
What I would like to do is to extract all the distinct numbers that appear in Calculation column to return a vector with the union of these numbers. Ideally, I would like to be able to distinguish between the numbers that are between brackets and the numbers that are not. This step is not so important (if it makes it complicated) since the numbers that are NOT between the brackets are few and I can manually detect them. So the desired output in this case would be:
b = c(489,4771,4777,5127,5357,5597,1044,463)
Thanks in advance
We can use str_extract_all from library(stringr). Using the regex lookbehind ((?<=\\[)), we match the numbers \\d+ that is preceded by [, extract them in a list, unlist to convert it to vector and then change the character to numeric (as.numeric), and get the unique elements.
library(stringr)
unique(as.numeric(unlist(str_extract_all(a$Calculation, '(?<=\\[)\\d+'))))
#[1] 489 4771 4777 5127 5357 5597 1044 463

Data.frame with both characters and numerics in one column

I have a function I'm using in R that requires input to several parameters, once as a numeric (1) and as a character (NULL). The default is NULL.
I want to apply the function using all possible combinations of parameters, so I used expand.grid to try and create a dataframe which stores these. However, I am running into problems with creating an object that contains both numerics and characters in one column.
This is what I've tried:
comb<-expand.grid(c("NULL",1),c("NULL",1),stringsAsFactors=FALSE), which returns:
comb
Var1 Var2
1 NULL NULL
2 1 NULL
3 NULL 1
4 1 1
with all entries characters:
class(comb[1,1])
[1] "character"
If I now try and insert a numeric into a specific spot, I still receive a character:
comb[2,1]<-as.numeric(1)
class(comb[2,1])
[1] "character"
I've also tried it using stringsAsFactors=TRUE, or using expand.grid(c(0,1),c(0,1)) and then switching out the 0 for NULL but always have the exact same problem: whenever I do this, I do not get a numeric 1.
Manually creating an object using cbind and then inserting the NULL as a character also does not help. I'd be grateful for a pointer, or a work-around to running the function with all possible combinations of parameters.
As you have been told, generally speaking columns of data frames need to be a single type. It's hard to solve your specific problem, because it is likely that the solution is not really "putting multiple types into a single column" but rather re-organizing your other unseen code to work within this restriction.
As I suggested, it probably will be better to use the built in NA value as expand.grid(c(NA,1),c(NA,1)) and then modify your function to use NA as an input. Or, of course, you could just use some "special" numeric value, like -1, or -99 or something.
The related issue that I mentioned is that you really should avoid using the character string "NULL" to mean anything, since NULL is a special value in R, and confusion will ensue.
These sorts of strategies would all be preferable to mixing types, and using character strings of reserved words like NULL.
All that said, it technically is possible to get around this, but it is awkward, and not a good idea.
d <- data.frame(x = 1:5)
> d$y <- list("a",1,2,3,"b")
> d
x y
1 1 a
2 2 1
3 3 2
4 4 3
5 5 b
> str(d)
'data.frame': 5 obs. of 2 variables:
$ x: int 1 2 3 4 5
$ y:List of 5
..$ : chr "a"
..$ : num 1
..$ : num 2
..$ : num 3
..$ : chr "b"

Extracting consecutive occurences in R (like unix uniq)

I'm beginning to analyse datas for my thesis. I first need to count consecutive occurences of strings as one. Here's a sample vector :
test <- c("vv","vv","vv","bb","bb","bb","","cc","cc","vv","vv")
I would like to simply extract unique values, as in the unix command uniq. So expected output would be a vector as :
"vv","bb","cc","vv"
I looked at rle function, wich seems to be fine, but how would I get the output of rle as a vector ? I don't seem to understand the rle class...
> rle(test)
Run Length Encoding
lengths: int [1:5] 3 3 1 2 2
values : chr [1:5] "vv" "bb" "" "cc" "vv"
How to get one vector of the values output by rle and another one for the lengths ? Hope I'm making myself clear...
Thanks again for any help !
rle() returns a two-element list of class "rle"; as #gsk points out, you can use ordinary list-indexing constructs to access the component vectors.
Also, try this, to put the results of rle into a more familiar format:
as.data.frame(rev(unclass(rle(test))))
# values lengths
# 1 vv 3
# 2 bb 3
# 3 1
# 4 cc 2
# 5 vv 2
Source: http://www.sigmafield.org/2009/09/22/r-function-of-the-day-rle
Solution: rle(test)$values
They use: coin.rle <- rle(coin) and coin.rle$values so, rle(test)$values should work.

Resources