Remove a character from elements in a dataframe - r

I have a set of data where some elements are preceded by "<" and I need to remove "<" so that I can perform some data analysis. The data is saved in a .txt file and I'm bringing it into R using read.table. Below is an example of what the text file looks like.
Background: 18 <10 27 22 <3
Site: 30 44 23 <16 13
I used x=read.file to make a dataframe, then tried gsub("<","",x) to remove the "<" and the result is something completely unexpected, at least to me. This is what I get as a result.
[1] "1:2" "c(18, 30)" "1:2" "c(27, 23)" "c(2, 1)" "1:2"
I have no idea what that means or why it's happening. I would greatly appreciate explanation both of what is going on here, and how I should go about accomplishing my goal.

df <- read.table(header = TRUE, text = "Background Site
18 30
<10 44
27 23
22 <16
<3 13", stringsAsFactors = FALSE)
You can use mutate_at and apply the gsub function to the variables (i.e. Background and Site) which you wish to remove the preceding < sign.
library(dplyr)
df %>% mutate_at(vars(Background, Site),
funs(as.numeric(gsub("^<", "", .))))
The output is:
Background Site
1 18 30
2 10 44
3 27 23
4 22 16
5 3 13

Read the file with readLines, perform the gsub and then re-read it with read.table. No packages are used:
read.table(text = gsub("<", "", readLines("myfile")), as.is = TRUE)
If the data does not come from a file but is already in a data frame DF then define a clean function which cleans a column of DF and apply it to each numeric column:
clean <- function(x) as.numeric(gsub(">", "", x))
DF[-1] <- lapply(DF[-1], clean)

Related

Dividing one column by another column in a dataset [duplicate]

I have a csv file where some of the numerical values are expressed as strings with commas as thousand separator, e.g. "1,513" instead of 1513. What is the simplest way to read the data into R?
I can use read.csv(..., colClasses="character"), but then I have to strip out the commas from the relevant elements before converting those columns to numeric, and I can't find a neat way to do that.
Not sure about how to have read.csv interpret it properly, but you can use gsub to replace "," with "", and then convert the string to numeric using as.numeric:
y <- c("1,200","20,000","100","12,111")
as.numeric(gsub(",", "", y))
# [1] 1200 20000 100 12111
This was also answered previously on R-Help (and in Q2 here).
Alternatively, you can pre-process the file, for instance with sed in unix.
You can have read.table or read.csv do this conversion for you semi-automatically. First create a new class definition, then create a conversion function and set it as an "as" method using the setAs function like so:
setClass("num.with.commas")
setAs("character", "num.with.commas",
function(from) as.numeric(gsub(",", "", from) ) )
Then run read.csv like:
DF <- read.csv('your.file.here',
colClasses=c('num.with.commas','factor','character','numeric','num.with.commas'))
I want to use R rather than pre-processing the data as it makes it easier when the data are revised. Following Shane's suggestion of using gsub, I think this is about as neat as I can do:
x <- read.csv("file.csv",header=TRUE,colClasses="character")
col2cvt <- 15:41
x[,col2cvt] <- lapply(x[,col2cvt],function(x){as.numeric(gsub(",", "", x))})
This question is several years old, but I stumbled upon it, which means maybe others will.
The readr library / package has some nice features to it. One of them is a nice way to interpret "messy" columns, like these.
library(readr)
read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5",
col_types = list(col_numeric())
)
This yields
Source: local data frame [4 x 1]
numbers
(dbl)
1 800.0
2 1800.0
3 3500.0
4 6.5
An important point when reading in files: you either have to pre-process, like the comment above regarding sed, or you have to process while reading. Often, if you try to fix things after the fact, there are some dangerous assumptions made that are hard to find. (Which is why flat files are so evil in the first place.)
For instance, if I had not flagged the col_types, I would have gotten this:
> read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5")
Source: local data frame [4 x 1]
numbers
(chr)
1 800
2 1,800
3 3500
4 6.5
(Notice that it is now a chr (character) instead of a numeric.)
Or, more dangerously, if it were long enough and most of the early elements did not contain commas:
> set.seed(1)
> tmp <- as.character(sample(c(1:10), 100, replace=TRUE))
> tmp <- c(tmp, "1,003")
> tmp <- paste(tmp, collapse="\"\n\"")
(such that the last few elements look like:)
\"5\"\n\"9\"\n\"7\"\n\"1,003"
Then you'll find trouble reading that comma at all!
> tail(read_csv(tmp))
Source: local data frame [6 x 1]
3"
(dbl)
1 8.000
2 5.000
3 5.000
4 9.000
5 7.000
6 1.003
Warning message:
1 problems parsing literal data. See problems(...) for more details.
We can also use readr::parse_number, the columns must be characters though. If we want to apply it for multiple columns we can loop through columns using lapply
df[2:3] <- lapply(df[2:3], readr::parse_number)
df
# a b c
#1 a 12234 12
#2 b 123 1234123
#3 c 1234 1234
#4 d 13456234 15342
#5 e 12312 12334512
Or use mutate_at from dplyr to apply it to specific variables.
library(dplyr)
df %>% mutate_at(2:3, readr::parse_number)
#Or
df %>% mutate_at(vars(b:c), readr::parse_number)
data
df <- data.frame(a = letters[1:5],
b = c("12,234", "123", "1,234", "13,456,234", "123,12"),
c = c("12", "1,234,123","1234", "15,342", "123,345,12"),
stringsAsFactors = FALSE)
a dplyr solution using mutate_all and pipes
say you have the following:
> dft
Source: local data frame [11 x 5]
Bureau.Name Account.Code X2014 X2015 X2016
1 Senate 110 158,000 211,000 186,000
2 Senate 115 0 0 0
3 Senate 123 15,000 71,000 21,000
4 Senate 126 6,000 14,000 8,000
5 Senate 127 110,000 234,000 134,000
6 Senate 128 120,000 159,000 134,000
7 Senate 129 0 0 0
8 Senate 130 368,000 465,000 441,000
9 Senate 132 0 0 0
10 Senate 140 0 0 0
11 Senate 140 0 0 0
and want to remove commas from the year variables X2014-X2016, and
convert them to numeric. also, let's say X2014-X2016 are read in as
factors (default)
dft %>%
mutate_all(funs(as.character(.)), X2014:X2016) %>%
mutate_all(funs(gsub(",", "", .)), X2014:X2016) %>%
mutate_all(funs(as.numeric(.)), X2014:X2016)
mutate_all applies the function(s) inside funs to the specified columns
I did it sequentially, one function at a time (if you use multiple
functions inside funs then you create additional, unnecessary columns)
"Preprocess" in R:
lines <- "www, rrr, 1,234, ttt \n rrr,zzz, 1,234,567,987, rrr"
Can use readLines on a textConnection. Then remove only the commas that are between digits:
gsub("([0-9]+)\\,([0-9])", "\\1\\2", lines)
## [1] "www, rrr, 1234, ttt \n rrr,zzz, 1234567987, rrr"
It's als useful to know but not directly relevant to this question that commas as decimal separators can be handled by read.csv2 (automagically) or read.table(with setting of the 'dec'-parameter).
Edit: Later I discovered how to use colClasses by designing a new class. See:
How to load df with 1000 separator in R as numeric class?
Using read_delim function, which is part of readr library, you can specify additional parameter:
locale = locale(decimal_mark = ",")
read_delim("filetoread.csv", ";", locale = locale(decimal_mark = ","))
*Semicolon in second line means that read_delim will read csv semicolon separated values.
This will help to read all numbers with a comma as proper numbers.
Regards
Mateusz Kania
If number is separated by "." and decimals by "," (1.200.000,00) in calling gsub you must set fixed=TRUE as.numeric(gsub(".","",y,fixed=TRUE))
A very convenient way is readr::read_delim-family. Taking the example from here:
Importing csv with multiple separators into R you can do it as follows:
txt <- 'OBJECTID,District_N,ZONE_CODE,COUNT,AREA,SUM
1,Bagamoyo,1,"136,227","8,514,187,500.000000000000000","352,678.813105723350000"
2,Bariadi,2,"88,350","5,521,875,000.000000000000000","526,307.288878142830000"
3,Chunya,3,"483,059","30,191,187,500.000000000000000","352,444.699742995200000"'
require(readr)
read_csv(txt) # = read_delim(txt, delim = ",")
Which results in the expected result:
# A tibble: 3 × 6
OBJECTID District_N ZONE_CODE COUNT AREA SUM
<int> <chr> <int> <dbl> <dbl> <dbl>
1 1 Bagamoyo 1 136227 8514187500 352678.8
2 2 Bariadi 2 88350 5521875000 526307.3
3 3 Chunya 3 483059 30191187500 352444.7
I think preprocessing is the way to go. You could use Notepad++ which has a regular expression replace option.
For example, if your file were like this:
"1,234","123","1,234"
"234","123","1,234"
123,456,789
Then, you could use the regular expression "([0-9]+),([0-9]+)" and replace it with \1\2
1234,"123",1234
"234","123",1234
123,456,789
Then you could use x <- read.csv(file="x.csv",header=FALSE) to read the file.

Is there a quick way to replace column values in R?

Suppose we have a data frame containing numeric values which looks like:
Temperature Height
32 157
31 159
33 139
I want to replace Height values with pic_00001, pic_00002 etc. so that the end result is:
Temperature Height
32 pic_00001
31 pic_00002
33 pic_00003
There are 10,000+ rows in the full data frame, hence I need a quicker way than doing this manually.
You can use sprintf:
# create the example used by the OP
dat <- data.frame(Temperature = 31:33,
Height = c(157, 159, 139))
# use sprintf along with seq_len
dat$Height <- sprintf("pic_%05d", seq_len(NROW(dat)))
# show the result
dat
#R> Temperature Height
#R> 1 31 pic_00001
#R> 2 32 pic_00002
#R> 3 33 pic_00003
You can change the 05d if you want more leading zeros. E.g. 07d will give a seven digit sequence. The manual page for sprintf have further details.
You can do:
id <- seq_len(nrow(data))
new_values <- paste("pic_",id,sep = "")
data$Height <- new_values
to achieve final output (from original by monjeanjean, i cant comment yet lol):
id <- seq_len(nrow(data))
new_values <- paste("pic_",formatC(x,width=5,flag="0",format="fg"),sep = "")
data$Height <- new_values
You can also use the following solution:
library(dplyr)
library(stringr)
df %>%
mutate(across(Height, ~ str_c("pic_", str_pad(row_number(), 5, "left", "0"))))
Temperature Height
1 32 pic_00001
2 31 pic_00002
3 33 pic_00003

Replacing empty cells in a column with values from another column in R

I am trying to pull the cell values from the StudyID column to the empty cells SigmaID column, but I am running into an odd issue with the output.
This is how my data looks before running commands.
StudyID Gender Region SigmaID
LM24008 1 20 LM24008
LM82993 1 16 LM28888
ST04283 0 44
ST04238 0 50
LM04829 1 24 LM23921
ST91124 0 89
ST29001 0 55
I tried accomplishing this by writing the syntax in three ways, because I wasn't sure if there is a problem with the way the logic was set up. All three produce the same output.
df$SigmaID <- ifelse(test = df$SigmaID != "", yes = df$SigmaID, no = df$StudyID)
df$SigmaID <- ifelse(df$SigmaID == "", df$StudyID, df3$SigmaID)
df %>% mutate(SigmaID = ifelse(Gender == 0, df$StudyID, df$SigmaID)
Output: instead of pulling the values from from the StudyID column, it is populating one to four digit numbers.
StudyID Gender Region SigmaID
LM24008 1 20 LM24008
LM82993 1 16 LM28888
ST04283 0 44 5
ST04238 0 50 4908
LM04829 1 24 LM23921
ST91124 0 89 209
ST29001 0 55 4092
I have tried recoding the empty spaces to NA and then calling on NA in the logic, but this produced the same output as seen above. I'm wondering if it could have anything to do with variable type or variable attributes and something's off about how it's reading the characters in StudyID. Would appreciate feedback on this issue!
Here is how to do it:
df$SigmaID[df$SigmaID == ""] = df$StudyID[df$SigmaID == ""]
df[df$SigmaID == ""] selects only the rows where SigmaID==""
I also recommend using data.table instead of data.frame. It is faster and has some useful syntax features:
library(data.table)
setDT(df) # setDT converts a data.frame to a data.table
df[SigmaID=="",SigmaId:=StudyID]
Following up on this! As it turns out, default R converts string types into factors. There are a few ways of addressing the issue above.
i <- sapply[df, is.factor]
df[i] <- lapply(df[i], as.character)
Another method:
df <- read.csv("/insert file pathway here", stringAsFactors = FALSE)
This is what I found to be helpful! I'm sure there are additional methods of troubleshooting this as well.

Read CSV in R with first column as dataframe header

I have a simply text file where the first column is names (strings) and the second column is values (floats). As an example, names and ages:
Name, Age
John, 32
Heather, 46,
Jake, 23
Sally, 19
I'd like to read this in as a dataframe (call this df) but transposed so that I can access ages by names such that df$John would return 32. How can I do this?
Previous I tried creating a new dataframe, tdf, looping through the data in a for loop, assigning each name and age and then inserting into the empty dataframe as tdf[name] = age but this did not work as I expected.
You can read your data using read.table().
Then you can transpose it using t() and set colnames after.
Example:
If df is:
df=read.table("dummydata", header=T, sep=",")
df
Name Age
1 John 32
2 Heather 46
3 Jake 23
4 Sally 19
You transpose the age and then transform them into a dataframe:
tdf=as.data.frame(t(df$Age))
colnames(tdf)=t(df$Name)
So tdf will return:
tdf
John Heather Jake Sally
1 32 46 23 19
And, as you asked, tdf$John will return:
tdf$John
[1] 32
Now, if you have more than two columns you can do the same but instead of indicating the name of the column you can simply indicate the position using brackets.
df=read.table("dummydata", header=T, sep=",")
With t(df[2:ncol(df)]) you transpose the whole table starting from the second column, no matter the number of columns. The first column will be the names after the transpose.
tdf=as.data.frame(t(df[2:ncol(df)]))
Then you set the columnames.
colnames(tdf)=t(df[1])
tdf$John
[1] 32
This should add the the row as header when you read from the file
read.csv2(filename, as.is = TRUE, header = TRUE)
Read the data into a data frame, DF (see Note).
1) Assign the names to the rows of DF in which case this will give John's age without having to create a new data structure:
rownames(DF) <- DF$Name
DF["John", "Age"]
## [1] 32
2) Alternatively, split DF into a named list in which case you can get the precise syntax requested:
ages <- with(DF, split(Age, Name))
ages$John
## [1] 32
3) This alternative would also create the same list:
ages <- with(DF, setNames(as.list(Age), Name))
Note: DF in reproducible form is as follows. (We have removed the trailing comma on one line in the question but if it is really there add fill = TRUE to the read.csv line.)
Lines <- "Name, Age
John, 32
Heather, 46
Jake, 23
Sally, 19"
DF <- read.csv(text = Lines)
A bit late but hopefully helpful. The "row.names" parameter allows you to select the desired column as header:
read.csv("df.csv", header = TRUE, row.names = 1)

How to combine similar elements in a data frame in R

I have a data frame consisting of
Lancaster001A 76
Lancaster001B 35
Lancaster002A 46
Lancaster002D 9
.... ...
I'd like to consolidate the dataframe into this
Lancaster001 111
Lancaster002 55
And so remove the smaller categorising. I couldn't find a way to do with merge, is there a general function that can be used using similarity?
Here is a base R solution using a regex to remove all characters after three numeric characters:
DF <- read.table(text = "Lancaster001A 76
Lancaster001B 35
Lancaster002A 46
Lancaster002D 9")
setNames(aggregate(V2 ~ gsub("(?<=\\d{3}).*", "", V1, perl = TRUE),
DF, FUN = sum),
c("V1", "V2"))
# V1 V2
#1 Lancaster001 111
#2 Lancaster002 55
It would be trivial to use data.table if the aggregation is too slow on a large dataset.
Adjust the regex as needed if the structure of your data is different.
Let's assume these names for your columns, and let's assume the 'smaller categorising' means one letter at the end.
id value
Lancaster001A 76
Lancaster001B 35
Lancaster002A 46
Lancaster002D 9
.... ...
I use dplyr for everything. Install dplyr, make sure your column names are correct, and then try:
library(dplyr)
mydata %>%
mutate(id = substr(id, 1, nchar(id)-1) %>% # removes last character
group_by(id) %>%
summarize(sum = sum(value))
Edit: An even simpler data.table solution from #Arun's helpful tip:
library(data.table)
dt[, list(sum=sum(value)), by = substr(as.character(id),1,nchar(as.character(id)) - 1)]
id sum
1: Lancaster001 111
2: Lancaster002 55

Resources