R reading values of numeric field in file wrongly - r

R is reading the values from a file wrongly. One can check if this statement is true with the following example:
A sample picture/snapshot which explains the problem areas is here
(1) Copy paste the following 10 numbers into a test file (sample.csv)
1000522010609612
1000522010609613
1000522010609614
1000522010609615
1000522010609616
1000522010609617
971000522010609612
1501000522010819466
971000522010943717
1501000522010733490
(2) Read these contents into R using read.csv
X <- read.csv("./test.csv", header=FALSE)
(3) Print the output
print(head(X, n=10), digits=22)
The output I got was
V1
1 1000522010609612.000000
2 1000522010609613.000000
3 1000522010609614.000000
4 1000522010609615.000000
5 1000522010609616.000000
6 1000522010609617.000000
7 971000522010609664.000000
8 1501000522010819584.000000
9 971000522010943744.000000
10 1501000522010733568.000000
The problem is that rows 7,8,9,10 are not correct (check the sample 10 numbers that we considered before).
What could be the problem? Is there some setting that I am missing with my R - terminal?

You could try
library(bit64)
x <- read.csv('sample.csv', header=FALSE, colClasses='integer64')
x
# V1
#1 1000522010609612
#2 1000522010609613
#3 1000522010609614
#4 1000522010609615
#5 1000522010609616
#6 1000522010609617
#7 971000522010609612
#8 1501000522010819466
#9 971000522010943717
#10 1501000522010733490
If you load the bit64, then you can also try fread from data.table
library(data.table)
x1 <- fread('sample.csv')
x1
# V1
#1: 1000522010609612
#2: 1000522010609613
#3: 1000522010609614
#4: 1000522010609615
#5: 1000522010609616
#6: 1000522010609617
#7: 971000522010609612
#8: 1501000522010819466
#9: 971000522010943717
#10: 1501000522010733490

Related

Sorting dataframe in R in reverse order - Column name as a variable [duplicate]

I've looked and looked and the answer either does not work for me, or it's far too complex and unnecessary.
I have data, it can be any data, here is an example
chickens <- read.table(textConnection("
feathers beaks
2 3
6 4
1 5
2 4
4 5
10 11
9 8
12 11
7 9
1 4
5 9
"), header = TRUE)
I need to, very simply, sort the data for the 1st column in descending order. It's pretty straightforward, but I have found two things below that both do not work and give me an error which says:
"Error in order(var) : Object 'var' not found.
They are:
chickens <- chickens[order(-feathers),]
and
chickens <- chickens[sort(-feathers),]
I'm not sure what I'm not doing, I can get it to work if I put the df name in front of the varname, but that won't work if I put an minus sign in front of the varname to imply descending sort.
I'd like to do this as simply as possible, i.e. no boolean logic variables, nothing like that. Something akin to SPSS's
SORT BY varname (D)
The answer is probably right in front of me, I apologize for the basic question.
Thank you!
You need to use dataframe name as prefix
chickens[order(chickens$feathers),]
To change the order, the function has decreasing argument
chickens[order(chickens$feathers, decreasing = TRUE),]
The syntax in base R, needs to use dataframe name as a prefix as #dmi3kno has shown. Or you can also use with to avoid using dataframe name and $ all the time as mentioned by #joran.
However, you can also do this with data.table :
library(data.table)
setDT(chickens)[order(-feathers)]
#Also
#setDT(chickens)[order(feathers, decreasing = TRUE)]
# feathers beaks
# 1: 12 11
# 2: 10 11
# 3: 9 8
# 4: 7 9
# 5: 6 4
# 6: 5 9
# 7: 4 5
# 8: 2 3
# 9: 2 4
#10: 1 5
#11: 1 4
and dplyr :
library(dplyr)
chickens %>% arrange(desc(feathers))

efficient rbind alternative with applied function

To take a step back, my ultimate goal is to read in around 130,000 images into R with a pixel size of HxW and then to make a dataframe/datatable containing the rgb of each pixel of each image on a new row. So the output will be something like this:
> head(train_data, 10)
image_no r g b pixel_no
1: 00003e153.jpg 0.11764706 0.1921569 0.3098039 1
2: 00003e153.jpg 0.11372549 0.1882353 0.3058824 2
3: 00003e153.jpg 0.10980392 0.1843137 0.3019608 3
4: 00003e153.jpg 0.11764706 0.1921569 0.3098039 4
5: 00003e153.jpg 0.12941176 0.2039216 0.3215686 5
6: 00003e153.jpg 0.13333333 0.2078431 0.3254902 6
7: 00003e153.jpg 0.12549020 0.2000000 0.3176471 7
8: 00003e153.jpg 0.11764706 0.1921569 0.3098039 8
9: 00003e153.jpg 0.09803922 0.1725490 0.2901961 9
10: 00003e153.jpg 0.11372549 0.1882353 0.3058824 10
I currently have a piece of code to do this in which I apply a function to get the rgb for each pixel of a specified image, returning the result in a dataframe:
#function to get rgb from image file paths
get_rgb_table <- function(link){
img <- readJPEG(toString(link))
# Creating the data frame
rgb_image <- data.frame(r = as.vector(img[1:H, 1:W, 1]),
g = as.vector(img[1:H, 1:W, 2]),
b = as.vector(img[1:H, 1:W, 3]))
#add pixel id
rgb_image$pixel_no <- row.names(rgb_image)
#add image id
train_rgb <- cbind(sub('.*/', '',link),rgb_image)
colnames(train_rgb)[1] <- "image_no"
return(train_rgb)
}
I call this function on another dataframe which contains the links to all the images:
train_files <- list.files(path="~/images/", pattern=".jpg",all.files=T, full.names=T, no.. = T)
train <- data.frame(matrix(unlist(train_files), nrow=length(train_files), byrow=T))
The train dataframe looks like this:
> head(train, 10)
link
1 C:/Documents/image/00003e153.jpg
2 C:/Documents/image/000155de5.jpg
3 C:/Documents/image/00021ddc3.jpg
4 C:/Documents/image/0002756f7.jpg
5 C:/Documents/image/0002d0f32.jpg
6 C:/Documents/image/000303d4d.jpg
7 C:/Documents/image/00031f145.jpg
8 C:/Documents/image/00053c6ba.jpg
9 C:/Documents/image/00057a50d.jpg
10 C:/Documents/image/0005d01c8.jpg
I finally get the result I want with the following loop:
for(i in 1:length(train[,1])){
train_data <- rbind(train_data,get_rgb_table(train[i,1]))
}
However, this last bit of code is very inefficient. An optimization of how the function is applied and and/or the rbind would help. I think the function get_rgb_table() itself is quick but the problem is with the loop and the rbind. I have tried using apply() but can't manage to do this on each row and put the result in one dataframe without running out of memory. Any help on this would be great. Thanks!
This is very difficult to answer given the vagueness of the question, but I'll make a reproducible example of what I think you're asking and will give a solution.
Say I have a function that returns a data frame:
MyFun <- function(x)randu[1:x,]
And I have a data frame df that will act an input to the function.
# a b
# 1 1 21
# 2 2 22
# 3 3 23
# 4 4 24
# 5 5 25
# 6 6 26
# 7 7 27
# 8 8 28
# 9 9 29
# 10 10 30
From your question, it looks like only one column will be used as input. So, I apply the function to each row of this data frame using lapply then I bind the results together using do.call and rbind like this:
do.call(rbind, lapply(df$a, MyFun))

splitting variable product code into letters and numbers

I have a product code variable like:
Product Code
RMMI001,
RMMI001,
CMCM009,
ASCMOT064,
ASPMOA023,
CMCM009,
CMCM012,
CMCM001,
ASCMBW001,
RMMI001,
TMHO002,
TMSP001,
TMHO002,
TMDMST003
I need to split those and need these characters in another column.
You may try using sub here to remove all trailing numbers, leaving you with the character portion:
df <- data.frame(product_code=c("RMMI001", "RMMI001", "CMCM009"))
df$code <- sub("\\d*$", "", df$product_code)
df
product_code code
1 RMMI001 RMMI
2 RMMI001 RMMI
3 CMCM009 CMCM
Demo
What about something like this?
# Sample product codes
ss <- c("RMMI001", "RMMI001", "CMCM009", "ASCMOT064", "ASPMOA023", "CMCM009", "CMCM012", "CMCM001", "ASCMBW001", "RMMI001", "TMHO002", "TMSP001", "TMHO002", "TMDMST003")
# Separate code and numbers and store in data.frame
read.csv(text = gsub("^([a-zA-Z]+)(\\d+)$", "\\1,\\2", ss), header = F)
# V1 V2
#1 RMMI 1
#2 RMMI 1
#3 CMCM 9
#4 ASCMOT 64
#5 ASPMOA 23
#6 CMCM 9
#7 CMCM 12
#8 CMCM 1
#9 ASCMBW 1
#10 RMMI 1
#11 TMHO 2
#12 TMSP 1
#13 TMHO 2
#14 TMDMST 3
You can use tidyr::extract as well, it works with dataframes only.
tidyr::extract(data.frame(x =c("RMMI001", "CMCM009")),x, c("first", "second"), "([a-zA-Z]+)(\\d+)" )
Output:
# first second
#1 RMMI 001
#2 CMCM 009
This will extract both the alphabets and numbers in separate columns, if you choose "([a-zA-Z]+)\d+" instead of "([a-zA-Z]+)(\d+)". It will then extract only the first match represented as english words like below. Note the difference here is the capturing group represented by parenthesis.It is used here for capturing the match, in this case these are words and numbers into separate columns.
tidyr::extract(data.frame(x =c("RMMI001", "CMCM009")),x, c("first"), "([a-zA-Z]+)\\d+" )
# first
# 1 RMMI
# 2 CMCM

Load all R data files from specific folder

I've got a lot of Rdata files which I want to combine in one dataframe.
My files, as an example, are:
file1.RData
file2.RData
file3.RData
All the datafiles have the structure: datafile$a and datafile$b. From all of the files above I would like to load take the variable $aand add this to and already existing dataframe called md. My problem isn't loading the files into the global environment, but processing the data in the RData file.
My code so far, which obviously doesn't work.
library(dplyr)
files <- list.files("correct directory", pattern="*.RData")
This returns the correct list of files.
I also know I need to lapply over a function.
lapply(files, myFun)
My problem is in the function. What I've got at the moment:
myFun <- function(files) {
load(files)
df <- data.frame(datafile$a)
md <- bind_rows(md, df)
}
The code above doesn't work, any idea how I get this to work?
Try
library(dplyr)
bind_rows(lapply(files, myFun))
# a
#1 1
#2 2
#3 3
#4 4
#5 5
#6 1
#7 2
#8 3
#9 4
#10 5
#11 6
#12 7
#13 8
#14 9
#15 10
#16 11
#17 12
#18 13
#19 14
#20 15
where
myFun <- function(files) {
load(files)
df <- data.frame(a= datafile$a)
}
data
datafile <- data.frame(a=1:5, b=6:10)
save(datafile, file='file1.RData')
datafile <- data.frame(a=1:15, b=16:30)
save(datafile, file='file2.RData')
files <- list.files(pattern='file\\d+.RData')
files

Replace values in one data frame from values in another data frame

I need to change individual identifiers that are currently alphabetical to numerical. I have created a data frame where each alphabetical identifier is associated with a number
individuals num.individuals (g4)
1 ZYO 64
2 KAO 24
3 MKU 32
4 SAG 42
What I need to replace ZYO with the number 64 in my main data frame (g3) and like wise for all the other codes.
My main data frame (g3) looks like this
SAG YOG GOG BES ATR ALI COC CEL DUN EVA END GAR HAR HUX ISH INO JUL
1 2
2 2 EVA
3 SAG 2 EVA
4 2
5 SAG 2
6 2
Now on a small scale I can write a code to change it like I did with ATR
g3$ATR <- as.character(g3$ATR)
g3[g3$target == "ATR" | g3$ATR == "ATR","ATR"] <- 2
But this is time consuming and increased chance of human error.
I know there are ways to do this on a broad scale with NAs
I think maybe we could do a for loop for this, but I am not good enough to write one myself.
I have also been trying to use this function which I feel like may work but I am not sure how to logically build this argument, it was posted on the questions board here
Fast replacing values in dataframe in R
df <- as.data.frame(lapply(df, function(x){replace(x, x <0,0)})
I have tried to work my data into this by
df <- as.data.frame(lapply(g4, function(g3){replace(x, x <0,0)})
Here is one approach using the data.table package:
First, create a reproducible example similar to your data:
require(data.table)
ref <- data.table(individuals=1:4,num.individuals=c("ZYO","KAO","MKU","SAG"),g4=c(64,24,32,42))
g3 <- data.table(SAG=c("","SAG","","SAG"),KAO=c("KAO","KAO","",""))
Here is the ref table:
individuals num.individuals g4
1: 1 ZYO 64
2: 2 KAO 24
3: 3 MKU 32
4: 4 SAG 42
And here is your g3 table:
SAG KAO
1: KAO
2: SAG KAO
3:
4: SAG
And now we do our find and replacing:
g3[ , lapply(.SD,function(x) ref$g4[chmatch(x,ref$num.individuals)])]
And the final result:
SAG KAO
1: NA 24
2: 42 24
3: NA NA
4: 42 NA
And if you need more speed, the fastmatch package might help with their fmatch function:
require(fastmatch)
g3[ , lapply(.SD,function(x) ref$g4[fmatch(x,ref$num.individuals)])]
SAG KAO
1: NA 24
2: 42 24
3: NA NA
4: 42 NA

Resources