There are 10 numeric values each value will be within range of 1 - 4096.
These numeric values can also be duplicate. The aim is to compress this 10 numeric values in 24bits. But the compression should be lossless and from resultant 24bits we should also retrieve 10 numeric values with its original sequence.
Could any one help me with some suggestions!
Related
In the data set I use, there is no numeric information other than the measurement values explained with 0 and 1 values. the remaining columns are values such as location, education information. how can i get numeric data from all this character data? By the way, I'm using the R language.
I got some frequency values but I don't know what to do about columns like location, education.
I am trying to remove some outliers from my data set. I am investigating each variable in the data one at a time. I have constructed boxplots for variables but don't want to remove all the classified outliers, only the most extreme. So I am noting the value on the boxplot that I don't want my variable to exceed and trying to remove rows that correspond to the observations that have a specific column value that exceed the chosen value.
For example,
My data set is called milk and one of the variables is called alpha_s1_casein. I thought the following would remove all rows in the data set where the value for alpha_s1_casein is greater than 29:
milk <- milk[milk$alpha_s1_casein < 29,]
In fact it did. The amount of rows in the data frame decreased from 430 to 428. However it has introduced a lot of NA values in noninvolved columns in my data set
Before I ran the above code the amount of NA's were
sum(is.na(milk))
5909 NA values
But after performing the above the sum of NA's now returned is
sum(is.na(milk))
75912 NA values.
I don't understand what is going wrong here and why what I'm doing is introducing more NA values than when I started when all I'm trying to do is remove observations if a column value exceeds a certain number.
Can anyone help? I'm desperate
Without using additional packages, to remove all rows in the data set where the value for alpha_s1_casein is greater than 29, you can just do this:
milk <- milk[-which(milk$alpha_s1_casein > 29),]
I want to convert a data table containing numeric values for 305 variables and 361 observations into a data table of same size containing dates. The data table does contain "NA"s.
The numeric value of the dates has the origin of excel. This is what I tried so far:
Rep_Day_monthly <- as.data.table(sapply(Rep_Day_monthly,as.numeric))
Rep_Day_monthly <- sapply(Rep_Day_monthly,as.Date)
Problem with this is, that the data table still contains numeric values, so e.g. 5963 instead of 1986-04-30.
Looking very much forward to your help!
Cheers
as.Date needs an origin (i.e. a date corresponding to 0). If the data is from Excel, this will usually be 1 Jan 1970, so you could use Rep_Day_monthly <- as.data.table(lapply(Rep_Day_monthly,as.Date,origin="1970-01-01"))
I'm importing a large dataset in R and curious if there's a way to quickly go through the columns and identify whether the column has categorical values, numeric, date, etc. When I use str(df) or class(df), the columns mostly come back mislabeled.
For example, some columns are labeled as numeric, but there are only 10 unique values in the column (ranging from 1-10), indicating that it should really be a factor. There are other columns that only have 11 unique values representing a rating, from 0-5 in 0.5 increments. Another column has country codes (172 values), which range from 1-230.
Is there a way to quickly identify if a column should be a factor without going through each of the columns to understand the nature of variable? (there are many columns in the dataset)
Thanks!
At the moment, I've been using variations of the following code to catch the first two cases:
as.numeric(df[,51]) #convert the column to numeric
len = length(unique(df[,51])) #find number of unique values
diff = max(df[,51]) - min(df[,51]) #calculate difference between min and max
ord = (len - 1) / diff # calculate the increment if equally spaced
#subtract the max value from second to max value to find the actual increment (only uses last two values)
step = sort(unique(df[,51]),partial=len)[len] -
sort(unique(df[,51]),partial=len-1)[len-1]
ord == step #check if the last increment equals the implied increment
However, this approach assumes that each of the variables are equally spaced (for example, incremented 0.5) and only tests the space between the last two values. This wouldn't catch a column that contains c(1,2,3.5,4.5,5,6) which has 6 unique values, but uneven spacing in the middle (not that this is common in my dataset).
It is not obvious how many distinct values would indicate a factor vs a numeric variable, but you can examine all variables to see what is in your data with
table(sapply(df, function(x) { length(unique(x))} ))
and if you decide that the boundary between factor and numeric is k you can identify the factors with
which(sapply(df, function(x) {length(unique(x)) < k}))
Questions about displaying of certain numbers of digits have been posted, however, just for single values or vectors, so I hope someone can help me with this.
I have a data frame with several columns and want to display all values in one column with two decimal digits (this column only). I have tried round() and format() and options(digits) but none worked on a column (numerical). I wonder if there is a method to do this without going the extra way of converting the column to a vector and gluing all together again.
Thanks a lot!
Here's an example of how to do this with the cars data.frame that comes installed with R.
First I'll add some variability so that we have numbers with decimal places:
data=cars+runif(nrow(cars))
Then to round just a single column (in this case the dist column to 2 decimal places):
data[,'dist']=round(data[,'dist'],2)
If your data contain whole numbers then you can guarantee that all values will have 2 decimal places by using:
cars[,'dist']=format(round(cars[,'dist'],2),nsmall=2)