Adding columns by splitting number, and removing duplicates - r

I have a dataframe like the following (this is a reduced example, I have many more rows and columns):
CH1 CH2 CH3
1 3434 282 7622
2 4442 6968 8430
3 4128 6947 478
4 6718 6716 3017
5 3735 9171 1128
6 65 4876 4875
7 9305 6944 3309
8 4283 6060 650
9 5588 2285 203
10 205 2345 9225
11 8634 4840 780
12 6383 0 1257
13 4533 7692 3760
14 9363 9846 4697
15 3892 79 4372
16 6130 5312 9651
17 7880 7386 6239
18 8515 8021 2295
19 1356 74 8467
20 9024 8626 4136
I need to create additional columns by splitting the values. For example, value 1356 would have to be split into 6, 56, and 356. I do this on a for loop splitting by string. I do this to keep the leading zeros. So far, decent.
# CREATE ADDITIONAL COLUMNS
for(col in 1:3) {
# Create a temporal variable
temp <- as.character(data[,col] )
# Save the new column
for(mod in c(-1, -2, -3)) {
# Create the column
temp <- cbind(temp, str_sub(as.character(data[,col]), mod))
}
# Merge to the row
data <- cbind(data, temp)
}
My problem is that not all cells have 4 digits: some may have 1, 2 or 3 digits. Therefore, I get repeated values when I split. For example, for 79 I get: 79 (original), 9, 79, 79, 79.
Problem: I need to remove the repeated values. Of course, I could do unique, but that gives me rows of uneven number of columns. I need to fill those missing (i.e. the removed repeated values) with NA. I can only compare this by row.
I checked CJ Yetman's answer here, but they only replace consecutive numbers. I only need to keep unique values.
Reproducible Example: Here is a fiddle with my code working: http://rextester.com/IKMP73407
Expected outcome: For example, for rows 11 & 12 of the example (see the link for the reproducible example), if this is my original:
8634 4 34 634 4840 0 40 840 780 0 80 780
6383 3 83 383 0 0 0 0 1257 7 57 257
I'd like to get this:
8634 4 34 634 4840 0 40 840 780 NA 80 NA
6383 3 83 383 0 NA NA NA 1257 7 57 257

You can use apply():
The data:
data <- structure(list(CH1 = c(3434L, 4442L, 4128L, 6718L, 3735L, 65L,
9305L, 4283L, 5588L, 205L, 8634L, 6383L, 4533L, 9363L, 3892L,
6130L, 7880L, 8515L, 1356L, 9024L), CH2 = c(282L, 6968L, 6947L,
6716L, 9171L, 4876L, 6944L, 6060L, 2285L, 2345L, 4840L, 0L, 7692L,
9846L, 79L, 5312L, 7386L, 8021L, 74L, 8626L), CH3 = c(7622L,
8430L, 478L, 3017L, 1128L, 4875L, 3309L, 650L, 203L, 9225L, 780L,
1257L, 3760L, 4697L, 4372L, 9651L, 6239L, 2295L, 8467L, 4136L
)), .Names = c("CH1", "CH2", "CH3"), row.names = c(NA, 20L), class = "data.frame")
Select row 11 and 12:
data <- data[11:12, ]
Using your code:
# CREATE ADDITIONAL COLUMNS
for(col in 1:3) {
# Create a temporal variable
temp <- data[,col]
# Save the new column
for(mod in c(10, 100, 1000)) {
# Create the column
temp <- cbind(temp, data[, col] %% mod)
}
data <- cbind(data, temp)
}
data[,1:3] <- NULL
The result is:
temp V2 V3 V4 temp V2 V3 V4 temp V2 V3 V4
11 8634 4 34 634 4840 0 40 840 780 0 80 780
12 6383 3 83 383 0 0 0 0 1257 7 57 257
Then go through the data row by row and remove duplicates and transpose the outcome:
t(apply(data, 1, function(row) {
row[duplicated(row)] <- NA
return(row)
}))
The result is:
temp V2 V3 V4 temp V2 V3 V4 temp V2 V3 V4
11 8634 4 34 634 4840 0 40 840 780 NA 80 NA
12 6383 3 83 383 0 NA NA NA 1257 7 57 257

Related

Conditional filling NA rows with comparing non-NA labeled rows

I want to fill NA rows based on checking the differences between the closest non-NA labeled rows.
For instance
data <- data.frame(sd_value=c(34,33,34,37,36,45),
value=c(383,428,437,455,508,509),
label=c(c("bad",rep(NA,4),"unable")))
> data
sd_value value label
1 34 383 bad
2 33 428 <NA>
3 34 437 <NA>
4 37 455 <NA>
5 36 508 <NA>
6 45 509 unable
I want to evaluate how to change NA rows with checking the difference between sd_value and value those close to bad and unablerows.
if we want to get differences between the rows we can do;
library(dplyr)
data%>%
mutate(diff_val=c(0,diff(value)), diff_sd_val=c(0,diff(sd_value)))
sd_value value label diff_val diff_sd_val
1 34 383 bad 0 0
2 33 428 <NA> 45 -1
3 34 437 <NA> 9 1
4 37 455 <NA> 18 3
5 36 508 <NA> 53 -1
6 45 509 unable 1 9
The condition how I want to label the NA rows is
if the diff_val<50 and diff_sd_val<9 label them with the last non-NA label else use the first non-NA label after the last NA row.
So that the expected output would be
sd_value value label diff_val diff_sd_val
1 34 383 bad 0 0
2 33 428 bad 45 -1
3 34 437 bad 9 1
4 37 455 bad 18 3
5 36 508 unable 53 -1
6 45 509 unable 1 9
The possible solution I cooked up so far:
custom_labelling <- function(x,y,label){
diff_sd_val<-c(NA,diff(x))
diff_val<-c(NA,diff(y))
label <- NA
for (i in 1:length(label)){
if(is.na(label[i])&diff_sd_val<9&diff_val<50){
label[i] <- label
}
else {
label <- label[i]
}
}
return(label)
}
which gives
data%>%
mutate(diff_val=c(0,diff(value)), diff_sd_val=c(0,diff(sd_value)))%>%
mutate(custom_label=custom_labelling(sd_value,value,label))
Error in mutate_impl(.data, dots) :
Evaluation error: missing value where TRUE/FALSE needed.
In addition: Warning message:
In if (is.na(label[i]) & diff_sd_val < 9 & diff_val < 50) { :
the condition has length > 1 and only the first element will be used
One option is to find NA and non-NA index and based on the condition select the closest label to it.
library(dplyr)
#Create a new dataframe with diff_val and diff_sd_val
data1 <- data%>% mutate(diff_val=c(0,diff(value)), diff_sd_val=c(0,diff(sd_value)))
#Get the NA indices
NA_inds <- which(is.na(data1$label))
#Get the non-NA indices
non_NA_inds <- setdiff(1:nrow(data1), NA_inds)
#For every NA index
for (i in NA_inds) {
#Check the condition
if(data1$diff_sd_val[i] < 9 & data1$diff_val[i] < 50)
#Get the last non-NA label
data1$label[i] <- data1$label[non_NA_inds[which.max(i > non_NA_inds)]]
else
#Get the first non-NA label after last NA value
data1$label[i] <- data1$label[non_NA_inds[i < non_NA_inds]]
}
data1
# sd_value value label diff_val diff_sd_val
#1 34 383 bad 0 0
#2 33 428 bad 45 -1
#3 34 437 bad 9 1
#4 37 455 bad 18 3
#5 36 508 unable 53 -1
#6 45 509 unable 1 9
You can remove diff_val and diff_sd_val columns later if not needed.
We can also create a function
custom_label <- function(label, diff_val, diff_sd_val) {
NA_inds <- which(is.na(label))
non_NA_inds <- setdiff(1:length(label), NA_inds)
new_label = label
for (i in NA_inds) {
if(diff_sd_val[i] < 9 & diff_val[i] < 50)
new_label[i] <- label[non_NA_inds[which.max(i > non_NA_inds)]]
else
new_label[i] <- label[non_NA_inds[i < non_NA_inds]]
}
return(new_label)
}
and then apply it
data%>%
mutate(diff_val = c(0, diff(value)),
diff_sd_val = c(0, diff(sd_value)),
new_label = custom_label(label, diff_val, diff_sd_val))
# sd_value value label diff_val diff_sd_val new_label
#1 34 383 bad 0 0 bad
#2 33 428 <NA> 45 -1 bad
#3 34 437 <NA> 9 1 bad
#4 37 455 <NA> 18 3 bad
#5 36 508 <NA> 53 -1 unable
#6 45 509 unable 1 9 unable
If we want to apply it by group we can add a group_by statement and it should work.
data%>%
group_by(group) %>%
mutate(diff_val = c(0, diff(value)),
diff_sd_val = c(0, diff(sd_value)),
new_label = custom_label(label, diff_val, diff_sd_val))

How to subtract values of a first column from all columns by function in R

I wonder how to make a function to subtract values present in column A01 from columns A02, A03 etc.
example data frame:
A01 A02 A03 A04 A05 (...)
1 158 297 326 354 357
2 252 131 341 424 244
3 ...
4 ...
I can manually subtract each column for example:
sampledata[1]-sampledata[1]
sampledata[2]-sampledata[1]
sampledata[3]-sampledata[1]
sampledata[4]-sampledata[1] ... etc.
But how to make a nice function to do this calculation for each of column ? As a result I suppose to have this:
A01 A02 A03 A04 A05 (...)
1 0 139 168 196 199
2 0 -121 89 171 -8
3 ...
4 ...
After subtraction, if some value would be negative, then I want to convert it to zero.
I assume that my problem is easy to solve, but I'm newbie in R.
Thank you all for different solutions.
It seems that the simplest and still perfectly working is that suggested by #DavidArenburg:
new_sample_data = (sampledata - sampledata[,1]) * (sampledata > sampledata[,1])
It makes two transformations in one formula (subtracting first column, and converting negatives to zeroes).
Thank you!
Here's how:
# Your data
A01 <- c(158, 252)
A02 <- c(297, 131)
A03 <- c(326, 341)
A04 <- c(354, 424)
A05 <- c(357, 244)
df <- data.frame(A01, A02, A03, A04, A05, stringsAsFactors = FALSE)
df
# Define the function
f_minus <- function(first_col, other_col) {
other_col - first_col
}
df_output <- as.data.frame(matrix(ncol=ncol(df), nrow=nrow(df)))
for (i in 1:ncol(df)) {
df_output[,c(i)] <- f_minus(df[,1], df[,i])
}
df_output
# V1 V2 V3 V4 V5
# 1 0 139 168 196 199
# 2 0 -121 89 172 -8

Reading unkown file type with strange entries into R

I am completely new at this and here, so please have mercy.
I want to open an ASCII data file in R.
After several different attempts, I have tried df=read.csv("C:MyDirectory" ,header=FALSE, sep="").
This has produced a table with several variables, but some rows clearly contain the wrong information, some cells are blank, some contain NA values.
Any ideas what has gone wrong? I have gotten the file from an offical Spanish research institute:
http://www.cis.es/cis/opencm/ES/2_bancodatos/estudios/listaTematico.jsp?tema=1&todos=si
Then BARĂ“METRO DE OCTUBRE 2017, to the right is a small link entitled "fichero de datos", which allows you to download after providing them with some info. The file giving the trouble is DA3191. If anyone could go through the trouble of helping me with this, it would be awesome. Thank you.
Part 1
This looks like a fixed width format, so you need read.fwf instead of read.csv and friends. I made a screen shot of an almost random place of that file: my hypothesis is that the 99's and 98's etc are missing data codes, so the first 99 marked in yellow would belong to the same column with 4, 2, 0, etc, and the immediately following 99 (not marked) is in the same column with 0, 5, 7, etc.
Part 2
And then look at the file ES3191 -- this looks like SPSS code (pardon my French!) containing the rules about reading in the data file. You can probably figure out the width of each column and what's in there from that file:
DATA LIST FILE= 'DA3191'
/ESTU 1-4 CUES 5-9 CCAA 10-11 PROV 12-13 MUN 14-16 TAMUNI 17 CAPITAL 18 DISTR 19-20 SECCION 21-23
ENTREV 24-27 P0 28 P0A 29-31 P1 32 P2 33 P3 34 P4 35 P5 36 P6 37 P701 38-39 P702 40-41 P703 42-43
P801 44-45 P802 46-47 P803 48-49 P901 50-51 P902 52-53 P903 54-55 P904 56-57 P905 58-59 P906 60-61
P907 62-63 P1001 64 P1002 65 P1003 66 P1101 67 P1102 68 P1103 69 P1104 70 P1201 71 P1202 72
P1203 73 P1204 74 P1205 75 P1206 76 P1207 77 P1208 78 P1209 79 P13 80-81 P13A 82-83 P1401 84-85
P1402 86-87 P1403 88-89 P1404 90-91 P1405 92-93 P1406 94-95 P1407 96-97 P1408 98-99 P1409 100-101
P1410 102-103 P1411 104-105 P1412 106-107 P1413 108-109 P1414 110-111 P1415 112-113 P1416 114-115
I'm not an SPSS expert but I would guess that what it is trying to tell us is that
columns 1-4 contain the variable "ESTU"
columns 5-9 contain the variable "CUES"
etc
For read.fwf you have to calculate each variable's "width" i.e. 4 characters for ESTU (if my reading was right) 5 characters for CUES etc.
Part 3
Using the guesses above, I used the following code to read in your data, and it looks like it works:
# this is copy/pasted SPSS code from file "ES3191"
txt <- "ESTU 1-4 CUES 5-9 CCAA 10-11 PROV 12-13 MUN 14-16 TAMUNI 17 CAPITAL 18 DISTR 19-20 SECCION 21-23
ENTREV 24-27 P0 28 P0A 29-31 P1 32 P2 33 P3 34 P4 35 P5 36 P6 37 P701 38-39 P702 40-41 P703 42-43
P801 44-45 P802 46-47 P803 48-49 P901 50-51 P902 52-53 P903 54-55 P904 56-57 P905 58-59 P906 60-61
P907 62-63 P1001 64 P1002 65 P1003 66 P1101 67 P1102 68 P1103 69 P1104 70 P1201 71 P1202 72
P1203 73 P1204 74 P1205 75 P1206 76 P1207 77 P1208 78 P1209 79 P13 80-81 P13A 82-83 P1401 84-85
P1402 86-87 P1403 88-89 P1404 90-91 P1405 92-93 P1406 94-95 P1407 96-97 P1408 98-99 P1409 100-101
P1410 102-103 P1411 104-105 P1412 106-107 P1413 108-109 P1414 110-111 P1415 112-113 P1416 114-115
P1501 116-117 P1502 118-119 P1503 120-121 P1504 122-123 P1505 124-125 P1506 126-127 P1507 128-129
P1508 130-131 P1509 132-133 P1510 134-135 P1511 136-137 P1512 138-139 P1513 140-141 P1514 142-143
P1515 144-145 P1516 146-147 P16 148 P17 149 P1801 150-151 P1802 152-153 P1803 154-155 P1804 156-157
P1805 158-159 P1806 160-161 P1807 162-163 P1808 164-165 P1809 166-167 P1810 168-169 P1811 170-171
P1812 172-173 P1813 174-175 P19 176 P20 177 P21 178-179 P22 180-181 P23 182-183 P2401 184-185
P2402 186-187 P2403 188-189 P2404 190-191 P2405 192-193 P2406 194-195 P2407 196-197 P2408 198-199
P2409 200-201 P2410 202-203 P2411 204-205 P2412 206-207 P2413 208-209 P2414 210-211 P2415 212-213
P2416 214-215 P25 216 P26 217 P27 218 P27A 219-220 P28 221-222 P29 223 P30 224-225 P31 226 P31A 227-228
P32 229 P32A 230 P33 231 P34 232 P35 233 P35A 234 P36 235 P37 236 P37A 237 P37B 238 P38 239-241
P39 242 P39A 243 P40 244-246 P41 247-248 P42 249-250 P43 251 P43A 252 P43B 253 P44 254 P4501 255
P4502 256 P4503 257 P4504 258 P4601 259-261(A) P4602 262-264(A) P4603 265-267(A) P4604 268-270(A)
P4605 271-273(A) P4701 274-276(A) P4702 277-279(A) P4703 280-282(A) P4704 283-285(A) P4705 286-288(A)
P48 289 P49 290 P50 291 P51 292 I1 293-295 I2 296-298 I3 299-301 I4 302-304 I5 305-307 I6 308-310
I7 311-313 I8 314-316 I9 317-319 E101 320-321 E102 322-323 E103 324-325 E2 326 E3 327-329 E4 330
C1 331 C1A 332-333 C2 334 C2A 335 C2B 336-337 C3 338 C4 339-340 P21R 341-342 P22R 343-344 VOTOSIMG 345-346
P27AR 347-348 RECUERDO 349-350 ESTUDIOS 351 OCUMAR11 352-353 RAMA09 354 CONDICION11 355-356
ESTATUS 357 "
# making a 2-column matrix (name = left column, position = right column)
m <- matrix(scan(text=txt, what=""), ncol=2, byrow=TRUE)
m <- as.data.frame(m, stringsAsFactors=FALSE)
names(m) <- c("Var", "Pos")
pos <- sub("(A)", "", m$Pos, fixed = TRUE) # some entries contain '(A)' - no idea what it means so deleting it
pos <- strsplit(pos, "-")
starts <- as.numeric(sapply(pos, head, 1)) # get the first element from left
ends <- as.numeric(sapply(pos, tail, 1)) # get the first element from right
w <- ends - starts +1
MyData <- read.fwf("R/MD3191/DA3191", widths = w)
names(MyData) <- m$Var
head(MyData)
# ESTU CUES CCAA PROV MUN TAMUNI CAPITAL DISTR SECCION ENTREV P0 P0A P1 P2 P3 P4 P5 P6
# 1 3191 1 16 1 59 5 1 0 0 0 1 0 3 2 2 5 1 2
# 2 3191 2 16 1 59 5 1 0 0 0 1 0 4 2 3 5 2 3
# 3 3191 3 16 1 59 5 1 0 0 0 1 0 4 2 2 4 2 2

Subsetting to drop rows where df$var=0 produces NA rows where var is NA

I have a data.frame that I'm attempting to eliminate some observations on. I want to drop any row in which out$SUB_AGE is equal to 0. However, when I try to subset my df based on that condition, it transforms any row that has NA for out$SUB_AGE into a row of NAs. I've provided a dput below which doesn't actually contain any rows where out$SUB_AGE=0 but it does behave exactly the same as the full dataset which does contain zeroes does.
# dput the data
> temp <- dput(droplevels(out[1:12, 1:4]))
structure(list(SUB_ID = c(5998784L, 6805295L, 318926L, 1270965L,
1635543L, 4296301L, 1001498L, 2388387L, 2190957L, 4168048L, 318926L,
4073180L), ORG_ID = c(10861L, 17361L, 10608L, 11099L, 13135L,
14803L, 12359L, 13151L, 13135L, 17252L, 10608L, 17317L), SUB_AGE = c(36,
NA, NA, 40, 60, 50, 52, 61, 56, 62, NA, NA), SUB_SEX = c(NA,
1, 2, 1, 2, 2, 1, 2, 2, NA, 2, 2)), .Names = c("SUB_ID", "ORG_ID",
"SUB_AGE", "SUB_SEX"), row.names = c(107L, 190L, 242L, 331L,
361L, 447L, 455L, 591L, 663L, 664L, 731L, 732L), class = "data.frame")
# table before subsetting
SUB_ID ORG_ID SUB_AGE SUB_SEX
107 5998784 10861 36 NA
190 6805295 17361 NA 1
242 318926 10608 NA 2
331 1270965 11099 40 1
361 1635543 13135 60 2
447 4296301 14803 50 2
455 1001498 12359 52 1
591 2388387 13151 61 2
663 2190957 13135 56 2
664 4168048 17252 62 NA
731 318926 10608 NA 2
732 4073180 17317 NA 2
# code to subset
temp <- temp[temp$SUB_AGE != 0,]
# table after subsetting
SUB_ID ORG_ID SUB_AGE SUB_SEX
107 5998784 10861 36 NA
NA NA NA NA NA
NA.1 NA NA NA NA
331 1270965 11099 40 1
361 1635543 13135 60 2
447 4296301 14803 50 2
455 1001498 12359 52 1
591 2388387 13151 61 2
663 2190957 13135 56 2
664 4168048 17252 62 NA
NA.2 NA NA NA NA
NA.3 NA NA NA NA
I'm sure there's something simple I'm missing here but I racked my brain and apparently couldn't come up with the right combination of keywords to figure it out myself.
To understand the problem, try printing temp$SUB_AGE != 0:
[1] TRUE NA NA TRUE TRUE TRUE TRUE TRUE TRUE TRUE NA NA
You're using this vector to subset temp, but that functionality only works for TRUE/FALSE values. If you want to keep all the rows with NA values, you can add an extra condition:
temp[temp$SUB_AGE != 0 | is.na(temp$SUB_AGE),]

Binning a dataframe with equal frequency of samples

I have binned my data using the cut function
breaks<-seq(0, 250, by=5)
data<-split(df2, cut(df2$val, breaks))
My split dataframe looks like
... ...
$`(15,20]`
val ks_Result c
15 60 237
18 70 247
... ...
$`(20,25]`
val ks_Result c
21 20 317
24 10 140
... ...
My bins looks like
> table(data)
data
(0,5] (5,10] (10,15] (15,20] (20,25] (25,30] (30,35]
0 0 0 7 128 2748 2307
(35,40] (40,45] (45,50] (50,55] (55,60] (60,65] (65,70]
1404 11472 1064 536 7389 1008 1714
(70,75] (75,80] (80,85] (85,90] (90,95] (95,100] (100,105]
2047 700 329 1107 399 376 323
(105,110] (110,115] (115,120] (120,125] (125,130] (130,135] (135,140]
314 79 1008 77 474 158 381
(140,145] (145,150] (150,155] (155,160] (160,165] (165,170] (170,175]
89 660 15 1090 109 824 247
(175,180] (180,185] (185,190] (190,195] (195,200] (200,205] (205,210]
1226 139 531 174 1041 107 257
(210,215] (215,220] (220,225] (225,230] (230,235] (235,240] (240,245]
72 671 98 212 70 95 25
(245,250]
494
When I mean the bins, I get on an average of ~900 samples
> mean(table(data))
[1] 915.9
I want to tell R to make irregular bins in such a way that each bin will contain on an average 900 samples (e.g. (0, 27] = 900, (27,28.5] = 900, and so on). I found something similar here, which deals with only one variable, not the whole dataframe.
I also tried Hmisc package, unfortunately the bins don't contain equal frequency!!
library(Hmisc)
data<-split(df2, cut2(df2$val, g=30, oneval=TRUE))
data<-split(df2, cut2(df2$val, m=1000, oneval=TRUE))
Assuming you want 50 equal sized buckets (based on your seq) statement, you can use something like:
df <- data.frame(var=runif(500, 0, 100)) # make data
cut.vec <- cut(
df$var,
breaks=quantile(df$var, 0:50/50), # breaks along 1/50 quantiles
include.lowest=T
)
df.split <- split(df, cut.vec)
Hmisc::cut2 has this option built in as well.
Can be done by the function provided here by Joris Meys
EqualFreq2 <- function(x,n){
nx <- length(x)
nrepl <- floor(nx/n)
nplus <- sample(1:n,nx - nrepl*n)
nrep <- rep(nrepl,n)
nrep[nplus] <- nrepl+1
x[order(x)] <- rep(seq.int(n),nrep)
x
}
data<-split(df2, EqualFreq2(df2$val, 25))

Resources