Drop column if it has an observation is equal to ZERO - r

I have a very big dataframe with 150000 rows and 1000s columns. A subset is as follow:
df <- data.frame(col1 = c('201507', '201508', '201509', '201510', '201511', '201512', '201601', '201602', '201603'),
col2 = c(12, 45, 6, 23, 17, 32, 67, 23, 12),
col3 = c(0, 0, 12, 0, 67, 34, 87, 19, 9),
col4 = c(4584, 3423, 6723, 1245, 3234, 14577, 213, 557, 5677),
col5 = c(134, 345, 0, 23, 93, 48, 12, 21, 0))
I want to drop any column where:
it has ZERO value at any row (for example col3 and col5)
the first row of the column is ZERO (for example only `col3).
I know this is a simple example but I have 1000s of columns

Related

Adding colors to network according to the status of respondent

I am a complete beginner when it comes to R but I am trying to analyze a friendship network and I want to color in the nodes according to what the persons status is, so either 0 = high status (maybe in yellow) or 1= low status (maybe in red).
I created the status variable with the data from wave 1, removed the NA and those that did not participate in both waves and created the network with the data of the second wave
w1$Status <- NA
w1$Status[!is.na(w1$diplomamother)|!is.na(w1$diplomafather)]<-1
w1$Status[w1$diplomamother!=2&w1$diplomafather!=2] <-0
#Is this neccessary?
statusw1 <- w1 %>%
select(Status, NW_ID)
statusw1[-c(1, 10, 12, 13, 17, 44, 50, 51, 52, 53, 54, 15, 16, 20, 22, 25, 33, 38, 57, 58, 59, 60, 62),]
statusw1 <- statusw1[-c(1, 10, 12, 13, 17, 44, 50, 51, 52, 53, 54, 15, 16, 20, 22, 25, 33, 38, 57, 58, 59, 60, 62),]
friendsw2 <- friendsw2 %>%
select(!grep(c("_15_|_16_|_20_|_22_|_25_|_33_|_38_|_57_|_58_|_59_|_60_|_62_"), colnames(friendsw2)))
friends_ma2 <- as.matrix(data.frame(friendsw2, row.names = "NW_ID"))
colnames(friends_ma2) <- rownames(friends_ma2)
friends_ma2[is.na(friends_ma2)] <- 0
friends_ma2[friends_ma2 == ""] <- 0
storage.mode(friends_ma2) <- "numeric"
friends_gr2 <- graph.adjacency(friends_ma2, mode = "directed", diag = FALSE)
plot(friends_gr2)
What do I have to do next to visualize the status?
I hope what I explained is understandable! Thank you for your help!

How change rows value in a "Map loop" in R-studio?

I have this dataframe:
df <- structure(list(a = c(2, 5, 90, 77, 56, 65, 85, 75, 12, 24, 52,
32), b = c(45, 78, 98, 55, 63, 12, 23, 38, 75, 68, 99, 73), c = c(77,
85, 3, 22, 4, 69, 86, 39, 78, 36, 96, 11), d = c(52, 68, 4, 25,
79, 120, 97, 20, 7, 19, 37, 67), e = c(14, 73, 91, 87, 94, 38,
1, 685, 47, 102, 666, 74)), class = "data.frame", row.names = c(NA,
-12L))
and the script:
R <- Map(`+`, list(1:3), 0:3)
df_cum <- as.matrix(rep(NA, ncol(df)))
for (r in seq(R)) {
for (f in seq(ncol(df))) {
df_cum <- sapply(df[R[[r]],], function(x) (cumprod(1 + x) - 1)*100)
}
}
I want to change all the first row values to "0", for each loop (1:3, 2:4, 3:5,...), before
df_cum <- sapply(df[R[[r]],], function(x) (cumprod(1 + x) - 1)*100)
I.e. for the first cicle 1:3 (df rows), the first row values change from "2, 45, 77, 52, 14" to "0, 0, 0, 0, 0".
How can I do?
Thx

R rewriting a for loop

I've got a loop in my code that I would like to rewrite so running the code takes a little less time to compete. I know you allways have to avoid loops in the code but I can't think of an another way to accomplice my goal.
So I've got a dataset "df_1531" containing a lot of data that I need to cut into pieces by using subset() (if anyone knows a better way, let me know ;) ). I've got a vector with 21 variable names on which I like assign a subset of df_1531. Furthermore the script contains 22 variables with constrains (shift_XY_time).
So, this is my code now...
# list containing different slots
shift_time_list<- c(startdate, shift_1m_time, shift_1a_time, shift_1n_time,
shift_2m_time, shift_2a_time, shift_2n_time,
shift_3m_time, shift_3a_time, shift_3n_time,
shift_4m_time, shift_4a_time, shift_4n_time,
shift_5m_time, shift_5a_time, shift_5n_time,
shift_6m_time, shift_6a_time, shift_6n_time,
shift_7m_time, shift_7a_time, shift_7n_time)
# List with subset names
shift_sub_list <- c("shift_1m_sub", "shift_1a_sub", "shift_1n_sub",
"shift_2m_sub", "shift_2a_sub", "shift_2n_sub",
"shift_3m_sub", "shift_3a_sub", "shift_3n_sub",
"shift_4m_sub", "shift_4a_sub", "shift_4n_sub",
"shift_5m_sub", "shift_5a_sub", "shift_5n_sub",
"shift_6m_sub", "shift_6a_sub", "shift_6n_sub",
"shift_7m_sub", "shift_7a_sub", "shift_7n_sub")
# The actual loop that I'd like to rewrite
for (i in 1:21) {
assign(shift_sub_list[i], subset(df_1531, df_1531$'PLS FFM' >= shift_time_list[i] & df_1531$'PLS FFM' < shift_time_list[i+1]))
}
Running the loop takes approximately 6 or 7 seconds. So, if anyone knows a better/cleaner or quicker way to write my code, I desperately like to hear your suggestion/opinion.
**Reproducible example **
mydata <- cars
dput(cars)
structure(list(speed = c(4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11,
12, 12, 12, 12, 13, 13, 13, 13, 14, 14, 14, 14, 15, 15, 15, 16,
16, 17, 17, 17, 18, 18, 18, 18, 19, 19, 19, 20, 20, 20, 20, 20,
22, 23, 24, 24, 24, 24, 25), dist = c(2, 10, 4, 22, 16, 10, 18,
26, 34, 17, 28, 14, 20, 24, 28, 26, 34, 34, 46, 26, 36, 60, 80,
20, 26, 54, 32, 40, 32, 40, 50, 42, 56, 76, 84, 36, 46, 68, 32,
48, 52, 56, 64, 66, 54, 70, 92, 93, 120, 85)), class = "data.frame", row.names = c(NA,
-50L))
dist_interval_list <- c( 0, 5, 10, 15,
20, 25, 30, 35,
40, 45, 50, 55,
60, 65, 70, 75,
80, 85, 90, 95,
100, 105, 110, 115, 120)
var_name_list <- c("var_name_1a", "var_name_1b", "var_name_1c", "var_name_1d",
"var_name_2a", "var_name_2b", "var_name_2c", "var_name_2d",
"var_name_3a", "var_name_3b", "var_name_3c", "var_name_3d",
"var_name_4a", "var_name_4b", "var_name_4c", "var_name_4d",
"var_name_5a", "var_name_5b", "var_name_5c", "var_name_5d",
"var_name_6a", "var_name_6b", "var_name_6c", "var_name_6d")
for (i in 1:24){
assign(var_name_list[i], subset(mydata,
mydata$dist >= dist_interval_list[i] &
mydata$dist < dist_interval_list[i+1]))
}
Starting with the 'reproducible' part and the information that the final aim is to summarize another column, it is possible to exploit the fact that the intervals are non-overlapping and simply use the cut function.
library(tidyverse)
mydata %>%
mutate(interval = cut(dist, breaks = dist_interval_list)) %>%
group_by(interval) %>%
summarise(sum = sum(speed))
This should be much faster and will also help you not to get lost in a messy environment full of variables (which are actually part of your data). You want to keep all your data in a single data frame as long as possible;) You probably want to follow with something like purrrlyr::invoke_rows at the final modeling step, if your function does not work with data frames.

removing columns with similar variance

I have a dataframe of 3500 X 4000. I am trying to write a professional command in R to remove any columns in a matrix that show the same variance. I am able to do this a with a long, complicated command such as
datavar <- apply(data, 2, var)
datavar <- datavar[!duplicated(datavar)]
then assemble my data by matching the remaining column names, but this is SAD! I was hoping to do this in a single go. I was thinking of something like
data <- data[, which(apply(data, 2, function(col) !any(var(data) = any(var(data)) )))]
I know the last part of the above command is nonsense, but I also know there is someway it can be done in some... smart command!
Here's some data that applies to the question
data <- structure(list(V1 = c(3, 213, 1, 135, 5, 2323, 1231, 351, 1,
33, 2, 213, 153, 132, 1321, 53, 1, 231, 351, 3135, 13), V2 = c(1,
1, 1, 2, 3, 5, 13, 33, 53, 132, 135, 153, 213, 213, 231, 351,
351, 1231, 1321, 2323, 3135), V3 = c(65, 41, 1, 53132, 1, 6451,
3241, 561, 321, 534, 31, 135, 1, 1351, 31, 351, 31, 31, 3212,
3132, 1), V4 = c(2, 2, 5, 4654, 5641, 21, 21, 1, 1, 465, 31,
4, 651, 35153, 13, 132, 123, 1231, 321, 321, 5), V5 = c(23, 13,
213, 135, 15341, 564, 564, 8, 464, 8, 484, 6546, 132, 165, 123,
135, 132, 132, 123, 123, 2), V6 = c(2, 1, 84, 86468, 464, 18,
45, 55, 2, 5, 12, 4512, 5, 123, 132465, 12, 456, 15, 45, 123213,
12), V7 = c(1, 2, 2, 5, 5, 12, 12, 12, 15, 18, 45, 45, 55, 84,
123, 456, 464, 4512, 86468, 123213, 132465)), .Names = c("V1",
"V2", "V3", "V4", "V5", "V6", "V7"), row.names = c(NA, 21L), class = "data.frame")
Would I be able to keep one of the "similar variance" columns too?
Thanks,
I might go a more cautious route, like
data[, !duplicated(round(sapply(data,var),your_precision_here))]
This is pretty similar to what you've come up with:
vars <- lapply(data,var)
data[,which(sapply(1:length(vars), function(x) !vars[x] %in% vars[-x]))]
One thing to think about though is whether you want to match variances exactly (as in this example) or just variances that are close. The latter would be a significantly more challenging problem.
... or as alternative:
data[ , !c(duplicated(apply(data, 2, var)) | duplicated(apply(data, 2, var), fromLast=TRUE))]
...but also not shorter :)

Creating a sequence object from SPELL data

I am trying to create a sequence object with seqdef using SPELL format. Here is an example of my data:
spell <- structure(list(ID = c(1, 3, 3, 4, 5, 5, 6, 8, 9, 10, 11, 11,
12, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 14, 14, 15, 15,
15, 15, 15, 15, 15, 16, 16, 16, 16, 17, 17, 17, 18, 18, 18, 19,
19), status = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 2, 3, 1, 2, 3, 2, 3, 1, 1, 1, 3, 1, 3, 3, 1, 3, 1, 1, 1,
1, 1, 3, 3, 1, 3, 1, 1, 1), time1 = c(1, 1, 57, 1, 1, 91, 1,
1, 1, 1, 1, 104, 1, 1, 60, 109, 121, 1, 42, 47, 54, 64, 72, 78,
85, 116, 1, 29, 39, 69, 74, 78, 88, 1, 16, 40, 68, 1, 30, 123,
1, 39, 51, 1, 61), time2 = c(125, 57, 125, 125, 91, 125, 125,
125, 125, 125, 104, 125, 125, 60, 109, 121, 125, 42, 47, 54,
64, 72, 78, 85, 116, 125, 29, 39, 69, 74, 78, 88, 125, 16, 40,
68, 125, 30, 123, 125, 39, 51, 125, 61, 125)), .Names = c("ID",
"status", "time1", "time2"), row.names = c(NA, 45L), class = "data.frame")
When I try to define the sequence object, a strange error is thrown:
spell.seq <- seqdef(data=spell, informat="SPELL", id="ID", begin="time1", end="time2",
status="status", limit=125,process=FALSE)
[>] time axis: 1 -> 125
[>] SPELL data converted into 17 STS sequences
[>] 3 distinct states appear in the data:
1 = 1
2 = 2
3 = 3
[>] state coding:
[alphabet] [label] [long label]
1 1 1 1
2 2 2 2
3 3 3 3
[>] 17 sequences in the data set
[>] min/max sequence length: 125/125
Error in `row.names<-.data.frame`(`*tmp*`, value = value) :
invalid 'row.names' length
However, if I do the same indirectly via seqformat, preserving the same arguments, no error is thrown:
sts <- seqformat(data=spell,from="SPELL",to="STS",
id="ID",begin="time1",end="time2",status="status",
limit=125,process=FALSE)
seqs <- seqdef(sts,right="DEL")
Using TraMineR 1.8-5 with R 3.0.0 Windows 7 64-bit. Is this a bug or am I doing something wrong? Thanks in advance.
A quick look at the source of seqdef() for how the row.names are set shows they are set based on the value of the id argument.
Looking in ?seqdef for id shows
id
optional argument for setting the rownames of the sequence object. If NULL (default), the rownames are taken from the input data. If set to "auto", sequences are numbered from 1 to the number of sequences. A vector of rownames of length equal to the number of sequences may be specified as well.
From the example in the question you are passing id="ID" which does not meet these criteria. Changing this to id=NULL allows the command to complete as expected and a check for equality using identical( spell.seq, seqs) yields true.

Resources